Big Data Analytics (second term 2023/24)
IMPORTANT INFO
This course will be conducted online using Zoom (Join Zoom Meeting: https://us02web.zoom.us/j/5469311951)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First day
15 March 2023 - Morning session
Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1m; b) script for Lab 1m; datasets: a) Boston tweets sample (csv file; rds file); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
15 March 2023 - Afternoon session
Theory: Supervised and unsupervised scaling models
Reference texts (1; 2; 3; 4; 5; 6)
Lab class: How to implement the Wordscores & Wordfish algorithms (a) packages to install for Lab 1a; b) Lab 1a scripts (Wordscores & Wordfish); c) dataset for the first part of the Lab (party manifestos dataset); d) dataset for the second part of the Lab (music dataset))
Second day
16 March 2023 - Morning session
Theory: Unsupervised classification models
Reference text (1; 2):
Lab class: How to implement a Topic Model (a) packages to install for Lab 2m; b) Lab 2m script; c) dataset: Guardian 2016)
16 March 2023 - Afternoon session
Theory: Some Advancements in Topic Modeling: the Structural Topic Model and the Semi-Supervised (Structural) Topic Model
Reference text Reference text (1, 2, 3):
Lab class: How to implement a Structural Topic Model and a Semi-supervised (structural) topic model (a) packages to install for Lab 2a; b) Lab 2a scripts (part I: STM; part II: keyATM); datasets: a) New York Times articles (csv file; rds file); b) data for topical content analysis
Third day
22 March 2023 - Morning session
Theory: (Part 1): Dictionary models; (Part 2): An introduction to supervised classification models
Reference texts (1; 2; 3)
Lab class: How to implement a dictionary model and a naive-bayes one ( a) packages to install for Lab 3m; b) Lab 3m scripts (part I: dictionary models; part II: Naive Bayes model); datasets: 1) Laver & Garry dictionary; 2) sample of tweets discussing Donal Trump (csv file; rds file); 3) disaster training-set (csv file; rds file); 4) disaster test-set (csv file; rds file); 5) airlines training-set (csv file; rds file). EXTRA: a) converting an external dictionary to a Quanteda dictionary; b) split-half reliability test; c) The meaning of a compressed sparse matrix)
22 March 2023 - Afternoon session
Theory: Some further ML algorithms
Reference text (1):
Lab class: How to implement Random Forest and Support Vector Machine ML algorithms (a) packages to install for Lab 3a; b) Lab 3a script)
Fourth day
23 March 2023 - Morning session
Theory: (Part 1): How to validate your ML results; (Part 2): The importance of a good training set
Reference text (1, 2, 3, 4; 5; 6):
Lab class: How to compute internal and external validity of a ML algorithm & inter-coder reliability (a) packages to install for Lab 4m; b) Lab 4m scripts (part I: external validity; part II: internal validity; part III: inter-coder reliability); functions to compute external and internal validity: for 2 class labels; for more than 2 class labels; rds for internal validity: random forest; svm; NB with 3 classes
23 March 2023 - Afternoon session
Theory: An introduction to word embedding techniques
Reference texts (1):
Lab class: How to implement GloVe (scripts: a) packages to install for Lab 4a; b) Lab 4a script; datasets: a) movie reviews dataset (csv file; rds file); b) pre-trained WE on Google news; c) pre-trained WE on Facebook posts)
Course Assignment (datasets to be employed for the Assignment: 1) 2018 Donald Trump tweets; 2) sample of UK party manifestos since 2010; 3) UK training-set tweets; 4) UK test-set tweets
This course will be conducted online using Zoom (Join Zoom Meeting: https://us02web.zoom.us/j/5469311951)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
- Packages to install in R for the first week
- Packages to install in R for the second week
First day
15 March 2023 - Morning session
Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1m; b) script for Lab 1m; datasets: a) Boston tweets sample (csv file; rds file); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
15 March 2023 - Afternoon session
Theory: Supervised and unsupervised scaling models
Reference texts (1; 2; 3; 4; 5; 6)
- Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
- Egerod, Benjamin C.K., and Robert Klemmensen (2020). Scaling Political Positions from text. Assumptions, Methods and Pitfalls. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 27
- Martin, Lanny W., and Georg Vanberg. 2008. A robust transformation procedure for interpreting political text. Political Analysis, 16: 93-100
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2009. How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany. German Politics, 18(3): 323-344.
- Curini, Luigi, Hino, Airo, and Atsushi Osaki. 2020. Intensity of government–opposition divide as measured through legislative speeches and what we can learn from it. Analyses of Japanese parliamentary debates, 1953–2013, Government and Opposition, 55(2), 184-201
Lab class: How to implement the Wordscores & Wordfish algorithms (a) packages to install for Lab 1a; b) Lab 1a scripts (Wordscores & Wordfish); c) dataset for the first part of the Lab (party manifestos dataset); d) dataset for the second part of the Lab (music dataset))
Second day
16 March 2023 - Morning session
Theory: Unsupervised classification models
Reference text (1; 2):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Chan, Chung-hong, and Marius Sältzer. oolong: An R package for validating automated content analysis tools. The Journal of Open Source Software: JOSS 5.55 (2020): 2461
Lab class: How to implement a Topic Model (a) packages to install for Lab 2m; b) Lab 2m script; c) dataset: Guardian 2016)
16 March 2023 - Afternoon session
Theory: Some Advancements in Topic Modeling: the Structural Topic Model and the Semi-Supervised (Structural) Topic Model
Reference text Reference text (1, 2, 3):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2023). Keyword-Assisted Topic Models, American Journal of Political Science, DOI: 10.1111/ajps.12779
Lab class: How to implement a Structural Topic Model and a Semi-supervised (structural) topic model (a) packages to install for Lab 2a; b) Lab 2a scripts (part I: STM; part II: keyATM); datasets: a) New York Times articles (csv file; rds file); b) data for topical content analysis
Third day
22 March 2023 - Morning session
Theory: (Part 1): Dictionary models; (Part 2): An introduction to supervised classification models
Reference texts (1; 2; 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
Lab class: How to implement a dictionary model and a naive-bayes one ( a) packages to install for Lab 3m; b) Lab 3m scripts (part I: dictionary models; part II: Naive Bayes model); datasets: 1) Laver & Garry dictionary; 2) sample of tweets discussing Donal Trump (csv file; rds file); 3) disaster training-set (csv file; rds file); 4) disaster test-set (csv file; rds file); 5) airlines training-set (csv file; rds file). EXTRA: a) converting an external dictionary to a Quanteda dictionary; b) split-half reliability test; c) The meaning of a compressed sparse matrix)
22 March 2023 - Afternoon session
Theory: Some further ML algorithms
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to implement Random Forest and Support Vector Machine ML algorithms (a) packages to install for Lab 3a; b) Lab 3a script)
Fourth day
23 March 2023 - Morning session
Theory: (Part 1): How to validate your ML results; (Part 2): The importance of a good training set
Reference text (1, 2, 3, 4; 5; 6):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, forthcoming
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
Lab class: How to compute internal and external validity of a ML algorithm & inter-coder reliability (a) packages to install for Lab 4m; b) Lab 4m scripts (part I: external validity; part II: internal validity; part III: inter-coder reliability); functions to compute external and internal validity: for 2 class labels; for more than 2 class labels; rds for internal validity: random forest; svm; NB with 3 classes
23 March 2023 - Afternoon session
Theory: An introduction to word embedding techniques
Reference texts (1):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
Lab class: How to implement GloVe (scripts: a) packages to install for Lab 4a; b) Lab 4a script; datasets: a) movie reviews dataset (csv file; rds file); b) pre-trained WE on Google news; c) pre-trained WE on Facebook posts)
Course Assignment (datasets to be employed for the Assignment: 1) 2018 Donald Trump tweets; 2) sample of UK party manifestos since 2010; 3) UK training-set tweets; 4) UK test-set tweets