Big Data Analytics (first term 2023/24)
Overview of the course
IMPORTANT INFO
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 15 December 2023
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
28/09/23, 10:30-12:30 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
29/09/23, 10:30-12:30 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) script for Lab 1; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of cosine similarity; b1) An explanation of the chi-2)
Second Lecture
02/10/23, 13:45-15:45 Theory: (Part 1): From words to positions: supervised scaling models ; (Part 2): CMP dataset
Reference texts (1; 2; 3, 4)
03/10/23, 13:45-15:45 Lab class: How to implement the Wordscores algorithm (a) Lab 2 scripts (part I: Wordscores; part II: CMP); b) dataset for the Wordscores parte of the Lab; EXTRA: a1) How Wordscores works)
Third Lecture
05/10/23, 10:30-12:30 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3; 4)
06/10/23, 10:30-12:30 Lab class: How to implement the Wordfish algorithm (a) packages to install for Lab 3; b) Lab 3 scripts (part I: Wordifsh; part II: Wordshoal); EXTRA: a) estimating bootstrap confidence intervals in Wordfish)
First assignment (due: 12 October 2023) (dataset for Assignment 1. To open this file, please use the data compression tool WinRAR)
Fourth Lecture
12/10/23, 10:30-12:30 Theory: From words to issues: unsupervised classification models
Reference text (1):
13/10/23, 10:30-12:30 Lab class: How to implement a Topic Model (a) packages to install for Lab 4; b) Lab 4 script; c) dataset for Lab 4: Guardian 2016; EXTRA: a) Clustering Models; b) how to extimate a cluster model)
Second assignment (due: 19 October 2023) (dataset for Guardian 2013)
Fifth Lecture
19/10/23, 10:30-12:30 Theory: (Part 1): From words to issues: structural topic models; (Part 2): Dictionary models
Reference texts (1, 2):
20/10/23, 10:30-12:30 Lab class: How to implement a Structural Topic Model and dictionary models (a) packages to install for Lab 5; b) Lab 5 scripts (part I: STM; part II: dictionaries); datasets: a) NyT; b) data for topical content analysis; c) sample of tweets discussing Donal Trump; EXTRA: a) converting an external dictionary to Quanteda; b) split-half reliability test)
Third Assignment (due: 26 October 2023) (dataset for Assignment 3. To open this file, use the command readRDS("Trump2018.rds"))
Sixth Lecture
26/10/23, 10:30-12:30 Theory: (Part 1): From words to issues: semi-supervised classification models; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3, 4, 5):
27/10/23, 10:30-12:30 Lab class: How to implement a semi-supervised classification model (a) packages to install for Lab 6; b) Lab 6 script; EXTRA: a) computing coherence and exclusivity with keyATM)
Fourth Assignment (due: 2 November 2023)
Seventh Lecture
02/11/23, 10:30-12:30 Theory: From words to issues: supervised classification models (first part)
Reference text (1):
03/11/23, 10:30-12:30 Lab class: How to implement supervised classification models (a) packages to install for Lab 7; b) Lab 7 script; datasets: 1) disasters training-set; 2) disasters test-set:, EXTRA: The meaning of a compressed sparse matrix)
Fifth Assignment (due: 9 November 2023) (datasets for Assignment 5: a) UK training set; b) UK test set)
Eigth Lecture
09/11/23, 10:30-12:30 Theory: How to validate the results from a ML algorithm
Reference text (1, 2, 3, 4, 5, 6):
10/11/23, 10:30-12:30 Lab class: How to compute internal and external validity of a ML algorithm (a) packages to install for Lab 8; b) Lab 8 scripts (part I: internal validity; part II: external validity); datasets: 1) disasters validation set; 2) airplane training-set; functions to compute cross-validation: for 2 class labels; for more than 2 class labels - (to open the last two files, please use the data compression tool WinRAR)
Sixth Assignment (due: 16 November 2023) (datasets for Assignment 6: a) training-set; b) validation-set; c) test-set)
Ninth Lecture
16/11/23, 10:30-12:30 Theory: (Part 1): From words to issues: supervised classification models (second part); (Part 2): The importance of the training set
Reference texts (1; 2):
17/11/23, 10:30-12:30 Lab class: How to implement supervised classification models & inter-coder reliability (a) packages to install for Lab 9; b) Lab 9 scripts (part I: ML algorithms; part II: internal validity; part III: external validity; part IV: inter-coder reliability)
Seventh Assignment (due: 23 November 2023)
Tenth Lecture
23/11/23, 10:30-12:30 Theory: An introduction to word embedding models
Reference texts (1, 2, 3, 4):
Eight Assignment (due: 30 November 2023) (dataset for Assignment 8)
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 15 December 2023
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
28/09/23, 10:30-12:30 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
29/09/23, 10:30-12:30 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) script for Lab 1; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of cosine similarity; b1) An explanation of the chi-2)
Second Lecture
02/10/23, 13:45-15:45 Theory: (Part 1): From words to positions: supervised scaling models ; (Part 2): CMP dataset
Reference texts (1; 2; 3, 4)
- Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
- Egerod, Benjamin C.K., and Robert Klemmensen (2020). Scaling Political Positions from text. Assumptions, Methods and Pitfalls. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 27
- Martin, Lanny W., and Georg Vanberg. 2008. A robust transformation procedure for interpreting political text. Political Analysis, 16: 93-100
- Bräuninger Thoams and Nathalier Giger. Strategic Ambiguity of Party Positions in Multi-Party Competition, Political Science Research and Methods, 6(3), 527-548, 2018
03/10/23, 13:45-15:45 Lab class: How to implement the Wordscores algorithm (a) Lab 2 scripts (part I: Wordscores; part II: CMP); b) dataset for the Wordscores parte of the Lab; EXTRA: a1) How Wordscores works)
Third Lecture
05/10/23, 10:30-12:30 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3; 4)
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2009. How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany. German Politics, 18(3): 323-344.
- Curini, Luigi, Hino, Airo, and Atsushi Osaki. 2020. Intensity of government–opposition divide as measured through legislative speeches and what we can learn from it. Analyses of Japanese parliamentary debates, 1953–2013, Government and Opposition, 55(2), 184-201
- Lauderdale, Benjamin E., and Alexander Herzog (2016). Measuring Political Positions from Legislative Speech, Political Analysis (2016) 24:374–39
06/10/23, 10:30-12:30 Lab class: How to implement the Wordfish algorithm (a) packages to install for Lab 3; b) Lab 3 scripts (part I: Wordifsh; part II: Wordshoal); EXTRA: a) estimating bootstrap confidence intervals in Wordfish)
First assignment (due: 12 October 2023) (dataset for Assignment 1. To open this file, please use the data compression tool WinRAR)
Fourth Lecture
12/10/23, 10:30-12:30 Theory: From words to issues: unsupervised classification models
Reference text (1):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
13/10/23, 10:30-12:30 Lab class: How to implement a Topic Model (a) packages to install for Lab 4; b) Lab 4 script; c) dataset for Lab 4: Guardian 2016; EXTRA: a) Clustering Models; b) how to extimate a cluster model)
Second assignment (due: 19 October 2023) (dataset for Guardian 2013)
Fifth Lecture
19/10/23, 10:30-12:30 Theory: (Part 1): From words to issues: structural topic models; (Part 2): Dictionary models
Reference texts (1, 2):
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
20/10/23, 10:30-12:30 Lab class: How to implement a Structural Topic Model and dictionary models (a) packages to install for Lab 5; b) Lab 5 scripts (part I: STM; part II: dictionaries); datasets: a) NyT; b) data for topical content analysis; c) sample of tweets discussing Donal Trump; EXTRA: a) converting an external dictionary to Quanteda; b) split-half reliability test)
Third Assignment (due: 26 October 2023) (dataset for Assignment 3. To open this file, use the command readRDS("Trump2018.rds"))
Sixth Lecture
26/10/23, 10:30-12:30 Theory: (Part 1): From words to issues: semi-supervised classification models; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3, 4, 5):
- Kohei Watanabe and Yuan Zhou (2020) Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review, DOI: 10.1177/0894439320907027
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2023). Keyword-Assisted Topic Models, American Journal of Political Science, DOI: 10.1111/ajps.12779
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
27/10/23, 10:30-12:30 Lab class: How to implement a semi-supervised classification model (a) packages to install for Lab 6; b) Lab 6 script; EXTRA: a) computing coherence and exclusivity with keyATM)
Fourth Assignment (due: 2 November 2023)
Seventh Lecture
02/11/23, 10:30-12:30 Theory: From words to issues: supervised classification models (first part)
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
03/11/23, 10:30-12:30 Lab class: How to implement supervised classification models (a) packages to install for Lab 7; b) Lab 7 script; datasets: 1) disasters training-set; 2) disasters test-set:, EXTRA: The meaning of a compressed sparse matrix)
Fifth Assignment (due: 9 November 2023) (datasets for Assignment 5: a) UK training set; b) UK test set)
Eigth Lecture
09/11/23, 10:30-12:30 Theory: How to validate the results from a ML algorithm
Reference text (1, 2, 3, 4, 5, 6):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
10/11/23, 10:30-12:30 Lab class: How to compute internal and external validity of a ML algorithm (a) packages to install for Lab 8; b) Lab 8 scripts (part I: internal validity; part II: external validity); datasets: 1) disasters validation set; 2) airplane training-set; functions to compute cross-validation: for 2 class labels; for more than 2 class labels - (to open the last two files, please use the data compression tool WinRAR)
Sixth Assignment (due: 16 November 2023) (datasets for Assignment 6: a) training-set; b) validation-set; c) test-set)
Ninth Lecture
16/11/23, 10:30-12:30 Theory: (Part 1): From words to issues: supervised classification models (second part); (Part 2): The importance of the training set
Reference texts (1; 2):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
17/11/23, 10:30-12:30 Lab class: How to implement supervised classification models & inter-coder reliability (a) packages to install for Lab 9; b) Lab 9 scripts (part I: ML algorithms; part II: internal validity; part III: external validity; part IV: inter-coder reliability)
Seventh Assignment (due: 23 November 2023)
Tenth Lecture
23/11/23, 10:30-12:30 Theory: An introduction to word embedding models
Reference texts (1, 2, 3, 4):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
- Rudkowsky Elena, et al. (2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures. 12:2-3, 140-157
- Wankmüller, Sandra (2019). Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis. Sociological Methods & Research, 00491241221134527
- Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2022). Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI. Political Analysis, 1-33.
Eight Assignment (due: 30 November 2023) (dataset for Assignment 8)