Big Data Analytics (first term 2024/25)
Overview of the course
IMPORTANT INFO
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 13 December 2024
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
19/09/24, 10:30-12:30 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
20/09/24, 10:30-12:30 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) script for Lab 1; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
Second Lecture
26/09/24, 10:30-12:30 Theory: (Part 1): From words to positions: supervised scaling models; (Part 2): CMP dataset; (Part 3): Google Colab
Reference texts (1; 2)
27/09/24, 10:30-12:30 Lab class:How to implement the Wordscores algorithm (a) packages to install for Lab 2; b) Lab 2 scripts (part I: Wordscores; part II: CMP; part III: Google Colab); c) party manifesto dataset (unzip the txt files in a folder); d) music dataset)
First assignment (due: 3 October 2024) (dataset for Assignment 1. To open this file, please use the data compression tool WinRAR)
Third Lecture
03/10/24, 10:30-12:30 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3; 4)
04/10/24, 10:30-12:30 Lab class: How to implement the Wordfish algorithm (a) packages to install for Lab 3; b) Lab 3 scripts (part I: Wordifsh; part II: Wordshoal); EXTRA: estimating bootstrap confidence intervals in Wordfish)
Second assignment (due: 10 October 2024)
Fourth Lecture
10/10/24, 10:30-12:30 Theory: From words to issues: unsupervised classification models
Reference text (1; 2):
Third assignment (due: 17 October 2024) (dataset for Guardian 2013 - .csv; .rds)
Fifth Lecture
17/10/24, 10:30-12:30 Theory: (Part 1): Some Advancements in Topic Modeling: the Structural Topic Model and the Semi-Supervised (Structural) Topic Model; (Part 2): A practical introduction to the Bayesian framework
Reference text Reference text (1, 2, 3):
18/10/24, 10:30-12:30 Lab class: How to implement a Structural Topic Model and a Semi-supervised (structural) topic model (a) packages to install for Lab 5; b) Lab 5 scripts (part I: STM; part II: keyATM); datasets for Lab 5: a) NyT economic articles (.csv; .rds); b) data for topical content analysis
Fourth Assignment (due: 24 October 2024) (dataset for Assignment 4. To open this file, use the command readRDS("Trump2018.rds"))
Sixth Lecture
24/10/24, 10:30-12:30 Theory: (Part 1): Supervised classification methods: automatic tagging; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3):
25/10/24, 10:30-12:30 Lab class: Lab class: How to implement dictionary models and a semi-supervised classification model (a) packages to install for Lab 6); Lab 6 scripts (part I: dictionaries; part II: NB model); datasets for Lab 6: a) Laver and Garry policy dictionary; b) sample of tweets discussing Donal Trump (.rds file); c) disaster dataset - training-set (.csv; .rds); d) disaster dataset - test-set (.csv; .rds); e) US airlines training-set (.csv; .rds); f) US airlines test-set (.csv; .rds); EXTRA: a) converting an external dictionary to a Quanteda format; b) split-half reliability test; c) Meaning of compressed sparse matrices)
Fifth Assignment (due: 31 October 2024) (datasets for Assignment 5: a) UK training set (.csv; .rds); b) UK test set (.csv; .rds))
Seventh Lecture
30/10/24, 13:30-15:30 Theory: From words to issues: supervised classification models (part I)
Reference text (1):
31/10/24, 10:30-12:30 Lab class: How to implement a RF and a SVM: (a) packages to install for Lab 7; b) Lab 7 script; c) Google Colab notebook about Keras package)
Sixth Assignment (due: 7 November 2024)
Eight Lecture
07/11/24, 10:30-12:30 Theory: From words to issues: supervised classification models (part I: Neural Network Models; part II: Gradient boosting models)
Reference texts (1):
08/11/24, 10:30-12:30 Lab class: How to implement supervised classification models (a) packages to install for Lab 8; b) Lab 8 scripts (part I: Gradient boosting; part II: Neural Network Models - R script; Google Colab Notebook)
Seventh Assignment (due: 14 November 2024)
Ninth Lecture
14/11/24, 10:30-12:30 Theory: Part I: How to validate the results from a ML algorithm; Part II: The importance of the training set
Reference text (1, 2, 3, 4, 5, 6):
15/11/24, 10:30-12:30 Lab class: How to compute internal and external validity of a ML algorithm (a) packages to install for Lab 9; b) Lab 9 scripts (part I: external validity; part II: global interpretation; part III: inter-coder reliability); c) functions to compute cross-validation: for 2 class labels; for more than 2 class labels (to open the two files, please use the data compression tool WinRAR); d) .rds file for the grid-search of the NN algorithm; e) rds. files for the global interpretation exercise (to open the file, please use the data compression tool WinRAR); f) Google Colab notebook about Text package; g) Google Colab notebook about Grafzahl pacakge
Eight Assignment (due: 28 November 2024) (datasets for Assignment 8 (.csv; .rds))
Tenth Lecture
21/11/24, 10:30-12:30 Theory: Word Embedding techniques (with an extension to LLMs)
Reference texts (1, 2):
Ninht Assignment (due: 5 December 2024) (datasets for Assignment 9 (.csv; .rds))
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 13 December 2024
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
19/09/24, 10:30-12:30 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
20/09/24, 10:30-12:30 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) script for Lab 1; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
Second Lecture
26/09/24, 10:30-12:30 Theory: (Part 1): From words to positions: supervised scaling models; (Part 2): CMP dataset; (Part 3): Google Colab
Reference texts (1; 2)
- Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
- Egerod, Benjamin C.K., and Robert Klemmensen (2020). Scaling Political Positions from text. Assumptions, Methods and Pitfalls. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 27
27/09/24, 10:30-12:30 Lab class:How to implement the Wordscores algorithm (a) packages to install for Lab 2; b) Lab 2 scripts (part I: Wordscores; part II: CMP; part III: Google Colab); c) party manifesto dataset (unzip the txt files in a folder); d) music dataset)
First assignment (due: 3 October 2024) (dataset for Assignment 1. To open this file, please use the data compression tool WinRAR)
Third Lecture
03/10/24, 10:30-12:30 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3; 4)
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2009. How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany. German Politics, 18(3): 323-344.
- Curini, Luigi, Hino, Airo, and Atsushi Osaki. 2020. Intensity of government–opposition divide as measured through legislative speeches and what we can learn from it. Analyses of Japanese parliamentary debates, 1953–2013, Government and Opposition, 55(2), 184-201
- Lauderdale, Benjamin E., and Alexander Herzog (2016). Measuring Political Positions from Legislative Speech, Political Analysis (2016) 24:374–39
04/10/24, 10:30-12:30 Lab class: How to implement the Wordfish algorithm (a) packages to install for Lab 3; b) Lab 3 scripts (part I: Wordifsh; part II: Wordshoal); EXTRA: estimating bootstrap confidence intervals in Wordfish)
Second assignment (due: 10 October 2024)
Fourth Lecture
10/10/24, 10:30-12:30 Theory: From words to issues: unsupervised classification models
Reference text (1; 2):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
- Chan, Chung-hong, and Marius Sältzer. "oolong: An R package for validating automated content analysis tools." The Journal of Open Source Software: JOSS 5.55 (2020): 2461
Third assignment (due: 17 October 2024) (dataset for Guardian 2013 - .csv; .rds)
Fifth Lecture
17/10/24, 10:30-12:30 Theory: (Part 1): Some Advancements in Topic Modeling: the Structural Topic Model and the Semi-Supervised (Structural) Topic Model; (Part 2): A practical introduction to the Bayesian framework
Reference text Reference text (1, 2, 3):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2023). Keyword-Assisted Topic Models, American Journal of Political Science, DOI: 10.1111/ajps.12779
18/10/24, 10:30-12:30 Lab class: How to implement a Structural Topic Model and a Semi-supervised (structural) topic model (a) packages to install for Lab 5; b) Lab 5 scripts (part I: STM; part II: keyATM); datasets for Lab 5: a) NyT economic articles (.csv; .rds); b) data for topical content analysis
Fourth Assignment (due: 24 October 2024) (dataset for Assignment 4. To open this file, use the command readRDS("Trump2018.rds"))
Sixth Lecture
24/10/24, 10:30-12:30 Theory: (Part 1): Supervised classification methods: automatic tagging; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
25/10/24, 10:30-12:30 Lab class: Lab class: How to implement dictionary models and a semi-supervised classification model (a) packages to install for Lab 6); Lab 6 scripts (part I: dictionaries; part II: NB model); datasets for Lab 6: a) Laver and Garry policy dictionary; b) sample of tweets discussing Donal Trump (.rds file); c) disaster dataset - training-set (.csv; .rds); d) disaster dataset - test-set (.csv; .rds); e) US airlines training-set (.csv; .rds); f) US airlines test-set (.csv; .rds); EXTRA: a) converting an external dictionary to a Quanteda format; b) split-half reliability test; c) Meaning of compressed sparse matrices)
Fifth Assignment (due: 31 October 2024) (datasets for Assignment 5: a) UK training set (.csv; .rds); b) UK test set (.csv; .rds))
Seventh Lecture
30/10/24, 13:30-15:30 Theory: From words to issues: supervised classification models (part I)
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
31/10/24, 10:30-12:30 Lab class: How to implement a RF and a SVM: (a) packages to install for Lab 7; b) Lab 7 script; c) Google Colab notebook about Keras package)
Sixth Assignment (due: 7 November 2024)
Eight Lecture
07/11/24, 10:30-12:30 Theory: From words to issues: supervised classification models (part I: Neural Network Models; part II: Gradient boosting models)
Reference texts (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
08/11/24, 10:30-12:30 Lab class: How to implement supervised classification models (a) packages to install for Lab 8; b) Lab 8 scripts (part I: Gradient boosting; part II: Neural Network Models - R script; Google Colab Notebook)
Seventh Assignment (due: 14 November 2024)
Ninth Lecture
14/11/24, 10:30-12:30 Theory: Part I: How to validate the results from a ML algorithm; Part II: The importance of the training set
Reference text (1, 2, 3, 4, 5, 6):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
15/11/24, 10:30-12:30 Lab class: How to compute internal and external validity of a ML algorithm (a) packages to install for Lab 9; b) Lab 9 scripts (part I: external validity; part II: global interpretation; part III: inter-coder reliability); c) functions to compute cross-validation: for 2 class labels; for more than 2 class labels (to open the two files, please use the data compression tool WinRAR); d) .rds file for the grid-search of the NN algorithm; e) rds. files for the global interpretation exercise (to open the file, please use the data compression tool WinRAR); f) Google Colab notebook about Text package; g) Google Colab notebook about Grafzahl pacakge
Eight Assignment (due: 28 November 2024) (datasets for Assignment 8 (.csv; .rds))
Tenth Lecture
21/11/24, 10:30-12:30 Theory: Word Embedding techniques (with an extension to LLMs)
Reference texts (1, 2):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
- Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
Ninht Assignment (due: 5 December 2024) (datasets for Assignment 9 (.csv; .rds))