Applied Scaling & Classification Techniques using text data
(academic year 2025/26)
Syllabus
Course aims and objectives
Students will learn how to apply widely discussed methods from the literature to analyze texts and extract useful information for testing their own theories.
First Lecture - Room 3-902
08/01/26 10:40-12:20 Theory: An introduction to text analytics (part 1): the Bag-of-Words approach
Reference texts: (1; 2)
Second Lecture - Room 3-902
15/01/26 10:40-12:20 Theory: Unsupervised classification methods: the Topic Model (and beyond)
Reference text (1; 2):
15/01/26 17:00-18:40 Lab class: How to implement a Topic Model and a Structural Topic Model. a) packages to install for Lab 2; b) script for Lab 2 - R script; Google Colab notebook; dataset: NyT economic articles (.rds); c) data for topical content analysis; d) dataset of the searchK exploration. EXTRA 1: computing FREX statistics in topicmodels; EXTRA 2: how to use a trained topic model with a new test-set
First assignment (due: 22 January 2026) (dataset for the first part: dataset about lyrics of songs; dataset for the second part: Guardian 2016 (.csv; .rds); dataset for the third part: Trump 2018 tweets). Note: to open a .rds file, use the command readRDS("NAME OF THE FILE.rds")
Third Lecture - Room 3-902
19/01/26 10:40-12:20 Theory: (part 1): An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
19/01/26 17:00-18:40 Lab class: How to implement supervised classification model: NB & RF. a) packages to install for Lab 3; b) script for Lab 3 - R script; Google Colab notebook; datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds); EXTRA 1: Meaning of a compressed sparse matrix; EXTRA 2: using KERAS3 on Google Colab
Fourth Lecture - Room 3-902
22/01/26 10:40-12:20 Theory: Neural Network Models
22/01/26 17:00-18:40 Lab class: How to implement a NN model. Script for Lab 5 (Neural Network Models - R script; Google Colab Notebook)
Second assignment (due: 29 January 2026) (datasets for Assignment 2: a) UK training set (.csv; .rds); b) UK test set (.csv; .rds))
Fifth Lecture - Room 3-902
26/01/26 10:40-12:20 Theory: (part 1): How to validate a ML algorithm: external validity; (part 2): How to validate a ML algorithm: global interpretation
Reference text (1, 2, 3; 4; 5):
26/01/26 17:00-18:40 Lab class: How to compute external validity and global interpretation of a ML algorithm: a) packages to install for Lab 5; b) scripts for Lab 5 (part I: external validity - R script; Google Colab Notebook; part II: global interpretation - R script; Google Colab Notebook); c) functions to compute cross-validation and global interpretation for class-labels; c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-class case); e) .rds files for the global interpretation part (to open the file, please use the data compression tool WinRAR); EXTRA: The importance of a good training-set
Sixth Lecture - Room 3-902
29/01/26 10:40-12:20 Theory: Static Word Embedding techniques
Reference texts (1):
29/01/26 17:00-18:40 Lab class: How to compute static WE. (a) packages to install for Lab 6; b) scripts for Lab 6 - R script; Google Colab Notebook); c) movie reviews dataset (.csv; .rds); d) movie review test-set (.rds); e) pre-trained WE on Google news; f) pre-trained WE on Facebook posts; EXTRA 1: Naïve Bayes classifier with continuous features; EXTRA 2: Installing TRANSFORMERS on Google Colab
Third assignment (due: 5 February 2026) (dataset for the assignment: training.set: .csv; .rds; test-set: .csv; .rds)
Seventh Lecture - Room 3-902
02/02/26 10:40-12:20 Theory: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 2):
Fourth assignment (due: 12 February 2026) (dataset for the assignment: training.set: .csv; .rds; test-set: .csv; .rds)
Students will learn how to apply widely discussed methods from the literature to analyze texts and extract useful information for testing their own theories.
First Lecture - Room 3-902
08/01/26 10:40-12:20 Theory: An introduction to text analytics (part 1): the Bag-of-Words approach
Reference texts: (1; 2)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
Second Lecture - Room 3-902
15/01/26 10:40-12:20 Theory: Unsupervised classification methods: the Topic Model (and beyond)
Reference text (1; 2):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
15/01/26 17:00-18:40 Lab class: How to implement a Topic Model and a Structural Topic Model. a) packages to install for Lab 2; b) script for Lab 2 - R script; Google Colab notebook; dataset: NyT economic articles (.rds); c) data for topical content analysis; d) dataset of the searchK exploration. EXTRA 1: computing FREX statistics in topicmodels; EXTRA 2: how to use a trained topic model with a new test-set
First assignment (due: 22 January 2026) (dataset for the first part: dataset about lyrics of songs; dataset for the second part: Guardian 2016 (.csv; .rds); dataset for the third part: Trump 2018 tweets). Note: to open a .rds file, use the command readRDS("NAME OF THE FILE.rds")
Third Lecture - Room 3-902
19/01/26 10:40-12:20 Theory: (part 1): An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
19/01/26 17:00-18:40 Lab class: How to implement supervised classification model: NB & RF. a) packages to install for Lab 3; b) script for Lab 3 - R script; Google Colab notebook; datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds); EXTRA 1: Meaning of a compressed sparse matrix; EXTRA 2: using KERAS3 on Google Colab
Fourth Lecture - Room 3-902
22/01/26 10:40-12:20 Theory: Neural Network Models
22/01/26 17:00-18:40 Lab class: How to implement a NN model. Script for Lab 5 (Neural Network Models - R script; Google Colab Notebook)
Second assignment (due: 29 January 2026) (datasets for Assignment 2: a) UK training set (.csv; .rds); b) UK test set (.csv; .rds))
Fifth Lecture - Room 3-902
26/01/26 10:40-12:20 Theory: (part 1): How to validate a ML algorithm: external validity; (part 2): How to validate a ML algorithm: global interpretation
Reference text (1, 2, 3; 4; 5):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
- Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
26/01/26 17:00-18:40 Lab class: How to compute external validity and global interpretation of a ML algorithm: a) packages to install for Lab 5; b) scripts for Lab 5 (part I: external validity - R script; Google Colab Notebook; part II: global interpretation - R script; Google Colab Notebook); c) functions to compute cross-validation and global interpretation for class-labels; c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-class case); e) .rds files for the global interpretation part (to open the file, please use the data compression tool WinRAR); EXTRA: The importance of a good training-set
Sixth Lecture - Room 3-902
29/01/26 10:40-12:20 Theory: Static Word Embedding techniques
Reference texts (1):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
29/01/26 17:00-18:40 Lab class: How to compute static WE. (a) packages to install for Lab 6; b) scripts for Lab 6 - R script; Google Colab Notebook); c) movie reviews dataset (.csv; .rds); d) movie review test-set (.rds); e) pre-trained WE on Google news; f) pre-trained WE on Facebook posts; EXTRA 1: Naïve Bayes classifier with continuous features; EXTRA 2: Installing TRANSFORMERS on Google Colab
Third assignment (due: 5 February 2026) (dataset for the assignment: training.set: .csv; .rds; test-set: .csv; .rds)
Seventh Lecture - Room 3-902
02/02/26 10:40-12:20 Theory: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 2):
- Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
- Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17
Fourth assignment (due: 12 February 2026) (dataset for the assignment: training.set: .csv; .rds; test-set: .csv; .rds)