Text Classification Algorithms: From Bag-of-Words to Dynamic Word Embeddings (second term 2025/26)
Syllabus
IMPORTANT INFO
This course will be conducted online using Teams here.
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.
First day
20 March 2026 - Morning session
Theory: An introduction to text analytics: the Bag-of-Words approach
Reference texts: (1; 2)
20 March 2026 - Afternoon session
Theory: (part 1): An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
Second day
21 March 2026 - Morning session
Theory: Neural Network Models
Lab class: How to implement a NN model. Script for Lab 2m (Neural Network Models - R script; Google Colab Notebook); EXTRA: Google Colab Notebook for keras3 with GPU
21 March 2026 - Afternoon session
Theory: (Part 1): How to validate a ML algorithm: external validity; (Part 2): The importance of a good training-set
Reference text (1, 2, 3; 4; 5, 6):
Lab class: How to compute external validity of a ML algorithm: a) scripts for Lab 2a (R script; Google Colab Notebook); b) functions to compute cross-validation and global interpretation; c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-class case)
Third day
27 March 2026 - Morning session
Theory: How to validate a ML algorithm: global interpretation
Reference text (1):
Lab class: How to compute global interpretation of a ML algorithm: a) scripts for Lab 3m (R script; Google Colab Notebook); b) .rds files for the global interpretation part (to open this file, please use the data compression tool WinRAR)
27 March 2026 - Afternoon session
Theory: Static Word Embedding techniques
Reference texts (1):
Lab class: How to compute static WE. (a) scripts for Lab 3A - R script; Google Colab Notebook); b) movie reviews dataset (.csv; .rds); c) movie review test-set (.rds); d) pre-trained WE on Google news; e) pre-trained WE on Facebook posts; EXTRA 1: Naïve Bayes classifier with continuous features; EXTRA 2: Installing TRANSFORMERS on Google Colab
Fourth day
28 March 2026 - Morning session
Theory: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 2):
Lab class: How to compute dynamic WE. (a) Scripts for Lab 4M - R script; Google Colab Notebook; b) functions to compute dynamic WE; c) .rds file to use in the lab (to open this file, please use the data compression tool WinRAR); d) dataset for the lab (.csv; .rds)
28 March 20265 - Afternoon session
Theory: Encoders (and beyond)
Reference texts (1, 2, 3):
Laurer & co. (2024). Building Efficient Universal Classifiers with Natural Language Inference, arXiv
Burnham M. Stance detection: a practical guide to classifying political beliefs in text. Political Science Research and Methods. Published online 2024:1-18. doi:10.1017/psrm.2024.35
Ornstein, Joseph T., Elise N. Blasingame, and Jake S. Truscott. How to train your stochastic parrot: Large language models for political texts. Political Science Research and Methods (2023): 1-18
Lab class: How to fine-tune an encoder and how to estimate NLI and zero-shot models. Scripts for Lab 4A (a) part I: how to fine-tune an encoder - R script; Google Colab Notebook; part II: how to estimate NLI, zero-shot and sentiment models - R script; Google Colab Notebook; part III: how to fine-tune a NLI, zero-shot and sentiment model - R script; Google Colab Notebook; part IV: applying NLI to a real dataset - R script; Google Colab Notebook; b) .rds to use in the lab (to open this file, please use the data compression tool WinRAR); c) folder with a fine-tuned BERT (this is a big folder!); d) folder with a fine-tuned NLI (this is a big folder!); e) folder with a fine-tuned Zero Shot Model (this is a big folder!); f) folder with a fine-tuned Sentiment Model (this is a big folder!); g) folder with the fine-tuned NLI with a real dataset (this is a big folder!)
IMPORTANT! A list of exercises to assess your understanding of the material discussed throughout the course is available HERE
This course will be conducted online using Teams here.
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.
First day
20 March 2026 - Morning session
Theory: An introduction to text analytics: the Bag-of-Words approach
Reference texts: (1; 2)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
20 March 2026 - Afternoon session
Theory: (part 1): An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Second day
21 March 2026 - Morning session
Theory: Neural Network Models
Lab class: How to implement a NN model. Script for Lab 2m (Neural Network Models - R script; Google Colab Notebook); EXTRA: Google Colab Notebook for keras3 with GPU
21 March 2026 - Afternoon session
Theory: (Part 1): How to validate a ML algorithm: external validity; (Part 2): The importance of a good training-set
Reference text (1, 2, 3; 4; 5, 6):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
- Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
Lab class: How to compute external validity of a ML algorithm: a) scripts for Lab 2a (R script; Google Colab Notebook); b) functions to compute cross-validation and global interpretation; c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-class case)
Third day
27 March 2026 - Morning session
Theory: How to validate a ML algorithm: global interpretation
Reference text (1):
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
Lab class: How to compute global interpretation of a ML algorithm: a) scripts for Lab 3m (R script; Google Colab Notebook); b) .rds files for the global interpretation part (to open this file, please use the data compression tool WinRAR)
27 March 2026 - Afternoon session
Theory: Static Word Embedding techniques
Reference texts (1):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
Lab class: How to compute static WE. (a) scripts for Lab 3A - R script; Google Colab Notebook); b) movie reviews dataset (.csv; .rds); c) movie review test-set (.rds); d) pre-trained WE on Google news; e) pre-trained WE on Facebook posts; EXTRA 1: Naïve Bayes classifier with continuous features; EXTRA 2: Installing TRANSFORMERS on Google Colab
Fourth day
28 March 2026 - Morning session
Theory: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 2):
- Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
- Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17
Lab class: How to compute dynamic WE. (a) Scripts for Lab 4M - R script; Google Colab Notebook; b) functions to compute dynamic WE; c) .rds file to use in the lab (to open this file, please use the data compression tool WinRAR); d) dataset for the lab (.csv; .rds)
28 March 20265 - Afternoon session
Theory: Encoders (and beyond)
Reference texts (1, 2, 3):
Laurer & co. (2024). Building Efficient Universal Classifiers with Natural Language Inference, arXiv
Burnham M. Stance detection: a practical guide to classifying political beliefs in text. Political Science Research and Methods. Published online 2024:1-18. doi:10.1017/psrm.2024.35
Ornstein, Joseph T., Elise N. Blasingame, and Jake S. Truscott. How to train your stochastic parrot: Large language models for political texts. Political Science Research and Methods (2023): 1-18
Lab class: How to fine-tune an encoder and how to estimate NLI and zero-shot models. Scripts for Lab 4A (a) part I: how to fine-tune an encoder - R script; Google Colab Notebook; part II: how to estimate NLI, zero-shot and sentiment models - R script; Google Colab Notebook; part III: how to fine-tune a NLI, zero-shot and sentiment model - R script; Google Colab Notebook; part IV: applying NLI to a real dataset - R script; Google Colab Notebook; b) .rds to use in the lab (to open this file, please use the data compression tool WinRAR); c) folder with a fine-tuned BERT (this is a big folder!); d) folder with a fine-tuned NLI (this is a big folder!); e) folder with a fine-tuned Zero Shot Model (this is a big folder!); f) folder with a fine-tuned Sentiment Model (this is a big folder!); g) folder with the fine-tuned NLI with a real dataset (this is a big folder!)
IMPORTANT! A list of exercises to assess your understanding of the material discussed throughout the course is available HERE