LUIGI Curini
  • Home
  • CV & Interests
  • Courses
    • 2025/26 >
      • Big Data Analytics
      • Game Theory for Social Scientists
      • Scienza Politica
      • Applied Scaling & Classification Techniques in Political Science
      • Text Classification Algorithms
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • Personal
  • News
  • Home
  • CV & Interests
  • Courses
    • 2025/26 >
      • Big Data Analytics
      • Game Theory for Social Scientists
      • Scienza Politica
      • Applied Scaling & Classification Techniques in Political Science
      • Text Classification Algorithms
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • Personal
  • News
Picture
Picture


Text Classification Algorithms: From Bag-of-Words to Dynamic Word Embeddings (second term 2025/26)

Syllabus

 IMPORTANT INFO
This course will be conducted online using Teams here.

​
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.

  • Packages to install in R for the first week
  • ​Packages to install in R for the second week

​First day
20 March 2026 - Morning session
Theory: 
An introduction to text analytics: the Bag-of-Words approach 

Reference texts:  (1; 2)
  • ​Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26 ​
Lab class: Lab class: An introduction to the Quanteda package​. a) Script for Lab 1m - R script; Google Colab notebook; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR), EXTRA: a) How to read and save files from your Google Drive in Google Colab

20  March 2026 - Afternoon session
Theory: (part 1):  An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
  • Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56  
Lab class: How to implement supervised classification model: NB & RF. a) script for Lab 1a - R script; Google Colab notebook; datasets for Lab 2: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds);  EXTRA 1: Meaning of a compressed sparse matrix; EXTRA 2: using a single dfm for training and test set; EXTRA 3: using KERAS3 on Google Colab

Second day
21 March 2026 - Morning session
Theory
: Neural Network Models

Lab class: How to implement a NN model. Script for Lab 2m (Neural Network Models - R script; Google Colab Notebook); EXTRA: Google Colab Notebook for keras3 with GPU 

21 March 2026 - Afternoon session
​Theory
: (Part 1): How to validate a ML algorithm: external validity; (Part 2): The importance of a good training-set
Reference text (1, 2, 3; 4;  5, 6):
  • Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
  • Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
  • Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17​
  • Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
  • Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020

Lab class:  How to compute external validity of a ML algorithm: a) scripts for Lab 2a (R script; Google Colab Notebook); b) functions to compute cross-validation and global interpretation;  c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-class case)

Third day
27 March 2026 - Morning session
Theory
: How to validate a ML algorithm: global interpretation
Reference text (1):
  • Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
​
Lab class: How to compute global interpretation of a ML algorithm: a) scripts for Lab 3m (R script; Google Colab Notebook); b) .rds files for the global interpretation part (to open this file, please use the data compression tool WinRAR)

27 March 2026 - Afternoon session
Theory: 
Static Word Embedding techniques
Reference texts (1):
  • Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115​

Lab class: How to compute static WE. (a) scripts for Lab 3A - R script; Google Colab Notebook); b) movie reviews dataset (.csv; .rds); c) movie review test-set (.rds); d) pre-trained WE on Google news; e) pre-trained WE on Facebook posts; EXTRA 1: Naïve Bayes classifier with continuous features; EXTRA 2: Installing TRANSFORMERS on Google Colab

Fourth day
28 March 2026 - Morning session
Theory
: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 
2):
  • Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
  • Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17

Lab class: How to compute dynamic WE. (a) Scripts for Lab 4M - R script; Google Colab Notebook; b) functions to compute dynamic WE; c) .rds file to use in the lab  (to open this file, please use the data compression tool WinRAR); d) dataset for the lab (.csv; .rds)

28 March 20265  - Afternoon session
Theory: 
Encoders (and beyond)
Reference texts (1, 2, 3):
Laurer & co. (2024). Building Efficient Universal Classifiers with Natural Language Inference, arXiv
Burnham M. Stance detection: a practical guide to classifying political beliefs in text. Political Science Research and Methods. Published online 2024:1-18. doi:10.1017/psrm.2024.35
​Ornstein, Joseph T., Elise N. Blasingame, and Jake S. Truscott. How to train your stochastic parrot: Large language models for political texts. Political Science Research and Methods (2023): 1-18

Lab class: How to fine-tune an encoder and how to estimate NLI and zero-shot models. Scripts for Lab 4A (a) part I: how to fine-tune an encoder - R script; Google Colab Notebook; part II: how to estimate NLI, zero-shot and sentiment models - R script; Google Colab Notebook; part III: how to fine-tune a NLI, zero-shot and sentiment model - R script; Google Colab Notebook; part IV: applying NLI to a real dataset - R script; Google Colab Notebook; b) .rds to use in the lab  (to open this file, please use the data compression tool WinRAR); c) folder with a fine-tuned BERT (this is a big folder!); d) folder with a fine-tuned NLI (this is a big folder!); e) folder with a fine-tuned Zero Shot Model (this is a big folder!); f) folder with a fine-tuned Sentiment Model (this is a big folder!); g) folder with the fine-tuned NLI with a real dataset (this is a big folder!) 

IMPORTANT! A list of exercises to assess your understanding of the material discussed throughout the course is available HERE
Proudly powered by Weebly