LUIGI Curini
  • Home
  • CV & Interests
  • Courses
    • 2025/26 >
      • Big Data Analytics
      • Game Theory for Social Scientists
      • Scienza Politica
      • Applied Scaling & Classification Techniques in Political Science
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • Personal
  • News
  • Home
  • CV & Interests
  • Courses
    • 2025/26 >
      • Big Data Analytics
      • Game Theory for Social Scientists
      • Scienza Politica
      • Applied Scaling & Classification Techniques in Political Science
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • Personal
  • News
Picture
Picture


​Big Data Analytics (first term 2025/26)

​​Overview of the course

IMPORTANT INFO
To register your final mark for this course, plz enroll yourself in the Big Data Analytics exam of 18 December 2025.

Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and extract useful information from them for testing their own theories.

​First Lecture 
18/09/25, 10:30-12:30 Theory: An introduction to text analytics (part 1): the Bag-of-Words approach 
Reference texts:  (1; 2)
  • ​Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26 ​

​19/09/25, 10:30-12:30 Lab class: An introduction to the Quanteda package​. a) packages to install for Lab 1; b) Script for Lab 1 - R script; Google Colab notebook; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a) How to open files stored in your Google Drive on Google Colab; b) dataset about lyrics of songs

Second Lecture
25/09/25, 10:30-12:30 Theory: Unsupervised classification methods: the Topic Model (and beyond)
Reference text (1; 2):
  • ​Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • ​Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082

26/09/25, 10:30-12:30 Lab class: How to implement a Topic Model and a Structural Topic Model. a) packages to install for Lab 2; b) script for Lab 2 - R script; Google Colab notebook; dataset: NyT economic articles (.rds); c) data for topical content analysis; d) dataset of the searchK exploration; e) LDA results; f) STM results. EXTRA:  computing FREX statistics for a topic model

First assignment (due: 2 October 2025) (dataset for the first part: Guardian 2016 (.csv; .rds)) (dataset for the second part: Trump 2018 tweets. To open this file, use the command readRDS("Trump2018.rds"))

Third Lecture
09/10/25, 10:30-12:30 Theory: (part 1):  An introduction to supervised classification models; (part 2): Supervised classification models: Naïve Bayes & Random Forest
Reference text (1):
  • Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56  

​10/10/25, 10:30-12:30  Lab class: How to implement supervised classification model: NB & RF. a) packages to install for Lab 3; b) script for Lab 3 - R script; Google Colab notebook; datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds);  EXTRA: Meaning of a compressed sparse matrix 

Second assignment (due: 16 October 2025) (datasets for Assignment 3: a) UK training set (.csv; .rds); b) UK test set (.csv; .rds))

Fourth Lecture
16/10/25, 10:30-12:30 Theory:  Supervised classification models: Support Vector Machine & Regularized Regression

​17/10/25, 10:30-12:30  Lab class: How to implement supervised classification model: SVM & RR. a) packages to install for Lab 4; b) script for Lab 4 - R script; Google Colab notebook; EXTRA: How to implement BERT in Google Colab

Third assignment (due: 23 October 2025) 

Fifth Lecture
23/10/25, 10:30-12:30 Theory: Neural Network Models
24/10/25, 10:30-12:30 Lab class: How to implement a NN model. Script for Lab 5 (Neural Network Models - R script; Google Colab Notebook)

Fourth assignment (due: 30 October 2025) 

Sixth Lecture
30/10/25, 10:30-12:30 Theory: (part 1): How to validate a ML algorithm (first part); (part 2): The importance of a good training-set

Reference text (1, 2, 3; 4; 5;  6, 7):
  • Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
  • Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
  • Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17
  • Curini, Luigi, and Robert Fahey (2020). Sentiment Analysis and Social Media. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 29 
  • Barberá, Pablo et al. (2020) Automated Text Classification of News Articles: A Practical Guide, Political Analysis, DOI: 10.1017/pan.2020.8

31/10/25, 10:30-12:30 Lab class: How to compute external validity of a ML algorithm. Scripts for Lab 6 (part I: external validity - R script; Google Colab Notebook; part II: inter-coder reliability - R script; Google Colab Notebook;  b) functions to compute cross-validation and global interpretation for class-labels; c) .rds file for the grid-search for the Random Forest (multi-class case); d) .rds file for the grid-search for the RR (multi-class case); e) .rds file for the grid-search of the NN algorithm (binary case; multi-class case)

Fifth assignment (due: 6 November 2025) 

Seventh Lecture
6/11/25, 10:30-12:30 Theory: How to validate a ML algorithm (second part)
Reference text (1):​
  • Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378

​7/11/25, 10:30-12:30 Lab class: How to compute a local and a global interpretation of a ML algorithm. Scripts for Lab 7 (a) part I: global interpretation - R script; Google Colab Notebook; part II: local interpretation - R script; Google Colab Notebook);   b) packages to install for Lab 8; c) .rds files for the global interpretation part (to open the file, please use the data compression tool WinRAR); d) functions to fit LIME; e) EXTRA: how to fit IML and LIME with tabular data - R script; f) dataset for the EXTRA part: first dataset; second dataset

Sixth assignment (due: 13 November 2025)


Eight Lecture
13/11/25, 10:30-12:30 Theory: Static Word Embedding techniques 
Reference texts (1):
  • Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115​

14/11/25, 10:30-12:30 Lab class: How to compute static WE. Script for Lab 8 (a) part I: computing stating WE - R script; Google Colab Notebook; part II: LIME and WE  R Script; Google Colab Notebook); b) movie reviews dataset (.csv; .rds); c) social-disaster dataset (.csv; .rds); c) file with the results of the weighted average embeddings exercise for the movie dataset; d) pre-trained WE on Google news; d) pre-trained WE on Facebook posts

Seventh Assignment (due: 20 Novembre 2025) (dataset for the assignment: .csv; .rds)


Ninth Lecture
20/11/25, 10:30-12:30  Theory: An introduction to Large Language Models (LLMs) with special attention to BERT
Reference texts (1; 2):
  • Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
  • Park, Ju Yeon, and Jacob M. Montgomery. "Toward a framework for creating trustworthy measures with supervised machine learning for text." Political Science Research and Methods (2025): 1-17

21/11/25, 10:30-12:30 Lab class: How to compute dynamic WE. Scripts for Lab 9 (a) part I: computing dynamic WE - R script; Google Colab Notebook; part II: entering the black box - R script; Google Colab Notebook; b) packages to install for Lab 9; c) functions to compute dynamic WE; d) Lab slides: The problems with using LIME on BERT; e) .rds file to use in the lab; f) dataset for the lab (.csv; .rds)

Eight Assignment (due: 27 November 2025) (dataset for the assignment - training-set: .csv; .rds; test-set: .csv; .rds)

Tenth Lecture
27/11/25, 10:30-12:30 Theory: Encoders (and beyond)
Reference texts (1, 2, 3):
Laurer & co. (2024). Building Efficient Universal Classifiers with Natural Language Inference, arXiv
Burnham M. Stance detection: a practical guide to classifying political beliefs in text. Political Science Research and Methods. Published online 2024:1-18. doi:10.1017/psrm.2024.35
​Ornstein, Joseph T., Elise N. Blasingame, and Jake S. Truscott. How to train your stochastic parrot: Large language models for political texts. Political Science Research and Methods (2023): 1-18.
 
28/11/25, 10:30-12:30  Lab class: How to fine-tune an encoder and how to estimate NLI and zero-shot models. Scripts for Lab 10 (a) part I: how to fine-tune an encoder - R script; Google Colab Notebook; part II: how to estimate NLI, zero-shot and sentiment models - R script; Google Colab Notebook; part III: fine-tuning a NLI, zero-shot and sentiment model - R script; Google Colab Notebook; part IV: applying NLI to a real dataset - R script; Google Colab Notebook; b) .rds to use in the lab; c) folder with a fine-tuned BERT (this is a big folder!); d) folder with a fine-tuned NLI (this is a big folder!)

Ninth Assignment (due: 4 December 2025) (training-set; validation-set; fine-tuning dataset)
Proudly powered by Weebly