LUIGI Curini
  • Home
  • CV & Interests
  • Courses
    • 2024/25 >
      • Applied Scaling & Classification Techniques in Political Science
      • Big Data Analytics
      • Classification algorithms in text analytics
      • Game Theory for Social Scientists
      • Scienza Politica
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • ILSD
  • Personal
  • News
  • Home
  • CV & Interests
  • Courses
    • 2024/25 >
      • Applied Scaling & Classification Techniques in Political Science
      • Big Data Analytics
      • Classification algorithms in text analytics
      • Game Theory for Social Scientists
      • Scienza Politica
  • Publications
    • Scientific Publications
    • Articles on press OP/EDS
    • Interviews
  • ILSD
  • Personal
  • News
Picture
Picture


Classification algorithms in text analytics (second term 2024/25)

Syllabus

IMPORTANT INFO
This course will be conducted online using Teams.

Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.

  • Packages to install in R for the first week
  • ​Packages to install in R for the second week

​First day
28 March 2025 - Morning session
Theory: 
An introduction to text analytics (part 1): the Bag-of-Words approach 

Reference texts:  (1; 2)
  • ​Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26 ​

Lab class: Lab class: An introduction to the Quanteda package​. Script for Lab 1 - R script; Google Colab notebook; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a) How to open files stored in your Google Drive on Google Drive

28  March 2025 - Afternoon session
Theory: (part 1): Supervised classification methods: automatic tagging; (part 2): An introduction to supervised classification models
Reference texts (1):
  • Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
Lab class: How to implement dictionary models. Scripts for Lab 2 (R script; Google Colab Notebook); datasets for Lab 2: a) Laver and Garry policy dictionary; b) sample of tweets discussing Donal Trump (.rds file); EXTRA: a) converting an external dictionary to a Quanteda format; b) split-half reliability test

Second day
29 March 2025 - Morning session
Theory
: Supervised classification models - Part 1 
Reference text (1):
  • Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56  

Lab class: How to implement supervised classification model. Scripts for Lab 3 (R script; Google Colab Notebook); datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds); EXTRA: Meaning of a compressed sparse matrix 

29 March 2025 - Afternoon session
​Theory
: (Part 1): Supervised classification models - Part 2; (Part 2): The importance of a good training-set
Reference text (1):
  • Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56  

Lab class: How to implement supervised classification model and running an inter-coder reliability test. Scripts for Lab 4 (part I: supervised classification models - R script; Google Colab Notebook; part II:  inter-coder reliability test - R script; Google Colab Notebook); EXTRA: a Google Colab notebook  example about Keras package

Third day
4 April 2025 - Morning session
Theory
: Neural Network Models

Lab class: How to implement a NN model. Scripts for Lab 5 (Neural Network Models - R script; Google Colab Notebook)

4 April 2025 - Afternoon session
Theory: How to validate a ML algorithm
Reference text (1, 2, 3, 4):
  • Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
  • Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
  • Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
  • Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024

Lab class: How to compute internal and external validity of a ML algorithm. Scripts for Lab 6 (part I: external validity - R script; part II: global interpretation - R script; Google Colab Notebook for both external validity and global interpretation);  b) functions to compute cross-validation and global interpretation; c) .rds file for the grid-search for the Random Forest (multi-labels case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-labels case); e) .rds files for the global interpretation exercise (to open the file, please use the data compression tool WinRAR); f) Google Colab notebook about Text package

Fourth day
5 April 2025 - Morning session
Theory
: An introduction to text analytics (part 2): the Word Embedding approach 
Reference texts (1):
  • Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115​

Lab class: How to implement GloVe and word2vec. Scripts for Lab 7 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) .rds file with the results of the RF model via permutation; c) pre-trained WE on Google news; d) pre-trained WE on Facebook posts

Lab class: 
5 April 2025  - Afternoon session
 Theory: An introduction to Large Language Models (LLMs) with a special attention to BERT
Reference texts (1):
  • Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542

Lab class: How to implement BERT. Scripts for Lab 8 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) pre-trained WE on Google news; c) BERT results via text package (.rds file 1; .rds file 2); d) social disaster dataset (.csv; .rds)
Proudly powered by Weebly