Classification algorithms in text analytics (second term 2024/25)
Syllabus
IMPORTANT INFO
This course will be conducted online using Teams.
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.
First day
28 March 2025 - Morning session
Theory: An introduction to text analytics (part 1): the Bag-of-Words approach
Reference texts: (1; 2)
Lab class: Lab class: An introduction to the Quanteda package. Script for Lab 1 - R script; Google Colab notebook; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a) How to open files stored in your Google Drive on Google Drive
28 March 2025 - Afternoon session
Theory: (part 1): Supervised classification methods: automatic tagging; (part 2): An introduction to supervised classification models
Reference texts (1):
Second day
29 March 2025 - Morning session
Theory: Supervised classification models - Part 1
Reference text (1):
Lab class: How to implement supervised classification model. Scripts for Lab 3 (R script; Google Colab Notebook); datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds); EXTRA: Meaning of a compressed sparse matrix
29 March 2025 - Afternoon session
Theory: (Part 1): Supervised classification models - Part 2; (Part 2): The importance of a good training-set
Reference text (1):
Lab class: How to implement supervised classification model and running an inter-coder reliability test. Scripts for Lab 4 (part I: supervised classification models - R script; Google Colab Notebook; part II: inter-coder reliability test - R script; Google Colab Notebook); EXTRA: a Google Colab notebook example about Keras package
Third day
4 April 2025 - Morning session
Theory: Neural Network Models
Lab class: How to implement a NN model. Scripts for Lab 5 (Neural Network Models - R script; Google Colab Notebook)
4 April 2025 - Afternoon session
Theory: How to validate a ML algorithm
Reference text (1, 2, 3, 4):
Lab class: How to compute internal and external validity of a ML algorithm. Scripts for Lab 6 (part I: external validity - R script; part II: global interpretation - R script; Google Colab Notebook for both external validity and global interpretation); b) functions to compute cross-validation and global interpretation; c) .rds file for the grid-search for the Random Forest (multi-labels case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-labels case); e) .rds files for the global interpretation exercise (to open the file, please use the data compression tool WinRAR); f) Google Colab notebook about Text package
Fourth day
5 April 2025 - Morning session
Theory: An introduction to text analytics (part 2): the Word Embedding approach
Reference texts (1):
Lab class: How to implement GloVe and word2vec. Scripts for Lab 7 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) .rds file with the results of the RF model via permutation; c) pre-trained WE on Google news; d) pre-trained WE on Facebook posts
Lab class:
5 April 2025 - Afternoon session
Theory: An introduction to Large Language Models (LLMs) with a special attention to BERT
Reference texts (1):
Lab class: How to implement BERT. Scripts for Lab 8 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) pre-trained WE on Google news; c) BERT results via text package (.rds file 1; .rds file 2); d) social disaster dataset (.csv; .rds)
This course will be conducted online using Teams.
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze a corpus of texts and to extract from them useful information for texting their own theories.
First day
28 March 2025 - Morning session
Theory: An introduction to text analytics (part 1): the Bag-of-Words approach
Reference texts: (1; 2)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
Lab class: Lab class: An introduction to the Quanteda package. Script for Lab 1 - R script; Google Colab notebook; datasets: a) Boston tweets sample (.csv; .rds); b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a) How to open files stored in your Google Drive on Google Drive
28 March 2025 - Afternoon session
Theory: (part 1): Supervised classification methods: automatic tagging; (part 2): An introduction to supervised classification models
Reference texts (1):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
Second day
29 March 2025 - Morning session
Theory: Supervised classification models - Part 1
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to implement supervised classification model. Scripts for Lab 3 (R script; Google Colab Notebook); datasets for Lab 3: a) disaster dataset - training-set (.csv; .rds); b) disaster dataset - test-set (.csv; .rds); c) US airlines training-set (.csv; .rds); d) US airlines test-set (.csv; .rds); EXTRA: Meaning of a compressed sparse matrix
29 March 2025 - Afternoon session
Theory: (Part 1): Supervised classification models - Part 2; (Part 2): The importance of a good training-set
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to implement supervised classification model and running an inter-coder reliability test. Scripts for Lab 4 (part I: supervised classification models - R script; Google Colab Notebook; part II: inter-coder reliability test - R script; Google Colab Notebook); EXTRA: a Google Colab notebook example about Keras package
Third day
4 April 2025 - Morning session
Theory: Neural Network Models
Lab class: How to implement a NN model. Scripts for Lab 5 (Neural Network Models - R script; Google Colab Notebook)
4 April 2025 - Afternoon session
Theory: How to validate a ML algorithm
Reference text (1, 2, 3, 4):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the “Black Box” of Machine Learning Models for Legislative Scholars”, Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
- Arnold, Christian, Biedebach Luka, Küpfer Andreas, and Neunhoeffer Marcel. (2023). The Role of Hyperparameters in Machine Learning, Political Science Research and Methods, 2024
Lab class: How to compute internal and external validity of a ML algorithm. Scripts for Lab 6 (part I: external validity - R script; part II: global interpretation - R script; Google Colab Notebook for both external validity and global interpretation); b) functions to compute cross-validation and global interpretation; c) .rds file for the grid-search for the Random Forest (multi-labels case); d) .rds file for the grid-search of the NN algorithm (binary case; multi-labels case); e) .rds files for the global interpretation exercise (to open the file, please use the data compression tool WinRAR); f) Google Colab notebook about Text package
Fourth day
5 April 2025 - Morning session
Theory: An introduction to text analytics (part 2): the Word Embedding approach
Reference texts (1):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
Lab class: How to implement GloVe and word2vec. Scripts for Lab 7 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) .rds file with the results of the RF model via permutation; c) pre-trained WE on Google news; d) pre-trained WE on Facebook posts
Lab class:
5 April 2025 - Afternoon session
Theory: An introduction to Large Language Models (LLMs) with a special attention to BERT
Reference texts (1):
- Kjell, O., Giorgi, S., & Schwartz, H. A. (2023, May 1). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000542
Lab class: How to implement BERT. Scripts for Lab 8 (R script; Google Colab Notebook). Dataset & files for the lab: a) movie reviews dataset (.csv; .rds); b) pre-trained WE on Google news; c) BERT results via text package (.rds file 1; .rds file 2); d) social disaster dataset (.csv; .rds)