Big Data Analytics (second term 2022/23)
IMPORTANT INFO
This course will be conducted online using Zoom (Join Zoom Meeting: https://us02web.zoom.us/j/5469311951)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First day
17 March 2023 - Morning session
Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
Lab class: An introduction to the Quanteda package - script: Lab 1 script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of the chi-squared; b1) An explanation of cosine similarity
17 March 2023 - Afternoon session
Theory: From words to issues: unsupervised classification models - the Topic Model
Reference text (1):
Lab class: How to implement a Topic Model - script: Lab 2 script; dataset: Guardian 2016
Second day
18 March 2023 - Morning session
Theory: From words to issues: unsupervised classification models - the Structural Topic Model
Reference texts (1):
Lab class: How to implement a Structural Topic Model - scripts: (part I: STM; part II: Twitter data); datasets: a) NyT; b) data for topical content analysis
18 March 2023 - Afternoon session
Theory: From words to issues: semi-supervised classification models
Reference texts (1, 2):
Lab class: How to implement a semi-supervised classification model - scripts: a) Lab 4 script; b) Twitter newspapers example; dataset: Twitter newspapers data; EXTRA a1) computing coherence and exclusivity with keyATM)
Third day
24 March 2023 - Morning session
Theory: Dictionary models
Reference texts (1):
Lab class: How to implement dictionary models - scripts: Lab 1B script (part I); Lab 1B script (part II: dictionary applied to Twitter); EXTRA: a1) converting an external dictionary to Quanteda; b1) split-half reliability test)
24 March 2023 - Afternoon session
Theory: From words to issues: supervised classification models (part I)
Reference text (1):
Lab class: How to implement supervised classification models (script: a) Lab 2B script; datasets: 1) disaster training-set; 2) disaster test-set:, EXTRA: slides about the meaning of a compressed sparse matrix); alternative approach to run a ML
Fourth day
25 March 2023 - Morning session
From words to issues: supervised classification models (part II)
Reference text (1):
Lab class:How to implement supervised classification models (script: a) Lab 3B script)
25 March 2023 - Afternoon session
Theory: Cross validation & the importance of the training set
Reference texts (1; 2,3):
Lab class: How to apply k-fold cross validation (scripts: a) Lab 4B script (part I); b) Lab 4B script (part II); datasets: a) first training-set for the lab; b) second training-set for the lab; EXTRA: script for multi-class variable; script for computing inter-coder reliability
FINAL Assignment
This course will be conducted online using Zoom (Join Zoom Meeting: https://us02web.zoom.us/j/5469311951)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First day
17 March 2023 - Morning session
Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
Lab class: An introduction to the Quanteda package - script: Lab 1 script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of the chi-squared; b1) An explanation of cosine similarity
17 March 2023 - Afternoon session
Theory: From words to issues: unsupervised classification models - the Topic Model
Reference text (1):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
Lab class: How to implement a Topic Model - script: Lab 2 script; dataset: Guardian 2016
Second day
18 March 2023 - Morning session
Theory: From words to issues: unsupervised classification models - the Structural Topic Model
Reference texts (1):
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
Lab class: How to implement a Structural Topic Model - scripts: (part I: STM; part II: Twitter data); datasets: a) NyT; b) data for topical content analysis
18 March 2023 - Afternoon session
Theory: From words to issues: semi-supervised classification models
Reference texts (1, 2):
- Kohei Watanabe and Yuan Zhou (2020) Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review, DOI: 10.1177/0894439320907027
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2023). Keyword Assisted Topic Models, American Journal of Political Science, DOI: 10.1111/ajps.12779
Lab class: How to implement a semi-supervised classification model - scripts: a) Lab 4 script; b) Twitter newspapers example; dataset: Twitter newspapers data; EXTRA a1) computing coherence and exclusivity with keyATM)
Third day
24 March 2023 - Morning session
Theory: Dictionary models
Reference texts (1):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
Lab class: How to implement dictionary models - scripts: Lab 1B script (part I); Lab 1B script (part II: dictionary applied to Twitter); EXTRA: a1) converting an external dictionary to Quanteda; b1) split-half reliability test)
24 March 2023 - Afternoon session
Theory: From words to issues: supervised classification models (part I)
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to implement supervised classification models (script: a) Lab 2B script; datasets: 1) disaster training-set; 2) disaster test-set:, EXTRA: slides about the meaning of a compressed sparse matrix); alternative approach to run a ML
Fourth day
25 March 2023 - Morning session
From words to issues: supervised classification models (part II)
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class:How to implement supervised classification models (script: a) Lab 3B script)
25 March 2023 - Afternoon session
Theory: Cross validation & the importance of the training set
Reference texts (1; 2,3):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
Lab class: How to apply k-fold cross validation (scripts: a) Lab 4B script (part I); b) Lab 4B script (part II); datasets: a) first training-set for the lab; b) second training-set for the lab; EXTRA: script for multi-class variable; script for computing inter-coder reliability
FINAL Assignment