Big Data Analytics (second term 2020/21)
Overview of the course
IMPORTANT INFO
This course will be conducted online using Zoom (Join Zoom Meeting: https://zoom.us/j/2335812074?pwd=aitDcWNFd1g3QVR1NHIxMit1bWpzUT09)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
10 March 2021 - Morning session
Theory class: An introduction to text analytics
Reference texts: (1; 2, 3)
Lab class: An introduction to the Quanteda package (a) packages to install; scripts: Lab 1A script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
10 March 2021 - Afternoon session
Theory class: From words to positions: supervised & unsupervised scaling models
Reference texts (1; 2; 3; 4)
Lab class: How to implement the Wordscores & the Wordfish algorithm (scripts: a) Lab 1B script; b) Lab 1C script; dataset: a) sample of Japanese legislatives speeches (to open these files, please use the data compression tool WinRAR); b) UK party programs 1992 and 1997 (to open this file, please use the data compression tool WinRAR)
Second Lecture
11 March 2021 - Morning session
Theory class: From words to issues: unsupervised classification models
Reference text (1; 2):
11 March 2021 - Afternoon session
Theory class: (Part 1): From words to issues: Dictionary approaches & Semisupervised classification models
Reference texts (1, 2, 3, 4):
Lab class: How to implement a dictionary and a semi-supervised classification model (scripts: a) Lab 2B script; b) Lab 2C script)
Third Lecture
18 March 2021 - Morning session
Theory class: From words to issues: supervised classification models
Reference text (1):
Lab class: How to get access to Twitter data (script: a) Lab 3A script)
18 March 2021 - Afternoon session
Theory class: From words to issues: supervised classification models
Reference text (1):
Lab class: How to implement supervised classification models (scripts: a) packages to install; b) Lab 3B script; datasets: a) social disaster training-set; b) social disaster test-set; c) Nationality)
Fourth Lecture
19 March 2021 - Morning session
Theory class: How to validate the results you get from machine learning algorithms & The importance of the training set
Reference text (1, 2, 3, 4):
Lab class: How to apply k-fold cross validation & How to run an inter-coder reliability analysis(scripts: a) package to install; b) Lab 4A ; c) Lab 4B; d) Lab 4C; e) Lab 4D; datasets: a) movie reviews training-set; ; b) moview reviews test-set; c) uk tweets)
19 March 2021 - Afternoon session
Theory class: An introduction to word embeddings
Reference texts (1, 2):
Lab class: How to implement a word-embedding procedure. (script: a) Lab 4E; dataset: a) pre-trained WE)
Course Assignment (due: 23 April 2021) (Texts for the first part of the Assignment)
This course will be conducted online using Zoom (Join Zoom Meeting: https://zoom.us/j/2335812074?pwd=aitDcWNFd1g3QVR1NHIxMit1bWpzUT09)
Course aims and objectives
In the 4-days course, students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
10 March 2021 - Morning session
Theory class: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
Lab class: An introduction to the Quanteda package (a) packages to install; scripts: Lab 1A script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR)
10 March 2021 - Afternoon session
Theory class: From words to positions: supervised & unsupervised scaling models
Reference texts (1; 2; 3; 4)
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2009. How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany. German Politics, 18(3): 323-344.
- Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
- Egerod, Benjamin C.K., and Robert Klemmensen (2020). Scaling Political Positions from text. Assumptions, Methods and Pitfalls. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 27
Lab class: How to implement the Wordscores & the Wordfish algorithm (scripts: a) Lab 1B script; b) Lab 1C script; dataset: a) sample of Japanese legislatives speeches (to open these files, please use the data compression tool WinRAR); b) UK party programs 1992 and 1997 (to open this file, please use the data compression tool WinRAR)
Second Lecture
11 March 2021 - Morning session
Theory class: From words to issues: unsupervised classification models
Reference text (1; 2):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
11 March 2021 - Afternoon session
Theory class: (Part 1): From words to issues: Dictionary approaches & Semisupervised classification models
Reference texts (1, 2, 3, 4):
- Kohei Watanabe and Yuan Zhou (2020) Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review, DOI: 10.1177/0894439320907027
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2020). Keyword Assisted Topic Models, arXiv:2004.05964v1
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
Lab class: How to implement a dictionary and a semi-supervised classification model (scripts: a) Lab 2B script; b) Lab 2C script)
Third Lecture
18 March 2021 - Morning session
Theory class: From words to issues: supervised classification models
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to get access to Twitter data (script: a) Lab 3A script)
18 March 2021 - Afternoon session
Theory class: From words to issues: supervised classification models
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Lab class: How to implement supervised classification models (scripts: a) packages to install; b) Lab 3B script; datasets: a) social disaster training-set; b) social disaster test-set; c) Nationality)
Fourth Lecture
19 March 2021 - Morning session
Theory class: How to validate the results you get from machine learning algorithms & The importance of the training set
Reference text (1, 2, 3, 4):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
Lab class: How to apply k-fold cross validation & How to run an inter-coder reliability analysis(scripts: a) package to install; b) Lab 4A ; c) Lab 4B; d) Lab 4C; e) Lab 4D; datasets: a) movie reviews training-set; ; b) moview reviews test-set; c) uk tweets)
19 March 2021 - Afternoon session
Theory class: An introduction to word embeddings
Reference texts (1, 2):
- Rodriguez Pedro L. and Spirling Arthur (2021). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, forthcoming
- Rudkowsky Elena, et al. (2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures. 12:2-3, 140-157
Lab class: How to implement a word-embedding procedure. (script: a) Lab 4E; dataset: a) pre-trained WE)
Course Assignment (due: 23 April 2021) (Texts for the first part of the Assignment)