Big Data Analytics (first term 2022/23)
Overview of the course
IMPORTANT INFO
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 14 December 2022
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
22/09/22, 10:00-12:00 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
23/09/22, 10:00-12:00 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) Lab 1 slides; scripts: Lab 1 script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of cosine similarity; b1) An explanation of the chi-squared; c1) how to deal with Japanese and Chinese languages; d1) sample of Japanese legislatives speeches (to open these files, please use the data compression tool WinRAR);
First assignment (due: 28 September 2022)
Second Lecture
29/9/22, 10:00-12:00 Theory: From words to positions: supervised scaling models
Reference texts (1; 2; 3, 4)
30/9/22, 10:00-12:00 Lab class: How to implement the Wordscores algorithm (a) packages to install for Lab 2; b) Lab 2 script (part I: Wordscores; part II: rtweet and rest API); c) dataset for the first part of the Lab; EXTRA: a1) How Wordscores works)
Second assignment (due: 5 October 2022) (dataset for Assignment 2. To open this file, please use the data compression tool WinRAR)
Third Lecture
10/10/22, 14:00-16:00 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3)
11/10/22, 14:00-16:00 Lab class: How to implement the Wordfish algorithm (scripts: a) packages to install; b) Lab 3 slides; c) Lab 3 scripts (part I: Wordifsh; part II: rtweet streaming api; part III: rtweet and geodata; d) dataset); EXTRA: a1) estimating bootstrap confidence intervals in Wordfish; b1) slides about Wordshoal; c1) estimating Wordshoal; d1) convert emoji to text)
Third assignment (due: 17 October 2022)
Fourth Lecture
13/10/22, 10:00-12:00 Theory: From words to issues: unsupervised classification models
Reference text (1):
14/10/22, 10:00-12:00 Lab class: How to implement a Topic Model (scripts: a) packages to install; b) Lab 4 script (topic model); EXTRA: a1) estimating a cluster model; dataset: Guardian 2016)
Fourth assignment (due: 21 October 2022) (dataset for Assignment 4)
Fifth Lecture
20/10/22, 10:00-12:00 Theory: (Part 1): From words to issues: structural topic models; (Part 2): Dictionary models
Reference texts (1, 2):
21/10/22, 10:00-12:00 Lab class: How to implement a Structural Topic Model and dictionary models (scripts: a) packages to install; b) Lab 5 scripts (parti I: STM; part II: dictionaries; part III: dictionaries and Twitter); datasets: a) NyT; b) data for topical content analysis); EXTRA: a1) converting an external dictionary to Quanteda; b1) split-half reliability test)
Fifth Assignment (due: 27 October 2022)
Sixth Lecture
27/10/22, 10:00-12:00 Theory: (Part 1): From words to issues: semi-supervised classification models; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3, 4):
28/10/22, 10:00-12:00 Lab class: How to implement a semi-supervised classification model (scripts: a) packages to install; b) Lab 6 script; EXTRA a1) computing coherence and exclusivity with keyATM)
Sixth Assignment (due: 2 November 2022)
Seventh Lecture
3/11/22, 10:00-12:00 Theory: From words to issues: supervised classification models (first part)
Reference text (1):
4/11/22, 10:00-12:00 Lab class: How to implement supervised classification models (scripts: a) packages to install; b) Lab 7 script; datasets: 1) disasters training-set; 2) disasters test-set:, EXTRA: slides about the meaning of a compressed sparse matrix)
Seventh Assignment (due: 9 November 2022) (datasets for Assignment 7: a) UK training set; b) UK test set)
Eigth Lecture
10/11/22, 10:00-12:00 Theory: (Part 1): From words to issues: supervised classification models (second part); (Part 2): How to validate the results from a ML algorithm
Reference text (1, 2, 3):
11/11/22, 10:00-12:00 Lab class: How to apply k-fold cross validation (scripts: a) package to install; b) Lab 8 script (part A); c) Lab 8 script (part B); d) Lab 8 script (part C); e) Lab 8 script (part D); f) training-set for the lab; g) second training-set for the lab
Eigth Assignment (due: 16 November 2022)
Ninth Lecture
17/11/22, 10:00-12:00 Theory: (Part 1): From words to issues: supervised classification models (third part); (Part 2): The importance of the training set
Reference texts (1; 2):
Ninth Assignment (due: 23 November 2022)
Tenth Lecture
24/11/22, 11:00-13:00 Theory: An introduction to word embeddings
Reference texts (1, 2):
25/11/22, 16:00-18:00 Lab class: How to implement a proportional algorithm and a word-embedding procedure. (scripts: a) packages to install; b) Lab 10 script; dataset for the lab: training-set; test-set; pre-trained WE)
Tenth Assignment (due: 30 November 2022)
To register your final mark for this course, plz enroll yourself at the Big Data Analytics exam of 14 December 2022
Course aims and objectives
Students will learn how to employ some widely discussed methods advanced in the literature to analyze political texts and to extract from them useful information for texting their own theories.
First Lecture
22/09/22, 10:00-12:00 Theory: An introduction to text analytics
Reference texts: (1; 2, 3)
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Benoit, Kenneth (2020). Text as data: An overview. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 26
- Grossman, Jonathan, and Pedahzur Ami (2020). Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them, Political Science Quarterly, 135(2): 225-257
23/09/22, 10:00-12:00 Lab class: An introduction to the Quanteda package (a) packages to install for Lab 1; b) Lab 1 slides; scripts: Lab 1 script; datasets: a) Boston tweets sample; b) Inaugural US Presidential speeches sample (to open this file, please use the data compression tool WinRAR); EXTRA: a1) An explanation of cosine similarity; b1) An explanation of the chi-squared; c1) how to deal with Japanese and Chinese languages; d1) sample of Japanese legislatives speeches (to open these files, please use the data compression tool WinRAR);
First assignment (due: 28 September 2022)
Second Lecture
29/9/22, 10:00-12:00 Theory: From words to positions: supervised scaling models
Reference texts (1; 2; 3, 4)
- Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
- Egerod, Benjamin C.K., and Robert Klemmensen (2020). Scaling Political Positions from text. Assumptions, Methods and Pitfalls. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 27
- Martin, Lanny W., and Georg Vanberg. 2008. A robust transformation procedure for interpreting political text. Political Analysis, 16: 93-100
- Bräuninger Thoams and Nathalier Giger. Strategic Ambiguity of Party Positions in Multi-Party Competition, Political Science Research and Methods, 6(3), 527-548, 2018
30/9/22, 10:00-12:00 Lab class: How to implement the Wordscores algorithm (a) packages to install for Lab 2; b) Lab 2 script (part I: Wordscores; part II: rtweet and rest API); c) dataset for the first part of the Lab; EXTRA: a1) How Wordscores works)
Second assignment (due: 5 October 2022) (dataset for Assignment 2. To open this file, please use the data compression tool WinRAR)
Third Lecture
10/10/22, 14:00-16:00 Theory: From words to positions: Unsupervised scaling models
Reference texts (1; 2; 3)
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
- Proksch, Sven-Oliver, and Slapin, Jonathan B. 2009. How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany. German Politics, 18(3): 323-344.
- Curini, Luigi, Hino, Airo, and Atsushi Osaki. 2020. Intensity of government–opposition divide as measured through legislative speeches and what we can learn from it. Analyses of Japanese parliamentary debates, 1953–2013, Government and Opposition, 55(2), 184-201
11/10/22, 14:00-16:00 Lab class: How to implement the Wordfish algorithm (scripts: a) packages to install; b) Lab 3 slides; c) Lab 3 scripts (part I: Wordifsh; part II: rtweet streaming api; part III: rtweet and geodata; d) dataset); EXTRA: a1) estimating bootstrap confidence intervals in Wordfish; b1) slides about Wordshoal; c1) estimating Wordshoal; d1) convert emoji to text)
Third assignment (due: 17 October 2022)
Fourth Lecture
13/10/22, 10:00-12:00 Theory: From words to issues: unsupervised classification models
Reference text (1):
- Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response. American Journal of Political Science, 58(4), 1064-1082
14/10/22, 10:00-12:00 Lab class: How to implement a Topic Model (scripts: a) packages to install; b) Lab 4 script (topic model); EXTRA: a1) estimating a cluster model; dataset: Guardian 2016)
Fourth assignment (due: 21 October 2022) (dataset for Assignment 4)
Fifth Lecture
20/10/22, 10:00-12:00 Theory: (Part 1): From words to issues: structural topic models; (Part 2): Dictionary models
Reference texts (1, 2):
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley. 2014. STM: R Package for Structural Topic Models. Journal of Statistical Software
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
21/10/22, 10:00-12:00 Lab class: How to implement a Structural Topic Model and dictionary models (scripts: a) packages to install; b) Lab 5 scripts (parti I: STM; part II: dictionaries; part III: dictionaries and Twitter); datasets: a) NyT; b) data for topical content analysis); EXTRA: a1) converting an external dictionary to Quanteda; b1) split-half reliability test)
Fifth Assignment (due: 27 October 2022)
Sixth Lecture
27/10/22, 10:00-12:00 Theory: (Part 1): From words to issues: semi-supervised classification models; (Part 2): An introduction to supervised classification models
Reference texts (1, 2, 3, 4):
- Kohei Watanabe and Yuan Zhou (2020) Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review, DOI: 10.1177/0894439320907027
- Shusei Eshima, Kosuke Imai, and Tomoya Sasaki (2020). Keyword Assisted Topic Models, arXiv:2004.05964v1
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
28/10/22, 10:00-12:00 Lab class: How to implement a semi-supervised classification model (scripts: a) packages to install; b) Lab 6 script; EXTRA a1) computing coherence and exclusivity with keyATM)
Sixth Assignment (due: 2 November 2022)
Seventh Lecture
3/11/22, 10:00-12:00 Theory: From words to issues: supervised classification models (first part)
Reference text (1):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
4/11/22, 10:00-12:00 Lab class: How to implement supervised classification models (scripts: a) packages to install; b) Lab 7 script; datasets: 1) disasters training-set; 2) disasters test-set:, EXTRA: slides about the meaning of a compressed sparse matrix)
Seventh Assignment (due: 9 November 2022) (datasets for Assignment 7: a) UK training set; b) UK test set)
Eigth Lecture
10/11/22, 10:00-12:00 Theory: (Part 1): From words to issues: supervised classification models (second part); (Part 2): How to validate the results from a ML algorithm
Reference text (1, 2, 3):
- Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297
- Cranmer, Skyler J. and Desmarais, Bruce A. (2017) What Can We Learn from Predictive Modeling?, Political Analysis, 25: 145-166
- Curini, Luigi, and Robert Fahey. 2020. Sentiment Analysis. In: Luigi Curini and Robert Franzese (eds.), Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, chapter 29
11/11/22, 10:00-12:00 Lab class: How to apply k-fold cross validation (scripts: a) package to install; b) Lab 8 script (part A); c) Lab 8 script (part B); d) Lab 8 script (part C); e) Lab 8 script (part D); f) training-set for the lab; g) second training-set for the lab
Eigth Assignment (due: 16 November 2022)
Ninth Lecture
17/11/22, 10:00-12:00 Theory: (Part 1): From words to issues: supervised classification models (third part); (Part 2): The importance of the training set
Reference texts (1; 2):
- Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
- Barberá, Pablo et al. (2020). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, DOI: 10.1017/pan.2020
Ninth Assignment (due: 23 November 2022)
Tenth Lecture
24/11/22, 11:00-13:00 Theory: An introduction to word embeddings
Reference texts (1, 2):
- Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
- Rudkowsky Elena, et al. (2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures. 12:2-3, 140-157
25/11/22, 16:00-18:00 Lab class: How to implement a proportional algorithm and a word-embedding procedure. (scripts: a) packages to install; b) Lab 10 script; dataset for the lab: training-set; test-set; pre-trained WE)
Tenth Assignment (due: 30 November 2022)