Political Science and Big Data Structured Data, Unstructured Data, and How to Use Them

Ami Pedahzur

Political Science and Big Data Structured Data, Unstructured Data, and How to Use Them

Ami Pedahzur

2020, Political Science Quarterly

https://doi.org/10.1002/POLQ.13032

visibility

…

description

33 pages

link

1 file

Abstract

WHAT MAKES DATA "BIG"? Big data are a product of the information revolution-breakthroughs in information technology in the late twentieth century, both in hardware and software, that have transformed human society, economy, and culture in most parts of the world. 2 From a historical perspective, the information revolution is still in its infancy. While it is too early to fully assess its magnitude, it would not be presumptuous to assert that this revolution has had a profound impact on political science and policy. Among other changes, it reshaped how the scholarly community in general and political scientists in particular approached scientific endeavors-the ways they collected, organized, and analyzed data. To explain this shift, we first have to delineate the differences between structured and unstructured data. For most people, the word "data" means an organized set of values, such as an Excel spreadsheet. These are structured data-data that can fit squarely into a table or a relational database, where every row is an observation, every column a variable, and the cells at the intersection of rows and columns contain values. Consequently, the entire database or parts of it can be subject to quantitative analysis. Election results, census records, financial transactions, temperature measurements, and GPS coordinates are all examples of structured data; they transform observable phenomena into measurable and legible forms. Data of this kind, however, are but a fraction of the total amount of data in the world. According to different estimates, 3 80 to 95 percent of existing data are unstructured data, that is, data that cannot fit snugly into rows and columns. Unstructured data may take the form of text, audio, video, or any other observable manifestation. The content of a political speech, the video recording of that speech, the blog post commenting on the video, and the academic article analyzing the post are all unstructured. 4 2

Figures (2)

Political Science Articles about Big Data, by Year FIGURE 2

Key takeaways

Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them
Discussions of big data tend to ignore the fact that such structured databases constitute only a small part of what qualifies as big data, whereas most data in the world are unstructured.
Still other researchers analyze a random sample of unstructured big data.
However, unstructured big data also pose new challenges for political scientists.
More research and documented findings are necessary before the majority of political scientists feel at ease studying structured or unstructured big data.

Political Science and Big Data: Structured Data, Unstructured Data, and How to Use Them JONATHAN GROSSMAN AMI PEDAHZUR BIG DATA ARE A SALIENT FEATURE of the information tsunami that characterized the end of the twentieth and the beginning of the twenty‐ﬁrst century. As a result of the incessant rise in computational power, communication velocity, and storage capacity, new knowledge is accumulating at an exponential rate. Between 2006 and 2011, the amount of data in the world increased almost ninefold. Today, it is expected to double every two years.1 Computer and data scientists have been studying big databases for a while. Some of them see the analysis of such data as a panacea for all scientiﬁc questions—an omnipotent power that renders theory and within‐case analysis redundant and can predict future trends and generate policy recommendations with the magic of numbers. Conversely, social scientists in general and political scientists in particular have only recently turned their attention to big data analysis. In contrast to most 1 Min Chen, Shiwen Mao, and Yunhao Liu, “Big Data: A Survey,” Mobile Networks and Applications 19 (April 2014): 171–209, at 171. JONATHAN GROSSMAN is a postdoctoral fellow at the Leonard Davis Instititue for International Relations, the Hebrew University of Jerusalem. His main areas of research are social science research methodology, diaspora politics, and diplomatic history. AMI PEDAHZUR is a professor of government at the University of Texas at Austin. His most recent research explores the evolution of warfare since the industrial revolution. POLITICAL SCIENCE QUARTERLY | Volume 135 Number 2 2020 | www.psqonline.org © 2020 Academy of Political Science DOI: 10.1002/polq.13032 225 226 | POLITICAL SCIENCE QUARTERLY data analysts, political scientists are far more suspicious of big data, and rightly so. Our objective in this article is to solidify the conceptual foundations for the use of big data in political science and policy research. We discuss the main beneﬁts of big data in the discipline and, more importantly, their potential to advance scholarship in the ﬁeld by drawing on both structured and unstructured data. We contend that political scientists have not expressed much excitement about the rise of big data, for good reason. The analysis of increasingly large troves of data has been a feature of political research for several decades now. As investigators were unveiling the beneﬁts of such large data depots, they became increasingly aware of their limitations: big databases have oﬀered insuﬃcient information about context, processes over time, and interactions among variables and actors. Without the ability to capture these nuances, generalizations and statistical inferences based on big data have often been ﬂawed or superﬁcial. These problems stem, to a large degree, from the structured nature of the big data sets that researchers commonly analyze. Discussions of big data tend to ignore the fact that such structured databases constitute only a small part of what qualiﬁes as big data, whereas most data in the world are unstructured. The recent surge in the literature on evidence‐based research methods that focus on in‐depth analyses of unstructured sources is thus timely. While data scientists often refer to unstructured data of any size as a hurdle, we maintain that in light of the advancements in both technology and political science methodology, unstructured big data are now easier to collect and organize and can prove useful in solving some of the most pertinent issues that preoccupy the discipline. When used with caution and rigor, they can help us test theories and establish causal relationships without losing meaning or context. To substantiate our claims, we begin by clarifying the terminology for data and big data. Second, we contextualize this discussion by reviewing the recent, and quite limited, use of the term “big data” in political science and policy publications; the actual analysis of big data in such studies, which is considerably more prevalent; and the challenges of using structured big data in social science research. Third, we brieﬂy discuss, through examples, the merits of unstructured big data and their potential to contribute to burgeoning methodological evidence‐based approaches in political science. Fourth, we argue that the research tradition of historical institutionalism and the research method of process tracing can particularly beneﬁt from incorporating unstructured big data sources. In the ﬁnal section, we discuss the reasons why most historical institutionalists and process tracers are still unenthusiastic POLITICAL SCIENCE AND BIG DATA | 227 about big data, and we oﬀer a pathway to overcome these obstacles by utilizing unstructured big data in an organized and systematic manner. WHAT MAKES DATA “BIG”? Big data are a product of the information revolution—breakthroughs in information technology in the late twentieth century, both in hardware and software, that have transformed human society, economy, and culture in most parts of the world.2 From a historical perspective, the information revolution is still in its infancy. While it is too early to fully assess its magnitude, it would not be presumptuous to assert that this revolution has had a profound impact on political science and policy. Among other changes, it reshaped how the scholarly community in general and political scientists in particular approached scientiﬁc endeavors—the ways they collected, organized, and analyzed data. To explain this shift, we ﬁrst have to delineate the diﬀerences between structured and unstructured data. For most people, the word “data” means an organized set of values, such as an Excel spreadsheet. These are structured data—data that can ﬁt squarely into a table or a relational database, where every row is an observation, every column a variable, and the cells at the intersection of rows and columns contain values. Consequently, the entire database or parts of it can be subject to quantitative analysis. Election results, census records, ﬁnancial transactions, temperature measurements, and GPS coordinates are all examples of structured data; they transform observable phenomena into measurable and legible forms. Data of this kind, however, are but a fraction of the total amount of data in the world. According to diﬀerent estimates,3 80 to 95 percent of existing data are unstructured data, that is, data that cannot ﬁt snugly into rows and columns. Unstructured data may take the form of text, audio, video, or any other observable manifestation. The content of a political speech, the video recording of that speech, the blog post commenting on the video, and the academic article analyzing the post are all unstructured.4 2 See Manuel Castells, The Rise of the Network Society, 2nd ed., vol. 1, The Information Age: Economy, Society, and Culture (Malden, MA: Wiley‐Blackwell, 2010), 28–33. 3 “Data, Data Everywhere,” The Economist, 27 February 2010, accessed at https://www.economist.com/ special‐report/2010/02/25/data‐data‐everywhere, 9 April 2020; Drew Robb, “Semi‐Structured Data,” Datamation, 3 July 2017, accessed at https://www.datamation.com/big‐data/semi‐structured‐data.html, 9 April 2020; and Christie Schneider, “The Biggest Data Challenges That You Might Not Even Know You Have,” IBM Watson Blog, 25 May 2016, accessed at https://www.ibm.com/blogs/watson/2016/05/ biggest‐data‐challenges‐might‐not‐even‐know/, 9 April 2020. 4 On the diﬀerences between structured and unstructured data, see Rob Kitchin, The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences (Los Angeles: Sage, 2014), 5–6; and 228 | POLITICAL SCIENCE QUARTERLY Because of the messy and eclectic nature of unstructured data, attempts to investigate them with conventional statistical methods would often be futile. To analyze such unruly data in a quantitative way, one needs to impose a structure upon them by coding selected data points as observations and variables—that is, rows and columns. As one data analyst commented, “In analytics there is no such thing as unstructured data, just data that structure has not yet been applied to.”5 For example, researchers interested in the data surrounding events review media reports and other unstructured sources to identify discrete political incidents such as protests, terrorist attacks, or violent clashes. Based on their codebook—a document that speciﬁes and standardizes the project’s rules for collecting and reviewing raw data and transforming them into structured data—they have to decide whether an incident meets the inclusion criteria, in which case they would code it as an observation in a structured database. In accordance with the codebook, they would assign values to the attributes of each event—for instance, the date, location, and type of a protest or terrorist attack or the number of participants or casualties.6 The reverse procedure of turning a structured database into unstructured data is hardly feasible, as data structuring is, by nature, a reductive process that inevitably entails the loss of details and context. Such preparation of data for quantitative investigation is not a novel approach—the quantiﬁcation of unstructured data has been in existence since the invention of writing.7 However, the technological developments of the information age enabled the automated processing of massive amounts of raw data rapidly and at a low cost.8 Another approach to structuring data is through text‐as‐data methods, in which scholars use computer software to turn unstructured data into quantitative data, based on such parameters as the occurrence frequency of words in documents or the type of sentiments expressed in them. In text‐as‐data projects, investigators usually break the text into speciﬁc Amir Gandomi and Murtaza Haider, “Beyond the Hype: Big Data Concepts, Methods, and Analytics,” International Journal of Information Management 35 (April 2015): 137–144, at 138. 5 Hjalmar Gislason, “There Is No ‘Unstructured Data’ in Analytics,” Medium, 15 July 2017, accessed at https://medium.com/@hjalli/there‐is‐no‐unstructured‐data‐in‐analytics‐8c5d06944b23, 9 April 2020. 6 Philip A. Schrodt, “Event Data in Foreign Policy Analysis,” in Laura Neack, Jeanne A.K. Hey, and Patrick Jude Haney, eds., Foreign Policy Analysis: Continuity and Change in Its Second Generation (Englewood Cliﬀs, NJ: Prentice Hall, 1995), 145–166. 7 Viktor Mayer‐Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think (Boston: Houghton Miﬄin Harcourt, 2013), chap. 5. 8 Wei Wang, Ryan Kennedy, David Lazer, and Naren Ramakrishnan, “Growing Pains for Global Monitoring of Societal Events,” Science 353 (September 2016): 1502–1503. POLITICAL SCIENCE AND BIG DATA | 229 units of analysis, such as articles, chapters, paragraphs, sentences, or single words. Next, they use computer software to quantify these units and analyze the resulting data statistically.9 For example, in their political theory article, Lisa Blaydes and colleagues examined nearly 10,000 sections from 46 medieval political advice books; they identiﬁed recurring terms in the text, combined synonyms into single terms, and nested individual themes (such as a ruler’s moral character) under broader ones (in this case, “the art of rulership”). Finally, they tracked the occurrence and weight of these themes in medieval books over time. The statistical analysis shows that while the emphasis on religious issues in political advice texts written by Christians declined over time, Muslim writers continued using religious discourse throughout the period in question. Moreover, the analysis identiﬁes the mass migration of various peoples to the Middle East during the High Middle Ages as a possible inﬂuence on Muslim political thought. In particular, the invasion of Turks and Mongols into the Islamic world between the eleventh and thirteenth centuries correlates with a strong emphasis in political advice texts on the desired qualities of the ideal ruler.10 Such methods enable large‐scale conclusions that were previously inaccessible or only accessible through more time‐consuming and labor‐intensive research. On a smaller scale, we model the text‐as‐data approach in the next section by counting the occurrence of the term “big data” in political science articles. The amount of unstructured data in the world not only is massive, but keeps growing perpetually. Consider the case of academic scholarship—the one type of unstructured data that pertains to every political scientist, including purely quantitative ones. We compared the number of political science journals and articles indexed by the Web of Science Social Sciences Citation Index at two points in time—2007 and 2017. We sampled the three most relevant subﬁelds for the purpose of this article: political science, international relations, and public administration. As Figure 1 shows, in one decade, all subﬁelds experienced a dramatic rise. Most notably, political science journals jumped by 82 percent. As for the articles, the increase is even more staggering. Political science and public administration experienced 100 percent and 99 percent jumps, respectively. Simply put, to launch 9 Justin Grimmer and Brandon M. Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts,” Political Analysis 21 (Summer 2013): 267–297; and John Wilkerson and Andreu Casas, “Large‐Scale Computerized Text Analysis in Political Science: Opportunities and Challenges,” Annual Review of Political Science 20 (2017): 529–544. 10 Lisa Blaydes, Justin Grimmer, and Alison McQueen, “Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds,” Journal of Politics 80 (October 2018): 1150–1167. 230 | POLITICAL SCIENCE QUARTERLY FIGURE 1 Increase in the Number of Political Science Journals and Articles, 2007–2017 Source: Charts created with data from a Web of Science search conducted on 16 February 2019. Left panel: Journals indexed in the Social Sciences Citation Index in 2007 and 2017 by Web of Science subject category. Right panel: Articles in journals indexed in the Social Sciences Citation Index in 2007 and 2017 by Web of Science subject category. a new study today, political scientists must ﬁrst review twice as much or more literature than they did a decade ago. These numbers show that even the most basic scholarly engagement with unstructured data—that is, the process of producing a literature review—already involves struggling with large quantities of material that amass at an inexorable rate. In fact, as one calculation shows, even hypothetical scholars dedicating all their time to reading core publications in their ﬁeld (and assuming that they are high‐speed English readers) would miserably fail to keep up with the literature.11 The information revolution has turned the search for relevant literature into a desperate race against time and, consequently, a constant source of anxiety for many researchers.12 While this information overload is not a recent phenomenon,13 the term “big data” is relatively new. It ﬁrst emerged in the information technology industry in the mid‐1990s14 and made its academic debut in a 11 Michael Billig, Learn to Write Badly: How to Succeed in the Social Sciences (Cambridge: Cambridge University Press, 2013), 27–28. 12 Kenneth Einar Himma, “The Concept of Information Overload: A Preliminary Step in Understanding the Nature of a Harmful Information‐Related Condition,” Ethics and Information Technology 9 (December 2007): 259–272. 13 David Bawden and Lyn Robinson, “The Dark Side of Information: Overload, Anxiety and Other Paradoxes and Pathologies,” Journal of Information Science 35 (April 2009): 180–191; and Ann Blair, Too Much to Know: Managing Scholarly Information before the Modern Age (New Haven, CT: Yale University Press, 2010), 55. 14 Steve Lohr, “The Origins of ‘Big Data’: An Etymological Detective Story,” Bits Blog (New York Times), 1 February 2013, accessed at https://bits.blogs.nytimes.com/2013/02/01/the‐origins‐of‐big‐data‐an‐ etymological‐detective‐story, 9 April 2020. POLITICAL SCIENCE AND BIG DATA | 231 1998 computer science paper.15 In the two decades that followed, it gained popularity rapidly.16 This proliferation notwithstanding, there is little agreement, both inside academia and outside it, as to what exactly constitutes big data. The most common deﬁnitions are based on the “three Vs” framework that Doug Laney presented in an unpublished 2001 report (which, curiously, did not include the words “big data” at all).17 According to these deﬁnitions, data are big if they are high in volume (the sheer size of the data set is large), velocity (data are produced in or almost in real time), and variety (data come in diﬀerent types and formats and may be structured or unstructured).18 Over the years, authors have come up with additional Vs, such as veracity, variability, and value,19 but volume, velocity, and variety remain the core attributes. Many big data deﬁnitions also underline the technological innovations and capabilities and the sophisticated methods required to gather, store, and analyze such data.20 These attributes of big data, despite their prevalence in the literature, are more of a general guideline than a deﬁnition. Since the introduction of the term, specialists have been attaching to it diﬀerent and often contrasting meanings. By way of illustration, consider a survey that the University of California, Berkeley, School of Information conducted in 2014. Attempting to clarify ambiguity around the term, the author invited 43 experts in various sectors and industries—from tech and education to food and fashion—to deﬁne big data.21 A majority of respondents (29) highlighted the novel technological and methodological capabilities required to collect, store, organize, and analyze such data. 15 Francis X. Diebold, “On the Origin(s) and Development of the Term ‘Big Data,’” (Social Science Research Network, 26 September 2012), accessed at https://papers.ssrn.com/abstract=2152421, 9 April 2020. 16 A Google Scholar search (21 September 2019) yielded more than 600,000 papers (excluding patents) in which the expression “big data” occurred, while a Web of Science core collection query of papers containing “big data” in any search ﬁeld produced more than 73,000 results. 17 Doug Laney, “3D Data Management: Controlling Data Volume, Velocity and Variety” (research note, META Group, Stamford, CT, 6 February 2001). 18 Erik W. Kuiler, “From Big Data to Knowledge: An Ontological Approach to Big Data Analytics: From Big Data to Knowledge,” Review of Policy Research 31 (July 2014): 311–318, at 311; and Andrej Zwitter, “Big Data and International Relations,” Ethics & International Aﬀairs 29 (Winter 2015): 377–389, at 378–379. 19 Gandomi and Haider, “Beyond the Hype,” 139. 20 Andrea De Mauro, Marco Greco, and Michele Grimaldi, “A Formal Deﬁnition of Big Data Based on Its Essential Features,” Library Review 65 (2016): 122–135, at 125–127; and David Lazer and Jason Radford, “Data Ex Machina: Introduction to Big Data,” Annual Review of Sociology 43 (2017): 19–39, at 20–21. 21 Jennifer Dutcher, “What Is Big Data?,” Berkeley School of Information Blog, 3 September 2014, accessed at https://web.archive.org/web/20180121162550/, https://datascience.berkeley.edu/what‐is‐ big‐data, 9 April 2020. 232 | POLITICAL SCIENCE QUARTERLY Many respondents referred to the volume of big data (22) and their cost and potential value—commercial, scientiﬁc, or intellectual (17). Fewer participants mentioned data variety, complexity, or messiness (18) and data velocity (11).22 Alongside these more standard conceptualizations, many responses failed to oﬀer a clear deﬁnition of big data and, instead, relied on vague language—“storytelling,” “challenges and opportunities,” “a cultural shift,” “a rhetorical device,” and even “anything related to data analytics or visualization” were among the answers given. As these answers suggest, many deﬁnitions of big data are ambiguous, and therefore lack the qualities that make good concepts in the social sciences.23 Further, the way data analysts in the business sector or the STEM disciplines (science, technology, engineering, and mathematics) deﬁne big data is not entirely compatible with the way social scientists should view it. For a start, the big data sets that social scientists analyze are large in terms of their disciplines, but they are often considerably smaller in volume and less resource‐intensive than what many STEM scientists or data analysts refer to as big data.24 For these reasons, social science researchers need their own deﬁnition and use of big data appropriate to their ﬁeld. The two common features of big data that are most relevant to political science are the variety of data and the technological means required to extract, organize, and analyze them. For political scientists, then, big data should be deﬁned as structured and unstructured data of diﬀerent provenances, which they can access, review, and process with the help of digital technology. Of utmost importance here is the variety of big data: today, more than 98 percent of stored data in the world are digital (the rest being stored on paper, ﬁlm, vinyl, and other analog media).25 As most of these data are unstructured, many political questions would greatly beneﬁt from the wealth of knowledge that the information revolution has made available to us in the form of organizational and personal, oﬃcial and unoﬃcial, digitized or digitally born documents, photos, video and audio recordings, websites, art, video games, and so on.26 Yet scholars in the discipline have 22 Replication materials for our content analysis of the University of California, Berkeley, survey can be accessed at https://github.com/jonathan‐grossman/Political‐Science‐and‐Big‐Data. 23 See John Gerring, Social Science Methodology: A Uniﬁed Framework (Cambridge: Cambridge University Press, 2012), 116–131. 24 Lev Manovich, “Trending: The Promises and the Challenges of Big Social Data,” in Matthew K. Gold, ed., Debates in the Digital Humanities (Minneapolis: University of Minnesota Press, 2012), 461; and Alex Street, Thomas A. Murray, John Blitzer, and Rajan S. Patel, “Estimating Voter Registration Deadline Eﬀects with Web Search Data,” Political Analysis 23 (Spring 2015): 225–241, at 238. 25 Mayer‐Schönberger and Cukier, Big Data, 9. 26 See Sharan B. Merriam and Elizabeth J. Tisdell, Qualitative Research: A Guide to Design and Implementation, 4th ed. (San Francisco: Jossey‐Bass, 2016), chap. 7. POLITICAL SCIENCE AND BIG DATA | 233 largely ignored unstructured big data so far (although, as will be shown later, this situation is gradually changing). As we discuss in the next section, while political scientists have been reluctant to join the conversation about big data, many of them have actually been using structured big data without referring to them as such. BIG DATA AND POLITICAL SCIENCE Political scientists witnessed the transformation of big data into a “buzzword,”27 a “catchall term,”28 and even a “meme.”29 However, a content analysis of the 133 political science articles about big data that were indexed in the Web of Science Social Sciences Citation Index30 (as of 1 January 2019) suggests that the discussion of big data in the discipline is still in its infancy.31 The earliest article on big data in a political science journal was published in 2012.32 As Figure 2 shows, since 2014, there has been a marked increase, although not a dramatic or steady one, in the number of such articles. To a large degree, this discussion is currently conﬁned to specialized journals (such as those focusing on intelligence and technology, for example Intelligence and National Security or Policy & Internet), methodological contributions, and special issues and symposia—30 out of the 133 articles were published in the same issue with at least two other articles from the database; 41 articles share the same volume with at least two other articles. For example, eight articles were published in the same issue and volume of Review of Policy Research in 2014, and four articles were published in one volume of Politics and Governance in 2018. Not all articles whose topic was categorized as “big data” by the Web of Science discuss big data in any meaningful way. At least in some of them, “big data” serves as a buzzword (or, perhaps, as clickbait) to catch the attention of readers or journal editors without much actual discussion of 27 Christopher Eldridge, Christopher Hobbs, and Matthew Moran, “Fusing Algorithms and Analysts: Open‐Source Intelligence in the Age of ‘Big Data,’” Intelligence and National Security 33 (2018): 391–406, at 393; and Damien Van Puyvelde, Stephen Coulthart, and M. Shahriar Hossain, “Beyond the Buzzword: Big Data and National Security Decision‐Making,” International Aﬀairs 93 (November 2017): 1397–1416. 28 Connie L. McNeely and Jong‐on Hahm, “The Big (Data) Bang: Policy, Prospects, and Challenges,” Review of Policy Research 31 (July 2014): 304–310, at 304. 29 Jan Youtie, Alan L. Porter, and Ying Huang, “Early Social Science Research about Big Data,” Science and Public Policy 44 (February 2016): 65–74, at 65. 30 The Social Sciences Citation Index is part of Clarivate Analytics’ Web of Science academic index (https://webofknowledge.com). 31 See the online appendix for the methodology and results of the content analysis, at https://github.com/ jonathan‐grossman/Political‐Science‐and‐Big‐Data/blob/master/Online‐Appendix.pdf. Replication materials can be accessed at https://github.com/jonathan‐grossman/Political‐Science‐and‐Big‐Data. 32 The full list of articles and all the variables in this analysis can be accessed at https://github.com/ jonathan‐grossman/Political‐Science‐and‐Big‐Data. 234 | POLITICAL SCIENCE QUARTERLY FIGURE 2 Political Science Articles about Big Data, by Year Source: Chart created with data from a Web of Science search conducted on 1 January 2019 with the following speciﬁcations: DOCUMENT TYPE = “Articles”; WEB OF SCIENCE CATEGORIES = “Political Science,” “International Relations,” and “Public Administration.” this concept. Thus, in 16 articles, the expression “big data” only occurs in the article title, abstract, list of references, or list of keywords, but not in the content of the article. Fifteen articles mention big data only once in the narrative. In 35 articles, the term occurs three times or fewer. Only 52 articles—less than 40 percent of the total articles—oﬀer a deﬁnition of big data that is workable to some extent. Further, the discussion of big data in political science is highly fragmented—less than half of the articles in the database cite each other.33 As these content and citation analyses indicate, political scientists have only recently begun to integrate the term big data into their scholarship. Discussions of big data have yet to occupy a niche in mainstream political science publications. Nonetheless, the tables of contents of leading quantitatively oriented political science journals conﬁrm that political scientists have been using digital tools to structure and analyze large data repositories for some time now, albeit without explicitly referring to their data as “big” or taking part in recent debates about big data. Established research ﬁelds such as electoral studies, public opinion research, event data analysis, and comparative policy analysis are only a few examples of 33 For a citation network analysis of the 133 articles, see the online appendix at https://github.com/ jonathan‐grossman/Political‐Science‐and‐Big‐Data/blob/master/Online‐Appendix.pdf. POLITICAL SCIENCE AND BIG DATA | 235 the prevalence of big data sources in political science. The proliferation of big databases in these ﬁelds has been a gradual process rather than a revolution—the size and complexity of data that political scientists analyze have been increasing for decades, since before the information age, in keeping with technological advancements in storage and processing capabilities.34 And yet, political scientists have been reluctant to identify their sources as big data when there was no analytical justiﬁcation for using this term. For instance, Gary King and colleagues’ 2013 censorship study on China features an automated collection, coding, and analysis of 11,382,221 social media posts from 1,382 Chinese websites. The researchers sampled each post several times to track changes to it. As they found out, the Chinese government was likely to censor posts mentioning collective action even when the posts were supportive of the government—and less likely to delete posts that did not mention collective action even when these posts were highly critical of the government.35 Adopting a similar approach to write about civil conﬂicts, Nils Metternich and his colleagues created “a dataset of over two million machine‐coded daily events” extracted from more than 75 news sources. By applying network analysis and game theory methods to this database, they created a model to predict antigovernment conﬂicts in Thailand.36 Although the term “big data” does not occur even once in these and similar papers, they would qualify as studies of structured big data according to common deﬁnitions. When studying the use of big data in political science, then, we should consider that while papers rarely mention the term “big data,” many of them use this kind of data in practice. Political scientists who engage in such research endeavors are predominantly working within the quantitative tradition. They detect trends and patterns in structured databases and employ statistical methods to diﬀerentiate signal from noise, identify correlations, and generate predictions and policy recommendations. In many instances, this use of big data for research is necessary. No one would expect demographers to conduct their own national census or students of voter turnout rates to be 34 Michael D. Ward, Nils W. Metternich, Cassy L. Dorﬀ, Max Gallop, Florian M. Hollenbach, Anna Schultz, and Simon Weschle, “Learning from the Past and Stepping into the Future: Toward a New Generation of Conﬂict Prediction,” International Studies Review 15 (December 2013): 473–490. 35 Gary King, Jennifer Pan, and Margaret E. Roberts, “How Censorship in China Allows Government Criticism but Silences Collective Expression,” American Political Science Review 107 (May 2013): 326–343. 36 Nils W. Metternich, Cassy Dorﬀ, Max Gallop, Simon Weschle, and Michael D. Ward, “Antigovernment Networks in Civil Conﬂicts: How Network Structures Aﬀect Conﬂictual Behavior,” American Journal of Political Science 57 (October 2013): 892–911. 236 | POLITICAL SCIENCE QUARTERLY present at voting centers throughout Election Day to record the attendance of each and every voter. Structured data are better equipped than unstructured data to address some problems, just as quantitative methods can answer certain questions better than qualitative ones.37 When data are objective and fully represent the phenomenon in question, such analyses can be highly eﬀective. Unfortunately, this is not always the case. At the outset, big databases seemed to oﬀer a solution to the challenges associated with quantitative data analysis: why use a sample when we can study the entire population? However, this optimism was short‐lived. An ever‐growing body of literature indicates that big data are not as representative or free of bias, manipulation, and interpretation as we would like to believe.38 For instance, while Facebook has been the most popular online social platform across generations, most social scientists work with Twitter data because the latter are considerably easier to collect and structure than Facebook posts.39 In addition, 10 million tweets may represent 10 million people—but only those people who have an internet connection and Twitter account and actively tweet. As a case in point, one study of the 2009 German parliamentary election claimed that the proportion of tweets mentioning a political party before the election could predict the number of votes that this party would gain.40 However, as a later study revealed, that analysis was correct only because the researchers had arbitrarily removed from the results the small but much‐discussed Pirate Party, which was mentioned in about a third of the tweets but won only 2.1 percent of the votes.41 In a similar vein, an ever‐updating stream of news media articles may capture hundreds or thousands of events each day, but it will fail to include unreported events or events reported by news outlets that 37 See Gary Goertz and James Mahoney, A Tale of Two Cultures: Qualitative and Quantitative Research in the Social Sciences (Princeton, NJ: Princeton University Press, 2012), 2–3. 38 danah boyd and Kate Crawford, “Critical Questions for Big Data,” Information, Communication & Society 15 (June 2012): 662–679, at 666–71; Lazer and Radford, “Data Ex Machina,” 28–31; and Matthew J. Salganik, Bit by Bit: Social Research in the Digital Age (Princeton, NJ: Princeton University Press, 2018), 17–41. 39 Ines Mergel, “Building Holistic Evidence for Social Media Impact,” Public Administration Review 77 (July–August 2017): 489–495, at 490. 40 Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, and Isabell M. Welpe, “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment,” Fourth International AAAI Conference on Weblogs and Social Media, May 2010, accessed at https://www.aaai.org/ocs/index. php/ICWSM/ICWSM10/paper/view/1441, 9 April 2020. 41 Andreas Jungherr, Pascal Jürgens, and Harald Schoen. “Why the Pirate Party Won the German Election of 2009 or The Trouble with Predictions: A Response to Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe, I. M. ‘Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment,’” Social Science Computer Review 30 (May 2012): 229–234. POLITICAL SCIENCE AND BIG DATA | 237 are not represented in the speciﬁc index that the researchers consulted. Nils Weidmann compared coverage of insurgent attacks in Afghanistan in a military database with coverage of such events in a media‐based data set. Weidmann concludes that an event was more likely to be reported in the media when the number of casualties was relatively high, the event took place relatively close to a city or a town, and the area had cellphone coverage.42 However, oﬃcial data are not necessarily more reliable than media reports. Governments, corporations, and other entities may restrict access to some or all of their data, manipulate them, or fail to collect data eﬃciently. According to a recent study of oﬃcer‐involved fatal shootings in California and Texas, open‐source records are in fact more accurate and comprehensive than oﬃcial databases; the former report 25 to 50 percent more such cases than state or federal databases, which suﬀer from underreporting and classiﬁcation errors.43 Another problem arises when researchers forgo asking questions altogether. In the age of big data, some analysts have challenged the scientiﬁc method of raising questions, formulating hypotheses, and using theory. Instead, they claim that since we have so much data at our disposal, these data should be allowed to “speak for themselves,” revealing patterns and correlations that would tell us what is important without knowing in advance what we were hoping to ﬁnd.44 While this inductive approach is mainly characteristic of analysts in the business sector, such as ones searching for promising investment opportunities, a growing number of data and STEM scientists now purport to solve social problems with the power of big data. Ascribing little importance to unquantiﬁable details, these pundits often lack not only training in social science methods but also substantial knowledge of the unique social, cultural, and political circumstances of the issue under study. Instead, they conﬁdently interpret the available structured data according to patterns and correlations that they ﬁnd in them,45 while paying little attention, if at all, to social theory. As one review of the literature on Twitter and political action shows, even researchers who 42 Nils B. Weidmann, “A Closer Look at Reporting Bias in Conﬂict Event Data,” American Journal of Political Science 60 (January 2016): 206–218. 43 Howard E. Williams, Scott W. Bowman, and Jordan Taylor Jung, “The Limitations of Government Databases for Analyzing Fatal Oﬃcer‐Involved Shootings in the United States,” Criminal Justice Policy Review 30 (March 2019): 201–222, at 216. 44 Rob Kitchin, “Big Data, New Epistemologies and Paradigm Shifts,” Big Data & Society 1 (April–June 2014): 1–12, at 3–4; and Mayer‐Schönberger and Cukier, Big Data, chap. 4. 45 Kitchin, “Big Data,” 3–5. 238 | POLITICAL SCIENCE QUARTERLY include theoretical discussion in their papers do not always ground the results of their big data analysis in any theory.46 Not surprisingly, such data‐centered works tend to be superﬁcial, if not erroneous. Their authors, devoid of regional or historical context, often infer causation from spurious correlations.47 Without a theoretical, conceptual, and contextual framework and clearly deﬁned hypotheses to guide data collection and analysis, any datum in the vast ocean of big data may appear relevant and important.48 The website Spurious Correlations records such instances of statistically signiﬁcant yet absurd ﬁndings. For example, there is a 95.86 percent correlation between the number of civil engineering doctorates that are awarded each year and the consumption of mozzarella cheese in the United States.49 Researchers who downplay the importance of context (and, sometimes, of common sense) and do not know their subject matter well may conclude that the fact that two variables are correlated necessarily means that one caused the other.50 Some scholars maintained that big data were likely to encompass not only explanatory powers but also predictive ones. However, as it soon turned out, our ability to accurately forecast outcomes such as terrorist attacks and election results has yet to improve signiﬁcantly.51 For instance, the vast majority of predictive models for the 2016 U.S. presidential election, which forecast that Hillary Clinton would defeat Donald Trump, proved to be wrong.52 Political events are often highly complex. They involve a large number of agents who interact with one another in unexpected and often irrational ways and conceal their motives and strategies. Neither the laws of nature nor a ﬁnite set of actors, which render large volumes of structured data highly eﬀective in forecasting the weather or determining the best move in a game of 46 Peter Cihon and Taha Yasseri, “A Biased Review of Biases in Twitter Studies on Political Collective Action,” Frontiers in Physics 4 (August 2016), https://doi.org/10.3389/fphy.2016.00034. 47 Gary Smith, “The Exaggerated Promise of So‐Called Unbiased Data Mining,” Wired, 11 January 2019, accessed at https://www.wired.com/story/the‐exaggerated‐promise‐of‐data‐mining, 9 April 2020. 48 Gary King, Robert O. Keohane, and Sidney Verba, Designing Social Inquiry: Scientiﬁc Inference in Qualitative Research (Princeton, NJ: Princeton University Press, 1994), 44. 49 Spurious Correlations, accessed at http://tylervigen.com/spurious‐correlations, 9 April 2020. 50 Nate Silver, The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t (New York: Penguin, 2015), 253. 51 Lars‐Erik Cederman and Nils B. Weidmann, “Predicting Armed Conﬂict: Time to Adjust Our Expectations?,” Science 355 (February 2017): 474–476. 52 James Markarian, “What the Election Taught Us about Predictive Analytics,” Forbes, 8 February 2017, accessed at https://www.forbes.com/sites/forbestechcouncil/2017/02/08/what‐the‐election‐taught‐us‐ about‐predictive‐analytics, 9 April 2020. POLITICAL SCIENCE AND BIG DATA | 239 chess, exist in most political interactions.53 Further, the prediction of political events is often based on ﬂawed data, especially when computer programs automatically extract and code qualitative sources such as news reports without proper quality control by experts. Consequently, biases and errors in unstructured sources might be integrated into structured databases and distort the results of statistical analyses.54 While political science is not immune to these practices, the discipline as a whole is actually moving toward a more theory‐guided and context‐ informed analysis of big data. As part of this trend, researchers have been paying increasing attention to the content of unstructured big data sources. Most of these studies employ text‐as‐data approaches that turn the unstructured sources into highly structured data sets ready for statistical analysis at the cost of historical detail. One recent example is Azusa Katagiri and Eric Min’s article about the eﬀectiveness of private and public diplomatic signals, which uses the 1958–1963 Berlin Crisis as a case study. To test their theoretical hypotheses that actions speak louder than words and that private signals are both less noisy and more eﬀective than public ones, Katagiri and Min processed more than 18,000 U.S. diplomatic documents. By building and training statistical models adapted to the unique circumstances of the Berlin Crisis, the authors identiﬁed documents discussing Soviet threats to use force and translated their narrative into quantitative data. Based on the results of their statistical analysis, Katagiri and Min argue that Soviet material actions were considerably more eﬀective than private signals in inﬂuencing how U.S. policymaking elites perceived Soviet threats. However, private signals of Soviet resolve were still more compelling than public ones in convincing U.S. oﬃcials that the Soviet Union was willing to use force to achieve its objectives in Berlin. Thus, the quantitative analysis supports the researchers’ hypotheses.55 Quantitative studies of this type capitalize on the potential of unstructured big data, but the insight derived from them only applies to general trends and therefore should not replace a contextualized in‐depth reading of historical documents, as the authors of one such work frankly admit.56 53 Cederman and Weidmann, “Predicting Armed Conﬂict”; Keith Dowding, “So Much to Say: Response to Commentators,” Political Studies Review 15 (May 2017): 217–230, at 226; and Silver, The Signal and the Noise, 2. 54 Cederman and Weidmann, “Predicting Armed Conﬂict,” 475; Idean Salehyan, “Best Practices in the Collection of Conﬂict Data,” Journal of Peace Research 52 (January 2015): 105–109, at 108; and Wang et al., “Growing Pains.” 55 Azusa Katagiri and Eric Min, “The Credibility of Public and Private Signals: A Document‐Based Approach,” American Political Science Review 113 (February 2019): 156–172. 56 Blaydes, Grimmer, and McQueen, “Mirrors for Princes and Sultans,” 1153–1154. 240 | POLITICAL SCIENCE QUARTERLY Many other scholars supplement their quantitative analysis of structured big data with a thorough analysis of traditional sources such as oral interviews.57 For instance, in their study of the inﬂuence of election campaigns on voter turnout rates, Ryan Enos and Anthony Fowler quantitatively investigate a big data repository of voter participation but reinforce their statistical ﬁndings with insight from oral and email conversations with campaign managers and strategists.58 Still other researchers analyze a random sample of unstructured big data. Greg Distelhorst and Diana Fu, in their article about citizenship in authoritarian regimes, employ this method. After compiling a database of more than 8,000 online appeals by Chinese citizens to local authorities, the authors randomly selected a subsample of 500 documents for close reading and interpretive analysis.59 Such applications of unstructured “small” data have proven useful for contextualizing the statistical investigation of structured big data sets and for enhancing sampling validity. At the same time, they do not tap into the full potential of unstructured big data. THE PROMISE OF UNSTRUCTURED BIG DATA Political research that is primarily based on the scrutiny of mostly unstructured texts has become more rigorous, systematic, and transparent than ever. Taking advantage of the rich detail found in such traditional sources as archival documents, oral interviews, and memoirs, as well as of the investigators’ theoretical, historical, and regional expertise,60 scholars have been able to tackle meaningful questions about political behavior in innovative ways and to publish their ﬁndings in mainstream political science journals. This direct contact between political scientists and historical data is a signiﬁcant break from past practices—not long ago, secondary literature written by historians was the main source of information for political scientists working on historical case studies, who rarely consulted primary data.61 Today, many of them collect their own historical evidence. 57 Tobias Bornakke and Brian L Due, “Big–Thick Blending: A Method for Mixing Analytical Insights from Big and Thick Data Sources,” Big Data & Society 5 (January–June 2018), https://doi.org/10.1177/ 2053951718765026. 58 Ryan D. Enos and Anthony Fowler, “Aggregate Eﬀects of Large‐Scale Campaigns on Voter Turnout,” Political Science Research and Methods 6 (October 2018): 733–751. 59 Greg Distelhorst and Diana Fu, “Performing Authoritarian Citizenship: Public Transcripts in China,” Perspectives on Politics 17 (March 2019): 106–121, at 110. 60 Marcus Kreuzer, “The Structure of Description: Evaluating Descriptive Inferences and Conceptualizations,” Perspectives on Politics 17 (March 2019): 122–139, at 125. 61 Ian S. Lustick, “History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias,” American Political Science Review 90 (September 1996): 605–618; and Theda Skocpol, “Emerging Agendas and Recurrent Strategies in Historical Sociology,” in Theda Skocpol, ed., Vision and Method in Historical Sociology (Cambridge: Cambridge University Press, 1984), 356–385, at 382. POLITICAL SCIENCE AND BIG DATA | 241 Two recent articles in this journal illustrate both the potential and challenges of the dramatic increase in unstructured data for advancing political science research. In his 2018 article, Christopher Lamb shows how careful organization and examination of a large number of unstructured sources enable political scientists to correct common misconceptions in the literature and, as a result, develop better theories. Lamb makes this point with regard to the Mayaguez incident, a highly complex and contested historical event in which U.S. and Cambodian forces clashed over the seizure of a U.S. cargo ship by the Khmer Rouge in May 1975. Lamb demonstrates that even though a staggering number of sources on the incident have become available to researchers, there is still no consensus as to what really happened there and how the United States managed the crisis.62 Moreover, Lamb argues that the rise in the number of sources on the incident did not contribute to our understanding of it: scholars used the newly declassiﬁed information to describe the crisis—the qualitative equivalent of letting the data “speak for themselves”—rather than explaining it through the testing of alternative hypotheses.63 Lamb uses the available evidence on the Mayaguez incident, including records that have only recently been declassiﬁed, to build a painstakingly detailed chronology of the events in Cambodia and Washington on 12–15 May 1975. He then draws on this timeline to criticize key arguments in the extant literature on the crisis. For instance, he cites a “widely overlooked interview” with Veteran magazine, in which former Secretary of Defense James R. Schlesinger admitted that he had purposely delayed a White House order to sink the Cambodian ﬁshing boat carrying the U.S. crew of the Mayaguez. As Schlesinger retroactively explained, “it seemed to me it would destroy our own purposes to sink a ship which would have killed the Americans we were trying to save.”64 Other scholars failed to notice this crucial piece of evidence that did not exist in conventional sources such as government records. This oversight is indicative of a typical problem in the information era—unstructured data are not only growing, they are also scattered across many places and are often hard to locate. Finding a needle in a haystack is only possible if we can ﬁnd the haystack. Another example of how the growing availability of unstructured data, coupled with a systematic approach to analyzing these data, can result in more accurate political science research is Stefano Recchia’s 2015 article. 62 Christopher J. Lamb, “The Mayaguez Crisis: Correcting 30 Years of Scholarship,” Political Science Quarterly 133 (Spring 2018): 35–76. 63 Lamb, “The Mayaguez Crisis,” 74–75. 64 Lamb, “The Mayaguez Crisis,” 53. 242 | POLITICAL SCIENCE QUARTERLY This article oﬀers a new reading of France’s March 2003 thwarting of a U.S.‐British plan to introduce a resolution draft at the United Nations Security Council, which would have authorized the use of force against Iraq. Relying on declassiﬁed documents and oral interviews, Recchia constructs a comprehensive timeline of the events that led to President Jacques Chirac’s public threat to veto the proposed resolution. Recchia uses these rich data to show, by means of counterfactual analysis, that France did not categorically oppose the use of force as many specialists argue. In fact, President Chirac was likely to support or at least abstain on a similar resolution draft that would have allowed Iraq more time to comply with the Security Council’s demands regarding the control and surrendering of certain weapons before resorting to armed force. The George W. Bush administration, which was already preparing the invasion of Iraq, was unwilling to postpone it. Instead, Washington and its allies decided to carry out the attack without United Nations support.65 Lamb’s and Recchia’s works are examples of how the meticulous collection and sequencing of unstructured data—and particularly the construction of detailed timelines consisting of distinct interactions between actors—can revolutionize research. In the age of big data, as more and more information on such political events as the 1975 Mayaguez incident and the 2003 invasion of Iraq becomes available, it is increasingly challenging to keep track of the details of these events for inferential purposes. A growing number of political scientists have attempted to address these challenges by integrating methodological and conceptual contributions from such disciplines as economics, psychology, history, and law into their research. As we propose in the next section, the political science literature on historical institutionalism and process tracing oﬀers a productive methodological framework for working with unstructured big data. HISTORICAL INSTITUTIONALISM, PROCESS TRACING, AND THE PROMISE OF UNSTRUCTURED BIG DATA Historical institutionalism adds dynamism to political science research by introducing time as a focal element, while process tracing allows researchers to systematically arbitrate between rival hypotheses to ﬁnd causal mechanisms. Both historical institutionalism and process tracing emphasize evidence. Unstructured big data can provide such evidence—for example, in the form of news articles from numerous origins, the full text of or excerpts from multiple books and accounts, or diplomatic correspondence from the 65 Stefano Recchia, “Did Chirac Say ‘Non’? Revisiting UN Diplomacy on Iraq, 2002–03,” Political Science Quarterly 130 (Winter 2015): 625–654. POLITICAL SCIENCE AND BIG DATA | 243 archives of foreign ministries. By harnessing the potential of such data, historical institutionalists and process tracers (who are sometimes the same researchers) can dramatically enhance the accuracy and scope of their work and magnify the impact of their scholarship. Historical Institutionalism and Big Data Since the 1990s, scholars working within the research tradition of historical institutionalism have been underscoring the signiﬁcance of time for political analysis. Historical institutionalism centers on identifying temporal sequences—that is, the timing and order of events—as critical for making causal claims and testing theories regarding change and continuity in the life of institutions (rules, procedures, norms, and organizations). To further understand the inﬂuence of temporality on political processes, historical institutionalists have introduced into the discussion of formal and informal institutions such concepts as path dependence, positive and negative feedback, and increasing and diminishing returns.66 A central quest of many historical institutionalist research designs is to identify critical junctures—short periods of time (relative to the length of the institutional path) during which structural constraints on political actors are relaxed so that “the range of plausible choices open to powerful political actors expands substantially and the consequences of their decisions for the outcome of interest are potentially much more momentous.”67 Unlike the previous generation of historical institutionalists, who mostly relied on the scholarship of historians,68 many current researchers 66 Giovanni Capoccia and R. Daniel Kelemen, “The Study of Critical Junctures: Theory, Narrative, and Counterfactuals in Historical Institutionalism,” World Politics 59 (April 2007): 341–369; Orfeo Fioretos, Tulia Gabriela Falleti, and Adam D. Sheingate, eds., The Oxford Handbook of Historical Institutionalism (Oxford: Oxford University Press, 2016); Orfeo Fioretos, ed., International Politics and Institutions in Time (New York: Oxford University Press, 2017); B. Guy Peters, Institutional Theory in Political Science: The New Institutionalism, 3rd ed. (New York: Continuum, 2012), chap. 4; Paul Pierson, “Increasing Returns, Path Dependence, and the Study of Politics,” American Political Science Review 94 (June 2000): 251–267; Paul Pierson, Politics in Time: History, Institutions, and Social Analysis (Princeton, NJ: Princeton University Press, 2004); Paul Pierson and Theda Skocpol, “Historical Institutionalism in Contemporary Political Science,” in Ira Katznelson and Helen V. Milner, eds., Political Science: State of the Discipline (New York: W.W. Norton, 2002), 693–721; Thomas Rixen, Lora Anne Viola, and Michael Zürn, eds., Historical Institutionalism and International Relations: Explaining Institutional Development in World Politics (Oxford: Oxford University Press, 2016); and Sven Steinmo, Kathleen Ann Thelen, and Frank Longstreth, eds., Structuring Politics: Historical Institutionalism in Comparative Analysis (Cambridge: Cambridge University Press, 1992). 67 Capoccia and Kelemen, “The Study of Critical Junctures,” 343. 68 For example, Ruth Berins Collier and David Collier, Shaping the Political Arena: Critical Junctures, the Labor Movement, and Regime Dynamics in Latin America (Notre Dame, IN: University of Notre Dame Press, 2002); and Peter A. Hall, “The Movement from Keynesianism to Monetarism: Institutional Analysis and British Economic Policy in the 1970s,” in Sven Steinmo, Kathleen Ann Thelen, and Frank 244 | POLITICAL SCIENCE QUARTERLY are working closely with primary sources, which render their work more comprehensive and less prone to selection bias.69 In both cases, researchers use sources that are, for the most part, unstructured. For example, a 2014 article by Ramazan Kilinç draws on the historical institutionalist literature on critical junctures and self‐reinforcing sequences to explain democratic consolidation using a Turkish case study. Kilinç relies on various English and Turkish unstructured sources—research literature, news articles, party programs, political speeches, and legal records—to deﬁne the two‐year period that followed the 1997 military intervention as a critical juncture for democratic consolidation in the country; during that period, important Islamist groups, whose political participation was curtailed by the military, adopted a liberal democratic stance in a successful attempt to gain access to power. Because these groups chose to pursue a democratic path at that speciﬁc point in time, they became path dependent (that is, committed to the democratic discourse that they defended and to the constituencies that supported them because of this democratic position). This commitment ignited a self‐reinforcing sequence that resulted in democratic consolidation: ﬁrst, the Islamist Justice and Development (AK) Party won the 2002 parliamentary elections as the ﬂagbearer of democratization, market economy, social reforms, and EU membership. Subsequently, the AK Party carried out measures that limited the military’s ability to intervene again in the political system and broadened the party’s own base of support.70 A salient characteristic of historical institutionalism is that it seeks to explain real‐world cases rather than predict general political behavior.71 While the use of structured big data to generate predictions has been problematic in many instances, using unstructured big data sources to answer historical riddles is more feasible and—when done systematically, as evidenced in Kilinç’s analysis of Turkish politics—less fallible. Real‐ world cases, after all, need real‐world empirical evidence. That is not to say that historical institutionalism is merely descriptive and divorced from theory. Historical institutionalist explanations can be generalized Longstreth, eds., Structuring Politics: Historical Institutionalism in Comparative Analysis (Cambridge: Cambridge University Press, 1992), 90–113. 69 Orfeo Fioretos, “Institutions and Time in International Relations,” in Orfeo Fioretos, ed., International Politics and Institutions in Time (New York: Oxford University Press, 2017), 25. 70 Ramazan Kilinç, “Critical Junctures, Catalysts, and Democratic Consolidation in Turkey,” Political Science Quarterly 129 (Summer 2014): 293–318. 71 Sven Steinmo, “Historical Institutionalism,” in Donatella Della Porta and Michael Keating, eds., Approaches and Methodologies in the Social Sciences: A Pluralist Perspective (Cambridge: Cambridge University Press, 2008), 134. POLITICAL SCIENCE AND BIG DATA | 245 and elaborated into theories that may, in turn, be employed to predict institutional behavior. Thus, drawing on his temporal analysis, Kilinç also makes the theoretical argument that, with respect to democratic consolidation, “the timing of events facilitates or impedes the causal processes that structural conditions lead.”72 In addition, unlike many other approaches, historical institutionalism views political phenomena as structured in time and space and therefore deeply embedded in historical context.73 One piece of good advice for historically oriented political scientists who wish to understand the context of the problem they attempt to address is not to limit their purview to analyzing a handful of key primary documents. Rather, as Alexander George and Andrew Bennett suggest, they should survey a large array of media accounts from the period under study to understand not only the facts directly related to their question but also the climate in which decisions and actions were taken and the information that was available to actors, policymakers, and the public at the time.74 As Deborah Larson remarks, “Journalistic analyses and interpretations of speeches provide a code book by which to decipher the meaning of a document.”75 While viewing a single newspaper over time may result in the incorporation into the study of biases held by speciﬁc journalists and editors, using a multitude of sources can lead to a more balanced depiction of the zeitgeist. Browsing unstructured big data repositories such as digital collections of newspapers or, with respect to recent years, online news archives and social media posts can oﬀer this balance. The latter point illuminates the two features that make unstructured big data particularly appealing for historical institutionalists: they are laden with voluminous and variegated historical narrative, and they are often digitally searchable. Historical institutionalists must identify the point at which an institution started following a particular path as well as when it stopped following it.76 For this purpose, they must have enough information on the institution and the diﬀerent actors that 72 Kilinç, “Critical Junctures, Catalysts, and Democratic Consolidation in Turkey,” 295. Peter A. Hall, “Politics as a Process Structured in Space and Time,” in Orfeo Fioretos, Tulia G. Falleti, and Adam Sheingate, eds., The Oxford Handbook of Historical Institutionalism (Oxford: Oxford University Press, 2016), 31–50. 74 Alexander L. George and Andrew Bennett, Case Studies and Theory Development in the Social Sciences (Cambridge, MA: MIT Press, 2005), 97. 75 Deborah Welch Larson, “Sources and Methods in Cold War History: The Need for a New Theory‐Based Archival Approach,” in Colin Elman and Miriam Fendius Elman, eds., Bridges and Boundaries: Historians, Political Scientists, and the Study of International Relations (Cambridge, MA: MIT Press, 2001), 347. 76 Pierson, Politics in Time, esp. 44–46. 73 246 | POLITICAL SCIENCE QUARTERLY inﬂuenced and were inﬂuenced by it over what is often a very long stretch of time.77 When formulating a critical juncture argument, they equally need to engage in massive pursuit for historical evidence in order to map the main actors in the juncture and their interactions, build and test counterfactual arguments, and understand the legacy of the juncture.78 To that end, identifying the dates and locations of events and the participants in them with a maximal degree of certainty is vital. Sources, however, often suﬀer from description bias, providing inconsistent or inaccurate dates and names. Thus, the more independent sources we have on an item and the more diversiﬁed and reliable they are, the better our chances to eliminate errors and discrepancies and establish an accurate time frame.79 As the information age matures, such digital primary and secondary sources accumulate at an unprecedented rate. Archives, libraries, publishers, museums, media outlets, government agencies, for‐proﬁt and non‐proﬁt organizations, businesses and companies, and even private amateurs and aﬁcionados are increasingly digitizing their collections and granting direct access to them, for free or for a fee. New digital archives appear every day, while existing ones are growing bigger and bigger as digitally born material is being published and analog material is being scanned. In its strategic plan for the years 2018–2022, the U.S. National Archives and Records Administration committed to scan and make available to the public half a billion pages by 2024.80 Other countries are increasingly committed to the digitization of records. The government of New Zealand, for example, launched the digitization project of its national archives in 2017.81 The United Kingdom archives government information that was published online, including tweets by oﬃcial government organizations.82 Private archiving projects have also become 77 Kathleen Thelen and James Conran, “Institutional Change,” in Orfeo Fioretos, Tulia G. Falleti, and Adam Sheingate, eds., The Oxford Handbook of Historical Institutionalism (Oxford: Oxford University Press, 2016), 57. 78 Giovanni Capoccia, “Critical Junctures and Institutional Change,” in James Mahoney and Kathleen Thelen, eds., Advances in Comparative‐Historical Analysis (Cambridge: Cambridge University Press, 2015), 169–173. 79 Marc Trachtenberg, The Craft of International History: A Guide to Method (Princeton, NJ: Princeton University Press, 2006), 147–148. 80 National Archives and Records Administration, “Strategic Plan 2018–2022,” February 2018, accessed at https://www.archives.gov/about/plans‐reports/strategic‐plan/strategic‐plan‐2018‐2022, 14 August 2019. 81 Archives New Zealand, “What’s Been Digitised,” 31 May 2019, accessed at https://archives.govt.nz/ search‐the‐archive/what‐we‐have/whats‐been‐digitised, 9 April 2020. 82 U.K. Government Web Archive, Twitter Archive, accessed at https://webarchive.nationalarchives.gov. uk/twitter/, 9 April 2020. POLITICAL SCIENCE AND BIG DATA | 247 more common and sophisticated. Today, for example, invaluable information on the operations and members of British commando forces in World War II, including detailed timelines, maps, photos, letters, and booklets, can be found on personal websites founded and maintained by the veterans’ families.83 No less meaningful for historical institutionalists is the ability to digitally mine, save, organize, search, and retrieve big data sources. Given the gargantuan and constantly growing volume of historical records, researchers can never read all the relevant sources all the way through (although they should peruse enough sources to establish the indispensable historical context). Luckily, the possibility of digitally searching these sources’ content allows investigators to quickly and easily ﬁnd those chunks of text (and, increasingly, of audio, video, and visual material) that pertain to their research questions and units of analysis. As Paul Pierson, one of the forefathers of historical institutionalism in political science, muses, the digital revolution “makes it possible to examine huge quantities of text, increasing researchers’ ability to accurately map mass and elite political expression over time.”84 In the online appendix,85 we suggest several ways of identifying and ﬁnding important elements in a large text corpus through regular expressions—common patterns in the text—as well as other methods.86 To illustrate both the value of unstructured data and the prospects of unstructured big data in such research, consider Kathryn Sikkink’s contribution to a recent volume on historical institutionalism in international politics. Sikkink demonstrates that the consolidation of human rights institutions in Latin America in the 1980s harks back to the 1940s. This argument undermines an important study of this topic, according to which the consolidation of such institutions only happened in the 1970s, when social movements in the region began to advocate for human rights. Sikkink explains that the post–World War II support of Latin American governments for human rights set in motion a prolonged critical juncture during which the path of human rights institutions 83 See, for example, the Combined Operations Project, accessed at https://www.combinedops.com/ about_site_background.htm, 9 April 2020; and Commando Veterans Archive, accessed at http://www. commandoveterans.org/, 9 April 2020. 84 Paul Pierson, “Power in Historical Institutionalism,” in Orfeo Fioretos, Tulia G. Falleti, and Adam Sheingate, eds., The Oxford Handbook of Historical Institutionalism (Oxford: Oxford University Press, 2016), 138. 85 Available at https://github.com/jonathan‐grossman/Political‐Science‐and‐Big‐Data/blob/master/ Online‐Appendix.pdf. 86 On regular expressions, see Jeﬀrey E.F. Friedl, Mastering Regular Expressions, 3rd ed. (Sebastapol, CA: O’Reilly, 2006), 4. 248 | POLITICAL SCIENCE QUARTERLY slowly and incrementally changed, ﬁnally stabilizing and reaching path dependence in the 1980s. Arriving at this conclusion required not only the methodological toolbox of historical institutionalism, but also the careful examination of various historical accounts, legal documents, and organizational reports over a long span of time.87 Sikkink’s study showcases the crucial role that sensitivity to historical context and detail play when building sound historical institutionalist arguments. However, in the digital age, these arguments can become more robust by drawing on the immense pool of independent historical sources that the internet oﬀers. Unstructured big data such as minutes of discussions in Latin American and other governments and parliaments, op‐eds and reports in Latin American and other newspapers, or the texts of political speeches could have reinforced (or, perhaps, modiﬁed) Sikkink’s conclusions. Navigating such large collections of documents has become easier thanks to the digital searchability of many online sources; by using keywords such as “human rights” (or “derechos humanos”/ “direitos humanos” in Spanish/Portuguese), researchers could follow these institutions more closely and accurately. Process Tracing and Big Data When historical institutionalists want to make causal claims and theorize causal mechanisms based on empirical evidence, they must choose an adequate method of analysis.88 Process tracing—a qualitative method of within‐case analysis that is popular among historical institutionalists89 as well as other political scientists—can particularly beneﬁt from unstructured big data. Drawing on theoretical and methodological contributions from history and psychology, process tracers put rival hypotheses to the test of existing evidence in order to establish causality and identify causal mechanisms.90 In formal process tracing, researchers use Bayesian reasoning to evaluate the likelihood of a hypothesis given the evidence at hand and their prior knowledge, which they translate into numerical values. Later on, they 87 Kathryn Sikkink, “Timing and Sequencing in International Politics: Latin America’s Contributions to Human Rights,” in Orfeo Fioretos, ed., International Politics and Institutions in Time (New York: Oxford University Press, 2017), 231–250. 88 Fioretos, “Institutions and Time in International Relations,” 23–24. 89 Capoccia and Kelemen, “The Study of Critical Junctures,” 358. 90 Derek Beach and Rasmus Brun Pedersen, Process‐Tracing Methods: Foundations and Guidelines (Ann Arbor: University of Michigan Press, 2013); Andrew Bennett and Jeﬀrey T. Checkel, eds., Process Tracing: From Metaphor to Analytic Tool (Cambridge: Cambridge University Press, 2015); George and Bennett, Case Studies and Theory Development, chap. 10; John Gerring, Case Study Research: Principles and Practices (Cambridge: Cambridge University Press, 2007), chap. 7; and Ingo Rohlﬁng, Case Studies and Causal Inference: An Integrative Framework. (Basingstoke: Palgrave Macmillan, 2012), chap. 6. POLITICAL SCIENCE AND BIG DATA | 249 update this probability as new evidence unfolds.91 Alternatively, the causal inference in process tracing may take a more narrative, although by no means less rigorous, form.92 In either case, the collection and assessment of evidence are theory‐guided, context‐sensitive, and explicit (although many other case studies, such as Lamb’s article on the Mayaguez incident and Recchia’s article on France’s threat to veto the attack against Iraq, employ an implicit form of process tracing). Like historical institutionalists, process tracers must be highly familiar with their subject matter and its history and they are expected to describe every stage of their inference in a transparent way that allows readers to replicate it.93 Jacob Ricks and Amy Liu, in an appendix to their methodological article about process tracing, demonstrate how this method can help researchers test hypotheses. In one of their exemplary case studies, the outcome of interest is the prominence, since 2001, of the Thai Rak Thai (TRT) Party in Thailand—a country characterized by weak political parties and unstable coalitions. Ricks and Liu ask whether the TRT’s unprecedented success was attributable to the personal clout of its leader, Thaksin Shinawatra, or to the institutional changes brought about by the 1997 Constitution. After presenting these two hypotheses, the authors establish timelines containing the most important events that happened between the hypothesized causes and the outcome of interest; create causal graphs that visualize possible causal links between these events; identify counterfactual outcomes—what would have happened had Thaksin not founded the TRT Party or had the 1997 constitutional changes not been implemented; and determine which types of evidence are needed to support or refute each hypothesis.94 Putting the two hypotheses to the test of evidence, Ricks and Liu survey a rich body of secondary literature on Thai politics. These data support the institutional hypothesis and weaken the personal one: Thaksin already led a political party in the 1996 election (that is, before the 1997 constitutional reforms) and suﬀered a crushing defeat; therefore, his personality was probably not suﬃcient to attain electoral victory. Moreover, there is evidence that after the 2001 election, constitutional 91 Tasha Fairﬁeld and Andrew E. Charman, “Explicit Bayesian Analysis for Process Tracing: Guidelines, Opportunities, and Caveats,” Political Analysis 25 (May 2017): 363–380. 92 David Collier, “Understanding Process Tracing,” PS: Political Science & Politics 44 (October 2011): 823–830. 93 Beach and Pedersen, Process‐Tracing Methods, 123–126; Christopher Darnton, “Archives and Inference: Documentary Evidence in Case Study Research and the Debate over U.S. Entry into World War II,” International Security 42 (Winter 2017/18): 84–126; and Kreuzer, “The Structure of Description.” 94 Jacob I. Ricks and Amy H. Liu, “Process‐Tracing Research Designs: A Practical Guide,” PS: Political Science & Politics 51 (October 2018): 1–5, appendix, at 9–18. 250 | POLITICAL SCIENCE QUARTERLY restrictions prevented the secession of one of the major factions that composed the TRT and thus ensured the party’s integrity. This evidence lends support to the claim that it was the 1997 constitutional reform that allowed the TRT to remain united and form a strong coalition. While Ricks and Liu could not ﬁnd decisive evidence in the surveyed literature to prove or discredit either hypothesis beyond any doubt, their deep familiarity with the Thai political context allows them to conclude that given the evidence at hand, the institutional hypothesis best explains the TRT’s success. At the same time, they do not discount the rival hypothesis and acknowledge Thaksin’s individual contribution to this outcome.95 This example shows researchers developing and employing good research design by relying on process tracing as well as area expertise and knowledge of context to solve a political puzzle.96 When inferring causality, time is of the essence. Like historical institutionalists, process tracers equally seek to identify exact timeframes and attach events and interactions to speciﬁc points in time.97 They are required to establish a detailed and accurate timeline divided into years, months, days, or even hours and minutes, depending on the level of granularity of inference. Such a timeline constitutes the ﬁrst layer of research. Once it is identiﬁed, documented, and veriﬁed, it provides a foundation for multilayered analysis that may include temporal, spatial, and networked elements. As with historical institutionalism, unstructured big data sources such as digitized archival documents, interview transcripts, news articles, book manuscripts, and even secondary literature can provide thick political, historical, social, and cultural description for this purpose.98 Thus, to test the hypothesis that Thai constitutional reform accounted for the TRT Party’s success, Ricks and Liu had to show that the Constitution predated the TRT’s electoral victory. Otherwise, the hypothesis would be invalid.99 While tracing this particular sequence was not difﬁcult to do—the Constitution was changed in 1997 and the election took place in 2001, as could be easily veriﬁed in any reliable chronology of political events in Thailand—other timelines require much greater fragmentation and could thus beneﬁt markedly from drawing on big data. 95 Ricks and Liu, “Process‐Tracing Research Designs,” appendix, 15–18. See also James Mahoney, “Process Tracing and Historical Explanation,” Security Studies 24 (June 2015): 200–218. 97 Collier, “Understanding Process Tracing,” 824; and Ricks and Liu, “Process‐Tracing Research Designs,” 2. 98 See Kitchin, “Big Data,” 10. 99 Ricks and Liu, “Process‐Tracing Research Designs,” appendix, 16. 96 POLITICAL SCIENCE AND BIG DATA | 251 In the 1962 Cuban missile crisis, for example, a myriad of interactions between many actors occurred within a few days. It is now possible to reconstruct many of these exchanges thanks to a great selection of primary documents from various provenances that are available online; typing the phrase “documents on the Cuban missile crisis” into any web search engine will return an overwhelming number of relevant results. This, however, is not enough. Because of the density of events related to this crisis, process tracers would need to know and record the exact hour—and even the minute—in which these events started and ended to identify causal links between the many distinct interactions. Even though historical institutionalists and process tracers use data that are, for the most part, unstructured, so far they have paid scant attention to the potential of unstructured big data. By embracing the big data revolution, adherents of these approaches can attain unprecedented levels of granularity and hence of accuracy and reliability, in keeping with the growing standards of transparency and reproducibility in the social sciences.100 However, unstructured big data also pose new challenges for political scientists. The ﬁnal section of this article discusses these challenges as well as the best ways to face them. THE CHALLENGES OF UNSTRUCTURED BIG DATA AND HOW TO OVERCOME THEM The incredible quantity of big data tests scholars’ aims to conduct research with measurable goals and a reasonable time frame. Not only is the volume of data overwhelming, but data can also be misleading and onerous. Finding the evidence that we need in an ocean of irrelevant and false information is a constant challenge. Even if we encounter some details that seem pertinent, we still have to verify their accuracy. In this respect, the big data revolution is a blessing and a curse, as it magniﬁes and exacerbates reliability problems that have always been part of any inquiry of unstructured sources. While the amount of data in the world keeps growing, most available data are of dubious quality.101 As Tom Nichols bluntly puts it, “The Internet lets a billion ﬂowers bloom, and most of them stink.”102 100 R. Michael Alvarez, Ellen M. Key, and Lucas Núñez, “Research Replication: Practical Considerations,” PS: Political Science & Politics 51 (April 2018): 422–426; and Sean Yom, “Analytic Transparency, Radical Honesty, and Strategic Incentives,” PS: Political Science & Politics 51 (April 2018): 416–421. 101 Silver, The Signal and the Noise, 250. 102 Thomas M. Nichols, The Death of Expertise: The Campaign against Established Knowledge and Why It Matters (New York: Oxford University Press, 2017), 108. 252 | POLITICAL SCIENCE QUARTERLY The sheer amount of data about a certain topic is not necessarily an indicator of accuracy. Making quantitative judgments regarding the quality of data—that is, attempting to determine which pieces of information are correct based solely on the number of sources that support them—would often be precarious. In the age of “fake news,” 99 news articles that back up a particular claim might be the product of one disinformation campaign, whereas the single report that substantiates the rival hypothesis might emanate from the only independent journalist bold enough to uncover the truth or committed enough to defend it. Moreover, using evidence volume as a proxy for evidence reliability might result in the exclusion of minorities and women, who are often underrepresented in big data repositories.103 Thus, the researchers’ expertise in the context and history of the period and phenomena they investigate, as well as their ability to evaluate the reliability of their sources, are prerequisites for a good unstructured big data analysis. Data scientists or quantitatively oriented political scientists who were not trained in historical methods, or even a historically oriented investigator who does not specialize in the area or period in question, are likely to encounter difﬁculties when dealing with such sources without guidance from or collaboration with other experts.104 When researching unstructured data, we often strive to ﬁnd the needles‐in‐haystacks—those elusive pieces of evidence that provide insight and truth and can conﬁrm a hypothesis, impugn rival explanations, or both.105 When our data consist of a large number of documents, it is easy to miss a name that occurs only once in the entire corpus. Robert A. Caro, the famous biographer and twice Pulitzer Prize winner, recalls in his memoir the invaluable advice of a trusted editor: “Turn every page. Never assume anything. Turn every goddamned page.”106 However, when there are too many pages, it might be impossible to turn each and every one of them. Caro writes that when he embarked on his research at the Lyndon Baines Johnson Library and Museum in Austin, Texas, upon learning that the LBJ archives contain around 32 million pages, he had to concede that “there would be no turning every page here.”107 His realization highlights the indispensability of digitally searching big data 103 Brooke Foucault Welles, “On Minorities and Outliers: The Case for Making Big Data Small,” Big Data & Society 1 (April–June 2014), https://doi.org/10.1177/2053951714540613. 104 See George and Bennett, Case Studies and Theory Development, 96. 105 Stephen Van Evera, Guide to Methods for Students of Political Science (Ithaca, NY: Cornell University Press, 1997), 30–32. 106 Robert A. Caro, Working: Researching, Interviewing, Writing (New York: Alfred A. Knopf, 2019), 11. 107 Caro, Working, 84. POLITICAL SCIENCE AND BIG DATA | 253 by keywords, strings, and regular expressions, since carefully reviewing and annotating all the relevant sources, as investigators used to do prior to the information era (and, as Caro’s anecdote demonstrates, were often unable to accomplish even in the age of small data), is no longer feasible—there are simply too many books to read, video clips to watch, or audio recordings to hear. Moreover, even when we know how to ﬁnd and extract the information we need, we must be able to keep track of these details in order to analyze them. Historians who examine primary sources arrange their research notes and present their ﬁndings in a logical order for their work to be meaningful.108 Political scientists studying historical data are no diﬀerent in this regard. To make sense of unstructured sources, and especially of large digital collections of such sources, we must treat our data according to clear organizing principles.109 Unlike structuring strategies whose goal is the statistical analysis of data, historical institutionalism and process tracing do not entail the irrevocable quantiﬁcation or reduction of ﬁne‐grained historical narrative. Although scholars engaging in such research endeavors must be able to locate, organize, and retrieve only those facts relevant to their questions and cast aside the noise, they must also be able to consult the raw material whenever they need for the purposes of context, replication, or further inquiry. One possible way to attain this goal is to create a qualitative codebook in the form of a spreadsheet or a relational database. Such a codebook contains, in a tabular and digitally searchable form, all the germane dates, events, actors, and locations, as well as links to the original sources. On the one hand, the codebook would give researchers a bird’s‐eye view of all the meaningful items in their study and allow them to array actors and events in order of time, importance, inﬂuence over outcome of interest, or any other order. A relational database can additionally allow researchers to create causal links and other connections between actors and events. On the other hand, immediate access to the original sources would permit researchers the close and careful reading of historical sources that is central to historical institutionalism and process tracing, oﬀset the loss of context that is characteristic of structured big data, and ensure the replicability of their study. 108 Anthony Brundage, Going to the Sources: A Guide to Historical Research and Writing, 6th ed. (Hoboken, NJ: Wiley, 2017), 114–121. 109 George and Bennett, Case Studies and Theory Development, 90. 254 | POLITICAL SCIENCE QUARTERLY To further improve and contextualize the retrieval of relevant details, researchers can allocate columns in the spreadsheet to descriptive metadata—data that describe the content of a resource and allow users to look up that resource based on its attributes in a way similar to the category labels in an email inbox or a reference manager.110 Throughout the research process, the investigators should constantly update the codebook in accordance with the data they ﬁnd while documenting their steps and decisions to guarantee their project’s transparency and replicability. In the online appendix, we provide more elaborate instructions on how to create such a codebook.111 The qualitative codebook is just one method for organizing unstructured big data, and it requires some computer knowledge and skills. More traditional methods, such as identifying causal mechanisms by drawing causal graphs112 or set diagrams,113 or using an intuitive timeline and mind‐mapping software that help writers to keep track of their narrative and characters, may be equally useful in organizing the plethora of details found in unstructured big data. In any event, analyzing such data are most eﬀective when they are machine‐readable, fully searchable, and, preferably, downloaded in toto, so that researchers, reviewers, and readers can revisit the data even if the creators or owners of the data change or remove them from their original location. Manually downloading massive amounts of web pages, text documents, or media ﬁles can be a time‐consuming task. More eﬃcient solutions would be clipping web pages with a designated browser extension that imports the source into some software, such as a note‐taking application, a computer‐ assisted qualitative data analysis (CAQDA) package, or a reference manager; running an automated script that extracts the data from the internet;114 or using a more user‐friendly desktop or web‐based scraping service that boasts a graphical user interface. Alternatively, researchers can delegate data mining to a host of human workers through crowdsourcing web services such as Amazon Mechanical Turk.115 110 Richard Gartner, Metadata: Shaping Knowledge from Antiquity to the Semantic Web (Cham, Switzerland: Springer, 2016), 6–7; and Richard Pearce‐Moses, A Glossary of Archival and Records Terminology, Archival Fundamentals Series (Chicago: Society of American Archivists, 2005), 113. 111 Available at https://github.com/jonathan‐grossman/Political‐Science‐and‐Big‐Data/blob/master/ Online‐Appendix.pdf. 112 Ricks and Liu, “Process‐Tracing Research Designs,” 2–3. 113 James Mahoney, Khairunnisa Mohamedali, and Christoph Nguyen, “Causality and Time in Historical Institutionalism,” in Orfeo Fioretos, Tulia G. Falleti, and Adam Sheingate, eds., The Oxford Handbook of Historical Institutionalism (Oxford: Oxford University Press, 2016), 71–88. 114 Gabe Ignatow and Rada Mihalcea, An Introduction to Text Mining: Research Design, Data Collection, and Analysis (Los Angeles: Sage, 2018). 115 Wilkerson and Casas, “Large‐Scale Computerized Text Analysis,” 531. POLITICAL SCIENCE AND BIG DATA | 255 Finally, regardless of the ways data are found, downloaded, and organized, we believe that human researchers rather than machines should be in control of analyzing unstructured big data to ensure that nuanced contextual analyses accompany any quantitative analysis. We do not reject the possibility that throughout the collection and examination of data, technology can help us unveil patterns that we have not yet considered. As it is often practically impossible to read or view all the available evidence, automated methods can help us with this eﬀort. For example, we can use CAQDA software and text‐as‐data methods to check the frequency of words in a corpus of sources. In this way, we can become familiar with the concepts, actors, locations, and ideas that predominated diﬀerent moments in time, visualize those ﬁndings as a word cloud or a list of words,116 and narrow down both our digital searches and our analytic focus. Nonetheless, we do not recommend a wholly inductive approach. As this article has indicated multiple times, there are no perfect digital shortcuts to immersing ourselves in the context and history of the political phenomenon that we wish to explain. CONCLUSION “Big data” might not be a buzzword in political science yet, but it is nevertheless widely used by political scientists. So far, this use has been almost exclusively in the form of statistical analyses of structured databases. Through this article, we seek to challenge this state of aﬀairs by advocating the analysis of unstructured big data in political research, especially in the context of historical institutionalism and process tracing. Given that most data in the world are unstructured to some degree, making inferences from exclusively structured data may result in a “drunkard’s search” bias in that it would often include only those data that are easiest to obtain and analyze. To deliberately conﬁne our investigation to what constitutes, at best, 20 percent of the available data in the world is to restrict the scope of our research, if not cherry‐ pick our evidence. In the age of information, the nexus between accessible unstructured data of unprecedented magnitude and novel methods, both quantitative and qualitative, for the analysis of such data has the potential to shed light on formerly hidden political mysteries. It can lead us from descriptive or correlative studies to an accurate depiction of causal political processes and mechanisms and thus enhance our understanding of the world in which we live. 116 Klaus Krippendorﬀ, Content Analysis: An Introduction to Its Methodology, 3rd ed. (Los Angeles: Sage, 2013), 189–192. 256 | POLITICAL SCIENCE QUARTERLY Of course, as we emphasize throughout the article, investigators should choose their method and data according to their research question—some questions are best answered with the statistical inference of structured data, others with in‐depth analyses of unstructured sources, and still others with a combination of both—one common approach, for example, is identifying a causal eﬀect with the former and theorizing the causal mechanism with the latter.117 In any case, researchers should do their best to avoid the pitfalls of big data while acknowledging that avoiding them altogether is unfeasible. More research and documented ﬁndings are necessary before the majority of political scientists feel at ease studying structured or unstructured big data. Although big data analysis can be done today using common software that are widely available for free or at a low price, it still requires varying degrees of computational skills. Python and R, for example, are two powerful open‐source programming languages that many political scientists use to extract, process, and analyze big data sources. On the one hand, these programs are freely available and enjoy a vast universe of users, developers, downloadable packages and modules, online courses, blogs, and guidebooks. On the other hand, they are very unintuitive—mastering them requires a steep learning curve and considerable time resources; moreover, these skills may come at the expense of acquiring others that may be more useful for certain research projects, such as gaining expertise in the language, history, and politics of the studied area or country.118 Certainly, a wide digital divide exists between scholars who know how to write algorithms in a programming language, those who do not code but are capable of using relatively sophisticated (and often pricey) digital tools with a graphical user interface (for example, relational databases, text mining applications, or CAQDA software), and those with only basic digital literacy. For the moment, big data are fully accessible only to the ﬁrst group and, with a lesser degree of functionality, to the second one.119 Developing software packages that would simplify the search, extraction, organization, and analysis of such data for members of all groups is among the most pressing challenges that social science is facing today. Making these tools available for free or at an aﬀordable price, so as to increase equal access across diﬀerent countries, institutions, and 117 Gerring, Case Study Research, 43. See Thomas B. Pepinsky, “The Return of the Single‐Country Study,” Annual Review of Political Science 22 (2019): 187–203. 119 See also Henry E. Brady, “The Challenge of Big Data and Data Science,” Annual Review of Political Science 22 (2019): 297–323. 118 POLITICAL SCIENCE AND BIG DATA | 257 individuals, is another major challenge. Given the ever‐increasing availability of user‐friendly digital tools for social science research, on the one hand, and the rapid proliferation of open‐source software, on the other hand, there is good reason to believe that these challenges are temporary. Scholars may also increasingly develop the necessary skill‐sets to code data or collaborate across disciplines and beyond academia with people that already possess these skills. We look forward to the contributions that these advances will make to the study of big data in political science.* *We thank the editors and anonymous reviewers of Political Science Quarterly and Raelene Camille Wyse for their helpful comments on earlier drafts of this article.

Political Science and Big Data Structured Data, Unstructured Data, and How to Use Them

Sign up for access to the world's latest research

Abstract

Figures (2)

Key takeaways

Related papers

Related topics