68 research outputs found

    Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit

    Get PDF
    “Mining the Internet Graveyard” argues that the advent of massive quantity of born-digital historical sources necessitates a rethinking of the historians’ toolkit. The contours of a third wave of computational history are outlined, a trend marked by ever-increasing amounts of digitized information (especially web based), falling digital storage costs, a move to the cloud, and a corresponding increase in computational power to process these sources. Following this, the article uses a case study of an early born-digital archive at Library and Archives Canada – Canada’s Digital Collections project (CDC) – to bring some of these problems into view. An array of off-the-shelf data analysis solutions, coupled with code written in Mathematica, helps us bring context and retrieve information from a digital collection on a previously inaccessible scale. The article concludes with an illustration of the various computational tools available, as well as a call for greater digital literacy in history curricula and professional development.Social Sciences and Humanities Research Council || 430-2013-061

    Trawling and trolling for terrorists in the digital Gulf of Bothnia : Cross-lingual text mining for the emergence of terrorism in Swedish and Finnish newspapers, 1780—1926

    Get PDF
    In pursuing the historical emergence of the discourse on terrorism, this study trawls the “digital Gulf of Bothnia” in the form of a corpus of combined Swedish and Finnish digitized newspaper texts. Through a cross-lingual exploration of the uses of the concept of terrorism in historical Swedish and Finnish news, we examine meanings anchored in the two culturally close but still decidedly different national political contexts. The study is an outcome of an integrative interdisciplinary effort.Peer reviewe

    Can we forecast conflict? A framework for forecasting global human societal behavior using latent narrative indicators

    Get PDF
    The ability to successfully forecast impending societal unrest, from riots and protests to assassinations and coups, would fundamentally transform the ability of nations to proactively address instability around the world, intervening before unrest accelerates to conflict or prepositioning assets to enhance preventive activity. It would also enhance the ability of social scientists to quantitatively study the underpinnings of how and why grievances transition from agitated individuals to population-scale physical unrest. Recognizing this potential, the US government has funded research on “conflict early warning” and conflict forecasting for more than 40 years and current unclassified approaches incorporate nearly every imaginable type of data from telephone call records to traffic signals, tribal and cultural linkages to satellite imagery. Yet, current approaches have yielded poor outcomes: one recent study showed that the top models of civil war onset miss 90% of the cases they supposedly explain. At the same time, emerging work in the economics disciplines is finding that new approaches, especially those based on latent linguistic indicators, can offer significant predictive power of future physical behavior. The information environment around us records not just factual information, but also a rich array of cultural and contextual influences that offer a window into national consciousness. A growing body of literature has shown that measuring the linguistic dimensions of this real–time consciousness can accurately forecast many broad social behaviors, ranging from box office sales to the stock market itself. In fact, the United States intelligence community believes so strongly in the ability of surface-level indicators to forecast future physical unrest more successfully than current approaches, it now has an entire program devoted to such “Open Source Indicators.” Yet, few studies have explored the application of these methods to the forecasting of non-economic human societal behavior and have primarily focused on large-bore events such as militarized disputes, epidemics, and regime change. One of the reasons for this is the lack of high-resolution cross-national longitudinal data on societal conflict equivalent to the daily indicators available in economics research. This dissertation therefore presents a novel framework for evaluating these new classes of latent-based forecasting measures on high-resolution geographically-enriched quantitative databases of human behavior. To demonstrate this framework, an archive of 4.7 million news articles totaling 1.3 billion words, consisting of the entirety of international news coverage from Agence France Presse, the Associated Press, and Xinhua over the last 30 years, is used to construct a database of more than 29 million global events in over 300 categories using the TABARI coding system and CAMEO event taxonomy, resulting the largest event database created in the academic literature. The framework is then applied to examine the hypothesis of latent forecasting as a classification problem, demonstrating the ability of a simple example-based classifier to not only return potentially actionable forecasts from latent discourse indicators, but to quantitatively model the topical traces of the metanarratives that underlie them. The results of this dissertation demonstrate that this new framework provides a powerful new evaluative environment for exploring the emerging class of latent indicators and modeling approaches and that even rudimentary classification-based models may have significant forecasting potential

    Railroads and the Making of Modern America -- Tools for Spatio-Temporal Correlation, Analysis, and Visualization

    Get PDF
    This project aims to integrate large-scale data sources from the Digging into Data repositories with other types of relevant data on the railroad system, already assembled by the project directors. Our project seeks to develop useful tools for spatio-temporal visualization of these data and the relationships among them. Our interdisciplinary team includes computer science, history, and geography researchers. Because the railroad "system" and its spatio-temporal configuration appeared differently from locality-to-locality and region-to-region, we need to adjust how we "locate" and "see" the system. By applying data mining and pattern recognition techniques, software systems can be created that dynamically redefine the way spatial data are represented. Utilizing processes common to analysis in Computer Science, we propose to develop a software framework that allows these embedded concepts to be visualized and further studied

    Computational approaches to semantic change (Volume 6)

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans

    Developing natural language processing instruments to study sociotechnical systems

    Get PDF
    Identifying temporal linguistic patterns and tracing social amplification across communities has always been vital to understanding modern sociotechnical systems. Now, well into the age of information technology, the growing digitization of text archives powered by machine learning systems has enabled an enormous number of interdisciplinary studies to examine the coevolution of language and culture. However, most research in that domain investigates formal textual records, such as books and newspapers. In this work, I argue that the study of conversational text derived from social media is just as important. I present four case studies to identify and investigate societal developments in longitudinal social media streams with high temporal resolution spanning over 100 languages. These case studies show how everyday conversations on social media encode a unique perspective that is often complementary to observations derived from more formal texts. This unique perspective improves our understanding of modern sociotechnical systems and enables future research in computational linguistics, social science, and behavioral science

    Trachoma and Ocular Chlamydial Infection in the Era of Genomics.

    Get PDF
    Trachoma is a blinding disease usually caused by infection with Chlamydia trachomatis (Ct) serovars A, B, and C in the upper tarsal conjunctiva. Individuals in endemic regions are repeatedly infected with Ct throughout childhood. A proportion of individuals experience prolonged or severe inflammatory episodes that are known to be significant risk factors for ocular scarring in later life. Continued scarring often leads to trichiasis and in-turning of the eyelashes, which causes pain and can eventually cause blindness. The mechanisms driving the chronic immunopathology in the conjunctiva, which largely progresses in the absence of detectable Ct infection in adults, are likely to be multifactorial. Socioeconomic status, education, and behavior have been identified as contributing to the risk of scarring and inflammation. We focus on the contribution of host and pathogen genetic variation, bacterial ecology of the conjunctiva, and host epigenetic imprinting including small RNA regulation by both host and pathogen in the development of ocular pathology. Each of these factors or processes contributes to pathogenic outcomes in other inflammatory diseases and we outline their potential role in trachoma

    State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism

    Get PDF
    Overview This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.  The paper is structured as follows: Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS). Part 2 provides an introduction to the key approaches of social media intelligence (henceforth ‘SOCMINT’) for counter-terrorism. Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored. Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work

    Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence

    Full text link
    The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical
    • 

    corecore