36 research outputs found
Content Analysis of 150 Years of British Periodicals
Previous studies have shown that it is possible to detect macroscopic patterns of cultural change over periods of centuries by analyzing large textual time series, specifically digitized books. This method promises to empower scholars with a quantitative and data-driven tool to study culture and society, but its power has been limited by the use of data from books and simple analytics based essentially on word counts. This study addresses these problems by assembling a vast corpus of regional newspapers from the United Kingdom, incorporating very fine-grained geographical and temporal information that is not available for books. The corpus spans 150 years and is formed by millions of articles, representing 14% of all British regional outlets of the period. Simple content analysis of this corpus allowed us to detect specific events, like wars, epidemics, coronations, or conclaves, with high accuracy, whereas the use of more refined techniques from artificial intelligence enabled us to move beyond counting words by detecting references to named entities. These techniques allowed us to observe both a systematic underrepresentation and a steady increase of women in the news during the 20th century and the change of geographic focus for various concepts. We also estimate the dates when electricity overtook steam and trains overtook horses as a means of transportation, both around the year 1900, along with observing other cultural transitions. We believe that these data-driven approaches can complement the traditional method of close reading in detecting trends of continuity and change in historical corpora
Biased Embeddings from Wild Data: Measuring, Understanding and Removing
Many modern Artificial Intelligence (AI) systems make use of data embeddings,
particularly in the domain of Natural Language Processing (NLP). These
embeddings are learnt from data that has been gathered "from the wild" and have
been found to contain unwanted biases. In this paper we make three
contributions towards measuring, understanding and removing this problem. We
present a rigorous way to measure some of these biases, based on the use of
word lists created for social psychology applications; we observe how gender
bias in occupations reflects actual gender bias in the same occupations in the
real world; and finally we demonstrate how a simple projection can
significantly reduce the effects of embedding bias. All this is part of an
ongoing effort to understand how trust can be built into AI systems.Comment: Author's original versio
Words are Malleable: Computing Semantic Shifts in Political and Media Discourse
Recently, researchers started to pay attention to the detection of temporal
shifts in the meaning of words. However, most (if not all) of these approaches
restricted their efforts to uncovering change over time, thus neglecting other
valuable dimensions such as social or political variability. We propose an
approach for detecting semantic shifts between different viewpoints--broadly
defined as a set of texts that share a specific metadata feature, which can be
a time-period, but also a social entity such as a political party. For each
viewpoint, we learn a semantic space in which each word is represented as a low
dimensional neural embedded vector. The challenge is to compare the meaning of
a word in one space to its meaning in another space and measure the size of the
semantic shifts. We compare the effectiveness of a measure based on optimal
transformations between the two spaces with a measure based on the similarity
of the neighbors of the word in the respective spaces. Our experiments
demonstrate that the combination of these two performs best. We show that the
semantic shifts not only occur over time, but also along different viewpoints
in a short period of time. For evaluation, we demonstrate how this approach
captures meaningful semantic shifts and can help improve other tasks such as
the contrastive viewpoint summarization and ideology detection (measured as
classification accuracy) in political texts. We also show that the two laws of
semantic change which were empirically shown to hold for temporal shifts also
hold for shifts across viewpoints. These laws state that frequent words are
less likely to shift meaning while words with many senses are more likely to do
so.Comment: In Proceedings of the 26th ACM International on Conference on
Information and Knowledge Management (CIKM2017
Bioinformatics and Classical Literary Study
This paper describes the Quantitative Criticism Lab, a collaborative
initiative between classicists, quantitative biologists, and computer
scientists to apply ideas and methods drawn from the sciences to the study of
literature. A core goal of the project is the use of computational biology,
natural language processing, and machine learning techniques to investigate
authorial style, intertextuality, and related phenomena of literary
significance. As a case study in our approach, here we review the use of
sequence alignment, a common technique in genomics and computational
linguistics, to detect intertextuality in Latin literature. Sequence alignment
is distinguished by its ability to find inexact verbal similarities, which
makes it ideal for identifying phonetic echoes in large corpora of Latin texts.
Although especially suited to Latin, sequence alignment in principle can be
extended to many other languages
History Playground: A Tool for Discovering Temporal Trends in Massive Textual Corpora
Recent studies have shown that macroscopic patterns of continuity and change
over the course of centuries can be detected through the analysis of time
series extracted from massive textual corpora. Similar data-driven approaches
have already revolutionised the natural sciences, and are widely believed to
hold similar potential for the humanities and social sciences, driven by the
mass-digitisation projects that are currently under way, and coupled with the
ever-increasing number of documents which are "born digital". As such, new
interactive tools are required to discover and extract macroscopic patterns
from these vast quantities of textual data. Here we present History Playground,
an interactive web-based tool for discovering trends in massive textual
corpora. The tool makes use of scalable algorithms to first extract trends from
textual corpora, before making them available for real-time search and
discovery, presenting users with an interface to explore the data. Included in
the tool are algorithms for standardization, regression, change-point detection
in the relative frequencies of ngrams, multi-term indices and comparison of
trends across different corpora
Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction
Digitization of historical documents is a challenging task in many digital
humanities projects. A popular approach for digitization is to scan the
documents into images, and then convert images into text using Optical
Character Recognition (OCR) algorithms. However, the outcome of OCR processing
of historical documents is usually inaccurate and requires post-processing
error correction. This study investigates how crowdsourcing can be utilized to
correct OCR errors in historical text collections, and which crowdsourcing
methodology is the most effective in different scenarios and for various
research objectives. A series of experiments with different micro-task's
structures and text lengths was conducted with 753 workers on the Amazon's
Mechanical Turk platform. The workers had to fix OCR errors in a selected
historical text. To analyze the results, new accuracy and efficiency measures
have been devised. The analysis suggests that in terms of accuracy, the optimal
text length is medium (paragraph-size) and the optimal structure of the
experiment is two-phase with a scanned image. In terms of efficiency, the best
results were obtained when using longer text in the single-stage structure with
no image. The study provides practical recommendations to researchers on how to
build the optimal crowdsourcing task for OCR post-correction. The developed
methodology can also be utilized to create golden standard historical texts for
automatic OCR post-correction. This is the first attempt to systematically
investigate the influence of various factors on crowdsourcing-based OCR
post-correction and propose an optimal strategy for this process.Comment: 25 pages, 12 figures, 1 tabl
From Critique to Audit: A Pragmatic Approach to the Climate Emergency
Rethinking the work of academics in a time of pressing deadlines for climate action, this paper offers a series of new pragmatic strategies that academics can take up. It suggests a climate pledge where university teachers promise 5% or more of their teaching time to link the field of their traditional research to climate issues. It suggests that humanists, social scientists and data scientists need not only to critique the logic of extraction that propels our climate catastrophe, but also to audit individual institutions, writers, and politicians for their continuing engagement with climate or lack thereof