12,464 research outputs found
Accessing the content of nineteenth-century periodicals: the Science in the Nineteenth-Century Periodical project
Nineteenth-century periodicals significantly outnumber books from that era, and present historians with an immensely valuable set of sources, but their use is constrained by the difficulty of identifying relevant material. For many periodicals, contents pages and volume indexes have been the only guide, and the few subject indexes that exist usually provide only an indication of the subjects mentioned in the article titles. By contrast, the Science in the Nineteenth-Century Periodical project (SciPer) indexed the science content of general-interest periodicals by skim-reading the entire text. The project’s approach to indexing is described and the relative merits of indexing and digitization in aiding researchers to locate relevant material are discussed. The article concludes that, notwithstanding the more sophisticated search interfaces of more recent retrodigitization projects, human indexing still has an important role to play in providing access to the content of historic periodicals and in mapping their data structure
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
Article Segmentation in Digitised Newspapers
Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities
Noise-Robust De-Duplication at Scale
Identifying near duplicates within large, noisy text corpora has a myriad of
applications that range from de-duplicating training datasets, reducing privacy
risk, and evaluating test set leakage, to identifying reproduced news articles
and literature within large corpora. Across these diverse applications, the
overwhelming majority of work relies on N-grams. Limited efforts have been made
to evaluate how well N-gram methods perform, in part because it is unclear how
one could create an unbiased evaluation dataset for a massive corpus. This
study uses the unique timeliness of historical news wires to create a 27,210
document dataset, with 122,876 positive duplicate pairs, for studying
noise-robust de-duplication. The time-sensitivity of news makes comprehensive
hand labelling feasible - despite the massive overall size of the corpus - as
duplicates occur within a narrow date range. The study then develops and
evaluates a range of de-duplication methods: hashing and N-gram overlap (which
predominate in the literature), a contrastively trained bi-encoder, and a
re-rank style approach combining a bi- and cross-encoder. The neural approaches
significantly outperform hashing and N-gram overlap. We show that the
bi-encoder scales well, de-duplicating a 10 million article corpus on a single
GPU card in a matter of hours. The public release of our NEWS-COPY
de-duplication dataset will facilitate further research and applications
A Massive Scale Semantic Similarity Dataset of Historical English
A diversity of tasks use language models trained on semantic similarity data.
While there are a variety of datasets that capture semantic similarity, they
are either constructed from modern web data or are relatively small datasets
created in the past decade by human annotators. This study utilizes a novel
source, newly digitized articles from off-copyright, local U.S. newspapers, to
assemble a massive-scale semantic similarity dataset spanning 70 years from
1920 to 1989 and containing nearly 400M positive semantic similarity pairs.
Historically, around half of articles in U.S. local newspapers came from
newswires like the Associated Press. While local papers reproduced articles
from the newswire, they wrote their own headlines, which form abstractive
summaries of the associated articles. We associate articles and their headlines
by exploiting document layouts and language understanding. We then use deep
neural methods to detect which articles are from the same underlying source, in
the presence of substantial noise and abridgement. The headlines of reproduced
articles form positive semantic similarity pairs. The resulting publicly
available HEADLINES dataset is significantly larger than most existing semantic
similarity datasets and covers a much longer span of time. It will facilitate
the application of contrastively trained semantic similarity models to a
variety of tasks, including the study of semantic change across space and time
Relation between early life socioeconomic position and all cause mortality in two generations. A longitudinal study of Danish men born in 1953 and their parents
Objective: To examine (1) the relation between parental socioeconomic position and all cause mortality in two generations, (2) the relative importance of mother’s educational status and father’s occupational status on offspring mortality, and (3) the effect of factors in the family environment on these relations.
Design: A longitudinal study with record linkage to the Civil Registration System. The data were analysed using Cox regression models.
Setting: Copenhagen, Denmark.
Subjects: 2890 men born in 1953, whose mothers were interviewed regarding family social background in 1968. The vital status of this population and their parents was ascertained from April 1968 to January 2002.
Main outcome measures: All cause mortality in study participants, their mothers, and fathers.
Results: A similar pattern of relations was found between parental social position and all cause mortality in adult life in the three triads of father, mother, and offspring constituted of the cohort of men born in 1953, their parents, and grandparents. The educational status of mothers showed no independent effect on total mortality when father’s occupational social class was included in the model in either of the triads. Low material wealth was the indicator that remained significantly associated with adult all cause mortality in a model also including parental social position and the intellectual climate of the family in 1968. In the men born in 1953 the influence of material wealth was strongest for deaths later in adult life.
Conclusion: Father’s occupational social class is associated with adult mortality in all members of the mother-father-offspring triad. Material wealth seems to be an explanatory factor for this association
Discovering emerging topics in textual corpora of galleries, libraries, archives, and museums institutions
For some decades now, galleries, libraries, archives, and museums (GLAM) institutions have provided access to information resources in digital format. Although some datasets are openly available, they are often not used to their full potential. Recently, approaches such as the so-called Labs within GLAM institutions promote the reuse of digital collections in innovative and inspiring ways. In this article, we explore a straightforward computational procedure to identify emerging topics in periodical materials such as newspapers, bibliographies, and journals. The method is illustrated in three use cases based on public digital collections. This type of tools are expected to promote further usage by researchers of the digital collections
- …