12,464 research outputs found

    Accessing the content of nineteenth-century periodicals: the Science in the Nineteenth-Century Periodical project

    Get PDF
    Nineteenth-century periodicals significantly outnumber books from that era, and present historians with an immensely valuable set of sources, but their use is constrained by the difficulty of identifying relevant material. For many periodicals, contents pages and volume indexes have been the only guide, and the few subject indexes that exist usually provide only an indication of the subjects mentioned in the article titles. By contrast, the Science in the Nineteenth-Century Periodical project (SciPer) indexed the science content of general-interest periodicals by skim-reading the entire text. The project’s approach to indexing is described and the relative merits of indexing and digitization in aiding researchers to locate relevant material are discussed. The article concludes that, notwithstanding the more sophisticated search interfaces of more recent retrodigitization projects, human indexing still has an important role to play in providing access to the content of historic periodicals and in mapping their data structure

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Noise-Robust De-Duplication at Scale

    Full text link
    Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. The public release of our NEWS-COPY de-duplication dataset will facilitate further research and applications

    A Massive Scale Semantic Similarity Dataset of Historical English

    Full text link
    A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time

    Relation between early life socioeconomic position and all cause mortality in two generations. A longitudinal study of Danish men born in 1953 and their parents

    Get PDF
    Objective: To examine (1) the relation between parental socioeconomic position and all cause mortality in two generations, (2) the relative importance of mother’s educational status and father’s occupational status on offspring mortality, and (3) the effect of factors in the family environment on these relations. Design: A longitudinal study with record linkage to the Civil Registration System. The data were analysed using Cox regression models. Setting: Copenhagen, Denmark. Subjects: 2890 men born in 1953, whose mothers were interviewed regarding family social background in 1968. The vital status of this population and their parents was ascertained from April 1968 to January 2002. Main outcome measures: All cause mortality in study participants, their mothers, and fathers. Results: A similar pattern of relations was found between parental social position and all cause mortality in adult life in the three triads of father, mother, and offspring constituted of the cohort of men born in 1953, their parents, and grandparents. The educational status of mothers showed no independent effect on total mortality when father’s occupational social class was included in the model in either of the triads. Low material wealth was the indicator that remained significantly associated with adult all cause mortality in a model also including parental social position and the intellectual climate of the family in 1968. In the men born in 1953 the influence of material wealth was strongest for deaths later in adult life. Conclusion: Father’s occupational social class is associated with adult mortality in all members of the mother-father-offspring triad. Material wealth seems to be an explanatory factor for this association

    Discovering emerging topics in textual corpora of galleries, libraries, archives, and museums institutions

    Get PDF
    For some decades now, galleries, libraries, archives, and museums (GLAM) institutions have provided access to information resources in digital format. Although some datasets are openly available, they are often not used to their full potential. Recently, approaches such as the so-called Labs within GLAM institutions promote the reuse of digital collections in innovative and inspiring ways. In this article, we explore a straightforward computational procedure to identify emerging topics in periodical materials such as newspapers, bibliographies, and journals. The method is illustrated in three use cases based on public digital collections. This type of tools are expected to promote further usage by researchers of the digital collections
    • …
    corecore