7 research outputs found

    Novel Event Detection and Classification for Historical Texts

    Get PDF
    Event processing is an active area of research in the Natural Language Processing community but resources and automatic systems developed so far have mainly addressed contemporary texts. However, the recognition and elaboration of events is a crucial step when dealing with historical texts particularly in the current era of massive digitization of historical sources: research in this domain can lead to the development of methodologies and tools that can assist historians in enhancing their work, while having an impact also on the field of Natural Language Processing. Our work aims at shedding light on the complex concept of events when dealing with historical texts. More specifically, we introduce new annotation guidelines for event mentions and types, categorised into 22 classes. Then, we annotate a historical corpus accordingly, and compare two approaches for automatic event detection and classification following this novel scheme. We believe that this work can foster research in a field of inquiry so far underestimated in the area of Temporal Information Processing. To this end, we release new annotation guidelines, a corpus and new models for automatic annotation

    The challenges of German archival document categorization on insufficient labeled data

    Get PDF
    Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task

    Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization

    Get PDF
    In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre Virtuel de la Connaissance sur l’Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on extractive historical text summarization

    Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas Aquinas

    Get PDF
    This paper presents a new set of lemma embeddings for the Latin language. Embeddings are trained on a manually annotated corpus of texts belonging to the Classical era: different models, architectures and dimensions are tested and evaluated using a novel benchmark for the synonym selection task. In addition, we release vectors pre-trained on the “Opera Maiora” by Thomas Aquinas, thus providing a resource to analyze Latin in a diachronic perspective. The embeddings built upon the two training corpora are compared to each other to support diachronic lexical studies. The words showing the highest usage change between the two corpora are reported and a selection of them is discussed

    Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence

    Full text link
    The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical
    corecore