13 research outputs found

    Adapting vs. Pre-training Language Models for Historical Languages

    Get PDF
    As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.Language Use in Past and Presen

    Computational approaches to semantic change (Volume 6)

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
    corecore