81 research outputs found

    Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

    Get PDF
    Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.Comment: la 28e Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN

    On New Text Corpora For Minority Languages On The Helsinki korp.csc.fi Server

    Get PDF
    The korp.csc.fi server in Finland provides text corpora of multiple varieties for numerous languages large and small. The Korp infrastructure is developed by the Swedish Språkbanken in the University and Gothenburg, and the source code is released under MIT license. Open nature of the systems makes it easily transferred into new environments, and there are already numerous Korp installations available. The one we discuss is maintained by the Language Bank of Finland.Peer reviewe

    Model-Based Evaluation of Multilinguality

    Full text link
    • 

    corecore