108 research outputs found

    The only option is open: Why should language technology and resources be free?

    Get PDF
    Proceedings of the NODALIDA 2011 Workshop Visibility and Availability of LT Resources. Editors: Sjur Nørstebø Moshagen and Per Langgård. NEALT Proceedings Series, Vol. 13 (2011), 1–2. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1697

    An Italian to Catalan RBMT system reusing data from existing language pairs

    Get PDF
    This paper presents an Italian! Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM

    OmniLingo: Listening- and speaking-based language learning

    Full text link
    In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data

    Data-Driven Morphological Analysis for Uralic Languages

    Get PDF
    This paper describes an initial set of experiments in data-driven morpholog-ical analysis of Uralic languages. The paper differs from previous work in thatour work covers both lemmatization and generating ambiguous analyses. Whilehand-crafted finite-state transducers represent the state of the art in morpholog-ical analysis for most Uralic languages, we believe that there is a place for data-driven approaches, especially with respect to making up for lack of completenessin the шlexicon. We present results for nine Uralic languages that show that, atleast for basic nominal morphology for six out of the nine languages, data-drivenmethods can achieve an F-score of over 90%, providing results that approach thoseof finite-state techniques. We also compare our system to an earlier approach toFinnish data-driven morphological analysis (Silfverberg and Hulden,2018) andshow that our system outperforms this baseline.Peer reviewe

    Towards an open-source universal-dependency treebank for Erzya

    Get PDF
    This article describes the first steps towards a open-source dependency tree- bank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically analyzed and disambiguated after which they are annotated manually for depen- dency structure. In the article we present some issues in dependency syntax for Erzya and how they are analyzed in the universal-dependency framework. Pre- liminary statistics are given for dependency parsing of Erzya, along with points of interest for future research.Peer reviewe

    Delineating Turkic non-finite verb forms by syntactic function

    Get PDF
    In this paper, we argue against the primary categories of non-finite verb used in the Turkology literature: “participle” (причастие ‹pričastije›) and “converb” (деепричастие ‹dejepričastije›). We argue that both of these terms conflate several discrete phenomena, and that they furthermore are not coherent as umbrella terms for these phenomena. Based on detailed study of the non-finite verb morphology and syntax of a wide range of Turkic languages (presented here are Turkish, Kazakh, Kyrgyz, Tatar, Tuvan, and Sakha), we instead propose delineation of these categories according to their morphological and syntactic properties. Specifically, we propose that more accurate categories are verbal noun, verbal adjective, verbal adverb, and infinitive. This approach has far-reaching implications to the study of syntactic phenomena in Turkic languages, including phenomena ranging from relative clauses to clause chaining

    A morphological analyser for Maltese

    Get PDF
    This article describes the development of a free/open-source morphological description of Maltese, originally created as the analysis component in a rule-based machine translation system for Maltese to Arabic and later applied to other tasks. The lexicon formalism we use is lttoolbox, part of the Apertium machine translation platform. An evaluation of the analyser shows that the coverage is adequate, at 84.90%, while precision is 92.5% on a large automatically annotated test set and 96.2% on a smaller hand-validated set.peer-reviewe

    A Free/Open-Source Morphological Analyser and Generator for Sakha

    Get PDF
    We present, to our knowledge, the first ever published morphological analyser and generator for Sakha, a marginalised language of Siberia. The transducer, developed using HFST, has coverage of solidly above 90%, and high precision. In the development of the analyser, we have expanded linguistic knowledge about Sakha, and developed strategies for complex grammatical patterns. The transducer is already being used in downstream tasks, including computer assisted language learning applications for linguistic maintenance and computational linguistic shared tasks.Peer reviewe

    Keyword spotting for audiovisual archival search in Uralic languages

    Get PDF
    Publisher Copyright: © 2021 IWCLUL 2021 - 7th International Workshop on Computational Linguistics of Uralic Languages, Proceedings. All rights reserved.In this study we investigate the potential of using Automatic Speech Recognition (ASR) for keyword spotting for four Uralic languages: Finnish, Hungarian, Estonian and Komi. These languages also represent different levels on the high and low resource continuum. Although the accuracy of the ASR systems show there is a long way to go, we show that they still have potential to be useful for downstream tasks such as keyword spotting. By using a simple text search after running ASR, we are already able to achieve an F1 score of between 0.15 and 0.33, a precision of nearly 0.90 for Estonian and Hungarian, and a precision of 0.76 for Komi.Peer reviewe
    corecore