13 research outputs found

    Progress of the PRINCIPLE project: promoting MT for Croatian, Icelandic, Irish and Norwegian

    Get PDF
    This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission un-der the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on col-lecting high-quality language resources for Croatian, Icelandic, Irish and Norwe-gian, which have been identified as low-resource languages, especially for build-ing effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project

    Achievements of the PRINCIPLE project: promoting MT for Croatian, Icelandic, Irish and Norwegian

    Get PDF
    This paper provides an overview of the main achievements of the completed PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focused on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which are severely low-resource languages, especially for building effective machine translation (MT) systems. We report the achievements of the project, primarily, in terms of the large amounts of data collected for all four low-resource languages and of promoting the uptake of neural MT (NMT) for these languages

    Mot en trebank for amerikanorsk

    No full text
    This article presents a method for automatic assignment of syntactic dependency relations to the corpus of American Norwegian speech (CANS). Different machine learning techniques and corpora are used. Finally, an accuracy measure is computed and compared with a relatively new treebank for spoken Norwegian.I denne artikkelen presenteres en framgangsmåte for å tilordne en del av amerikanorsk talespråkskorpus syntaktiske dependensrelasjoner automat- isk. Ulike maskinlæringsteknikker og korpus blir tatt i bruk. Til slutt gis et mål på forventa nøyaktighet og en sammenlikning med en annen relativt nylig publisert trebank for norsk

    Tagging a Norwegian Dialect Corpus

    No full text
    This paper describes an evaluation of five data-driven part-of-speech (PoS) taggers for spoken Norwegian. The taggers all rely on different machine learning mechanisms: decision trees, hidden Markov models (HMMs), conditional random fields (CRFs), long-short term memory networks (LSTMs), and convolutional neural networks (CNNs). We go into some of the challenges posed by the task of tagging spoken, as opposed to written, language, and in particular a wide range of dialects as is found in the recordings of the LIA (Language Infrastructure made Accessible) project. The results show that the taggers based on either conditional random fields or neural networks perform much better than the rest, with the LSTM tagger getting the highest score

    The LIA Treebank of Spoken Norwegian Dialects

    No full text
    This article presents the LIA treebank of transcribed spoken Norwegian dialects. It consists of dialect recordings made in the period between 1950--1990, which have been digitised, transcribed, and subsequently annotated with morphological and dependency-style syntactic analysis as part of the LIA (Language Infrastructure made Accessible) project at the University of Oslo. In this article, we describe the LIA material of dialect recordings and its transcription, transliteration and further morphosyntactic annotation. We focus in particular on the extension of the native NDT annotation scheme to spoken language phenomena, such as pauses and various types of disfluencies, and present the subsequent conversion of the treebank to the Universal Dependencies scheme. The treebank currently consists of 13,608 tokens, distributed over 1396 segments taken from three different dialects of spoken Norwegian. The LIA treebank annotation is an on-going effort and future releases will extend on the current data set

    Progress of the PRINCIPLE project: promoting MT for Croatian, Icelandic, Irish and Norwegian

    No full text
    This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission un-der the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on col-lecting high-quality language resources for Croatian, Icelandic, Irish and Norwe-gian, which have been identified as low-resource languages, especially for build-ing effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project

    Sharing high-quality language resources in the legal domain to develop neural machine translation for under-resourced European languages

    Get PDF
    This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs. Petra;Sheila;Edoardo;Jane;Federico;Níels ;Andre;Filip;Gauti;Helen;Róisín;Órla ;Jon;Carla ;Akshai;Natalia;Páraic;Andy Wa
    corecore