Search CORE

289 research outputs found

Machine Translation for English--Inuktitut with Segmentation, Data Acquisition and Pre-Training

Author: Edman Lukas
Kelly Kevin
Minnema Gosse
Roest Christian
Spenader Jennifer
Toral Antonio
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2020
Field of study

ARTS repository - University of Groningen

Machine Translation for English--Inuktitut with Segmentation, Data Acquisition and Pre-Training

Author: Edman Lukas
Kelly Kevin
Minnema Gosse
Roest Christian
Spenader Jennifer
Toral Antonio
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2020
Field of study

Translating to and from low-resource polysynthetic languages present numerous challenges for NMT. We present the results of our systems for the English--Inuktitut language pair for the WMT 2020 translation tasks. We investigated the importance of correct morphological segmentation, whether or not adding data from a related language (Greenlandic) helps, and whether using contextual word embeddings improves translation. While each method showed some promise, the results are mixed

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Machine Translation for English--Inuktitut with Segmentation, Data Acquisition and Pre-Training

Author: Edman Lukas
Kelly Kevin
Minnema Gosse
Roest Christian
Spenader Jennifer
Toral Antonio
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2020
Field of study

Dissertations of the University of Groningen

Machine Translation for English--Inuktitut with Segmentation, Data Acquisition and Pre-Training

Author: Edman Lukas
Kelly Kevin
Minnema Gosse
Roest Christian
Spenader Jennifer
Toral Antonio
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2020
Field of study

University of Groningen

Multilingual representations and models for improved low-resource language processing

Author: Jalili Sabet Masoud
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 18/07/2022
Field of study

Word representations are the cornerstone of modern NLP. Representing words or characters using real-valued vectors as static representations that can capture the Semantics and encode the meaning has been popular among researchers. In more recent years, Pretrained Language Models using large amounts of data and creating contextualized representations achieved great performance in various tasks such as Semantic Role Labeling. These large pretrained language models are capable of storing and generalizing information and can be used as knowledge bases. Language models can produce multilingual representations while only using monolingual data during training. These multilingual representations can be beneficial in many tasks such as Machine Translation. Further, knowledge extraction models that only relied on information extracted from English resources, can now benefit from extra resources in other languages. Although these results were achieved for high-resource languages, there are thousands of languages that do not have large corpora. Moreover, for other tasks such as machine translation, if large monolingual data is not available, the models need parallel data, which is scarce for most languages. Further, many languages lack tokenization models, and splitting the text into meaningful segments such as words is not trivial. Although using subwords helps the models to have better coverage over unseen data and new words in the vocabulary, generalizing over low-resource languages with different alphabets and grammars is still a challenge. This thesis investigates methods to overcome these issues for low-resource languages. In the first publication, we explore the degree of multilinguality in multilingual pretrained language models. We demonstrate that these language models can produce high-quality word alignments without using parallel training data, which is not available for many languages. In the second paper, we extract word alignments for all available language pairs in the public bible corpus (PBC). Further, we created a tool for exploring these alignments which are especially helpful in studying low-resource languages. The third paper investigates word alignment in multiparallel corpora and exploits graph algorithms for extracting new alignment edges. In the fourth publication, we propose a new model to iteratively generate cross-lingual word embeddings and extract word alignments when only small parallel corpora are available. Lastly, the fifth paper finds that aggregation of different granularities of text can improve word alignment quality. We propose using subword sampling to produce such granularities

Digitale Hochschulschriften der LMU

Recommended from our members

Word Alignment for Languages with Scarce Resources

Author: Martin Joel
Mihalcea Rada, 1974-
Pedersen Ted
Publication venue
Publication date: 01/06/2005
Field of study

This paper presents the task definition, resources, participating systems, and comparative results for the shared task on word alignment which was organized as part of the Association for Computational Linguistics (ACL) 2005 Workshop on Building and Using Parallel Texts. The shared task included English-Inuktitut, Romanian-English, and English-Hindi sub-tasks, and drew the participation of ten teams from around the world with a total of 50 systems

UNT Digital Library

Universal Phone Recognition with a Multilingual Allophone System

Author: Anastasopoulos Antonios
Black Alan W
Dalmia Siddharth
Lee Matthew
Li Juncheng
Li Xinjian
Littell Patrick
Metze Florian
Mortensen David R.
Neubig Graham
Yao Jiali
Publication venue
Publication date: 26/02/2020
Field of study

Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages. Multilingual acoustic models, however, generally ignore the difference between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This can lead to performance degradation when combining a variety of training languages, as identically annotated phonemes can actually correspond to several different underlying phonetic realizations. In this work, we propose a joint model of both language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute in low-resource conditions. Additionally, because we are explicitly modeling language-independent phones, we can build a (nearly-)universal phone recognizer that, when combined with the PHOIBLE large, manually curated database of phone inventories, can be customized into 2,000 language dependent recognizers. Experiments on two low-resourced indigenous languages, Inuktitut and Tusom, show that our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.Comment: ICASSP 202

arXiv.org e-Print Archive

Crossref