106 research outputs found
Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
This paper presents the systems submitted by the University of Groningen to the English-Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsupervised and rule-based), given the agglutinative nature of Kazakh, (ii) data from two additional languages (Turkish and Russian), given the scarcity of English-Kazakh data and (iii) synthetic data, both for the source and for the target language. Our best sub- missions ranked second for Kazakh-English and third for English-Kazakh in terms of the BLEU automatic evaluation metric
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
On understanding character-level models for representing morphology
Morphology is the study of how words are composed of smaller units of meaning
(morphemes). It allows humans to create, memorize, and understand words in their
language. To process and understand human languages, we expect our computational
models to also learn morphology. Recent advances in neural network models provide
us with models that compose word representations from smaller units like word segments,
character n-grams, or characters. These so-called subword unit models do not
explicitly model morphology yet they achieve impressive performance across many
multilingual NLP tasks, especially on languages with complex morphological processes.
This thesis aims to shed light on the following questions: (1) What do subword
unit models learn about morphology? (2) Do we still need prior knowledge about
morphology? (3) How do subword unit models interact with morphological typology?
First, we systematically compare various subword unit models and study their performance
across language typologies. We show that models based on characters are
particularly effective because they learn orthographic regularities which are consistent
with morphology. To understand which aspects of morphology are not captured by
these models, we compare them with an oracle with access to explicit morphological
analysis. We show that in the case of dependency parsing, character-level models
are still poor in representing words with ambiguous analyses. We then demonstrate
how explicit modeling of morphology is helpful in such cases. Finally, we study how
character-level models perform in low resource, cross-lingual NLP scenarios, whether
they can facilitate cross-linguistic transfer of morphology across related languages.
While we show that cross-lingual character-level models can improve low-resource
NLP performance, our analysis suggests that it is mostly because of the structural
similarities between languages and we do not yet find any strong evidence of crosslinguistic
transfer of morphology. This thesis presents a careful, in-depth study and
analyses of character-level models and their relation to morphology, providing insights
and future research directions on building morphologically-aware computational NLP
models
Survey of Low-Resource Machine Translation
International audienceWe present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT
Recommended from our members
Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios
With the high cost of manually labeling data and the increasing interest in low-resource languages, for which human annotators might not be even available, unsupervised approaches have become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this work, we propose new fully unsupervised approaches for two tasks in morphology: unsupervised morphological segmentation and unsupervised cross-lingual part-of-speech (POS) tagging, which have been two essential subtasks for several downstream NLP applications, such as machine translation, speech recognition, information extraction and question answering.
We propose a new unsupervised morphological-segmentation approach that utilizes Adaptor Grammars (AGs), nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs), where a PCFG models word structure in the task of morphological segmentation. We implement the approach as a publicly available morphological-segmentation framework, MorphAGram, that enables unsupervised morphological segmentation through the use of several proposed language-independent grammars. In addition, the framework allows for the use of scholar knowledge, when available, in the form of affixes that can be seeded into the grammars. The framework handles the cases when the scholar-seeded knowledge is either generated from language resources, possibly by someone who does not know the language, as weak linguistic priors, or generated by an expert in the underlying language as strong linguistic priors. Another form of linguistic priors is the design of a grammar that models language-dependent specifications. We also propose a fully unsupervised learning setting that approximates the effect of scholar-seeded knowledge through self-training. Moreover, since there is no single grammar that works best across all languages, we propose an approach that picks a nearly optimal configuration (a learning setting and a grammar) for an unseen language, a language that is not part of the development. Finally, we examine multilingual learning for unsupervised morphological segmentation in low-resource setups.
For unsupervised POS tagging, two cross-lingual approaches have been widely adapted: 1) annotation projection, where POS annotations are projected across an aligned parallel text from a source language for which a POS tagger is accessible to the target one prior to training a POS model; and 2) zero-shot model transfer, where a model of a source language is directly applied on texts in the target language. We propose an end-to-end architecture for unsupervised cross-lingual POS tagging via annotation projection in truly low-resource scenarios that do not assume access to parallel corpora that are large in size or represent a specific domain. We integrate and expand the best practices in alignment and projection and design a rich neural architecture that exploits non-contextualized and transformer-based contextualized word embeddings, affix embeddings and word-cluster embeddings. Additionally, since parallel data might be available between the target language and multiple source ones, as in the case of the Bible, we propose different approaches for learning from multiple sources. Finally, we combine our work on unsupervised morphological segmentation and unsupervised cross-lingual POS tagging by conducting unsupervised stem-based cross-lingual POS tagging via annotation projection, which relies on the stem as the core unit of abstraction for alignment and projection, which is beneficial to low-resource morphologically complex languages. We also examine morpheme-based alignment and projection, the use of linguistic priors towards better POS models and the use of segmentation information as learning features in the neural architecture.
We conduct comprehensive evaluation and analysis to assess the performance of our approaches of unsupervised morphological segmentation and unsupervised POS tagging and show that they achieve the state-of-the-art performance for the two morphology tasks when evaluated on a large set of languages of different typologies: analytic, fusional, agglutinative and synthetic/polysynthetic
- …