3 research outputs found
Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages
Morphological segmentation for polysynthetic languages is challenging,
because a word may consist of many individual morphemes and training data can
be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define
the state of the art for morphological segmentation in high-resource settings
and for (mostly) European languages, we first show that they also obtain
competitive performance for Mexican polysynthetic languages in minimal-resource
settings. We then propose two novel multi-task training approaches -one with,
one without need for external unlabeled resources-, and two corresponding data
augmentation methods, improving over the neural baseline for all languages.
Finally, we explore cross-lingual transfer as a third way to fortify our neural
model and show that we can train one single multi-lingual model for related
languages while maintaining comparable or even improved performance, thus
reducing the amount of parameters by close to 75%. We provide our morphological
segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for
future research.Comment: Long Paper, 16th Annual Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologie
Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages
Machine translation from polysynthetic to fusional languages is a challenging
task, which gets further complicated by the limited amount of parallel text
available. Thus, translation performance is far from the state of the art for
high-resource and more intensively studied language pairs. To shed light on the
phenomena which hamper automatic translation to and from polysynthetic
languages, we study translations from three low-resource, polysynthetic
languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa.
Doing so, we find that in a morpheme-to-morpheme alignment an important amount
of information contained in polysynthetic morphemes has no Spanish counterpart,
and its translation is often omitted. We further conduct a qualitative analysis
and, thus, identify morpheme types that are commonly hard to align or ignored
in the translation process.Comment: To appear in "All Together Now? Computational Modeling of
Polysynthetic Languages" Workshop, at COLING 201
LowResourceEval-2019: a shared task on morphological analysis for low-resource languages
The paper describes the results of the first shared task on morphological
analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and
Veps. For the languages in question, only small-sized corpora are available.
The tasks include morphological analysis, word form generation and morpheme
segmentation. Four teams participated in the shared task. Most of them use
machine-learning approaches, outperforming the existing rule-based ones. The
article describes the datasets prepared for the shared tasks and contains
analysis of the participants' solutions. Language corpora having different
formats were transformed into CONLL-U format. The universal format makes the
datasets comparable to other language corpura and facilitates using them in
other NLP tasks.Comment: 16 pages, 4 tables, 2 figures, published in the conference proceedin