36,871 research outputs found
Modeling Target-Side Inflection in Neural Machine Translation
NMT systems have problems with large vocabulary sizes. Byte-pair encoding
(BPE) is a popular approach to solving this problem, but while BPE allows the
system to generate any target-side word, it does not enable effective
generalization over the rich vocabulary in morphologically rich languages with
strong inflectional phenomena. We introduce a simple approach to overcome this
problem by training a system to produce the lemma of a word and its
morphologically rich POS tag, which is then followed by a deterministic
generation step. We apply this strategy for English-Czech and English-German
translation scenarios, obtaining improvements in both settings. We furthermore
show that the improvement is not due to only adding explicit morphological
information.Comment: Accepted as a research paper at WMT17. (Updated version with
corrected references.
Paradigm Completion for Derivational Morphology
The generation of complex derived word forms has been an overlooked problem
in NLP; we fill this gap by applying neural sequence-to-sequence models to the
task. We overview the theoretical motivation for a paradigmatic treatment of
derivational morphology, and introduce the task of derivational paradigm
completion as a parallel to inflectional paradigm completion. State-of-the-art
neural models, adapted from the inflection task, are able to learn a range of
derivation patterns, and outperform a non-neural baseline by 16.4%. However,
due to semantic, historical, and lexical considerations involved in
derivational morphology, future work will be needed to achieve performance
parity with inflection-generating systems.Comment: EMNLP 201
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
We present a novel cross-lingual transfer method for paradigm completion, the
task of mapping a lemma to its inflected forms, using a neural encoder-decoder
model, the state of the art for the monolingual task. We use labeled data from
a high-resource language to increase performance on a low-resource language. In
experiments on 21 language pairs from four different language families, we
obtain up to 58% higher accuracy than without transfer and show that even
zero-shot and one-shot learning are possible. We further find that the degree
of language relatedness strongly influences the ability to transfer
morphological knowledge.Comment: Accepted at ACL 201
The impact of morphological errors in phrase-based statistical machine translation from German and English into Swedish
We have investigated the potential for improvement in target language morphology when translating into Swedish from English and German, by measuring the errors made by a state of the art phrase-based statistical machine translation system. Our results show that there is indeed a performance gap to be filled by better modelling of inflectional morphology and compounding; and that the gap is not filled by
simply feeding the translation system with more training data
Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
The necessity of using a fixed-size word vocabulary in order to control the
model complexity in state-of-the-art neural machine translation (NMT) systems
is an important bottleneck on performance, especially for morphologically rich
languages. Conventional methods that aim to overcome this problem by using
sub-word or character-level representations solely rely on statistics and
disregard the linguistic properties of words, which leads to interruptions in
the word structure and causes semantic and syntactic losses. In this paper, we
propose a new vocabulary reduction method for NMT, which can reduce the
vocabulary of a given input corpus at any rate while also considering the
morphological properties of the language. Our method is based on unsupervised
morphology learning and can be, in principle, used for pre-processing any
language pair. We also present an alternative word segmentation method based on
supervised morphological analysis, which aids us in measuring the accuracy of
our model. We evaluate our method in Turkish-to-English NMT task where the
input language is morphologically rich and agglutinative. We analyze different
representation methods in terms of translation accuracy as well as the semantic
and syntactic properties of the generated output. Our method obtains a
significant improvement of 2.3 BLEU points over the conventional vocabulary
reduction technique, showing that it can provide better accuracy in open
vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine
Translation (EAMT), Research Paper, 12 page
How much hybridisation does machine translation need?
This is the peer reviewed version of the following article: [Costa-jussà, M. R. (2015), How much hybridization does machine translation Need?. J Assn Inf Sci Tec, 66: 2160–2165. doi:10.1002/asi.23517], which has been published in final form at [10.1002/asi.23517]. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.Rule-based and corpus-based machine translation (MT)have coexisted for more than 20 years. Recently, bound-aries between the two paradigms have narrowed andhybrid approaches are gaining interest from bothacademia and businesses. However, since hybridapproaches involve the multidisciplinary interaction oflinguists, computer scientists, engineers, and informa-tion specialists, understandably a number of issuesexist.While statistical methods currently dominate researchwork in MT, most commercial MT systems are techni-cally hybrid systems. The research community shouldinvestigate the bene¿ts and questions surrounding thehybridization of MT systems more actively. This paperdiscusses various issues related to hybrid MT includingits origins, architectures, achievements, and frustra-tions experienced in the community. It can be said thatboth rule-based and corpus- based MT systems havebene¿ted from hybridization when effectively integrated.In fact, many of the current rule/corpus-based MTapproaches are already hybridized since they do includestatistics/rules at some point.Peer ReviewedPostprint (author's final draft
Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic
Language modeling for an inflected language
such as Arabic poses new challenges for speech recognition and
machine translation due to its rich morphology. Rich morphology
results in large increases in out-of-vocabulary (OOV) rate and
poor language model parameter estimation in the absence of large
quantities of data. In this study, we present a joint
morphological-lexical language model (JMLLM) that takes
advantage of Arabic morphology. JMLLM combines
morphological segments with the underlying lexical items and
additional available information sources with regards to
morphological segments and lexical items in a single joint model.
Joint representation and modeling of morphological and lexical
items reduces the OOV rate and provides smooth probability
estimates while keeping the predictive power of whole words.
Speech recognition and machine translation experiments in
dialectal-Arabic show improvements over word and morpheme
based trigram language models. We also show that as the
tightness of integration between different information sources
increases, both speech recognition and machine translation
performances improve
- …