5,392 research outputs found
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
End-to-end neural machine translation has overtaken statistical machine
translation in terms of translation quality for some language pairs, specially
those with large amounts of parallel data. Besides this palpable improvement,
neural networks provide several new properties. A single system can be trained
to translate between many languages at almost no additional cost other than
training time. Furthermore, internal representations learned by the network
serve as a new semantic representation of words -or sentences- which, unlike
standard word embeddings, are learned in an essentially bilingual or even
multilingual context. In view of these properties, the contribution of the
present work is two-fold. First, we systematically study the NMT context
vectors, i.e. output of the encoder, and their power as an interlingua
representation of a sentence. We assess their quality and effectiveness by
measuring similarities across translations, as well as semantically related and
semantically unrelated sentence pairs. Second, as extrinsic evaluation of the
first point, we identify parallel sentences in comparable corpora, obtaining an
F1=98.2% on data from a shared task when using only NMT context vectors. Using
context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Chinese-Catalan: A neural machine translation approach based on pivoting and attention mechanisms
This article innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture, we are using the latest state-of-the-art, which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community, which consists of a human-developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian, and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade.Peer ReviewedPostprint (author's final draft
- …