83 research outputs found
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation
Text alignment is crucial to the accuracy of Machine Translation (MT)
systems, some NLP tools or any other text processing tasks requiring bilingual
data. This research proposes a language independent sentence alignment approach
based on Polish (not position-sensitive language) to English experiments. This
alignment approach was developed on the TED Talks corpus, but can be used for
any text domain or language pair. The proposed approach implements various
heuristics for sentence recognition. Some of them value synonyms and semantic
text structure analysis as a part of additional information. Minimization of
data loss was ensured. The solution is compared to other sentence alignment
implementations. Also an improvement in MT system score with text processed
with described tool is shown.Comment: corpora filtration, text alignement, corpora improvement. arXiv admin
note: text overlap with arXiv:1509.0888
A Deep Source-Context Feature for Lexical Selection in Statistical Machine Translation
this is the author’s version of a work that was accepted for publication in Pattern Recognition Letters . Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition Letters 75 (2016) 24–29. DOI 10.1016/j.patrec.2016.02.014.This paper presents a methodology to address lexical disambiguation in a standard phrase-based statistical
machine translation system. Similarity among source contexts is used to select appropriate translation
units. The information is introduced as a novel feature of the phrase-based model and it is used to select
the translation units extracted from the training sentence more similar to the sentence to translate. The
similarity is computed through a deep autoencoder representation, which allows to obtain effective lowdimensional
embedding of data and statistically significant BLEU score improvements on two different
tasks (English-to-Spanish and English-to-Hindi).
© 2016 Elsevier B.V. All rights reserved.The work of the first author has been supported by FPI UPV pre-doctoral grant (num. registro - 3505). The work of the second author has been supported by Spanish Ministerio de Economia y Competitividad, contract TEC2015-69266-P and the Seventh Framework Program of the European Commission through the International Outgoing Fellowship Marie Curie Action (IMTraP-2011-29951). The work of the third author has been supported by the Spanish Ministerio de Economia y Competitividad, SomEMBED TIN2015-71147-C2-1-P research project and by the Generalitat Valenciana under the grant ALMAPATER (PrometeoII/2014/030).Gupta, PA.; Costa-Jussa, MR.; Rosso, P.; Banchs, R. (2016). A Deep Source-Context Feature for Lexical Selection in Statistical Machine Translation. Pattern Recognition Letters. 75:24-29. https://doi.org/10.1016/j.patrec.2016.02.014S24297
Innovative technologies for under-resourced language documentation: The BULB Project
International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping
Overview of the IWSLT 2012 Evaluation Campaign
open5siWe report on the ninth evaluation campaign organized by
the IWSLT workshop. This year, the evaluation offered multiple
tracks on lecture translation based on the TED corpus, and
one track on dialog translation from Chinese
to English based on the Olympic trilingual corpus.
In particular, the TED tracks included a speech transcription
track in English, a speech translation track from English to French,
and text translation tracks from English to French and from Arabic
to English. In addition to the official tracks, ten unofficial
MT tracks were offered that required translating TED talks into English
from either Chinese, Dutch, German, Polish, Portuguese (Brazilian), Romanian, Russian, Slovak,
Slovene, or Turkish.
16 teams participated in the evaluation and submitted a total of 48 primary runs.
All runs were evaluated with objective metrics, while runs of the official translation
tracks were also ranked by crowd-sourced judges.
In particular, subjective ranking for the TED task was performed on a progress test which permitted
direct comparison of the results from this year against the best results from the 2011 round of the evaluation campaign.Marcello Federico; Mauro Cettolo; Luisa Bentivogli; Michael Paul; Sebastian StükerFederico, Marcello; Cettolo, Mauro; Bentivogli, Luisa; Michael, Paul; Sebastian, Stüke
Innovative technologies for under-resourced language documentation: The BULB Project
International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping
Multilingual Neural Machine Translation for Low-Resource Languages
In recent years, Neural Machine Translation (NMT) has been shown to be more effective than phrase-based statistical methods, thus quickly becoming the state of the art in machine translation (MT). However, NMT systems are limited in translating low-resourced languages, due to the significant amount of parallel data that is required to learn useful mappings between languages. In this work, we show how the so-called multilingual NMT can help to tackle the challenges associated with low-resourced language translation. The underlying principle of multilingual NMT is to force the creation of hidden representations of words in a shared semantic space across multiple languages, thus enabling a positive parameter transfer across languages. Along this direction, we present multilingual translation experiments with three languages (English, Italian, Romanian) covering six translation directions, utilizing both recurrent neural networks and transformer (or self-attentive) neural networks. We then focus on the zero-shot translation problem, that is how to leverage multi-lingual data in order to learn translation directions that are not covered by the available training material. To this aim, we introduce our recently proposed iterative self-training method, which incrementally improves a multilingual NMT on a zero-shot direction by just relying on monolingual data. Our results on TED talks data show that multilingual NMT outperforms conventional bilingual NMT, that the transformer NMT outperforms recurrent NMT, and that zero-shot NMT outperforms conventional pivoting methods and even matches the performance of a fully-trained bilingual system
From feature to paradigm: deep learning in machine translation
In the last years, deep learning algorithms have highly revolutionized several areas including speech, image and natural language processing. The specific field of Machine Translation (MT) has not remained invariant. Integration of deep learning in MT varies from re-modeling existing features into standard statistical systems to the development of a new architecture. Among the different neural networks, research works use feed- forward neural networks, recurrent neural networks and the encoder-decoder schema. These architectures are able to tackle challenges as having low-resources or morphology variations. This manuscript focuses on describing how these neural networks have been integrated to enhance different aspects and models from statistical MT, including language modeling, word alignment, translation, reordering, and rescoring. Then, we report the new neural MT approach together with a description of the foundational related works and recent approaches on using subword, characters and training with multilingual languages, among others. Finally, we include an analysis of the corresponding challenges and future work in using deep learning in MTPostprint (author's final draft
Linguistic Structure in Statistical Machine Translation
This thesis investigates the influence of linguistic structure in statistical machine translation. We develop a word reordering model based on syntactic parse trees and address the issues of pronouns and morphological agreement with a source discriminative word lexicon predicting the translation for individual words using structural features. When used in phrase-based machine translation, the models improve the translation for language pairs with different word order and morphological variation
- …