35 research outputs found
The ADAPT system description for the IWSLT 2018 Basque to English translation task
In this paper we present the ADAPT system built for the
Basque to English Low Resource MT Evaluation Campaign.
Basque is a low-resourced, morphologically-rich language.
This poses a challenge for Neural Machine Translation models which usually achieve better performance when trained
with large sets of data.
Accordingly, we used synthetic data to improve the translation quality produced by a model built using only authentic
data. Our proposal uses back-translated data to: (a) create
new sentences, so the system can be trained with more data;
and (b) translate sentences that are close to the test set, so the
model can be fine-tuned to the document to be translated
The IWSLT 2018 Evaluation Campaign
The InternationalWorkshop of Spoken Language Translation
(IWSLT) 2018 Evaluation Campaign featured two tasks: the
low-resourced machine translation task and the speech translation
task. In the first task, manual transcribed speech needs
to be translated from Basque to English. Since this translation
direction is a under-resourced language pair, participants
were encouraged to used additional parallel data from
related languages. In the second task, the participants need
to translate English audio into German text by building a full
speech-translation system. In the baseline condition, participants
were free to used any architecture, while they are restricted
to use a single model for the end-to-end task.
This year, eight research groups took part in the Basque
English translation task, and nine in the speech translation
tas
Transductive data-selection algorithms for fine-tuning neural machine translation
Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a technique for adapting an NMT model to some domain. In this work, we want to use this technique to adapt the model to a given test set. In particular, we are using transductive data selection algorithms which take advantage the information of the test set to retrieve sentences from a larger parallel set
Cascade or Direct Speech Translation? A Case Study
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language processing, recent studies have explored the potential of direct end-to-end neural modelling to perform the speech translation task. Though several benefits may come from end-to-end modelling, such as a reduction in latency and error propagation, the comparative merits of each approach still deserve detailed evaluations and analyses. In this work, we compared state-of-the-art cascade and direct approaches on the under-resourced Basque–Spanish language pair, which features challenging phenomena such as marked differences in morphology and word order. This case study thus complements other studies in the field, which mostly revolve around the English language. We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models trained on this corpus, with variants exploiting additional data as well. Our results indicated that, despite significant progress with end-to-end models, which may outperform alternatives in some cases in terms of automated metrics, a cascade approach proved optimal overall in our experiments and manual evaluations. © 2022 by the authors. Licensee MDPI, Basel, Switzerland
Adaptation of machine translation models with back-translated data using transductive data selection methods
Data selection has proven its merit for improving Neural Machine Translation (NMT), when applied to authentic data. But the benefit of using synthetic data in NMT training, produced by the popular back-translation technique, raises the question if data selection could also be useful for synthetic data? In this work we use Infrequent n-gram Recovery (INR) and Feature Decay Algorithms (FDA), two transductive data selection methods to obtain subsets of sentences from synthetic data. These methods ensure that selected sentences share n-grams with the test set so the NMT model can be adapted to translate it. Performing data selection on back-translated data creates new challenges as the source-side may contain noise originated by the model used in the back-translation. Hence, finding ngrams present in the test set become more difficult. Despite that, in our work we show that adapting a model with a selection of synthetic data is an useful approach
Survey of Low-Resource Machine Translation
International audienceWe present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT