21 research outputs found
Language Modelling Approaches to Adaptive Machine Translation
Consistency is a key requirement of high-quality translation. It is
especially important to adhere to pre-approved terminology and adapt to
corrected translations in domain-specific projects. Machine translation (MT)
has achieved significant progress in the area of domain adaptation. However,
in-domain data scarcity is common in translation settings, due to the lack of
specialised datasets and terminology, or inconsistency and inaccuracy of
available in-domain translations. In such scenarios where there is insufficient
in-domain data to fine-tune MT models, producing translations that are
consistent with the relevant context is challenging. While real-time adaptation
can make use of smaller amounts of in-domain data to improve the translation on
the fly, it remains challenging due to supported context limitations and
efficiency constraints. Large language models (LLMs) have recently shown
interesting capabilities of in-context learning, where they learn to replicate
certain input-output text generation patterns, without further fine-tuning.
Such capabilities have opened new horizons for domain-specific data
augmentation and real-time adaptive MT. This work attempts to address two main
relevant questions: 1) in scenarios involving human interaction and continuous
feedback, can we employ language models to improve the quality of adaptive MT
at inference time? and 2) in the absence of sufficient in-domain data, can we
use pre-trained large-scale language models to improve the process of MT domain
adaptation?Comment: PhD thesi
Fine-tuning Large Language Models for Adaptive Machine Translation
This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose
large language model (LLM), for adaptive machine translation (MT). The
fine-tuning process involves utilising a combination of zero-shot and one-shot
translation prompts within the medical domain. The primary objective is to
enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt
translations to the required domain at inference time. The results,
particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned
model, demonstrating quality improvements in both zero-shot and one-shot
translation scenarios, surpassing Mistral 7B's baseline performance. Notably,
the fine-tuned Mistral outperforms ChatGPT "gpt-3.5-turbo" in zero-shot
translation while achieving comparable one-shot translation quality. Moreover,
the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's
performance, and its one-shot translation quality surpasses that of NLLB 3.3B.
These findings emphasise the significance of fine-tuning efficient LLMs like
Mistral 7B to yield high-quality zero-shot translations comparable to
task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved
in one-shot translation are comparable to those of commercial LLMs such as
ChatGPT. Our experiments demonstrate that, with a relatively small dataset of
20,000 segments that incorporate a mix of zero-shot and one-shot prompts,
fine-tuning significantly enhances Mistral's in-context learning ability,
especially for real-time adaptive MT
Adaptive Machine Translation with Large Language Models
Consistency is a key requirement of high-quality translation. It is
especially important to adhere to pre-approved terminology and adapt to
corrected translations in domain-specific projects. Machine translation (MT)
has achieved significant progress in the area of domain adaptation. However,
real-time adaptation remains challenging. Large-scale language models (LLMs)
have recently shown interesting capabilities of in-context learning, where they
learn to replicate certain input-output text generation patterns, without
further fine-tuning. By feeding an LLM at inference time with a prompt that
consists of a list of translation pairs, it can then simulate the domain and
style characteristics. This work aims to investigate how we can utilize
in-context learning to improve real-time adaptive MT. Our extensive experiments
show promising results at translation time. For example, GPT-3.5 can adapt to a
set of in-domain sentence pairs and/or terminology while translating a new
sentence. We observe that the translation quality with few-shot in-context
learning can surpass that of strong encoder-decoder MT systems, especially for
high-resource languages. Moreover, we investigate whether we can combine MT
from strong encoder-decoder models with fuzzy matches, which can further
improve translation quality, especially for less supported languages. We
conduct our experiments across five diverse language pairs, namely
English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French
(EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES)
The ADAPT system description for the STAPLE 2020 English-to-Portuguese translation task
This paper describes the ADAPT Centre’s submission to STAPLE (Simultaneous Translation and Paraphrase for Language Education) 2020, a shared task of the 4th Workshop on Neural Generation and Translation (WNGT), for the English-to-Portuguese translation task. In this shared task, the participants were asked to produce high-coverage sets of plausible translations given English prompts (input source sentences). We present our English-to-Portuguese machine translation (MT) models that were built applying various strategies, e.g. data and sentence selection, monolingual MT for generating alternative translations, and combining multiple n-best translations. Our experiments show that adding the aforementioned techniques to the
baseline yields an excellent performance in the English-to-Portuguese translation task
Arabisc: context-sensitive neural spelling checker
Traditional statistical approaches to spelling correction usually consist of two consecutive processes – error detection and correction – and they are generally computationally intensive. Current state-of-the-art neural spelling correction models usually attempt to correct spelling errors directly over an entire sentence, which, as a consequence, lacks control of the process, e.g. they are prone to overcorrection. In recent years, recurrent neural networks (RNNs), in particular long short-term memory (LSTM) hidden units, have proven increasingly popular and powerful models for many natural language processing (NLP) problems. Accordingly, we made use of a bidirectional LSTM language model (LM) for our context-sensitive spelling detection and correction model which is shown to have much control over the correction process. While the use of LMs for spelling checking and correction is not new to this line of NLP research, our proposed approach makes better use of the rich neighbouring context, not only from before the word to be corrected, but also after it, via a dual-input deep LSTM network. Although in theory our proposed approach can be applied to any language, we carried out our experiments on Arabic, which we believe adds additional value given the fact that there are limited linguistic resources readily available in Arabic in comparison to many languages. Our experimental results demonstrate that the pro- posed methods are effective in both improving the quality of correction suggestions and minimising overcorrection
Terminology-aware sentence mining for NMT domain adaptation: ADAPT’s submission to the Adap-MT 2020 English-to-Hindi AI translation shared task
This paper describes the ADAPT Centre’s submission to the Adap-MT 2020 AI Translation Shared Task for English-to-Hindi. The neural machine translation (NMT) systems that we built to translate AI domain texts are state-of- the-art Transformer models. In order to improve the translation quality of our NMT systems, we made use of both in-domain and out-of-domain data for training and employed different fine-tuning techniques for adapting our NMT systems to this task, e.g. mixed fine-tuning and on-the-fly self-training. For this, we mined parallel sentence pairs and monolingual sentences from large out-of-domain data, and the mining process was facilitated through automatic extraction of terminology from the in-domain data. This paper outlines the experiments we carried out for this task and reports the performance of our NMT systems on the evaluation test set
Translation word-level auto-completion: what can we achieve out of the box?
Research on Machine Translation (MT) has
achieved important breakthroughs in several
areas. While there is much more to be done
in order to build on this success, we believe
that the language industry needs better ways to
take full advantage of current achievements.
Due to a combination of factors, including
time, resources, and skills, businesses tend
to apply pragmatism into their AI workflows.
Hence, they concentrate more on outcomes,
e.g. delivery, shipping, releases, and features,
and adopt high-level working production
solutions, where possible. Among the features
thought to be helpful for translators are
sentence-level and word-level translation autosuggestion and auto-completion. Suggesting
alternatives can inspire translators and limit
their need to refer to external resources, which
hopefully boosts their productivity. This
work describes our submissions to WMT’s
shared task on word-level auto-completion,
for the Chinese-to-English, English-to-Chinese,
German-to-English, and English-to-German
language directions. We investigate the
possibility of using pre-trained models and
out-of-the-box features from available libraries.
We employ random sampling to generate
diverse alternatives, which reveals good
results. Furthermore, we introduce our
open-source API, based on CTranslate2, to
serve translations, auto-suggestions, and autocompletions
Domain Terminology Integration into Machine Translation: Leveraging Large Language Models
This paper discusses the methods that we used for our submissions to the WMT
2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech
(EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to
advance machine translation (MT) by challenging participants to develop systems
that accurately translate technical terms, ultimately enhancing communication
and understanding in specialised domains. To this end, we conduct experiments
that utilise large language models (LLMs) for two purposes: generating
synthetic bilingual terminology-based data, and post-editing translations
generated by an MT model through incorporating pre-approved terms. Our system
employs a four-step process: (i) using an LLM to generate bilingual synthetic
data based on the provided terminology, (ii) fine-tuning a generic
encoder-decoder MT model, with a mix of the terminology-based synthetic data
generated in the first step and a randomly sampled portion of the original
generic training data, (iii) generating translations with the fine-tuned MT
model, and (iv) finally, leveraging an LLM for terminology-constrained
automatic post-editing of the translations that do not include the required
terms. The results demonstrate the effectiveness of our proposed approach in
improving the integration of pre-approved terms into translations. The number
of terms incorporated into the translations of the blind dataset increases from
an average of 36.67% with the generic model to an average of 72.88% by the end
of the process. In other words, successful utilisation of terms nearly doubles
across the three language pairs.Comment: WMT 202
Domain-specific text generation for machine translation
Preservation of domain knowledge from the source to target is crucial in any translation
workflow. It is common in the translation industry to receive highly specialized projects,
where there is hardly any parallel in-domain data. In such scenarios where there is insufficient
in-domain data to fine-tune Machine Translation (MT) models, producing translations that
are consistent with the relevant context is challenging. In this work, we propose a novel
approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs)
for domain-specific data augmentation for MT, simulating the domain characteristics of
either (a) a small bilingual dataset, or (b) the monolingual source text to be translated.
Combining this idea with back-translation, we can generate huge amounts of synthetic
bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art
Transformer architecture. We employ mixed fine-tuning to train models that significantly
improve translation of in-domain texts. More specifically, in both scenarios, our proposed
methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on
the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of
human evaluation corroborates the automatic evaluation results
Language modelling approaches to adaptive machine translation
Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, in-domain data scarcity is common in translation settings, due to the lack of specialised datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. In such scenarios where there is
insufficient in-domain data to fine-tune MT models, producing translations that are consistent with the relevant context is challenging. While real-time adaptation can make use of smaller amounts of in-domain data to improve the translation on the fly, it remains challenging due to supported context limitations and efficiency constraints. Large language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. Such capabilities have opened new horizons for domain-
specific data augmentation and real-time adaptive MT. This work attempts to address two main relevant questions: 1) in scenarios involving human interaction and continuous feedback, can we employ language models to improve the quality of adaptive MT at inference time? and 2) in the absence of sufficient in-domain data, can we use pre-trained large-scale language models to improve the process of MT domain adaptation