40 research outputs found
Trivial Transfer Learning for Low-Resource Neural Machine Translation
Transfer learning has been proven as an effective technique for neural
machine translation under low-resource conditions. Existing methods require a
common target language, language relatedness, or specific training tricks and
regimes. We present a simple transfer learning method, where we first train a
"parent" model for a high-resource language pair and then continue the training
on a lowresource pair only by replacing the training corpus. This "child" model
performs significantly better than the baseline trained for lowresource pair
only. We are the first to show this for targeting different languages, and we
observe the improvements even for unrelated languages with different alphabets.Comment: Accepted to WMT18 reseach paper, Proceedings of the 3rd Conference on
Machine Translation 201
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to
detect translation quality errors, specifically for the quality estimation
setting without the need for human reference translations. Based on the power
of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting
technique, querying the GPT-4 model to mark error quality spans. Compared to
previous works, our method has language-agnostic prompts, thus avoiding the
need for manual prompt preparation for new languages.
While preliminary results indicate that GEMBA-MQM achieves state-of-the-art
accuracy for system ranking, we advise caution when using it in academic works
to demonstrate improvements over other methods due to its dependence on the
proprietary, black-box GPT model.Comment: Accepted to WMT 202
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
We describe GEMBA, a GPT-based metric for assessment of translation quality,
which works both with a reference translation and without. In our evaluation,
we focus on zero-shot prompting, comparing four prompt variants in two modes,
based on the availability of the reference. We investigate nine versions of GPT
models, including ChatGPT and GPT-4. We show that our method for translation
quality assessment only works with GPT~3.5 and larger models. Comparing to
results from WMT22's Metrics shared task, our method achieves state-of-the-art
accuracy in both modes when compared to MQM-based human labels. Our results are
valid on the system level for all three WMT22 Metrics shared task language
pairs, namely English into German, English into Russian, and Chinese into
English. This provides a first glimpse into the usefulness of pre-trained,
generative large language models for quality assessment of translations. We
publicly release all our code and prompt templates used for the experiments
described in this work, as well as all corresponding scoring results, to allow
for external validation and reproducibility.Comment: Accepted in EAMT, 10 pages, 8 tables, one figur
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
Reference-based metrics that operate at the sentence level typically
outperform quality estimation metrics, which have access only to the source and
system output. This is unsurprising, since references resolve ambiguities that
may be present in the source. We investigate whether additional source context
can effectively substitute for a reference. We present a metric, SLIDE (SLiding
Document Evaluator), which operates on blocks of sentences using a window that
slides over each document in the test set, feeding each chunk into an
unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains
significantly higher pairwise system accuracy than its sentence-level baseline,
in some cases even eliminating the gap with reference-base metrics. This
suggests that source context may provide the same information as a human
reference
Results of the WMT17 Neural MT Training Task
This paper presents the results of the WMT17 Neural MT Training Task.
The objective of this task is to explore the methods of training a fixed neural architecture, aiming primarily at the best translation quality and, as a secondary goal, shorter training time.
Task participants were provided with a complete neural machine translation system, fixed training data and the configuration of the network.
The translation was performed in the English-to-Czech direction and the task was divided into two subtasks of different configurations - one scaled to fit on a 4GB and another on an 8GB GPU card.
We received 3 submissions for the 4GB variant and 1 submission for the 8GB variant; we provided also our run for each of the sizes and two baselines.
We translated the test set with the trained models and evaluated the outputs using several automatic metrics.
We also report results of the human evaluation of the submitted systems
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT
Generative large language models (LLMs), e.g., ChatGPT, have demonstrated
remarkable proficiency across several NLP tasks, such as machine translation,
text summarization. Recent research (Kocmi and Federmann, 2023) has shown that
utilizing ChatGPT for assessing the quality of machine translation (MT)
achieves state-of-the-art performance at the system level but performs poorly
at the segment level. To further improve the performance of LLMs on MT quality
assessment, we conduct an investigation into several prompting methods, and
propose a new prompting method called Error Analysis Prompting (EAPrompt) by
combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al.,
2022). Our results on WMT22 indicate that prompting LLMs like ChatGPT with
error analysis can generate human-like MT evaluations at both the system and
segment level. Additionally, we first discover some limitations of ChatGPT as
an MT evaluator, such as changing the order of input may significantly
influence the judgment when providing multiple translations in a single query.
This work provides a preliminary experience of prompting LLMs as an evaluator
to improve the reliability of translation evaluation metrics under the error
analysis paradigm
Neural Monkey: The Current State and Beyond
Neural Monkey is an open-source toolkit for sequence-to-sequence learning. The focus of this paper is to present the current state of the toolkit to the intended audience, which includes students and researchers, both active in the deep learning community and newcomers. For each of these target groups, we describe the most relevant features of the toolkit, including the simple configuration scheme, methods of model inspection that promote useful intuitions, or a modular design for easy prototyping. We summarize relevant contributions to the research community which were made using this toolkit and discuss the characteristics of our toolkit with respect to other existing systems. We conclude with a set of proposals for future development