25 research outputs found
"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
Quality Estimation (QE) is the task of evaluating the quality of a
translation when reference translation is not available. The goal of QE aligns
with the task of corpus filtering, where we assign the quality score to the
sentence pairs present in the pseudo-parallel corpus. We propose a Quality
Estimation based Filtering approach to extract high-quality parallel data from
the pseudo-parallel corpus. To the best of our knowledge, this is a novel
adaptation of the QE framework to extract quality parallel corpus from the
pseudo-parallel corpus. By training with this filtered corpus, we observe an
improvement in the Machine Translation (MT) system's performance by up to 1.8
BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language
pairs, over the baseline model. The baseline model is the one that is trained
on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned
from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali
training instances, shows an improvement of up to 0.6 BLEU points for
Hindi-Bengali language pair, compared to the baseline model. This demonstrates
the promise of transfer learning in the setting under discussion. QE systems
typically require in the order of (7K-25K) of training data. Our Hindi-Bengali
QE is trained on only 500 instances of training that is 1/40th of the normal
requirement and achieves comparable performance. All the scripts and datasets
utilized in this study will be publicly available
Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora
Resources for the non-English languages are scarce and this paper addresses
this problem in the context of machine translation, by automatically extracting
parallel sentence pairs from the multilingual articles available on the
Internet. In this paper, we have used an end-to-end Siamese bidirectional
recurrent neural network to generate parallel sentences from comparable
multilingual articles in Wikipedia. Subsequently, we have showed that using the
harvested dataset improved BLEU scores on both NMT and phrase-based SMT systems
for the low-resource language pairs: English--Hindi and English--Tamil, when
compared to training exclusively on the limited bilingual corpora collected for
these language pairs.Comment: 8 pages, 3 figures, 4 tables, NAACL-SRW (2018
Quality Estimation of Machine Translated Texts based on Direct Evidence from Training Data
Current Machine Translation systems achieve very good results on a growing
variety of language pairs and data sets. However, it is now well known that
they produce fluent translation outputs that often can contain important
meaning errors. Quality Estimation task deals with the estimation of quality of
translations produced by a Machine Translation system without depending on
Reference Translations. A number of approaches have been suggested over the
years. In this paper we show that the parallel corpus used as training data for
training the MT system holds direct clues for estimating the quality of
translations produced by the MT system. Our experiments show that this simple
and direct method holds promise for quality estimation of translations produced
by any purely data driven machine translation system
Findings of the 2015 Workshop on Statistical Machine Translation
This paper presents the results of the
WMT15 shared tasks, which included a
standard news translation task, a metrics
task, a tuning task, a task for run-time
estimation of machine translation quality,
and an automatic post-editing task. This
year, 68 machine translation systems from
24 institutions were submitted to the ten
translation directions in the standard translation
task. An additional 7 anonymized
systems were included, and were then
evaluated both automatically and manually.
The quality estimation task had three
subtasks, with a total of 10 teams, submitting
34 entries. The pilot automatic postediting
task had a total of 4 teams, submitting
7 entries
Findings of the 2014 Workshop on Statistical Machine Translation
This paper presents the results of the
WMT14 shared tasks, which included a
standard news translation task, a separate
medical translation task, a task for
run-time estimation of machine translation
quality, and a metrics task. This year, 143
machine translation systems from 23 institutions
were submitted to the ten translation
directions in the standard translation
task. An additional 6 anonymized systems
were included, and were then evaluated
both automatically and manually. The
quality estimation task had four subtasks,
with a total of 10 teams, submitting 57 entries
A Survey of Word Reordering Model in Statistical Machine Translation
Machine translation is the process of translating one natural language in to another natural language by computers. In statistical machine translation word reordering is a big challenge between distant language pair. It is important factor for its quality and efficiency. Word reordering is major challenge For Indian languages who have big structural difference like English and Hindi language. This paper present description about statistical machine translation, reordering model and reordering types
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2