2,989 research outputs found
Improved Statistical Machine Translation Using Paraphrases
Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48 % to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches.
Paraphrasing and Translation
Paraphrasing and translation have previously been treated as unconnected natural lan¬
guage processing tasks. Whereas translation represents the preservation of meaning
when an idea is rendered in the words in a different language, paraphrasing represents
the preservation of meaning when an idea is expressed using different words in the
same language. We show that the two are intimately related. The major contributions
of this thesis are as follows:• We define a novel technique for automatically generating paraphrases using
bilingual parallel corpora, which are more commonly used as training data for
statistical models of translation.• We show that paraphrases can be used to improve the quality of statistical ma¬
chine translation by addressing the problem of coverage and introducing a degree
of generalization into the models.• We explore the topic of automatic evaluation of translation quality, and show that
the current standard evaluation methodology cannot be guaranteed to correlate
with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon
either data sources which were uncommon such as multiple translation of the same
source text, or language specific resources such as parsers, our approach is able to
harness more widely parallel corpora and can be applied to any language which has
a parallel corpus. The technique was evaluated by replacing phrases with their para¬
phrases, and asking judges whether the meaning of the original phrase was retained
and whether the resulting sentence remained grammatical. Paraphrases extracted from
a parallel corpus with manual alignments are judged to be accurate (both meaningful
and grammatical) 75% of the time, retaining the meaning of the original phrase 85%
of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be
easily integrated into statistical machine translation. A paraphrase model derived from
parallel corpora other than the one used to train the translation model can be used to
increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but
a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that
augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000
sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%,
with more than half of the newly covered items accurately translated, as opposed to
none in current approaches
Recommended from our members
Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation
This paper is interested in improving the quality of Arabic-English statistical machine translation (SMT) on highly dialectal Arabic text using morphological knowledge. We present a light-weight rule-based approach to producing Modern Standard Arabic (MSA) paraphrases of dialectal Arabic out-of-vocabulary words and low frequency words. Our approach extends an existing MSA analyzer with a small number of morphological clitics and transfer rules. The generated paraphrase lattices are input to a state-of-the-art phrase-based SMT system resulting in improved BLEU scores on a blind test set by 0.56 absolute BLEU (or 1.5% relative)
Incorporating source-language paraphrases into phrase-based SMT with confusion networks
To increase the model coverage, sourcelanguage paraphrases have been utilized to boost SMT system performance. Previous
work showed that word lattices constructed from paraphrases are able to reduce out-ofvocabulary words and to express inputs in different ways for better translation quality.
However, such a word-lattice-based method suffers from two problems: 1) path duplications in word lattices decrease the capacities for potential paraphrases; 2) lattice decoding in SMT dramatically increases the search space and results in poor time efficiency. Therefore, in this paper, we adopt word confusion networks as the input structure to carry source-language paraphrase information. Similar to previous work, we use word lattices to build word confusion networks for merging of duplicated paths and faster decoding. Experiments are carried out on small-, medium- and large-scale English–
Chinese translation tasks, and we show that compared with the word-lattice-based method, the decoding time on three tasks is reduced significantly (up to 79%) while comparable
translation quality is obtained on the largescale task
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Contextual bitext-derived paraphrases in automatic MT evaluation
In this paper we present a novel method for deriving paraphrases during automatic MT evaluation using only the source and reference texts, which are necessary for
the evaluation, and word and phrase alignment software. Using target language paraphrases produced through word and
phrase alignment a number of alternative reference sentences are constructed automatically for each candidate translation. The method produces lexical and lowlevel
syntactic paraphrases that are relevant to the domain in hand, does not use external knowledge resources, and can be
combined with a variety of automatic MT evaluation system
On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference
We propose a process for investigating the extent to which sentence
representations arising from neural machine translation (NMT) systems encode
distinct semantic phenomena. We use these representations as features to train
a natural language inference (NLI) classifier based on datasets recast from
existing semantic annotations. In applying this process to a representative NMT
system, we find its encoder appears most suited to supporting inferences at the
syntax-semantics interface, as compared to anaphora resolution requiring
world-knowledge. We conclude with a discussion on the merits and potential
deficiencies of the existing process, and how it may be improved and extended
as a broader framework for evaluating semantic coverage.Comment: To be presented at NAACL 2018 - 11 page
Using TERp to augment the system combination for SMT
TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases.
There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Para-phrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties\ud
of the TERp, we also propose a two-pass decoding strategy for the lattice-based phrase-level confusion network(CN) to generate the final result. The experiments conducted on the NIST2008 Chinese-to-English test set show that our TERp-based augmented system combination framework achieves significant improvements in terms of BLEU and TERp scores compared to the state-of-the-art word-level system combination framework and a TER-based combination strategy
- …