925 research outputs found
Incorporating source-language paraphrases into phrase-based SMT with confusion networks
To increase the model coverage, sourcelanguage paraphrases have been utilized to boost SMT system performance. Previous
work showed that word lattices constructed from paraphrases are able to reduce out-ofvocabulary words and to express inputs in different ways for better translation quality.
However, such a word-lattice-based method suffers from two problems: 1) path duplications in word lattices decrease the capacities for potential paraphrases; 2) lattice decoding in SMT dramatically increases the search space and results in poor time efficiency. Therefore, in this paper, we adopt word confusion networks as the input structure to carry source-language paraphrase information. Similar to previous work, we use word lattices to build word confusion networks for merging of duplicated paths and faster decoding. Experiments are carried out on small-, medium- and large-scale English–
Chinese translation tasks, and we show that compared with the word-lattice-based method, the decoding time on three tasks is reduced significantly (up to 79%) while comparable
translation quality is obtained on the largescale task
Facilitating translation using source language paraphrase lattices
For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input
sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore,
given limited data, in order to facilitate translation
from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and largescale
English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resource sufficient pairs to some extent
Bootstrapping Lexical Choice via Multiple-Sequence Alignment
An important component of any generation system is the mapping dictionary, a
lexicon of elementary semantic expressions and corresponding natural language
realizations. Typically, labor-intensive knowledge-based methods are used to
construct the dictionary. We instead propose to acquire it automatically via a
novel multiple-pass algorithm employing multiple-sequence alignment, a
technique commonly used in bioinformatics. Crucially, our method leverages
latent information contained in multi-parallel corpora -- datasets that supply
several verbalizations of the corresponding semantics rather than just one.
We used our techniques to generate natural language versions of
computer-generated mathematical proofs, with good results on both a
per-component and overall-output basis. For example, in evaluations involving a
dozen human judges, our system produced output whose readability and
faithfulness to the semantic input rivaled that of a traditional generation
system.Comment: 8 pages; to appear in the proceedings of EMNLP-200
Using TERp to augment the system combination for SMT
TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases.
There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Para-phrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties\ud
of the TERp, we also propose a two-pass decoding strategy for the lattice-based phrase-level confusion network(CN) to generate the final result. The experiments conducted on the NIST2008 Chinese-to-English test set show that our TERp-based augmented system combination framework achieves significant improvements in terms of BLEU and TERp scores compared to the state-of-the-art word-level system combination framework and a TER-based combination strategy
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Contextual bitext-derived paraphrases in automatic MT evaluation
In this paper we present a novel method for deriving paraphrases during automatic MT evaluation using only the source and reference texts, which are necessary for
the evaluation, and word and phrase alignment software. Using target language paraphrases produced through word and
phrase alignment a number of alternative reference sentences are constructed automatically for each candidate translation. The method produces lexical and lowlevel
syntactic paraphrases that are relevant to the domain in hand, does not use external knowledge resources, and can be
combined with a variety of automatic MT evaluation system
Recommended from our members
Phrase-level System Combination for Machine Translation Based on Target-to-Target Decoding
In this paper, we propose a novel lattice-based MT combination methodology that we call Target-to-Target Decoding (TTD). The combination process is carried out as a “translation” from backbone to the combination result. This perspective suggests the use of existing phrase-based MT techniques in the combination framework. We show how phrase extraction rules and confidence estimations inspired from machine translation improve results. We also propose system-specific LMs for estimating N-gram consensus. Our results show that our approach yields a strong improvement over the best single MT system and competes with other state-of-the-art combination systems
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
- …