590 research outputs found
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Data-driven sentence simplification: Survey and benchmark
Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments
CCG-augmented hierarchical phrase-based statistical machine translation
Augmenting Statistical Machine Translation (SMT) systems with syntactic information aims at improving translation quality. Hierarchical Phrase-Based (HPB) SMT takes a step toward incorporating syntax in Phrase-Based (PB) SMT by modelling one aspect of language syntax, namely the hierarchical structure of phrases. Syntax Augmented Machine Translation (SAMT) further incorporates syntactic information extracted using context free phrase structure grammar (CF-PSG) in the HPB SMT model. One of the main challenges facing CF-PSG-based augmentation approaches for SMT systems emerges from the difference in the definition of the constituent in CF-PSG and the ‘phrase’ in SMT systems, which hinders the ability of CF-PSG to express the syntactic function of many SMT phrases. Although the SAMT approach to solving this problem using ‘CCG-like’ operators to combine constituent labels improves syntactic constraint coverage, it significantly increases their sparsity, which restricts translation and negatively affects its quality.
In this thesis, we address the problems of sparsity and limited coverage of syntactic constraints facing the CF-PSG-based syntax augmentation approaches for HPB SMT using Combinatory Cateogiral Grammar (CCG). We demonstrate that
CCG’s flexible structures and rich syntactic descriptors help to extract richer, more expressive and less sparse syntactic constraints with better coverage than CF-PSG,
which enables our CCG-augmented HPB system to outperform the SAMT system. We also try to soften the syntactic constraints imposed by CCG category nonterminal labels by extracting less fine-grained CCG-based labels. We demonstrate that CCG label simplification helps to significantly improve the performance of our CCG category HPB system. Finally, we identify the factors which limit the coverage of the syntactic constraints in our CCG-augmented HPB model. We then try to tackle these factors by extending the definition of the nonterminal label to be composed of a sequence of CCG categories and augmenting the glue grammar with CCG combinatory rules. We demonstrate that our extension approaches help to significantly increase the scope of the syntactic constraints applied in our CCG-augmented HPB model and achieve significant improvements over the HPB SMT baseline
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
- …