136 research outputs found
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Improving Statistical Machine Translation Using Comparable Corpora
With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available.
Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques?
To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related.
The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources
Monolingual Sentence Rewriting as Machine Translation: Generation and Evaluation
In this thesis, we investigate approaches to paraphrasing entire sentences within the constraints of a given task, which we call monolingual sentence rewriting. We introduce a unified framework for monolingual sentence rewriting, and apply it to three representative tasks: sentence compression, text simplification, and grammatical error correction. We also perform a detailed analysis of the evaluation methodologies for each task, identify bias in common evaluation techniques, and propose more reliable practices.
Monolingual rewriting can be thought of as translating between two types of English (such as from complex to simple), and therefore our approach is inspired by statistical machine translation. In machine translation, a large quantity of parallel data is necessary to model the transformations from input to output text. Parallel bilingual data naturally occurs between common language pairs (such as English and French), but for monolingual sentence rewriting, there is little existing parallel data and annotation is costly. We modify the statistical machine translation pipeline to harness monolingual resources and insights into task constraints in order to drastically diminish the amount of annotated data necessary to train a robust system. Our method generates more meaning-preserving and grammatical sentences than earlier approaches and requires less task-specific data.
Once candidate sentences are generated, it is crucial to have reliable evaluation methods. Sentential paraphrases must fulfill a variety of requirements: preserve the meaning of the original sentence, be grammatical, and meet any stylistic or task-specific constraints. We analyze common evaluation practices and propose better methods that more accurately measure the quality of output. Often overlooked, robust automatic evaluation methodology is necessary for improving systems, and this work presents new metrics and outlines important considerations for reliably measuring the quality of the generated text
Deep neural networks for identification of sentential relations
Natural language processing (NLP) is one of the most important technologies in the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate mostly in language: web search, advertisement, emails, customer service, language translation, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications.
Recently, deep learning approaches have obtained exciting performance across a broad array of NLP tasks. These models can often be trained in an end-to-end paradigm without traditional, task-specific feature engineering.
This dissertation focuses on a specific NLP task --- sentential relation
identification. Successfully identifying the relations of two sentences can contribute greatly to some downstream NLP problems. For example, in open-domain question answering, if the system can recognize that a new question is a paraphrase of a previously observed question, the known answers can be returned directly,
avoiding redundant reasoning. For another, it is also helpful to discover some latent knowledge, such as inferring ``the weather is good today'' from another description ``it is sunny today''. This dissertation presents some deep neural networks (DNNs) which are developed to handle this sentential relation identification problem. More specifically, this problem is addressed by this dissertation in the following three aspects.
(i) Sentential relation representation is built on the matching between
phrases of arbitrary lengths. Stacked Convolutional Neural Networks (CNNs) are employed to model the sentences, so that each filter can cover a local phrase, and filters in lower level span shorter phrases and filters in higher level span longer phrases. CNNs in stack enable to model sentence phrases in different granularity and different abstraction.
(ii) Phrase matches contribute differently to the tasks. This motivates us to propose an attention mechanism in CNNs for these tasks, differing from the popular research of attention mechanisms in Recurrent Neural Networks (RNNs). Attention mechanisms are implemented in both convolution layer as well as pooling layer in deep CNNs, in order to figure out automatically which phrase of one sentence matches a specific phrase of the other sentence. These matches are supposed to be indicative to the final decision. Another contribution in terms of attention mechanism is inspired by the observation that some sentential relation identification task, like answer selection for multi-choice question answering, is mainly determined by phrase alignments of stronger degree; in contrast, some tasks such as textual entailment benefit more from the phrase alignments of weaker degree. This motivates us to propose a dynamic
``attentive pooling'' to select phrase alignments of different intensities for different task categories.
(iii) In certain scenarios, sentential relation can only be successfully identified within specific background knowledge, such as the multi-choice question answering based on passage comprehension. In this case, the relation between two sentences (question and answer candidate) depends on not only the semantics in the two sentences, but also the information encoded in the given passage.
Overall, the work in this dissertation models sentential relations in
hierarchical DNNs, different attentions and different background knowledge. All systems got state-of-the-art performances in representative tasks.Die Verarbeitung natürlicher Sprachen (engl.: natural language processing - NLP) ist eine der wichtigsten Technologien des Informationszeitalters. Weiterhin ist das Verstehen komplexer sprachlicher Ausdrücke ein essentieller
Teil künstlicher Intelligenz. Anwendungen von NLP sind überall zu finden, da Menschen haupt\-säch\-lich über Sprache kommunizieren: Internetsuchen, Werbung, E-Mails, Kundenservice, Übersetzungen, etc. Es gibt eine große Anzahl Tasks und Modelle des maschinellen Lernens für NLP-Anwendungen.
In den letzten Jahren haben Deep-Learning-Ansätze vielversprechende Ergebnisse für eine große Anzahl verschiedener NLP-Tasks erzielt. Diese Modelle können oft end-to-end trainiert werden, kommen also ohne auf den Task zugeschnittene Feature aus.
Diese Dissertation hat einen speziellen NLP-Task als Fokus: Sententielle Relationsidentifizierung. Die Beziehung zwischen zwei Sätzen erfolgreich zu erkennen, kann die Performanz für nachfolgende NLP-Probleme stark verbessern. Für open-domain question answering, zum Beispiel, kann ein System, das erkennt, dass eine neue Frage eine Paraphrase einer bereits gesehenen Frage ist, die be\-kann\-te Antwort direkt zurückgeben und damit mehrfaches
Schlussfolgern vermeiden. Zudem ist es auch hilfreich, zu Grunde liegendes Wissen zu entdecken, so wie das Schließen der Tatsache "das Wetter ist gut" aus der Beschreibung "es ist heute sonnig". Diese Dissertation stellt einige tiefe neuronale Netze (eng.: deep neural networks - DNNs) vor, die speziell für das Problem der sententiellen Re\-la\-tions\-i\-den\-ti\-fi\-zie\-rung entwickelt wurden. Im Speziellen wird dieses Problem in dieser Dissertation unter den
folgenden drei Aspekten behandelt: (i) Sententielle Relationsrepr\"{a}sentationen basieren auf einem Matching zwischen Phrasen beliebiger Länge. Tiefe
convolutional neural networks (CNNs) werden verwendet, um diese Sätze zu modellieren, sodass jeder Filter eine lokale Phrase abdecken kann, wobei Filter in niedrigeren Schichten kürzere und Filter in höheren Schichten längere Phrasen umfassen. Tiefe CNNs machen es möglich, Sätze in unterschiedlichen Granularitäten und Abstraktionsleveln zu modellieren. (ii) Matches zwischen Phrasen tragen unterschiedlich zu unterschiedlichen Tasks bei. Das motiviert uns, einen Attention-Mechanismus für CNNs für diese Tasks einzuführen, der sich von dem bekannten Attention-Mechanismus für recurrent neural networks
(RNNs) unterscheidet. Wir implementieren Attention-Mechanismen sowohl im convolution layer als auch im pooling layer tiefer CNNs, um herauszufinden, welche Phrasen eines Satzes bestimmten Phrasen eines anderen Satzes entsprechen. Wir erwarten, dass solche Matches die finale Entscheidung stark beeinflussen. Ein anderer Beitrag zu Attention-Mechanismen
wurde von der Beobachtung inspiriert, dass einige
sententielle Relationsidentifizierungstasks, zum Beispiel die Auswahl einer Antwort für multi-choice question answering hauptsächlich von Phrasen\-a\-lignie\-rungen stärkeren Grades bestimmt werden. Im Gegensatz dazu profitieren andere Tasks wie textuelles Schließen mehr von Phrasenalignierungen schwächeren Grades. Das motiviert uns, ein dynamisches "attentive pooling" zu entwickeln, um Phrasenalignierungen verschiedener Stärken für verschiedene
Taskkategorien auszuwählen. (iii) In bestimmten Szenarien können sententielle Relationen nur mit entsprechendem Hintergrundwissen erfolgreich identifiziert werden, so wie multi-choice question answering auf der Grundlage des Verständnisses eines Absatzes. In diesem Fall hängt die Relation zwischen zwei Sätzen (der Frage und der möglichen Antwort) nicht nur von der Semantik der beiden Sätze, sondern auch von der in dem gegebenen Absatz enthaltenen
Information ab.
Insgesamt modellieren die in dieser Dissertation enthaltenen Arbeiten sententielle Relationen in hierarchischen DNNs, mit verschiedenen Attention-Me\-cha\-nis\-men und wenn unterschiedliches Hintergrundwissen zur Verf\ {u}gung steht. Alle Systeme erzielen state-of-the-art Ergebnisse für die entsprechenden Tasks
Paraphrasing and Translation
Paraphrasing and translation have previously been treated as unconnected natural lan¬
guage processing tasks. Whereas translation represents the preservation of meaning
when an idea is rendered in the words in a different language, paraphrasing represents
the preservation of meaning when an idea is expressed using different words in the
same language. We show that the two are intimately related. The major contributions
of this thesis are as follows:• We define a novel technique for automatically generating paraphrases using
bilingual parallel corpora, which are more commonly used as training data for
statistical models of translation.• We show that paraphrases can be used to improve the quality of statistical ma¬
chine translation by addressing the problem of coverage and introducing a degree
of generalization into the models.• We explore the topic of automatic evaluation of translation quality, and show that
the current standard evaluation methodology cannot be guaranteed to correlate
with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon
either data sources which were uncommon such as multiple translation of the same
source text, or language specific resources such as parsers, our approach is able to
harness more widely parallel corpora and can be applied to any language which has
a parallel corpus. The technique was evaluated by replacing phrases with their para¬
phrases, and asking judges whether the meaning of the original phrase was retained
and whether the resulting sentence remained grammatical. Paraphrases extracted from
a parallel corpus with manual alignments are judged to be accurate (both meaningful
and grammatical) 75% of the time, retaining the meaning of the original phrase 85%
of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be
easily integrated into statistical machine translation. A paraphrase model derived from
parallel corpora other than the one used to train the translation model can be used to
increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but
a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that
augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000
sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%,
with more than half of the newly covered items accurately translated, as opposed to
none in current approaches
Recommended from our members
Paraphrase identification using knowledge-lean techniques
This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP.
Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri.
The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs).
These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques.
Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods
Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications
Translation and cross-lingual access to information are key technologies in a global economy. Even though the quality of machine translation (MT) output is still far from the level of human translations, many real-world applications have emerged, for which MT can be employed. Machine translation supports human translators in computer-assisted translation (CAT), providing the opportunity to improve translation systems based on human interaction and feedback. Besides, many tasks that involve natural language processing operate in a cross-lingual setting, where there is no need for perfectly fluent translations and the transfer of meaning can be modeled by employing MT technology. This thesis describes cumulative work in the field of cross-lingual natural language processing in a user-oriented setting. A common denominator of the presented approaches is their anchoring in an alignment between texts in two different languages to quantify the similarity of their content
- …