Search CORE

46 research outputs found

Improving Statistical Machine Translation Using Comparable Corpora

Author: Snover Matthew Garvey
Publication venue
Publication date: 01/01/2010
Field of study

With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available. Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques? To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related. The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources

Digital Repository at the University of Maryland

An Unsupervised Knowledge Free Algorithm for the Learning of Morphology in Natural Languages - Master\u27s Thesis, May 2002

Author: Snover Matthew G.
Publication venue: Washington University Open Scholarship
Publication date: 18/04/2002
Field of study

This thesis describes an unsupervised system to learn natural language morphology, specifically suffix identification from unannotated text. The system is language independent, so that is can learn the morphology of any human language. For English this means identifying “-s”, “-ing”, “-ed”, “-tion” and many other suffixes, in addition to learning which stems they attach to. The system uses no prior knowledge, such as part of speech tags, and learns the morphology by simply reading in a body of unannotated text. The system consists of a generative probabilistic model which is used to evaluate hypotheses, and a directed search and a hill-climbing search which are used in conjunction to find a highly probably hypothesis. Experiments applying the system to English and Polish are described

Washington University St. Louis: Open Scholarship

Training Autoregressive Speech Recognition Models with Limited in-domain Supervision

Author: Hartmann William
Keith Francis
Li Chak-Fai
Snover Matthew
Publication venue
Publication date: 26/10/2022
Field of study

Advances in self-supervised learning have significantly reduced the amount of transcribed audio required for training. However, the majority of work in this area is focused on read speech. We explore limited supervision in the domain of conversational speech. While we assume the amount of in-domain data is limited, we augment the model with open source read speech data. The XLS-R model has been shown to perform well with limited adaptation data and serves as a strong baseline. We use untranscribed data for self-supervised learning and semi-supervised training in an autoregressive encoder-decoder model. We demonstrate that by using the XLS-R model for pseudotranscription, a much smaller autoregressive model can outperform a finetuned XLS-R model when transcribed in-domain data is limited, reducing WER by as much as 8% absolute.Comment: Submitted to IEEE ICASSP 202

arXiv.org e-Print Archive

Expected Dependency Pair Match: Predicting translation quality with expected syntactic structure

Author: Jeremy G Kahn
Mari Ostendorf
Matthew Snover
Publication venue
Publication date: 03/04/2020
Field of study

Abstract. Recent efforts aimed at improving over standard machine translation evaluation methods (BLEU, TER) have investigated mechanisms for accounting for allowable wording differences either in terms of syntactic structure or synonyms/paraphrases. This paper explores an approach for combining scores from partial syntactic dependency matches with standard local n-gram matches using a statistical parser, and taking advantage of parse probabilities in deriving expected scores based on the N-best parses for the hypothesized sentence translation. The new scoring metric, Expected Dependency Pair Match (EDPM), is shown to be superior to BLEU and TER in terms of correlation to human judgements and as a perdocument and per-sentence predictor of HTER, using mean subtraction to account for document difficulty. Further, we explore the potential benefit of combining the n-gram and syntactic features of EDPM with the alternative wording features of TERp, with experiments showing that there is a benefit to accounting for syntactic structure on top of the semantic equivalency features

CiteSeerX

Unsupervised Paraphrasing via Deep Reinforcement Learning

Author: Banerjee Satanjeev
Barzilay Regina
Dolan William B
Fader Anthony
Gupta Ankush
Heafield Kenneth
Hovy Eduard H
Knight Kevin
Papineni Kishore
Ross Stéphane
Silver David
Snover Matthew
Zhao Shiqi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/07/2020
Field of study

Paraphrasing is expressing the meaning of an input sentence in different wording while maintaining fluency (i.e., grammatical and syntactical correctness). Most existing work on paraphrasing use supervised models that are limited to specific domains (e.g., image captions). Such models can neither be straightforwardly transferred to other domains nor generalize well, and creating labeled training data for new domains is expensive and laborious. The need for paraphrasing across different domains and the scarcity of labeled training data in many such domains call for exploring unsupervised paraphrase generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a novel unsupervised paraphrase generation method based on deep reinforcement learning (DRL). PUP uses a variational autoencoder (trained using a non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL model. Then, PUP progressively tunes the seed paraphrase guided by our novel reward function which combines semantic adequacy, language fluency, and expression diversity measures to quantify the quality of the generated paraphrases in each iteration without needing parallel sentences. Our extensive experimental evaluation shows that PUP outperforms unsupervised state-of-the-art paraphrasing techniques in terms of both automatic metrics and user studies on four real datasets. We also show that PUP outperforms domain-adapted supervised algorithms on several datasets. Our evaluation also shows that PUP achieves a great trade-off between semantic similarity and diversity of expression

arXiv.org e-Print Archive

Crossref

Serrated Lesions of the Colorectum: Review and Recommendations From an Expert Panel

Author: Ahnen Dennis J
Baron John A
Batts Kenneth P
Burke Carol A
Burt Randall W
Church James
Goldblum John R
Guillem José G
Kahi Charles J
Kalady Matthew F
Odze Robert D
Ogino Shuji
O′Brien Michael J
Parry Susan
Rex Douglas K
Snover Dale C
Torlakovic Emina Emilia
Wise Paul E
Young Joanne
Publication venue
Publication date: 01/01/2012
Field of study

Serrated lesions of the colorectum are the precursors of perhaps one-third of colorectal cancers. Cancers arising in serrated lesions are usually in the proximal colon, and account for a disproportionate fraction of cancer identified after colonoscopy

PubMed Central

Carolina Digital Repository

Taking MT evaluation metrics to extremes : beyond correlation with human judgments

Author: Bahdanau Dzmitry
Baig Taimur
Banerjee Satanjeev
Berentsen Geir Drage
Bojar Ondřej
Callison-Burch Chris
Callison-Burch Chris
Coughlin Deborah
Culy Christopher
Denkowski Michael
Fomicheva Marina
Giménez Jesús
Graham Yvette
Hjort Nils Lid
Junczys-Dowmunt Marcin
Levene Howard
Liu Ding
Lucia Specia
Marina Fomicheva
Moore Robert C.
Nießen Sonja
Papineni Kishore
Snover Matthew
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Sutskever Ilya
Tillmann Christoph
Williams Evan James
Publication venue: 'MIT Press - Journals'
Publication date: 12/06/2019
Field of study

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online