101 research outputs found
An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation
International audienceRecently, a growing need of Confidence Estimation (CE) for Statistical Machine Translation (SMT) systems in Computer Aided Translation (CAT), was observed. However, most of the CE toolkits are optimized for a single target language (mainly English) and, as far as we know, none of them are dedicated to this specific task and freely available. This paper presents an open-source toolkit for predicting the quality of words of a SMT output, whose novel contributions are (i) support for various target languages, (ii) handle a number of features of different types (system-based, lexical , syntactic and semantic). In addition, the toolkit also integrates a wide variety of Natural Language Processing or Machine Learning tools to pre-process data, extract features and estimate confidence at word-level. Features for Word-level Confidence Estimation (WCE) can be easily added / removed using a configuration file. We validate the toolkit by experimenting in the WCE evaluation framework of WMT shared task with two language pairs: French-English and English-Spanish. The toolkit is made available to the research community with ready-made scripts to launch full experiments on these language pairs, while achieving state-of-the-art and reproducible performances
Coherence in Machine Translation
Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and
with no access to extra-sentential context.
In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as
evidenced via discourse relations.
For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax
model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined
lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text
SemEval-2017 Task 1: semantic textual similarity - multilingual and cross-lingual focused evaluation
Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017)
Findings of the 2017 Conference on Machine Translation
This paper presents the results of the
WMT17 shared tasks, which included
three machine translation (MT) tasks
(news, biomedical, and multimodal), two
evaluation tasks (metrics and run-time estimation
of MT quality), an automatic
post-editing task, a neural MT training
task, and a bandit learning task
Examining lexical coherence in a multilingual setting
This paper presents a preliminary study of lexical coherence and cohesion in the context of multiple languages.
We explore two entity-based frameworks in a multilingual setting in an attempt to understand how lexical coherence is realised across different languages. These frameworks (an entity-grid model and an entity graph metric) have previously been used for assessing coherence in a monolingual setting. We apply them to a multilingual setting for the first time, assessing whether entity based coherence frameworks could help ensure lexical coherence in a Machine Translation context
Better Evaluation of ASR in Speech Translation Context Using Word Embeddings
International audienceThis paper investigates the evaluation of ASR in spoken language translation context. More precisely, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results. For reproductible experiments, the code allowing to call our modified WER and the corpora used are made available to the research community
Word Confidence Estimation for Machine Translation
International audienc
Document-Level Machine Translation Quality Estimation
Assessing Machine Translation (MT) quality at document level is a challenge as metrics need to account for many linguistic phenomena on different levels. Large units of text encompass different linguistic phenomena and, as a consequence, a machine translated document can have different problems. It is hard for humans to evaluate documents regarding document-wide phenomena (e.g. coherence) as they get easily distracted by problems at other levels (e.g. grammar). Although standard automatic evaluation metrics (e.g. BLEU) are often used for this purpose, they focus on n-grams matches and often disregard document-wide information. Therefore, although such metrics are useful to compare different MT systems, they may not reflect nuances of quality in individual documents.
Machine translated documents can also be evaluated according to the task they will be used for. Methods based on measuring the distance between machine translations and post-edited machine translations are widely used for task-based purposes. Another task-based method is to use reading comprehension questions about the machine translated document, as a proxy of the document quality. Quality Estimation (QE) is an evaluation approach that attempts to predict MT outputs quality, using trained Machine Learning (ML) models. This method is robust because it can consider any type of quality assessment for building the QE models. Thus far, for document-level QE, BLEU-style metrics were used as quality labels, leading to unreliable predictions, as document information is neglected. Challenges of document-level QE encompass the choice of adequate labels for the task, the use of appropriate features for the task and the study of appropriate ML models.
In this thesis we focus on feature engineering, the design of quality labels and the use of ML methods for document-level QE. Our new features can be classified as document-wide (use shallow document information), discourse-aware (use information about discourse structures) and consensus-based (use other machine translations as pseudo-references). New labels are proposed in order to overcome the lack of reliable labels for document-level QE. Two different approaches are proposed: one aimed at MT for assimilation with a low requirement, and another aimed at MT for dissemination with a high quality requirement. The assimilation labels use reading comprehension questions as a proxy of document quality. The dissemination approach uses a two-stage post-editing method to derive the quality labels. Different ML techniques are also explored for the document-level QE task, including the appropriate use of regression or classification and the study of kernel combination to deal with features of different nature (e.g. handcrafted features versus consensus features). We show that, in general, QE models predicting our new labels and using our discourse-aware features are more successful than models predicting automatic evaluation metrics. Regarding ML techniques, no conclusions could be drawn, given that different models performed similarly throughout the different experiments
Self-Supervised and Controlled Multi-Document Opinion Summarization
We address the problem of unsupervised abstractive summarization of
collections of user generated reviews with self-supervision and control. We
propose a self-supervised setup that considers an individual document as a
target summary for a set of similar documents. This setting makes training
simpler than previous approaches by relying only on standard log-likelihood
loss. We address the problem of hallucinations through the use of control
codes, to steer the generation towards more coherent and relevant
summaries.Finally, we extend the Transformer architecture to allow for multiple
reviews as input. Our benchmarks on two datasets against graph-based and recent
neural abstractive unsupervised models show that our proposed method generates
summaries with a superior quality and relevance.This is confirmed in our human
evaluation which focuses explicitly on the faithfulness of generated summaries
We also provide an ablation study, which shows the importance of the control
setup in controlling hallucinations and achieve high sentiment and topic
alignment of the summaries with the input reviews.Comment: 18 pages including 5 pages appendi
- âŠ