5,931 research outputs found
Predicting sentence translation quality using extrinsic and language independent features
We develop a top performing model for automatic, accurate, and language independent prediction of sentence-level statistical machine translation (SMT) quality with or without looking at the translation outputs.
We derive various feature functions measuring the closeness of a given test sentence to the training data and
the difficulty of translating the sentence.
We describe \texttt{mono} feature functions that are based on statistics of only one side of the parallel
training corpora and \texttt{duo} feature functions that incorporate statistics involving both source and
target sides of the training data.
Overall, we describe novel, language independent, and SMT system extrinsic features for predicting the SMT performance, which also rank high during feature ranking evaluations.
We experiment with different learning settings, with or without looking at the translations, which help differentiate the contribution of different feature sets.
We apply partial least squares and feature subset selection, both of which improve the results and we present ranking of the top features selected for each learning setting, providing an exhaustive analysis of the extrinsic features used.
We show that by just looking at the test source sentences and not using the translation outputs at all, we can
achieve better performance than a baseline system using SMT model dependent features that generated the
translations.
Furthermore, our prediction system is able to achieve the nd best performance overall according to the official
results of the Quality Estimation Task (QET) challenge when also looking at the translation outputs.
Our representation and features achieve the top performance in QET among the models using the SVR learning model
Machine translation evaluation resources and methods: a survey
We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. The deep learning models for evaluation are very newly proposed. Subsequently, we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT.
This paper differs from the existing works\cite {GALEprogram2009, EuroMatrixProject2007} from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
What do Neural Machine Translation Models Learn about Morphology?
Neural machine translation (MT) models obtain state-of-the-art performance
while maintaining a simple, end-to-end architecture. However, little is known
about what these models learn about source and target languages during the
training process. In this work, we analyze the representations learned by
neural MT models at various levels of granularity and empirically evaluate the
quality of the representations for learning morphology through extrinsic
part-of-speech and morphological tagging tasks. We conduct a thorough
investigation along several parameters: word-based vs. character-based
representations, depth of the encoding layer, the identity of the target
language, and encoder vs. decoder representations. Our data-driven,
quantitative evaluation sheds light on important aspects in the neural MT
system and its ability to capture word structure.Comment: Updated decoder experiment
What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models
Despite the remarkable evolution of deep neural networks in natural language
processing (NLP), their interpretability remains a challenge. Previous work
largely focused on what these models learn at the representation level. We
break this analysis down further and study individual dimensions (neurons) in
the vector representation learned by end-to-end neural models in NLP tasks. We
propose two methods: Linguistic Correlation Analysis, based on a supervised
method to extract the most relevant neurons with respect to an extrinsic task,
and Cross-model Correlation Analysis, an unsupervised method to extract salient
neurons w.r.t. the model itself. We evaluate the effectiveness of our
techniques by ablating the identified neurons and reevaluating the network's
performance for two tasks: neural machine translation (NMT) and neural language
modeling (NLM). We further present a comprehensive analysis of neurons with the
aim to address the following questions: i) how localized or distributed are
different linguistic properties in the models? ii) are certain neurons
exclusive to some properties and not others? iii) is the information more or
less distributed in NMT vs. NLM? and iv) how important are the neurons
identified through the linguistic correlation method to the overall task? Our
code is publicly available as part of the NeuroX toolkit (Dalvi et al. 2019).Comment: AAA 2019, pages 10, AAAI Conference on Artificial Intelligence (AAAI
2019
Referential translation machines for predicting translation quality
We use referential translation machines (RTM) for quality estimation of translation outputs. RTMs are a computational model for identifying the translation acts between any two data sets with respect to interpretants selected in the same domain, which are effective when making monolingual and bilingual similarity judgments. RTMs achieve top performance in automatic, accurate, and language independent prediction of sentence-level and word-level statistical machine translation (SMT) quality. RTMs remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations and achieve the top performance in WMT13 quality estimation task (QET13). We improve our RTM models with the Parallel FDA5 instance selection model, with
additional features for predicting the translation performance, and with improved learning models.
We develop RTM models for each WMT14 QET (QET14) subtask, obtain improvements over QET13 results, and rank st in all of the tasks and subtasks of QET14
Towards Optimizing MT for Post-Editing Effort: Can BLEU Still Be Useful?
We propose a simple, linear-combination automatic evaluation measure (AEM) to approximate post-editing (PE) effort. Effort is measured both as PE time and as the number of PE operations performed. The ultimate goal is to define an AEM that can be used to optimize machine translation (MT) systems to minimize PE effort, but without having to perform unfeasible repeated PE during optimization. As PE effort is expected to be an extensive magnitude (i.e., one growing linearly with the sentence length and which may be simply added to represent the effort for a set of sentences), we use a linear combination of extensive and pseudo-extensive features. One such pseudo-extensive feature, 1–BLEU times the length of the reference, proves to be almost as good a predictor of PE effort as the best combination of extensive features. Surprisingly, effort predictors computed using independently obtained reference translations perform reasonably close to those using actual post-edited references. In the early stage of this research and given the inherent complexity of carrying out experiments with professional post-editors, we decided to carry out an automatic evaluation of the AEMs proposed rather than a manual evaluation to measure the effort needed to post-edit the output of an MT system tuned on these AEMs. The results obtained seem to support current tuning practice using BLEU, yet pointing at some limitations. Apart from this intrinsic evaluation, an extrinsic evaluation was also carried out in which the AEMs proposed were used to build synthetic training corpora for MT quality estimation, with results comparable to those obtained when training with measured PE efforts.Work supported by the Spanish government through project EFFORTUNE (TIN2015-69632-R) and through grant PRX16/00043 for Mikel L. Forcada, and by the European Commission through QT21 project (H2020 No. 645452)
Referential translation machines for predicting translation quality and related statistics
We use referential translation machines (RTMs) for predicting translation performance. RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource. We improve our RTM models with the ParFDA instance selection model (Bicici et al., 2015), with additional features for predicting the translation performance, and with improved learning models. We develop RTM models for each WMT15 QET (QET15) subtask and obtain improvements over QET14 results.
RTMs achieve top performance in QET15 ranking 1st in document- and sentence-level prediction tasks and 2nd in word-level prediction task
- …