2,605 research outputs found
Learning labelled dependencies in machine translation evaluation
Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and which correlate
better than other existing metrics with human judgements. Other research in this area has presented machine learning
methods which learn directly from human judgements. In this paper, we present a novel combination of dependency- and
machine learning-based approaches to automatic MT evaluation, and demonstrate greater correlations with human judgement than the existing state-of-the-art methods.
In addition, we examine the extent to which our novel method can be generalised across different tasks and domains
An algorithm for cross-lingual sense-clustering tested in a MT evaluation setting
Unsupervised sense induction methods offer a solution to the
problem of scarcity of semantic resources. These methods
automatically extract semantic information from textual data
and create resources adapted to specific applications and domains of interest. In this paper, we present a clustering algorithm for cross-lingual sense induction which generates
bilingual semantic inventories from parallel corpora. We describe the clustering procedure and the obtained resources. We then proceed to a large-scale evaluation by integrating the resources into a Machine Translation (MT) metric (METEOR). We show that the use of the data-driven sense-cluster inventories leads to better correlation with human judgments of translation quality, compared to precision-based metrics, and to improvements similar to those obtained when a handcrafted semantic resource is used
Improving the objective function in minimum error rate training
In Minimum Error Rate Training (MERT), the parameters of an SMT system are tuned on a certain evaluation metric to improve translation quality. In this paper, we present empirical results in which parameters tuned on one metric (e.g. BLEU) may not lead to optimal scores on the same metric. The score can be improved significantly by tuning on an entirely different metric (e.g. METEOR, by 0.82
BLEU points or 3.38% relative improvement on WMT08 English–French dataset). We analyse the impact of choice of objective function in MERT and further propose three
combination strategies of different metrics to reduce the bias of a single metric, and obtain parameters that receive better scores (0.99 BLEU points or 4.08% relative improvement) on evaluation metrics than those tuned on the
standalone metric itself
Capturing lexical variation in MT evaluation using automatically built sense-cluster inventories
The strict character of most of the existing Machine Translation (MT) evaluation metrics does not permit them to capture lexical variation in translation. However, a central
issue in MT evaluation is the high correlation that the metrics should have with human judgments of translation quality. In order to achieve a higher correlation, the identification of sense correspondences between the compared translations becomes really important. Given
that most metrics are looking for exact correspondences, the evaluation results are often misleading concerning translation quality. Apart from that, existing metrics do not permit one to make a conclusive estimation of the impact of Word Sense Disambiguation techniques into
MT systems. In this paper, we show how information acquired by an unsupervised semantic analysis method can be used to render MT evaluation more sensitive to lexical semantics. The sense inventories built by this data-driven method are incorporated into METEOR: they replace WordNet for evaluation in English and render METEOR’s synonymy module operable in French. The evaluation results demonstrate that the use of these inventories gives rise to an increase in the number of matches and the correlation with human judgments of translation quality, compared to precision-based metrics
Sample Complexity of Sample Average Approximation for Conditional Stochastic Optimization
In this paper, we study a class of stochastic optimization problems, referred
to as the \emph{Conditional Stochastic Optimization} (CSO), in the form of
\min_{x \in \mathcal{X}}
\EE_{\xi}f_\xi\Big({\EE_{\eta|\xi}[g_\eta(x,\xi)]}\Big), which finds a wide
spectrum of applications including portfolio selection, reinforcement learning,
robust learning, causal inference and so on. Assuming availability of samples
from the distribution \PP(\xi) and samples from the conditional distribution
\PP(\eta|\xi), we establish the sample complexity of the sample average
approximation (SAA) for CSO, under a variety of structural assumptions, such as
Lipschitz continuity, smoothness, and error bound conditions. We show that the
total sample complexity improves from \cO(d/\eps^4) to \cO(d/\eps^3) when
assuming smoothness of the outer function, and further to \cO(1/\eps^2) when
the empirical function satisfies the quadratic growth condition. We also
establish the sample complexity of a modified SAA, when and are
independent. Several numerical experiments further support our theoretical
findings.
Keywords: stochastic optimization, sample average approximation, large
deviations theoryComment: Typo corrected. Reference added. Revision comments handle
The integration of machine translation and translation memory
We design and evaluate several models for integrating Machine Translation (MT) output into a Translation Memory (TM) environment to facilitate the adoption of MT technology
in the localization industry.
We begin with the integration on the segment level via translation recommendation and translation reranking. Given an input to be translated, our translation recommendation
model compares the output from the MT and the TMsystems, and presents the better one to the post-editor. Our translation reranking model combines k-best lists from both systems,
and generates a new list according to estimated post-editing effort. We perform both automatic and human evaluation on these models. When measured against the consensus of
human judgement, the recommendation model obtains 0.91 precision at 0.93 recall, and the reranking model obtains 0.86 precision at 0.59 recall. The high precision of these models indicates that they can be integrated into TM environments without the risk of deteriorating the quality of the post-editing candidate, and can thereby preserve TM assets and established cost estimation methods associated with TMs.
We then explore methods for a deeper integration of translation memory and machine translation on the sub-segment level. We predict whether phrase pairs derived from fuzzy matches could be used to constrain the translation of an input segment. Using a series of novel linguistically-motivated features, our constraints lead both to more consistent translation output, and to improved translation quality, reflected by a 1.2 improvement in BLEU score and a 0.72 reduction in TER score, both of statistical significance (p < 0.01).
In sum, we present our work in three aspects: 1) translation recommendation and translation reranking models that can access high quality MT outputs in the TMenvironment, 2)
a sub-segment translation memory and machine translation integration model that improves both translation consistency and translation quality, and 3) a human evaluation pipeline to validate the effectiveness of our models with human judgements
- …