1,740 research outputs found

    TermEval: an automatic metric for evaluating terminology translation in MT

    Get PDF
    Terminology translation plays a crucial role in domain-specific machine translation (MT). Preservation of domain-knowledge from source to target is arguably the most concerning factor for the customers in translation industry, especially for critical domains such as medical, transportation, military, legal and aerospace. However, evaluation of terminology translation, despite its huge importance in the translation industry, has been a less examined area in MT research. Term translation quality in MT is usually measured with domain experts, either in academia or industry. To the best of our knowledge, as of yet there is no publicly available solution to automatically evaluate terminology translation in MT. In particular, manual intervention is often needed to evaluate terminology translation in MT, which, by nature, is a time-consuming and highly expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems are often needed to be updated for many reasons (e.g. availability of new training data or leading MT techniques). Hence, there is a genuine need to have a faster and less expensive solution to this problem, which could aid the end-users to instantly identify term translation problems in MT. In this study, we propose an automatic evaluation metric, TermEval, for evaluating terminology translation in MT. To the best of our knowledge, there is no gold-standard dataset available for measuring terminology translation quality in MT. In the absence of gold standard evaluation test set, we semi-automatically create a gold-standard dataset from English--Hindi judicial domain parallel corpus. We trained state-of-the-art phrase-based SMT (PB-SMT) and neural MT (NMT) models on two translation directions: English-to-Hindi and Hindi-to-English, and use TermEval to evaluate their performance on terminology translation over the created gold standard test set. In order to measure the correlation between TermEval scores and human judgments, translations of each source terms (of the gold standard test set) is validated with human evaluator. High correlation between TermEval and human judgements manifests the effectiveness of the proposed terminology translation evaluation metric. We also carry out comprehensive manual evaluation on terminology translation and present our observations

    Machine translation evaluation resources and methods: a survey

    Get PDF
    We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. The deep learning models for evaluation are very newly proposed. Subsequently, we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT. This paper differs from the existing works\cite {GALEprogram2009, EuroMatrixProject2007} from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content

    Exploiting source similarity for SMT using context-informed features

    Get PDF
    In this paper, we introduce context informed features in a log-linear phrase-based SMT framework; these features enable us to exploit source similarity in addition to target similarity modeled by the language model. We present a memory-based classification framework that enables the estimation of these features while avoiding sparseness problems. We evaluate the performance of our approach on Italian-to-English and Chinese-to-English translation tasks using a state-of-the-art phrase-based SMT system, and report significant improvements for both BLEU and NIST scores when adding the context-informed features

    BIKE: Bilingual Keyphrase Experiments

    Get PDF
    This paper presents a novel strategy for translating lists of keyphrases. Typical keyphrase lists appear in scientific articles, information retrieval systems and web page meta-data. Our system combines a statistical translation model trained on a bilingual corpus of scientific papers with sense-focused look-up in a large bilingual terminological resource. For the latter, we developed a novel technique that benefits from viewing the keyphrase list as contextual help for sense disambiguation. The optimal combination of modules was discovered by a genetic algorithm. Our work applies to the French / English language pair

    Real-Time Statistical Speech Translation

    Full text link
    This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. Such systems can be used in a pipeline with speech recognition and synthesis software in order to produce a real-time voice communication system between foreigners. We obtained three main data sets from spoken proceedings that represent three different types of human speech. TED, Europarl, and OPUS parallel text corpora were used as the basis for training of language models, for developmental tuning and testing of the translation system. We also conducted experiments involving part of speech tagging, compound splitting, linear language model interpolation, TrueCasing and morphosyntactic analysis. We evaluated the effects of variety of data preparations on the translation results using the BLEU, NIST, METEOR and TER metrics and tried to give answer which metric is most suitable for PL-EN language pair.Comment: machine translation, polish englis

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    Lexical Features for Statistical Machine Translation

    Get PDF
    In modern phrasal and hierarchical statistical machine translation systems, two major features model translation: rule translation probabilities and lexical smoothing scores. The rule translation probabilities are computed as maximum likelihood estimates (MLEs) of an entire source (or target) phrase translating to a target (or source) phrase. The lexical smoothing scores are also a likelihood estimate of a source (target) phrase translating to a target (source) phrase, but they are computed using independent word-to-word translation probabilities. Intuitively, it would seem that the lexical smoothing score is a less powerful estimate of translation likelihood due to this independence assumption, but I present the somewhat surprising result that lexical smoothing is far more important to the quality of a state-of-the-art hierarchical SMT system than rule translation probabilities. I posit that this is due to a fundamental data sparsity problem: The average word-to-word translation is seen many more times than the average phrase-to-phrase translation, so the word-to-word translation probabilities (or lexical probabilities) are far better estimated. Motivated by this result, I present a number of novel methods for modifying the lexical probabilities to improve the quality of our MT output. First, I examine two methods of lexical probability biasing, where for each test document, a set of secondary lexical probabilities are extracted and interpolated with the primary lexical probability distribution. Biasing each document with the probabilities extracted from its own first-pass decoding output provides a small but consistent gain of about 0.4 BLEU. Second, I contextualize the lexical probabilities by factoring in additional information such as the previous or next word. The key to the success of this context-dependent lexical smoothing is a backoff model, where our "trust" of a context-dependent probability estimation is directly proportional to how many times it was seen in the training. In this way, I avoid the estimation problem seen in translation rules, where the amount of context is high but the probability estimation is inaccurate. When using the surrounding words as context, this feature provides a gain of about 0.6 BLEU on Arabic and Chinese. Finally, I describe several types of discriminatively trained lexical features, along with a new optimization procedure called Expected-BLEU optimization. This new optimization procedure is able to robustly estimate weights for thousands of decoding features, which can in effect discriminatively optimize a set of lexical probabilities to maximize BLEU. I also describe two other discriminative feature types, one of which is the part-of-speech analogue to lexical probabilities, and the other of which estimates training corpus weights based on lexical translations. The discriminative features produce a gain of 0.8 BLEU on Arabic and 0.4 BLEU on Chinese
    corecore