170,093 research outputs found

    Comparison of Data Selection Techniques for the Translation of Video Lectures

    Full text link
    [EN] For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martínez-Villaronga, A.; Giménez Pastor, A.; Juan Císcar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431

    Comparison of Data Selection Techniques for the Translation of Video Lectures

    Full text link
    [EN] For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martínez-Villaronga, A.; Giménez Pastor, A.; Juan Císcar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431

    Training Machine Translation for Human Acceptability

    Get PDF
    Discriminative training, a.k.a. tuning, is an important part of Statistical Machine Translation. This step optimises weights for the several statistical models and heuristics used in a machine translation system, in order to balance their relative effect on the translation output. Different weights lead to significant changes in the quality of translation outputs, and thus selecting appropriate weights is of key importance. This thesis addresses three major problems with current discriminative training methods in order to improve translation quality. First, we design more accurate automatic machine translation evaluation metrics that have better correlation with human judgements. An automatic evaluation metric is used in the loss function in most discriminative training methods, however what the best metric is for this purpose is still an open question. In this thesis we propose two novel evaluation metrics that achieve better correlation with human judgements than the current de facto standard, the BLEU metric. We show that these metrics can improve translation quality when used in discriminative training. Second, we design an algorithm to select sentence pairs for training the discriminative learner from large pools of freely available parallel sentences. These resources tend to be noisy and include translations of varying degrees of quality and suitability for the translation task at hand, especially if obtained using crowdsourcing methods. Nevertheless, they are crucial when professionally created training data is scarce or unavailable. There is very little previous research on the data selection for discriminative training. Our novel data selection algorithm does not require knowledge of the test set nor uses decoding outputs, and is thus more generally useful and efficient. Our experiments show that with this data selection algorithm, translation quality consistently improves over strong baselines. Finally, the third component of the thesis is a novel weighted ranking-based optimisation algorithm for discriminative training. In contrast to previous approaches, this technique assigns a different weight to each training instance according to its reachability and its relationship to test sentence being decoded, a form of transductive learning. Our experimental results show improvements over a modern state-of-the-art method across different language pairs. Overall, the proposed approaches lead to better translation quality when compared strong baselines in our experiments, both in isolation and when combined, and can be easily applied to most existing statistical machine translation approaches

    Preference Learning for Machine Translation

    Get PDF
    Automatic translation of natural language is still (as of 2017) a long-standing but unmet promise. While advancing at a fast rate, the underlying methods are still far from actually being able to reliably capture syntax or semantics of arbitrary utterances of natural language, way off transporting the encoded meaning into a second language. However, it is possible to build useful translating machines when the target domain is well known and the machine is able to learn and adapt efficiently and promptly from new inputs. This is possible thanks to efficient and effective machine learning methods which can be applied to automatic translation. In this work we present and evaluate methods for three distinct scenarios: a) We develop algorithms that can learn from very large amounts of data by exploiting pairwise preferences defined over competing translations, which can be used to make a machine translation system robust to arbitrary texts from varied sources, but also enable it to learn effectively to adapt to new domains of data; b) We describe a method that is able to efficiently learn external models which adhere to fine-grained preferences that are extracted from a constricted selection of translated material, e.g. for adapting to users or groups of users in a computer-aided translation scenario; c) We develop methods for two machine translation paradigms, neural- and traditional statistical machine translation, to directly adapt to user-defined preferences in an interactive post-editing scenario, learning precisely adapted machine translation systems. In all of these settings, we show that machine translation can be made significantly more useful by careful optimization via preference learning

    Compact Personalized Models for Neural Machine Translation

    Full text link
    We propose and compare methods for gradient-based domain adaptation of self-attentive neural machine translation models. We demonstrate that a large proportion of model parameters can be frozen during adaptation with minimal or no reduction in translation quality by encouraging structured sparsity in the set of offset tensors during learning via group lasso regularization. We evaluate this technique for both batch and incremental adaptation across multiple data sets and language pairs. Our system architecture - combining a state-of-the-art self-attentive model with compact domain adaptation - provides high quality personalized machine translation that is both space and time efficient.Comment: Published at the 2018 Conference on Empirical Methods in Natural Language Processin
    corecore