78 research outputs found

    MATREX: the DCU MT system for WMT 2010

    Get PDF
    This paper describes the DCU machine translation system in the evaluation campaign of the Joint Fifth Workshop on Statistical Machine Translation and Metrics in ACL-2010. We describe the modular design of our multi-engine machine translation (MT) system with particular focus on the components used in this participation. We participated in the English–Spanish and English–Czech translation tasks, in which we employed our multiengine architecture to translate. We also participated in the system combination task which was carried out by the MBR decoder and confusion network decoder

    Towards improving English-Latvian translation: a system comparison and a new rescoring feature

    Get PDF
    This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.Postprint (published version

    Apprentissage discriminant des modèles continus en traduction automatique

    Get PDF
    Over the past few years, neural network (NN) architectures have been successfully applied to many Natural Language Processing (NLP) applications, such as Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT).For the language modeling task, these models consider linguistic units (i.e words and phrases) through their projections into a continuous (multi-dimensional) space, and the estimated distribution is a function of these projections. Also qualified continuous-space models (CSMs), their peculiarity hence lies in this exploitation of a continuous representation that can be seen as an attempt to address the sparsity issue of the conventional discrete models. In the context of SMT, these echniques have been applied on neural network-based language models (NNLMs) included in SMT systems, and oncontinuous-space translation models (CSTMs). These models have led to significant and consistent gains in the SMT performance, but are also considered as very expensive in training and inference, especially for systems involving large vocabularies. To overcome this issue, Structured Output Layer (SOUL) and Noise Contrastive Estimation (NCE) have been proposed; the former modifies the standard structure on vocabulary words, while the latter approximates the maximum-likelihood estimation (MLE) by a sampling method. All these approaches share the same estimation criterion which is the MLE ; however using this procedure results in an inconsistency between theobjective function defined for parameter stimation and the way models are used in the SMT application. The work presented in this dissertation aims to design new performance-oriented and global training procedures for CSMs to overcome these issues. The main contributions lie in the investigation and evaluation of efficient training methods for (large-vocabulary) CSMs which aim~:(a) to reduce the total training cost, and (b) to improve the efficiency of these models when used within the SMT application. On the one hand, the training and inference cost can be reduced (using the SOUL structure or the NCE algorithm), or by reducing the number of iterations via a faster convergence. This thesis provides an empirical analysis of these solutions on different large-scale SMT tasks. On the other hand, we propose a discriminative training framework which optimizes the performance of the whole system containing the CSM as a component model. The experimental results show that this framework is efficient to both train and adapt CSM within SMT systems, opening promising research perspectives.Durant ces dernières années, les architectures de réseaux de neurones (RN) ont été appliquées avec succès à de nombreuses applications en Traitement Automatique de Langues (TAL), comme par exemple en Reconnaissance Automatique de la Parole (RAP) ainsi qu'en Traduction Automatique (TA).Pour la tâche de modélisation statique de la langue, ces modèles considèrent les unités linguistiques (c'est-à-dire des mots et des segments) à travers leurs projections dans un espace continu (multi-dimensionnel), et la distribution de probabilité à estimer est une fonction de ces projections.Ainsi connus sous le nom de "modèles continus" (MC), la particularité de ces derniers se trouve dans l'exploitation de la représentation continue qui peut être considérée comme une solution au problème de données creuses rencontré lors de l'utilisation des modèles discrets conventionnels.Dans le cadre de la TA, ces techniques ont été appliquées dans les modèles de langue neuronaux (MLN) utilisés dans les systèmes de TA, et dans les modèles continus de traduction (MCT).L'utilisation de ces modèles se sont traduit par d'importantes et significatives améliorations des performances des systèmes de TA. Ils sont néanmoins très coûteux lors des phrases d'apprentissage et d'inférence, notamment pour les systèmes ayant un grand vocabulaire.Afin de surmonter ce problème, l'architecture SOUL (pour "Structured Output Layer" en anglais) et l'algorithme NCE (pour "Noise Contrastive Estimation", ou l'estimation contrastive bruitée) ont été proposés: le premier modifie la structure standard de la couche de sortie, alors que le second cherche à approximer l'estimation du maximum de vraisemblance (MV) par une méthode d’échantillonnage.Toutes ces approches partagent le même critère d'estimation qui est la log-vraisemblance; pourtant son utilisation mène à une incohérence entre la fonction objectif définie pour l'estimation des modèles, et la manière dont ces modèles seront utilisés dans les systèmes de TA.Cette dissertation vise à concevoir de nouvelles procédures d'entraînement des MC, afin de surmonter ces problèmes.Les contributions principales se trouvent dans l'investigation et l'évaluation des méthodes d'entraînement efficaces pour MC qui visent à: (i) réduire le temps total de l'entraînement, et (ii) améliorer l'efficacité de ces modèles lors de leur utilisation dans les systèmes de TA.D'un côté, le coût d'entraînement et d'inférence peut être réduit (en utilisant l'architecture SOUL ou l'algorithme NCE), ou la convergence peut être accélérée.La dissertation présente une analyse empirique de ces approches pour des tâches de traduction automatique à grande échelle.D'un autre côté, nous proposons un cadre d'apprentissage discriminant qui optimise la performance du système entier ayant incorporé un modèle continu.Les résultats expérimentaux montrent que ce cadre d'entraînement est efficace pour l'apprentissage ainsi que pour l'adaptation des MC au sein des systèmes de TA, ce qui ouvre de nouvelles perspectives prometteuses

    A syntactified direct translation model with linear-time decoding

    Get PDF
    Recent syntactic extensions of statistical translation models work with a synchronous context-free or tree-substitution grammar extracted from an automatically parsed parallel corpus. The decoders accompanying these extensions typically exceed quadratic time complexity. This paper extends the Direct Translation Model 2 (DTM2) with syntax while maintaining linear-time decoding. We employ a linear-time parsing algorithm based on an eager, incremental interpretation of Combinatory Categorial Grammar (CCG). As every input word is processed, the local parsing decisions resolve ambiguity eagerly, by selecting a single supertag–operator pair for extending the dependency parse incrementally. Alongside translation features extracted from the derived parse tree, we explore syntactic features extracted from the incremental derivation process. Our empirical experiments show that our model significantly outperforms the state-of-the art DTM2 system

    Incorporating translation quality-oriented features into log-linear models of machine translation

    Get PDF
    The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system

    Statistical Machine Translation Features with Multitask Tensor Networks

    Full text link
    We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.Comment: 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long pape

    Phrase extraction and rescoring in statistical machine translation

    Get PDF
    The lack of linguistically motivated translation units or phrase pairs in Phrase-based Statistical Machine Translation (PB-SMT) systems is a well-known source of error. One approach to minimise such errors is to supplement the standard PB-SMT models with phrase pairs extracted from parallel treebanks (linguistically annotated and aligned corpora). In this thesis, we extend the treebank-based phrase extraction framework with percolated dependencies – a hitherto unutilised knowledge source – and evaluate its usability through more than a dozen syntax-aware phrase extraction models. However, the improvement in system performance is neither consistent nor conclusive despite the proven advantages of linguistically motivated phrase pairs. This leads us to hypothesize that the PB-SMT pipeline is flawed as it often fails to access perfectly good phrase-pairs while searching for the highest scoring translation (decoding). A model error occurs when the highest-probability translation (actual output of a PB-SMT system) according to a statistical machine translation model is not the most accurate translation it can produce. In the second part of this thesis, we identify and attempt to trace these model errors across state-of-the-art PB-SMT decoders by locating the position of oracle translations (the translation most similar to a reference translation or expected output of a PB-SMT system) in the n-best lists generated by a PB-SMT decoder. We analyse the impact of individual decoding features on the quality of translation output and introduce two rescoring algorithms to minimise the lower ranking of oracles in the n-best lists. Finally, we extend our oracle-based rescoring approach to a reranking framework by rescoring the n-best lists with additional reranking features. We observe limited but optimistic success and conclude by speculating on how our oracle-based rescoring of n-best lists can help the PB-SMT system (supplemented with multiple treebank-based phrase extractions) get optimal performance out of linguistically motivated phrase pairs

    Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic

    Get PDF
    Language modeling for an inflected language such as Arabic poses new challenges for speech recognition and machine translation due to its rich morphology. Rich morphology results in large increases in out-of-vocabulary (OOV) rate and poor language model parameter estimation in the absence of large quantities of data. In this study, we present a joint morphological-lexical language model (JMLLM) that takes advantage of Arabic morphology. JMLLM combines morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items in a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates while keeping the predictive power of whole words. Speech recognition and machine translation experiments in dialectal-Arabic show improvements over word and morpheme based trigram language models. We also show that as the tightness of integration between different information sources increases, both speech recognition and machine translation performances improve
    corecore