15 research outputs found

    Latent Domain Translation Models in Mix-of-Domains Haystack

    Get PDF

    Improving English to Spanish out-of-domain translations by morphology generalization and generation

    Get PDF
    This paper presents a detailed study of a method for morphology generalization and generation to address out-of-domain translations in English-to-Spanish phrase-based MT. The paper studies whether the morphological richness of the target language causes poor quality translation when translating out-ofdomain. In detail, this approach first translates into Spanish simplified forms and then predicts the final inflected forms through a morphology generation step based on shallow and deep-projected linguistic information available from both the source and targetlanguage sentences. Obtained results highlight the importance of generalization, and therefore generation, for dealing with out-ofdomain data.Peer ReviewedPostprint (published version

    Cost-sensitive active learning for computer-assisted translation

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition Letters. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition Letters, [Volume 37, 1 February 2014, Pages 124–134] DOI: 10.1016/j.patrec.2013.06.007[EN] Machine translation technology is not perfect. To be successfully embedded in real-world applications, it must compensate for its imperfections by interacting intelligently with the user within a computer-assisted translation framework. The interactive¿predictive paradigm, where both a statistical translation model and a human expert collaborate to generate the translation, has been shown to be an effective computer-assisted translation approach. However, the exhaustive supervision of all translations and the use of non-incremental translation models penalizes the productivity of conventional interactive¿predictive systems. We propose a cost-sensitive active learning framework for computer-assisted translation whose goal is to make the translation process as painless as possible. In contrast to conventional active learning scenarios, the proposed active learning framework is designed to minimize not only how many translations the user must supervise but also how difficult each translation is to supervise. To do that, we address the two potential drawbacks of the interactive-predictive translation paradigm. On the one hand, user effort is focused to those translations whose user supervision is considered more ¿informative¿, thus, maximizing the utility of each user interaction. On the other hand, we use a dynamic machine translation model that is continually updated with user feedback after deployment. We empirically validated each of the technical components in simulation and quantify the user effort saved. We conclude that both selective translation supervision and translation model updating lead to important user-effort reductions, and consequently to improved translation productivity.Work supported by the European Union Seventh Framework Program (FP7/2007-2013) under the CasMaCat Project (Grants agreement No. 287576), by the Generalitat Valenciana under Grant ALMPR (Prometeo/2009/014), and by the Spanish Government under Grant TIN2012-31723. The authors thank Daniel Ortiz-Martinez for providing us with the log-linear SMT model with incremental features and the corresponding online learning algorithms. The authors also thank the anonymous reviewers for their criticisms and suggestions.González Rubio, J.; Casacuberta Nolla, F. (2014). Cost-sensitive active learning for computer-assisted translation. Pattern Recognition Letters. 37(1):124-134. https://doi.org/10.1016/j.patrec.2013.06.007S12413437

    Leveraging bilingual terminology to improve machine translation in a CAT environment

    Get PDF
    This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation (CAT) scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality. Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system. We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model. We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 2.23 to 6.78 BLEU points over a baseline SMT system and from 0.05 to 3.03 compared to the widely-used XML markup approach

    Programari per a la personalització de motors de TA. Anàlisi de productes

    Get PDF
    L'objectiu d'aquest Treball de Fi de Màster és analitzar programari que permet la personalització de motors de TA. Amb aquest propòsit s'han escollit 6 eines diferents (Machine Translation Training Tool, ModernMT, MTradumàtica, LetsMT, KantanMT i Microsoft Translation Hub) i s'ha investigat la seva instal·lació i el seu entrenament. A més, es proporciona una explicació teòrica sobre els sistemes de TA, la qualitat en TA i el seu estat de la qüestió. De la mateixa manera, es descriuen amb precisió els recursos que s'han utilitzat durant el treball, les característiques principals de les eines, la creació dels dos corpus i la instal·lació i l'entrenament del programari. Amb els motors ja entrenats s'ha preparat un resum de les característiques més importants de cada eina i una anàlisi de la qualitat de la TA. A banda d'aquests resultats, aquest treball subratlla els mètodes relatius a la instal·lació i a l'entrenament dels programes.El objetivo de este Trabajo de Fin de Máster es analizar software que permite la personalización de motores de TA. Con este propósito se han escogido 6 herramientas distintas (Machine Translation Training Tool, ModernMT, MTradumàtica, LetsMT, KantanMT y Microsoft Translation Hub) y se ha indagado acerca de su instalación y de su entrenamiento. Además, se han creado dos corpus (chino-catalán y francés-catalán) para entrenar los motores de estos programes con combinaciones lingüísti-cas concretas. Así, se proporciona una explicación teórica acerca de los sistemas de TA, de la calidad en TA y de su estado de la cuestión. De la misma manera, se describen con precisión los recursos que se han utilizado a lo largo de este trabajo, las características principales de las herramientas, la crea-ción de los dos corpus y la instalación y el entrenamiento del software. Con los motores ya entrenados se ha preparado un resumen de las características más relevantes de cada herramienta y un análisis de la calidad de la TA. A parte de estos resultados, este trabajo subraya los métodos relativos a la ins-talación y al entrenamiento de los programas.The aim of this Master's Degree Dissertation is analyzing software that allows engine customization of Machine Translation. With this purpose 6 different tools (Machine Translation Training Tool, Mo-dernMT, MTradumàtica, LetsMT, KantanMT and Microsoft Translation Hub) have been chosen and their installation and their engine training have been explored. Moreover, two corpus (chinese-catalan and french-catalan) have been created in order to train the engines with specific linguistic combina-tions. A theoretical explanation about MT systems, quality in MT and its state-of-the-art is provided. Likewise, the resources used during this dissertation, the tools' main features, every single step of the creation of both corpus and the installation and the training of the software is described. Once the engines have been trained, a summary of the most significant features of each tool and the analysis of the quality of the MT are given. Apart from these results, this dissertation highlights the methods of the installation and the training

    Multilingual Neural Translation

    Get PDF
    Machine translation (MT) refers to the technology that can automatically translate contents in one language into other languages. Being an important research area in the field of natural language processing, machine translation has typically been considered one of most challenging yet exciting problems. Thanks to research progress in the data-driven statistical machine translation (SMT), MT is recently capable of providing adequate translation services in many language directions and it has been widely deployed in various practical applications and scenarios. Nevertheless, there exist several drawbacks in the SMT framework. The major drawbacks of SMT lie in its dependency in separate components, its simple modeling approach, and the ignorance of global context in the translation process. Those inherent drawbacks prevent the over-tuned SMT models to gain any noticeable improvements over its horizon. Furthermore, SMT is unable to formulate a multilingual approach in which more than two languages are involved. The typical workaround is to develop multiple pair-wise SMT systems and connect them in a complex bundle to perform multilingual translation. Those limitations have called out for innovative approaches to address them effectively. On the other hand, it is noticeable how research on artificial neural networks has progressed rapidly since the beginning of the last decade, thanks to the improvement in computation, i.e faster hardware. Among other machine learning approaches, neural networks are known to be able to capture complex dependencies and learn latent representations. Naturally, it is tempting to apply neural networks in machine translation. First attempts revolve around replacing SMT sub-components by the neural counterparts. Later attempts are more revolutionary by fundamentally changing the whole core of SMT with neural networks, which is now popularly known as neural machine translation (NMT). NMT is an end-to-end system which directly estimate the translation model between the source and target sentences. Furthermore, it is later discovered to capture the inherent hierarchical structure of natural language. This is the key property of NMT that enables a new training paradigm and a less complex approach for multilingual machine translation using neural models. This thesis plays an important role in the evolutional course of machine translation by contributing to the transition of using neural components in SMT to the completely end-to-end NMT and most importantly being the first of the pioneers in building a neural multilingual translation system. First, we proposed an advanced neural-based component: the neural network discriminative word lexicon, which provides a global coverage for the source sentence during the translation process. We aim to alleviate the problems of phrase-based SMT models that are caused by the way how phrase-pair likelihoods are estimated. Such models are unable to gather information from beyond the phrase boundaries. In contrast, our discriminative word lexicon facilitates both the local and global contexts of the source sentences and models the translation using deep neural architectures. Our model has improved the translation quality greatly when being applied in different translation tasks. Moreover, our proposed model has motivated the development of end-to-end NMT architectures later, where both of the source and target sentences are represented with deep neural networks. The second and also the most significant contribution of this thesis is the idea of extending an NMT system to a multilingual neural translation framework without modifying its architecture. Based on the ability of deep neural networks to modeling complex relationships and structures, we utilize NMT to learn and share the cross-lingual information to benefit all translation directions. In order to achieve that purpose, we present two steps: first in incorporating language information into training corpora so that the NMT learns a common semantic space across languages and then force the NMT to translate into the desired target languages. The compelling aspect of the approach compared to other multilingual methods, however, lies in the fact that our multilingual extension is conducted in the preprocessing phase, thus, no change needs to be done inside the NMT architecture. Our proposed method, a universal approach for multilingual MT, enables a seamless coupling with any NMT architecture, thus makes the multilingual expansion to the NMT systems effortlessly. Our experiments and the studies from others have successfully employed our approach with numerous different NMT architectures and show the universality of the approach. Our multilingual neural machine translation accommodates cross-lingual information in a learned common semantic space to improve altogether every translation direction. It is then effectively applied and evaluated in various scenarios. We develop a multilingual translation system that relies on both source and target data to boost up the quality of a single translation direction. Another system could be deployed as a multilingual translation system that only requires being trained once using a multilingual corpus but is able to translate between many languages simultaneously and the delivered quality is more favorable than many translation systems trained separately. Such a system able to learn from large corpora of well-resourced languages, such as English → German or English → French, has proved to enhance other translation direction of low-resourced language pairs like English → Lithuania or German → Romanian. Even more, we show that kind of approach can be applied to the extreme case of zero-resourced translation where no parallel data is available for training without the need of pivot techniques. The research topics of this thesis are not limited to broadening application scopes of our multilingual approach but we also focus on improving its efficiency in practice. Our multilingual models have been further improved to adequately address the multilingual systems whose number of languages is large. The proposed strategies demonstrate that they are effective at achieving better performance in multi-way translation scenarios with greatly reduced training time. Beyond academic evaluations, we could deploy the multilingual ideas in the lecture-themed spontaneous speech translation service (Lecture Translator) at KIT. Interestingly, a derivative product of our systems, the multilingual word embedding corpus available in a dozen of languages, can serve as a useful resource for cross-lingual applications such as cross-lingual document classification, information retrieval, textual entailment or question answering. Detailed analysis shows excellent performance with regard to semantic similarity metrics when using the embeddings on standard cross-lingual classification tasks

    Entrenament de motors de traducció automàtica estadística especialitzats en farmàcia i medicina entre el castellà i el romanés

    Get PDF
    En aquest TFM es duu a terme l'entrenament de motors de traducció estadística entre el romanés i el castellà, especialitzats en el camp de la farmàcia i la medicina, a la plataforma MTradumàtica. S'hi ofereix una explicació teòrica sobre els sistemes de traducció automàtica existents i, en particular i amb més profunditat, sobre la traducció automàtica estadística i els models que hi entren en joc. Així mateix, s'hi explica amb detall cada un dels passos seguits per a l'entrenament del motor, des de la cerca de recursos lingüístics, la preparació dels materials i la conversió a formats acceptats, fins a l'entrenament mateix del traductor a la plataforma. Una vegada entrenats els motors, s'hi analitza la qualitat de la traducció automàtica dels motors amb avaluadors automàtics i es compara amb altres motors existents. A més a més, s'estudia quin efecte tenen les particularitats del romanés en el resultat del motor, tant pel que fa a la morfologia com a la codificació als dispositius electrònics.En este TFM se lleva a cabo el entrenamiento de motores de traducción estadística entre el rumano y el castellano, especializados en el campo de la farmacia y la medicina, en la plataforma MTradumàtica. Se ofrece una explicación teórica sobre los sistemas de traducción automática existentes y, en particular y con más profundidad, sobre la traducción automática estadística y los modelos que la componen. Asimismo, se explica en detalle cada uno de los pasos seguidos para el entrenamiento del motor, desde la búsqueda de recursos lingüísticos, la preparación de los materiales y la conversión a formatos aceptados, hasta el propio entrenamiento del traductor en la plataforma. Una vez se han entrenado los motores, se analiza la calidad de la traducción automática de los motores con evaluadores automáticos y se compara el resultado con otros motores existentes. Además, se estudia cómo influyen las particularidades del rumano en el resultado del motor, tanto en relación con la morfología como con su codificación en los dispositivos electrónicos.The aim of this Master's Degree Dissertation is training statistical translation engines in the pharmaceutical and medical domain between the Romanian and Spanish languages on the MTradumàtica platform. A theoretical explanation about the existing machine translation systems is provided and, particularly and more thoroughly, about the statistical machine translation and the models involved in the translation process. Likewise, every single step aiming at the engine training is explained in detail, from the linguistic resources research to the very same translation engine training on the platform, including the material preparation and its conversion into accepted formats. Once the engines have been trained, the quality of the machine translation is analysed through automatic evaluation metrics. The result is compared then with other existing engines. Furthermore, the peculiarities of the Romanian language and the way they affect the engine results have also been studied, both in relation with the morphology and the encoding of the characters on electronic devices
    corecore