3 research outputs found

    Improving statistical machine translation through adaptation and learning

    Get PDF
    With the arrival of free on-line machine translation (MT) systems, came the possibility to improve automatic translations with the help of daily users. One of the methods to achieve such improvements is to ask to users themselves for a better translation. It is possible that the system had made a mistake and if the user is able to detect it, it would be a valuable help to let the user teach the system where it made the mistake so it does not make it again if it finds a similar situation. Most of the translation systems you can find on-line provide a text area for users to suggest a better translation (like Google translator) or a ranking system for them to use (like Microsoft's). In 2009, as part of the Seventh Framework Programme of the European Commission, the FAUST project started with the goal of developing "machine translation (MT) systems which respond rapidly and intelligently to user feedback". Specifically, one of the project objective was to "develop mechanisms for instantaneously incorporating user feedback into the MT engines that are used in production environments, ...". As a member of the FAUST project, this thesis focused on developing one such mechanism. Formally, the general objective of this work was to design and implement a strategy to improve the translation quality of an already trained Statistical Machine Translation (SMT) system, using translations of input sentences that are corrections of the system's attempt to translate them. To address this problem we divided it in three specific objectives: 1. Define a relation between the words of a correction sentence and the words in the system's translation, in order to detect the errors that the former is aiming to solve. 2. Include the error corrections in the original system, so it learns how to solve them in case a similar situation occurs. 3. Test the strategy in different scenarios and with different data, in order to validate the applications of the proposed methodology. The main contributions made to the SMT field that can be found in this Ph.D. thesis are: - We defined a similarity function that compares an MT system output with a translation reference for that output and align the errors made by the system with the correct translations found in the reference. This information is then used to compute an alignment between the original input sentence and the reference. - We defined a method to perform domain adaptation based on the alignment mentioned before. Using this alignment with an in-domain parallel corpus, we extract new translation units that correspond both to units found in the system and were correctly chosen during translation and new units that include the correct translations found in the reference. These new units are then scored and combined with the units in the original system in order to improve its quality in terms of both human an automatic metrics. - We succesfully applied the method in a new task: to improve a SMT translation quality using post-editions provided by real users of the system. In this case, the alignment was computed over a parallel corpus build with post-editions, extracting translation units that correspond both to units found in the system and were correctly chosen during translation and new units that include the corrections found in the feedback provided. - The method proposed in this dissertation is able to achieve significant improvements in translation quality with a small learning material, corresponding to a 0.5% of the training material used to build the original system. Results from our evaluations also indicate that the improvement achieved with the domain adaptation strategy is measurable by both automatic a human-based evaluation metrics.Esta tesis propone un nuevo método para mejorar un sistema de Traducción Automática Estadística (SMT por sus siglas en inglés) utilizando post-ediciones de sus traducciones automáticas. La estrategia puede asociarse con la adaptación de dominio, considerando las post-ediciones obtenidas a través de usuarios reales del sistema de traducción como el material del dominio a adaptar. El método compara las post-ediciones con las traducciones automáticas con la finalidad de detectar automáticamente los lugares en los que el traductor cometió algún error, para poder aprender de ello. Una vez los errores han sido detectados se realiza un alineado a nivel de palabras entre las oraciones originales y las postediciones, para extraer unidades de traducción que son luego incorporadas al sistema base de manera que se corrijan los errores en futuras traducciones. Nuestros resultados muestran mejoras estadísticamente significativas a partir de un conjunto de datos que representa en tamaño un 0, 5% del material utilizado durante el entrenamiento. Junto con las medidas automáticas de calidad, también presentamos un análisis cualitativo del sistema para validar los resultados. Las mejoras en la traducción se observan en su mayoría en el léxico y el reordenamiento de palabras, seguido de correcciones morfológicas. La estrategia, que introduce los conceptos de corpus aumentado, función de similaridad y unidades de traducción derivadas, es probada con dos paradigmas de SMT (traducción basada en N-gramas y en frases), con dos pares de lengua (Catalán-Español e Inglés-Español) y en diferentes escenarios de adaptación de dominio, incluyendo un dominio abierto en el cual el sistema fue adaptado a través de peticiones recogidas por usuarios reales a través de internet, obteniendo resultados similares durante todas las pruebas. Los resultados de esta investigación forman parte del projecto FAUST (en inglés, Feedback Analysis for User adaptive Statistical Translation), un proyecto del Séptimo Programa Marco de la Comisión Europea

    Statistical machine translation system and computational domain adaptation

    Get PDF
    Statističko strojno prevođenje temeljeno na frazama jedan je od mogućih pristupa automatskom strojnom prevođenju. U radu su predložene metode za poboljšanje kvalitete strojnog prijevoda prilagodbom određenih parametara u modelu sustava za statističko strojno prevođenje. Ideja rada bila jest izgraditi sustave za statističko strojno prevođenje temeljeno na frazama za hrvatski i engleski jezik. Sustavi su trenirani za dva jezična smjera, na dvije domene, na paralelnim korpusima različitih veličina i obilježja za hrvatsko-engleski i englesko-hrvatski jezični par, nakon čega proveden postupak ugađanja sustava. Istraženi su hibridni sustavi koji objedinjuju značajke obiju domena. Time je ispitan izravan utjecaj adaptacije domene na kvalitetu automatskog strojnog prijevoda hrvatskog jezika, a nova saznanja mogu koristiti pri izgradnji novih sustava. Provedena je automatska i ljudska evaluacija (vrednovanje) strojnih prijevoda, a dobiveni rezultati uspoređeni su s rezultatima strojnih prijevoda dobivenih primjenom postojećih web servisa za statističko strojno prevođenje.Phrase-based statistical machine translation is one of possible automatic machine translation approaches. This work proposes methods for increasing the quality of machine translation by adapting certain parameters in the statistical machine translation model. The idea was to build phrase-based statistical machine translation systems for Croatian and English language. The systems were be trained for two directions, on two domains, on parallel corpora of different sizes and characteristics for Croatian-English and English-Croatian language pair, after which the tuning procedure was conducted. Afterwards, hybrid systems which combine features of both domains were investigated. Thereby the direct impact of domain adaptation on the quality of automatic machine translation of Croatian language was explored, whereas new findings can be utilised for building new systems. Automatic and human evaluation of machine translations were carried out, while obtained results were compared with results obtained from applying existing statistical machine translation web services

    Mixing multiple translation models in Statistical Machine Translation

    No full text
    Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
    corecore