1,031 research outputs found

    GREAT: open source software for statistical machine translation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10590-011-9097-6[EN] In this article, the first public release of GREAT as an open-source, statistical machine translation (SMT) software toolkit is described. GREAT is based on a bilingual language modelling approach for SMT, which is so far implemented for n-gram models based on the framework of stochastic finite-state transducers. The use of finite-state models is motivated by their simplicity, their versatility, and the fact that they present a lower computational cost, if compared with other more expressive models. Moreover, if translation is assumed to be a subsequential process, finite-state models are enough for modelling the existing relations between a source and a target language. GREAT includes some characteristics usually present in state-of-the-art SMT, such as phrase-based translation models or a log-linear framework for local features. Experimental results on a well-known corpus such as Europarl are reported in order to validate this software. A competitive translation quality is achieved, yet using both a lower number of model parameters and a lower response time than the widely-used, state-of-the-art SMT system Moses. © 2011 Springer Science+Business Media B.V.Study was supported by the EC (FEDER, FSE), the Spanish government (MICINN, MITyC, “Plan E”, under Grants MIPRCV “Consolider Ingenio 2010”, iTrans2 TIN2009-14511, and erudito.com TSI-020110-2009-439), and the Generalitat Valenciana (Grant Prometeo/2009/014).González Mollá, J.; Casacuberta Nolla, F. (2011). GREAT: open source software for statistical machine translation. Machine Translation. 25(2):145-160. https://doi.org/10.1007/s10590-011-9097-6S145160252Amengual JC, Benedí JM, Casacuberta F, Castaño MA, Castellanos A, Jiménez VM, Llorens D, Marzal A, Pastor M, Prat F, Vidal E, Vilar JM (2000) The EUTRANS-I speech translation system. Mach Transl 15(1-2): 75–103Andrés-Ferrer J, Juan-Císcar A, Casacuberta F (2008) Statistical estimation of rational transducers applied to machine translation. Appl Artif Intell 22(1–2): 4–22Bangalore S, Riccardi G (2002) Stochastic finite-state models for spoken language machine translation. Mach Transl 17(3): 165–184Berstel J (1979) Transductions and context-free languages. B.G. Teubner, Stuttgart, GermanyCasacuberta F, Vidal E (2004) Machine translation with inferred stochastic finite-state transducers. Comput Linguist 30(2): 205–225Casacuberta F, Vidal E (2007) Learning finite-state models for machine translation. Mach Learn 66(1): 69–91Foster G, Kuhn R, Johnson H (2006) Phrasetable smoothing for statistical machine translation. In: Proceedings of the 11th Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, pp 53–61González J (2009) Aprendizaje de transductores estocásticos de estados finitos y su aplicación en traducción automática. PhD thesis, Universitat Politècnica de València. Advisor: Casacuberta FGonzález J, Casacuberta F (2009) GREAT: a finite-state machine translation toolkit implementing a grammatical inference approach for transducer inference (GIATI). In: Proceedings of the EACL Workshop on Computational Linguistic Aspects of Grammatical Inference, Athens, Greece, pp 24–32Kanthak S, Vilar D, Matusov E, Zens R, Ney H (2005) Novel reordering approaches in phrase-based statistical machine translation. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, Ann Arbor, MI, pp 167–174Karttunen L (2001) Applications of finite-state transducers in natural language processing. In: Proceedings of the 5th Conference on Implementation and Application of Automata, London, UK, pp 34–46Kneser R, Ney H (1995) Improved backing-off for n-gram language modeling. In: Proceedings of the 20th IEEE International Conference on Acoustic, Speech and Signal Processing, Detroit, MI, pp 181–184Knight K, Al-Onaizan Y (1998) Translation with finite-state devices. In: Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, Langhorne, PA, pp 421–437Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 9th Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp 388–395Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp 79–86Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge, UKKoehn P, Hoang H (2007) Factored translation models. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, pp 868–876Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 177–180Kumar S, Deng Y, Byrne W (2006) A weighted finite state transducer translation template model for statistical machine translation. Nat Lang Eng 12(1): 35–75Li Z, Callison-Burch C, Dyer C, Ganitkevitch J, Khudanpur S, Schwartz L, Thornton WNG, Weese J, Zaidan OF (2009) Joshua: an open source toolkit for parsing-based machine translation. In: Procee- dings of the ACL Workshop on Statistical Machine Translation, Morristown, NJ, pp 135–139Llorens D, Vilar JM, Casacuberta F (2002) Finite state language models smoothed using n-grams. Int J Pattern Recognit Artif Intell 16(3): 275–289Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 7th Conference on Empirical Methods in Natural Language Processing, Morristown, NJ, pp 133–139Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JAR, Costa-jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4): 527–549Medvedev YT (1964) On the class of events representable in a finite automaton. In: Moore EF (eds) Sequential machines selected papers. Addison Wesley, Reading, MAMohri M, Pereira F, Riley M (2002) Weighted finite-state transducers in speech recognition. Comput Speech Lang 16(1): 69–88Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp 295–302Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51Ortiz D, García-Varea I, Casacuberta F (2005) Thot: a toolkit to train phrase-based statistical translation models. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp 141–148Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp 311–318Pérez A, Torres MI, Casacuberta F (2008) Joining linguistic and statistical methods for Spanish-to-Basque speech translation. Speech Commun 50: 1021–1033Picó D, Casacuberta F (2001) Some statistical-estimation methods for stochastic finite-state transducers. Mach Learn 44: 121–142Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. Comput Speech Lang 10: 187–228Simard M, Plamondon P (1998) Bilingual sentence alignment: balancing robustness and accuracy. Mach Transl 13(1): 59–80Singh AK, Husain S (2007) Exploring translation similarities for building a better sentence aligner. In: Proceedings of the 3rd Indian International Conference on Artificial Intelligence, Pune, India, pp 1852–1863Steinbiss V, Tran BH, Ney H (1994) Improvements in beam search. In: Proceedings of the 3rd International Conference on Spoken Language Processing, Yokohama, Japan, pp 2143–2146Torres MI, Varona A (2001) k-TSS language models in speech recognition systems. Comput Speech Lang 15(2): 127–149Vidal E (1997) Finite-state speech-to-speech translation. In: Proceedings of the 22nd IEEE International Conference on Acoustic, Speech and Signal Processing, Munich, Germany, pp 111–114Vidal E, Thollard F, de la Higuera C, Casacuberta F, Carrasco RC (2005) Probabilistic finite-state machines–Part II. IEEE Trans Pattern Anal Mach Intell 27(7): 1025–1039Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2): 260–26

    Spanish generation from Spanish Sign Language using a phrase-based translation system

    Get PDF
    This paper describes the development of a Spoken Spanish generator from Spanish Sign Language (LSE – Lengua de Signos Española) in a specific domain: the renewal of Identity Document and Driver’s license. The system is composed of three modules. The first one is an interface where a deaf person can specify a sign sequence in sign-writing. The second one is a language translator for converting the sign sequence into a word sequence. Finally, the last module is a text to speech converter. Also, the paper describes the generation of a parallel corpus for the system development composed of more than 4,000 Spanish sentences and their LSE translations in the application domain. The paper is focused on the translation module that uses a statistical strategy with a phrase-based translation model, and this paper analyses the effect of the alignment configuration used during the process of word based translation model generation. Finally, the best configuration gives a 3.90% mWER and a 0.9645 BLEU

    Statistical approaches for natural language modelling and monotone statistical machine translation

    Full text link
    Esta tesis reune algunas contribuciones al reconocimiento de formas estadístico y, más especícamente, a varias tareas del procesamiento del lenguaje natural. Varias técnicas estadísticas bien conocidas se revisan en esta tesis, a saber: estimación paramétrica, diseño de la función de pérdida y modelado estadístico. Estas técnicas se aplican a varias tareas del procesamiento del lenguajes natural tales como clasicación de documentos, modelado del lenguaje natural y traducción automática estadística. En relación con la estimación paramétrica, abordamos el problema del suavizado proponiendo una nueva técnica de estimación por máxima verosimilitud con dominio restringido (CDMLEa ). La técnica CDMLE evita la necesidad de la etapa de suavizado que propicia la pérdida de las propiedades del estimador máximo verosímil. Esta técnica se aplica a clasicación de documentos mediante el clasificador Naive Bayes. Más tarde, la técnica CDMLE se extiende a la estimación por máxima verosimilitud por leaving-one-out aplicandola al suavizado de modelos de lenguaje. Los resultados obtenidos en varias tareas de modelado del lenguaje natural, muestran una mejora en términos de perplejidad. En a la función de pérdida, se estudia cuidadosamente el diseño de funciones de pérdida diferentes a la 0-1. El estudio se centra en aquellas funciones de pérdida que reteniendo una complejidad de decodificación similar a la función 0-1, proporcionan una mayor flexibilidad. Analizamos y presentamos varias funciones de pérdida en varias tareas de traducción automática y con varios modelos de traducción. También, analizamos algunas reglas de traducción que destacan por causas prácticas tales como la regla de traducción directa; y, así mismo, profundizamos en la comprensión de los modelos log-lineares, que son de hecho, casos particulares de funciones de pérdida. Finalmente, se proponen varios modelos de traducción monótonos basados en técnicas de modelado estadístico .Andrés Ferrer, J. (2010). Statistical approaches for natural language modelling and monotone statistical machine translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/7109Palanci

    A Phrase-Level Machine Translation Approach For Disfluency Detection Using Weighted Finite State Transducers

    Get PDF
    We propose a novel algorithm to detect disfluency in speech by reformulating the problem as phrase-level statistical machine translation using weighted finite state transducers. We approach the task as translation of noisy speech to clean speech. We simplify our translation framework such that it does not require fertility and alignment models. We tested our model on the Switchboard disfluency-annotated corpus. Using an optimized decoder that is developed for phrase-based translation at IBM, we are able to detect repeats, repairs and filled pauses for more than a thousand sentences in less than a second with encouraging results. Index Terms: disfluency detection, machine translation, speech-to-speech translation

    Novel statistical approaches to text classification, machine translation and computer-assisted translation

    Full text link
    Esta tesis presenta diversas contribuciones en los campos de la clasificación automática de texto, traducción automática y traducción asistida por ordenador bajo el marco estadístico. En clasificación automática de texto, se propone una nueva aplicación llamada clasificación de texto bilingüe junto con una serie de modelos orientados a capturar dicha información bilingüe. Con tal fin se presentan dos aproximaciones a esta aplicación; la primera de ellas se basa en una asunción naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, más sofisticada, considera la existencia de una correlación entre palabras en diferentes lenguas. La primera aproximación dió lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la más compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexación de documentos. La segunda aproximación se caracteriza por modelos de traducción capaces de capturar correlación entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducción elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas más simples superando la aproximación naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos bilingües. En traducción automática, los modelos estadísticos de traducción basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelización mediante mixturas, con el objetivo de definir modelos de traducción dependientes del contexto. Asimismo se extiende un algoritmo iterativo de búsqueda basado en programación dinámica, originalmente diseñado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de búsqueda nCivera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502Palanci
    corecore