8 research outputs found

    Current trends in computer linguistics and problem of the machine translation of Arabic

    Get PDF
    The aim of this paper is to present some problems concerning the machine translation of Arabic in the context of the chosen NLP theories and their evolution. First attempts of electronic machine translations in Europe started only a little more than fifty years ago. It is enough time to perceive some aspects of the evolution? Although a lot of the concepts are still valid, the situation in A.D. 2012 is quite different than even twelve years ago. We still see useful old works of N. Chomsky, D. Cohen, but CFG seems to be supported with some new theories which also have got some disadvantages. Some interesting problems occur in the process of automatic translation when the number of grammatical cases is smaller in the source language than that in the output language.The aim of this paper is to present some problems concerning the machine translation of Arabic in the context of the chosen NLP theories and their evolution. First attempts of electronic machine translations in Europe started only a little more than fifty years ago. It is enough time to perceive some aspects of the evolution? Although a lot of the concepts are still valid, the situation in A.D. 2012 is quite different than even twelve years ago. We still see useful old works of N. Chomsky, D. Cohen, but CFG seems to be supported with some new theories which also have got some disadvantages. Some interesting problems occur in the process of automatic translation when the number of grammatical cases is smaller in the source language than that in the output language

    Learning Dependency Translation Models as Collections of Finite State Head Transducers

    No full text
    The paper defines weighted head transducers,finite-state machines that perform middle-out string transduction. These transducers are strictly more expressive than the special case of standard leftto-right finite-state transducers. Dependency transduction models are then defined as collections of weighted head transducers that are applied hierarchically. A dynamic programming search algorithm is described for finding the optimal transduction of an input string with respect to a dependency transduction model. A method for automatically training a dependency transduction model from a set of input-output example strings is presented. The method first searches for hierarchical alignments of the training examples guided by correlation statistics, and then constructs the transitions of head transducers that are consistent with these alignments. Experimental results are given for applying the training method to translation from English to Spanish and Japanese. 1

    Constrained word alignment models for statistical machine translation

    Get PDF
    Word alignment is a fundamental and crucial component in Statistical Machine Translation (SMT) systems. Despite the enormous progress made in the past two decades, this task remains an active research topic simply because the quality of word alignment is still far from optimal. Most state-of-the-art word alignment models are grounded on statistical learning theory treating word alignment as a general sequence alignment problem, where many linguistically motivated insights are not incorporated. In this thesis, we propose new word alignment models with linguistically motivated constraints in a bid to improve the quality of word alignment for Phrase-Based SMT systems (PB-SMT). We start the exploration with an investigation into segmentation constraints for word alignment by proposing a novel algorithm, namely word packing, which is motivated by the fact that one concept expressed by one word in one language can frequently surface as a compound or collocation in another language. Our algorithm takes advantage of the interaction between segmentation and alignment, starting with some segmentation for both the source and target language and updating the segmentation with respect to the word alignment results using state-of-the-art word alignment models; thereafter a refined word alignment can be obtained based on the updated segmentation. In this process, the updated segmentation acts as a hard constraint on the word alignment models and reduces the complexity of the alignment models by generating more 1-to-1 correspondences through word packing. Experimental results show that this algorithm can lead to statistically significant improvements over the state-of-the-art word alignment models. Given that word packing imposes "hard" segmentation constraints on the word aligner, which is prone to introducing noise, we propose two new word alignment models using syntactic dependencies as soft constraints. The first model is a syntactically enhanced discriminative word alignment model, where we use a set of feature functions to express the syntactic dependency information encoded in both source and target languages. One the one hand, this model enjoys great flexibility in its capacity to incorporate multiple features; on the other hand, this model is designed to facilitate model tuning for different objective functions. Experimental results show that using syntactic constraints can improve the performance of the discriminative word alignment model, which also leads to better PB-SMT performance compared to using state-of-the-art word alignment models. The second model is a syntactically constrained generative word alignment model, where we add in a syntactic coherence model over the target phrases in the context of HMM word-to-phrase alignment. The advantages of our model are that (i) the addition of the syntactic coherence model preserves the efficient parameter estimation procedures; and (ii) the flexibility of the model can be increased so that it can be tuned according to different objective functions. Experimental results show that tuning this model properly leads to a significant gain in MT performance over the state-of-the-art

    Novel statistical approaches to text classification, machine translation and computer-assisted translation

    Full text link
    Esta tesis presenta diversas contribuciones en los campos de la clasificaci贸n autom谩tica de texto, traducci贸n autom谩tica y traducci贸n asistida por ordenador bajo el marco estad铆stico. En clasificaci贸n autom谩tica de texto, se propone una nueva aplicaci贸n llamada clasificaci贸n de texto biling眉e junto con una serie de modelos orientados a capturar dicha informaci贸n biling眉e. Con tal fin se presentan dos aproximaciones a esta aplicaci贸n; la primera de ellas se basa en una asunci贸n naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, m谩s sofisticada, considera la existencia de una correlaci贸n entre palabras en diferentes lenguas. La primera aproximaci贸n di贸 lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la m谩s compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexaci贸n de documentos. La segunda aproximaci贸n se caracteriza por modelos de traducci贸n capaces de capturar correlaci贸n entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducci贸n elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas m谩s simples superando la aproximaci贸n naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos biling眉es. En traducci贸n autom谩tica, los modelos estad铆sticos de traducci贸n basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelizaci贸n mediante mixturas, con el objetivo de definir modelos de traducci贸n dependientes del contexto. Asimismo se extiende un algoritmo iterativo de b煤squeda basado en programaci贸n din谩mica, originalmente dise帽ado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de b煤squeda nCivera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/2502Palanci

    Idiom treatment experiments in machine translation

    Get PDF
    Idiomatic expressions pose a particular challenge for the today\u27;s Machine Translation systems, because their translation mostly does not result literally, but logically. The present dissertation shows, how with the help of a corpus, and morphosyntactic rules, such idiomatic expressions can be recognized and finally correctly translated. The work leads the reader in the first chapter generally to the field of Machine Translation and following that, it focuses on the special field of Example-based Machine Translation. Next, an important part of the doctoral thesis dissertation is devoted to the theory of idiomatic expressions. The practical part of the thesis describes how the hybrid Example-based Machine Translation system METIS-II, with the help of morphosyntactic rules, is able to correctly process certain idiomatic expressions and finally, to translate them. The following chapter deals with the function of the transfer system CAT2 and its handling of the idiomatic expressions. The last part of the thesis includes the evaluation of three commercial systems, namely SYSTRAN, T1 Langenscheidt, and Power Translator Pro, with respect to continuous and discontinuous idiomatic expressions. For this, both small corpora and a part of the extensive corpus Europarl and the Digital Lexicon of the German Language in 20th century were processed, firstly manually and then automatically. The dissertation concludes with results from this evaluation.Idiomatische Redewendungen stellen f眉r heutige maschinelle 脺bersetzungssysteme eine besondere Herausforderung dar, da ihre 脺bersetzung nicht w枚rtlich, sondern stets sinngem盲脽 erfolgen muss. Die vorliegende Dissertation zeigt, wie mit Hilfe eines Korpus sowie morphosyntaktischer Regeln solche idiomatische Redewendungen erkannt und am Ende richtig 眉bersetzt werden k枚nnen. Die Arbeit f眉hrt den Leser im ersten Kapitel allgemein in das Gebiet der Maschinellen 脺bersetzung ein und vertieft im Anschluss daran das Spezialgebiet der Beispielbasierten Maschinellen 脺bersetzung. Im Folgenden widmet sich ein wesentlicher Teil der Doktorarbeit der Theorie 眉ber idiomatische Redewendungen. Der praktische Teil der Arbeit beschreibt wie das hybride Beispielbasierte Maschinelle 脺bersetzungssystem METIS-II mit Hilfe von morphosyntaktischen Regeln bef盲higt wurde, bestimmte idiomatische Redewendungen korrekt zu bearbeiten und am Ende zu 眉bersetzen. Das nachfolgende Kapitel behandelt die Funktion des Transfersystems CAT2 und dessen Umgang mit idiomatischen Wendungen. Der letzte Teil der Arbeit beinhaltet die Evaluation von drei kommerzielle Systemen, n盲mlich SYSTRAN, T1 Langenscheidt und Power Translator Pro, in Bezug auf deren Umgang mit kontinuierlichen und diskontinuierlichen idiomatischen Redewendungen. Hierzu wurden sowohl kleine Korpora als auch ein Teil des umfangreichen Korpus Europarl und des Digatalen W枚rterbuchs der deutschen Sprache des 20. Jh. erst manuell und dann maschinell bearbeitet. Die Dissertation wird mit Folgerungen aus der Evaluation abgeschlossen

    A tree-to-tree model for statistical machine translation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 227-234).In this thesis, we take a statistical tree-to-tree approach to solving the problem of machine translation (MT). In a statistical tree-to-tree approach, first the source-language input is parsed into a syntactic tree structure; then the source-language tree is mapped to a target-language tree. This kind of approach has several advantages. For one, parsing the input generates valuable information about its meaning. In addition, the mapping from a source-language tree to a target-language tree offers a mechanism for preserving the meaning of the input. Finally, producing a target-language tree helps to ensure the grammaticality of the output. A main focus of this thesis is to develop a statistical tree-to-tree mapping algorithm. Our solution involves a novel representation called an aligned extended projection, or AEP. The AEP, inspired by ideas in linguistic theory related to tree-adjoining grammars, is a parse-tree like structure that models clause-level phenomena such as verbal argument structure and lexical word-order. The AEP also contains alignment information that links the source-language input to the target-language output. Instead of learning a mapping from a source-language tree to a target-language tree, the AEP-based approach learns a mapping from a source-language tree to a target-language AEP. The AEP is a complex structure, and learning a mapping from parse trees to AEPs presents a challenging machine learning problem. In this thesis, we use a linear structured prediction model to solve this learning problem. A human evaluation of the AEP-based translation approach in a German-to-English task shows significant improvements in the grammaticality of translations. This thesis also presents a statistical parser for Spanish that could be used as part of a Spanish/English translation system.by Brooke Alissa Cowan.Ph.D
    corecore