457 research outputs found

    Creating a strong statistical machine translation system by combining different decoders

    Get PDF
    Machine translation is a very important field in Natural Language Processing. The need for machine translation arises due to the increasing amount of data available online. Most of our data now is digital and this is expected to increase over time. Since human manual translation takes a lot of time and effort, machine translation is needed to cover all of the languages available. A lot of research has been done to make machine translation faster and more reliable between different language pairs. Machine translation is now being coupled with deep learning and neural networks. New topics in machine translation are being studied and tested like applying neural machine translation as a replacement to the classical statistical machine translation. In this thesis, we also study the effect of data-preprocessing and decoder type on translation output. We then demonstrate two ways to enhance translation from English to Arabic. The first approach uses a two-decoder system; the first decoder translates from English to Arabic and the second is a post-processing decoder that retranslates the first Arabic output to Arabic again to fix some of the translation errors. We then study the results of different kinds of decoders and their contributions to the test set. The results of this study lead to the second approach which combines different decoders to create a stronger one. The second approach uses a classifier to categorize the English sentences based on their structure. The output of the classifier is the decoder that is suited best to translate the English sentence. Both approaches increased the BLEU score albeit with different ranges. The classifier showed an increase of ~0.1 BLEU points while the post-processing decoder showed an increase of between ~0.3~11 BLEU points on two different test sets. Eventually we compare our results to Google translate to know how well we are doing in comparison to a well-known translator. Our best translation machine system scored 5 absolute points compared to Google translate in ISI corpus test set and we were 9 absolute points lower in the case of the UN corpus test set

    Workshop on statistical machine translation for curious translators

    Get PDF
    El autor era colaborador honorífico del Departamento de Lenguajes y Sistemas Informáticos en diciembre de 2016.Presentación de diapositivas del taller "Workshop on Statistical Machine Translation for Curious Translators", impartido por Víctor Manuel Sánchez Cartagena en la Universidad de Alicante en diciembre de 2016

    Providing morphological information for SMT using neural networks

    Get PDF
    Treating morphologically complex words (MCWs) as atomic units in translation would not yield a desirable result. Such words are complicated constituents with meaningful subunits. A complex word in a morphologically rich language (MRL) could be associated with a number of words or even a full sentence in a simpler language, which means the surface form of complex words should be accompanied with auxiliary morphological information in order to provide a precise translation and a better alignment. In this paper we follow this idea and propose two different methods to convey such information for statistical machine translation (SMT) models. In the first model we enrich factored SMT engines by introducing a new morphological factor which relies on subword-aware word embeddings. In the second model we focus on the language-modeling component. We explore a subword-level neural language model (NLM) to capture sequence-, word- and subword-level dependencies. Our NLM is able to approximate better scores for conditional word probabilities, so the decoder generates more fluent translations. We studied two languages Farsi and German in our experiments and observed significant improvements for both of them

    Text Representation for Nonconcatenative Morphology

    Full text link
    The last six years have seen the immense improvement of the NMT in terms of translation quality. With the help of the neural networks, the NMT has been able to achieve the state-of-the-art results in transla- tion quality. However, the NMT is still not able to achieve translation quality near human levels. In this thesis, we propose new approaches to improve the language representation as input to the NMT. This can be achieved by exploiting language specific knowledge, such as phonetic alterations, the morphology, and the syntax. We propose a new approach to improve the language representation by exploiting mor- phological phenomena in Turkish and Hebrew and show that the proposed segmentation approaches can improve translation quality. We have used several different segmentation approaches and compared them with each other. All of the segmentation approaches are rooted in the language specific morphological analysis of Turkish and Hebrew. We have also looked at the effect of the specific segmentation approach on translation quality. We have trained six different models of the type transformer with different seg- mentation approaches and compared them with each other. For each of the segmentation approaches, we have evaluated the translation quality using two automatic metrics and the human evaluation. We have also observed that the segmentation approaches can improve the translation quality in the case of the human evaluation, but not in the case of the automatic metrics. We have emphasized the importance of the human evaluation for NMT, and have shown that the automatic metrics can often be misleading

    EUSMT: incorporating linguistic information to SMT for a morphologically rich language. Its use in SMT-RBMT-EBMT hybridation

    Get PDF
    148 p.: graf.This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language. First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data. Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline. Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps. This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language. First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data. Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline. Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.Eusko Jaurlaritzaren ikertzaileak prestatzeko beka batekin (BFI05.326)eginda
    corecore