1,809 research outputs found

    Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing

    Get PDF
    This is an accepted manuscript of an article published by Elsevier BV in Computer Speech & Language on 08/11/2014, available online: https://doi.org/10.1016/j.csl.2014.10.007 The accepted version of the publication may differ from the final published version.Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.This paper is based upon work supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant number 356097-08.Published versio

    Statistical Parsing by Machine Learning from a Classical Arabic Treebank

    Get PDF
    Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

    Creating a strong statistical machine translation system by combining different decoders

    Get PDF
    Machine translation is a very important field in Natural Language Processing. The need for machine translation arises due to the increasing amount of data available online. Most of our data now is digital and this is expected to increase over time. Since human manual translation takes a lot of time and effort, machine translation is needed to cover all of the languages available. A lot of research has been done to make machine translation faster and more reliable between different language pairs. Machine translation is now being coupled with deep learning and neural networks. New topics in machine translation are being studied and tested like applying neural machine translation as a replacement to the classical statistical machine translation. In this thesis, we also study the effect of data-preprocessing and decoder type on translation output. We then demonstrate two ways to enhance translation from English to Arabic. The first approach uses a two-decoder system; the first decoder translates from English to Arabic and the second is a post-processing decoder that retranslates the first Arabic output to Arabic again to fix some of the translation errors. We then study the results of different kinds of decoders and their contributions to the test set. The results of this study lead to the second approach which combines different decoders to create a stronger one. The second approach uses a classifier to categorize the English sentences based on their structure. The output of the classifier is the decoder that is suited best to translate the English sentence. Both approaches increased the BLEU score albeit with different ranges. The classifier showed an increase of ~0.1 BLEU points while the post-processing decoder showed an increase of between ~0.3~11 BLEU points on two different test sets. Eventually we compare our results to Google translate to know how well we are doing in comparison to a well-known translator. Our best translation machine system scored 5 absolute points compared to Google translate in ISI corpus test set and we were 9 absolute points lower in the case of the UN corpus test set
    corecore