9 research outputs found

    Combination of Arabic Preprocessing Schemes for Statistical Machine Translation

    No full text
    Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.La traduction automatique statistique est assez complexe en ce qui concerne la s\ue9lection de la repr\ue9sentation des intrants. Elle ne demande que la coh\ue9rence entre la formation et la mise \ue0 l'essai. C'est pourquoi il existe toute une gamme d'options de pr\ue9traitement des donn\ue9es utilis\ue9es en traduction automatique statistique. C'est notamment le cas des langues compl\ue8tes sur le plan de la morphologie, comme l'arabe. Dans le pr\ue9sent document, nous \ue9tudions l'effet des sch\ue9mas de pr\ue9traitement diff\ue9rents au niveau des mots en arabe sur la qualit\ue9 de la traduction automatique statistique syntagmatique. En outre, nous pr\ue9sentons et \ue9valuons diff\ue9rentes m\ue9thodes de combinaison des sch\ue9mas de pr\ue9traitement, ce qui donne une am\ue9lioration de la qualit\ue9 de la traduction.NRC publication: Ye

    CCG-augmented hierarchical phrase-based statistical machine translation

    Get PDF
    Augmenting Statistical Machine Translation (SMT) systems with syntactic information aims at improving translation quality. Hierarchical Phrase-Based (HPB) SMT takes a step toward incorporating syntax in Phrase-Based (PB) SMT by modelling one aspect of language syntax, namely the hierarchical structure of phrases. Syntax Augmented Machine Translation (SAMT) further incorporates syntactic information extracted using context free phrase structure grammar (CF-PSG) in the HPB SMT model. One of the main challenges facing CF-PSG-based augmentation approaches for SMT systems emerges from the difference in the definition of the constituent in CF-PSG and the ‘phrase’ in SMT systems, which hinders the ability of CF-PSG to express the syntactic function of many SMT phrases. Although the SAMT approach to solving this problem using ‘CCG-like’ operators to combine constituent labels improves syntactic constraint coverage, it significantly increases their sparsity, which restricts translation and negatively affects its quality. In this thesis, we address the problems of sparsity and limited coverage of syntactic constraints facing the CF-PSG-based syntax augmentation approaches for HPB SMT using Combinatory Cateogiral Grammar (CCG). We demonstrate that CCG’s flexible structures and rich syntactic descriptors help to extract richer, more expressive and less sparse syntactic constraints with better coverage than CF-PSG, which enables our CCG-augmented HPB system to outperform the SAMT system. We also try to soften the syntactic constraints imposed by CCG category nonterminal labels by extracting less fine-grained CCG-based labels. We demonstrate that CCG label simplification helps to significantly improve the performance of our CCG category HPB system. Finally, we identify the factors which limit the coverage of the syntactic constraints in our CCG-augmented HPB model. We then try to tackle these factors by extending the definition of the nonterminal label to be composed of a sequence of CCG categories and augmenting the glue grammar with CCG combinatory rules. We demonstrate that our extension approaches help to significantly increase the scope of the syntactic constraints applied in our CCG-augmented HPB model and achieve significant improvements over the HPB SMT baseline

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government
    corecore