193 research outputs found

    Extended Parallel Corpus for Amharic-English Machine Translation

    Full text link
    This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202

    Effectiveness of query expansion in searching the Holy Quran

    Get PDF
    Modern Arabic text is written without diacritical marks (short vowels), which causes considerable ambiguity at the word level in the absence of context. Exceptional from this is the Holy Quran, which is endorsed with short vowels and other marks to preserve the pronunciation and hence, the correctness of sensing its words. Searching for a word in vowelized text requires typing and matching all its diacritical marks, which is cumbersome and preventing learners from searching and understanding the text. The other way around, is to ignore these marks and fall in the problem of ambiguity. In this paper, we provide a novel diacritic-less searching approach to retrieve from the Quran relevant verses that match a user’s query through automatic query expansion techniques. The proposed approach utilizes a relational database search engine that is scalable, portable across RDBMS platforms, and provides fast and sophisticated retrieval. The results are presented and the applied approach reveals future directions for search engines

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Diacritization as a machine translation problem and as a sequence labeling problem

    Get PDF
    In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates

    A rules based system for named entity recognition in modern standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult for many users to find and access the right information within acceptable time. Research communities in the natural language processing (NLP) field are developing tools and techniques to alleviate these problems and help users in exploiting these vast resources. These techniques include Information Retrieval (IR) and Information Extraction (IE). The work described in this thesis concerns IE and more specifically, named entity extraction in Arabic. The Arabic language is of significant interest to the NLP community mainly due to its political and economic significance, but also due to its interesting characteristics. Text usually contains all kinds of names such as person names, company names, city and country names, sports teams, chemicals and lots of other names from specific domains. These names are called Named Entities (NE) and Named Entity Recognition (NER), one of the main tasks of IE systems, seeks to locate and classify automatically these names into predefined categories. NER systems are developed for different applications and can be beneficial to other information management technologies as it can be built over an IR system or can be used as the base module of a Data Mining application. In this thesis we propose an efficient and effective framework for extracting Arabic NEs from text using a rule based approach. Our approach makes use of Arabic contextual and morphological information to extract named entities. The context is represented by means of words that are used as clues for each named entity type. Morphological information is used to detect the part of speech of each word given to the morphological analyzer. Subsequently we developed and implemented our rules in order to recognise each position of the named entity. Finally, our system implementation, evaluation metrics and experimental results are presented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Arabic opinion mining using combined classification approach

    Get PDF
    In this paper, we present a combined approach that automatically extracts opinions from Arabic documents. Most research efforts in the area of opinion mining deal with English texts and little work with Arabic text. Unlike English, from our experiments, we found that using only one method on Arabic opinioned documents produce a poor performance. So, we used a combined approach that consists of three methods. At the beginning, lexicon based method is used to classify as much documents as possible. The resultant classified documents used as training set for maximum entropy method which subsequently classifies some other documents. Finally, k-nearest method used the classified documents from lexicon based method and maximum entropy as training set and classifies the rest of the documents. Our experiments showed that in average, the accuracy moved (almost) from 50% when using only lexicon based method to 60% when used lexicon based method and maximum entropy together, to 80% when using the three combined methods

    A Rules Based System for Named Entity Recognition in Modern Standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult formany users to find and access the right information within acceptable time. Researchcommunities in the natural language processing (NLP) field are developing tools andtechniques to alleviate these problems and help users in exploiting these vast resources.These techniques include Information Retrieval (IR) and Information Extraction (IE). Thework described in this thesis concerns IE and more specifically, named entity extraction inArabic. The Arabic language is of significant interest to the NLP community mainly due toits political and economic significance, but also due to its interesting characteristics.Text usually contains all kinds of names such as person names, company names,city and country names, sports teams, chemicals and lots of other names from specificdomains. These names are called Named Entities (NE) and Named Entity Recognition(NER), one of the main tasks of IE systems, seeks to locate and classify automaticallythese names into predefined categories. NER systems are developed for differentapplications and can be beneficial to other information management technologies as it canbe built over an IR system or can be used as the base module of a Data Mining application.In this thesis we propose an efficient and effective framework for extracting Arabic NEsfrom text using a rule based approach. Our approach makes use of Arabic contextual andmorphological information to extract named entities. The context is represented by meansof words that are used as clues for each named entity type. Morphological information isused to detect the part of speech of each word given to the morphological analyzer.Subsequently we developed and implemented our rules in order to recognise each positionof the named entity. Finally, our system implementation, evaluation metrics andexperimental results are presented

    Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic.

    Get PDF
    Abstract Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resources. In comparison, resources for dialectal Arabic (DA), the unstandardized spoken varieties of Arabic, are still lacking. We present ELISSA, a machine translation (MT) system for DA to MSA. ELISSA employs a rule-based approach that relies on morphological analysis, transfer rules and dictionaries in addition to language models to produce MSA paraphrases of DA sentences. ELISSA can be employed as a general preprocessor for DA when using MSA NLP tools. A manual error analysis of ELISSA's output shows that it produces correct MSA translations over 93% of the time. Using ELISSA to produce MSA versions of DA sentences as part of an MSA-pivoting DA-to-English MT solution, improves BLEU scores on multiple blind test sets between 0.6% and 1.4%
    • …
    corecore