96 research outputs found

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Turkish lexicon expansion by using finite state automata

    Get PDF
    © 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdfTurkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.Published versio

    A rules based system for named entity recognition in modern standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult for many users to find and access the right information within acceptable time. Research communities in the natural language processing (NLP) field are developing tools and techniques to alleviate these problems and help users in exploiting these vast resources. These techniques include Information Retrieval (IR) and Information Extraction (IE). The work described in this thesis concerns IE and more specifically, named entity extraction in Arabic. The Arabic language is of significant interest to the NLP community mainly due to its political and economic significance, but also due to its interesting characteristics. Text usually contains all kinds of names such as person names, company names, city and country names, sports teams, chemicals and lots of other names from specific domains. These names are called Named Entities (NE) and Named Entity Recognition (NER), one of the main tasks of IE systems, seeks to locate and classify automatically these names into predefined categories. NER systems are developed for different applications and can be beneficial to other information management technologies as it can be built over an IR system or can be used as the base module of a Data Mining application. In this thesis we propose an efficient and effective framework for extracting Arabic NEs from text using a rule based approach. Our approach makes use of Arabic contextual and morphological information to extract named entities. The context is represented by means of words that are used as clues for each named entity type. Morphological information is used to detect the part of speech of each word given to the morphological analyzer. Subsequently we developed and implemented our rules in order to recognise each position of the named entity. Finally, our system implementation, evaluation metrics and experimental results are presented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    A Rules Based System for Named Entity Recognition in Modern Standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult formany users to find and access the right information within acceptable time. Researchcommunities in the natural language processing (NLP) field are developing tools andtechniques to alleviate these problems and help users in exploiting these vast resources.These techniques include Information Retrieval (IR) and Information Extraction (IE). Thework described in this thesis concerns IE and more specifically, named entity extraction inArabic. The Arabic language is of significant interest to the NLP community mainly due toits political and economic significance, but also due to its interesting characteristics.Text usually contains all kinds of names such as person names, company names,city and country names, sports teams, chemicals and lots of other names from specificdomains. These names are called Named Entities (NE) and Named Entity Recognition(NER), one of the main tasks of IE systems, seeks to locate and classify automaticallythese names into predefined categories. NER systems are developed for differentapplications and can be beneficial to other information management technologies as it canbe built over an IR system or can be used as the base module of a Data Mining application.In this thesis we propose an efficient and effective framework for extracting Arabic NEsfrom text using a rule based approach. Our approach makes use of Arabic contextual andmorphological information to extract named entities. The context is represented by meansof words that are used as clues for each named entity type. Morphological information isused to detect the part of speech of each word given to the morphological analyzer.Subsequently we developed and implemented our rules in order to recognise each positionof the named entity. Finally, our system implementation, evaluation metrics andexperimental results are presented

    A morphological-syntactical analysis approach for Arabic textual tagging

    Get PDF
    Part-of-Speech (POS) tagging is the process of labeling or classifying each word in written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition, adjective, etc. It is the most common disambiguation process in the field of Natural Language Processing (NLP). POS tagging systems are often preprocessors in many NLP applications. The Arabic language has a valuable and an important feature, called diacritics, which are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl when the diacritical mark is assigned to one or maximum two letters in the word. Diacritics in Arabic texts are extremely important especially at the end of the word. They help determining not only the correct POS tag for each word in the sentence, but also in providing full information regarding the inflectional features, such as tense, number, gender, etc. for the sentence words. They add semantic information to words which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics ascribe grammatical functions to the words, differentiating the word from other words, and determining the syntactic position of the word in the sentence. 1. Vocalisation (also referred as diacritisation or vowelisation). This thesis presents a rule-based Part-of-Speech tagging system called AMT - short for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign the correct tag to each word in an untagged raw partially-vocalised Arabic corpus, and to produce a POS tagged corpus without using a manually tagged or untagged lexicon (dictionary) for training. Two different techniques were used in this work, the pattem-based technique and the lexical and contextual technique. The rules in the pattem-based technique technique are based on the pattern of the testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed and introduced in this work. The aim of this algorithm is to match the testing word with its correct pattern in pattern lexicon. The lexical and contextual technique on the other hand is used to assist the pattembased technique technique to assign the correct tag to those words not have a pattern to follow. The rules in the lexical and contextual technique are based on the character(s), the last diacritical mark, the word itself, and the tags of the surrounding words. The importance of utilizing the diacritic feature of the Arabic language to reduce the lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag set and a new partially-vocalised Arabic corpus to test AMT have been compiled and presented in this work. The AMT system has achieved an average accuracy of 91 %

    Arabic named entity recognition

    Full text link
    En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci
    • …
    corecore