17 research outputs found

    Using Arabic Numbers (Singular, Dual, and Plurals) Patterns To Enhance Question Answering System Results

    Get PDF
    In the field of information retrieval, it is very difficult to answer the question entered by the user, because the search engine retrieve a ranked documents that contain any key word or phrase inside the documents, this need another extra effort to search the answer inside the documents, and there may be no answer. The alternative of search engine is a question answering system, which it retrieves the exact answer of the question in the natural language if found. A question answering system accepts the question in the natural, then many processes were done to extract the exact answer. In general a question answering system is composed of three main components: question classification module, information retrieval module and answer extraction module. A question answering system is applied in holy Quran which written and cited in Arabic language, some characteristic of the Arabic language were used to enhance the answer extraction, one of these important characteristics is numbering, singular, dual and plural. A prototype build uses special pattern used to process the number in Arabic language, which enhance the answers by adding more words and meaning. A corpus of questions and its answers from holy Quran used to test and answers the question

    Memory-based morphological analysis generation and part-of-speech tagging of Arabic

    Get PDF

    Proposals for a normalized representation of Standard Arabic full form lexica

    No full text
    Standardized lexical resources are an important prerequisite for the development of robust and wide coverage natural language processing application. Therefore, we applied the Lexical Markup Framework, a recent ISO initiative towards standards for designing, implementing and representing lexical resources, on a test bed of data for an Arabic full form lexicon. Besides minor structural accommodation that would be needed in order to take into account the traditional root-based organization of Arabic dictionaries, the LMF proposal appeared to be suitable to our purpose, especially because of the separate management of the hierarchical data structure (LMF core model) and elementary linguistic descriptors (data categories)

    Turkish lexicon expansion by using finite state automata

    Get PDF
    © 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdfTurkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.Published versio

    A computational model of modern standard arabic verbal morphology based on generation

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literataura Comparada. Fecha de lectura: 29-01-2013The computational handling of non-concatenative morphologies is still a challenge in the field of natural language processing. Amongst the various areas of research, Arabic morphology stands out due to its highly complex structure. We propose a model for Arabic verbal morphology based on a root-and-pattern approach, which satisfies both computational consistency and an elegant formalization. Our model defines an abstract representation of prosodic templates and a set of intertwined morphemes that operate at different phonological levels, as well as a separate module of rewrite rules to deal with morphophonological and orthographic alterations. Our verbal system model asserts that Arabic exhibits two conjugational classes. The computational system, named Jabalín, is focused on generation—the program generates a full annotated lexicon of verbal forms, which is subsequently used to develop a morphological analyzer and generator. The input of the system consists of a lexicon of 15,452 verb lemmas of both Classical Arabic and Modern Standard Arabic—taken from El-Dahdah (1991)—comprising a total of 3,706 roots. The output of the system is a lexicon of 1,684,268 verbal inflected forms. We carried out an evaluation against a lexicon of inflected verbs provided by the analyzer ElixirFM (Smrž, 2007a; 2007b), which we considered a Golden Standard, achieving a precision of 99.52%. Additionally, we compared our lexicon with a list of the most frequent verb lemmas—including the most frequent verbs from each conjugation—taken from Buckwalter and Parkinson (2010). The list includes 825 verbs which are all included in our lexicon and passed an evaluation test with 99.27% of accuracy. Jabalín is available under a GNU license, and can be accessed and tested through an online interface, at http://elvira.lllf.uam.es/jabalin/, hosted at the LLI-UAM lab. The Jabalín interface provides different functionalities: analyze a form, generate the inflectional paradigm of a verb lemma, derive a root, show quantitative data, and explore the database, which includes data from the evaluation. ii Key words: Computational Linguistics, Natural Language Processing, Arabic Computational Morphology, Root-and-Pattern Morphology, Non-concatenative Morphology, Templatic Morphology, Root-and-Prosody Morphology, Computational Prosodic Morphology.Los sistemas morfológicos de tipo no concatenativo siguen siendo uno de los mayores retos para el procesamiento del lenguaje natural. Entre las diversas líneas de investigación, el estudio de la morfología del árabe destaca por ser un sistema de gran complejidad estructural. En el presente proyecto de investigación, se propone un modelo de morfología verbal del árabe basado en un enfoque root-and-pattern, así como formalmente elegante y coherente desde el punto de vista computacional. El modelo propuesto se apoya fundamentalmente en una formalización abstracta de los esquemas prosódicos y su interrelación con el material morfológico. Paralelamente, el sistema cuenta con un módulo de reglas que tratan las alteraciones morfofonológicas y ortográficas del árabe. El modelo del sistema verbal propone, y se asienta en la idea de que, existen sólo dos clases conjugacionales en árabe. El sistema computacional, llamado Jabalín, está orientado a la generación: el programa genera un lexicón de formas verbales con la información lingüística asociada. El lexicón se emplea a continuación para desarrollar un analizador y generador morfológicos. Como entrada, el sistema recibe un lexicón de lemas verbales de 15.452 entradas (tomado de El-Dahdah, 1991), que combina léxico tanto del árabe clásico como del árabe estándar moderno, y cuenta con un total de 3.706 raíces. La salida es un lexicón de 1.684.268 formas verbales flexionadas. Se ha llevado a cabo una evaluación contra un lexicón de formas verbales extraído del analizador ElixirFM (Smrž, 2007a; 2007b), con una precisión de 99,52%. Por otro lado, el lexicón se ha evaluado también contra una lista de verbos más frecuentes (incluyendo los lemas más frecuentes de cada tipo de conjugación) sacada de Buckwalter y Parkinson (2010). El total de los 825 verbos que componen la lista están incluidos en nuestro lexicón de lemas verbales y presentan una precisión del 99.27%. El sistema Jabalín, desarrollado bajo licencia GNU, cuenta además con una interfaz web donde se pueden realizar consultas en árabe, http://elvira.lllf.uam.es/jabalin/, albergada en el LLI-UAM. La interfaz cuenta iv con varias funcionalidades: analizar forma, generar flexión de un lema verbal, derivar raíz, mostrar datos cuantitativos, y explorar la base de datos, que incluye los datos de la evaluación. Palabras clave: Lingüística Computacional, Procesamiento del Lenguaje Natural, Morfología Computacional del Árabe, morfología root-and-pattern, morfología no-concatenativa, morfología templática, morfología root-and-prosody, morfología prosódica computacional

    Blend formation tendencies, from English to Arabic : a comparative study

    Get PDF
    PhD ThesisBlending in English is a widely recognized means for forming new lexemes by joining two or more existing words in a way where at least one of them is shortened (Algeo 1991: 10). Familiar examples are brunch from breakfast and lunch, slanguage from slang and language, and chortle from chuckle and snort (Algeo 1977: 49). Linguistic studies of English blends – which are numerous – have focused in particular on the three following features of blends: the cut-off point in the source words, the proportional contributions from the source words to the blend, and the stress pattern of the blend. The main aim of the present research is to examine Arabic blends in the light of the blend formation tendencies that have been identified with respect to these features in English. Blends in Classical Arabic are generally formed by joining the first two root consonants of each source word and imposing the prosodic pattern CaCCaC on them. Typical examples of Classical Arabic blends are /ʕabdar(ij)/ 'someone from the family of Abdul Dār' < /ʕabd/ 'slave' and /da:r/ 'house', /ʕabqas(ij)/ 'someone from the family of Abdul Qays' < /ʕabd/ 'slave' and /qajs/ 'a male name', and /ʕabʃam(ij)/ 'someone from the family of Abdi Shams' < /ʕabd/ 'slave' and /ʃams/ 'sun'– all names for Arab tribes in the 6th Century AD. However, such Classical blends are few in number. The more numerous blends that have been formed in Arabic in recent times do not appear to follow this root-and-pattern template. Examples are /fawsʕawt(ij)/ 'supersonic' < /fawq/ 'above' and /sʕawt(ij)/ 'sound', and /qabħarb/ 'pre-war' < /qabl/ 'before' and /ħarb/ 'war'. Since no linguistic study has investigated in depth the structure of modern Arabic blends, the main aim of this thesis is to uncover the regularities that are found in these modern formations and in that way contribute to understanding the structure of Arabic words in general and blends in particular. The main research question in this study is: To what extent do the blend formation tendencies identified in English apply to blend formation in Arabic? The data for Arabic come from published resources as well as a survey and an experiment, both designed to collect some novel blends by asking native speakers of Arabic to form blends from a list of word pairs. These data were examined in light of the main features and tendencies related to blend-formation in English. The overall result of the investigation is that there is a high degree of resemblance between modern Arabic blends and English blends. This is the case for both the established Modern Arabic blends and the novel invented blends. In this respect, they differ notably from the established blends of Classical Arabic. The main tendencies for forming Arabic blends that have been identified in this study are: (1) There is a general tendency for the cut-off points in source words to occur at syllabic joints with the greatest preference for them to occur between syllabic constituents. (2) There is a general tendency for the greater proportional contribution to come from the shorter source word, and for source words of equal phonemic lengths to contribute equal proportions to the blend. (3) There is a general tendency for the stress pattern of the blend to be identical to that of the source word that has identical syllabic size as that of the blend.the Directorate of Scholarships and Cultural Relations of the Iraqi Ministry of Higher Education and Scientific Researc

    First International Workshop on Lexical Resources

    Get PDF
    International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics
    corecore