172 research outputs found

    Applying dynamic Bayesian networks in transliteration detection and generation

    Get PDF
    Peter Nabende promoveert op methoden die programma’s voor automatisch vertalen kunnen verbeteren. Hij onderzocht twee systemen voor het genereren en vergelijken van transcripties: een DBN-model (Dynamische Bayesiaanse Netwerken) waarin Pair Hidden Markovmodellen zijn geïmplementeerd en een DBN-model dat op transductie is gebaseerd. Nabende onderzocht het effect van verschillende DBN-parameters op de kwaliteit van de geproduceerde transcripties. Voor de evaluatie van de DBN-modellen gebruikte hij standaard dataverzamelingen van elf taalparen: Engels-Arabisch, Engels-Bengaals, Engels-Chinees, Engels-Duits, Engels-Frans, Engels-Hindi, Engels-Kannada, Engels-Nederlands, Engels-Russisch, Engels-Tamil en Engels-Thai. Tijdens het onderzoek probeerde hij om verschillende modellen te combineren. Dat bleek een goed resultaat op te leveren

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Graphemic Normalization of the Perso-Arabic Script

    Full text link
    Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions. This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community. We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies, insufficient literacy, and loss or lack of orthographic tradition. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques especially for languages with a paucity of resources.Comment: Pre-print to appear in the Proceedings of Grapholinguistics in the 21st Century (G21C), 2022. Telecom Paris, Palaiseau, France, June 8-10, 2022. 41 pages, 38 tables, 3 figure
    • …
    corecore