172 research outputs found
Applying dynamic Bayesian networks in transliteration detection and generation
Peter Nabende promoveert op methoden die programma’s voor automatisch vertalen kunnen verbeteren. Hij onderzocht twee systemen voor het genereren en vergelijken van transcripties: een DBN-model (Dynamische Bayesiaanse Netwerken) waarin Pair Hidden Markovmodellen zijn geïmplementeerd en een DBN-model dat op transductie is gebaseerd. Nabende onderzocht het effect van verschillende DBN-parameters op de kwaliteit van de geproduceerde transcripties. Voor de evaluatie van de DBN-modellen gebruikte hij standaard dataverzamelingen van elf taalparen: Engels-Arabisch, Engels-Bengaals, Engels-Chinees, Engels-Duits, Engels-Frans, Engels-Hindi, Engels-Kannada, Engels-Nederlands, Engels-Russisch, Engels-Tamil en Engels-Thai. Tijdens het onderzoek probeerde hij om verschillende modellen te combineren. Dat bleek een goed resultaat op te leveren
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
Graphemic Normalization of the Perso-Arabic Script
Since its original appearance in 1991, the Perso-Arabic script representation
in Unicode has grown from 169 to over 440 atomic isolated characters spread
over several code pages representing standard letters, various diacritics and
punctuation for the original Arabic and numerous other regional orthographic
traditions. This paper documents the challenges that Perso-Arabic presents
beyond the best-documented languages, such as Arabic and Persian, building on
earlier work by the expert community. We particularly focus on the situation in
natural language processing (NLP), which is affected by multiple, often
neglected, issues such as the use of visually ambiguous yet canonically
nonequivalent letters and the mixing of letters from different orthographies.
Among the contributing conflating factors are the lack of input methods, the
instability of modern orthographies, insufficient literacy, and loss or lack of
orthographic tradition. We evaluate the effects of script normalization on
eight languages from diverse language families in the Perso-Arabic script
diaspora on machine translation and statistical language modeling tasks. Our
results indicate statistically significant improvements in performance in most
conditions for all the languages considered when normalization is applied. We
argue that better understanding and representation of Perso-Arabic script
variation within regional orthographic traditions, where those are present, is
crucial for further progress of modern computational NLP techniques especially
for languages with a paucity of resources.Comment: Pre-print to appear in the Proceedings of Grapholinguistics in the
21st Century (G21C), 2022. Telecom Paris, Palaiseau, France, June 8-10, 2022.
41 pages, 38 tables, 3 figure
- …