Search CORE

308 research outputs found

A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Author: Abdelali Ahmed,
Doumi Noureddine
Lehireche Ahmed
Maurel Denis
Publication venue: IJIT
Publication date: 01/02/2016
Field of study

International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

Directory of Open Access Journals

Hal-Diderot

Sociolinguistically-aware computational models of Mandarin-English codeswitching

Author: Yi Irene
Publication venue: 'Linguistic Society of America'
Publication date: 05/05/2022
Field of study

Current research on computational modeling of codeswitching has focused on the use of syntactic constraints as model predictors (Li & Fung 2014; Li & Vu 2019). However, proposed syntactic constraints (Poplack 1978; Poplack 1980; Myers-Scotton 1993; Belazi et al. 1994) are largely based around Spanish-English codeswitching, and are violated repeatedly (and potentially systematically) by codeswitching involving other languages. Thus, a computational model trained on these syntactic constraints, when applied to codeswitching involving languages that are not Spanish-English, may not capture the naturalistic patterns of those languages in codeswitching contexts. This paper demonstrates the value of sociolinguistic factors as predictors in training a Classification and Regression Tree (CART) model on novel Mandarin-English codeswitch data, which come from 12 bilingual speakers of two different generations from Grand Rapids, Michigan. Participants also answered metalinguistic questions about their own language practices and attitudes and completed a written Language History Questionnaire (LHQ) (Li et al. 2020), which asked for self-evaluations of language habits (proficiency, immersion, and dominance in the two languages). LHQ responses were then quantified into numerical scores serving as sociolinguistic predictors in the CART model. The model, which highlighted that age, L2 Dominance, and L1 Immersion were among the top predictors, achieved an accuracy of 0.804 with the area under its ROC curve being 0.692. This is comparable to, if not more powerful than, previous computational studies (e.g. Li & Fung 2014) that trained models using only proposed syntactic constraints as predictors. This paper shows the importance of sociolinguistic factors in computational research previously focused on syntactic constraints; the intersection of these methodologies could improve a cross-linguistic and computational understanding of codeswitching patterns

Proceedings Published by the LSA (Linguistic Society of America)

A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Author: Ahmed Abdelali
Ahmed Lehireche
Denis Maurel
Noureddine Doumi
null null
Publication venue: 'MECS Publisher'
Publication date: 01/01/2016
Field of study

Typologically robust statistical machine translation:Understanding and exploiting differences and similarities between languages in machine translation

Author: Daiber J.
Publication venue
Publication date: 01/01/2018
Field of study

Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

Author: Peter Spyns Jan Odijk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie