308 research outputs found

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Sociolinguistically-aware computational models of Mandarin-English codeswitching

    Get PDF
    Current research on computational modeling of codeswitching has focused on the use of syntactic constraints as model predictors (Li & Fung 2014; Li & Vu 2019). However, proposed syntactic constraints (Poplack 1978; Poplack 1980; Myers-Scotton 1993; Belazi et al. 1994) are largely based around Spanish-English codeswitching, and are violated repeatedly (and potentially systematically) by codeswitching involving other languages. Thus, a computational model trained on these syntactic constraints, when applied to codeswitching involving languages that are not Spanish-English, may not capture the naturalistic patterns of those languages in codeswitching contexts. This paper demonstrates the value of sociolinguistic factors as predictors in training a Classification and Regression Tree (CART) model on novel Mandarin-English codeswitch data, which come from 12 bilingual speakers of two different generations from Grand Rapids, Michigan. Participants also answered metalinguistic questions about their own language practices and attitudes and completed a written Language History Questionnaire (LHQ) (Li et al. 2020), which asked for self-evaluations of language habits (proficiency, immersion, and dominance in the two languages). LHQ responses were then quantified into numerical scores serving as sociolinguistic predictors in the CART model. The model, which highlighted that age, L2 Dominance, and L1 Immersion were among the top predictors, achieved an accuracy of 0.804 with the area under its ROC curve being 0.692. This is comparable to, if not more powerful than, previous computational studies (e.g. Li & Fung 2014) that trained models using only proposed syntactic constraints as predictors. This paper shows the importance of sociolinguistic factors in computational research previously focused on syntactic constraints; the intersection of these methodologies could improve a cross-linguistic and computational understanding of codeswitching patterns

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
    • …
    corecore