160 research outputs found
DaCToR: A data collection tool for the RELATER project
Collecting domain-specific data for under-resourced languages, e.g., dialects of languages, can be very expensive, potentially financially prohibitive and taking long time. Moreover, in the case of rarely written languages, the normalization of non-canonical transcription might be another time consuming but necessary task. In order to collect domain-specific data in such circumstances in a time and cost-efficient way, collecting read data of pre-prepared texts is often a viable option. In order to collect data in the domain of psychiatric diagnosis in Arabic dialects for the project RELATER, we have prepared the data collection tool DaCToR for collecting read texts by speakers in the respective countries and districts in which the dialects are spoken. In this paper we describe our tool, its purpose within the project RELATER and the dialects which we have started to collect with the tool
The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation
International audienceThe development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of limited resources, consists in performing a translation of the dialect into MSA in order to use the tools developed for MSA. We describe in this paper an architecture for such a translation and we evaluate it on Tunisian Arabic verbs. Our approach relies on modeling the translation process over the deep morphological representations of roots and patterns, commonly used to model Semitic morphology. We compare different techniques for how to perform the cross-lingual mapping. Our evaluation demonstrates that the use of a decent coverage root+pattern lexicon of Tunisian and MSA with a backoff that assumes independence of mapping roots and patterns is optimal in reducing overall ambiguity and increasing recall
TArC: Incrementally and semi-automatically collecting a Tunisian arabish corpus
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish
TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus
This article describes the constitution process of the first
morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also
known as Arabizi, is a spontaneous coding of Arabic dialects in Latin
characters and arithmographs (numbers used as letters). This code-system was
developed by Arabic-speaking users of social media in order to facilitate the
writing in the Computer-Mediated Communication (CMC) and text messaging
informal frameworks. There is variety in the realization of Arabish amongst
dialects, and each Arabish code-system is under-resourced, in the same way as
most of the Arabic dialects. In the last few years, the focus on Arabic
dialects in the NLP field has considerably increased. Taking this into
consideration, TArC will be a useful support for different types of analyses,
computational and linguistic, as well as for NLP tools training. In this
article we will describe preliminary work on the TArC semi-automatic
construction process and some of the first analyses we developed on TArC. In
addition, in order to provide a complete overview of the challenges faced
during the building process, we will present the main Tunisian dialect
characteristics and their encoding in Tunisian Arabish.Comment: Paper accepted at the Language Resources and Evaluation Conference
(LREC) 202
A web-based interface to calculate phonological neighborhood density for words and nonwords in Modern Standard Arabic
The availability of online databases (e.g., Balota et al., 2007) and calculators (e.g., Storkel & Hoover, 2010) has contributed to an increase in psycholinguistic-related research, to the development of evidence-based treatments in clinical settings, and to scientifically supported training programs in the language classroom. The benefit of online language resources is limited by the fact that the majority of such resources provide information only for the English language (Vitevitch, Chan & Goldstein, 2014). To address the lack of diversity in these resources for languages that differ phonologically and morphologically from English, the present article describes an online database to compute phonological neighborhood density (i.e., the number of words that sound similar to a given word) for words and nonwords in Modern Standard Arabic (MSA). A full description of how the calculator can be used is provided. It can be freely accessed at https://calculator.ku.edu/density/about
Recommended from our members
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
Phonological Phenomena of Tunisian Arabic
(anglicky) This bachelor thesis deals with the phonology of Tunisian Arabic and aims to introduce and describe selected phonological phenomena of the most important Tunisian dialect, the dialect of the capital Tunis and its surroundings. The work is based on the use of secondary literature. In the introduction, I first introduces the chosen transcription. I briefly mention historical influences on Tunisian dialect and its sociolinguistic status in the context of Arab dialectology. The main part of the thesis focuses on phonological characteristics such as syllable structure, accent determination, vocal phonemes, consonant phonemes, diphthongs, vocal length changes, consonant gemination and consonant assimilation. In summary, the thesis tries to cover as wide a spectrum of phenomena as possible and focuses on some of them in more detail. Topics to which the largest space will is devoted are metathesis and vocal reduction. Finally, some morphonological phenomena related to phonological phenomena are considered. All important phenomena are illustrated on real language examples. In some cases, comparison with Modern Standard Arabic is also made to illustrate the existing differences. So far, very few publications have been devoted to Tunisian Arabic and its phonology. The aim of this work is to...(česky) Bakalářská práce pojednává o fonologii tuniské arabštiny a klade si za cíl představit a blíže popsat vybrané fonologické jevy nejvýznamnějšího tuniského dialektu, dialektu hlavního města Tunisu a jeho okolí. Práci je založena na využití sekundární literatury. V úvodu nejprve představuje zvolenou transkripci. Krátce se zmiňuje o historických vlivech na tuniský dialekt a jeho sociolingvistickém statutu v kontextu arabské dialektologie. V hlavní části se práce zaměřuje na fonologické charakteristiky jako je slabičná struktura, určování přízvuku, vokalické fonémy, konsonantní fonémy, diftongy, změny délky vokálů, geminace konsonantů a asimilace konsonantů. Přehledově se práce snaží obsáhnout co nejširší spektrum jevů a některým se věnuje podrobněji. Specifickými tématy, kterým je věnován největší prostor, je metateze a redukce vokálů. Nakonec jsou zohledněny některé morfonologické jevy, které s fonologickými jevy souvisí. Všechny důležité jevy jsou ilustrovány na reálných jazykových příkladech. V některých případech bude učiněno též srovnání s moderní spisovnou arabštinou pro ilustraci odlišností. Doposud se tuniské́ arabštině̌ a její́ fonologii věnuje jen velmi málo publikací. Cílem této práce je přiblížit fonologii tuniské arabštiny na vybraných fonologických jevech.Department of Middle Eastern StudiesKatedra Blízkého východuFilozofická fakultaFaculty of Art
Multi-Task sequence prediction for Tunisian Arabizi multi-level annotation
In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem
Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid
International audienceCreating parallel corpora is a difficult issue that many researches try to deal with. In the context of under-resourced languages like Arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment of creating a Parallel Corpus which contain several dialects and Modern Standard Arabic(MSA). We attempt to highlight the most important choices that we did and how good were these choices
- …