160 research outputs found

    DaCToR: A data collection tool for the RELATER project

    Get PDF
    Collecting domain-specific data for under-resourced languages, e.g., dialects of languages, can be very expensive, potentially financially prohibitive and taking long time. Moreover, in the case of rarely written languages, the normalization of non-canonical transcription might be another time consuming but necessary task. In order to collect domain-specific data in such circumstances in a time and cost-efficient way, collecting read data of pre-prepared texts is often a viable option. In order to collect data in the domain of psychiatric diagnosis in Arabic dialects for the project RELATER, we have prepared the data collection tool DaCToR for collecting read texts by speakers in the respective countries and districts in which the dialects are spoken. In this paper we describe our tool, its purpose within the project RELATER and the dialects which we have started to collect with the tool

    The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation

    No full text
    International audienceThe development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of limited resources, consists in performing a translation of the dialect into MSA in order to use the tools developed for MSA. We describe in this paper an architecture for such a translation and we evaluate it on Tunisian Arabic verbs. Our approach relies on modeling the translation process over the deep morphological representations of roots and patterns, commonly used to model Semitic morphology. We compare different techniques for how to perform the cross-lingual mapping. Our evaluation demonstrates that the use of a decent coverage root+pattern lexicon of Tunisian and MSA with a backoff that assumes independence of mapping roots and patterns is optimal in reducing overall ambiguity and increasing recall

    TArC: Incrementally and semi-automatically collecting a Tunisian arabish corpus

    Get PDF
    This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish

    TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

    Full text link
    This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish.Comment: Paper accepted at the Language Resources and Evaluation Conference (LREC) 202

    A web-based interface to calculate phonological neighborhood density for words and nonwords in Modern Standard Arabic

    Get PDF
    The availability of online databases (e.g., Balota et al., 2007) and calculators (e.g., Storkel & Hoover, 2010) has contributed to an increase in psycholinguistic-related research, to the development of evidence-based treatments in clinical settings, and to scientifically supported training programs in the language classroom. The benefit of online language resources is limited by the fact that the majority of such resources provide information only for the English language (Vitevitch, Chan & Goldstein, 2014). To address the lack of diversity in these resources for languages that differ phonologically and morphologically from English, the present article describes an online database to compute phonological neighborhood density (i.e., the number of words that sound similar to a given word) for words and nonwords in Modern Standard Arabic (MSA). A full description of how the calculator can be used is provided. It can be freely accessed at https://calculator.ku.edu/density/about

    Phonological Phenomena of Tunisian Arabic

    Get PDF
    (anglicky) This bachelor thesis deals with the phonology of Tunisian Arabic and aims to introduce and describe selected phonological phenomena of the most important Tunisian dialect, the dialect of the capital Tunis and its surroundings. The work is based on the use of secondary literature. In the introduction, I first introduces the chosen transcription. I briefly mention historical influences on Tunisian dialect and its sociolinguistic status in the context of Arab dialectology. The main part of the thesis focuses on phonological characteristics such as syllable structure, accent determination, vocal phonemes, consonant phonemes, diphthongs, vocal length changes, consonant gemination and consonant assimilation. In summary, the thesis tries to cover as wide a spectrum of phenomena as possible and focuses on some of them in more detail. Topics to which the largest space will is devoted are metathesis and vocal reduction. Finally, some morphonological phenomena related to phonological phenomena are considered. All important phenomena are illustrated on real language examples. In some cases, comparison with Modern Standard Arabic is also made to illustrate the existing differences. So far, very few publications have been devoted to Tunisian Arabic and its phonology. The aim of this work is to...(česky) Bakalářská práce pojednává o fonologii tuniské arabštiny a klade si za cíl představit a blíže popsat vybrané fonologické jevy nejvýznamnějšího tuniského dialektu, dialektu hlavního města Tunisu a jeho okolí. Práci je založena na využití sekundární literatury. V úvodu nejprve představuje zvolenou transkripci. Krátce se zmiňuje o historických vlivech na tuniský dialekt a jeho sociolingvistickém statutu v kontextu arabské dialektologie. V hlavní části se práce zaměřuje na fonologické charakteristiky jako je slabičná struktura, určování přízvuku, vokalické fonémy, konsonantní fonémy, diftongy, změny délky vokálů, geminace konsonantů a asimilace konsonantů. Přehledově se práce snaží obsáhnout co nejširší spektrum jevů a některým se věnuje podrobněji. Specifickými tématy, kterým je věnován největší prostor, je metateze a redukce vokálů. Nakonec jsou zohledněny některé morfonologické jevy, které s fonologickými jevy souvisí. Všechny důležité jevy jsou ilustrovány na reálných jazykových příkladech. V některých případech bude učiněno též srovnání s moderní spisovnou arabštinou pro ilustraci odlišností. Doposud se tuniské́ arabštině̌ a její́ fonologii věnuje jen velmi málo publikací. Cílem této práce je přiblížit fonologii tuniské arabštiny na vybraných fonologických jevech.Department of Middle Eastern StudiesKatedra Blízkého východuFilozofická fakultaFaculty of Art

    Multi-Task sequence prediction for Tunisian Arabizi multi-level annotation

    Get PDF
    In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem

    Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid

    Get PDF
    International audienceCreating parallel corpora is a difficult issue that many researches try to deal with. In the context of under-resourced languages like Arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment of creating a Parallel Corpus which contain several dialects and Modern Standard Arabic(MSA). We attempt to highlight the most important choices that we did and how good were these choices
    corecore