16 research outputs found

    A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation

    No full text
    Arabic script writing is typically underspecified for short vowels and other mark up, referred to as diacritics. Apart from the lexical ambiguity found in words, similar to that exhibited in other languages, the lack of diacritics in written Arabic script adds another layer of ambiguity which is an artifact of the orthography. Diacritization of written text has a significant impact on Arabic NLP applications. In this paper, we present a pilot study on building a diacritized multi-genre corpus in Arabic. We annotate a sample of nondiacritized words extracted from five text genres. We explore different annotation strategies: Basic where we present only the bare undiacritized forms to the annotators, Intermediate (Basic forms+their POS tags), and Advanced (automatically diacritized words). We present the impact of the annotation strategy on annotation quality. Moreover, we study different diacritization schemes in the process

    Overview for the First Shared Task on Language Identification in Code-Switched Data

    No full text
    We present an overview of the first shared task on language identification on code-switched data. The shared task in-cluded code-switched data from four lan

    Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions

    No full text
    International audienceThis paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multi-word expressions. We present the annotation methodology, focusing on changes from last year's shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed
    corecore