Search CORE

2 research outputs found

Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

Author: Al-Badrashiny Mohamed
AlGhamdi Fahad
AlMarwani Nada
Diab Mona
Ghoneim Mahmoud
Hawwari Abdelati
Publication venue
Publication date: 27/09/2019
Field of study

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%

arXiv.org e-Print Archive

Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task

Author: Aguilar Gustavo
AlGhamdi Fahad
Diab Mona
Hirschberg Julia
Solorio Thamar
Soto Victor
Publication venue
Publication date: 10/06/2019
Field of study

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.Comment: ACL 2018 (CALCS

arXiv.org e-Print Archive