131 research outputs found
Language and Dialect Identification of Cuneiform Texts
This article introduces a corpus of cuneiform texts from which the dataset
for the use of the Cuneiform Language Identification (CLI) 2019 shared task was
derived as well as some preliminary language identification experiments
conducted using that corpus. We also describe the CLI dataset and how it was
derived from the corpus. In addition, we provide some baseline language
identification results using the CLI dataset. To the best of our knowledge, the
experiments detailed here are the first time automatic language identification
methods have been used on cuneiform data
Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models
Peer reviewe
TwistBytes - identification of Cuneiform languages and German dialects at VarDial 2019
We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe
A Report on the Third VarDial Evaluation Campaign
Non peer reviewe
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German
Swiss German is a dialect continuum whose natively acquired dialects
significantly differ from the formal variety of the language. These dialects
are mostly used for verbal communication and do not have standard orthography.
This has led to a lack of annotated datasets, rendering the use of many NLP
methods infeasible. In this paper, we introduce the first annotated parallel
corpus of spoken Swiss German across 8 major dialects, plus a Standard German
reference. Our goal has been to create and to make available a basic dataset
for employing data-driven NLP applications in Swiss German. We present our data
collection procedure in detail and validate the quality of our corpus by
conducting experiments with the recent neural models for speech synthesis
- …