31,412 research outputs found
Natural language processing for similar languages, varieties, and dialects: A survey
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe
The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection
This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%
Language and Dialect Identification of Cuneiform Texts
This article introduces a corpus of cuneiform texts from which the dataset
for the use of the Cuneiform Language Identification (CLI) 2019 shared task was
derived as well as some preliminary language identification experiments
conducted using that corpus. We also describe the CLI dataset and how it was
derived from the corpus. In addition, we provide some baseline language
identification results using the CLI dataset. To the best of our knowledge, the
experiments detailed here are the first time automatic language identification
methods have been used on cuneiform data
Neural Machine Translation into Language Varieties
Both research and commercial machine translation have so far neglected the
importance of properly handling the spelling, lexical and grammar divergences
occurring among language varieties. Notable cases are standard national
varieties such as Brazilian and European Portuguese, and Canadian and European
French, which popular online machine translation services are not keeping
distinct. We show that an evident side effect of modeling such varieties as
unique classes is the generation of inconsistent translations. In this work, we
investigate the problem of training neural machine translation from English to
specific pairs of language varieties, assuming both labeled and unlabeled
parallel texts, and low-resource conditions. We report experiments from English
to two pairs of dialects, EuropeanBrazilian Portuguese and European-Canadian
French, and two pairs of standardized varieties, Croatian-Serbian and
Indonesian-Malay. We show significant BLEU score improvements over baseline
systems when translation into similar languages is learned as a multilingual
task with shared representations.Comment: Published at EMNLP 2018: third conference on machine translation (WMT
2018
- …