Search CORE

2 research outputs found

Discriminating between similar languages in Twitter using label propagation

Author: Galle Matthias
Radford Will
Publication venue
Publication date: 19/07/2016
Field of study

Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which is successful for dissimilar language pairs. We propose a label propagation approach that takes the social graph of tweet authors into account as well as content to better tease apart similar languages. This results in state-of-the-art shared task performance of

76.63\%

1.4\%

higher than the top system

arXiv.org e-Print Archive

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Author: Alok Deepak
Basit Abdul
Dawer Yogesh
Jain Mayank
Kumar Ritesh
Lahiri Bornini
Ojha Atul Kr.
Publication venue
Publication date: 26/03/2018
Field of study

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.Comment: Paper accepted at the 4th Workshop in Indian Languages Data and Resources (WILDRE - 4), 11th edition of the Language Resources and Evaluation Conference (LREC - 2018), 7-12 May 2018, Miyazaki (Japan

arXiv.org e-Print Archive