2 research outputs found
Discriminating between similar languages in Twitter using label propagation
Identifying the language of social media messages is an important first step
in linguistic processing. Existing models for Twitter focus on content
analysis, which is successful for dissimilar language pairs. We propose a label
propagation approach that takes the social graph of tweet authors into account
as well as content to better tease apart similar languages. This results in
state-of-the-art shared task performance of , higher than the
top system
Automatic Identification of Closely-related Indian Languages: Resources and Experiments
In this paper, we discuss an attempt to develop an automatic language
identification system for 5 closely-related Indo-Aryan languages of India,
Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora
of varying length for these languages from various resources. We discuss the
method of creation of these corpora in detail. Using these corpora, a language
identification system was developed, which currently gives state of the art
accuracy of 96.48\%. We also used these corpora to study the similarity between
the 5 languages at the lexical level, which is the first data-based study of
the extent of closeness of these languages.Comment: Paper accepted at the 4th Workshop in Indian Languages Data and
Resources (WILDRE - 4), 11th edition of the Language Resources and Evaluation
Conference (LREC - 2018), 7-12 May 2018, Miyazaki (Japan