2 research outputs found

    Discriminating between similar languages in Twitter using label propagation

    Full text link
    Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which is successful for dissimilar language pairs. We propose a label propagation approach that takes the social graph of tweet authors into account as well as content to better tease apart similar languages. This results in state-of-the-art shared task performance of 76.63%76.63\%, 1.4%1.4\% higher than the top system

    Automatic Identification of Closely-related Indian Languages: Resources and Experiments

    Full text link
    In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.Comment: Paper accepted at the 4th Workshop in Indian Languages Data and Resources (WILDRE - 4), 11th edition of the Language Resources and Evaluation Conference (LREC - 2018), 7-12 May 2018, Miyazaki (Japan
    corecore