1 research outputs found
Language Identification of Hindi-English tweets using code-mixed BERT
Language identification of social media text has been an interesting problem
of study in recent years. Social media messages are predominantly in code mixed
in non-English speaking states. Prior knowledge by pre-training contextual
embeddings have shown state of the art results for a range of downstream tasks.
Recently, models such as BERT have shown that using a large amount of unlabeled
data, the pretrained language models are even more beneficial for learning
common language representations. Extensive experiments exploiting transfer
learning and fine-tuning BERT models to identify language on Twitter are
presented in this paper. The work utilizes a data collection of
Hindi-English-Urdu codemixed text for language pre-training and Hindi-English
codemixed for subsequent word-level language classification. The results show
that the representations pre-trained over codemixed data produce better results
by their monolingual counterpart