130 research outputs found
Elevating Code-mixed Text Handling through Auditory Information of Words
With the growing popularity of code-mixed data, there is an increasing need
for better handling of this type of data, which poses a number of challenges,
such as dealing with spelling variations, multiple languages, different
scripts, and a lack of resources. Current language models face difficulty in
effectively handling code-mixed data as they primarily focus on the semantic
representation of words and ignore the auditory phonetic features. This leads
to difficulties in handling spelling variations in code-mixed text. In this
paper, we propose an effective approach for creating language models for
handling code-mixed textual data using auditory information of words from
SOUNDEX. Our approach includes a pre-training step based on
masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a
new method of providing input data to the pre-trained model. Through
experimentation on various code-mixed datasets (of different languages) for
sentiment, offensive and aggression classification tasks, we establish that our
novel language modeling approach (SAMLM) results in improved robustness towards
adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM
based approach also results in better classification results over the popular
baselines for code-mixed tasks. We use the explainability technique, SHAP
(SHapley Additive exPlanations) to explain how the auditory features
incorporated through SAMLM assist the model to handle the code-mixed text
effectively and increase robustness against adversarial attacks
\footnote{Source code has been made available on
\url{https://github.com/20118/DefenseWithPhonetics},
\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}.Comment: Accepted to EMNLP 202
LT3 at SemEval-2020 Task 9 : cross-lingual embeddings for sentiment analysis of Hinglish social media text
This paper describes our contribution to the SemEval-2020 Task 9 on Sentiment Analysis for
Code-mixed Social Media Text. We investigated two approaches to solve the task of Hinglish
sentiment analysis. The first approach uses cross-lingual embeddings resulting from projecting
Hinglish and pre-trained English FastText word embeddings in the same space. The second
approach incorporates pre-trained English embeddings that are incrementally retrained with a set
of Hinglish tweets. The results show that the second approach performs best, with an F1-score of
70.52% on the held-out test data
Hate Me Not: Detecting Hate Inducing Memes in Code Switched Languages
The rise in the number of social media users has led to an increase in the hateful content posted online. In countries like India, where multiple languages are spoken, these abhorrent posts are from an unusual blend of code-switched languages. This hate speech is depicted with the help of images to form “Memes which create a long-lasting impact on the human mind. In this paper, we take up the task of hate and offense detection from multimodal data, i.e. images (Memes) that contain text in code-switched languages. We firstly present a novel triply annotated Indian political Memes (IPM) dataset, which comprises memes from various Indian political events that have taken place post-independence and are classified into three distinct categories. We also propose a binary-channelled CNN cum LSTM based model to process the images using the CNN model and text using the LSTM model to get state-of-the-art results for this task
- …