6 research outputs found
CS-Embed at SemEval-2020 Task 9: The effectiveness of code-switched word embeddings for sentiment analysis
The growing popularity and applications of sentiment analysis of social media
posts has naturally led to sentiment analysis of posts written in multiple
languages, a practice known as code-switching. While recent research into
code-switched posts has focused on the use of multilingual word embeddings,
these embeddings were not trained on code-switched data. In this work, we
present word-embeddings trained on code-switched tweets, specifically those
that make use of Spanish and English, known as Spanglish. We explore the
embedding space to discover how they capture the meanings of words in both
languages. We test the effectiveness of these embeddings by participating in
SemEval 2020 Task 9: ~\emph{Sentiment Analysis on Code-Mixed Social Media
Text}. We utilised them to train a sentiment classifier that achieves an F-1
score of 0.722. This is higher than the baseline for the competition of 0.656,
with our team (codalab username \emph{francesita}) ranking 14 out of 29
participating teams, beating the baseline.Comment: Accepted at SemEval-2020, COLIN
Code-Mixed Probes Show How Pre-Trained Models Generalise on Code-Switched Text
Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on abilities of these models to generalise representations to CS corpora. We release all our code and data, including the novel corpus, at https://github.com/francesita/code-mixed-probes
Code-Mixed Probes Show How Pre-Trained Models Generalise on Code-Switched Text
Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on abilities of these models to generalise representations to CS corpora. We release all our code and data, including the novel corpus, at https://github.com/francesita/code-mixed-probes