3 research outputs found

    Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

    Get PDF
    Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes

    A Computational Study in the Detection of English–Spanish Code-Switches

    Full text link
    Code-switching is the linguistic phenomenon where a multilingual person alternates between two or more languages in a conversation, whether that be spoken or written. This thesis studies the automatic detection of code-switching occurring specifically between English and Spanish in two corpora. Twitter and other social media sites have provided an abundance of linguistic data that is available to researchers to perform countless experiments. Collecting the data is fairly easy if a study is on monolingual text, but if a study requires code-switched data, this becomes a complication as APIs only accept one language as a parameter. This thesis focuses on identifying code-switching in both Twitter data and the Miami-Bangor corpus. This is done by conducting three different experiments. Our first experiment is a logistic regression model where we attempt to distinguish code-switched data from monolingual data. The second experiment is using a novel Word2Vec average nearest neighbor (WANN) classifier based on word embeddings to detect code-switching. The third experiment uses Doc2Vec, where the model uses the mean vector of each document to learn and distinguish between code-switched and monolingual data. Each of these experiments are performed twice, once with tweets and once with the Miami Bangor corpus. The results show that the WANN model performs best on Twitter data. The Doc2Vec model performs best on the Miami Bangor corpus. However, both approaches did well and the performances are comparable
    corecore