28 research outputs found
Recommended from our members
Identifying and Modeling Code-Switched Language
Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during written or spoken communication. The importance of developing language technologies that are able to process code-switched language is immense, given the large populations that routinely code-switch. Current NLP and Speech models break down when used on code-switched data, interrupting the language processing pipeline in back-end systems and forcing users to communicate in ways which for them are unnatural.
There are four main challenges that arise in building code-switched models: lack of code-switched data on which to train generative language models; lack of multilingual language annotations on code-switched examples which are needed to train supervised models; little understanding of how to leverage monolingual and parallel resources to build better code-switched models; and finally, how to use these models to learn why and when code-switching happens across language pairs. In this thesis, I look into different aspects of these four challenges.
The first part of this thesis focuses on how to obtain reliable corpora of code-switched language. We collected a large corpus of code-switched language from social media using a combination of sets of anchor words that exist in one language and sentence-level language taggers. The newly obtained corpus is superior to other corpora collected via different strategies when it comes to the amount and type of bilingualism in it. It also helps train better language tagging models. We also have proposed a new annotation scheme to obtain part-of-speech tags for code-switched English-Spanish language. The annotation scheme is composed of three different subtasks including automatic labeling, word-specific questions labeling and question-tree word labeling. The part-of-speech labels obtained for the Miami Bangor corpus of English-Spanish conversational speech show very high agreement and accuracy.
The second section of this thesis focuses on the tasks of part-of-speech tagging and language modeling. For the first task, we proposed a state-of-the-art approach to part-of-speech tagging of code-switched English-Spanish data based on recurrent neural networks.Our models were tested on the Miami Bangor corpus on the task of POS tagging alone, for which we achieved 96.34% accuracy, and joint part-of-speech and language ID tagging,which achieved similar POS tagging accuracy (96.39%) and very high language ID accuracy (98.78%).
For the task of language modeling, we first conducted an exhaustive analysis of the relationship between cognate words and code-switching. We then proposed a set of cognate-based features that helped improve language modeling performance by 12% relative points. Furthermore, we showed that these features can also be used across language pairs and still obtain performance improvements.
Finally, we tackled the question of how to use monolingual resources for code-switching models by pre-training state-of-the-art cross-lingual language models on large monolingual corpora and fine-tuning them on the tasks of language modeling and word-level language tagging on code-switched data. We obtained state-of-the-art results on both tasks
Recommended from our members
What Code-Switching Strategies are Effective in Dialogue Systems?
Since most people in the world today are multilingual, code-switching is ubiquitous in spoken and written interactions. Paving the way for future adaptive, multilingual conversational agents, we incorporate linguistically-motivated strategies of code-switching into a rule-based goal-oriented dialogue system. We collect and release CommonAmigos, a corpus of 587 human-computer text conversations between our dialogue system and human users in mixed Spanish and English. From this new corpus, we analyze the amount of elicited code-switching, preferred patterns of user code-switching, and the impact of user demographics on code-switching. Based on these exploratory findings, we give recommendations for future effective code-switching dialogue systems, highlighting user\u27s language proficiency and gender as critical considerations
Measuring Entrainment in Spontaneous Code-switched Speech
It is well-known that interlocutors who entrain to one another have more
successful conversations than those who do not. Previous research has shown
that interlocutors entrain on linguistic features in both written and spoken
monolingual domains. More recent work on code-switched communication has also
shown preliminary evidence of entrainment on certain aspects of code-switching
(CSW). However, such studies of entrainment in code-switched domains have been
extremely few and restricted to human-machine textual interactions. Our work
studies code-switched spontaneous speech between humans by answering the
following questions: 1) Do patterns of written and spoken entrainment in
monolingual settings generalize to code-switched settings? 2) Do patterns of
entrainment on code-switching in generated text generalize to spontaneous
code-switched speech? We find evidence of affirmative answers to both of these
questions, with important implications for the potentially "universal" nature
of entrainment as a communication phenomenon, and potential applications in
inclusive and interactive speech technology
Code-Switching in Multilinguals: A Narrative Elicitation Study with L1 Arabic, L2 English, L3 Norwegian Speakers. The Role of Cognates, Dominance and Typological proximity
This study aims to investigate the phenomenon of code-switching in multilingual. Participants in this study speak Arabic as L1, English as L2, and Norwegian as L3. The focus will mainly be on two main patterns in code-switching: The Insertion of cognates and the Direction of the cross-linguistic influence. More specifically, we will investigate if the co-activation effect on cognates would increase the potential to code-switch cognates more than non-cognates. In addition, we will try to find out which factor could be more influential on the directionality of the switches in terms of dominance and typological proximity.
A group of 41 participants was interviewed to elicit data for this study. Two elicitation tasks were employed: the MAIN task by Gagarina (2012) and the Picture Descriptive Task adapted from Lloyd-Smith’s study. Each task consists of two depicted stories. Participants had to tell a story out of the presented pictures. All their narratives were recorded and then transcribed.
The results showed that there were more code-switch instances among cognates than non-cognates in the English narratives. Additionally, the difference between the cognate code-switches and the non-cognate code-switches was significant. This significant difference is attributed to the strong activation of the Norwegian language that led to a strong representation of cognate in the mental lexicon. On the other hand, participants did not produce more cognates than non-cognates in the Norwegian narratives, and the difference was not significant. This can be explained by the weak activation level of English that led to a more inadequate representation of the cognates in the mental lexicon.
Regarding the direction, the results revealed that there were code-switches from all languages, but only one language (Norwegian) was the strongest donor. The role of dominance was seen between English and Norwegian, whereas the dominance of the participants’ L1 had no effect due to the lack of typological proximity between Arabic and the other two Germanic languages in this study.
Keywords: code-switching, insertion, directionality, cognates, cross-linguistic influence, multilingual
Proceedings of the VIIth GSCP International Conference
The 7th International Conference of the Gruppo di Studi sulla Comunicazione Parlata, dedicated to the memory of Claire Blanche-Benveniste, chose as its main theme Speech and Corpora. The wide international origin of the 235 authors from 21 countries and 95 institutions led to papers on many different languages. The 89 papers of this volume reflect the themes of the conference: spoken corpora compilation and annotation, with the technological connected fields; the relation between prosody and pragmatics; speech pathologies; and different papers on phonetics, speech and linguistic analysis, pragmatics and sociolinguistics. Many papers are also dedicated to speech and second language studies. The online publication with FUP allows direct access to sound and video linked to papers (when downloaded)