1,718 research outputs found

    Automatic processing of code-mixed social media content

    Get PDF
    Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

    Neural Sequence Labeling on Social Media Text

    Get PDF
    As social media (SM) brings opportunities to study societies across the world, it also brings a variety of challenges to automate the processing of SM language. In particular, most of the textual content in SM is considered noisy; it does not always stick to the rules of the written language, and it tends to have misspellings, arbitrary abbreviations, orthographic inconsistencies, and flexible grammar. Additionally, SM platforms provide a unique space for multilingual content. This polyglot environment requires modern systems to adapt to a diverse range of languages, imposing another linguistic barrier to processing and understanding of text from SM domains. This thesis aims at providing novel sequence labeling approaches to handle noise and linguistic code-switching (i.e., the alternation of languages in the same utterance) in SM text. In particular, the first part of this thesis focuses on named entity recognition for English SM text, where I propose linguistically-inspired methods to address phonological writing and flexible syntax. Besides, I investigate whether the performance of current state-of-the-art models relies on memorization or contextual generalization of entities. In the second part of this thesis, I focus on three sequence labeling tasks for code-switched SM text: language identification, part-of-speech tagging, and named entity recognition. Specifically, I propose transfer learning methods from state-of-the-art monolingual and multilingual models, such as ELMo and BERT, to the code-switching setting for sequence labeling. These methods reduce the demand for code-switching annotations and resources while exploiting multilingual knowledge from large pre-trained unsupervised models. The methods presented in this thesis are meant to benefit higher-level NLP applications oriented to social media domains, including but not limited to question-answering, conversational systems, and information extraction

    How Do Multilingual Encoders Learn Cross-lingual Representation?

    Get PDF
    NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the foundation. As BERT revolutionized representation learning and NLP, it also revolutionized cross-lingual representations and cross-lingual transfer. Multilingual BERT was released as a replacement for single-language BERT, trained with Wikipedia data in 104 languages. Surprisingly, without any explicit cross-lingual signal, multilingual BERT learns cross-lingual representations in addition to representations for individual languages. This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks. Naturally, it raises a set of questions, most notably how do these multilingual encoders learn cross-lingual representations. In exploring these questions, this thesis will analyze the behavior of multilingual models in a variety of settings on high and low resource languages. We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models. Together, they provide a better understanding of multilingual encoders on cross-lingual transfer. Our findings will lead us to suggested improvements to multilingual encoders and cross-lingual transfer
    corecore