544 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Persona-aware Generative Model for Code-mixed Language
Code-mixing and script-mixing are prevalent across online social networks and
multilingual societies. However, a user's preference toward code-mixing depends
on the socioeconomic status, demographics of the user, and the local context,
which existing generative models mostly ignore while generating code-mixed
texts. In this work, we make a pioneering attempt to develop a persona-aware
generative model to generate texts resembling real-life code-mixed texts of
individuals. We propose a Persona-aware Generative Model for Code-mixed
Generation, PARADOX, a novel Transformer-based encoder-decoder model that
encodes an utterance conditioned on a user's persona and generates code-mixed
texts without monolingual reference data. We propose an alignment module that
re-calibrates the generated sequence to resemble real-life code-mixed texts.
PARADOX generates code-mixed texts that are semantically more meaningful and
linguistically more valid. To evaluate the personification capabilities of
PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM
KS. On average, PARADOX achieves 1.6 points better CM BLEU, 47% better
perplexity and 32% better semantic coherence than the non-persona-based
counterparts.Comment: 4 tables, 4 figure
A Computational Study in the Detection of English–Spanish Code-Switches
Code-switching is the linguistic phenomenon where a multilingual person alternates between two or more languages in a conversation, whether that be spoken or written. This thesis studies the automatic detection of code-switching occurring specifically between English and Spanish in two corpora.
Twitter and other social media sites have provided an abundance of linguistic data that is available to researchers to perform countless experiments. Collecting the data is fairly easy if a study is on monolingual text, but if a study requires code-switched data, this becomes a complication as APIs only accept one language as a parameter. This thesis focuses on identifying code-switching in both Twitter data and the Miami-Bangor corpus. This is done by conducting three different experiments. Our first experiment is a logistic regression model where we attempt to distinguish code-switched data from monolingual data. The second experiment is using a novel Word2Vec average nearest neighbor (WANN) classifier based on word embeddings to detect code-switching. The third experiment uses Doc2Vec, where the model uses the mean vector of each document to learn and distinguish between code-switched and monolingual data. Each of these experiments are performed twice, once with tweets and once with the Miami Bangor corpus. The results show that the WANN model performs best on Twitter data. The Doc2Vec model performs best on the Miami Bangor corpus. However, both approaches did well and the performances are comparable
Automatic processing of code-mixed social media content
Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language
identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging
#Languagemixing on Twitter
The influence of the English language on the world stage is such that it now constitutes a kind of global Lingua Franca. As such, English has supplanted French as the language of diplomacy, of culture, and of social prestige. This role reversal entails some residual opposition in France, and in consequence, the use of English expressions and vocabulary by French continues to be a controversial subject in France, as it has been for decades. Regulations are still being implemented to control the French language. Nowadays, social media has been an important tool in our society. Twitter has become a popular means of communication used in a variety of fields, such as politics, journalism, and academia. This widely used online platform has an impact on the way people express themselves and is changing language usage worldwide at an unprecedented pace. The language used online reflects the linguistic battle that has been going on for several decades in French society today. In my dissertation, I investigate the factors prompting the use of English and French language mixing on Twitter in France. The use of acronyms, hashtags as well as another language may be used as strategies to reach a wider audience. The need for visibility and audience maximization seem to be important factors for linguistic choice on Twitter. This study enables a deeper understanding of users' linguistic behavior online. The implications are important and allow for a rise in awareness of intercultural and cross-language exchanges.Includes bibliographical reference
- …