1,416 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Recommended from our members
What Code-Switching Strategies are Effective in Dialogue Systems?
Since most people in the world today are multilingual, code-switching is ubiquitous in spoken and written interactions. Paving the way for future adaptive, multilingual conversational agents, we incorporate linguistically-motivated strategies of code-switching into a rule-based goal-oriented dialogue system. We collect and release CommonAmigos, a corpus of 587 human-computer text conversations between our dialogue system and human users in mixed Spanish and English. From this new corpus, we analyze the amount of elicited code-switching, preferred patterns of user code-switching, and the impact of user demographics on code-switching. Based on these exploratory findings, we give recommendations for future effective code-switching dialogue systems, highlighting user\u27s language proficiency and gender as critical considerations
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
Recommended from our members
Identifying and Modeling Code-Switched Language
Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during written or spoken communication. The importance of developing language technologies that are able to process code-switched language is immense, given the large populations that routinely code-switch. Current NLP and Speech models break down when used on code-switched data, interrupting the language processing pipeline in back-end systems and forcing users to communicate in ways which for them are unnatural.
There are four main challenges that arise in building code-switched models: lack of code-switched data on which to train generative language models; lack of multilingual language annotations on code-switched examples which are needed to train supervised models; little understanding of how to leverage monolingual and parallel resources to build better code-switched models; and finally, how to use these models to learn why and when code-switching happens across language pairs. In this thesis, I look into different aspects of these four challenges.
The first part of this thesis focuses on how to obtain reliable corpora of code-switched language. We collected a large corpus of code-switched language from social media using a combination of sets of anchor words that exist in one language and sentence-level language taggers. The newly obtained corpus is superior to other corpora collected via different strategies when it comes to the amount and type of bilingualism in it. It also helps train better language tagging models. We also have proposed a new annotation scheme to obtain part-of-speech tags for code-switched English-Spanish language. The annotation scheme is composed of three different subtasks including automatic labeling, word-specific questions labeling and question-tree word labeling. The part-of-speech labels obtained for the Miami Bangor corpus of English-Spanish conversational speech show very high agreement and accuracy.
The second section of this thesis focuses on the tasks of part-of-speech tagging and language modeling. For the first task, we proposed a state-of-the-art approach to part-of-speech tagging of code-switched English-Spanish data based on recurrent neural networks.Our models were tested on the Miami Bangor corpus on the task of POS tagging alone, for which we achieved 96.34% accuracy, and joint part-of-speech and language ID tagging,which achieved similar POS tagging accuracy (96.39%) and very high language ID accuracy (98.78%).
For the task of language modeling, we first conducted an exhaustive analysis of the relationship between cognate words and code-switching. We then proposed a set of cognate-based features that helped improve language modeling performance by 12% relative points. Furthermore, we showed that these features can also be used across language pairs and still obtain performance improvements.
Finally, we tackled the question of how to use monolingual resources for code-switching models by pre-training state-of-the-art cross-lingual language models on large monolingual corpora and fine-tuning them on the tasks of language modeling and word-level language tagging on code-switched data. We obtained state-of-the-art results on both tasks
Active learning in annotating micro-blogs dealing with e-reputation
Elections unleash strong political views on Twitter, but what do people
really think about politics? Opinion and trend mining on micro blogs dealing
with politics has recently attracted researchers in several fields including
Information Retrieval and Machine Learning (ML). Since the performance of ML
and Natural Language Processing (NLP) approaches are limited by the amount and
quality of data available, one promising alternative for some tasks is the
automatic propagation of expert annotations. This paper intends to develop a
so-called active learning process for automatically annotating French language
tweets that deal with the image (i.e., representation, web reputation) of
politicians. Our main focus is on the methodology followed to build an original
annotated dataset expressing opinion from two French politicians over time. We
therefore review state of the art NLP-based ML algorithms to automatically
annotate tweets using a manual initiation step as bootstrap. This paper focuses
on key issues about active learning while building a large annotated data set
from noise. This will be introduced by human annotators, abundance of data and
the label distribution across data and entities. In turn, we show that Twitter
characteristics such as the author's name or hashtags can be considered as the
bearing point to not only improve automatic systems for Opinion Mining (OM) and
Topic Classification but also to reduce noise in human annotations. However, a
later thorough analysis shows that reducing noise might induce the loss of
crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science -
Vol 3 - Contextualisation digitale - 201
Systematic Analysis of Language Transcripts Solutions: A Tutorial
Purpose:
In the early 1980s, researchers and speech-language pathologists (SLPs) collaborated to develop the Systematic Analysis of Language Transcripts (SALT). Research and development over the ensuing decades has culminated into SALT Solutions, a set of tools to assist SLPs to efficiently complete language sample analysis (LSA) with their clients. In this tutorial, we describe how SALT can assist with the accurate identification of children with language disorders and provide a rich description of children\u27s functional language use. After summarizing the multiple elicitation methods developed by the SALT team, we provide case studies, showing how to select an elicitation method that aligns with a child\u27s characteristics. We then summarize major considerations when transcribing, analyzing, and interpreting language samples with SALT. We revisit our case studies to illustrate how SALT adds value to the comprehensive assessment of language in children. Conclusions:
LSA is a powerful assessment tool for children suspected of having language disorders. The SALT suite of solutions provides a toolkit to assist SLPs with their comprehensive language assessments
- …