17 research outputs found
Distributed Representations of Geographically Situated Language
We introduce a model for incorporating contextual information (such as geogra-phy) in learning vector-space representa-tions of situated language. In contrast to approaches to multimodal representation learning that have used properties of the object being described (such as its color), our model includes information about the subject (i.e., the speaker), allowing us to learn the contours of a word’s meaning that are shaped by the context in which it is uttered. In a quantitative evaluation on the task of judging geographically in-formed semantic similarity between repre-sentations learned from 1.1 billion words of geo-located tweets, our joint model out-performs comparable independent models that learn meaning in isolation.
Semantic Variation in Online Communities of Practice
We introduce a framework for quantifying semantic variation of common words
in Communities of Practice and in sets of topic-related communities. We show
that while some meaning shifts are shared across related communities, others
are community-specific, and therefore independent from the discussed topic. We
propose such findings as evidence in favour of sociolinguistic theories of
socially-driven semantic variation. Results are evaluated using an independent
language modelling task. Furthermore, we investigate extralinguistic features
and show that factors such as prominence and dissemination of words are related
to semantic variation.Comment: 13 pages, Proceedings of the 12th International Conference on
Computational Semantics (IWCS 2017
HBert + BiasCorp -- Fighting Racism on the Web
Subtle and overt racism is still present both in physical and online
communities today and has impacted many lives in different segments of the
society. In this short piece of work, we present how we're tackling this
societal issue with Natural Language Processing. We are releasing BiasCorp, a
dataset containing 139,090 comments and news segment from three specific
sources - Fox News, BreitbartNews and YouTube. The first batch (45,000 manually
annotated) is ready for publication. We are currently in the final phase of
manually labeling the remaining dataset using Amazon Mechanical Turk. BERT has
been used widely in several downstream tasks. In this work, we present hBERT,
where we modify certain layers of the pretrained BERT model with the new
Hopfield Layer. hBert generalizes well across different distributions with the
added advantage of a reduced model complexity. We are also releasing a
JavaScript library and a Chrome Extension Application, to help developers make
use of our trained model in web applications (say chat application) and for
users to identify and report racially biased contents on the web respectively
Training Temporal Word Embeddings with a Compass
Temporal word embeddings have been proposed to support the analysis of word
meaning shifts during time and to study the evolution of languages. Different
approaches have been proposed to generate vector representations of words that
embed their meaning during a specific time interval. However, the training
process used in these approaches is complex, may be inefficient or it may
require large text corpora. As a consequence, these approaches may be difficult
to apply in resource-scarce domains or by scientists with limited in-depth
knowledge of embedding models. In this paper, we propose a new heuristic to
train temporal word embeddings based on the Word2vec model. The heuristic
consists in using atemporal vectors as a reference, i.e., as a compass, when
training the representations specific to a given time interval. The use of the
compass simplifies the training process and makes it more efficient.
Experiments conducted using state-of-the-art datasets and methodologies suggest
that our approach outperforms or equals comparable approaches while being more
robust in terms of the required corpus size.Comment: Accepted at AAAI201
A Comprehensive View of the Biases of Toxicity and Sentiment Analysis Methods Towards Utterances with African American English Expressions
Language is a dynamic aspect of our culture that changes when expressed in
different technologies/communities. Online social networks have enabled the
diffusion and evolution of different dialects, including African American
English (AAE). However, this increased usage is not without barriers. One
particular barrier is how sentiment (Vader, TextBlob, and Flair) and toxicity
(Google's Perspective and the open-source Detoxify) methods present biases
towards utterances with AAE expressions. Consider Google's Perspective to
understand bias. Here, an utterance such as ``All n*ggers deserve to die
respectfully. The police murder us.'' it reaches a higher toxicity than
``African-Americans deserve to die respectfully. The police murder us.''. This
score difference likely arises because the tool cannot understand the
re-appropriation of the term ``n*gger''. One explanation for this bias is that
AI models are trained on limited datasets, and using such a term in training
data is more likely to appear in a toxic utterance. While this may be
plausible, the tool will make mistakes regardless. Here, we study bias on two
Web-based (YouTube and Twitter) datasets and two spoken English datasets. Our
analysis shows how most models present biases towards AAE in most settings. We
isolate the impact of AAE expression usage via linguistic control features from
the Linguistic Inquiry and Word Count (LIWC) software, grammatical control
features extracted via Part-of-Speech (PoS) tagging from Natural Language
Processing (NLP) models, and the semantic of utterances by comparing sentence
embeddings from recent language models. We present consistent results on how a
heavy usage of AAE expressions may cause the speaker to be considered
substantially more toxic, even when speaking about nearly the same subject. Our
study complements similar analyses focusing on small datasets and/or one method
only.Comment: Under peer revie
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201