168 research outputs found
Five sources of bias in natural language processing
Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much of the recent work has focused on describing bias and providing an overview of bias in a larger context. Here, we provide a simple, actionable summary of this recent work. We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and finally (5) the research design (or how we conceptualize our research). We explore each of the bias sources in detail in this article, including examples and links to related work, as well as potential counter-measures
On the gap between adoption and understanding in NLP
No abstract availabl
HONEST: measuring hurtful sentence completion in language models
No abstract availabl
Twitter-demographer: a flow-based tool to enrich Twitter data
Twitter data have become essential to Natural Language Processing (NLP) and
social science research, driving various scientific discoveries in recent
years. However, the textual data alone are often not enough to conduct studies:
especially social scientists need more variables to perform their analysis and
control for various factors. How we augment this information, such as users'
location, age, or tweet sentiment, has ramifications for anonymity and
reproducibility, and requires dedicated effort. This paper describes
Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with
additional information about tweets and users. Twitter-Demographer is aimed at
NLP practitioners and (computational) social scientists who want to enrich
their datasets with aggregated information, facilitating reproducibility, and
providing algorithmic privacy-by-design measures for pseudo-anonymity. We
discuss our design choices, inspired by the flow-based programming paradigm, to
use black-box components that can easily be chained together and extended. We
also analyze the ethical issues related to the use of this tool, and the
built-in measures to facilitate pseudo-anonymity
Welcome to the modern world of pronouns: identity-inclusive Natural Language Processing beyond gender
The world of pronouns is changing. From a closed class of words with few
members to a much more open set of terms to reflect identities. However,
Natural Language Processing (NLP) is barely reflecting this linguistic shift,
even though recent work outlined the harms of gender-exclusive language
technology. Particularly problematic is the current modeling 3rd person
pronouns, as it largely ignores various phenomena like neopronouns, i.e.,
pronoun sets that are novel and not (yet) widely established. This omission
contributes to the discrimination of marginalized and underrepresented groups,
e.g., non-binary individuals. However, other identity-expression phenomena
beyond gender are also ignored by current NLP technology. In this paper, we
provide an overview of 3rd person pronoun issues for NLP. Based on our
observations and ethical considerations, we define a series of desiderata for
modeling pronouns in language technology. We evaluate existing and novel
modeling approaches w.r.t. these desiderata qualitatively, and quantify the
impact of a more discrimination-free approach on established benchmark data
Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
Topic models extract meaningful groups of words from documents, allowing for
a better understanding of data. However, the solutions are often not coherent
enough, and thus harder to interpret. Coherence can be improved by adding more
contextual knowledge to the model. Recently, neural topic models have become
available, while BERT-based representations have further pushed the state of
the art of neural models in general. We combine pre-trained representations and
neural topic models. Pre-trained BERT sentence embeddings indeed support the
generation of more meaningful and coherent topics than either standard LDA or
existing neural topic models. Results on four datasets show that our approach
effectively increases topic coherence
- …