Search CORE

168 research outputs found

Five sources of bias in natural language processing

Author: Hovy Dirk
Prabhumoye Shrimai
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much of the recent work has focused on describing bias and providing an overview of bias in a larger context. Here, we provide a simple, actionable summary of this recent work. We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and finally (5) the research design (or how we conceptualize our research). We explore each of the bias sources in detail in this article, including examples and links to related work, as well as potential counter-measures

Archivio istituzionale della Ricerca - Bocconi

PubMed Central

On the gap between adoption and understanding in NLP

Author: Bianchi Federico
Hovy Dirk
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

No abstract availabl

Archivio istituzionale della Ricerca - Bocconi

Open Access Repository

HONEST: measuring hurtful sentence completion in language models

Author: Bianchi Federico
Hovy Dirk
Nozza Debora
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

No abstract availabl

Archivio istituzionale della Ricerca - Bocconi

XLM-EMO: multilingual emotion prediction in social media text

Author: Bianchi Federico
Hovy Dirk
Nozza Debora
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Twitter-demographer: a flow-based tool to enrich Twitter data

Author: Bianchi Federico
Cutrona Vincenzo
Hovy Dirk
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age, or tweet sentiment, has ramifications for anonymity and reproducibility, and requires dedicated effort. This paper describes Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with additional information about tweets and users. Twitter-Demographer is aimed at NLP practitioners and (computational) social scientists who want to enrich their datasets with aggregated information, facilitating reproducibility, and providing algorithmic privacy-by-design measures for pseudo-anonymity. We discuss our design choices, inspired by the flow-based programming paradigm, to use black-box components that can easily be chained together and extended. We also analyze the ethical issues related to the use of this tool, and the built-in measures to facilitate pseudo-anonymity

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Pipelines for social bias testing of large language models

Author: Bianchi Federcio
Hovy Dirk
Nozza Debora
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Language invariant properties in Natural Language Processing

Author: Bianchi Federico
Hovy Dirk
Nozza Debora
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Welcome to the modern world of pronouns: identity-inclusive Natural Language Processing beyond gender

Author: Crowley Archie
Hovy Dirk
Lauscher Anne
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

The world of pronouns is changing. From a closed class of words with few members to a much more open set of terms to reflect identities. However, Natural Language Processing (NLP) is barely reflecting this linguistic shift, even though recent work outlined the harms of gender-exclusive language technology. Particularly problematic is the current modeling 3rd person pronouns, as it largely ignores various phenomena like neopronouns, i.e., pronoun sets that are novel and not (yet) widely established. This omission contributes to the discrimination of marginalized and underrepresented groups, e.g., non-binary individuals. However, other identity-expression phenomena beyond gender are also ignored by current NLP technology. In this paper, we provide an overview of 3rd person pronoun issues for NLP. Based on our observations and ethical considerations, we define a series of desiderata for modeling pronouns in language technology. We evaluate existing and novel modeling approaches w.r.t. these desiderata qualitatively, and quantify the impact of a more discrimination-free approach on established benchmark data

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

Author: Bianchi Federico
Hovy Dirk
Terragni Silvia
Publication venue
Publication date: 08/04/2020
Field of study

Topic models extract meaningful groups of words from documents, allowing for a better understanding of data. However, the solutions are often not coherent enough, and thus harder to interpret. Coherence can be improved by adding more contextual knowledge to the model. Recently, neural topic models have become available, while BERT-based representations have further pushed the state of the art of neural models in general. We combine pre-trained representations and neural topic models. Pre-trained BERT sentence embeddings indeed support the generation of more meaningful and coherent topics than either standard LDA or existing neural topic models. Results on four datasets show that our approach effectively increases topic coherence

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi