Search CORE

156 research outputs found

Hypothesis Engineering for Zero-Shot Hate Speech Detection

Author: Goldzycher Janis
Schneider Gerold
Publication venue: ACL
Publication date: 17/10/2022
Field of study

Standard approaches to hate speech detection rely on sufficient available hate speech annotations. Extending previous work that repurposes natural language inference (NLI) models for zero-shot text classification, we propose a simple approach that combines multiple hypotheses to improve English NLI-based zero-shot hate speech detection. We first conduct an error analysis for vanilla NLI-based zero-shot hate speech detection and then develop four strategies based on this analysis. The strategies use multiple hypotheses to predict various aspects of an input text and combine these predictions into a final verdict. We find that the zero-shot baseline used for the initial error analysis already outperforms commercial systems and fine-tuned BERT-based hate speech detection models on HateCheck. The combination of the proposed strategies further increases the zero-shot accuracy of 79.4% on HateCheck by 7.9 percentage points (pp), and the accuracy of 69.6% on ETHOS by 10.0pp

ZORA

“It's Not Just Hate”:A Multi-Dimensional Perspective on Detecting Harmful Speech Online

Author: Bianchi Federico
Hills Stefanie Anja
Hovy Dirk
Rossini Patricia
Tintarev Nava
Tromble Rebekah
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Well-annotated data is a prerequisite for good Natural Language Processing models. Too often, though, annotation decisions are governed by optimizing time or annotator agreement. We make a case for nuanced efforts in an interdisciplinary setting for annotating offensive online speech. Detecting offensive content is rapidly becoming one of the most important real-world NLP tasks. However, most datasets use a single binary label, e.g., for hate or incivility, even though each concept is multi-faceted. This modeling choice severely limits nuanced insights, but also performance. We show that a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content addresses both conceptual and performance issues. We release a novel dataset of over 40, 000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. Our dataset not only allows for a more nuanced understanding of harmful speech online, models trained on it also outperform or match performance on benchmark datasets. Warning: This paper contains examples of hateful language some readers might find offensive

Maastricht University Research Portal

Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection

Author: de Araujo Pedro Henrique Luz
Roth Benjamin
Publication venue
Publication date: 08/04/2022
Field of study

Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs -- is an alternative evaluation method of natural language processing systems proposed to address the shortcomings of the standard approach: computing metrics on held-out data. While behavioural tests capture human prior knowledge and insights, there has been little exploration on how to leverage them for model training and development. With this in mind, we explore behaviour-aware learning by examining several fine-tuning schemes using HateCheck, a suite of functional tests for hate speech detection systems. To address potential pitfalls of training on data originally intended for evaluation, we train and evaluate models on different configurations of HateCheck by holding out categories of test cases, which enables us to estimate performance on potentially overlooked system properties. The fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups, suggesting that models can potentially generalise to overlooked functionalities. However, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class and that the procedure led to overfitting to the HateCheck data distribution.Comment: 9 pages, 5 figures. Accepted at the First Workshop on Efficient Benchmarking in NLP (NLP Power!

arXiv.org e-Print Archive

Improving the evaluation and effectiveness of hate speech detection models

Author: Röttger Paul
Publication venue
Publication date: 10/09/2024
Field of study

Online hate speech is a widespread and deeply harmful problem. To tackle hate at scale, we need models that can automatically detect it. This has motivated research in Natural Language Processing (NLP) to develop text-based hate speech detection models. In recent years, these models have improved substantially, following general advances in language modelling. In my thesis, I show that impressive headline results paint an incomplete picture of model quality. I argue that much progress in hate speech detection so far has rested on simplifying assumptions, which, while useful in some settings, we need to move past in order to develop more truly effective models. In particular, I argue that current standards for model evaluation tend to be overly aggregated, static, monolithic and English language-centric, because of four common simplifying assumptions, which I use to structure my thesis. I discuss core concepts in an introduction and literature review. Then, I present four Chapters, which each challenge one of the four simplifying assumptions. Assumption 1 is that model accuracy equals model quality. I introduce a suite of functional tests for hate speech detection models, which enables fine-grained diagnostic insights and reveals critical weaknesses in seemingly accurate models. Assumption 2 is that hate speech today equals hate speech tomorrow. I find that model performance degrades over time and explore temporal adaptation as a remedy. Assumption 3 is that hate speech for me equals hate speech for you. I evidence subjectivity in labelling hate speech and introduce two contrasting data annotation paradigms for managing subjectivity. Assumption 4 is that hate speech in English equals all hate speech. I explore data-efficient strategies for expanding detection into more under-resourced languages. Overall, my thesis seeks to 1) work towards better, more comprehensive quality standards for hate speech detection models, and 2) improve models along these standards

Oxford University Research Archive

HateCheck: functional tests for hate speech detection models

Author: Dong Nguyen
Margetts Helen
Pierrehumbert Janet B
Röttger Paul
Vidgen Bertram
Waseem Zeerak
Publication venue: Association for Computational Linguistics
Publication date: 27/07/2021
Field of study

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses

Oxford University Research Archive

Testing Hateful Speeches against Policies

Author: Budhrani Girish
Liu Xueqing
Rathnasuriya Ravishka
Yang Wei
Zheng Jiangrui
Publication venue
Publication date: 23/07/2023
Field of study

In the recent years, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems brought challenges to traceability, because AI system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}

arXiv.org e-Print Archive

Benchmarking Offensive and Abusive Language in Dutch Tweets

Author: Caselli Tommaso
van der Veen Hylke
Publication venue: Association for Computational Linguistics, ACL Anthology
Publication date: 01/01/2023
Field of study

We present an extensive evaluation of different fine-tuned models to detect instances of offensive and abusive language in Dutch across three benchmarks: a standard held-out test, a task-agnostic functional benchmark, and a dynamic test set. We also investigate the use of data cartography to identify high quality training data. Our results show a relatively good quality of the manually annotated data used to train the models while highlighting some critical weakness. We have also found a good portability of trained models along the same language phenomena. As for the data cartography, we have found a positive impact only on the functional benchmark and when selecting data per annotated dimension rather than using the entire training material.</p

University of Groningen

ARTS repository - University of Groningen

Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?

Author: Camilla Casula
Sara Tonelli
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2023
Field of study

Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias

Archivio della ricerca - Fondazione Bruno Kessler

Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data

Author: Amrhein Chantal
Goldzycher Janis
Preisig Moritz
Schneider Gerold
Publication venue
Publication date: 06/06/2023
Field of study

Most research on hate speech detection has focused on English where a sizeable amount of labeled training data is available. However, to expand hate speech detection into more languages, approaches that require minimal training data are needed. In this paper, we test whether natural language inference (NLI) models which perform well in zero- and few-shot settings can benefit hate speech detection performance in scenarios where only a limited amount of labeled data is available in the target language. Our evaluation on five languages demonstrates large performance improvements of NLI fine-tuning over direct fine-tuning in the target language. However, the effectiveness of previous work that proposed intermediate fine-tuning on English data is hard to match. Only in settings where the English training data does not match the test domain, can our customised NLI-formulation outperform intermediate fine-tuning on English. Based on our extensive experiments, we propose a set of recommendations for hate speech detection in languages where minimal labeled training data is available.Comment: 15 pages, 7 figures, Accepted at the 7th Workshop on Online Abuse and Harms (WOAH), ACL 202

arXiv.org e-Print Archive

SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore

Author: Choo Kenny Tsu Wei
Hee Ming Shan
Lee Roy Ka-Wei
Ng Ri Chi
Prakash Nirmalendu
Publication venue
Publication date: 03/05/2024
Field of study

To address the limitations of current hate speech detection models, we introduce \textsf{SGHateCheck}, a novel framework designed for the linguistic and cultural context of Singapore and Southeast Asia. It extends the functional testing approach of HateCheck and MHC, employing large language models for translation and paraphrasing into Singapore's main languages, and refining these with native annotators. \textsf{SGHateCheck} reveals critical flaws in state-of-the-art models, highlighting their inadequacy in sensitive content moderation. This work aims to foster the development of more effective hate speech detection tools for diverse linguistic environments, particularly for Singapore and Southeast Asia contexts

arXiv.org e-Print Archive