156 research outputs found
Hypothesis Engineering for Zero-Shot Hate Speech Detection
Standard approaches to hate speech detection rely on sufficient available hate speech annotations. Extending previous work that repurposes natural language inference (NLI) models for zero-shot text classification, we propose a simple approach that combines multiple hypotheses to improve English NLI-based zero-shot hate speech detection. We first conduct an error analysis for vanilla NLI-based zero-shot hate speech detection and then develop four strategies based on this analysis. The strategies use multiple hypotheses to predict various aspects of an input text and combine these predictions into a final verdict. We find that the zero-shot baseline used for the initial error analysis already outperforms commercial systems and fine-tuned BERT-based hate speech detection models on HateCheck. The combination of the proposed strategies further increases the zero-shot accuracy of 79.4% on HateCheck by 7.9 percentage points (pp), and the accuracy of 69.6% on ETHOS by 10.0pp
“It's Not Just Hate”:A Multi-Dimensional Perspective on Detecting Harmful Speech Online
Well-annotated data is a prerequisite for good Natural Language Processing models. Too often, though, annotation decisions are governed by optimizing time or annotator agreement. We make a case for nuanced efforts in an interdisciplinary setting for annotating offensive online speech. Detecting offensive content is rapidly becoming one of the most important real-world NLP tasks. However, most datasets use a single binary label, e.g., for hate or incivility, even though each concept is multi-faceted. This modeling choice severely limits nuanced insights, but also performance. We show that a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content addresses both conceptual and performance issues. We release a novel dataset of over 40, 000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. Our dataset not only allows for a more nuanced understanding of harmful speech online, models trained on it also outperform or match performance on benchmark datasets. Warning: This paper contains examples of hateful language some readers might find offensive
Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection
Behavioural testing -- verifying system capabilities by validating
human-designed input-output pairs -- is an alternative evaluation method of
natural language processing systems proposed to address the shortcomings of the
standard approach: computing metrics on held-out data. While behavioural tests
capture human prior knowledge and insights, there has been little exploration
on how to leverage them for model training and development. With this in mind,
we explore behaviour-aware learning by examining several fine-tuning schemes
using HateCheck, a suite of functional tests for hate speech detection systems.
To address potential pitfalls of training on data originally intended for
evaluation, we train and evaluate models on different configurations of
HateCheck by holding out categories of test cases, which enables us to estimate
performance on potentially overlooked system properties. The fine-tuning
procedure led to improvements in the classification accuracy of held-out
functionalities and identity groups, suggesting that models can potentially
generalise to overlooked functionalities. However, performance on held-out
functionality classes and i.i.d. hate speech detection data decreased, which
indicates that generalisation occurs mostly across functionalities from the
same class and that the procedure led to overfitting to the HateCheck data
distribution.Comment: 9 pages, 5 figures. Accepted at the First Workshop on Efficient
Benchmarking in NLP (NLP Power!
Improving the evaluation and effectiveness of hate speech detection models
Online hate speech is a widespread and deeply harmful problem. To tackle hate at scale, we need models that can automatically detect it. This has motivated research in Natural Language Processing (NLP) to develop text-based hate speech detection models. In recent years, these models have improved substantially, following general advances in language modelling.
In my thesis, I show that impressive headline results paint an incomplete picture of model quality. I argue that much progress in hate speech detection so far has rested on simplifying assumptions, which, while useful in some settings, we need to move past in order to develop more truly effective models. In particular, I argue that current standards for model evaluation tend to be overly aggregated, static, monolithic and English language-centric, because of four common simplifying assumptions, which I use to structure my thesis.
I discuss core concepts in an introduction and literature review. Then, I present four Chapters, which each challenge one of the four simplifying assumptions. Assumption 1 is that model accuracy equals model quality. I introduce a suite of functional tests for hate speech detection models, which enables fine-grained diagnostic insights and reveals critical weaknesses in seemingly accurate models. Assumption 2 is that hate speech today equals hate speech tomorrow. I find that model performance degrades over time and explore temporal adaptation as a remedy. Assumption 3 is that hate speech for me equals hate speech for you. I evidence subjectivity in labelling hate speech and introduce two contrasting data annotation paradigms for managing subjectivity. Assumption 4 is that hate speech in English equals all hate speech. I explore data-efficient strategies for expanding detection into more under-resourced languages.
Overall, my thesis seeks to 1) work towards better, more comprehensive quality standards for hate speech detection models, and 2) improve models along these standards
HateCheck: functional tests for hate speech detection models
Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses
Testing Hateful Speeches against Policies
In the recent years, many software systems have adopted AI techniques,
especially deep learning techniques. Due to their black-box nature, AI-based
systems brought challenges to traceability, because AI system behaviors are
based on models and data, whereas the requirements or policies are rules in the
form of natural or programming language. To the best of our knowledge, there is
a limited amount of studies on how AI and deep neural network-based systems
behave against rule-based requirements/policies. This experience paper examines
deep neural network behaviors against rule-based requirements described in
natural language policies. In particular, we focus on a case study to check
AI-based content moderation software against content moderation policies.
First, using crowdsourcing, we collect natural language test cases which match
each moderation policy, we name this dataset HateModerate; second, using the
test cases in HateModerate, we test the failure rates of state-of-the-art hate
speech detection software, and we find that these models have high failure
rates for certain policies; finally, since manual labeling is costly, we
further proposed an automated approach to augument HateModerate by finetuning
OpenAI's large language models to automatically match new examples to policies.
The dataset and code of this work can be found on our anonymous website:
\url{https://sites.google.com/view/content-moderation-project}
Benchmarking Offensive and Abusive Language in Dutch Tweets
We present an extensive evaluation of different fine-tuned models to detect instances of offensive and abusive language in Dutch across three benchmarks: a standard held-out test, a task-agnostic functional benchmark, and a dynamic test set. We also investigate the use of data cartography to identify high quality training data. Our results show a relatively good quality of the manually annotated data used to train the models while highlighting some critical weakness. We have also found a good portability of trained models along the same language phenomena. As for the data cartography, we have found a positive impact only on the functional benchmark and when selecting data per annotated dimension rather than using the entire training material.</p
Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?
Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias
Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data
Most research on hate speech detection has focused on English where a
sizeable amount of labeled training data is available. However, to expand hate
speech detection into more languages, approaches that require minimal training
data are needed. In this paper, we test whether natural language inference
(NLI) models which perform well in zero- and few-shot settings can benefit hate
speech detection performance in scenarios where only a limited amount of
labeled data is available in the target language. Our evaluation on five
languages demonstrates large performance improvements of NLI fine-tuning over
direct fine-tuning in the target language. However, the effectiveness of
previous work that proposed intermediate fine-tuning on English data is hard to
match. Only in settings where the English training data does not match the test
domain, can our customised NLI-formulation outperform intermediate fine-tuning
on English. Based on our extensive experiments, we propose a set of
recommendations for hate speech detection in languages where minimal labeled
training data is available.Comment: 15 pages, 7 figures, Accepted at the 7th Workshop on Online Abuse and
Harms (WOAH), ACL 202
SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore
To address the limitations of current hate speech detection models, we
introduce \textsf{SGHateCheck}, a novel framework designed for the linguistic
and cultural context of Singapore and Southeast Asia. It extends the functional
testing approach of HateCheck and MHC, employing large language models for
translation and paraphrasing into Singapore's main languages, and refining
these with native annotators. \textsf{SGHateCheck} reveals critical flaws in
state-of-the-art models, highlighting their inadequacy in sensitive content
moderation. This work aims to foster the development of more effective hate
speech detection tools for diverse linguistic environments, particularly for
Singapore and Southeast Asia contexts
- …
