32 research outputs found
Temporal adaptation of BERT and performance on downstream document classification: insights from social media
Language use differs between domains and even within a domain, language use changes over time. For pre-trained language models like BERT, domain adaptation through continued pre-training has been shown to improve performance on in-domain downstream tasks. In this article, we investigate whether temporal adaptation can bring additional benefits. For this purpose, we introduce a corpus of social media comments sampled over three years. It contains unlabelled data for adaptation and evaluation on an upstream masked language modelling task as well as labelled data for fine-tuning and evaluation on a downstream document classification task. We find that temporality matters for both tasks: temporal adaptation improves upstream and temporal fine-tuning downstream task performance. Time-specific models generally perform better on past than on future test sets, which matches evidence on the bursty usage of topical words. However, adapting BERT to time and domain does not improve performance on the downstream task over only adapting to domain. Token-level analysis shows that temporal adaptation captures event-driven changes in language use in the downstream task, but not those changes that are actually relevant to task performance. Based on our findings, we discuss when temporal adaptation may be more effective
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset
Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca.\ 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at https://github.com/jagol/gahd
Two contrasting data annotation paradigms for subjective NLP tasks
Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
Labelled data is the foundation of most natural language processing tasks.
However, labelling data is difficult and there often are diverse valid beliefs
about what the correct data labels should be. So far, dataset creators have
acknowledged annotator subjectivity, but rarely actively managed it in the
annotation process. This has led to partly-subjective datasets that fail to
serve a clear downstream use. To address this issue, we propose two contrasting
paradigms for data annotation. The descriptive paradigm encourages annotator
subjectivity, whereas the prescriptive paradigm discourages it. Descriptive
annotation allows for the surveying and modelling of different beliefs, whereas
prescriptive annotation enables the training of models that consistently apply
one belief. We discuss benefits and challenges in implementing both paradigms,
and argue that dataset creators should explicitly aim for one or the other to
facilitate the intended use of their dataset. Lastly, we conduct an annotation
experiment using hate speech data that illustrates the contrast between the two
paradigms.Comment: Accepted at NAACL 2022 (Main Conference
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models
In this paper, we address the concept of "alignment" in large language models
(LLMs) through the lens of post-structuralist socio-political theory,
specifically examining its parallels to empty signifiers. To establish a shared
vocabulary around how abstract concepts of alignment are operationalised in
empirical datasets, we propose a framework that demarcates: 1) which dimensions
of model behaviour are considered important, then 2) how meanings and
definitions are ascribed to these dimensions, and by whom. We situate existing
empirical literature and provide guidance on deciding which paradigm to follow.
Through this framework, we aim to foster a culture of transparency and critical
evaluation, aiding the community in navigating the complexities of aligning
LLMs with human populations.Comment: Socially Responsible Language Modelling Research (SoLaR) @ NeurIPs
202
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Without proper safeguards, large language models will readily follow
malicious instructions and generate toxic content. This motivates safety
efforts such as red-teaming and large-scale feedback learning, which aim to
make models both helpful and harmless. However, there is a tension between
these two objectives, since harmlessness requires models to refuse complying
with unsafe prompts, and thus not be helpful. Recent anecdotal evidence
suggests that some models may have struck a poor balance, so that even clearly
safe prompts are refused if they use similar language to unsafe prompts or
mention sensitive topics. In this paper, we introduce a new test suite called
XSTest to identify such eXaggerated Safety behaviours in a structured and
systematic way. In its current form, XSTest comprises 200 safe prompts across
ten prompt types that well-calibrated models should not refuse to comply with.
We describe XSTest's creation and composition, and use the test suite to
highlight systematic failure modes in a recently-released state-of-the-art
language model.Comment: v1 to document initial data releas
HateCheck: functional tests for hate speech detection models
Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Training large language models to follow instructions makes them perform
better on a wide range of tasks, generally becoming more helpful. However, a
perfectly helpful model will follow even the most malicious instructions and
readily generate harmful content. In this paper, we raise concerns over the
safety of models that only emphasize helpfulness, not safety, in their
instruction-tuning. We show that several popular instruction-tuned models are
highly unsafe. Moreover, we show that adding just 3% safety examples (a few
hundred demonstrations) in the training set when fine-tuning a model like LLaMA
can substantially improve their safety. Our safety-tuning does not make models
significantly less capable or helpful as measured by standard benchmarks.
However, we do find a behavior of exaggerated safety, where too much
safety-tuning makes models refuse to respond to reasonable prompts that
superficially resemble unsafe ones. Our study sheds light on trade-offs in
training LLMs to follow instructions and exhibit safe behavior