78 research outputs found
Auditing large language models: a three-layered approach
The emergence of large language models (LLMs) represents a major advance in
artificial intelligence (AI) research. However, the widespread use of LLMs is
also coupled with significant ethical and social challenges. Previous research
has pointed towards auditing as a promising governance mechanism to help ensure
that AI systems are designed and deployed in ways that are ethical, legal, and
technically robust. However, existing auditing procedures fail to address the
governance challenges posed by LLMs, which are adaptable to a wide range of
downstream tasks. To help bridge that gap, we offer three contributions in this
article. First, we establish the need to develop new auditing procedures that
capture the risks posed by LLMs by analysing the affordances and constraints of
existing auditing procedures. Second, we outline a blueprint to audit LLMs in
feasible and effective ways by drawing on best practices from IT governance and
system engineering. Specifically, we propose a three-layered approach, whereby
governance audits, model audits, and application audits complement and inform
each other. Finally, we discuss the limitations not only of our three-layered
approach but also of the prospect of auditing LLMs at all. Ultimately, this
article seeks to expand the methodological toolkit available to technology
providers and policymakers who wish to analyse and evaluate LLMs from
technical, ethical, and legal perspectives.Comment: Preprint, 29 pages, 2 figure
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models
In this paper, we address the concept of "alignment" in large language models
(LLMs) through the lens of post-structuralist socio-political theory,
specifically examining its parallels to empty signifiers. To establish a shared
vocabulary around how abstract concepts of alignment are operationalised in
empirical datasets, we propose a framework that demarcates: 1) which dimensions
of model behaviour are considered important, then 2) how meanings and
definitions are ascribed to these dimensions, and by whom. We situate existing
empirical literature and provide guidance on deciding which paradigm to follow.
Through this framework, we aim to foster a culture of transparency and critical
evaluation, aiding the community in navigating the complexities of aligning
LLMs with human populations.Comment: Socially Responsible Language Modelling Research (SoLaR) @ NeurIPs
202
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Without proper safeguards, large language models will readily follow
malicious instructions and generate toxic content. This motivates safety
efforts such as red-teaming and large-scale feedback learning, which aim to
make models both helpful and harmless. However, there is a tension between
these two objectives, since harmlessness requires models to refuse complying
with unsafe prompts, and thus not be helpful. Recent anecdotal evidence
suggests that some models may have struck a poor balance, so that even clearly
safe prompts are refused if they use similar language to unsafe prompts or
mention sensitive topics. In this paper, we introduce a new test suite called
XSTest to identify such eXaggerated Safety behaviours in a structured and
systematic way. In its current form, XSTest comprises 200 safe prompts across
ten prompt types that well-calibrated models should not refuse to comply with.
We describe XSTest's creation and composition, and use the test suite to
highlight systematic failure modes in a recently-released state-of-the-art
language model.Comment: v1 to document initial data releas
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets
Vision-language models are growing in popularity and public visibility to
generate, edit, and caption images at scale; but their outputs can perpetuate
and amplify societal biases learned during pre-training on uncurated image-text
pairs from the internet. Although debiasing methods have been proposed, we
argue that these measurements of model bias lack validity due to dataset bias.
We demonstrate there are spurious correlations in COCO Captions, the most
commonly used dataset for evaluating bias, between background context and the
gender of people in-situ. This is problematic because commonly-used bias
metrics (such as Bias@K) rely on per-gender base rates. To address this issue,
we propose a novel dataset debiasing pipeline to augment the COCO dataset with
synthetic, gender-balanced contrast sets, where only the gender of the subject
is edited and the background is fixed. However, existing image editing methods
have limitations and sometimes produce low-quality images; so, we introduce a
method to automatically filter the generated images based on their similarity
to real images. Using our balanced synthetic contrast sets, we benchmark bias
in multiple CLIP-based models, demonstrating how metrics are skewed by
imbalance in the original COCO images. Our results indicate that the proposed
approach improves the validity of the evaluation, ultimately contributing to
more realistic understanding of bias in vision-language models.Comment: Github: https://github.com/oxai/debias-gensynt
Assessing Language Model Deployment with Risk Cards
This paper introduces RiskCards, a framework for structured assessment and
documentation of risks associated with an application of language models. As
with all language, text generated by language models can be harmful, or used to
bring about harm. Automating language generation adds both an element of scale
and also more subtle or emergent undesirable tendencies to the generated text.
Prior work establishes a wide variety of language model harms to many different
actors: existing taxonomies identify categories of harms posed by language
models; benchmarks establish automated tests of these harms; and documentation
standards for models, tasks and datasets encourage transparent reporting.
However, there is no risk-centric framework for documenting the complexity of a
landscape in which some risks are shared across models and contexts, while
others are specific, and where certain conditions may be required for risks to
manifest as harms. RiskCards address this methodological gap by providing a
generic framework for assessing the use of a given language model in a given
scenario. Each RiskCard makes clear the routes for the risk to manifest harm,
their placement in harm taxonomies, and example prompt-output pairs. While
RiskCards are designed to be open-source, dynamic and participatory, we present
a "starter set" of RiskCards taken from a broad literature survey, each of
which details a concrete risk presentation. Language model RiskCards initiate a
community knowledge base which permits the mapping of risks and harms to a
specific model or its application scenario, ultimately contributing to a
better, safer and shared understanding of the risk landscape
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
The past year has seen rapid acceleration in the development of large
language models (LLMs). For many tasks, there is now a wide range of
open-source and open-access LLMs that are viable alternatives to proprietary
models like ChatGPT. Without proper steering and safeguards, however, LLMs will
readily follow malicious instructions, provide unsafe advice, and generate
toxic content. This is a critical safety risk for businesses and developers. We
introduce SimpleSafetyTests as a new test suite for rapidly and systematically
identifying such critical safety risks. The test suite comprises 100 test
prompts across five harm areas that LLMs, for the vast majority of
applications, should refuse to comply with. We test 11 popular open LLMs and
find critical safety weaknesses in several of them. While some LLMs do not give
a single unsafe response, most models we test respond unsafely on more than 20%
of cases, with over 50% unsafe responses in the extreme. Prepending a
safety-emphasising system prompt substantially reduces the occurrence of unsafe
responses, but does not completely stop them from happening. We recommend that
developers use such system prompts as a first line of defence against critical
safety risks
DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public Figures
Public figures receive a disproportionate amount of abuse on social media,
impacting their active participation in public life. Automated systems can
identify abuse at scale but labelling training data is expensive, complex and
potentially harmful. So, it is desirable that systems are efficient and
generalisable, handling both shared and specific aspects of online abuse. We
explore the dynamics of cross-group text classification in order to understand
how well classifiers trained on one domain or demographic can transfer to
others, with a view to building more generalisable abuse classifiers. We
fine-tune language models to classify tweets targeted at public figures across
DOmains (sport and politics) and DemOgraphics (women and men) using our novel
DODO dataset, containing 28,000 labelled entries, split equally across four
domain-demographic pairs. We find that (i) small amounts of diverse data are
hugely beneficial to generalisation and model adaptation; (ii) models transfer
more easily across demographics but models trained on cross-domain data are
more generalisable; (iii) some groups contribute more to generalisability than
others; and (iv) dataset similarity is a signal of transferability.Comment: 15 pages, 7 figures, 4 table
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
- …