13 research outputs found
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
The deployment of large-scale generative models is often restricted by their
potential risk of causing harm to users in unpredictable ways. We focus on the
problem of black-box red teaming, where a red team generates test cases and
interacts with the victim model to discover a diverse set of failures with
limited query access. Existing red teaming methods construct test cases based
on human supervision or language model (LM) and query all test cases in a
brute-force manner without incorporating any information from past evaluations,
resulting in a prohibitively large number of queries. To this end, we propose
Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods
based on Bayesian optimization, which iteratively identify diverse positive
test cases leading to model failures by utilizing the pre-defined user input
pool and the past evaluations. Experimental results on various user input pools
demonstrate that our method consistently finds a significantly larger number of
diverse positive test cases under the limited query budget than the baseline
methods. The source code is available at
https://github.com/snu-mllab/Bayesian-Red-Teaming.Comment: ACL 2023 Long Paper - Main Conferenc
Who Wrote this Code? Watermarking for Code Generation
Large language models for code have recently shown remarkable performance in
generating executable code. However, this rapid advancement has been
accompanied by many legal and ethical concerns, such as code licensing issues,
code plagiarism, and malware generation, making watermarking machine-generated
code a very timely problem. Despite such imminent needs, we discover that
existing watermarking and machine-generated text detection methods for LLMs
fail to function with code generation tasks properly. Hence, in this work, we
propose a new watermarking method, SWEET, that significantly improves upon
previous approaches when watermarking machine-generated code. Our proposed
method selectively applies watermarking to the tokens with high enough entropy,
surpassing a defined threshold. The experiments on code generation benchmarks
show that our watermarked code has superior quality compared to code produced
by the previous state-of-the-art LLM watermarking method. Furthermore, our
watermark method also outperforms DetectGPT for the task of machine-generated
code detection
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Large language models (LLMs) learn not only natural text generation abilities
but also social biases against different demographic groups from real-world
data. This poses a critical risk when deploying LLM-based applications.
Existing research and resources are not readily applicable in South Korea due
to the differences in language and culture, both of which significantly affect
the biases and targeted demographic groups. This limitation requires localized
social bias datasets to ensure the safe and effective deployment of LLMs. To
this end, we present KO SB I, a new social bias dataset of 34k pairs of
contexts and sentences in Korean covering 72 demographic groups in 15
categories. We find that through filtering-based moderation, social biases in
generated content can be reduced by 16.47%p on average for HyperCLOVA (30B and
82B), and GPT-3.Comment: 17 pages, 8 figures, 12 tables, ACL 202
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
The potential social harms that large language models pose, such as
generating offensive content and reinforcing biases, are steeply rising.
Existing works focus on coping with this concern while interacting with
ill-intentioned users, such as those who explicitly make hate speech or elicit
harmful responses. However, discussions on sensitive issues can become toxic
even if the users are well-intentioned. For safer models in such scenarios, we
present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a
large-scale Korean dataset of 49k sensitive questions with 42k acceptable and
46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA
in a human-in-the-loop manner based on real news headlines. Experiments show
that acceptable response generation significantly improves for HyperCLOVA and
GPT-3, demonstrating the efficacy of this dataset.Comment: 19 pages, 10 figures, ACL 202
Beyond Fact Verification: Comparing and Contrasting Claims on Contentious Topics
As the importance of identifying misinformation is increasing, many
researchers focus on verifying textual claims on the web. One of the most
popular tasks to achieve this is fact verification, which retrieves an evidence
sentence from a large knowledge source such as Wikipedia to either verify or
refute each factual claim. However, while such problem formulation is helpful
for detecting false claims and fake news, it is not applicable to catching
subtle differences in factually consistent claims which still might implicitly
bias the readers, especially in contentious topics such as political, gender,
or racial issues. In this study, we propose ClaimDiff, a novel dataset to
compare the nuance between claim pairs in both a discriminative and a
generative manner, with the underlying assumption that one is not necessarily
more true than the other. This differs from existing fact verification datasets
that verify the target sentence with respect to an absolute truth. We hope this
task assists people in making more informed decisions among various sources of
media