The deployment of large-scale generative models is often restricted by their
potential risk of causing harm to users in unpredictable ways. We focus on the
problem of black-box red teaming, where a red team generates test cases and
interacts with the victim model to discover a diverse set of failures with
limited query access. Existing red teaming methods construct test cases based
on human supervision or language model (LM) and query all test cases in a
brute-force manner without incorporating any information from past evaluations,
resulting in a prohibitively large number of queries. To this end, we propose
Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods
based on Bayesian optimization, which iteratively identify diverse positive
test cases leading to model failures by utilizing the pre-defined user input
pool and the past evaluations. Experimental results on various user input pools
demonstrate that our method consistently finds a significantly larger number of
diverse positive test cases under the limited query budget than the baseline
methods. The source code is available at
https://github.com/snu-mllab/Bayesian-Red-Teaming.Comment: ACL 2023 Long Paper - Main Conferenc