3 research outputs found
Language Models (Mostly) Know What They Know
We study whether language models can evaluate the validity of their own
claims and predict which questions they will be able to answer correctly. We
first show that larger models are well-calibrated on diverse multiple choice
and true/false questions when they are provided in the right format. Thus we
can approach self-evaluation on open-ended sampling tasks by asking models to
first propose answers, and then to evaluate the probability "P(True)" that
their answers are correct. We find encouraging performance, calibration, and
scaling for P(True) on a diverse array of tasks. Performance at self-evaluation
further improves when we allow models to consider many of their own samples
before predicting the validity of one specific possibility. Next, we
investigate whether models can be trained to predict "P(IK)", the probability
that "I know" the answer to a question, without reference to any particular
proposed answer. Models perform well at predicting P(IK) and partially
generalize across tasks, though they struggle with calibration of P(IK) on new
tasks. The predicted P(IK) probabilities also increase appropriately in the
presence of relevant source materials in the context, and in the presence of
hints towards the solution of mathematical word problems. We hope these
observations lay the groundwork for training more honest models, and for
investigating how honesty generalizes to cases where models are trained on
objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
We describe our early efforts to red team language models in order to
simultaneously discover, measure, and attempt to reduce their potentially
harmful outputs. We make three main contributions. First, we investigate
scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B
parameters) and 4 model types: a plain language model (LM); an LM prompted to
be helpful, honest, and harmless; an LM with rejection sampling; and a model
trained to be helpful and harmless using reinforcement learning from human
feedback (RLHF). We find that the RLHF models are increasingly difficult to red
team as they scale, and we find a flat trend with scale for the other model
types. Second, we release our dataset of 38,961 red team attacks for others to
analyze and learn from. We provide our own analysis of the data and find a
variety of harmful outputs, which range from offensive language to more subtly
harmful non-violent unethical outputs. Third, we exhaustively describe our
instructions, processes, statistical methodologies, and uncertainty about red
teaming. We hope that this transparency accelerates our ability to work
together as a community in order to develop shared norms, practices, and
technical standards for how to red team language models
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Large language models (LLMs) may not equitably represent diverse global
perspectives on societal issues. In this paper, we develop a quantitative
framework to evaluate whose opinions model-generated responses are more similar
to. We first build a dataset, GlobalOpinionQA, comprised of questions and
answers from cross-national surveys designed to capture diverse opinions on
global issues across different countries. Next, we define a metric that
quantifies the similarity between LLM-generated survey responses and human
responses, conditioned on country. With our framework, we run three experiments
on an LLM trained to be helpful, honest, and harmless with Constitutional AI.
By default, LLM responses tend to be more similar to the opinions of certain
populations, such as those from the USA, and some European and South American
countries, highlighting the potential for biases. When we prompt the model to
consider a particular country's perspective, responses shift to be more similar
to the opinions of the prompted populations, but can reflect harmful cultural
stereotypes. When we translate GlobalOpinionQA questions to a target language,
the model's responses do not necessarily become the most similar to the
opinions of speakers of those languages. We release our dataset for others to
use and build on. Our data is at
https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide
an interactive visualization at https://llmglobalvalues.anthropic.com