Search CORE

12 research outputs found

Toy Models of Superposition

Author: Amodei Dario
Chen Carol
Drain Dawn
Elhage Nelson
Grosse Roger
Hatfield-Dodds Zac
Henighan Tom
Hume Tristan
Kaplan Jared
Kravec Shauna
Lasenby Robert
McCandlish Sam
Olah Christopher
Olsson Catherine
Schiefer Nicholas
Wattenberg Martin
Publication venue
Publication date: 21/09/2022
Field of study

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.Comment: Also available at https://transformer-circuits.pub/2022/toy_model/index.htm

arXiv.org e-Print Archive

Towards Understanding Sycophancy in Language Models

Author: Askell Amanda
Bowman Samuel R.
Cheng Newton
Durmus Esin
Duvenaud David
Hatfield-Dodds Zac
Johnston Scott R.
Korbak Tomasz
Kravec Shauna
Maxwell Timothy
McCandlish Sam
Ndousse Kamal
Perez Ethan
Rausch Oliver
Schiefer Nicholas
Sharma Mrinank
Tong Meg
Yan Da
Zhang Miranda
Publication venue
Publication date: 27/10/2023
Field of study

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.Comment: 32 pages, 20 figure

arXiv.org e-Print Archive

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Author: Bowman Samuel R.
Brauner Jan
Chandrasekaran Venkatesa
Chen Anna
Chen Carol
Cheng Newton
Denison Carson
Durmus Esin
Hatfield-Dodds Zac
Hernandez Danny
Hubinger Evan
Joseph Nicholas
Kaplan Jared
Kernion Jackson
Lanham Tamera
Lukošiūtė Kamilė
Maxwell Tim
McCandlish Sam
Nguyen Karina
Perez Ethan
Radhakrishnan Ansh
Rausch Oliver
Schiefer Nicholas
Showk Sheer El
Publication venue
Publication date: 25/07/2023
Field of study

As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.Comment: For few-shot examples and prompts, see https://github.com/anthropics/DecompositionFaithfulnessPape

arXiv.org e-Print Archive

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe

arXiv.org e-Print Archive

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models

arXiv.org e-Print Archive

Specific versus General Principles for Constitutional AI

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely

arXiv.org e-Print Archive

Scaling Laws and Interpretability of Learning from Repeated Data

Author: Amodei Dario
Brown Tom
Conerly Tom
DasSarma Nova
Drain Dawn
El-Showk Sheer
Elhage Nelson
Hatfield-Dodds Zac
Henighan Tom
Hernandez Danny
Hume Tristan
Johnston Scott
Joseph Nicholas
Kaplan Jared
Mann Ben
McCandlish Sam
Olah Chris
Olsson Catherine
Publication venue
Publication date: 20/05/2022
Field of study

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.Comment: 23 pages, 22 figure

arXiv.org e-Print Archive

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Author: Askell Amanda
Bakhtin Anton
Chen Carol
Clark Jack
Durmus Esin
Ganguli Deep
Hatfield-Dodds Zac
Hernandez Danny
Joseph Nicholas
Kaplan Jared
Liao Thomas I.
Lovitt Liane
McCandlish Sam
Nyugen Karina
Schiefer Nicholas
Sikder Orowa
Tamkin Alex
Thamkul Janel
Publication venue
Publication date: 28/06/2023
Field of study

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com

arXiv.org e-Print Archive

HypothesisWorks/hypothesis: Hypothesis for Python - version 4.5.9

You can read the changelog for this release here

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

xarray

Author: Abernathey Ryan
Amici Alessandro
Banihirwe Anderson
Barghini Aureliana
Bell Ray
Bovy Benoît
Cherian Deepak
Clark Spencer
Fitzgerald Clark
Fujii Keisuke
Hatfield-Dodds Zac
Hauser Mathias
Hoyer Stephan
Imperiale Guido
Joseph Hamman
Kleeman Alex
Kluyver Thomas
Magin Justus
Maussion Fabien
Munroe James
Mühlbauer Kai
Nicholas Thomas
Omotani John
Roos Maximilian
Roszko Maximilian K.
Westling Jimmy
Wolfram Phillip J.
Publication venue: Zenodo
Publication date: 28/01/2022
Field of study

N-D labeled arrays and datasets in Python.If you use this software, please cite it as below

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY