Search CORE

17 research outputs found

Causal Abstraction with Soft Interventions

Author: Bacciu Davide
Geiger Atticus
Icard Thomas
Massidda Riccardo
Publication venue
Publication date: 22/11/2022
Field of study

Causal abstraction provides a theory describing how several causal models can represent the same system at different levels of detail. Existing theoretical proposals limit the analysis of abstract models to "hard" interventions fixing causal variables to be constant values. In this work, we extend causal abstraction to "soft" interventions, which assign possibly non-constant functions to variables without adding new causal connections. Specifically, (i) we generalize

\tau

-abstraction from Beckers and Halpern (2019) to soft interventions, (ii) we propose a further definition of soft abstraction to ensure a unique map

\omega

between soft interventions, and (iii) we prove that our constructive definition of soft abstraction guarantees the intervention map

\omega

has a specific and necessary explicit form

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Linear Representations of Sentiment in Large Language Models

Author: Geiger Atticus
Hollinsworth Oskar John
Nanda Neel
Tigges Curt
Publication venue
Publication date: 23/10/2023
Field of study

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions

arXiv.org e-Print Archive

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Author: Geiger Atticus
Goodman Noah D.
Icard Thomas
Potts Christopher
Wu Zhengxuan
Publication venue
Publication date: 06/02/2024
Field of study

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that has uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at `https://github.com/stanfordnlp/pyvene`.Comment: NeurIPS 2023 with Author Correction

arXiv.org e-Print Archive

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Author: Arora Aryaman
Geiger Atticus
Goodman Noah D.
Huang Jing
Icard Thomas
Potts Christopher
Wu Zhengxuan
Publication venue
Publication date: 23/01/2024
Field of study

We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.Comment: 20 pages, 14 figure

arXiv.org e-Print Archive

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Author: Arora Aryaman
Geiger Atticus
Goodman Noah D.
Huang Jing
Manning Christopher D.
Potts Christopher
Wang Zheng
Wu Zhengxuan
Publication venue
Publication date: 12/03/2024
Field of study

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce

\textbf{pyvene}

, an open-source Python library that supports customizable interventions on a range of different PyTorch modules.

\textbf{pyvene}

supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how

\textbf{pyvene}

provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/stanfordnlp/pyvene.Comment: 8 pages, 3 figure

arXiv.org e-Print Archive

Dynabench: Rethinking Benchmarking in NLP

Author: Bansal Mohit
Bartolo Max
Geiger Atticus
Jia Robin
Kaushik Divyansh
Kiela Douwe
Ma Zhiyi
Nie Yixin
Potts Christopher
Prasad Grusha
Riedel Sebastian
Ringshia Pratik
Singh Amanpreet
Stenetorp Pontus
Thrush Tristan
Vidgen Bertie
Waseem Zeerak
Williams Adina
Wu Zhengxuan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 11/06/2021
Field of study

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field

UCL Discovery