17 research outputs found
Causal Abstraction with Soft Interventions
Causal abstraction provides a theory describing how several causal models can
represent the same system at different levels of detail. Existing theoretical
proposals limit the analysis of abstract models to "hard" interventions fixing
causal variables to be constant values. In this work, we extend causal
abstraction to "soft" interventions, which assign possibly non-constant
functions to variables without adding new causal connections. Specifically, (i)
we generalize -abstraction from Beckers and Halpern (2019) to soft
interventions, (ii) we propose a further definition of soft abstraction to
ensure a unique map between soft interventions, and (iii) we prove
that our constructive definition of soft abstraction guarantees the
intervention map has a specific and necessary explicit form
Linear Representations of Sentiment in Large Language Models
Sentiment is a pervasive feature in natural language text, yet it is an open
question how sentiment is represented within Large Language Models (LLMs). In
this study, we reveal that across a range of models, sentiment is represented
linearly: a single direction in activation space mostly captures the feature
across a range of tasks with one extreme for positive and the other for
negative. Through causal interventions, we isolate this direction and show it
is causally relevant in both toy tasks and real world datasets such as Stanford
Sentiment Treebank. Through this case study we model a thorough investigation
of what a single direction means on a broad data distribution.
We further uncover the mechanisms that involve this direction, highlighting
the roles of a small subset of attention heads and neurons. Finally, we
discover a phenomenon which we term the summarization motif: sentiment is not
solely represented on emotionally charged words, but is additionally summarized
at intermediate positions without inherent sentiment, such as punctuation and
names. We show that in Stanford Sentiment Treebank zero-shot classification,
76% of above-chance classification accuracy is lost when ablating the sentiment
direction, nearly half of which (36%) is due to ablating the summarized
sentiment direction exclusively at comma positions
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Obtaining human-interpretable explanations of large, general-purpose language
models is an urgent goal for AI safety. However, it is just as important that
our interpretability methods are faithful to the causal dynamics underlying
model behavior and able to robustly generalize to unseen inputs. Distributed
Alignment Search (DAS) is a powerful gradient descent method grounded in a
theory of causal abstraction that has uncovered perfect alignments between
interpretable symbolic algorithms and small deep learning models fine-tuned for
specific tasks. In the present paper, we scale DAS significantly by replacing
the remaining brute-force search steps with learned parameters -- an approach
we call Boundless DAS. This enables us to efficiently search for interpretable
causal structure in large language models while they follow instructions. We
apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf,
solves a simple numerical reasoning problem. With Boundless DAS, we discover
that Alpaca does this by implementing a causal model with two interpretable
boolean variables. Furthermore, we find that the alignment of neural
representations with these variables is robust to changes in inputs and
instructions. These findings mark a first step toward faithfully understanding
the inner-workings of our ever-growing and most widely deployed language
models. Our tool is extensible to larger LLMs and is released publicly at
`https://github.com/stanfordnlp/pyvene`.Comment: NeurIPS 2023 with Author Correction
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
We respond to the recent paper by Makelov et al. (2023), which reviews
subspace interchange intervention methods like distributed alignment search
(DAS; Geiger et al. 2023) and claims that these methods potentially cause
"interpretability illusions". We first review Makelov et al. (2023)'s technical
notion of what an "interpretability illusion" is, and then we show that even
intuitive and desirable explanations can qualify as illusions in this sense. As
a result, their method of discovering "illusions" can reject explanations they
consider "non-illusory". We then argue that the illusions Makelov et al. (2023)
see in practice are artifacts of their training and evaluation paradigms. We
close by emphasizing that, though we disagree with their core characterization,
Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the
field of interpretability forward.Comment: 20 pages, 14 figure
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Interventions on model-internal states are fundamental operations in many
areas of AI, including model editing, steering, robustness, and
interpretability. To facilitate such research, we introduce ,
an open-source Python library that supports customizable interventions on a
range of different PyTorch modules. supports complex
intervention schemes with an intuitive configuration format, and its
interventions can be static or include trainable parameters. We show how
provides a unified and extensible framework for performing
interventions on neural models and sharing the intervened upon models with
others. We illustrate the power of the library via interpretability analyses
using causal abstraction and knowledge localization. We publish our library
through Python Package Index (PyPI) and provide code, documentation, and
tutorials at https://github.com/stanfordnlp/pyvene.Comment: 8 pages, 3 figure
Dynabench: Rethinking Benchmarking in NLP
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field