52 research outputs found
Path-Specific Objectives for Safer Agent Incentives
We present a general framework for training safe agents whose naive
incentives are unsafe. As an example, manipulative or deceptive behaviour can
improve rewards but should be avoided. Most approaches fail here: agents
maximize expected return by any means necessary. We formally describe settings
with 'delicate' parts of the state which should not be used as a means to an
end. We then train agents to maximize the causal effect of actions on the
expected return which is not mediated by the delicate parts of state, using
Causal Influence Diagram analysis. The resulting agents have no incentive to
control the delicate state. We further show how our framework unifies and
generalizes existing proposals.Comment: Presented at AAAI 202
Tracr: Compiled Transformers as a Laboratory for Interpretability
We show how to "compile" human-readable programs into standard decoder-only
transformer models. Our compiler, Tracr, generates models with known structure.
This structure can be used to design experiments. For example, we use it to
study "superposition" in transformers that execute multi-step algorithms.
Additionally, the known structure of Tracr-compiled models can serve as
ground-truth for evaluating interpretability methods. Commonly, because the
"programs" learned by transformers are unknown it is unclear whether an
interpretation succeeded. We demonstrate our approach by implementing and
examining programs including computing token frequencies, sorting, and
parenthesis checking. We provide an open-source implementation of Tracr at
https://github.com/google-deepmind/tracr.Comment: Presented at NeurIPS 2023 (Spotlight
- …