Search CORE

52 research outputs found

Path-Specific Objectives for Safer Agent Incentives

Author: Carey Ryan
Everitt Tom
Farquhar Sebastian
Publication venue
Publication date: 21/04/2022
Field of study

We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.Comment: Presented at AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Tracr: Compiled Transformers as a Laboratory for Interpretability

Author: Farquhar Sebastian
Kramár János
Lindner David
McGrath Thomas
Mikulik Vladimir
Rahtz Matthew
Publication venue
Publication date: 03/11/2023
Field of study

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.Comment: Presented at NeurIPS 2023 (Spotlight

arXiv.org e-Print Archive