Search CORE

4 research outputs found

LEACE: Perfect linear concept erasure in closed form

Author: Belrose Nora
Biderman Stella
Cotterell Ryan
Raff Edward
Ravfogel Shauli
Schneider-Joseph David
Publication venue
Publication date: 06/06/2023
Field of study

Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure

arXiv.org e-Print Archive

Eliciting Latent Predictions from Transformers with the Tuned Lens

Author: Belrose Nora
Biderman Stella
Furman Zach
Halawi Danny
McKinney Lev
Ostrovsky Igor
Smith Logan
Steinhardt Jacob
Publication venue
Publication date: 29/08/2023
Field of study

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens

arXiv.org e-Print Archive

imitation: Clean Imitation Learning Implementations

Author: Belrose Nora
Emmons Scott
Ernestus Maximilian
Gleave Adam
Jenner Erik
Rocamonde Juan
Russell Stuart
Taufeeque Mohammad
Toyer Sam
Wang Steven H.
Publication venue
Publication date: 21/11/2022
Field of study

imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a modular fashion, making it simple to develop novel algorithms in the framework. Our source code, including documentation and examples, is available at https://github.com/HumanCompatibleAI/imitatio

arXiv.org e-Print Archive

Adversarial Policies Beat Superhuman Go AIs

Author: Belrose Nora
Dennis Michael D.
Duan Yawen
Gleave Adam
Levine Sergey
Miller Joseph
Pelrine Kellin
Pogrebniak Viktor
Russell Stuart
Tseng Tom
Wang Tony T.
Publication venue
Publication date: 13/07/2023
Field of study

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.Comment: Accepted to ICML 2023, see paper for changelo

arXiv.org e-Print Archive