4 research outputs found
LEACE: Perfect linear concept erasure in closed form
Concept erasure aims to remove specified features from a representation. It
can be used to improve fairness (e.g. preventing a classifier from using gender
or race) and interpretability (e.g. removing a concept to observe changes in
model behavior). In this paper, we introduce LEAst-squares Concept Erasure
(LEACE), a closed-form method which provably prevents all linear classifiers
from detecting a concept while inflicting the least possible damage to the
representation. We apply LEACE to large language models with a novel procedure
called "concept scrubbing," which erases target concept information from every
layer in the network. We demonstrate the usefulness of our method on two tasks:
measuring the reliance of language models on part-of-speech information, and
reducing gender bias in BERT embeddings. Code is available at
https://github.com/EleutherAI/concept-erasure
Eliciting Latent Predictions from Transformers with the Tuned Lens
We analyze transformers from the perspective of iterative inference, seeking
to understand how model predictions are refined layer by layer. To do so, we
train an affine probe for each block in a frozen pretrained model, making it
possible to decode every hidden state into a distribution over the vocabulary.
Our method, the tuned lens, is a refinement of the earlier "logit lens"
technique, which yielded useful insights but is often brittle.
We test our method on various autoregressive language models with up to 20B
parameters, showing it to be more predictive, reliable and unbiased than the
logit lens. With causal experiments, we show the tuned lens uses similar
features to the model itself. We also find the trajectory of latent predictions
can be used to detect malicious inputs with high accuracy. All code needed to
reproduce our results can be found at
https://github.com/AlignmentResearch/tuned-lens
imitation: Clean Imitation Learning Implementations
imitation provides open-source implementations of imitation and reward
learning algorithms in PyTorch. We include three inverse reinforcement learning
(IRL) algorithms, three imitation learning algorithms and a preference
comparison algorithm. The implementations have been benchmarked against
previous results, and automated tests cover 98% of the code. Moreover, the
algorithms are implemented in a modular fashion, making it simple to develop
novel algorithms in the framework. Our source code, including documentation and
examples, is available at https://github.com/HumanCompatibleAI/imitatio
Adversarial Policies Beat Superhuman Go AIs
We attack the state-of-the-art Go-playing AI system KataGo by training
adversarial policies against it, achieving a >97% win rate against KataGo
running at superhuman settings. Our adversaries do not win by playing Go well.
Instead, they trick KataGo into making serious blunders. Our attack transfers
zero-shot to other superhuman Go-playing AIs, and is comprehensible to the
extent that human experts can implement it without algorithmic assistance to
consistently beat superhuman AIs. The core vulnerability uncovered by our
attack persists even in KataGo agents adversarially trained to defend against
our attack. Our results demonstrate that even superhuman AI systems may harbor
surprising failure modes. Example games are available https://goattack.far.ai/.Comment: Accepted to ICML 2023, see paper for changelo