Distributed representations provide a vector space that captures meaningful
relationships between data instances. The distributed nature of these
representations, however, entangles together multiple attributes or concepts of
data instances (e.g., the topic or sentiment of a text, characteristics of the
author (age, gender, etc), etc). Recent work has proposed the task of concept
erasure, in which rather than making a concept predictable, the goal is to
remove an attribute from distributed representations while retaining other
information from the original representation space as much as possible. In this
paper, we propose a new distance metric learning-based objective, the
Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure.
KRaM fits a transformation of representations to match a specified distance
measure (defined by a labeled concept to erase) using a modified
rate-distortion function. Specifically, KRaM's objective function aims to make
instances with similar concept labels dissimilar in the learned representation
space while retaining other information. We find that optimizing KRaM
effectively erases various types of concepts: categorical, continuous, and
vector-valued variables from data representations across diverse domains. We
also provide a theoretical analysis of several properties of KRaM's objective.
To assess the quality of the learned representations, we propose an alignment
score to evaluate their similarity with the original representation space.
Additionally, we conduct experiments to showcase KRaM's efficacy in various
settings, from erasing binary gender variables in word embeddings to
vector-valued variables in GPT-3 representations.Comment: NeurIPS 202