248 research outputs found
SPINE: SParse Interpretable Neural Embeddings
Prediction without justification has limited utility. Much of the success of
neural models can be attributed to their ability to learn rich, dense and
expressive representations. While these representations capture the underlying
complexity and latent trends in the data, they are far from being
interpretable. We propose a novel variant of denoising k-sparse autoencoders
that generates highly efficient and interpretable distributed word
representations (word embeddings), beginning with existing word representations
from state-of-the-art methods like GloVe and word2vec. Through large scale
human evaluation, we report that our resulting word embedddings are much more
interpretable than the original GloVe and word2vec embeddings. Moreover, our
embeddings outperform existing popular word embeddings on a diverse suite of
benchmark downstream tasks.Comment: AAAI 201
When Are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
CLIP embeddings have demonstrated remarkable performance across a wide range
of computer vision tasks. However, these high-dimensional, dense vector
representations are not easily interpretable, restricting their usefulness in
downstream applications that require transparency. In this work, we empirically
show that CLIP's latent space is highly structured, and consequently that CLIP
representations can be decomposed into their underlying semantic components. We
leverage this understanding to propose a novel method, Sparse Linear Concept
Embeddings (SpLiCE), for transforming CLIP representations into sparse linear
combinations of human-interpretable concepts. Distinct from previous work,
SpLiCE does not require concept labels and can be applied post hoc. Through
extensive experimentation with multiple real-world datasets, we validate that
the representations output by SpLiCE can explain and even replace traditional
dense CLIP representations, maintaining equivalent downstream performance while
significantly improving their interpretability. We also demonstrate several use
cases of SpLiCE representations including detecting spurious correlations,
model editing, and quantifying semantic shifts in datasets.Comment: 17 pages, 8 figures, Code is provided at
https://github.com/AI4LIFE-GROUP/SpLiC
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
Toy Models of Superposition
Neural networks often pack many unrelated concepts into a single neuron - a
puzzling phenomenon known as 'polysemanticity' which makes interpretability
much more challenging. This paper provides a toy model where polysemanticity
can be fully understood, arising as a result of models storing additional
sparse features in "superposition." We demonstrate the existence of a phase
change, a surprising connection to the geometry of uniform polytopes, and
evidence of a link to adversarial examples. We also discuss potential
implications for mechanistic interpretability.Comment: Also available at
https://transformer-circuits.pub/2022/toy_model/index.htm
Interpreting Neural Networks through the Polytope Lens
Mechanistic interpretability aims to explain what a neural network has
learned at a nuts-and-bolts level. What are the fundamental primitives of
neural network representations? Previous mechanistic descriptions have used
individual neurons or their linear combinations to understand the
representations a network has learned. But there are clues that neurons and
their linear combinations are not the correct fundamental units of description:
directions cannot describe how neural networks use nonlinearities to structure
their representations. Moreover, many instances of individual neurons and their
combinations are polysemantic (i.e. they have multiple unrelated meanings).
Polysemanticity makes interpreting the network in terms of neurons or
directions challenging since we can no longer assign a specific feature to a
neural unit. In order to find a basic unit of description that does not suffer
from these problems, we zoom in beyond just directions to study the way that
piecewise linear activation functions (such as ReLU) partition the activation
space into numerous discrete polytopes. We call this perspective the polytope
lens. The polytope lens makes concrete predictions about the behavior of neural
networks, which we evaluate through experiments on both convolutional image
classifiers and language models. Specifically, we show that polytopes can be
used to identify monosemantic regions of activation space (while directions are
not in general monosemantic) and that the density of polytope boundaries
reflect semantic boundaries. We also outline a vision for what mechanistic
interpretability might look like through the polytope lens.Comment: 22/11/22 initial uploa
- …