863 research outputs found
Disentangling Disentanglement in Variational Autoencoders
We develop a generalisation of disentanglement in variational autoencoders (VAEs)—decomposition of the latent representation—characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. Decomposition permits disentanglement, i.e. explicit independence between latents, as a special case, but also allows for a much richer class of properties to be imposed on the learnt representation, such as sparsity, clustering, independent subspaces, or even intricate hierarchical dependency relationships. We show that the β-VAE varies from the standard VAE predominantly in its control of latent overlap and that for the standard choice of an isotropic Gaussian prior, its objective is invariant to rotations of the latent representation. Viewed from the decomposition perspective, breaking this invariance with simple manipulations of the prior can yield better disentanglement with little or no detriment to reconstructions. We further demonstrate how other choices of prior can assist in producing different decompositions and introduce an alternative training objective that allows the control of both decomposition factors in a principled manner
Disentangling Factors of Variation by Mixing Them
We propose an approach to learn image representations that consist of
disentangled factors of variation without exploiting any manual labeling or
data domain knowledge. A factor of variation corresponds to an image attribute
that can be discerned consistently across a set of images, such as the pose or
color of objects. Our disentangled representation consists of a concatenation
of feature chunks, each chunk representing a factor of variation. It supports
applications such as transferring attributes from one image to another, by
simply mixing and unmixing feature chunks, and classification or retrieval
based on one or several attributes, by considering a user-specified subset of
feature chunks. We learn our representation without any labeling or knowledge
of the data domain, using an autoencoder architecture with two novel training
objectives: first, we propose an invariance objective to encourage that
encoding of each attribute, and decoding of each chunk, are invariant to
changes in other attributes and chunks, respectively; second, we include a
classification objective, which ensures that each chunk corresponds to a
consistently discernible attribute in the represented image, hence avoiding
degenerate feature mappings where some chunks are completely ignored. We
demonstrate the effectiveness of our approach on the MNIST, Sprites, and CelebA
datasets.Comment: CVPR 201
Towards Visually Explaining Variational Autoencoders
Recent advances in Convolutional Neural Network (CNN) model interpretability
have led to impressive progress in visualizing and understanding model
predictions. In particular, gradient-based visual attention methods have driven
much recent effort in using visual attention maps as a means for visual
explanations. A key problem, however, is these methods are designed for
classification and categorization tasks, and their extension to explaining
generative models, e.g. variational autoencoders (VAE) is not trivial. In this
work, we take a step towards bridging this crucial gap, proposing the first
technique to visually explain VAEs by means of gradient-based attention. We
present methods to generate visual attention from the learned latent space, and
also demonstrate such attention explanations serve more than just explaining
VAE predictions. We show how these attention maps can be used to localize
anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD
dataset. We also show how they can be infused into model training, helping
bootstrap the VAE into learning improved latent space disentanglement,
demonstrated on the Dsprites dataset
Learning Disentangled Representations with Reference-Based Variational Autoencoders
Learning disentangled representations from visual data, where different
high-level generative factors are independently encoded, is of importance for
many computer vision tasks. Solving this problem, however, typically requires
to explicitly label all the factors of interest in training images. To
alleviate the annotation cost, we introduce a learning setting which we refer
to as "reference-based disentangling". Given a pool of unlabeled images, the
goal is to learn a representation where a set of target factors are
disentangled from others. The only supervision comes from an auxiliary
"reference set" containing images where the factors of interest are constant.
In order to address this problem, we propose reference-based variational
autoencoders, a novel deep generative model designed to exploit the
weak-supervision provided by the reference set. By addressing tasks such as
feature learning, conditional image generation or attribute transfer, we
validate the ability of the proposed model to learn disentangled
representations from this minimal form of supervision
- …