37 research outputs found
Neural Discrete Representation Learning
Learning useful representations without supervision remains a key challenge
in machine learning. In this paper, we propose a simple yet powerful generative
model that learns such discrete representations. Our model, the Vector
Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways:
the encoder network outputs discrete, rather than continuous, codes; and the
prior is learnt rather than static. In order to learn a discrete latent
representation, we incorporate ideas from vector quantisation (VQ). Using the
VQ method allows the model to circumvent issues of "posterior collapse" --
where the latents are ignored when they are paired with a powerful
autoregressive decoder -- typically observed in the VAE framework. Pairing
these representations with an autoregressive prior, the model can generate high
quality images, videos, and speech as well as doing high quality speaker
conversion and unsupervised learning of phonemes, providing further evidence of
the utility of the learnt representations
Representation Learning with Contrastive Predictive Coding
While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen such widespread adoption, and remains an
important and challenging endeavor for artificial intelligence. In this work,
we propose a universal unsupervised learning approach to extract useful
representations from high-dimensional data, which we call Contrastive
Predictive Coding. The key insight of our model is to learn such
representations by predicting the future in latent space by using powerful
autoregressive models. We use a probabilistic contrastive loss which induces
the latent space to capture information that is maximally useful to predict
future samples. It also makes the model tractable by using negative sampling.
While most prior work has focused on evaluating representations for a
particular modality, we demonstrate that our approach is able to learn useful
representations achieving strong performance on four distinct domains: speech,
images, text and reinforcement learning in 3D environments
Associative Compression Networks for Representation Learning
This paper introduces Associative Compression Networks (ACNs), a new
framework for variational autoencoding with neural networks. The system differs
from existing variational autoencoders (VAEs) in that the prior distribution
used to model each code is conditioned on a similar code from the dataset. In
compression terms this equates to sequentially transmitting the dataset using
an ordering determined by proximity in latent space. Since the prior need only
account for local, rather than global variations in the latent space, the
coding cost is greatly reduced, leading to rich, informative codes. Crucially,
the codes remain informative when powerful, autoregressive decoders are used,
which we argue is fundamentally difficult with normal VAEs. Experimental
results on MNIST, CIFAR-10, ImageNet and CelebA show that ACNs discover
high-level latent features such as object class, writing style, pose and facial
expression, which can be used to cluster and classify the data, as well as to
generate diverse and convincing samples. We conclude that ACNs are a promising
new direction for representation learning: one that steps away from IID
modelling, and towards learning a structured description of the dataset as a
whole.Comment: Revised to clarify difference between ACN and IID los
Pixel Recurrent Neural Networks
Modeling the distribution of natural images is a landmark problem in
unsupervised learning. This task requires an image model that is at once
expressive, tractable and scalable. We present a deep neural network that
sequentially predicts the pixels in an image along the two spatial dimensions.
Our method models the discrete probability of the raw pixel values and encodes
the complete set of dependencies in the image. Architectural novelties include
fast two-dimensional recurrent layers and an effective use of residual
connections in deep recurrent networks. We achieve log-likelihood scores on
natural images that are considerably better than the previous state of the art.
Our main results also provide benchmarks on the diverse ImageNet dataset.
Samples generated from the model appear crisp, varied and globally coherent
Multi-Format Contrastive Learning of Audio Representations
Recent advances suggest the advantage of multi-modal training in comparison
with single-modal methods. In contrast to this view, in our work we find that
similar gain can be obtained from training with different formats of a single
modality. In particular, we investigate the use of the contrastive learning
framework to learn audio representations by maximizing the agreement between
the raw audio and its spectral representation. We find a significant gain using
this multi-format strategy against the single-format counterparts. Moreover, on
the downstream AudioSet and ESC-50 classification task, our audio-only approach
achieves new state-of-the-art results with a mean average precision of 0.376
and an accuracy of 90.5%, respectively
Count-Based Exploration with Neural Density Models
Bellemare et al. (2016) introduced the notion of a pseudo-count, derived from
a density model, to generalize count-based exploration to non-tabular
reinforcement learning. This pseudo-count was used to generate an exploration
bonus for a DQN agent and combined with a mixed Monte Carlo update was
sufficient to achieve state of the art on the Atari 2600 game Montezuma's
Revenge. We consider two questions left open by their work: First, how
important is the quality of the density model for exploration? Second, what
role does the Monte Carlo update play in exploration? We answer the first
question by demonstrating the use of PixelCNN, an advanced neural density model
for images, to supply a pseudo-count. In particular, we examine the intrinsic
difficulties in adapting Bellemare et al.'s approach when assumptions about the
model are violated. The result is a more practical and general algorithm
requiring no special apparatus. We combine PixelCNN pseudo-counts with
different agent architectures to dramatically improve the state of the art on
several hard Atari games. One surprising finding is that the mixed Monte Carlo
update is a powerful facilitator of exploration in the sparsest of settings,
including Montezuma's Revenge
Generating Diverse High-Fidelity Images with VQ-VAE-2
We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE)
models for large scale image generation. To this end, we scale and enhance the
autoregressive priors used in VQ-VAE to generate synthetic samples of much
higher coherence and fidelity than possible before. We use simple feed-forward
encoder and decoder networks, making our model an attractive candidate for
applications where the encoding and/or decoding speed is critical.
Additionally, VQ-VAE requires sampling an autoregressive model only in the
compressed latent space, which is an order of magnitude faster than sampling in
the pixel space, especially for large images. We demonstrate that a multi-scale
hierarchical organization of VQ-VAE, augmented with powerful priors over the
latent codes, is able to generate samples with quality that rivals that of
state of the art Generative Adversarial Networks on multifaceted datasets such
as ImageNet, while not suffering from GAN's known shortcomings such as mode
collapse and lack of diversity
On Variational Bounds of Mutual Information
Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bounding MI in high dimensions is challenging. To
establish tractable and scalable objectives, recent work has turned to
variational bounds parameterized by neural networks, but the relationships and
tradeoffs between these bounds remains unclear. In this work, we unify these
recent developments in a single framework. We find that the existing
variational lower bounds degrade when the MI is large, exhibiting either high
bias or high variance. To address this problem, we introduce a continuum of
lower bounds that encompasses previous bounds and flexibly trades off bias and
variance. On high-dimensional, controlled problems, we empirically characterize
the bias and variance of the bounds and their gradients and demonstrate the
effectiveness of our new bounds for estimation and representation learning.Comment: ICML 201
Conditional Image Generation with PixelCNN Decoders
This work explores conditional image generation with a new image density
model based on the PixelCNN architecture. The model can be conditioned on any
vector, including descriptive labels or tags, or latent embeddings created by
other networks. When conditioned on class labels from the ImageNet database,
the model is able to generate diverse, realistic scenes representing distinct
animals, objects, landscapes and structures. When conditioned on an embedding
produced by a convolutional network given a single image of an unseen face, it
generates a variety of new portraits of the same person with different facial
expressions, poses and lighting conditions. We also show that conditional
PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally,
the gated convolutional layers in the proposed model improve the log-likelihood
of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet,
with greatly reduced computational cost
Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
This paper investigates recently proposed approaches for defending against
adversarial examples and evaluating adversarial robustness. We motivate
'adversarial risk' as an objective for achieving models robust to worst-case
inputs. We then frame commonly used attacks and evaluation metrics as defining
a tractable surrogate objective to the true adversarial risk. This suggests
that models may optimize this surrogate rather than the true adversarial risk.
We formalize this notion as 'obscurity to an adversary,' and develop tools and
heuristics for identifying obscured models and designing transparent models. We
demonstrate that this is a significant problem in practice by repurposing
gradient-free optimization techniques into adversarial attacks, which we use to
decrease the accuracy of several recently proposed defenses to near zero. Our
hope is that our formulations and results will help researchers to develop more
powerful defenses