72 research outputs found
Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
The unconditional generation of high fidelity images is a longstanding
benchmark for testing the performance of image decoders. Autoregressive image
models have been able to generate small images unconditionally, but the
extension of these methods to large images where fidelity can be more readily
assessed has remained an open problem. Among the major challenges are the
capacity to encode the vast previous context and the sheer difficulty of
learning a distribution that preserves both global semantic coherence and
exactness of detail. To address the former challenge, we propose the Subscale
Pixel Network (SPN), a conditional decoder architecture that generates an image
as a sequence of sub-images of equal size. The SPN compactly captures
image-wide spatial dependencies and requires a fraction of the memory and the
computation required by other fully autoregressive models. To address the
latter challenge, we propose to use Multidimensional Upscaling to grow an image
in both size and depth via intermediate stages utilising distinct SPNs. We
evaluate SPNs on the unconditional generation of CelebAHQ of size 256 and of
ImageNet from size 32 to 256. We achieve state-of-the-art likelihood results in
multiple settings, set up new benchmark results in previously unexplored
settings and are able to generate very high fidelity large scale samples on the
basis of both datasets
MaCow: Masked Convolutional Generative Flow
Flow-based generative models, conceptually attractive due to tractability of
both the exact log-likelihood computation and latent-variable inference, and
efficiency of both training and sampling, has led to a number of impressive
empirical successes and spawned many advanced variants and theoretical
investigations. Despite their computational efficiency, the density estimation
performance of flow-based generative models significantly falls behind those of
state-of-the-art autoregressive models. In this work, we introduce masked
convolutional generative flow (MaCow), a simple yet effective architecture of
generative flow using masked convolution. By restricting the local connectivity
in a small kernel, MaCow enjoys the properties of fast and stable training, and
efficient sampling, while achieving significant improvements over Glow for
density estimation on standard image benchmarks, considerably narrowing the gap
to autoregressive models.Comment: In Proceedings of Thirty-third Conference on Neural Information
Processing Systems (NeurIPS-2019
Hierarchical Autoregressive Image Models with Auxiliary Decoders
Autoregressive generative models of images tend to be biased towards
capturing local structure, and as a result they often produce samples which are
lacking in terms of large-scale coherence. To address this, we propose two
methods to learn discrete representations of images which abstract away local
detail. We show that autoregressive models conditioned on these representations
can produce high-fidelity reconstructions of images, and that we can train
autoregressive priors on these representations that produce samples with
large-scale coherence. We can recursively apply the learning procedure,
yielding a hierarchy of progressively more abstract image representations. We
train hierarchical class-conditional autoregressive models on the ImageNet
dataset and demonstrate that they are able to generate realistic images at
resolutions of 128128 and 256256 pixels. We also perform a
human evaluation study comparing our models with both adversarial and
likelihood-based state-of-the-art generative models.Comment: Updated: added human evaluation results, incorporated review feedbac
Natural Image Manipulation for Autoregressive Models Using Fisher Scores
Deep autoregressive models are one of the most powerful models that exist
today which achieve state-of-the-art bits per dim. However, they lie at a
strict disadvantage when it comes to controlled sample generation compared to
latent variable models. Latent variable models such as VAEs and normalizing
flows allow meaningful semantic manipulations in latent space, which
autoregressive models do not have. In this paper, we propose using Fisher
scores as a method to extract embeddings from an autoregressive model to use
for interpolation and show that our method provides more meaningful sample
manipulation compared to alternate embeddings such as network activations
Scaling Autoregressive Video Models
Due to the statistical complexity of video, the high degree of inherent
stochasticity, and the sheer amount of data, generating natural video remains a
challenging task. State-of-the-art video generation models often attempt to
address these issues by combining sometimes complex, usually video-specific
neural network architectures, latent variable models, adversarial training and
a range of other methods. Despite their often high complexity, these approaches
still fall short of generating high quality video continuations outside of
narrow domains and often struggle with fidelity. In contrast, we show that
conceptually simple autoregressive video generation models based on a
three-dimensional self-attention mechanism achieve competitive results across
multiple metrics on popular benchmark datasets, for which they produce
continuations of high fidelity and realism. We also present results from
training our models on Kinetics, a large scale action recognition dataset
comprised of YouTube videos exhibiting phenomena such as camera movement,
complex object interactions and diverse human movement. While modeling these
phenomena consistently remains elusive, we hope that our results, which include
occasional realistic continuations encourage further research on comparatively
complex, large scale datasets such as Kinetics.Comment: International Conference on Learning Representations (ICLR) 202
Generating Long Sequences with Sparse Transformers
Transformers are powerful sequence models, but require time and memory that
grows quadratically with the sequence length. In this paper we introduce sparse
factorizations of the attention matrix which reduce this to . We
also introduce a) a variation on architecture and initialization to train
deeper networks, b) the recomputation of attention matrices to save memory, and
c) fast attention kernels for training. We call networks with these changes
Sparse Transformers, and show they can model sequences tens of thousands of
timesteps long using hundreds of layers. We use the same architecture to model
images, audio, and text from raw bytes, setting a new state of the art for
density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate
unconditional samples that demonstrate global coherence and great diversity,
and show it is possible in principle to use self-attention to model sequences
of length one million or more
Generative Model with Dynamic Linear Flow
Flow-based generative models are a family of exact log-likelihood models with
tractable sampling and latent-variable inference, hence conceptually attractive
for modeling complex distributions. However, flow-based models are limited by
density estimation performance issues as compared to state-of-the-art
autoregressive models. Autoregressive models, which also belong to the family
of likelihood-based methods, however suffer from limited parallelizability. In
this paper, we propose Dynamic Linear Flow (DLF), a new family of invertible
transformations with partially autoregressive structure. Our method benefits
from the efficient computation of flow-based methods and high density
estimation performance of autoregressive methods. We demonstrate that the
proposed DLF yields state-of-theart performance on ImageNet 32x32 and 64x64 out
of all flow-based methods, and is competitive with the best autoregressive
model. Additionally, our model converges 10 times faster than Glow (Kingma and
Dhariwal, 2018). The code is available at https://github.com/naturomics/DLF.Comment: 12 pages, 7 figure
Generating Diverse High-Fidelity Images with VQ-VAE-2
We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE)
models for large scale image generation. To this end, we scale and enhance the
autoregressive priors used in VQ-VAE to generate synthetic samples of much
higher coherence and fidelity than possible before. We use simple feed-forward
encoder and decoder networks, making our model an attractive candidate for
applications where the encoding and/or decoding speed is critical.
Additionally, VQ-VAE requires sampling an autoregressive model only in the
compressed latent space, which is an order of magnitude faster than sampling in
the pixel space, especially for large images. We demonstrate that a multi-scale
hierarchical organization of VQ-VAE, augmented with powerful priors over the
latent codes, is able to generate samples with quality that rivals that of
state of the art Generative Adversarial Networks on multifaceted datasets such
as ImageNet, while not suffering from GAN's known shortcomings such as mode
collapse and lack of diversity
Classification Accuracy Score for Conditional Generative Models
Deep generative models (DGMs) of images are now sufficiently mature that they
produce nearly photorealistic samples and obtain scores similar to the data
distribution on heuristics such as Frechet Inception Distance (FID). These
results, especially on large-scale datasets such as ImageNet, suggest that DGMs
are learning the data distribution in a perceptually meaningful space and can
be used in downstream tasks. To test this latter hypothesis, we use
class-conditional generative models from a number of model
classes---variational autoencoders, autoregressive models, and generative
adversarial networks (GANs)---to infer the class labels of real data. We
perform this inference by training an image classifier using only synthetic
data and using the classifier to predict labels on real data. The performance
on this task, which we call Classification Accuracy Score (CAS), reveals some
surprising results not identified by traditional metrics and constitute our
contributions. First, when using a state-of-the-art GAN (BigGAN-deep), Top-1
and Top-5 accuracy decrease by 27.9\% and 41.6\%, respectively, compared to the
original data; and conditional generative models from other model classes, such
as Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical
Autoregressive Models (HAMs), substantially outperform GANs on this benchmark.
Second, CAS automatically surfaces particular classes for which generative
models failed to capture the data distribution, and were previously unknown in
the literature. Third, we find traditional GAN metrics such as Inception Score
(IS) and FID neither predictive of CAS nor useful when evaluating non-GAN
models. Furthermore, in order to facilitate better diagnoses of generative
models, we open-source the proposed metric
Latent Video Transformer
The video generation task can be formulated as a prediction of future video
frames given some past frames. Recent generative models for videos face the
problem of high computational requirements. Some models require up to 512
Tensor Processing Units for parallel training. In this work, we address this
problem via modeling the dynamics in a latent space. After the transformation
of frames into the latent space, our model predicts latent representation for
the next frames in an autoregressive manner. We demonstrate the performance of
our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach
tends to reduce requirements to 8 Graphical Processing Units for training the
models while maintaining comparable generation quality
- …