7,300 research outputs found
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Despite recent progress in generative image modeling, successfully generating
high-resolution, diverse samples from complex datasets such as ImageNet remains
an elusive goal. To this end, we train Generative Adversarial Networks at the
largest scale yet attempted, and study the instabilities specific to such
scale. We find that applying orthogonal regularization to the generator renders
it amenable to a simple "truncation trick," allowing fine control over the
trade-off between sample fidelity and variety by reducing the variance of the
Generator's input. Our modifications lead to models which set the new state of
the art in class-conditional image synthesis. When trained on ImageNet at
128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of
166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous
best IS of 52.52 and FID of 18.6
High-Fidelity Image Generation With Fewer Labels
Deep generative models are becoming a cornerstone of modern machine learning.
Recent work on conditional generative adversarial networks has shown that
learning complex, high-dimensional distributions over natural images is within
reach. While the latest models are able to generate high-fidelity, diverse
natural images at high resolution, they rely on a vast quantity of labeled
data. In this work we demonstrate how one can benefit from recent work on self-
and semi-supervised learning to outperform the state of the art on both
unsupervised ImageNet synthesis, as well as in the conditional setting. In
particular, the proposed approach is able to match the sample quality (as
measured by FID) of the current state-of-the-art conditional model BigGAN on
ImageNet using only 10% of the labels and outperform it using 20% of the
labels.Comment: Mario Lucic, Michael Tschannen, and Marvin Ritter contributed equally
to this work. ICML 2019 camera-ready version. Code available at
https://github.com/google/compare_ga
GANSynth: Adversarial Neural Audio Synthesis
Efficient audio synthesis is an inherently difficult machine learning task,
as human perception is sensitive to both global structure and fine-scale
waveform coherence. Autoregressive models, such as WaveNet, model local
structure at the expense of global latent structure and slow iterative
sampling, while Generative Adversarial Networks (GANs), have global latent
conditioning and efficient parallel sampling, but struggle to generate
locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact
generate high-fidelity and locally-coherent audio by modeling log magnitudes
and instantaneous frequencies with sufficient frequency resolution in the
spectral domain. Through extensive empirical investigations on the NSynth
dataset, we demonstrate that GANs are able to outperform strong WaveNet
baselines on automated and human evaluation metrics, and efficiently generate
audio several orders of magnitude faster than their autoregressive
counterparts.Comment: Colab Notebook: http://goo.gl/magenta/gansynth-dem
Generative Adversarial Network in Medical Imaging: A Review
Generative adversarial networks have gained a lot of attention in the
computer vision community due to their capability of data generation without
explicitly modelling the probability density function. The adversarial loss
brought by the discriminator provides a clever way of incorporating unlabeled
samples into training and imposing higher order consistency. This has proven to
be useful in many cases, such as domain adaptation, data augmentation, and
image-to-image translation. These properties have attracted researchers in the
medical imaging community, and we have seen rapid adoption in many traditional
and novel applications, such as image reconstruction, segmentation, detection,
classification, and cross-modality synthesis. Based on our observations, this
trend will continue and we therefore conducted a review of recent advances in
medical imaging using the adversarial training scheme with the hope of
benefiting researchers interested in this technique.Comment: 24 pages; v4; added missing references from before Jan 1st 2019;
accepted to MedI
High Fidelity Face Manipulation with Extreme Poses and Expressions
Face manipulation has shown remarkable advances with the flourish of
Generative Adversarial Networks. However, due to the difficulties of
controlling structures and textures, it is challenging to model poses and
expressions simultaneously, especially for the extreme manipulation at
high-resolution. In this paper, we propose a novel framework that simplifies
face manipulation into two correlated stages: a boundary prediction stage and a
disentangled face synthesis stage. The first stage models poses and expressions
jointly via boundary images. Specifically, a conditional encoder-decoder
network is employed to predict the boundary image of the target face in a
semi-supervised way. Pose and expression estimators are introduced to improve
the prediction performance. In the second stage, the predicted boundary image
and the input face image are encoded into the structure and the texture latent
space by two encoder networks, respectively. A proxy network and a feature
threshold loss are further imposed to disentangle the latent space.
Furthermore, due to the lack of high-resolution face manipulation databases to
verify the effectiveness of our method, we collect a new high-quality
Multi-View Face (MVF-HQ) database. It contains 120,283 images at 6000x4000
resolution from 479 identities with diverse poses, expressions, and
illuminations. MVF-HQ is much larger in scale and much higher in resolution
than publicly available high-resolution face manipulation databases. We will
release MVF-HQ soon to push forward the advance of face manipulation.
Qualitative and quantitative experiments on four databases show that our method
dramatically improves the synthesis quality.Comment: Accepted by IEEE Transactions on Information Forensics and Security
(TIFS
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
In this paper we introduce a generative parametric model capable of producing
high quality samples of natural images. Our approach uses a cascade of
convolutional networks within a Laplacian pyramid framework to generate images
in a coarse-to-fine fashion. At each level of the pyramid, a separate
generative convnet model is trained using the Generative Adversarial Nets (GAN)
approach (Goodfellow et al.). Samples drawn from our model are of significantly
higher quality than alternate approaches. In a quantitative assessment by human
evaluators, our CIFAR10 samples were mistaken for real images around 40% of the
time, compared to 10% for samples drawn from a GAN baseline model. We also show
samples from models trained on the higher resolution images of the LSUN scene
dataset
Pros and Cons of GAN Evaluation Measures
Generative models, in particular generative adversarial networks (GANs), have
received significant attention recently. A number of GAN variants have been
proposed and have been utilized in many applications. Despite large strides in
terms of theoretical progress, evaluating and comparing GANs remains a daunting
task. While several measures have been introduced, as of yet, there is no
consensus as to which measure best captures strengths and limitations of models
and should be used for fair model comparison. As in other areas of computer
vision and machine learning, it is critical to settle on one or few good
measures to steer the progress in this field. In this paper, I review and
critically discuss more than 24 quantitative and 5 qualitative measures for
evaluating generative models with a particular emphasis on GAN-derived models.
I also provide a set of 7 desiderata followed by an evaluation of whether a
given measure or a family of measures is compatible with them
Decompose to manipulate: Manipulable Object Synthesis in 3D Medical Images with Structured Image Decomposition
The performance of medical image analysis systems is constrained by the
quantity of high-quality image annotations. Such systems require data to be
annotated by experts with years of training, especially when diagnostic
decisions are involved. Such datasets are thus hard to scale up. In this
context, it is hard for supervised learning systems to generalize to the cases
that are rare in the training set but would be present in real-world clinical
practices. We believe that the synthetic image samples generated by a system
trained on the real data can be useful for improving the supervised learning
tasks in the medical image analysis applications. Allowing the image synthesis
to be manipulable could help synthetic images provide complementary information
to the training data rather than simply duplicating the real-data manifold. In
this paper, we propose a framework for synthesizing 3D objects, such as
pulmonary nodules, in 3D medical images with manipulable properties. The
manipulation is enabled by decomposing of the object of interests into its
segmentation mask and a 1D vector containing the residual information. The
synthetic object is refined and blended into the image context with two
adversarial discriminators. We evaluate the proposed framework on lung nodules
in 3D chest CT images and show that the proposed framework could generate
realistic nodules with manipulable shapes, textures and locations, etc. By
sampling from both the synthetic nodules and the real nodules from 2800 3D CT
volumes during the classifier training, we show the synthetic patches could
improve the overall nodule detection performance by average 8.44% competition
performance metric (CPM) score
Semantic Image Synthesis with Spatially-Adaptive Normalization
We propose spatially-adaptive normalization, a simple but effective layer for
synthesizing photorealistic images given an input semantic layout. Previous
methods directly feed the semantic layout as input to the deep network, which
is then processed through stacks of convolution, normalization, and
nonlinearity layers. We show that this is suboptimal as the normalization
layers tend to ``wash away'' semantic information. To address the issue, we
propose using the input layout for modulating the activations in normalization
layers through a spatially-adaptive, learned transformation. Experiments on
several challenging datasets demonstrate the advantage of the proposed method
over existing approaches, regarding both visual fidelity and alignment with
input layouts. Finally, our model allows user control over both semantic and
style. Code is available at https://github.com/NVlabs/SPADE .Comment: Accepted as a CVPR 2019 oral pape
Identity Preserving Face Completion for Large Ocular Region Occlusion
We present a novel deep learning approach to synthesize complete face images
in the presence of large ocular region occlusions. This is motivated by recent
surge of VR/AR displays that hinder face-to-face communications. Different from
the state-of-the-art face inpainting methods that have no control over the
synthesized content and can only handle frontal face pose, our approach can
faithfully recover the missing content under various head poses while
preserving the identity. At the core of our method is a novel generative
network with dedicated constraints to regularize the synthesis process. To
preserve the identity, our network takes an arbitrary occlusion-free image of
the target identity to infer the missing content, and its high-level CNN
features as an identity prior to regularize the searching space of generator.
Since the input reference image may have a different pose, a pose map and a
novel pose discriminator are further adopted to supervise the learning of
implicit pose transformations. Our method is capable of generating coherent
facial inpainting with consistent identity over videos with large variations of
head motions. Experiments on both synthesized and real data demonstrate that
our method greatly outperforms the state-of-the-art methods in terms of both
synthesis quality and robustness.Comment: 12 pages,9 figure
- …