6,199 research outputs found
Towards Diverse and Natural Image Descriptions via a Conditional GAN
Despite the substantial progress in recent years, the image captioning
techniques are still far from being perfect.Sentences produced by existing
methods, e.g. those based on RNNs, are often overly rigid and lacking in
variability. This issue is related to a learning principle widely used in
practice, that is, to maximize the likelihood of training samples. This
principle encourages high resemblance to the "ground-truth" captions while
suppressing other reasonable descriptions. Conventional evaluation metrics,
e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we
explore an alternative approach, with the aim to improve the naturalness and
diversity -- two essential properties of human expression. Specifically, we
propose a new framework based on Conditional Generative Adversarial Networks
(CGAN), which jointly learns a generator to produce descriptions conditioned on
images and an evaluator to assess how well a description fits the visual
content. It is noteworthy that training a sequence generator is nontrivial. We
overcome the difficulty by Policy Gradient, a strategy stemming from
Reinforcement Learning, which allows the generator to receive early feedback
along the way. We tested our method on two large datasets, where it performed
competitively against real people in our user study and outperformed other
methods on various tasks.Comment: accepted in ICCV2017 as an Oral pape
Semi-supervised FusedGAN for Conditional Image Generation
We present FusedGAN, a deep network for conditional image synthesis with
controllable sampling of diverse images. Fidelity, diversity and controllable
sampling are the main quality measures of a good image generation model. Most
existing models are insufficient in all three aspects. The FusedGAN can perform
controllable sampling of diverse images with very high fidelity. We argue that
controllability can be achieved by disentangling the generation process into
various stages. In contrast to stacked GANs, where multiple stages of GANs are
trained separately with full supervision of labeled intermediate images, the
FusedGAN has a single stage pipeline with a built-in stacking of GANs. Unlike
existing methods, which requires full supervision with paired conditions and
images, the FusedGAN can effectively leverage more abundant images without
corresponding conditions in training, to produce more diverse samples with high
fidelity. We achieve this by fusing two generators: one for unconditional image
generation, and the other for conditional image generation, where the two
partly share a common latent space thereby disentangling the generation. We
demonstrate the efficacy of the FusedGAN in fine grained image generation tasks
such as text-to-image, and attribute-to-face generation
Generative Adversarial Text to Image Synthesis
Automatic synthesis of realistic images from text would be interesting and
useful, but current AI systems are still far from this goal. However, in recent
years generic and powerful recurrent neural network architectures have been
developed to learn discriminative text feature representations. Meanwhile, deep
convolutional generative adversarial networks (GANs) have begun to generate
highly compelling images of specific categories, such as faces, album covers,
and room interiors. In this work, we develop a novel deep architecture and GAN
formulation to effectively bridge these advances in text and image model- ing,
translating visual concepts from characters to pixels. We demonstrate the
capability of our model to generate plausible images of birds and flowers from
detailed text descriptions.Comment: ICML 201
Channel-Recurrent Autoencoding for Image Modeling
Despite recent successes in synthesizing faces and bedrooms, existing
generative models struggle to capture more complex image types, potentially due
to the oversimplification of their latent space constructions. To tackle this
issue, building on Variational Autoencoders (VAEs), we integrate recurrent
connections across channels to both inference and generation steps, allowing
the high-level features to be captured in global-to-local, coarse-to-fine
manners. Combined with adversarial loss, our channel-recurrent VAE-GAN
(crVAE-GAN) outperforms VAE-GAN in generating a diverse spectrum of high
resolution images while maintaining the same level of computational efficacy.
Our model produces interpretable and expressive latent representations to
benefit downstream tasks such as image completion. Moreover, we propose two
novel regularizations, namely the KL objective weighting scheme over time steps
and mutual information maximization between transformed latent variables and
the outputs, to enhance the training.Comment: Code: https://github.com/WendyShang/crVAE. Supplementary Materials:
http://www-personal.umich.edu/~shangw/wacv18_supplementary_material.pd
- …