736 research outputs found
Improving Image Captioning with Conditional Generative Adversarial Nets
In this paper, we propose a novel
conditional-generative-adversarial-nets-based image captioning framework as an
extension of traditional reinforcement-learning (RL)-based encoder-decoder
architecture. To deal with the inconsistent evaluation problem among different
objective language metrics, we are motivated to design some "discriminator"
networks to automatically and progressively determine whether generated caption
is human described or machine generated. Two kinds of discriminator
architectures (CNN and RNN-based structures) are introduced since each has its
own advantages. The proposed algorithm is generic so that it can enhance any
existing RL-based image captioning framework and we show that the conventional
RL training method is just a special case of our approach. Empirically, we show
consistent improvements over all language evaluation metrics for different
state-of-the-art image captioning models. In addition, the well-trained
discriminators can also be viewed as objective image captioning evaluatorsComment: 12 pages; 33 figures; 36 refenences; Accepted by AAAI201
Towards Diverse and Natural Image Descriptions via a Conditional GAN
Despite the substantial progress in recent years, the image captioning
techniques are still far from being perfect.Sentences produced by existing
methods, e.g. those based on RNNs, are often overly rigid and lacking in
variability. This issue is related to a learning principle widely used in
practice, that is, to maximize the likelihood of training samples. This
principle encourages high resemblance to the "ground-truth" captions while
suppressing other reasonable descriptions. Conventional evaluation metrics,
e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we
explore an alternative approach, with the aim to improve the naturalness and
diversity -- two essential properties of human expression. Specifically, we
propose a new framework based on Conditional Generative Adversarial Networks
(CGAN), which jointly learns a generator to produce descriptions conditioned on
images and an evaluator to assess how well a description fits the visual
content. It is noteworthy that training a sequence generator is nontrivial. We
overcome the difficulty by Policy Gradient, a strategy stemming from
Reinforcement Learning, which allows the generator to receive early feedback
along the way. We tested our method on two large datasets, where it performed
competitively against real people in our user study and outperformed other
methods on various tasks.Comment: accepted in ICCV2017 as an Oral pape
C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis
Generating an image from its description is a challenging task worth solving
because of its numerous practical applications ranging from image editing to
virtual reality. All existing methods use one single caption to generate a
plausible image. A single caption by itself, can be limited, and may not be
able to capture the variety of concepts and behavior that may be present in the
image. We propose two deep generative models that generate an image by making
use of multiple captions describing it. This is achieved by ensuring
'Cross-Caption Cycle Consistency' between the multiple captions and the
generated image(s). We report quantitative and qualitative results on the
standard Caltech-UCSD Birds (CUB) and Oxford-102 Flowers datasets to validate
the efficacy of the proposed approach.Comment: To appear in the proceedings of IEEE Winter Conference on
Applications of Computer Vision, WACV-201
Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks
Taking a photo outside, can we predict the immediate future, e.g., how would
the cloud move in the sky? We address this problem by presenting a generative
adversarial network (GAN) based two-stage approach to generating realistic
time-lapse videos of high resolution. Given the first frame, our model learns
to generate long-term future frames. The first stage generates videos of
realistic contents for each frame. The second stage refines the generated video
from the first stage by enforcing it to be closer to real videos with regard to
motion dynamics. To further encourage vivid motion in the final generated
video, Gram matrix is employed to model the motion more precisely. We build a
large scale time-lapse dataset, and test our approach on this new dataset.
Using our model, we are able to generate realistic videos of up to resolution for 32 frames. Quantitative and qualitative experiment results
have demonstrated the superiority of our model over the state-of-the-art
models.Comment: To appear in Proceedings of CVPR 201
Semantically Invariant Text-to-Image Generation
Image captioning has demonstrated models that are capable of generating
plausible text given input images or videos. Further, recent work in image
generation has shown significant improvements in image quality when text is
used as a prior. Our work ties these concepts together by creating an
architecture that can enable bidirectional generation of images and text. We
call this network Multi-Modal Vector Representation (MMVR). Along with MMVR, we
propose two improvements to the text conditioned image generation. Firstly, a
n-gram metric based cost function is introduced that generalizes the caption
with respect to the image. Secondly, multiple semantically similar sentences
are shown to help in generating better images. Qualitative and quantitative
evaluations demonstrate that MMVR improves upon existing text conditioned image
generation results by over 20%, while integrating visual and text modalities.Comment: 5 papers, 5 figures, Published in 2018 25th IEEE International
Conference on Image Processing (ICIP
- …