48 research outputs found
CanvasGAN: A simple baseline for text to image generation by incrementally patching a canvas
We propose a new recurrent generative model for generating images from text
captions while attending on specific parts of text captions. Our model creates
images by incrementally adding patches on a "canvas" while attending on words
from text caption at each timestep. Finally, the canvas is passed through an
upscaling network to generate images. We also introduce a new method for
generating visual-semantic sentence embeddings based on self-attention over
text. We compare our model's generated images with those generated Reed et.
al.'s model and show that our model is a stronger baseline for text to image
generation tasks.Comment: CVC 201
C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis
Generating an image from its description is a challenging task worth solving
because of its numerous practical applications ranging from image editing to
virtual reality. All existing methods use one single caption to generate a
plausible image. A single caption by itself, can be limited, and may not be
able to capture the variety of concepts and behavior that may be present in the
image. We propose two deep generative models that generate an image by making
use of multiple captions describing it. This is achieved by ensuring
'Cross-Caption Cycle Consistency' between the multiple captions and the
generated image(s). We report quantitative and qualitative results on the
standard Caltech-UCSD Birds (CUB) and Oxford-102 Flowers datasets to validate
the efficacy of the proposed approach.Comment: To appear in the proceedings of IEEE Winter Conference on
Applications of Computer Vision, WACV-201
Adversarial Learning of Semantic Relevance in Text to Image Synthesis
We describe a new approach that improves the training of generative
adversarial nets (GANs) for synthesizing diverse images from a text input. Our
approach is based on the conditional version of GANs and expands on previous
work leveraging an auxiliary task in the discriminator. Our generated images
are not limited to certain classes and do not suffer from mode collapse while
semantically matching the text input. A key to our training methods is how to
form positive and negative training examples with respect to the class label of
a given image. Instead of selecting random training examples, we perform
negative sampling based on the semantic distance from a positive example in the
class. We evaluate our approach using the Oxford-102 flower dataset, adopting
the inception score and multi-scale structural similarity index (MS-SSIM)
metrics to assess discriminability and diversity of the generated images. The
empirical results indicate greater diversity in the generated images,
especially when we gradually select more negative training examples closer to a
positive example in the semantic space