2,362 research outputs found
Semantic bottleneck for computer vision tasks
This paper introduces a novel method for the representation of images that is
semantic by nature, addressing the question of computation intelligibility in
computer vision tasks. More specifically, our proposition is to introduce what
we call a semantic bottleneck in the processing pipeline, which is a crossing
point in which the representation of the image is entirely expressed with
natural language , while retaining the efficiency of numerical representations.
We show that our approach is able to generate semantic representations that
give state-of-the-art results on semantic content-based image retrieval and
also perform very well on image classification tasks. Intelligibility is
evaluated through user centered experiments for failure detection
C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis
Generating an image from its description is a challenging task worth solving
because of its numerous practical applications ranging from image editing to
virtual reality. All existing methods use one single caption to generate a
plausible image. A single caption by itself, can be limited, and may not be
able to capture the variety of concepts and behavior that may be present in the
image. We propose two deep generative models that generate an image by making
use of multiple captions describing it. This is achieved by ensuring
'Cross-Caption Cycle Consistency' between the multiple captions and the
generated image(s). We report quantitative and qualitative results on the
standard Caltech-UCSD Birds (CUB) and Oxford-102 Flowers datasets to validate
the efficacy of the proposed approach.Comment: To appear in the proceedings of IEEE Winter Conference on
Applications of Computer Vision, WACV-201
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
In neural image captioning systems, a recurrent neural network (RNN) is
typically viewed as the primary `generation' component. This view suggests that
the image features should be `injected' into the RNN. This is in fact the
dominant view in the literature. Alternatively, the RNN can instead be viewed
as only encoding the previously generated words. This view suggests that the
RNN should only be used to encode linguistic features and that only the final
representation should be `merged' with the image features at a later stage.
This paper compares these two architectures. We find that, in general, late
merging outperforms injection, suggesting that RNNs are better viewed as
encoders, rather than generators.Comment: Appears in: Proceedings of the 10th International Conference on
Natural Language Generation (INLG'17
CanvasGAN: A simple baseline for text to image generation by incrementally patching a canvas
We propose a new recurrent generative model for generating images from text
captions while attending on specific parts of text captions. Our model creates
images by incrementally adding patches on a "canvas" while attending on words
from text caption at each timestep. Finally, the canvas is passed through an
upscaling network to generate images. We also introduce a new method for
generating visual-semantic sentence embeddings based on self-attention over
text. We compare our model's generated images with those generated Reed et.
al.'s model and show that our model is a stronger baseline for text to image
generation tasks.Comment: CVC 201
What value do explicit high level concepts have in vision to language problems?
Much of the recent progress in Vision-to-Language (V2L) problems has been
achieved through a combination of Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs). This approach does not explicitly represent
high-level semantic concepts, but rather seeks to progress directly from image
features to text. We propose here a method of incorporating high-level concepts
into the very successful CNN-RNN approach, and show that it achieves a
significant improvement on the state-of-the-art performance in both image
captioning and visual question answering. We also show that the same mechanism
can be used to introduce external semantic information and that doing so
further improves performance. In doing so we provide an analysis of the value
of high level semantic information in V2L problems.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognition 2016.
Fixed titl
- …