7 research outputs found
Region-Based Image Retrieval Revisited
Region-based image retrieval (RBIR) technique is revisited. In early attempts
at RBIR in the late 90s, researchers found many ways to specify region-based
queries and spatial relationships; however, the way to characterize the
regions, such as by using color histograms, were very poor at that time. Here,
we revisit RBIR by incorporating semantic specification of objects and
intuitive specification of spatial relationships. Our contributions are the
following. First, to support multiple aspects of semantic object specification
(category, instance, and attribute), we propose a multitask CNN feature that
allows us to use deep learning technique and to jointly handle multi-aspect
object specification. Second, to help users specify spatial relationships among
objects in an intuitive way, we propose recommendation techniques of spatial
relationships. In particular, by mining the search results, a system can
recommend feasible spatial relationships among the objects. The system also can
recommend likely spatial relationships by assigned object category names based
on language prior. Moreover, object-level inverted indexing supports very fast
shortlist generation, and re-ranking based on spatial constraints provides
users with instant RBIR experiences.Comment: To appear in ACM Multimedia 2017 (Oral
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
In order to convey the most content in their limited space, advertisements
embed references to outside knowledge via symbolism. For example, a motorcycle
stands for adventure (a positive property the ad wants associated with the
product being sold), and a gun stands for danger (a negative property to
dissuade viewers from undesirable behaviors). We show how to use symbolic
references to better understand the meaning of an ad. We further show how
anchoring ad understanding in general-purpose object recognition and image
captioning improves results. We formulate the ad understanding task as matching
the ad image to human-generated statements that describe the action that the ad
prompts, and the rationale it provides for taking this action. Our proposed
method outperforms the state of the art on this task, and on an alternative
formulation of question-answering on ads. We show additional applications of
our learned representations for matching ads to slogans, and clustering ads
according to their topic, without extra training.Comment: To appear, Proceedings of the European Conference on Computer Vision
(ECCV
Dual Attention on Pyramid Feature Maps for Image Captioning
Generating natural sentences from images is a fundamental learning task for
visual-semantic understanding in multimedia. In this paper, we propose to apply
dual attention on pyramid image feature maps to fully explore the
visual-semantic correlations and improve the quality of generated sentences.
Specifically, with the full consideration of the contextual information
provided by the hidden state of the RNN controller, the pyramid attention can
better localize the visually indicative and semantically consistent regions in
images. On the other hand, the contextual information can help re-calibrate the
importance of feature components by learning the channel-wise dependencies, to
improve the discriminative power of visual features for better content
description. We conducted comprehensive experiments on three well-known
datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in
generating descriptive and smooth natural sentences from images. Using either
convolution visual features or more informative bottom-up attention features,
our composite captioning model achieves very promising performance in a
single-model mode. The proposed pyramid attention and dual attention methods
are highly modular, which can be inserted into various image captioning modules
to further improve the performance.Comment: in IEEE Transactions on Multimedia, 202
Spatial-Semantic Image Search by Visual Feature Synthesis
The performance of image retrieval has been improved tremendously in recent years through the use of deep feature representations. Most existing methods, however, aim to retrieve images that are visually similar or semantically relevant to the query, irrespective of spatial configuration. In this paper, we develop a spatial-semantic image search technology that enables users to search for images with both semantic and spatial constraints by manipulating concept text-boxes on a 2D query canvas. We train a convolutional neural network to synthesize appropriate visual features that captures the spatial-semantic constraints from the user canvas query. We directly optimize the retrieval performance of the visual features when training our deep neural network. These visual features then are used to retrieve images that are both spatially and semantically relevant to the user query. The experiments on large-scale datasets such as MS-COCO and Visual Genome show that our method outperforms other baseline and state-of-the-art methods in spatial-semantic image search
Multimodal knowledge integration for object detection and visual reasoning
We humans still perceive and reason in a different way than artificial intelligence models. We witness, we listen, we touch, we understand the world via multi-modal sensing, while machine models rely only on a single or a few modalities and ignore abundant information. In this thesis, we explore techniques for reducing the perception gap between machines and humans and focus on two families of tasks, reasoning and detection. First, we incorporate information from text, audio, motion, external knowledge bases, for training computer vision models. We find that data inputs from more extensive channels provide complementary information to improve models. Second, we study how multimodal inputs can be fully utilized. We argue that most existing deep learning methods are prone to pay too large attention to shallow patterns in the input features, which causes the resulting models to be biased. We propose robust training to overcome the issue. Third, we extend the benefits of multi-modal information to the supervision signals instead of the inputs, by learning a weakly supervised detection model from the natural supervision of textual captions or audio narrations. With the help of NLP constituency parsing, it is possible to extract structural knowledges from the captions and narrations, hence determines the entities and relations of visual objects