23,122 research outputs found
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
In order to convey the most content in their limited space, advertisements
embed references to outside knowledge via symbolism. For example, a motorcycle
stands for adventure (a positive property the ad wants associated with the
product being sold), and a gun stands for danger (a negative property to
dissuade viewers from undesirable behaviors). We show how to use symbolic
references to better understand the meaning of an ad. We further show how
anchoring ad understanding in general-purpose object recognition and image
captioning improves results. We formulate the ad understanding task as matching
the ad image to human-generated statements that describe the action that the ad
prompts, and the rationale it provides for taking this action. Our proposed
method outperforms the state of the art on this task, and on an alternative
formulation of question-answering on ads. We show additional applications of
our learned representations for matching ads to slogans, and clustering ads
according to their topic, without extra training.Comment: To appear, Proceedings of the European Conference on Computer Vision
(ECCV
SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching
Existing image-text matching approaches typically infer the similarity of an
image-text pair by capturing and aggregating the affinities between the text
and each independent object of the image. However, they ignore the connections
between the objects that are semantically related. These objects may
collectively determine whether the image corresponds to a text or not. To
address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN)
which processes images and sentences symmetrically by recurrent neural networks
(RNN). In particular, given an input image-text pair, our model reorders the
image objects based on the positions of their most related words in the text.
In the same way as extracting the hidden features from word embeddings, the
model leverages RNN to extract high-level object features from the reordered
object inputs. We validate that the high-level object features contain useful
joint information of semantically related objects, which benefit the retrieval
task. To compute the image-text similarity, we incorporate a Multi-attention
Cross Matching Model into DP-RNN. It aggregates the affinity between objects
and words with cross-modality guided attention and self-attention. Our model
achieves the state-of-the-art performance on Flickr30K dataset and competitive
performance on MS-COCO dataset. Extensive experiments demonstrate the
effectiveness of our model.Comment: Accepted by AAAI-2
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
- …