7,974 research outputs found
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Learning Visual Question Answering by Bootstrapping Hard Attention
Attention mechanisms in biological perception are thought to select subsets
of perceptual information for more sophisticated processing which would be
prohibitive to perform on all sensory inputs. In computer vision, however,
there has been relatively little exploration of hard attention, where some
information is selectively ignored, in spite of the success of soft attention,
where information is re-weighted and aggregated, but never filtered out. Here,
we introduce a new approach for hard attention and find it achieves very
competitive performance on a recently-released visual question answering
datasets, equalling and in some cases surpassing similar soft attention
architectures while entirely ignoring some features. Even though the hard
attention mechanism is thought to be non-differentiable, we found that the
feature magnitudes correlate with semantic relevance, and provide a useful
signal for our mechanism's attentional selection criterion. Because hard
attention selects important features of the input information, it can also be
more efficient than analogous soft attention mechanisms. This is especially
important for recent approaches that use non-local pairwise operations, whereby
computational and memory costs are quadratic in the size of the set of
features.Comment: ECCV 201
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
We propose a neural sequence-to-sequence model for direction following, a
task that is essential to realizing effective autonomous agents. Our
alignment-based encoder-decoder model with long short-term memory recurrent
neural networks (LSTM-RNN) translates natural language instructions to action
sequences based upon a representation of the observable world state. We
introduce a multi-level aligner that empowers our model to focus on sentence
"regions" salient to the current world state by using multiple abstractions of
the input sentence. In contrast to existing methods, our model uses no
specialized linguistic resources (e.g., parsers) or task-specific annotations
(e.g., seed lexicons). It is therefore generalizable, yet still achieves the
best results reported to-date on a benchmark single-sentence dataset and
competitive results for the limited-training multi-sentence setting. We analyze
our model through a series of ablations that elucidate the contributions of the
primary components of our model.Comment: To appear at AAAI 2016 (and an extended version of a NIPS 2015
Multimodal Machine Learning workshop paper
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.Comment: Accepted by ICCV 201
- …