33 research outputs found
A Survey of Current Datasets for Vision and Language Research
Integrating vision and language has long been a dream in work on artificial
intelligence (AI). In the past two years, we have witnessed an explosion of
work that brings together vision and language from images to videos and beyond.
The available corpora have played a crucial role in advancing this area of
research. In this paper, we propose a set of quality metrics for evaluating and
analyzing the vision & language datasets and categorize them accordingly. Our
analyses show that the most recent datasets have been using more complex
language and more abstract concepts, however, there are different strengths and
weaknesses in each.Comment: To appear in EMNLP 2015, short proceedings. Dataset analysis and
discussion expanded, including an initial examination into reporting bias for
one of them. F.F. and N.M. contributed equally to this wor
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
We introduce a model for bidirectional retrieval of images and sentences
through a multi-modal embedding of visual and natural language data. Unlike
previous models that directly map images or sentences into a common embedding
space, our model works on a finer level and embeds fragments of images
(objects) and fragments of sentences (typed dependency tree relations) into a
common space. In addition to a ranking objective seen in previous work, this
allows us to add a new fragment alignment objective that learns to directly
associate these fragments across modalities. Extensive experimental evaluation
shows that reasoning on both the global level of images and sentences and the
finer level of their respective fragments significantly improves performance on
image-sentence retrieval tasks. Additionally, our model provides interpretable
predictions since the inferred inter-modal fragment alignment is explicit
Predicting Motivations of Actions by Leveraging Text
Understanding human actions is a key problem in computer vision. However,
recognizing actions is only the first step of understanding what a person is
doing. In this paper, we introduce the problem of predicting why a person has
performed an action in images. This problem has many applications in human
activity understanding, such as anticipating or explaining an action. To study
this problem, we introduce a new dataset of people performing actions annotated
with likely motivations. However, the information in an image alone may not be
sufficient to automatically solve this task. Since humans can rely on their
lifetime of experiences to infer motivation, we propose to give computer vision
systems access to some of these experiences by using recently developed natural
language models to mine knowledge stored in massive amounts of text. While we
are still far away from fully understanding motivation, our results suggest
that transferring knowledge from language into vision can help machines
understand why people in images might be performing an action.Comment: CVPR 201
Text to 3D Scene Generation with Rich Lexical Grounding
The ability to map descriptions of scenes to 3D geometric representations has
many applications in areas such as art, education, and robotics. However, prior
work on the text to 3D scene generation task has used manually specified object
categories and language that identifies them. We introduce a dataset of 3D
scenes annotated with natural language descriptions and learn from this data
how to ground textual descriptions to physical objects. Our method successfully
grounds a variety of lexical terms to concrete referents, and we show
quantitatively that our method improves 3D scene generation over previous work
using purely rule-based methods. We evaluate the fidelity and plausibility of
3D scenes generated with our grounding approach through human judgments. To
ease evaluation on this task, we also introduce an automated metric that
strongly correlates with human judgments.Comment: 10 pages, 7 figures, 3 tables. To appear in ACL-IJCNLP 201
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a
Visual Turing Test. By combining latest advances in image representation and
natural language processing, we propose Neural-Image-QA, an end-to-end
formulation to this problem for which all parts are trained jointly. In
contrast to previous efforts, we are facing a multi-modal problem where the
language output (answer) is conditioned on visual and natural language input
(image and question). Our approach Neural-Image-QA doubles the performance of
the previous best approach on this problem. We provide additional insights into
the problem by analyzing how much information is contained only in the language
part for which we provide a new human baseline. To study human consensus, which
is related to the ambiguities inherent in this challenging task, we propose two
novel metrics and collect additional answers which extends the original DAQUAR
dataset to DAQUAR-Consensus.Comment: ICCV'15 (Oral
Automatic Generation of Grounded Visual Questions
In this paper, we propose the first model to be able to generate visually
grounded questions with diverse types for a single image. Visual question
generation is an emerging topic which aims to ask questions in natural language
based on visual input. To the best of our knowledge, it lacks automatic methods
to generate meaningful questions with various types for the same visual input.
To circumvent the problem, we propose a model that automatically generates
visually grounded questions with varying types. Our model takes as input both
images and the captions generated by a dense caption model, samples the most
probable question types, and generates the questions in sequel. The
experimental results on two real world datasets show that our model outperforms
the strongest baseline in terms of both correctness and diversity with a wide
margin.Comment: VQ
Scene Graph Generation by Iterative Message Passing
Understanding a visual scene goes beyond recognizing individual objects in
isolation. Relationships between objects also constitute rich semantic
information about the scene. In this work, we explicitly model the objects and
their relationships using scene graphs, a visually-grounded graphical structure
of an image. We propose a novel end-to-end model that generates such structured
scene representation from an input image. The model solves the scene graph
inference problem using standard RNNs and learns to iteratively improves its
predictions via message passing. Our joint inference model can take advantage
of contextual cues to make better predictions on objects and their
relationships. The experiments show that our model significantly outperforms
previous methods for generating scene graphs using Visual Genome dataset and
inferring support relations with NYU Depth v2 dataset.Comment: CVPR 201
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape