231 research outputs found
Visual7W: Grounded Question Answering in Images
We have seen great progress in basic perceptual tasks such as object
recognition and detection. However, AI models still fail to match humans in
high-level vision tasks due to the lack of capacities for deeper reasoning.
Recently the new task of visual question answering (QA) has been proposed to
evaluate a model's capacity for deep image understanding. Previous works have
established a loose, global association between QA sentences and images.
However, many questions and answers, in practice, relate to local regions in
the images. We establish a semantic link between textual descriptions and image
regions by object-level grounding. It enables a new type of QA with visual
answers, in addition to textual answers used in previous work. We study the
visual QA tasks in a grounded setting with a large collection of 7W
multiple-choice QA pairs. Furthermore, we evaluate human performance and
several baseline models on the QA tasks. Finally, we propose a novel LSTM model
with spatial attention to tackle the 7W QA tasks.Comment: CVPR 201
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
- …