104 research outputs found
A Dataset for Movie Description
Descriptive video service (DVS) provides linguistic descriptions of movies
and allows visually impaired people to follow a movie along with their peers.
Such descriptions are by design mainly visual and thus naturally form an
interesting data source for computer vision and computational linguistics. In
this work we propose a novel dataset which contains transcribed DVS, which is
temporally aligned to full length HD movies. In addition we also collected the
aligned movie scripts which have been used in prior work and compare the two
different sources of descriptions. In total the Movie Description dataset
contains a parallel corpus of over 54,000 sentences and video snippets from 72
HD movies. We characterize the dataset by benchmarking different approaches for
generating video descriptions. Comparing DVS to scripts, we find that DVS is
far more visual and describes precisely what is shown rather than what should
happen according to the scripts created prior to movie production
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a
Visual Turing Test. By combining latest advances in image representation and
natural language processing, we propose Neural-Image-QA, an end-to-end
formulation to this problem for which all parts are trained jointly. In
contrast to previous efforts, we are facing a multi-modal problem where the
language output (answer) is conditioned on visual and natural language input
(image and question). Our approach Neural-Image-QA doubles the performance of
the previous best approach on this problem. We provide additional insights into
the problem by analyzing how much information is contained only in the language
part for which we provide a new human baseline. To study human consensus, which
is related to the ambiguities inherent in this challenging task, we propose two
novel metrics and collect additional answers which extends the original DAQUAR
dataset to DAQUAR-Consensus.Comment: ICCV'15 (Oral
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained
from large language or visual datasets has been successfully explored in recent
years. However, tasks such as visual question answering require combining these
vector representations with each other. Approaches to multimodal pooling
include element-wise product or sum, as well as concatenation of the visual and
textual representations. We hypothesize that these methods are not as
expressive as an outer product of the visual and textual vectors. As the outer
product is typically infeasible due to its high dimensionality, we instead
propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and
expressively combine multimodal features. We extensively evaluate MCB on the
visual question answering and grounding tasks. We consistently show the benefit
of MCB over ablations without MCB. For visual question answering, we present an
architecture which uses MCB twice, once for predicting attention over spatial
features and again to combine the attended representation with the question
representation. This model outperforms the state-of-the-art on the Visual7W
dataset and the VQA challenge.Comment: Accepted to EMNLP 201
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision.Comment: arXiv admin note: text overlap with arXiv:1612.0475
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Deep models that are both effective and explainable are desirable in many
settings; prior explainable models have been unimodal, offering either
image-based visualization of attention weights or text-based generation of
post-hoc justifications. We propose a multimodal approach to explanation, and
argue that the two modalities provide complementary explanatory strengths. We
collect two new datasets to define and evaluate this task, and propose a novel
model which can provide joint textual rationale generation and attention
visualization. Our datasets define visual and textual justifications of a
classification decision for activity recognition tasks (ACT-X) and for visual
question answering tasks (VQA-X). We quantitatively show that training with the
textual explanations not only yields better textual justification models, but
also better localizes the evidence that supports the decision. We also
qualitatively show cases where visual explanation is more insightful than
textual explanation, and vice versa, supporting our thesis that multimodal
explanation models offer significant benefits over unimodal approaches.Comment: arXiv admin note: text overlap with arXiv:1612.0475
- …