53,675 research outputs found
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Regularizing Deep Networks by Modeling and Predicting Label Structure
We construct custom regularization functions for use in supervised training
of deep neural networks. Our technique is applicable when the ground-truth
labels themselves exhibit internal structure; we derive a regularizer by
learning an autoencoder over the set of annotations. Training thereby becomes a
two-phase procedure. The first phase models labels with an autoencoder. The
second phase trains the actual network of interest by attaching an auxiliary
branch that must predict output via a hidden layer of the autoencoder. After
training, we discard this auxiliary branch.
We experiment in the context of semantic segmentation, demonstrating this
regularization strategy leads to consistent accuracy boosts over baselines,
both when training from scratch, or in combination with ImageNet pretraining.
Gains are also consistent over different choices of convolutional network
architecture. As our regularizer is discarded after training, our method has
zero cost at test time; the performance improvements are essentially free. We
are simply able to learn better network weights by building an abstract model
of the label space, and then training the network to understand this
abstraction alongside the original task.Comment: to appear at CVPR 201
Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
A deeper understanding of video activities extends beyond recognition of
underlying concepts such as actions and objects: constructing deep semantic
representations requires reasoning about the semantic relationships among these
concepts, often beyond what is directly observed in the data. To this end, we
propose an energy minimization framework that leverages large-scale commonsense
knowledge bases, such as ConceptNet, to provide contextual cues to establish
semantic relationships among entities directly hypothesized from video signal.
We mathematically express this using the language of Grenander's canonical
pattern generator theory. We show that the use of prior encoded commonsense
knowledge alleviate the need for large annotated training datasets and help
tackle imbalance in training through prior knowledge. Using three different
publicly available datasets - Charades, Microsoft Visual Description Corpus and
Breakfast Actions datasets, we show that the proposed model can generate video
interpretations whose quality is better than those reported by state-of-the-art
approaches, which have substantial training needs. Through extensive
experiments, we show that the use of commonsense knowledge from ConceptNet
allows the proposed approach to handle various challenges such as training data
imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201
Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age
Simultaneous Localization and Mapping (SLAM)consists in the concurrent
construction of a model of the environment (the map), and the estimation of the
state of the robot moving within it. The SLAM community has made astonishing
progress over the last 30 years, enabling large-scale real-world applications,
and witnessing a steady transition of this technology to industry. We survey
the current state of SLAM. We start by presenting what is now the de-facto
standard formulation for SLAM. We then review related work, covering a broad
set of topics including robustness and scalability in long-term mapping, metric
and semantic representations for mapping, theoretical performance guarantees,
active SLAM and exploration, and other new frontiers. This paper simultaneously
serves as a position paper and tutorial to those who are users of SLAM. By
looking at the published research with a critical eye, we delineate open
challenges and new research issues, that still deserve careful scientific
investigation. The paper also contains the authors' take on two questions that
often animate discussions during robotics conferences: Do robots need SLAM? and
Is SLAM solved
Interpreting Deep Visual Representations via Network Dissection
The success of recent deep convolutional neural networks (CNNs) depends on
learning hidden representations that can summarize the important factors of
variation behind the data. However, CNNs often criticized as being black boxes
that lack interpretability, since they have millions of unexplained model
parameters. In this work, we describe Network Dissection, a method that
interprets networks by providing labels for the units of their deep visual
representations. The proposed method quantifies the interpretability of CNN
representations by evaluating the alignment between individual hidden units and
a set of visual semantic concepts. By identifying the best alignments, units
are given human interpretable labels across a range of objects, parts, scenes,
textures, materials, and colors. The method reveals that deep representations
are more transparent and interpretable than expected: we find that
representations are significantly more interpretable than they would be under a
random equivalently powerful basis. We apply the method to interpret and
compare the latent representations of various network architectures trained to
solve different supervised and self-supervised training tasks. We then examine
factors affecting the network interpretability such as the number of the
training iterations, regularizations, different initializations, and the
network depth and width. Finally we show that the interpreted units can be used
to provide explicit explanations of a prediction given by a CNN for an image.
Our results highlight that interpretability is an important property of deep
neural networks that provides new insights into their hierarchical structure.Comment: *B. Zhou and D. Bau contributed equally to this work. 15 pages, 27
figure
- …