50,877 research outputs found
Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos
We present a semi-supervised approach that localizes multiple unknown object
instances in long videos. We start with a handful of labeled boxes and
iteratively learn and label hundreds of thousands of object instances. We
propose criteria for reliable object detection and tracking for constraining
the semi-supervised learning process and minimizing semantic drift. Our
approach does not assume exhaustive labeling of each object instance in any
single frame, or any explicit annotation of negative data. Working in such a
generic setting allow us to tackle multiple object instances in video, many of
which are static. In contrast, existing approaches either do not consider
multiple object instances per video, or rely heavily on the motion of the
objects present. The experiments demonstrate the effectiveness of our approach
by evaluating the automatically labeled data on a variety of metrics like
quality, coverage (recall), diversity, and relevance to training an object
detector.Comment: To appear in CVPR 201
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video
Manual annotations of temporal bounds for object interactions (i.e. start and
end times) are typical training input to recognition, localization and
detection algorithms. For three publicly available egocentric datasets, we
uncover inconsistencies in ground truth temporal bounds within and across
annotators and datasets. We systematically assess the robustness of
state-of-the-art approaches to changes in labeled temporal bounds, for object
interaction recognition. As boundaries are trespassed, a drop of up to 10% is
observed for both Improved Dense Trajectories and Two-Stream Convolutional
Neural Network.
We demonstrate that such disagreement stems from a limited understanding of
the distinct phases of an action, and propose annotating based on the Rubicon
Boundaries, inspired by a similarly named cognitive model, for consistent
temporal bounds of object interactions. Evaluated on a public dataset, we
report a 4% increase in overall accuracy, and an increase in accuracy for 55%
of classes when Rubicon Boundaries are used for temporal annotations.Comment: ICCV 201
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
High-speed Video from Asynchronous Camera Array
This paper presents a method for capturing high-speed video using an
asynchronous camera array. Our method sequentially fires each sensor in a
camera array with a small time offset and assembles captured frames into a
high-speed video according to the time stamps. The resulting video, however,
suffers from parallax jittering caused by the viewpoint difference among
sensors in the camera array. To address this problem, we develop a dedicated
novel view synthesis algorithm that transforms the video frames as if they were
captured by a single reference sensor. Specifically, for any frame from a
non-reference sensor, we find the two temporally neighboring frames captured by
the reference sensor. Using these three frames, we render a new frame with the
same time stamp as the non-reference frame but from the viewpoint of the
reference sensor. Specifically, we segment these frames into super-pixels and
then apply local content-preserving warping to warp them to form the new frame.
We employ a multi-label Markov Random Field method to blend these warped
frames. Our experiments show that our method can produce high-quality and
high-speed video of a wide variety of scenes with large parallax, scene
dynamics, and camera motion and outperforms several baseline and
state-of-the-art approaches.Comment: 10 pages, 82 figures, Published at IEEE WACV 201
Computationally Efficient Target Classification in Multispectral Image Data with Deep Neural Networks
Detecting and classifying targets in video streams from surveillance cameras
is a cumbersome, error-prone and expensive task. Often, the incurred costs are
prohibitive for real-time monitoring. This leads to data being stored locally
or transmitted to a central storage site for post-incident examination. The
required communication links and archiving of the video data are still
expensive and this setup excludes preemptive actions to respond to imminent
threats. An effective way to overcome these limitations is to build a smart
camera that transmits alerts when relevant video sequences are detected. Deep
neural networks (DNNs) have come to outperform humans in visual classifications
tasks. The concept of DNNs and Convolutional Networks (ConvNets) can easily be
extended to make use of higher-dimensional input data such as multispectral
data. We explore this opportunity in terms of achievable accuracy and required
computational effort. To analyze the precision of DNNs for scene labeling in an
urban surveillance scenario we have created a dataset with 8 classes obtained
in a field experiment. We combine an RGB camera with a 25-channel VIS-NIR
snapshot sensor to assess the potential of multispectral image data for target
classification. We evaluate several new DNNs, showing that the spectral
information fused together with the RGB frames can be used to improve the
accuracy of the system or to achieve similar accuracy with a 3x smaller
computation effort. We achieve a very high per-pixel accuracy of 99.1%. Even
for scarcely occurring, but particularly interesting classes, such as cars, 75%
of the pixels are labeled correctly with errors occurring only around the
border of the objects. This high accuracy was obtained with a training set of
only 30 labeled images, paving the way for fast adaptation to various
application scenarios.Comment: Presented at SPIE Security + Defence 2016 Proc. SPIE 9997, Target and
Background Signatures I
- …