15,946 research outputs found
A fully-convolutional neural network for background subtraction of unseen videos
Background subtraction is a basic task in computer vision
and video processing often applied as a pre-processing step
for object tracking, people recognition, etc. Recently, a number of successful background-subtraction algorithms have
been proposed, however nearly all of the top-performing
ones are supervised. Crucially, their success relies upon
the availability of some annotated frames of the test video
during training. Consequently, their performance on completely “unseen” videos is undocumented in the literature.
In this work, we propose a new, supervised, backgroundsubtraction algorithm for unseen videos (BSUV-Net) based
on a fully-convolutional neural network. The input to our
network consists of the current frame and two background
frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance
of overfitting, we also introduce a new data-augmentation
technique which mitigates the impact of illumination difference between the background frames and the current frame.
On the CDNet-2014 dataset, BSUV-Net outperforms stateof-the-art algorithms evaluated on unseen videos in terms of
several metrics including F-measure, recall and precision.Accepted manuscrip
BSUV-Net: a fully-convolutional neural network for background subtraction of unseen videos
Background subtraction is a basic task in computer vision and video processing often applied as a pre-processing step for object tracking, people recognition, etc. Recently, a number of successful background-subtraction algorithms have been proposed, however nearly all of the top-performing ones are supervised. Crucially, their success relies upon the availability of some annotated frames of the test video during training. Consequently, their performance on completely “unseen” videos is undocumented in the literature. In this work, we propose a new, supervised, background subtraction algorithm for unseen videos (BSUV-Net) based on a fully-convolutional neural network. The input to our network consists of the current frame and two background frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance of overfitting, we also introduce a new data-augmentation technique which mitigates the impact of illumination difference between the background frames and the current frame. On the CDNet-2014 dataset, BSUV-Net outperforms stateof-the-art algorithms evaluated on unseen videos in terms of several metrics including F-measure, recall and precision.Accepted manuscrip
Computationally Efficient Target Classification in Multispectral Image Data with Deep Neural Networks
Detecting and classifying targets in video streams from surveillance cameras
is a cumbersome, error-prone and expensive task. Often, the incurred costs are
prohibitive for real-time monitoring. This leads to data being stored locally
or transmitted to a central storage site for post-incident examination. The
required communication links and archiving of the video data are still
expensive and this setup excludes preemptive actions to respond to imminent
threats. An effective way to overcome these limitations is to build a smart
camera that transmits alerts when relevant video sequences are detected. Deep
neural networks (DNNs) have come to outperform humans in visual classifications
tasks. The concept of DNNs and Convolutional Networks (ConvNets) can easily be
extended to make use of higher-dimensional input data such as multispectral
data. We explore this opportunity in terms of achievable accuracy and required
computational effort. To analyze the precision of DNNs for scene labeling in an
urban surveillance scenario we have created a dataset with 8 classes obtained
in a field experiment. We combine an RGB camera with a 25-channel VIS-NIR
snapshot sensor to assess the potential of multispectral image data for target
classification. We evaluate several new DNNs, showing that the spectral
information fused together with the RGB frames can be used to improve the
accuracy of the system or to achieve similar accuracy with a 3x smaller
computation effort. We achieve a very high per-pixel accuracy of 99.1%. Even
for scarcely occurring, but particularly interesting classes, such as cars, 75%
of the pixels are labeled correctly with errors occurring only around the
border of the objects. This high accuracy was obtained with a training set of
only 30 labeled images, paving the way for fast adaptation to various
application scenarios.Comment: Presented at SPIE Security + Defence 2016 Proc. SPIE 9997, Target and
Background Signatures I
General Dynamic Scene Reconstruction from Multiple View Video
This paper introduces a general approach to dynamic scene reconstruction from
multiple moving cameras without prior knowledge or limiting constraints on the
scene structure, appearance, or illumination. Existing techniques for dynamic
scene reconstruction from multiple wide-baseline camera views primarily focus
on accurate reconstruction in controlled environments, where the cameras are
fixed and calibrated and background is known. These approaches are not robust
for general dynamic scenes captured with sparse moving cameras. Previous
approaches for outdoor dynamic scene reconstruction assume prior knowledge of
the static background appearance and structure. The primary contributions of
this paper are twofold: an automatic method for initial coarse dynamic scene
segmentation and reconstruction without prior knowledge of background
appearance or structure; and a general robust approach for joint segmentation
refinement and dense reconstruction of dynamic scenes from multiple
wide-baseline static or moving cameras. Evaluation is performed on a variety of
indoor and outdoor scenes with cluttered backgrounds and multiple dynamic
non-rigid objects such as people. Comparison with state-of-the-art approaches
demonstrates improved accuracy in both multiple view segmentation and dense
reconstruction. The proposed approach also eliminates the requirement for prior
knowledge of scene structure and appearance
Object-based 2D-to-3D video conversion for effective stereoscopic content generation in 3D-TV applications
Three-dimensional television (3D-TV) has gained increasing popularity in the broadcasting domain, as it enables enhanced viewing experiences in comparison to conventional two-dimensional (2D) TV. However, its application has been constrained due to the lack of essential contents, i.e., stereoscopic videos. To alleviate such content shortage, an economical and practical solution is to reuse the huge media resources that are available in monoscopic 2D and convert them to stereoscopic 3D. Although stereoscopic video can be generated from monoscopic sequences using depth measurements extracted from cues like focus blur, motion and size, the quality of the resulting video may be poor as such measurements are usually arbitrarily defined and appear inconsistent with the real scenes. To help solve this problem, a novel method for object-based stereoscopic video generation is proposed which features i) optical-flow based occlusion reasoning in determining depth ordinal, ii) object segmentation using improved region-growing from masks of determined depth layers, and iii) a hybrid depth estimation scheme using content-based matching (inside a small library of true stereo image pairs) and depth-ordinal based regularization. Comprehensive experiments have validated the effectiveness of our proposed 2D-to-3D conversion method in generating stereoscopic videos of consistent depth measurements for 3D-TV applications
Geometry meets semantics for semi-supervised monocular depth estimation
Depth estimation from a single image represents a very exciting challenge in
computer vision. While other image-based depth sensing techniques leverage on
the geometry between different viewpoints (e.g., stereo or structure from
motion), the lack of these cues within a single image renders ill-posed the
monocular depth estimation task. For inference, state-of-the-art
encoder-decoder architectures for monocular depth estimation rely on effective
feature representations learned at training time. For unsupervised training of
these models, geometry has been effectively exploited by suitable images
warping losses computed from views acquired by a stereo rig or a moving camera.
In this paper, we make a further step forward showing that learning semantic
information from images enables to improve effectively monocular depth
estimation as well. In particular, by leveraging on semantically labeled images
together with unsupervised signals gained by geometry through an image warping
loss, we propose a deep learning approach aimed at joint semantic segmentation
and depth estimation. Our overall learning framework is semi-supervised, as we
deploy groundtruth data only in the semantic domain. At training time, our
network learns a common feature representation for both tasks and a novel
cross-task loss function is proposed. The experimental findings show how,
jointly tackling depth prediction and semantic segmentation, allows to improve
depth estimation accuracy. In particular, on the KITTI dataset our network
outperforms state-of-the-art methods for monocular depth estimation.Comment: 16 pages, Accepted to ACCV 201
- …