3,929 research outputs found
Combining Background Subtraction Algorithms with Convolutional Neural Network
Accurate and fast extraction of foreground object is a key prerequisite for a
wide range of computer vision applications such as object tracking and
recognition. Thus, enormous background subtraction methods for foreground
object detection have been proposed in recent decades. However, it is still
regarded as a tough problem due to a variety of challenges such as illumination
variations, camera jitter, dynamic backgrounds, shadows, and so on. Currently,
there is no single method that can handle all the challenges in a robust way.
In this letter, we try to solve this problem from a new perspective by
combining different state-of-the-art background subtraction algorithms to
create a more robust and more advanced foreground detection algorithm. More
specifically, an encoder-decoder fully convolutional neural network
architecture is trained to automatically learn how to leverage the
characteristics of different algorithms to fuse the results produced by
different background subtraction algorithms and output a more precise result.
Comprehensive experiments evaluated on the CDnet 2014 dataset demonstrate that
the proposed method outperforms all the considered single background
subtraction algorithm. And we show that our solution is more efficient than
other combination strategies
BSUV-Net: a fully-convolutional neural network for background subtraction of unseen videos
Background subtraction is a basic task in computer vision and video processing often applied as a pre-processing step for object tracking, people recognition, etc. Recently, a number of successful background-subtraction algorithms have been proposed, however nearly all of the top-performing ones are supervised. Crucially, their success relies upon the availability of some annotated frames of the test video during training. Consequently, their performance on completely “unseen” videos is undocumented in the literature. In this work, we propose a new, supervised, background subtraction algorithm for unseen videos (BSUV-Net) based on a fully-convolutional neural network. The input to our network consists of the current frame and two background frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance of overfitting, we also introduce a new data-augmentation technique which mitigates the impact of illumination difference between the background frames and the current frame. On the CDNet-2014 dataset, BSUV-Net outperforms stateof-the-art algorithms evaluated on unseen videos in terms of several metrics including F-measure, recall and precision.Accepted manuscrip
A fully-convolutional neural network for background subtraction of unseen videos
Background subtraction is a basic task in computer vision
and video processing often applied as a pre-processing step
for object tracking, people recognition, etc. Recently, a number of successful background-subtraction algorithms have
been proposed, however nearly all of the top-performing
ones are supervised. Crucially, their success relies upon
the availability of some annotated frames of the test video
during training. Consequently, their performance on completely “unseen” videos is undocumented in the literature.
In this work, we propose a new, supervised, backgroundsubtraction algorithm for unseen videos (BSUV-Net) based
on a fully-convolutional neural network. The input to our
network consists of the current frame and two background
frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance
of overfitting, we also introduce a new data-augmentation
technique which mitigates the impact of illumination difference between the background frames and the current frame.
On the CDNet-2014 dataset, BSUV-Net outperforms stateof-the-art algorithms evaluated on unseen videos in terms of
several metrics including F-measure, recall and precision.Accepted manuscrip
ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information
Object detection in wide area motion imagery (WAMI) has drawn the attention
of the computer vision research community for a number of years. WAMI proposes
a number of unique challenges including extremely small object sizes, both
sparse and densely-packed objects, and extremely large search spaces (large
video frames). Nearly all state-of-the-art methods in WAMI object detection
report that appearance-based classifiers fail in this challenging data and
instead rely almost entirely on motion information in the form of background
subtraction or frame-differencing. In this work, we experimentally verify the
failure of appearance-based classifiers in WAMI, such as Faster R-CNN and a
heatmap-based fully convolutional neural network (CNN), and propose a novel
two-stage spatio-temporal CNN which effectively and efficiently combines both
appearance and motion information to significantly surpass the state-of-the-art
in WAMI object detection. To reduce the large search space, the first stage
(ClusterNet) takes in a set of extremely large video frames, combines the
motion and appearance information within the convolutional architecture, and
proposes regions of objects of interest (ROOBI). These ROOBI can contain from
one to clusters of several hundred objects due to the large video frame size
and varying object density in WAMI. The second stage (FoveaNet) then estimates
the centroid location of all objects in that given ROOBI simultaneously via
heatmap estimation. The proposed method exceeds state-of-the-art results on the
WPAFB 2009 dataset by 5-16% for moving objects and nearly 50% for stopped
objects, as well as being the first proposed method in wide area motion imagery
to detect completely stationary objects.Comment: Main paper is 8 pages. Supplemental section contains a walk-through
of our method (using a qualitative example) and qualitative results for WPAFB
2009 datase
Deep Occlusion Reasoning for Multi-Camera Multi-Target Detection
People detection in single 2D images has improved greatly in recent years.
However, comparatively little of this progress has percolated into multi-camera
multi-people tracking algorithms, whose performance still degrades severely
when scenes become very crowded. In this work, we introduce a new architecture
that combines Convolutional Neural Nets and Conditional Random Fields to
explicitly model those ambiguities. One of its key ingredients are high-order
CRF terms that model potential occlusions and give our approach its robustness
even when many people are present. Our model is trained end-to-end and we show
that it outperforms several state-of-art algorithms on challenging scenes
- …