29,013 research outputs found
Unsupervised Deep Context Prediction for Background Foreground Separation
In many advanced video based applications background modeling is a
pre-processing step to eliminate redundant data, for instance in tracking or
video surveillance applications. Over the past years background subtraction is
usually based on low level or hand-crafted features such as raw color
components, gradients, or local binary patterns. The background subtraction
algorithms performance suffer in the presence of various challenges such as
dynamic backgrounds, photometric variations, camera jitters, and shadows. To
handle these challenges for the purpose of accurate background modeling we
propose a unified framework based on the algorithm of image inpainting. It is
an unsupervised visual feature learning hybrid Generative Adversarial algorithm
based on context prediction. We have also presented the solution of random
region inpainting by the fusion of center region inpaiting and random region
inpainting with the help of poisson blending technique. Furthermore we also
evaluated foreground object detection with the fusion of our proposed method
and morphological operations. The comparison of our proposed method with 12
state-of-the-art methods shows its stability in the application of background
estimation and foreground detection.Comment: 17 page
Predicting the Future with Transformational States
An intelligent observer looks at the world and sees not only what is, but
what is moving and what can be moved. In other words, the observer sees how the
present state of the world can transform in the future. We propose a model that
predicts future images by learning to represent the present state and its
transformation given only a sequence of images. To do so, we introduce an
architecture with a latent state composed of two components designed to capture
(i) the present image state and (ii) the transformation between present and
future states, respectively. We couple this latent state with a recurrent
neural network (RNN) core that predicts future frames by transforming past
states into future states by applying the accumulated state transformation with
a learned operator. We describe how this model can be integrated into an
encoder-decoder convolutional neural network (CNN) architecture that uses
weighted residual connections to integrate representations of the past with
representations of the future. Qualitatively, our approach generates image
sequences that are stable and capture realistic motion over multiple predicted
frames, without requiring adversarial training. Quantitatively, our method
achieves prediction results comparable to state-of-the-art results on standard
image prediction benchmarks (Moving MNIST, KTH, and UCF101).Comment: 24 pages, including supplemen
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Unsupervised Learning of Dense Optical Flow, Depth and Egomotion from Sparse Event Data
In this work we present a lightweight, unsupervised learning pipeline for
\textit{dense} depth, optical flow and egomotion estimation from sparse event
output of the Dynamic Vision Sensor (DVS). To tackle this low level vision
task, we use a novel encoder-decoder neural network architecture - ECN.
Our work is the first monocular pipeline that generates dense depth and
optical flow from sparse event data only. The network works in self-supervised
mode and has just 150k parameters. We evaluate our pipeline on the MVSEC self
driving dataset and present results for depth, optical flow and and egomotion
estimation. Due to the lightweight design, the inference part of the network
runs at 250 FPS on a single GPU, making the pipeline ready for realtime
robotics applications. Our experiments demonstrate significant improvements
upon previous works that used deep learning on event data, as well as the
ability of our pipeline to perform well during both day and night
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Learning to estimate 3D geometry in a single frame and optical flow from
consecutive frames by watching unlabeled videos via deep convolutional network
has made significant progress recently. Current state-of-the-art (SoTA) methods
treat the two tasks independently. One typical assumption of the existing depth
estimation methods is that the scenes contain no independent moving objects.
while object moving could be easily modeled using optical flow. In this paper,
we propose to address the two tasks as a whole, i.e. to jointly understand
per-pixel 3D geometry and motion. This eliminates the need of static scene
assumption and enforces the inherent geometrical consistency during the
learning process, yielding significantly improved results for both tasks. We
call our method as "Every Pixel Counts++" or "EPC++". Specifically, during
training, given two consecutive frames from a video, we adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet),
and per-pixel optical flow between two frames (OptFlowNet) respectively. The
three types of information are fed into a holistic 3D motion parser (HMP), and
per-pixel 3D motion of both rigid background and moving objects are
disentangled and recovered. Comprehensive experiments were conducted on
datasets with different scenes, including driving scenario (KITTI 2012 and
KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic
animation (MPI Sintel dataset). Performance on the five tasks of depth
estimation, optical flow estimation, odometry, moving object segmentation and
scene flow estimation shows that our approach outperforms other SoTA methods.
Code will be available at: https://github.com/chenxuluo/EPC.Comment: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI
submissio
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Learning by Inertia: Self-supervised Monocular Visual Odometry for Road Vehicles
In this paper, we present iDVO (inertia-embedded deep visual odometry), a
self-supervised learning based monocular visual odometry (VO) for road
vehicles. When modelling the geometric consistency within adjacent frames, most
deep VO methods ignore the temporal continuity of the camera pose, which
results in a very severe jagged fluctuation in the velocity curves. With the
observation that road vehicles tend to perform smooth dynamic characteristics
in most of the time, we design the inertia loss function to describe the
abnormal motion variation, which assists the model to learn the consecutiveness
from long-term camera ego-motion. Based on the recurrent convolutional neural
network (RCNN) architecture, our method implicitly models the dynamics of road
vehicles and the temporal consecutiveness by the extended Long Short-Term
Memory (LSTM) block. Furthermore, we develop the dynamic hard-edge mask to
handle the non-consistency in fast camera motion by blocking the boundary part
and which generates more efficiency in the whole non-consistency mask. The
proposed method is evaluated on the KITTI dataset, and the results demonstrate
state-of-the-art performance with respect to other monocular deep VO and SLAM
approaches.Comment: 2019 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation
We propose a sequential variational autoencoder to learn disentangled
representations of sequential data (e.g., videos and audios) under
self-supervision. Specifically, we exploit the benefits of some readily
accessible supervisory signals from input data itself or some off-the-shelf
functional models and accordingly design auxiliary tasks for our model to
utilize these signals. With the supervision of the signals, our model can
easily disentangle the representation of an input sequence into static factors
and dynamic factors (i.e., time-invariant and time-varying parts).
Comprehensive experiments across videos and audios verify the effectiveness of
our model on representation disentanglement and generation of sequential data,
and demonstrate that, our model with self-supervision performs comparable to,
if not better than, the fully-supervised model with ground truth labels, and
outperforms state-of-the-art unsupervised models by a large margin.Comment: to appear in CVPR202
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
We propose GeoNet, a jointly unsupervised learning framework for monocular
depth, optical flow and ego-motion estimation from videos. The three components
are coupled by the nature of 3D scene geometry, jointly learned by our
framework in an end-to-end manner. Specifically, geometric relationships are
extracted over the predictions of individual modules and then combined as an
image reconstruction loss, reasoning about static and dynamic scene parts
separately. Furthermore, we propose an adaptive geometric consistency loss to
increase robustness towards outliers and non-Lambertian regions, which resolves
occlusions and texture ambiguities effectively. Experimentation on the KITTI
driving dataset reveals that our scheme achieves state-of-the-art results in
all of the three tasks, performing better than previously unsupervised methods
and comparably with supervised ones.Comment: Accepted to CVPR 2018; Code will be made available at
https://github.com/yzcjtr/GeoNe
- …