64 research outputs found
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
Deep Lidar CNN to Understand the Dynamics of Moving Vehicles
Perception technologies in Autonomous Driving are experiencing their golden
age due to the advances in Deep Learning. Yet, most of these systems rely on
the semantically rich information of RGB images. Deep Learning solutions
applied to the data of other sensors typically mounted on autonomous cars (e.g.
lidars or radars) are not explored much. In this paper we propose a novel
solution to understand the dynamics of moving vehicles of the scene from only
lidar information. The main challenge of this problem stems from the fact that
we need to disambiguate the proprio-motion of the 'observer' vehicle from that
of the external 'observed' vehicles. For this purpose, we devise a CNN
architecture which at testing time is fed with pairs of consecutive lidar
scans. However, in order to properly learn the parameters of this network,
during training we introduce a series of so-called pretext tasks which also
leverage on image data. These tasks include semantic information about
vehicleness and a novel lidar-flow feature which combines standard image-based
optical flow with lidar scans. We obtain very promising results and show that
including distilled image information only during training, allows improving
the inference results of the network at test time, even when image data is no
longer used.Comment: Presented in IEEE ICRA 2018. IEEE Copyrights: Personal use of this
material is permitted. Permission from IEEE must be obtained for all other
uses. (V2 just corrected comments on arxiv submission
RVOS: end-to-end recurrent network for video object segmentation
Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable. Our model incorporates recurrence on two different domains: (i) the spatial, which allows to discover the different object instances within a frame, and (ii) the temporal, which allows to keep the coherence of the segmented objects along time. We train RVOS for zero-shot video object segmentation and are the first ones to report quantitative results for DAVIS-2017 and YouTube-VOS benchmarks. Further, we adapt RVOS for one-shot video object segmentation by using the masks obtained in previous time steps as inputs to be processed by the recurrent module. Our model reaches comparable results to state-of-the-art techniques in YouTube-VOS benchmark and outperforms all previous video object segmentation methods not using online learning in the DAVIS-2017 benchmark. Moreover, our model achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.Peer ReviewedPostprint (published version
Online Adaptation of Convolutional Neural Networks for Video Object Segmentation
We tackle the task of semi-supervised video object segmentation, i.e.
segmenting the pixels belonging to an object in the video using the ground
truth pixel mask for the first frame. We build on the recently introduced
one-shot video object segmentation (OSVOS) approach which uses a pretrained
network and fine-tunes it on the first frame. While achieving impressive
performance, at test time OSVOS uses the fine-tuned network in unchanged form
and is not able to adapt to large changes in object appearance. To overcome
this limitation, we propose Online Adaptive Video Object Segmentation (OnAVOS)
which updates the network online using training examples selected based on the
confidence of the network and the spatial configuration. Additionally, we add a
pretraining step based on objectness, which is learned on PASCAL. Our
experiments show that both extensions are highly effective and improve the
state of the art on DAVIS to an intersection-over-union score of 85.7%.Comment: Accepted at BMVC 2017. This version contains minor changes for the
camera ready versio
- …