11,224 research outputs found
The 2017 DAVIS Challenge on Video Object Segmentation
We present the 2017 DAVIS Challenge on Video Object Segmentation, a public
dataset, benchmark, and competition specifically designed for the task of video
object segmentation. Following the footsteps of other successful initiatives,
such as ILSVRC and PASCAL VOC, which established the avenue of research in the
fields of scene classification and semantic segmentation, the DAVIS Challenge
comprises a dataset, an evaluation methodology, and a public competition with a
dedicated workshop co-located with CVPR 2017. The DAVIS Challenge follows up on
the recent publication of DAVIS (Densely-Annotated VIdeo Segmentation), which
has fostered the development of several novel state-of-the-art video object
segmentation techniques. In this paper we describe the scope of the benchmark,
highlight the main characteristics of the dataset, define the evaluation
metrics of the competition, and present a detailed analysis of the results of
the participants to the challenge.Comment: Challenge website: http://davischallenge.or
The 2018 DAVIS Challenge on Video Object Segmentation
We present the 2018 DAVIS Challenge on Video Object Segmentation, a public
competition specifically designed for the task of video object segmentation. It
builds upon the DAVIS 2017 dataset, which was presented in the previous edition
of the DAVIS Challenge, and added 100 videos with multiple objects per sequence
to the original DAVIS 2016 dataset. Motivated by the analysis of the results of
the 2017 edition, the main track of the competition will be the same than in
the previous edition (segmentation given the full mask of the objects in the
first frame -- semi-supervised scenario). This edition, however, also adds an
interactive segmentation teaser track, where the participants will interact
with a web service simulating the input of a human that provides scribbles to
iteratively improve the result.Comment: Challenge website: http://davischallenge.org
Video Object Segmentation using Tracked Object Proposals
We present an approach to semi-supervised video object segmentation, in the
context of the DAVIS 2017 challenge. Our approach combines category-based
object detection, category-independent object appearance segmentation and
temporal object tracking. We are motivated by the fact that the objects
semantic category tends not to change throughout the video while its appearance
and location can vary considerably. In order to capture the specific object
appearance independent of its category, for each video we train a fully
convolutional network using augmentations of the given annotated frame. We
refine the appearance segmentation mask with the bounding boxes provided either
by a semantic object detection network, when applicable, or by a previous frame
prediction. By introducing a temporal continuity constraint on the detected
boxes, we are able to improve the object segmentation mask of the appearance
network and achieve competitive results on the DAVIS datasets.Comment: All authors contributed equally, CVPR-2017 workshop, DAVIS-2017
Challeng
PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation
We address semi-supervised video object segmentation, the task of
automatically generating accurate and consistent pixel masks for objects in a
video sequence, given the first-frame ground truth annotations. Towards this
goal, we present the PReMVOS algorithm (Proposal-generation, Refinement and
Merging for Video Object Segmentation). Our method separates this problem into
two steps, first generating a set of accurate object segmentation mask
proposals for each video frame and then selecting and merging these proposals
into accurate and temporally consistent pixel-wise object tracks over a video
sequence in a way which is designed to specifically tackle the difficult
challenges involved with segmenting multiple objects across a video sequence.
Our approach surpasses all previous state-of-the-art results on the DAVIS 2017
video object segmentation benchmark with a J & F mean score of 71.6 on the
test-dev dataset, and achieves first place in both the DAVIS 2018 Video Object
Segmentation Challenge and the YouTube-VOS 1st Large-scale Video Object
Segmentation Challenge.Comment: Accepted for publication in ACCV1
UnOVOST: Unsupervised Offline Video Object Segmentation and Tracking
We address Unsupervised Video Object Segmentation (UVOS), the task of
automatically generating accurate pixel masks for salient objects in a video
sequence and of tracking these objects consistently through time, without any
input about which objects should be tracked. Towards solving this task, we
present UnOVOST (Unsupervised Offline Video Object Segmentation and Tracking)
as a simple and generic algorithm which is able to track and segment a large
variety of objects. This algorithm builds up tracks in a number stages, first
grouping segments into short tracklets that are spatio-temporally consistent,
before merging these tracklets into long-term consistent object tracks based on
their visual similarity. In order to achieve this we introduce a novel
tracklet-based Forest Path Cutting data association algorithm which builds up a
decision forest of track hypotheses before cutting this forest into paths that
form long-term consistent object tracks. When evaluating our approach on the
DAVIS 2017 Unsupervised dataset we obtain state-of-the-art performance with a
mean J &F score of 67.9% on the val, 58% on the test-dev and 56.4% on the
test-challenge benchmarks, obtaining first place in the DAVIS 2019 Unsupervised
Video Object Segmentation Challenge. UnOVOST even performs competitively with
many semi-supervised video object segmentation algorithms even though it is not
given any input as to which objects should be tracked and segmented.Comment: Accepted for publication at WACV 202
Learning to Segment Instances in Videos with Spatial Propagation Network
We propose a deep learning-based framework for instance-level object
segmentation. Our method mainly consists of three steps. First, We train a
generic model based on ResNet-101 for foreground/background segmentations.
Second, based on this generic model, we fine-tune it to learn instance-level
models and segment individual objects by using augmented object annotations in
first frames of test videos. To distinguish different instances in the same
video, we compute a pixel-level score map for each object from these
instance-level models. Each score map indicates the objectness likelihood and
is only computed within the foreground mask obtained in the first step. To
further refine this per frame score map, we learn a spatial propagation
network. This network aims to learn how to propagate a coarse segmentation mask
spatially based on the pairwise similarities in each frame. In addition, we
apply a filter on the refined score map that aims to recognize the best
connected region using spatial and temporal consistencies in the video.
Finally, we decide the instance-level object segmentation in each video by
comparing score maps of different instances.Comment: CVPR 2017 Workshop on DAVIS Challenge. Code is available at
http://github.com/JingchunCheng/Seg-with-SP
Self-supervised Video Object Segmentation
The objective of this paper is self-supervised representation learning, with
the goal of solving semi-supervised video object segmentation (a.k.a. dense
tracking). We make the following contributions: (i) we propose to improve the
existing self-supervised approach, with a simple, yet more effective memory
mechanism for long-term correspondence matching, which resolves the challenge
caused by the dis-appearance and reappearance of objects; (ii) by augmenting
the self-supervised approach with an online adaptation module, our method
successfully alleviates tracker drifts caused by spatial-temporal
discontinuity, e.g. occlusions or dis-occlusions, fast motions; (iii) we
explore the efficiency of self-supervised representation learning for dense
tracking, surprisingly, we show that a powerful tracking model can be trained
with as few as 100 raw video clips (equivalent to a duration of 11mins),
indicating that low-level statistics have already been effective for tracking
tasks; (iv) we demonstrate state-of-the-art results among the self-supervised
approaches on DAVIS-2017 and YouTube-VOS, as well as surpassing most of methods
trained with millions of manual segmentation annotations, further bridging the
gap between self-supervised and supervised learning. Codes are released to
foster any further research (https://github.com/fangruizhu/self_sup_semiVOS)
FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation
Many of the recent successful methods for video object segmentation (VOS) are
overly complicated, heavily rely on fine-tuning on the first frame, and/or are
slow, and are hence of limited practical use. In this work, we propose FEELVOS
as a simple and fast method which does not rely on fine-tuning. In order to
segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding
together with a global and a local matching mechanism to transfer information
from the first frame and from the previous frame of the video to the current
frame. In contrast to previous work, our embedding is only used as an internal
guidance of a convolutional network. Our novel dynamic segmentation head allows
us to train the network, including the embedding, end-to-end for the multiple
object segmentation task with a cross entropy loss. We achieve a new state of
the art in video object segmentation without fine-tuning with a J&F measure of
71.5% on the DAVIS 2017 validation set. We make our code and models available
at https://github.com/tensorflow/models/tree/master/research/feelvos.Comment: CVPR 2019 camera-ready versio
Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks
We present a deep learning method for the interactive video object
segmentation. Our method is built upon two core operations, interaction and
propagation, and each operation is conducted by Convolutional Neural Networks.
The two networks are connected both internally and externally so that the
networks are trained jointly and interact with each other to solve the complex
video object segmentation problem. We propose a new multi-round training scheme
for the interactive video object segmentation so that the networks can learn
how to understand the user's intention and update incorrect estimations during
the training. At the testing time, our method produces high-quality results and
also runs fast enough to work with users interactively. We evaluated the
proposed method quantitatively on the interactive track benchmark at the DAVIS
Challenge 2018. We outperformed other competing methods by a significant margin
in both the speed and the accuracy. We also demonstrated that our method works
well with real user interactions.Comment: CVPR 201
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Semi-supervised video object segmentation has made significant progress on
real and challenging videos in recent years. The current paradigm for
segmentation methods and benchmark datasets is to segment objects in video
provided a single annotation in the first frame. However, we find that
segmentation performance across the entire video varies dramatically when
selecting an alternative frame for annotation. This paper address the problem
of learning to suggest the single best frame across the video for user
annotation-this is, in fact, never the first frame of video. We achieve this by
introducing BubbleNets, a novel deep sorting network that learns to select
frames using a performance-based loss function that enables the conversion of
expansive amounts of training examples from already existing datasets. Using
BubbleNets, we are able to achieve an 11% relative improvement in segmentation
performance on the DAVIS benchmark without any changes to the underlying method
of segmentation.Comment: CVPR 201
- …