26 research outputs found
The 2018 DAVIS Challenge on Video Object Segmentation
We present the 2018 DAVIS Challenge on Video Object Segmentation, a public
competition specifically designed for the task of video object segmentation. It
builds upon the DAVIS 2017 dataset, which was presented in the previous edition
of the DAVIS Challenge, and added 100 videos with multiple objects per sequence
to the original DAVIS 2016 dataset. Motivated by the analysis of the results of
the 2017 edition, the main track of the competition will be the same than in
the previous edition (segmentation given the full mask of the objects in the
first frame -- semi-supervised scenario). This edition, however, also adds an
interactive segmentation teaser track, where the participants will interact
with a web service simulating the input of a human that provides scribbles to
iteratively improve the result.Comment: Challenge website: http://davischallenge.org
PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation
We address semi-supervised video object segmentation, the task of
automatically generating accurate and consistent pixel masks for objects in a
video sequence, given the first-frame ground truth annotations. Towards this
goal, we present the PReMVOS algorithm (Proposal-generation, Refinement and
Merging for Video Object Segmentation). Our method separates this problem into
two steps, first generating a set of accurate object segmentation mask
proposals for each video frame and then selecting and merging these proposals
into accurate and temporally consistent pixel-wise object tracks over a video
sequence in a way which is designed to specifically tackle the difficult
challenges involved with segmenting multiple objects across a video sequence.
Our approach surpasses all previous state-of-the-art results on the DAVIS 2017
video object segmentation benchmark with a J & F mean score of 71.6 on the
test-dev dataset, and achieves first place in both the DAVIS 2018 Video Object
Segmentation Challenge and the YouTube-VOS 1st Large-scale Video Object
Segmentation Challenge.Comment: Accepted for publication in ACCV1
Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks
We present a deep learning method for the interactive video object
segmentation. Our method is built upon two core operations, interaction and
propagation, and each operation is conducted by Convolutional Neural Networks.
The two networks are connected both internally and externally so that the
networks are trained jointly and interact with each other to solve the complex
video object segmentation problem. We propose a new multi-round training scheme
for the interactive video object segmentation so that the networks can learn
how to understand the user's intention and update incorrect estimations during
the training. At the testing time, our method produces high-quality results and
also runs fast enough to work with users interactively. We evaluated the
proposed method quantitatively on the interactive track benchmark at the DAVIS
Challenge 2018. We outperformed other competing methods by a significant margin
in both the speed and the accuracy. We also demonstrated that our method works
well with real user interactions.Comment: CVPR 201
FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation
Many of the recent successful methods for video object segmentation (VOS) are
overly complicated, heavily rely on fine-tuning on the first frame, and/or are
slow, and are hence of limited practical use. In this work, we propose FEELVOS
as a simple and fast method which does not rely on fine-tuning. In order to
segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding
together with a global and a local matching mechanism to transfer information
from the first frame and from the previous frame of the video to the current
frame. In contrast to previous work, our embedding is only used as an internal
guidance of a convolutional network. Our novel dynamic segmentation head allows
us to train the network, including the embedding, end-to-end for the multiple
object segmentation task with a cross entropy loss. We achieve a new state of
the art in video object segmentation without fine-tuning with a J&F measure of
71.5% on the DAVIS 2017 validation set. We make our code and models available
at https://github.com/tensorflow/models/tree/master/research/feelvos.Comment: CVPR 2019 camera-ready versio
SkelNetOn 2019: Dataset and Challenge on Deep Learning for Geometric Shape Understanding
We present SkelNetOn 2019 Challenge and Deep Learning for Geometric Shape
Understanding workshop to utilize existing and develop novel deep learning
architectures for shape understanding. We observed that unlike traditional
segmentation and detection tasks, geometry understanding is still a new area
for deep learning techniques. SkelNetOn aims to bring together researchers from
different domains to foster learning methods on global shape understanding
tasks. We aim to improve and evaluate the state-of-the-art shape understanding
approaches, and to serve as reference benchmarks for future research. Similar
to other challenges in computer vision, SkelNetOn proposes three datasets and
corresponding evaluation methodologies; all coherently bundled in three
competitions with a dedicated workshop co-located with CVPR 2019 conference. In
this paper, we describe and analyze characteristics of datasets, define the
evaluation criteria of the public competitions, and provide baselines for each
task.Comment: Dataset paper for SkelNetOn Challenge, in association with Deep
Learning for Geometric Shape Understanding Workshop at CVPR 201
ScribbleBox: Interactive Annotation Framework for Video Object Segmentation
Manually labeling video datasets for segmentation tasks is extremely time
consuming. In this paper, we introduce ScribbleBox, a novel interactive
framework for annotating object instances with masks in videos. In particular,
we split annotation into two steps: annotating objects with tracked boxes, and
labeling masks inside these tracks. We introduce automation and interaction in
both steps. Box tracks are annotated efficiently by approximating the
trajectory using a parametric curve with a small number of control points which
the annotator can interactively correct. Our approach tolerates a modest amount
of noise in the box placements, thus typically only a few clicks are needed to
annotate tracked boxes to a sufficient accuracy. Segmentation masks are
corrected via scribbles which are efficiently propagated through time. We show
significant performance gains in annotation efficiency over past work. We show
that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks
per box track, and 4 frames of scribble annotation
Key Instance Selection for Unsupervised Video Object Segmentation
This paper proposes key instance selection based on video saliency covering
objectness and dynamics for unsupervised video object segmentation (UVOS). Our
method takes frames sequentially and extracts object proposals with
corresponding masks for each frame. We link objects according to their
similarity until the M-th frame and then assign them unique IDs (i.e.,
instances). Similarity measure takes into account multiple properties such as
ReID descriptor, expected trajectory, and semantic co-segmentation result.
After M-th frame, we select K IDs based on video saliency and frequency of
appearance; then only these key IDs are tracked through the remaining frames.
Thanks to these technical contributions, our results are ranked third on the
leaderboard of UVOS DAVIS challenge.Comment: Ranked 3rd in 'Unsupervised DAVIS Challenge' (CVPR 2019
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Semi-supervised video object segmentation has made significant progress on
real and challenging videos in recent years. The current paradigm for
segmentation methods and benchmark datasets is to segment objects in video
provided a single annotation in the first frame. However, we find that
segmentation performance across the entire video varies dramatically when
selecting an alternative frame for annotation. This paper address the problem
of learning to suggest the single best frame across the video for user
annotation-this is, in fact, never the first frame of video. We achieve this by
introducing BubbleNets, a novel deep sorting network that learns to select
frames using a performance-based loss function that enables the conversion of
expansive amounts of training examples from already existing datasets. Using
BubbleNets, we are able to achieve an 11% relative improvement in segmentation
performance on the DAVIS benchmark without any changes to the underlying method
of segmentation.Comment: CVPR 201
RANet: Ranking Attention Network for Fast Video Object Segmentation
Despite online learning (OL) techniques have boosted the performance of
semi-supervised video object segmentation (VOS) methods, the huge time costs of
OL greatly restrict their practicality. Matching based and propagation based
methods run at a faster speed by avoiding OL techniques. However, they are
limited by sub-optimal accuracy, due to mismatching and drifting problems. In
this paper, we develop a real-time yet very accurate Ranking Attention Network
(RANet) for VOS. Specifically, to integrate the insights of matching based and
propagation based methods, we employ an encoder-decoder framework to learn
pixel-level similarity and segmentation in an end-to-end manner. To better
utilize the similarity maps, we propose a novel ranking attention module, which
automatically ranks and selects these maps for fine-grained VOS performance.
Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the
best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and
J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16,
exceeding state-of-the-art VOS methods. The code can be found at
https://github.com/Storife/RANet.Comment: Accepted by ICCV 2019. 10 pages, 7 figures, 6 tables. The
supplementary file can be found at
https://csjunxu.github.io/paper/2019ICCV/RANet_supp.pdf ; Code is available
at https://github.com/Storife/RANe
Video Object Segmentation with Language Referring Expressions
Most state-of-the-art semi-supervised video object segmentation methods rely
on a pixel-accurate mask of a target object provided for the first frame of a
video. However, obtaining a detailed segmentation mask is expensive and
time-consuming. In this work we explore an alternative way of identifying a
target object, namely by employing language referring expressions. Besides
being a more practical and natural way of pointing out a target object, using
language specifications can help to avoid drift as well as make the system more
robust to complex dynamics and appearance variations. Leveraging recent
advances of language grounding models designed for images, we propose an
approach to extend them to video data, ensuring temporally coherent
predictions. To evaluate our method we augment the popular video object
segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of
target objects. We show that our language-supervised approach performs on par
with the methods which have access to a pixel-level mask of the target object
on DAVIS'16 and is competitive to methods using scribbles on the challenging
DAVIS'17 dataset.Comment: ACCV 2018: 14th Asian Conference on Computer Visio