61 research outputs found
The 2018 DAVIS Challenge on Video Object Segmentation
We present the 2018 DAVIS Challenge on Video Object Segmentation, a public
competition specifically designed for the task of video object segmentation. It
builds upon the DAVIS 2017 dataset, which was presented in the previous edition
of the DAVIS Challenge, and added 100 videos with multiple objects per sequence
to the original DAVIS 2016 dataset. Motivated by the analysis of the results of
the 2017 edition, the main track of the competition will be the same than in
the previous edition (segmentation given the full mask of the objects in the
first frame -- semi-supervised scenario). This edition, however, also adds an
interactive segmentation teaser track, where the participants will interact
with a web service simulating the input of a human that provides scribbles to
iteratively improve the result.Comment: Challenge website: http://davischallenge.org
RANet: Ranking Attention Network for Fast Video Object Segmentation
Despite online learning (OL) techniques have boosted the performance of
semi-supervised video object segmentation (VOS) methods, the huge time costs of
OL greatly restrict their practicality. Matching based and propagation based
methods run at a faster speed by avoiding OL techniques. However, they are
limited by sub-optimal accuracy, due to mismatching and drifting problems. In
this paper, we develop a real-time yet very accurate Ranking Attention Network
(RANet) for VOS. Specifically, to integrate the insights of matching based and
propagation based methods, we employ an encoder-decoder framework to learn
pixel-level similarity and segmentation in an end-to-end manner. To better
utilize the similarity maps, we propose a novel ranking attention module, which
automatically ranks and selects these maps for fine-grained VOS performance.
Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the
best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and
J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16,
exceeding state-of-the-art VOS methods. The code can be found at
https://github.com/Storife/RANet.Comment: Accepted by ICCV 2019. 10 pages, 7 figures, 6 tables. The
supplementary file can be found at
https://csjunxu.github.io/paper/2019ICCV/RANet_supp.pdf ; Code is available
at https://github.com/Storife/RANe
Video Object Segmentation using Space-Time Memory Networks
We propose a novel solution for semi-supervised video object segmentation. By
the nature of the problem, available cues (e.g. video frame(s) with object
masks) become richer with the intermediate predictions. However, the existing
methods are unable to fully exploit this rich source of information. We resolve
the issue by leveraging memory networks and learn to read relevant information
from all available sources. In our framework, the past frames with object masks
form an external memory, and the current frame as the query is segmented using
the mask information in the memory. Specifically, the query and the memory are
densely matched in the feature space, covering all the space-time pixel
locations in a feed-forward fashion. Contrast to the previous approaches, the
abundant use of the guidance information allows us to better handle the
challenges such as appearance changes and occlussions. We validate our method
on the latest benchmark sets and achieved the state-of-the-art performance
(overall score of 79.4 on Youtube-VOS val set, J of 88.7 and 79.2 on DAVIS
2016/2017 val set respectively) while having a fast runtime (0.16 second/frame
on DAVIS 2016 val set).Comment: ICCV 201
Improving Image co-segmentation via Deep Metric Learning
Deep Metric Learning (DML) is helpful in computer vision tasks. In this
paper, we firstly introduce DML into image co-segmentation. We propose a novel
Triplet loss for Image Segmentation, called IS-Triplet loss for short, and
combine it with traditional image segmentation loss. Different from the general
DML task which learns the metric between pictures, we treat each pixel as a
sample, and use their embedded features in high-dimensional space to form
triples, then we tend to force the distance between pixels of different
categories greater than of the same category by optimizing IS-Triplet loss so
that the pixels from different categories are easier to be distinguished in the
high-dimensional feature space. We further present an efficient triple sampling
strategy to make a feasible computation of IS-Triplet loss. Finally, the
IS-Triplet loss is combined with 3 traditional image segmentation losses to
perform image segmentation. We apply the proposed approach to image
co-segmentation and test it on the SBCoseg dataset and the Internet dataset.
The experimental result shows that our approach can effectively improve the
discrimination of pixels' categories in high-dimensional space and thus help
traditional loss achieve better performance of image segmentation with fewer
training epochs.Comment: 11 pages, 5 figure
ScribbleBox: Interactive Annotation Framework for Video Object Segmentation
Manually labeling video datasets for segmentation tasks is extremely time
consuming. In this paper, we introduce ScribbleBox, a novel interactive
framework for annotating object instances with masks in videos. In particular,
we split annotation into two steps: annotating objects with tracked boxes, and
labeling masks inside these tracks. We introduce automation and interaction in
both steps. Box tracks are annotated efficiently by approximating the
trajectory using a parametric curve with a small number of control points which
the annotator can interactively correct. Our approach tolerates a modest amount
of noise in the box placements, thus typically only a few clicks are needed to
annotate tracked boxes to a sufficient accuracy. Segmentation masks are
corrected via scribbles which are efficiently propagated through time. We show
significant performance gains in annotation efficiency over past work. We show
that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks
per box track, and 4 frames of scribble annotation
Collaborative Video Object Segmentation by Foreground-Background Integration
This paper investigates the principles of embedding learning to tackle the
challenging semi-supervised video object segmentation. Different from previous
practices that only explore the embedding learning using pixels from foreground
object (s), we consider background should be equally treated and thus propose
Collaborative video object segmentation by Foreground-Background Integration
(CFBI) approach. Our CFBI implicitly imposes the feature embedding from the
target foreground object and its corresponding background to be contrastive,
promoting the segmentation results accordingly. With the feature embedding from
both foreground and background, our CFBI performs the matching process between
the reference and the predicted sequence from both pixel and instance levels,
making the CFBI be robust to various object scales. We conduct extensive
experiments on three popular benchmarks, i.e., DAVIS 2016, DAVIS 2017, and
YouTube-VOS. Our CFBI achieves the performance (J$F) of 89.4%, 81.9%, and
81.4%, respectively, outperforming all the other state-of-the-art methods.
Code: https://github.com/z-x-yang/CFBI.Comment: ECCV 2020, Spotligh
In defense of OSVOS
As a milestone for video object segmentation, one-shot video object
segmentation (OSVOS) has achieved a large margin compared to the conventional
optical-flow based methods regarding to the segmentation accuracy. Its
excellent performance mainly benefit from the three-step training mechanism,
that are: (1) acquiring object features on the base dataset (i.e. ImageNet),
(2) training the parent network on the training set of the target dataset (i.e.
DAVIS-2016) to be capable of differentiating the object of interest from the
background. (3) online fine-tuning the interested object on the first frame of
the target test set to overfit its appearance, then the model can be utilized
to segment the same object in the rest frames of that video. In this paper, we
argue that for the step (2), OSVOS has the limitation to 'overemphasize' the
generic semantic object information while 'dilute' the instance cues of the
object(s), which largely block the whole training process. Through adding a
common module, video loss, which we formulate with various forms of constraints
(including weighted BCE loss, high-dimensional triplet loss, as well as a novel
mixed instance-aware video loss), to train the parent network in the step (2),
the network is then better prepared for the step (3), i.e. online fine-tuning
on the target instance. Through extensive experiments using different network
structures as the backbone, we show that the proposed video loss module can
improve the segmentation performance significantly, compared to that of OSVOS.
Meanwhile, since video loss is a common module, it can be generalized to other
fine-tuning based methods and similar vision tasks such as depth estimation and
saliency detection
Self-Supervised Visual Representation Learning from Hierarchical Grouping
We create a framework for bootstrapping visual representation learning from a
primitive visual grouping capability. We operationalize grouping via a contour
detector that partitions an image into regions, followed by merging of those
regions into a tree hierarchy. A small supervised dataset suffices for training
this grouping primitive. Across a large unlabeled dataset, we apply this
learned primitive to automatically predict hierarchical region structure. These
predictions serve as guidance for self-supervised contrastive feature learning:
we task a deep network with producing per-pixel embeddings whose pairwise
distances respect the region hierarchy. Experiments demonstrate that our
approach can serve as state-of-the-art generic pre-training, benefiting
downstream tasks. We additionally explore applications to semantic region
search and video-based object instance tracking.Comment: Accepted by NeurIPS 202
Fast Pixel-Matching for Video Object Segmentation
Video object segmentation, aiming to segment the foreground objects given the
annotation of the first frame, has been attracting increasing attentions. Many
state-of-the-art approaches have achieved great performance by relying on
online model updating or mask-propagation techniques. However, most online
models require high computational cost due to model fine-tuning during
inference. Most mask-propagation based models are faster but with relatively
low performance due to failure to adapt to object appearance variation. In this
paper, we are aiming to design a new model to make a good balance between speed
and performance. We propose a model, called NPMCA-net, which directly localizes
foreground objects based on mask-propagation and non-local technique by
matching pixels in reference and target frames. Since we bring in information
of both first and previous frames, our network is robust to large object
appearance variation, and can better adapt to occlusions. Extensive experiments
show that our approach can achieve a new state-of-the-art performance with a
fast speed at the same time (86.5% IoU on DAVIS-2016 and 72.2% IoU on
DAVIS-2017, with speed of 0.11s per frame) under the same level comparison.
Source code is available at https://github.com/siyueyu/NPMCA-net.Comment: Accepted by Signal Processing: Image Communicatio
VideoMatch: Matching based Video Object Segmentation
Video object segmentation is challenging yet important in a wide variety of
applications for video analysis. Recent works formulate video object
segmentation as a prediction task using deep nets to achieve appealing
state-of-the-art performance. Due to the formulation as a prediction task, most
of these methods require fine-tuning during test time, such that the deep nets
memorize the appearance of the objects of interest in the given video. However,
fine-tuning is time-consuming and computationally expensive, hence the
algorithms are far from real time. To address this issue, we develop a novel
matching based algorithm for video object segmentation. In contrast to
memorization based classification techniques, the proposed approach learns to
match extracted features to a provided template without memorizing the
appearance of the objects. We validate the effectiveness and the robustness of
the proposed method on the challenging DAVIS-16, DAVIS-17, Youtube-Objects and
JumpCut datasets. Extensive results show that our method achieves comparable
performance without fine-tuning and is much more favorable in terms of
computational time.Comment: Accepted to ECCV 201
- …