60 research outputs found
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation
This paper tackles the problem of semi-supervised video object segmentation
on resource-constrained devices, such as mobile phones. We formulate this
problem as a distillation task, whereby we demonstrate that small
space-time-memory networks with finite memory can achieve competitive results
with state of the art, but at a fraction of the computational cost (32
milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a
theoretically grounded framework that unifies knowledge distillation with
supervised contrastive representation learning. These models are able to
jointly benefit from both pixel-wise contrastive learning and distillation from
a pre-trained teacher. We validate this loss by achieving competitive J&F to
state of the art on both the standard DAVIS and YouTube benchmarks, despite
running up to 5x faster, and with 32x fewer parameters.Comment: CVPR 202
Learning the What and How of Annotation in Video Object Segmentation
Video Object Segmentation (VOS) is crucial for several applications, from
video editing to video data generation. Training a VOS model requires an
abundance of manually labeled training videos. The de-facto traditional way of
annotating objects requires humans to draw detailed segmentation masks on the
target objects at each video frame. This annotation process, however, is
tedious and time-consuming. To reduce this annotation cost, in this paper, we
propose EVA-VOS, a human-in-the-loop annotation framework for video object
segmentation. Unlike the traditional approach, we introduce an agent that
predicts iteratively both which frame ("What") to annotate and which annotation
type ("How") to use. Then, the annotator annotates only the selected frame that
is used to update a VOS module, leading to significant gains in annotation
time. We conduct experiments on the MOSE and the DAVIS datasets and we show
that: (a) EVA-VOS leads to masks with accuracy close to the human agreement
3.5x faster than the standard way of annotating videos; (b) our frame selection
achieves state-of-the-art performance; (c) EVA-VOS yields significant
performance gains in terms of annotation time compared to all other methods and
baselines.Comment: Accepted to WACV 202
Boosting Video Object Segmentation via Space-time Correspondence Learning
Current top-leading solutions for video object segmentation (VOS) typically
follow a matching-based regime: for each query frame, the segmentation mask is
inferred according to its correspondence to previously processed and the first
annotated frames. They simply exploit the supervisory signals from the
groundtruth masks for learning mask prediction only, without posing any
constraint on the space-time correspondence matching, which, however, is the
fundamental building block of such regime. To alleviate this crucial yet
commonly ignored issue, we devise a correspondence-aware training framework,
which boosts matching-based VOS solutions by explicitly encouraging robust
correspondence matching during network learning. Through comprehensively
exploring the intrinsic coherence in videos on pixel and object levels, our
algorithm reinforces the standard, fully supervised training of mask
segmentation with label-free, contrastive correspondence learning. Without
neither requiring extra annotation cost during training, nor causing speed
delay during deployment, nor incurring architectural modification, our
algorithm provides solid performance gains on four widely used benchmarks,
i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous
matching-based VOS solutions.Comment: CVPR 2023; Project page:
https://github.com/wenguanwang/VOS_Correspondenc
Deep Learning for Video Object Segmentation:A Review
As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: https://github.com/gaomingqi/VOS-Review
LVOS: A Benchmark for Long-term Video Object Segmentation
Existing video object segmentation (VOS) benchmarks focus on short-term
videos which just last about 3-5 seconds and where objects are visible most of
the time. These videos are poorly representative of practical applications, and
the absence of long-term datasets restricts further investigation of VOS on the
application in realistic scenarios. So, in this paper, we present a new
benchmark dataset named \textbf{LVOS}, which consists of 220 videos with a
total duration of 421 minutes. To the best of our knowledge, LVOS is the first
densely annotated long-term VOS dataset. The videos in our LVOS last 1.59
minutes on average, which is 20 times longer than videos in existing VOS
datasets. Each video includes various attributes, especially challenges
deriving from the wild, such as long-term reappearing and cross-temporal
similar objeccts.Based on LVOS, we assess existing video object segmentation
algorithms and propose a Diverse Dynamic Memory network (DDMemory) that
consists of three complementary memory banks to exploit temporal information
adequately. The experimental results demonstrate the strength and weaknesses of
prior methods, pointing promising directions for further study. Data and code
are available at https://lingyihongfd.github.io/lvos.github.io/.Comment: Accepted by ICCV 2023. Project page:
https://lingyihongfd.github.io/lvos.github.io
Video Object Segmentation using Point-based Memory Network
Recent years have witnessed the prevalence of memory-based methods for Semi-supervised Video Object Segmentation (SVOS) which utilise past frames efficiently for label propagation. When conducting feature matching, fine-grained multi-scale feature matching has typically been performed using all query points, which inevitably results in redundant computations and thus makes the fusion of multi-scale results ineffective. In this paper, we develop a new Point-based Memory Network, termed as PMNet, to perform fine-grained feature matching on hard samples only, assuming that easy samples can already obtain satisfactory matching results without the need for complicated multi-scale feature matching. Our approach first generates an uncertainty map from the initial decoding outputs. Next, the fine-grained features at uncertain locations are sampled to match the memory features on the same scale. Finally, the matching results are further decoded to provide a refined output. The point-based scheme works with the coarsest feature matching in a complementary and efficient manner. Furthermore, we propose an approach to adaptively perform global or regional matching based on the motion history of memory points, making our method more robust against ambiguous backgrounds. Experimental results on several benchmark datasets demonstrate the superiority of our proposed method over state-of-the-art methods
Per-Clip Video Object Segmentation
Recently, memory-based approaches show promising results on semi-supervised
video object segmentation. These methods predict object masks frame-by-frame
with the help of frequently updated memory of the previous mask. Different from
this per-frame inference, we investigate an alternative perspective by treating
video object segmentation as clip-wise mask propagation. In this per-clip
inference scheme, we update the memory with an interval and simultaneously
process a set of consecutive frames (i.e. clip) between the memory updates. The
scheme provides two potential benefits: accuracy gain by clip-level
optimization and efficiency gain by parallel computation of multiple frames. To
this end, we propose a new method tailored for the per-clip inference.
Specifically, we first introduce a clip-wise operation to refine the features
based on intra-clip correlation. In addition, we employ a progressive matching
mechanism for efficient information-passing within a clip. With the synergy of
two modules and a newly proposed per-clip based training, our network achieves
state-of-the-art performance on Youtube-VOS 2018/2019 val (84.6% and 84.6%) and
DAVIS 2016/2017 val (91.9% and 86.1%). Furthermore, our model shows a great
speed-accuracy trade-off with varying memory update intervals, which leads to
huge flexibility.Comment: CVPR 2022; Code is available at https://github.com/pkyong95/PCVO
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
We present XMem, a video object segmentation architecture for long videos
with unified feature memory stores inspired by the Atkinson-Shiffrin memory
model. Prior work on video object segmentation typically only uses one type of
feature memory. For videos longer than a minute, a single feature memory model
tightly links memory consumption and accuracy. In contrast, following the
Atkinson-Shiffrin model, we develop an architecture that incorporates multiple
independent yet deeply-connected feature memory stores: a rapidly updated
sensory memory, a high-resolution working memory, and a compact thus sustained
long-term memory. Crucially, we develop a memory potentiation algorithm that
routinely consolidates actively used working memory elements into the long-term
memory, which avoids memory explosion and minimizes performance decay for
long-term prediction. Combined with a new memory reading mechanism, XMem
greatly exceeds state-of-the-art performance on long-video datasets while being
on par with state-of-the-art methods (that do not work on long videos) on
short-video datasets. Code is available at https://hkchengrex.github.io/XMemComment: Accepted to ECCV 2022. Project page:
https://hkchengrex.github.io/XMe
Image and Video Segmentation of Appearance-Volatile Objects
Segmentation is a process of partitioning a digital image or frame into multiple regions or objects. The goal of segmentation is to identify and locate the objects of interest with their boundaries. Recent segmentation approaches often follow such a pipeline: they first train the model on a collected dataset and then evaluate the trained model on a given image or video. They assume that the appearance of object is consistent in training and testing sets. However, the appearance of object may change in different photography conditions. How to effectively segment the objects with volatile appearance remains under-explored. In this work, we present a framework for image and video segmentation of appearance-volatile objects, including two novel modules, uncertain region refinement and feature bank. For image segmentation, we designed a new confidence loss and a fine-grained segmentation module to enhance the segmentation accuracy in uncertain regions.
For video segmentation, we proposed a matching-based algorithm which feature banks are created to store features for region matching and classification. We introduced an adaptive feature bank update scheme to dynamically absorb new features and discard obsolete features.
We compared our algorithm and the state-of-the-art methods on the public benchmarks. Our algorithm outperforms the existing methods and can produce more reliable and accurate segmentation results
Breaking the "Object" in Video Object Segmentation
The appearance of an object can be fleeting when it transforms. As eggs are
broken or paper is torn, their color, shape and texture can change
dramatically, preserving virtually nothing of the original except for the
identity itself. Yet, this important phenomenon is largely absent from existing
video object segmentation (VOS) benchmarks. In this work, we close the gap by
collecting a new dataset for Video Object Segmentation under Transformations
(VOST). It consists of more than 700 high-resolution videos, captured in
diverse environments, which are 20 seconds long on average and densely labeled
with instance masks. A careful, multi-step approach is adopted to ensure that
these videos focus on complex object transformations, capturing their full
temporal extent. We then extensively evaluate state-of-the-art VOS methods and
make a number of important discoveries. In particular, we show that existing
methods struggle when applied to this novel task and that their main limitation
lies in over-reliance on static appearance cues. This motivates us to propose a
few modifications for the top-performing baseline that improve its capabilities
by better modeling spatio-temporal information. But more broadly, the hope is
to stimulate discussion on learning more robust video object representations
- …