1 research outputs found
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Existing methods for instance segmentation in videos typi-cally involve
multi-stage pipelines that follow the tracking-by-detectionparadigm and model a
video clip as a sequence of images. Multiple net-works are used to detect
objects in individual frames, and then associatethese detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we pro-pose a different approach that is
well-suited to a variety of tasks involvinginstance segmentation in videos. In
particular, we model a video clip asa single 3D spatio-temporal volume, and
propose a novel approach thatsegments and tracks instances across space and
time in a single stage. Ourproblem formulation is centered around the idea of
spatio-temporal em-beddings which are trained to cluster pixels belonging to a
specific objectinstance over an entire video clip. To this end, we introduce
(i) novel mix-ing functions that enhance the feature representation of
spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that
can rea-son about temporal context. Our network is trained end-to-end to
learnspatio-temporal embeddings as well as parameters required to clusterthese
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and modelsare available at
https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure