4 research outputs found

    STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

    Full text link
    Existing methods for instance segmentation in videos typi-cally involve multi-stage pipelines that follow the tracking-by-detectionparadigm and model a video clip as a sequence of images. Multiple net-works are used to detect objects in individual frames, and then associatethese detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we pro-pose a different approach that is well-suited to a variety of tasks involvinginstance segmentation in videos. In particular, we model a video clip asa single 3D spatio-temporal volume, and propose a novel approach thatsegments and tracks instances across space and time in a single stage. Ourproblem formulation is centered around the idea of spatio-temporal em-beddings which are trained to cluster pixels belonging to a specific objectinstance over an entire video clip. To this end, we introduce (i) novel mix-ing functions that enhance the feature representation of spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that can rea-son about temporal context. Our network is trained end-to-end to learnspatio-temporal embeddings as well as parameters required to clusterthese embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and modelsare available at https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure

    An empirical study of detection-based video instance segmentation

    No full text
    Video instance segmentation (VIS) is a composite task that requires the joint detection, tracking, and segmentation of objects in a video. In this work, we introduce a complete framework for VIS, which integrates the strengths of instance segmentation and general object tracking in addressing the unique challenges of VIS. In developing the framework, we investigate effective ways of coordinating the two components for maximum benefits while thoroughly investigate their separate contributions. Our approach improves over the official baseline by an absolute 14.4% in mAP and achieves the second place in the 2019 YouTubeVIS challenge
    corecore