758 research outputs found
Event-Free Moving Object Segmentation from Moving Ego Vehicle
Moving object segmentation (MOS) in dynamic scenes is challenging for
autonomous driving, especially for sequences obtained from moving ego vehicles.
Most state-of-the-art methods leverage motion cues obtained from optical flow
maps. However, since these methods are often based on optical flows that are
pre-computed from successive RGB frames, this neglects the temporal
consideration of events occurring within inter-frame and limits the
practicality of these methods in real-life situations. To address these
limitations, we propose to exploit event cameras for better video
understanding, which provide rich motion cues without relying on optical flow.
To foster research in this area, we first introduce a novel large-scale dataset
called DSEC-MOS for moving object segmentation from moving ego vehicles.
Subsequently, we devise EmoFormer, a novel network able to exploit the event
data. For this purpose, we fuse the event prior with spatial semantic maps to
distinguish moving objects from the static background, adding another level of
dense supervision around our object of interest - moving ones. Our proposed
network relies only on event data for training but does not require event input
during inference, making it directly comparable to frame-only methods in terms
of efficiency and more widely usable in many application cases. An exhaustive
comparison with 8 state-of-the-art video object segmentation methods highlights
a significant performance improvement of our method over all other methods.
Project Page: https://github.com/ZZY-Zhou/DSEC-MOS
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
In this paper, we propose a simple yet effective approach for self-supervised
video object segmentation (VOS). Our key insight is that the inherent
structural dependencies present in DINO-pretrained Transformers can be
leveraged to establish robust spatio-temporal correspondences in videos.
Furthermore, simple clustering on this correspondence cue is sufficient to
yield competitive segmentation results. Previous self-supervised VOS techniques
majorly resort to auxiliary modalities or utilize iterative slot attention to
assist in object discovery, which restricts their general applicability and
imposes higher computational requirements. To deal with these challenges, we
develop a simplified architecture that capitalizes on the emerging objectness
from DINO-pretrained Transformers, bypassing the need for additional modalities
or slot attention. Specifically, we first introduce a single spatio-temporal
Transformer block to process the frame-wise DINO features and establish
spatio-temporal dependencies in the form of self-attention. Subsequently,
utilizing these attention maps, we implement hierarchical clustering to
generate object segmentation masks. To train the spatio-temporal block in a
fully self-supervised manner, we employ semantic and dynamic motion consistency
coupled with entropy normalization. Our method demonstrates state-of-the-art
performance across multiple unsupervised VOS benchmarks and particularly excels
in complex real-world multi-object video segmentation tasks such as
DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will
be released at https://github.com/shvdiwnkozbw/SSL-UVOS
- …