1,050 research outputs found
In Defense of Clip-based Video Relation Detection
Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm
Understanding Masked Autoencoders From a Local Contrastive Perspective
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised
learning with its simple yet effective masking and reconstruction strategies.
However, despite achieving state-of-the-art performance across various
downstream vision tasks, the underlying mechanisms that drive MAE's efficacy
are less well-explored compared to the canonical contrastive learning paradigm.
In this paper, we first propose a local perspective to explicitly extract a
local contrastive form from MAE's reconstructive objective at the patch level.
And then we introduce a new empirical framework, called Local Contrastive MAE
(LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE
reveals that MAE learns invariance to random masking and ensures distribution
consistency between the learned token embeddings and the original images.
Furthermore, we dissect the contribution of the decoder and random masking to
MAE's success, revealing both the decoder's learning mechanism and the dual
role of random masking as data augmentation and effective receptive field
restriction. Our experimental analysis sheds light on the intricacies of MAE
and summarizes some useful design methodologies, which can inspire more
powerful visual self-supervised methods
Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving
Robotic perception requires the modeling of both 3D geometry and semantics.
Existing methods typically focus on estimating 3D bounding boxes, neglecting
finer geometric details and struggling to handle general, out-of-vocabulary
objects. 3D occupancy prediction, which estimates the detailed occupancy states
and semantics of a scene, is an emerging task to overcome these limitations. To
support 3D occupancy prediction, we develop a label generation pipeline that
produces dense, visibility-aware labels for any given scene. This pipeline
comprises three stages: voxel densification, occlusion reasoning, and
image-guided voxel refinement. We establish two benchmarks, derived from the
Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and
Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the
proposed dataset with various baseline models. Lastly, we propose a new model,
dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior
performance on the Occ3D benchmarks. The code, data, and benchmarks are
released at https://tsinghua-mars-lab.github.io/Occ3D/.Comment: Accepted to NeurIPS 202
- …