5 research outputs found
MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning
Learning robust and scalable visual representations from massive multi-view
video data remains a challenge in computer vision and autonomous driving.
Existing pre-training methods either rely on expensive supervised learning with
3D annotations, limiting the scalability, or focus on single-frame or monocular
inputs, neglecting the temporal information. We propose MIM4D, a novel
pre-training paradigm based on dual masked image modeling (MIM). MIM4D
leverages both spatial and temporal relations by training on masked multi-view
video inputs. It constructs pseudo-3D features using continuous scene flow and
projects them onto 2D plane for supervision. To address the lack of dense 3D
supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable
rendering to learn geometric representations. We demonstrate that MIM4D
achieves state-of-the-art performance on the nuScenes dataset for visual
representation learning in autonomous driving. It significantly improves
existing methods on multiple downstream tasks, including BEV segmentation (8.7%
IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our
work offers a new choice for learning representation at scale in autonomous
driving. Code and models are released at https://github.com/hustvl/MIM4
Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction
Online lane graph construction is a promising but challenging task in
autonomous driving. Previous methods usually model the lane graph at the pixel
or piece level, and recover the lane graph by pixel-wise or piece-wise
connection, which breaks down the continuity of the lane. Human drivers focus
on and drive along the continuous and complete paths instead of considering
lane pieces. Autonomous vehicles also require path-specific guidance from lane
graph for trajectory planning. We argue that the path, which indicates the
traffic flow, is the primitive of the lane graph. Motivated by this, we propose
to model the lane graph in a novel path-wise manner, which well preserves the
continuity of the lane and encodes traffic information for planning. We present
a path-based online lane graph construction method, termed LaneGAP, which
end-to-end learns the path and recovers the lane graph via a Path2Graph
algorithm. We qualitatively and quantitatively demonstrate the superiority of
LaneGAP over conventional pixel-based and piece-based methods on challenging
nuScenes and Argoverse2 datasets. Abundant visualizations show LaneGAP can cope
with diverse traffic conditions. Code and models will be released at
\url{https://github.com/hustvl/LaneGAP} for facilitating future research
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Learning a human-like driving policy from large-scale driving demonstrations
is promising, but the uncertainty and non-deterministic nature of planning make
it challenging. In this work, to cope with the uncertainty problem, we propose
VADv2, an end-to-end driving model based on probabilistic planning. VADv2 takes
multi-view image sequences as input in a streaming manner, transforms sensor
data into environmental token embeddings, outputs the probabilistic
distribution of action, and samples one action to control the vehicle. Only
with camera sensors, VADv2 achieves state-of-the-art closed-loop performance on
the CARLA Town05 benchmark, significantly outperforming all existing methods.
It runs stably in a fully end-to-end manner, even without the rule-based
wrapper. Closed-loop demos are presented at https://hgao-cv.github.io/VADv2.Comment: Project Page: https://hgao-cv.github.io/VADv
VMA: Divide-and-Conquer Vectorized Map Annotation System for Large-Scale Driving Scene
High-definition (HD) map serves as the essential infrastructure of autonomous
driving. In this work, we build up a systematic vectorized map annotation
framework (termed VMA) for efficiently generating HD map of large-scale driving
scene. We design a divide-and-conquer annotation scheme to solve the spatial
extensibility problem of HD map generation, and abstract map elements with a
variety of geometric patterns as unified point sequence representation, which
can be extended to most map elements in the driving scene. VMA is highly
efficient and extensible, requiring negligible human effort, and flexible in
terms of spatial scale and element type. We quantitatively and qualitatively
validate the annotation performance on real-world urban and highway scenes, as
well as NYC Planimetric Database. VMA can significantly improve map generation
efficiency and require little human effort. On average VMA takes 160min for
annotating a scene with a range of hundreds of meters, and reduces 52.3% of the
human cost, showing great application value
VAD: Vectorized Scene Representation for Efficient Autonomous Driving
Autonomous driving requires a comprehensive understanding of the surrounding
environment for reliable trajectory planning. Previous works rely on dense
rasterized scene representation (e.g., agent occupancy and semantic map) to
perform planning, which is computationally intensive and misses the
instance-level structure information. In this paper, we propose VAD, an
end-to-end vectorized paradigm for autonomous driving, which models the driving
scene as a fully vectorized representation. The proposed vectorized paradigm
has two significant advantages. On one hand, VAD exploits the vectorized agent
motion and map elements as explicit instance-level planning constraints which
effectively improves planning safety. On the other hand, VAD runs much faster
than previous end-to-end planning methods by getting rid of
computation-intensive rasterized representation and hand-designed
post-processing steps. VAD achieves state-of-the-art end-to-end planning
performance on the nuScenes dataset, outperforming the previous best method by
a large margin. Our base model, VAD-Base, greatly reduces the average collision
rate by 29.0% and runs 2.5x faster. Besides, a lightweight variant, VAD-Tiny,
greatly improves the inference speed (up to 9.3x) while achieving comparable
planning performance. We believe the excellent performance and the high
efficiency of VAD are critical for the real-world deployment of an autonomous
driving system. Code and models will be released for facilitating future
research.Comment: Code&Demos: https://github.com/hustvl/VA