72 research outputs found
Observation of sub-Doppler absorption in the /Lambda-type three-level Doppler-broadened cesium system
Thanks to the atomic coherence in coupling laser driven atomic system,
sub-Doppler absorption has been observed in Doppler-broadened cesium vapor cell
via the /Lambda-type three-level scheme. The linewidth of the sub-Doppler
absorption peak become narrower while the frequency detuning of coupling laser
increases. The results are in agreement with the theoretical prediction by G.
Vemuri et al.[PRA,Vol.53(1996) p.2842].Comment: 12 pages, 5 figures, to appear on Applied Physics
Language Prompt for Autonomous Driving
A new trend in the computer vision community is to capture objects of
interest following flexible human command represented by a natural language
prompt. However, the progress of using language prompts in driving scenarios is
stuck in a bottleneck due to the scarcity of paired prompt-instance data. To
address this challenge, we propose the first object-centric language prompt set
for driving scenes within 3D, multi-view, and multi-frame space, named
NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367
language descriptions, each referring to an average of 5.3 object tracks. Based
on the object-text pairs from the new benchmark, we formulate a new
prompt-based driving task, \ie, employing a language prompt to predict the
described object trajectory across views and frames. Furthermore, we provide a
simple end-to-end baseline model based on Transformer, named PromptTrack.
Experiments show that our PromptTrack achieves impressive performance on
NuPrompt. We hope this work can provide more new insights for the autonomous
driving community. Dataset and Code will be made public at
\href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}
VLM-Eval: A General Evaluation on Video Large Language Models
Despite the rapid development of video Large Language Models (LLMs), a
comprehensive evaluation is still absent. In this paper, we introduce a unified
evaluation that encompasses multiple video tasks, including captioning,
question and answering, retrieval, and action recognition. In addition to
conventional metrics, we showcase how GPT-based evaluation can match human-like
performance in assessing response quality across multiple aspects. We propose a
simple baseline: Video-LLaVA, which uses a single linear projection and
outperforms existing video LLMs. Finally, we evaluate video LLMs beyond
academic datasets, which show encouraging recognition and reasoning
capabilities in driving scenarios with only hundreds of video-instruction pairs
for fine-tuning. We hope our work can serve as a unified evaluation for video
LLMs, and help expand more practical scenarios. The evaluation code will be
available soon
Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
In this paper, we propose a robust 3D detector, named Cross Modal Transformer
(CMT), for end-to-end 3D multi-modal detection. Without explicit view
transformation, CMT takes the image and point clouds tokens as inputs and
directly outputs accurate 3D bounding boxes. The spatial alignment of
multi-modal tokens is performed by encoding the 3D points into multi-modal
features. The core design of CMT is quite simple while its performance is
impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on
nuScenes test set while maintaining faster inference speed. Moreover, CMT has a
strong robustness even if the LiDAR is missing. Code is released at
https://github.com/junjie18/CMT
ADriver-I: A General World Model for Autonomous Driving
Typically, autonomous driving adopts a modular design, which divides the full
stack into perception, prediction, planning and control parts. Though
interpretable, such modular design tends to introduce a substantial amount of
redundancy. Recently, multimodal large language models (MLLM) and diffusion
techniques have demonstrated their superior performance on comprehension and
generation ability. In this paper, we first introduce the concept of
interleaved vision-action pair, which unifies the format of visual features and
control signals. Based on the vision-action pairs, we construct a general world
model based on MLLM and diffusion model for autonomous driving, termed
ADriver-I. It takes the vision-action pairs as inputs and autoregressively
predicts the control signal of the current frame. The generated control signals
together with the historical vision-action pairs are further conditioned to
predict the future frames. With the predicted next frame, ADriver-I performs
further control signal prediction. Such a process can be repeated infinite
times, ADriver-I achieves autonomous driving in the world created by itself.
Extensive experiments are conducted on nuScenes and our large-scale private
datasets. ADriver-I shows impressive performance compared to several
constructed baselines. We hope our ADriver-I can provide some new insights for
future autonomous driving and embodied intelligence.Comment: Tech Repor
Realization of strong coupling between deterministic single-atom arrays and a high-finesse miniature optical cavity
We experimentally demonstrate strong coupling between a one-dimensional (1D)
single-atom array and a high-finesse miniature cavity. The atom array is
obtained by loading single atoms into a 1D optical tweezer array with
dimensions of 111. Therefore, a deterministic number of atoms is
obtained, and the atom number is determined by imaging the atom array on a CCD
camera in real time. By precisely controlling the position and spacing of the
atom array in the high finesse Fabry--Perot cavity, all the atoms in the array
are strongly coupled to the cavity simultaneously. The vacuum Rabi splitting
spectra are discriminated for deterministic atom numbers from 1 to 8, and the
dependence of the collective enhancement of the coupling strength on
atom number is validated at the single-atom level.Comment: Main text: 7 pages, 5 figures; Supplementary material: 5 pages, 4
figure
Far3D: Expanding the Horizon for Surround-view 3D Object Detection
Recently 3D object detection from surround-view images has made notable
advancements with its low deployment cost. However, most works have primarily
focused on close perception range while leaving long-range detection less
explored. Expanding existing methods directly to cover long distances poses
challenges such as heavy computation costs and unstable convergence. To address
these limitations, this paper proposes a novel sparse query-based framework,
dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D
adaptive queries that complement the 3D global queries. To efficiently capture
discriminative features across different views and scales for long-range
objects, we introduce a perspective-aware aggregation module. Additionally, we
propose a range-modulated 3D denoising approach to address query error
propagation and mitigate convergence issues in long-range tasks. Significantly,
Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset,
covering a wide range of 150 meters, surpassing several LiDAR-based approaches.
Meanwhile, Far3D exhibits superior performance compared to previous methods on
the nuScenes dataset. The code is available at
https://github.com/megvii-research/Far3D.Comment: Accepted by AAAI-202
Stream Query Denoising for Vectorized HD Map Construction
To enhance perception performance in complex and extensive scenarios within
the realm of autonomous driving, there has been a noteworthy focus on temporal
modeling, with a particular emphasis on streaming methods. The prevailing trend
in streaming models involves the utilization of stream queries for the
propagation of temporal information. Despite the prevalence of this approach,
the direct application of the streaming paradigm to the construction of
vectorized high-definition maps (HD-maps) fails to fully harness the inherent
potential of temporal information. This paper introduces the Stream Query
Denoising (SQD) strategy as a novel approach for temporal modeling in
high-definition map (HD-map) construction. SQD is designed to facilitate the
learning of temporal consistency among map elements within the streaming model.
The methodology involves denoising the queries that have been perturbed by the
addition of noise to the ground-truth information from the preceding frame.
This denoising process aims to reconstruct the ground-truth information for the
current frame, thereby simulating the prediction process inherent in stream
queries. The SQD strategy can be applied to those streaming methods (e.g.,
StreamMapNet) to enhance the temporal modeling. The proposed SQD-MapNet is the
StreamMapNet equipped with SQD. Extensive experiments on nuScenes and
Argoverse2 show that our method is remarkably superior to other existing
methods across all settings of close range and long range. The code will be
available soon
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
The field of autonomous driving increasingly demands high-quality annotated
training data. In this paper, we propose Panacea, an innovative approach to
generate panoramic and controllable videos in driving scenarios, capable of
yielding an unlimited numbers of diverse, annotated samples pivotal for
autonomous driving advancements. Panacea addresses two critical challenges:
'Consistency' and 'Controllability.' Consistency ensures temporal and
cross-view coherence, while Controllability ensures the alignment of generated
content with corresponding annotations. Our approach integrates a novel 4D
attention and a two-stage generation pipeline to maintain coherence,
supplemented by the ControlNet framework for meticulous control by the
Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative
evaluations of Panacea on the nuScenes dataset prove its effectiveness in
generating high-quality multi-view driving-scene videos. This work notably
propels the field of autonomous driving by effectively augmenting the training
dataset used for advanced BEV perception techniques.Comment: Project page: https://panacea-ad.github.io
- …