140 research outputs found
Sparse4D v3: Advancing End-to-End 3D Detection and Tracking
In autonomous driving perception systems, 3D detection and tracking are the
two fundamental tasks. This paper delves deeper into this field, building upon
the Sparse4D framework. We introduce two auxiliary training tasks (Temporal
Instance Denoising and Quality Estimation) and propose decoupled attention to
make structural improvements, leading to significant enhancements in detection
performance. Additionally, we extend the detector into a tracker using a
straightforward approach that assigns instance ID during inference, further
highlighting the advantages of query-based algorithms. Extensive experiments
conducted on the nuScenes benchmark validate the effectiveness of the proposed
improvements. With ResNet50 as the backbone, we witnessed enhancements of
3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and
49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on
the nuScenes test set. Code will be released at
\url{https://github.com/linxuewu/Sparse4D}
Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion
Bird-eye-view (BEV) based methods have made great progress recently in
multi-view 3D detection task. Comparing with BEV based methods, sparse based
methods lag behind in performance, but still have lots of non-negligible
merits. To push sparse 3D detection further, in this work, we introduce a novel
method, named Sparse4D, which does the iterative refinement of anchor boxes via
sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling:
for each 3D anchor, we assign multiple 4D keypoints, which are then projected
to multi-view/scale/timestamp image features to sample corresponding features;
(2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of
different view/scale, different timestamp and different keypoints to generate
high-quality instance feature. In this way, Sparse4D can efficiently and
effectively achieve 3D detection without relying on dense view transformation
nor global attention, and is more friendly to edge devices deployment.
Furthermore, we introduce an instance-level depth reweight module to alleviate
the ill-posed issue in 3D-to-2D projection. In experiment, our method
outperforms all sparse based methods and most BEV based methods on detection
task in the nuScenes dataset
EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction
Motion prediction is a crucial task in autonomous driving, and one of its
major challenges lands in the multimodality of future behaviors. Many
successful works have utilized mixture models which require identification of
positive mixture components, and correspondingly fall into two main lines:
prediction-based and anchor-based matching. The prediction clustering
phenomenon in prediction-based matching makes it difficult to pick
representative trajectories for downstream tasks, while the anchor-based
matching suffers from a limited regression capability. In this paper, we
introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to
define the positive and negative components for multimodal motion prediction
based on mixture models. We enable anchors to evolve and redistribute
themselves under specific scenes for an enlarged regression capacity.
Furthermore, we select distinct anchors before matching them with the ground
truth, which results in impressive scoring performance. Our approach enhances
all metrics compared to the baseline MTR, particularly with a notable relative
reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on
the Waymo Open Motion Dataset. Code is available at
https://github.com/Longzhong-Lin/EDA.Comment: Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI2024
Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration
State-of-the-art convolutional neural networks are enormously costly in both
compute and memory, demanding massively parallel GPUs for execution. Such
networks strain the computational capabilities and energy available to embedded
and mobile processing platforms, restricting their use in many important
applications. In this paper, we push the boundaries of hardware-effective CNN
design by proposing BCNN with Separable Filters (BCNNw/SF), which applies
Singular Value Decomposition (SVD) on BCNN kernels to further reduce
computational and storage complexity. To enable its implementation, we provide
a closed form of the gradient over SVD to calculate the exact gradient with
respect to every binarized weight in backward propagation. We verify BCNNw/SF
on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for
CIFAR-10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of
17% and execution time reduction of 31.3% compared to BCNN with only minor
accuracy sacrifices.Comment: 9 pages, 6 figures, accepted for Embedded Vision Workshop (CVPRW
DynStatF: An Efficient Feature Fusion Strategy for LiDAR 3D Object Detection
Augmenting LiDAR input with multiple previous frames provides richer semantic
information and thus boosts performance in 3D object detection, However,
crowded point clouds in multi-frames can hurt the precise position information
due to the motion blur and inaccurate point projection. In this work, we
propose a novel feature fusion strategy, DynStaF (Dynamic-Static Fusion), which
enhances the rich semantic information provided by the multi-frame (dynamic
branch) with the accurate location information from the current single-frame
(static branch). To effectively extract and aggregate complimentary features,
DynStaF contains two modules, Neighborhood Cross Attention (NCA) and
Dynamic-Static Interaction (DSI), operating through a dual pathway
architecture. NCA takes the features in the static branch as queries and the
features in the dynamic branch as keys (values). When computing the attention,
we address the sparsity of point clouds and take only neighborhood positions
into consideration. NCA fuses two features at different feature map scales,
followed by DSI providing the comprehensive interaction. To analyze our
proposed strategy DynStaF, we conduct extensive experiments on the nuScenes
dataset. On the test set, DynStaF increases the performance of PointPillars in
NDS by a large margin from 57.7% to 61.6%. When combined with CenterPoint, our
framework achieves 61.0% mAP and 67.7% NDS, leading to state-of-the-art
performance without bells and whistles.Comment: Accepted to CVPR2023 Workshop on End-to-End Autonomous Drivin
Double-Flow-based Steganography without Embedding for Image-to-Image Hiding
As an emerging concept, steganography without embedding (SWE) hides a secret
message without directly embedding it into a cover. Thus, SWE has the unique
advantage of being immune to typical steganalysis methods and can better
protect the secret message from being exposed. However, existing SWE methods
are generally criticized for their poor payload capacity and low fidelity of
recovered secret messages. In this paper, we propose a novel
steganography-without-embedding technique, named DF-SWE, which addresses the
aforementioned drawbacks and produces diverse and natural stego images.
Specifically, DF-SWE employs a reversible circulation of double flow to build a
reversible bijective transformation between the secret image and the generated
stego image. Hence, it provides a way to directly generate stego images from
secret images without a cover image. Besides leveraging the invertible
property, DF-SWE can invert a secret image from a generated stego image in a
nearly lossless manner and increases the fidelity of extracted secret images.
To the best of our knowledge, DF-SWE is the first SWE method that can hide
large images and multiple images into one image with the same size,
significantly enhancing the payload capacity. According to the experimental
results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times
compared to its competitors while producing diverse images to minimize the
exposure risk. Importantly, DF-SWE can be applied in the steganography of
secret images in various domains without requiring training data from the
corresponding domains. This domain-agnostic property suggests that DF-SWE can
1) be applied to hiding private data and 2) be deployed in resource-limited
systems
- …