167 research outputs found
On the Adversarial Robustness of Camera-based 3D Object Detection
In recent years, camera-based 3D object detection has gained widespread
attention for its ability to achieve high performance with low computational
cost. However, the robustness of these methods to adversarial attacks has not
been thoroughly examined. In this study, we conduct the first comprehensive
investigation of the robustness of leading camera-based 3D object detection
methods under various adversarial conditions. Our experiments reveal five
interesting findings: (a) the use of accurate depth estimation effectively
improves robustness; (b) depth-estimation-free approaches do not show superior
robustness; (c) bird's-eye-view-based representations exhibit greater
robustness against localization attacks; (d) incorporating multi-frame benign
inputs can effectively mitigate adversarial attacks; and (e) addressing
long-tail problems can enhance robustness. We hope our work can provide
guidance for the design of future camera-based object detection modules with
improved adversarial robustness
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
The recent work CLIPA presents an inverse scaling law for CLIP training --
whereby the larger the image/text encoders used, the shorter the sequence
length of image/text tokens that can be applied in training. This finding
enables us to train high-performance CLIP models with significantly reduced
computations. Building upon this work, we hereby present CLIPA-v2 with two key
contributions. Technically, we find this inverse scaling law is also applicable
in the finetuning stage, enabling further reduction in computational needs.
Empirically, we explore CLIPA at scale, extending the experiments up to the
H/14 model with ~13B image-text pairs seen during training.
Our results are exciting -- by only allocating a budget of \4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our
code and models are available at https://github.com/UCSC-VLAA/CLIPA.Comment: Tech Report. Code is available at https://github.com/UCSC-VLAA/CLIP
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Automated audio captioning aims at generating natural language descriptions
for given audio clips, not only detecting and classifying sounds, but also
summarizing the relationships between audio events. Recent research advances in
audio captioning have introduced additional guidance to improve the accuracy of
audio events in generated sentences. However, temporal relations between audio
events have received little attention while revealing complex relations is a
key component in summarizing audio content. Therefore, this paper aims to
better capture temporal relationships in caption generation with sound event
detection (SED), a task that locates events' timestamps. We investigate the
best approach to integrate temporal information in a captioning model and
propose a temporal tag system to transform the timestamps into comprehensible
relations. Results evaluated by the proposed temporal metrics suggest that
great improvement is achieved in terms of temporal relation generation
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.Comment: 15 pages, 18 figure
DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation
3D perception based on the representations learned from multi-camera
bird's-eye-view (BEV) is trending as cameras are cost-effective for mass
production in autonomous driving industry. However, there exists a distinct
performance gap between multi-camera BEV and LiDAR based 3D object detection.
One key reason is that LiDAR captures accurate depth and other geometry
measurements, while it is notoriously challenging to infer such 3D information
from merely image input. In this work, we propose to boost the representation
learning of a multi-camera BEV based student detector by training it to imitate
the features of a well-trained LiDAR based teacher detector. We propose
effective balancing strategy to enforce the student to focus on learning the
crucial features from the teacher, and generalize knowledge transfer to
multi-scale layers with temporal fusion. We conduct extensive evaluations on
multiple representative models of multi-camera BEV. Experiments reveal that our
approach renders significant improvement over the student models, leading to
the state-of-the-art performance on the popular benchmark nuScenes.Comment: ICCV 202
Improving Audio Caption Fluency with Automatic Error Correction
Automated audio captioning (AAC) is an important cross-modality translation
task, aiming at generating descriptions for audio clips. However, captions
generated by previous AAC models have faced ``false-repetition'' errors due to
the training objective. In such scenarios, we propose a new task of AAC error
correction and hope to reduce such errors by post-processing AAC outputs. To
tackle this problem, we use observation-based rules to corrupt captions without
errors, for pseudo grammatically-erroneous sentence generation. One pair of
corrupted and clean sentences can thus be used for training. We train a neural
network-based model on the synthetic error dataset and apply the model to
correct real errors in AAC outputs. Results on two benchmark datasets indicate
that our approach significantly improves fluency while maintaining semantic
information.Comment: Accepted by NCMMSC 202
FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning
Federated learning (FL) is an emerging paradigm in machine learning, where a
shared model is collaboratively learned using data from multiple devices to
mitigate the risk of data leakage. While recent studies posit that Vision
Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in
addressing data heterogeneity in FL, the specific architectural components that
underpin this advantage have yet to be elucidated. In this paper, we
systematically investigate the impact of different architectural elements, such
as activation functions and normalization layers, on the performance within
heterogeneous FL. Through rigorous empirical analyses, we are able to offer the
first-of-its-kind general guidance on micro-architecture design principles for
heterogeneous FL.
Intriguingly, our findings indicate that with strategic architectural
modifications, pure CNNs can achieve a level of robustness that either matches
or even exceeds that of ViTs when handling heterogeneous data clients in FL.
Additionally, our approach is compatible with existing FL techniques and
delivers state-of-the-art solutions across a broad spectrum of FL benchmarks.
The code is publicly available at https://github.com/UCSC-VLAA/FedConvComment: 9 pages, 6 figures. Equal contribution by P. Xu and Z. Wan
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Compared with ample visual-text pre-training research, few works explore
audio-text pre-training, mostly due to the lack of sufficient parallel
audio-text data. Most existing methods incorporate the visual modality as a
pivot for audio-text pre-training, which inevitably induces data noise. In this
paper, we propose BLAT: Bootstrapping Language-Audio pre-training based on
Tag-guided synthetic data. We utilize audio captioning to generate text
directly from audio, without the aid of the visual modality so that potential
noise from modality mismatch is eliminated. Furthermore, we propose caption
generation under the guidance of AudioSet tags, leading to more accurate
captions. With the above two improvements, we curate high-quality, large-scale
parallel audio-text data, based on which we perform audio-text pre-training.
Evaluation on a series of downstream tasks indicates that BLAT achieves SOTA
zero-shot classification performance on most datasets and significant
performance improvement when fine-tuned on downstream tasks, suggesting the
effectiveness of our synthetic data
- …