38 research outputs found
CrossFusion: Interleaving Cross-modal Complementation for Noise-resistant 3D Object Detection
The combination of LiDAR and camera modalities is proven to be necessary and
typical for 3D object detection according to recent studies. Existing fusion
strategies tend to overly rely on the LiDAR modal in essence, which exploits
the abundant semantics from the camera sensor insufficiently. However, existing
methods cannot rely on information from other modalities because the corruption
of LiDAR features results in a large domain gap. Following this, we propose
CrossFusion, a more robust and noise-resistant scheme that makes full use of
the camera and LiDAR features with the designed cross-modal complementation
strategy. Extensive experiments we conducted show that our method not only
outperforms the state-of-the-art methods under the setting without introducing
an extra depth estimation network but also demonstrates our model's noise
resistance without re-training for the specific malfunction scenarios by
increasing 5.2\% mAP and 2.4\% NDS
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification
Prompt learning has become a popular approach for adapting large
vision-language models, such as CLIP, to downstream tasks. Typically, prompt
learning relies on a fixed prompt token or an input-conditional token to fit a
small amount of data under full supervision. While this paradigm can generalize
to a certain range of unseen classes, it may struggle when domain gap
increases, such as in fine-grained classification and satellite image
segmentation. To address this limitation, we propose Retrieval-enhanced Prompt
learning (RePrompt), which introduces retrieval mechanisms to cache the
knowledge representations from downstream tasks. we first construct a retrieval
database from training examples, or from external examples when available. We
then integrate this retrieval-enhanced mechanism into various stages of a
simple prompt learning baseline. By referencing similar samples in the training
set, the enhanced model is better able to adapt to new tasks with few samples.
Our extensive experiments over 15 vision datasets, including 11 downstream
tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate
that RePrompt achieves considerably improved performance. Our proposed approach
provides a promising solution to the challenges faced by prompt learning when
domain gap increases. The code and models will be available
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Large pre-trained models (LPMs), such as LLaMA and ViT-G, have shown
exceptional performance across various tasks. Although parameter-efficient
fine-tuning (PEFT) has emerged to cheaply fine-tune these large models on
downstream tasks, their deployment is still hindered by the vast model scale
and computational costs. Neural network pruning offers a solution for model
compression by removing redundant parameters, but most existing methods rely on
computing parameter gradients. However, obtaining the gradients is
computationally prohibitive for LPMs, which necessitates the exploration of
alternative approaches. To this end, we propose a unified framework for
efficient fine-tuning and deployment of LPMs, termed LoRAPrune. We first design
a PEFT-aware pruning criterion, which utilizes the values and gradients of
Low-Rank Adaption (LoRA), rather than the gradients of pre-trained parameters
for importance estimation. We then propose an iterative pruning procedure to
remove redundant parameters while maximizing the advantages of PEFT. Thus, our
LoRAPrune delivers an accurate, compact model for efficient inference in a
highly cost-effective manner. Experimental results on various tasks demonstrate
that our method achieves state-of-the-art results. For instance, in the VTAB-1k
benchmark, LoRAPrune utilizes only 0.76% of the trainable parameters and
outperforms magnitude and movement pruning methods by a significant margin,
achieving a mean Top-1 accuracy that is 5.7% and 4.3% higher, respectively.
Moreover, our approach achieves comparable performance to PEFT methods,
highlighting its efficacy in delivering high-quality results while benefiting
from the advantages of pruning
Pedestrian Attribute Recognition in Video Surveillance Scenarios Based on View-attribute Attention Localization
Pedestrian attribute recognition in surveillance scenarios is still a
challenging task due to the inaccurate localization of specific attributes. In
this paper, we propose a novel view-attribute localization method based on
attention (VALA), which utilizes view information to guide the recognition
process to focus on specific attributes and attention mechanism to localize
specific attribute-corresponding areas. Concretely, view information is
leveraged by the view prediction branch to generate four view weights that
represent the confidences for attributes from different views. View weights are
then delivered back to compose specific view-attributes, which will participate
and supervise deep feature extraction. In order to explore the spatial location
of a view-attribute, regional attention is introduced to aggregate spatial
information and encode inter-channel dependencies of the view feature.
Subsequently, a fine attentive attribute-specific region is localized, and
regional weights for the view-attribute from different spatial locations are
gained by the regional attention. The final view-attribute recognition outcome
is obtained by combining the view weights with the regional weights.
Experiments on three wide datasets (RAP, RAPv2, and PA-100K) demonstrate the
effectiveness of our approach compared with state-of-the-art methods