89 research outputs found
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning
Neural network compression empowers the effective yet unwieldy deep
convolutional neural networks (CNN) to be deployed in resource-constrained
scenarios. Most state-of-the-art approaches prune the model in filter-level
according to the "importance" of filters. Despite their success, we notice they
suffer from at least two of the following problems: 1) The redundancy among
filters is not considered because the importance is evaluated independently. 2)
Cross-layer filter comparison is unachievable since the importance is defined
locally within each layer. Consequently, we must manually specify layer-wise
pruning ratios. 3) They are prone to generate sub-optimal solutions because
they neglect the inequality between reducing parameters and reducing
computational cost. Reducing the same number of parameters in different
positions in the network may reduce different computational cost. To address
the above problems, we develop a novel algorithm named as COP
(correlation-based pruning), which can detect the redundant filters
efficiently. We enable the cross-layer filter comparison through global
normalization. We add parameter-quantity and computational-cost regularization
terms to the importance, which enables the users to customize the compression
according to their preference (smaller or faster). Extensive experiments have
shown COP outperforms the others significantly. The code is released at
https://github.com/ZJULearning/COP.Comment: 7 pages, 4 figures, has been accepted by IJCAI201
OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection
Compared to typical multi-sensor systems, monocular 3D object detection has
attracted much attention due to its simple configuration. However, there is
still a significant gap between LiDAR-based and monocular-based methods. In
this paper, we find that the ill-posed nature of monocular imagery can lead to
depth ambiguity. Specifically, objects with different depths can appear with
the same bounding boxes and similar visual features in the 2D image.
Unfortunately, the network cannot accurately distinguish different depths from
such non-discriminative visual features, resulting in unstable depth training.
To facilitate depth learning, we propose a simple yet effective plug-and-play
module, \underline{O}ne \underline{B}ounding Box \underline{M}ultiple
\underline{O}bjects (OBMO). Concretely, we add a set of suitable pseudo labels
by shifting the 3D bounding box along the viewing frustum. To constrain the
pseudo-3D labels to be reasonable, we carefully design two label scoring
strategies to represent their quality. In contrast to the original hard depth
labels, such soft pseudo labels with quality scores allow the network to learn
a reasonable depth range, boosting training stability and thus improving final
performance. Extensive experiments on KITTI and Waymo benchmarks show that our
method significantly improves state-of-the-art monocular 3D detectors by a
significant margin (The improvements under the moderate setting on KITTI
validation set are \textbf{mAP in BEV} and
\textbf{mAP in 3D}). Codes have been released at
\url{https://github.com/mrsempress/OBMO}.Comment: 10 pages, 7 figure
SelFLoc: Selective Feature Fusion for Large-scale Point Cloud-based Place Recognition
Point cloud-based place recognition is crucial for mobile robots and
autonomous vehicles, especially when the global positioning sensor is not
accessible. LiDAR points are scattered on the surface of objects and buildings,
which have strong shape priors along different axes. To enhance message passing
along particular axes, Stacked Asymmetric Convolution Block (SACB) is designed,
which is one of the main contributions in this paper. Comprehensive experiments
demonstrate that asymmetric convolution and its corresponding strategies
employed by SACB can contribute to the more effective representation of point
cloud feature. On this basis, Selective Feature Fusion Block (SFFB), which is
formed by stacking point- and channel-wise gating layers in a predefined
sequence, is proposed to selectively boost salient local features in certain
key regions, as well as to align the features before fusion phase. SACBs and
SFFBs are combined to construct a robust and accurate architecture for point
cloud-based place recognition, which is termed SelFLoc. Comparative
experimental results show that SelFLoc achieves the state-of-the-art (SOTA)
performance on the Oxford and other three in-house benchmarks with an
improvement of 1.6 absolute percentages on mean average recall@1
General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment
Compared to 2D images, 3D point clouds are much more sensitive to rotations.
We expect the point features describing certain patterns to keep invariant to
the rotation transformation. There are many recent SOTA works dedicated to
rotation-invariant learning for 3D point clouds. However, current
rotation-invariant methods lack generalizability on the point clouds in the
open scenes due to the reliance on the global distribution, \ie the global
scene and backgrounds. Considering that the output activation is a function of
the pattern and its orientation, we need to eliminate the effect of the
orientation.In this paper, inspired by the idea that the network weights can be
considered a set of points distributed in the same 3D space as the input
points, we propose Weight-Feature Alignment (WFA) to construct a local
Invariant Reference Frame (IRF) via aligning the features with the principal
axes of the network weights. Our WFA algorithm provides a general solution for
the point clouds of all scenes. WFA ensures the model achieves the target that
the response activity is a necessary and sufficient condition of the pattern
matching degree. Practically, we perform experiments on the point clouds of
both single objects and open large-range scenes. The results suggest that our
method almost bridges the gap between rotation invariance learning and normal
methods.Comment: 4 figure
A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation
Unsupervised domain adaptation (UDA) methods facilitate the transfer of
models to target domains without labels. However, these methods necessitate a
labeled target validation set for hyper-parameter tuning and model selection.
In this paper, we aim to find an evaluation metric capable of assessing the
quality of a transferred model without access to target validation labels. We
begin with the metric based on mutual information of the model prediction.
Through empirical analysis, we identify three prevalent issues with this
metric: 1) It does not account for the source structure. 2) It can be easily
attacked. 3) It fails to detect negative transfer caused by the over-alignment
of source and target features. To address the first two issues, we incorporate
source accuracy into the metric and employ a new MLP classifier that is held
out during training, significantly improving the result. To tackle the final
issue, we integrate this enhanced metric with data augmentation, resulting in a
novel unsupervised UDA metric called the Augmentation Consistency Metric (ACM).
Additionally, we empirically demonstrate the shortcomings of previous
experiment settings and conduct large-scale experiments to validate the
effectiveness of our proposed metric. Furthermore, we employ our metric to
automatically search for the optimal hyper-parameter set, achieving superior
performance compared to manually tuned sets across four common benchmarks.
Codes will be available soon
MCS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders
Masked point modeling has become a promising scheme of self-supervised
pre-training for point clouds. Existing methods reconstruct either the original
points or related features as the objective of pre-training. However,
considering the diversity of downstream tasks, it is necessary for the model to
have both low- and high-level representation modeling capabilities to capture
geometric details and semantic contexts during pre-training. To this end,
MCS is proposed to enable the model with the above abilities. Specifically,
with masked point cloud as input, MCS introduces two decoders to predict
masked representations and the original points simultaneously. While an extra
decoder doubles parameters for the decoding process and may lead to
overfitting, we propose siamese decoders to keep the amount of learnable
parameters unchanged. Further, we propose an online codebook projecting
continuous tokens into discrete ones before reconstructing masked points. In
such way, we can enforce the decoder to take effect through the combinations of
tokens rather than remembering each token. Comprehensive experiments show that
MCS achieves superior performance at both classification and segmentation
tasks, outperforming existing methods
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention
While features of different scales are perceptually important to visual
inputs, existing vision transformers do not yet take advantage of them
explicitly. To this end, we first propose a cross-scale vision transformer,
CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short
distance attention (LSDA). On the one hand, CEL blends each token with multiple
patches of different scales, providing the self-attention module itself with
cross-scale features. On the other hand, LSDA splits the self-attention module
into a short-distance one and a long-distance counterpart, which not only
reduces the computational burden but also keeps both small-scale and
large-scale features in the tokens. Moreover, through experiments on
CrossFormer, we observe another two issues that affect vision transformers'
performance, i.e., the enlarging self-attention maps and amplitude explosion.
Thus, we further propose a progressive group size (PGS) paradigm and an
amplitude cooling layer (ACL) to alleviate the two issues, respectively. The
CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive
experiments show that CrossFormer++ outperforms the other vision transformers
on image classification, object detection, instance segmentation, and semantic
segmentation tasks. The code will be available at:
https://github.com/cheerss/CrossFormer.Comment: 16 pages, 7 figure
Model Compression and Efficient Inference for Large Language Models: A Survey
Transformer based large language models have achieved tremendous success.
However, the significant memory and computational costs incurred during the
inference process make it challenging to deploy large models on
resource-constrained devices. In this paper, we investigate compression and
efficient inference methods for large language models from an algorithmic
perspective. Regarding taxonomy, similar to smaller models, compression and
acceleration algorithms for large language models can still be categorized into
quantization, pruning, distillation, compact architecture design, dynamic
networks. However, Large language models have two prominent characteristics
compared to smaller models: (1) Most of compression algorithms require
finetuning or even retraining the model after compression. The most notable
aspect of large models is the very high cost associated with model finetuning
or training. Therefore, many algorithms for large models, such as quantization
and pruning, start to explore tuning-free algorithms. (2) Large models
emphasize versatility and generalization rather than performance on a single
task. Hence, many algorithms, such as knowledge distillation, focus on how to
preserving their versatility and generalization after compression. Since these
two characteristics were not very pronounced in early large models, we further
distinguish large language models into medium models and ``real'' large models.
Additionally, we also provide an introduction to some mature frameworks for
efficient inference of large models, which can support basic compression or
acceleration algorithms, greatly facilitating model deployment for users.Comment: 47 pages, review 380 papers. The work is ongoin
- …