17 research outputs found
Few Shot Network Compression via Cross Distillation
Model compression has been widely adopted to obtain light-weighted deep
neural networks. Most prevalent methods, however, require fine-tuning with
sufficient training data to ensure accuracy, which could be challenged by
privacy and security issues. As a compromise between privacy and performance,
in this paper we investigate few shot network compression: given few samples
per class, how can we effectively compress the network with negligible
performance drop? The core challenge of few shot network compression lies in
high estimation errors from the original network during inference, since the
compressed network can easily over-fits on the few training instances. The
estimation errors could propagate and accumulate layer-wisely and finally
deteriorate the network output. To address the problem, we propose cross
distillation, a novel layer-wise knowledge distillation approach. By
interweaving hidden layers of teacher and student network, layer-wisely
accumulated estimation errors can be effectively reduced.The proposed method
offers a general framework compatible with prevalent network compression
techniques such as pruning. Extensive experiments on benchmark datasets
demonstrate that cross distillation can significantly improve the student
network's accuracy when only a few training instances are available.Comment: AAAI 202
RTN: Reparameterized Ternary Network
To deploy deep neural networks on resource-limited devices, quantization has
been widely explored. In this work, we study the extremely low-bit networks
which have tremendous speed-up, memory saving with quantized activation and
weights. We first bring up three omitted issues in extremely low-bit networks:
the squashing range of quantized values; the gradient vanishing during
backpropagation and the unexploited hardware acceleration of ternary networks.
By reparameterizing quantized activation and weights vector with full precision
scale and offset for fixed ternary vector, we decouple the range and magnitude
from the direction to extenuate the three issues. Learnable scale and offset
can automatically adjust the range of quantized values and sparsity without
gradient vanishing. A novel encoding and computation pat-tern are designed to
support efficient computing for our reparameterized ternary network (RTN).
Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a
much better efficiency between bitwidth and accuracy, and achieves up to 26.76%
relative accuracy improvement compared with state-of-the-art methods. Moreover,
we validate the proposed computation pattern on Field Programmable Gate Arrays
(FPGA), and it brings 46.46x and 89.17x savings on power and area respectively
compared with the full precision convolution.Comment: To appear at AAAI-2
Visually Guided Generative Text-Layout Pre-training for Document Intelligence
Prior study shows that pre-training techniques can boost the performance of
visual document understanding (VDU), which typically requires models to gain
abilities to perceive and reason both document texts and layouts (e.g.,
locations of texts and table-cells). To this end, we propose visually guided
generative text-layout pre-training, named ViTLP. Given a document image, the
model optimizes hierarchical language and layout modeling objectives to
generate the interleaved text and layout sequence. In addition, to address the
limitation of processing long documents by Transformers, we introduce a
straightforward yet effective multi-segment generative pre-training scheme,
facilitating ViTLP to process word-intensive documents of any length. ViTLP can
function as a native OCR model to localize and recognize texts of document
images. Besides, ViTLP can be effectively applied to various downstream VDU
tasks. Extensive experiments show that ViTLP achieves competitive performance
over existing baselines on benchmark VDU tasks, including information
extraction, document classification, and document question answering.Comment: Accepted to NAACL 2024 main conference. The first version of this
paper was submitted to OpenReview
(https://openreview.net/forum?id=ARtBIBAmNR) in June 202
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Vision-language pre-trained models have achieved impressive performance on
various downstream tasks. However, their large model sizes hinder their
utilization on platforms with limited computational resources. We find that
directly using smaller pre-trained models and applying magnitude-based pruning
on CLIP models leads to inflexibility and inferior performance. Recent efforts
for VLP compression either adopt uni-modal compression metrics resulting in
limited performance or involve costly mask-search processes with learnable
masks. In this paper, we first propose the Module-wise Pruning Error (MoPE)
metric, accurately assessing CLIP module importance by performance decline on
cross-modal tasks. Using the MoPE metric, we introduce a unified pruning
framework applicable to both pre-training and task-specific fine-tuning
compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge
from the teacher model, significantly reducing pre-training costs while
maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning
from width to depth yields highly competitive task-specific models. Extensive
experiments in two stages demonstrate the effectiveness of the MoPE metric, and
MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.Comment: 18 pages, 8 figures, Published in CVPR202