2,553 research outputs found
NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning
Fine-tuning a pre-trained language model (PLM) emerges as the predominant
strategy in many natural language processing applications. However, even
fine-tuning the PLMs and doing inference are expensive, especially on edge
devices with low computing power. Some general approaches (e.g. quantization
and distillation) have been widely studied to reduce the compute/memory of PLM
fine-tuning, while very few one-shot compression techniques are explored. In
this paper, we investigate the neural tangent kernel (NTK)--which reveals the
gradient descent dynamics of neural networks--of the multilayer perceptrons
(MLP) modules in a PLM and propose to coin a lightweight PLM through
NTK-approximating MLP fusion. To achieve this, we reconsider the MLP as a
bundle of sub-MLPs, and cluster them into a given number of centroids, which
can then be restored as a compressed MLP and surprisingly shown to well
approximate the NTK of the original PLM. Extensive experiments of PLM
fine-tuning on both natural language understanding (NLU) and generation (NLG)
tasks are provided to verify the effectiveness of the proposed method MLP
fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.Comment: ICML 202
Distributionally Robust Circuit Design Optimization under Variation Shifts
Due to the significant process variations, designers have to optimize the
statistical performance distribution of nano-scale IC design in most cases.
This problem has been investigated for decades under the formulation of
stochastic optimization, which minimizes the expected value of a performance
metric while assuming that the distribution of process variation is exactly
given. This paper rethinks the variation-aware circuit design optimization from
a new perspective. First, we discuss the variation shift problem, which means
that the actual density function of process variations almost always differs
from the given model and is often unknown. Consequently, we propose to
formulate the variation-aware circuit design optimization as a distributionally
robust optimization problem, which does not require the exact distribution of
process variations. By selecting an appropriate uncertainty set for the
probability density function of process variations, we solve the shift-aware
circuit optimization problem using distributionally robust Bayesian
optimization. This method is validated with both a photonic IC and an
electronics IC. Our optimized circuits show excellent robustness against
variation shifts: the optimized circuit has excellent performance under many
possible distributions of process variations that differ from the given
statistical model. This work has the potential to enable a new research
direction and inspire subsequent research at different levels of the EDA flow
under the setting of variation shift.Comment: accepted by ICCAD 2023, 8 page
Dense Vision Transformer Compression with Few Samples
Few-shot model compression aims to compress a large model into a more compact
one with only a tiny training set (even without labels). Block-level pruning
has recently emerged as a leading technique in achieving high accuracy and low
latency in few-shot CNN compression. But, few-shot compression for Vision
Transformers (ViT) remains largely unexplored, which presents a new challenge.
In particular, the issue of sparse compression exists in traditional CNN
few-shot methods, which can only produce very few compressed models of
different model sizes. This paper proposes a novel framework for few-shot ViT
compression named DC-ViT. Instead of dropping the entire block, DC-ViT
selectively eliminates the attention module while retaining and reusing
portions of the MLP module. DC-ViT enables dense compression, which outputs
numerous compressed models that densely populate the range of model complexity.
DC-ViT outperforms state-of-the-art few-shot compression methods by a
significant margin of 10 percentage points, along with lower latency in the
compression of ViT and its variants.Comment: Accepted to CVPR 2024. Note: Jianxin Wu is a contributing author for
the arXiv version of this paper but is not listed as an author in the CVPR
version due to his role as Program Chai
- …