723 research outputs found
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Large foundation models are becoming ubiquitous, but training them from
scratch is prohibitively expensive. Thus, efficiently adapting these powerful
models to downstream tasks is increasingly important. In this paper, we study a
principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream
task adaptation. Despite demonstrating good generalizability, OFT still uses a
fairly large number of trainable parameters due to the high dimensionality of
orthogonal matrices. To address this, we start by examining OFT from an
information transmission perspective, and then identify a few key desiderata
that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast
Fourier transform algorithm enables efficient information transmission, we
propose an efficient orthogonal parameterization using butterfly structures. We
apply this parameterization to OFT, creating a novel parameter-efficient
finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a
special case, BOFT introduces a generalized orthogonal finetuning framework.
Finally, we conduct an extensive empirical study of adapting large vision
transformers, large language models, and text-to-image diffusion models to
various downstream tasks in vision and language.Comment: Technical Report (33 pages, 18 figures
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models
Gigantic pre-trained models have become central to natural language
processing (NLP), serving as the starting point for fine-tuning towards a range
of downstream tasks. However, two pain points persist for this paradigm: (a) as
the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the
fine-tuning process can be time-consuming and computationally expensive; (b)
the fine-tuned model has the same size as its starting point by default, which
is neither sensible due to its more specialized functionality, nor practical
since many fine-tuned models will be deployed in resource-constrained
environments. To address these pain points, we propose a framework for
resource- and parameter-efficient fine-tuning by leveraging the sparsity prior
in both weight updates and the final model weights. Our proposed framework,
dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two
key objectives: (i) parameter efficient fine-tuning - by enforcing
sparsity-aware low-rank updates on top of the pre-trained weights; and (ii)
resource-efficient inference - by encouraging a sparse weight structure towards
the final fine-tuned model. We leverage sparsity in these two directions by
exploiting both unstructured and structured sparse patterns in pre-trained
language models via a unified approach. Extensive experiments and in-depth
investigations, with diverse network backbones (i.e., BERT, RoBERTa, and GPT-2)
on dozens of datasets, consistently demonstrate impressive
parameter-/inference-efficiency, while maintaining competitive downstream
performance. For instance, DSEE saves about 25% inference FLOPs while achieving
comparable performance, with 0.5% trainable parameters on BERT. Codes are
available in https://github.com/VITA-Group/DSEE.Comment: Preprin
- …