A Survey on Transformer Compression

Guo, Jianyuan; Han, Kai; Hu, Hailin; Tang, Yehui; Tao, Dacheng; Tu, Zhijun; Wang, Yunhe

A Survey on Transformer Compression

Authors: Jianyuan Guo
Kai Han
Hailin Hu
Yehui Tang
Dacheng Tao
Zhijun Tu
Yunhe Wang
Publication date: 7 April 2024
Publisher

Abstract

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.Comment: Model Compression, Transformer, Large Language Model, Large Vision Model, LL

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2402.05964

Last time updated on 24/10/2024