Search CORE

24 research outputs found

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Author: Choudhary Samridhi
Kunzmann Siegfried
Yang Zi
Zhang Zheng
Publication venue
Publication date: 01/06/2023
Field of study

Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to

63\times

compression ratio, little accuracy loss and remarkable inference and training speedup

arXiv.org e-Print Archive

LAMBO: Large Language Model Empowered Edge Intelligence

Author: Dong Li
Jiang Feibo
Pan Cunhua
Peng Yubo
Schober Robert
Wang Kezhi
Yang Kun
Publication venue
Publication date: 29/08/2023
Field of study

Next-generation edge intelligence is anticipated to bring huge benefits to various applications, e.g., offloading systems. However, traditional deep offloading architectures face several issues, including heterogeneous constraints, partial perception, uncertain generalization, and lack of tractability. In this context, the integration of offloading with large language models (LLMs) presents numerous advantages. Therefore, we propose an LLM-Based Offloading (LAMBO) framework for mobile edge computing (MEC), which comprises four components: (i) Input embedding (IE), which is used to represent the information of the offloading system with constraints and prompts through learnable vectors with high quality; (ii) Asymmetric encoderdecoder (AED) model, which is a decision-making module with a deep encoder and a shallow decoder. It can achieve high performance based on multi-head self-attention schemes; (iii) Actor-critic reinforcement learning (ACRL) module, which is employed to pre-train the whole AED for different optimization tasks under corresponding prompts; and (iv) Active learning from expert feedback (ALEF), which can be used to finetune the decoder part of the AED while adapting to dynamic environmental changes. Our simulation results corroborate the advantages of the proposed LAMBO framework.Comment: To be submitted for possible journal publicatio

arXiv.org e-Print Archive

MARS: Masked Automatic Ranks Selection in Tensor Decompositions

Author: Kodryan Maxim
Kropotov Dmitry
Vetrov Dmitry
Publication venue
Publication date: 18/06/2021
Field of study

Tensor decomposition methods are known to be efficient for compressing and accelerating neural networks. However, the problem of optimal decomposition structure determination is still not well studied while being quite important. Specifically, decomposition ranks present the crucial parameter controlling the compression-accuracy trade-off. In this paper, we introduce MARS -- a new efficient method for the automatic selection of ranks in general tensor decompositions. During training, the procedure learns binary masks over decomposition cores that "select" the optimal tensor structure. The learning is performed via relaxed maximum a posteriori (MAP) estimation in a specific Bayesian model. The proposed method achieves better results compared to previous works in various tasks

arXiv.org e-Print Archive