24 research outputs found
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
Fine-tuned transformer models have shown superior performances in many
natural language tasks. However, the large model size prohibits deploying
high-performance transformer models on resource-constrained devices. This paper
proposes a quantization-aware tensor-compressed training approach to reduce the
model size, arithmetic operations, and ultimately runtime latency of
transformer-based models. We compress the embedding and linear layers of
transformers into small low-rank tensor cores, which significantly reduces
model parameters. A quantization-aware training with learnable scale factors is
used to further obtain low-precision representations of the tensor-compressed
models. The developed approach can be used for both end-to-end training and
distillation-based training. To improve the convergence, a layer-by-layer
distillation is applied to distill a quantized and tensor-compressed student
model from a pre-trained transformer. The performance is demonstrated in two
natural language understanding tasks, showing up to compression
ratio, little accuracy loss and remarkable inference and training speedup
LAMBO: Large Language Model Empowered Edge Intelligence
Next-generation edge intelligence is anticipated to bring huge benefits to
various applications, e.g., offloading systems. However, traditional deep
offloading architectures face several issues, including heterogeneous
constraints, partial perception, uncertain generalization, and lack of
tractability. In this context, the integration of offloading with large
language models (LLMs) presents numerous advantages. Therefore, we propose an
LLM-Based Offloading (LAMBO) framework for mobile edge computing (MEC), which
comprises four components: (i) Input embedding (IE), which is used to represent
the information of the offloading system with constraints and prompts through
learnable vectors with high quality; (ii) Asymmetric encoderdecoder (AED)
model, which is a decision-making module with a deep encoder and a shallow
decoder. It can achieve high performance based on multi-head self-attention
schemes; (iii) Actor-critic reinforcement learning (ACRL) module, which is
employed to pre-train the whole AED for different optimization tasks under
corresponding prompts; and (iv) Active learning from expert feedback (ALEF),
which can be used to finetune the decoder part of the AED while adapting to
dynamic environmental changes. Our simulation results corroborate the
advantages of the proposed LAMBO framework.Comment: To be submitted for possible journal publicatio
MARS: Masked Automatic Ranks Selection in Tensor Decompositions
Tensor decomposition methods are known to be efficient for compressing and
accelerating neural networks. However, the problem of optimal decomposition
structure determination is still not well studied while being quite important.
Specifically, decomposition ranks present the crucial parameter controlling the
compression-accuracy trade-off. In this paper, we introduce MARS -- a new
efficient method for the automatic selection of ranks in general tensor
decompositions. During training, the procedure learns binary masks over
decomposition cores that "select" the optimal tensor structure. The learning is
performed via relaxed maximum a posteriori (MAP) estimation in a specific
Bayesian model. The proposed method achieves better results compared to
previous works in various tasks