61 research outputs found
MARS: Masked Automatic Ranks Selection in Tensor Decompositions
Tensor decomposition methods are known to be efficient for compressing and
accelerating neural networks. However, the problem of optimal decomposition
structure determination is still not well studied while being quite important.
Specifically, decomposition ranks present the crucial parameter controlling the
compression-accuracy trade-off. In this paper, we introduce MARS -- a new
efficient method for the automatic selection of ranks in general tensor
decompositions. During training, the procedure learns binary masks over
decomposition cores that "select" the optimal tensor structure. The learning is
performed via relaxed maximum a posteriori (MAP) estimation in a specific
Bayesian model. The proposed method achieves better results compared to
previous works in various tasks
Recommended from our members
Tensor shape search for efficient compression of tensorized data and neural networks
Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training
Deep neural networks have achieved great success in many data processing
applications. However, the high computational complexity and storage cost makes
deep learning hard to be used on resource-constrained devices, and it is not
environmental-friendly with much power cost. In this paper, we focus on
low-rank optimization for efficient deep learning techniques. In the space
domain, deep neural networks are compressed by low rank approximation of the
network parameters, which directly reduces the storage requirement with a
smaller number of network parameters. In the time domain, the network
parameters can be trained in a few subspaces, which enables efficient training
for fast convergence. The model compression in the spatial domain is summarized
into three categories as pre-train, pre-set, and compression-aware methods,
respectively. With a series of integrable techniques discussed, such as sparse
pruning, quantization, and entropy coding, we can ensemble them in an
integration framework with lower computational complexity and storage. Besides
of summary of recent technical advances, we have two findings for motivating
future works: one is that the effective rank outperforms other sparse measures
for network compression. The other is a spatial and temporal balance for
tensorized neural networks
Recommended from our members
Low-Rank Tensorized Neural Networks With Tensor Geometry Optimization
Deep neural networks have demonstrated significant achievements across various fields, yet their memory and time complexities present obstacles for implementing them on resource-constrained devices. Compressing deep neural networks using tensor decomposition can decrease both memory usage and computational costs. The performance of a low-rank tensorized network depends on the choices of hyperparameters including the tensor rank and geometry. Previous studies have concentrated on identifying optimal tensor ranks. This thesis studies the effect of tensor geometry used for folding data for low-rank tensor compression. It is demonstrated that tensor geometry significantly affects compression efficiency of the tensorized data and model parameters. Consequently, a novel mathematical formulation is developed to optimize tensor geometry. The tensor geometry optimization model is adopted for efficient deployment of low-rank neural networks. The presented tensor geometry optimization model is combinatorial and thus challenging to solve. Therefore, surrogate and relaxed versions of the model are developed and various methods including integer linear programming, graph optimization, and random search algorithms are applied to solve the presented optimization model. The proposed tensor geometry optimization achieved a notable reduction in both the memory and time complexities of neural networks while maintaining accuracy. The developed methods can be applied for hardware-software co-design of artificial intelligence (AI) accelerators particularly on resource-constrained devices
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
Fine-tuned transformer models have shown superior performances in many
natural language tasks. However, the large model size prohibits deploying
high-performance transformer models on resource-constrained devices. This paper
proposes a quantization-aware tensor-compressed training approach to reduce the
model size, arithmetic operations, and ultimately runtime latency of
transformer-based models. We compress the embedding and linear layers of
transformers into small low-rank tensor cores, which significantly reduces
model parameters. A quantization-aware training with learnable scale factors is
used to further obtain low-precision representations of the tensor-compressed
models. The developed approach can be used for both end-to-end training and
distillation-based training. To improve the convergence, a layer-by-layer
distillation is applied to distill a quantized and tensor-compressed student
model from a pre-trained transformer. The performance is demonstrated in two
natural language understanding tasks, showing up to compression
ratio, little accuracy loss and remarkable inference and training speedup
- …