26,866 research outputs found

    Accelerating and Compressing Deep Neural Networks for Massive MIMO CSI Feedback

    Full text link
    The recent advances in machine learning and deep neural networks have made them attractive candidates for wireless communications functions such as channel estimation, decoding, and downlink channel state information (CSI) compression. However, most of these neural networks are large and inefficient making it a barrier for deployment in practical wireless systems that require low-latency and low memory footprints for individual network functions. To mitigate these limitations, we propose accelerated and compressed efficient neural networks for massive MIMO CSI feedback. Specifically, we have thoroughly investigated the adoption of network pruning, post-training dynamic range quantization, and weight clustering to optimize CSI feedback compression for massive MIMO systems. Furthermore, we have deployed the proposed model compression techniques on commodity hardware and demonstrated that in order to achieve inference gains, specialized libraries that accelerate computations for sparse neural networks are required. Our findings indicate that there is remarkable value in applying these model compression techniques and the proposed joint pruning and quantization approach reduced model size by 86.5% and inference time by 76.2% with minimal impact to model accuracy. These compression methods are crucial to pave the way for practical adoption and deployments of deep learning-based techniques in commercial wireless systems.Comment: IEEE ICC 2023 Conferenc

    Hardware-Friendly Model Compression techniques for Deep Learning Accelerators

    Get PDF
    The objective of the proposed research is to introduce solutions to make energy-efficient \gls{dnn} accelerators to be deployable on edge devices through developing hardware-aware \gls{dnn} compression methods. The rising popularity of intelligent mobile devices and the computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. In particular, we proposed four compression techniques for energy and memory efficient \gls{dnn} computing. In the first method, \gls{lgps}, we present a hardware-aware pruning method where the locations of non-zero weights are derived in real-time from a \gls{lfsr}. Using the proposed method, we demonstrate a total saving of energy and area up to 63.96\% and 64.23\% for VGG-16 network on down-sampled ImageNet, respectively for iso-compression-rate and iso-accuracy. Secondly, We achieved ultra-low bit-precision deep learning model by developing a quantization scheme through knowledge distillation and gradual quantization for pruned network. Thirdly, we propose a novel model compression scheme that allows inference to be carried out using bit-level sparsity, which can be efficiently implemented using in-memory computing macros. We introduce a method called BitS-Net to leverage the benefits of bit-sparsity (where the number of zeros is more than number of ones in binary representation of weight/activation values) when applied to Compute-In-Memory (\gls{cim}) with Resistive Random Access Memory (\gls{rram}) to develop energy efficient \gls{dnn} accelerators operating in the inference mode. We demonstrate that BitS-Net improves the energy efficiency by up to 5x for ResNet models on the ImageNet dataset. In the last part, to achieve highly energy-efficient DNN, we introduce a novel twofold sparsity method (Twofold Sparsity, twofoldS-Net) to sparsify the DNN models in bit- and network-level, simultaneously. We added two separate regularizations to the loss function in order to achieve bit- and network-level sparsity at the same time. We sparsify the model in network-level, by adding a mask generated by \gls{lfsr}. For bit-level sparsity, we quantize the network to 8-bit representation in two's complement format. During inference we take advantage of \gls{cim} architecture and \gls{lfsr} indexing. We have shown that by using our proposed method we are able to sparsify the network and design a highly energy-efficient deep learning accelerator to eventually bring \gls{ai} to our daily lives.Ph.D

    Efficient Edge Intelligence in the Era of Big Data

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Smart wearables, known as emerging paradigms for vital big data capturing, have been attracting intensive attentions. However, one crucial problem is their power-hungriness, i.e., the continuous data streaming consumes energy dramatically and requires devices to be frequently charged. Targeting this obstacle, we propose to investigate the biodynamic patterns in the data and design a data-driven approach for intelligent data compression. We leverage Deep Learning (DL), more specifically, Convolutional Autoencoder (CAE), to learn a sparse representation of the vital big data. The minimized energy need, even taking into consideration the CAE-induced overhead, is tremendously lower than the original energy need. Further, compared with state-of-the-art wavelet compression-based method, our method can compress the data with a dramatically lower error for a similar energy budget. Our experiments and the validated approach are expected to boost the energy efficiency of wearables, and thus greatly advance ubiquitous big data applications in era of smart health. In recent years, there has also been a growing interest in edge intelligence for emerging instantaneous big data inference. However, the inference algorithms, especially deep learning, usually require heavy computation requirements, thereby greatly limiting their deployment on the edge. We take special interest in the smart health wearable big data mining and inference. Targeting the deep learning’s high computational complexity and large memory and energy requirements, new approaches are urged to make the deep learning algorithms ultra-efficient for wearable big data analysis. We propose to leverage knowledge distillation to achieve an ultra-efficient edge-deployable deep learning model. More specifically, through transferring the knowledge from a teacher model to the on-edge student model, the soft target distribution of the teacher model can be effectively learned by the student model. Besides, we propose to further introduce adversarial robustness to the student model, by stimulating the student model to correctly identify inputs that have adversarial perturbation. Experiments demonstrate that the knowledge distillation student model has comparable performance to the heavy teacher model but owns a substantially smaller model size. With adversarial learning, the student model has effectively preserved its robustness. In such a way, we have demonstrated the framework with knowledge distillation and adversarial learning can, not only advance ultra-efficient edge inference, but also preserve the robustness facing the perturbed input.2023-06-0

    Distributed learning and inference in deep models

    Get PDF
    In recent years, the size of deep learning problems has been increased significantly, both in terms of the number of available training samples as well as the number of parameters and complexity of the model. In this thesis, we considered the challenges encountered in training and inference of large deep models, especially on nodes with limited computational power and capacity. We studied two classes of related problems; 1) distributed training of deep models, and 2) compression and restructuring of deep models for efficient distributed and parallel execution to reduce inference times. Especially, we considered the communication bottleneck in distributed training and inference of deep models. Data compression is a viable tool to mitigate the communication bottleneck in distributed deep learning. However, the existing methods suffer from a few drawbacks, such as the increased variance of stochastic gradients (SG), slower convergence rate, or added bias to SG. In my Ph.D. research, we have addressed these challenges from three different perspectives: 1) Information Theory and the CEO Problem, 2) Indirect SG compression via Matrix Factorization, and 3) Quantized Compressive Sampling. We showed, both theoretically and via simulations, that our proposed methods can achieve smaller MSE than other unbiased compression methods with fewer communication bit-rates, resulting in superior convergence rates. Next, we considered federated learning over wireless multiple access channels (MAC). Efficient communication requires the communication algorithm to satisfy the constraints imposed by the nodes in the network and the communication medium. To satisfy these constraints and take advantage of the over-the-air computation inherent in MAC, we proposed a framework based on random linear coding and developed efficient power management and channel usage techniques to manage the trade-offs between power consumption and communication bit-rate. In the second part of this thesis, we considered the distributed parallel implementation of an already-trained deep model on multiple workers. Since latency due to the synchronization and data transfer among workers adversely affects the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models on the workers. To achieve this goal, we developed and analyzed RePurpose, an efficient algorithm to rearrange the neurons in the neural network and partition them (without changing the general topology of the neural network) such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers.Ph.D
    • …
    corecore