153 research outputs found
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
With the emergence of a spectrum of high-end mobile devices, many
applications that formerly required desktop-level computation capability are
being transferred to these devices. However, executing the inference of Deep
Neural Networks (DNNs) is still challenging considering high computation and
storage demands, specifically, if real-time performance with high accuracy is
needed. Weight pruning of DNNs is proposed, but existing schemes represent two
extremes in the design space: non-structured pruning is fine-grained, accurate,
but not hardware friendly; structured pruning is coarse-grained,
hardware-efficient, but with higher accuracy loss. In this paper, we introduce
a new dimension, fine-grained pruning patterns inside the coarse-grained
structures, revealing a previously unknown point in design space. With the
higher accuracy enabled by fine-grained pruning patterns, the unique insight is
to use the compiler to re-gain and guarantee high hardware efficiency. In other
words, our method achieves the best of both worlds, and is desirable across
theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an
end-to-end framework to efficiently execute DNN on mobile devices with the help
of a novel model compression technique (pattern-based pruning based on extended
ADMM solution framework) and a set of thorough architecture-aware compiler- and
code generation-based optimizations (filter kernel reordering, compressed
weight storage, register load redundancy elimination, and parameter
auto-tuning). Evaluation results demonstrate that PatDNN outperforms three
state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba
Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively,
with no accuracy compromise. Real-time inference of representative large-scale
DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS 20
FedDIP: Federated Learning with Extreme Dynamic Pruning and Incremental Regularization
Federated Learning (FL) has been successfully adopted for distributed
training and inference of large-scale Deep Neural Networks (DNNs). However,
DNNs are characterized by an extremely large number of parameters, thus,
yielding significant challenges in exchanging these parameters among
distributed nodes and managing the memory. Although recent DNN compression
methods (e.g., sparsification, pruning) tackle such challenges, they do not
holistically consider an adaptively controlled reduction of parameter exchange
while maintaining high accuracy levels. We, therefore, contribute with a novel
FL framework (coined FedDIP), which combines (i) dynamic model pruning with
error feedback to eliminate redundant information exchange, which contributes
to significant performance improvement, with (ii) incremental regularization
that can achieve \textit{extreme} sparsity of models. We provide convergence
analysis of FedDIP and report on a comprehensive performance and comparative
assessment against state-of-the-art methods using benchmark data sets and DNN
models. Our results showcase that FedDIP not only controls the model sparsity
but efficiently achieves similar or better performance compared to other model
pruning methods adopting incremental regularization during distributed model
training. The code is available at: https://github.com/EricLoong/feddip.Comment: Accepted for publication at ICDM 2023 (Full version in arxiv). The
associated code is available at https://github.com/EricLoong/feddi
Recommended from our members
Model-Architecture Co-design of Deep Neural Networks for Embedded Systems
In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems.
This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available.
In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x.
My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable.
My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification
Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss
Reducing bit-widths of activations and weights of deep networks makes it
efficient to compute and store them in memory, which is crucial in their
deployments to resource-limited devices, such as mobile phones. However,
decreasing bit-widths with quantization generally yields drastically degraded
accuracy. To tackle this problem, we propose to learn to quantize activations
and weights via a trainable quantizer that transforms and discretizes them.
Specifically, we parameterize the quantization intervals and obtain their
optimal values by directly minimizing the task loss of the network. This
quantization-interval-learning (QIL) allows the quantized networks to maintain
the accuracy of the full-precision (32-bit) networks with bit-width as low as
4-bit and minimize the accuracy degeneration with further bit-width reduction
(i.e., 3 and 2-bit). Moreover, our quantizer can be trained on a heterogeneous
dataset, and thus can be used to quantize pretrained networks without access to
their training data. We demonstrate the effectiveness of our trainable
quantizer on ImageNet dataset with various network architectures such as
ResNet-18, -34 and AlexNet, on which it outperforms existing methods to achieve
the state-of-the-art accuracy
- …