7 research outputs found
AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training
It is usually infeasible to fit and train an entire large deep neural network
(DNN) model using a single edge device due to the limited resources. To
facilitate intelligent applications across edge devices, researchers have
proposed partitioning a large model into several sub-models, and deploying each
of them to a different edge device to collaboratively train a DNN model.
However, the communication overhead caused by the large amount of data
transmitted from one device to another during training, as well as the
sub-optimal partition point due to the inaccurate latency prediction of
computation at each edge device can significantly slow down training. In this
paper, we propose AccEPT, an acceleration scheme for accelerating the edge
collaborative pipeline-parallel training. In particular, we propose a
light-weight adaptive latency predictor to accurately estimate the computation
latency of each layer at different devices, which also adapts to unseen devices
through continuous learning. Therefore, the proposed latency predictor leads to
better model partitioning which balances the computation loads across
participating devices. Moreover, we propose a bit-level computation-efficient
data compression scheme to compress the data to be transmitted between devices
during training. Our numerical results demonstrate that our proposed
acceleration approach is able to significantly speed up edge pipeline parallel
training up to 3 times faster in the considered experimental settings
A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification
Recent advancements in machine learning achieved by Deep Neural Networks
(DNNs) have been significant. While demonstrating high accuracy, DNNs are
associated with a huge number of parameters and computation, which leads to
high memory usage and energy consumption. As a result, deploying of DNNs on
devices with constrained hardware resources poses significant challenges. To
overcome this, various compression techniques have been widely employed to
optimize DNN accelerators. A promising approach is quantization, in which the
full-precision values are stored in low bit-width precision. Quantization not
only reduces memory requirements but also replaces high-cost operations with
low-cost ones. DNN quantization offers flexibility and efficiency in hardware
design, making it a widely adopted technique in various methods. Since
quantization has been extensively utilized in previous works, there is a need
for an integrated report that provides an understanding, analysis, and
comparison of different quantization approaches. Consequently, we present a
comprehensive survey of quantization concepts and methods, with a focus on
image classification. We describe clustering-based quantization methods and
explore the use of a scale factor parameter for approximating full-precision
values. Moreover, we thoroughly review the training of quantized DNN, including
the use of straight-through estimator and quantized regularization. We explain
the replacement of floating-point operations with low-cost bitwise operations
in a quantized DNN and the sensitivity of different layers in quantization.
Furthermore, we highlight the evaluation metrics for quantized methods and
important benchmarks in image classification task. We also present the accuracy
of the state-of-the-art methods on CIFAR-10 and ImageNet.Comment: The title of the paper has been changed. The abstract has been
improved. The grammatical errors have been corrected. The structure of the
paper has been modified. Some new and important references have been added.
Some of the used abbreviations in the paper have been corrected. The
discussion of some important topics has been extended. Some figures have been
improve
Enabling Deep Learning on Edge Devices
Deep neural networks (DNNs) have succeeded in many different perception
tasks, e.g., computer vision, natural language processing, reinforcement
learning, etc. The high-performed DNNs heavily rely on intensive resource
consumption. For example, training a DNN requires high dynamic memory, a
large-scale dataset, and a large number of computations (a long training time);
even inference with a DNN also demands a large amount of static storage,
computations (a long inference time), and energy. Therefore, state-of-the-art
DNNs are often deployed on a cloud server with a large number of
super-computers, a high-bandwidth communication bus, a shared storage
infrastructure, and a high power supplement.
Recently, some new emerging intelligent applications, e.g., AR/VR, mobile
assistants, Internet of Things, require us to deploy DNNs on
resource-constrained edge devices. Compare to a cloud server, edge devices
often have a rather small amount of resources. To deploy DNNs on edge devices,
we need to reduce the size of DNNs, i.e., we target a better trade-off between
resource consumption and model accuracy.
In this dissertation, we studied four edge intelligence scenarios, i.e.,
Inference on Edge Devices, Adaptation on Edge Devices, Learning on Edge
Devices, and Edge-Server Systems, and developed different methodologies to
enable deep learning in each scenario. Since current DNNs are often
over-parameterized, our goal is to find and reduce the redundancy of the DNNs
in each scenario.Comment: PhD thesis at ETH Zuric
Adaptive Loss-Aware Quantization for Multi-Bit Networks
We investigate the compression of deep neural networks by quantizing their weights and activations into multiple binary bases, known as multi-bit networks (MBNs), which accelerate the inference and reduce the storage for the deployment on low-resource mobile and embedded platforms. We propose Adaptive Loss-aware Quantization (ALQ), a new MBN quantization pipeline that is able to achieve an average bitwidth below one-bit without notable loss in inference accuracy. Unlike previous MBN quantization solutions that train a quantizer by minimizing the error to reconstruct full precision weights, ALQ directly minimizes the quantization-induced error on the loss function involving neither gradient approximation nor full precision maintenance. ALQ also exploits strategies including adaptive bitwidth, smooth bitwidth reduction, and iterative trained quantization to allow a smaller network size without loss in accuracy. Experiment results on popular image datasets show that ALQ outperforms state-of-the-art compressed networks in terms of both storage and accuracy
Adaptive loss-aware quantization for multi-bit networks
We investigate the compression of deep neural networks by quantizing their
weights and activations into multiple binary bases, known as multi-bit networks
(MBNs), which accelerate the inference and reduce the storage for the
deployment on low-resource mobile and embedded platforms. We propose Adaptive
Loss-aware Quantization (ALQ), a new MBN quantization pipeline that is able to
achieve an average bitwidth below one-bit without notable loss in inference
accuracy. Unlike previous MBN quantization solutions that train a quantizer by
minimizing the error to reconstruct full precision weights, ALQ directly
minimizes the quantization-induced error on the loss function involving neither
gradient approximation nor full precision maintenance. ALQ also exploits
strategies including adaptive bitwidth, smooth bitwidth reduction, and
iterative trained quantization to allow a smaller network size without loss in
accuracy. Experiment results on popular image datasets show that ALQ
outperforms state-of-the-art compressed networks in terms of both storage and
accuracy. Code is available at https://github.com/zqu1992/ALQComment: To appear in CVPR 2020; Code available at
https://github.com/zqu1992/AL