23 research outputs found
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
With the emergence of a spectrum of high-end mobile devices, many
applications that formerly required desktop-level computation capability are
being transferred to these devices. However, executing the inference of Deep
Neural Networks (DNNs) is still challenging considering high computation and
storage demands, specifically, if real-time performance with high accuracy is
needed. Weight pruning of DNNs is proposed, but existing schemes represent two
extremes in the design space: non-structured pruning is fine-grained, accurate,
but not hardware friendly; structured pruning is coarse-grained,
hardware-efficient, but with higher accuracy loss. In this paper, we introduce
a new dimension, fine-grained pruning patterns inside the coarse-grained
structures, revealing a previously unknown point in design space. With the
higher accuracy enabled by fine-grained pruning patterns, the unique insight is
to use the compiler to re-gain and guarantee high hardware efficiency. In other
words, our method achieves the best of both worlds, and is desirable across
theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an
end-to-end framework to efficiently execute DNN on mobile devices with the help
of a novel model compression technique (pattern-based pruning based on extended
ADMM solution framework) and a set of thorough architecture-aware compiler- and
code generation-based optimizations (filter kernel reordering, compressed
weight storage, register load redundancy elimination, and parameter
auto-tuning). Evaluation results demonstrate that PatDNN outperforms three
state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba
Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively,
with no accuracy compromise. Real-time inference of representative large-scale
DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS 20
SoC-Cluster as an Edge Server: an Application-driven Measurement Study
Huge electricity consumption is a severe issue for edge data centers. To this
end, we propose a new form of edge server, namely SoC-Cluster, that
orchestrates many low-power mobile system-on-chips (SoCs) through an on-chip
network. For the first time, we have developed a concrete SoC-Cluster server
that consists of 60 Qualcomm Snapdragon 865 SoCs in a 2U rack. Such a server
has been commercialized successfully and deployed in large scale on edge
clouds. The current dominant workload on those deployed SoC-Clusters is cloud
gaming, as mobile SoCs can seamlessly run native mobile games.
The primary goal of this work is to demystify whether SoC-Cluster can
efficiently serve more general-purpose, edge-typical workloads. Therefore, we
built a benchmark suite that leverages state-of-the-art libraries for two
killer edge workloads, i.e., video transcoding and deep learning inference. The
benchmark comprehensively reports the performance, power consumption, and other
application-specific metrics. We then performed a thorough measurement study
and directly compared SoC-Cluster with traditional edge servers (with Intel CPU
and NVIDIA GPU) with respect to physical size, electricity, and billing. The
results reveal the advantages of SoC-Cluster, especially its high energy
efficiency and the ability to proportionally scale energy consumption with
various incoming loads, as well as its limitations. The results also provide
insightful implications and valuable guidance to further improve SoC-Cluster
and land it in broader edge scenarios
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices
Mobile devices are becoming an important carrier for deep learning tasks, as
they are being equipped with powerful, high-end mobile CPUs and GPUs. However,
it is still a challenging task to execute 3D Convolutional Neural Networks
(CNNs) targeting for real-time performance, besides high inference accuracy.
The reason is more complex model structure and higher model dimensionality
overwhelm the available computation/storage resources on mobile devices. A
natural way may be turning to deep learning weight pruning techniques. However,
the direct generalization of existing 2D CNN weight pruning methods to 3D CNNs
is not ideal for fully exploiting mobile parallelism while achieving high
inference accuracy.
This paper proposes RT3D, a model compression and mobile acceleration
framework for 3D CNNs, seamlessly integrating neural network weight pruning and
compiler code generation techniques. We propose and investigate two structured
sparsity schemes i.e., the vanilla structured sparsity and kernel group
structured (KGS) sparsity that are mobile acceleration friendly. The vanilla
sparsity removes whole kernel groups, while KGS sparsity is a more fine-grained
structured sparsity that enjoys higher flexibility while exploiting full
on-device parallelism. We propose a reweighted regularization pruning algorithm
to achieve the proposed sparsity schemes. The inference time speedup due to
sparsity is approaching the pruning rate of the whole model FLOPs (floating
point operations). RT3D demonstrates up to 29.1 speedup in end-to-end
inference time comparing with current mobile frameworks supporting 3D CNNs,
with moderate 1%-1.5% accuracy loss. The end-to-end inference time for 16 video
frames could be within 150 ms, when executing representative C3D and R(2+1)D
models on a cellphone. For the first time, real-time execution of 3D CNNs is
achieved on off-the-shelf mobiles.Comment: To appear in Proceedings of the 35th AAAI Conference on Artificial
Intelligence (AAAI-21
SoD: Statically Optimizing Dynamic Deep Neural Network
Though many compilation and runtime systems have been developed for DNNs in
recent years, the focus has largely been on static DNNs. Dynamic DNNs, where
tensor shapes and sizes and even the set of operators used are dependent upon
the input and/or execution, are becoming common. This paper presents SoD, a
comprehensive framework for optimizing Dynamic DNNs. The basis of our approach
is a classification of common operators that form DNNs, and the use of this
classification towards a Rank and Dimension Propagation (RDP) method. This
framework statically determines the shapes of operators as known constants,
symbolic constants, or operations on these. Next, using RDP we enable a series
of optimizations, like fused code generation, execution (order) planning, and
even runtime memory allocation plan generation. By evaluating the framework on
10 emerging Dynamic DNNs and comparing it against several existing systems, we
demonstrate both reductions in execution latency and memory requirements, with
RDP-enabled key optimizations responsible for much of the gains. Our evaluation
results show that SoD runs up to faster than these systems
while saving up to peak memory consumption
PockEngine: Sparse and Efficient Fine-tuning in a Pocket
On-device learning and efficient fine-tuning enable continuous and
privacy-preserving customization (e.g., locally fine-tuning large language
models on personalized data). However, existing training frameworks are
designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and
lack the optimizations for learning on the edge, which faces challenges of
resource limitations and edge hardware diversity. We introduce PockEngine: a
tiny, sparse and efficient engine to enable fine-tuning on various edge
devices. PockEngine supports sparse backpropagation: it prunes the backward
graph and sparsely updates the model with measured memory saving and latency
reduction while maintaining the model quality. Secondly, PockEngine is
compilation first: the entire training graph (including forward, backward and
optimization steps) is derived at compile-time, which reduces the runtime
overhead and brings opportunities for graph transformations. PockEngine also
integrates a rich set of training graph optimizations, thus can further
accelerate the training cost, including operator reordering and backend
switching. PockEngine supports diverse applications, frontends and hardware
backends: it flexibly compiles and tunes models defined in
PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We
evaluated PockEngine on both vision models and large language models.
PockEngine achieves up to 15 speedup over off-the-shelf TensorFlow
(Raspberry Pi), 5.6 memory saving back-propagation (Jetson AGX Orin).
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin
at 550 tokens/s, 7.9 faster than the PyTorch