263 research outputs found
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
With the emergence of a spectrum of high-end mobile devices, many
applications that formerly required desktop-level computation capability are
being transferred to these devices. However, executing the inference of Deep
Neural Networks (DNNs) is still challenging considering high computation and
storage demands, specifically, if real-time performance with high accuracy is
needed. Weight pruning of DNNs is proposed, but existing schemes represent two
extremes in the design space: non-structured pruning is fine-grained, accurate,
but not hardware friendly; structured pruning is coarse-grained,
hardware-efficient, but with higher accuracy loss. In this paper, we introduce
a new dimension, fine-grained pruning patterns inside the coarse-grained
structures, revealing a previously unknown point in design space. With the
higher accuracy enabled by fine-grained pruning patterns, the unique insight is
to use the compiler to re-gain and guarantee high hardware efficiency. In other
words, our method achieves the best of both worlds, and is desirable across
theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an
end-to-end framework to efficiently execute DNN on mobile devices with the help
of a novel model compression technique (pattern-based pruning based on extended
ADMM solution framework) and a set of thorough architecture-aware compiler- and
code generation-based optimizations (filter kernel reordering, compressed
weight storage, register load redundancy elimination, and parameter
auto-tuning). Evaluation results demonstrate that PatDNN outperforms three
state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba
Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively,
with no accuracy compromise. Real-time inference of representative large-scale
DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS 20
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search
Though recent years have witnessed remarkable progress in single image
super-resolution (SISR) tasks with the prosperous development of deep neural
networks (DNNs), the deep learning methods are confronted with the computation
and memory consumption issues in practice, especially for resource-limited
platforms such as mobile devices. To overcome the challenge and facilitate the
real-time deployment of SISR tasks on mobile, we combine neural architecture
search with pruning search and propose an automatic search framework that
derives sparse super-resolution (SR) models with high image quality while
satisfying the real-time inference requirement. To decrease the search cost, we
leverage the weight sharing strategy by introducing a supernet and decouple the
search problem into three stages, including supernet construction,
compiler-aware architecture and pruning search, and compiler-aware pruning
ratio search. With the proposed framework, we are the first to achieve
real-time SR inference (with only tens of milliseconds per frame) for
implementing 720p resolution with competitive image quality (in terms of PSNR
and SSIM) on mobile platforms (Samsung Galaxy S20)
- …