Search CORE

11 research outputs found

Recommended from our members

Model-Architecture Co-design of Deep Neural Networks for Embedded Systems

Author: Maji Partha
Publication venue: University of Cambridge
Publication date: 24/06/2020
Field of study

In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems. This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available. In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x. My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable. My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification

Apollo (Cambridge)

Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs

Author: Benini Luca
de Prado Miguel
Denna Maurizio
Mundy Andrew
Pazos Nuria
Saeed Rabia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/12/2020
Field of study

The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation

arXiv.org e-Print Archive

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

Performance engineering of data-intensive applications

Author: Mazaheri Arya
Publication venue
Publication date: 01/01/2022
Field of study

Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms. In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases. Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%. For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability. Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment. Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications

TUbiblio

tuprints

Desarrollo y paralelización del algoritmo de convolución rápida de Winograd sobre plataformas ARM de bajo consumo.

Author: Cuadrillero Geer Duero Joshua
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 28/12/2021
Field of study

[ES] Los avances en el diseño y desarrollo de redes neuronales convolucionales profundas, así como el incremento de la precisión de las mismas en el campo de la visión artificial, ha supuesto su amplia adopción en un gran abanico de aplicaciones inteligentes. La mayoría de dispositivos móviles y empotrados, ya sea en forma de teléfono móvil, tableta, sistema de navegación, sistemas empotrado o robot, incorporan procesadores ARM y requieren la ejecución eficiente y consciente del consumo este tipo de redes neuronales. Desde el punto de vista computacional, la fase de inferencia de las redes neuronales convolucionales se puede resumir mayormente en la realización de dos operaciones básicas: la multiplicación de matrices, para el procesamiento de capas completamente conectadas, y la convolución, para el procesamiento de capas convolucionales. En este trabajo se aborda el desarrollo y la paralelización del algoritmo de convolución rápida de Winograd, una variante de esta operación basada en transformaciones y que tiene como objetivo de maximizar el rendimiento de la fase de inferencia a costa de reducir el número de operaciones realizadas. En concreto, el trabajo detalla las implementaciones y optimizaciones llevadas a cabo para aprovechar la naturaleza del algoritmo rápido de convolución sobre procesadores ARM y se compara con la convolución directa y en los basados en transformaciones im2col o im2row, seguidas de una multiplicación de matrices. La implementación desarrollada en este trabajo explora, además, distintas versiones vectorizadas y paralelas del algoritmo basadas en el uso de funciones intrínsecas de ARM-NEON y en el entorno de programación paralelo OpenMP.[EN] The ongoing progress on deep convolutional neural networks and their important role in artificial vision has made their role relevant in a vast number of smart applications. A large portion of mobile and embedded systems, be it in the shape of mobile phones, tablets, navigation systems, embedded systems or robots, currently make use of ARM microprocessors and require constant and efficient use of the before mentioned neural networks. From a computational point of view, convolutional neural network make use of two main operations: matrix multiplication for fully connected layers and the convolution operation for convolutional layers. This document will go over the development of a parallel convolution using Winograd's method for fast convolution. This method seeks to maximize performance, making use of various transformations that decrease the amount of calculations needed to perform a convolution. In more detail, this document will document various possible implementations and optimizations which will try and exploit the nature of this algorithm on ARM processors. These versions are later compared with other fast convolution techniques, as are im2row or im2col. The documented implementation will also go over a vectorized implementation based on ARM-NEON intrinsic functions and a thread based parallel implementation using Openmp.Cuadrillero Geer, DJ. (2021). Desarrollo y paralelización del algoritmo de convolución rápida de Winograd sobre plataformas ARM de bajo consumo. Universitat Politècnica de València. http://hdl.handle.net/10251/179017TFG

RiuNet

Recommended from our members

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Author: Beu J
Dasika G
Maji Partha
Mattina M
Mullins Robert
Mundy A
Publication venue
Publication date: 30/05/2019
Field of study

The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have frugal memory and low power budget. This research work aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.PhD student funded by EPSRC doctoral training accoun

Apollo (Cambridge)

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Author: Beu J
Dasika G
Maji P
Mattina M
Mullins Robert
Mundy A
Publication venue
Publication date: 04/03/2019
Field of study

arXiv.org e-Print Archive

Crossref

Apollo (Cambridge)