11 research outputs found

    Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs

    Full text link
    The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation

    Performance engineering of data-intensive applications

    Get PDF
    Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms. In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases. Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%. For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability. Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment. Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications

    Desarrollo y paralelizaci贸n del algoritmo de convoluci贸n r谩pida de Winograd sobre plataformas ARM de bajo consumo.

    Full text link
    [ES] Los avances en el dise帽o y desarrollo de redes neuronales convolucionales profundas, as铆 como el incremento de la precisi贸n de las mismas en el campo de la visi贸n artificial, ha supuesto su amplia adopci贸n en un gran abanico de aplicaciones inteligentes. La mayor铆a de dispositivos m贸viles y empotrados, ya sea en forma de tel茅fono m贸vil, tableta, sistema de navegaci贸n, sistemas empotrado o robot, incorporan procesadores ARM y requieren la ejecuci贸n eficiente y consciente del consumo este tipo de redes neuronales. Desde el punto de vista computacional, la fase de inferencia de las redes neuronales convolucionales se puede resumir mayormente en la realizaci贸n de dos operaciones b谩sicas: la multiplicaci贸n de matrices, para el procesamiento de capas completamente conectadas, y la convoluci贸n, para el procesamiento de capas convolucionales. En este trabajo se aborda el desarrollo y la paralelizaci贸n del algoritmo de convoluci贸n r谩pida de Winograd, una variante de esta operaci贸n basada en transformaciones y que tiene como objetivo de maximizar el rendimiento de la fase de inferencia a costa de reducir el n煤mero de operaciones realizadas. En concreto, el trabajo detalla las implementaciones y optimizaciones llevadas a cabo para aprovechar la naturaleza del algoritmo r谩pido de convoluci贸n sobre procesadores ARM y se compara con la convoluci贸n directa y en los basados en transformaciones im2col o im2row, seguidas de una multiplicaci贸n de matrices. La implementaci贸n desarrollada en este trabajo explora, adem谩s, distintas versiones vectorizadas y paralelas del algoritmo basadas en el uso de funciones intr铆nsecas de ARM-NEON y en el entorno de programaci贸n paralelo OpenMP.[EN] The ongoing progress on deep convolutional neural networks and their important role in artificial vision has made their role relevant in a vast number of smart applications. A large portion of mobile and embedded systems, be it in the shape of mobile phones, tablets, navigation systems, embedded systems or robots, currently make use of ARM microprocessors and require constant and efficient use of the before mentioned neural networks. From a computational point of view, convolutional neural network make use of two main operations: matrix multiplication for fully connected layers and the convolution operation for convolutional layers. This document will go over the development of a parallel convolution using Winograd's method for fast convolution. This method seeks to maximize performance, making use of various transformations that decrease the amount of calculations needed to perform a convolution. In more detail, this document will document various possible implementations and optimizations which will try and exploit the nature of this algorithm on ARM processors. These versions are later compared with other fast convolution techniques, as are im2row or im2col. The documented implementation will also go over a vectorized implementation based on ARM-NEON intrinsic functions and a thread based parallel implementation using Openmp.Cuadrillero Geer, DJ. (2021). Desarrollo y paralelizaci贸n del algoritmo de convoluci贸n r谩pida de Winograd sobre plataformas ARM de bajo consumo. Universitat Polit猫cnica de Val猫ncia. http://hdl.handle.net/10251/179017TFG

    Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

    No full text
    The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have frugal memory and low power budget. This research work aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.PhD student funded by EPSRC doctoral training accoun
    corecore