    Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

    Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them

    PoET-BiN: Power Efficient Tiny Binary Neurons

    RÉSUMÉ Le succès des réseaux de neurones dans la classification des images a inspiré diverses implémentations matérielles sur des systèmes embarqués telles que des FPGAs, des processeurs embarqués et des unités de traitement graphiques. Ces systèmes sont souvent limités en termes de puissance. Toutefois, les réseaux de neurones consomment énormément à travers les opérations de multiplication/accumulation et des accès mémoire pour la récupération des poids. La quantification et l’élagage ont été proposés pour résoudre ce problème. Bien que efficaces, ces techniques ne prennent pas en compte l’architecture sous-jacente du matériel utilisé. Dans ce travail, nous proposons une implémentation économe en énergie, basée sur une table de vérité, d’un neurone binaire sur des systèmes embarqués à ressources limitées. Une approche d’arbre de décision modifiée constitue le fondement de la mise en œuvre proposée dans le domaine binaire. Un accès de LUT consomme beaucoup moins d’énergie que l’opération équivalente de multiplication/accumulation qu’il remplace. De plus, l’algorithme modifié de l’arbre de décision élimine le besoin d’accéder à la mémoire. Nous avons utilisé les neurones binaires proposés pour mettre en œuvre la couche de classification de réseaux utilisés pour la résolution des jeux de données MNIST, SVHN et CIFAR-10, avec des résultats presque à la pointe de la technologie. La réduction de puissance pour la couche de classification atteint trois ordres de grandeur pour l’ensemble de données MNIST et cinq ordres de grandeur pour les ensembles de données SVHN et CIFAR-10.----------ABSTRACT The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to ad-dress this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs

    Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18

    Efficient FPGA-Based Inference Architectures for Deep Learning Networks

    L’apprentissage profond est devenu la technique de pointe pour de nombreuses applications de classification et de régression. Les modèles d’apprentissage profond, tels que les réseaux de neurones profonds (Deep Neural Network - DNN) et les réseaux de neurones convolutionnels (Convolutional Neural Network - CNN), déploient des dizaines de couches cachées avec des centaines de neurones pour obtenir une représentation significative des données d’entrée. La puissance des DNN et des CNN provient du fait qu’ils sont formés par apprentissage de caractéristiques extraites plutôt que par des algorithmes spécifiques à une tâche. Cependant, cela se fait aux dépens d’un coût de calcul élevé pour les processus d’apprentissage et d’inférence. Cela nécessite des accélérateurs avec de hautes performances et économes en énergie, en particulier pour les inférences lorsque le traitement en temps réel est important. Les FPGA offrent une plateforme attrayante pour accélérer l’inférence des DNN et des CNN en raison de leurs performances, dû à leur configurabilité et de leur efficacité énergétique. Dans cette thèse, nous abordons trois problèmes principaux. Premièrement, nous examinons le problème de la mise en oeuvre précise et efficace des DNN traditionnels entièrement connectés sur les FPGA. Bien que les réseaux de neurones binaires (Binary Neural Network - BNN) utilisent une représentation de données compacte sur un bit par rapport aux données à virgule fixe et à virgule flottante pour les DNN et les CNN traditionnels, ils peuvent encore nécessiter trop de ressources de calcul et de mémoire. Par conséquent, nous étudions le problème de l’implémentation des BNN sur FPGA en tant que deuxième problème. Enfin, nous nous concentrons sur l’introduction des FPGA en tant qu’accélérateurs matériels pour un plus grand nombre de développeurs de logiciels, en particulier ceux qui ne maîtrisent pas les connaissances en programmation sur FPGA. Pour résoudre le premier problème, et dans la mesure où l’implémentation efficace de fonctions d’activation non linéaires est essentielle à la mise en oeuvre de modèles d’apprentissage profond sur les FPGA, nous introduisons une implémentation de fonction d’activation non linéaire basée sur le filtre à interpolation de la transformée cosinus discrète (Discrete Cosine Transform Interpolation Filter - DCTIF). L’architecture d’interpolation proposée combine des opérations arithmétiques sur des échantillons stockés de la fonction de tangente hyperbolique et sur les données d’entrée. Cette solution offre une précision 3× supérieure à celle des travaux précédents, tout en utilisant une quantité similaire des ressources de calculs et une petite quantité de mémoire. Différentes combinaisons de paramètres du filtre DCTIF peuvent être choisies pour compenser la précision et la complexité globale du circuit de la fonction tangente hyperbolique.----------ABSTRACT: Deep learning has evolved to become the state-of-the-art technique for numerous classification and regression applications. Deep learning models, such as Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), deploy dozens of hidden layers with hundreds of neurons to learn a meaningful representation of the input data. The power of DNNs and CNNs comes from the fact that they are trained through feature learning rather than task-specific algorithms. However, this comes at the expense of high computational cost for both training and inference processes. This necessitates high-performance and energyefficient accelerators, especially for inference where real-time processing matters. FPGAs offer an appealing platform for accelerating the inference of DNNs and CNNs due to their performance, configurability and energy-efficiency. In this thesis, we address three main problems. Firstly, we consider the problem of realizing a precise but efficient implementation of traditional fully connected DNNs in FPGAs. Although Binary Neural Networks (BNNs) use compact data representation (1-bit) compared to fixedpoint data and floating-point representation in traditional DNNs and CNNs, they may still need too many computational and memory resources. Therefore, we study the problem of implementing BNNs in FPGAs as the second problem. Finally, we focus on introducing FPGAs as accelerators to a wider range of software developers, especially those who do not posses FPGA programming knowledge. To address the first problem, and since efficient implementation of non-linear activation functions is essential to the implementation of deep learning models on FPGAs, we introduce a non-linear activation function implementation based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed interpolation architecture combines arithmetic operations on the stored samples of the hyperbolic tangent function and on input data. It achieves almost 3× better precision than previous works while using a similar amount of computational resources and a small amount of memory. Various combinations of DCTIF parameters can be chosen to trade off the accuracy and the overall circuit complexity of the tanh function. In an attempt to address the first and third problems, we introduce a Single hidden layer Neural Network (SNN) multiplication-free overlay architecture with fully connected DNN-level performance. This FPGA inference overlay can be used for applications that are normally solved with fully connected DNNs. The overlay avoids the time needed to synthesize, place, route and regenerate a new bitstream when the application changes. The SNN overlay in puts and activations are quantized to power-of-two values, which allows utilizing shift units instead of multipliers. Since the overlay is a SNN, we fill the FPGA chip with the maximum possible number of neurons that can work in parallel in the hidden layer. We evaluate the proposed architecture on typical benchmark datasets and demonstrate higher throughput with respect to the state-of-the-art while achieving the same accuracy. In addition, the SNN overlay makes the power and versatility of FPGAs available to a wider DNN user community and to improve DNN design efficiency

    Artificial neural networks acceleration on field-programmable gate arrays considering model redundancy

    Artificial Neural Networks (ANNs) have dramatically developed over the last ten years, and have been successfully applied in many important areas. A natural follow-up topic is to deploy ANNs to a wider range of hardware platforms. However, modern ANN models may aim for millisecond- or even nanosecond-level latency for each input processing while it is common for them to require million-level operations and gigabyte-scale data access for computing each input. This intrinsic high computational complexity introduces hardware challenges to the system implementation. Meanwhile, the integration of computing resources on hardware platforms is hampered by the slowing down of Moore’s Law. Therefore, it is important to study new design methods for ANN hardware systems that produce high model accuracy with low resource usage. Field-Programmable Gate Array (FPGA) is a natural fit for this topic due to its reconfigurability and flexibility. These features of FPGA allow us to implement customised data paths and data representations on hardware, which makes it the primary platform in this research. The main topics discussed in this thesis include neural network redundancy and its impact on hardware systems. The main goal is to reduce hardware complexity by reducing neural network redundancy and maintaining accuracy at the same time. To achieve this, redundancy is firstly categorised into two types: model- and data-level. Then, each type is studied in isolation before both are combined in a single system design. First, to study model-level redundancy, an algorithm called dropout is implemented as a way to reduce model-level redundancy during training and used here to reduce hardware cost. Our proposed system achieves a 50% reduction in DSP usage and 33% to 47% fewer on-chip memory usage compared to conventional implementations. Second, in terms of data-level redundancy, we aim to study how data precision affects hardware cost and system throughput. Our experiments show that reduced-precision data present negligible or even no accuracy loss to full-precision data on the tested benchmarks. In particular, the 4-bit fixed point presents a good trade-off between model accuracy and hardware cost compared to other tested data representations. Third, we studied the interactive effect of reducing both model- and data-level redundancy and proposed a FPGA accelerator design for Redundancy-Reduced (RR-) MobileNet [Hea17]. Our proposed RR-MobileNet system achieves a state-of-the-art latency, 7.85 ms, for single image processing in ImageNet inference. Finally, a design guideline is proposed as a step-by-step guidance for redundancy-reduced neural network system design.Open Acces

    Flexible Computing Systems For AI Acceleration At The Extreme Edge Of The IoT

    Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions
