273 research outputs found

    Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs

    Get PDF
    Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Hardware compilation of deep neural networks: an overview

    Get PDF
    Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research

    Low power and high performance heterogeneous computing on FPGAs

    Get PDF
    L'abstract รจ presente nell'allegato / the abstract is in the attachmen

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

    Get PDF
    With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers, to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems as proponents for driving biomedical circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21 pages, 10 figures, 5 tables

    ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰์„ ์œ„ํ•œ ์‹คํ–‰ ๊ณ„ํš ์ž๋™ ์ƒ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Over the past years, a large number of architectures and accelerators for Deep Neural Networks (DNNs) have been proposed. While exhibiting common features, the number and arrangement of processing elements, the sizes and types of on-chip memory, and the possibilities of parallel execution vary significantly especially in the embedded system domain. The number of off-chip memory accesses and the performance of a DNN on a given accelerator depends not only on the supported computational patterns and the available on-chip memory but also on the sizes and shapes of each layer. Finding a computational pattern that minimizes off-chip memory accesses while maximizing performance is thus a tedious and error-prone task. This thesis presents e-PlaNNer, a compiler framework that generates an optimized execution plan for a given embedded accelerator and Convolutional Neural Network (CNN). For each layer, e-PlaNNer determines the performance-optimal configuration by considering the data movement, tiling, and work distribution. The generated execution plan is transformed to code, allowing for a fast development cycle with different CNNs and hardware accelerators. Evaluated with five neural networks under varying memory configurations and compared to previous works on the Nvidia Jetson TX2, e-PlaNNer achieves 6x speedup and 21.14% reduction of off-chip memory access volume on average. In addition, e-PlaNNer shows meaningful performance compared to well-known deep learning frameworks in terms of end-to-end execution.์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ์œ„ํ•œ ์ˆ˜๋งŽ์€ ์•„ํ‚คํ…์ฒ˜์™€ ๊ฐ€์†๊ธฐ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, ์ผ๋ฐ˜์ ์ธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์ˆ˜ํ–‰ ๋ฐฉ์‹๋“ค์ด ํ•จ๊ป˜ ์ œ์•ˆ๋˜์—ˆ์œผ๋‚˜, ๊ตฌ์ฒด์ ์ธ ์—ฐ์‚ฐ ๋ฐฐ์น˜ ๋ฐฉ์‹๊ณผ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ ๋ฐ ์ข…๋ฅ˜, ๊ทธ๋ฆฌ๊ณ  ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ฐฉ์‹์€ ํŠนํžˆ ๋‚ด์žฅํ˜• ์‹œ์Šคํ…œ์—์„œ ๋‹ค์–‘ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํฌ๊ธฐ ๋ฐ ์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ์€ ์—ฐ์‚ฐ ํ˜•ํƒœ ๋ฐ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ ๋ฟ ์•„๋‹ˆ๋ผ ์‹ ๊ฒฝ๋ง ๊ฐ ๊ณ„์ธต์˜ ํฌ๊ธฐ ๋ฐ ํ˜•ํƒœ์— ๋”ฐ๋ผ์„œ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์—ฐ์‚ฐ ํ˜•ํƒœ๋ฅผ ์ผ์ผ์ด ์ฐพ๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ๋ฒˆ๊ฑฐ๋กœ์šด ์ž‘์—…์ด๋ฉฐ, ๋งŽ์€ ์˜ค๋ฅ˜๋ฅผ ๋ฐœ์ƒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•  e-PlaNNer๋Š” ์ฃผ์–ด์ง„ ๋‚ด์žฅํ˜• ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ์™€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•˜์—ฌ ์ตœ์ ํ™”๋œ ์‹คํ–‰ ๊ณ„ํš์„ ์ƒ์„ฑํ•ด์ฃผ๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. e-PlaNNer๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๊ฐ ์‹ ๊ฒฝ๋ง ๊ณ„์ธต์— ๋Œ€ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ด๋™, ํƒ€์ผ๋ง, ๊ทธ๋ฆฌ๊ณ  ์ž‘์—… ๋ฐฐ๋ถ„์„ ๊ณ ๋ คํ•œ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋œ ์‹คํ–‰ ๊ณ„ํš์„ ๊ฒฐ์ •ํ•œ๋‹ค. ๋˜ํ•œ, ์ƒ์„ฑ๋œ ์‹คํ–‰ ๊ณ„ํš์„ ์‹ค์ œ ์ปดํŒŒ์ผ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•จ์œผ๋กœ์จ, ์„œ๋กœ ๋‹ค๋ฅธ ๋‹ค์–‘ํ•œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง๊ณผ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ๊ฐœ๋ฐœ ์ฃผ๊ธฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ์œผ๋กœ ๋‹ค์„ฏ ๊ฐ€์ง€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์‘์šฉ์„ Nvidia์˜ Jetson TX2 ์—์„œ ๊ฒ€์ฆํ•˜์—ฌ ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์™€ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, e-PlaNNer๋Š” ํ‰๊ท ์ ์œผ๋กœ 6๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ 21.14% ์˜ ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ๋Ÿ‰ ๊ฐ์†Œ๋ฅผ ๋ณด์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, e-PlaNNer๋Š” ์ „์ฒด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์‹คํ–‰ ๊ด€์ ์—์„œ ๊ธฐ์กด์— ์ž˜ ์•Œ๋ ค์ง„ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์™€์˜ ๋น„๊ต์—์„œ๋„ ์˜๋ฏธ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related Work 5 Chapter 3 Background 8 3.1 Convolutional Neural Networks 8 3.2 DNN Accelerator 9 3.3 Roofline Model 11 Chapter 4 Graph Level Processing 13 4.1 Graph Construction 13 4.2 Schedule Caching 14 Chapter 5 Convolutional Layer Analysis 15 5.1 Loop Structure 16 5.2 Loop Tiling 17 5.3 Dataflow 18 Chapter 6 Execution Planning 20 6.1 Architecture Con figurations 20 6.2 Modeling Off-Chip Memory Accesses 22 6.3 Modeling Performance 24 6.4 Search Space Exploration 25 Chapter 7 Code Generation 32 7.1 Intermediate Representation 33 7.2 Target Code Generation 34 Chapter 8 Evaluation 36 8.1 Experimental Setup 36 8.2 Performance Results 39 8.3 Comparison of Off-chip Memory Access 40 8.4 Framework Results 42 Chapter 9 Discussion 46 Chapter 10 Conclusion 47 Bibliography 48 ์š”์•ฝ 57Maste
    • โ€ฆ
    corecore