273 research outputs found
Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs
Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Hardware compilation of deep neural networks: an overview
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research
Low power and high performance heterogeneous computing on FPGAs
L'abstract รจ presente nell'allegato / the abstract is in the attachmen
Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey
In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio
Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications
With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic
processors, new opportunities are emerging for applying deep and Spiking Neural
Network (SNN) algorithms to healthcare and biomedical applications at the edge.
This can facilitate the advancement of the medical Internet of Things (IoT)
systems and Point of Care (PoC) devices. In this paper, we provide a tutorial
describing how various technologies ranging from emerging memristive devices,
to established Field Programmable Gate Arrays (FPGAs), and mature Complementary
Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL
accelerators to solve a wide variety of diagnostic, pattern recognition, and
signal processing problems in healthcare. Furthermore, we explore how spiking
neuromorphic processors can complement their DL counterparts for processing
biomedical signals. After providing the required background, we unify the
sparsely distributed research on neural network and neuromorphic hardware
implementations as applied to the healthcare domain. In addition, we benchmark
various hardware platforms by performing a biomedical electromyography (EMG)
signal processing task and drawing comparisons among them in terms of inference
delay and energy. Finally, we provide our analysis of the field and share a
perspective on the advantages, disadvantages, challenges, and opportunities
that different accelerators and neuromorphic processors introduce to healthcare
and biomedical domains. This paper can serve a large audience, ranging from
nanoelectronics researchers, to biomedical and healthcare practitioners in
grasping the fundamental interplay between hardware, algorithms, and clinical
adoption of these tools, as we shed light on the future of deep networks and
spiking neuromorphic processing systems as proponents for driving biomedical
circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21
pages, 10 figures, 5 tables
ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง์ ํจ์จ์ ์ธ ์คํ์ ์ํ ์คํ ๊ณํ ์๋ ์์ฑ
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. Bernhard Egger.Over the past years, a large number of architectures and accelerators for Deep Neural Networks (DNNs) have been proposed. While exhibiting common features, the number and arrangement of processing elements, the sizes and types of on-chip memory, and the possibilities of parallel execution vary significantly especially in the embedded system domain. The number of off-chip memory accesses and the performance of a DNN on a given accelerator depends not only on the supported computational patterns and the available on-chip memory but also on the sizes and shapes of each layer. Finding a computational pattern that minimizes off-chip memory accesses while maximizing performance is thus a tedious and error-prone task. This thesis presents e-PlaNNer, a compiler framework that generates an optimized execution plan for a given embedded accelerator and Convolutional Neural Network (CNN). For each layer, e-PlaNNer determines the performance-optimal configuration by considering the data movement, tiling, and work distribution. The generated execution plan is transformed to code, allowing for a fast development cycle with different CNNs and hardware accelerators. Evaluated with five neural networks under varying memory configurations and compared to previous works on the Nvidia Jetson TX2, e-PlaNNer achieves 6x speedup and 21.14% reduction of off-chip memory access volume on average. In addition, e-PlaNNer shows meaningful performance compared to well-known deep learning frameworks in terms of end-to-end execution.์ง๋ ๋ช ๋
๊ฐ ์ฌ์ธต์ ๊ฒฝ๋ง์ ์ํ ์๋ง์ ์ํคํ
์ฒ์ ๊ฐ์๊ธฐ๊ฐ ์ ์๋์๋ค. ์ด๋ฅผ ํตํด, ์ผ๋ฐ์ ์ธ ์ฌ์ธต์ ๊ฒฝ๋ง ์ํ ๋ฐฉ์๋ค์ด ํจ๊ป ์ ์๋์์ผ๋, ๊ตฌ์ฒด์ ์ธ ์ฐ์ฐ ๋ฐฐ์น ๋ฐฉ์๊ณผ ์จ์นฉ ๋ฉ๋ชจ๋ฆฌ์ ํฌ๊ธฐ ๋ฐ ์ข
๋ฅ, ๊ทธ๋ฆฌ๊ณ ๋ณ๋ ฌ ์คํ ๋ฐฉ์์ ํนํ ๋ด์ฅํ ์์คํ
์์ ๋ค์ํ๊ฒ ๋ํ๋ ์ ์๋ค. ๋ฟ๋ง ์๋๋ผ, ์คํ์นฉ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ ํฌ๊ธฐ ๋ฐ ์ ๊ฒฝ๋ง์ ์ฑ๋ฅ์ ์ฐ์ฐ ํํ ๋ฐ ์จ์นฉ ๋ฉ๋ชจ๋ฆฌ์ ํฌ๊ธฐ ๋ฟ ์๋๋ผ ์ ๊ฒฝ๋ง ๊ฐ ๊ณ์ธต์ ํฌ๊ธฐ ๋ฐ ํํ์ ๋ฐ๋ผ์ ๋ฌ๋ผ์ง ์ ์๋ค. ๋ฐ๋ผ์, ์ต๋ ์ฑ๋ฅ์ ๋ด๋ฉด์ ์คํ์นฉ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ์ ์ต์ํํ๋ ์ฐ์ฐ ํํ๋ฅผ ์ผ์ผ์ด ์ฐพ๋ ๊ฒ์ ์๋นํ ๋ฒ๊ฑฐ๋ก์ด ์์
์ด๋ฉฐ, ๋ง์ ์ค๋ฅ๋ฅผ ๋ฐ์ ์ํฌ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์ ์๊ฐํ e-PlaNNer๋ ์ฃผ์ด์ง ๋ด์ฅํ ํ๋์จ์ด ๊ฐ์๊ธฐ์ ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง์ ๋ํ์ฌ ์ต์ ํ๋ ์คํ ๊ณํ์ ์์ฑํด์ฃผ๋ ์ปดํ์ผ๋ฌ ํ๋ ์์ํฌ์ด๋ค. e-PlaNNer๋ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๊ฐ ์ ๊ฒฝ๋ง ๊ณ์ธต์ ๋ํ์ฌ ๋ฐ์ดํฐ ์ด๋, ํ์ผ๋ง, ๊ทธ๋ฆฌ๊ณ ์์
๋ฐฐ๋ถ์ ๊ณ ๋ คํ ์ฑ๋ฅ ์ต์ ํ๋ ์คํ ๊ณํ์ ๊ฒฐ์ ํ๋ค. ๋ํ, ์์ฑ๋ ์คํ ๊ณํ์ ์ค์ ์ปดํ์ผ ๊ฐ๋ฅํ ์ฝ๋๋ก ๋ณํํจ์ผ๋ก์จ, ์๋ก ๋ค๋ฅธ ๋ค์ํ ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง๊ณผ ํ๋์จ์ด ๊ฐ์๊ธฐ์ ๋ํ์ฌ ๋น ๋ฅธ ๊ฐ๋ฐ ์ฃผ๊ธฐ๋ฅผ ์ ๊ณตํ๋ค. ๋ค์ํ ๋ฉ๋ชจ๋ฆฌ ๊ตฌ์ฑ์ผ๋ก ๋ค์ฏ ๊ฐ์ง ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง ์์ฉ์ Nvidia์ Jetson TX2 ์์ ๊ฒ์ฆํ์ฌ ๊ธฐ์กด์ ์ฐ๊ตฌ์ ๋น๊ตํ ๊ฒฐ๊ณผ, e-PlaNNer๋ ํ๊ท ์ ์ผ๋ก 6๋ฐฐ์ ์ฑ๋ฅ ํฅ์๊ณผ 21.14% ์ ์คํ์นฉ ๋ฉ๋ชจ๋ฆฌ ๋ฐ์ดํฐ ์ ๊ทผ๋ ๊ฐ์๋ฅผ ๋ณด์๋ค. ๋ฟ๋ง ์๋๋ผ, e-PlaNNer๋ ์ ์ฒด ์ฌ์ธต์ ๊ฒฝ๋ง์ ์คํ ๊ด์ ์์ ๊ธฐ์กด์ ์ ์๋ ค์ง ๋ฅ๋ฌ๋ ํ๋ ์์ํฌ์์ ๋น๊ต์์๋ ์๋ฏธ์๋ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.Chapter 1 Introduction 1
Chapter 2 Related Work 5
Chapter 3 Background 8
3.1 Convolutional Neural Networks 8
3.2 DNN Accelerator 9
3.3 Roofline Model 11
Chapter 4 Graph Level Processing 13
4.1 Graph Construction 13
4.2 Schedule Caching 14
Chapter 5 Convolutional Layer Analysis 15
5.1 Loop Structure 16
5.2 Loop Tiling 17
5.3 Dataflow 18
Chapter 6 Execution Planning 20
6.1 Architecture Con figurations 20
6.2 Modeling Off-Chip Memory Accesses 22
6.3 Modeling Performance 24
6.4 Search Space Exploration 25
Chapter 7 Code Generation 32
7.1 Intermediate Representation 33
7.2 Target Code Generation 34
Chapter 8 Evaluation 36
8.1 Experimental Setup 36
8.2 Performance Results 39
8.3 Comparison of Off-chip Memory Access 40
8.4 Framework Results 42
Chapter 9 Discussion 46
Chapter 10 Conclusion 47
Bibliography 48
์์ฝ 57Maste
- โฆ