Search CORE

273 research outputs found

Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs

Author: Benini L.
Brian M.
Capotondi A.
Conti F.
Deriu G.
Meloni P.
Raffo L.
Rossi D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/12/2017
Field of study

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Cagliari

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Author: Bouganis Christos-Savvas
Kouris Alexandros
Venieris Stylianos I.
Publication venue
Publication date: 19/02/2018
Field of study

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Hardware compilation of deep neural networks: an overview

Author: Cheung P
Constantinides G
Davis JJ
Liu S
Luk W
Ng H
Niu X
Shi H
Wang E
Wang X
Zhao R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/06/2018
Field of study

Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research

Crossref

Spiral - Imperial College Digital Repository

Low power and high performance heterogeneous computing on FPGAs

Author: Ma Liang
Publication venue: Politecnico di Torino
Publication date
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

Author: Boppu Srinivas
Cenkeramaddi Linga Reddy
Dhilleswararao Pudi
Manikandan M. Sabarimalai
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

Agder University Research Archive

An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks

Author: Bussolino Beatrice
Capra Maurizio
Marchisio Alberto
Martina Maurizio
Masera Guido
Shafique Muhammad
Publication venue: 'MDPI AG'
Publication date
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Author: Azghadi Mostafa Rahimi
Donati Elisa
Eshraghian Jason K.
Indiveri Giacomo
Lammie Corey
Linares-Barranco Bernabe
Payvand Melika
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers, to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems as proponents for driving biomedical circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21 pages, 10 figures, 5 tables

arXiv.org e-Print Archive

Repository for Publications and Research Data

ResearchOnline at James Cook University

합성곱 신경망의 효율적인 실행을 위한 실행 계획 자동 생성

Author: 김민수
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2020. 8. Bernhard Egger.Over the past years, a large number of architectures and accelerators for Deep Neural Networks (DNNs) have been proposed. While exhibiting common features, the number and arrangement of processing elements, the sizes and types of on-chip memory, and the possibilities of parallel execution vary significantly especially in the embedded system domain. The number of off-chip memory accesses and the performance of a DNN on a given accelerator depends not only on the supported computational patterns and the available on-chip memory but also on the sizes and shapes of each layer. Finding a computational pattern that minimizes off-chip memory accesses while maximizing performance is thus a tedious and error-prone task. This thesis presents e-PlaNNer, a compiler framework that generates an optimized execution plan for a given embedded accelerator and Convolutional Neural Network (CNN). For each layer, e-PlaNNer determines the performance-optimal configuration by considering the data movement, tiling, and work distribution. The generated execution plan is transformed to code, allowing for a fast development cycle with different CNNs and hardware accelerators. Evaluated with five neural networks under varying memory configurations and compared to previous works on the Nvidia Jetson TX2, e-PlaNNer achieves 6x speedup and 21.14% reduction of off-chip memory access volume on average. In addition, e-PlaNNer shows meaningful performance compared to well-known deep learning frameworks in terms of end-to-end execution.지난 몇 년간 심층신경망을 위한 수많은 아키텍처와 가속기가 제안되었다. 이를 통해, 일반적인 심층신경망 수행 방식들이 함께 제안되었으나, 구체적인 연산 배치 방식과 온칩 메모리의 크기 및 종류, 그리고 병렬 실행 방식은 특히 내장형 시스템에서 다양하게 나타날 수 있다. 뿐만 아니라, 오프칩 메모리 접근 크기 및 신경망의 성능은 연산 형태 및 온칩 메모리의 크기 뿐 아니라 신경망 각 계층의 크기 및 형태에 따라서 달라질 수 있다. 따라서, 최대 성능을 내면서 오프칩 메모리 접근을 최소화하는 연산 형태를 일일이 찾는 것은 상당히 번거로운 작업이며, 많은 오류를 발생 시킬 수 있다. 본 논문에서 소개할 e-PlaNNer는 주어진 내장형 하드웨어 가속기와 합성곱 신경망에 대하여 최적화된 실행 계획을 생성해주는 컴파일러 프레임워크이다. e-PlaNNer는 심층신경망의 각 신경망 계층에 대하여 데이터 이동, 타일링, 그리고 작업 배분을 고려한 성능 최적화된 실행 계획을 결정한다. 또한, 생성된 실행 계획을 실제 컴파일 가능한 코드로 변환함으로써, 서로 다른 다양한 합성곱 신경망과 하드웨어 가속기에 대하여 빠른 개발 주기를 제공한다. 다양한 메모리 구성으로 다섯 가지 합성곱 신경망 응용을 Nvidia의 Jetson TX2 에서 검증하여 기존의 연구와 비교한 결과, e-PlaNNer는 평균적으로 6배의 성능 향상과 21.14% 의 오프칩 메모리 데이터 접근량 감소를 보였다. 뿐만 아니라, e-PlaNNer는 전체 심층신경망의 실행 관점에서 기존에 잘 알려진 딥러닝 프레임워크와의 비교에서도 의미있는 결과를 보였다.Chapter 1 Introduction 1 Chapter 2 Related Work 5 Chapter 3 Background 8 3.1 Convolutional Neural Networks 8 3.2 DNN Accelerator 9 3.3 Roofline Model 11 Chapter 4 Graph Level Processing 13 4.1 Graph Construction 13 4.2 Schedule Caching 14 Chapter 5 Convolutional Layer Analysis 15 5.1 Loop Structure 16 5.2 Loop Tiling 17 5.3 Dataflow 18 Chapter 6 Execution Planning 20 6.1 Architecture Con figurations 20 6.2 Modeling Off-Chip Memory Accesses 22 6.3 Modeling Performance 24 6.4 Search Space Exploration 25 Chapter 7 Code Generation 32 7.1 Intermediate Representation 33 7.2 Target Code Generation 34 Chapter 8 Evaluation 36 8.1 Experimental Setup 36 8.2 Performance Results 39 8.3 Comparison of Off-chip Memory Access 40 8.4 Framework Results 42 Chapter 9 Discussion 46 Chapter 10 Conclusion 47 Bibliography 48 요약 57Maste

SNU Open Repository and Archive