Search CORE

2,191 research outputs found

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Author: Cong Jason
Wei Peng
Yu Cody Hao
Zhang Peng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/07/2018
Field of study

CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

arXiv.org e-Print Archive

Crossref

Scipedia

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

Implementation of Genetic Algorithms in FPGA-based Reconfigurable Computing Systems

Author: Alam Nahid
Publication venue: Clemson University Libraries
Publication date: 01/08/2009
Field of study

Genetic Algorithms (GAs) are used to solve many optimization problems in science and engineering. GA is a heuristics approach which relies largely on random numbers to determine the approximate solution of an optimization problem. We use the Mersenne Twister Algorithm (MTA) to generate a non-overlapping sequence of random numbers with a period of 219937-1. The random numbers are generated from a state vector that consists of 624 elements. Our work on state vector generation and the GA implementation targets the solution of a flow-line scheduling problem where the flow-lines have jobs to process and the goal is to find a suitable completion time for all jobs using a GA. The state vector generation algorithm (MTA) performs poorly in traditional von Neumann architectures due to its poor temporal and spatial locality. Therefore its performance is limited by the speed at which we can access memory. With an approximate increase of processor performance by 60% per year and a drop of memory latency only 7% per year, a new approach is needed for performance improvement. On the other hand, the GA implementation in a general-purpose microprocessor, though performs reasonably well, has scope for performance gain in a parallel implementation. The parallel implementation of the GA can work as a kernel for applications that uses a GA to reach a solution. Our approach is to implement the state vector generation process and the GA in an FPGA-based Reconfigurable Computing (RC) system with the goal of improving the overall performance. Application design for FPGA-based RC systems is not trivial and the performance improvement is not guaranteed. Designing for RC systems requires algorithmic parallelism in order to exploit the inherent parallelism of the FPGA. We are using a high-level language that provides a level of abstraction from the lower-level hardware in the RC system making it difficult to fully exploit some of the architectural benefits of the FPGA. Considering these factors, we improve the state vector generation process algorithmically. Our implementation generates state vectors 5X faster than the previous implementation in an Intel Xeon microprocessor of 2GHz. The modified algorithm is also implemented in a Xilinx Virtex-4 FPGA that results in a 2.4X speedup. Improvement in this preprocessing step accelerates GA application performance as random numbers are generated from these state vectors for the genetic operators. We simulate the basic operations of a GA in an FPGA to study its behavior in a parallel environment and analyze the results. The initial FPGA implementation of the GA runs about 7X slower than its microprocessor counterpart. The reasons are explained along with suggestions for improvement and future work

Clemson University: TigerPrints

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Author: Bouganis Christos-Savvas
Kouris Alexandros
Venieris Stylianos I.
Publication venue
Publication date: 19/02/2018
Field of study

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository