927 research outputs found
Coarse-grained reconfigurable array architectures
Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹexibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code
High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP
This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBMâs Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs
The use of field-programmable gate arrays for the hardware acceleration of design automation tasks
This paper investigates the possibility of using Field-Programmable Gate Arrays (FrâGAS) as
reconfigurable co-processors for workstations to produce moderate speedups for most tasks
in the design process, resulting in a worthwhile overall design process speedup at low cost
and allowing algorithm upgrades with no hardware modification. The use of FPGAS as hardware
accelerators is reviewed and then achievable speedups are predicted for logic simulation
and VLSI design rule checking tasks for various FPGA co-processor arrangements
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks
Fully realizing the potential of acceleration for Deep Neural Networks (DNNs)
requires understanding and leveraging algorithmic properties. This paper builds
upon the algorithmic insight that bitwidth of operations in DNNs can be reduced
without compromising their classification accuracy. However, to prevent
accuracy loss, the bitwidth varies significantly across DNNs and it may even be
adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer
limited benefits to accommodate the worst-case bitwidth requirements, or lead
to a degradation in final accuracy. To alleviate these deficiencies, this work
introduces dynamic bit-level fusion/decomposition as a new dimension in the
design of DNN accelerators. We explore this dimension by designing Bit Fusion,
a bit-flexible accelerator, that constitutes an array of bit-level processing
elements that dynamically fuse to match the bitwidth of individual DNN layers.
This flexibility in the architecture enables minimizing the computation and the
communication at the finest granularity possible with no loss in accuracy. We
evaluate the benefits of BitFusion using eight real-world feed-forward and
recurrent DNNs. The proposed microarchitecture is implemented in Verilog and
synthesized in 45 nm technology. Using the synthesis results and cycle accurate
simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN
accelerators, Eyeriss and Stripes. In the same area, frequency, and process
technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss.
Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction
at 45 nm node when BitFusion area and frequency are set to those of Stripes.
Scaling to GPU technology node of 16 nm, BitFusion almost matches the
performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while
BitFusion merely consumes 895 milliwatts of power
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
- âŠ