3 research outputs found
HALF: Holistic Auto Machine Learning for FPGAs
Deep Neural Networks (DNNs) are capable of solving complex problems in
domains related to embedded systems, such as image and natural language
processing. To efficiently implement DNNs on a specific FPGA platform for a
given cost criterion, e.g. energy efficiency, an enormous amount of design
parameters has to be considered from the topology down to the final hardware
implementation. Interdependencies between the different design layers have to
be taken into account and explored efficiently, making it hardly possible to
find optimized solutions manually. An automatic, holistic design approach can
improve the quality of DNN implementations on FPGA significantly. To this end,
we present a cross-layer design space exploration methodology. It comprises
optimizations starting from a hardware-aware topology search for DNNs down to
the final optimized implementation for a given FPGA platform. The methodology
is implemented in our Holistic Auto machine Learning for FPGAs (HALF)
framework, which combines an evolutionary search algorithm, various
optimization steps and a library of parametrizable hardware DNN modules. HALF
automates both the exploration process and the implementation of optimized
solutions on a target FPGA platform for various applications. We demonstrate
the performance of HALF on a medical use case for arrhythmia detection for
three different design goals, i.e. low-energy, low-power and high-throughput
respectively. Our FPGA implementation outperforms a TensorRT optimized model on
an Nvidia Jetson platform in both throughput and energy consumption.Comment: 2021 31st International Conference on Field-Programmable Logic and
Applications (FPL). IEEE, 202
Extending High-Level Synthesis for Task-Parallel Programs
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular
for field-programmable gate array (FPGA) accelerators in many application
domains in recent years, thanks to its competitive quality of result (QoR) and
short development cycle compared with the traditional register-transfer level
(RTL) design approach. Yet, limited by the sequential C semantics, it remains
challenging to adopt the same highly productive high-level programming approach
in many other application domains, where coarse-grained tasks run in parallel
and communicate with each other at a fine-grained level. While current HLS
tools support task-parallel programs, the productivity is greatly limited in
the code development, correctness verification, and QoR tuning cycles, due to
the poor programmability, restricted software simulation, and slow code
generation, respectively. Such limited productivity often defeats the purpose
of HLS and hinder programmers from adopting HLS for task-parallel FPGA
accelerators. In this paper, we extend the HLS C++ language and present a fully
automated framework with programmer-friendly interfaces, universal software
simulation, and fast code generation to overcome these limitations.
Experimental results based on a wide range of real-world task-parallel programs
show that, on average, the lines of kernel and host code are reduced by 22% and
51%, respectively, which considerably improves the programmability. The
correctness verification and the iterative QoR tuning cycles are both greatly
accelerated by 3.2xand 6.8x, respectively
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators
Adopting FPGA as an accelerator in datacenters is becoming mainstream for
customized computing, but the fact that FPGAs are hard to program creates a
steep learning curve for software programmers. Even with the help of high-level
synthesis (HLS), accelerator designers still have to manually perform code
reconstruction and cumbersome parameter tuning to achieve the optimal
performance. While many learning models have been leveraged by existing work to
automate the design of efficient accelerators, the unpredictability of modern
HLS tools becomes a major obstacle for them to maintain high accuracy. In this
paper, we address this problem by incorporating an automated DSE
framework-AutoDSE- that leverages bottleneck-guided gradient optimizer to
systematically find abetter design point. AutoDSE finds the bottleneck of the
design in each step and focuses on high-impact parameters to overcome that,
which is similar to the approach an expert would take. The experimental results
show that AutoDSE is able to find the design point that achieves, on the
geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia
benchmarks and 1.04x over the manually designed HLS accelerated vision kernels
in Xilinx Vitis libraries yet with 26x reduction of their optimization pragmas.
With less than one optimization pragma per design on average, we are making
progress towards democratizing customizable computing by enabling software
programmers to design efficient FPGA accelerators.Comment: 11 page