For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains. However, this performance has come at the expense of programmability. In particular, FPGA developers use hardware design languages (HDLs) to implement an FPGA design. Programming using HDLs requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this improved programmability can come at the expense of performance. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite (https://github.com/vtsynergy/OpenDwarfs), as shown in Table I and where IMP1, short for "Implementation 1," serves as the reference implementation.
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains. However, this performance has come at the expense of programmability. In particular, FPGA developers use hardware design languages (HDLs) to implement an FPGA design. Programming using HDLs requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this improved programmability can come at the expense of performance. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite (https://github.com/vtsynergy/OpenDwarfs), as shown in Table I and where IMP1, short for "Implementation 1," serves as the reference implementation.
Use of restrict/const keywords and kernel vectorization: The restrict keyword (i.e., hinting the lack of pointer aliasing) enables better performing designs by eliminating unecessary assumed memory dependencies. With respect to IMP2 (no restrict), its resource utilization exhibits 1.31 times lower resource utilization than IMP4, as shown in Figure 1 . With respect to performance, IMP4 (with restrict) delivers 3.94 times faster performance than IMP2, due to cache hits for the majority of memory accesses in IMP4 and sub-optimal memory accesses that result in cache misses and pipeline stalls in IMP2. Using the const keyword, we observe virtually no difference in resource utilization and execution time between IMP1 and IMP2. Kernel vectorization (SIMD), which is enabled with the appropriate OpenCL attribute, can yield easy performance gains of 5.85x (IMP4, IMP8) at the cost of increased (double) resource utilization.
Compiler resource-driven optimizations: In compilation with resource-driven optimization, the compiler applies a set of heuristics and selects the set of attributes for optimizations to perform (e.g., loop unroll factor, kernel vectorization width, compute unit (CU) replication factor). This process does not always provide the best implementation. In GEM, we identify at least one case where the manual choice of kernel Algorithmic refactoring: A different algorithmic implementation for solving the same problem tailored for the FPGA can prove very beneficial. We apply basic algorithmic refactoring in GEM by removing the complex conditional statements for different cases encapsulated in a single kernel and alter the kernel to correspond to the case at hand. This provides a two-fold benefit, as shown going from IMP2 to IMP3: (a) better resource utilization (i.e., 10% less FPGA resources), and (b) better performance (i.e., 5% faster) More importantly, better resource utilization may allow for wider SIMD or more compute units to fit on a given board. For example, in refactoring from IMP6 to IMP7, the reduced resource utilization of the refactored algorithm allows a SIMD length of 16, whereas the original algorithm only accomodated up to 8. This translates to a 1.22-fold faster execution of the refactored over the original.
