8 research outputs found
The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface
Supported by their high power efficiency and recent advancements in High
Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud
systems. Large amounts of work have been done so far on loop and area
optimizations for different applications on FPGAs using HLS. However, a
comprehensive analysis of the behavior and efficiency of the memory controller
of FPGAs is missing in literature, which becomes even more crucial when the
limited memory bandwidth of modern FPGAs compared to their GPU counterparts is
taken into account. In this work, we will analyze the memory interface
generated by Intel FPGA SDK for OpenCL with different configurations for
input/output arrays, vector size, interleaving, kernel programming model,
on-chip channels, operating frequency, padding, and multiple types of
overlapped blocking. Our results point to multiple shortcomings in the memory
controller of Intel FPGAs, especially with respect to memory access alignment,
that can hinder the programmer's ability in maximizing memory performance in
their design. For some of these cases, we will provide work-arounds to improve
memory bandwidth efficiency; however, a general solution will require major
changes in the memory controller itself.Comment: Published at H2RC'19: Fifth International Workshop on Heterogeneous
High-performance Reconfigurable Computing held in conjunction with SC'1
FPGA acceleration of structured-mesh-based explicit and implicit numerical solvers using SYCL
We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multidimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a wide-range of realworld applications ranging from CFD to financial computing. A general, unified workflow is formulated for synthesizing them on Intel FPGAs together with predictive analytic models to explore the design space to obtain near-optimal performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing better or matching performance to the V100 GPU. However, more importantly the FPGA solutions provide 59%-76% less energy consumption for their largest configurations, making them highly attractive for solving workloads based on these applications in production settings. The performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating their significant utility for design space explorations. With these tools and techniques, we discuss determinants for a given structuredmesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design, how they can be codified using SYCL and the resulting performance
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Stencil computation is one of the most widely-used compute patterns in high
performance computing applications. Spatial and temporal blocking have been
proposed to overcome the memory-bound nature of this type of computation by
moving memory pressure from external memory to on-chip memory on GPUs. However,
correctly implementing those optimizations while considering the complexity of
the architecture and memory hierarchy of GPUs to achieve high performance is
difficult. We propose AN5D, an automated stencil framework which is capable of
automatically transforming and optimizing stencil patterns in a given C source
code, and generating corresponding CUDA code. Parameter tuning in our framework
is guided by our performance model. Our novel optimization strategy reduces
shared memory and register pressure in comparison to existing implementations,
allowing performance scaling up to a temporal blocking degree of 10. We achieve
the highest performance reported so far for all evaluated stencil benchmarks on
the state-of-the-art Tesla V100 GPU