The use of reconfigurable computing, and FPGAs in particular, to accelerate
computational kernels has the potential to be of great benefit to scientific
codes and the HPC community in general. However, whilst recent advanced in FPGA
tooling have made the physical act of programming reconfigurable architectures
much more accessible, in order to gain good performance the entire algorithm
must be rethought and recast in a dataflow style. Reducing the cost of data
movement for all computing devices is critically important, and in this paper
we explore the most appropriate techniques for FPGAs. We do this by describing
the optimisation of an existing FPGA implementation of an atmospheric model's
advection scheme. By taking an FPGA code that was over four times slower than
running on the CPU, mainly due to data movement overhead, we describe the
profiling and optimisation strategies adopted to significantly reduce the
runtime and bring the performance of our FPGA kernels to a much more practical
level for real-world use. The result of this work is a set of techniques,
steps, and lessons learnt that we have found significantly improves the
performance of FPGA based HPC codes and that others can adopt in their own
codes to achieve similar results.Comment: Preprint of article in 2019 IEEE/ACM International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC