Accelerators are quickly emerging as the leading technology to further boost
computing performances; their main feature is a massively parallel on-chip architecture. NVIDIA
and AMD GPUs and the Intel Xeon-Phi are examples of accelerators available today. Accelerators
are power-efficient and deliver up to one order of magnitude more peak performance than
traditional CPUs. However, existing codes for traditional CPUs require substantial changes to
run efficiently on accelerators, including rewriting with specific programming languages.
In this contribution we present our experience in porting large codes to NVIDIA GPU and Intel
Xeon-Phi accelerators. Our reference application is a CFD code based on the Lattice
Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for
processor architectures with a large degree of parallelism. However, the challenge of
exploiting a large fraction of the theoretically available performance is not easy to
met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a
D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the
equation-of-state of a perfect gas.
We describe in details how we implement and optimize our LB code for Xeon-Phi and
GPUs, and then analyze performances on single- and multi-accelerator systems. We
finally compare results with those available on recent traditional multi-core CPUs