We describe a high-performance implementation of the lattice-Boltzmann method
(LBM) for sparse geometries on graphic processors. In our implementation we
cover the whole geometry with a uniform mesh of small tiles and carry out
calculations for each tile independently with a proper data synchronization at
tile edges. For this method we provide both the theoretical analysis of
complexity and the results for real implementations for 2D and 3D geometries.
Based on the theoretical model, we show that tiles offer significantly smaller
bandwidth overhead than solutions based on indirect addressing. For
2-dimensional lattice arrangements a reduction of memory usage is also
possible, though at the cost of diminished performance. We reached the
performance of 682 MLUPS on GTX Titan (72\% of peak theoretical memory
bandwidth) for D3Q19 lattice arrangement and double precision data.Comment: Accepted in IEEE Transactions on Parallel and Distributed Systems, 14
pages, 9 figures, 5 table