We describe a high-performance implementation of the lattice Boltzmann method (LBM) for sparse 3D geometries on graphic processors (GPU). The main contribution of this work is a data layout that allows to minimise the number of redundant memory transactions during the propagation step of LBM. We show that by using a uniform mesh of small three-dimensional tiles and a careful data placement it is possible to utilise more than 70% of maximum theoretical GPU memory bandwidth for D3Q19 lattice and double precision numbers. The performance of our implementation is thoroughly examined and compared with other GPU implementations of LBM. The proposed method performs the best for sparse geometries with good spatial locality.
Introduction
The lattice Boltzmann method (LBM) is a new versatile and highly parallel approach where the discrete Boltzmann transport equation is solved in the velocity or moment R n space to obtain the time-dependent fluid velocity distributions. Due to a trivial LBM parallelization many researchers have explored the area of its implementations on graphics processors (graphic processing units, GPU). A thorough review is presented in [1] . A few typical optimisation techniques may be mentioned: combining all computations into the single kernel to minimise memory bandwidth usage, use of structure of arrays (SoA) instead of array of structures (AoS) data types to avoid uncoalesced memory transactions, an adaptation of all computations to the single precision floating point format to minimise a register pressure and maximise occupancy, use of two copies of data to avoid race conditions, and many others. However, the above optimisations concern dense geometries. The number of GPU implementations for sparse geometries (with many solid nodes) is much lower and they offer significantly worse performance. This can hinder the usage of GPU LBM implementations in many areas where dense geometries are not applicable. Typical examples may be biomedical or porous media simulations [2] .
Sparse geometries handling requires some form of storing information about the placement of non-solid nodes. This information can not only increase memory usage, but also generate irregular memory access patterns that can significantly decrease performance.
In general, two indirect addressing solutions are used in GPU implementations of LBM for sparse geometries: the Connectivity Matrix (CM), shown in [3] with highly optimised version in [2] , and the Fluid Index Array (FIA), presented in [4] . The connectivity matrix contains pointers to the neighbour node for each propagated f i function. The fluid index array is a kind of "bitmap" that contains −1 for every solid node in geometry or the pointer to non-solid node data. Both solutions bring an overhead in memory usage, additionally the FIA implementation results in a low utilisation of memory bandwidth due to the irregular data access pattern.
In this work we present the GPU implementation of LBM for sparse geometries where the information about geometry sparsity is stored using the uniform grid of threedimensional cubic tiles. Due to the fixed tile size, we were able to arrange data inside a tile to minimise memory traffic. For the proposed data layout, we show both a detailed theoretical analysis and results of performance measurements for dense and sparse geometries. Additionally, we analyse how the fixed tile size affects the overall performance. Provided that the tiles contain a low number of solid nodes, the performance of our implementation remains high regardless of the geometry sparsity.
The structure of the paper is as follows. In Section 2, we briefly introduce the LBM method and define basic concepts in GPU programming. Section 3 contains the description of our implementation with the detailed analysis of introduced overheads. The performance comparison with existing implementations and a verification procedure are presented in Section 4. Section 5 contains conclusions.
Background

Graphic Processing Unit
Today's GPUs are in fact mass-produced, easily programmable versatile variant of vector coprocessors. Compared with universal Central Processing Unit (CPU), they offer more computational power (due to numerous units) and a higher memory bandwidth at the cost of a diminished cache memory and a low performance of a single computational unit. Two of the most popular frameworks for GPU programming are OpenCL and CUDA (in this work we use the latter). Selected parameters of CUDA capable GPU generations are shown in Table 1 . It can be seen that the successive generations offer not only a higher memory bandwidth, but also their computational power for double precision increases faster than the memory bandwidth. In the CUDA programming model [5] a complete coprocessor (device) contains a GPU with a separate memory. The device is controlled by a host computer that launches on the GPU many instances (threads) of a GPU code (kernel ). Threads are arranged in an explicitly defined multidimensional structure: 3D grid of 3D thread blocks.
From the hardware point of view, a CUDA GPU contains a multi-channel DDR memory controller, optional L2 cache memory and from one up to over a dozen of multiprocessors. Each multiprocessor has a number of processing units, a register file, an amount of fast shared memory (used as a software controlled scratchpad), cache memories and a number of instruction decoders. One instruction decoder is common for many computational units, thus threads are run in groups called warps. The situation, when threads from the same warp need to follow different execution paths, is called thread divergence and causes a performance penalty.
Threads are assigned per multiprocessor in complete blocks. Usually, the number of blocks concurrently run on the multiprocessor is limited by the usage of registers and shared memory, lowering the multiprocessor occupancy. High occupancy and/or high instruction level parallelism (ILP) within single threads allow to achieve a high performance through masking of instruction latencies by hardware task switching.
GPU memory hierarchy contains the off-chip DDR DRAM global memory (limited to 1-32 GB depending on GPU), up to two levels of on-chip cache memory (there are separate caches for read-only access), shared memory and registers. A single transaction with the global memory requires a few hundreds of GPU clock cycles, thus memory transactions for threads from the same warp can be coalesced into one if defined restrictions are met. Additionally, limited software control over a cache usage is available.
Lattice-Boltzmann method
In the LBM, as in the conventional CFD, the geometry, initial and boundary conditions must be specified to solve the initial-value problem. The computational domain is uniformly partitioned and computational nodes are placed in vertices of adjacent cells, giving the lattice. In this study, the D3Q19 lattice structure is used (D3 is the space dimension, Q19 is the lattice links number). Let f i represent the probability distribution function along the i lattice direction, δ x and δ t are the lattice spacing and the lattice time step (usually assumed to be equal 1), δx δt is the lattice speed (also usually assumed to be equal 1 lu), e i is the unit vector along the i lattice direction, and c i = δx δt e i is the lattice velocity vector in the velocity space along the i lattice direction. In this work we use the lattice directions naming scheme shown in Fig. 1 . The core of the lattice Boltzmann method is the discretized in velocity R n space form of the Boltzmann transport equation that can be written for each i of q directions (lattice linkages) in velocity space as follows:
where Ω i is the collision operator.
2
The collision operator can be approximated with the most popular Bhatnagar-Gross-Krook model [6] ,
where τ is the relaxation time in LB units related to the lattice fluid viscosity and f eq i is the equilibrium distribution function along the i lattice direction given by the following formula for the quasi-compressible model:
and for the incompressible model [7] :
where c s is the lattice speed of sound that is a lattice constant; u is the macroscopic fluid velocity vector expressed in LB units; ρ = i f i is a fluid density expressed in LB units, and ω i is a weighting scalar for the i lattice direction. The macroscopic velocity for the quasi-compressible model can be determined from
and for the incompressible model from [8] 
Integrating Eqn. (1) from t to t + δ t along the i lattice direction and assuming that the collision term is constant during the interval, we can obtain discretized in time form of BGK-LBM equation
where r is a position vector in the velocity space. The therm on the left hand side is known as the streaming step, the latter represents the collision step. These two steps are repeated sequentially during the simulation giving velocity, density and pressure distributions at each time step.
The BGK-LBM uses the single relaxation time to characterise the collision effects. However, physically, these rates should be different during collision processes. To overcome this limitation, a collision matrix with different eigenvalues or multiple relaxation times can be used. The LBM with an MRT collision operator can be expressed as [9] :
where A is the collision matrix. The MRT-LBM model has been attracting more attention recently due to the higher stability than that of the BGK-LBM model. In our study, we use both above mentioned collision models in incompressible and quasi-compressible versions to compare their efficiency.
LBM performance limits
To compare different implementations, we employ a widely used, simple computational complexity model that allows for computing the minimum memory usage, the memory bandwidth and the computational performance. Subsequently, we compare the real results to this theoretical limit.
Since we are interested in peak performance, in the considerations that follow below we ignore boundary nodes. We also assume that the computational unit (CPU or GPU) does computations for a single node in a single LBM iteration (streaming and collision) in three stages: data read from the main memory, computations, data write to the main memory (the data for a single lattice node is buffered in registers and in the internal memory, e.g. shared/cache memory). It was shown in [10] that it is possible to do even a few time steps per one data read/write from the main memory, but according to [11] this method was difficult to apply to more complex boundary condition models.
We assume that for each lattice node the only values stored in the main memory are f i functions and an additional node type identifier. Since the previous work has shown that usually LBM implementations are memory bound, the other values (e.g. velocity or density) can be computed internally, and dropped after use without transferring them from/to the main memory.
Let q denote the number of functions f i , n d denote the number of bytes for storing a single f i value (e.g. 4 B for a single precision floating point, 8 B for double precision), and n t denote the number of bytes used to store a node type (we assume that at least one byte is necessary, although for simple cases only a few bits can be enough). Thus, the minimum number of memory bytes needed for single node datum is
For a single LBM iteration, the new f i values are computed based on the node type and the previous f i values. The newly computed f i values are stored in the memory. The minimum number of bytes transferred per one LBM iteration for a single node is then
Formulae defining the number of floating point operations (FLOP, not to be confused with floating point operations per second, FLOPS) are much more complex to determine. A simple count of the operations resulting from a naive implementation of equations (7) and (8) gives numbers significantly larger than in real implementations. Actually, many operations described by equations (7) and (8) can be skipped (e.g. multiplications by −1, 0 or 1) or used a few times (e.g. partial sums of f i ). Some of these optimisations can be detected very early (i.e. multiplications by e i constants equal to −1, 0, 1), but others are unpredictable since they are applied during the compilation and their use depends on many factors (optimisation level, registers usage constraints during compilation, machine capabilitiesa number of registers etc.). Thus, for numbers of computational operations we show only specific values obtained by a disassembling of a GPU binary code using nvdisasm utility (we count only arithmetic operations). The results are shown in Table 2 . The number of FLOP for our LBM implementation consists of two parts: computation of v, ρ and collision. The complexity of v, ρ computations depends only on the fluid model. These computations are implemented as a simple series of additions and optional divisions for the quasicompressible model. The computational complexity of collision depends practically almost only on collision model. The full cost of the quasi-compressible model is then builtin into computations of v, ρ, where additional divisions are needed for each velocity component, and into the code responsible for boundary nodes handling (not shown in Table 2). Due to a different ratio of division to collision complexity, the computational complexity of the quasicompressible model implementation is from 50% for BGK to 14% for MRT higher than for the incompressible model.
The comparison of FLOP/B ratios from Table 2 with specifications of high-performance GPUs (see Table 1) shows that LBM implementations are usually bandwidth bound. This is especially true for the newest CUDA capable architectures, for which the FLOP/B ratio usually gradually increases. The high performance LBM implementations for GPU should then focus on memory bandwidth optimisations. Notice that this is also true for general purpose CPUs, provided that their computational power is fully utilised using vector instructions.
Implementation
Tiling overview
In our implementation, the whole geometry is covered by the uniform mesh of cubic tiles, where each tile contains a 3 nodes. If a geometry size is not divisible by a, then the geometry is extended with solid nodes. The tiling is implemented by the host code and done once at the geometry load. We use a very simple tiling algorithm: first, the geometry is covered by the uniform mesh of tiles starting at node (0,0,0), and next, the tiles containing only solid nodes are removed.
In general, the tile size could be arbitrary but, as we show below, it can have a significant impact on performance due to a low tile utilisation if a tile is too large. Additionally, the number of nodes within a tile should be a multiple of the GPU warp size to avoid a low hardware utilisation. In our implementation we have chosen a = 4. Fig. 2 shows thread assignment to nodes within a tile. We use thread blocks with dimension 4, 4, 4 so both coordinates of threads and of nodes in the tile are the same. This mapping gives two warps per tile: the first warp operates on all nodes with z ∈ {0, 1}, the second warp on nodes with z ∈ {2, 3}. Thus, the tiles can be processed independently and in any order during a single LBM iteration.
Though the tiling is not a very novel technique (it is described even in NVIDIA tutorials as a method improving the performance of dense matrix multiplication), to the best of our knowledge this is the first attempt to use it for a GPU implementation of LBM for sparse geometries. Existing GPU LBM implementations for sparse geometries [3] , [4] , [2] focus on indirect addressing, what does not guarantee proper memory layout and requires additional data increasing memory and bandwidth usage. Below we will show that due to the regular memory layout the tiling offers a higher performance provided that the tiles contain a low number of solid nodes.
Tiling memory layout
Since the LBM implementations are usually bandwidthbound on high-performance GPUs, the proper memory layout resulting in good memory bandwidth utilisation is crucial to achieving high performance. Thanks to the fixed tile size in our implementation we are able to fully control data placement for nodes processed in threads from the same warp. This gives us a possibility to maximise the number of coalesced memory transactions, what results in a very high utilisation of GPU memory bandwidth (comparable to values reported for dense geometries).
The general memory layout used in our implementation is shown in Fig. 3 . All floating values for all nodes are stored in a single, continuous allValues array, what allows us to control data placement in GPU memory. In addition to the allValues array, we also use additional data arrays, e.g.: nodeType array where an information about LB nodes is stored, and tileMap array with tile bitmap. Data layout in the nodeType array is similar to allValues but for each tile only one block of data is stored. The tileMap array is a simple three-dimensional array stored in row order. A single value (e.g. f 0 ) from all nodes within the same tile forms a data block shown in Fig. 3 . Data layout in allValues array guarantees that each data block (64 double precision values) is properly aligned and can be transferred using sixteen 32-byte wide memory transactions. The only problem is such data arrangement that avoids uncoalesced/redundant transactions during propagation. To address this problem, each data block uses one of the three proposed data arrangements. Each data arrangement is optimised for different access patterns during propagation (see Fig. 4, 5 and 6 ). Let x t , y t , z t ∈ {0, 1, 2, 3} denote node coordinates inside a tile. In all below considerations we assume that x t , y t , z t are represented in the natural binary number system (unsigned types in CUDA/C language). The transformation from the x t , y t , z t node coordinates to the offset inside data block is defined by a linear mapping function L(x t , y t , z t ). We use three linear mapping functions (all equations are valid for a = 4 nodes per tile edge): .... 
where ∩ denotes bitwise AND.
All mapping functions allow to minimise the number of uncoalesced memory transaction for our thread mapping and tile size.
In our implementation we have started from the L XY Z layout for all f i data blocks and, after code profiling, we have modified layouts for these f i data blocks, for which the redundant memory traffic was observed. 
Tiling overhead
Let the overhead ∆ be defined as a ratio of additional memory/operations (∆ M -memory overhead, ∆ B -bandwidth overhead, ∆ C -computational overhead) to the minimum numbers defined in Sec. 2.3. Only one LBM time iteration is analysed since all iterations are processed in the same way. To simplify the equations, we use average values per tile, which enables us to omit multiplications by the number of tiles.
Let a denote the number of nodes per tile edge. Then, the number of nodes per tile is n tn = a 3 . If n tsn denotes the average number of solid nodes per tile (we count only tiles containing non-solid nodes), and n tf n = n tn − n tsn denotes the average number of non-solid nodes per tile, then we can define the average tile utilization factor
When we assume that for each node the number of operations and used memory are minimal (i.e. that the tiling does not have an impact on the computational complexity for single nodes), then the overhead caused by low η t results only from additional operations/memory, which must be done/allocated for all solid nodes from tile. In this case, the generic overhead value is then
In our implementation, the real values of bandwidth and computational overheads are even lower because we conditionally skip operations for solid nodes. For computational overhead, some operations are not done when the full warp of threads processes only solid nodes. This situation is rather rare (but possible), as a half of a tile must contain solid nodes. For bandwidth overhead the transfers are omitted, when values for solid nodes form a single 32-byte memory transaction. This happens quite often due to a spatial locality of geometry. Also, some transfers can be skipped, when during the propagation step the neighbour tile contains solid nodes.
For the memory overhead, Eqn. (15) allows only to find the overhead compared to an implementation where the two copies of f i values are stored for each node. However, if we use as a reference the value defined by Eqn. (9) , then the overhead must take into account the second copy of f i
The final approximation is true for n t ≪ q · n d . Thus, the memory overhead compared with minimal requirements defined by Eqn. (9) 3 ·M node ≈ 9800 B, what is almost 2500 times larger than the index in the tile bitmap. Thus, except for exceptionally sparse geometries (consisting significantly less than 1% of non-empty tiles) the tile bitmap overhead is negligible.
It can be seen from Eqn. (15) and (16) that the overhead grows quickly when the tile utilisation factor is lower than 1 (the overhead is 0.2 already for η t = 0.83). Therefore, it is critical to achieve the tile utilisation as high as possible.
In Fig. 7 and 9 we show, how η t changes for tiling of infinitely long channels running along one of the axes. Notice that the tile positions are discrete, thus only a few values of η t can occur. It can be seen that tile utilisation above 0.8 can be always achieved for channels with diameter/side at least about 40 nodes. To have a guarantee that the tile utilisation is always above 0.9, the channels must have about 100 nodes across. Though these channel dimensions can be too large for some applications, it should be noted that these values are pessimistic. It is possible to find a tile placement that gives good tile utilisation even for very small channels (for example, η t can be equal to 1 for a square channel as small as 4 × 4 nodes).
Additionally, since channels often are not parallel to the axes, then the values of η t for real cases can be closer to the average value for different tilings of the same channel size (see Fig. 8 ). The average value for different tilings of the same channel size (the green line in Fig. 7 and 9 ) exceeds η t = 0.8 for square channel 25 2 nodes and circular channel with diameter 30 nodes. For channel dimensions about 55 and 65 nodes for square and circular channels η t exceeds 0.9.
Some observations can be also extracted from for different tilings of a channel with given dimension is higher for square channels than for circular channels. It is possible to achieve higher tile utilisation for square channels, but it is also possible that the tile utilisation will be much lower than for circular ones.
The average value of different tile utilisations is higher for circular channels. The difference between the tile utilisation for circular and square channels depends greatly on the channel dimension: starting from 3 times for the smallest channels, through the 10% for channels with 20 nodes at diameter/side, and to the less than 2% for channels with at least 100 nodes.
For square channels, the two additional phenomena occur. First, if the channel dimension is larger by one node than the tile edge, then all tilings have the same tile utilisation. In this case, all tilings require the same number of tiles. The one node outside tile borders can always be placed in either "left" or "right" side of the tile, never on both sides. When the channel dimension is larger by two or more nodes than the tile edge, the number of additional tiles may differ depending on the tile placement. For example, two additional nodes can be put either in one additional tile or in two additional tiles on both sides. This behaviour can be used during geometry preparation to guarantee constant tile utilisation.
Second, the tile utilisation fluctuates with a period equal to the tile edge a = 4 nodes, i.e. for channel edge equal to k · a + m, where m ∈ {0, 1, . . . , a − 1} and integer k ≥ 0. This results from the fact that for a single period (for the same k) the same numbers of tiles are needed: k, k + 1 or k + 2 tiles along each axis. Since the number of non-solid nodes grows with m, then for the same number of tiles the tile utilisation η t becomes larger. As the lowest tile utilisation is always for m = 2, square channel side should be different from k · 4 + 2.
Implementation details
Our code is written in CUDA C language ver. 7.5, the first version with official support for many C++11 features. Thanks to this we could write a highly templated code, which can be shared between the CPU and GPU compilers. The code sharing enabled us to simplify testing because results of many parts of the code could be thoroughly checked only on the CPU side. Additionally, extensive use of C++ templates allowed us to write a generic kernel code, which later can be specialised for fluid and collision models, data layouts, data precision, enabled optimisations etc. We were then able to test many combinations of the above parameters with minimal changes to source code and without code duplication. The disadvantages of this solution are a long compilation time and a large size of the executable file containing many versions of the same kernel.
In our implementation, the computations for a single LBM time step include collision, propagation and boundary computations for all nodes within the tiles. To minimise memory transfers, we combine collision, propagation and boundary computations for a single node into a single GPU kernel. The kernel requires only one read of data from memory at the beginning, and one store of data at the kernel end. Local copies of data are stored in registers and in the shared memory. Separate tiles are processed by a separate GPU thread blocks (see Fig. 2 ), what allows for effortless synchronisation of computations within the tile. The general structure of our GPU kernel is shown in Fig. 10 The propagation is implemented as a gathering data from neighbour nodes. The propagation between two nodes is done only when the target and source nodes are not solid. To minimise a cost of checking node types, we use shared memory to store copies of node types from current and neighbour tiles (lines 2-3 in Fig. 10 ). In code responsible for loading node types from neighbour tiles (line 3) we use warp level programming to balance the load of warps assigned to the tile. We also make a copy of 3 3 tile indices from the tile bitmap array forming a cube surrounding currently processed tile (line 1). All later operations requiring information about node and/or tile types use these copies.
After loading of node and tile types the barrier is needed (line 4 in Fig. 10 ), because in a later part of the kernel the threads from different warps require the data gathered in other warps. This is the only synchronisation point in the kernel.
All later operations are done only when a node assigned to current thread is not solid (line 5). In this way, we can skip unnecessary operations as mentioned in Sec. 3.3.
The computations for a non-solid node are divided into the three parts: lines 6-12 are responsible for gathering f i values, lines 13 and 14 perform all floating point calculations and line 15 stores the computed f i values to the memory. All operations are implemented as standard except for the line 8, where we compute indices for a neighbour node. First, the index of a tile containing the neighbour node is computed -we use only the values from −1, 0, 1 due to local copy of the tile bitmap. When the neighbour tile is not empty, a few indices for the neighbour node are computed since we need both information about the neighbour node type (this is gathered from the copy in shared memory) and the value of proper f i (which is stored in memory). Notice that for different f i functions we need to use the different data layouts (see Sec. 3.2).
In the GPU kernel, we have also applied some optimisations not shown in Fig. 10 . First, the loop starting at line 7 is partially unrolled to allow computations for two f i functions in a single iteration, what increases the instruction level parallelism due to the interlacing of two independent streams of instructions. In addition, the order of iterations over q is chosen in a way that allows to share some parts of neighbour node address computations (line 8). Next, some values (e.g. v) are stored in shared memory to minimise register pressure. Shared memory is also used in the implementation of the bounce back boundary condition to avoid a significant increase of required registers. To minimise conflicts in shared memory, 64-bit mode is used. We have also tuned the usage of registers for each specialisation of the kernel since different collision models require different amounts of temporary data. The cost of divergence in the code responsible for computations of an address of the neighbour node is minimised -only integer indices are computed in divergent branches, time-consuming operations like memory accesses are done in all threads simultaneously. In places, where it is possible and useful, the number of costly divisions is reduced by replacing them with multiplications by the inverse. Finally, we have been intensively using inline methods and compiler pragmas to force loop unrolling, what allowed us to connect clean, structured code with high performance.
Results
Measurements methodology
All tests have been run on a computer with the Intel i7-3930K CPU and 64 GB quad-channel DDR3 DRAM. The installed GPU was GTX Titan (Kepler architecture) clocked at 823 MHz with 6 GB GDDR5 memory clocked at 3.004 GHz. The peak theoretical GPU memory bandwidth is 288.384 GB/s (GB/s denotes 10 9 B/s), the NVIDIA bandwidthTest utility reported 228.5 GB/s (79.2% of peak). We have been using Linux operating system with NVIDIA CUDA Toolkit 7.5 installed.
To identify performance limits we have been using three versions of kernels: kernels implementing all LBM operations, kernel with full propagation but without computations, and kernel without propagation, which only reads and writes node data for the same node (without communication between neighbour nodes). The kernels implementing all LBM operations are described by collision and fluid models (e.g. "BGK incompressible"), the kernel with propagation is marked as "propagation only" and the last kernel without full propagation is denoted as "read/write only". All kernels operate on double precision floating point numbers.
As a measure of kernel performance we have been using the number of 10 6 × non-solid (fluid and boundary) nodes processed per second (MLUPS). The processing time was measured for separate kernel runs (with necessary synchronisation). For each kernel and case we run 1000 steps and measured wall clock time for each kernel run. The first few (1-3) kernel runs usually took more time than the rest, thus before processing we removed the first five measurements. For time measurement we have been using the C++ std::chrono library with microsecond resolution. To identify performance limits not affected by the geometry sparsity, we have first measured the performance of our implementation for the cavity3D test case, a standard example of dense geometry. The performance of all kernels as a function of a geometry size is shown in Fig.  11 .
Performance for cavity3D
The performance dependency on the geometry size is similar to many other GPU implementations. For 10 5 nodes the kernels achieve about 85% of their maximum performance. To achieve more than 95% of maximum performance, at least 3 × 10 5 nodes are needed. For other GPUs the numbers of nodes can be different, but for today's machines the maximum performance should always be possible to achieve for geometries containing about 10 6 or more nodes.
The highest performance was observed for read/write only kernel: 773 MLUPS what corresponds to memory bandwidth equal to 236.5 GB/s (82% of maximal theoretical bandwidth for GTX Titan). Because in this kernel all memory transactions are coalesced, we achieved bandwidth utilisation even higher than reported by the bandwidthT est utility.
For kernel with propagation only the performance dropped by 9.6 % to 699 MLUPS resulting in bandwidth 213.9 GB/s (74.2% of maximum). It should be noted that though the propagation only kernel has more than seven times more instructions than read/write only kernel (711 vs 99 instructions), the performance penalty caused by neighbour tile data access during propagation is smaller than 10%. More than 60% of all instructions in propagation only kernel are integer arithmetic operations used for address computations.
Finally, after enabling all computations the kernel performance dropped by an additional 2.3% for BGK incompressible (684 MLUPS, 72.6% of GPU memory bandwidth) up to 32% for MRT quasi-compressible (476 MLUPS, 50.5% of GPU memory bandwidth). The large performance drop for MRT results from computational overhead -due to the large register usage per single thread and the low GPU occupancy the complex double precision computations are not masked with memory transfers.
The performance of our implementation for sparse geometries is close even to the highly optimised version for dense geometries from [1] , which utilises 76.7% of the maximum memory bandwidth for Tesla K20c and the BGK quasi-compressible model. Moreover, the implementation from [1] uses single precision numbers what allows to increase GPU utilisation due to much lower register pressure. This result shows that it is possible to process sparse geometries with performance similar to dense ones.
As can be seen in Fig. 11 , the performance of all kernel versions fluctuates for geometry edge b = 4 · k + m, where m ∈ {0, . . . , 3}. This results from the low tile utilisation for tiles placed on the geometry edges (see Section 3.3). Notice that a performance difference for the geometry edge b computed for the same k and different m decreases for larger k, e.g. for b = 100 the performance of propagation only kernel is between 672 and 699 MLUPS, and for b = 248 between 643 and 653 MLUPS. This results primarily from differences in the average tile utilisation: for the cubic cavity3D geometry the number of fully utilised tiles grows Some interesting behaviour is also shown in Fig. 11 -after achieving peak about 10 6 nodes the performance slowly decreases with a geometry size increase. The intensity of this phenomenon depends on a computational complexity of operation -the most visible is for propagation only (see Table 3 ). Also, it does not occur for the kernel without propagation (read/write only). This is a result of data propagation between neighbour tiles that increases the time needed for tile processing -for read/write only kernel there is no propagation between tiles, thus no performance decrease is observed. For smaller geometries the ratio of tile faces and edges common to two tiles to the number of tiles is smaller than for large geometries, thus the performance for smaller geometries is higher.
Propagation performance
To estimate how the communication between tiles affects the performance we have also measured the performance of propagation for a set of rectangular channels containing about 10 6 nodes. The channels differ in dimensions: we have used channels starting from 4 × 4 × 62500 nodes up to 100 × 100 × 100 nodes (1873 combinations in total). This way allowed us to generate geometries with a different number of faces and edges common to neighbour tiles.
Let η f denote the number of common faces per tile and η e denote the number of common edges per tile. The values of η f and η e are computed based on the geometric layout, e.g. for four tiles arranged in 2 × 2 × 1 mesh there are 4 common faces and 1 common edge, thus η f = 4/4 and η e = 1/4. For channel dimensions used in our measurements the values of η f and η e are the following: channel 4 × 4 × 62500 had about one common face per tile and zero edges, channel 4 × 8 × 31248 had about 1.5 common face per tile and about 0.5 common edge per tile, and so forth. We do not count the numbers of transferred faces/edges because for rectangular channels each face is always common between two tiles and each edge between four, thus the number of data transfers is proportional to η f and η e . The propagation performance as a function of common faces per tile and common edges per tile ratios is shown in Fig. 12 . The highest performance (851 MLUPS, 90% GPU memory bandwidth utilisation) was observed for channel 4 × 62500 × 4 nodes.
Most of the points shown in Fig. 12 is placed close to a plane parallel to the "Bandwidth utilisation" axis and defined by a line
which can be treated as a coarse estimation of a ratio of common edges to faces. Only for small η f (η f < 2), which corresponds to very narrow channels of dimension 1 × k tiles, the line defining the plane becomes
The bandwidth utilisation shown in Fig. 12 can be estimated as
Two main areas can be identified in the dependence shown in Eqn. (19). For η f < 2.8 and η e < 2.6 the propagation performance falls proportionally to η f and η e . Also, the η f is almost twice as important as η e . For η f < 2.8 and η e < 2.6 the propagation performance increases in inverse proportion to η f and η e . In this range η e is more important than η f , what means that the more square channels (with width similar to depth) allow to achieve a higher performance. The highest bandwidth utilisation in this range was observed for channel 100 × 100 × 100 nodes (74%, 698 MLUPS). 
Data layout impact
To measure the performance gains caused by tile memory layouts presented in Sec. 3.2 we have used nvprof utility to profile the four propagation kernels with memory layouts XYZ, XYZ + zigzagNE, XYZ + YXZ, and all three (XYZ + YXZ + zigzagNE). The assignment of memory layouts to f i data blocks is defined in Sec. 3.2. For cases XYZ + zigzagNE and XYZ + YXZ the default memory layout is XYZ. The results are shown in Table 4 .
As a test case, we have used the geometry containing 100 × 100 × 100 nodes due to its regular structure that allows to exactly compute the number of transactions. The minimal number of write transactions is 25 The measured numbers of 32-byte memory writes were independent of the used memory layout. For all layouts we have observed 4 750 000 32-byte writes, which is the minimal value. This is a result of the gather pattern, where irregular memory accesses to neighbour nodes/tiles occur only during the read stage.
The data from Table 4 show that the proposed memory layout ("all three" row) allows to achieve the number of memory transactions (both read and write) only 1.12% larger than the minimal. This is a result of both a reduction of uncoalesced memory reads (L2 reads column) and an increase of cache hit rate. Notice also that for "rw only" kernel the numbers of cache and device memory transactions is minimal, what shows the correctness of our theoretical estimations of numbers of memory transactions.
Since the YXZ layout is applied to the largest number of f i data blocks (eight), its use results in the largest performance increase (13.65% compared with 586 MLUPS for XYZ layout). The zigzagNE layout is used only for two f i data blocks and gives exactly four times smaller performance increase (3.41%). Notice that the performance increase does not depend on address computation complexity. Though zigzagNE layout requires 14 times more additional instructions per f i data block than YXZ layout (14 compared with 1 instruction per f i data block), the performance increase scales linearly only with the number of used data blocks regardless of the address computation complexity.
The reductions of the numbers of transactions (for both cache and device memories) for the zigzagNE and YXZ lay- To verify the performance of our implementation for sparse geometries we have run simulations for cases similar to presented in [2] and [4] . In Table 5 there are shown results for three arrays of random arranged spheres with a diameter of 40 lattice units. Compared with implementation of the same collision model from [2] our solution achieves the higher bandwidth utilisation for cases with porosity 70% and 80% (respectively 60% and 55.7% of peak memory bandwidth versus 49.1% and 48.6% for [2] ). For porosity 90% our implementation is slightly worse (48% vs 49.6% of peak memory bandwidth). Average tile utilisation factors for these three cases are 0.699, 0.584 and 0.512.
Performance for sparse geometries
The second test was the blood flow in cerebral aneurysm (see Fig. 13 ). We used the model similar to presented in [2] , but since our implementation does not support multi-GPU configurations, we reduced the resolution to 730 × 342 × 334 nodes (84.6 × 10 6 nodes in total, 14.8 × 10 non-solid nodes, geometry porosity 0.825, average tile utilisation η t = 0.931). The results are shown in Table 6 . For the cerebral aneurysm case our implementation achieves significantly higher GPU memory bandwidth utilisation than [2] . This is due to the high tile utilisation resulting from a good spatial locality of geometry. Notice also that the numbers from Table 6 are close to the values achieved for dense geometries (the right column in Table 3 ). It shows that when the tile utilisation is high, then the performance of our implementation is almost not dependent on geometry sparsity. Finally, we have measured the performance for aorta with coarctation case (see Fig. 14 ) similar to presented in [4] . The performance comparison is shown in Table 7 . Since this geometry is rather small (resolution 93 × 161 × 442, about 0.66 × 10 6 non-solid nodes, porosity 0.906), the performance of our implementation is up to 14% lower than for the cerebral aneurysm case shown in Table 6 . Small geometry size degrades performance not only due to the low number of nodes (as shown in Fig. 11 ) but primarily due to the reduced tile utilisation (equal to 0.807 in this case). Despite this performance degradation, the tile based solution achieves significantly higher memory bandwidth utilisation than implementation presented in [4] . Table 7 : Kernels performance for aorta model with coarctation (similar to [4] ). Results from [4] are for GTX 680 GPU, D3Q15 lattice and kernel similar to our BGK quasicompressible. The values in last row are estimated based on results presented in [4] . 
Conclusions
In this paper, the GPU based LBM implementation for fluid flows in sparse geometries is presented. In contrary to the previous solutions for sparse geometries, which use the indirect node addressing, our implementation covers the geometry with the uniform mesh of cubic tiles. Due to the fixed tile size, we have been able to carefully arrange data inside a tile what allowed to significantly decrease the GPU memory traffic. The proposed data layout allowed us to achieve up to 684 MLUPS (72.6% utilisation of maximum theoretical memory bandwidth) on GTX Titan for D3Q19 lattice, BGK incompressible model and double precision data. We have also examined the performance of our implementation for both BGK and MRT collision models in incompressible and quasi-compressible versions. The performance comparison with existing implementations shows that in most real cases our solution allows to achieve a higher performance. Especially promising results were obtained for the large geometry with good spatial locality (cerebral aneurysm model with 14.8 × 10 6 non-solid nodes), where we observed up to 1.65× higher memory bandwidth utilisation than the highly optimised implementation based on indirect node addressing. Only for the very sparse geometries with weak spatial locality (array of random arranged spheres with porosity 0.9) our implementation is slightly slower (48% vs 49.6% of peak memory bandwidth) due to a low tile utilisation.
Future work includes the extension to a multi-GPU version to increase the size of supported geometries and the search for a method to increase the tile utilisation (e.g. by a non-uniform tile placement and/or a decrease of the tile size).
