Abstract-GPU performance of the lattice Boltzmann method (LBM) depends heavily on memory access patterns. When LBM is advanced with GPUs on complex computational domains, geometric data is typically accessed indirectly, and lattice data is typically accessed lexicographically in the Structure of Array (SoA) layout. Although there are a variety of existing access patterns beyond the typical choices, no study has yet examined the relative efficacy between them. Here, we compare a suite of memory access schemes via empirical testing and performance modeling. We find strong evidence that semi-direct addressing is the superior addressing scheme for the majority of cases examined: Semi-direct addressing increases computational speed and often reduces memory consumption. For lattice layout, we find that the Collected Structure of Arrays (CSoA) layout outperforms the SoA layout. When compared to state-of-the-art practices, our recommended addressing modifications lead to performance gains between 10-40% across different complex geometries, fluid volume fractions, and resolutions. The modifications also lead to a decrease in memory consumption by as much as 17%. Having discovered these improvements, we examine a highly resolved arterial geometry on a leadership class system. On this system we present the first near-optimal strong results for LBM with arterial geometries run on GPUs. We also demonstrate that the above recommendations remain valid for large scale, many device simulations, which leads to an increased computational speed and average memory usage reductions. To understand these observations, we employ performance modeling which reveals that semi-direct methods outperform indirect methods due to a reduced number of total loads/stores in memory, and that CSoA outperforms SoA and bundling due to improved caching behavior.
Abstract-GPU performance of the lattice Boltzmann method (LBM) depends heavily on memory access patterns. When LBM is advanced with GPUs on complex computational domains, geometric data is typically accessed indirectly, and lattice data is typically accessed lexicographically in the Structure of Array (SoA) layout. Although there are a variety of existing access patterns beyond the typical choices, no study has yet examined the relative efficacy between them. Here, we compare a suite of memory access schemes via empirical testing and performance modeling. We find strong evidence that semi-direct addressing is the superior addressing scheme for the majority of cases examined: Semi-direct addressing increases computational speed and often reduces memory consumption. For lattice layout, we find that the Collected Structure of Arrays (CSoA) layout outperforms the SoA layout. When compared to state-of-the-art practices, our recommended addressing modifications lead to performance gains between 10-40% across different complex geometries, fluid volume fractions, and resolutions. The modifications also lead to a decrease in memory consumption by as much as 17%. Having discovered these improvements, we examine a highly resolved arterial geometry on a leadership class system. On this system we present the first near-optimal strong results for LBM with arterial geometries run on GPUs. We also demonstrate that the above recommendations remain valid for large scale, many device simulations, which leads to an increased computational speed and average memory usage reductions. To understand these observations, we employ performance modeling which reveals that semi-direct methods outperform indirect methods due to a reduced number of total loads/stores in memory, and that CSoA outperforms SoA and bundling due to improved caching behavior.
The lattice Boltzmann method (LBM) is increasingly being used to study problems in computational fluid dynamics. LBM is attractive due to its computational simplicity, which allows for large scale parallel implementations that exhibit favorable strong and weak scaling properties [2] , [11] , [30] , [38] [39] [40] [41] . LBM has enabled the first solutions of full body arterial flows [40] and the possibility of porous media simulations at unprecedented scales [30] .
LBM updates each lattice point with the same set of rules, and this single instruction, multiple thread (SIMT) algorithm is suited for acceleration with graphics processing units (GPUs). There is a significant quantity of work implementing LBM on GPUs, and many of these studies consider dense cubic geometries in which all lattice positions are fluid points (e.g., [9] [10] [11] , [17] , [18] , [22] , [23] , [34] [35] [36] , [42] , [50] ). A few studies using GPUs examine problems involving complex geometries, where the lattice points within a computational bounding box may be either fluid or solid; complex geometries include the examples of arterial flows and porous media simulations mentioned above [2] [3] [4] [5] , [30] , [32] , [43] .
There are a variety of memory storage techniques to track both geometric information and the data at each lattice point. For dense, cubic geometries, direct addressing schemes are universally employed and studied. For complex domains, however, the strategy for memory storage may influence both the performance and memory consumption of the algorithm; this fact was demonstrated for LBM on CPUs in [29] . There is no corresponding study that examines the relative efficacy of memory storage techniques for LBM on GPUs.
We begin such a study by examining the computational cost of different data storage strategies for solving LBM on complex geometries with GPUs. To track geometric data, we analyze direct, semi-direct, and indirect addressing schemes, along with several other schemes that we classify as locallydirect methods. Lattice data is stored according to the ordering of the lattice points, and the layout of the data. In terms of ordering, we analyze lexicographic and two-grid ordering. In terms of layout, we analyze the standard SoA layout, bundling [29] , and collected SoA layout (CSoA) [6] .
We find strong evidence that semi-direct addressing is superior for arterial and porous media geometries that use LBM, improving computational speed and reducing memory consumption; in comparison to indirect addressing schemes, semi-direct schemes improve computational speed by 5-20% over the majority of examined geometries and across different devices. We determine that the CSoA memory layout consistently provides computational acceleration with minimal coding effort and negligible memory increase; in comparison to SoA, CSoA improves computational speed by 5-20% over the majority of examined geometries.
The above work is valid for a single GPU, however large scale simulations may require many devices working in parallel. We implement the semi-direct and indirect addressing schemes along with both SoA and CSoA layouts on the leadership class Titan supercomputer. Our code base validates that the performance gains and memory reductions are consistent for computations that employ multiple GPUs. Our code is also the first to (i) show near-optimal strong scaling for arterial geometries when employing LBM on GPUs, and (ii) utilize a communication scheme to allow for semi-direct methods both to maintain their accelerated computational speed and to scale effectively to many-GPU computations.
The mechanisms of the performance gains are revealed with performance modeling. Our results are surprising considering that currently indirect addressing and SoA layouts are most heavily utilized in practice.
I. THE LATTICE BOLTZMANN METHOD
In the current work, we consider the D3Q19 lattice scheme, in which the spatial lattice is cubic and the local velocity distribution is discretized into Q = 19 discrete velocities. For the collision operator, we employ the Bhatnagar, Gross and Krook (BGK) collision operator [7] . For boundary conditions we use half-way bounce back [51] . We update resulting stencil computation for D3Q19 with the pull method (stream occurs before collision within the kernel; e.g., [17] , [42] ) which typically has superior performance to the push method (collision occurs before stream within the kernel) on GPUs. For more details on LB methods, see [47] .
II. DATA STORAGE SCHEMES
There are three primary issues in storing data for LBM: (i) how to order the lattice points into a one dimensional memory space, (ii) how to store and access the Q elements of the distribution, and (iii) how to retain geometric information. We denote these issues as '(memory) ordering,' '(memory) layout,' and '(memory) addressing,' respectively.
A. Memory ordering
To store information on a collection of two or three dimensional lattice points, lattice points must be ordered in memory along a single dimension. The standard ordering is lexicographic (dictionary order) in which consecutive coordinate values are compared to decide the order. For example, the lattice point (1, 2, 3) precedes the point (3, 2, 1) in any lexicographic ordering in which the x coordinate is compared before the z coordinate. Conversely, (1, 2, 3) is preceded by (3, 2, 1) in any lexicographic ordering in which the z coordinate is compared before the x coordinate. Lexicographical ordering is the predominant method; often the z-coordinate is the first coordinate examined and the x coordinate is the last.
In addition to lexicographical ordering, lattice points may be ordered by a multi-grid scheme. Although less common, this ordering has been employed by the HemeLB project [31] with a two-grid approach. In this approach, a coarse grid is given a width of b lattice points and the fine grid is required to have a width of one lattice point. Points are ordered according to the coarse grid location lexicographically (cubes of side length b); lattice points within the same coarse grid block are locally lexicographically ordered (see the regular locally-direct row in Figure 1 ). More-than-two-multi-grid orderings have also been employed. The most common is an oct-tree [13] , [19] Iy[P 10 ,-êy]+êx = P 2 +êx = P 10 (ir. Loc.-dir.)
Iy[3 ,-1 ]+1 = 2 +1 = 3 Fig. 1 . To exemplify the considered addressing schemes, examples are provided for a D2Q9 lattice scheme. An 8×2 (Mx×My) lattice grid is labeled by position P i , and fluid/solid points are specified by grey scale. All orderings (middle column) are lexicographic except for the regular locally-direct scheme (reg. Loc.-Dir.) which has two-grid order. An example of how each scheme accesses neighbor lattice points is presented: starting from position P 10 , data at position P 3 is accessed. In the third column, the top equation displays the accessed position, and the bottom the memory index.
although these schemes have not yet been implemented on GPUs.
The above orderings use the same lattice ordering for the Q distribution elements, however there are other techniques that employ a different ordering for each element. For example, the SHIFT method [27] considers a unique ordering for each element of the velocity distribution. When considering the standard algorithmic structure of LBM on GPUs, it is unclear if the SHIFT method would be competitive on GPUs and therefore is not analyzed in the current work.
There are also orderings that consider sets of irregular cuboids of lattice points that cover the geometry of interest: Points are lexicographically ordered within each cuboid [13] ; similar to the multi-grid ordering, lattice points are locally lexicographically ordered within the coarse blocks. We explore one such schemes in the fiber addressing scheme mentioned below (see Section II-C4).
B. Memory layout
Historically, there are two primary ideas for laying out the Q elements of distribution data in memory. The first is called the Array of Structures (AoS) in which the distribution data at each node is stored in a contiguous chunk of memory and these contiguous chunks are stored adjacently. The second is referred to as Structure of Arrays (SoA) in which Q arrays store one of the Q elements of the distribution for each lattice point; these Q arrays are stored adjacently (see Figure 2) . Given the lth ordered lattice point of N L lattice points, the qth distribution element of this point will be stored at memory location
for AoS, and
SoA takes advantage of how GPUs read from memory, allowing for contiguous memory accesses along a warp when either reading within collide (during push) or writing within stream (during pull); SoA is therefore the standard layout on GPUs (e.g., [2] [3] [4] [5] , [10] , [11] , [17] , [18] , [22] , [23] , [32] ... [34]- [36] , [42] , [50] ). With the SoA layout, varying the index q requires consecutive memory reads spaced by N L positions for each thread. To reduce this spacing while maintaining the advantage of using SoA on a GPU, we consider a type of addressing that holds smaller sub-collections of the lattice in the SoA framework. This addressing scheme has been deemed Collected Structure of Arrays (CSoA) [6] ; it has also been called Array of Structure of Array (ASA [46] or AoSoA [48] ), and tiled-AoS [24] ; we will use the notation CSoA in this work as it is specific to LBM. CSoA necessitates choosing a stride length, s, for the length of the sub-collections. Given s, the qth element of a distribution at the lth ordered lattice point is located at
To achieve aligned reads or writes on the GPU, s must be a multiple of 32 for single precision floating point numbers and 16 for double precision. In the present work s = 64 (see Figure  2 for an example with s = 2). If s = 1, CSoA is equivalent to AoS, and, if s = N L , CSoA is equivalent to SoA. Therefore CSoA is a generalized layout that includes both AoS and SoA. CSoA requires storage space for N L /s × s × Q pieces of data. In typical applications N L s, and additional required memory is negligibly small.
Another notable memory layout is called bundling [29] , [30] . Bundling groups subsets of distribution elements and stores them in contiguous groups; in existing applications three groups of distribution elements are formed: those with positive, negative, or zero z direction. Originally applied on CPUs, bundling laid out each group in the AoS-like structure, however recently an adaptation was made in [30] which is similar to laying out each bundled subgroup in a CSoA layout with a stride length of s = 16. For example, with Q = 19, there are 5 distributional velocities with positive z component. An index representing one of these directions q z+ ∈ {i} 4 i=0 at the lth ordered node is accessed in the corresponding group array at location
Velocities in different subgroups are still far away in a linear memory space with bundling. For a given bounding box, the amount of memory needed for direct addressing is
where sizeof(FLOAT) gives the number of bytes used by a floating point number, and N G is the number of grids used to store lattice data (in the current work N G = 2, i.e., an AB pattern). In complex geometries, fluid volume fractions of a minimal bounding box may be small. In many arterial geometries, for example, a minimal bounding box has a fluid volume fraction of 2%-10% -the majority of stored lattice information is superfluous, and high resolution simulations are not feasible due to memory constraints. Furthermore, direct addressing at small fluid volume fractions slows performance as GPU warps may have only a few activated threads. To avoid storing unneeded data and to activate GPU warps on a higher fraction of threads, a variety of alternative addressing schemes have been utilized, and we discuss these schemes for the remainder of this section.
2) Indirect addressing: Indirect addressing abandons implicit addressing and instead tracks local geometry by maintaining a neighbor-list for each fluid point [8] , [37] , [44] . Each fluid point requires memory storage for the Q elements of the distribution along with Q − 1 memory addresses of the neighboring lattice points (see Figure 1 ). The collection of all neighbor lists is often referred to as a connectivity matrix (with dimensions N F × (Q − 1); N F is the number of fluid lattice points in the computational domain). Neighborlists are ordered based on an enumeration of the position vectors in the particle distribution and have memory layout identical to the fluid list (AoS, SoA, CSoA, or bundled), ensuring that consecutive GPU threads may read neighbor data in consecutive memory locations. If a particular direction does not have a fluid neighbor, an out of bounds index may be used.
Typical implementations lexicographically order the fluid points [2] [3] [4] [5] , [32] , [33] . To the best of our knowledge, indirect addressing ordered lexicographically is the only sparse domain decomposition that has been applied to LBM on GPUs, with two exceptions in semi-direct addressing mentioned below.
Memory consumption of indirect addressing is dependent on the fluid volume fraction within the minimal bounding (6) where the interior sum represents the distribution data and the connectivity data respectively. Memory consumption scales linearly in p, with zero memory consumption as p → 0.
3) Semi-direct addressing: Semi-direct addressing uses implicit addressing while keeping the memory consumption low by only storing distributions at fluid nodes. On CPUs, this compression is achieved with a directly addressed array of pointers which point to structures containing distribution information at fluid nodes, and to the null pointer at non-fluid nodes [28] .
There are only two works that we know of that utilize semidirect addressing on a GPU [1] , [33] ; the former uses the original approach for CPUs, whereas the latter lays out the pointers and data contiguously in memory. The first method is more flexible, whereas the latter is more computationally efficient due to the ability to align reads or writes.
In the present work, we utilized the approach of [33] by contiguously storing all fluid lattice data in a single array with a given layout and order. To encode the connectivity data, a directly addressed vector of integers stores the memory location of each lattice point within the sparse contiguous collection of fluid distributions (see Figure 1) . If the lattice point with position (i, j, k) is not a fluid point, then an outof-bounds index is stored in this directly addressed integer vector. The directly addressed vector is used to determine the memory locations of neighboring lattice points. To encode a map from the fluid data to directly addressed integer vector, a third array is kept that associates the integer (kN + j)M + i with each fluid point at position (i, j, k).
Memory consumption of semi-direct addressing depends on the fluid volume fraction p and will consume approximately (7) where the first term in the sum holds the directly addressed memory locations and the map from the indirect to direct index, and the second term holds the stored lattice data.
4) Locally-direct addressing:
We now introduce a class of addressing schemes that have not been previously explored in the context of LBM on GPUs. The idea is to break the irregular domain into a collection of cuboids; each cuboid is directly addressed, and data structures are used to determine the relationship between cuboids. The cuboids may either be of identical sizes and placed on regular intervals, or may have irregular sizes and placed arbitrarily.
Regular locally-direct addressing: Regular locally-direct addressing relies on subdomains having predictable relationships. One natural idea is presented in [16] , [31] , in which the authors utilize a two-grid approach for CPUs. The coarse mesh is indirectly addressed with a well defined connectivity matrix. If a single lattice point within a coarse mesh location is a fluid point, then all fine lattice point data within the structure is kept and directly addressed; otherwise no lattice data is stored. It is also possible to address the coarse mesh semi-directly (see Figure 1) .
We choose to only examine a 3D thread block that spans a full coarse lattice block, so that the threads within a block access each locally direct subdomain. We restrict the domain to a cubic coarse mesh with a stride of b lattice points in all directions (in [31] 
Memory consumption of two-gird regular locally-direct addressing depends on the number of coarse mesh nodes to be addressed. Describing the fluid volume fraction of coarse mesh as p MG , simulations require either
depending on if we use an indirect, or semi-direct method for the coarse mesh (respectively); [13] where the authors use irregular domain decomposition to place directly addressed cuboids within subdomains to cover all fluid points; the subdomains are chosen to balance between minimizing fluid volume fractions while maintaining large collections of nodes to be directly addressed together. The authors advance the algorithm in parallel on CPUs. While irregular locally-direct addressing is promising for applications using GPUs, developing an algorithm to dynamically divide the domain into subdomains with high volume fraction is beyond the scope of this work Instead, we examine a fiber based addressing that is a type of locally-direct addressing. In the fiber addressing, subdomains are contiguous regions that have the same indices in two directions; this is to say that we consider collections of points with identical values of j, k in the y and z directions, and are contiguous. Each subdomain is linearly addressed from smallest to largest index i in the x direction, making the fibers locally directly addressed with a lexicographic order.
Each fiber must be able to access a given neighboring fiber. To achieve this, a four dimensional neighbor list is kept for each lattice point -each lattice point, located at (i, j, k) has access to the fluid data at positions (i, j ±1, k) and (i, j, k±1). The neighbors (i ± 1, j, k) are stored directly by incrementing or decrementing the index of the memory location by one.
To access other neighbors, multiple neighbor-lists are linked. One consequence of this addressing scheme is that the domain must be extended to include all non-fluid lattice points that have a fluid neighbor. The reason for this extension may be seen through example: if the lattice point at (i, j, k) requires data at (i ± 1, j + 1, k), but the lattice point (i, j + 1, k) is not a fluid point, the above extension there allows access data at (i ± 1, j + 1, k) (see Figure 1) .
Memory consumption for fiber addressing is
where p ext is the volume fraction of the fluid lattice points together with solid nearest neighbors. At high resolution, p ext ≈ p. When compared to indirect addressing, this fiber method saves (Q − 5) × sizeof(INT) bytes per fluid node. Both regular and irregular locally-direct addressing methods comprise a huge number of possible addressing schemes. The described methods only begin to examine the possibilities within these schemes.
D. Comparison of Memory Consumption
One of the major concerns in large scale, realistic, LBM computations on GPUs is memory consumption. To approximate memory consumption for each of the above addressing schemes, we must approximate p ext and p MG over arbitrary fluid domains; in general p ext > p and p MG > p, but on a highly resolved complex domain p ext ≈ p MG ≈ p. Taking this assumption of high resolution provides an estimate for optimal memory consumption, and with it we examine the relative memory consumptions of the above addressing schemes against indirect addressing. Indirect addressing is chosen to be the standard with which to compare other schemes because, on complex domains, it is used most prevalently. We plot the ratio of memory consumptions with each addressing scheme against the memory consumption of indirect addressing as a function of fluid volume fraction in Figure 3 (i.e., equations 5 and 7-10 against equation 6).
Locally direct schemes are the least memory intensive. Over the majority of fluid volume fractions, locally direct schemes lead to memory savings of between 15-20% when compared with indirect addressing. Semi-direct addressing consumes less memory than indirect addressing for p > (Q − 2) −1 ; for Q = 19, this corresponds to p ≈ 6%.
E. Description of tested kernels
We examine a suite of different addressing schemes. In order to describe memory layout we adopt the convention
where addressing may be D, I, SD, rLD.I, rLD.SD, or iLD.f for direct, indirect, semi-direct, regular locally-direct with coarse indirect, regular locally-direct with coarse semi-direct, or irregular locally-direct with fibers, respectively; ordering may be or 2G for lexicographic or two-grid ordering, respectively; layout may be S, C, or B for SoA, CSoA, or bundling, respectively. Scheme A list of previously examined kernels is given in Table I . For direct and regular locally-direct addressing kernels with two-grid ordering, the block structure on the GPU is (N C , 1, 1) , respectively, where N C is the number of coarse grid blocks. For indirect, semi-direct, and irregular locallydirect methods, the block structure is (B s , 1, 1) (with B s = 64) and the grid structure is ( N f /B s , 1, 1 ). All kernels are run at double precision.
The choice of addressing scheme is not enough to ensure optimal performance, and certain optimization strategies must be employed for CUDA kernels (e.g., [33] ). Our optimization strategies are consistent across kernels. To test their efficacy, we examine the D ( ,S) scheme on a 128 3 lattice at 100% fluid volume fraction and double precision. We find it runs at 99.2% of the measured memory bandwidth and 68.9% of the theoretical memory bandwidth at 100% fluid volume fraction. To calculate the observed memory bandwidth, we use the method from [49] and approximate the bandwidth to be 2 · Q · S10 6 · sizeof(FLOAT) bytes s −1 , where S is the speed in mega-fluid lattice updates per second (MFLUPS; s = 1659.9 in the above test; measured bandwidth was ).
III. GEOMETRIC TEST SUITE
We examine two examples of complex domains for real world LBM applications -porous media [30] , [43] and arterial flows [2] , [3] , [38] . We also examine an embedded cube which was the test geometry used in [29] .
To test on a real porous media data set, we use the 1024 3 data provided by [20] , [21] . Each data position holds an 8-bit grey scale value -larger values are more likely to be solid and smaller values more likely to be fluid. To distinguish fluid points from solid points, we set a threshold value: All data points above the threshold are considered to be solid (for an example see Figure 4 ). For grids that differ in size from is plotted for a suite of pull kernels run at double precision. The minimal computational cuboid is fixed to be 128 3 . In the semi-direct panels, we include the standard I ( ,S) kernel for direct comparison. The performance of direct and locally direct methods decays as volume fraction decreases; these methods are typically outperformed by semi-direct and indirect and are not shown.
1024
3 , weighted averages determine the grey scale value at each lattice point. Varying the threshold value changes the fluid volume fraction. When considering an embedded cube which represents an idealized porous media, the size of the cube determines the fluid volume fraction.
To test on a realistic arterial flow, we consider an aorta with data taken from [14] , [15] (see Figure 4) .
A. Hardware
We examine single GPU results on an NVIDIA Tesla P100 GPU and an NVIDIA Tesla K40 GPU; results on the K40 are qualitatively similar, with the exception that GPU kernels using no shared memory kernels tend to outperform kernels that use shared memory; all kernels were tested on the K40 with the exception of the bundling layouts for semi-direct addressing, which was not adapted for non-shared memory. Multi-GPU tests are performed on the Titan supercomputer at ORNL, which uses NVIDIA Tesla K20x GPUs. All devices have ECC enabled and utilized standard clock speeds.
IV. SINGLE GPU RESULTS

Varying volume fraction:
We examine performance over a wide range of fluid volume fractions over the embedded cube and porous media geometries. The minimal bounding box is fixed to be a lattice grid of 128 3 . The computational performances on the Tesla P100 GPU are presented for a variety of indirect and semi-direct addressing schemes in figure  5 ; ten time steps are taken over each tested volume fraction and kernel. We present kernels using shared memory only because they outperform kernels without shared memory on the P100. Surprisingly, semi-direct addressing outperforms indirect addressing by roughly 20% over most volume fractions and both geometries; this is unexpected as the two methods were observed to perform similarly on CPUs [29] . In fact, the SD ( ,C) kernel preforms nearly identically to the optimized direct kernels, achieving a memory bandwidth of 97.3% of the measured device bandwidth. For both indirect and semi-direct schemes, the CSoA layout (I ( ,C) ; SD ( ,C) ; stride length 64) outperforms both SoA and bundling layouts (I ( ,S) ; SD ( ,S) ; I ( ,B) ; SD ( ,B) ). The performance gain depends on the geometry and addressing choice, and varies between 5 and 20% over SoA. Lexicographic ordering is also found to be slightly superior to two-grid ordering when considering indirect addressing (I ( ,S) ; I (2G,S) ).
Direct and locally direct schemes have competitive performance for large fluid volume fractions, but their performance decays as fluid volume fraction shrinks. For the embedded cube, indirect addressing outperforms direct and locally-direct schemes for volume fractions below 30%; for the porous media, indirect addressing outperforms direct and locallydirect schemes for volume fractions below 94%. Above these volume fractions, the direct and locally direct methods perform quite well; however, below these volume fractions the performance drops precipitously. To illustrate this drop, average performances over the range of fluid volume fractions are presented in Table II for the porous media geometry.
Varying resolution: The above tests fix a resolution and change volume fraction. The choice of resolution, however, may have a large impact on performance. To test the effect of grid resolution, we examine the aorta with grid spacings ranging from 1.57mm to 0.196mm, and the porous media on grid spacings ranging from 117.2μm to 29.3μm. The aorta has a fluid volume fraction of around 2.2%. The porous media uses a variable threshold to ensure a 20% fluid volume fraction over all grid resolutions. We plot computational performance with respect to grid resolution in Figure 6 .
In both geometries, indirect addressing schemes provide more consistent performance with respect to semi-direct schemes; however, semi-direct methods typically outperform indirect addressing. Indirect addressing outperforms semidirect addressing only at low resolution in the aorta. Furthermore, CSoA is found to be the superior layout.
Regular locally-direct methods (not shown in Figure 6 ) become competitive with indirect methods at the two lower grid spacings in the aorta. However, for all other cases, locallydirect addressing methods are significantly outperformed by both semi-direct and indirect addressing.
Grid orientation effects:
The orientation of the geometry may have an effect on overall performance, as rotations may change the way and magnitude of the contiguous lattice points. To test this, we have investigated the effect of the orientation of the grid on an aorta and cylinder. Although there are minor effects, performance is robust with respect to orientation.
V. MULTIPLE-GPU RESULTS
We next confirm that the performance gains found on single GPUs scale when using multiple GPUs. We adapt the HARVEY code base [41] to run on GPUs with pull algorithms with four separate kernels for the memory schemes I ( ,S) , I ( ,C) , SD ( ,S) , and SD ( ,C) . HARVEY is an LBM fluid solver designed for large scale blood flow simulations, and has been shown to scale well on arterial geometries on over 1.6 million MPI tasks of the IBM Blue Gene/Q supercomputer [39] , [40] . To explore massive parallelization, the GPU adapted version of HARVEY was run on the Titan super computer at ORNL which consists of 18688 nodes, each equipped with an NVIDIA K20x GPU per node. Shared memory is not used because we have found that using it is inferior on the Kepler architectures.
Implementation of the indirect addressing scheme is similar to that of [30] . For semi-direct addressing, the directly addressed array is extended to include a halo, and points within the halo map to a second array that takes in the direction (q) and halo index and maps to the index of a third buffer array. The buffer array holds the communicated distribution data. The buffer array is used to communicate with MPI across GPUs. Communication schemes overlap with GPU kernel execution: In all four kernels, communication is entirely hidden when each GPU launches a sufficient number of threads. This is the first verification that semi-direct addressing has similar scaling properties to indirect addressing on multiple GPUs.
To confirm the observations on a single GP hold true at scale, tests are examined on all four memory schemes. For each kernel, HARVEY is started with a resolution of 84μm on 8 GPUs, 42μm on 64 GPUs, and 21μm on 512 GPUs; for each fixed resolution, the number of GPUs is increased until thread latency is exacerbated due to a reduced number of lattice points per GPU. The I ( ,C) , SD ( ,S) , and SD ( ,C) kernels all outperform the more standard I ( ,S) kernel, and the SD ( ,C) kernel outperforms all other methods. Speedup ratios are displayed in Figure 7 . Utilizing the SD ( ,C) kernel over I ( ,S) increases the computational speed by 10-40%. In decoupling the effect of changing addressing and layout, the I ( ,C) , and SD ( ,S) kernels both lead to typical performance gains between 5-20% when compared with I ( ,S) ; however, changing to the CSoA layout generally leads to a larger performance improvement than changing to semi-direct addressing.
The above results have favorable strong and weak scaling. To show this, the scaling behavior of the I ( ,S) kernel over the considered tests is shown in Figure 8 . For a fixed resolution, the amount of work on the GPUs decay as more devices are used and thus performance erodes. This is identical to the behavior seen in [30] . When the resolution is increased at a fixed number of GPUs, the simulations again approach optimal behavior. Other deviations from ideal scaling are explained by load imbalances -in Figure 8 we model the effect of the load imbalance by assuming the device with the most lattice points runs with the same MFLUPS as all other devices; we then determine how this load imbalance will alter ideal scaling away from that predicted by the 8 GPU case. This model demonstrates that nearly all deviation from ideal scaling is due to either thread latency or load imbalances. The favorable strong and weak scaling properties confirm that the performance gains of semi-direct and CSoA found on single GPUs hold at scale.
In addition to performance considerations, memory consumption is an important algorithmic consideration because GPUs have far less available memory when compared to CPUs. For example, a node on Titan has a single K20x GPU with 6G of global memory, but has 32G of RAM. For the complete geometry, the aorta takes up roughly 2% of the volume fraction in the minimal bounding box. Therefore, when examining the aorta on a single device, indirect schemes will consume less memory (see Figure 3) . When performing load balancing, local minimal bounding boxes for arterial systems tend to grow in fluid volume fraction -this means that as the number of GPUs increases on a fixed arterial, semidirect methods may consume less memory. To test the relative memory consumption, we examine the histogram of memory consumption with each addressing scheme over a range of GPUs. We find that as the number of GPUs increase, the average memory consumption decays -for example using 512 GPUs provides an average memory savings of 16%, which is near the optimal memory savings of 17% (see Figure 9) .
A. Impact on a real world problem
To assess the impact of the resources used on a nonidealized, real world problem, we consider a study in which several different stenosed aortas are subject to a variety of stresses, such as exercise, pregnancy or high altitudes [12] . One hundred variations on geometry, inflow, and blood viscosity were examined in this study; using the ORNL Titan billing system of 30 core hours per node along with 16 cores per node, we estimate this work, taking place on Titan CPUs, would have taken 70.8 million Titan core hours. Using the empirical multi-GPU estimates from above, we construct a model to estimate the Titan core hours needed; the model assumes room for 3GB of memory per GPU (to account for load balance imperfections), assumes high fluid volume fractions for each subdomain, and approximates MFLUPS from the empirical, rather than idealized, scaling results. We estimate that using the I ,S kernels on the GPUs would have reduced the core hours on this project to 1.2 million core hours; using the SD ,C kernel would further reduce the core hours to 0.97 million core hours. This means that our recommendations would save a total of 230,000 core hours between GPU implementations, a 19% reduction, due to a combination of reduced memory consumption and increased speed. No. GPUs 10 100 1000 Fig. 9 . Histograms of the memory usage per GPU when using the number of GPUs on the x-axis under weak scaling. The inset on the upper left shows the overall relative memory usage between semi-direct and indirect. In all cases, semi-direct uses less total memory than indirect addressing. At 512 GPUs with the aorta resolved at 21μm, there is a memory savings of 16%, which is close to the optimal limit of 17% (shown as the dashed line in the inset).
VI. PERFORMANCE MODELING
To understand why semi-direct addressing with the CSoA layout outperforms indirect addressing with the SoA layout, we use performance modeling with the Aspen modeling framework [45] . Aspen (Abstract Scalable Performance Engineering Notation) is an open-source, domain specific language and toolkit for performance modeling of HPC applications and architectures. In this work, we use the COMPASS framework [25] , one of the Aspen tools, which takes an input source code of an application and automatically generates a structured, parametrized performance model for the input application. Using COMPASS, we are able to quickly develop an Aspen application model of the kernels under investigation, and these generated models are used to predict memory access characteristics and runtime performance of the tested kernels. In [26] , we validated the performance model for the original HARVEY application by comparing resource usage (e.g., floating-point operations (flops), loads, stores, used memory, etc.) predicted by COMPASS against hardware-counter values measured by a PAPI tool. We also compared the predicted execution times by our performance model against measured values. To measure the actual resource usage and times, we compiled HARVEY with and without optimizations, and compared against both versions. The results showed that the COMPASS prediction closely follows the measured values of optimized versions for all three inputs, and scales similarly with the hardware measurements. We also validated models used in this work against empirical measurements, but we don't include the validation results due to space limitations.
A. Model-Based Performance Analysis
We first investigate why semi-direct addressing significantly outperforms indirect addressing on the tested GPUs. In applying LBM on GPUs with double precision, the D3Q19 algorithm is memory bandwidth bound, rather than compute bound. Therefore, in order to explain our results, it is crucial to analyze the amount of memory access required by each algorithm. Regular and aligned access patterns strongly affect performance when utilizing the full memory bandwidth of GPUs. We have modified the existing Aspen resource count tool to distinguish irregular memory accesses from regular ones. Figure 10 shows the predicted memory access characteristics of the indirect addressing and semi-direct addressing for the porous media geometry with 128 3 lattice points across various volume fractions. Compared to indirect addressing, semi-direct addressing reduces load operations. Given that we are memory bandwidth bound, easing the amount of data needed for transfer explains the speed-up. There are roughly 15% more data transfers needed for indirect addressing scheme which is similar to the percent performance gain between the methods.
The number of data stores/loads between CSoA and SoA is identical; to examine why CSoA outperforms SoA, we performed a postmortem analysis by fitting the Aspen prediction model into the measured execution times. Aspen reveals that SoA experiences a lower effective read/write bandwidth than CSoA. The higher effective read/write bandwidth incurred by CSoA indicates that the cache friendly behavior of CSoA is a major reason for its better performance.
VII. CONCLUSION
By examining a large range of addressing schemes, we have systematically investigated the efficacy of memory storage and access choices. In nearly all examined tests, semi-direct addressing outperforms indirect addressing, and the CSoA layout outperforms SoA and bundling layouts. These observations are determined from empirical tests on single and multi-GPU applications on realistic geometries. In testing on multiple GPUs, we have demonstrated the first positive strong scaling results on arterial geometries at high resolutions, as well as validated an implementation for semi-direct communications schemes. In determining the reason for the semi-direct and CSoA performance gains, we have employed performance modeling that reveals semi-direct schemes reduce memory transfers and CSoA layouts lead to more cache friendly behavior.
In terms of the cost to programmer time, switching code from the SoA to the CSoA layout is trivial; for comparison, switching from SoA to bundling maybe more labor intensive. Modifying existing code from indirect to semi-direct addressing is labor intensive; however, the addressing schemes are of comparable complexity when built from scratch.
In addition to these primary discoveries, there are a number of interesting tertiary findings: bundling layouts outperform SoA layouts and can be competitive with CSoA layouts; using a two grid ordering appears to only slightly decay performance compared to lexicographic ordering; and regular locally-direct schemes can have similar performance to indirect schemes, but only at high resolution; in these regimes, they will lead to significant memory reduction over other examined schemes.
These studies have also left some paths for future research. Among them are the cost of using different stride lengths in CSoA and bundling; investigating more general irregular locally-direct methods such as those found in [13] , or looking at how oct-tree addressing will effect performance on GPUs.
Overall, our work leads to the recommendation for LBM practitioners that CSoA layouts should be used in favor of other layouts for GPUs, and that semi-direct addressing should be used over indirect addressing.
