We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle the problem of the interaction of solids with an incompressible fluid flow. We explain in detail the parallelization of the complete method on two different heterogeneous platforms based on massively parallel hardware accelerators: NVIDIA Kepler GPU and Intel Xeon Phi. we describe a number of software optimizations, mainly focusing on memory management and host-accelerator communication. Based on a thorough perforrmance evaluation, we demonstrate that the baseline LBM implementation achieves satisfactory results on hardware accelerators. Unfortunately, when coupling LBM and IB methods on the heterogeneous platforms, the overheads of IB degrade the overall performance. As an alternative, we explore different heterogeneous implementations that effectively hide such overheads and allow us to exploit both the multi-core and the hardware accelerator in a co-operative way.
Introduction
The dynamics of a solid in a flow field is a research topic currently with a growing interest in many scientific communities. It is intrinsically interdisciplinary (structural mechanics, fluid mechanics, applied mathematics, . . . ) and covers a broad range of applications (e.g. aeronautics, civil engineering, biological flows, etc.) The number of works in this field reflects the growing importance of the study of the dynamics in the solid-fluid interaction [14, 15, 16, 17] . The use of accelerated, heterogeneous architectures to compute the fluid field is widely used within the Computational Fluid Dynamics (CFD) community due to the computational requirements of such applications, with significant performance results [11, 12, 13] . In contrast, the solid-fluid interaction has only recently gained wider interest.
The main objective of this work is the reduction the overhead caused by the simulation of solid-fluid interaction on heterogeneous platforms. In particular, we propose a number of hybrid strategies that distribute parts of the computation either to the hardware accelerator or the general-purpose CPU, depending on how they adapt to each part of the solver. We also give notes on specific optimizations to tune each one of the parts of the computation, adapting them to the characteristics of each architecture.
Classical fluid solvers based on the unsteady incompressible Navier Stokes equations may turn out to be inefficient or difficult to tune to achieve maximum performance on new massively-parallel platforms [18, 19] . A choice that better meets their characteristics is based on modeling the fluid flow through the Lattice Boltzmann method (LBM). Several recent works have shown that the combination of hardware accelerators and methods based on the LBM algorithm can achieve impressive performances due to the intrinsic characteristics of the algorithm. Certainly, the computing stages of LBM are amenable to fine grain paralelization in an almost straightforward way (see, for example, [11, 12] and references therein). Nevertheless, few works extend the parallel efficiency of LBM to cases involving geometries bounded by complex, moving or deformable boundaries. A very recent work that covers a subject closely related with the present contribution is the one by [13] where a new efficient 2D implementation of LBM method for fluids flowing in geometries with curved boundary using GPUs platforms is proposed. Curved boundaries are taken into account via a non equilibrium extrapolation scheme developed by [21] . S. K. Laytona et al. [31] propose a GPU implementation based on the IB projection method [32] , which computes the influence of one rigid solid into Navier-Stokes solver through the use of sparse linear algebra routines, using the open-source Cusp library to carry out these routines. This approach requires a higher computational compared with ours, attaining similar accuracy.
We consider a different approach for computing the influence of a solid immersed in a incompressible flow, which is based on different forcing approach [6] able to deal with complex, moving or deformable boundaries. It has been used on Lattice-Boltzmann [29] and Navier-Stokes [7] based solvers. In [20] , authors proposed the use of heterogeneous CPU-GPU platforms to minimize the overhead that the computing of the solid presence exhibits. This paper extends the aforementioned contribution in different ways. Our numerical framework is based on LBM coupled with an Immersed Boundary (IB) method able to deal with complex, moving or deformable boundaries [1, 2, 3, 4, 5, 6, 7, 29] . Special emphasis is given to the algorithmic and implementation techniques adopted to keep the solver highly efficient on heterogeneous host-accelerator platforms with independent memory spaces, based both on GPUs or the recently introduced Intel Xeon Phi accelerator. This paper is structured as follows. Section 2 introduces the physical problem at hand and the general numerical framework that has been selected to cope with it: Lattice-Boltzmann method (LBM) coupled with Immersed-Boundary (IB) technique based on the use of a set of Lagrangian nodes distributes along the solid boundaries. In Section 3, the specific potential parallel features of IB method over multicore and massively parallel architectures are presented. Different LBM approaches for dealing with the particular features of the architectures considered are discussed in Section 4. In Section 5, we detail the parallel strategies envisaged to optimally enhance the performance of the global LBM-IB algorithm on CPU-GPU heterogeneous platforms. Finally, Section 6 details the performance analysis of the proposed techniques and in Section 7 some conclusions are outlined.
The Lattice Boltzmann and Immersed Boundary methods
The Lattice Boltzmann method combined with an Immersed Boundary technique is highly attractive when dealing with bodies for two main reasons: the shape of the boundary, tracked by a set of Lagrangian nodes is a sufficient information to impose the boundary values; and the force of the fluid on the immersed boundary is readily available and thus easily incorporated in the set of equations that govern the dynamics of the immersed object. In addition, it is also particularly well suited for massively parallelized simulations, as the time advancement is explicit and the computational stencil is formed by few local neighbors of each computational node (support). The fluid is discretized on the regular Cartesian mesh while the shape of the solids is discretized in a Lagrange fashion by a set of points which obviously do not necessarily coincide with mesh points. The Lattice Boltzmann method has been extensively used in the past decades (see [22] for a complete overview) and it is now regarded as a powerful and efficient alternative to classical Navier Stokes solvers. In particular, LBM has been used to simulate high Reynolds turbulent flows both using Direct Numerical Simulation and Large Eddy Simulation [23] . Another challenging application where LBM has proved to be quite successful concerns aeroacoustics problems [24] . In what follows we briefly recall the basic formulation of the method. The LBM is based on an equation that governs the evolution of a discrete distribution function f i (x, t) describing the probability of finding a particle at Lattice site x at time t with speed v = e i . In this work, we consider the BGK formulation [25] that relies upon an unique relaxation time τ toward the equilibrium distribution f
ω8 ω7 ω6 ω3 ω5 ω1 Figure 1 : The standard two-dimensional 9-speed lattice (D2Q9) used.
The particles can move only along the links of a regular Lattice defined by the discrete speeds (e 0 = c(0, 0); e i = c(±1, 0), c(0, ±1), i = 1 · · · 4; e i = c(±1, ±1), c(±1, ±1), i = 5 · · · 8 with c = ∆x/∆t) so that the synchronous particle displacements ∆x i = e i ∆t never take the fluid particles away from the Lattice. For the present study, the standard two-dimensional 9-speed Lattice D2Q9 (Figure 1 ) is used [26] , but all the techniques that will be presented can be extended in a straightforward manner to three dimensional lattices.
The equilibrium function f (eq) (x, t) can be obtained by Taylor series expansion of the Maxwell-Boltzmann equilibrium distribution [27] :
In Equation 2, c s is the speed of sound (c s = 1/ √ 3) and the weight coefficients ω i are ω 0 = 4/9, ω i = 1/9, i = 1 · · · 4 and ω 5 = 1/36, i = 5 · · · 8 according to the current normalization. external volume force f (ib) (x, t), the contribution on the lattice are computed according to the formulation proposed by [21] as:
The multi-scale Chapman Enskog expansion of Equation 1, neglecting terms of O( M 2 ) and using expression 3, returns the Navier-Stokes equations with body forces and the kinematic viscosity related to lattice scaling as ν = c 2 s (τ −1/2)∆t. Without the contribution of the external volume forces stemming from the immersed boundary treatment, Equation 1 is typically advanced in time in two stages: collision and streaming.
Collision stage:
Streaming stage:
Depending on the ordering of the two stages, two different strategies arise.
The classical approach is known as the push method and performs collision before streaming. We have adopted instead for pull method [28] , which performs the steps in the opposite order. This can lead to an important performance enhancement on fine grained parallel machines. A short discussion about the different implementations and achieved performances using the two orderings will be detailed later on.
Next, we briefly introduce the Immersed Boundary method that we use both to enforce boundary values and to recover the fluid force exerted on immersed objects within the framework of the LBM algorithm [6, 7] . In the taken IB approach (as in several others), the fluid is discretized on a regular Cartesian lattice while the immersed objects are discretized and tracked in a Lagrangian fashion by a set of markers distributed along their boundaries. The general setup of the present Lattice Boltzmann-Immersed Boundary method can be recast in the following algorithmic sketch.
Given f i (x, t) compute:
Compute :
Interpolate on Lagrangian markers (volume force):
Repeat collision with body forces (see 3) and Streaming:
As outlined above, the basic idea consists in performing each time step twice.
The first one, performed without body forces, allows to predict the velocity values at the immersed boundary markers and the force distribution that restores the desired velocity boundary values at their locations. The second one applies the regularized set of singular forces and repeats the procedure advancing (using Equation 3) to determine the final values of the distribution function at the next time step. The key aspects of the algorithm and of its efficient implementation depend on the way the interpolation I and the S operators (termed as spread from now on) are applied. Here, following [6] and [7] we perform both operations (interpolation and spread) through a convolution with a compact support mollifier meant to mimic the action of a Dirac's delta. Combining the two operators we can write in a compact form: whereδ is the mollifier, to be defined later, Γ is the immersed boundary, Ω is the computational domain, and U d is the desired value on the boundary at the next time step. The discrete equivalent of 4 is simply obtained by any standard composite quadrature rule applied on the union of the supports associated to each Lagrangian marker. As an example, the quadrature needed to obtain the force distribution on the lattice nodes is given by:
where the superscript l refers to the l th component of the immersed boundary force, (x i , y j ) are the lattice nodes (Cartesian points) falling within the union of all the supports, N e is the number of Lagrangian markers and n is a value to be determined to enforce consistency between interpolation and the convolution 5 . More details about the method and in particular about the determination of the n values can be found in [7] .
In what follows we will give more details on the construction of the support cages surrounding each Lagrangian marker since it plays a key role in the parallel implementation of the IB algorithm. Figure 2 illuastrates an example of the portion of the lattice units that falls within the union of all supports. As already mentioned, the embedded boundary curve is discretized into a number of markers X I , I = 1..N e . Around each marker X I we define a rectangular cage Ω I with the following properties: (i) it must contain at least three nodes of the underlying Eulerian lattice for each direction; (ii) the number of nodes of the lattice contained in the cage must be minimized. The modified kernel, obtained as a cartesian product of the one dimensional function [8] 
will be identically zero outside the square Ω I . We take the edges of the square to measure slighlty more than three lattice spacings ∆ (i.e., the edge size is Author Re = 20 Re = 40 Re = 100 ditions are set as: Inlet: u = U, v = 0, Outlet: ∂u ∂x = ∂v ∂x = 0, Upper and lower boundaries: ∂u ∂y = 0, v = 0, Cylinder surface: u = 0, v = 0, Volume fraction: 0.5236%. When Reynolds number is 20 and 40, there is no vortex structure formed during the evolution. The flow field is laminar and steady. In contrast, for the Reynolds number of 100, the symmetrical rectangular zones disappear and an asymmetric pattern is formed. The vorticity is shed behind the circular cylinder, and vortex structures are formed downstream. This phenomenon is graphically illustrated in Figure 3 .
Two important dimensionless numbers are studied, the drag (CD = FD 0.5ρU 2 D ) and lift (CL = FL 0.5ρU 2 D ) coefficients. F D corresponds to the resistance and F L is the lifting force of the circular cylinder, ρ is the density of the fluid, and U is the velocity of inflow. In order to verify the numerical results, the coefficients were calculated and compared with the results of previous studies ( Table 1 ). The drag coefficient for Reynolds number of 20 and 40 is equal to the results presented by Zhou et al. [13] . The drag coefficient obtained for Reynolds number of 100 is identical to the results obtained by Silva et al. [17] , and the lift coefficient is close to the presented by Zhou et al. [13] .
Immersed Boundary Method Implementations
This section presents the strategy that we have adopted for the efficient parallelization of the IB algorithm when executed on CPU-GPU heterogeneous platforms. The computations related with the Lagrange markers (support) distributed on the solid/s surface can be parallelized efficiently on both, CPU and GPU. As already mentioned the whole algorithm can be seen as a two steps procedure: a first, global LBM update, and a subsequent local correction to impose the boundary values. It is well known that the memory management plays a crucial role in the performance. To compute the IB method, it is necessary to store the information about the coordinates, velocities and forces of all the Lagrangian points and their supports. A set of memory management optimizations, which depends on the access pattern, have been carried out for the IB method implementation on both platforms, multicore and GPU, to achieve an effective memory usage.
Memory management
Two different memory management approaches are proposed depending on the use of multicore or GPU, since both architectures exhibit different memory features and a substantially different hierarchy. The multicore approach stores the information of a particular lagrangian point and its support in nearby memory locations, which benefits the exploitation of coarse grain parallelism. In contrast, in order to achieve a coalesced access to global memory, the GPU approach distributes the information of all lagrangian points in a set of one-dimensional arrays. In this way, contiguous threads access to contiguous memory locations. Particular attention was paid to the access pattern to the information of the supports. It was necessary to store the information of the set of all supports in a unified array to carry out coalescing memory accesses, achieving that elements which share the same index on different supports are located in continuous memory locations. For clarity, Figure 4 shows the memory mapping performed on both platforms in a simplified example with two consecutive Lagrangian points.
As shown, the use and the exploitation of each memory hierarchy, main (CPU) and global (GPU) memory, is totally different. Furthermore, an additional advantage is found in the use of the memory structures, allowing the transfer the information of the solid(s) between both memories by carrying out a single memory transfer.
Parallelization approaches on multi-core and GPU
The degree of parallelism of the IB method is given by the number of Lagrangian points. The multicore approach carries out a coarse-grain parallelism by mapping a set of continuous Lagrangian points on each core which are solved sequentially. This distribution is well balanced and the use of the memory is optimized by using the memory structures previously described ( Figure 4 ). The set of Lagrangian points can be easily parallelized with this approach, annotating some of its loops with OpenMP pragmas.
On the GPU, the implementation consists of using two basic kernels. The After the spreading step the forces are stored in the global memory by using atomic functions. These atomic functions are performed to prevent race conditions. Particularly, we used these operations to avoid incoherent executions, since the supports of different Lagrangian points can share the same Eulerian points, as graphically shown in Figure 2 . The pseudo-code of the IFC kernel is graphically illustrated in Algorithm 1.
After the execution of the IB related computations (IFC kernel), the lattice forces as in equation 3 (Section 2) need to be determined. Before tackling this next stage it is necessary to introduce a synchronization point that guarantees that all the IB forces have been actually computed on all the points within the union of the supports. Nonetheless, the global memory access required to determine the system of lattice forces is larger than in the previous stage: 9
directions for each lattice node in the support. Also in this case to inhibit race conditions it has been necessary to resort to atomic functions. As for the case of the equilibrium distribution, also here the computation of the lattice force contributions is carried out using registers. The pseudo-code for this final kernel is given in algorithm 2.
Algorithm 2 BFC kernel. for j = 1 → 9 do 9:
[y])) 10 :
end for 12 : end for
Lattice-Boltzmann Implementations
This section explores different strategies to enhance performance of LBM according to the architectures used in this work. Special emphasis is put on memory management and how to implement the different steps of LBM.
Parallelization approach
Falta!!
Memory mapping strategies
Being a memory-bound algorithm, it is expected that memory management has a substantial impact on the attained performance. This is even more relevant for hardware accelerator processors, usually following a many-core architectural paradigm. As an example, especially on warp-oriented architectures Nx*Ny approaches can be more efficient on other type of architectures, e.g. shared memory multi-cores.
Here, we propose a number of memory access strategies adapted to LBM, namely:
1. Uncoalesced. This strategy does not suppose an elaborate management of memory. Basically, the set of 9 lattice directions is stored in consecutive locations of memory. Thus, the lattice level is stored in memory as an entre este y el anterior no queda clara con estas dos frases). This approach has been widely used on CUDA-based implementations and it is considered the optimal approach for this kind of architectures [11, 12] .
Henceforth, we use a different memory management based on Structure of Arrays (SoA). Despite this approach is more amenable to vector units, 3. Hybrid. This approach arises to take advantage of the main features of the aforementioned strategies. It consists of joining both features, a small distance among the 9 lattice-direction and a efficient exploitation of vector units. In this sense, groups of the same lattice-direction are located in memory consecutively (Figure 9 ). The vector size usually imposes the maximum number of elements to be grouped. Faltan las ventajas especficas de este approach con respecto a lo anterior).
To exploit the high data parallelism presented in LBM, we will keep in memory two copies of the lattice (f 1 f 2 ) in all implementations. Each time step alternatively takes one of the copies as the input, and writes the result to the second.
Push and pull implementations
Depending on the ordering of the collision and streaming stages two different strategies arise. The classical approach is known as push method and performs collide before streaming. Otherwise, the pull approach performs the steps in the opposite order.
The pull approach
The pull computational scheduling of LBM was introduced by [28] and has been recently consider by [12] . This is an efficient approach based on a single- A schematical sketch of the LBM implementation is given in Algorithm 3.
Basically, this strategy does not need any synchronization among steps, being very profitable for parallel architectures. Furthermore, the top regions of the memory hierarchy can be efficiently exploited, as the macroscopic level can be completely computed without transfers to/from main memory, better exploiting cache re-use.
The push approach
Next, we introduce the main characteristics of the LBM implementation based on push approach (collide-stream strategy). This computational scheduling of LBM has been used in numerous previous works [13, 30, 11] . In general, the push method divides the LBM steps into two routines: the first one computes the collide and stream steps; the second one computes the macroscopic Algorithm 3 LBM pull. on the performance when implemented on our target architectures.
Coupled LBM-IB on Heterogeneous Platforms
After describing different implementation and strategies for LBM and IB as isolated entities, we describe next two different approaches towards the integration of both techniques on heterogeneous architectures.
Coupled LBM-IB on hybrid CPU-GPU platforms
Our first parallel implementation of the LBM-IB method performs all the major steps on the GPU. The host CPU is used exclusively for a pre-processing stage that sets up the initial configuration and uploads those initial data to the GPU memory and a monitoring stage that downloads the information of each lattice node (i.e., velocity components and density) back to the CPU memory when required. As shown in Figure 10 Although this approach achieves satisfactory results, its speedups are substantially lower than those achieved by pure LBM solvers [12] . The obvious reason behind this behavior is the ratio between the characteristic volume fraction and the fluid field, which is typically very small. Therefore, the amount of data parallelism in the LBM kernel is substantially higher than in the other two kernels. In fact, for the target problems investigated, millions of threads compute the LBM kernel, while the IFC and BFC kernels only need thousands of them. But in addition, those kernels also require atomic functions due to the intrinsic characteristics of the IB method, and those operations usually degrade performance.
As an alternative to mitigate those problems, we have explored a heterogeneous implementation graphically illustrated in Figure 10 (bottom). The LBM kernel is computed on the GPU as in the previous approach but the whole IB method and an additional local correction to LBM on the supports of the Lagrangian points are performed on the CPU in a coordinated way using a pipeline. This way, we are able to overlap the prediction of the fluid field for the t + 1 iteration with the correction of the IB method on the previous iteration t at the expense of a local LBM computation of the"t + 1" iteration on the CPU and additional transfers of the supports between the GPU and the CPU at each simulation step.
Coupled LBM-IB on hybrid CPU-Intel Xeon Phi platforms
We present next an alternative hybrid implementation targeting heterogeneous CPU-Intel Xeon Phi platforms, which favors the particular features of this type of systems. Unlike the previous platform, this approach is more suitable for a proper integration among both architectures, multi-core and Intel Xeon Phi.
This approach attempts to take advantages of the transparency that this system offers in terms of programmability, being the communication among both architectures simpler. One of the main claims of Intel with respect to these platforms consists of intending that the sections to be executed on multi-core and the Intel Xeon Phi are as similar as possible. In order to follow these guidelines, a higher importance is given to multicore architecture, as it shows a profitable performance computing LBM, instead of computing only IB. This new strategy basically consists of breaking the data dependency among LBM and IB method by dividing the whole domain into two sub-domains, where the LBM and IB method are solved in one of these domains, while the rest of the fluid domain is solved in the another one. According to the features of both algorithms, LB and IB, and architectures, Intel Xeon and Intel Xeon Phi, we have proposed the approach illustrated in Figure 11 .
Next, it is detailed the main features of this approach focusing on the distribution of the computational load, as well as the inter-architectures communication. As known, the performance in term of GFLOPs shown by Phi is higher (around 3× higher) with respect to multicore architectures. Furthermore, the IB method presents some drawbacks to be efficiently managed on massively parallel architectures.
For the purpose of favoring the overlapping space, taking into account the performance of both architectures, the influence of the solid and a piece of fluid domain (around 25% in relation to the entire one domain) is computed on multicore, while the rest of the domain is solved on Phi. Obviously, the hybrid approach requires an additional cost with respect the homogeneous strategies in terms of memory transfers. In particular, the boundary regions among both domains are in need to be communicated each other from (to) both memories spaces. No additional communications are needed since each architecture works on its own space of memory (domain). The communication is based on transferring a common space composed by two columns, one for reading and one for computing. These two columns are shared by both architectures. However, the column to be computed on multicore (gray column in Figure 11 ) is read by Phi, and vice-versa (white column). These transfers involve a very small portion of the domain, and so they are very fast. In addition, as depending on the memory layout selected, the elements of the shared columns could not be contiguous in memory, they are packed/unpacked prior to/after each data transfer between memory spaces.
As described previously, we make use of 2 lattices (f1 and f2 ) in order to be able to compute in only single-loop function all the LBM steps, where each time step read from one copy and write results to the other. The communication is carried out at the end of each time step. It consists of overwriting those boundary regions which are in need to be communicated, that is the gray column in the Xeon domain and the white column in the Phi domain. In particular, it is overwritten the lattice updated in the current time step (f2/f1 for even/odd time steps).
Performance Evaluation

Experimental setup
To critically evaluate the performance of the developed LBM-IB solvers, next we consider a number of tests executed on two different heterogeneous architectures, CPU-GPU (i.e., Xeon-Kepler) and CPU-Phi system. More details about the specific architectures that have been used for performance evaluation are given in Table 2 . According to the memory requirements of the kernels, the memory hierarchy of the GPU has been configured as 16KB shared memory and 48KB L1, since our codes do not benefit from a higher amount of shared memory on the investigated tests. All the simulations have been performed using double precision and as a performance metric we have used the conventional MFLUPS metric (millions of fluid lattice updates per second) used in most LBM studies.
Immerse boundary performance evaluation
The first tests ( Figure 12 ) focus exclusively on the IB method using a synthetic simulation without considering the LBM method and analyze its acceler- ation on both multicore and GPUs. Even for a moderate number of Lagrangian nodes, we achieve substantial speedups over the sequential implementation on both platforms. Despite the overheads mentioned above, our GPU implementation is able to outperform the multicore counterpart (8 cores) from 2500 Lagrangian markers on.
Lattice-Boltzmann performance evaluation
The for Intel Xeon and 512-bit for Intel Xeon Phi. Although many approaches exist to achieve and efficient vectorization (e.g. Cilk Plus, Array Notation, SIMD directives or Auto-vectorization), we have mainly explored Auto-vectorization as a way to maintain a common code baseline among Intel architectures, guiding the compiler towards an efficient SIMD exploitation. In the following, we explore and illustrate some of the tuning approaches to adapt the LBM to our platforms.
LBM on the NVIDIA Kepler GPU
Based on insights extracted from a number of previous works [9, 10, 11, 12, 13] , we have focused on analyzing the performance of LBM on GPU by exploiting the coalesced memory mapping presented in the previous section, and a thread mapping consisting of one thread per lattice unit, leaving proved the efficiency of this approach. In the following, we will analyze both LBM approaches: push and pull. In particular, two alternative push approaches, macro-collide-stream and collide-stream-macro, are evaluated. Although the pull approach offers several advantages over the push approaches, the latter are commonly required numerically for some alternative implementations, such as multi-domain refinement [33] , so they are also worth a special attention.
Obviously, a strong point of synchronism among the collide-stream and macroscopic steps of LBM and a higher number of accesses to global memory degrade the performance of the push approaches over the pull strategy. More specifically, the macro-collide-stream push scheme achieves better performance with respect to the collide-stream-macro push one. This is mainly due to the position of the macroscopic level computation. No entiendo lo anterior!!!. . .
In both schemes, it becomes mandatory to add a synchornization point among macroscopic and mesoscopic (collide and stream) levels; however, computing the macroscopic level at the beginning allows us to use just one kernel instead of two as in the collide-stream-macro scheme. This is translated into a lower number of CPU-GPU communications and reduces the overhead of kernel invocations, and thus the impact on performance. Figure 13 1.5× in terms of MFLUPS.
LBM on the Intel Xeon
After verifying the superiority of the pull LBM approach over its main alternatives in terms of performance, we will also focus on the use of this approach for the rest of the architectures targeted in the paper. Deberiamos hablar de la estrategia de paralelizacion sobre multi/manycore. Besides the chosen parallelization scheme, we will mainly focus on the difference between each memory access strategies presented in previous sections. Here, we evaluate the three memory strategies (uncoalesced, coalesced and bleded/hybrid) and their impact on performance. To show the scalability of each approach and the influence on the parallelism, we have run two tests which make use a sequential and a parallelized implementation of the pull approach on a dual-socket CPU with 8 cores per chip. coalesced and hybrid approaches with respect to the sequential counterpart respectively. For both approaches, the memory has been adapted for the size of vector units.
No entiendo lo anterior
Focusing on scalability, Figure 15 illustrates the trend in performance by increasing the number of threads for the coalesced memory mapping. In particular, a constant trend is shown by increasing the number of thread up to achieve 32 threads, with small differences in performance when varying the grid size. However, from 16 hardware threads on, only those problems which present a sufficient workload (that is, grid size) are amenable to continue increasing the performance.
LBM on the Intel Xeon Phi
As for the Intel Xeon, we first evaluate the different memory mappings on Intel Xeon Phi. Figure 16 shows the efficiency obtained by each strategy. Contrary to what has been achieved in multicore CPU, the hybrid approach achieves here a slight gain over the coalesced one. tion of the thread-core affinity. In our case, three different pre-defined affinity strategies have been studied: scatter, balanced and compact (see [? ] for details on the rationale of each strategy). Taking into account that the best choice for our problem, that is the hybrid approach, the three thread-core distributions are evaluated for it. Figure 17 shows the performance achieved by the different strategies, being the most efficient the compact one; differences between them (in terms of performance) are reduced and not significant.
One of the common tuning options that Intel Xeon Phi exposes is the selec-
After verification of the best strategy for memory mapping and core affinity, we carry out a study about the scalability over the same test bed previously evaluated by the multicore CPU architecture (Figure 18 ). Except for the smallest grid sizes, a common trend is shown for the most scenarios, a higher number of threads achieves a better performance. In general, comparing the Xeon Phi implementation with its counterpart on the general-purpose multi-core CPU, our approach achieves a gain around 25× and 4× with comparing with the sequential and multi-threaded (32 threads) codes respectively, in terms of MFLUPS.
Coupled LBM-IB on heterogeneous platforms
After evaluating the performance of both methods (LBM and IB) in an isolated way, we present the strategies carried out for the implementation of CPUs, the first accelerated by a Nvidia Kepler GPU, and the second by an Intel Xeon Phi.
LBM-IB on hybrid CPU-GPU architectures
The performance of the complete LBM-IB solver is analyzed in Figure 19 .
We have used the same physical setting as in Section 2 for an increasing number of lattice nodes to analyze the scalability of the method. We have investigated two realistic scenarios with characteristic volume fractions of 0.5% and 1% respectively (i.e. the amount of embedded Lagrangian markers also grows with the number of lattice nodes).
The performance of the homogeneous GPU implementation of the LBM-IB method drops substantially over the pure LBM-pull implementation (Figure 19 ).
The slowdown is around 15% for a solid volume fraction of 0.5%, growing to 25% for the 1% case. In contrast, for these fractions our heterogeneous approach is able to hide the overheads of the IB method, reaching similar performance to the pure LBM-pull implementation. LBM corrections) for the heterogeneous approach does not suppose an additional cost over the pure LBM solver, representing around 65% of the total time consumed by the LBM kernel.
LBM-IB on hybrid CPU-Xeon Phi architectures
Similarly, the performance of the proposed strategy over the CPU-Xeon Phi heterogeneous platform is analyzed. In particular, we have focused on the same numerical scenario with a volume fraction equal to 1%. We exploit the pullcoalesced memory mapping on the multi-core and the pull-hybrid approach on the Intel Xeon Phi.
In general, a substantial gain in performance is achieved by the heterogeneousoverlapped approach (Figure 21-right) with respect to the heterogeneous-non overlapped one ( Figure 21 -left); as an example, for the largest grid tested, the non-overlapped approach attains a peak of roughly 300 MFLUPS, while the counterpart operlapped implementation improves this number up to roughly 450 MFLUPS. As in the CPU-GPU approach, the performance achieved by this CPU-Xeon Phi heterogeneous implementation roughly matches the performance achieved by the pure-LBM implementation (Figure 18 ), and thus, the impact of the introduction of the solid interaction is mainly hidden.
In this approach, an approapriate work distribution between both platforms is mandatory; in order to find the best load balancing among the portion of the domain executed on the multicore and the portion executed on the Intel Xeon Phi, we have visualized the trend of performance by increasing the size of the domain executed on multi-core (and thus, equivalently, decreasing the amount of work delivered to the Intel Xeon Phi).
If a non-overlapping approach is to be taken (left plot in Figure 21 ), the weak performance of the multi-core compared with the Intel Xeon Phi, together with the penalty introduced by the PCI-e bus for data transfers between memory spaces, makes it unfeasible to introduce the multi-core as a processor for part of the computation; see, also, that this effect is even more evident as the size of the (sub-)grid processed by the multi-core increases. The conclusions for the overlapped implementation (right plot in Figure 21 ) are dramatically different.
In this case, both platforms can co-operate to solve the problem, and data transfers penalty is hidden if conveniently orchestrated. In fact, the only necessity is for the workload to be correctly balanced. Thus, the percentage of the grid assigned to the multi-core and to the Intel Xeon Phi should be approximately constant, and depends on the grid size. See, for example, how the peak performance for the largest grid tested, where L x = 5120, L y = 1400 is attained when the portion of the grid assigned to the multi-core is around L x = 620, L y = 1400 (that is, around 12% of the complete grid); for the smallest grid tested, where L x = 2560, L y = 800, the peak is attained when L x = 250, L y = 800, which roughly means a 10% of the complete grid.
Conclusions
In this paper, we have investigated the performance of a coupled Lattice-Boltzmann and Immersed Boundary method that simulates the contribution of solid behavior within an incompressible fluid.
We have presented different strategies to enhance LBM performance over three current parallel architectures, Intel Xeon, NVIDIA Kepler GPU and Intel Xeon Phi. Special emphasis has been put on memory management within the many-core platforms, hiding of the impact of data transfers, and on the introduction of different numerical approaches. A more detailed study is carried out on Intel Xeon Phi, as few current works focus on LBM implementations, and even less on coupled LBM/IB approaches. On the other hand, it is possible to find a number of LBM solvers based on NVIDIA GPU architectures. In this sense, our multicore approach achieves a reasonable performance by considering a coalesced-pull approach. It obtains a peak performance around 120 MFLUPS by using a dual socket multi-core CPU. Although the Intel Xeon Phi shows a much higher performance by using a hybrid-pull scheme with respect to the coalesced-pull approach on multicore, being around 3.75× faster, the strategy based on coalesced-pull implemented on GPU outperforms these results with a extra benefit around 17%.
In addition, we have focused on the design and analysis of a hybrid implementation that takes advantage of both the accelerator (GPU/Xeon Phi) and multicore in a co-operative way. For realistic physical scenarios with realistic solid volume fractions, our heterogeneous solvers are able to hide the overheads introduced by the IB method and match the performance (in terms of MFLUPS) of state-of-the-art pure LBM solvers.
Our target problem exhibits a dynamic behavior (i.e. its computational cost varies through the time). However, we could take advantage from this characteristic by distributing those stages with a low computational cost (mainly the IB computation) over the general-purpose multi-core, and those with a high computational cost (LBM) over the accelerator, overlapping both computations in time. In particular, we have proposed two different strategies, one for CPU-GPU architectures and the second for CPU-Xeon Phi, promoting the particular features that each platform presents.
As a future research topic we plan to investigate more complex physical scenarios that require higher amount of memory, making mandatory the use of distributed memory architectures equipped with multi-GPU/Phi platforms.
