The Unified Parallel C (UPC) language from the Partitioned Global Address Space (PGAS) family unifies the advantages of shared and local memory spaces and offers a relatively straightforward code parallelisation with the Central Processing Unit (CPU). In contrast, the Computer Unified Device Architecture (CUDA) development kit gives a tool to make use of the Graphics Processing Unit (GPU). We provide a detailed comparison between these novel techniques through the parallelisation of a two-dimensional lattice Boltzmann method based fluid flow solver. Our comparison between the CUDA and UPC parallelisation takes into account the required conceptual effort, the performance gain, and the limitations of the approaches from the application oriented developers' point of view. We demonstrated that UPC led to competitive efficiency with the local memory implementation. However, the performance of the shared memory code fell behind our expectations, and we concluded that the investigated UPC compilers could not efficiently treat the shared memory space. The CUDA implementation proved to be more complex compared to the UPC approach mainly because of the complicated memory structure of the graphics card which also makes GPUs suitable for the parallelisation of the lattice Boltzmann method. 
INTRODUCTION
In the world of High Performance Computing (HPC), effective implementation and parallelisation are vital for novel scientific software. Computational Fluid Dynamics (CFD) targets fluid flow modelling, which is a typical application field of HPC. While the Message Passing Interface (MPI) [Message Passing Interface Forum 2012] has become the dominant technique in parallel computing, other approaches, like the Partitioned Authors' addresses: M. Szőke, Queen's Building, University of Bristol, Bristol, BS8 1TR, United Kingdom; email: m.szoke@bristol.ac.uk; T. I. Józsa, Max Born Crescent, King's Buildings, Alrick Building, School of Engineering, The University of Edinburgh, Edinburgh EH9 3FB, United Kingdom; email: t.jozsa@ed.ac.uk; A. Koleszár, Soliton Systems Development Center A/S, Spotorno Alle 12, Taastrup, 2630, Denmark; email: adam.koleszar@solitonsystems.com; I. Moulitsas and L. Könözsy, College Road, Cranfield University, Cranfield, MK43 0AL, United Kingdom; emails: {i.moulitsas, laszlo.konozsy}@cranfield.ac.uk. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. c 2017 ACM 0098-3500/2017/07-ART8 $15.00 DOI: http://dx.doi.org/10.1145/3085590
Performance Evaluation of a Two-Dimensional Lattice Boltzmann Solver
8:3
The lattice Boltzmann method has become quite popular on parallel architectures because of the local nature of the operations which is discussed in Section 2. The first parallel solvers were relying on CPUs, and were reported in the 1990s [Amati et al. 1997; Kandhai et al. 1998 ]. Later on, GPU architectures were proven to be a good basis to parallelise the LBM. Two-dimensional CUDA implementation was published by Tölke [2010] , where a speedup of ≈20 was reported. Several descriptions of threedimensional solvers can be found, such as Ryoo et al. [2008] , where a speedup of 12.5 was measured. Rinaldi et al. [2012] reached a speedup of 131 using advanced strategies with CUDA. The LBM solver was proved to be highly efficient in multi-GPU environment as well [Xian and Takayuki 2011] . As far as the authors know, only Valero-Lara and Jansson [2015] considered using UPC for the LBM.
The advantages of the LBM are its good scalability, explicit timestep formulation, applicability for multiphase flows [Succi 2001] , and applicability for flows with relatively high Knudsen number [Mohamed 2011] . The latter one is one of the main limitations of the NSE. The LBM has been developed for incompressible subsonic flows, it has second order of accuracy, and relatively high memory requirements due to the discretisation of particle directions.
In this article, we present and compare the performance gain achieved after the CUDA and UPC parallelisation of an in-house LBM code. The solver handles twodimensional fluid flow problems using the LBM. The comparison of the two currently applied architectures is not widely discussed from the performance point of view. Since HPC is often used by mathematicians, physicists, and engineers with limited computer science knowledge, we also aim to inspect how user-friendly CUDA and UPC are. We investigated the effect of the following factors:
(a) memory structure of UPC: shared and local variables; (b) spatial resolution; (c) hardware; (d) collision models; (e) data representation: single and double precision; (f) required programming effort for the different codes.
The UPC implementation was evaluated on two different clusters, and the results were compared to the CUDA parallelisation on two different architectures. To quantify the performance the speedup was defined as SU = t serial /t parallel , where t serial is the execution time of the serial code and t parallel is the execution time of the parallel code.
THE IMPLEMENTATION OF THE LATTICE BOLTZMANN METHOD
The Boltzmann equation was derived to describe the motion of a large number of particles on a statistical basis. The idea behind the LBM is to discretise the Boltzmann equation so that particle propagation is allowed only in certain discrete directions. The governing equation is written as Equation (1), where f = f ( x, p, t) is the distribution function, which represents the probability of "finding a molecule around position x at time t with momentum p " [Succi 2001] . c is the microscopic particle velocity, and = ( f ) is the collision operator. The Chapman-Enskog procedure forms a clear bridge between the LBM equation (1) and the NSE (in the artificial compressibility form proposed by Chorin [1967] ), thus the method can be used to describe incompressible fluid flows [He and Luo 1997] .
(1) Fig. 1 . Structure of the computational domain.
We applied the D2Q9 model, which means that the resolved flow field was twodimensional, and particle propagation was allowed in nine discrete directions (Figure 1) . The time marching was divided into four steps from the implementation point of view:
(1) Collision: The collision modelling treats the right-hand side of Eq. (1). Two collision models were analysed: (a) In the BGKW model, the collision is described by Eq. (2), which was derived by Bhatnagar et al. [1954] and by Welander [1954] . Here τ is the relaxation factor, calculated from the lattice viscosity of the fluid, and f eq is the so-called local equilibrium distribution function described by the Maxwell-Boltzmann distribution [Maxwell 1860] . It is important to note that the collision term can be computed independently for every direction after the discretisation as
(b) The Multi Relaxation Time (MRT) model was presented and reviewed by d 'Humières [1992] and d 'Humières et al. [2002] . In this case, instead of using a single constant (τ −1 ) to describe the collision, the model applies a matrix which depends on the resolved directions (D2Q9). In our case, this matrix has a dimension of 9×9. This approach yields a matrix multiplication for each lattice. Despite the fact that this model is computationally more expensive than the BGKW, it is widely applied since the flow field can be more accurately resolved.
(2) Streaming: The streaming process occurs when the directional distribution functions "travel" to the neighbouring cells: second term on the left-hand side of Equation (1). This process is presented in Figure 1 (b). (3) Boundary treatment: These steps are followed by the handling of the boundaries.
In the current article, the boundary description suggested by Zou and He [1997] was used for the moving wall, and the so-called bounce-back boundary condition [Succi 2001 ] was used to handle the no-slip condition at the stationary walls. (4) Update macroscopic: Once the distribution functions were known at the end of the time step, the macroscopic variables (density, x-and y-directional velocity components) had to be computed from the distribution functions to recover the flow field.
The sum of the listed items is referred to from now on as the main loop. The grid generations and the initialisation of the microscopic and macroscopic variables took place before the main loop.
In terms of the computational grid, the method is based on a uniform Cartesian lattice, which is represented in Figure 1(a) , where the nine directions of the streaming are displayed as well. In order to examine the effect of the spatial resolution on the performance gain, four different grids were investigated. In the following, the lattices are referred to based on the names given in Table I .
The simulated fluid flow was the well-known lid-driven cavity, which is a common validation case for CFD solvers [Ghia et al. 1982] . The aspect ratio of the domain was unity. The lid on the top moved with a defined positive x-directional velocity, while all the other boundaries were stationary walls (see Figure 1(a) ). From these conditions, the Reynolds number in the domain was defined based on the lattice quantities as Re = n x u lid /ν l , where n x = n is the number of lattices along the x-direction, u lid is the lid velocity, and ν l is the lattice viscosity, which was set to be 0.1. The Reynolds number of the simulations was 1,000. At the end of the computations, the coordinates and the macroscopic variables were saved. A qualitative validation of the flow field can be found in Appendix B. For a more comprehensive validation and verification, we refer to the work of Józsa et al. [2016] .
PARALLELISATION
The serial code was written in C, and the code was built up as it was presented in Section 2. The serial implementation was based on one-dimensional Array of Structures (AoS), similarly to the UPC version. The Cells structure included the macroscopic variables (velocity components as U and V, density as Rho, etc.) as scalars, while the distribution function F was stored as a nine-dimensional array within the Cells structure. Thus (Cells+i)->F[k] referred to the i th lattice in the domain and the corresponding distribution function in the k th discrete direction. The parallelisation process can be followed in Appendix A. The serial simulations were performed on Archer, the United Kingdom National Supercomputing Service.
First, we parallelised the solver with UPC and ran the codes on Archer and Astral. The latter is the HPC cluster of Cranfield University. The hyper-threading technology [Intel 2015 ] was switched off on both architectures. The properties of the two clusters are listed in Table II . A higher performance can be expected on Archer since it holds several optimisation properties such as hardware supported shared memory addressing. Note that the interconnection between the nodes is different in the two investigated [Chauwvin et al. 2007]. clusters. On Archer the commercial Cray C compiler was available, while the open source BUPC compiler was installed on Astral.
Second, the CUDA parallelisation was tested. The performance gain was evaluated on two different nVidia GPUs. The relevant properties of the graphical cards are listed in Table III . While the GeForce cards are cheaper devices, as they are primarily designed for computer games, the Tesla cards are more expensive, directly designed for scientific computing. The code development was carried out on a desktop using the GTX 550Ti card, while the Tesla GPU of the University of Edinburgh's Indy cluster was used for further investigations.
Unified Parallel C Approach
UPC is an extension of the standard C language but it requires its own compiler. The novelty of UPC relies on its memory structure, where the user has the opportunity to lay down variables in the shared address space and in the local memory at the same time. A schematic draw of the local and the shared memory spaces is shown in Figure 2 . Implementing UPC based codes therefore offers the opportunity to combine the advantages of the conventional parallelisation methods, such as OpenMP and MPI, which restrict the programmer to rely purely on either shared or local memory only. The UPC compiler is designed to manage and handle the data layout between the threads and nodes. This keeps the architecture based problems hidden from the user, i.e., optimisation is expected to be performed by the compiler.
The syntax of the local memory declarations is the same as in the standard C language. Data exchange between the threads is performed via the upc_memput, upc_memget, and upc_memcpy functions. The first function copies local data to the shared space, and the second copies data from the shared to the local memory. The last function performs data copy from shared to shared address space. Note that the first and second functions are similar to the MPI_Send and MPI_Recv functions.
The shared memory based data needs to be declared according to the UPC standards. In this case, the programmer must declare a compile time constant block size, which defines how many elements of a vector belong to each thread. The data is laid down between the threads in a round-robin fashion using the corresponding block size. We can see that the compile time constant restriction is the biggest drawback of the investigated language. If the programmer wishes to lay down the whole mesh-for example, here in the shared memory-then the mesh size needs to be known in advance of the compilation. In other words, different executable files are needed for different mesh sizes.
To perform shared memory based operations, UPC offers the usage of upc_forall, which is an extension of the standard C for loop. Each shared variable within a vector has an affinity term that describes which thread the given element belongs to. Based on this information, the upc_forall distributes the computational load between the threads. UPC also offers the usage of barriers, locks, shared pointers, collectives, etc. For further description, we refer to Chauwvin et al. [2007] .
In our case, two UPC codes were implemented: (a) one with shared memory implementation, and (b) another code relying on the local memory. The streaming step of these codes is given as an example in Appendix A. In the former code, the data is laid down in the shared memory, and in the latter one, the data is stored in the local memory. The shared memory based code exploits the novelty of UPC, i.e., this code relies on the upc_forall function and shared pointer declarations. This leads to a more easily readable code. The second approach, which exploits data locality and follows the logic of MPI implementations, offers better speed and lower latency time. As a disadvantage, the upc_memput, upc_memget memory operations were required; therefore, this code is more complex and consists of more lines. The computational load was distributed equally between the threads in both implementations, since the mesh size was divisible by the number of threads during the simulations.
During the compilation of the UPC codes, similarly to the serial code, the performance affecting flags were avoided. The compile command of the UPC codes-for example, using four threads-was upcc -T=4 *.c -lm -o LBMSolver. The execution simplified to upcrun LBMSolver on both HPC systems.
nVidia CUDA
The CUDA SDK contains several additional functions to make the programming of nVidia GPUs possible. The programming of the graphical card, from now on referred to as "device," requires some basic knowledge of its structure. The working units are the so-called threads, which form separate blocks. Theoretically, every thread can work in parallel. Originally the threads could perform operations only on the data which was stored in the device memory. This meant that the programmer had to work out the data transfer between the host and device memory. The most recent solution of nVidia to bridge this problem is the unified memory (available since v6.0), which enables automatic data migration between the host and the device. Nevertheless, the host memory bandwidth is evidently the bottleneck of the available performance gain. It is a rule of thumb that the data transfer between the host and the device should be minimised for high efficiency.
After the data has been copied to the device, CUDA offers an opportunity to manage the multilayer memory structure of the GPUs. In the current study, only the global and the constant memories were used. When data is copied to the device, it is stored originally in the global memory (Table III) . Every thread has access to this data; however, this memory has the smallest bandwidth on the device. To quicken the speed of the calculations, the parameters can be stored in the constant memory which has a higher bandwidth but a smaller size.
The shared memory capability becomes increasingly important when the communication between the blocks is high. For instance, in the case of the so-called prefix sum, when every element of a vector is summed. Originally, the threads within different blocks can communicate only through the relatively slow global memory. However, the threads have access to the shared memory only within the blocks; the appropriate management of this memory layer often leads to significant performance gain because of its high bandwidth and low latency. For further description on the memory structure of the GPU and on CUDA programming we refer to Nickolls et al. [2008] and Sanders and Kandrot [2010] .
The CUDA parallelisation included the following main steps (presented through the streaming in Appendix A):
CUDA 1: Several smaller structures were used instead of the Cells structure-for example, Cells_var_9d_d-to store the nine-dimensional distribution functions (Appendix A, CUDA parallelisation step 1). The inefficient for loops were avoided, so that the streaming was performed theoretically in parallel for every node in every discrete direction. CUDA 2: The so-called Structures of Array (SoA) approach was implemented, which proved to be more efficient compared to the AoS approach [Ryoo et al. 2008] . This means that every variable (e.g., density and velocity components) was stored in separate one-dimensional arrays. This SoA version of the code proved to be ≈5 times faster than the AoS version. The maximum speedup with this implementation, measured with the ultrafine lattice on the Tesla K20 card using SP, was ≈20. This is similar to the speedup achieved by Tölke [2010] on a 2D lattice, but seems infinitesimal compared to the 131 speedup reported by Rinaldi et al. [2012] using a 3D domain. CUDA 3: The SoA approach was kept but the threads were assigned to the lattices and the discrete directions were taken step by step as is shown in Appendix A. We note that the operations related to the 0 th discrete direction were ignored during the streaming in the last version of the CUDA code because they do not have any physical role. This simplification did not influence the performance significantly. After the boundary treatment was identified as the bottleneck of the computations (see Section 4, Figure 6 ), the global search, which was performed based on a Boolean mask at every timestep to find the boundary lattices, was replaced. In the last version, the boundary lattices were selected during the initialisation so that the boundary treatment kernel function "knew" the location of the boundaries in advance.
In every case, one-dimensional grids and blocks were used; furthermore, 256 threads were initialised within every block (block size). The number of the blocks (grid size) varied automatically as a function of the mesh size. This setup proved to be the most efficient computationally, although it resulted in a strong limitation in terms of the maximum mesh size. The theoretical maximum thread number of the devices is 65,535×1,024 (maximum grid size times maximum block size). In the first two implementations the threads were assigned to every distribution function in every lattice which led to a maximum cell number 65,535×256/9≈1,864,207. This limitation was overcome in the third CUDA parallelisation step by assigning the threads to the cells. This way the maximum number of lattices was nine times higher. The last version of the CUDA code was compiled with the nvcc -arch=sm_20 -rdc=true command. Here the first flag defines the virtual architecture of the device, while the second one allows the user to compile the files separately and link them at the end. (The first and the second version did not need the -rdc=true flag, since the kernel functions were in one file.) The authors note that compiling the code with a more recent virtual architecture for the K20 GPU-for instance, -arch=sm_35-would probably result in an enhanced parallel performance. The -arch=sm_20 flag was used because this was the most recent virtual architecture supported by both of the tested GPUs. The detailed analysis of the code performance can be found in Section 4.
RESULTS AND DISCUSSION
In this section, we will discuss the achieved speedup of the main loop. Firstly, the serial results will be analysed. Secondly, the UPC approaches will be evaluated. Finally, the nVidia CUDA parallelisation will be examined and compared to the UPC approaches. In addition, we will evaluate the applied parallel approaches from the code developer's point of view.
Serial Simulations
The measured iteration times of the serial simulations are listed in Table IV . As expected, the simulation time increased with the grid size. A factor of approximately 4 can be identified between the computational cost of the coarse-medium and fine-ultrafine grids, which is logical since the grid size increased exactly by the same factor. This observation is not valid between the medium and fine meshes. While the variables for the coarse and the medium mesh can be fitted into the cache, the memory requirement of the fine and the ultrafine meshes exceeds the cache size. As we expected, the MRT model was computationally more expensive than the BGKW. Compared to the BGKW model, the MRT execution time was typically 10%-50% higher, and as the number of cells increased the MRT model became relatively cheaper.
Performance Analysis of the UPC Codes
The shared and local approaches were compared in terms of the achieved speedup in Figure 3 . The dash-dotted line shows the theoretical limit, the linear speedup (SU = N threads ) in all figures presented in the current subchapter. Figure 3 (a) presents the shared memory based speedup achieved on Astral and Archer using the fine mesh and double precision. We can see that the shared memory based code led to poor parallel performance. By using any number of threads, the gained performance was far below the linear speedup. Furthermore, for lower number of threads (four and eight), the parallel code was slower than the serial. By crossing nodes, the speedup continued to increase indicating appropriate communication between the nodes. Figure 3(a) indicates that none of the simulations exploit the maximum performance of the supercomputers. Despite the hardware support of the shared memory operations on Archer, the simulations did not show better speedup results compared to Astral. We hypothesised that the compilers could not handle the shared pointers and manage the data between the shared and local memory spaces effectively. As a first step, performance analysis was conducted on Archer using CrayPAT [Cray Inc. 2015] , which showed that the data was managed and "fed" to the CPUs properly.
As a next step, we tried to find other reasons for the problem. We reckon that the poor performance might have been caused by one of the following factors, and so we took measures to overcome them: (a) usage of shared pointers. All of them were tested with static variables; (b) inappropriate time measurement. We tested different approaches such as the clock(), and the MPI_Wtime() commands; (c) inappropriate usage of the upc_forall command. Different methods were examined to distribute the computational load; for instance, working threads were defined based on affinity of shared variables (&Cells[i]) or modular division of integers (i % THREADS); (d) usage of Cells structure. The structure was eliminated (see the code samples in Appendix A); (e) lack of optimisation flags. All of the available optimisation flags (-O1, -O2, -O3) were tested.
None of these modifications resulted in better speedup in the case of shared variables, i.e., the experienced performance of the shared code was the same for each listed factor. Therefore, we concluded that the compiler cannot handle the data properly without additional modifications, and for this particular application the compiler is still not mature enough. As was presented by Zhang et al. [2011] , the problem could be resolved by outer libraries and user implemented machine level data management. Adding these low level modifications to the shared code would eliminate the main advantage of UPC, namely, the quick and user-friendly parallelisation environment. To overcome this problem the data was rather transferred to the local memory and another, MPI-like code was developed. We plotted the effect of mesh size on the speedup in Figure 5 (a) and (b) for Astral and Archer, respectively. If we consider more than 32 threads, then we may conclude that better speedup was achieved with increasing mesh size. Unexpectedly, this finding was not valid for the ultrafine mesh, where performance degradation was experienced on both architectures. To find the "leakage" in the performance we measured the time spent with the data transfer on the fine and the ultrafine meshes using 128 threads. With this setup, the halo included twice as much data on the ultrafine mesh than on the fine mesh, so that the data transfer should take roughly twice as long as well. In contrast, the data transfer took six times longer on the ultrafine mesh compared to the fine. The performance degradation experienced on the ultrafine mesh was caused by the increased communication costs, which seems to be a relatively strong limitation. We note that the ultrafine mesh consists of approximately one million cells, so the computational load of the processors is still reasonably low when 128 threads are allocated.
Performance Comparison of the UPC and CUDA Codes
In this subsection, only the local memory based UPC implementation run on 64 threads will be analysed and compared to the serial and CUDA simulations. Figure 6 shows profiling results of the different codes. Based on Figure 6 (a) and (b) one can identify the collision and the streaming as the most expensive operations. We can conclude from these two figures that the MRT model is slightly more expensive than the BGKW model. Although the boundary treatment is relevant only for the nodes on the perimeter of the domain, it still needs approximately as much time as the update macroscopic operation, because the boundary nodes are treated based on a global search.
Compared to the serial results, the UPC implementation drew attention to the inefficient boundary treatment which became one of the most expensive operations in the UPC approach (Figure 6(c) and (g) ). Furthermore, the parallelisation highlighted the increased cost of the collision in the MRT model, which was more expensive compared to the other steps (Figure 6(g)) .
The MRT collision model was found to be computationally more expensive on the GPU as well. We can see how the collision step became less expensive during the development (Figure 6(h)-(j) ), although it still took approximately 40% of the main loop. The MRT model includes summations along the discrete directions. In the first and the second CUDA development steps these summations required operations between the blocks, while in the final version the summation happened within the blocks, so that it could be done more efficiently. While the first and the second CUDA development steps were more favourable for the streaming, the third step was specifically developed to decrease the cost of the collision.
The scalability of the codes as a function of the hardware is displayed in the bar charts in Figure 7 . As we can see in Figure 7 (a) and (b), the speedups of the BGKW and the MRT models were bounded around 50 on Archer, and around 80 on the Tesla K20 card. In the case of the K20 card, the performance gap between double and single precision execution is clearly visible: while a maximum speedup of around 80 was measured with single precision arithmetic (Figure 7(a) ), a speedup around 65 could be achieved with double precision arithmetic in the case of the BGKW model (Figure 7(c)) . A similar trend can be seen in the case of the MRT model as well, with a slightly wider gap between the single precision and double precision arithmetic (Figure 7 (b) and (d)). Considering that the double precision processing power of the K20 unit is approximately a third of its single precision processing power (Table III) , it might seem surprising that the performance gap is only around 20%. If we also consider that two of the main steps in the LBM (streaming and boundary treatment) are essentially data copying, then the relatively small gap makes more sense: the high memory bandwidth of the GPU compensated for the lack of computing power.
Interestingly, the GTX550Ti device showed better parallel performance with the MRT model compared to the BGKW model (Figure 7(a) and (b) ). While this card gave higher speedup with Double Precision (DP) in the case of the BGKW model, using DP led to a drastic performance drop with the MRT model (compare Figure 7 Based on the charts, the K20 card performed in almost every case better than the GTX550Ti. In fact, the K20 card was slightly slower than the GTX550Ti only when the grid size was small. In our case, the medium grid proved to be big enough to utilise the better potentials of the K20 GPU. As the grid size increased, we could measure an increasing speedup in the case of the K20 device, while the GTX550Ti card reached its limits at the fine mesh. These results mirror the GPUs' evolution, and correlate well with the hardware parameters (e.g., CUDA cores) given in Table III. Ideally, in the case of the CPU parallelisation, when the number of threads is kept constant and the grid size changes, a nearly constant speedup can be expected. After looking at Figure 7 , it becomes clear how far away this application is from an ideal situation: increasing the number of lattices up to a certain point (fine mesh) resulted in an increasing speedup. This happened because as the problem size increased, the time spent with the halo swap decreased relative to the time spent with the computation of the different operations. The speedup on Archer with 64 threads was close to the ideal when single precision arithmetic was used on the fine mesh (Figure 7(c) and (d) ). However, using double precision arithmetic means an increased computational load for each thread; it also means increased communication between the threads. Probably this is the reason why the parallel performance of the double precision execution was lower on Archer compared to the single precision simulations when the fine mesh was investigated (Figure 7(c) and (d) ).
In order to gain a better understanding of the results, deeper analyses of the code are required. The speedup of the main parts of the code are shown in Figure 8 . It is important to recognize that, theoretically, only the speedup of the collision step should change when we consider different models. Indeed, the other operations show only a small deviation (compare Figure 8(a) and (c) with (b) and (d)). After a first look, we can see that the boundary treatment, which was identified as the bottleneck of the performance (Figure 6 ), was significantly improved in the final step of the CUDA code. This high speedup was measured, because the global search of the boundaries was replaced (see Section 3.2). Furthermore, it is also visible that the other parts of the code had a relatively uniform speedup in the case of the CUDA implementation: for the K20 card around 50 with single precision and 40 with double precision. When compared with the K20, the GTX550Ti device showed better speedup of the boundary treatment but a worse speedup of the other operations. The unexpected behaviour of the GTX550Ti card when compared to K20 can be caused by its structure which was designed for gaming, or the different CUDA release (see Table III ).
The speedup of the operations measured on Archer shows less deviation. However, it was found that the measured speedup exceeded the theoretical limit (64) several times, especially for the collision and the update macroscopic parts, while other times the parallel code underperformed. This behaviour seems to be logical if we take into account that the collision and the update macroscopic operations do not require any communication between the processes. Furthermore, the data of the partitioned mesh fit the cache of the nodes, while the same data exceeded the cache size of a single node in the case of the serial execution.
UPC and CUDA Beyond Performance
Because of the complex memory structure of the GPUs, the CUDA parallelisation was more cumbersome and time-consuming. We can see that certain parts of the code (e.g., boundary treatment) needed significant reformulation in order to reach higher performance. Furthermore, it is important to note that for a CPU related parallelisation it is reasonably well understood how the efficiency can be improved, but for a complex application, like the current study, the CUDA related optimisation requires more conceptual effort and the identification of performance loss is more complex. All in all, the final version of the CUDA code required roughly twice as many working hours as the UPC parallelisation. The final code was achieved via three steps, and there are still several opportunities to further increase the performance, for instance, using the shared memory of the cards or multiple graphical cards. Although the straightforward shared memory approach of UPC proved to be inefficient, the classical MPI-like local parallelisation technique gave acceptable results. Thanks to its simple syntax, it needed less programming effort compared to CUDA. The corresponding number of lines for the different codes are listed in Table V . We can see that the most efficient CUDA implementation took ≈40% more lines, while the longest UPC implementation needed only ≈15% more lines compared to the serial code. It is arguable whether counting the number of lines is representative of the programming effort but in this case it is well correlated to the required work. The efficient parallelisation with CUDA required roughly twice as many office hours than the two UPC implementations. Without a doubt, we can conclude that the highest amount of conceptual effort was required by the CUDA approach followed by the local UPC code and the shared UPC code.
It is important to see what kind of compromises application oriented developers need to make. The situation can be described with the help of the triangle shown in Figure 9 . The edges of the triangle contain the good properties of a high performance programming approach: low conceptual effort, high performance, and low hardware costs. For the lattice Boltzmann method, the first possible scenario includes a low cost hardware (Tesla K20 of $2,300) and a reasonably high performance but we have to pay the price of conceptual effort because of the GPU's programming environment. Another extreme scenario is when a higher budget is available (Intel E5 processors, each for $2,900), and we can work with a more flexible programming environment, and probably end up with an efficient code in a shorter period of time. This situation makes the local memory based UPC programming environment a suitable candidate. Additionally to these two cases, when the available budget is limited, one can consider running single core simulations on a cheap hardware. This would clearly require longer computations because of the low performance. The choice between the three scenarios is still usually made by the time frame, the available hardware, and skill set. However, the shared memory approach of UPC aims to provide another low effort, high performance scenario; our investigations highlighted that the compilers need further development to achieve this goal.
CONCLUSIONS
We parallelised an in-house, two-dimensional lattice Boltzmann solver using CPU and GPU parallelisation approaches. We presented the UPC implementation of the lattice Boltzmann method and compared the parallel efficiency of our CUDA and UPC codes using the two-dimensional physical problem of the lid-driven cavity. The UPC codes were tested with two different compilers on two different clusters, while the CUDA codes were run on two different GPUs. A detailed performance analysis of the different implementations was performed to provide insight into the parallel capabilities of UPC and CUDA when it comes to the lattice Boltzmann method.
The parallelisation of the collision proved to be crucial since this is the part in the algorithm where the majority of the computation happens. Based on our experience, the efficiency of this part determines the globally experienced efficiency, and it typically means favourable implementation for the update macroscopic operation as well. We would like to draw attention to the boundary treatment as well, since it can easily become the bottleneck of the parallel code, although its execution time is essentially limited by the memory bandwidth similarly to the streaming.
The UPC code using the shared memory approach showed surprisingly low performance compared to the serial code. We found that the investigated compilers could not automatically manage the data transfer between the threads efficiently. The further development of the compilers may solve this issue and make the UPC approach more user-friendly and attractive for future scientific programmers. Until then, we can enjoy the simple syntax of UPC for local memory based implementation, which was proven to be more efficient and suitable for the parallelisation of the lattice Boltzmann method.
The CUDA development was presented through three different steps which highlighted that the used data structures (namely, the AoS and the SoA approaches), and the data distribution strategies have a significant effect on the parallel performance. We can confirm that the nVidia graphics cards, especially the ones designed for scientific computing, are highly suitable for the parallelisation of the lattice Boltzmann method. Based on our measurements, a single GPU might compete with three to four supercomputer nodes (around 80 threads) or more, for a significantly lower price. To reach this high performance developers need more specific skills and programming effort when it is compared to the local UPC implementation.
CODE AVAILABILITY
The developed codes are available as open source and can be downloaded from GitHub at https://github.com/mate-szoke/ParallelLbmCranfield. The codes are available under the MIT license. The folders also include all mesh and setup files used to perform the documented simulations.
APPENDICES

A. CODE SECTIONS
