To predict the properties of fluid flow over a solid geometry is an important engineering problem. In many applications so-called panel methods (or boundary element methods) have become the standard approach to solve the corresponding partial differential equation. Since panel methods in two dimensions are computationally cheap, they are well suited as the inner solver in an optimization algorithm.
Introduction
Numerical simulations are routinely used in applications to predict the properties of fluid flow over a solid geometry. Such applications range from the design and analysis of aircrafts to constructing more efficient wind turbines. In this context, a large number of different models and numerical methods have been developed in order to efficiently compute aerodynamic quantities such as lift and drag. It is generally accepted that the compressible Navier-Stokes system is able to represent the physics that is encountered in such systems faithfully. However, since solving the full Navier-Stokes equations is often computationally prohibitive, a hierarchy of models has been introduced that are able to provide good predictions for a selected range of problems.
In the following we will assume that the flow under consideration is irrotational and slow compared to the speed of sound (i.e., compressible effects can be neglected). This assumption reduces the Navier-Stokes equations to Laplace's equation. One should note that a direct solution of Laplace's equation would result in a body with zero lift. However, by imposing an additional constraint, the so-called Kutta condition, this simple model yields very accurate results in its regime of validity (even for lifting bodies such as airfoils, rotor blades, or fins). In addition, many phenomenological corrections have been developed that are able to extend the range of validity of this simplified model considerably.
In principle, any numerical method can be used to solve Laplace's equation together with the Kutta condition. However, since we are usually interested in the fluid flow outside of a solid body, so-called panel methods (or boundary element methods) have become the standard approach. The advantage of such a method is that only the boundary has to be discretized. This implies that for a two-dimensional flow only a linear system in a single dimension has to be solved (although the corresponding matrix is no longer sparse). In addition, no error is made by introducing an artificial boundary faraway from the dynamics of interest. On modern computers a good implementation is able to compute, for example, the flow over an airfoil in less than a milliseconds (although this has not always been true in the past). Especially in the early days of computational fluid dynamics, performing such simulations was the only way to gain reasonable results. As a consequence, sophisticated software packages (such as Xfoil [5] ) have been developed that are still used in current aerodynamics research (see, for example, [8, 9, 6] ).
Panel methods are computationally cheap, and that fact makes them well suited as the inner solver in optimization algorithms. In this context the goal for the computer program is to find an ideal airfoil geometry given a target (or fitness) function. This is a nonlinear optimization problem as the geometry is the parameter under consideration. In recent years, genetic algorithms have enjoyed some success for such systems (see, for example, [1, 8, 7] ). However, their application yields a new computational challenge as they require the computation of thousands or even hundred thousands of different airfoil configuration. This is a problem which lends itself very well to parallelization. As such it can potentially profit significantly from accelerators such as graphic processing units (GPUs) or the Intel Xeon Phi. In fact, a number of papers have been published that implement panel methods on GPUs (see, for example, the work conducted in [12, 2, 11, 3] ). However, most of the literature focuses on the three dimensional case and the speedups reported seem to be somewhat optimistic and are almost never compared to a properly optimized central processing unit (CPU) code.
In this paper, we consider an implementation of a two-dimensional panel code that has been optimized for both the CPU and the Xeon Phi 7120 (using OpenMP and vectorization techniques) as well as for the NVIDIA K80 GPU (using CUDA). We use the Intel C++ compiler as well as the Intel MKL library which provides highly optimized LAPACK routines. In addition, we use the MAGMA library on the GPU (in order to further optimize the GPU code and exploit the second GPU present in the K80 expansion card).
The panel method used in this paper is introduced in section 2. We will discuss the hardware used and the corresponding performance characteristics in section 3. The timing results and details of the specific implementation under consideration are then presented in sections 4 (single GPU), 5 (Xeon Phi), and 6 (two GPU setup). Finally, we conclude in section 7.
Numerical algorithm
Panel methods are a type of boundary element methods and are widely used in fluid dynamics. In order to remedy the deficiency of Laplace's equation to describe the airflow over lifting bodies, they are supplemented by the empirically derived Kutta condition. This model, in many instances, gives a good description of lifting flow over solid bodies for low speed aerodynamics. In the following, we will limit ourselves to two-dimensional flows over wing cross sections (so-called airfoils).
The geometry of the problem is given by a sequence of points x 0 , x 1 , . . . , x n ∈ R 2 that represent the discretization of an airfoil ∂Ω. We assume that x 0 is located at the trailing edge and that x n = x 0 holds true. This setup is illustrated in Figure 1 .
The discretized geometry of the NACA 2412 airfoil is shown (for the purpose of illustration a very coarse discretization with n = 10 is employed). The control points are shown in red and the exact geometry is outlined in gray.
The goal of the numerical method is to compute an approximation to the solution of Laplace's equation in R 2 \Ω. The corresponding fundamental solution is given by
Panel methods represent the solution as a superposition of fundamental solutions and the global flow that is imposed far away from the airfoil. That is, our solution ϕ(x) will be written as
where γ(s) is the coefficient in the superposition. The stream function of the global flow with velocity
The parameter α is called the angle of attack. Laplace's equation is subject to the boundary condition ϕ| ∂Ω = C which enforces that no fluid can move perpendicular to the wall. We discretize this ansatz by assuming that the vortex strength γ(s) is constant on each panel. For a panel from x i to x i+1 this yields
is the outward pointing vector that is orthogonal to h i and satisfies |h
We have used ·, · to denote the dot product. The boundary condition is enforced at the control points (i.e., at x i+1/2 = (x i+1 + x i )/2). This yields an underdetermined system of linear equations
which we supplement by the Kutta condition
In stating the Kutta condition we have assumed that the variables are ordered such that the trailing edge is located at x 0 = x n . This, in total, gives us n equations for the n unknowns γ 0 , . . . , γ n−2 and C.
While the present numerical scheme yields good predictions for the lift coefficient, it gives completely wrong results for the drag coefficient. This is to be expected as drag is a viscous effect. However, a range of phenomenological corrections has been developed that, for attached flows, are able to predict the drag coefficient based on the inviscid solution. In our code we have implemented Thwaites' method (see, for example, [13, 10] ) in order to perform a viscosity correction. Let us note, however, that more sophisticated schemes have been developed (see, for example, [5] ). To validate our code we have run a genetic optimization algorithm based on a B-spline parametrization of the airfoils. In the genetic algorithm, tournament selection with one-point crossover is employed and the mutations are performed for a single B-spline coefficient at a time. The fitness function is proportional to the lift-to-drag ratio at zero angle of attack. A number of airfoils that were generated by our implementation are shown in Figure 2 . The lift and drag coefficients stated are computed using Xfoil and confirm that our implementation generates successively better airfoils.
Hardware
The numerical implementation of the above algorithm requires two parts of significant computational effort. First, the system of linear equations has to be assembled which requires O n 2 operations but involves the (expensive) evaluation of two logarithms and two arctan2 functions for each panel. Second,
Figure 2: Three airfoils for each generation of the genetic optimization algorithm are shown. The algorithm proceeds from the left to the right and each column represents a distinct generation. We show the best classes of airfoils (according to the lift-to-drag ratio) for a specific generation. The population size is equal to 1000.
the solution of the linear system of equations is usually done by an LU decomposition and thus involves 2 3 n 3 operations. In practice n is often between 100 and 300. In this regime both parts require substantial computational effort.
In the following we will consider the CPU, Xeon Phi, and GPU configuration listed in Table 1 . These will be used for all the numerical simulations and all the benchmarks conducted in this paper. The corresponding (peak) performance characteristics with respect to single and double precision arithmetics and the theoretical attainable memory bandwidth are listed in Some fairly representative single and double precision timing results are collected in Table 2 . These results point the clear picture that on the CPU assembling the matrix is between 2.5 and 3.5 times more expensive compared to solving the resulting linear systems. Thus, on the CPU the assembly actually dictates the performance of the algorithm to a large extend. This situation is reversed for both the Xeon Phi and the K80 GPU. For the Xeon Phi the assembly step is by approximately a factor of two faster compared to the two CPUs. Since assembly is an extremely compute bound problem and giving the similarities of the two architectures, this gain is expected based on the factor of two difference in the theoretical arithmetic performance. On the other hand, one half 1 of the K80 outperforms the same two CPUs by a factor of approximately 5.
It is striking that the performance of the linear solver is relatively poor on both the Xeon Phi as well as on the GPU. Note that in our application we are not interested in solving large linear systems (for which both of these libraries provide excellent performance) but in solving a large number of relatively small linear systems. Let us also note that the performance of the Intel MKL library on the CPU certainly suggests that there is some room for improvement for the Xeon Phi. In fact, some research has already been conducted in 1 Note that the NVIDIA K80 is a single expansion card that includes two identical GPUs each with its own separate memory. Thus, using one-half of the K80 means that we use one of the two GPUs. this direction (see, for example, [4, 14] ). The same is presumably true for the GPU. It thus seems that neither the CPU nor accelerators are ideally suited for the problem under consideration. However, since the accelerators are very efficient in the assembly step and the CPUs are very efficient in the linear solve step, the hope is that a hybrid algorithm that uses both platforms can succeed in obtaining a significant speedup compared to a CPU only implementation. The difficulty in this approach is that a large amount of data has to be transferred over the (relatively) slow PCIe bus. In the problem under consideration this means that all the assembled matrices have to be transferred from the accelerator to the CPU. Clearly, if such a scheme is to be successful some strategy has to be employed in order to mitigate this communication overhead. To present an efficient implementation and the corresponding benchmark results for both the Intel Xeon Phi and for the NVIDIA K80 is the purpose of the remainder of this paper.
GPU implementation
In this section we will consider an implementation where the assembly of the matrix is conducted exclusively on the GPU and the linear solves are performed exclusively on the CPU. This requires the transfer of a large number of matrices in each step from the GPU to the CPU. Timing results indicate that the run time of the assembly step (on the GPU) together with the required transfer of data (from the CPU to the GPU) is comparable or smaller than the time it takes to perform the linear solve (on the CPU). Thus, in order to hide the communication overhead, we interleave the assembly and transfer operation with the linear solves on the CPU. This approach is illustrated in Figure 3 . Figure 3: This figure shows a communication hiding pattern that interleaves the assembly (green; on the GPU) and copy (orange; data transfer from the GPU to the CPU) with the linear solve (blue; on the CPU). The red areas constitute the remaining overhead that decreases as we divide our problem into more and more slices.
Note that the overhead of this approach decreases as we increase the number of slices our problem is partitioned into. However, since the individual problems become smaller and smaller, overhead inherent in the different parts of the algorithm becomes more pronounced. Therefore, a compromise has to be made. In general, between 10 and 20 slices seems to yield near optimal performance in most circumstances. Table 3 : Timing results for the hybrid algorithm (GPU+CPU) illustrated in Figure 3 . The wall time (W), the time required to assemble the system (A), the time required for the linear solves (L) and the overhead due to offloading to the GPU (O) are shown. The number of slices that yield the optimal run time are shown in bold. All measured times are in units of seconds.
The timing results are given in Table 3 . We observe a speedup of 3 (single precision) and 2.9 (double precision) for adding a single K80 to the dual socket workstation. Although even a naive implementation (i.e., doing the assembly, the data transfer, and the linear solve in sequential order) results in some speedup, the communication hiding scheme employed contributes significantly to the performance of the implementation. In the case of a single socket workstation the observed speedup is approximately 3.4 (single precision) and 3.9 (double precision).
The overhead in this implementation can be partitioned into two parts:
• As we partition our problem into more and more slices the performance of the linear solver on the CPU decreases. This is a consequence of the overhead required for the asynchronous data transfer to the GPU as well as the overhead that is incurred in decreasing the batch size for the linear solver. In the numerical simulations conducted here this overhead is on the order of 10%.
• There is an inherent overhead in the interleave scheme (see the red area in Figure 3 ). This overhead decreases as we increase the number of slices.
Assuming instantaneous data transfer the optimal run time of our hybrid implementation is equal to the time for the linear solver. Our implementation is within 10 to 20% of that value.
Intel Xeon Phi implementation
In essence the implementation on the Xeon Phi is similar to the GPU implementation. However, there are two major differences. First, due to the 512 bit wide vector units, vectorization is extremely important to obtain good performance on the Xeon Phi. In order to enable the compiler to generate efficient code for the assembly step, we have added __restrict and const keywords to our code. We have used the vectorization report of the Intel C compiler to check that the compiler has indeed sufficient information to vectorize the time intensive portions of our algorithm. This has to be contrasted with the GPU implementation of the assembly step which is relatively straightforward (neither warp divergence nor coalesced memory access is a major concern in this application). Note, however, that the code for the Intel Xeon Phi is essentially identical to the optimized code for the CPU.
Second, since the assembly step takes significantly longer on the Xeon Phi compared to the GPU, it is no longer true that assembly (on the Xeon Phi) together with data transfer (from the Xeon Phi to the CPU) consumes less time than the linear solver (on the CPU). Thus, in order to obtain good performance we have to interleave all three operations as shown in Figure 4 . Figure 4: This figure shows a communication hiding pattern that interleaves the assembly (green; on the Xeon Phi), the copy (orange; data transfer from the Xeon Phi to the CPU), and the linear solve (blue; on the CPU). The red areas constitute the remaining overhead that decreases as we divide our problem into smaller and smaller slices.
The timing results for the Xeon Phi are given in Table 4 . We observe a speedup of approximately 2.5 (for both single and double precision) for adding a single Xeon Phi 7120 to the dual socket workstation. On the other hand, for a single socket workstation the observed speedup is approximately 3 (single precision) and 3.6 (double precision).
Note that the performance of the GPU implementation considered in section 4 is superior by approximately 20% (for the dual socket case) and approximately 10% (for the single socket case) compared to the Xeon Phi implementation. We should also note that, as discussed before, the interleave scheme is out of necessity somewhat more complicated than the interleave scheme that is used for the GPU code (see Figure  3) .
The performance difference between the Intel Xeon Phi and the GPU are mainly explained by the fact that the assembly step is more costly on the Xeon Phi. Therefore, it is not possible to hide the data transfer as well as on the GPU which negatively impacts the performance of the implementation.
Multiple GPU implementation
The GPU implementation in section 4 uses a single GPU to perform the assembly step of the optimization algorithm. However, as has been pointed out in the introduction, the NVIDIA K80 includes two identical GPUs within the same expansion card. Thus, so far we have only used one half of the computational potential within that package. Certainly, we can not expect a factor of two improvement when using this additional GPU as in the present implementation performance is mainly limited by the linear solve conducted on the CPU. However, the timing results given in Table 2 suggest that we could solve part of the problem (both assembly and linear solve) on the second GPU. In this situation, optimal load balancing dictates that about one quarter of the original problem is parceled out to the second GPU. Thus, in the best case we would expect an improvement in performance by approximately 30%.
There is one additional issue that deserves our attention. While the MAGMA linear algebra library includes routines that use the GPU memory as input and output, it is primarily designed to operate in an environment that includes CPUs as well as GPUs. Consequently, there is no way to execute a MAGMA routine without CPU support and in an asynchronous fashion. To avoid oversubscription (which measurements show has a negative impact on performance) we use only 15 OpenMP threads for the linear solve and execute the MAGMA call in a separate pthread.
The timing results for this implementation are shown in Table 5 . We observe a speedup of 3.1 (single precision) and 3.6 (double precision) for adding a K80 to the dual socket workstation. In the case of a single socket workstation the observed speedup is approximately 4.4 (single precision) and 5.1 (double precision). We remark that except for the dual socket single precision configuration we observe a speedup between 20 and 30% compared to the single GPU implementation.
Conclusion
We have compared the speedup that can be achieved when an Intel Xeon Phi 7120 or a NVIDIA K80 is added to a workstation with one or two Intel Xeon E5-2630 v3 CPUs. Optimization and parallelization for the CPU and Intel Xeon Phi code is done using the Intel C compiler (vectorization) and OpenMP. For the GPU we use an implementation that is based on CUDA. Since the linear solver is faster on the CPU and the assembly is faster on the Xeon Phi/GPU, the present algorithms profits from a hybrid implementation that uses both traditional CPUs as well as accelerators. The obtained results can be summarized as follows:
• Adding a K80 to the dual socket workstation results in a speedup of approximately 3.1 (single precision) and 3.6 (double precision).
• Adding a Xeon Phi 7120 to the dual socket workstation results in a speedup of approximately 2.4 (single precision) and 2.5 (double precision).
• Since the performance of the CPU only implementation is mostly dominated by the assembly step, the speedups for a single CPU are significantly larger. In this configuration we observe speedups of up to 5 for the GPU implementation and up to 3.5 for the Xeon Phi implementation.
From the results obtained we follow that for the problem under consideration the GPU implementation yields better performance compared to the Xeon Phi. What is not so clear cut is the development effort that is required for each platform. One advantage of the Xeon Phi is that once we had an optimized code for the assembly step on the CPU (using vectorization and OpenMP) we almost immediately obtained good performance on the Xeon Phi. On the other hand, the CUDA implementation of the assembly step is straightforward and due to the computational advantage of the GPU a less complicated communication hiding scheme proves sufficient. Thus, with respect to the development effort involved there is no clear winner. Table 4 : Timing results for the hybrid algorithm (Xeon Phi+CPU) illustrated in Figure 4 . The wall time (W), the time required to assemble the system (A), the time required for the linear solves (L) and the and the overhead due to offloading to the Phi (O) is shown. The number of slices that yield the optimal run time are highlighted in bold in the Table 5 : Timing results for the hybrid algorithm (GPUs+CPU) that uses both GPUs of the K80 hardware. The wall time (W), the time required to assemble the system (A), the time required for the linear solves (L) and the overhead due to offloading to the GPU (O) is shown. The number of slices that yield the optimal run time are highlighted in bold in the table. All measured times are in units of seconds.
single precision

