In the era of the accelerator, load balancing strategies that are well-understood for traditional homogeneous supercomputers must be re-worked in order to address the problem of distributing work across heterogeneous hardware such that neither the CPU nor the accelerator is left idle. Whereas partitioning for a homogeneous system need only balance node-level workload against single-node performance, partitioning for an accelerator-enabled system requires a nested partitioning scheme that ensures both an optimal intra-node and inter-node load balance. We refer to this as enclave partitioning. Our parallelization scheme allows the same shared-memory-level code to be used on both the CPU and the accelerator, and also allows inter-node communication code to be reused during CPU-accelerator communication. This is in contrast to the traditional "offload" model, in which accelerator code can differ significantly from CPU code in both form and programming language. Using a hybrid MPI-OpenMP implementation of the acoustic-elastic wave propagation, we demonstrate the efficacy of the proposed partitioning scheme on a heterogeneous, Intel R Xeon Phi TM -accelerated supercomputer (Stampede). With our approach we have realized speedups of up to 5.78x using a 7th order discretization and 6.88x for 15th order discretization relative to a baseline, pure-MPI implementation. We present strong and weak scaling results as well as individual node performance to illustrate the benefits and limits of the accelerator-enabled scientific computing.
INTRODUCTION
The trend in modern supercomputing is to move toward systems that realize the bulk of their computational power from accelerators [11] . According to the Nov 2014 version of the top500 list [18] , 75 of the clusters on the list (15%) had accelerators and this number has steadily been rising in the last few years. On most of these machines the majority of the flops are available from the accelerator. For example on Stampede at the Texas Advanced Computing Center (TACC), 2.2PetaFlop/s is the peak performance of the base cluster while an additional 7.4PF/s is available from the use of the Intel R Xeon Phi TM accelerator available on its 6400 nodes. Additionally, while the peak performance is usually 3-4 times higher for the accelerator, the total available memory is smaller on the accelerator by a similar factor. The situation is similar for other top machines utilizing accelerators. To make effective use of such machines, application codes need to parallelize across 4 levels; a) across distributed memory nodes, b) across the CPU and the accelerator, c) shared-memory parallelism with the CPU/accelerator, and d) vectorization at the finest level. The asymmetry in performance as well as memory at the cpu-accelerator makes this particularly challenging for complex applications.
Load balancing strategies that are well-understood for traditional homogeneous supercomputers must be re-worked in order to address the problem of distributing work across heterogeneous hardware such that neither the CPU nor the accelerator is left idle. A common paradigm for accelerator computing is to offload computationally intensive tasks to the accelerator. In many such cases, the CPU is left relatively idle. Additionally, this requires O(n) data transfer between the CPU and the accelerator. Given that the accelerators have less memory available, such an approach is inefficient unless the operations being offloaded are of extremely high computational efficiency. Instead, we believe it is preferable to treat the accelerator the same as any other compute node, reducing the data transfer costs to O(n 2/3 ) for 3D problems. This allows programmers to use a SingleProgram, Multiple-Data (SPMD) programming paradigm, which is much easier than programming separate logic for the CPU and the accelerator. This is particularly attractive on clusters such as Stampede and Tianhe-2 using the Intel
R

Xeon Phi
TM as the accelerator, as it is based on the x86 architecture as the CPU. This means that on such clusters, we can have true SPMD code 1 .
Figure 1: Enclave partition:
The computational domain is first partitioned among MPI processes, shown here separated and in different colors. Each MPI process then partitions its subdomain into a set of elements to assign to the CPU (transparent) and a set of elements to assign to the accelerator.
The key challenge for the equivalent treatment of CPUs and accelerators-besides the differences in performance and memory-is that the communication cost from an accelerator to any other process other than its host node is significantly higher. In this regards, they are not the same as the CPUs and therefore when we divide work across the nodes, and between the CPU and the accelerator, we need to partition such that the accelerators only require data transfers with their hosts. In other words, the task graph for the problem is such that the tasks assigned to any accelerator have data-dependencies only with their host CPUs. A novel O(n), node-local partitioning algorithm that facilitates such a data-dependency is the main contribution of this work. We call this enclave partitioning and an example of such a partition is shown in Figure 1 . The partitioning improves the overall performance of the system by ensuring proper utilization of both the CPU and the accelerator, and also makes it easier to program for such systems by using a SPMD programming paradigm. Using a hybrid MPI-OpenMP implementation of the acoustic-elastic wave propagation, we demonstrate the efficacy of the proposed partitioning scheme on Stampede. Using our approach, we have realized speedups of up to 5.78x using a 7th order discretization and 6.88x for 15th order discretization relative to a baseline, pure-MPI implementation. We present strong and weak scaling results as well as individual node performance to illustrate the benefits and limits of the coprocessoraccelerated scientific computing. To the best of our knowledge, this is the first work considering such a partition that enables SPMD programming for heterogeneous clusters.
Organization of this paper:.
The rest of this paper is organized as follows. In §2, we present the Seismic wave propagation problem-our test application-as it is important to understand the computational structure of the application to appreciate the algorithmic choices. In §3 we provide a baseline analysis that we performed prior to this work, that led to the development of the current work. In §4, we present the enclave partitioning algorithm and in §5 we discuss the implementation details of our algorithms. Finally, in §6 we present experiments and scalability results in support of enclave partitioning and conclude with directions for future research. 
SEISMIC WAVE PROPAGATION
We are primarily concerned with accelerating the solution of the acoustic-elastic wave equations to simulate the propagation of seismic waves in coupled acoustic-elastic media. Our long-term goal is the solution of the global-scale inverse problem of determining the interior structure of the earth using full seismic waveform inversion [20] . Each step of the inverse problem requires a full, explicit forward-in-time solution of the acoustic-elastic wave equations. As such, a moderate speedup in the forward solution translates to a significant speedup in the inverse solution [5] .
For isotropic media, seismic wave propagation can be described in velocity-strain form by the first-order system [10] 
where we make use of the Lamé parameters λ and µ as our material parameters. This formulation allows us to describe wave propagation in an acoustic medium by simply setting µ to zero [20] . To make the system well-posed, we assume initial conditions v(x, 0) = v0(x) and E(x, 0) = E0(x) on the problem domain B, and boundary condition Sn = t bc on the boundary of the domain, ∂B, where we prescribe the traction t bc , n is the outward unit normal vector at the boundary, and S = CE is the Cauchy stress tensor. At an interface Γ between two elastic media, we impose continuity in the velocity and the traction
whereas at an interface between an elastic medium and an acoustic medium or between two acoustic media we impose continuity between the normal component of velocity and the traction
Additional details on the mathematical formulation of the problem can be found in [20] .
Discontinuous Galerkin Method
The discontinuous Galerkin (DG) finite element method is characterized by its use of discontinuous basis functions, which allow the methods to operate on h-adaptive, nonconforming meshes [8] . Global solution continuity is enforced weakly on element boundaries through the use of a numerical flux. This makes most DG operators highly local. In this paper we will present only a cursory examination of the features of DG that are relevant to an optimal implementation on a heterogeneous system. For a more detailed treatment of the solution of the acoustic-elastic wave equations using a DG method, see [20] , [13] , [4] .
Our DG formulation uses an h-adaptive, possibly nonconforming, mesh of hexahedral curvilinear elements, which we require conforms to a 2:1 balance. A 2:1 balance indicates that only 1:1 and 2:1 size relations are allowed between neighboring elements. Following [8] , the basis of each element is composed of tensor products of 1-D Lagrange polynomials. Numerical integration points are chosen to coincide with the basis function nodes, which we have chosen to place at the Legendre-Gauss-Lobatto points (again following [8] ). Each element, then, can be conceptualized as a collection of nodes. Element face nodes collocate when elements are adjacent to each other (Figure 2a ). At the beginning of each iteration, we isolate the nodes on the faces of the elements and copy them to the "facemesh". We call a face that is adjacent to four faces from neighboring elements a "parent face", and the faces of the smaller neighboring elements "hanging faces". When transferring face data to the facemesh, we interpolate the data from a parent face to four corresponding virtual hanging faces in the face mesh. This ensures that neighboring faces always conform to one another in the facemesh (Figure 2c) .
The key kernels used in the DG formulation of the acousticelastic wave equations are shown in Table 1 . The "Locality" column in the table describes the data domain on which each kernel operates. Element-and face-local kernels can operate on each element independently, while face-global kernels require nodal values from multiple neighboring elements. In this sense of the word "global", the only global kernel in our DG formulation is the Lift kernel, which applies the fluxes from adjacent faces to the volume nodes of the elements to which they correspond (Figure 2b ). Partitioning can be thought of as untangling this curve and laying it out in one dimension, as in the color-coded curve at the bottom of the figure.
The colors represent the different processes to which these elements are assigned.
Adaptive Meshing
In this section, we briefly describe our adaptive meshing framework. Octrees are axis-aligned space partitions in 3D. The root of an octree can be associated with a topologically equivalent 3D domain that can be mapped smoothly to a cube. A forest is a collection of octrees, which are connected conformingly through faces, edges and corners [6] . Such a forest of octrees allows for a two-tier decomposition of geometric domains, where the first tier, the macromesh, is an unstructured mesh of conforming hexahedral elements (possibly with curved boundaries). Recursive octree subdivision then allows for efficient octree-based adaptivity of the mesh in the geometric domain. Localized refinement generally creates non-conforming meshes, which can still be used to represent continuous finite element solutions by incorporating algebraic constraints at the element level (hanging nodes).
The adaptive octree structure can naturally be interpreted to define a recursive space filling curve that traverses all octants in all trees in the so-called Morton-or z-order [12] . This curve provides the basis for indexing and fast partitioning of octants between processes, a fact that we exploit while assigning work across nodes. lying Gaussian densities were the same for all runs. One such mesh is shown in Figure 1 .
The fraction of execution time taken by each of the key kernels discussed in the previous section was measured for the baseline code, the results are shown in Figure 5 . As can be seen, the derivatives kernel accounts for almost half the execution time. This would normally point to a taskoffloading approach, wherein the derivative computation is offloaded to the Xeon Phi, while the CPU takes care of the rest. Unfortunately, given that the derivatives need to be computed on the volume unknowns, the overall speedup factoring-in the cost of transferring data back and forth to the accelerator was < 10%. Equating this with communication costs normally observed on homogeneous clusters, one can see that a similar setup where the communication is proportional to the surface-area instead of the volume would be highly beneficial. As mentioned earlier, it would have the added benefit of simplifying the programming by allowing us to adopt the SPMD programming paradigm. In the next section, we expand on this idea and introduce our partitioning scheme.
PARTITIONING
As previously discussed, in the discontinuous Galerkin finite element method, inter-process communication consists entirely of the transfer of shared element faces. Thus, the partitioning of the global domain into processor-local domains directly controls the amount of communication required. If partitions are created such that the number of faces shared amongst processors is minimized, then communication will also be minimized. Partitioning must take place at two levels-first, the inter-node partitioning is done using the Morton ordering of the elements to ensure load-balance at the node level. For some total number of elements K and compute nodes p, we assign K/p elements to each node with node-i receiving elements E = [(i − 1)K/p, iK/p] according to the Morton order. We assign one MPI process to each compute node, and so each MPI process owns a subdomain of contiguous 2 elements. This domain is then partitionedlocally on each node by the CPU-into two domains such 2 in the Morton order Figure 5 : Fraction of execution time for key kernels in the solution of the acoustic-elastic wave propagation problem for a fixed problem size and varying number of compute nodes. All runs were completed using a fixed number of 7th order elements.
that one domain is an enclave 3 within the other. We refer to this as enclave partitioning.
This approach differs from the way in which accelerators have traditionally been used. A common paradigm for accelerators computing is to offload whole computationally intensive tasks to the accelerator. For example, in the solution of the acoustic-elastic wave equation, one might offload the Derivatives kernel to the accelerator while the CPU computes the Flux kernel. While the Derivatives kernel is certainly computationally intensive, and maps well to accelerator architectures, the amount of data that must be transferred at each timestep is proportional to the total number of elements K owned by the compute node, times the number of degrees of freedom per element (N + 1)
3 . In contrast, transferring only a subset KMIC of elements to the accelerator and transferring only shared face data, assuming that KMIC is structured such that inter-facial area is minimized, is proportional to 6K
2 . The tradeoff is that, although less data is transferred, we must ensure the entire timestep runs optimally on the Xeon Phi TM , rather than just one or two computationally-intensive kernels. As mentioned earlier, this supports the SPMD paradigm, which is a huge advantage when the accelerator and CPU share the same architectural basis, as is the case for Stampede. While this can be an issue with using say GPUs with x86 CPUs, where it might be easier to take an incremental approach of only porting a single kernel, we believe that the SPMD approach is better and more scalable and maintainable in the long run.
The approach proposed in this paper is to treat the relationship between the host and offload processes in much the same way as the relationship between compute nodes is treated. Once an enclave has been identified within the node's domain, the enclaved domain is assigned to the accelerator and therefore only needs to communicate with the host CPU. Note that due to the nature of the partition, the inter-node communication patterns are identical to the case when no accelerator is used. This is important as we do not pay any penalty for the use of accelerators. Each host ini-tializes its accelerator, then launches an offload process that runs each timestep in parallel with the host. Synchronization is only required once per time step, when the CPU and accelerator exchange their shared face data. At the end of the computation, all of the degrees of freedom computed on the accelerator are transferred back to the host, and so we incur the cost of transferring large datasets only once.
Given the different computational capabilities of the CPU and the accelerator, enclave partitioning has to be able to produce enclaves of different sizes such that the ratio of the number of elements in the enclave to those surrounding it is similar to the relative performance of the accelerator to that of the CPU on the problem(s) of interest. We now explain our algorithm for enclave partitioning.
Enclave Partitioning
Our goal is to extract a specified number of elements E φ ⊂ E, where E φ is the set of elements owned by the accelerator and E is the initial subdomain assigned to a given node. The goal of enclave partitioning is to determine E φ such that it does not share any faces with elements / ∈ E. We would also like to minimize the number of faces shared between E φ and E \ E φ to minimize the cost of exchanging data between the CPU and the accelerator. It is important to note that it might not be possible in every case to find an enclave of the specified size satisfying these criteria. Specifically, E φ is by definition made up of interior elements, and if the number of elements in E is small, then it is entirely possible that the number of interior elements is less than requested or even zero. For real applications however, we expect to have significant number of elements in E such that it will be possible to compute E φ .
The algorithm first extracts the boundary elements E b , i.e., elements that share at least one face with elements / ∈ E. A simple version of the algorithm can then simply select the specified number of elements from E \ E b to belong to E φ . This does not however minimize the number of faces shared between E φ and E \ E φ . Heuristically, we can minimize the number of shared faces by ensuring that the boundary is largely a smooth face and not made of jagged faces. For our octree-based meshes, this implies selecting sets of elements that are contained within coarser octants. Following this heuristic, we construct a coarse interior region, Ecoarse by constructing the minimal linear octree between successive elements ∈ E b using Algorithm 1 (see [17] for a detailed discussion). This algorithm runs in O(nl), where l is the number of levels used in the octree (practically a constant) and n is the number of elements in the boundary.
Initializing E φ with the elements ∈ E contained within the "largest" octant in Ecoarse, we keep adding the "largest" adjacent (next or previous octant in the Morton order) octants till E φ contains the desired number of elements. When we speak of a "large" coarse octant O, we are not referring to the geometric size of the octant, but the number of octants in E that are overlapped by O. Generating the coarse interior region helps us in two ways. First, it attempts to minimize the effective number of shared faces between the CPU and the Xeon Phi TM . Secondly, while adding additional elements to E φ , Ecoarse acts as a skip list and speeds up the partitioning.
Determining the size of the enclave.
In order to determine the optimal number of elements to offload to the accelerator, we estimate the accelerator runAlgorithm 1. Constructing a minimal linear octree between two octants (sequential) -CompleteRegion
Input:
Two octants, a and b > a.
Output:
R, the minimal octree between a and b.
Work:
O(n log n), where n = len(R). Storage: O(n), where n = len(R).
end if 8. end for 9. Sort(R) time, CPU runtime, and PCI-bus data transfer time for a given set of partition parameters, such as the number of elements on the CPU and the accelerator, the number of faces shared between the CPU and the accelerator, and the order of the elements. We can calculate the number of FLOPs that each key kernel will execute for the given partition parameters, then use measured FLOP rates from each of the kernels (gathered under a variety of partitions and element orders) to estimate the per-timestep runtime of the kernels for the theoretical partition. We select the number of elements to offload by calculating at which point the estimated accelerator runtime plus the PCI-bus data transfer time is equal to the estimated CPU runtime, under the assumption that the global domain is a perfect cube of uniform elements, and that any accelerator subdomain will also be a cube of uniform elements. The resulting number of elements to offload KMIC is then passed as an input to our intra-node partitioning algorithm.
IMPLEMENTATION
In this section we describe the parallelization at the two lowermost levels of parallelism shown in Figure ? ?-shared memory parallelism using OpenMP and vectorization. While several of these optimizations were done to improve the performance on the Intel R Xeon Phi TM , they also helped improve the performance of the code on the CPU. This is another big advantage of the SPMD paradigm.
Vectorization
Several authors have identified vectorization as a key component in achieving good performance on the Xeon Phi [15] . Indeed, the 512-bit wide SIMD vector processing capabilities of the Xeon Phi TM allow for up to 16 DP FLOPS per clock cycle per core (for a fused multiply-add). Code which is not vectorizable, then, suffers a significant performance hit relative to vectorized code. There are several approaches to vectorization, including auto-parallelization by the Intel R compiler, the use of Intel R Cilk TM Plus directives [1] , and hand-coding using vector intrinsics or assembly. For each of the key kernels presented in Table 1 we implemented each of these vectorization strategies, though only the results for the Derivative kernel will be presented here. For all other kernels there was no significant performance difference between the three vectorization methods. The Derivative kernel calculates 27 tensor products, producing the partial derivatives of the nine field variables in the three canonical directions. There are three tensor application kernels, one for each canonical direction, and for order N elements the operation of each tensor kernel is analogous to performing N + 1 matrix-matrix multiplications of size (N + 1) × (N + 1). This is the reason that we choose 7th and 15th order elements, to allow the matrices to be 8 × 8 and 16 × 16, enabling better vectorization. Though many optimized dgemm kernels are available [19] , [7] , including the Phi-optimized Intel R Math Kernel Library [2] , these kernels achieve high performance only for matrices that are relatively large compared to the amount of data available in registers and the L1 cache. Our problem deals with much smaller data structures, however, as for, e.g., 7th order elements, four (N + 1) × (N + 1) matrices can fit in the vector registers of a single Xeon Phi TM core. A full examination of our hand-optimized tensor kernels exceeds the scope of this paper, and so we will simply say that our optimization strategy was to minimize loads and stores between registers and the cache, perform all arithmetic operations in the VPU, and amortize loads and stores by reordering instructions such that arithmetic operations could be executed concurrent to memory transfers. For the CPU, we observed a significant speedup over both compiler-vectorized and Cilk TM Plus-vectorized code, as Table 2 shows. On the Xeon Phi TM , however, although our vectorized code performs significantly better than unvectorized code, we were not able to produce performance significantly better than the Cilk TM Plus-vectorized code. Anomalously, the only vectorization method for which we observed higher performance on the Xeon Phi TM than on the CPU was the Cilk TM Plus-vectorized kernel. We expect the Xeon Phi TM to outperform the CPU by a factor of three, and are currently investigating the reasons for the weakerthan-expected performance on the Xeon Phi TM .
Multithreading and Offload
Currently, OpenMP is the method of choice for making full use of the many hardware threads available on the Xeon Phi TM [9] . While MPI libraries are available on the Xeon Phi TM , treating the Xeon Phi TM as a distributed memory system would give each core only 134 MB of RAM. When we consider that each Xeon Phi TM core can only issue a vector instruction every other clock cycle, which necessitates the use of at least two hardware threads per core, we see that using MPI to keep the Xeon Phi TM fully scheduled would lead to only 67 MB of RAM per MPI process.
With the decision to use OpenMP on the Xeon Phi TM firmly in place, we are faced with the remaining decisions of how to implement parallelism on the CPU and how to implement communication between the CPU and the Xeon Phi TM . The existing codebase we have inherited for the solution of the acoustic-elastic wave equations utilizes pure MPI for interprocess communication and parallelization, spawning one MPI process per core on each compute node. On Stampede this translates to 16 MPI processes per compute node, each process having its own address space, arrays of unknowns, and computational subdomain. At each time step, facial degrees of freedom on faces that fall on interprocessor boundaries are communicated between MPI processes. An obvious option is to maintain this distributedmemory parallelism on the CPU side of the compute nodes and launch a single MPI process on the Xeon Phi TM that then spawns OpenMP threads. This would require an overhaul of our existing MPI scheduling algorithm, however, which assigns a roughly equal amount of work to each MPI process. It would also mean that the computational subdomain owned by the Xeon Phi TM could have a potentially large number of neighboring subdomains, which would lead to a large amount of data being transferred over the PCI bus and subsequently over the InfiniBand network. The worst case data path would be a Xeon Phi TM whose subdomain was neighbored by a subdomain owned by a Xeon Phi TM on another compute node, in which case the shared faces would have to travel across the PCI bus, then between nodes via Infiniband, then back across the PCI bus after arrival at the destination node.
To reduce traffic on the PCI bus and simplify load balancing, then, we choose to assign only one MPI process to the CPU side of each compute node, which then spawns OpenMP threads on the CPU-side to take advantage of the multiple Xeon TM cores. This same MPI process also controls offload to the Xeon Phi TM and MPI communication. We restrict the subdomain assigned to the Xeon Phi TM to a subset of the elements owned by the parent MPI process whose faces do not lie on inter-processor boundaries. This eliminates the need for a reworking of our MPI load balancing algorithm, as each compute node can be given the same amount of work and the partitioning between the CPU and the Xeon Phi TM domains can be done independently and locally on each compute node.
SPMD programming differences.
Though each host MPI process does technically "offload" in the sense that it initiates the execution of the Xeon Phi TM code through offload compiler directives, the behavior of the Xeon Phi TM and the dynamic between the CPU and the Xeon Phi TM is more like that of a distinct compute node. Each Xeon Phi TM owns a set of elements forming a computational subdomain and executes full time steps in the solution of the acoustic-elastic wave equation. The only difference between the CPU and Xeon Phi TM codes lies in the communication phases of each time step. The CPU maintains communication buffers and maps to inter-processor boundary faces for both inter-node MPI communication and intranode Xeon Phi TM communication, whereas the Xeon Phi chronously via MPI, while CPU-Xeon Phi TM communication is implemented synchronously through compiler offload directives. Although the Xeon Phi TM is capable of asynchronous computation, and theoretically capable of asynchronous data transfer, we have not observed a significant time difference in data transferred between the CPU and the Xeon Phi TM when using asynchronous offload directives versus synchronous directives. We postulate that this is due to a fixed overhead in the offload runtime that, though possibly amortized by actual asynchronous transfer for larger array sizes, is significant enough for the amount of face data we need to transfer to be very nearly the same as the time required to synchronously transfer that data.
RESULTS
We present results in support of the proposed partitioning algorithm in this section. All experiments were carried out on Stampede at the Texas Advanced Computing Center (TACC). Stampede consists of 6400 compute nodes, each of which houses two 8-core Intel R Xeon TM E5 CPUs and up to two 61-core Intel R Xeon Phi TM SE10P accelerators [3] . As table 3 shows, in terms of floating point performance and memory bandwidth the Xeon Phi TM outperforms the CPU by approximately a factor of three. As with most accelerators, the amount of work that can be done by the Xeon Phi TM is constrained by its limited amount of random-access memory. To begin with, the Xeon Phi TM has only 25% as much RAM as is available to its host-node CPUs. Considering the Xeon Phi TM 's higher core count the situation becomes worse, as there are approximately 134 MB of RAM for every Xeon Phi TM core, compared to 2 GB of RAM for each CPU core. Thus, the Xeon Phi TM is most effectively used as a large SMP node, programmed using a shared-memory paradigm. Figure 7 : Incremental optimization of 15th order code.
Single-Node Optimization and Offloading
We first present the results of each step in the transformation from our baseline pure-MPI code to our vectorized, hybrid MPI-OpenMP, coprocessor-accelerated code. Figures 6  and 7 show the speedup realized after each subsequent optimization step relative to the baseline MPI-only code run on a single socket of Stampede. Some compute nodes on Stampede contain two Xeon Phi TM . Our code seamlessly handles the utilization of two Xeon Phi TM per compute node by simply scheduling two MPI processes per node, each pinned to a different socket. Each of these MPI processes then spawns 8 OpenMP threads on the 8 cores of their host socket and controls offloading to one of the two available Xeon Phi TM . Though we expected a performance hit when moving from pure MPI code to hybrid MPI-OpenMP code, we did not observe any difference in runtime between our OpenMP and pure MPI codes. We took care in our OpenMP implementation to minimize memory sharing amongst threads as much as possible so as to mimic the distributed-memory behavior of MPI processes.
Partitioning
To measure the consistency of the partitions produced by our partitioning algorithm we examine the distribution of Xeon Phi TM runtimes across a large run. Examining Figure  8 we see that most of the Xeon Phi TM runtimes are within 5% of the mean runtime. We believe this is a very good result and within acceptable bounds given the variable nature of adaptive meshes. While it might be possible to get a more uniform load, the simplicity of the current algorithm makes it very efficient.
Scaling Results
The difficulty in achieving strong scalability using coprocessorenabled code is that as the number of compute nodes is increased for a fixed number of elements, the ratio of boundary to interior elements on each node's subdomain also increases. Since only interior elements are assigned to the Xeon Phi TM , this means that the percentage of elements eligible for offload to the accelerator decreases as the number of compute nodes increases. At very high compute node counts, the performance of the Phi-accelerated code approaches the performance of the baseline code due to the increased ratio of work being assigned to the CPU. This behavior is illustrated in Figures 9 and 11 . We observed weak scaling comparable to or better than our baseline pure-MPI code. For weak scaling runs using dual-Phi compute nodes, the size of our experiments were constrained by the available number of compute nodes with two Phis. We present dual-Phi weak scaling for up to as many nodes as were available. Table 4 shows the speedup factor for our vectorized, MPI-OpenMP, Xeon Phi TM -enabled code relative to our MPI baseline for the strong scaling runs shown in Figures 9 and 10 . The "Cores" column specifies how many CPU-side cores were used in the computation. Baseline computations used one MPI process per core, while accelerated runs used one MPI process per socket, each spawning 8 OpenMP threads to make full use of the available cores. This table illustrates that when compute nodes are given nontrivial workloads, as is the case in the strong scaling runs at low core count, the Xeon Phi TM contributes to a significant speedup. 
CONCLUSION
In this work we have presented a new partitioning algorithm targeting hybrid codes on heterogeneous clusters. We demonstrated the efficacy of our algorithm using a largescale seismic wave propagation application on the Stampede Supercomputer. We observed speedups of up to 5.78x for runs using 7th order elements and up to 6.88x for 15th order elements. Our partitioning scheme was able to construct relatively consistent subdomains for the Xeon Phi TM . We were able to increase the performance of the baseline MPI code through the use of vector instructions, and were also able to maintain performance when moving from pure MPI Figure 11 : Number of elements assigned to the CPU and Xeon Phi TM in the single-socket strong scaling runs. Number of processors refers to total number of CPU cores used in baseline case, and total number of OpenMP threads used on the host CPU side in Xeon Phi TM -accelerated cases. Figure 12 : Weak scaling for 7th order elements. Phienabled runs completed with 63% of the elements given to the Xeon Phi TM (s). Baseline runs were original MPI-only code. Runs labeled "8 MPI" or "8 OMP" indicate that only one socket per node was used.
MIC CPU
to an MPI-OpenMP hybrid. On a system like Stampede where the majority of computational power is delivered by accelerators, we have seen that optimizing code for effective execution on the accelerator can have a dramatic impact on performance. The partitioning scheme also enables the developers to program using a SPMD paradigm, which does not appear straightforward while considering heterogeneous architectures. We believe our work is the first to take such an approach and will make it easier to obtain good performance and scalability on large heterogeneous clusters. 
