The gyrokinetic toroidal code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5-D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including internode 2-D domain decomposition and particle decomposition, as well as intranode shared memory partition and vectorization, have enabled pushing the scalability of the PIC method to extreme computational scales. In this article, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) coprocessors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of ion-temperature-gradient driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects, and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.
Introduction
The gyrokinetic toroidal code (GTC) (Lin et al., 1998) is a particle-in-cell (PIC) code that solves the fivedimensional (5-D) gyrokinetic Vlasov equation in full, global torus geometry to address kinetic turbulence issues in magnetically confined fusion experimental facilities known as tokamaks (Tang and Keyes 2009 ). This code was developed using FORTRAN 90 and features three levels of parallelism: a one-dimensional (1-D) domain decomposition in the toroidal dimension, a particle decomposition in each toroidal domain, and multithreaded, shared memory loop-level parallelism implemented with OpenMP (Ethier et al., 2005 (Ethier et al., , 2008 . The method scales well to a large number of processor cores with respect to the number of particles. However, the approach can suffer from performance bottlenecks when dealing with significantly increased problem size, for example, simulating large-scale plasmas, such as the burning plasma experimental facility International Thermonuclear Experimental Reactor (ITER), the largest device currently under construction in France (The ITER project 2013) . This is due to the increasing gridrelated computation and memory requirement on each 1 process. With only 1-D domain decomposition in the toroidal dimension, this algorithm has each process, keep a copy of the full poloidal grid, and the number of grid points in the poloidal grid increases 4 3 for a plasma device of 2 3 in minor radius. In order to effectively address the open questions in fusion plasma physics, such as the scaling of the energy confinement time with system size (Idomura and Nakata 2014; Lin et al., 2002; McMillan et al., 2010; Tang et al., 2014) , a key additional level of domain decomposition in the radial dimension was introduced. In this new version named GTC-Princeton or GTC-P (Adams et al. 2007; Ethier et al., 2010) , the additional domain decomposition substantially reduces the memory requirements and thus enables the code to deal with increased problem size, including being able to scale to ITER-size simulations.
The illustrative application domain on which the present studies are focused begins with the fact that the fusion of light nuclides forms the basis of energy release in the universe and can potentially be harnessed and used as a clean and sustainable supply of energy on Earth. Magnetic confinement fusion devices in which very high temperature plasma is confined in a tokamak, a donut-shaped vacuum vessel, is one of the most promising approaches to fusion reactors. In such a system, the interplay between the complex trajectories of individual charged particles and the collective effects arising from the long-range nature of electromagnetic forces leads to a wide range of waves, oscillations, and instabilities. These include larger scale (macro) instabilities that produce rapid topological changes in the device resulting in a catastrophic loss of fusion power, as well as smaller scale (micro) instabilities that gradually leak energy, and thus affect performance and economic viability. Turbulent fluctuations (''microturbulence'') are instabilities caused by the unavoidable spatial variations (gradients) in a plasma system. They can significantly increase the transport rate of heat, particles, and momentum across the confining magnetic field and severely limit the energy confinement time for a given machine size. High-fidelity predictive numerical experiments based on 5-D gyrokinetic simulations play a significant role in understanding, predicting, and possibly controlling these instabilities. However, gyrokinetic simulations are extremely compute intensive, since the phase space distribution function in 5-D needs to be accurately resolved. This is a key motivating factor for developing highly optimized modern PIC codes that can effectively carry out microturbulence simulations capable of being deployed on the most powerful supercomputing systems.
The basic particle method has long been a wellestablished approach for simulating the behavior of charged particles interacting with each other through pairwise electromagnetic forces. At each time step, the particle properties are updated according to these calculated forces. For applications on powerful modern supercomputers with deep cache hierarchy, a pure particle method is efficient with respect to both locality and arithmetic intensity (compute bound). Unfortunately, the O(N 2 ) complexity makes a particle method impractical for plasma simulations using millions of particles per process. Rather than calculating O(N 2 ) forces, the PIC method, which was introduced by J Dawson and N Birdsall (Birdsall and Langdon 1991) in 1968, employs a grid as the medium to calculate the long-range electromagnetic forces. This reduces the complexity from O(N 2 ) to O(N + M log M), where M is the number of grid points that is generally much smaller than N . However, achieving high parallel and architectural efficiency is a significant challenge for PIC methods due to potential fine-grained data hazards, irregular data access, and low arithmetic intensity. Attaining performance becomes even more complex as HPC technology evolves toward vast onnode parallelism in modern multi-and many-core designs. A deep understanding of how to improve the associated scalability will have a wide-ranging influence on numerous physical applications, which use particlemesh algorithms, including molecular dynamics, cosmology, accelerator physics, and plasma physics.
Driven by chip and system power limitations, the HPC community is moving to the era of multi-and many-core architectures with greatly increased thread and vector parallelism on shared memory processors. While the family of the GTC FORTRAN codes (the original GTC (Ethier et al., 2005) and GTC-P (Adams et al., 2007; Ethier et al., 2010) ) continues to scale well to a large number of processes, the level and efficiency of multithreading need to be increased significantly to take full advantage of the upcoming technology. In order to benefit from the computer science advances in exploiting multithreading capabilities to facilitate largescale simulations on modern systems, we have strived to develop a thread-and accelerator-optimized version of the code over the last 5 years. These efforts commenced with a complete rewrite of the original GTC FORTRAN code in the C language. The need for this change arose from our 2009 study of the GTC charge accumulation step, in which several low-level multithreaded algorithms were implemented using Pthread programming (Madduri et al., 2009 ) and compared for performance. The best algorithm was later reimplemented with OpenMP, replacing the much more complicated Pthread constructs. The C version also simplified the process of porting GTC-P to GPU and Intel Xeon Phi. Novel multi-core centric optimizations were introduced to enhance the performance of the code (Madduri et al., 2009 (Madduri et al., , 2011a (Madduri et al., , 2011b . The earlier efforts also included the development of the corresponding GPU version using the CUDA language (Ibrahim et al., 2013; Madduri et al., 2011a ). The journey continued with the development of a highly efficient radial domain decomposition . Specifically, we carefully addressed several issues identified in the original GTC-P FORTRAN code, such as load imbalance and the lack of multithreading capability for some grid-based subroutines. The multiple levels of parallelism, including internode 2-D domain decomposition and particle decomposition, and intranode shared memory partition, as well as vectorization within each core, have allowed the new GTC-P code to efficiently scale to the full capability of several top computer systems, including Sequoia at Lawrence Livermore National Laboratory (United States) and Mira at the Argonne Leadership Computing Facility (ALCF) (United States) .
In this article, we describe our approach to developing such a highly parallel PIC code that has the unique capability of efficiently simulating large size plasmas on both homogeneous and heterogeneous systems. In particular, we discuss our recent experience in porting and optimizing GTC-P on NVIDIA GPU-and Intel Xeon Phi (MIC)-accelerated systems. The performance studies with production simulation parameters are performed on numerous supercomputing platforms including Titan at the Oak Ridge Leadership Computing Facility (US), Mira at the ALCF (US), Piz Daint at the Swiss National Supercomputing Center (Switzerland), Edison at the National Energy Research Scientific Computing Center (US), Stampede at the Texas Advanced Computing Center (TACC) (US), and Blue Waters at the National Center for Supercomputing Applications (NCSA) (US). This is the first study that examines the production code with ''true weak scaling study'' on both large-scale homogeneous and heterogeneous HPC systems, as opposed to our earlier research that focused on either large size (ITER-size) simulations on CPU systems or small to moderate size simulations on GPU systems (Ibrahim et al., 2013; Madduri et al., 2011a) . Given that the PIC algorithm consists of several phases of varying computational intensity, memory access patterns, and communication patterns, our study enables exploration with respect to computer science contributions as well as from a full-scale discovery-sciencecapable applications standpoint. With regard to the important challenge of dealing with code portability on advanced supercomputing platforms, another contribution from these studies involves providing a quantitative comparison of the number of ''lines of code changes'' and associated speedup obtained from our experience in developing the modern GTC-P code across a wide range of architectures.
The sections below are organized as follows. In Section 2, we first review the 5-D gyrokinetic VlasovPoisson equation used for studying low-frequency microturbulence in fusion plasmas. Then, we describe the numerical discretization for the system based on the PIC method. In Section 3, the modern GTC-P code is introduced. We discuss its main kernels as well as the parallelization and optimization deployed. In particular, this article provides results from our new optimization techniques for GPUs and description of our experience in porting and optimizing the code on the Intel Xeon Phi system. The performance analysis and results of high resolution, longtime simulations of iontemperature-gradient (ITG) driven turbulence are given at the end of Section 6.
Gyrokinetic Vlasov-Poisson system

Governing equations
The study of low-frequency microturbulence in high temperature, magnetically confined plasmas requires the use of the kinetic model described by the Vlasov equation in six-dimensional (6-D) phase space. In the gyrokinetic approach (Lee 1987) , the dynamics of the high-frequency cyclotron motion of the charged particles in the strong magnetic field is averaged out, reducing the 6-D equation to a 5-D gyrokinetic equation
where f a (R, v k , m) is the 5-D phase space distribution function of species a in the gyrocenter coordinates R. Z a is the particle charge. m a is the particle mass. B = Bb is the nonuniform equilibrium magnetic field. m =
2B
is the magnetic moment. f is the gyrophase-averaged potential. v E , v c , and v g are the E 3 B drift, magnetic curvature drift, and grad-B drift velocities, which take the forms
where c is the speed of the light and O a = Z a B m a c is the gyrofrequency. We use the identity r 3 B = Br 3b + rB 3b = 0 ð9Þ in the low b limit. Consider a simple plasma of electrons and ions, the corresponding gyrokinetic Poisson's equation is
where the second gyrophase-averaged potential is
is assumed to be Maxwellian. n i is the gyrophase-averaged charge density
Equation (10) can be interpreted as a normal Poisson equation plus an ion polarization term (the second term on the left). In the gyrokinetic ordering limit, we usually drop the three-dimensional (3-D) Debye shielding term (the first term) in the gyrokinetic Poisson's equation as the Debye shielding term is much smaller than the ion polarization term.
In the case of adiabatic electron where
, the Poisson equation is reduced to
where n i0 and n e0 are ion and electron densities at equilibrium.
Numerical discretization
In the PIC method, the 5-D distribution function is represented by a set of particles. Specifically, the PIC simulations are being carried out using marker particles representing a small volume of phase space characterized by position, velocity, and weight. After being initialized, the particles follow trajectories computed from the characteristic curves given by the Vlasov equation. There are usually two approaches for the initial loading of the particles. They can be loaded uniformly in phase space and assigned different weights that are proportional to the value of the distribution function, or, alternatively, they can be loaded following the distribution function where each particle carries the same weight. The latter is based on the importance sampling approach of the Monte Carlo method (Aydemir 1994) and thus is preferred as it produces a low-noise loading.
To avoid particle load imbalance due to spatial gradient in the distribution function, we use a uniform background density function based on multi-scale expansion (Lee et al., 2011) . In order to address the well-known issue of particle noise in PIC methods, we use the so-called df method (Parker et al., 1993) . In our simulations, the particles discretize only the perturbation df with respect to an equilibrium state f 0 . The df method reduces noise by replacing as much of the Monte Carlo estimate as possible by an analytical formula (Aydemir 1994; Parker et al., 1993) describing f 0 and hence reducing the variance . However, it does not control the noise increase during the longtime particle evolution. For carrying out longtime simulations, we introduce coarse-graining to the system, also called phase space remapping (Chen and Parker 2007; Wang et al., 2011 Wang et al., , 2012 . The phase space grid works as a natural numerical dissipation model to stabilize the collisionless system in longtime simulations. The system of equations is advanced using numerical integration, for example, the second-order Runge Kutta method.
In the PIC method, the Poisson equation is solved on a grid. In GTC-P, we utilize a highly specialized grid that follows the magnetic field lines as they twist around the torus due to the magnetic field pitch (Figure 1 ). This allows the code to take advantage of the strong anisotropy between the fast motion of the particles along the field lines and their slow motion across them. With such a grid, the same accuracy can be achieved using much fewer poloidal planes (1=100) than a nonfield-aligned grid. Thus, a production simulation generally consists of only 32 or 64 high-resolution poloidal planes wrapped around the torus. The 3-D torus geometry is first discretized uniformly in the toroidal dimension. Each 2-D poloidal plane (cross section of the torus) is represented by an unstructured grid with uniform spacing in the radial direction (psi) and approximate uniform arc length spacing in the poloidal direction (theta). To follow the magnetic field line, each concentric circle forming the radial grid of a poloidal plane shifts by a specific poloidal angle as one goes around the torus (see Figure 1) . As discussed earlier, the gyrokinetic approximation neglects the 3-D Debye shielding term in the Poisson equation, which reduces the complex 3-D grid based solve to a fixed number of independent two-dimensional (2-D) solves. On each 2-D plane, the four points algorithm (Lin and Lee 1995) is used to approximate the second-order gyro-averaged operator (equation (11)) on the unstructured grid. The resulting diagonally dominant matrix is then solved by a weighted Jacobi iterative solver (Lin and Lee 1995) .
The modern GTC-P code
GTC-P is developed under the philosophy that in order to leverage the computer science and applied mathematics advances in a timely matter, the physical model needs to be sufficiently simple but nevertheless complete for studying the scaling of turbulent transport spanning the range from present generation experiments to the large-ITER-scale plasmas. This leads to the fact that GTC-P includes the model that takes into account all the toroidicity effects, such as the curvature drift and multiple rational surfaces, but not the noncircular cross-section effects or the fully electromagnetic, nonadiabatic electron dynamics found in some of the other gyrokinetic codes at the current stage (Jolliet et al., 2007; Ku et al., 2009; Lin and Lee 1995) . These techniques reduce the complexity of developing the algorithmic advances required to take advantage of rapidly evolving architectural platforms. It is worth mentioning that the modern GTC-P code shares the same physical model as the original FORTRAN code that has a long history of producing important scientific discoveries via state-of-art computers (Lin et al., 1998 (Lin et al., , 2002 . Besides the implementation differences, the major difference is the dissipation models used in these two codes. While the original FORTRAN version used an artificial heat bath model, the modern GTC-P introduced several numerical dissipation models (Chen and Parker 2007; McMillan et al., 2008; Wang et al., 2011) to stabilize longtime simulation.
Main kernels
Each time step of GTC-P requires the execution of six principal computational kernels. The kernels present a number of intra and internode performance and scalability challenges when running on today's multi-core and accelerated supercomputers. We highlight the kernels and their challenges here.
Charge performs a particle-to-grid interpolation in which particles deposit charge onto the charge grid using the four-point gyro-averaging method (see Figure   2 ). Particles, represented by a four-point approximation for a charged ring, may deposit charge onto as many as 32 unique grid memory locations and as few as eight.
In a shared memory environment, Figure 2 visualizes the fact that threading over either particles or points on the charge ring can result in data hazards, which must be guarded by synchronization mechanisms in order to guarantee correct results. Unfortunately, this challenge extends to SIMD architectures in which attempting to vectorize (or unroll) beyond eight-way (there are four points in the two enclosing poloidal planes) can produce intrathread data dependencies. Construction of thread (or lane) private copies of the charge grid can remedy the synchronization challenges but can exacerbate cache and memory challenges. In a distributed memory environment, the charge deposition phase will also involve bandwidth-intensive MPI communication (a reduction). For example, particles deposit charge on local grids consisting of nonoverlapping zones and ghost zones. The charge on ghost zones needs to be merged to neighboring nonoverlapping zones. In addition, if the domain is decomposed in toroidal dimension, the charge from neighboring toroidal sections needs to be merged. Additionally, the charge phase includes an MPI_Allreduce to obtain a flux-surface-averaged charge for solving the so-called ''zonal flow.' ' The poisson, field, smooth kernels solve the gyrokinetic Poisson equation, compute an electric field, and smooth the charge and potential with a filter on the grid, respectively. These three steps constitute purely grid-related work in which the floating-point operations and memory references scale with the number of poloidal grid points. As in the charge deposition phase, in a distributed memory environment, all three subroutines may involve MPI communication to update the values in the ghost zones.
Push interpolates the electric field onto particles, and using that field, advances particle phase space positions. In the push phase, the electric field values at the location of the particles are ''gathered'' and used for time advancing their phase space coordinates. This step also requires reading irregular grid locations in memory (the electric field values) corresponding to the four bounding boxes of the four points on the ring, involving data reads from up to 32 unique memory locations. Fortunately, the operations in the push step are usually independent (hazard free) and are thus relatively easier to parallelize and vectorize.
In distributed memory environment, shift identifies particles being moved, copies them to a separated buffer, and moves them to the appropriate processes.
In general, particle-related work such as charge, push, and shift scales linearly with the number of particles. Grid-related work such as poisson, smooth, and field scales with the number of poloidal grid points. Since particle to mesh ratio is in the order of 100 to 100,000 in PIC methods, the time spent in grid-related operations is a small component of the overall run time.
Parallel decomposition
GTC-P employs two levels of decompositions. First, a 2-D domain decomposition occurs in the toroidal dimension zeta and the radial dimension psi (see Figure 1 ). In order to further increase MPI parallelism, a second level of decomposition over the particles is introduced. Within each subdomain, the particles are divided between several processes wherein each process owns a fraction of the total particles in that subdomain as well as a private copy of the local grid to simplify the charge deposition. The 2-D domain decomposition and particle decomposition are implemented with MPI using three different communicators: a toroidal communicator, a radial communicator, and a particle communicator. The particles move between domains with nearest-neighbor communication (implemented with MPI_Sendrecv) within the toroidal communicator and radial communicator. To merge these private copies together, GTC-P performs an MPI_Allreduce within the particle communicator. In gyrokinetic simulations, particles move much faster in the toroidal dimension than in the radial dimension at every time step. The messages transferred within the toroidal communicator are of much larger size than the messages exchanged within the radial communicator. To optimize performance, if possible, MPI ranks should be placed such that the physical processors assigned to adjacent toroidal domains in the same toroidal communicator stay close together.
The domain decomposition in the toroidal dimension is trivial as the toroidal grid is uniformly discretized. However, with unstructured grid in circular geometry, the domain decomposition in the radial dimension is far more challenging. We begin the process by partitioning a poloidal plane into nonoverlapping domains with equal area. Assuming particle density is uniform, this partitioning divides all particles in one toroidal section equally across multiple processes. Next, the nonoverlapping domain is extended to line up with the mesh boundary in the radial direction. Finally, the valid grid is extended on each side with ghost cells accounting for charge deposition with four-point approximation. In general, three to eight ghost cells are sufficient. Although this discretization will result in more grid points for the processes close to the edge, the effect from load imbalance is small considering that the number of particles is much larger than the number of grid points (e.g. particles per cell number is much greater than 1). In the case where more than eight ghost cells are required for charge decomposition with the four-point averaging, we can either send the off-region points explicitly to the neighboring processes (Wang et al., 2012) or completely ignore those points. For the later case, benchmarking will be required to ensure that accuracy is preserved with the approximation.
The multiple levels of decomposition enables GTC-P to efficiently carry out ''true weak scaling study'' where both the grid size and the number of particles are increased in proportion to the number of processors. With a 2-D domain decomposition and particle decomposition, GTC-P pushes the scalability of the PIC method to an extreme and allows scaling on the world's largest computing systems . Note that although the simulations are in 3-D, a 2-D domain decomposition is sufficient for weak scaling since the number of grid points in the toroidal dimensional remains constant for all plasma sizes due to Landau damping physics.
Experimental platforms
In this article, we examine the performance and optimization of GTC-P on six different systems. We detail the machines here.
Edison is a 5600 node Cray XC30 MPP at NERSC (Edison Cray XC30). Each node includes two 12-core Xeon (Ivy Bridge) processors running at 2.4 GHz. Each processor includes a 30 MB L3 cache and four DDR3-1600 (the machine has since been upgraded to DDR3-1866) memory controllers which provide nearly 40 GB/s of memory bandwidth. Edison's nodes are connected together using Cray's Aries (Dragonfly) network which provides lower latency, higher bandwidth, and better scalability than torus or fat tree topologies. As each node has two NUMA nodes (one per processor), we run one 12-thread process per NUMA node.
Titan is a GPU-accelerated Cray XK7 MPP at the Oak Ridge National Lab and is composed of 18,688 nodes connected via Cray's Gemini 3-D torus like Blue Waters (Titan Cray XK7). Unlike the similar Blue Waters, there are only two Interlagos CPU processors per node with the others being replaced by a PCIeattached NVIDIA Kepler K20x GPU. Each GPU, running at 733 MHz, has 14 streaming multiprocessors each of 192 CUDA cores, 64 DP data paths, a 64 KB SRAM which can be partitioned into a cache and a scratchpad, and 32 K 32-bit registers. Each GPU includes 6 GB of GDDR5 memory. The GPU presents programmers with dramatically different performance characteristics than CPU architectures as it favors computations that maximize parallelism, present streaming access patterns with easily expressible locality, and minimize off-node (off-GPU) communication.
When using Titan, there are several programming styles which must mix MPI, OpenMP, and CUDA. First, one can run one 16-thread, GPU-accelerated process per node. This minimizes off-node computation and allows one to fully exploit the GPU. However, this demands the programmer optimize for NUMA in OpenMP-a challenging prospect. Second, one can run one eight-thread, GPU-accelerated process per node. This has all the benefits of the aforementioned style, but side steps NUMA by constraining the process to a single NUMA node. Unfortunately, this cuts CPU bandwidth and memory in half. Finally, one may ignore the GPU altogether and run two 8-thread processes per node.
Mira is an IBM Blue Gene/Q (BGQ) supercomputer at the Argonne National Lab (Mira IBM BGQ). Mira is composed of 49,152 compute nodes connected via a high-performance custom network in a 5-D torus. Each node includes a 16-core (+1 system core) 1.6 GHz SOC processor connected to 16 GB of DDR3-1333 DRAM via two 128-bit memory controllers. Each core is inorder, dual-issue, four-way multithreaded, has a 4 3 64b SIMD FMA functional unit, and includes a private 16 KB cache. One must run at least two threads per core in order to dual-issue instructions. All cores are connected via a crossbar to a 32-MB L2 cache.
Blue Waters is a Cray XE6 MPP at the NCSA (Blue Waters Cray XE6). It contains 22,500 Interlagos-based nodes connected by Cray's Gemini network into a 3-D torus. Each node includes four 8-core processors running at 2.3 GHz in which each core has a private 16 KB L1 data cache, each pair of cores shares a 256b-wide SIMD unit, each pair of cores shares a 2-MB L2 cache, and all cores share a 32-MB L3 cache. Each processor is connected to 16 GB of DDR3-1333 DRAM via a 128-bit memory controller. As such, each compute node contains four NUMA nodes and programmers are motivated to run four processes of eight threads per node. Computation must be partitioned across NUMA nodes in order to maximize performance. Each processor (NUMA node) thus provides about half the bandwidth and one-third the compute capability of a BGQ node but has a much more cache.
Piz Daint is a 5272-node GPU-accelerated Cray XC30 MPP at the Swiss National Supercomputing Center (Piz Daint Cray XC30). Piz Daint is very similar to Titan with the caveat that rather than using Cray's Gemini 3-D torus, nodes are connected via Cray's Aries dragonfly network. Aries provides much higher bandwidth, fewer hops, and overall, better scalability than Gemini. The other significant difference is the switch from Interlagos to Xeon (Sandy Bridge). As a result, there is only one NUMA node per compute node. As such, one may simply run one 8-thread process per node and ignore any NUMA optimizations. Performance per NUMA node is broadly comparable to Edison.
Stampede is a 6400-node Xeon Phi (MIC)-accelerated Infiniband cluster at TACC (Stampede). Each node includes two 8-core Xeon (Sandy Bridge) processors running at 2.7 GHz and a PCIe-connected 60-core Xeon Phi coprocessor. Each Xeon Phi core is a dualissue, in-order, four-way multithreaded core capable of executing one 8-way SIMD FMA per cycle. Each core has a private 32 KB L1 cache and a coherent 512 KB L2 cache. Each Xeon Phi is attached to 8 GB of GDDR5 memory providing 160 GB/s of streaming bandwidth. In practice, the Xeon Phi and the Kepler GPU have comparable DRAM bandwidth, but the Xeon Phi has a more conventional cache hierarchy and can run existing MPI+OpenMP applications in native mode. However, many of the programmer-friendly micro-architectural features of the commodity Xeon processors are not present in the Xeon Phi. As a result, attaining high performance can be difficult. In addition to native mode (host-only) performance experiments, we also examine the performance of Xeon/Xeon Phi clusters in symmetric (SYM) mode. In SYM mode, hosts and accelerators are viewed as piers each running an MPI process. Although simpler than offload mode and more capable of fully utilizing a heterogeneous node than native mode, load balancing between Xeon processes and Xeon Phi processes becomes a challenge.
GTC-P on modern node architectures
In order to harness the computing power of the emerging architectures, applications need to be carefully designed such that the hierarchy of parallelism provided by the hardware is fully utilized. In GTC-P, aside from internode parallelization implemented with MPI, we explore on-node scaling with shared memory threadlevel parallelism implemented with OpenMP and incore vectorization with SIMD intrinsics. The structure of array (SoA) layout is chosen for particle data to maximize spatial locality on streaming accesses. The data allocation is aligned to memory boundary to facilitate SIMD intrinsics.
For heterogeneous systems with GPU accelerators, we judiciously select computationally expensive kernels such as charge and push as well as the mandated shift kernel to run on GPUs while leave the communicationintensive subroutines such as smooth, poisson, and field on CPUs. GPU architectures exhibit multiple distinguishing computational constraints including limitedcapacity high-bandwidth memory, limited coherence support, low-overhead scheduling, and high-throughput vector computing. For instance, their small memory to core ratio restricts the use of memory replication as a technique to avoid data hazards in charge. We also found it more efficient to redundantly recompute values rather than precompute and load from memory. Additionally, loop/computation fusion was implemented to further reduce memory usage for auxiliary arrays in push. The GPU's relatively small cache capacity makes it difficult to exploit locality to avoid expensive memory accesses. We addressed this limitation previously using cooperative computation to capture locality for coscheduled threads (thread blocks) (Madduri et al., 2011b) . In this article, we also explore different techniques for explicit management of the GPU shared memory. Like CPUs, GPUs benefit greatly from sorting particles based on their cell association. Sorting helps in reducing the temporal reuse distance and aids the cache hierarchy in capturing locality. We exploit the GPUs low-overhead hardware scheduler to load-balance computation using fine-grained tasks (one particle computation per thread). Ultimately, the high overhead of moving data between the GPU and CPU host motivated us to keep particle persistently allocated on the GPU memory. Only shifted particles are moved to/from the host during communication phases.
In this section, we detail our efforts in optimizing GTC-P kernels on multi-core CPU and many-core GPU.
Charge: Since modern computer architectures rely on data locality and regular computation to tolerate access latency, it is extremely challenging to optimize the charge kernel due to the random memory access, potential fine-grained data hazard, and low computational intensity characteristics. In order to improve data locality for grid access in charge and push, the particles are binned in the radial dimension. The binning frequency is an adjustable parameter given by the users.
In multithreading environment, two particles operating by two different threads may access the same grid point and cause a read-after-write data hazard. Thus, the data access must be either guarded with a synchronization mechanism or redirected to a private copy depending on the availability of memory resources. On CPUs with up to hundreds of threads, the grid replication strategy on a per thread basis is usually used to address the issue of fine-grained data hazards (Madduri et al., 2011a; Wang et al., 2013) . The charge densities on each private copy are summed at the end of the charge phase, where the order of summation is carefully chosen to avoid synchronization. Note for large size plasma simulation, the grid replication strategy is only applicable with 2-D domain decomposition.
On GPUs with thousands of threads, the grid replication strategy is no longer applicable even with particle binning in all dimensions. This is because in gyrokinetic PIC simulation, each particle is actually a gyroparticle represented by four points on a ring of gyroradius. Even with perfectly sorting, the influential grid points from a set of particles of a single cell will spread up to 16 radii. In a circular geometry, each thread may require as much as 2 MB private memory to store the local grid information such as charge density and field-cost prohibitive with thousands of threads. This motivates us to use an update binning algorithm to implement charge deposition on the GPU. Besides binning gyroparticles according to their coordinates periodically, we also bin the points of all gyroparticles at every time step. This additional points binning step simplifies the gyrokinetic PIC algorithm to the standard PIC algorithm. As a result, a local grid private to each thread spans only up to 1 radii instead of 16. Due to reduced ghost zones, the additional point binning significantly reduces the temporary memory storage for hazard-free charge deposition. Hence, we can use the fast shared memory for charge deposition.
Specifically, the binned points are carefully organized such that all points belonging to the same supercell are processed by the same CUDA thread iteratively and the points in two adjacent supercells are processed by two consecutive CUDA threads. Every thread block holds a local copy of the grid on shared memory for charge deposition relying solely on fast shared memory atomic operations. Furthermore, atomic operations could be totally avoided by providing two copies of the partial grid in shared memory, one for charge deposition by threads with even IDs and another with odd IDs. The charge in shared memory is added to global memory at once at the end. Given the extensive use of shared memory, we configure the GPU to favor shared memory.
One drawback of the above algorithm is when the number of points in different supercells has large variations, leading to potential issues of memory waste and load imbalance. As we discussed earlier, in a Monte Carlo type particle simulation, the particles are loaded with importance sampling such that the particle number density is proportional to the distribution function f . In the df method, this translates to the particle number density is proportional to the background distribution function f 0 . As we consider uniform background density currently, we should expect relatively uniform particle number density in each cell. However, since GTC-P uses magnetic coordinates, the Jacobian from the coordinate transformation (from magnetic coordinates to Cartesian coordinates) needs be taken into account. Consequently, the particle number density will be proportional to the Jacobian J = (1 + r cos (u)) 2 , where r and u are the radial and poloidal coordinates. This issue can be addressed by developing a nonuniform grid on each poloidal ring such that the number of points on the new grid has small variations. Note that the nonuniform grid discretization is only used for the purpose of binning points.
The updated binning algorithm works especially well on NVIDIA's previous Fermi architecture, where doubleprecision atomic increments are expensive. The subsequent introduction of the NVIDIA Kepler architecture (employed in Titan and Piz Diant) enabled fast doubleprecision atomic increment (implemented via a compareand-swap operation). As a result, the performance of the charge algorithm is limited by the number of active thread blocks in a streaming multiprocessor. While Kepler preserves the same amount of shared memory as the earlier version Fermi, we observe that the updated binning algorithm is less appealing on Kepler. Alternatively, the previous implementation using cooperative computation (Madduri et al., 2011a ) that uses global atomic write performs well if designed properly. Specifically, computation is organized so that CUDA threads access successive particles when reading but is reorganized so that threads work together on one charge deposition (of up to 32 neighboring grid points) when writing. The reorganization is through transposing the data write in shared memory. Given the limited use of shared memory in this case, the GPU is configured to favor the L1 cache.
Poisson/Field/Smooth: The grid-related kernels are run on the CPUs for both homogeneous and heterogeneous systems. This is mainly because simulations typically employ high particle number densities. As a result, the time spent in grid-based kernels is substantially lower than the particle-related kernels. In addition, due to 2-D domain decomposition, all grid-based kernels involve extensive MPI communication to update the values at ghost zones.
The poloidal planes in tokamaks are in circular geometry, thus the grid-related kernels usually include irregularly nested loops in which psi is the outer loop and theta is the inner loop. On large tokamak devices, the outermost domain has little variability in the outer loop but large variability in the inner loop. Simple thread parallelization of the outer loop among threads is not sufficient for multi-core architectures with thread numbers varying up to 64. The main optimization we used for the grid-based kernels is manually flattening the nested loop into a single loop to expose more thread-level parallelism. Note the OpenMP collapse(2) clause is not applicable here as it requires that the nested loops form a rectangular iteration space.
Push: The push kernel involves two operations: gather and update. The key optimization we employ is loop fusion to avoid the write and read of temporary particle data. The loop fusion optimization also minimizes parallel thread fork-join overhead and increases computational intensity of the kernel. The data locality for grid access is improved by binning the particles in the radial dimension. For large tokamak devices, binning in the radial dimension only may lead to large cache or even TLB misses in CPUs, since the outermost rings usually consist of a large number of grid points. On GPUs, it is important to place the electrostatic field on CUDA's texture memory to maximize the read bandwidth from noncoalescing memory access.
Shift: This kernel streams through the particle array with unit access, identifies the particles to move, packs them in a separated buffer, and sends/receives them to/ from the neighboring processors with message passing. On CPUs, the performance of this kernel sensitively depends on two factors. One is the network performance of the system, in particular for nearest-neighbor point-to-point communication. The second factor is the effect of caches misses in packing the particles in a separated buffer. On heterogeneous systems with GPUs, particle packing is implemented on the GPU while the message passing remains on the CPU. Besides the network performance of the system, the performance of shift kernel will depend on how well the GPU handles the packing process and the performance of data transfer between the CPU and the GPU via the PCIe bus.
In the previous GPU implementation of particle packing (Madduri et al., 2011a) , each thread block maintains small shared buffers that are filled as it traverses its subset of the particle array. Particles are sorted into three buffers for left shift, right shift, and keep buffer. Whenever the local buffer is exhausted, the thread block atomically reserve a space in a preallocated global buffer and copy data from the local buffer to the global one. Recall that the normal iterative shift algorithm on CPUs uses array of structure data layout to pack the data for message passing, the data are transposed while flush to the global buffer. Then, the data are copied back to the CPU where the message passing is executed. Upon completion, the host transfers a list of incoming particles to the GPU, where unpacking involves filing clustered holes in the particle arrays and transposing the data back to the GPU data SoAs layout. With data transposes and atomic operations, maximizing parallelism and attaining good memory coalescing make porting shift kernel extremely challenging on GPUs. This strategy works well on the NVIDIA Fermi architecture, even though it slightly limited the number of active thread blocks (thread block occupancy).
With the introduction of the NVIDIA Kepler architecture with Streaming Multiprocessor Architecture (SMX), the core count per multiprocessor increased and the core speed decreased, while the amount of shared memory is preserved similar to earlier generation Fermi. As such, we redesigned areas of the code that stress the use of the shared memory and limit occupancy including the shift routine. In the latest version of the code, the shift and sort functionality are simultaneously executed in the same phase. Relying on the fast sorting algorithm provided by Thrust library (Thrust 2010), particles are sorted into three buffers for the left shift, right shift, and keep buffer in the GPU's global memory. By modifying the normal iterative shift algorithm on the CPUs to pass message with SoA data layout instead of array of structure, the data transpose operation is not required. Compared with the previous approach, both reliance on the GPU's shared memory and data transpose operations are reduced.
MIC-specific implementation: The Intel Xeon Phi supports several execution models, including native, offload, and SYM mode. As its programming environment is based on C/Fortran languages and MPI/OpenMP standards, it is reasonably straightforward to port the code onto the MIC processor. However, attaining high performance can be difficult on some applications. This is mainly due to several differences in the design of the processor cores on Xeon Phi (Rahman 2013) . First, MIC supports much larger number of threads than the Xeon processor, while the performance of each thread is much lower than that of the Xeon processor. The MIC also has wider vector units. Thus, high performance can only be achieved if the applications utilize all cores and vector units effectively. Additionally, although the peak performance of Xeon Phi is much higher than the Xeon processor, the memory bandwidth per core is not. As such, the lower flop per byte on MIC may require choosing a different data structure and algorithms for solving the same problem. In this article, we share our early experience in porting and optimizing GTC-P code onto Intel Xeon Phi in native mode and SYM mode. As PIC algorithm involves ''gather'' and ''scatter'' operations that features random memory access and potential data hazards, our studies show some challenges and opportunities of using Intel Xeon Phi for irregular applications.
The initial performance of the code on Xeon Phi in native mode was 2 3 -4 3 slower compared with the same code running on host CPU. As the number of threads per core increased from 1 to 4, shift performance did not scale as the number of OpenMP threads were increased, instead the runtime grew linearly with the number of threads. Further investigation concluded that this was due to the ''malloc'' and ''free'' operations performed by each thread. With 240 threads, the calls for these two operations were serializing and thus causing poor scalability. The first optimization we applied is to replace heap memory created by each thread within the OpenMP parallel region with preallocated memory buffer on shift. On multi-core CPUs and many-core GPUs, we use loop fusion to increase computational intensity and to minimize parallel thread fork-join overhead for push. On Intel Xeon Phi, we find that loop fission boosts the performance by 20%. Specifically, we implement the gather and update operations with two separate loops. For the gather loop, in order to decrease the memory access latency due to ''gather'' operation, the neighboring grid information is prefetched to level 1 cache. In addition, the grid memory read is implemented with vector intrinsics. The update loop does not include any irregular operation and can potentially have high performance on Intel Xeon Phi. We thus use #pragma omp for simd to enable SIMD instructions. Considering that the synchronization between the large number of threads in Xeon Phi could be expensive, we use the grid replication strategy, the same as we choose for multi-core CPU, to address the issue of fine-grained data hazard in charge.
In SYM mode execution, MPI processes can run fully on both host and coprocessor. However, it has been demonstrated that the latency and the bandwidth can vary in an order of magnitude depending on the pair of communication devices involved (Karpusenko and Vladimirov 2014) . The multiple levels of decompositions in GTC-P provide great flexibility in mapping the MPI ranks to physical devices with relatively optimal communication bandwidth. To enable message passing through only high bandwidth channels, we turn on particle decomposition and confine all MPI ranks among the same node to a particle communicator. With particle decomposition among the MPI ranks of the same node including the host CPUs and the MICs, we can further enhance the performance by running the communication-intensive grid-based subroutines only on the host CPUs. The grid-based charge density and electrostatic field can be copied from and to the coprocessors with MPI_Reduce and MPI_Scatter within the particle communicator. Since the collective communication only involves message passing inside a node, this avoids communications that potentially involve varying latency and bandwidth. Depending on the performance of the particle-based kernels (e.g. charge, push) on different devices, we distribute the particles between them such that the work on different devices is well balanced.
We also highlight that the current GTC-P SYM mode implementation is similar to the GPU implementation and future offload mode implementation in that only particle-based kernels are running on accelerators. Unlike the offload approach in which the host CPUs and the accelerators are running in a staggered pattern, both the host CPUs and the MICs are running concurrently in SYM mode.
Results and analysis
In this section, we demonstrate the performance of GTC-P on the six evaluated supercomputing systems using weak scaling on different plasma sizes. In contrast to the weak scaling studies by our previous studies (Ibrahim et al., 2013; Madduri et al., 2011a) and by other particle-based fusion codes (Meng et al., 2013) where the grid size was kept constant but the number of particles was scaled, our experiments represent a true weak scaling study of both the grid size and the number of particles. This is important to effectively address the outstanding open issues in fusion plasma physics such as the scaling of the energy confinement time with system size. A summary of our recent ITG simulations using the highly efficient code is presented in Section 6.4.
Configuration
A GTC-P simulation is described by several important numerical parameters: the toroidal grid size ntoroidal, the poloidal grid size mgrid, and the average particle density as measured in the ratio of particles to grid points micell; where the total number of particles in a simulation is ntoroidal 3 mgrid 3 micell. In order to demonstrate the viability of our optimizations across a wide variety of potential simulations, we explore four different grid problem sizes, labeled A, B, C, and D. Grid size A and B correspond to the majority of existing tokamaks in the world, C corresponds to the JET tokamak, the largest device currently in operation (Joint European Torus 2013), and D to ITER, the largest device currently under construction in France (The ITER project 2013) . Table 1 lists the poloidal grid sizes and the corresponding grid and particle-related memory requirement in one toroidal domain for different plasma sizes. Observe that when moving to a plasma device of one size larger (with constant micell), for example, A to B, the poloidal grid size and number of particles in one toroidal domain increase by 4 3 . As discussed earlier, a production simulation generally consists of only fixed number of poloidal planes (e.g. 64), wrapped around the torus for all plasma sizes due to Landau damping physics. As as result, in our weak scaling study, the number of processors increases 4 3 (instead of 8 3 as in some 3-D codes) as we move to a plasma one size larger. In all of our experiments below, we choose particle number density micell = 100. There are two reasons of choosing a single particle number density in our current performance study. First, micell = 100 is a typical number in all production runs using df method. Second, as suggested in earlier studies (Madduri et al., 2011a; Wang et al., 2013) , high particle density leads to improved parallel efficiency. Thus, the results in this study will establish a performance baseline for subsequent higher resolution simulation.
Performance evaluation and analysis
Since the proper placement of MPI ranks to physical nodes/processors is important to the network performance, we have applied optimized placements on each system if applicable. In gyrokinetic simulations, because the parallel velocity is much larger than the perpendicular velocity (parallel and perpendicular to the magnetic field), at each time step, the number of particles moving out of the local domain in the toroidal dimension is 10 3 more than in the radial dimension. We thus place the MPI ranks to favor the toroidal communicator. Generally, this can be achieved by assigning consecutive MPI rank numbers for processes within each toroidal communicator. On a BGQ system with 5-D torus network, we further group two or three torus dimensions The grid and particle memory requirements are for one toroidal domain only. A simulation typically consists of fixed number of toroidal domains, for example, 64. ''mpsi'' is the number of grid points in the radial dimension and ''mthetamax'' is the number of grid points in the poloidal dimension at the outermost ring.
to match 64 for an optimized placement layout by setting the environmental variable RUNJOB_MAPPING. The network resources are shared among all users on Titan and Blue Waters leading to large performance variation for the same experiment. In order to minimize this effect, we run the same experiment multiple times and record the best result. On Blue Waters, we further select a set of convex shaped computing nodes through job scheduler for optimal network performance. The similar convex node selection is not available on Titan by the time the data were collected.
6.2.1. One process per NUMA domain. NUMA nodes are increasingly viewed as the standard unit of computation. When subtracting for age bias, most NUMA nodes have a comparable amount of memory and bandwidth. As such, in our first series of experiments, we choose to run with one MPI process per NUMA node and thereby eliminate the need for NUMA optimizations. The configuration for each platform is shown at the top of Figure 3 , showing the MPI ranks 3 OpenMP threads, as well as MIC in native mode and GPU experiments. To fit the largest problem size D on Piz Daint, a 5272 compute nodes system, we choose 4096 MPI partitions with 64-way parallelization in both toroidal and radial dimensions. Our weak scaling study starts from A size problem using total 64 MPI processes with 64-way partitioning in the toroidal dimension and oneway partitioning in the radial dimension. While the number of partitioning in the toroidal dimension remains unchanged in the weak scaling study, we increase the radial partitioning by 4 3 as we move to each larger plasma size. For all experiments with 1 MPI per NUMA node, we do not apply particle decomposition. Figure 3 presents the breakdown wall-clock time for 100 time steps on each kernel for the weak-scaled Figure 3 . GTC-P weak scaling run times for 100 time steps using plasma sizes A-D with 1 MPI process per NUMA node. Table presents run configurations showing MPI ranks 3 OpenMP threads as well as MIC in native mode and GPU experiments. In all experiments, the number of particles per process~3,235,900. Due to system issues, we were unable to conduct 4096 node native mode simulations on Stampede. Run time relative to Edison is shown above each bar (times greater than 1.0 designate a slowdown). GTC-P: gyrokinetic toroidal code at Princeton.
plasma sizes A-D on our evaluated systems. Note that on GPUs, the shift and sort routines have been merged to one phase in our new GPU-based shift kernel, thus their total wall-clock time are represented by shift time only.
Observe, performance on the CPU-based architectures is generally correlated with the DRAM STREAM bandwidth per NUMA node as shown in Table 2 . However, we observe that the overall run time on MIConly and GPU-accelerated systems is substantially greater than the STREAM-predicted run time of 25% of Edison. There was no speedup on the MIC-only native mode performance over Edison on any function with shift being substantially slower. Conversely, on the GPU-accelerated systems, there was a substantial speedup on the easily parallelized and computeintensive push despite its data locality challenges with comparable performance on charge and slowdowns on push. It is likely the data locality, synchronization, and PCIe challenges of these operations impede the performance of the GPU.
On the CPU systems, when weak scaling to larger size plasmas, the time spent in charge increases for two reasons. First, there are load imbalance issues for charge interpolation on the 2-D poloidal plane. Although the number of particles is well-balanced among different processes, the particle interpolation close to the outer edge of the 2-D will suffer larger cache-or even TLB misses. The problem is exacerbated on larger sized devices where the number of grid points at the edge increases. Second, as we discussed in Section3, charge includes two type of communications: a point-to-point communication that merges values at ghost cell and a collective communication that obtains flux-surface-averaged charge for solving ''zonal flow.'' The time spent on both communications increases as the plasma size grows. By examining the code using performance tools such as CrayPat on several Cray systems (i.e. Titan, Blue Waters, and Piz Daint), we confirm that more L1, L2, and TLB cache misses occur for processors assigned to the domain close to the edge for C and D size plasmas. Although MPI communication time increases from A to D as expected, the time is mostly spent at the synchronization point before the collective communication, thus suggesting that load imbalance is accounting for the major overheads in charge kernel.
The load imbalance caused by cache misses also accounts for the performance behaviors of push as its operation only involve random memory access. The performance of shift sensitively depends on the network performance. Therefore, the time increase from A to D for shift can be used as an indicator of the network performance of a system. The time spent on the grid-based subroutine has increased with large size devices. This is due to the increasing load imbalance on large plasma sizes from the radial domain decomposition in circular geometry. We observe that the time increase on Titan and Blue Waters is especially significant, where the time is mostly spent on the point-to-point communications in the radial dimension to update the ghost cell values. This issue could be alleviated by placing the MPI ranks to physical nodes to favor the radial communicator. However, this will affect the performance of shift in toroidal communicator. The push run time remains relatively flat from A-D plasmas as expected. The GPU design relies on massive number of threads to hide the memory access latency (as opposed to hierarchies of caches on commodity CPUs). Thus, compared with the CPU-only results on Titan and Piz Daint (Figure 3) , our CPU/GPU hybrid implementation results in a significant 2 3 and 3 3 speedup on charge and push kernels, respectively. In order to understand the GPU performance in more details, we measure the time spent in PCIe, in GPU execution, and in CPU communication, respectively. Table 3 lists these breakdowns for three kernels, charge, push, and shift, from Figure 3 .
Observe that the cache miss issue on larger size plasmas is no longer significant since the GPUs rely on massive number of threads to hide the latency. The GPU execution times on Titan and Piz Daint are similar as both systems use the same NVIDIA K20X GPU accelerator. However, Piz Daint has shown better PCIe performance since it has PCIe 3.0 interface whereas Titan utilizes PCIe 2.0. It worth noting that the execution time on shift+sort increases 2 3 from A to B size plasma. This is because A does not have radial shift where only 1-way radial partition is used. The time increase on charge and shift is solely related to the network performance of the systems. It shows that the Aries NIC on Piz Daint has much better performance than the Gemini NIC on Titan, since it has 3 3 more injection bandwidth and significantly higher global bandwidth.
6.2.2. One process per node. Often, there is a desire to view a multi-socket, multi-accelerator heterogeneous compute node as the single unit of computation. Thus, in our second series of experiments, we use one MPI process per node (except on SYM mode Stampede where we use 2 MPI processes per compute node). The configuration for each platform is shown at the top of Figure 4 , showing MPI ranks 3 OpenMP threads, MIC in native mode and SYM mode, as well as GPU experiments. These experiments help us to measure the overall node (as opposed to NUMA domain) performance, while scaling the problem to the largest node configuration possible for a given problem size. As before, we choose 4096 MPI partitions for the D size simulation with 64-way partitions in both toroidal and poloidal dimensions. The performance results are presented in Figure 4 , where for each problem size, the number of computing nodes involved is the same for all systems. These experiments highlight the performance challenges associated with attempting to program multi-socket NUMA architectures using OpenMP.
The Xeon Phi offers three programming modelsoffload (akin to GPU acceleration), native (no host), and SYM (host and MIC are peers in a heterogeneous system). In order to run in SYM mode, one must balance work between CPU processes and MIC processes. Using the data from the previous section (CPU-only and MIC-only), we observe that GTC-P performance using one MIC processor per node is comparable to the performance attained using one of the 8-core Xeon E5-2680 processors. This suggests there is roughly a 2:1 performance difference between the Xeons and the MIC. When running in SYM mode on Stampede, there In this section, we use the latter wherein we assign twice as many particles to the multi-socket CPU process as the MIC process. This configuration delivered better overall performance as the grid-based subroutines are now threaded over all 16 host cores. Figure 4 shows that the SYM mode boosts the performance by 1.2 3 with respect to the CPU-only version up to 1024 nodes on Stampede. Figure 5 presents the weak scaling trend for the simulations in Figure 4 , where ideal scaling is a flat line. The slope of the results essentially shows the impact of load imbalance (due to cache design) and network performance of a given system to our application. Generally speaking, we see good scalability on all systems except Stampede where there is marked degradation in scalability on its fat tree beyond 256 nodes. Moreover, Titan, and Blue Waters both use the same Gemini 3-D torus. When subtracting for the difference in the performance of four CPUs versus. to two CPUs and a GPU, we see very similar scalability between these two machines. Interestingly, Piz Daint and Edison show different scalability despite both using the same Cray Aries Dragonfly network. Figure  5 (right) shows there is a substantial increase in time in charge on Edison while time remains roughly constant on Piz Daint and Mira. An investigation of the timing showed even the average performance over 100 iterations was highly variable on Edison. One can speculate that there is a qualitative difference in machine usage and contention, but further investigation is warranted.
Although Mira delivers the lowest performance, one should remember that it consumes roughly one-quarter of the power per node as Edison and only 8% of the full machine was utilized. As GTC-P has demonstrated effective scalability to the full capability of Mira and Titan (Tang et al., 2014; Wang et al., 2013) , alternate experiments in which the D100 problem is distributed across each full machine (accepting that work per node will differ from one machine to the next) is a potential avenue of future research.
Portability versus speedup
The computer architecture has evolved significantly over the last two decades. This revolution requires the application developers to modify their codes to take full advantage of the new hardware. Sometimes, this even means a complete rewrite of a code. Table 4 shows the trade-off between portability with respect to the number of lines changed and the achieved speedup based on our experience in porting GTC-P to different architectures. While the GPU hardware enhances the performance by up to 4.69 3 , it also requires rewriting whole kernels in CUDA. Directive-based programming models, such as OpenMP, greatly simplify the implementation of new optimizations. In the case of GPU, however, the current directive models (OpenACC 2.0, OpenMP 4.0) are still lacking several key features needed to achieve the same performance as the CUDA version. Overall, the SYM mode MIC version seems to have the best balance between portability and speedup. However, one has to remember that the speedup is solely obtained by utilizing additional resources.
ITG simulations
With the advances of computer architectures and numerical algorithm as well as the development of the modern GTC-P code, we now have the ability to study longtime behavior of turbulence transport on largescale devices (ITER size) with high phase-space resolution. Compared with the original GTC size scaling study of 7000 time steps (Lin et al., 2002) , our production run has increased the phase-space resolution by up to 12 3 and total time steps by 4 3 .
The physical parameters for the simulations are the Cyclone case (Dimits et al., 2000) which has peak ITG at r = 0:5a and local parameters: R 0 =L T = 6:9, R 0 =L n = 2:2, q = 1:4, and (r=q)(dq=dr) = 0:78. Here a The NOLC and the %LC partially indicates the level of effort made to port and optimize GTC-P to a specific architecture. The speedup is obtained from the data in Figure 3 . For example, the two GPU speedup values are calculated by comparing the kernel CPU and GPU times on Piz Daint and Titan, respectively. The speedup can be regarded as chip-to-chip performance comparison except for the MIC SYM mode.
is the minor radius, R 0 is the major radius, and L T and L n are the temperature and density gradient scale length, respectively. The safe factor q defines the amount of twist in the magnetic field as a function of radius. A homogeneous Dirichlet boundary condition is enforced for the electrostatic potential at the core and edge boundary, r = 0:1a and r = 0:9a. This model uses a circular cross section and assumes electrostatic fluctuations with adiabatic electron. The pressure gradient profile, which is the driving force for the turbulence, is defined as exp fÀ½r À 0:5a=0:35a 6 g. The velocity space nonlinearity, which is usually ignored as a high-order term, but has been shown to be important to maintain energy conservation (Dimits et al., 2000) , is included in the simulations. Figure 6 shows the time evolution of the diffusivity in Gyro-Bohm units for a wide range plasma sizes up to 30,000 time steps. We see a gradual ''rollover'' of transport level in large size devices. However, this ''rollover'' is much more gradual and the transport level's magnitude is significant lower than the value obtained from much lower resolution and shorter-duration kinetic simulation by the original GTC code early (Lin et al., 2002) . It is unclear at this point what are the exact reasons that cause the transport level difference between these two codes. Considering that the original GTC used a heat bath model whereas GTC-P used numerical dissipation to stabilize the longtime simulation, we are systematically investigating the effects of various dissipation models. A more in depth study will be presented elsewhere.
Conclusion
In this article, we have described our efforts in developing a highly parallelized PIC code which solves the 5-D Vlasov-Poisson equation with efficient utilization of modern massively parallel HPC systems. In particular, we discuss the kernel implementation on NVIDIA GPU accelerators and Intel Xeon Phi coprocessors. As being ''memory bound,'' the performance of our application is correlated to the DRAM STREAM bandwidth per NUMA node on CPU. On accelerators, however, its performance is substantially lower than the STREAM-predicted number due to data locality, synchronization and PCIe challenges in ''gather'' and ''scatter'' operations, and unavoidable data transfer through PCIe. Comparisons of the code performance on multiple supercomputer systems show that the network performance of a system has become an increasingly important factor for the overall performance of our application, especially for large-scale simulations. GTC-P currently uses two-sided MPI for distributed memory communication. In near-future work, we will actively explore the potential benefits of deploying onesided communication using MPI standard 3.0. Our future work will also involve the development of the offload version for Intel Xeon Phi based system using both Intel compiler offload directives and OpenMP 4.0 standard. Additionally, we will explore the utilization of multiple Intel Xeon Phi coprocessors in a single node. Our expectation is that if these efforts prove successful, scientific discovery in application domains such as fusion energy will be significantly enhanced. 
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author biographies
Bei Wang is an associate research scholar at Princeton University. She received her PhD from the Department of Applied Science at University of California Davis. Her research interests include advanced algorithm and software development for scientific applications in fluid dynamics and plasma physics and the associated code optimization on multi-core and many-core architectures.
Stephane Ethier is a senior computational physicist in the Computational Plasma Physics Group at the Princeton Plasma Physics Laboratory. He received his PhD from the Department of Energy and Materials of the Institut National de la Recherche Scientifique in Montreal, Canada. His current research involves largescale gyrokinetic PIC simulations of microturbulence in magnetic confinement fusion devices as well as all aspects of high-performance computing on massively parallel systems.
William Tang of Princeton University is a principal research physicist at the Princeton Plasma Physics Laboratory (PPPL), lecturer with rank and title of professor in the University's Department of Astrophysical Sciences, Plasma Physics Section, and member of the Executive Board for the University's interdisciplinary ''Princeton Institute for Computational Science and Engineering (PICSciE),'' which he helped establish and for which he served as associate director (2003) (2004) (2005) (2006) (2007) (2008) (2009) Program (2000-present) and is internationally recognized for expertise in the mathematical formalism as well as associated computational applications dealing with electromagnetic kinetic plasma behavior in complex geometries and has well over 200 publications with more than 150 peer-reviewed papers and an ''h-index'' or ''impact factor'' of 45 on the Web of Science, including over 7000 total citations. He has taught for over 30 years at Princeton University and has supervised numerous PhD students, including recipients of the Presidential Early Career Award for Samuel Williams is a computer scientist in the Future Technologies Group at LBNL. He received both his PhD and MSc degrees in computer science from the University of California at Berkeley. He received BSc degrees in electrical engineering, mathematics, and physics from Southern Methodist University. His current research interests include performance optimization and modeling for multi-and many-core architectures.
Leonid Oliker is a senior computer scientist in the Future Technologies Group at LBNL. He received BSc degrees in computer engineering and finance from the University of Pennsylvania and performed both his doctoral and postdoctoral work at the NASA Ames Research Center. Lenny has coauthored over 90 technical articles, five of which received best paper awards. His research interests include high-performance computing evaluation, multicore autotuning, and power-efficient computing.
