Abstract-The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynamics problems of interest in astrophysics, including supersonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have coevolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applications of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.
I. INTRODUCTION
The Piecewise-Parabolic Method (PPM) grew out of algorithmic design efforts in the 1970s aimed at compressible gas dynamics, and in particular in achieving numerical representations of strong shocks that were thin when measured in grid cells. The FCT algorithm of Boris and Book [1] was the first to achieve this goal, to our knowledge, with Godunov's first-order method [2] showing important insights into how thin representations of highly nonlinear waves could be achieved from a more physical and less mathematical point of view. Van Leer successfully put these two ideas together in the MUSCL algorithm [3, 4] , with a unique formulation of an FCT-like monotonicity constraint that fit the physical framework of Godunov's approach very well. In 1984, in [5] , multiple alternative approaches were compared in 3 generic test problems, and the key features of PPM [6, 7] that influenced its design and made it successful were identified and discussed. One key feature was not discussed in that article, but emerged in exchanges with the authors of other methods -PPM ran very fast on the Cray-1 (about 50% of the Cray's peak performance), making up with this speed for a greater algorithmic complexity than was found in the alternative approaches of the day. This feature of PPM exploited a feature of the Cray-1, namely an ability to do complex arithmetic at rates that were competitive with simpler arithmetic. To researchers familiar with today's many-core devices designed upon a graphics computing model, this aspect of code competitions in the early 1980s is surely not surprising now. PPM performed more arithmetic than any of the comparison codes in [5] . Nevertheless, it delivered results that were sufficiently more accurate, according to the comparison standards set in [5] , to overcome this handicap and deliver the least time to solution at the same rough accuracy (cf. [5] ). This same kind of issue is very much with us today. However, the disparity in achieved performance on today's largest systems between one code and anothereither embodying different numerical algorithms or even the same ones -is very, very much larger. We can distinguish this very important element from the numerical algorithm by identifying it as the implementation. In this paper, we discuss implementation issues that interact strongly with machine design. These interactions are the motivation for co-design.
The initial implementation success of PPM mentioned above was due to its exploitation of vector hardware. This affected the algorithm design by making it practical to utilize multiple switches inside the algorithm in order to select, based upon the local flow solution, alternative numerical subalgorithms. This approach had been heavily avoided by other researchers up to this time, because conditional arithmetic was understood, incorrectly on the new vector machines, to be slow. PPM grew out of earlier code efforts at the Lawrence Livermore National Laboratory (LLNL) to find effective ways to implement compressible, multifluid gas dynamics on the CDC Star-100 computer, the world's first vector machine. MUSCL advection [8] was implemented in the BBC code (see [5] ) on the Star, and was used to study star formation [9] . This experience gave rise to wide use of vector logic in PPM. A capability for vector logic is found in one form or another on every computing device manufactured today. Thus this initial design choice has remained with us. On moving from the Star to the Cray-1 in the late 1970s, our PPM implementation exploited the Cray's ability to do vector arithmetic at top speed with only relatively short vectors of a few times 64 words. The Star had required vector lengths of several thousands of words, so this new capability was a liberation. It meant that instead of a code in which every temporary result was a vector as long as the entire grid, temporaries could be relatively short, enabling one to keep hundreds of them around, with meaningful names, without eating up all available memory. On the Star, most quantities had to be assigned to one of just 5 working storage vectors, named, appropriately, ws1, ws2, ws3, ws4, ws5. The effect was akin to code encryption. Programmers working with today's GPU devices may recognize this curse, although GPU compilers can alleviate it, while the compilers of the 1970s could not.
The Star experience that brought vector logic and short vector temporaries to PPM implementation also brought an additional feature, the elimination of multidimensional loops. The Star could not handle the short vectors that multidimensional loop nests create, so we simply and manually compiled these away by eliminating them totally from the FORTRAN code. This is of course something a compiler should have done, but these were early days. This feature of our PPM codes enabled it to achieve very good performance on the next class of innovative computing equipment that came along around 1990, MMPs. This feature gave PPM a significant advantage on the early nCUBE machines, and it was useful as well on Thinking Machines' Connection Machine. In early discussions at Los Alamos with Karl-Heinz Winkler and LANL's representative from Thinking Machines, the example of PPM illustrated how a vector instruction executed on each node, as distinct from a parallel SIMD instruction executed in one instance only on each node, could boost performance. This discussion arose from a misreading of the machine's documentation, which mentioned a capability for vector arithmetic. In the documentation, "vector" turned out to mean a quantity with 3 components, as opposed to the vector in computing of the day, which might have hundreds or thousands of components. This discussion led to further discussion, and ultimately to a change in Thinking Machines' compiler design resulting in the then-called "slicewise" compiler. For PPM, the resulting performance boost was about a factor of 3. In this case, the system software changed the most, but our PPM codes also had to change to begin performing parallel vector arithmetic in a SIMD mode. This added a new, highly competitive feature to our PPM code implementation.
Following the Connection Machine, the next architectural development we experienced was the birth of the SMP cluster with the SGI Power Challenge Array in 1993. This new computing system design fully exploited our PPM code's parallel vector design, as well as its ability to scale to a large number of SMPs (shared memory multiprocessors) on a relatively slow cluster network. In this event, we collaborated closely with SGI, so closely in fact that our first and their first implementation was run in their manufacturing building using equipment that was destined, after a short sojourn in this first Power Challenge Array, for a large number of separate customers. On a recordbreaking simulation of compressible turbulence on a 1024 3 grid, our PPM code achieved 5 Gflop/s. The same code ran at 11 Gflop/s on the Connection Machine CM5, but the SGI equipment was much more generic, and much less expensive. This was most definitely a co-design experience -both the machine and the code underwent significant restructuring in the course of this collaboration. At the Supercomputing 93 exhibit, we ran one of the 3 test problems from [5] interactively in the SGI exhibit booth, at about 50 times the performance achieved on a Cray processor for this problem at the time [5] was published. Because this test problem was familiar to many by this time, this demonstration of co-designed hardware and software was very effective.
With the arrival of the SMP cluster architecture, which remains with us today in the world's most powerful computing systems, came the development of the sPPM benchmark code. This code was an early example of a miniapp. It was a complete application, as its use in the Gordon Bell Award winning run in 1998 simulating an experiment to study the Richtmyer-Meshkov instability attests [10] . In this way sPPM stood apart from many previous benchmarks which were libraries of functions to be called from within an application. Examples of these that are still with us today are, of course, LinPack as well as the NAS parallel benchmarks. Because sPPM was a full application, it tested the ability of a full computing system to perform a scientific computation from beginning to end, with all associated costs included. In order to make sPPM more useful to its DOE sponsors, we eliminated from it, at their request, the overlapping of internode communication with on-node computation. Also at their request, we fully vectorized the code, which is not the most effective approach to high performance now and was not then either. There was at the time a desire to come up with a serious and useful code example that could achieve high overall performance on very large computing systems of the day. sPPM, as the 1999 Gordon Bell Award attests, did achieve this goal. In this paper, we discuss a new PPM miniapp code, mPPM. mPPM does not make these sorts of compromises to produce high Gflop/s at the cost of time to solution. It is also considerably more modern in its construction in order to adapt to computing systems available today and also those we anticipate tomorrow.
II. MPPM
Application proxies have been part of the code developers' toolkit for many years. The LINPACK benchmark came into existence [11] as what we are now calling a miniapp. Sweep3d [12] , sPPM [13] , and the NAS benchmarks [14] in some sense may also be viewed in these terms. Large scale proxies, such as LULESH [15] , are serving related purposes. The three DOE Office of Advanced Scientific Computing Research (ASCR) Co-Design Centers (Ex-MatEx, CESAR, and ExaCT) have identified the development of proxy applications as a key component of their efforts. Miniapps, and other kinds of application proxies, are being used as part of machine procurements. The Mantevo project [16] solidifies the application proxy idea, bringing a community-based, focused effort to bear on the wide variety of explorations that application proxies can enable. Here we introduce mPPM as a new element in the Mantevo project, and we discuss, through examples, how it exposes aspects of code implementation that depend upon machine and computing system design in important ways.
III. A GAS DYNAMICS PROBLEM AND ASSOCIATED
NUMERICAL ALGORITHMS mPPM is set up to solve a simple multifluid gas dynamics problem that incorporates several complicating factors commonly encountered. First, we have two distinct gases, each with its own equation of state. We use simple equations of state, so that for each gas we have
, where is the pressure, is the constant ratio of specific heats, is the density, and is the internal energy per unit mass for fluid . We simplify this further by demanding that in each grid cell containing more than a single fluid both the pressures and the temperatures are equal. This further requirement implies that in such a cell all the are the same, and all the densities are in the fixed ratios implied by the mean molecular weights, , of the different gases. For this simple assumption to be valid, we must not change the ionization or dissociation states of any of the gases during the course of the problem.
In the problem set up in mPPM, we begin with two horizontal slabs of denser gas, intended to be krypton, in a background of air within a cubical problem domain. We have periodic boundary conditions in X and Z, and at the bottom and top boundaries (in Y) we have inflow of air at a prescribed density, pressure, and Y-velocity that is assumed to be unaffected by the dynamics of the gas within the computational region. The Yboundary condition is rather artificial. It serves to generate strong horizontal shock waves, and subsequent reverberations, that drive the two slabs of denser gas toward each other. This rather artificial gas dynamics experiment allows us to study the multiple fluid instabilities of the multifluid interfaces in the problem: the Richtmyer-Meshkov instability, the RayleighTaylor instability, and the Kelvin-Helmholtz instability. We get these instabilities started by applying multiple sine-wave perturbations to the upper and lower surfaces of the dense gas slabs. It is not our purpose here to study these fluid instabilities, but instead to use this physical problem to study code performance in a meaningful context. This example problem has been chosen to exercise features in mPPM that deal with unstable multifluid interfaces, strong shocks, and multiple equations of state. The code also includes the generation of output to disk for visualization of the computed dynamics. This can be turned on or off, partially or totally, by the setting of flags in the code, so that I/O This work was supported at the University of Minnesota by a contract from Sandia National Laboratory. Sandia National Laboratories is a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Figure 1 . The fractional volume of the denser krypton gas is shown at problem times 0, 0.8, 1.0, and 1.8 in a volume rendering for which pure air is transparent, and progressively higher concentrations of krypton are shown as dark blue, aqua, white (a 50-50 mixture), yellow, and red (pure krypton). performance can be tested or removed from consideration as a matter of choice for any particular run. Such visualizations are shown in Fig. 1 . We see visualizations of the local fractional volume of dense, krypton gas at the outset of the problem (top left), just before the shocks reflected from the midplane strike the dense slabs (top right), just as these reflected shocks break out of the dense slabs (bottom left), and at the end of the problem, when these shocks have reached the boundaries and reflected but have not yet struck the denser slab gas (bottom right).
The numerical algorithms incorporated in mPPM have been fully described elsewhere. The fluid state is updated using the Piecewise-Parabolic Method (PPM) [5] [6] [7] in a version described in detail in [17] and with small modifications to accommodate multiple fluids that are given in [18] . The variable that gives the fraction, , of a grid cell's volume that contains the denser krypton gas is updated either according to the PPM advection scheme as described in [17] , without PPM's contact discontinuity steepening feature enabled, or by the Piecewise-Parabolic Boltzmann (PPB) advection scheme [19, 20] , that has been derived from van Leer's Scheme VI [8] by adding constraints and extending it to 3 dimensions. The PPB scheme has been fully described, in the version used in mPPM, in [20] . The PPB scheme, which is less familiar than PPM, has two principal features that are of interest to us here. First, it is able to represent gas-gas interfaces as smooth transitions from fractional volume values of 0 to 1 while keeping these representations very sharp, in terms of their widths measured in grid cells. This is possible because of applied constraints that recognize the specialness of the values 0 and 1 and also because the method uses 10 moments of within each grid cell, updated independently by the algorithm, to represent as a quadratic form the subcell structure. This feature of PPB is of course related to its accuracy, which in terms of the accumulation of error is 5 th -order [8, 19] , but it is also related to its performance. PPB has a very narrow difference stencil, because it does not need to use many other grid cells in order to form its parabolae that represent a grid cell's internal structure. It therefore appears to be computationally intensive, because it involves about 3 times as much computational labor as PPM advection. However, it is less computationally intensive than PPM because it does 3 times the labor with 10 times the data. This feature of low computational intensity associated with high formal accuracy is shared by a whole class of other numerical algorithms, such as spectral element methods or discontinuous Galerkin methods.
IV. DEALING WITH LIMITED MEMORY BANDWIDTH
If the numerical algorithm is left basically unchanged, then we are aware of only one way, short of redundant and unnecessary additional computation, in which the requirement for offchip memory bandwidth can be reduced: an on-chip cache or local store must be provided. This local store can be mainly in the form of registers, as with devices manufactured by, for example, Nvidia; it can be in the form of a programmer managed cache, as with IBM's Cell processor; or it can be in the form of multiple levels of cache supported by hardware that maps the cache locations to off-chip memory. If the requirement for memory bandwidth is to be reduced, this local store must exist. mPPM exploits on-chip storage in whatever form to reduce its requirement for off-chip data bandwidth.
It is obvious why it is possible to reduce the requirement for off-chip data bandwidth when we reflect that in the solution of the example problem described above, if we could hold the whole description of the fluid state at one time level on the processor chip, we would need NO off-chip bandwidth. With such a chip, we would reuse each stored data item at least as many times as we have time steps in the problem. Our task is therefore clear. We must find ways to get reuse of intermediate computed results before we either discard them or must send them back to off-chip memory. We find that this requires a restructuring of the application code. The numerical algorithms in mPPM are very accurate, and they therefore necessarily involve the generation of hundreds of intermediate results that need never be stored for use in the next time step. If we can fit these intermediate results into the on-chip data store, then we can reuse them and avoid a round trip to off-chip memory in each such instance.
Our problem in squeezing the data we need to reuse into the on-chip cache is made more difficult because the processor relies for its best performance on a SIMD engine that operates on multi-word data. The SIMD width is the number of words that the SIMD engine processes simultaneously. On present equipment, this ranges from 4 to 32 words. Recognizing this fact, mPPM is built to present to the processor (from each simultaneously executing thread of control) a long sequence of operations on such aligned multi-word operands. This forces the smallest unit of cached data to become at least the SIMD processing length. This in turn demands a larger cache to store this data awaiting its reuse. As SIMD processing lengths have been increasing in recent years, cache capacities have not been scaling up along with them, which makes our task of reusing cached data increasingly challenging. mPPM exposes this co-design issue by allowing the user to select the vector length at compile time, choosing between options of 16, 32, or 64 words. Here we see both code and machine design trade-offs between vector length and cache capacity. Our experimentation to date (see Table 2 ) suggests that 16 is enough, and that if cache sizes do not grow, vendors might consider reducing their requirement for simultaneous threads sharing the same core. Forcing code developers to run simultaneous threads on the same core (SMT) cuts the available cache capacity per thread dramatically, and therefore tends to defeat the performance gain that SMT was introduced to deliver.
The mPPM code exploits a briquette data structure to generate efficient vector processing with vectors that are only 16 words long. As illustrated in Fig. 2 , grid briquettes are small regions of logical grid uniformity within grid bricks. Each MPI worker process in mPPM updates an octobrick of 8 identically sized grid bricks, each of which is composed of grid briquettes. Briquettes of 4 3 cells offer opportunities for use of vectors that are 16, 32, or 64 words, which matches present hardware and which options are user selectable in mPPM.
We use briquettes to avoid unnecessary data traffic to and from the processor chip in two ways. First, we store essentially all the data required to update a grid briquette contiguously in memory in a briquette data record. These data records are then treated by the code as atomic units of off-chip data access. This forces upon us a significant amount of planning in carrying out a briquette update. This planning in turn allows us to very easily express (although, on present hardware, far less easily achieve) a complete overlapping of the communication with the off-chip memory (i.e. memory accesses) with computation. We never thought about this issue quite so seriously before 2006, which marked the appearance of IBM's Cell processor, although sPPM includes some techniques for addressing this issue. The Cell processor had a hardware DMA engine for streaming data asynchronously to and from the chip, under programmer control. Since this was the only mechanism provided, it forced us to think about accessing off-chip memory in the way we had grown accustomed to thinking about accessing memory on remote nodes, or even on disks. This mode of thought is highly beneficial, because the cost measured in potential arithmetic that could be performed while this data is in transit is similar to the cost of accesses to remote memories on computing systems of a decade or two ago. mPPM expresses a data prefetching strategy clearly, and it provides C-preprocessor flags to allow a user to test whether or not this clear strategy is working (see Table 2 ). Briquettes are updated in sequence, and briquette records are fetched one full briquette update before they are needed. In addition to storing the briquette record in a presumably cacheresident array one update interval ahead, mPPM can be asked to explicitly load the first word of each cache line involved to force such a prefetch. Alternatively prefetch intrinsic functions provided by Intel can be introduced by turning on such a flag. Data prefetching is a co-design issue, because it is critical to performance on modern processors, and strategies are implemented either in hardware or in compilers by system vendors that may or may not be effective. mPPM exposes an approach to this issue that is motivated by the Cell processor experience and that is not commonly pursued. It allows this approach to be assessed in quantifiable experiments.
We use briquettes to reduce the requirement for data traffic onto and off of the chip in a second way. Briquettes produce for us the opportunity to reduce the on-chip data footprint of our numerical algorithms to a practical minimum. This is a technique which we perfected in our work with the Cell processor and the Roadrunner machine at Los Alamos [21] . In that case, if the data footprint did not fit into 256 KB, including the instructions to be executed, the processor would simply stop. This behavior tends to get one's attention. By today's standards, 256 KB is relatively generous, although not spacious. Cell augmented this with a register file of 128 vector (4-word) registers. It was no simple matter to figure out how to make this work. The lessons we learned from this experience are built into mPPM, and we find that they are as valid today as they were when the Cell processor appeared. Briquettes enable very high performance with very small data footprints, because they allow all arithmetical operands to be aligned and to precisely fit SIMD processing size requirements. We believe this is a key to achieving very high performance with SIMD engines.
On the Cell processor, we simply declared data to live in the on-chip local store forever. On cache-based microprocessors, this cannot be done explicitly, but it can be expressed implicitly, which is almost equally effective. If we want to have a data item be cache resident, then we put it on a FORTRAN subroutine stack. We do that by declaring it inside the subroutine and nowhere else. If the subroutine is to be executed in parallel by multiple threads, we designate our OpenMP parallel region, which in mPPM is the entire code, as "default private," so that the subroutine stack is automatically private, which avoids cache coherency overhead. FORTRAN common blocks are shared by default, so that no redundant assertions of that property need be made in the code. We learned these tricks in the mid 1990s when programming SGI Power Challenge multiprocessor workstations. Data alignment, which is not explicitly supported in high level languages, is also easily achieved. On Roadruner we learned that if in a subroutine we declare local variables with lengths that are multiples of the SIMD processing length, then compilers will align them properly. We also discovered that FORTRAN common blocks are automatically aligned on page boundaries by default in all compilers of which we are aware. These facts make it possible to express cache residency and alignment in standard FORTRAN code, although FORTRAN does not explicitly recognize either as an issue. The requirement that our stack data never be expelled from cache demands that we keep its volume down. Briquettes achieve this goal quite naturally for us, because they produce the shortest vectors that we find to result in high performance.
An additional technique for reducing the cache memory footprint is incorporated in mPPM through the CFDbuilder code translation tool that comes with the miniapp package. This tool and its methodology has been described elsewhere [22] . Here it suffices to say that it performs two extremely important functions for us. First, it takes our code expression for the updating of a single briquette augmented by a ghost briquette on either side, and it automatically transforms that expression into a humongous pipelined loop of several thousand FORTRAN lines. This is the SIMD engine generalization of a code transformation that we developed about 15 years ago for improving the performance of the sPPM code on the MIPS R10000 processor chip. The transformation has been described in [23, 21] , and its benefits quantified in [22] . It is a straightforward, although massive, application of simple pipelining and loop fusion. It involves in-lining all the subroutines called by PPMMF in mPPM Figure 2 . Illustration of the hierarchical grid and, hence, data structures in mPPM -octobricks (left), bricks (center), and briquettes (right). Teams of MPI worker processes update rectangular solid regions composed of octobricks, and the entire problem domain is composed of a rectangular solid made up of such regions. All the physical state data describing a briquette is stored contiguously in memory in a briquette record. Briquette records are atomic units of data access consisting of many cache lines. The contents of these records can efficiently be transposed while the data is on chip and before the updated contents are written back.
and fusing nearly all the loops. Done manually, this transformation takes about 3 days. With CFDbuilder, it takes about 30 seconds. This transformation appears not to be what modern hardware expects, and it might defeat hardware and/or compiler heuristics for data prefetching. The reason is that the resulting loop consists of thousands of lines of instructions for operations on 16-word operands. On Intel's Xeon Phi, these are just one SIMD processing width, and therefore the standard practice of prefetching items for the next loop iteration is defeated -the loop is thousands of lines long. We nevertheless find this code to be the best performing expression that we have come up with for this hardware. We could modify our CFDbuilder code translator to automatically insert explicit prefetching hints, but these would not be prefetches from off-chip memory but instead prefetches from L2 cache into L1. We have not yet built such a feature into CFDbuilder, and simple manual experiments suggest that it might not be worth the effort.
It is not enough to pipeline all our loops into one glorious whole using CFDbuilder. This gives us all the data reuse that we desire, but it does not result in the smallest possible cache footprint. To reduce the cache footprint, CFDbuilder provides a function that is extremely difficult, confusing, and error prone for a programmer to perform manually. That function is cache memory compaction through maximal array contraction. CFDbuilder first generates the pipelined code. It processes the data in a briquette one grid plane of 4×4 cells at a time. Because this processing is pipelined, it is like a production line with several stages. Grid planes from different locations within a briquette are being processed along this production line in a single iteration of the pipelined loop. On each iteration of the pipelined loop, a new grid plane is introduced at one end of the production line and a fully updated one is produced at the other end.
Intermediate results of this complex computation are held in arrays the size of one or more grid planes. Depending upon the demands of the particular algorithm, only certain of the various pipeline (or production line) stages will require values of the corresponding grid plane for any particular quantity. For mPPM, this pipeline has 9 stages, which corresponds to the size of the PPM difference stencil. Some quantities are produced only somewhat down the pipeline and are almost immediately consumed in successive stages. For such quantities, only those few grid planes corresponding to the pipeline stages that require the quantity need to exist in the on-chip cache. CFDbuilder automatically scans the generated pipelined code to determine which grid planes must be allocated on the subroutine stack and which need not be. This reduces the cache footprint of the algorithm dramatically (see [22] ). The impact of this transformation, almost impossible to expect from even an expert and dedicated programmer, is equally dramatic. In Table 1 , we illustrate a co-design issue relating to bandwidth to off-chip memory. We examine individual subroutines in mPPM in their pipelined form with their minimized cache footprints, determined by adding up the sizes of the arrays declared on the subroutine stack. This allows us to quantify the size of the cached data they require, to count via static analysis the operations they perform (flops, with add = mult = 1, compare = 0, reciprocal = 3, sqrt = 5), and to compute the number of flops performed per word that is read from or written to off-chip memory. This procedure assumes that each routine is made into a stand-alone "kernel" that is performed on the entire data, which forces this data to be read in from off chip and written back there. The flops/word we call the computational intensity. It has the same value for either 32-bit or 64-bit precision. In the first column of numbers in the table, we give the percent of redundant computation involved for a presumed computation on a grid of 64 3 cells augmented with ghost cells. These subroutines would all run well on a Cray-1, which could pretty much deliver about 50% of its rated peak performance on kernels with computational intensities higher than unity. However, on today's equipment, all these routines, including even the PPB10 routine with its intensity of over 13, would be memory bandwidth Table 1 . Dependence of computational intensity and redundant computation fraction on size of cached data. Decomposing the mPPM algorithm into small "kernels" does not increase performance, because it reduces computational intensity. See text discussion. Figure 3 . Illustration, taken from [22] , of the code transformations automatically performed by CFDbuilder to our mPPM Fortran code expression based upon the use of a few directives in the code. The options for generation of C and of C+SIMD were based upon a tool written by our collaborator Pei-Hung Lin. These are not included in the mPPM code package along with CFDbuilder. Instead, standard Fortran code is generated in a form that produces high performance when compiled.
Computation
limited. When they are all pipelined together by CFDbuilder, however, the data they read and write comes from cache rather than main memory, so that the overall intensity of the computation rises to 33.1, about one order of magnitude higher than the routines individually. This increase in the intensity can only be accomplished by providing sufficient cache space, which is given in the table as 56.8 KB. This is the space needed per thread (where a thread performs 16-wide SIMD operations). Space is also needed of course for the instructions. These take up a comparable amount of cache storage space. The first rows in the table give measurements for entire time steps consisting of three 1-D passes each. These are performed on blocks of 64 2 ×32 cells, so that the required cached data could fit into a processor's L3 cache. Fitting the data into a cache forces increasing amounts of redundant computation, but it is clear that the computational intensity can rise to quite a high level. In this way mPPM displays a quantitative relationship between code implementation and computational intensity, which determines the requirement for off-chip memory bandwidth. The cost in redundant computation for achieving computational intensity is also made clear. We believe that the techniques we use with CFDbuilder to obtain the high intensities in Table 1 for mPPM are quite general. One simply uses the cache to enable more and more of the computation to be performed upon a single, wellcoordinated pass through the data. Surmounting the off-chip bandwidth hurdle as we have just described has revealed additional, unanticipated hurdles that we must surmount. By eliminating off-chip memory bandwidth as an impediment to performance in this way, we clarify these new issues. We find that today's microprocessors seem unable to compute at half their rated speeds, which is all we would ever expect, even when all the data they need for the computation is on the chip. Surely moving data around on the chip is immensely easier and less costly than moving it onto and off of the chip. Thus it seems astonishing that once we have everything on the chip, our code's performance is still not at the par level set in about 1979 by the Cray-1 machine -namely 50% of peak for a program written with modest care. Some other bottleneck must be involved. For the Intel Nehalem CPU, which runs mPPM at about 22% of peak, we could ignore this problem. We get at least half-way to our 50% goal, and we nearly match sPPM performance from 1998, as a percentage of peak. However, on Intel's Sandy Bridge CPU, we achieve only about 16% of peak, and on Intel's Knights Corner chip, we achieve only about 5.0% of peak. This last example is truly discouraging. However, it is helpful to realize that mPPM performance on Knights Corner is about 2 times that of an 8-core Sandy Bridge CPU (.050×2112 ≈ 2.0×.16×332.8 Gflop/s). If, by some relatively inexpensive chip redesign it would be possible to eliminate the on-chip processing bottleneck that holds us back, then performance on 60 Knights Corner cores, even at half the Sandy Bridge clock rate, would be spectacular.
Computation flops/c ell
Determining why performance is not what we expect is difficult, despite the many profiling tools that exist. A problem is that a modern CPU core is intended to do many things at once. Hence an indication that a series of loads (or prefetching of a grid briquette record) takes, say, 10% of the time does not necessarily mean that this operation has any real cost at all, since it should be overlapped with computation. As physicists, we have honed our skills at devising tests that confirm or invalidate hypotheses. We therefore can build hypotheses and test them by modifying the code. For example, if off-chip memory accesses are truly overlapped with computation, if we turn off all these accesses the code should run at the same speed. mPPM exposes through a large set of C-preprocessor flags opportunities to form such useful experiments. The above one does indeed indicate that for our code off-chip memory accesses cost almost nothing (see Table 2 ). However, on-chip bottlenecks are harder to test for.
A persistent complaint from vendor representatives about the sPPM benchmark code was that it exhibited extreme "register pressure." There was very little redundant arithmetic in the code, and results of arithmetical operations were saved for later use as a general rule. Loop bodies were long. If it is arithmetic that we pay for in a scientific application, this coding style should be optimal. Nevertheless, it does not resemble the standard benchmarks, because computed temporary results are produced, must be stored, and are not immediately consumed. This means that there must be a significant amount of traffic into and out of the register file. Since all the arithmetic, or nearly all, is done in SIMD mode, the registers in question must be large and are therefore unlikely to be numerous. The more registers we have (per thread), the less traffic between the registers and the L1 data cache will be generated. On the other hand, the more bandwidth we have on this critical on-chip channel, the fewer registers we need. This is a design trade-off for a device. For the code, we could perform redundant computation, but this is by definition self-defeating. We could also, perhaps, revise the numerical algorithms. This data traffic on the chip is generated because we are solving coupled partial differential equations. The physics says these equations are coupled, but we could attempt to approximate the physics with uncoupled equations. This approach is familiar from the well-known technique of operator splitting. It is less accurate, but on hardware with insufficient registers and/or on-chip bandwidth, it could perhaps be the best course of action. In mPPM we do not take this course, Table 2 . A few experiments run with mPPM with the worker process or processes on a single, dual-socket Intel Sandy Bridge node with 16 cores (total) running at 2.0 GHz and with the team leader and team timekeeper processes on another, similar node with attached disks. As discussed in the text, these experiments indicate only a small cost for off-chip data accesses (input deck 7 runs only 11% faster than input deck 1), better performance for shorter vectors (input deck 1 runs 53% faster than input deck 8), better performance for 2 threads per core (all 3 input decks run faster with 2 threads per core), and worse performance for the popular "MPI everywhere" mode of execution (for input deck 1).
but we do expose the problem. In future work it may be possible to quantify this trade-off, so that the costs of a particular approach could be assessed.
V. SUMMARY
We have constructed a new miniapp, mPPM, that embodies lessons we have learned over several years, beginning with our experience on the IBM Cell processor and the Roadrunner machine. mPPM provides a capability, through setting of Cpreprocessor flags, to perform a number of experiments intended to illuminate specific co-design issues and trade-offs. The issues discussed here focus on code/machine design alternatives addressing memory bandwidth. We have seen that experiments performed with Intel Sandy Bridge CPUs reveal that we have successfully freed the code from limitations imposed by off-chip memory bandwidth, since eliminating all off-chip data traffic results in only an 11% performance improvement. This accomplishment results from briquette data structures, pipelined briquette processing, and maximal array contraction. The aligned vectors that result from these transformations show that short, not long, vectors give the best performance. Finally, we have quantified for this example code the relationship between the onchip storage space available and the achievable computational intensity. Although the numbers in our Table 1 apply only to this algorithm and code, the fact that these two quantities are strictly related is, we believe, quite generally true. This relationship has significant consequences for both code and device design.
ACKNOWLEDGMENT
We have summarized experience with PPM codes as it relates to co-design over a very long period. Key collaborators in that work were Karl-Heinz Winkler, Stephen Hodson, David Porter, Tom Jacobson, Tom Ruwart, Sarah Anderson, William Dai, and Michael Knox. Terence Parr wrote an early source-tosource translator for our codes to precompile them for the CM-FORTRAN compiler on the Connection Machine. That work was an inspiration for CFDbuilder, which is built using Parr's ANTLR tool. Pei-Hung Lin's back-end precompiler for use with CFDbuilder is not included in the mPPM package, but it was an essential tool in performing our work on the Cell processor, Roadrunner machine, and the IBM Blue Gene/Q. It is with pleasure that we acknowledge the contributions of these collaborators over the years. Development of PPM algorithms and codes has been supported by a series of DOE Office of Science grants, NSF grants, and contracts from the Los Alamos and Sandia National Laboratories. Our work at the University of Minnesota has also been supported by NSF through a CISE Research Infrastructure award CNS-0708822, and by donations of equipment from IBM and Intel. We have also carried out tests on early examples of advanced hardware at Sandia's CSRI. Recent scaling experiments have been carried out through an NSF PRAC grants OCI-0832618 and OCI-1440025 for access to NCSA's Blue Waters system.
