Algorithms with low computational intensity show interesting performance and power consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) as a prototype for this scenario in order to show if and how single-chip performance and power characteristics can be generalized to the highly parallel case. LBM is an algorithm for CFD simulations that has gained popularity due to its ease of implementation and suitability for complex geometries. In this paper we perform a thorough analysis of a sparse-lattice LBM implementation on the Intel Sandy Bridge processor. Starting from a single-core performance model we can describe the intra-chip saturation characteristics of the code and its optimal operating point in terms of energy to solution as a function of the propagation method, the clock frequency, and the SIMD vectorization. We then show how these findings may be extrapolated to the massively parallel level on a petascale-class machine, and quantify the energy-saving potential of various optimizations. We find that high single-core performance and a correct choice of the number of cores used on the chip are the essential factors for lowest energy to solution with minimal loss of performance. In the highly parallel case, these guidelines are found to be even more important for fixing the optimal performance-energy operating point, especially when taking the system's baseline power consumption and the MPI communication characteristics into account. Simplistic measures often applied by users and computing centers, such as setting a low clock speed for memory-bound applications, have limited impact.
INTRODUCTION
For many years, high performance computing has focused solely on optimizing the sustained performance in order to reduce the time to solution. Owing to the increasing energy consumption and energy costs of HPC systems, additional metrics such as energy to solution are brought into focus. Adapting the processors' clock frequency to the needs of an implementation is one way to influence the energy consumption of a compute node and also to meet power capping requirements. However, the complex topology of modern compute nodes with typically at least two sockets and multi-core processors as well as the different performance bottlenecks -such as memory bandwidth, instruction throughput, and arithmetic units -and their saturation behavior make an a priori prediction difficult. Some of these bottlenecks, in particular the sustained memory performance, often depend on the processors' clock frequency.
In this paper we use a lattice-Boltzmann method (LBM) from computational fluid dynamics (CFD) as a typical example for the large class of algorithms with low computational intensity. Despite its seeming simplicity and ease of implementation, optimizing LBM on recent hardware platforms and for different application cases has been the subject of intense research in the last ten years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] . Here we conduct a thorough analysis of performance and energy to solution on the chip and highly parallel levels for an MPI-parallel implementation of LBM. Starting from the a single-core execution-cache-memory (ECM) performance model we can describe the intra-chip saturation characteristics of two different implementations ("propagation methods" [10] ) and the optimal operating point in terms of performance and energy to solution as a function of the propagation method, the clock frequency, and the SIMD vectorization. In order to find out whether the knowledge thus gained at the chip level can be generalized to the highly parallel case we conduct scaling experiments on up to 128 nodes of a modern cluster system. This paper is organized as follows. The remainder of Sect. 1 covers related work, the basics of the lattice-Boltzmann implementations, the hardware used for testing, and a list of contributions. Sect. 2 then introduces, applies, and validates the ECM model for two different propagation methods on the Intel Sandy Bridge architecture. In Sect. 3 we use a recently introduced multi-core power model to identify the optimal operating points on the chip. Finally, Sect. 4 presents performance data for highly parallel runs and analyzes the impact of the different parameters (clock speed, number of cores per chip, SIMD vectorization, system baseline power). Sect. 5 gives a summary and an outlook to future research.
Related work
The roof line model of WILLIAMS et al. [11] provides valuable insight into how much performance can be achieved with regard to the typical limiting factors memory bandwidth and arithmetic throughput. This allows for a first assessment of how far the performance of the code at hand deviates from the maximum sustainable performance (the "light speed").
Performance modeling and prediction especially in the context of LBM are an ongoing research topic of many groups in engineering and computer science [12, 13] . Auto-tuning was used, e.g., in [14] for a magnetohydrodynamics LBM. The ILBDC LBM code we use in the present work has been optimized previously [15] and its sustained performance is close to predictions from performance models [10] .
Research in the direction of energy-saving hardware and software mechanisms focuses on models and algorithms for dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) [16] . With frequency scaling not only the computational performance of a core changes, but as a side effect also cache and memory bandwidth can be influenced [17, 18] . Any realistic modeling effort must take these effects into account. The ECM model [19, 17 ] is a refinement of the roofline model and allows a more accurate prediction of the intra-chip scaling properties of a parallel code. In [17] we have used it to model a specific variant of an LBM solver. Together with a simple power model derived for the Sandy Bridge chip we were able to explain the performance saturation and energy to solution characteristics of this solver. The analysis in this paper goes much deeper, and considers a more advanced propagation method. It also extends beyond the single chip to fathom the energy and performance properties of the solver in distributed-memory parallel runs on up to 128 nodes.
With energy efficiency moving into the focus of HPC research, modeling the power dissipation of chips has received much attention in recent years. A power model similar to the one used in this paper has been introduced in [20] . It concentrates, however, on microscopic quantities such as the energy cost for transferring data or for performing floating-point operations, and does not address multi-core issues. It would be interesting to reconcile both models for a more holistic view on chip power, but this is left for future work.
The lattice Boltzmann method
Lattice-Boltzmann methods have become a popular approach in computational fluid dynamics. However, they are also interesting for computer scientists as the core algorithm is rather short, uses a larger Jacobi-like stencil with vector instead of scalar data resulting in many concurrent memory streams and no reuse of data in a single iteration, but is straightforward to parallelize due to simple next-neighbor communication.
In the present work we employ the ILBDC code [15] , which uses a D3Q19 lattice model and implements a two-relaxation-time (TRT) collision model [21] . All calculations are performed in doubleprecision floating point arithmetic. The algorithm with the D3Q19 model can be viewed as a 19-point stencil in 3-D accessing only nearest neighbors, but has two important differences to common stencil algorithms: (i) each lattice node consists not only of one, but of 19 values (so called particle distribution functions [PDFs] ); (ii) each PDF accessed is only accessed again in the next time step, which prevents data reuse (unless complex temporal blocking schemes are employed).
The performance of a given LBM approach depends at least on the data layout and memory access patterns, the scale of arithmetic operations (i.e., how well numerical expressions are simplified and combined to avoid unnecessary operations), and their degree of SIMD vectorization. A thorough overview of more propagationstep implementations and their memory access characteristics can be found in [6, 10] .
It seems natural to store the PDFs in a 4-D array with an additional flag field, which is used to distinguish fluid and solid nodes. This is known as the marker-and-cell approach. However, LBM simulations of domains with a large fraction of solid nodes can benefit from a sparse representation of the domain [1, 2, 3, 5, 8, 9] , where only the fluid nodes are kept in a 1-D vector. Indirect accesses to PDFs of neighboring nodes are then required and accomplished through an adjacency list (IDX), which represents the topological connections of the nodes. ILBDC uses such a sparse representation.
For updating one node, optimized implementations read one PDF from each of the 19 surrounding neighbors (streaming step), compute updated values (collision step), and write the results to the PDFs of the local node. This is known as the pull scheme [4] . It is implemented in the ILBDC code together with a structure-ofarrays (SoA) data layout where all PDFs of a direction are stored consecutively in memory before the next direction follows.
To work around the data dependency problems of a combined stream-collide step, two lattices are often used, one as the source and one as the destination. Then, a fluid lattice-node update (FLUP) requires 19 PDF loads, 19 additional PDF loads because of writeallocate transfers, 19 PDF stores and 18 IDX loads of the adjacency list. Assuming double-precision floating-point numbers (eight bytes) for PDF and four-byte integers for IDX, the total number of bytes that must be transferred between CPU and memory for one FLUP is 3 × 19 × 8 (PDF) + 18 × 4 (IDX) bytes = 528 bytes. The write-allocate is triggered by the cache hardware on store misses and loads the data at the accessed location (the complete cache line, to be exact) from memory into the cache before it is updated. With non-temporal store instructions these write-allocates are avoided and the data is directly written from the processor into memory, bypassing the cache hierarchy. The number of bytes required for one FLUP decreases then to 2 × 19 × 8 (PDF) + 18 × 4 (IDX) bytes = 376 bytes. Current standard processors have difficulties with 19 concurrent write streams, in particular if they consist of nontemporal stores. As a remedy, blocking can be applied so that a node's PDFs are read in chunks and updated values are stored in a small temporary buffer, which should be small enough to fit in the L1 cache. From this buffer, two directions of the updated PDFs at a time are written with non-temporal stores to the destination lattice. We call this implementation pull-split no-NT or pull-split NT, depending on whether normal stores or non-temporal stores are used.
BAILEY'S AA pattern [7] for the PDF access allows using one single lattice only (instead of separate source and destination grids) while maintaining the possibility to update all cells in any order and in parallel. The iterations over the lattice are divided into even and odd time steps. During an even step only PDFs of the current node are accessed in each lattice site update. At the following odd step only PDFs of the neighboring nodes are accessed, which requires indirect addressing in our case of the sparse representation. With this update scheme only stores to locations in memory occur which have previously been read. No write-allocate will be performed as the data to be updated already resides in the cache. We use an optimized version where the even time step is completely SIMD-vectorizable (SSE/AVX), which can easily be accomplished as all PDFs are accessed consecutively and no indirect access is required. In the odd time step a partial vectorization is performed, which can avoid the indirect addressing and allows for vectorized execution of consecutively stored chunks of PDFs. Nodes which cannot be treated in this way are regularly updated without SIMD vectorization (i.e., in scalar mode). The fraction of nodes which can be updated with SIMD operations depends on the geometry used for the simulation. During even time steps 2 × 19 × 8 (PDF) bytes = 304 bytes per FLUP are required. In the odd time step the number of bytes needed for one FLUP depends on the fraction of vectorizable updates. The lower bound is the case when all updates can be vectorized. Here only 2 × 19 × 8 bytes = 304 bytes are required. The upper bound is reached when all updates must be scalar and indirect accesses are required, which results in 2 × 19 × 8 (PDF) + 18 × 4 (IDX) bytes = 376 bytes.
In order to have full control over the code vectorization, all performance-critical parts were implemented using SIMD compiler intrinsics.
Test bed and benchmark cases
SuperMUC [22] , which is installed at Leibniz Supercomputing Center (LRZ) 1 in Garching near Munich, is a tier-0 PRACE 2 system and one of the main federal compute resources in Germany. It is built from a number of 512-node "islands," with a fully non-blocking fat tree FDR10 InfiniBand connectivity inside each island. A compute node comprises two Intel Sandy Bridge ("SNB") EP (Xeon E5-2680) eight-core processors with a base clock frequency of 2.7 GHz. The actual clock speed of the processors can be influenced by a so-called "energy tag," which is supplied upon LoadLeveler job submission, together with a parameter specifying how much performance degradation the user wants to tolerate for their job. A heuristic-based mechanism based on hardware performance counter measurements of the user's previous jobs with the same energy tag then sets the clock frequency for the job. Since SuperMUC does not easily allow an arbitrary frequency setting, and the "turbo mode" feature of the Intel processors cannot be used at all, all single-node benchmark tests were run on a standalone Sandy Bridge EP node with the same type of CPU and otherwise similar characteristics.
The Intel Fortran/C compiler 13.1 and Intel MPI 4.1 were used in all cases. The operating system on SuperMUC is SuSE SLES11. Each MPI process was explicitly pinned to its physical core using sched_setaffinity() within the code. All benchmarks were run inside a single island to guarantee that communication is performed through the fully non-blocking fat tree. Two geometries were selected for the benchmarks in this paper. The first is an empty channel which consists only of fluid nodes except for the walls. The second geometry is a packed bed reactor, i.e., a tube filled with spheres (see Fig. 1 ). It represents a real world application case for flow simulation with this type of code. Both geometries have dimensions of 4000 × 80 × 80 nodes, resulting in 25 · 10 6 fluid nodes (≈ 3.8 GB lattice + 1.8 GB adj. list) and 19 · 10 6 fluid nodes (≈ 2.9 GB lattice + 1.4 GB adj. list) for the channel and the reactor geometry, respectively.
With these dimensions both geometries fit into the NUMA locality domain of one socket on SuperMUC. For strong scaling runs the reactor geometry was enlarged to 8000 × 160 × 160 nodes, as the smaller lattice fits in the L3 caches of 128 compute nodes and above. The large reactor geometry consists of 157 · 10 6 fluid nodes and requires around 24 GB of memory for the lattice and around 11 GB for the adjacency list.
The ILBDC code is purely MPI-parallel. All single-node measurements were thus performed with intra-node MPI only; a hybrid MPI/OpenMP version exists, but the details of the multi-threaded implementation (alignment constraints and loop peeling, encoding of obstacle-free regions for SIMD execution in the odd time step) are beyond the scope of this work. We also expect no major differences for an equivalent OpenMP-parallel version. Details abut the MPI parallelization can be found in Sect. 4.1.
Contribution
This paper makes the following contributions:
The scalar and AVX-vectorized single-core performance and intrachip saturation of an LBM implementation with AA propagation pattern is successfully modeled using the ECM model, and compared to the popular "pull-split" propagation model. We show the superiority of the AA pattern in terms of performance, and demonstrate that "best possible" performance is achieved on the chip.
The energy consumption of the LBM algorithm with AA propagation pattern is modeled on the chip level for a range of clock frequencies. Together with the ECM model we gain a coherent picture of the performance and power properties of the LBM algorithm on the chip and achieve good qualitative agreement with measurements. A region of optimal operating points w.r.t. clock speed and number of cores is identified. The system's baseline power (power consumption of everything apart from the CPUs, i.e., memory, chip sets, network, disks, etc.) is taken into account and shown to have a damping influence on the differences in energy consumption (as predicted by the power model). Even then, potential energy savings of up to 50 % can be achieved compared to a naive operating point with the inferior pull-split propagation model. Single-thread code performance and the selection of an optimal number of cores per chip (the latter depending on the former) are shown to have the largest impact on energy consumption.
In highly parallel LBM runs, we observe a loss in parallel efficiency when the CPU clock speed is reduced. This unexpected result can be explained by a strong dependence of effective inter-node and intra-node MPI communication bandwidth on the clock speed. The effective bandwidth also shows a strong negative correlation with the number of MPI processes per node. As a consequence, minimal energy to solution in the highly parallel case depends even more strongly on the proper choice of the operating point, especially on the number of cores per chip (and thus the single-thread performance). A simple non-reflective reduction of the clock speed will reduce performance and consume more energy at the same time.
ECM PERFORMANCE MODEL
The ECM model [19, 17 ] is a refinement of the roofline model [23, 24] . Note that the ECM model is not specific to the latticeBoltzmann algorithm. It has also been used successfully to describe the performance and scaling properties of streaming benchmark kernels [19, 17] , stencil smoothers [25, 26] , and medical image reconstruction kernels [27] . It is to our knowledge the first approach that can successfully model the single-thread performance and onchip scalability of data streaming applications on multicore processors.
It starts with an analysis of the serial loop instruction code to get an estimate for the "applicable peak performance," i.e., the expected maximum performance when all data comes from the L1 cache. After that, all required data transfers needed to get the data in and out of L1 are accounted for, together with the time it takes to move the cache lines up and down through the memory hierarchy levels. Pure streaming is assumed, i.e., there are no latency effects and the prefetching mechanisms work perfectly. Since it is sometimes hard to predict which of the above contributions to the serial runtime of a loop can overlap, the ECM model generates a prediction interval based on no-overlap and full-overlap assumptions, respectively.
Modeling the in-L1 execution and data transfers both require some knowledge about the architecture, such as instruction latency and throughput, data path widths, and the achievable memory bandwidth. If some of this data is unavailable, suitable microbenchmarks or tools can be employed. Especially for the in-core performance analysis the "Intel Architecture Code Analyzer" (IACA) [28] is instrumental in getting a better view of the relevant bottlenecks. All data transfers between the memory hierarchy levels occur in packets of one cache line; hence, the ECM model always considers a "unit of work" of one cache line's length. If, e.g., a loop accesses single precision floating-point arrays with unit stride on a CPU with 64-byte cache lines, the unit of work is sixteen iterations.
In order to get the scaling behavior across the cores of a multi-core chip it is assumed that each core creates some bandwidth pressure on the relevant bottleneck, which is usually a shared cache or the chip's memory interface. When the aggregate bandwidth pressure exceeds the available bandwidth, saturation sets in [29] : Figure 2 : Intra-socket strong scaling of the AA pattern LBM implementation for an empty channel, comparing AVX, SSE, and scalar code at the clock frequencies of 2.7 GHz (solid lines) and 1.2 GHz (dashed lines). The corresponding memory bandwidths are indicated for selected cases.
Here, P 0 is the single-core performance as predicted by the ECM model, and P max is the saturated full-chip performance for the relevant bottleneck (e.g., the memory bandwidth b S ),
with I being the computational intensity, i.e., how much "work" (flops, FLUPs, . . . ) is done per byte transferred over the bottleneck.
Observed chip-level performance and scaling
As a motivation for doing this kind of analysis we show in Fig. 2 the intra-socket scaling of the empty channel test case with the AA pattern for the two "extremal" clock frequencies of 2.7 GHz and 1.2 GHz, respectively, in three variants: AVX-vectorized (full 256-bit loads/stores), SSE-vectorized, and scalar. The data indicates that SIMD vectorization has a large impact in the serial case; in fact, the serial performance differs by more than a factor of two between the AVX and the scalar code. At 2.7 GHz, the gap closes as the number of cores is increased. On the full socket the scalar code is hardly 10 % slower than the AVX variant. The latter, however, reaches the same level already with four cores, which opens an opportunity for saving energy by leaving cores idle. At 1.2 GHz, the situation in the serial case is similar, but on a lower level. The single-core performance of all code variants is roughly proportional to clock speed. However, only the AVX-vectorized code shows a saturation pattern, while the SSE and scalar variants scale linearly up to eight cores without reaching a bandwidth barrier. Hence, lack of vectorization ("slow code") cannot be compensated by using more cores in this case. Moreover, the maximum memory bandwidth is correlated with the core clock frequency and varies by about 20 % across the full frequency range [18] .
In Fig. 3 we show a socket-level performance comparison of the scalar and vectorized AA pattern implementation with the pull-split pattern for both application cases (empty channel vs. packed reactor) at a clock speed of 2.7 GHz. Although there is a large fraction of obstacles in the packed reactor geometry, their presence hardly influences the performance, independent of the propagation pattern.
We also see that the pull-split pattern is not competitive, since it cannot by far saturate the memory bandwidth of the chip. The intention of applying the ECM model is to gain deeper insight into this performance behavior, and to pave the way for a practically useful energy consumption analysis.
In-core analysis
An IACA throughput analysis for the AA pattern kernel shows that the ADD port of the SNB core is the sole bottleneck of core execution for all variants (scalar, SSE, AVX), as well as for even and odd time steps, and that one loop iteration (four updates with AVX, two with SSE, one for scalar) should take about 135 cycles. In contrast, a critical path analysis reports somewhat longer execution times due to dependencies in the instruction and data flow. The critical path depends on the type of time step and has a maximum length of 163 cycles (even, AVX), 212 cycles (odd, AVX), 160 cycles (even, scalar), and 187 cycles (odd, scalar). This prediction roughly coincides with direct measurements, which we will use as an input in the following (160, 212, 158, and 160 cycles, respectively). These numbers must be multiplied by two (for AVX) or eight (for scalar) for getting execution times for one unit of work, i.e., a cache line (see the table in Fig. 4 ).
The analysis for the packed reactor case is surprisingly very similar: the even time step does not change at all, since no index access is required. In the odd time step, even when assuming no potential for vectorization (as would be the case for an extremely porous geometry) there is ample room for hiding the additional loads for the index array due to the bottleneck on the ADD port. This step is necessarily scalar, however, so the execution time is about 4 times longer per unit of work. The actual impact of this slowdown depends on the fraction of vectorizable updates. In our applications, this fraction is roughly 97 % for the empty channel and 92 % for the packed reactor case. This leads to a very small performance penalty for the latter, which was already observed in Fig. 3 . Hence, we will only consider the empty channel case for the rest of the chip-level analysis. The ECM model requires the maximum attainable memory bandwidth as an input parameter. It is known that this value depends on the number of parallel read/write streams as well as the CPU clock speed. From a data transfer perspective, the AA-pattern implementation of the D3Q19 LBM algorithm reads 19 arrays from memory, modifies their contents, and writes them back. In order to get the maximum memory bandwidth on the socket we hence use a parallel multi-stream array update benchmark (see Listing 1). It is designed to mimic the data streaming behavior of the LBM algorithm. Figure 5 shows the achieved memory bandwidth on one SNB socket with varying number of threads (cores) and clock frequencies between 1.2 GHz and 2.7 GHz (plus turbo mode). As predicted by the ECM model, the single-thread performance is proportional to the clock speed, and the saturation point is shifted to larger thread counts as the clock speed decreases: While saturation is reached near three cores with turbo mode, up to six cores are needed at the lowest frequencies. Due to the large number of read/write streams, the maximum bandwidth is significantly lower than with a standard single-stream update kernel (dashed line in Fig. 5 ). At the same time, the maximum (saturation) memory bandwidth drops by about 25 % over the whole frequency range; there is another 
Data transfers and saturation behavior on the chip
substantial drop when using the full socket (eight cores) at the lowest frequency. As of now we have no conclusive explanation for these latter effects. They do, however, influence the considerations on energy dissipation, which will be discussed in Sect. 3. In the following we will use the maximum bandwidths as measured at the respective frequencies as an input to the ECM model in order to calculate the number of cycles required to transfer cache lines between memory and L3 cache. Fig. 4 shows the complete ECM model analysis at 2.7 and 1.7 GHz, respectively. The cycle counts in square brackets are contributions from loading the adjacency information (dashed arrows), and can be ignored for the empty channel case. The achievable memory bandwidth (36 GB/s and 33 GB/s, respectively) and the clock speed enter the model when calculating the cycles for data transfers to and from main memory. Data transfers between adjacent cache levels are assumed occur at 32 bytes per cycle, so these cycle counts are independent of the clock frequency. The various execution and data transfer times may be combined in different ways to arrive at a performance prediction for the serial program:
1. The most conservative (worst case) assumption is that none of those contributions overlap with each other, so that the execution time is equal to their sum (e.g., 320+2·76+182=654 cycles for the even time step with AVX at 2.7 GHz).
2. The most optimistic assumption is that the cycles in which the L1 cache is occupied by loads and stores from the core cannot be used for reloads and evicts to L2, but all other contributions do overlap.
3. Lastly one may assume that the pure in-core execution part (everything except loads and stores) can overlap with loads and evicts from/to the L2 cache, but that there is no overlap beyond that.
None of these assumptions coincides with the roofline model, which requires the achievable memory bandwidth for each number of cores as an input parameter. The ECM model only requires the maximum (saturated) bandwidth, and predicts the scaling. Figure 6 shows a comparison of the measured performance for the AVX-vectorized AA pattern implementation with the three models described above. Apart from the region around the saturation point (3-4 cores), the third assumption provides the best fit to the data.
Validation of the performance model
For completeness the picture also shows the AVX implementation of the pull-split propagation pattern with and without non-temporal stores (triangles). Although the pull-split NT version has almost the same computational intensity as the AA pattern, it is not competitive. This failure can mainly be attributed to the fact that the pullsplit variant cannot be efficiently SIMD-vectorized on the Sandy Bridge architecture due to the indirect access in every lattice site update. More specifically, the loop which loads the neighboring distribution functions and stores intermediate results into temporary buffers is scalar. The loops which write the updated data back to the grid for the new time step can be vectorized, which makes the use of non-temporal stores possible (down-triangles).
We will ignore pull-split from now on and focus the following discussion on the AA pattern.
POWER MODEL
Recently, considerations of power dissipation and energy to solution have received much interest in the supercomputing community. The two prevalent questions are: (i) How can a parallel code be run so that its overall energy consumption until a solution is reached can be minimized, preferably under the constraint of constant time to solution? and (ii) How can a parallel computer be operated in a production environment so that overall power dissipation is minimized or kept below a given maximum?
We concentrate on the first question here. In [17] we have established a simple power model which is able to explain many of the peculiar properties of the energy-to-solution metric on a multicore processor for load-balanced codes that may show some performance saturation as the number of cores used is increased. In the following we briefly summarize its derivation and the most important conclusions drawn from it.
Power versus clock speed and energy to solution
Direct measurements on the Sandy Bridge chip [17] with the RAPL interface [30, 31] and the likwid-powermeter tool [32, 33] show an expected strong growth of power dissipation with rising clock frequency. Although it is not clear what the actual dependence should be, measurements suggest a power law with an exponent between 2 and 3. In the following we assume a quadratic behavior so that the power dissipation is
where f is the clock frequency and t is the number of cores used. Since W 1 is usually very small we neglect it in the following. The part which is linear in t is the dynamic power dissipation, whereas W 0 is the baseline power. Energy to solution is proportional to the ratio of power dissipation to performance (for a constant problem). We assume that multi-core performance can be modeled by
where f 0 is the chip's base frequency, P 0 is the serial performance at f 0 , and P max is a maximum performance (see also Sect. 2). P max may be set by the presence of some bottleneck such as the memory bandwidth, or it may be infinite if the code is perfectly scalable. Hence, the energy to solution is
It is a simple consequence from this model that a minimum energy to solution is reached near the performance saturation point, i.e., at the smallest number of cores for which P(t) = P max . If by some means P max can be increased, e.g., by choosing the AA propagation pattern, energy to solution will immediately go down proportionally. In case the available number of cores is too small to get saturation, i.e., if the performance scales well, all cores must be used for minimum energy consumption.
The single-core performance P 0 also appears only in the denominator. In the application cases we consider here, it is mainly influenced by the SIMD vectorization and thus, indirectly, by the choice of propagation pattern. The larger P 0 , the fewer cores are needed to reach saturation, so there is an inherent energy saving potential in having a fast single-threaded code.
The dependence of E on the clock frequency is more complex: if there is no saturation there may be an optimal frequency f opt for which E is minimal:
If W 0 comprises only the chip's idle power, f 0 is in the range of accessible frequencies. However, a low frequency setting has the adverse effect of extending the time to solution, which may not be desirable. Cost models other than time or energy (such as the energy-delay product) may then be needed to determine whether this is acceptable. On the other hand, if W 0 contains (the chip's share of) the baseline power of the complete node, f opt is usually near or beyond the highest possible setting, and "clock race to idle" is the strategy of choice. The details of this analysis can be found in [17] .
Energy to solution for the LBM solver on the chip
The ECM model and the power model enable a combined analysis of the energy and performance properties of the LBM algorithm. It is useful to plot energy to solution versus performance, with the number of cores used as a parameter within a data set for a specific frequency, SIMD vectorization variant, propagation method, or other property. This has been done in Fig. 7a for three different clock frequencies and turbo mode, using the AA pattern in AVX and scalar variants. In turbo mode, each data point was computed using the maximum allowed frequency for each number of active cores. The corresponding measurements are shown in Fig 7b. Note that we always show energy to solution in arbitrary units, but the values shown are coherent for a specific problem size (geometry and number of iterations).
The models are able to describe the qualitative features of energy and performance. The observed deviations are caused by (i) the inability of the ECM model to accurately describe the performance behavior in the vicinity of the saturation point, (ii) the inaccuracy in determining W 2 and W 0 , and (iii) the approximation of linear power behavior with respect to core count even with saturated codes like LBM at higher clock speeds. In addition, turbo mode does not fit perfectly into the model (5) since the SNB chip can operate beyond its thermal design power (TDP) for a limited amount of time [31] . Figure 8 : Same data as in Fig. 7b but with a power baseline of 50 W added to the socket. The circle marks a possible optimal operating point for almost minimal energy with a tolerable loss in performance. For reference, the best data for the pull-split propagation pattern (vectorized, full socket) for 1.2, 2.0, and 2.7 GHz is shown (filled squares).
This is why the deviation from the measurements is especially large with turbo mode (right-pointing triangles in Fig. 7 ).
Looking at the minimum energy point with respect to clock frequency and number of cores in the regime where performance is not saturated, we see that this point moves to smaller frequency as the core count goes up, as described by (6) . In general, all other things being equal, a faster sequential code (AVX instead of scalar) saves energy. Comparing energy to solution for the AVX codes at their respective saturation points, we can identify an "optimization space" (shaded area in Fig. 7b) , in which the desired optimal operating point should be found. Depending on the emphasis one wants to put on energy minimization vs. maximum performance, this point may be in the lower left corner of the area. In this case one would use all cores at the lowest frequency (1.2 GHz, filled circles) and sacrifice about 20 % of performance compared to the right edge of the area, which is defined by the saturation point at higher frequencies (2.0 GHz to turbo mode). Another clear conclusion is that turbo mode is of no good use for the LBM implementations studied here, neither from a performance nor from an energy point of view.
There is no single, well-defined criterion for identifying the optimal operating point on the chip level. One may certainly employ cost models such as the energy-delay product (ratio of energy and performance), but this is only one possible choice. For reference we have included a line of constant energy-delay product in Fig. 7b . From the data we have collected, using 5-6 cores at 2.0-2.3 GHz seems to provide a good compromise between performance loss and energy consumption ("as far on the lower right as possible").
While the model and the measurements yield a consistent picture on the chip level, it is clear that the chip contributes only a (however significant) part to the overall power consumption of a compute node. As mentioned above, when assessing the real energy demand for running an application, the rest of the system should be taken into account. We do this by setting W 0 = 73 W for the chip-level baseline power, which amounts to roughly 300 W of node power (assuming two-socket nodes). This is also the value measured during a LINPACK run on SuperMUC [34] . With this change we can offset the energy measurements from Fig. 7 to arrive at the data shown in Fig. 8 . Figure 9 : Parallel efficiency of the large packed bed reactor application case (8000 × 160 × 160 lattice nodes) for different frequency settings and different number of processes per chip (PPC) on up to 128 nodes of SuperMUC. The efficiency calculation was based on the four-node performance baseline.
As expected, the modified baseline power leads to a reduction of the vertical spread between the measurements for different clock frequencies. While it was possible with the chip-level (i.e., small) W 0 to have a situation where energy to solution was heavily influenced by frequency and SIMD vectorization even at a specific performance level (with a spread of up to 2× within the optimization space shown in Fig. 7) , a large W 0 reduces the spread to about 25 %. Hence, a large baseline power favors the "race to idle" principle where the most influential parameter is performance; optimizations that favor a larger saturation performance (such as the AA propagation pattern, or blocking schemes which increase the computational intensity) have the most potential for saving energy. In addition, optimized clock speed and a reduction of the number of cores used can yield second-order but still significant savings. Within the transformed optimization space (shaded area in Fig. 8) we can identify a possible optimal operating point at about 2.0 GHz and six cores, with almost minimal energy to solution and a performance loss of about 6 % compared to the highest possible saturation level. In comparison to a naive strategy of running on all cores with turbo mode enabled and a scalar kernel, more than one third of the energy can be saved.
The "race to idle" principle with respect to maximum code performance is evident from a comparison with the energy-performance data for the pull-split pattern (filled squares) in the best variant (SSE or AVX vectorized, non-temporal stores, full socket) at three different frequencies in Fig. 8 : The pull-split pattern can neither compete with AA in the performance nor in the energy dimension. Using AA, almost a factor of two in energy and 30-40 % of runtime can be saved in comparison to pull-split.
HIGHLY PARALLEL LBM SIMULA-TIONS 4.1 MPI parallelization in ILBDC
ILBDC uses an MPI parallelization with a static load balancing scheme. The sparse representation of the lattice is cut into equally sized chunks, so that each MPI rank receives the same number of fluid nodes (probably off by one). The interfaces of such generated partitions can be arbitrarily formed with different numbers of partition neighbors, as the simple cutting of the sparse representation does not consider any topological information. However, in the case of the channel and reactor benchmark geometries this method results only in a 1-D decomposition, where each rank only needs to exchange ghost PDFs with its two direct neighbors. A more extensive description of this approach can be found in [35] . We distribute the ranks linearly across the compute nodes, so that consecutive ranks are located nearby on the same node. In case of strong scaling the communication volume of a process stays constant, since each rank only gets a smaller segment of the long geometries when the number of processes goes up.
The packed bed reactor geometry was used for all the multi-node experiments, since it is the application scenario that is relevant in practice. We have shown earlier that the node-level performance (and thus power) properties are very similar to the empty channel case. Note also that "turbo mode" cannot be activated on Super-MUC, so we stick to the fixed frequencies of 1.2, 1.7, 2.3, and 2.7 GHz in the following.
Performance and energy at strong scaling

Parallel efficiency and communication performance
All variants of the AA pattern scale well up to 32 nodes (512 cores) at all frequencies, and parallel efficiency only starts to degrade below 90 % beyond that point. Scaling experiments were performed on up to 128 nodes (2048 cores), since this is where some variants start to show efficiencies as low as 60 %. In Fig. 9 we show the parallel efficiency of the strong scaling runs versus the number of nodes at the four chosen frequencies and with between four and eight processors per chip (PPC). Since the application case is too large to fit on a single node, all efficiency numbers were normalized to the four-node run.
Usually one would expect the parallel efficiency to increase as the node-level performance goes down, because communication and synchronization overheads become less important when the pure compute time goes up. On SuperMUC, the opposite is the case: The minimum parallel efficiency (at 128 nodes) varies between 76 and 63 % (depending on the number of cores per chip) for 2.7 GHz, but between 69 and 61 % at 1.2 GHz. We conclude that there must be a frequency-dependent factor which impedes scalability whenever communication overhead plays a significant role.
In order to explore the reasons for this effect we have conducted experiments with "sendrecv" from the Intel MPI benchmark suite (IMB) [36] , since it mimics the ringshift-like halo-exchange communication pattern of the ILBDC code. Each MPI process exchanges data with its neighbors: MPI_Sendrecv(to right neighbor, from left neighbor). The benchmark reports the available communication bandwidth per process. In Fig. 10 we show the results for two SuperMUC nodes in the two corner cases of one process (PPN=1) and 16 processes per node (PPN=16) for the two extremal frequencies of 1.2 and 2.7 GHz. The placement of the MPI ranks was done in the same way as for the ILBDC benchmarks: Neighboring ranks were "packed" to the same node to minimize inter-node traffic.
Although both scenarios show a dependence of the effective MPI bandwidth on the clock speed, this is especially pronounced at PPN=16, and we see a breakdown of about 35 % in communication bandwidth within the region of message sizes relevant for the The mapping of MPI ranks to cores was set for minimum internode traffic (consecutive ranks on the cores of a node). The shaded area indicates the range of message sizes for the LBM application test case (packed bed reactor geometry).
ILBDC packed reactor benchmark (shaded area). Moreover, the bandwidth of the FDR-10 IB interface cannot be saturated even at the highest frequency setting with PPN=16. We attribute both effects to the dominance of intra-node communication, which has a strong dependence on clock speed. In contrast, the saturated LBM performance with the AA pattern and AVX vectorization only drops by about 20 % over the whole frequency range (see Fig. 2 ). This explains the stronger breakdown of parallel efficiency at strong scaling and low clock speeds.
Energy and performance at scale
The question remains whether one can extrapolate the findings about energy to solution and performance from the chip to the multi-node level, and especially whether single-core optimizations, notable SIMD vectorization, have a similar impact. Figure 11 shows aggregated socket-level energy (as measured via RAPL) vs. performance with AVX for the three node counts (32, 64, and 128) at which parallel efficiency is between 90 and 60 %. Along each curve, the number of processes per chip is increased from four to eight, and the highest energy point at the top of each curve is at PPC=8. For reference we also show the corresponding lowestenergy data points for the scalar implementation (open symbols). The overall rise in energy to solution with growing node count is a trivial consequence of the decreasing parallel efficiency.
The most striking difference to the chip-level results is the notable performance degradation after the saturation point, especially at the larger node counts (32 and 64). It is caused by the drop in ringshift bandwidth (as described in the previous section) with growing PPC, and directly leads to a fast rise in energy to solution, much steeper than would be expected by the power model without communication component. Hence, it is even more crucial in the highly parallel case to select the optimal operating point, since each expendable core costs an over-proportional amount of energy: At 128 nodes and 2.7 GHz, the reduction in energy consumption when going from the full socket to the saturation point is over 40 %, but only about 25 % on a single chip (see the 2.7 GHz data in Fig. 7b ).
The strong disadvantage of scalar execution can also be seen on the highly parallel level (open symbols in Fig. 11 show the "naive" operating point of PPC=8 for this case). Since more processes are needed to reach saturation -if this is possible at all -, the slowdown at larger PPC contributes strongly to the low performance and high energy consumption. As a consequence, a well-vectorized LBM code is instrumental for optimal energy to solution, particularly in the highly parallel case when communication plays a noticeable (but not dominant) role.
Finally we need to comment on how these findings change if a realistic baseline power is used. Figure 12 shows the same data as Fig. 11 but with 100 W of constant power added per node. The results are very similar to the chip-level discussion in Sect. 3.2 above: All differences in energy to solution are damped by the larger idle power, but there is still more than 30 % gain between a naive scalar code run with PPC=8 and the possible optimal operating point (marked in Fig. 12 ) with PPC=4 and 2.3 GHz. In contrast to the case where only the chip power is considered, the lowest frequency setting of 1.2 GHz is very unfavorable: The large performance degradation together with the communication bandwidth breakdown problem and the large baseline power prohibit the use of very small frequencies, even if energy to solution were the only relevant metric. On the other hand, energy is practically constant between 1.7 and 2.7 GHz when the best PPC value is chosen, but performance is boosted by 15 % at 128 nodes.
SUMMARY AND OUTLOOK
We have analyzed the performance and energy to solution properties of a lattice-Boltzmann flow solver on the chip and highly parallel levels for an Intel Sandy Bridge EP-based system. Different propagation patterns (pull-split vs. AA), SIMD vectorization schemes (scalar vs. SSE/AVX), and the dependence on the clock speed of the processors and the number of processes per chip were studied using the chip-level ECM performance model and a simple power model. In addition to the measured chip power, the (estimated) "true" system-level power was taken into account in order to arrive at useful predictions for realistic scenarios.
We found a high single-core performance to be the pivotal instrument for reaching minimal energy without sacrificing too much time to solution. AVX vectorization and the choice of a good prop- Figure 12 : Same data as in Fig. 11 but with a realistic baseline power of 100 W added per node. The green circle marks a possible optimal operating point.
agation pattern are important components for accomplishing this goal. Technical measures such as clock speed reduction have less impact but still contribute considerably to the overall energy consumption.
When a "good" single-core code has been found and the optimal clock speed has been set, it is the identification of the performance saturation point with respect to the number of processes on the chip level, but more importantly in the highly parallel case, which decides upon minimal energy to solution. In the latter case, due to the dependency of MPI communication bandwidth on the clock speed of the processor, using too many cores per chip must be avoided when communication plays a non-negligible role.
The impact of the whole system's baseline power consumption (i.e., powered on but with idle cores) is twofold: It attenuates the differences in energy to solution caused by technical measures such as clock speed adjustments, but it emphasizes the influence of bare chip-level performance and communication overhead, especially in the highly parallel case. Hence, having a "good" single-chip code is even more important when dealing with realistic, i.e., full-system power consumption and parallel production runs.
This work opens the possibility for future research in multiple directions: We have not addressed realistic alternatives to the standard quality metrics of energy and time to solution in detail. One such cost function could be the energy-delay product, but there may be others. Moreover it will be interesting to compare our findings for a typical x86-based cluster to a modern low-power system such as the IBM Blue Gene/Q. Power capping is an additional operational constraint of modern systems and can be studied using the same methods as shown here. Finally, the ECM model and the multicore power model need refinements and adjustments to be able to deal with more complex hardware and software scenarios such as the expected flexible (per-core) frequency settings on upcoming Intel processors. Also it would be worthwhile to explore the reasons for the deviation of the ECM model from measurements near the saturation point, which does not occur for all types of code [17] .
