Abstract-We propose a new model for scaling applications with increasing power budget, which we call the iso-powerefficiency function. We show that viewing scaling in this way has advantages over the previously proposed isoefficiency function that assumes all processors run at maximum power. Our experimental results show that overprovisioning can result in better scaling under a power budget.
INTRODUCTION
For many applications, speedup saturates and parallel efficiency decreases if the problem size is held fixed while increasing the number of processors (the form of scaling known as strong scaling). For some problems, it is possible to maintain a fixed parallel efficiency by increasing both the problem size and the number of processing elements. The rate at which the problem size must increase to maintain constant efficiency for a given rate of increase of the number of processors is given by the iso-efficiency function [1] . We have developed a new scalability function called iso-powerefficiency that determines the rate at which the problem size must increase to maintain constant efficiency for a given rate of increase of the application's power budget. For a given power budget, an application can choose to use a larger number of processors running at lower power. As shown in [2] , speedup can often be obtained within a given power budget by such overprovisioning. Deriving the iso-powerefficiency function for a given problem involves 1) determining optimal configurations for problem instance/power budget pairs, and 2) expressing the parallel overhead as a function of problem size and power budget. We hypothesize that the rate of growth required for problem size can be lower with iso-power-efficiency than with isoefficiency, thus yielding better scalability. Our approach is to use a regression modeling methodology, similar to the focused regression modeling described in [3] , to fit observed execution data to an iso-power-efficiency function.
Users of large shared parallel computing systems are currently charged according to the number of processing elements they use for however long they use them. We predict that this will change to charging for the amount of energy used --that is for the number of processing elements times the average power per processor times the runtime. With future batch queueing systems, users will request a given power budget in addition to the number of processors and the estimated runtime, and they will be required to stay within that power budget. Our research on iso-powerefficiency will help users to scale their applications efficiently under this future scenario.
II. BACKGROUND
As we move towards exascale computing, we are presented with many of challenges. The U.S. Department of Energy's goal is for the first exaflop machine to consume no more than 20MW. Exascale refers to both this and larger machines that may have higher power bounds. Because of this power bound, future supercomputers will most likely be limited by the amount of power that they can consume rather than the physical hardware available. From a monetary standpoint, a megawatt of power costs about one million dollars each year. Knowing this, we find it prudent to make better use of the energy these computers use.
III. PROBLEM DESCRIPTION
Iso-efficiency is a parallel performance metric that measures scalability. Given a particular algorithm/architecture combination, the iso-efficiency function can represent the characteristics in a single expression. Using this we are able to predict the best combination of problem size and number of processors rather than brute forcing our way through every combination.
Parallel overhead is given by
where c is the number of processing elements (e.g., cores), T c is the parallel runtime, and T 1 is the runtime of the best known sequential implementation. The iso-efficiency function [1] tells us how we need to scale up the problem with increasing number of processing elements to maintain the same efficiency:
Where E is the efficiency to be obtained, and W = T 1 . For example, the iso-efficiency function for matrix-vector multiplication with 1-D block data decomposition is W = K⋅ c logc , meaning we need to scale up the problem size not just proportionally to the increase in the number of processors, but with an additional factor of log c. For iso-power-efficiency, we modify the parallel overhead function
[1] to the following:
Where b 1 is the smallest overall power budget for all the cores under which we can run the application, W = T 1 where T 1 is the runtime for the best sequential version, and T b is the runtime for the best-performing configuration under overall power budget b. A way of interpreting this function is to say that we are now scaling our problem with respect to the normalized power budget 1 b b , rather than only with respect to the number of cores. Iso-power-efficiency essentially encompasses Iso-efficiency. Working off the assumption that iso-efficiency runs all its cores at max power, we can represent b as a number of cores running at max power and b1 as one core running at max power. By doing this, we can see that that T o (W,c) is just a special case of T o (W,b) . Our goal is to show that by scaling with respect to power budget, instead of just the number of cores, we can obtain lower overhead T o and require less power to maintain a given efficiency E.
If we have a load-imbalanced problem and if we can vary the power per node, or even per core, then we can achieve a lower overhead in terms of power than in terms of cores. To see this, consider a task dependency graph in which not all paths have the same length. In this case, we can lower the power setting on nodes off the critical path without increasing the parallel runtime. Then
where p c is the average power cap per core and p 1 is the max power cap per core using the minimum core configuration which is needed for the application to run, will be less than T o (W,c) for the same c.
For example, consider a task dependency graph that has two paths each of length l. Suppose we can scale up the problem by parallelizing each of these paths. Assume doing so for one path introduces no extra communication between the cores, but parallelizing the other path introduces communication proportional to the logarithm to the base 2 of the number of cores for that path -that is, lg c 2 . Assume the communication is the only parallel overhead. Then the critical path length and the parallel execution time go from l to l + lg c 2 = l + lgc −1 . Suppose that the execution time is inversely proportional to the power supplied to a node and that we scale back (i.e., cap) the power to the nodes not on the critical path by a factor equal to l l + lgc −1 so that the execution time for these nodes is also 1 lg − + c l . Now the average power cap per node is given by averaging the power of the two nodes:
where p 1 is the original power per node. Without the power capping, the overhead function is given by
capping of the nodes that are off the critical path, the overhead function is given by
, and thus we can scale up the problem size at approximately half the rate as without power capping. Whether or not we can obtain different asymptotic complexities for overhead functions for a realistic problem with and without power capping remains to be seen.
IV. RELATED WORK

A. Overprovisioning
The idea of overprovisioning with respect to power is the basis of the research described in [2] . Overprovisioning means that more cores exist than can simultaneously run at the highest CPU clock frequency, and equivalently, at the highest power setting. The goal is to find the best configuration for an application so that the maximum performance is achieved without exceeding a given power bound. An application's scalability characteristics determine whether to use few nodes at higher power or more nodes at lower power. The investigation of overprovisioning in [2] assumes uniform power allocation per node and that the applications are perfectly load balanced. The authors use exhaustive search to find optimal configurations for a number of applications with fixed problem sizes, including SP-MZ, LU-MZ, and SPhot, under various system power bounds. For future work, they plan to develop a model to predict the optimal configuration, including allocation of non-uniform inter-node power based on critical path and load imbalance of the application (as for the example we gave above). Our work assumes such a model and we build our iso-power-efficiency model on top of it.
B. Iso-Energy-Efficiency
The concept of iso-energy-efficiency is described in [4, 5] . The energy overhead for parallel execution is given as E o = E p -E 1 , where E 1 is the energy consumption for the sequential execution and E p is the total energy consumption for the parallel execution. Iso-energy-efficiency is defined as
. This is analogous to the time-based
rather than to iso-efficiency.
The energy efficiency factor
is modeled by sets of machine-dependent and application-dependent parameters that can be measured by small-scale executions so that the EEF can be estimated for larger-scale executions. The goal is to determine what parameter settings, such as frequency, problem size, and number of processors, will minimize the energy overhead and thus maximize the energy efficiency. In contrast, our work assumes that some method exists to find the configuration under a given power budget that yields the minimum runtime, and determines how to scale up the problem to maintain the same power efficiency with increasing power budget.
V. METHODOLOGY
In order to determine how to scale a problem to obtain iso-power-efficiency, we need a model that predicts the overhead function ) , , , (
for the application, where W is the workload, c is the number of cores, p c is the power cap per core, and p 1 is full power (i.e., uncapped power) for a core. Our methodology is to sample the space of workload/power budget pairs. For a given workload and power budget, we do an exhaustive search to determine the best configuration and use the resulting parallel runtime to compute the overhead. We use each such sampled result (inputs: W, b; output: T o ) as a training input to our regression model for T o .
We hypothesize that even with uniform inter-node power, we may still be able to achieve better scaling with a power budget. Our reasoning is that as we scale up a problem, its strong scalability characteristics change. For example, with more nodes, the communication proportion of an application may increase. As the communication fraction increases, we may achieve faster runtime by running cores at lower power. We used some of the communication-intensive benchmarks from [2] , modified to have different problem sizes, to test this hypothesis.
VI. EXPERIMENTS
All of our experiments were run on the Cab cluster at Lawrence Livermore National Laboratory. The CAB cluster consists of 1296 nodes. 256 of these nodes were available to us. Each node has 2 sockets, each of which holds an Intel Xeon E5-2670 (2.6GHz, 8 cores) from the Sandy Bridge architecture family.
The Sandy Bridge family allows us to use Intel's Running Average Power Limit (RAPL). RAPL provides mechanisms to enforce power consumption limits so as to stay under a given power bound and retrieve power measurements. This is done by controlling several modelspecific registers (MSR). We can gather information that our PKG supports a lowest power domain, which is 51W, the thermal specification power, which is rated at 115W with a maximum power of 180W, and the largest time window, which is 0.0459 seconds. The Sandy Bridge server processor we are using can perform power clamping in units of .125 watts, .0000152 joules and .000977 seconds.
All experiments for the iso-power-efficiency were done with the minimum .000977 second time window [6] . We used the MSR-safe kernel module rather than the default MSR kernel module. MSR-safe allows us the same capabilities as the default, but some of the more critical risks with full access to MSRs are avoided [7] .
To manipulate the power bounds for our experiments, we used the LIBMSR library. The LIBMSR library provides access to Intel RAPL MSRs allowing the user to easily adjust the power bounds and time intervals desired. The version of LIBMSR we used allowed us to simply use environment variables to set power bounds and time intervals. LIBMSR can also be used to retrieve power consumption at a given time. LIBMSR can be found at https://github.com/scalability-llnl/libmsr [7] .
Both MSR-safe and LIBMSR were developed at Lawrence Livermore National Laboratory.
We began our experiments by generating many test runs for each of our benchmarks. Each run consists of a combination of processors, cap on the power per socket, and problem size. The number of processors is in the range of 16 to 1024 in multiples of 16. The cap on the power per socket is selected from a set of discrete values in the range of 51-115 watts. The problem size is dependent on the benchmark.
Every combination was run multiple times. This was done to reduce the amount of noise and to provide a better runtime sample over all the nodes available on the system. With these runs, we created a model for each benchmark that estimates the runtime given the number of processors c, average power per socket P s and problem size S. For simplicity, problem size S was used in place of workload W. This is fine because W is a function of S with all other parameters of W kept constant. These models are second order linear regression models with logarithmic transformation using log-log form. Each of these models is given in more detail for each benchmark below.
From here we attempt to find the best combination of processors and average power per socket that minimizes the runtime for a given power budget b such that
This is done for many power budgets ranging from some minimum given by the benchmark up to the max watts sampled for each benchmark.
After finding these best configurations, we used these data to create two more models that just use a power budget and a problem size to estimate run time T b and overhead time T o . T b and T o are used to estimate different efficiencies E.
We used the NAS Parallel Benchmarks (NPB) [8] for our experiments. Specifically EP (Embarrassingly Parallel), SP-MZ (Multi-zone Scalar Pentadiagonal), and LU-MZ (LowerUpper Gauss Seidel solver).
R was used to create all statistical modeling. R is a programming language that specializes in statistical computing and graphics [9] . We used R for its ability to create linear models and produce informative plots. The following are the models used to estimate To. We can see that given the high values for each of the three R-squared, we can safely assume that not only are the sample points fitted well but that future points will be predicted accurately. Figures 1, 2, and 3 show the amount of overhead T o using iso-efficiency and iso-power-efficiency. We can see that at all times the value of T o is lower for iso-powerefficiency but as the power budget grows we see that the values for T o begins to converge to the same value. By increasing the problem size the overhead gap increases. This allows for an increase in the longevity of the isopower-efficiency advantage.
VII. RESULTS
A. Overhead
The sharp spikes in Figure 1 are attributed to the average power per socket increasing to stay at the power budget but the number of cores staying the same. The spike goes up when the cores stay the same and down when the number of cores increases but the average power per socket decreases to stay at the power bound. The average power per socket tends to our minimum allowed power per socket 51 watts. When the best configuration would be a combination that would call for less than 51 watts, we must choose a lesser combination which generally means fewer cores at higher power. In Figures 4, 5and 6 , we can see the necessary power budget needed to maintain a given efficiency at some problem size. Each of these tables, with the exception of Figure 4 , shows efficiencies .7, .8, and .9. We are forced to use very high values of efficiency, .997, .998, and .999 for EP in Figure 4 because of how well EP scales. For each of these figures we used problem size rather than workload for convienience. Using workload rather than problem size would just alter the values of the y-axis but would not change what we are ultimately trying to show. We can see that for each of these benchmarks, as the problem size increases, the required power budget grows more slowly using our iso-power-efficiency function compared to isoefficiency. In Tables II, III , and IV, we can see a numerical representation of the Figures 4, 5 and 6 along with slope, which is the amount needed to increase the problem size for each given increase in power. With these tables it is obvious to see that using iso-power-efficiency we require less of an increase in input for a unit increase in watts. In each of these tables we can see the ratio of the slope of isoefficiency and iso-power-efficiency. For a high efficiency this ratio is very low. As the efficiency decreases this ratio tends to 1. This is because as the power budget increases, the optimal combination for iso-power-efficiency tends to the same combination as iso-efficiency. Generally this is because the amount of communication overhead outweighs the advantage of using more cores. This can usually be alleviated by increasing the problem size. This is also shown in our Tables, with the exception of Table II , in which the slope ratio becomes smaller when comparing larger problem sizes.
B. Efficiency
VIII. CONCLUSION
We have shown that the use of overprovisioning can result in lower parallel overhead, thus enabling better isoefficiency scaling.
As we move towards exascale computing, we will find ourselves more attentive to our energy consumption as we run HPC applications. With the true cost of computing becoming energy consumption, a function that allows applications to be scaled with better power efficiency as its priority will become invaluable. This is good to keep in mind as we predict that future queuing systems will require jobs to submit a power budget. In general, we have shown such a function that, with the benchmarks provided, showed better efficiency scaling than the iso-efficiency function which is based on using maximum power per core.
IX. FUTURE WORK
Although we were able to estimate T o given a power bound and problem size for a particular benchmark, we have yet to come up with a way of finding the optimal configuration that would result in this T o . Currently, we use exhaustive search to find the optimal configuration, as does the work in [2] . We have formulated an optimization problem for finding the optimal configuration, based on machine parameters such as time per flop, energy per flop, time per byte moved, and energy per byte moved, and on the operational intensity of the application. We are using microbenchmarks to try to measure the machine parameters and hardware counters to try to estimate the application's operational intensity. We will investigate the accuracy of this model and its sensitivity to small changes in the parameters.
We hypothesize that iso-power-efficiency scaling will show further advantages over iso-efficiency scaling if we can use different power caps for different sockets, for example, by lowering the power cap on sockets off the critical path. We plan to experiment with load-imbalanced applications to verify this hypothesis.
