As we approach the exascale era, power has become a primary bottleneck. The US Department of Energy has set a power constraint of 20MW on each exascale machine. To be able achieve one exaflop under this constraint, it is necessary that we use power intelligently to maximize performance under a power constraint.
INTRODUCTION
The supercomputing community is headed toward the era of exascale computing, which is slated to begin around ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. To copy otherwise, distribute, republish, or post, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
PACT '16, September 11-15, 2016 2020. Today's fastest supercomputer, Tianhe-2, consumes 17.8MW to deliver 33.86PFlops [2] . If we were to build an exascale machine with today's technology it would consume up to 350MW of power. A typical power plant generates 1GW of power, which is sufficient to power 700,000 homes [1] . The US DOE has set a power constraint of 20MW per future exascale systems to maintain a feasible electrical power demand. In order to get an exaflop under this constraint, we need at least an order of magnitude improvement in power efficiency with respect to today's systems [5, 7, 19, 31] .
Exascale systems are expected to be power-constrained: the size of the machine will be limited by the amount of provisioned power. Existing best practice requires provisioning power based on the theoretical maximum power draw of the machine, despite the fact that only a synthetic workload comes close to this level of power consumption. One of the key contributions in the power-constrained domain is "hardware overprovisioning" [24] . The idea is to provision much less power per node and thus provision more nodes. The benefit is that all of the scarce resource (power) will be used. The drawback is that power must be carefully scheduled within the machine in order to approach optimal performance. Fig. 1 depicts this foundational idea. Let the hardware overprovisioned system consist of Nmax processors and let the power budgeted for this system be P m/c Watts. As shown in the figure, with P m/c Watts total system power, only a part of the system (say N alloc where N alloc < Nmax processors) can be utilized at peak power (collection of nodes in red). Another valid configuration is to utilize the entire system at low power. One of the several other intermediate configurations is to use medium power levels and utilize a portion of the system larger than that at peak power but smaller than that at low power. In each of these configurations, a machine's power budget is uniformly distributed across a varying number of processors, i.e, each processor is allocated approximately P m/c N alloc Watts of power. This is a naïve approach of enforcing a power budget. Depending on the application's characteristics (memory-, compute-, and communication-boundedness), different applications achieve optimal performance on different configurations. In a nutshell, power procured for a system must be managed as a malleable resource to maximize performance of an overprovisioned system under a power constraint.
To facilitate the selection of different power levels, hardware manufacturers are providing various features like power clamping and on-chip power measurement mechanisms (e.g., • Peak power efficiency varies across processors.
• Most importantly, efficient processors are most efficient at lower power bounds whereas the inefficient processors are most efficient at higher power bounds. The "peak" of every curve is the point at which the processor achieves the maximum efficiency, i.e., maximum IPS/W. Orange curves (efficient processors) have peaks at lower power compared to the peaks of the red curves (less efficient processors) and the rest lie in between. Fig. 4 depicts the results of our thermal experiments. The x-axis presents processor IDs (processors are sorted in the order of efficiency). The y-axis presents the measured temperature (triangles) of the processors normalized with respect to the maximum temperature and the unbounded power (crosses) of the processors also normalized with respect to the maximum power. In these experiments, the processors were not capped, and they achieved uniform performance. We observe that the temperature increases as we go from efficient to inefficient processors (left to right), as does the unbounded power. However, not all inefficient processors are hotter than the efficient ones. This shows that thermal variation may be one of the potential causes of variation in efficiency but there are other factors that counter the effect as we do not see a linear trend for temperature (in contrast to the linear trend of unbounded power). We believe that one of the contributing factors is process/manufacturing variation induced at the time of fabrication. In the end, our proposed mechanism is agnostic of the actual cause of variation, it simply exploits the fact that variation (due to whatever reason) exists.
In summary, there exists variation in power efficiency across processors. There is a unique local maximum in every power efficiency curve that occurs at disparate power levels for different processors. Starting from the minimum power, increasing the power assigned to a processor leads to increasing gains in IPS. However, increasing the power beyond the peak efficiency point of a processor leads to diminishing returns. Hence, when power is limited, processors should operate at power levels close to their peak efficiency to maximize the overall efficiency of the system. Since the peak efficiency points for efficient processors are at lower power levels than for the inefficient processors, the optimal configuration should select lower power levels for efficient processors and higher power levels for inefficient processors to maximize performance. On the contrary, a naïve / uniform power scheme caps all the processors at identical power bounds. Hence, it is sub-optimal. An optimal algorithm should aim at leveraging the nonuniformity of the cluster to maximize the performance of a job under its power constraint.
To this end, we propose PTune, a power-performance variation-aware power tuner that exactly does this for each job. For every job, given a power budget, it determines the following: (1) the optimal number of processors (say nopt);
(2) selection of nopt processors; and (3) the power distribution (say p k , where 1 ≤ k ≤ nopt) across the selected nopt processors.
PROBLEM STATEMENT
The problem statement then is as follows: Given a machine level power budget, how should the machine's power be distributed across (a) jobs and (b) tasks within jobs on a given system, where (b) is discussed later. For (a), the process of making these decisions at the macro level of jobs is called power partitioning. Each job on the machine receives its own power partition.
We address the following questions:
1. How many partitions do we need at a time? I.e., determine how many jobs should be scheduled at a time.
2. What is the size of each of the power partitions? I.e., determine the power budget assigned to each job.
For (b), at the micro level, given a hard job-level power budget PJi, we need to determine the optimal number of processors, nopt, with a power distribution (p1, p2, ... , p (n opt −1) , pn opt ) such that performance of the job is maximized under its power budget. The constraint on the power distribution is expressed as
Here, min power is the minimum power that needs to be assigned to a processor for reliable performance and max power k is the maximum power consumed by the k th processor (uncapped power consumption) for an application. The performance of a job can be quantified in terms of number of instructions retired per second (IPS).
1 For a parallel application on n processors, the effective IPS is the aggregated IPS over n processors (JobIP Sn). Hence, the objective function is M aximize(JobIP Sn). A processor's IPS is a non-linear function of the power at which it operates. Each processor can be power bounded at several levels using the RAPL capping capabilities, which forces it to operate at various power levels within a fixed range. We know that unbounded power consumption is variable across processors while achieving the same unbounded (peak) performance for a given application. This is depicted in Fig. 5 . The x-axis indicates the power at which the processor operates and the y-axis shows the IPS (in billions) of the processor of an application. Each solid curve corresponds to the most efficient processor while the dotted curve correspond to the least efficient processor. The following two observations are made from this data:
1. On a single processor, the performance (IPS) achieved at any fixed power level is different for different workloads.
2. The performance of an application on two different processors at any fixed power level is not the same.
This means that when determining the optimal distribution of power across processors it is necessary to take the processor characteristics and the application characteristics 1 The general model holds for other performance metrics as well. We selected IPS here because it closely correlates to power in our experiments.
into account. One solution may not fit all applications. The optimal configuration for an application on one set of processors may be different from that on another set of processors because of performance variations under a power cap. 
PROPOSED SOLUTION
We propose a 2-level hierarchical approach of managing power as a resource (see Fig. 6 ). The parameters of the model are described in Table 1 . Nmax, P m/c and nreq are the inputs to the model that we assume. nopt is calculated once for every job at its dispatch time. N alloc , PJi and p k are re-calculated every time any job is dispatched. min power is architecturally defined for every family of processors. Table 2 is populated off-line using the characterization data. We make the assumption that the power consumption of the interconnect is zero, i.e., interconnect power is beyond the scope, and so are task-to-node mapping effects on power. We only consider processor power in this work and assume moldable jobs. DRAM power could not be included due to motherboard limitations at the time of this work. We do not expect the users of the system to predict and request power in their job request. Power decisions are made by our system software (PTune and PPartition). Users may be allowed to influence these decisions by assigning priorities to their jobs. At the macro level, we propose PPartition, a technique of partitioning a machine's power budget across jobs while scheduling them. Once a job is dispatched by a conventional scheduler (e.g., slurm or Maui/pbs), PPartition calculates its power budget. If the required power is not available, it steals power from the previously scheduled jobs and provisions this power for the new job. If sufficient power cannot be obtained, PPartition overrides the conventional scheduler's decision based on free resources (nodes) and does not schedule this job until sufficient power is available.
At the micro-level, we propose PTune, a power balancing model that determines the distribution of a job's power Table 2 Input Input budget (one job at a time) across an optimal selection of processors (among all free resources) to maximize the performance of a job under its power budget.
PTUNE
PTune shrinks the job's processor allocation by eliminating the less efficient processors that are expensive in terms of power to maximize the performance of a job under its power budget. Fig. 7 depicts the micro-level power tuner. For each job Ji, a power budget PJi is calculated at the macro level by PPartition. For every job with this assigned power budget, PTune answers the following questions:
1. How many (nopt) and which processors should a job run on?
2. What should be the power (p1, ..., pn opt ) assigned to each of the nopt processors? Figure 7 : PTune Let us start by addressing the first question. In order to use a processor, it needs to be assigned at least the minimum power (min power) that is constant across all processors. The upper limit on the processor's power (max power k ) is variable across processors. Fig. 8 shows the maximum power consumption of 600 Ivy Bridge processors when they are not power capped. The unbounded performance is uniform across all the processors. The x-axis represents all the processor sorted by power consumption and the y-axis represents the maximum power consumption in Watts. The optimal configuration for maximum performance of a job under a strict power budget consists of the maximum number of most efficient processors (from the left) such that their aggregate power consumption does not exceed the job's power budget. 
Sort the Processors
The first step towards determining the optimal configuration is to sort the available processors by their relative power efficiency. This is equivalent to sorting them by their unbounded power consumption. Let the sorted set of processors be indexed by k.
We divide this distribution of processors into quartiles, viz., Q1, Q2, Q3 and Q4, in the order of efficiency and pick processors from one or more of these quartiles for evaluation purposes.
Bounds on Number of Processors
The lower bound on n, n ⊥ , can be calculated by determining the maximum number of processors that can be capped at their maximum power, max power k , under the power budget. The selection of processors is reformed in the sorted order as described above. n ⊥ is given by the largest value of n that satisfies the following constraint:
The upper bound on n, n ⊤ , represents the maximum number of processors that can be operated at min power under the power budget. The bound n ⊤ is calculated as follows:
The processor count, n, is iterated from n ⊥ to n ⊤ , and in each step, the next efficient processor is added to the set of processors. Job-level performance, JobIP Sn, is calculated in each iteration by DistributeP ower() for the power budget PJi and a given number of processors, n, where
The optimal number of processors, i.e., nopt, is the value of n at which a job's IPS is maximized.
JobIP Sn opt = max(JobIP Sn ⊥ , JobIP S (n ⊥ +1) , ..., JobIP Sn ⊤ ). PTune leads to nopt ≤ n. Thus, PTune tends to reduce the number of processors required for a moldable job The spare processors are returned back to the global pool of unused resources so that they can be utilized by other jobs.
Distribute Power: Mathematical Model
DistributePower(), takes three inputs, viz., the number of processors n, the job's power budget, PJi, and the power distribution across n − 1 processors determined in the previous iteration. The output of this function is the maximum job IPS that can be achieved under PJi Watts with n proces-sors. It also calculates the optimal power caps, (p1, ..., pn), for n processors, which forms an input for the next iteration. This can be mathematically expressed as follows:
DistributeP ower(n, PJi, (p1, ..., pn)) = DistributeP ower((n − 1), PJi − pn, p1, ..., p (n−1) ) + getP rocIP S(n, pn). The function getProcIPS(k,p k ) performs a look-up in Tab. 2 to return the expected performance (IPS) of the k th processor when it is capped at p k Watts.
Power Stealing and Shifting
DistributePower() consists of two main steps, viz. Power Stealing and Power Shifting.
Step 1: Power is stolen in discrete quantities (delta power) from the n − 1 processors to provision power for the n th processor (see Fig. 9 ). The victim/donor processor is the one that suffers minimum loss in IPS when delta power is stolen from it. If the aggregate stolen power is at least min power, an additional n th processor is added to the processor set.
Step 2: Power is shifted from a donor to a receiver in discrete quantities, delta power, across the n processors. The victim/donor processor is identified in the same way as in step 1. The receiver is the processor that gains maximum IPS on receiving delta power. If the required power is available, PTune determines the optimal configuration for the job and the job is scheduled. It is important to note that even though the job power budget is proportionate to the number of requested processors, PTune schedules jobs on reduced number of processors. As a result, the machine's power depletes at a faster rate that the processing resources. If the available (unused) power is less than the calculated job power budget, power is stolen from already scheduled jobs. This is called power repartitioning (lower right blue/shaded box in Fig. 10 ) and detailed next.
Power Repartitioning
The power repartitioning algorithm is shown in Algorithm 1. As all of the machine power budget is already used up by the N allocated processors, a fair power share for the new job is calculated as
nopt ← P T une(PJi, n) ⊲ Recompute nopt 7:
while nopt < n ⊲ Repartition power across jobs to provision power for the i th job 8:
end for 12:
if total stolen power < PJi then ⊲ If enough power cannot be stolen, recompute nopt 13:
PJi ← total stolen power 14:
nopt ← P T une(PJi, n) 15:
end if 16: end procedure
The job is power tuned for the requested nreq (assigned to n) processors under PJi Watts calculated above. PTune gets rid of the unaffordable less efficient processors, if any, leading to nopt ≤ n. We recompute (in the while loop) the proportionate power the new job with nopt processors should have due to power partitioning across all jobs. This new power budget, PJi, then becomes the base for another PTune, and so on, until the number of processors (monotonically decreasing) for the new job reaches a fixed point (stabilizes) in the while loop. The fixed point guarantees a fair power level (PJi) relative to other jobs, but we still need to find other jobs to steal just enough power for this job.
In the following for loop, power is stolen from each of the scheduled jobs in a proportionate manner to each other's power budget. This is accomplished by ShrinkPartition, which consists of (1) stealing just enough power and (2) power tuning for the remaining power of a job and the same number of processors (since we assume moldable but not malleable applications). Here, we steal as much power as possible while retaining heterogeneous power bounds across a job's processors to respect processor variations and thus ensure a high IPS under lower power budget.
The aggregate stolen power from other jobs is offered to the new job. If the stolen power is less than the fixed power level for the new job, which was PJi, then the new job needs to be tuned one more time. If the stolen power was sufficient for this last tuning step, the new job is scheduled and the power re-tuning decisions made by ShrinkPartition for the existing ones are enforced. If, however, the stolen power is insufficient (as determined by PTune when the power budget cannot accommodate more than n 2 processors), no power is redistributed, i.e., all jobs remain unchanged in their power settings and the new job is deferred until at least another job completes.
IMPLEMENTATION
We modified the libmsr [35] library to gather the processor characterization data. We implemented a powerperformance profiler using the MPI profiling interface (PMPI) that invoked various subroutines of the libmsr library to assess the power and the performance of MPI applications. We captured several fixed counter values, power consumption, and completion times for each application on all the processors. The processor power consumption was measured using Intel's RAPL interface. This characterization data is made available to PTune and PPartition.
We assume that the jobs are moldable. Our power manager works in co-ordination with the conventional job scheduler. Once a job is dispatched by the conventional scheduler, the power manager (PPartition+PTune) determines its power budget, the selection of processors from those available, and the power distribution (or processor power caps) across them.
We assume a large job queue (> 384 processes) and a backfilling queue (< 48 processes). The conventional job scheduler schedules as many large jobs as it can on the machine before scheduling the backfilling jobs. We assume up to Nmax=550 nodes with 12 cores each (6600 processes). If the power manager decides to schedule the job, power distribution across its processors (and power repartitioning if required) is enforced using RAPL.
EXPERIMENTAL SETUP
Experiments were conducted on a 324-node Ivy Bridge cluster. Each node has two 12-core Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz processors and 128 GB of memory. We used MVAPICH2 version 1.7. The codes were compiled with the Intel compiler version 12.1. The msr-safe kernel module provides direct access to Intel RAPL registers via libmsr [35] . We used the package (PKG) domain of RAPL that provided us the capability of capping power for each of the processors in an experiment. The scheduling environment was simulated in R.
We again used EP, BT, and SP from the NPB suite and CoMD from the Mantevo suite in their pure MPI versions. We exponentially increase the node count for our experiments. The inputs were weakly scaled for different node counts. We report performance in terms of completion time in seconds and power in Watt. The reported numbers are averages across ten runs.
RESULTS
Experiments were conducted for single job power tuning and multi-job power partitioning.
Variation under Power Caps: Sorting Required
We now exploit the observed variability in the unbounded power consumption of the processor chips, which translates into variation in performance under a power constraint. This variability may be caused by factors such as manufacturing/process variation (at CMOS/transistor level), ambient machine room temperature in different rack positions (higher/lower to the floor), or others. Yet, our method handles variation irrespective of its causes. Previous work [27] and Section 2 has already established that a cluster is not homogeneous under a power constraint because of such variation. We also observe that scheduling a job on different sets of fixed number of processors under a constant power budget leads to variation in the performance of a parallel job.
We present a selection of configurations to demonstrate this behavior in Figures 11 and 12 . The x-axis represents the codes and the number of processors. The y-axis indicates the completion time in seconds. The codes are run on several combinations of processors from one or more quartiles of the processor distribution. The numbers on the top of the bars indicate percentage slowdown with respect to the baseline. The processors are uniformly capped at 51W in this set of experiments, i.e., they maintain a constant job power budget of 8KW, 16KW, and 32KW for 16, 32, and 64 processor experiments, respectively. The baseline for 16 processor experiments (Fig. 11) is the performance on the processors belonging to quartile Q1. For 32 and 64 processors (Fig. 12) , the baseline is the performance on the processors belonging to Q1 and Q4 (also see legends). Q1 consists of the most efficient processors whereas Q4 consists of the least efficient processors. We observe a performance slowdown ranging from 2% to 18%. We observe that performance deteriorates as we include less efficient processors (Q2, Q3, Q4) in the mix. Hence, the optimal selection of nopt processors should consist of the most efficient processors from the available ones.
PTune
We evaluate the effectiveness of PTune using the aforementioned codes. In Fig. 13 , we present results for three different combinations of processors belonging to different quartiles. There are three data points corresponding to each code.
In the figures, nLOW ER (synonymous with n ⊥ ) is the maximum number of processors that can operate at maximum power such that their aggregate power does not violate the job level power constraint. This configuration most closely resembles the worst case power provisioning as processors are not power constrained. PTune is the data point corresponding to optimal configuration suggested by the power tuner. Uniform power corresponds to the naïve approach of distributing the job's power budget evenly across all processors in a job. This is the baseline configuration.
Performance
In Fig. 13 , the y-axis represents performance (top graph) in terms of wall-clock time (in seconds) and the number of processors recommended by the power manager (bottom graph) Figure 13 : Evaluation of PTune on 16 processors from one or more quartiles over different codes and quartiles to which the processors belong (x-axis). The numbers on the bars indicate the runtime reduction and utilized number of processors relative to the baseline in percent. We observe a performance improvement of up to 22%. The gains are dependent on the combination of processors from different quartiles as well as on workload. PTune is able to free up to 38% of the resources while achieving similar or higher performance than the baseline configuration.
Scalability
We evaluate PTune on up to 128 processors. Fig. 14 presents results addressing the scalability of PTune. PTune achieves performance improvements of as much as 29% with a minimum of 1%. More significantly, in case of the minimal performance improvement, PTune frees up 23% of the processors, which subsequently become available to the next scheduled jobs. We observe an error of less than 2% between the total job power consumption (measured via RAPL) of the PTune recommended configurations and the assigned joblevel power budget across all experiments.
PPartition
In this section, we perform a macro-level evaluation of our 2-level model. We simulate the conventional scheduler that dispatches jobs from multiple queues, one at a time. Let np be the number of processes. The scheduler handles 3 queues, 1 large job queue (np ≥ 768 or n ≥ 64 processors), and two backfill queues (np ≤ 48 or n ≤ 4 processors, 48 < np < 768 or 4 < n < 64 processors). Larger jobs are scheduled first followed by backfilling jobs to improve the system utilization. We assume N max = 550 processors. Our job mix consists of 25% jobs from each EP, SP, BT, and CoMD.
We assume a hardware overprovisioned machine with a machine power budget P m/c = 28KW . Fig. 15 depicts a scenario in which the job scheduler is oblivious of power management. The machine's power budget is uniformly distributed across all the processors. We call this naïve schedul- 115%  90%  100%  89%  84%  100%   106%   85%  100%   93%  89%  100%  114%  98%  100%  93%  83%  100%   104%  97%  100%   74%  71%  100%  100%  82%  100%  82%  76%  100%  96%  90%  100%   94%  87%  100%  98%  84%  100%  83%  81%  100%  102%  99% Figure 14 : Evaluation of PTune on processors from Q1 and Q4 quartiles ing. The conventional scheduler schedules jobs as long as the required number of processors are available. Fig. 16 depicts the scenario when our power manager (PTune + PPartition) co-ordinates with the conventional scheduler to make variation-aware power and job scheduling decisions. The xaxes in both the plots represent job identifiers ordered by the time that they are dispatched by the conventional scheduler.
We can see that the large 64 processor job is scheduled first followed by the backfilling jobs. The left y-axis denotes job performance in IPS. Each of the red, green and blue curves represents a job's performance as more and more jobs are scheduled over time (moving right along the x-axis). leads to a drop in their performance. In return, we are able to schedule more jobs at the expense of the performance of already running jobs. In this scenario, our scheme is able to schedule 58 jobs whereas the power-oblivious scheduler is able to schedule only 36 jobs to run at the same time. This is because PTune schedules each job on a reduced number of processors compared to the naïve scheme. The performance of most of the first 36 jobs (that are scheduled under both the schemes) of our approach is at least as good as the naïve one. In addition to these jobs, our power control is able to schedule 22 more backfill jobs that further improve the overall throughput (SysIPS) of the machine (compared to the naïve approach) under the same power constraint.
The right y-axes depict the system's power consumption as a fraction of (normalized to) the overall provisioned system power, P m/c , in one line graph (circles) and the system's performance (SysIP S = JobIP Si) normalized to the maximum in the other (crosses). Both graphs track each other closely, but under our power control, the machine power is fully utilized much earlier (after ≈ 10 jobs) whereas 36 jobs are required to reach this level in the naïve case. These initial jobs also achieve higher performance under our scheme (>1500 Billions IPS) than that in the naïve case (900 to 1500 Billion IPS for backfill jobs and 2100 Billion IPS for the large job) before the other jobs are scheduled, and these jobs would thus terminate earlier as they have progressed further under our power control compared to the naïve case. This shows that when there are fewer jobs running on a machine, our power manager is able to direct all machine power to jobs where it is needed to maximize performance under a power constraint unlike the naïve approach. Fig. 17 presents a comparison of the throughput of our scheme compared to three other naïve schemes. The x-axis denotes the machine's power budget and the y-axis depicts the throughput of the machine (SysIPS) normalized to a maximum throughput at 39KW (left bar per set). Uniform capping schemes assume that an appropriate number of randomly selected processors on the machine are already capped at P m/c /Nmax, 75W (mid-way between minimum power and TDP), and TDP, such that their aggregate power does not exceed the machine's power budget. The rest of the processors in these configurations are not available to the conventional scheduler in the naïve scheme. PTune+PPartition represents our model that makes variation-aware decisions about scheduling jobs across the entire machine under a machine-level power constraint. The percentages on top of bars indicate how much lower the throughput per naïve scheme is compared to our solution. Our model achieves 5-35% higher throughput. Fig. 18 depicts the performance of all the jobs that are scheduled under schemes 1-4 (top left legend) indicated along the x-axis. The y-axis denotes job's performance normalized wrt. to the aggregate job performance (SysIPS) under the respective schemes. We see that our approach (scheme 1) is able to schedule a much larger number of jobs (denser clusters in plot) than the naïve scheduling policy (scheme 2) by trading off performance of some jobs. Figure 18 : Job performance. A job is represented by a triangle
RELATED WORK
Energy has been an important issue in high performance computing (HPC) for over a decade. Supercomputers as old as BlueGene/L have been built with the goal of maximizing power efficiency. Power-scalable clusters that are equipped with voltage and frequency scaling have existed for over a decade that enabled researchers to study the energy problem in HPC. Freeh et al. [14] investigated the energytime trade off of MPI applications to prove that it is feasible to save energy by scaling the processor down to lower energy levels with or without time penalty depending on the application. Springer et al. [36] proposed a combined approach of performance modeling and performance prediction for minimizing the execution times of MPI applications under energy bounds. They used voltage and frequency scaling on single cores of a small cluster of up to 10 nodes for their experiments. In addition, there is abundant work presenting algorithms that use frequency and voltage scaling mechanisms for energy savings [23, 28, 29, 13, 4] . In contrast, our work uses power capping via the Intel RAPL interface. Totoni et al. [38, 22] presented an ILP-based runtime system that schedules work on a selective subset of cores of a single multi-core chip to meet the power or performance constraint. Within-die or core-to-core variationaware DVFS schemes [37, 16] have been proposed for chipmultiprocessors. These schemes select optimal voltage and frequency set points for each of the cores to achieve improved power to performance ratio for the chip. Our work differs from this work in terms of granularity. We study variation across several processors or chips and not across cores of a single multi-core chip. We manage resources at processor (or chip) level. We use either all or none of the cores of the chips/multi-core processors on a machine. Our goal is to improve the performance of a parallel job scheduled on multiple processors under a strict power budget.
System-wide solutions for power constraint systems have been proposed that aim at increasing the throughput of systems by leveraging the idea of hardware overprovisioning [12, 24, 32, 33, 11] . Sarood et al. [34] proposed a scheme of determining an optimal number of nodes under strong scaling of applications executing on an overprovisioned system while distributing power between CPU and memory. Etinski et al. [10, 9] proposed the use of dynamic voltage and frequency scaling (DVFS) at the job scheduling-level to save energy and improve overall job performance. Patki et al. [25] proposed power-aware backfilling to improve the throughput of the system. Ellsworth et al. [8] presented a power scheduler that enforced a system-wide power bound by reallocating power across the cluster. Our work differs from all of the above because our approach takes the performance variation across processors of a cluster into account while scheduling and tuning jobs for performance.
Inadomi et al. [18] propose performance optimizations across inhomogeneous processors. It provides a detailed analysis of the phenomenon across multiple clusters and provides a set of simple algorithms for intelligent power balancing. These algorithms, while groundbreaking, suffered from two serious limitations. First, the processor power model assumed that CPU clock frequency increased proportionally with power. While that is a useful simplification, our work here shows that the story is not nearly so simple. Second, the algorithms assumed the ideal number of nodes to use was fixed a priori. In our approach, PPart and PTune jointly determine the ideal number of nodes and the power budget for every job at the time of scheduling. While the result of both of our approaches is a power schedule, we are solving a fundamentally different problem.
Kappiah et al. [21] presented a system that saves energy at the expense of execution time by scaling down the frequencies of the cores when they encounter slack time in an MPI application. Rountree et al. [28] used linear programming to establish a bound on optimal energy savings of an MPI application and presented a runtime algorithm to save energy in HPC applications with negligible delay [29] . Power conservation by means of turning off unwanted nodes is proposed in [26] . In the above presented solutions, authors used one core per node and their goal was to maximize energy savings with minimal impact on the execution time. In contrast to these solutions, we are intolerant to performance degradation. We use multicore processors and our goal is to minimize the completion time as long as we stay within the power budget.
SUMMARY
We presented a hierarchical variation-aware machine-wide solution for managing power on a hardware overprovisioned machine. It consists of a macro-level Power Partitioner that makes power and job scheduling decisions and a micro-level Power Tuner that determines the optimal processor selection and their power caps for a job such that its performance is maximized under a power constraint. PTune achieves up to 29% improvement in performance compared to uniform power capping. It does not lead to any performance degradation, yet frees up to 40% of resources compared to uniform power capping. PPartition is able to improve the throughput of the machine by 5-35% compared to naïve scheduling under the same machine power budget.
We established that under a power constraint, the variability in performance transforms into variation in peak power efficiency. We believe that this variation in power efficiency should be one of the primary considerations in the future power management research.
