Abstract-Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application.
I. INTRODUCTION
Performance has been and will continue to be the fundamental drive in supercomputing. However, in the push to achieve exascale (10 18 flops/sec) performance, a commensurate increase in power is no longer feasible. Today's top supercomputers with petaflops performance (10 15 flops/sec) already consume in excess of 15MW [1] . With the Exascale Computing study [2] specifying a definitive power budget of 20MW; the performance needs to improve 10x, while power less than doubles.
With tighter power budgets likely in the near future, it is imperative that each supercomputer component (hardware and software) be energy efficient. Most research to regulate energy and performance in software has revolved around Dynamic Voltage and Frequency Scaling (DVFS) [3] , [4] , [5] . Because of hardware limitations to date, DVFS research has impacted all of the cores on a multi-core processor and potentially slowed the critical path. Thus, the research has focused on finding situations where the slowdown is greatly outweighed by the energy savings. The chip-wide effect of DVFS also made effective fine-grain control of performance difficult.
With the introduction of per-core specific voltage regulators in Intel Haswell, new options for software energy control are now available. Each physical core (or 2 logical cores if using Hyper-Threading) can be independently controlled allowing only non-critical threads to have their frequency reduced.
In addition to DVFS, core-specific Software Controlled Clock Modulation has been supported by Intel since Pentium 4. The effective core frequency is adjusted nearly instantaneously by gating only a fraction of the clock cycles to that core. We call this approach Dynamic Duty Cycle Modulation (DDCM), given that the objective is to match duty cycle of the core to its work dynamically. The fine-grained control allows DDCM to save power effectively for unbalanced applications. As voltage regulators are unaffected, changing clock frequency with DDCM requires less work (and time) than with DVFS [6] .
In this paper, we present a generic policy that uses a corespecific power control like DDCM or per-core DVFS to throttle the frequencies of cores not on the critical path. The goal is to match a core's duty cycle to its workload to eliminate idle cycles. The duty cycle is given by, Duty cycle = T ime core (processor) in active state T otal time ×100
The amount of time a core is active can be changed either by lowering its T-state (with DDCM) or by reducing frequency (DVFS). By dynamically adapting core frequencies to workload characteristics on that core, fewer idle clock ticks occur and less power is wasted. Many HPC applications comprise multiple phases of computation with each of the cores performing disparate amounts of work leading to workload imbalance. With the proposed policy, a core doing more work will have a higher duty cycle (and run at higher T-state/frequency) than the one doing less work. Ideally, all threads reach next phase boundary at almost same time ( Figure 1 ). Many HPC researchers are exploring overprovisioning of processor nodes [7] , [8] , [9] to improve performance within the available power budgets. Many future exascale applications will have heterogeneous processor load, and with power-limits, most exascale systems will have heterogeneous performance. This can lead to significant run-to-run variations in the execution time and energy consumed. The adaptive runtime framework saves energy by dynamically setting core-specific frequencies. To the extent possible in the hardware, the power saved in non-critical nodes can be allocated to the cores on the critical path. This results in execution time reductions, particularly under a hardware power cap [10] .
The major work and ideas presented in this paper are:
• A generic policy that effectively utilizes per-core specific power controls to improve energy efficiency. Our previous work [10] aimed only at showing the efficacy of DDCM as an alternative to socket-wide DVFS. However, the present work offers a context for comparing DDCM (with its simple per-core hardware implementation and fast switching capability) and DVFS (more complex and costly to implement percore but with potential for greater savings), and for showing how and when they can be used together.
• Implementation of an adaptive runtime framework (library) that uses the duty cycle inspired policy to throttle the frequencies of cores not on the critical path of an MPI application. An important feature of this implementation is that it allows the flexibility to use multiple power policies to save energy -DDCM, per-core DVFS or both. Use of this library does not require any code changes to the underlying application.
• Validation of framework using six benchmarks (mini-AMR, miniFE, CloverLeaf, HPCCG, AMG, miniGhost), and real world application, ParaDis. The evaluation shows an overall 20% improvement in energy efficiency with an average 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS.
• Energy optimization is shown to improve performance in certain scenarios. With a real application ParaDis, the runtime is seen to improve performance by lowering run-to-run variation and facilitating running at turbo frequencies. The performance improvement is achieved in addition to reducing power.
II. ADAPTING CORE FREQUENCY TO WORKLOAD

CHARACTERISTICS
HPC applications have large varieties of CPU/memory usage patterns that are input driven and dependent on executing application phase. System noise from diverse factors like hardware, OS, network further complicates any static attempt to determine optimal core frequencies. These factors drove our choice to create an adaptive policy driven by runtime inputs.
Slack reclamation by trying to slow down the non-critical paths of computation is not new to energy-efficient HPC nor is the idea of an adaptive runtime. Most previous research has revolved around DVFS and its ability to obtain cubic savings in energy. The previous chip-wide requirement made it difficult to find applications where the savings did not result in significant execution time increases. Some of the previous work (see Section VI) used complex models requiring system-wide introspection that are better suited for off-line analysis. Other works required application level source code changes making them tedious and difficult for production applications.
Our previously developed adaptive policy [10] examined local system state and predicted proper duty cycle level to use for next application region. It saved 13.5% processor energy on one node and 20.8% on 16 nodes for several benchmarks. On a production application, ADCIRC 1 energy savings of 10% were obtained with only a 1-3% increase in execution time [11] .
In the current work, we offer a context for comparing DDCM (with its simple per-core hardware implementation and fast switching capability) and DVFS (more complex and costly to implement per-core but with potential for greater savings). This is done by showing how the previous policy [11] can be made generic to work with per-core DVFS in current work, and other core-specific power controls that the hardware might provide in the future. Further, a novel approach to combine per-core DVFS and DDCM is presented. The combination of multiple power controls in complementary ways is shown to achieve improved energy savings.
The new generic policy reduces power by applying retrospective information during an MPI collective to predict slack at the next MPI collective. The idea is that if a core reaches the collective earlier than others, then we should slow it down (or speed up if it arrives late) so that on the next collective cores will more likely reach the collective at the same time ( Figure 1 ). The policy depends on the amount of work performed between MPI collectives to be relatively stable during execution. In practice, this has not been found to be overly restrictive.
Only local timing and state information are used at each core. No global communication or state is required. This allows the policy to scale to any application size. The policy is implemented within the MPI profiling interface (PMPI) and requires no application code changes. Calls to MPI_init, MPI_Finalize and most MPI collective calls are intercepted. Data is computed or set in the prologue and used during the epilogue to determine the next phase's clock frequency. No data from another rank is required, eliminating the need for any communication. The application does need to link against our MPI library in addition to the standard MPI library, and needs to access protected machine-specific registers (MSRs) only to control power using software-controlled clock modulation. The access to MSRs can be obtained either by using libraries like msr-safe [12] or by running the application as root. For controlling power using DVFS, the acpi-cpufreq or other applicable kernel modules need to be loaded.
A. Working of policy with DDCM
The policy's goal is to detect and reduce imbalances. When a core is running faster than needed, that core's effective clock frequency is reduced. The new frequency is chosen to be the lowest such that core will not be the last one to arrive at the next collective. If the last core to arrive at a collective is running at full speed, the application should experience no slowdown.
The policy automates the process of selecting the clock frequency for the next application region by comparing the computing and waiting times of each core. If a core reaches a synchronization point early, (e.g. has a significant fraction of waiting time), it is assumed that the core will arrive at the next synchronization point early and is a candidate to have its clock frequency reduced. The clock frequency for the next section is calibrated using the compute and waiting times for the previous region. A core doing more work will run at a higher effective frequency than the one doing less work. Figure 2 shows working of the generic core-specific adaptive runtime policy. The policy uses two rules -one to decrease the clock frequency of a core and the other to increase it. It first attempts to decrease the clock frequency.
• L * -levels/steps to change clock frequency The next clock frequency is a function of the ratio of computing time to total time between barriers and the previous clock frequency. In practice not every clock frequency is available, the one chosen is the lowest frequency such that it would have made the core to wait least at the last barrier.
When the previous rule determines that duty cycle does not need to be reduced, the policy then determines whether the duty cycle needs to be increased to prevent the current core from being the last to arrive at the next barrier, thus slowing the application. The policy aggressively increases frequency when it determines that this core may have been the last to arrive a the next barrier. Increasing a core from the minimum clock frequency to the maximum only takes a few policy invocations rather than one for each effective clock rate level possible.
The equation to increase the duty cycle level is given by
The model estimates the next value for the duty cycle by comparing wait time with how close to the minimum duty cycle the last region was executed. Thus, the model again assumes some predictability between successive phases.
B. Making the policy generic (per-core DVFS)
The approach to making the adaptive runtime policy generic is straightforward. This is achieved by changing the maximum, minimum and intermediate values to the ones supported by corespecific power control that may be provided by the hardware in the future.
For per-core DVFS, C max is the maximum non-turbo frequency on a machine, and C min is the lowest frequency supported by DVFS. The transitions (L * ) occur at frequencies available in /sys/devices/system/cpu/cpu*/cpufreq/scaling available frequencies 2 .
C. Combined Policy
On an Intel Haswell machine, the cores can either use Tstate (DDCM) or P-state (DVFS) transitions to lower frequency. DVFS can generally support frequencies only down to about half the standard non-turbo frequency of the processor. The power saved using DVFS is higher in comparison to DDCM as both voltage and frequency is reduced. When the clock frequency needs to be reduced beyond what DVFS allows, DDCM can be used to further reduce the clock frequency. In the combined policy ( Figure 3 ), a core starts by using DVFS policy to lower frequency when its work corresponds to a frequency greater than or equal to the minimum frequency supported. Once the core is running at the minimum allowed by DVFS, and if it is determined that the clock rate should be further reduced, only then DDCM policy is applied. By using DVFS and DDCM together, effective clock rates up to 20% of maximum are possible before hardware glitches are seen.
Once a core uses DDCM policy, it continues to use it every phase until duty cycle is increased back to 100%. Only then is the DVFS policy used. During highly unbalanced code regions, cores can have both DVFS and DDCM active, attempting to reduce the effective clock frequency as much as possible.
D. Adaptive Core-specific Runtime
The Adaptive Core-specific Runtime (ACR) implements the DVFS, DDCM and combined policies described above. It provides the user with a choice to select one of the three policies to control processor power usage, and in addition supports system-wide introspection through data reported from hardware counters. To avoid aggressive lowering of frequency/duty cycle that may cause unnecessary performance degradation a few modifications have been made to the policies in the ACR. These changes below can be easily overridden by the user to apply the policies in their purest form.
• Frequency headroom aimed at minimizing execution time penalties: The chosen clock-rate by default is rounded up to the next highest clock-rate. If the chosen value is too low the execution time penalty is seen to be larger than the potential energy savings.
• Limit on the minimum permissible clock-rate: For low clock-rates, observed performance degradation is observed to be higher than predicted. The minimum clock-rate is thus set to be at most 18.75% of the maximum non-turbo frequency, a value that is obtained empirically. This minimum is likely to be architecture specific and can be changed by using a set of environment variables provided.
ACR provides support for user options to facilitate user customization of the framework to fit specific use cases. The options below may allow further improvement in energy savings or limit performance degradation.
1) Introduction of a limit on minimum phase length:
This consideration is to avoid frequency change decisions based on characteristics of smaller non-computational phases (like startup). This limit also prevents decisions from taking place too frequently, while the history is carried forward from skipped phases. 2) Monitoring performance degradation at the end of every phase: To minimize performance deterioration, a maximum flexible slowdown factor is introduced. It is expressed as a percentage value to monitor performance degradation. When the phase degradation in the last phase is greater than the specified value, the policy is skipped in the current phase and the frequency, as well as the duty cycle, are reset to maximum. In the next phase, the policy is applied with reset values. Additionally, this serves as a rudimentary way to reset clock frequency when a phase change is detected based on changes in total phase time. 3) Support for user-annotations: A user can easily override the preselected behavior of the runtime through environment variables.
Effects of OS noise and performance jitter that cause some applications to have irregular temporal patterns are somewhat smoothed by these user options and in practice, more predictable results have been observed.
ACR can be used directly by an application or embedded in libraries (e.g., MPI) to control energy with no application code changes. To measure energy, temperature and other execution metrics like frequency, it requires access to MSRs in user space through interfaces like "intel-rapl" kernel module and /sys/class/powercap/intel-rapl or RCRdaemon [13] .
The ACR interface for performance measurement is pinned to the first core in a socket, avoiding any interruption to other cores. Each call comprises only handful of straightline instructions, limiting the overhead to be smaller than the runto-run variation in performance and are not detectable.
MPI_Init and MPI_Finalize calls are intercepted to setup and clean the infrastructure. During program execution, ACR uses one of three policies discussed earlier to set the best effective clock rate between MPI calls. MPI_Barrier and MPI_Allreduce are intercepted in the current experiments, but other MPI collectives can be easily used as well.
III. INFRASTRUCTURE A thirty-two blade partition of the Shepard Advanced Systems Technology Test Bed at Sandia National Laboratories is used for all experiments. This development partition exposes at the user-level power and energy instrumentation as well as grants user control of clock frequency and modulation through msr-safe and other kernel modules.
A. System
All tests used a portion of Penguin blade cluster. Each node has two Intel(R) Xeon(R) E5-2698 V3 CPUs, each with 16 cores, 128GB of memory at 2.3GHz with hyperthreading enabled 3 
B. Measurement Techniques
All reported power, energy and temperature numbers are obtained with the Intel Running Average Power Limit (RAPL) interface. To allow user-level access to the RAPL values of interest, the Resource Centric Reflection (RCR) daemon [13] is used. The RCRdaemon has been extended to provide additional performance related metrics associated with frequency, instructions retired and cache accesses.
Modern processors have enough internal heterogeneity that execution times often vary by several percent run to run [14] . The average is taken over 12 test runs for each power and software setting. To avoid energy variation with temperature, each test script ignored results from the first several minutes until the system temperature was stable.
IV. RESULTS
The evaluation of ACR uses a set of DOE mini-apps that encompass a variety of computation/memory patterns. Measurements are reported for the entire execution and not restricted to single phases. The applications can be divided into two groups.
• Benchmarks Mini-apps: Six Mini-applications -five from the Mantevo Suite [15] (MiniFE, MiniGhost, CloverLeaf, miniAMR, HPCCG) and one from the NERSC-8/Trinity Benchmarks [16] , AMG • Production application: One production DOE application, ParaDiS -a free large scale dislocation dynamics simulation code to study the fundamental mechanisms of plasticity. It was originally developed at the Lawrence Livermore National Laboratory [17] .
A. Standard benchmarks ACR attempts to act where load imbalance exists and remain dormant when work is evenly partitioned. The potential gain realistically achievable with ACR should occur when evaluating several HPC benchmarks with unbalanced workloads.
For evaluation, a number of DOE MPI mini-apps were selected to simulate variety types of loads on HPC systems. Table I gives the execution time for default run without ACR for all applications. It also lists the ACR parameters used for runs that use ACR on 32 nodes. It can be observed that the policy works well without changing any ACR user options in most cases (represented as "none" value). Better energy efficiency while using ACR user options is obtained for some cases either by enhancing power reduction or controlling performance degradation as explained in Section IV-B. The values chosen for the user options in our experiments, especially for minimum phase length, is obtained empirically. We recommend using user-annotations supported by ACR to skip startup or non-computation phases in practice.
1) miniFE: miniFE is intended to be the best approximation to an unstructured implicit finite that includes all important computational phases. The problem size on 32 nodes is 225x375x525 with 'load imbalance' factor set at 100 to exploit maximum load imbalance that the application can present.
2) miniGhost: miniGhost simulates highly structured stencil operations. It executes the halo exchange pattern important in structured and block-structured explicit applications. A problem size of 30x30x30 is spread across 1024 cores with 16, 8 and 8 cores along the three axes.
3) miniAMR: miniAMR does a stencil calculation on a unit cube computational domain and can emulate the interaction of different bodies in space. It uses Adaptive Mesh Refinement to better model the edges of the moving bodies. A test case of a sphere moving diagonally along 1024 cores with 16, 8, 8 cores along x, y and z directions is used. It runs for 10 time steps. 4) CloverLeaf: CloverLeaf investigates the behavior and responses of materials when applied with varying levels of stress using a two-dimensional Eulerian formulation. The input used is the provided "clover bm512 short.in" and corresponds to a rectangular geometry of dimension 5.0x2.0 consisting of 30720 and 15360 cells along x and y axes respectively. 5) HPCCG: HPCCG is another approximation to an unstructured implicit finite but generates a synthetic linear system. The focus is entirely on the sparse iterative solver. The chosen problem size is 90x120x150. 6) AMG: AMG is a parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids. The problem consists of a 27-point stencil on a cube with size 60x60x60. The processor topology is 16, 8, 8 cores along x, y and z direction respectively, and uses PCG with diagonal scaling as its solver.
B. Impact of ACR user options
The impact of the user options on the Clock-Frequency policy is measurable. The performance impact on HPCCG as the user options are added is demonstrated with per-core DVFS policy in Figure 4 . The base policy (A) results in a slight improvement in performance of 0.5%. Power is marginally reduced by 6.1%, to see an energy improvement of 6.5%. By forcing minimum phase length to be equal or greater than 100ms (B), power reduction is improved drastically to 20.8%. The increase in execution time though is extremely large at 31.7% that the energy consumption increases by 4.2%. For HPC applications high execution slowdown is problematic.
By limiting phase degradation (C) to 10%, performance slowdown is reduced to 1.3%. The power reduction remains very similar to (B) at 19.2%. With this option, the runtime attempts to (over-)react quickly at a phase change to prevent the critical core in the next phase from running at a clock frequency below 100%. If ACR detects a phase to run greater than 10% longer than the previous instance, it resets the core clock frequency to 100%. Even with aggressive clock frequency resets, the energy saved is still 18.1%.
C. Benchmark Results
The results for the mini-apps are in Figure 5 . The best energy savings obtained for each application with ACR using either DDCM, DVFS or Combined is summarized in Table II . The mini-apps fall into two broad categories. miniFE and miniGhost have significant imbalanced phases with a large number of memory references. DDCM reduces the clock frequency further than DVFS resulting in greater power savings. The effect of DVFS and DDCM on memory references was not studied. The combined policy provides the best results by allowing the voltage to also be lowered during the low clock frequency phases. miniAMR does not have a large number of memory references like the above two, yet it achieves highest energy savings with Combined due to large imbalanced phases. The other mini-apps have better load balance and use DVFS's ability to lower the voltage resulting in lower energy usage than DDCM. The combined policy does not improve over DVFS for these mini-apps. The most likely cause is the overly aggressive use of DDCM during transition phases.
1) miniFE:
For miniFE version using DDCM, the energy saving is 22.4% and program execution time is increased by 10.1%. With DVFS the energy savings is only 18.9%, but the slowdown is reduced to only 0.1%. When the combined policy is used energy savings increases to 36.7%. The energy reduction is achieved in spite of a 9.4% execution time increase through a 42.1% power reduction.
2) miniGhost: DDCM reduces energy on miniGhost by 31.0% with a 2.5% execution slowdown. DVFS increases execution time by 4.7% and lowers power by 25.9% resulting in an energy reduction of 22.4%. By combining the two policies, the performance penalty is only 3.8% and the energy savings increases to 33.1%.
3) miniAMR: miniAMR is an interesting example. With DVFS, execution actually speeds up slightly (2.7%). This combined with a power reduction of 14.4% results in it using 16.7% less energy. The speedup is consistent over multiple runs. It may result from the hardware moving power from the core saving energy to the core with the critical section, or it may be related to a better performance of the barrier when all processes arrive at nearly the same time (no thread is swapped out). The energy savings with DDCM and Combined are 15.3% and 19.1% with an execution time increase of 2.1% and 3.9% respectively.
4) CloverLeaf:
This and the next two mini-apps with lower amounts of extreme imbalance all perform best with the DVFS policy. DVFS increases CloverLeaf execution by only 0.2%. This allows the energy reduction (17.6%) to effectively be equal to the power savings of 17.7%. In contrast, with DDCM the performance degradation is 7.7% results in only a 4.8% reduction in energy consumed. The combined policy does better than DDCM but still suffers a 2.9% execution time penalty and only reduces energy by 11.7 5) HPCCG: With DVFS, HPCCG is executed using 18.1% less energy. This savings is obtained with a time increase of only 1.3%. Both the DDCM and Combined policy see time increases of 4.5% and 4.9% respectively. The increased time results in energy savings of only 10.6% and 12.0% . 6) AMG: DVFS performs the best on AMG. Over a quarter of the energy is saved (28.2%), while only increasing execution time by 0.1%. The DVFS policy produced significant power/energy savings with a performance impact less than typical run-to-run variation in execution time. DDCM and Combined policies were much less effective.
D. Production applications -ParaDis
With encouraging mini-app results, testing was expanded to a small real world application ( Figure 6 ) with parameters shown in the last row of Table I . ParaDis does dislocation dynamics by introducing dislocation lines into a computational volume that interact and move in response to forces imposed by external stress and inter-dislocation interactions. The simulation run is "form binaryjunc" with "fm-ctab.Ta.600K.0GPa.m2.t5.dat" correction table demonstrating the formation of a binary junction from two dislocation lines. There are 8x8x8 cells spread across 16, 8, 8 cores along x, y and z axes on 32 nodes. The discretization range is [5.000000e+01, 2.000e+02] and re-mesh method used is 3, with maximum steps equal to 100. With load balancing turned off, ParaDis provides an unbalanced small real-world application where number of timesteps can be adjusted to create short enough runs for extensive testing. Initial testing with ParaDis on 1024 cores yielded encouraging results (Table III) . With the chosen number of time steps default case on average executed in 122.6 seconds. DDCM lowered power 19% but took on average 5% longer to execute, while DVFS also reduced 19% power but ran in 122.3 seconds showing no performance degradation. The Combined policy takes only 108.7 seconds on average. The optimization meant for power reduction, also decreased the execution time by 11%. When combined with the 31% reduction in power the total energy savings is a remarkable 42%.
Upon closer inspection, a large amount of run-to-run variation is present in the 1024 core runs. Figure 7 graphs the performance of 12 runs for each of the energy policies. The default runs have a 30% run-to-run variation, from a low of 105 seconds to high of 136 seconds. When using the Combined policy, variation was reduced to about 5%, and execution times clustered around the fastest observed for default execution (between 105 seconds and 111 seconds).
E. Understanding performance improvement for ParaDis
To better understand performance improvement seen with ParaDis its critical path behavior for the Default (no ACR), DDCM and DVFS cases on 24 nodes (768 cores) is shown in Figure 8 4 . The single run chosen has the worst execution time out of 12 runs for each of the three cases. The values in subtitles denote average values across the entire execution of the run, while the values in legend are average values taken only across data shown in the plot. A core with the highest compute time per phase is considered as the critical core. Critical cores with compute times shorter than 0.1s are discarded to avoid large average frequency values computed using Intel APERF and MPERF counters. Only cores running at average frequencies greater than 2200MHz are considered to ensure that the critical cores run at the maximum possible frequency and do not experience any undue slowdowns (due to policy mispredictions). Consequently, a lower percentage (89%, 74%, and 70%) of the actual critical execution is captured in Figure 8 .
On 24 nodes ParaDis takes about four times longer to complete in the default case and shows a lot higher workload imbalance than on 32 nodes. As a result with DDCM, the execution time for ParaDis is reduced from 405.0s to 265.2s, a reduction of 34.5% (compared to 11% on 32 nodes). For the Combined case (not shown in Figure 8 ), the execution time is lowered by 35.4% (261.6s) and the power by 28.9% (65.5W) for a total energy savings of 54.1% (17127.8J). The DVFS case, though, shows only 1.3% performance improvement running for 399.7s. This indicates that the performance improvement for the Combined case is mainly due to DDCM, and not DVFS. As in the case of 32 nodes, the run-to-run variation is seen to be greatly reduced with ACR on 24 nodes to suggest conformity between the two execution profiles. By analyzing the critical path behavior in Figure 8 the speedup for ParaDis can be explained using two key factors:
Reduction in run-to-run variation: The two dashed lines in each plot trace the means of a bimodal distribution of critical path times. In successive phases work appears to be similar, with occasional jumps between short and long critical paths. The consistency suits ACR as the frequency for the non-critical cores can be lowered to very low values for prolonged periods. Hence, non-critical cores do not compete with the critical core for resources during a phase. This alleviates any existing contentions to explain the reduced run-to-run variation. Further, the regular work pattern reduces mispredictions in all ACR policies.
Turbo mode: Lowering the frequency of non-critical cores for prolonged periods increases the available thermal headroom making critical cores with ACR using DDCM to run at higher turbo frequencies (2784.8MHz) compared to default (2507.4MHz). Because turbo frequencies are disabled when DVFS is in operation, critical cores run at much lower frequencies (2467.3MHz) resulting in low performance improvement, if any. Critical cores running at turbo also reduce the impact of policy mispredictions during phase transitions compared to DVFS (E.g. 75% of 2.6GHz (turbo) with DDCM > 75% of 2.3GHz with DVFS). The average count of instructions retired at OS and User level by each core using DDCM (1.1E+13) is only one-third compared to default (2.9E+13) due to lowered busy waiting. The reduction in busy waiting again can be mostly attributed to critical cores running at turbo with DDCM as this effect is not seen in the case of DVFS (2.3E+13). Finally, even though the power is reduced substantially (21.4%) with DDCM and average frequency across cores is only 1429.7MHz the temperature is not reduced comparably (only 3.2
• C reduction). This indicates work of turbo, as the heat dissipation is nonlinear. Table IV summarizes the energy savings and other related metrics obtained by each of the policies with standard benchmarks and real application ParaDis on 1024 cores. The least performance degradation of 0.5% across all applications is obtained with DVFS. It also reduces power by 20.5% to achieve a commensurate average energy efficiency improvement of 20.2%. The best energy savings overall is achieved with Combined at 22.6% with a power reduction of 24.9%. However, the execution time increases to 2.9%. The energy improvement with DDCM is 15.1% with power reduction of 19.3% and execution time penalty of 5.3%. The intent of the above comparison table is not to help the decision of choosing one policy over the other, but only to summarize the effectiveness of ACR with the three policies. Figure 5 and Figure 6 show unique characteristics of each policy depending on the nature of the application. For applications that show extreme workload imbalances and/or high memory references the Combined or DDCM policy work best. And in some cases (ParaDis) with improved performance. For applications showing more moderate imbalance, DVFS works better. The higher average time penalty seen by Combined and DDCM in Table IV is mostly due to the higher performance deterioration observed with applications that are more stable.
V. DISCUSSION
By tailoring the frequency of each core to match its work, slack as well as the power wasted is greatly reduced. Reducing the clock frequency for hardware threads spending significant time at software barriers results in valuable energy savings and in most cases will be invisible to the users, as the execution delay is well below the variance already observed during execution. With the reduction in power, ACR shows a corresponding reduction in the chip temperature for all applications, reducing cooling requirements.
VI. RELATED WORK Most power aware computing research centered around DVFS that had a socket-wide effect used either inter-node [4] , [5] or intra-node methods [18] , [19] . Computational workloads have been analyzed to propose ways to save power [20] , [21] . Models to amortize the effect of uneven work distribution through slack reclamation have been proposed [3] , [4] , [22] . Green Queue [23] automates the process of finding phases and optimal frequencies using power models. Automatic tuning of applications based on software performance options and processor clock frequency has also been explored [24] . The empirical software policy proposed is similar to the intra-node models but focuses on individual core (not socket) performance. Also, the use of per-core DVFS and its combination with DDCM is a departure from state of art solutions that previous work have provided on architectures older than Haswell. There has been work that surveys the new features targeting energyefficiency inherent to Haswell [25] , and some that apply these specifically to OpenMP [26] . We focus mainly on applications using MPI, considered to be the de facto interprocess communication interface for HPC.
Moving beyond DVFS, duty cycle modulation [14] , power capping [27] along with similar mechanisms on IBM Power 6 and 7 (capping) and AMD Bulldozer [28] (capping and thermal design power limits) have been explored. These solutions have focused specifically on a single power lever to improve energyefficiency. Our runtime provides the flexibility of use multiple core-specific power policies to save energy.
Applications have been profiled to determine the best configuration of nodes and power caps for overprovisioned systems [7] , [8] . Resource allocation schedulers that use overprovisioning incorporating power-response characteristics of each job along with power cap are being explored [9] . We show that use of core-specific power control may lead to performance improvement with energy savings in heterogeneous performing systems, or significantly reduce power (decreasing energy) without a power cap.
There has been considerable work to design energy-efficient runtimes oblivious to the running application [29] . These are aimed at alleviating the problems caused by system factors (OS noise, congestion) for runtimes that assume temporal patterns, and also to handle dynamic workloads. Our runtime does assume temporal patterns, but we show that an adaptive solution is effective on par with preemptive methods in handling dynamic conditions well.
A number of efforts use hardware performance counters [30] , [31] , [32] to compute optimal off-line settings. Several projects estimate energy usage based on hardware counters with direct correlation to cache access [33] , MIPS [34] and CPU stall cycles [35] . Our runtime does not make use of any hardware performance counters and only makes lightweight dynamic adjustments to each core's individual clock frequency.
VII. CONCLUSION & FUTURE WORK
ACR uses hardware core-specific power control mechanisms and an adaptive software policy to achieve significant energy savings with minimal execution time penalties. It provides, for a number of DOE mini-apps and small applications, 20+% energy savings with performance within the normal run-torun variation. With no application code modifications, ACR provides significant energy savings with no user-visible effects. For one application case (ParaDis) a significant performance improvement is observed due to the reduction in run-to-run variation and execution of critical path cores at turbo frequencies with DDCM.
As Exascale deploys over-provisioned systems that use per core power-limits in day-to-day operations, energy optimizations will be more important. Runtimes such as ACR will either allow more work to be run at one time by using less power or allow single applications to be run faster by allowing a higher power cap on critical cores than non-critical. On power-limited systems, power (and energy) optimizations will be critical. ACR demonstrates that adaptive dynamic control of power at runtime is possible.
In the future, a better understanding of the advantages and disadvantages of ACR is needed. At some scales, ACR results in a significant performance improvement. As a downside, at other configuration sizes (and user options) ACR results in significant slowdowns. A better understanding of when the chosen clock frequency is too low and how to correct it quickly is required before a system such as ACR can be deployed in a production environment.
VIII. ACKNOWLEDGEMENT The authors will like to thank Rob Fowler, RENCI for his valuable feedback and Nathan Gauntt, SNL for helping with Shepard cluster. Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. This work was also performed under the auspices of the U.S. Department of Energy XPRESS project under Contract DE-SC0008704 and Office of Science SciDAC SUPER Institute on grant DE-SC0006925.
