Optimal utilization of power is a major concern for HPC, and is one of the focus points on the path towards exascale and approaches range from chip level to facility wide solutions. In order to evaluate the implications of these approaches and their impact on future system design, we need to understand their interaction with applications as well as their performance impact. In this work we describe the GREMLIN framework, a general framework to emulate system changes on existing platforms by resource restriction or event injection. We use this framework to understand the behavior of applications executed on power limited systems and to evaluate a solution for one of the problems resulting from operating under a power limit: the translation of manufacturing variability into heterogeneous performance, as observed in power limited HPC environments. We show that in a power limited environment manufacturing variability is a key source of performance imbalances and thus non-optimal execution. We propose a Power Balancer for redistribution of unused power and show performance gains of up to 1.5% at small to medium node counts.
MOTIVATION
Power has become the critically limited resource as we move towards exascale, driven by political and economical targets of maximal power consumption. For example, the Department of Energy (DOE) has set a 20MW limit and other countries have imposed similar targets, requiring significant advances in power reduction. While this area was traditionally left to architectural optimization, the current pace of architectural evolution is insufficient for reaching exascale performance while meeting these power targets within the anticipated timeframe. We therefore need to complement hardware advances with software techniques at the runtime and application level.
A prerequisite for this is, though, a deep understanding of how power optimization techniques as well as the impact of power limits influence application performance. Simulation and modeling cover important aspects of this and have been successfully used for this task, but come with their own limitations. While simulations are highly accurate, they are limited in the size of code they can run; models can provide useful trends, but can be hard to construct and often lack the accuracy to provide a deep understanding. We therefore must complement these techniques with additional approaches. In particular, we propose to use system emulation, i.e., manipulating the properties of existing systems to emulate characteristics of future platforms. While the number of characteristics that we can emulate are limited and depend on the target platform, the ones we can do allow for a realistic evaluation of full codes. Emulation therefore provides a helpful additional tool in the performance evaluation arena.
In this work we rely on the GREMLIN framework [15, 17] , which implements system emulation by implementing resource restrictions or event injection to emulate the properties of future machines. We introduce the power components of the GREMLIN framework in detail and use their ability to emulate the impact of power limits of future node architectures to study the performance impact of power limits. We execute our experiments on up to 256 nodes of the Cab system, a production Linux cluster at Lawrence Livermore National Laboratory (LLNL). Our results highlight the problem of manufacturing variability and the resulting performance variations.
Based on the results of our study, we propose an initial approach to counter this effect at runtime using a novel power balancing method that can mitigate the impact seen. We then evaluate its improvements, again using the GREM-LIN's system emulation approach.
In summary, the contributions of this work are:
• A brief introduction to the GREMLIN framework and a detailed description of the power GREMLINs, their implementation, and usage.
• A study of power/performance impact on applications in power limited environments using the GREMLIN framework at scale.
• The development of a power balancing method as a response to the lessons learned and and an evaluation of our power balancer.
Section 2 gives some background and motivates why power is the critical limiter for exascale computing. Section 3 gives an overview of the GREMLIN framework used in this work and introduces the Power GREMLINs in detail. Section 4 describes our power studies using the GREMLINs and discusses their results and Section 5 introduces our power balancing approach along with its evaluation. The work concludes with related work in Section 6 and the summary in Section 7.
THE ROAD TO POWER LIMITED SYS-TEMS
When projecting the power consumption of systems reaching exaFLOPS or 10 18 FLOPS (floating point operations per second) based on today's most efficient system, Shoubu at RIKEN AICS with 7,031.58 MFLOPS/watt [7] , the power consumption would still exceed 142 MW. With an assumed price of 0.15 Euro/kWh or about 0.16 US$/kWh 1 this projection reaches energy costs as high as 187 million Euro per year or about 202.5 million US$, which would lead to prohibitive operational costs. Consequently, RIKEN, with 30MW [19] , and the U.S. Department of Energy (DOE), with 20MW [1] , have set limits for maximum allowable power consumption for their future systems, which can only be achieved with a dramatically increased efficiency alongside of architectural advances in power reduction.
Today's centers often waste significant amounts of power, as all components are provisioned with sufficient power to run all components at full power all the time, yet this rarely happens. Instead, we should consider provisioning at expected or average power, or even below that, to allow systems to fully exploit their power budgets, while introducing hard power caps to guard against spikes that would exceed the provisioned power.
However, while allowing for significantly higher power efficiency, this leads to new challenges: the use of power capping exposes performance reductions caused by processor manufacturing variability. While these were previously masked by the ability to adjust the power consumption to compensate for less efficient hardware, adding a power cap removes this ability and forces processors to expose their varying efficiency in terms of performance variability. This introduces a new source of performance heterogeneity to systems, making optimizations harder and impeding load balance efforts. We therefore require new techniques to mitigate these effects and to provide users a predicable and performance portable platform, even in power limited environments.
THE GREMLIN FRAMEWORK
A wide range of techniques exist to explore the performance of future systems. They range from abstract models to detailed architectural simulation. However, both of these extremes have their pros and cons: abstract models enable the fast evaluation of a wide range of parameters to understand basic tradeoffs, but often lack details for accurate absolute estimates; simulation toolkits provide excellent details, while being complex, slow and often unable to cover realistic codes.
The Role of System Emulation
Emulation seeks to cover these gaps and to provide an additional and complementary technique: the basic idea is to alter the characteristics of components, resources or ratios of resources in existing large scale systems, so that they match the expacted characteristics of future systems. This enables us to emulate future generation machines, such as for example an exascale like execution environment, for the individual resource of interest, and allows to apply this on current machines at scale, running complete and realistic codes at native speed.
GREMLIN Concept and Implementation
The GREMLIN framework, developed as part of the DOE Co-Design center ExMatEx, is designed to provide a flexible emulation environment targeted a wide range of characteristics. It relies on techniques like resource restriction (in hardware or software) or event injection (including but not limited to fault injection) to change the properties of the targeted components or resources. Further, it allows each aspect of the hardware emulation and its concomitant resource restriction to be controlled separately by design, using a highly modular approach of the GREMLINs -each technique used to impact a component or resource is implemented as its own module, i.e., its own GREMLIN, and the system then allows users to compose independent GREMLINs as needed.
For instrumentation and startup, the GREMLIN framework relies on the MPI profiling interface, PMPI, to inject itself transparently into the execution of a target application and uses the P n MPI tool [18] to compose as many independent GREMLINs as needed for a particular study. Each GREMLIN is itself implemented as an independent PMPI tool and is linkable statically at compile time or dynamically using the LD_PRELOAD mechanism. The concept of the Framework is shown in Figure 1 , which also shows the ability to control key parameters of each GREMLIN, enabling the control of different degrees of interference or injection.
Categories of GREMLINs
Existing GREMLINs can be grouped into three different categories: Power, memory, and resilience. While we focus on power GREMLINs in this work, we first briefly describe the other two categories: Memory GREMLINs emulate properties of future memory systems by artificially limiting resources in the memory system, such as available memory or memory bandwidth. They are typically implemented by some form of resource stealing, i.e., the consumption of a targeted resources by a separate thread contending with the application. By inducing these restrictions increased memory pressure, e.g., caused by increased numbers of cores per node or NUMAdomains can be emulated, as increased on-node parallelism combined with a slower increase in memory size and memory bandwidth the per node bandwidth will decrease. Details of a set of possible memory GREMLINs addressing these questions are described by Casas et al. [4] .
Resiliency GREMLINs are used to study the impact of failures in the overall system. With an increase in number of total components the likelihood of failure in the overall system will also increase, and this increase can be emulated by introducing artificial faults using fault injection. This can also serve as a testbed for resiliency models to be introduced in next generation communication models, for example as planned for MPI 4.0 [16] . The method of fault injection used by the resilience GREMLINs is described in Schulz et al. [17] .
Power GREMLINs
Power GREMLINs can be used to study the impact of power limited systems. In contrast to the two categories above, which rely on software techniques for their implementation, power GREMLINs are typically implemented using hardware features limiting processor frequencies, e.g., using DVFS, or setting hardware power caps, e.g., using mechanisms like Intel's RAPL (Running Average Power Limit). In the following we use the latter for the GREMLINS, since RAPL allows us to emulate power limited systems in a highly realistic way and enable us to study the performance impact of power limits on real applications.
The power GREMLINs use Intel's MSRs [10] to read and write power settings. Using the RAPL functionality, individual package power limits or power caps can be set by the MSRs to implement application wide power management. The MSRs can also be used to read power consumption as well. There are different kinds of power GREMLINs, some to simply record power data to give simple power analysis functionality, others to actively change and limit power availability at different scales. Using the power GREMLINs it is possible to emulate a globally power bound systems and also change power settings for individual cores and packages. This gives the possibility to study power and CPU over-provisioned systems. The power GREMLINs are used in the power analysis in the following sections.
POWER ANALYSIS USING GREMLINS
Using the power GREMLINs, we emulate an execution environment that enforces power limitations. Example usecases for this are to understand the application behavior in a CPU over-provisioned system [13] and to develop optimizations for these environments. A CPU over-provisioned system, compared to a regular system, provisions less power to each node than the node could draw at worst case, allowing it in most cases to completely utilize the provisioned power, while guarding against power spikes using power capping. This either allows to reduce and lower power budget for a system, or more interestingly, the utilization of more nodes at the same power budget. This can lead to better execution times and overall resource use as shown by Rountree et al. [14] . In the following, we investigate such environments at scale and their performance impact using the power GREMLIN infrastructure described above.
Experimental Setup
For our work, we use up to 256 nodes of Cab, a 500 TFlop/s production Linux cluster at LLNL. Each node has two Intel Xeon E5-2670 CPUs with eight cores each. This "Sandy Bridge EP" 32nm processor is specified with a clock rate of 2.6 GHz when turbo mode is disabled. The available memory is 32 GB per node with a peak CPU memory bandwidth of 51.2 GB/s. The interconnect used is QLogic's InfiniBand QDR. The TDP (Thermal Design Power) specified by Intel for the Xeon E5-2670 is 115 W, while operating at base frequency [9] . In general, regular workloads do not achieve this high power demand. The application studied in the following shows a power draw of ∼ 85 W [11] . Since the system is homogeneous with the same hardware present at each node, the same performance should be expected. However each processor shows different performance output and power consumption as we will elaborate below.
The default and maximum power cap for the processor used in this study is 115 W, which is equal to the TDP, which is is the average power in watts drawn by the processor under a Intel defined high-complexity workload, measured while operating at base frequency with all cores active [9] . The TDP also acts as a failsafe to guarantee optimal operation without running into issues regarding heat generation. All measurements have been taken with turbo boost deactivated for better reproducibility.
Application Setup
The GREMLINs were used to analyze the behavior of different proxy applications, namely AMG2013, CoMD and NEKBONE for scalability studies under different power bounds. The full analysis of the scaling studies is available in [11] . In the following we will briefly discuss the main results, by example of CoMD, leading to the decisions important for the work at hand.
The CoMD proxy application is a classical molecular dynamics code, developed for the ExMatEx Co-design center [6] . The observed power consumption on a single node using 16 MPI processes for CoMD is ∼ 85 W. Figure 2 shows the results of weak scaling tests, i.e., constant workload per process as we scale. To increase comparability, we allocate a single 256 node partition, which allows us to control node placement. To utilize all nodes allocated, the results using fewer nodes were repeated to have full utilization of the allocated processors. This is reflected in the different widths in the violin plot. In total 64 4 node runs, 32 8 node runs, 16 16 node runs, 8 32 node runs, and 4 64 node runs are presented, all using the full 256 nodes allocated. All times reported are averaged over 5 runs. These runs were repeated using different per core power limits, which allows us to compare scaling behavior at 115 W, 95 W, 80 W, 65 W and 50 W.
CoMD Results
The 115 W base case as well as the 95 W case show similar behavior and are placed beneath each other. Both experience no slowdown introduced by the power cap since the power consumption of the application is ∼ 85 W and hence below the two power caps. Starting at a power cap of 80 W all nodes experience a slowdown introduced by the power limit, which is strongest at 50 W.
The result shows similar scaling behavior at all power caps and thus indicate that power capping affects overall application speed, but has only minor impact on scaling behavior.
As expected, though, the individual nodes experience an increase in variance as well as performance variation with stricter power caps. This impact can be seen in the single node measurements of the same nodes shown in Figure 3 . This figure shows all 256 nodes ordered according to their single node performance at 50 W. The line colored according to the set power cap shows the slowest node of the respective power setting. Noticeably, the order established for the 50 W setting is not fully maintained when changing power caps, however the overall performance trend of the nodes is still visible.
The translation of uniform power consumption to varyingperformance output is only visible while operating under a power limit. When running at TDP, this slowdown is translated into different power consumption, while providing the same performance [14] . It should be noted, that observed performance is not linear to the chosen power cap, and also different for each individual processor. This performance curve also changes dependent on the application characteristics and its usage of the chip.
When running applications on multiple nodes under a power cap, the node allocation will always contain nodes of different quality, thus varying power consumption and speed. When using these nodes for multi node runs the performance is affected by the slowest node as shown in [11] . This also means that the nodes faster than the colored line at their respective power level of Figure 3 can be slowed down. Only under the assumption of no communication this does not affect the time to completion for applications executed on multiple nodes.
ACTIVE POWER BALANCING
Based on the prior results, applying a homogeneous power cap is ill suited for running multi node jobs under a power bound. In the following we will introduce a simple approach that slows down more efficient processors by reducing their power budget and instead moving it to less efficient processors to aid them with catching up with the overall computation. We call this approach Power Balancing (PB).
Approach for Power Balancing
The key idea of power balancing is at follows: since the slowest processor at a particular power cap is limiting the overall performance of a multi node application run, the power of all other faster processors can be reduced to be at an equivalent speed. For this we look for the individual processors that are still faster than the slowest processor with power cap applied. Their power is reduced until the next power reduction would make them slower than the slowest processor.
The initial idea is illustrated in Figure 4 . For our tests, we select 65 W as the desired power level, since 50 W is the vendor's minimal recommended power setting for the E5-2670 processor and we need a large enough room for changes in prescribed power caps towards the minimal recommended setting of 50 W. Figure 4 shows a small example running on 4 nodes. Since Intel's RAPL allows power limit control for individual packages, and the performance differences are already visible at package level, we use package granularity for the power balancer.
Initially we run the target application at the desired power level and determine the slowest package. In the example of Figure 4 this is processor 7, or the first processor on the 4th node. The next step is to find the individual minimal power settings for each package, where the performance is still better than the slowest processor. For this we reduce the power levels step by step, and stop once the slowdown limit (red line) is exceeded. This is a simple linear search approach since the difference in response to power changes is different for each processor. It should also be mentioned that RAPL and the used library allows changes in 1/8 W steps, which would allow better fine tuning. This was not used since 1 W steps suffice for exploring the possibilities of this solution. Figure 4 shows a dashed purple line connecting the 65 W processor performance. The dashed blue line shows the power settings selected with each individual processor performance at equal or better performance than the slowest processor while requiring less overall power.
Using these power balanced settings the results of a multi node run should show the same performance while requiring less power. The slowest node is unchanged while the faster nodes require less power. A caveat here is that the individual power settings are measured individually and no communi- cation behavior of the application is considered. Using these power settings in a multi node run actually results in a slowdown of the application, which is not the desired outcome of this technique.
This negative outcome can be overcome by reintroducing the saved power in a uniform way. The individual processors now show equalized performance while using different power settings. When reintroducing the saved power in a uniform manner, the harmonized performance stays equal and the overall application is sped up since all processors run at a higher power limit. The global power limit is still respected since only the saved power is used.
To visualize the potential power saving and capabilities of this method, Figure 5 shows the same analysis as Figure 4 with 32 nodes.
Again, the red line marks the slowest processor at 65 W. The dashed purple line shows the performance of the individual processors at 65 W, while the dashed blue line shows the power balanced package settings. The power savings of this measurement sums up to a total of 129 W. Already at a scale of 32 nodes the potentially saved power equals to being able to power the processors of two additional packages or one additional node, equaling a power consumption of 130 W at the power limit of 65 W. The easier and more promising option is to reintroduce the power to the already available nodes. The power limit of every node can be increased by 2W, equal to 128 W for the entire system. This speeds up the overall computation and still requires one Watt less than the flat power limit of 65 W. This is achieved while respecting the global power limit. In this case the global power limit is 64 * 65W = 4160. After power balancing the nodes runs at a total of 4159 W.
All processors are sped up by reintroducing the reclaimed power from the faster processors. If the power would not have been reclaimed from these processors, they would idle and wait for the slow nodes to reach barriers, synchronization points or their termination. In the following section we evaluate the effects of our proposed power balancing method while scaling the application.
Power Balancing at Scale
To evaluate our power balancing approach, we apply our technique of migrating power with a goal of maintaining a global power bound of 65 W times the number of nodes. We run our proxy application CoMD at different scales from 4 to 64 nodes. As described above, initially we run all processors at 65 W and then apply our power balancing method to obtained individual package power settings. In order to avoid system noise, we run the application 50 times at each node count. The results are shown in Figure 6 : the lines connect the median values of the 50 runs at each node count. We can see that the median of the purple 65 W runs is always slower than the first quartiles of the blue power balanced runs.
The scaling behavior in both cases is similar to the results at 65 W of Figure 2 . The only difference is that we inreased the problem size for the measurement to obtain longer runs.
Using the proposed power balancing method, we gain a mean speedup of 0.5% to 1.3% using this simple power balancing approach. This might seem like a minor improvement, however this is achieved at application runtime for free, just by removing power from places where it cannot be used to achieve uniform speedup of all processors, once performance is made heterogeneous.
Further, for calibration we execute short single node runs running for a few seconds. In our case, a small problem size takes approximately 10 seconds of execution time, which leads to a fixed startup time that has to and can be amortized at higher node counts. Our initial single node calibration detect performance variability of 5.8% at 4 nodes up to 8.0% at 64 nodes, as the single node details in Table 1 show. The performance difference measured at 65 W for single node performance is highly dependent on the node allocation returned from the job scheduler, which then also limits the maximum speedup. If running an application using a full system, the worst and the best node are by definition always present in the node allocation, leading to the maximum potential of such a power balancing method. For a general job submission, however, this is not the case and only subsets of the system are available. Our results were obtained while using a normal job scheduler returning different node allocations for each set of measurements.
Not all of the power reclaimed can be used for power balancing. The amount Power reclaimed−Total Power shifted can be freed and used in different parts of the system or in a power scheduler as proposed by Ellsworth et al. [5] . The reported PB Mean speedup reported in Table 1 is the speedup obtained from applying the method at the respective node counts. The power that can not be reapplied for speeding up processors is always smaller than the amount of packages available to the application. 
RELATED WORK
With the announcement of the 20MW power constraint the interest in power limited execution found its way to the HPC community.
Rountree et al. [14] examine the impacts regarding manufacturing variability and discusses some of the associated implications. Chips have different power consumption, while providing the same performance. The non uniform power consumption translates to non uniform performance when faced with a power limit. The power limit, however, results in uniform power consumption. This variability differs from chip to chip and the results elaborated in the work build a basis for the work at hand.
When and how power limits come into play is pointed out by a novel power scheduling approach discussed by Ellsworth et al. [5] . The work discusses how schedulers are combined with power management using a system wide view. Using the novel power scheduler as proposed in the work results in higher system utilization, but might require jobs to run power bound. The power scheduler optimizes system wide power management resulting in higher overall system utilization. Combined with the power balancer proposed in the work at hand, both system utilization and application performance can be improved, since both approaches can work together if application level power bounds are communicated. In addition to that, both approaches comply with safely running over-provisioned systems, since both always respect the dictated power limits.
The work of Inadomi et al. [8] on manufacturing variability aware power budgeting takes a similar approach to power limited supercomputing as the work at hand. Their prerequisite for using the framework also includes single module test runs, however they run at minimal and maximum CPU frequency for these measurements and assume a linear model which are used to develop an application-dependent variation-aware power model table. Using models and tables the individual optimal power and frequency settings are determined. The work at hand does add a flat configuration overhead to each application run, refraining from extensive table updates. This seems more suitable for the dynamic nature of applications developed for HPC, also regarding the changing characteristics for different input sets. Inadomi et al., use the complete HA8K system showing the maximum capabilities of such approaches. The work at hand shows capabilities for realistic job allocations as seen in a normal production system, also taking the different quality of node allocations into account, as provided by the job scheduler.
Marathe et al. [12] propose Conductor a runtime system combining DVFS and RAPL for power optimization with focus on critical path analysis. Conductor has a configuration step setting the optimal number of threads per NUMA domain and setting DVFS to minimize idle dynamically. RAPL is used as failsafe to prevent overstepping global power bounds. The work at hand refrains from combining DVFS and RAPL. Using only RAPL simplifies the redistribution of power since the impact can be directly attributed to RAPL resulting in a simpler and more controllable system. This results in positive results even when running in the power envelope of 50 to 115 W, as recommended by the manufacturer.
Barker et al. [3] propose a dynamic power allocation algorithm using power steering. The power steering approach uses p-States for controlling power consumption. This does not enforce a maximum consumption and could lead to complications in systems power supply regarding large systems. Their main contribution shows that power steering is feasible to effectively counter load imbalances.
Bailey et al. [2] show how power limited systems can be optimized using Integer Linear Programming. The method proposed gives a theoretical upper limit to performance improvements for power constrained applications.
SUMMARY AND FUTURE WORK
In this work we covered the GREMLIN framework, a framework to emulate the characteristics of future machine on current platforms. In particular, we introduced power GREMLINs, which build on techniques like RAPL to emulate future power limited systems. As our experiments using the GREMLINs show, such systems exhibit a new form of performance heterogeneity caused by manufacturing variability, which must be mitigated using active power balancing. We show that this can improve performance by removing power from nodes that cannot use it and reapplying the surplus to the remaining nodes in a uniform way. The GREMLINs are used to emulate such an environment, which is a possible scenario for future systems. We show that already a simple approach to redistributing power using a power balancer is feasible at small to medium node counts and obtain a mean speedup of up to 1.5% at runtime.
Currently, our approach requires single node measurements prior to starting the application run. Performing the power balancing process in the first seconds of the computational intensive phase is ongoing work. This requires phase identification as well as a reliable way of measuring performance other than end to end measurements. This is not a trivial task since it should not be dependent on the application, but be provided by independent system functionality.
As mentioned in Section 5, RAPL and the used library allows power steps of as little as 1/8W. This allows for better fine tuning of the approach and will improve power reuse. To improve startup time, other algorithms than linear search can reduce the time needed for node configuration. Solutions using better search algorithms require some prior knowledge regarding the CPU, which was not taken into account in the initial approach.
This method of power balancing is only possible since processors are subject to manufacturing variability, which makes efficient search hard. The proposed method does not rely on linear models but on actual measurements to obtain an optimal power setting, with flat overhead. As shown in the related work, power limited execution could very well become the norm. Not using power balancing would then waste resources that can be reallocated efficiently and with a low static overhead. The method presented in this work is applicable to current schedulers and does not make assumptions about the CPU distribution or the node allocation returned from the scheduler. It can therefore easily be integrated into any software stack and help optimize power utilization.
