Power consumption is a looming treat in today's computing progress. In scientific computing, a significant amount of power is spent in the communication and synchronization-related idle times. However, due to the time scale at which communication happens, transitioning in low power states during communication's idle times may introduce significant overheads in the scientific application.
INTRODUCTION
In today's supercomputers, the total power consumption of computing devices limits their practical achievable performance. This is a direct consequence of the end of Dennard's scaling, which in the last decade has caused a progressive increase of the power density required to operate each new processor generation at its maximum performance. Higher power density causes larger heat to be dissipated, and increases cooling costs. These altogether worsen the total D. Cesarini and A. Bartolini costs of ownership (TCOs) and operational costs: limiting de facto the budget for the supercomputer computational capacity.
Low-power design strategies enable computing resources to trade-off their performance for power consumption by mean of low-power states. These states are Dynamic and Voltage Frequency Scaling (DVFS) (also known as performance states or P-states [1] ), clock gating or throttling states (T-states), and idle states which switch off unused resources (C-states [1] ). These states are controlled by hardware policies, Operating System policies, and with an increasing interest in the recent years, at user-space by the final users [2] , [3] , [4] , [5] and run-time [6] , [7] .
While O.S. policies try to maximize the utilization of the computing resources -increasing the processor's speed (Pstate) proportionally to the processor's utilization, with a specific focus on server and interactive workload -two main families of power control policies are emerging in scientific computing. The first is based on the assumption that a performance penalty can be tolerated to reduce the overall energy consumption [2] , [3] , [4] , [8] . The second is based on the assumption that it is possible to slow down the speed of a processor only when it does not execute critical tasks: to save energy without penalizing the application performance [5] , [6] , [7] , [9] . Both approaches are based on the concept of application slack/bottleneck (memory, IO, and communication) that can be opportunistically exploited to reduce performance and to save energy. However, there are drawbacks which de facto limits their usage in a production environment. The first approach causes overheads in the application time-to-solution (TtS) limiting the supercomputer throughput and capacity. The second approach depends on the capability of predicting the critical tasks in advance with severe performance loss in case of misspredictions.
A typical HPC application is composed of several processes running in a cluster of nodes which exchange messages through a high-bandwidth low-latency network. These processes can access the network sub-system through a software interface that abstracts the network level. The Message-Passing Interface (MPI) is a software interface for communication that allows these application's processes to exchange explicit messages abstracting the network level.Usually when the scale of the application increases, the time spent by the application in the MPI library becomes not negligible and impacts the overall power consumption. By default when MPI processes are waiting in a synchronization primitive, the MPI libraries use a busy-waiting mechanism. However, during MPI primitives the workload is primarily composed of wait times and IO/memory accesses for which running an application in a low power mode may result in lower CPU power consumption with limited impact on the execution time.
MPI libraries implements idle-waiting mechanisms, but these are not used in practice to avoid performance penalties caused by the low power states transition time [10] . As a matter of fact, there is no known low-overhead and reliable mechanism for reducing selectively the energy consumption during MPI communication slack.
In this paper, we preset COUNTDOWN, a methodology and a tool to save energy in scientific applications by leveraging the communication slack.
The main contribution of this manuscript are:
• An analysis of the effects and implications of finegrain power management in today's supercomputing systems targeting the energy-saving in the MPI library. Our analysis shows that in today's HPC processors there are a significant latency in the HW to serve low-power states transitions. We show that this delay is at the source of inefficiencies (overheads and saving losses) in the application for fine-grain power management in the MPI library.
•
We show with a set of initial benchmarks in a supercomputer's node that (i) there is a potential saving of energy with negligible overheads in the MPI communication slack of today's scientific applications (ii) these savings are jeopardized by the time the HW takes to serve power state transitions. (iii) when combined with speculative low-power states these savings can improve the execution time of the application.
We present the COUNTDOWN library which consists of a run-time able to automatically inspect at fine granularity MPI and application phases and to inject power management policies opportunistically during MPI calls. Indeed, COUNTDOWN is able to identify MPI calls with energy-saving potential for which it is worth to enter in a low power state, leaving fast MPI calls unmodified to prevent overheads in low-power state transitions. We show that the principles proposed by COUNTDOWN can be used to both inject DVFS states as well as configure properly the MPI runtime and take advantage of MPI idle-waiting mechanisms. COUNTDOWN works at execution time without require any previous knowledge of the application and it is completely plugand-play ready, this means that it does not require any modification of the source code and compilation toolchain. COUNTDOWN can be dynamically injected in the application at loading time, this means that can intercept dynamic linking to the MPI library instrumenting all the application calls to MPI functions before that the execution workflow pass to the MPI library. The runtime also provide a static version of the library which can be injected in the application at linking time. COUTDOWN supports C/C++ and Fortran HPC applications and most of the open-source and commercial MPI libraries.
We extend previous conference version of the paper by evaluating COUNTDOWN with a wider set of applications and low-power state transitions. We show that the COUNTDOWN approach reduces the overheads of fine-grain power management based on T-states of 1.70%, P-states of the 0.02%, and Cstates of the 0.29% on a real scientific workload. When executed at the scale COUNTDOWN leads to savings of 23.32% in average for the NAS [11] parallel benchmarks and to 22.36% for an optimized QE run on 3456 cores, which increases to the 37.74% when QE is executed with default parameters.
The paper is organized as follows. Section 2, presents the state-of-the-art of power and energy management approaches for scientific computing systems. Section 3 introduces the key concepts on power-saving in MPI phases of the application. Section 4 explains our COUNTDOWN runtime and the characterizations of a real HPC applications. Section 5 characterizes the COUNTDOWN library and report experimental results in power saving of production runs of applications in a tier0 supercomputer.
RELATED WORK
Several works developed mechanisms and strategies to maximize the energy savings at the expense of performance. These works focus on operating the processors at a reduced frequency for the entire duration of the application [2] , [3] , [4] . The main drawback of these approaches is the negative impact on the application performance which detriments the data-centre cost efficiency and total cost of ownership (TCO).
Fraternali et al. [2] show the impact on frequency selection on a green HPC machine which can lead a significant global energy reduction in real-life applications but can also induce significant performance penalties. Auweter et al. [3] develop a energy-aware scheduler that relies on a predictive model to identify the walltime and the power consumption at different frequency levels for each running applications in the system. The scheduler uses this information to select the frequency to apply to all the nodes executing the job to minimize the energy-to-solution allowing unbounded slowdown in the time-to-solution. The main drawback of this approach is the selection of a fixed frequency for the entire application run which can cause a non-negligible penalty on CPU-bound applications. Hsu et al. [4] present a different approach where users can specify a maximumallowed performance slowdown for their applications while the proposed power-aware runtime reduces the applied frequency every second respecting the user's specified constraint. For this purpose the proposed run-time estimates the instruction throughput dependency over frequency and minimizes the frequency while respecting the user's the specified maximum-allowed performance slowdown. Similarly to the previous approach an energy gain is possible only by degrading the performance of the application. As results, the main drawback of the previous works is detrimental to the impact on the application performance which detriments the data-centre cost efficiency and total cost of ownership (TCO).
Other works in the state of the art target energy reduction methodologies with negligible/reduced impact on the performance of the running applications.
Sundriyal et al. [12] , [13] , [14] , [15] analyzes the impact of fine-grain power management strategies in MVAPICH2 communication primitives, with a focus on send/receive [12] , All-to-All [13] , and AllGather communications [14] . In [12] the authors propose an algorithm to lower the P-state of the processor during send and receive primitives. The algorithm dynamically learns the best operating points for the different send and receive calls. In the [13] , [14] , [15] works the authors proposes to lower also the T-state during the send-receive, AllGather and All-to-All primitives as this increases the power savings. In these works authors discover that the overhead of this solution is more prominent when these communication patterns are intra-node. Moreover the overhead of the proposed strategy decreases with message dimension. Authors shows that this overhead is different within the different kernels implementing the communication primitives, thus they propose a low-power implementation of the studied MVAPICH2 primitives where P-state and T-state are lowered in specific stages of these primitives. These approaches show that power saving can be achieved by entering in a low-power mode during specific communication primitives. To be effective predicting algorithms on next communication phase characteristic needs to be used or the internal logic of the communication primitive needs to be adapted, thus depends on a specific MPI implementation. Differently in our work we show that important savings can be achieve without changing the communication primitive logic.
Rountree et al. [16] analyzes the energy savings which can be achieved on MPI parallel applications by slowing down the frequencies of processors which are not in the critical path. Authors of the paper define tasks as the region of code between two MPI communication calls, we will refer later in paper to tasks as phases. The critical path is defined as the chain of the tasks which bounds the application execution time. Indeed, cores executing tasks in the critical path will be the latest ones to reach the MPI synchronization points, forcing the other cores to wait. In [16] authors proposes a methodology for estimating offline the minimum frequency at which the waiting cores can execute without affecting the critical path and application's Time to Solution (TtS) .
A later work of the same authors [6] , implements an online algorithm to identify the task and the minimum frequency at which executing it without worsening the critical path. In a similar way, Kappiah et al. [9] developed Jitter, an online runtime based on the identification of the critical path on the application among compute nodes involved in the application run. Liu et al. [17] use a similar methodology of [9] but applied to a multi-core CPU.
The authors of [18] as in [6] , [16] focuses on saving power by entering in a low-power state for the processes which are not in the critical path. The authors propose an algorithm to save energy by reducing the application unbalance. This is based on measuring the start and end time of each MPI barrier and MPI Allreduce primitives to compute the duration of application and mpi code. Based on that the authors propose a feedback loop to lower the P-state and T-state if in previous compute and MPI region the overhead was below a given threshold. The algorithm is based on the assumption that the duration of current application and MPI phases is the same of the previous ones.
The authors of [19] show that the approaches in [6] , [16] and the ones which estimate the duration of MPI and communication phases based on a last value prediction can lead to significant miss-prediction errors. The authors propose to solve this issue, by estimating the duration of the MPI phases with a combination of communication models and empirical observation specialized for the different groups of communication primitives. If this estimated time is long enough they will reduce the P-state. As we will show with the proposed COUNTDOWN approach this can be achieved without a specific library implementation and communication models.
Li et al. [7] analyze hybrid MPI/OpenMP applications in term of performance and energy saving and develop a power-aware runtime that relies on dynamic concurrency throttling (DCT) and DVFS mechanisms. This runtime uses a combination of a power model and a time predictor for OpenMP phases to select the best cores' frequency when application manifests workload imbalance. These works [7] , [9] , [17] have in common the prediction of future workload imbalances obtained by analyzing previous communication patterns. However this approach can lead to frequently mispredictions in irregular applications [20] which cause performance penalties. COUNTDOWN differs from the above approaches (and complements them) because it is purely reactive and does not rely on assumptions and estimation of the future workload unbalance.
Conficoni et al. [21] show that HPC cooling costs are mostly related to related to a large number of factors, on the most relevant: the total power consumption, ambient temperature and cooling control policy. The temperature can change fast and showing high difference between nodes and CPUs. As side effect, energy saving strategies ease the thermal control by reducing power consumption and core's temperature. Energy saving strategies also reflects on thermal control for supercomputing reducing overheating situations.
Eastep et al. propose GEOPM [5] , an extensible and plugin based framework for power management in large parallel systems. GEOPM is an open-source project and exposes a set of APIs that programmers can insert into applications to combine power management strategies and HPC workload. A plugin of the framework target power constraint systems aiming to speed up the critical path migrating power to the CPU's executing the critical path tasks. In a similar manner, another plugin can selectively reduce the frequency of the processors in specific regions of codes flagged by the user by differentiating regions in CPU, memory, IO, or disk bound. Today GEOPM is capable of identifying MPI regions and to reduce the frequency based on MPI primitive type. However, while this solution is an interesting first step it cannot differentiate between short and long MPI primitives and thus cannot control the overhead caused by frequency change and runtime in short MPI primitives. In this manuscript we present an approach which solve this issue opening new opportunities for MPI-aware power reduction.
Both GEOPM and previous works use primitives of the programming models runtime to profile application's tasks and phases. Marathe et at. [22] developed Libpowermon, a profiling framework for HPC used to correlate application metrics with system level metrics and thermal measurements. Differently from COUNTDOWN, Libpowermon implements only profiling capability without implementing any power control policy.
Benini et al. [23] presented a survey on dynamic power management policies and systems to minimize power consumption under performance constraints. In particular, they show that timeout-based shutdown policies are the most effective ones in mitigating the overheads of power states transitions which detriment the savings achievable with low power states. In this paper we leverage this property in the HPC power management context. Indeed previous works shown that unbalance in MPI workload can be exploited by power management solutions, however a overhead-free solution which can take advantage of this slack is still missing.
BACKGROUND

In this section we show the implications and challenges of triggering HW low-power states (P/C/T-states) during synchronization and communication primitives for energysavings on two practical examples.
As a test platform we have used a compute node equipped with two Intel Haswell E5-2630 v3 CPUs, with 8 cores at 2.4 GHz nominal clock speed and 85W Thermal Design Power(TDP) and a real production software stack of Intel systems 1 .
All the tests in this section have been executed on a real scientific application, namely QuantumESPRESSO which is a suite of packages for performing Density Functional Theory based simulations at the nanoscale and it is widely 1. We use Intel MPI Library 5.1 as the runtime for communication and Intel ICC/IFORT 18.0 in our toolchain We choose Intel software stack because it is currently used in our target systems as well as well supported in most of HPC machines based on Intel architectures employed to estimate ground state and excited state properties of materials ab initio. For these single nodes tests we used the CP package parallelized with MPI. 2 To exploit the system behavior for different workload distribution in a single node evaluation we focused the computation of the band structure of the Silicon along the main symmetry. QE main computational kernels include dense parallel linear algebra (diagonalizzation) and 3D parallel FFT. When executed by a user with no domain expertese and with default parameter QE executes with an hybrid MPI parallelization strategy with only one MPI process used to perform the diagonalizzation and all the MPI processes used to perform the FFT kernel. We will later refer to this case as QuantumESPRESSO CP Not Expert User (QE-CP-NEU). Differently when an expert user runs the same problem changes the parameters to better balance the workload by using all the MPI processes to compute also the diagonalizzation kernel. We will later refer to this case as QuantumESPRESSO CP Expert User (QE-CP-EU). Obviously, this later case results in better time to solution. In the QE-CP-NEU case, when a single process works on the linear algebra kernel, the other ones remain in busy wait on the MPI call. In the following text we will compare fine-grain power management solution with the busy-waiting mode (default mode) of MPI library, where processes continuously polling the CPU for the whole waiting time in MPI synchronization points.
Wait-mode/C-state MPI library
Usually, MPI libraries use a busy-waiting policy in collective synchronizations to avoid performance penalties. This is also the default behavior of Intel MPI library. This library can also be configured to release the control to the idle 2. QuantumESPRESSO mostly used packages are: (i) Car-Parrinello (CP) simulation, which prepares an initial configuration of a thermally disordered crystal of chemical elements by randomly displacing the atoms from their ideal crystalline positions; (ii) PWscf (Plane-Wave Self-Consistent Field) which solves the self-consistent Kohn and Sham (KS) equations and obtain the ground state electronic density for a representative case study. The code uses a pseudo-potential and plane-wave approach and implements multiple hierarchical levels of parallelism implemented with a hybrid MPI+OpenMP approach. As of today, OpenMP is generally used when MPI parallelism saturates and it can improve the scalability in the highly parallel regime. Nonetheless in the following we will only refer to data obtained with pure MPI parallelism since this is the main focus of this article and this choice does not significantly impair the conclusions reported later. task of the operating system (OS) during waiting time to leverage the C-states of the system. This allows cores to enter in sleep states, and being woken up by the MPI library when the message is ready through an interrupt routine. In Intel MPI library, it is possible to configure wait-mode mechanism through the environment variable I MPI WAIT MODE. This allows the library to leave the control to the idle task, reducing the power consumption for the core waiting in the MPI. Clearly the transitions into and out of the sleep mode induce overheads in the application execution time.
In figure 1 are reported the all the experimental results, the wait-mode strategy is identify with CS. From it we can see the overhead induced by the wait mode w.r.t. the default busy-waiting configuration, which worsen by the 25.85% the execution time. This is explained by the high number of MPI calls in the QE application which leads to frequent sleep/wake-up transitions and high overheads.
From the same figure, we can also see that the energy saving is negative, which is -12.72%, this is because the power savings obtained in the MPI primitives does not compensate the overhead induced by the sleep/wake-up transitions. Indeed, the power reduction is of the 12.83%. This is confirmed by the average load of the system, which is 83.02% as effect of the C-states in the MPI primitives. The average frequency is 2.6GHz, which is the standard turbo frequency of our target system. Surprisingly, the QE-CP-NEU case has a negative overhead (-1.08% overhead is a speedup). This speed up is given by the turbo logic of our system. Indeed, we can see that the average frequency is slightly higher than 2.6GHz, which means that the process doing the diagonalization can leverage the power budget which is freed by the other processes not involved in the diagonalization which are waiting in a sleep state in the MPI runtime. In figure 2, we report the average frequency of the process working on the diagonalization and the average frequencies of all the other MPI processes. In the target system, a single core can reach up to 3.2 GHz if only one core is running, this is what happens when all cores are waiting in a sleep state for the termination of the diagonalization workload. The benefit of this frequency boosting unleashed by the idle mode on the MPI library and the unbalanced workload, can save up to 16.69% of energy with a power saving of 20.86%.
As a conclusion of this first exploration, we recognize that is possible to leverage on the wait mode of the MPI library to save power without increasing the execution time, but energy savings and impact on the Tts depends on the MPI calls granularity which can lead in significant penalties if the application is characterized by frequent MPI calls.
DVFS/P-state MPI library
To overcome the overheads of C-state transitions, we focus our initial exploration on the active low power states (C-state) and DVFS (P-state). Intel MPI library does not implement such a feature, so we manually instrumented all the MPI calls of the application with an epilogue and prologue function to scale down and raise up the frequency when the execution enter and exit from an MPI call. To avoid interference with the power governor of the operating system, we disabled it in our compute node granting the complete control of the frequency scaling. We use the MSR driver to change the current P-state writing IA32 PERF CTL register with the highest and lowest available P-state of the CPU, which corresponds to turbo and 1.2GHz operating points. In figure 1 we report the results of this exploration, where the P-state case is labelled with PS.
In the overhead plot, in figure 1 .a, we can see that the overhead is significantly reduced w.r.t C-state mode, reducing the 25.85% overhead obtained previously to 5.96%. This means that the overhead of scaling the frequency is minor respect to the sleep/wake-up transitions cost. However, the energy and power savings are almost zero. This is due to the fact that as before in QE-CP-EU all the MPI processes participate to the diagonalization, thus we have a high number of MPI calls with very short duration. This is also confirmed by the average frequency, which does not show significant variations w.r.t. the baseline (busy waiting), with a measured average frequency of 2.4GHz. The load bar reports 100% of activity, which means that there is no idle time as expected.
Focusing in the QE-CP-NEU case, in figure 1 .b, the overhead is 3.88% which is reduced w.r.t. QE-CP-EU. In addition, in this case we have a significant energy and power saving, respectively of 14.74% and 14.75%. These saving are due by the workload unbalance and by long time spent in the MPI calls from the processes not involved in the diagonalization. This is confirmed by the lower average frequency (1.95GHz). The load is unaltered as expected.
In conclusion using DVFS for fine-grain power management instead of the idle mode allows to better control the overhead for both balanced and unbalanced workload, however the overhead is still significant and in HPC domains performance are the prime goal.
DDCM/T-state MPI library
One important question if the overheads of fine-grain power management strategies are induced by the specific power management states. Thus it would be worth to try dutycycling low power states 3 In Intel CPUs, DDCM is used by the HW power controller to reduce the power consumption when the CPU identifies thermal hazards. Similarly to [24] , we use DDCM to reduce the power consumption of the cores in MPI calls. We manually instrumented the target as we did in the prologue function of each MPI call to configuring DDCM to 12.5% of clock cycles, which means for each clock cycle we gate the next 7; while in theepilogue function, we restore the DDCM to 100% of clock cycles, we control it by writing to the DDCM configuration register, called IA32 CLOCK MODULATION, through the MSR driver.
In figure 1 .a, the DDCM results are reported with TS bars. Surprisingly, the overheads induced by T-states are greater then the wait mode, and equal to 34.78%. As consequence, the energy saving is the worst, leading to an energy penalty of 14.94%. The load is greatly reduces as effect to the throttling with an average of 67.78%, while the frequency is 3 . In this section we also tried to use the Dynamic Duty Cycle Modulation (DDCM) (also known as throttling states or T-states) available in the Intel architectures aiming to reduce the overheads. DDCM has been supported in Intel processors since Pentium 4 and enables on-demand software-controlled clock modulation duty cycle. constant to 2.6GHz.
In figure 1 .b, we report T-state results for QE-CP-NEU. Also for this unbalanced workload case the T-states are the worst. T-state transitions introduce an overhead of 15.82% consequent of the power reduction, with a very small energy saving, only of the 4.75%, and a power saving of 21.97%. The load of the system is reduced to 55.45%, similarly to the idle mode, and the frequency remained unchanged as expected.
As a matter of fact, we show that phase agnostic finegrain power management leads to significant application overheads which may nullify the overall saving. Though, we need to bring knowledge of the workload distribution and the communication granularity of the application in the fine-grain power management. In the next Sections we introduce the COUNTDOWN approach with tries to solve this issue.
FRAMEWORK
COUNTDOWN is a simple run-time library for profiling and fine-grain power management written in C language. COUNTDOWN is based on a profiler and on a event module to inspect and react to the MPI primitives. Every time the application calls a MPI primitive, COUNTDOWN profiles the call and uses a timeout strategy [23] to avoid changing the power state of the cores during extremely fast application and MPI context switches, where doing so may result only in an increment of the overhead without a significant energy and power reduction. As we will see later in this Section, each time the MPI library ask to enter in low power mode, COUNTDOWN defers the decision for a defined amount of time. If the MPI phase terminates within this amount of time COUNTDOWN does not enter in the low power states, filtering out too short MPI phases to save energy, but costly in terms of overheads.
In figure 4 the components of the COUNTDOWN are depicted. COUNTDOWN exposes the same interface of a standard MPI library and it can intercept all MPI calls from the application. COUNTDOWN implements two wrappers to intercept MPI calls: i) the first wrapper is used for C/C++ MPI libraries, ii) the second one is used for FORTRAN MPI In the wrapper function is called the equivalent PMPI call, but after and before a prologue and an epilogue function. Both functions are used from the profile and from event module to inject profiling capabilities and power management strategies in the application. COUNTDOWN interacts with the HW power manager through a specific Events module in the library. The Events module can also be triggered from system signals registered as callbacks for timing purposes. COUNTDOWN configurations can be done through environment variables, it is possible to change the verbosity of logging and the type of HW performance counters to monitor.
The library targets the instrumentation of applications through dynamic linking, as depicted in figure 3 , without user intervention. When dynamic linking is not possible COUNTDOWN has also a fall-back, a static-linking library, which can be used in the toolchain of the application to inject COUNTDOWN at compilation time. The advantage of using the dynamic linking is the possibility to instrument every MPI-based application without any modifications of the source code nor the toolchain. Linking COUNTDOWN to the application is straightforward: it is enough to configure the environment variable LD PRELOAD with the path of COUNTDOWN library and start the application as usual.
Profiler Module
COUNTDOWN uses three different profiler logic targeting three different monitoring granularities.
(i) The MPI profiler, is responsible to collect all information regarding the MPI activity. For each MPI process it collects information on MPI communicators, MPI groups and the coreId. In addition, the COUNTDOWN run-time library profiles each MPI call by collecting information on the type of the call, the enter and exit time and the data exchanged with the others MPI processes.
(ii) The fine-grain micro-architectural profiler, collects micro-architectural information at phase granularity, it monitors the average frequency, the time stamp counter (TSC) and the instruction retired for each MPI call and for the application phase. In Intel architectures to access on HW performance counters, is required a privileged permission, which cannot be granted to the final users in production machines. To overcome this limitation, we use the MSR SAFE [25] driver to access to the model-specific registers of the system (MSR), which can be configured to grant the access of standard users to a subset of privileged architecture registers avoiding security issues. COUNTDOWN stores both MPI and fine-grain profiler information in a binary file which can be written in the local or remote storage. Since these logs can grows with the number of MPI primitives and can become significant in long computation, by default the information are summarized in a coarse-grain profiler log.
(iii) The coarse-grain profiler, monitors a larger set of HW performance counters available in the Intel architectures without impacting the application execution time. These performance counters include those used in the fine-grain profiler plus extended metrics. At the core level, COUNT-DOWN monitors C-state residencies and temperature, while for the uncore it monitors package and DRAM energy counters, C-state residencies and temperature of the packages. The coarse-grain profiler, due the high overhead needed by each single access to the set of HW performance counters monitored, uses a time-base sample rate. This profiler collects data with at least Ts second delay from the previous sample. The fine-grain profiler at every MPI calls checks the time stamp of the previous sample of coarse-grain profiler and if it is above Ts seconds triggers it to get a new sample. Currently Ts is configured to 1s.
Event Module
COUNTDOWN interacts with the HW power controller of each core to reduce the power consumption. It uses MSR SAFE to write the architectural register to change the current P-state independently per core. When COUNT-DOWN is enabled, the Events module decides the performance at which to execute a given phase.
COUNTDOWN implements the timeout strategy through the standard Linux timer APIs, which expose the system calls: setitimer() and getitimer() to manipulate user's space timers and register callback functions. This methodology is depicted in figure 5 in the top part. When COUNT-DOWN encounters an MPI phase, in which opportunistically can save energy by entering in a low power state, COUNTDOWN registers a timer callback in the prologue routine (Event(start)), after that the execution continues with the standard workflow of the MPI phase. When the timer expires, a system signal is raised, the "normal" execution of the MPI code is interrupted, the signal handler triggers the COUNTDOWN callback, and once the callback returns, execution of MPI code is resumed at the point it was interrupted. If the "normal" execution returns to COUNTDOWN (termination of the MPI phase) before the timer expiration, COUNTDOWN disables the timer in the epilogue function and the execution continues like nothing happened. In the callback COUNTDOWN can be configured to enter in the lower T-state (12.5% of load), later refereed as COUNTDOWN THROTTLING, or in the lower P-state (1.2GHz) later refereed as COUNTDOWN DVFS.
Intel MPI library implements a similar strategy, but it relies on the sleep power state of the cores. Its behavior is depicted in the bottom part of figure 5 . If the environment variable I MPI WAIT MODE, presented in section 3.1, it is combined with the environment variable I MPI SPIN COUNT, it is possible to configure the spin count time for each MPI call. Time after which the MPI library leave the execution to the idle task of the CPU. This parameter does not contain a real time, but contains a value which is decremented by the spinning procedure on the MPI library until it reaches zero. This allows the Intel MPI library to spin on a synchronization point for a while, and after that, enter in an idle low-power state in order to reduce the power consumption of the core. The execution is restored when a system interrupt wakes up the MPI library signaling the end of the MPI call. Later we will refer to this mode as MPI SPIN WAIT.
In next section we will explain why the timeout logic introduced by COUNTDOWN is effective in making finegrain power management possible and convenient in MPI parallel applications.
EXPERIMENTAL RESULTS
Framework Overheads
We evaluate the overhead of running MPI applications instrumented with the profiler module of COUNTDOWN without changing the cores' frequency. We run QE-CP-EU on a single node, which is the worst case for COUNT-DOWN in term of number and granularity of MPI calls to profile. In this run, we counted more than 1.1 million of MPI primitives for each process in the diagonalization task: our run-time library need to profile in average a MPI call every 200us for each process. We measured the overhead comparing the execution time with and without COUNTDOWN instrumentation. We repeated the test five time and we report the median case. Our results show that COUNTDOWN profiler introduces an overhead in the execution time which is less than 1%. This result is also supported by the overhead measurements of the prologue and the epilogue routines that COUNTDOWN injects in the application for each MPI call, which costs between 1us and 2us. We repeat the same test changing the cores' frequency to assess the overhead of a fine-grain DVFS control. To measure only the overhead caused by the interaction with the DVFS knobs, we force COUNTDOWN to write always the highest P-state in the DVFS control registers. Thus, we avoid application slowdown caused by frequency variation and we obtained only the overhead caused by the register access. Our experimental results report of 1.04% of overhead to access to the DVFS control register and doing the profile.
This result proves that the source of the overheads of phase agnostic fine grain power management is not related to issuing the low-power state transition (DVFS in this case). Thus, why results in Section 3 had these large overheads? Figure 6 focuses on understanding the source of this by replicating the tests of Section 3 for both QE-CP-EU and QE-CP-NEU, but now entering in the low-power state only for MPI phases longer than a given time threshold. For the P-state and T-state ( Figure 6 .b and Figure 6 .c) we obtained that by profiling in advance the duration of each MPI phase and instrumenting with the low-power command only the phases which had a duration longer than the threshold. We report on the x-axes the time threshold value. For Cstate ( Figure 6 .a) we leveraged the COUNTDOWN MPI logic, I MPI SPIN COUNT parameter to filter out short phases. On the x-axis we report the I MPI SPIN COUNT parameter.
From the plot we can recognize that there is a well defined threshold of 500us for the T-state and P-state case and of 10K iteration steps for the C-state after which the overhead introduced by the fine-grain power management policy is reduced and the energy savings becomes positive for the QE-CP-EU. In the next section we will analyze why this happens by focusing on the P-state case.
DVFS Overheads and Time Region Analysis
To find the reason of the higher overhead when frequency reduction is applied in all the MPI phases as highlighted in previous section, we report two scatter plots in which we show on x-axes of the left plot the time duration of each MPI phase and right plot the time duration of each application phase. For both plots, we report on the y-axes the measured average frequency in that phase. We recall that, this test is conducted by instrumenting through COUNTDOWN each MPI call with a prologue function setting the lowest frequency (1.2GHz) and with an epilogue function setting the highest frequency (Turbo).
In theory, we would have expected that all MPI phases had executed at the minimum frequency and application phases had run always at maximum frequency. Indeed, is a matter of fact that MPI phases running at high frequencies may cause energy waste, while application phases running at low frequencies causes performance penalty to the application. Our results show that for phases with a time duration between 0us and 500us, the average frequency vary in the interval between the high and low CPUs' frequency values, while above it, it tends to the desired frequency for that phase. This can be explained by the response time of HW power controller in serving P-state transition of our Intel Haswell 4 CPUs [10] . The HW power controller periodically reads the DVFS control register to check if the O.S. has specified a new frequency level, this interval has been reported to be 500us in previous study [10] and matches 4. The same is true for Intel Broadwell's HW power controller.
our empirical threshold. This means that every new setting for the core's frequency faster than 500us could be applied or completely ignored, depending on when the register was sampled the previous time. This can cause all sort of average frequencies. Clearly application phases which execute at lower frequency than the maximum one may lead to slowdown in the application, while MPI phases which execute at higher frequency than the minimum one may lead to energy saving loss. It is nevertheless interesting to notice that phases with a duration in the region of 0s and 500us are more likely to have the highest frequency for the MPI phases and the lowest frequency frequency for the application phases. Which is opposite of what expected. We will give an explanation with the next analysis. It is thus not possible to have an effective control on the frequency selection for phases shorter than 500us, while for phases longer than 500us we can see a logarithmic trend toward the requested frequency. We thus suspect that in phases shorter than 500us the average frequency depends more on the previous phase frequency than the requested one.
Following this intuition in figure 8 , we correlate the time duration of each application phase with the time duration of the following MPI phase and its average frequency. We report in the y-axis the time duration of the application phase, in the x-axis the time duration of the subsequent MPI phase, and with the color code we report the average frequency. In the left plot, we report the average frequency of the MPI phase, while in the right plot, we report the average frequency for the application phase. For both plots, we can identify four regions/quadrants: (i) Application & MPI>500us: this region contains long application phases followed by long MPI phases. Points in this region show low frequency in MPI phases and high frequency in application phases. This is the ideal behavior, where applying frequency scaling policy reduces energy waste in MPI but with no impact on the performance of the application. Phases in this region are perfect candidates for fine-grain DVFS policies.
(ii) Application>500us & MPI<500us: this region contains long application phases followed by short MPI phases. Points in this region show for both application and MPI phases high average frequency. This is explained by the short duration of the MPI phases, which does not give enough time to the HW power controller to serve the request to scale down the frequency (prologue), before this setting is overwritten by the request to operate at the highest frequency (epilogue).
For this reason doing fine-grain DVFS control in this region does not save energy as the frequency reduction in MPI phases is negligible, but it also does not deteriorate the performance as the application phases are execute at the maximum frequency. Phases in this region should not be considered for fine-grain DVFS policies, being preferable to leave frequencies unaltered at the highest level.
(iii) Application<500us & MPI>500us: this region contains short application phases followed with long MPI phases. This is the opposite case of Application>500us & MPI<500us region. Points in this region show for both application and MPI phases low average frequencies. This is explained by the short duration of the application phases, which does not give enough time to the HW power controller to serve the request to raise up the frequency (requested at the exit of the previous MPI phase), before this setting gets overwritten by the request to operate at the lowest frequency (at the entrance of the following MPI phase).
Applying fine-grain DVFS policies in this region can save Both application and MPI phases executes randomly at high and low average frequencies due the inability of the HW power controller to capture and service the requested frequency changes. The average frequency at which MPI and application phases execute are strictly related to type of the previous long phase: if it was an application phase the following short phases will execute at high frequency in average; On the contrary, if it was an MPI phase the following short phases will execute at low frequency in average. Applying fine-grain DVFS policies in this region leads to unexpected behaviors which can detriment application performance. All phases shorter than 500us should be never considered by fine-grain power managers.
Single-node Evaluation
We repeat the experiments of Section 3 using COUNT-DOWN. We configure COUNTDOWN to scale down the Pand the T-states 500us after the prologues of MPI primitives.
To reproduce the same timeout strategy leveraging the C-states we configured MPI SPIN WAIT as described in 4.2 with 10K as MPI spincounter parameter.
The HW power controller of Intel CPUs, has a different transition latency for sleep states w.r.t. DVFS scaling, as described in [10] . For this reason, we empirically determine the best spin counter setting to maximize the energy efficiency and to minimize the overhead for the target application. Figure 9 report the experimental results using COUNT-DOWN THROTTLING, COUNTDOWN DVFS and MPI SPIN WAIT. We can see that in all cases the overhead, the energy saving, and the power saving are greatly improved w.r.t. the baseline (default MPI library, without COUNT-DOWN). Figure 9 .a shows the experimental results for QE-CP-EU. For the C-state mode the overhead decrease from 25.85% to 1.70% by using MPI SPIN WAIT. Instead using COUNT-DOWN DVFS for the P-state the overhead decreases from 5.96% to a negligible overhead, and for the T-state from 5.96% to 0.29% using COUNTDOWN THROTTLING. All evaluations report a non negative energy saving, as it was for the MPI library without timeout strategy, but with better results. Energy saving shows 21.80%, 14.94%, and 11.16% improvements and power saving reports 6.55%, 5.77%, and 2.47% respectively for C-state, P-state, and T-state. These experimental results confirm our exploration on the time duration of MPI phases reported in figure 7 . Most of the MPI calls of this benchmark have been skipped from COUNT-DOWN due their short duration to avoid overheads. Figure 9 .b show similar improvements for QE-CP-NEU. In this configuration, for C-State mode the speed up increases from 1.08% to 6.14% using MPI SPIN WAIT. Instead using COUNTDOWN, the overhead of P-state decrease from 3.88% to 1.25%, and for the T-state from 15.82% to 2.19%. As a result, the energy saving is 21.80%, 14.94%, and 11.16% while power saving corresponds to 24.61%, 19.84% and 15.23% respectively for C-state, P-state, and T-state.
HPC Evaluation
After we have evaluated our methodology in a single compute node, we extend our exploration in a real HPC system. We use a Tier-0 HPC system based on an IBM NeXtScale cluster which is currently classified in the Top500 supercomputer list [26] . The compute nodes of the HPC system, are equipped with 2 Intel Broadwell E5-2697 v4 CPUs, with 18 cores at 2.3 GHz nominal clock speed and 135W TDP and interconnected with an Intel QDR (40Gb/s) Infiniband high-performance network.
To benchmark the parallel performances in our target HPC system we focused on two set of applications. The first one is the NAS benchmark suite with the large dataset E. We executed the NAS benchmarks on 29 compute nodes with a total core count of 1024 cores. The second one is the QuantumESPRESSO PWscf software configured for a complex large scale simulation. For this purpouse, we performed ten iterative steps of the self-consistent loop algorithm that optimizes the electronic density starting from the superposition of atomic charge densities. In order to obtain a reasonable scaling up to the largest set of nodes used in the simulations we chose an ad-hoc dataset. The selected system comes from an actual scientific report [27] and reproduces a layered structure of Iridium, Cobalt and Graphene plus a molecular compound (iron-phthalocyanin) deposited on top. The whole simulation box includes 662 atoms which are described with 3662 KS states. The total number of plane waves is more than a million and, during the execution, main memory occupation may peak at 2 to 6 terabytes depending on the selected parallelization parameters.
During each iteration, the CPU time is mostly spent in linear algebra (matrix-matrix multiplication and matrix diagonalization) and FFT. Both these operations are distributed on multiple processors and operate on distributed data. As a consequence, FFT requires many AllToAll MPI communications while parallel diagonalization, performed with the PDSYEVD subroutine of SCALAPACK and requires mostly MPI broadcasting messages. We run QE on 96 compute nodes, using 3456 cores and 12 TB of DRAM. We use an input dataset capable to scale on such number of cores and we configure QE using a set of parameters optimized to avoid network bottlenecks, which would limit the scalability. We name this configuration Quantu-mESPRESSO Expert User (QE-PWscf-EU), to differentiate it form the same problem but solved without optimizing the internal parameter as it was run by an user without domain specific knowledge, we call this last configuration QuantumESPRESSO Not Expert User (QE-PWscf-NEU).
In these tests, we exclude the T-state mode, because in the single-node evaluation, it always reported worst results that the P-state mode. We also excluded the C-state mode as when we started the configuration of the Intel MPI library for HPC experiments using idle mode, we discover that this feature is not supported in distributed environment. The Intel MPI library overrides the request of idle mode with the busy-wait mode when the application runs on multiple nodes. For this reason, we only use the P-state mode (COUNTDOWN DVFS) in the HPC evaluation.
We run an instance of the application with and without COUNTDOWN on the same nodes and we compared the results. Figure 10 shows the results for the NAS benchmark suite [11] when executed on 1024 cores, while figure 11 shows the results for the QE-PWscf-* application when executed on 3456 cores. The different plots for Figure 10 reports the time-to-solution overhead, the energy and power saving as well as the MPI and application time phases distribution (in percentage of the total time the accumulated time spent in phases longer and shorter than 500us) for the different large-scale benchmarks and application run. All the values are normalized against the default MPI busy waiting policy. From the Figure 10 .c we can see that COUNTDOWN is capable of significantly cutting the energy consumption of the NAS benchmarks of the 6% to the 50% which vary between the benchmarks. From Figure 10 .c we can see that this savings follows the percentage of time the benchmark passes in the MPI phases longer than 500us. From the overhead plot ( Figure 10 .a) we can see that all these energy savings happens with a contained time-to-solution overhead which is in average below the 5%. These results are very promising as they are virtually portable to any application, without the need to touch the application binary. When looking at the QuantumEspresso (QE-PWscf-*) case reported in figure  11 we see that COUNTDOWN attains similar results as the NAS ones also with real application production run optimized for scalability -COUNTDOWN saves the 22.36% of energy with an overhead of the 2.88% in the the QE-PWscf-EU case. Figure 11 .a shows the total time spent in application and in MPI phases which are shorter and longer than 500us for the QE-PWscf-EU case. On the x-axis, the figure reports the id of the MPI rank, while in the y-axis reports in percentage of the total time the accumulated time spent in phases longer and shorter than 500us. We can immediately see that in this real and optimized run, the application spends a negligible time in phases shorter than 500us. In addition the time spent in the MPI library and in the application is not (a) QE-PWscf-EU (b) QE-PWscf-NEU Fig. 11: (a,b ) Sum of the time spent in phases longer and shorter than 500us for QE-PWscf-EU and QE-PWscf-NEU.
homogeneous among the MPI processes. This is an effect of the workload parameters chosen to optimize the communications, which distribute the workload in subsets of MPI processes to minimize broadcast and all-to-all communications. Using this configuration, our experimental results report 2.88% of overhead with a energy saving of 22.36% and a power saving of 24.53% thanks to COUNTDOWN. Figure 11 .b shows that for the case QE-PWscf-NEU where the parameter are not optimized all MPI processes have the same workload composition as they are part of same workgroup and due the large overhead in the broadcast and all-to-all communications all the processes spend almost the 80% of the time in the MPI library. Even if it is suboptimal, this happens to HPC users runnning the application without being domain experts, thus neglecting parameters to optimize the MPI communication or when executing the application outside its strong scaling linear region.
In this situation, COUNTDOWN increases its benefits, reaching up to 37.74% of energy saving of and a power saving of 41.47%. In this condition we also notice that COUNTDOWN induces a small but relevant overhead of 6.38%. We suspect that some MPI primitives suffer more than others from the frequency scaling. We will analyze in dept this problem in our future works aiming to keep the COUNTDOWN overhead negligible. We believe this condition will make the adoption of COUNTDOWN wider.
The result achieved by COUNTDOWN in at production scale and application are very promising and if systematically adopted would dramatically reduce the TCO of today supercomputers.
CONCLUSION
In this paper, we presented COUNTDOWN a methodology and a tool for profiling HPC scientific applications and injecting DVFS capabilities into standard MPI libraries. COUNTDOWN implements a timeout strategy to avoid costly performance overheads and leveraging on communication slacks to drastically reduce energy consumption. Our work targets real HPC systems and workloads and do not require any kind of modification to the source code nor to the compilation toolchain of the application. The COUNTDOWN approach can be adopted for several lowpower state technologies -P/T/C states.
In the experimental section, we compared our system with state-of-the-art power management for MPI libraries, which can dynamically control idle and DVFS levels for MPI-base application. Our experimental results show that using timeout strategy to take decisions on power control can drastically reduce overheads maximizing the energy efficient in small and large MPI communications. Our runtime library can lead up to 14.94% energy saving and 19.84% of power saving with a less than 1.5% performance penalty on a single-compute node. However the benefits of COUNT-DOWN increases with the scale of the application. In 1K nodes NAS benchmark run COUNTDOWN always saves energy, with a saving which depends on the application and ranges from the 6% to the 50% at a negligible overhead which is below the 6%. In a real production run of QE in more than 3K nodes COUNTDOWN saves the 22.36% of energy with only the 2.88% of performance overhead which increases to the 37.74% of energy saving when the application is executed by a non-expert user. COUNTDOWN unveils the limiting factors of fine-grain power monitoring strategies in targeting MPI based applications and proposes a simple but yet effective strategy to cut today's supercomputing center energy-consumption trasparently to the user.
