 Nowadays, it is well established that the pace dictated by Moore's law on technological scaling comes at the cost of increasing power consumption and leads to thermally bound computing systems. Supercomputers as well as data centers are at the cutting edge of this crisis, because of aggressive performance, integration density, and sustainable power budgets [1] . The Top500 list ranks worldwide supercomputers based on their peak performance [floating point operations per second (Flops)-when running a Linpack benchmark] [2]. Today, the most powerful supercomputer in Top500 is Sunway TaihuLight, which consumes 15.3 MW for delivering 93 petaflops. The second most powerful supercomputer, Tianhe-2 (ex 1st), consumes 17.8 MW for "only" 33.2 petaflops. However, the power consumption increases to
24 MW when considering also the cooling infrastructure [3] . Such an amount of cooling power serves to prevent thermal issues. Increasing the inlet coolant temperature reduces the cooling cost, but jeopardizes the thermal budget and, consequently, the performance.
Modern multicore pro cessors are equipped with thermal sensors and with fine-grained power management support to modulate the power consumption of the processor and peripherals, based on the current operating conditions. Intel's Running Average Power Limit (RAPL) modulates the frequency of the cores to maintain the processor power consumption below a user-defined power budget. In novel Intel server processors, the frequency of each core can be scaled independently [4] . However, this flexibility is currently not fully explored, as the operating systems in high-performance computing (HPC) installations (99.6% of Top500 supercomputers in November 2016 used Linux as the operating system) work by keeping all the cores at the same frequency to avoid performance-unbalance across the cores of the same processor. When it comes to hardware mechanisms to protect the die from over temperature (referred as thermal management), today's processors use two mechanisms: (1) dynamic voltage and frequencyscaling (DVFS-ACPI P-states) and (2) duty cycling (throttling-ACPI T-states). Thanks to turbo logic,
Editor's note:
Thermal management in high-performance multicore platforms has become exceedingly complex due to variable workloads, thermal heterogeneity, and long, thermal transients. This article addresses these complexities by sophisticated analysis of noisy thermal sensor readings, dynamic learning to adapt to the peculiarities of the hardware and the applications, and a dynamic optimization strategy.
-Axel Jantsch, TU Wien -Nikil Dutt, University of California at Irvine cores can run at frequencies above the nominal thermal design power (TDP) frequency without incurring overheating and voltage drop problems. Duty cycling is used as a protection mechanism when the die temperature exceeds a critical threshold (in Intel Xeon E5-26XX v3 HPC class processors vary from 69 °C to 101 °C). However, thermal throttling is strongly detrimental to core performance as the performance loss increases linearly with the power reduction. Differently, for DVFS low-power states, the performance loss scales sublinearly with the power reduction. Moskovsky et al. [5] show that hardware built in mechanisms can use the DVFS states effectively for power and energy management, but fail to prevent thermal throttling [5] . This is primarily due to the reactive nature of thermal controllers, which takes corrective actions when the chip is already too hot.
In addition, there are several nonidealities in the thermal characteristics of large multicore processors targeting the HPC market, namely, thermal heterogeneity, thermal capacitance, and thermal noise. Beneventi et al. [6] measure for a top-end Intel Xeon computing node featuring 36 physical cores, on the same silicon die, up to 24 °C of temperature difference between active cores and idle cores, and more than 7 °C of thermal heterogeneity when all the cores execute identical workload and run at the same frequency. Precision of thermal sensors is bound to 1 °C due to quantization. Moreover, there is an important thermal capacitance due to the large heat-sink and heat spreader, which causes long thermal transients taking minutes to reach the steady state after a change in the cores power consumption. These nonidealities worsen the efficiency of built-in reactive controllers as the usage of a single frequency to control the temperature of a multicore chip-subjected to strong thermal heterogeneity-is detrimental. This is in addition to the difficulties in finding a stable but thermally safe operating point.
Finally, in a supercomputing environment, users' applications run in isolation on a portion of the machine and are accounted based on the execution time. The peculiar usage model of supercomputing hardware and applications, which are not interactive, nor latency critical, but require sustained peak performance, is not yet well understood in terms of unique opportunities for domain-specific power and thermal management. Each user runs a parallel workload with synchronization points on the multicore and multinode hardware. For this reason, slowing the frequency of the core running the critical thread may induce a slowdown in threads waiting for completion to progress. Thread migration is prevented to avoid performance loss.
Kong et al. [7] and Sheikh et al. [8] survey thermal management and allocation strategies. From the survey, we can conclude that most state-of-the-art works take advantage of task migration to improve purely reactive approaches, do not consider thread criticality but assume deadlines or precedence constraints, and use thermal models obtained from thermal simulators and/or make simplifying assumptions on the intracore temperature dependency and neglect the thermal transients. These make them not directly applicable to supercomputers. Even if not directly applicable, in the section "Experimental results," we will compare the proposed strategy with thermal management ones, which use static thermal models for driving the task assignment, as well as purely reactive thermal controllers, which use short predictions to tune the power consumption dynamically.
In this article, we present a self-aware optimization framework that combines noisy thermal sensor readings, robust system identification strategy, integer linear programing (ILP) optimization, and application awareness to ensure a stable and safe working temperature. At the same time, the framework maximizes the application performance in state-of-the-art HPC nodes. We use this framework for discussing the design tradeoffs of self-aware and application-aware proactive thermal management runtimes targeting HPC systems. These include: (1) thermal capacitance; (2) noise and system identification; (3) workload awareness and job placement; and (4) HPC peculiarities.
Self-and application-aware thermal management framework
In this section, we describe the proposed framework. Figure 1 shows its building blocks.
At the bottom, we have the HPC hardware which is composed of several compute nodes. Each compute node (Node#i in Figure 1 ) is composed of 1-M computing engines. We consider as computing engine only parallel processors. These are the most common case: more than 90% of Top500 systems use Intel Xeon Processors, as reported by the November Special Issue on Self-Awareness in Systems on Chip, Part II 2016 Top500 list. Each parallel processor (CPU#i) is internally composed of N cores and uncore logic. The uncore logic accounts for the memory controllers, last level cache, and I/O. Each processor, core, and uncore are equipped with sensors that can be read periodically from the software stack. These sensors can monitor different architectural events as well as physical parameters. We focus on three main classes of sensors: per-core activity sensors, per-core thermal sensors, and per-CPU power gauges.
These sensors are used for two purposes: self-learning a predictive thermal model through system identification and estimating the current state of the cores (silicon temperature, power consumption, and activity) to be used as the basis for estimating the future core temperature and power consumption in the thermal and power management problem.
At the top of the stack, we have the software applications running on the system, which consist of parallel programs composed of several threads and processes which run exclusively on the computing resources (i.e., the cores). As effects of the computational patterns and data partitioning, threads of the same application running on different cores may require longer computation before reaching a synchronization point. This makes the core running the thread that requires longer computation to be the critical one in terms of completion time of the entire application. The programer, or the runtime of the programing model, can estimate the criticality of the threads composing the application and flags the threads with a label that represents the application criticality [criticality level (cr x ) in Figure 1 ].
At the center of the stack, the proposed framework implements the Self-Learning and the Thermal/ Power Management policies. The Self-Learning policy consists of three stages.
First, binary workload (1=active/0=idle) sequences are applied to each core to recreate a pseudorandom binary sequence (PRBS) of power stress in each core. PRBS sequences have the characteristic of creating a white noise spectrum with sequences of binary inputs. This guarantees that all the modes in the thermal removal path are equally excited. The power consumption values for each core are obtained by linear regression of each core's activity (cycles in active state and cycles in idle state) and per the CPU power consumption. Our framework uses the workloader to schedule one power stressmark in each core and uses POSIX signals to schedule and deschedule it to follow the PRBS sequence.
Second, during this stress pattern, the data collector monitors the sensors present on each core and CPU synchronously with the PRBS sampling time. The values collected are stored in a database and preprocessed to translate each core activity in a per-core power profile.
Finally, the power traces and thermal responses are provided as input to a system identification algorithm, which extracts a predictive thermal model. The thermal model is in the form of a multi-input single-output (MISO) autoreregressive exogenous (ARX) dynamic model for each core.
The Self-Learning policy is executed during system initialization and lasts for a few minutes. The learned thermal model is then fed as input to the Thermal/Power Management runtime. One of the main advantages of the proposed thermal modeling framework is that it is suitable to be integrated in a production environment. Indeed, it does not need to connect the computing node to lab equipment, but runs entirely in the software on the inspected node. The running overhead is below 1% and requires a small amount of data.
In contrast, the runtime executes periodically. At each iteration, it reads the application cr x for each thread running in the CPU, the sensors' state, and based on these values it takes a decision. The decisions taken are different if taken at application startup or during its execution. In the 1st step, the Thermal/Power Management runtime decides the thread for core mapping and the frequency at which each core is run. At the n-th step, the runtime decides only the frequency at which the core is run, without exceeding the critical temperature. Indeed, in the scientific HPC environment, once the threads of a parallel application are pinned to a core their migration is not allowed to avoid performance losses. The proposed Thermal/Power Management runtime uses an ILP formulation to find the optimal thread for core mapping and core frequencies, which maximizes the application performance while preserving a safe working temperature for each core. Thermal interaction between the cores, thermal differences, as well as thermal capacitance effects (dynamics) are accounted for in the optimization problem by means of the thermal model self-learnt on the HW, which is the result of the Self-Learning policy. To account for the different decision variables available at the 1st step and the n-th step of the Thermal/Power Management runtime, two different optimization problems are used.
The Self-Learning policy
The Self-Learning policy is implemented by adopting a distributed approach. Each core of a supercomputer node is represented by an MISO ARX model, where the output is the actual core temperature T ̅ (t ) and the inputs are the power dissipated by all the N cores P 0 (t), P 1 (t),… P N−1 (t) and the uncore power P unc (t) . Note that P unc (t) represents the power consumed by the uncore of the CPU, as shown in Figure 1 . This model is described by the following difference equation:
where u(t) is the input vector given as
where w(t) is a white process with variance σ w 
x(t) = x(t − 1) ]
A( z
Previous works have shown that the relation between the core temperature and the dissipated power can be described by a purely dynamic model [9] .
ARX models are widely used in system identification, as they constitute the simplest way of representing a dynamic process in the presence of uncertainties [10] . Two important features of these models are the possibility of obtaining asymptotically unbiased estimates of their parameters by means of least squares and the absence of stability problems of the associated optimal one step-ahead predictors [10] . Nevertheless, it has been shown in [9] and [11] that the classic MISO ARX model (1) is not able to describe properly the thermal dynamics of the system, because the estimated models are characterized by relevant negative poles and/or complex conjugate poles. This is in contrast with the physics of thermal systems, where only real positive poles can exist. As explained in [9] , this problem is due to the presence of a significant level of measurement noise. To take into account the presence of this noise, MISO ARX models with noisy output have been considered. The actual core temperature T ̅ (t) is thus assumed as affected by an additive noise v(t) , so that the available measurement T(t) is given as
T(t) = T ̅ (t) + v(t)
where v(t) is a white process with variance σ v 2* . From (1) and (4), the available temperature T(t) can be expressed as
This equation emphasizes the presence of both a process disturbance 1 / A( z ) , and (i = 0…, r ) , i.e., the parameter vector
starting from the input-output data u (1) 
,…, u(N), T(1), …, T(N) .
The MISO ARX model with additive noise (5) cannot be identified by using the least squares method, as the presence of v(t) leads to asymptotically biased estimates [9] . For this reason, we propose an ad hoc identification algorithm that exploits the properties of the dynamic Frisch scheme [12] . The rationale behind the Frisch scheme consists in searching for the solution of the Special Issue on Self-Awareness in Systems on Chip, Part II identification problem within a locus of solutions that are compatible with the covariance matrix of the noisy data [12] .
Let us introduce the vector ϕ(t) = [−T(t)…−T(t − n) u
T and its covariance
. It can be shown that in this context the above-mentioned locus of solutions, compatible with the covariance matrix of the noisy data Σ , consists of an interval describing a set of admissible additive noise variances. More precisely, consider the following two partitions of the same matrix Σ :
where σ T 2 is a scalar and Σ TT is an ( n + 1 ) x ( n + 1 ) matrix. We define also the scalar σ v,max to introduce a suitable selection criterion. The criterion proposed in [11] exploits the properties of the instrumental variable (IV) approach and leads to asymptotically unbiased estimates when the covariance matrix Σ is replaced by its sample
based on the IV approach is minimized along I ( Σ ̂ ) to get consistent estimates of θ * , σ v 2* , and σ w 2* (see [11] for more details). Finally, the identified models can be transformed into a state-space representation of the noisy ARX model (5) to be used in the ILP problem to proactively select the optimal usage of the resources. The proposed identification method is more effective in the considered supercomputer system framework. Thanks to its robustness, it can extract the faster dynamics in the presence of quantization noise even when working with a slow sampling rate as stated in [11] .
Thermal/Power Management runtime
As introduced in the section "Self-and applicationaware thermal management framework," the thermal model self-learned on the target HW is used to solve two optimization problems: the 1st step and the n-th step. The Thermal/Power Management runtime operates at the node level and it is composed of two main components: the thermal-aware thread mapper and controller (TMC) and an energyaware message passing interface (MPI) wrapper. The TMC is triggered: (a) after the job scheduler has deployed the parallel application on the reserved portion of the HPC machine for the job execution and (b) periodically with period T s . At job startup (a), the TMC specifies the thread to core mapping, which will be maintained until application completion. Clearly, if a critical thread is mapped to a thermally inefficient core this will more likely cause severe degradation of the final application performance. To abstract this requirement, we use a per-thread cr x . A higher criticality means a higher impact of the thread performance on the final application.
The 1st step optimization problem (FSP) is solved during the initialization of the application. Its purpose is to allocate the application's threads on the available cores and selecting for each of them the maximum frequency that meets the thermal constraint T max in the prediction interval (PI FSP ). As we will see in the experimental results, the PI (i.e., the time horizon) plays an important role. Indeed, if it is too short, the TMC cannot predict the long-term impact of a thread allocation on the cores' temperatures, as its effect is hidden by the thermal capacitance. By contrast, if the predicted interval is too long the TMC cannot take advantage of the thermal capacitance for sustaining short-time power burst. In addition, as introduced earlier, not all the threads have the same criticality. This is reflected in the optimization model which maximizes the frequency of the most critical thread, penalizing the frequencies of others. The optimization problem considers K threads to be assigned to N cores, where there are less or equal threads than cores, i.e., K ≤ N. Each core can be configured with a frequency in a set of M levels. The object function (OF) maximizes the sum of frequencies of all active cores γ jf weighted by the criticalities cr i of the thread assigned on each core.
To model the problem, we use two sets of binary decision variables
We can formulate the following ILP model with three constraints to model the assignments and the thermal bounds:
The constraint (11) specifies that a thread must be assigned only on a single core that works at a given frequency. In addition, it specifies that all the threads are assigned. The constraint (12) is needed to determine the decision variables that represent the idle cores. These variables are used in the constraint (13) in case there are less jobs than cores, i.e., K ≤ N. This constraint guarantees that the temperature of each core does not exceed T MAX during the next predicted interval ( P I FSP ). In the last constraint (13), GS is a gain matrix with dimension N × N . This matrix is used to calculate the increment of temperature of all the cores when a core is subjected to a constant power input for P I FSP seconds. T l 0 represents the dependency of the future temperature (at P I FSP ) from the current cores temperature. These values are derived from the state-space thermal model learned in the "Self-Learning policy." T a is the ambient temperature. When threads are less than the cores, the decision variable y i is used in conjunction with the vector of idle powers p ̅ , to add the idle power components.
After the threads have been assigned to the cores in the FSP, the TMC must solve periodically the assignment problem of frequencies to cores only, but at a finer timescale. The i-th step problem (ISP) has the same objective function as FSP (10) , as well (8) as the same thermal model formulation. However, the PI for the ISP (PI ISP ) can be generally different from the FSP.
Differently from the previous case, the model considers only active cores (K), because the thermal constraints cannot be broken by an idle core. This reduces the overall complexity. As threads have been already allocated in FSP in this model, threads and cores do not need separate variables, thus a cr x is referred to a core.
. (14) The ISP model requires fewer constraints than FSP due to the lower number of variables.
The constraint (16) bounds each core to a selected frequency. The constraint (17) guarantees the thermal limits imposed on the model, where the set A = a i contains the index of the active cores, the set I = i i contains the index of idle cores directly defined from the solution of FSP, and A ∩ I is empty. In general, the ISP problem is computationally simpler than the FSP problem due to the much lower number of decision variables and constraints.
In the following section, we will evaluate the performance of the proposed TMC in a realistic scenario and under different tradeoffs in between the predicted horizons of the FSP and ISP problems.
Experimental results
In this section, we first describe the performance of the system identification step we carried on a supercomputing node based on dual socket Intel Haswell E5-2630 v3 CPUs, with eight cores with a 2.4-GHz clock speed and 130 W TDP. We will use this thermal model together with the proposed framework to study the implication of the PI/horizon in the thermal-aware thread mapping and control of supercomputer nodes. For executing these tests, we have Special Issue on Self-Awareness in Systems on Chip, Part II assigned the highest criticality on the core running the MPI process with rank 0.
The TMC optimization problem proposed in the section "Thermal/Power Management runtime" has been solved using IBM ILOG CPLEX 12.6.1. The solver calls CPLEX each time there is a new TMC problem to be solved. At each CPLEX call, the runtime builds a new instance of the problem with the new thermal parameters and the criticality of the threads, and it waits for the CPLEX results. In our tests, we conducted the following experiments with different PIs for both FSP and ISP problems. We considered P I FSP and P I ISP = {1s, 10s, 100s, SS}. In the following, we name these tests with the notation P I FSP − P I ISP . It must be noted that 1s-1s represent state-of-the-art dynamic tag manager (DTM) solutions with no thermal-aware thread-to-core mapping, while SS-SS represents state-of-the-art static DTM solutions.
For all the experiments, we set the temperature limit as 65% of the maximum temperature that can be reached by the hottest core at the maximum frequency. Figure 2a shows the accuracy for one core of the predicted temperature in the target architecture. The residual is always below ±1 °C. Figure 2b depicts the average frequency of the core that hosts the highest criticality thread and the average frequency for all the cores in each configuration. The error bar shows the variance for each configuration among different executions. Larger error bars happen when FSP has a prediction horizon that is too short to see the effect of long-term thermal evolution, and thus it cannot predict which core will hit the thermal constraint. For this case, the allocation problem FSP is trivial and threads are allocated following a simple numerical binding. If the most critical thread is lucky, it will be pinned on a "cold" core. On the other hand, if the most critical thread is unlucky, it will be mapped on a "hot" core. At the steady state, the frequencies of the cores will be limited by the ISP to respect the thermal constraint. In the other cases, the P I FSP is always enough to sense the thermal constraint. The optimization model will pin the highest criticality thread on a "cold" core, making it run at the maximum frequency.
In the following analysis, we take as a baseline the SS-SS configuration, which models the state-of-the-art solutions based on static allocation of jobs and frequency. The 1s-1s and 10s-10s induces performance penalties on the high criticality task, while they lead to an increase in performance to 4.97% and 4.50%, respectively, on average in all the cores. For the remaining configurations, we measure no penalties for the high criticality tasks and a gain of 7.46%, 7.06%, and 3.65%, respectively, for the configurations SS-1s, SS-10s, and SS-100s. These results show that short-horizon predictive models pay off in the ISP, as they allow to take advantage of the thermal capacitance. When also considering the problem solution overhead (see Figure 2c ), the best choice is the SS-10s configuration that induces an overhead of 0.64%, which in conjunction with the 7.06% of performance gain leads to an overall performance gain of 6%. iN this article, we presented self-aware thermal management for HPC processors, which features a robust self-calibrating thermal model policy and an ILP-based application-aware thermal controller. Our results show that our proposed framework is suitable to learn the thermal model with high accuracy and to take advantage of thermal heterogeneity and capacitance present in state-of-the-art HPC processors. 
