This paper presents, implements, and evaluates a power-regulation technique for multicore processors, based on an integral controller with adjustable gain. The gain is designed for wide stability margins, and computed in real time as part of the control law. The tracking performance of the control system is robust with respect to modeling uncertainties and computational errors in the loop. The main challenge of designing such a controller is that the power dissipation of program-workloads varies widely and often cannot be measured accurately; hence extant controllers are either ad hoc or based on a-priori modeling characterizations of the processor and workloads. Our approach is different. Leveraging the aforementioned robustness it uses a simple textbook modeling framework, and adjusts its parameters in real time by a system-identification module. In this it trades modeling precision for fast computations in the loop making it suitable for on-line implementation in commodity data-center processors. Consequently, the proposed controller is agnostic in the sense that it does not require apriori system characterizations. We present an implementation of the controller on Intel's fourth-generation microarchitecture, Haswell, and test it on industry benchmark programs which are used in datacenter applications. Results of these experiments are presented in detail exposing some practical challenges of implementing provably-convergent power regulation solutions in commodity multicore processors.
I. INTRODUCTION
For decades, scaling of transistors to decreasing geometries was the primary source of increased processor performance. This was accompanied by the corresponding scaling of device power thereby keeping power densities roughly constant on a processor die. However, this behavior known as Dennard scaling has ended leading to unsustainable growth in power consumption in future processors as we increase the number of transistors on a die. Therefore, to continue to sustain performance scaling we must seek new and innovative advances in power management in multicore processors. Such advances are central to the effective operation of all modern processors in platforms ranging from mobile devices to data centers and high-performance computing (HPC) machines that drive national initiatives in key areas such as science, finance, and defense [1] , [2] .
In multicore processors, the relationships between workloads, power dissipation, resulting thermal fields, and their interaction with the leakage current present new and unresolved power and thermal management challenges. For example, application workloads exhibit time-varying computation and memory access behaviors resulting in spatially and temporally varying power dissipation and non-uniform thermal fields. The cross-chip variations in temperature couples to circuit leakage and delay, increases full-chip leakage power, reduces peak throughput, degrades chip/package reliability, and increases cooling/packaging costs. Thus, effective control of power dissipation is critical to the reliable and high performance operation of multicore processors. This paper describes a novel and effective power regulation technique and the results of an evaluation of its implementation on a commodity multicore processor.
Modern multicore processors are organized into several voltage islands where each island may comprise of one or more processing cores. Each voltage island can operate at one of several discrete power states each defined by an operational voltage-frequency pair. A general technique for controlling power and temperature is based on setting the appropriate power state of each voltage island. This is commonly referred to as Dynamic Voltage/Frequency Scaling, or DVFS. The development of effective controls based on DVFS faces several challenges. First, the relationship between the clock frequency and core power is complicated by other factors such as the coupling between temperature and leakage power. Second, application workloads have time-varying compute and memory system behaviors requiring a robust, adaptive control strategy to manage power dissipation. Third, distinct cores in a voltage island execute distinct instruction streams with distinct behaviors but may share a common clock and hence frequency. For example, the Intel Haswell processor tested in this paper has four cores sharing a single voltage island and executing eight hardware threads at the same frequency [3] .
A number of DVFS heuristics have been proposed to control power dissipation. Prominent are heuristics for clock gating [4] , thread migration [5] [6] , prediction [7] and voltage scaling [8] . However, heuristics are limited in their scope and robustness. Consequently, several efforts have applied feedback control theory as an effective way to improve performance and robustness [9] , [10] , [11] . Generally, such controllers relied on off-line analysis of anticipated workloads [10] , [1] or empirical approaches [12] , [13] to derive control parameters. This includes efforts to limit operation below a maximum power constraint [14] , [15] . However, all of these approaches are applied to applications that have been profiled a priori to derive the control parameters.
This paper concerns a control law that is not based on any off-line profiling (hence said to be agnostic), and it estimates the model-parameters on line by least-square system identification. The control law is comprised of a standalone integrator with adaptive gain. Now it is well-known that an integral control can have poor stability margins and oscilla-tions in the system's response, hence it is often supplemented by proportional and derivative elements in order to form the PID control [16] . We use a different approach, consisting of a standalone integrator with a variable gain, designed for fast convergence, wide stability margins, and reduced oscillations as compared to fixed-gain integrators. Moreover, its tracking performance and stability are quite robust to modeling variations and computing errors in the loop, and hence we need not worry about precise model parameters. Furthermore, we can speed up the computations in the loop at the expense of precision if needed. 1 The controller described in the sequel was first designed for regulating the dynamic power in computer cores in Ref. [17] . Subsequently it has been analyzed in an abstract setting in [18] , where its convergence, stability, and robustness were proved. Its performance was tested via simulation on instruction-throughput regulation [19] , [20] as well as on the throughput of abstract Discrete Event Dynamic Systems (DEDS) such as queues, Petri nets and transportation networks; see [18] and references therein. Lately the controller has been implemented on Intel's fourth-generation microarchitecture, Haswell [21] , where it was tested on instructionthroughput regulation.
This paper concerns an implementation of the controller on a Haswell machine and evaluations of its application to power regulation. It makes the following specific contributions: 1). It is the first to present an implementation of an integral control for power regulation in multicore processors. 2). It is agnostic, and adjusts well to workload variations. 3). It is the first (to our knowledge) to use on-line system identification for estimating a suitable system-model. 4). It is applied to timely data-center applications. 5). It converges quite fast.
The rest of the paper is organized as follows. Section II describes the problem, system-model, and powerregulation technique, and recounts established results. Section III presents test results of applications of the controller to two industry-benchmarks, and Section IV concludes the paper. An extended version of the paper, containing results of experiments on benchmark programs for scientific computing (Splash-2), can be found in the Arxiv [22] .
II. POWER REGULATION TECHNIQUE
This section first describes the regulation technique in the abstract setting considered in [18] in order to highlight its general salient features. Then it discusses its particular applications to power control in multicore processors.
Consider the single-input-single-output discrete-time system shown in Figure 1 , where k = 1, 2,... represents discrete time, r ∈ R is a reference input, y k ∈ R is the output, u k is the control variable, and e k ∈ R is the error signal. Generally the plant can be nonlinear and time varying, and the objective The controller we choose has the form
where A k is the gain at time k, assumed to be positive. If A k = A where A > 0 does not depend on time k then we recognize the controller as an adder, a discrete-time equivalent of an integrator. Since generally A k depends on k, we call the controller a variable-gain integrator. The plant generally characterizes the relationship between the control signal {u k } and the output process {y k }. Of a particular interest to us is the partial derivative
, which we assume to be nonzero. For reasons that will become apparent shortly, we would like to set the controller's gain to
In this we assume that the partial derivative
is computable in real time from suitable measurements of the system, and hence can play a part in the control law. However, such realtime computations may be subjected to delays and errors. Therefore an approximation may have to be used, resulting in the following definition of the controller's gain,
where η k−1 denotes an additive error. To complete the characterization of the loop we note that the tracking error is
as is evident from Figure 1 . The control law consists of repeated recursive applications of Equations (2)-(1)-(3). The rationale behind the choice of A k in Eq. (2) can be seen by considering for a moment the case where the plant is a memoryless nonlinearity described by the relation
. Using this relation in Eq.
(2) we recognize the controller as the Newton-Raphson method for solving the equation r − u(u) = 0, where the computation of the derivative term dg du (u k−1 ) is corrupted by the additive noise η k−1 .
There are well-known convergence results in the presence of substantial additive noise, including a geometric convergence rate, namely the existence of θ < 1 such that
see [23] . Therefore convergence of the Newton-Raphson method is said to be robust with respect to errors in the derivative dg du (u k−1 ). These results have been extended to the more-general setting where the plant-system is dynamic (as opposed to memoryless), stochastic and time-varying. In such setting the term dg du (u k−1 ) makes no sense but the term
(2) can be well defined. One cannot expect convergence in the form of the limit lim k→∞ (r − y k ) = 0 to hold true due to variations in the system's characteristics. However, results of the form limsup k→∞ |r − y k | < ε were obtained in [18] for a suitable ε > 0 which depends on a measure of the system's variability. In particular, the geometric convergence expressed in Eq. (4) is extended as long as |r − y k | is not too small, even for fairly large errors |η k−1 |.
In the context of computer processors, this technique was applied to regulate instructions' throughput. The plant is modelled as a discrete event dynamic system controlled by the processor's clock rate (frequency), whose output is the average instruction-throughput measured over short time frames. The technique was verified by both simulation [19] , [20] and implementation on a Haswell machine [21] . In simulation the derivative term
is estimated by Infinitesimal Perturbation Analysis [24] , [25] , and in implementation a cruder but faster approximation is used. This paper concerns power regulation which poses a different set of challenges, and it uses a system characterization as described in the following paragraphs.
The power dissipated in a core has two major components, static power and dynamic power, respectively denoted by P s and P d ; Thus
The dynamic power is due to the switching activities at the gates of the core. It depends on the supply voltage V , clock rate (frequency) φ , core's capacitance C, and the program-workload α representing the switching activities in the core's transistor gates. This dependence is represented by the equation
see [26] . Generally C is a constant which can be assessed empirically, but α = α(t) varies rapidly with the program load and cannot be measured. The relationship between frequency and voltage often is affine, namely V = a + mφ . In this case, and in light of Eq. (6), P d can be expressed as a third-degree polynomial in φ . However, it may be impossible to empirically determine the coefficients of this polynomial due to rapid variations of α(t). Furthermore, often it is impossible to measure the dynamic power but only the total power P. Consequently we are unable to compute the coefficients of the polynomial function relating frequency to dynamic power. As a matter of fact, earlier attempts to apply the regulation algorithm on Haswell with various fixed polynomial coefficients determined off line failed to yield the desired tracking.
The static power depends on the supply voltage and temperature, while the temperature depends on the total power (see [26] ). This circular relationship between power and temperature precludes the existence of a simple model relating frequency to static power. Moreover, the temperature may vary during program execution, further complicating the prospects of a frequency-to-power tractable model that can be used in a real-time control.
As mentioned in the introduction, the Haswell machine which serves as the implementation platform consists of four cores processing eight concurrent threads. All of the cores reside in the same voltage island, hence we cannot control each one of them separately but rather control them jointly by a common frequency, the processor frequency. The controlled quantity is the average power among the four cores, called the processor power. In the setting of Figure 1 , we partition the time horizon into equally-spaced contiguous intervals called control cycles and denoted by C k , k = 1, 2,...; u k is the processor frequency applied during C k , and y k is the average among the cores of the mean spatial and temporal power measured during C k at the cores. A key question is how to obtain an estimate of the derivative term
, in Equation (2). This requires knowledge of some parameters of the plant model relating u k to y k , but such a model is only partly available. In fact, we mentioned that there in no analytic model for the static power, and while there is an adequate third-degree polynomial for the dynamic power, its coefficients change with time at a high rate.
We overcome these problems by the following approach. First, we search for a third-order polynomial for estimating the relation between the applied frequency and the processor power. This of course can fit the dynamic power but not the static power. However, in computing applications at the frequency range considered in this paper, the static power comprises 20% -30% of the total power, and therefore we feel confident in leveraging the aforementioned robustness of the performance of the tracking controller with respect to errors in computing
. Second, to cope with the rapid variations in the coefficients of this polynomial due to changes in α(t), we use a system identification module run concurrently, in real time, along the program-executions by the processor.
Let us denote by p k (φ ) := a k φ 3 + b k φ 2 + c k φ + d k the estimator polynomial during C k , then its derivative (2). A system identification module, comprised of a standard recursive least-square estimator (e.g., [27] ), is used to compute the coefficient-vector
It must be pointed out that performance of the power regulator is affected by several practical considerations. First, the rate at which energy and power can be measured is determined by the processor vendor, which in this case is Intel. The model specific registers are updated at approximately 1 ms intervals but no timestamp is provided so it is not possible to know when the measurement interval began. Consequently, mapping measurements to application code is difficult and can cause larger deviations in regulated power than the model would otherwise achieve. Second, the current manner in which the frequency is changed is via file I/O incurring substantial latency relative to the execution time of instructions. If program behavior changes significantly during this interval, tracking becomes more challenging. For example, data center programs that possess poor spatial and temporal reference locality and are memory intensive will exhibit wide variations in average instruction execution time due to memory accesses. High latency in setting the processor frequency will make it difficult to rapidly adapt to changes in power consumption and consequently will affect the rate of convergence of the power regulator and the choice of the duration of the control cycle. Such practical considerations must be overcome by robustness in the design of the regulator.
III. RESULTS
The proposed power regulator was tested on two programs from a suite of industry benchmarks that perform computations over graphs, GraphBig [28] . We are witnessing an explosive growth in modern data science applications executing in data centers that deal with data that is of the relational form and can be represented by graph data structures with large numbers of node and edge properties. These applications have irregular memory access patterns, exhibit low spatial and temporal locality, and are characterized by low operation density, i.e., number of operations per byte of data accessed. Consequently, they stress the memory system and challenge optimizations for achieving high processor utilization. To cover major graph computation types and data sources encountered in such data center applications, Graph-BIG incorporates representative data structures, workloads and data sets from 21 real-world use cases from multiple application domains such as retail forecasting, data analytics, finance, and banking.
The controller was implemented as a software component of the application which accessed energy measurements via the PAPI interface [29] . 2 The Haswell processor has a finite set of 16 frequencies, namely Ω := {0.8, 1.0, 1.1, 1.3, 1.5, 1.7, 1.8, 2.0, 2.2, 2.4, 2.5, 2.7, 2.9, 3.1, 3.2, 3.4} in GHz. Therefore, we augmented Eq. (1) by projecting its Right-Hand Side (RHS) onto Ω. That is, with P Ω (u) := argmin{|v − u| : v ∈ Ω} for u ∈ R (with v < u if the argmin is not unique), we replace (1) by the following equation,
The control algorithm consists of a recursive application of Eqs.
(2)-(7)-(3). The control cycles of the algorithm can be chosen according to performance considerations such as settling times, subject to hardware constraints. At the end of each control cycle, first the model parameters are recomputed by the system identification module and then the operating frequency is assigned to the processor for the next period. In the Haswell processor that we use, energy consumption values are provided at a sampling interval of 1 ms. Hence we pick 2 Modern microprocessors include hardware counters that record the occurrences of various events during program executions, like completion of instructions' executions, cache misses, etc. The Performance Application Programming Interface (PAPI) is a publicly available software infrastructure for accessing these performance counters during program execution. control cycles that are multiples of this interval. 3 We test the control algorithm at two different rates associated with the control cycles of 10 ms and 30 ms, and we depict graphs of power as function of time during the first 4, 000 ms of program executions. We tested the controller on two GraphBig programs: Breadth-first Search (BFS), and Kcore. BFS is one of the most fundamental operations of graph computing, while Kcore encompasses topological analysis of graphs and is representative of approaches to the structural analysis of graphs. Both programs represent large-scale computations executed over clusters of servers in large data centers.
For both BFS and Kcore the target power is 5 W . Consider first the BFS program. For 10 ms-control cycles, the results are shown in Figure 2 and Figure 3 . The power vs. time graph is depicted in Figure 2 . The power starts at the initial value of 8.57 W, and following an initial transient lasting 380 ms (or 38 control cycles) it settles into an oscillatory behavior about the target value of 5 W. The average power in the interval [380, 4000] ms is 5.0542 W, which is 0.0542 W more than the target level of 5 W. The frequency graph is depicted in Figure 3 , and the average frequency in the interval [380, 4000] ms is 2.59 GHz.
For 30 ms-control cycles, the power graph is shown in Figure 4 ; the frequency-graph displays similar characteristics to that for 10 ms-control cycles depicted in Figure 3 , hence not shown. The power starts at the value of 2.62 W, and after a transient period of 510 ms (or 17 control cycles), it settles to oscillations in a band around 5 W. Its average in the interval [510, 4000] ms is 5.0108 W, which is 0.0108 Table I . Consider next the results for Kcore. For a 10 ms-control cycle, the graphs of power and frequency vs. time are shown in Figure 5 and Figure 6 , respectively. The power starts at the initial value of 7.749 W, and following an initial transient lasting 400 ms (or 40 control cycles) it settles about the target value of 5 W. The average power in the interval [400,4000] ms is 5.0124 W, which is 0.0124 W more than the target level of 5 W. The frequency graph is depicted in Figure 6 , and its mean is 2.478 GHz.
For 30 ms control cycles, the power graph is shown in Figure 7 . The power starts at the value of 7.23 W, and after a transient period of 480 ms (or 16 control cycles), it settles Figure 6 for the case of 10 ms. The average frequency is 2.608 GHz. These results are shown in Table II . Discussion of results. Table I and Table II display two important performance indicators of the control algorithm, namely the average error between power and its target value, and the power settling time. Also important is the variability of the power, which is depicted in the power-graphs shown Figures 2, 4 , 5 and 7. The factors impacting performance include the control cycle time, quantization due to the fact that the frequency set is finite, and the balance between memory-bound instructions and compute instructions in an application program.
Longer control cycles generally result in lesser power variability per cycle, but in fewer control cycles in a given time period. Therefore it is hard to predict whether they would give a better performance of the control algorithm. In fact, Table I and Table II indicate a slightly smaller settling times for the 10 ms control cycle, but give no clear indication about which one of the two control cycles results in smaller average errors. However, comparing Figure 2 to Figure 4 , and Figure 5 to Figure 7 , respectively, clearly indicates smaller power variability for the 30 ms control cycles than for the shorter, 10 ms control cycles.
As for quantization effects on frequency, they are apparent from the graphs in Figure 3 and Figure 6 , and especially from their oscillations between adjacent values in the frequency sets. We have no way of assessing their effects on the power performance.
For given control cycle and a frequency-set, the fraction of memory-bound instructions in a program plays a significant role in performance of the control algorithm. The reason is that the processor power is directly regulated by its clock rate, while the external-memory system is off chip and operates in a different voltage island and hence off a different clock. Furthermore, memory operations can take one-to-two orders of magnitude more time than compute instructions. Therefore the cores can stall for periods of time while waiting for memory-access operations to complete. During stalls the processor would reduce the clock rate with negligible impact on its instruction-execution times. Consequently, applications of the control algorithm to memory-intensive programs is expected to display larger power and frequency variations, hence lesser performance, than applications to memory-light applications.
Both BFS and Kcore are memory-intensive programs with different but irregular patterns of memory references. Nonetheless the results obtained from the experiments are quite good especially considering that these are large-scale application programs representative of data center applications. However, it must be pointed out that we have only a few experiments to report on, and more firm conclusions regarding the efficacy of the control technique would have to wait for more extensive testing.
IV. CONCLUSION
This paper describes an output-regulation technique inspired by Newton-Raphson's algorithm for solving algebraic equations. The tracking controller has the form of an integrator with adjustable gain, designed for effective regulation. The gain is adjusted in real time by simple computations in the feedback loop. Furthermore, the regulation algorithm is robust to modeling uncertainties and computing errors in the loop, hence does not require precise models of the plant.
We implemented the controller on Intel's commodity microarchitecture, Haswell, in order to test it on industrybenchmark programs that are used in datacenter applications. The control variable consists of the processor's clock rate, and the controlled quantity is the spatial and temporal average of the cores' power. Due to the lack of adequate models for performance evaluation of these systems, we programmed a system-identification algorithm that is executed in real time. Results of the experiments exhibit fast and effective convergence. To the best of our knowledge, the paper presents the first application of an integral control law implemented at the core level to programs that are executed in large datacenters.
