Abstract-A new technique for performance regulation in event-driven systems, recently proposed by the authors, consists of an adaptive-gain integral control. The gain is adjusted in the control loop by a real-time estimation of the derivative of the plant-function with respect to the control input. This estimation is carried out by Infinitesimal Perturbation Analysis (IPA). The main motivation comes from applications to throughput regulation in computer processors, where to-date, the testing of the proposed control technique has been assessed by simulation. The purpose of this paper is to report on its implementation on Intel's Haswell microprocessor, and compare its performance to that obtained from cycle-level, full system simulation environment. The intrinsic contribution of the paper to the Workshop on Discrete Event Systems is in describing the process of taking an IPA-based design and simulation to a concrete implementation, thereby providing a bridge between theory and applications.
I. INTRODUCTION
One of the objectives of systems' control is performance regulation, namely the output tracking of a given setpoint reference despite modeling uncertainties, time-varying system's characteristics, noise, and other unpredictable factors having the effects of system-disturbances. A commonly-practiced way to achieve tracking is by a feedback control law that includes an integrator. An integral control alone may have destabilizing effects on the closed-loop system, and hence the controller often includes proportional and derivative elements as well thereby comprising the well-known PID control [1] .
Recently there has been a growing interest in performance regulation of event-driven systems, including Discrete Event Dynamic Systems (DEDS) and Hybrid Systems (HS), and a control technique has been proposed which leverages on the special structure of discrete-event dynamics [2] . The controller consists of a standalone integrator with a variable gain, adjusted in real time as part of the control law. The rule for changing the gain is designed for stabilizing the closed-loop system as well as for simplicity of implementation and robustness to computational and measurement errors. Therefore it obviates the need for proportional and derivative elements, and can be implemented in real-time environments by approximating complicated computations by simpler ones. In other words, the balance between precision and computational complexity can be tilted in favor of simple, possibly imprecise computations. A key feature of the control law is that it is based on the derivative of the plant Research supported in part by the NSF under Grant Number CNS-1239225.
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA. Email: xchen318@gatech.edu, ywardi@ece.gatech.edu, sudha@ece.gatech.edu. function, namely the relation between the system's control parameter and its output, which is computed or estimated by Infinitesimal Perturbation Analysis (IPA). This will be explained in detail in the following paragraphs.
IPA is a well-known and well-tested technique for computing sample-performance derivatives (gradients) in DEDS, HS, and other event-driven systems with respect to controlled variables; see [3] , [4] for extensive presentations and surveys. Its salient feature is in the fact that the sample derivatives often are computable via simple, low-cost algorithms. However, this simplicity may come at the expense of their statistical unbiasedness. In situations where IPA is biased, alternative perturbation-analysis techniques have been proposed, but they may require far-larger computing efforts than the basic IPA (see [3] , [4] ). For the performance regulation technique described in this paper it has been shown that IPA need not be unbiased and, as earlier mentioned, it can rely on fast but possibly imprecise algorithms [2] .
The control system we consider is depicted in Figure 1 . Assuming discrete time and one-dimensional variables, r is the setpoint reference, n = 0, 1,... denotes the time counter, the control variable u n is the input to the plant at time n, y n is the corresponding output, and e n := r − y n is the error signal at time n. The control law is defined by the following equation, u n = u n−1 + A n e n−1 ,
and we recognize this as an adder, the discrete-time analogue of an integrator, if the gain A n is a constant that does not depend on n. The plant is an event-driven dynamical system whose output y n is related to its input u n in a manner defined below and denoted by the functional term
where
is used to define the controller's gain A n+1 via the equation
and the error signal is defined as
A recursive application of Eqs. (1)-(4) defines the closedloop system. As mentioned earlier the plant is an event-driven system, and we denote its input, state, and output by u(t), z(t), and y(t), respectively; the notation t designates continuous time or discrete time. Let us partition the time-axis {t ≥ 0} into contiguous left-closed, right-open intervals, C 1 ,C 2 ,..., called control cycles. Suppose that the input to the latter dynamical system has constant values during each interval C n , and it can be changed only at the boundary of these intervals. In the setting of the system of Figure 1 , u n is the value of the input u(t) during C n , and the output y n is a scalar quantity resulting from measurements of z(t) taken during C n . Thus, y n is a random function of u n and {z(t)} t∈C n , and we denote its sample realization by Eq. (2) whose dependence on z(t) is implicit. Specific examples in Section III will make this point clear.
The development of the proposed regulation technique has been motivated primarily by applications to computer cores, especially regarding regulation of power and instructionthroughput by adjusting the core's clock rate (frequency) [5] , [6] . Concerning throughput, there are no prescriptive, let alone analytic models for the frequency-to-throughput relationship, and a complicated, intractable queueing model has had to be used for simulation. Nonetheless a simple IPA algorithm has been developed and used to approximate the sample derivative for determining the integrator's gain via Eq. (3). The regulation technique was extensively tested on programs from an industry-based suite of benchmarks, Splash-2 [7] , using a detailed simulation platform for performance assessment of computer architectures, Manifold [8] . We reported the results in [2] , [6] , [9] , and deemed them encouraging and meriting a further exploration of the regulation technique.
In the context of IPA research, this regulation technique represents two new perspectives. First, the traditional application of IPA throughout its development has been to optimization, whereas here it is applied in a new way, namely to performance regulation. Second, much of the research of IPA has focused on its unbiasedness, whereas here, in contrast, the concern is with fast computation which may come at the expense of accuracy and unbiasedness. The main novelty of the paper as compared to References [2] , [6] , [9] is in the fact that it concerns not simulation but an actual implementation. In this we were facing new challenges associated with real-time measurements, computations, and control. Consequently we were unable to control each core separately as in [2] , [6] , [9] , and hence applied the regulation method to a processor containing multiple cores. Furthermore, due to issues with real-time computation, we were forced to take drastically cruder approximations to the IPA derivatives than in [2] , [6] , [9] , and in fact it seems that we drove the degree of imprecision to the limit. How this worked on application programs will be seen in the sequel. In any event, the work described here is, to our knowledge, the first implementation (beyond simulation) of IPA in a real-time control environment.
The rest of the paper is structured as follows. Section II summarizes relevant convergence results of the regulation technique in an abstract setting. Section III describes the system under study and presents simulation results on Manifold, while Section IV describes implementation experiments and compares the results to those obtained from simulation.
Finally, Section V concludes the paper.
An extended presentation of the content of this paper can be found in the arXiv archival system [10] . 
II. CONVERGENCE RESULTS
This section recounts established results concerning convergence of the regulation technique defined by recursive applications of Eqs. (1) to (4) . Extended discussions can be found in Refs. [2] , [10] .
Let J : R → R be a differentiable function, and suppose that (3) with an additive error term ζ n , and accordingly replace L n (u n ) by the resulting term L n (u n ) + ζ n in the RHS of (3). Suppose also that this term acts as an approximation of J (u n ) with an error term φ n . Then Eqs. (2) and (3) become Eqs. (5) and (6), respectively:
To simplify the discussion, suppose that there exists a closed, finite-length interval, I, such that (i) the functions J and L n are differentiable throughout I; (ii) the function J is either convex or concave, and monotone increasing or decreasing throughout I; and (iii) there exists u ∈ I such that J(u) = 0. Various ways to relax this assumption are discussed in [2] , [10] .
Proposition 2.3 and Lemma 2.2 in [2]:
For every η > 0 and M > 1 satisfying the inequalities η <= min{|J (u)| : u ∈ I} and max{|J (u)| : u ∈ I} ≤ M min{|J (u)| : u ∈ I}, and for every ε > 0, there exist α ∈ (0, 1), δ > 0, and θ ∈ (0, 1) such that: 1). If for some j ∈ {1, 2,...}, and for n = j, j
2). If for all n = 1, 2,..., u n ∈ I, |φ n | ≤ α|J (u n )|, and
In the case of exact function evaluations but errors in the derivatives' computations, namely ψ n = 0 for every n = 1, 2,..., the control algorithm amounts to Newton-Raphson's method with imprecise derivative computations for solving the equation r − J(u) = 0. General convergence results exist and can be found, for example, in Ref. [11] , and for the special case considered in this paper they imply that lim n→∞ e n = 0.
For the systems described in this paper we have in mind a situation where the plant is an event-driven system controlled by a real-valued parameter u, J(u) is an expectedvalue function defined over a finite horizon (hence not in steady state and possibly dependent on an initial condition), y n = L n (u n ) + ψ n is a sample-based approximation (possibly biased) of J(u n ) over the control cycle C n , and L n (u) + φ n is a sample approximation of J (u n ).
III. SYSTEM MODELING AND SIMULATION
The system-architecture considered in this paper is based on an Out-of-Order (OOO) core technology whereby instructions may complete execution in an order different from the program order. This enables instructions' execution to be limited primarily by data dependency and not by the order in which they appear in the program. Data dependency arises when an instruction requires variables that first must be computed by other instructions. A detailed description of OOO architectures can be found in Ref. [12] . Here we provide an abstract functional and logical description, and refer the reader to [2] for a more-detailed exposition.
The functionality of an OOO core is depicted in Figure  2 . 1 Instructions are fetched sequentially from memory and placed in the instruction queue, where they are processed by functional units, or servers in the parlance of queueing theory. The queue is assumed to have unlimited storage and there is a server associated with each buffer. The processing of an instruction starts as soon as it arrives and all of its required variables become available. The instruction departs from the queue as soon as its execution is complete and the previous instruction (according to the program order) departs. In the parlance of computer architectures, an instruction is said to be committed when it departs from the queue. The instruction-throughput of the core is defined and measured by the average number of instructions committed per second.
Instructions generally are classified as computational instructions or memory instructions. Access times of external, off-chip, memory instructions are one-to-two orders of magnitude longer than those of computational instructions. Therefore most architectures make use of a hierarchical memory arrangement where on-chip cache access takes less time than external memory such as DRAM. First the cache is searched, and if the variable is found there then it is fetched and the instruction is completed. If the variable is not stored in cache (a situation known as cache miss) then it is fetched from external memory (typically DRAM) and placed in the cache, whence it is accessed and the instruction is completed. External memory instructions can be thought of as being placed in a finite-buffer, first-in-first-out queue, designated as the memory queue in Figure 2 . When this queue becomes full, the entire memory access, including cache, is stalled. Thus, there are three causes for an instruction to be stalled: a computational or memory instruction waiting for variables computed by other instructions, a memory instruction waiting for the memory queue to become non-full, and any instruction waiting (after processing) for the previous instruction to depart from the queue. We point out that instructions involving computation and L 1 cache-access are subjected to the core's clock rate, while memory instructions involving external memory, such as DRAM, are not subjected to the same clock. This complicates the application of IPA and may cause it to be biased.
A quantified discrete-event model of this process is presented in Ref. [10] , and a more general description can be found in [2] , which also contains a detailed algorithm for the IPA derivative of the throughput as a function of frequency. We used a cycle-level, full system discrete event simulation platform for multi-core architectures, Manifold (see [8] for its detailed description and capabilities). The simulated model consists of a 16-core X86 processor die, where each core is in a separate clock domain and can control its own clock rate independently of other cores. We simulated two application programs from the benchmark-suite Splash-2, Barnes and Water-ns [7] . Barnes is a computation-intensive, memory-light application while Water-ns is memory intensive. For each execution, all of the 16 cores of the processor run threads of the same benchmark concurrently while each one of them is controlled separately. The control cycle is set to 0.1 ms for both benchmarks, and the frequency range of the cores is set to [0.5GHz, 5.0GHz]. We assume a continuous frequency range for the simulations, but will consider a realistic, discrete range for the implementation described in the sequel. The target instruction throughput is set to the same value for each core, and we experiment with the target throughput values of 1,200 MIPS (Million Instructions per Second), 1,000 MIPS, and 800 MIPS. In terms of instructions per control cycle, these target values correspond to 0.12 × 10 6 , 0.1 × 10 6 , and 0.08 × 10 6 , respectively. The relationship between clock frequency and instruction throughput is determined by the Manifold processor model, but its IPA derivative was computed according to the highlevel instruction flow described in [10] . For each application run we present, in the following paragraphs, the results for one of the 16 cores chosen at random.
Consider first the target throughput of 1,200 MIPS. The throughput simulation results for the Barnes benchmark are shown in Figure 3 , where the horizontal axis indicates time in ms and the vertical axis indicates instruction throughput. We discern a rise of the throughput from its initial value (measured at 643.1 MIPS) towards the target level of 1,200 MIPS, which it reaches for the first time in about 1.5 ms, or 15 control cycles. Thereafter it oscillates about the target value, which is not surprising due to the unpredictable, rapidly-changing program workload. The average throughput computed over the time interval [1.5ms,100ms] (soon after the throughput has reached the target value) is 1, 157.4 MIPS, which is 42.6 MIPS off the target level of 1,200 MIPS.
The corresponding graph of the clock frequency vs. time is not shown here due to the lack of space, but can be seen in Figure 4 in [10] . The frequencies experience some saturation at their highest level of 5.0 Ghz in the time-interval [7ms, 12ms]. Saturation at the highest level can push down the average throughput from its target level, since it indicates that the system may be unable to raise the throughput when needed. In this case the saturation is minor and hence its effect on the throughput is not clear. However, the average throughput over the interval [30ms, 100ms], which excludes the saturation period, is 1,192.6 MIPS, corresponding to only a 7.4-MIPS offset as compared to the aforementioned 46.4-MIPS offset for the period [1.5ms, 100ms]. The effects of saturation on the average throughput are more pronounced in simulation experiments described below.
In order to reduce the throughput oscillations and frequency saturation we scale down the gain in Eq. (1) by a factor ζ ∈ (0, 1), resulting in the equation u n = u n−1 + ξ A n e n−1 . After some experimentation on various benchmarks (excluding those tested here) we chose ξ = 0.2. The resulting frequencies did not saturate throughout the program's run, and yielded an average throughput of 1,198,5 MIPS, corresponding to an offset of only 1.5 MIPS. This technique, though performing well on the present example, is ad hoc and may not work well for the realistic case of a discrete frequency set since it may require a more-aggressive controller to change the frequencies. Therefore we will not further pursue it in this paper.
For the target throughput of 1,000 MIPS, the results (not shown due to space limitations) showed a rise in throughput from its initial value of 420.5 MIPS to 1,000 MIPS in 2.1 ms, or 21 control cycles. The average throughput in the [2.1ms, 100ms] interval is 990.2 MIPS, corresponding to an offset of 9.8 MIPS of the throughput from its target value of 1,000 MIPS. The frequency saturated at its upper limit only at 5 isolated control cycles with minimal effects on the throughput.
For the throughput target of 800 MIPS, the results show a rise in the throughput from its initial value of 679.3 MIPS to 800 MIPS in 1.9 ms, or 19 iterations. The average throughput in the interval [1.9ms,100ms] was 839.6 MIPS, which is 39.6 MIPS off the target value of 800 MIPS. There was a considerable frequency saturation at the lowest level of 0.5 GHz, which explains the positive offset.
For Water-ns, consider first the throughput target of 1,200 MIPS. Simulation results of the instruction throughput are shown in Figure 4 , and they indicate greater fluctuations than for Barnes. Moreover, Figure 4 indicates three periods of very low throughput, which correspond to significant saturation of the clock-frequency at its upper-boundary of 5 Ghz (for the graph of clock frequency vs. time, please see Figure 6 in Ref. [10] ). To explain this, recall that Water-ns is a memory-heavy program, and execution times of memory instructions are longer (typically by one or two orders of magnitude) than computational instructions. The saturation corresponds to periods of external memory access when the program's instructions run at a low throughput. In order to boost the throughput the controller sets the clock frequency to its highest-possible value, which is the saturation rate of 5 Ghz. However, this rate is not sufficient to raise the instruction-throughput, hence the low-level throughput indicated in Figure 4 and the resulting gap between the average throughput and the traget level of 1,200 MIPS.
For the target throughput of 1,000 MIPS, simulation results show the throughput increasing from its initial value of 472.1 MIPS to the target level on 1,000 MIPS in 2.3 ms, or 23 control cycles. There is considerable frequency saturation at the high limit of 5.0 GHz. The average throughput in the interval [2.3ms, 330ms] is 947.8 MIPS, which means an offset of 52.2 MIPS off the target throughput.
For the target throughput of 800 MIPS, simulation results indicated a rise of the throughput from its initial value of 443.3 to its target level in about 2.3 ms, or 23 control cycles. There is considerable saturation of the frequency at its lower level of 0.5 Ghz, and hence a positive offset between the computed average throughput and its target level. Indeed, the average throughput in the interval [2.3ms, 330ms] is 862.6, MIPS, hence meaning an offset of 62.6 MIPS of the throughput from its target value of 800 MIPS.
All of these results are summarized in Table I, Haswell is Intel's fourth-generation core processor architecture fabricated in the 22nm process [13] . Haswell is comprised of multiple out-of-order execution cores designed for improved power efficiency over prior generations. The version used in our study has four cores residing on each Haswell processor, and every core supports two threads.
All the cores in a Haswell processor execute at the same frequency, the processor frequency. Therefore it is not possible to control each core separately by its own frequency. Instead, we consider the average throughput among the active threads, which we call the normalized processor throughput, or normalized throughput in brief; it is the equivalent measure of the core's instruction-throughput in the Manifold model described above. We point out that typically the programmer and operating system distribute the load among the various cores in a balanced way, and in the system considered here there are 4 cores executing 8 threads, two threads to a core.
We implemented the controller by loading a C++ program to the processor via the PAPI interface [14] . 2 Recall that the control algorithm is based on Eqs. (1),(2),(6), (4) . However, the Haswell processor admits only a finite set of 16 frequencies, and we have to modify Eq. (1) accordingly. This set of frequencies, denoted by Ω, is (in GHz) Ω = {0.8, 1.0, 1.1, 1.3, 1.5, 1.7, 1.8, 2.0, 2.2, 2.4, 2.5, 2.7, 2.9, 3.1, 3.2, 3.4}. Denoting by [u] the nearest point to u ∈ R in the set Ω (the left point in case of a tie), we modify Eq. (1) by Eq. (9), below. The control algorithm is formalized as follows.
Notation: C n -the nth control cycle; r -the target normalized throughput; u n -the processor frequency during C n ; y nthe resulting measured normalized throughput; e n := r − y n .
Algorithm 1: The following steps are taken during C n : 1) At the start of C n , set u n = u n−1 + A n e n−1 .
. 2 Modern microprocessors include many hardware counters that record the occurrences of various events during program executions. Examples of such events include i) completion of the execution of an integer instruction, ii) a cache miss, or iii) an instruction that accesses memory. The Performance Application Programming Interface (PAPI) is a publicly available software infrastructure for accessing these performance counters during program execution.
2) During C n , measure y n , and compute an approximation to the IPA derivative,
3) At the end of C n , set A n+1 = λ n −1 .
4)
At the end of C n , compute e n := r − y n .
The IPA algorithm used for the Manifold simulation is too complicated for a real-time implementation, and therefore we explored approximations thereto with the objective of having them be as simple as possible. The simplest we could find wasλ
which we justify on the grounds that the frequency-tothroughput performance function y n := L n (u n ) is the sum of a linear component and a nonlinear component. If the executed program consists only of computational instructions than L n (u n ) would be linear, and generally, the nonlinear component is due to external-memory instructions. Therefore, we expect the regulation technique to work better for memorylight programs than for memory-intensive programs. This is evident from the testing we performed, whose results are presented in the following paragraphs. We partly attribute the 33.5-gap to the quantization error due to the rounding off of the frequencies to their nearest values in Ω, which is evident from Figure 8 in [10] .
For the target level of 1,000 MIPS, the throughput climbed from its initial value of 633.2 MIPS to its target level in 1.5 ms, or 15 control cycles. There was no frequency saturation, and the average throughput in the [1.5ms, 330ms] interval is 990.6, which means an offset of 9.4 MIPS from the target level of 1,000 MIPS.
For the target level of 800 MIPS, the throughput climbs from its initial value of 763.1 to the target level in 1.0 ms, or 10 control cycles. There was no frequency saturation, and the average throughput is 829.7 MIPS, which is 29.7 MIPS off the target level of 800 MIPS. Again, we attribute this gap to the saturation error.
Regarding Water-ns, results for the throughput target of 1,200 MIPS are shown in Figure 6 . The throughput rises from an initial value of 683.1 MIPS to its target value of 1,200 MIPS in about 2.1 ms, or 21 control cycles. The throughput oscillates at its upper level more than that obtained for Barnes, and this is due to the fact that Water-ns is a memoryheavy application. For the same reason, there is considerable Finally, for the target throughput of 800 MIPS, the throughput rises from its initial value of 698.3 to its target level in 2.4 ms, or 24 control cycles. There are considerable frequency oscillations at the upper limit of the frequency range, and the average throughput is 836.3, which is 36.3 MIPS off the target level.
These results are summarized in Table II , showing the offset (in MIPS) of average throughput from target throughput, obtained from Haswell implementation of Barnes and Waterns with throughput targets of 1,200, 1,00, and 800 MIPS.
Comparing the data summarized in Table I and Table  II , we see that the regulation technique performs slightly better on the Haswell implementation platform than on the Manifold simulation environment. The reason for this may be due to the fact that in the simulation experiment we regulate the throughput of each core separately, while in the implementation we control the average throughput of all the cores in the processor.
V. CONCLUSIONS
This paper describes the testing of an IPA-based throughput regulation technique in multicore processors. The testing was performed on both a simulation environment and an implementation platform. Despite crude approximations that have had to be made in the implementation setting, the proposed technique performed slightly better than in the simulation setting. Future research will extend the regulation method from a centralized control of a single processor to a decentralized control of networked systems.
