Achieving high performance under a peak temperature limit is a first-order concern for VLSI designers. This paper presents a new abstract model of a thermally-managed system, where a stochastic process model is employed to capture the system performance and thermal behavior. We formulate the problem of dynamic thermal management (DTM) as the problem of minimizing the energy cost of the system for a given level of performance under a peak temperature constraint by using a controllable Markovian decision process (MDP) model. The key rationale for utilizing MDP for solving the DTM problem is to manage the stochastic behavior of the temperature states of the system under online re-configuration of its micro-architecture and/or dynamic voltage-frequency scaling. Experimental results demonstrate the effectiveness of the modeling framework and the proposed DTM technique.
INTRODUCTION
Ongoing advances in CMOS process technologies and VLSI designs have resulted in the introduction of high-performance multi-core systems on a chip (SoC). Thermal control in such systems has become a first-order concern due to the increased power density and thermal vulnerability of the chip. Localized heating is a frequent occurrence in SoC designs. Power dissipation is spatially non-uniform across the chip, resulting in emergence of hot spots and spatial temperature gradients that can cause accelerated aging, timing errors (setup time violations), or even physical damage to the chip. To solve this, dynamic thermal management (DTM) techniques, which attempt to ensure thermal safety by employing runtime mechanisms to control power density and to prohibit excessive local heating, have been proposed as a class of micro-architectural solutions and control strategies, which seek to enable the highest SoC performance while meeting peak temperature constraints.
As reported in [1] - [5] , the problem of thermal modeling and management has received a lot of attention. The work presented in [1] relies on a compact thermal model to achieve a temperature-aware design methodology. A thermal control mechanism used to cool the microprocessor's temperature has been derived in [2] . Predictive thermal management [3] , which exploits certain properties of multimedia applications, is an example of online strategies for thermal management. In [4] , design guidelines for power and thermal management for highperformance microprocessors are provided. A summary of research that combine interconnect thermal effects and reliability measures is given in [5] .
Much of the past work has examined techniques for thermal modeling and management, but these techniques may be ineffective to reduce chip temperature of multi-core (MC) systems because the configurability of the micro-architecture depending on the target application and the uncertainty in temperature measurement (erroneous or noisy temperature reports) have often not been considered. Furthermore, thermal models, based on equivalent circuit models, cannot adequately model heat generation and diffusion in structures with complex shapes and boundary conditions. Indeed, it is extremely difficult to obtain the exact solution of the heat equations that arise from realistic die conditions [6] . These difficulties render the problem of identifying hot spots stochastic.
In this paper, we present a stochastic model of a thermallymanaged MC system (which we shall call TMS, for short) using a Markov decision process (MDP) model. Recall that MDP, which provides a robust theoretical framework for resource management problems, is a theory of modeling the sequential decision making process [7] . The key rationale for utilizing MDP for solving the DTM problem in MC systems is to manage the stochastic behavior of the temperature states of the system under dynamic re-configuration of its micro-architecture (which may take place in response to application program characteristics), while maximizing the system performance subject to the constraint that a critical temperature threshold is not exceeded locally or globally.
The remainder of this paper is organized as follows. Section 2 provides some preliminaries of the paper, while section 3 describes the details of the proposed models for a TMS. Section 4 presents a DTM problem formulation. Experimental results and conclusions are given in section 5 and section 6.
PRELIMINARIES
A modern computing system, which typically utilizes multi-cores to achieve high performance, exhibits different thermal profiles under different application programs due to its re-configurable micro-architecture. For example, its cache size varies drastically, depending on the characteristics of the running threads (i.e., application programs), where these adaptive caches adjust from small sizes with fast access time to higher capacity but slower and more power hungry configurations. As expected, larger cache configurations, which are more prominent for dual-thread workloads, provide higher power dissipation than smaller size cache. This in turn dynamically changes the temperature profile of the SoC during program execution. Details of the functionality of the MC systems and algorithms for changing the microarchitectural configuration on-the-fly fall outside the scope of the present paper. Interested readers may refer to [8] .
Application programs tend to exhibit different characteristics as a function of the program phase they are in [9] . This is in turn affects the computational workload of the processor, causes a new micro-architectural configuration to be employed, which in turn results in a different thermal profile on the chip. Figure 1 shows the obtained IPC (Instruction per Cycle) by running various application programs (e.g., SPEC CPU2000 [10] ) on the Intel Core Duo processor with a typical architectural specification (cf. [11] ). In this figure, IPCs for applications are compared to L2 cache miss rate, where average IPC for CPU2000 benchmarks is measured as 0.85. It is clearly seen that higher L2 cache miss rate accounts for its low IPC. An integrated circuit (device) is typically allowed to operate when the ambient air temperature, T A , surrounding the device package, is within the range of 0°C to 70°C [12] . It is expedient to define the critical temperature threshold, T crit , as the temperature above which a chip is in thermal violation resulting in timing errors or accelerated device/interconnect aging, and a trigger temperature threshold, T trig , as the temperature above which DTM techniques are employed. A thermal manager employs temperature reduction mechanisms when the system temperature exceeds a pre-defined temperature threshold (i.e., the trigger temperature). 
SYSTEM MODELING
We present a stochastic modeling technique to construct a TMS by utilizing a continuous-time Markov decision process (CTMDP).
Background
A CTMDP is a controllable continuous-time Markov process, which satisfies Markovian property [7] and takes a set of state s ∈ S, where state transition rates are controlled by actions a ∈ A. We consider a cost function which assigns a value to each state and action pair by adopting a conventional approach, i.e., when the system makes a transition from state s to another state s', it receives a cost. [7] . The exponential distribution for state transition times, a prominent property of CTMDP model, is sometimes insufficient to model practical cases, especially when we model the first request arrival in the idle state period [13] , where the inter-arrival times of service requests are in this case generally distributed. However, it will not hurt the quality of the present paper if we assume that the task inter-arrival times are exponentially distributed during the active state period since thermal management is only in effect during the program execution. Furthermore, the burst of program execution on a processor follows exponential distribution [14] .
Component Models
We present a CTMDP-based model of a TMS to optimally solve the DTM problem. Figure 2 shows an abstract model of a TMS, which comprises of three components: processor, application program, and thermal sensor. In this paper, for simplicity we assume that each application is executed by one processor and that individual thermal sensors measure temperatures of each and every processor in the MC system. A new application may cause micro-architectural re-configuration of the corresponding processor in order to improve the overall performance. A thermal manager (TM) receives state (phase) of the application, reads temperature data from the thermal sensor, and issues commands to the processor under its control to manage the temperature rise above T trig . There is one TM assigned to each processor. Notice that R i , S j , and H k represent the state sets of the application program (i = 1, 2, …, l), the processor (j = 1, 2, …, m), and the temperature (k = 1, 2, …, n), respectively, where l, m, and n are the number of applications, processors, and thermal sensors available within a MC system. Next, we construct the CTMDP model of a single processor system for simplicity. The CTMDP model of a MC system can be constructed in the same manner. 
Modeling the Processor State
The CTMDP model of the processor is constructed as follows.
Assume that each state s ∈ S represents a combination of a microarchitectural configuration c ∈ C (e.g., register file sizing, cache sizing, or float-point-unit disabling) and an action a ∈ A (e.g., operating voltage-frequency (VF) setting), where there are microarchitectural configuration set C = {c 1 , c 2 , …, c u } and action set A = {a 1 , a 2 , …, a v } available to the processor. Thus, the CTMDP model of the processor includes a state set S = {s 1 , s 2 , …, s w } and a parameterized generator matrix G proc , where w is the numbers of states of the processor, i.e., w = u·v. A state transition out of some state s is controlled by either an action a ∈ A or a configuration change c ∈ C. Any state transition takes a certain amount of time to complete, where this latency overhead ranges from several clock cycles to hundreds of milli-seconds. A typical microarchitecture re-configuration latency, the duration between the time a decision is made to change the micro-architectural configuration and the time of actual configuration, takes up to tens of clock cycles [16] . Thus, a state transition time in the CTMDP model of the processor takes τ(s, s') time (= max (τ DVFS ,
, where τ DVFS is the transition time of dynamic voltage and frequency scaling (DVFS), and τ ARCH is the micro-architecture transition period, when system transits from state s to s'.
An example of how to construct the CTMDP model of the processor is given next. For simplicity, we suppose that the processor has three micro-architectural configurations (e.g., cache resizing) denoted by c 1 , c 2 , and c 3 , and a voltage frequency (VF) setting chosen from a finite set of actions A = {a 1 , a 2 , a 3 } is applied to the processor, where a 1 < a 2 < a 3 in terms of the VF values. Then, the abstract CTMDP model of the processor can be illustrated as shown in Figure 3 (a), where a node represents a processor state and a directed arc represents a transition between two states with the parameterized generator G proc . In Figure 3 ( 
Modeling the Application State
Application programs can be characterized by using their architecture-dependent characteristics (such as the IPC and cachemiss rate), architecture-independent characteristics (such as data and instruction temporal localities), or a combination of these two [17] . In this paper, we focus on the micro-architecture reconfigurations that affect the IPC and data cache-miss rate characteristics of application programs, which subsequently result in temperature change on the processor die. Measuring the architecture-independent characteristics may be achieved by exploiting the notion of data similarity (e.g., instruction level parallelism, data locality). However, it is not straightforward to estimate the performance of a particular architecture-independent enhancement; therefore, we do not consider them here.
Inspired by these observations, we construct a CTMDP model of an application program. The CTMDP model consists of a state set R = {r 1 , r 2 , …, r p } and a generator matrix G app , where p is the number of states that are present in the application. In our problem setup, application state r is differentiated based on values of IPC and the cache-miss rate. A state transition between different application states takes place autonomously, and may initiate a change in the state of the processor. An example of a four-state CTMDP model of an application, considering workload characteristics, is depicted in Figure 4 . Here The transition rate σ r,r' in G app includes the context switch time, not assuming a round-robin context switching architecture, controlled by the operating system. For example, if we make a context switch when the deadline for completing an application program is missed, then a state transition will occur with a specific transition rate.
Modeling the Temperature State
Temperature readings from thermal sensors are important to DTM technique, since by knowing the temperature profile of a chip, the TMS may be triggered to respond to chip temperature changes so as to avoid thermal failure/damage of the chip or to maximize performance of interest under temperature constraints.
Conventionally, the junction temperature T J of the IC can be estimated with
where T A is the ambient temperature (°C), P is the device power dissipation (W), and θ JA is the thermal resistance from device junction to ambient (°C/W). In general, thermal failure is avoided by maintaining the device θ JA value small enough so that the junction temperature T J does not exceed a maximum value during operation. It is worthwhile to note that θ JA cannot be modeled directly due to the complexity of thermal models for the package, cooling system, and board stack-up [6] . In addition, θ JA is assumed to be a single parameter under the assumption that device power dissipation, P, is distributed uniformly across the die, which is not realistic assumption (i.e., uncertain behavior). To overcome this difficulty, we use an observation (i.e., temperature reading T T of the package top obtained by a thermal sensor) as
where ψ JT is the junction-to-top of package thermal characterization parameter used as a measure of the temperature difference between junction and package top surface, and is estimated from JEDEC thermal tests [12] . The device power P, a major source of heat generation, is varied based on microarchitectural configurations, which are also application dependent.
To construct the temperature state of the processor, we first define a set of temperatures T 0 < T 1 < … < T c , where T 0 = T trig (i.e., the trigger temperature threshold) and T c = T crit (i.e., the critical temperature threshold). The intervening temperature thresholds are defined according to the ACPI (Advanced Configuration and Power Interface) specification. Once the temperature of the processor reaches the initial trigger temperature, the thermal manager is awakened to consider the conditions and issue a thermal management decision (i.e., a system state-changing command), ensuring that the critical temperature threshold is not exceeded. 
Integrated Model of a TMS
After constructing the CTMDP models of the processor, application, and temperature reading, we denote by X the global state set of the integrated model, defined as the Cartesian product [18] of the state sets S, R, and H, with the generator matrix G TMS which contains the transition rate from a global state x = (s, r, h) to another x' = (s', r', h'). Note that the Cartesian product is a direct product of sets such as S×R×H = {(s, r, h) | s ∈ S, r ∈ R, and h ∈ H}. The global generator matrix G TMS is calculated as the tensor sum [19] of generator matrices G proc , G app , and G temp . Note that when two CTMDPs with generator matrices A and B are given, the generator matrix of the joint process is obtained by the tensor sum, a matrix operator, of A and B. Basically, the tensor sum, for example, C = A⊕B is given by C = A⊗I n2 + I n1 ⊗B, where n 1 is the order of A, n 2 is the order of B, I ni is the identity matrix of order n i , and ⊗ is the tensor product [19] . The tensor product, for example, C = A⊗B, is defined as, An example of the integrated CTMDP model that captures temperature evolution is provided in Figure 5 , assuming that we have two states for processor (s 1 , s 2 ), application (r 1 , r 2 ), and temperature (h 1 , h 2 ), for simplicity. For example, if a microarchitectural configuration change from s 1 to s 2 takes place given application r 1 and temperature reading h 1 , the TMS transits from x 1 to x 6 via x 5 , where the temperature in the end evolves into h 2 .
DYNAMIC THERMAL MANAGEMENT
In this section, mathematical formulation of the DTM problem that maximizes the performance metric subject to no exceeding a critical temperature threshold is constructed.
Optimal DTM Policy
After determining the relevant parameters for each state x ∈ X and each arc in the CTMDP model of the TMS, we set up a mathematical programming model to solve the DTM problem as a linear program as below. The goal is to find an optimal state s ∈ S which consists of action and micro-architectural configuration (a, c), while minimizing the energy cost of the system for a given level of performance and given an application r under tight temperature constraints. We call the tuple (a, c) a command since the TM controls the micro-architectural configuration and the VF setting, which in turn affect power dissipation of the processor, and thereby the resulting temperature. g is the energy cost of the system for a given level of performance (i.e., the energydelay-squared product, ED 2 P, which captures the powerperformance-efficiency under voltage scaling [8] and is independent of the clock frequency) when the system is in state x and command s x is chosen, ' ',
is the probability that the next system state is x' if the system is currently in state x and command s x is taken, δ(h(x), h c+1 ) is 1 if h(x) (i.e., current h of state x) = h c+1 (i.e., temperature beyond T crit ) or 0 otherwise, and Pr crit is a pre-defined threshold probability (i.e., the probability of exceeding the critical temperature threshold). 
where Pwr denotes the processor power consumption, and the processor performance is measured as the number of instructions per cycle (IPC).
3
Pwr IPC is an excellent figure of merit to capture the energy cost of a given level of processor performance [8] . Note that we focus on AC line powered systems that strive to deliver maximum performance while operating under temperature constraints. Specifically, the purpose of this optimization problem is to maximize the system's power-performance-efficiency while constraining the probability that the peak temperature is greater than T crit to be less than a pre-defined probability value, P crit .
Online DTM
In many cases, we are unable to know the actual characteristics of the applications which are running on the processor in advance. Thus, we must also develop an online DTM technique by constructing a pre-characterized configuration-command mapping table, where the entries of this mapping table correspond to various combinations of application types and temperature readings. Figure 6 illustrates how the thermal manager interacts with the applications and the temperature readings. In this figure, the pre-characterized mapping table is obtained through extensive offline simulation during design time, considering every possible combination of states for processor, applications, and temperature readings. It is worthwhile to note that the thermal manager is initiated only when the temperature exceeds the initial trigger temperature threshold T trig and then controls the performance of the processor by limiting critical temperature. More precisely, the thermal manager receives the states of current application and temperature when the temperature exceeds T trig , and issues an optimal micro-architectural configuration and action set (i.e., command) to the processor. Figure 6 . Online thermal management technique.
EXPERIMENTAL RESULTS
Experiments have been designed to evaluate the effectiveness of the proposed modeling technique and assess the performance of our optimization method. We use abstract models of the Intel Core Duo processor [20] , which provides dynamic L2 cache resizing mechanism, to construct a TMS. To simplify the experimental setup, we consider R = {r 1 , r 2 , r 3 , r 4 }, where each r is a combination of two IPC ranges and two L2 cache miss rate (η) ranges: IPC ≤ 0.85, IPC > 0.85; η ≤ 0.01, and η > 0.01, based on the performance distribution for application programs as shown in Figure 1 . The initial trigger temperatures threshold is defined as T trig = 60°C, with an ambient temperature of T A = 40°C, based on the thermal design guideline, where we use the thermal performance data of a 35x35mm 478-pin micro-FCPGA package [20] to obtain temperature states. The on-chip temperature is estimated by utilizing T chip = T A + P⋅(θ JA -Ψ JT ) based on the parameter values of the package. The device power dissipation P can be assumed to be a normally distributed random variable with some known mean value and standard deviation. Figure 7 shows the results of the proposed DTM technique, where we randomly chose a sequence of 100 programs of SPEC CPU2000 (cf. Figure 1 ) with P crit set to 0.2 and T crit set to 71°C. An optimal architectural configuration and action set is selected and provided to the processor when an input (i.e., application and temperature state) is given to the mapping table, where the entries of this table correspond to various combinations of inputs and performance constraints. It is clearly seen that the peak power consumption, which results in the temperaure increase, is limited by constraining the probability that the peak temperature of the system is greater than T crit to be less than P crit in our DTM policy. The time steps are abstractly defined to represent the peak power value of each program run. As expected, constraining the power dissipation causes some performance (ED 2 P) degradation. It, however, guarantees the thermal safety of the system. We investigated the performance-efficiency of the proposed DTM technique, which we call stochastic DTM, or SDTM for short. We assumed two voltage-frequency (VF) change commands are available (where a 1 < a 2 in terms of VF values). For comparison purpose, we also implemented a greedy DTM policy. Greedy: Apply the following VF assignment strategy -Use a 2 at low temperatures, i.e., T trig ≤ T < (T trig + T crit )/2; -Use a 1 at high temperatures, i.e., (T trig + T crit )/2 ≤ T < T crit . SDTM: Apply the optimal DTM commands, based on the mathematical program formulation of the TMS. The Greedy policy gives considerable performance benefit, similar to clock throttling techniques which throttle clock and flush pipelines under temperature constraints. The simulation results in Table 2 (normalized), which varies the values of T crit , demonstrate that, compared to the Greedy policy, the SDTM policy which allows exceeding T crit to the degree of P crit achieves performance savings of up to 16.1% (average) at the cost of 3.5% (average) power penalty. However, it indicates that as we move P crit to smaller values (e.g., 0.05), we can achieve 13.9% (average) performance savings with little impact on the power metric.
CONCLUSION
We introduced a new technique for modeling and solving the DTM problem for multi-core systems. The proposed modeling technique, based on Markov decision processes captures dynamic characteristics of processor, applications, and die temperatures. From the mathematical model, we can calculate the optimal DTM policy, which maximizes the power-performance-efficiency under a peak temperature constraint. [21] Vtune performance analyzer. http://www.intel.com/software. 
