Abstract-This paper addresses the problem of minimizing the peak temperature for pipelined multi-core systems under hard end-to-end deadline constraints by adversely using the PayBurst-Only-Once principle. The Periodic Thermal Management is adopted to control the temperature and every core is periodically switched between two power modes. With the peak temperature representation, we first formulate the problem of finding the thermal optimal periodic schemes which satisfies deadline constraints and then present a fast heuristic algorithm to solve it. Adopting real life processor platforms and applications, our simulation demonstrates that our approach reduces the peak temperature by up to 15 • C on the 4-stage ARM platform compared to sub-deadline partition approach. Moreover, our algorithm is shown to be scalable w.r.t. the number of pipelined stages and its effectiveness is validated by the brutally searching approach.
I. INTRODUCTION
With the increasing demand of computational performance, multi-core architectures are now widely adopted for products. To date, processors having 64 or more cores are available in the market. The architecture with such a high degree of parallelism poses designers a challenge: how to extract parallelism from applications and exploit them efficiently.
Pipelined computing, which can reduce the latency of a stream application, is a promising paradigm for real-time systems. The pipelined computing connects a set of processing units in series and executes the sub-tasks of the application on the pipelined processing system. In this way, the sub-tasks can be executed simultaneously, that is, parallel processing is performed. Therefore, pipelined computing can efficiently exploit the hardware performance advantage of multi-core processors and decrease the latency.
For hard real-time pipelined systems, bounding the latency is crucial for the correctness of the system. However, as power density is increasing exponentially with Moore's Law, the peak temperature on modern processors is rapidly elevated, which seriously threats the reliability and performance of the system. Since reducing the temperature usually requires decreasing power consumption, which means lower performance and larger latency, the trade-off between performance and temperature constraints should be made carefully. Therefore, it's an important and challenging task that designing a scheduling policy to optimize the peak temperature under the end-to-end deadline constraint for a pipelined real-time system on a multicore processor . This paper focuses on this issue and addresses the optimization problem by reversely using the Pay-Burst-OnlyOnce (PBOO) principle. Our work is inspired by the work of Chen et al. [1] , which minimizes the total power consumption for pipelined stage systems. However, their work cannot be directly transplanted to temperature optimization, due to the reasons: (1) although temperature is a strong function of power, power management techniques that are effective for energy saving may not be suitable for temperature managing [2] , which has already been theoretically proved by [3] . ( 2) The quadratic programming problem formulation in [1] cannot be reused, since calculating temperature is based on convolution while computing the total power consumption is based on integral. Therefore, the problem of temperature minimizing demands new analysis and approach, which are the main contributions of this paper:
• Based on the well-known HotSpot model, a peak temperature representation for a multi-core processor under Periodic Thermal Management PTM is given, where the heat flow between cores and the leakage current dependency on temperature (LDT) are considered.
• By reversely using the Pay-Burst-Only-Once principle, the optimization problem is transformed into a set of sub-problems. We formulate the sub-problem and solve it by a fast heuristic algorithm whose computing time grows (approximated) logarithmically with the number of stages.
• Based on two real life platforms: a homogeneous ARM multi-processor and the Intel Single-chip Cloud Computer (SCC), we evaluate the effectiveness and efficiency of our approach by comparing it with two brutally searching approaches, one with PBOO and one without PBOO. The rest of this paper is organized as follows: Section II gives a brief introduction of related works. Our system models are introduced in Section III and Section IV shows a motivation example and presents the problem statement. We analyze the peak temperature in Section V and discuss our approach in Section VI. Section VII details the case studies and Section VIII concludes.
II. RELATED WORKS
In this section, we briefly overview previous works on thermal-aware system scheduling policies and categorize them according if pipelined computing is considered.
examines only steady-state temperatures without considering the transient behavior. In this paper, we provide a peak temperature formulation which considers the transient temperature. Yong and et al. [8] presented a feedback thermal control framework named Real-Time Multicore Thermal Control which dynamically enforces both the desired temperature and the CPU utilization bounds for multicore real-time systems, through DVFS. Buyoung [9] addressed the problem of avoiding thermal hotspot on a multi-core chip by employing a runtime thermal aware scheduler (TAS) using job-migration and power-gating techniques. In [10] , Pradeep extended the concept of Thermal-Resiliency to multi-core architecture and then adopted a control-theoretic framework to ensure hardreal-time deadlines in a dynamic thermal environment while maintaining the thermal constraints. Above works all assumed simple task models, i.e., either periodic tasks or sporadic task model. In this paper, the task streams are modeled by a more general concept, arrival curve, therefore we can preserve more information such as the non-determinism of the event arrivals in the model.
III. SYSTEM MODEL
Notation: All matrices and vectors are denoted by bold characters.
Hardware Model
In this paper, a multi-core processor which can handle partitioned applications is considered. The sub-tasks of a partitioned application can be mapped and executed on different cores which communicate with each other via FIFOs. Each core has two power dissipation modes, namely 'active' and 'sleep' mode. In 'active' mode, the core works with higher power consumption and tackles input events in a fixed frequency. We also consider the mode-switching overhead. To switch the core i from 'active' mode to 'sleep' mode and back, t i swo f f and t i swon time units are required, respectively. During mode-switching, the power consumption equals that in 'active' mode. Moreover, no coming event can be handled in mode-switching or 'sleep' mode. Due to time overhead in mode-switching, the time lengths for which a core is switched to 'active' and 'sleep' mode must be larger than t swon and t swo f f , respectively: t o f f > t swo f f , t on > t swon . For brevity, we define t swoff = (t 1 swo f f ,t 2 swo f f , ··· ,t n swo f f ) and t swon = (t 1 swon ,t 2 swon , ··· ,t n swon ) for a n-core processor. Task Model We study the streaming applications which can be split into several sub-tasks. To model general task arrivals, the concept of arrival curve [11, 12, 13] 
is introduced in this paper. The upper arrival curveᾱ u (Δ) and the lower arrival curveᾱ l (Δ) are the upper and lower bound of R(t):
where R(t) is the cumulative workload and represents the number of tasks arriving in time interval [0,t). On the same way, the service curve β(Δ) is employed to model available resources in any time interval Δ. β(Δ) provides the upper and lower bound of another cumulative function C(t), which is the amount of total time slots that the processor provides to handle the tasks in time interval [0,t).
It's worth noting that arrive curve is event-based and specifies the number of input events in any time interval Δ while service curve describes the amount of available execution time. Therefore service curve β(Δ) should be transformed to a event-based service curveβ(Δ), which can be accomplished [14] , where c is the worst-case execution time (WCET). With these definitions, we 
Thermal Model
The well established thermal model HotSpot is employed to model the multi-core processor [15] . The vertical layout of the processor is modeled by four layers, which are the heat sink, heat spreader, thermal interface and silicon die layers [16] . Each layer is divided into a number of blocks according to the processing components on the die. Moreover, the blocks are ordered in a way that all the processing components occupy the beginning part of the order list. To predict the temperature evolution, we take the advantage of the well-known electro-thermal analogy [7, 10, 15, 16, 17, 18, 19, 20] , i.e., the RC thermal network. Every thermal block is mapped onto a node of the thermal circuit. An example of the model can be found in Fig. 1 . Finally, the temperature vector T(t) can be determined by a set of firstorder differential equations [16] :
where C is the thermal capacitance matrix, P is the power dissipation vector, K is the thermal ground conductance matrix, G is the thermal conductance matrix and T amb is the ambient temperature vector which is defined as
where T amb is the ambient temperature.
We assume the power vector P is the sum of the power P d due to dynamic current and the power P l due to leakage current [18, 21] . In 'active' and 'sleep' state P d is assumed to be constant, i.e., P a and P i , respectively. The dependency relationship between the leakage power and the temperature can be closely approximated by a linear function of the processor temperature [16, 17, 18, 22, 23] :
where W is a diagonal matrix with constant coefficients and V is a vector with constant coefficients. Therefore, P can be represented as P(t) = W · T(t) + V a if in active mode or P(t) = W · T(t) + V i otherwise, where V a = V + P a and V i = V + P i . Rewriting (3) with the power formulation, we can obtain the state space representation of the thermal model:
where u(t) is the input vector, A = −C −1 · (G + K − W) and
Since A and B are constant, the thermal model is a first order linear time invariant system (LTI) and the closed-form representation of the temperature is:
where H(t) = e A·t · B is the matrix describing the impulse response between any two nodes. The self-impulse response H ii (t) is a non-negative decreasing function [16] . Regarding H i j (t) where i = j, we adopt the conjecture proposed in [16] that H i j (t) is a non-negative unimodal function. Let T init (t) = e A·t · T 0 , H i j (t) = u i (t) = 0 for t ≤ 0, and T conv i j (t) = t 0 H i j (ξ) · u j (t − ξ)dξ, the temperature of node i yields:
To deduce the difficulties of calculating the peak temperature, we examine the model closely and make two observations that scale down the temporal and spatial exploring spaces.
1) For any i, j and a sufficient large t, we have
H i j (t) = 0 and T init i (t) = 0. The reason is our system is BIBO stable [24, 25] , which is assured if and only if ∞ −∞ |H(t)|dt < ∞ [25] . Considering H i j (t) is a non-negative function, one can prove that H i j (t) will approaches zero as t approaches infinity. Since
also has this property.
2) The peak temperature can only occur on the processing component nodes. The intuition is that according to Second law of thermodynamics, in our thermal model, heat can only flow from a hotter node to a colder one. Since heat are generated from the processing components, the temperature on them will be higher.
IV. PROBLEM STATEMENT
In this paper, we deploy Periodic Thermal Management [26] to manage the temperature of the chip by periodically switching every stage between two power consumption states with an individual pair of (t on ,t o f f ). Therefore, two vectors,
, should be determined offline to specify the PTMs deployed on the system. For the details of PTM, we refer to [26] . Now, we present a motivation example to illustrate the advantages of applying Pay Burst Only Once (PBOO) for thermal optimization. For comparison, the PTM schemes are derived from two approaches: the PBOO based approach (PBOO) and the one which partitions the end-to-end deadline into substage deadlines for each stage, namely SDP (Sub-Deadline Partition).
Motivation example
In the example, an event stream with arrival curve α = 0.15Δ+ 2 and deadline D = 35ms passes through a two-stage pipelined system. The WCETs are set as c 1 = c 2 = 1ms, respectively. In this case, We set t off = (5, 13)ms and then compare the t on s generated by the two methods. Fig. 2 graphically illustrates the derivation process corresponding to the two methods.
Let's first examine the strategy of SDP. We divide the deadline D into two sub-deadlines, D 1 = 10ms and D 2 = 25ms, in this case. For simplicity, we adopt the bounded-delay function (BDF) [26] 
, then the slope in bd f tot can be determined to calculate t on : ρ tot = 0.15, which is much smaller than ρ 1 and ρ 2 , therefore, results in smaller t 1 on = 0.9ms and t 2 on = 2.3ms. The pessimism in the SDP method comes from paying an additional burst and delay whenα is calculated for the second stage, as pay-burst-only-once principle points out [11] . Moreover, as the stage number increases, this effect is accumulated and then causes more pessimistic results. On the other hand, PBOO directly calculates the total service demand and then retrieves t on for every stage, which pays the burst only once and gets better results. Since lower partition of t on means lower temperature of the processor, we can see that employing PBOO will achieve lower peak temperature than using SDP. Therefore, by reversely using pay-burst-onlyonce, we can avoid paying the burst repeatedly therefore better optimize the peak temperature for pipelined systems, especially for scenarios of many stages.
Problem Statement Now we define our problem as:
Given a n-stage pipelined platform specified by the above hardware and thermal models, an event stream with arrival curve α and end-to-end deadline D, and the WCETs c = (c 1 , c 2 , ··· , c n ), our goal is to find the PTM schemes characterized by t off and t on such that the peak temperature is minimized while the deadline constraint is satisfied.
V. PEAK TEMPERATURE ANALYSIS
In this section, we formulate the peak temperature of a multi-core processor which adopts PTM. Based on the first observation of the thermal model, some notations are defined. Let (1) t conv i j denote the certain time point after which H i j (t) can be considered as zero, (2) 3 : For node i, the peak temperature T i equals the maximal local peak temperature which is reached after t end i . The proofs of above lemmas are omitted due to space limit.
Lem. 4:
The peak temperature of node i is:
Proof: From Lem. 3, we can know that T i can be obtained by finding the maximum of T i (t) for t ≥ t end i . According to Lem. 2, when t ≥ t end i , T i is a periodic function and its period is t lcm p . Therefore the the maximum of T i (t) is the local maximum in every period and can be formulated as (7).
Based on above lemmas and definition, we present our first important result in the following theory.
Thm. 1:
For a multi-core processor with hardware and thermal models described above, when the PTM schemes characterized by t off and t on are applied, the peak temperature of the processor can be formulated as:
Proof: Based on the second observation of thermal model, the peak temperature of the processor must be the peak temperature of the processing component nodes, which are the first m nodes in the model. Therefore, we have T = max {T 1 , T 2 , ··· , T m }.
VI. ALGORITHMS
In this section, we first transform our optimization problem into a set of sub-problems, which can be formulated as below, then provide a fast heuristic to solve the sub-problem.
A. Real-time analysis and formulation
Before giving the formulation, we first present the timing property analysis to ensure that all tasks are finished before their deadlines. For a n-stage pipelined processor that employs PTM schemes characterized by t off and t on , we define K i = 
1 See Thm.1 in [1] The right hand side of the inequality can be upper-bounded by a set of minimum bounded-delay functions bd f min
It's worth noting that b can vary in a feasible region [b min , b max ], which is obtained from [26] . For a given b, the corresponding ρ is calculated from:
, the deadline D is satisfied when the following inequity holds:
which can be transformed to:
Then, from the peak temperature formulation, the sub-problem can be formulated for every individual pair of b and ρ.
With this formulation, Algo. 1 provides the pseudo-code of our approach. get ρ from (10) 5:
solve sub-problem (13) and get T , K , t off 6:
T min ← T , K ← K , t off ← t off 8: end if 9: end for
B. Solving sub-problem
Now we present a fast algorithm which is inspired by the gradient descent method to solve problem (13).
Lem. 5: For problem (13) , K i can be obtained safely by the following equation:
Proof: From the definition of K i , one can derive t i on =
> 0, which means a larger K i results in a higher partition of t i on , that is, a higher peak temperature. Therefore K i should equal its lower bound c i ρ such that the peak temperature won't be elevated unnecessarily. Now, t off remains to be determined. Intuitively, one can brutally search the whole exploring space to find the optimal solution. However, as the stage number n increases, the exploring space expends (approximated) exponentially and this approach will finally be infeasible. For example, if n = 8 and every t o f f has 50 candidates in the exploring space, there will be 50 8 = 3.90625 × 10 13 combinations in total. The brutally searching algorithm needs more than 120 years to finish if the computer can check 10000 combinations per second. Therefore, a more clever algorithm is needed to solve subproblem (13) . Inspired by the gradient descent algorithm, we present a fast algorithm to find the optimal t off .
Algorithm 2 Sub-optimization with given b and ρ Input: Thermal model, n, b, ρ, c, t swon , t swoff , ξ Output: K, t off 1:
for i = 1 to n with step 1 do 6:
end for 8: if min g(i) ≥ 0 then find i where g(i) == min g(i)
12:
13:
end if 14: else 15: for every pair of i, j that 1 ≤ i, j ≤ n do 16 :
end for 18: if min G(i, j) ≥ 0 then 19: go = 0 20: else 21: find i and j where G(i, j) == min G(i, j) 22 :
end if 24: end if 25: end while Algo. 2 outlines the pseudo-code of the algorithm. It takes the thermal model, stage number n, mode-switching overhead vectors, b, ρ, c, and a fixed stepsize ξ as input. The iteration starts at the initial point t swoff (line 2). In every iteration, it first checks whether feeding one ξ to t off will exceed the limit of b (line 4). If not, ξ will be added to one of the t o f f s (line 12). If the limit of b is hit, we will subtract ξ from anther t o f f at the same time(line 22). In every iteration, all the possible directions are checked and the direction leading to the current steepest descent will be selected to update t off (lines 5-7, 11-12, 15-17, 21-22) . The algorithm executes the iteration until evolving towards all possible directions with step ξ won't result in a lower peak temperature (lines 8-9, 18-19). It's worth noting that based on a set of systemic simulations, we conjecture that the peak temperature in Thm. (8) is a convex function of t i o f f , which means T (t off ) has only one minimum in the domain of t off . Therefore, the result found by Algo. 2 is the global minimum in the exploring space.
VII. CASE STUDIES
We evaluate the effectiveness and feasibility of our proposed approach in this section. Three approaches are compared: (1) our Gradient Descent based PBOO algorithm (GD), (2) Brutally Searching based PBOO algorithm (BS), and (3) the brutally searching Sub-Deadline Partitions algorithm introduced in section IV (SDP). SDP brutally examines all the possible sub-deadline partitions and returns the one yielding the lowest peak temperature.
A. Setup
We implement the approaches on two simulation platforms: (1) a homogeneous multi-processor ARM platform with eight cores (ARM), (2) the Single-Chip Cloud Computer (SCC), a processor created by Intel that has 48 distinct physical cores [27] . The power and thermal parameters of the two platforms come from [16, 28] and parameter calibration. The thermal matrices G, C and K are obtained from the HotSpot toolbox. All the simulations are performed on a computer with an Intel i7-4770 processor and 16GB memory. Regarding determining the layout of activated cores, we select and activate the n cores whose locations are close to core #1.
Our simulation runs the peak temperature optimization for three partitioned applications: (1) the H.263 decoder application modeled by four tasks [29] , (2) the MP3 decoder application which can be split into five tasks [29] , and (3) the MADplayer application that consists of five tasks [30] . Their worst-case execution times are determined and scaled from [29] and [30] , respectively. All the mode-switching overheads are t swoff = t swon = (1, ··· , 1)ms. The activation periods of the three applications are set as 50ms, 60ms and 50ms, respectively. The relative deadline D is determined by deadline factor δ: D = δ × p, where p is the activation period.
B. Results
The three applications are executed with deadline factor δ = 0.9 and the peak temperatures on different platforms are examined for three-and four-stage scenarios. Fig. 3 provides results on ARM platform while Fig. 4 shows those from SCC platform. From the figures we can see that: (1) In all the cases, the peak temperatures obtained from BS and GD are identical in value, which proves the effectiveness of our fast heuristic method. (2) For the two platforms, the temperature difference between our PBOO based algorithms and algorithm SDP gets bigger when stage number increases. This is due to that SDP pays burst for more times when stage number increases and therefore returns higher peak temperatures. (3) Compared to ARM, the peak temperature and the gap between the approaches are much lower on SCC platform, which is owed to (a) the difference in the thermal parameter, such as chip thickness, heat sink size. (b) SCC has 48 cores, only turning on three or four cores won't warm the whole chip sufficiently, therefore the heat can be conducted to the environment faster. To further confirm the effectiveness and feasibility of our approach, we simulate a randomly generated application on the two platforms and then increase the stage number n from 2 up to 8 on ARM and to 12 on SCC (simulating up to 48 stages needs a huge figure to display the results and simulating up to 12 stages is enough to determine the scalability). The WCETs of the sub-tasks are randomly generated between [6, 8] ms and the application is activated every 100ms with δ = 1.2. The results are shown in Fig. 5 . Due to that SDP and BS may suffer from exploring space explosion as stage number increases, we terminate their simulation when n reached 7 and 11, respectively. For clarity, only the time expense on SCC platform is shown in the figure. Observe that the time required by GD generally is the lowest and the curve is nearly flat as n increases, which indicates GD is feasible for pipelined systems with many stages. Fig. 5a also shows that the computing time consumed by SDP is always the highest and grows exponentially. This is because SDP examines all the possible deadline partitions, the amount of which increases exponentially as the stage number increases. Moreover, computing the service demand for every following stage requires numerical min-plus convolution, which incurs significant computation and memory overhead. Similarly, we find that the time overhead of BS grows exponentially as stage number increases. Therefore, we can say that SDP and BS are not scalable with the stage number regarding the requirement for computing resource. As suggested by Fig. 5b , we can clearly see that the peak temperature generated by GD always equals that from BS, which further strength the effectiveness of GD. We notice that the peak temperature gap between the approaches is bigger in platform ARM than SCC, as we explained above. Fig. 5b also demonstrates that the temperature difference between SDP and GD widens as stage number increases, which is expected because SDP pays burst more times and therefore generates PTM schemes in which t on occupies bigger partition.
VIII. CONCLUSION
We have proposed a new approach to minimize the peak temperature of a pipelined hard real-time system by reversely utilizing the Pay-Burst-Only-Once principle. The problem is transformed into a set of sub-problems and a fast heuristic algorithm is proposed to solve the problem. We conduct simulation of our approach on two actual platforms for real life applications and the results show that our approach reduces the peak temperature more efficiently than the approach without PBOO, especially for many-stage scenarios. It is also shown that the time expense our fast algorithm grows (approximated) logarithmically as the stage number increases, indicating the algorithm is scalable with the number of stages.
