Energy optimization is a critical design concern for embedded systems. Combining DVFS+DPM is considered as one preferable technique to reduce energy consumption. There have been optimal DVFS+DPM algorithms for periodic independent tasks running on uniprocessor in the literature. Optimal combination of DVFS and DPM for periodic dependent tasks on multicore systems is however not yet reported. The challenge of this problem is that the idle intervals of cores are not easy to model. In this article, a novel technique is proposed to directly model the idle intervals of individual cores such that both DVFS and DPM can be optimized at the same time. Based on this technique, the energy optimization problem is formulated by means of mixed integrated linear programming. We also present techniques to prune the exploration space of the formulation. Experimental results using real-world benchmarks demonstrate the effectiveness of our approach compared to existing approaches.
INTRODUCTION
With increasing requirement for high-performance multicore architectures such as MPSoCs (Multiprocessor System-om-Chip) are believed to be the major solution for future embedded systems, for instance, electronic vehicle [Lukasiewycz et al. 2012] . Chip makers have released several MPSoCs, for instance, ARM Cortex-A15 MPCore [ARM 2012] , Intel Atom processors [Intel 2009 ], and Marvell ARMADA MV78460 multicore processors [Marvell 2012 ]. Many real-time applications, especially streaming applications, can be executed on multiple processors simultaneously to achieve parallel processing. When real-time applications are executed on multicore architectures, minimizing the energy consumption is one of the major design goals, because an energyefficient design will increase the reliability and decrease the heat dissipation of the system. This work has been partly funded by German BMBF projects ECU (grant number: 13N11936) and Car2X (grant number: 13N11933). Authors' addresses: G. Chen, K. Huang, and A. Knoll, Technical University Munich, Boltzmannstrae 3, 85748 Garching, Germany; email: kai.huang@in.tum.de. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2014 ACM 1539-9087/2014/03-ART111 $15.00 DOI: http://dx.doi.org/10. 1145/2567935 Power consumption of processors mainly comes from dynamic power consumption due to switching activity and static power consumption due to the leakage current [Jejurikar et al. 2004] . In micrometer CMOS technology, dynamic power dominates the power consumption of processors. As the number of processing cores on a chip increases (e.g., Intel has produced 48-core x86 Processor as Single-chip Cloud Computer [Intel 2012] ), chip density increases, which leads to a dramatic increase of static power. According to the International Technology Roadmap for Semiconductors [ITRS 2011 ], leakage power increases its dominance of total power consumption as semiconductors progress toward 32nm. Thus, both dynamic power consumption and static power consumption need to be considered in overall energy optimization of the system.
To reduce dynamic power and static power consumption, two main mechanisms can be employed, that is, Dynamic Voltage Frequency Scaling (DVFS) and Dynamic Power Management (DPM), respectively. DVFS reduces the dynamic power consumption by dynamically adjusting voltage and frequency of a processor [Jha 2001 ]. The disadvantage of this technique is the lack of means to reduce static power consumption. There are two types of DVFS polices, that is, intratask DVFS and intertask DVFS. Intratask DVFS allows the processor frequencies can be changed within single task, while intertask DVFS only allows frequencies can be changed at intertask boundaries. We consider intertask DVFS in this article. On the other hand, DPM explores idle intervals of a processor and switches the processor to a sleep mode with low static power consumption to reduce the static power [Jha 2001 ]. The limitations of DPM are that mode-switching in processor causes additional energy and latency penalty. Actually, it is worthwhile to switch the processor to sleep mode only when the idle interval is longer than a certain threshold called break-even time [Chen et al. 2013; Cheng and Goddard 2006] .
In principle, DVFS and DPM counteract each other with respect to energy reduction and a trade-off between them plays a critical role in energy consumption reduction [Gerards and Kuper 2013] . Concerning DVFS, lower frequencies result in lower dynamic power consumption, which however prolong the task execution time and shorten idle intervals. Therefore, DVFS techniques in general reduce the opportunities of reducing static power. On the other hand, although running the system at higher frequencies can create longer sleep intervals and reduce more static power, DPM will cost more dynamic power and mode-switch overheads. In the last decade, the researchers used to believe that, between DVFS and DPM, DVFS should be exploited before DPM [Jha 2001; Srinivasan and Chatha 2007] . However, as transistor technology is shifting toward submicron domains (e.g., Intel has shift its manufacturing technologies into 22nm in 2011 [Intel 2011]) , the static power increases exponentially and becomes comparable or even greater than dynamic power. According to Huang et al. [2011] , the static power accounts for as much as 50% percentage of the total power dissipation for high-end processors in 90 nm technologies. Thus, DPM technology becomes more and more important for energy-efficient design. Applying DVFS before DPM will result in a suboptimal solution. A motivation example will further elaborate this issue in Section 4. Therefore, the motivation of our work is to globally integrate DVFS and DPM and find the best trade-off to reduce the total energy consumption of a system.
In this article, we present a novel energy optimization technique which optimally integrates DVFS and DPM in real-time multicore systems. We consider multicore systems that DVFS and DPM can be applied for each core independently and each core operates at several discrete voltage and frequency levels. For a given set of applications represented as directed acyclic graphics and a mapping of the applications on the MPSoC, our approach can generate an optimal time-triggered non-preemptive schedule and an optimal frequency assignment for each task by which the total energy consumption of MPSoCs is minimized. Based on this optimal schedule, the tasks can be scheduled with insertion of voltage-setting and mode-switching instructions. Energy-efficient code can be generated by integrating this optimal approach into compilers and real-time OSs. The contributions of our work can be summarized as follows.
-The challenge of globally integrating DVFS and DPM is that the idle interval of the processors cannot be modeled because the scheduling cannot be determined in the optimization stage. In contrast to the work in Srinivasan and Chatha [2007] , which firstly only integrates DVFS with scheduling and then separately applies DPM as a final stage to generate system level low power designs, we propose a key technique to directly model the idle interval of individual processor and integrate DVFS and DPM within scheduling. -We develop an energy-minimization formulation that globally integrates DVFS and DPM and solve it by means of mixed integer linear programming. In contrast to the work in Wang et al. [2011] based on genetic algorithm, our approach can guarantee to find the optimum. -A refinement technique is presented to prune the design space of our formulation. -We conduct simulations using real-life applications and demonstrate the effectiveness of our approach by extensive experiments, comparing with Srinivasan and Chatha [2007] .
The rest of the article is organized as follows: Section 2 reviews related work in the literature. Section 3 presents basic models and the definition of studied problem. Section 4 presents the motivation example and Section 5 describes the proposed approach. Experimental evaluation is presented in Section 6 and Section 7 concludes the article.
RELATED WORK
DVFS is one of most effective techniques for energy optimization and has been used for more than a decade. A lot of DVFS scheduling techniques for MPSoCs has been proposed in the literature. For the independent frame-based task set, Chen and Kuo [2005] proposed approximation algorithm based on a Kuhn-Tucker optimality condition on homogeneous multiprocessor. Hung et al. [2006] explored energy-efficient scheduling of periodic independent real-time tasks in a heterogeneous multiprocessor. Based on levelpacking, Xu et al. [2012] address energy minimization problem for parallel independent task systems with discrete operation modes and under timing constraints. For the dependent tasks on MPSoC, Zhang et al. [2002] proposed a two-phase framework that integrates task scheduling and voltage selection to minimize energy consumption of real-time dependent tasks on MPSoC. Based on list scheduling, Gruian and Kuchcinski [2001] proposed a DVS scheduling technology which scales down the delays of all tasks by the ratio of the timing constraint over the critical path length.
Combined DVS+DPM approach is considered as one preferable techniques for low power systems [Jha 2001 ]. However, only a few of literatures have considered the combination of DPM and DVS. For uniprocessors, Devadas and Aydin [2012] addressed energy minimization combining DVFS and DPM only for independent frame-based tasks, where DVFS is employed on uniprocessor and DPM is used for peripheral devices. Based on Devadas and Aydin [2012] , the state-of-the-art work in Gerards and Kuper [2013] presented a schedule for independent frame-based tasks that globally minimizes the energy consumption. However, these approaches can not be applied to multicore platform. In the context of MPSoCs, Bhatti et al. [2011] proposed a machinelearning mechanism to adapt on runtime to the DVFS and DPM policy for independent tasks to achieve energy efficiency. For dependent tasks, Srinivasan and Chatha [2007] presented a mixed-integer linear programming (MILP) formulation, which integrates DVFS along with pipelining scheduling, and applies DPM the the final design step. This work applies DVFS before DPM, which will result in suboptimal solution. Based on task pipelining, Wang et al. [2011] proposed a two-phase approach to optimize the energy consumption. In the first phase, a periodic dependent task graph is transformed into a set of independent tasks by retiming technology. In the second phase, a scheduling algorithm based on genetic algorithm is proposed to optimize the energy consumption. However, this work can not find the exact optimal solution as it is based on genetic algorithm. Besides, the approaches in Srinivasan and Chatha [2007] and Wang et al. [2011] can only handle one task graph.
In contrast to prior work, we consider the problem for MPSoCs that globally integrates DVFS and DPM with scheduling, rather than applying DVS before DPM [Srinivasan and Chatha 2007] , and generate an exact optimal scheduling for dependent tasks with minimizing the overall energy consumption.
MODELS AND PROBLEM DEFINITION

Hardware Model
In this article, we consider a typical multicore system [Wang et al. 2011 with local memories, as shown in Figure 1 . The multicore system consists of M cores P = { p 1 , p 2 , . . . , p M }. Every core is connected via a high bandwidth shared bus. Bus arbiter implements a given bus protocol and assigns bus access rights to individual cores. In this article, we adopt a time-triggered bus based on time-division multiple access (TDMA) protocol.
Task Model
We consider the functionality of the entire system as an applications set A, which consists of a set of independent periodic applications. An application J ∈ A is modeled as a directed acyclic task graph G (V, E, H) , where vertices V denote the set of tasks T to be executed, the edges E represent data dependencies between tasks and H denotes the period of the application. The deadline D of the application is equal to its period. We use w ij to denote the worst case execution time (WCET) of task T i ∈ V under frequency f j and W i = {w i1 , w i2 , . . . , w is } to denote the WCET profile of task T i , where s is the total number of available frequency levels. In this article, we abstract communication between two tasks mapped on different cores as special task, which mapped on bus and whose execution time can be modeled as its communication overhead. Thus, we can also integrate communication task into task graph.
Time-triggered scheduling can offer a fully deterministic real-time behavior for safety-related systems. Current practice in many safety-critical domains, such as electric vehicle [Lukasiewycz et al. 2012 ] and avionics systems [Lin et al. 2007 ], favors time-triggered approach [Baruah and Fohler 2011] . In this article, we consider a periodic time-triggered non-preemptive scheduling policy. We use R to denote the set of the profiles for all tasks in applications set A. A task profile r i ∈ R is defined as a 
Energy Model
In this article, we assume cores in a MPSoC can support both DVFS and DPM. For DVFS, each core has s different voltage/frequency levels, which are denoted as
The voltage/frequency levels are sorted in ascent order. For each frequency level, there is a power consumption associated, and thus we have a set of power values {P 1 , P 2 , . . . , P s } corresponding to voltage/frequency levels and P 1 < P 2 < · · · < P s . We adopt intertask DVFS in this article [Srinivasan and Chatha 2007; Wang et al. 2011; Zhong and Xu 2008] , hence the frequency of the core stays constant for entire duration of a task's execution. Besides, we adopt the same assumption as [Li and Wu 2012; Singh et al. 2013; Srinivasan and Chatha 2007; Zhang et al. 2002] and assume that the energy consumption of the task is determined by the assigned frequency. Thus, the task has an uniform energy consumption during the entire execution time. Note that our approach can also be extended to the case that the energy consumption is associated with the tasks.
The analytical processor energy model in [Martin et al. 2002; Wang and Mishra 2010; Jejurikar et al. 2004 ] is adopted in this article, whose accuracy has been verified by SPICE simulation. The dynamic power consumption of the core on one voltage/frequency level (V dd , f ) can be given by:
where V dd is the supply voltage, f is the operating frequency and C e f f the effective switching capacitance. The cycle length t cycle is given by a modified alpha power model which is verified by SPICE simulation [Martin et al. 2002; Wang and Mishra 2010; Jejurikar et al. 2004] .
where K 6 is technology constant and L d is estimated by the average logic depth of all instructions critical path in the processor. The threshold voltage V th is given below.
where V th1 , K 1 , K 2 are technology constants and V bs is the body bias voltage. The static power is mainly contributed by the subthreshold leakage current I subn , the reverse bias junction current I j and the number of devices in the circuit L g . It can be presented by:
where the reverse bias junction current I j is approximated as a constant and the subthreshold leakage current I subn can be determined as:
where K 3 , K 4 , K 5 are technology constants. To avoid junction leakage power overriding the gain in lowering I subn , V bs should be constrained between 0 and -1V. Thus, the power consumption of the processor under each voltage/frequency (V dd , f ) can be computed as:
where P on is an inherent power needed for keeping the processor on, which related to idle power.
Considering the overhead of switching the processor between active mode and sleep mode, the processor break-even time T BET indicates the minimum time length that the processor should stay at sleep mode. If the interval at which the processor can stay at sleep mode is smaller than T BET , the mode-switch mode overheads are larger than the energy saving. Therefore, mode-switch is not worthy. The break-even time T BET can be defined as follows:
where t sw and E sw denote the total state transition time and energy overhead, respectively. P idle and P sleep respectively represent the idle power and sleep power and we have P idle > P sleep . Given a time-triggered schedule S, the total energy consumption E t (S) in one hyperperiod can be represented as follows:
Where E d (S) is the total energy consumption when the processor is executing tasks, E i (S) is the total energy consumption when the cores stay at idle mode, E s (S) is the total sleep consumption when the cores stay at sleep mode, and E ov (S) is the energy consumption due to the overhead of mode-switches. The energy consumption of executing task T i running at frequency f j is
When the idle interval of the core is less than the break even time T BET , that is, t idle < T BET , the core should not enter sleep mode and should stay at idle mode with high energy consumption P idle .
The energy consumed in the sleep mode, E s , is calculated by
where the sleep interval t sleep should be longer than the break even time T BET , that is, t sleep ≥ T BET , and P sleep is the power consumption when the core stays at the sleep mode.
Problem Statement
Given an applications set A with task profile R, a multicore architecture with M cores P = {p 1 , p 2 , . . . , p M }, each core with s discrete voltage/frequency levels, and task mapping {T → P}, our goal is to find a voltage assignment for each task and a timetriggered scheduling S so that the total energy E total (S) is minimized while the timing constraints of all applications are guaranteed.
MOTIVATION
In this section, we present a motivation example to show that the strategy of applying DVFS before DPM cannot generate the optimal solution. Consider two applications
2 }, respectively. For simplicity, communication overhead is not considered in this motivation example. The period of the application J 1 and J 2 are respectively 120ms and 60ms. The MPSoC hardware has a dual-core architecture P = {p 1 , p 2 }. Each core has two voltage/frequency levels, that is, high level f H and low level f L . The normalized 
frequency is 1 at the high level and 0.5 at the low level. The dynamic power at the high level and low level are 0.4W and 0.1W, respectively. The static power at the high level and low level are 0.16W and 0.12W, respectively. P on which is related to idle power is 0.15W. The time overhead and energy overhead of run-sleep mode switch are 25ms and 1mJ, respectively. The task profiles and mapping are listed in Table I . We compare two different schemes, that is, applying DVFS before DPM (SubOPT) and globally integrating DVFS and DPM (OPT). The frequency assignment and total power consumption P is listed in Table II and the schedules of these two schemes are shown in Figure 2 . OPT can achieve about 9.1% energy savings w.r.t SubOPT. In Figure 2 (a), all intervals are smaller than the break-even time and the system should turn on all the time. In SubOPT, DVFS assigns the frequency as low as possible, which increases the execution time of the task and shortens the idle time intervals of the system. In contrary, OPT can avoid this. In Figure 2 (b), we can see that, by increasing some task's frequencies, some idle intervals are extended large enough and core 2 can enter into sleep mode. Besides, by adjusting the frequency, some tasks in core 1 could run in lower frequency for further power savings.
PROPOSED APPROACH
This section presents our mixed integer linear programming (MILP) approach for integrating DVFS and DPM into task scheduling. We start with a MILP formulation that focuses only on the scheduling problem. Then, we introduce a key technique to model the idle interval of the cores and integrate DPM into the formulation. Based on the observation that the MILP formulation may suffer from the state explosion, we develop a refinement, the so-called execution windows analysis, to reduce the exploration space of the formulation.
Time-Triggered Task Scheduling
In this article, we consider time-triggered non-preemptive schedule. For each task T i with the profile
contains the WCETs of the task T i under different frequency settings. We use a set of binary variables c ij to describe the frequency assignment of the task T i : c ij = 1 if the task T i executes with frequency f j and c ij = 0 otherwise. In this case, the actual WCET of T i can be obtained as s j=1 c ij w ij , where s is the total number of available different discrete frequency levels. As we adopt the intertask DVFS, each task can be assigned only one frequency. 
To formulate the scheduling problem by means of MILP, we have to cope with the task dependency, deadlines, and non-preemption. We present our formulation as follows.
Let ξ denote the overheads for dynamic frequency scaling and task switch. The data dependency T j → T i requires the start time of T i to be no earlier than the finish time of T j . Note that T i or T j can also be the communication task.
For deadline constraint, task T i has to finish no later than its deadline:
The non-preemptive constraint requires that any two tasks mapped to the same core must not overlap in time, as well as the communication tasks in the bus. 
The constraints (15) and (16) ensure that either the instance of T p runs strictly before the instance of T p , or vice verse. Noting that (15) and (16) can also be applied to non-preemptive constraint of communication tasks in the bus.
DPM Representation
5.2.1. Challenge. The challenge of globally integrating DVFS and DPM is that the idle interval of the cores is difficult to model, which is determined by the scheduling. However, the scheduling cannot be determined in the optimization stage. To illustrate this challenge, we give a simple example in Figure 3 . Task T 1 , T 2 , T 3 , and T 4 run at the same core with periods of h, h, 2h, and h, respectively. We can make following observations. -The idle interval of the task is determined by its closest task. For example, the idle interval I 1 of task T 1 is determined by the task T 3 , while I 2 is determined by the task T 2 . However, it is unknown which two task are the closest tasks in the time axis because the execution order is not known yet. -The idle interval of the last task in one hy-period is determined by the first task in the following hy-period. For example, the idle interval of the last task T 4 is combined by I 3 and I 4 . Thus, we cannot represent the idle intervals in one hy-period.
In this section, we will present a novel technique to represent DPM efficiently. At first, we construct the execution order matrix O by reusing the scheduling decision variables z ij p p in Section 5.1. Then, based on the constructed execution order matrix O, we propose a novel approach to represent which two tasks instances are closest to each other, that is, to represent which two tasks are scheduled to form the idle interval. With this approach, the idle interval of the task can be modeled. Besides, we introduce the virtual task concept to model idle interval of the last task instance in one hy-period. In the end, by taking the break-even time constraint into consideration, the decision variables that determine whether the system can enter sleep mode can be modeled as linear items.
5.2.2. DPM Formulation. In Section 5.1, binary variable z ij p p has been adopted to model the execution order of two tasks. To model the execution order of tasks, we reuse this binary variable z ij p p to construct a matrix O, so called execution order decision matrix (EODM), which is defined as:
Definition 5.1. Assume there are N tasks which need to decide their execution order and z ij denote the execution order of task instance T i and T j (i = j): z ij = 1 if the task instance T i finishes before the task T j , and 0 otherwise. An N × N execution order decision matrix O can be determined as: (1) Figure 3 , there are 7 task instances in one hy-period. To order 7 tasks, the 7 × 7 execution order decision matrix O 7×7 can be determined as follows. And the row and column of the matrix is indexed by task instance {T 
To represent the task execution order, we give out the following lemma.
LEMMA 5.2. ST (TS i ) and FT (TS i ) denote the start time and finish time of task instance TS i . Given an execution order decision matrix O with N task instances and compute decision variable matrix A = O(1−O)
T with the element a ij = 
If a ij = 1 holds, then TS j is the closest task instance of TS i and the idle interval I i after task instance TS i finishes can be represented as
To demonstrate its correctness, we present one simple example. Based on O 7×7 , determination matrix A 7×7 is calculated as 
From the matrix A 7×7 , we can locate the closest task for all tasks except T , virtual task concept is proposed. The period of the virtual task is hy-period of the tasks and the execution time is zero. It starts when one hy-period starts or ends. As shown in Figure 3 , there are two virtual tasks in one hy-period, which are respectively denoted as VTS and VTE. VTS and VTE are at the start and end of the hy-period, respectively. By adding VTS and VTE into execution order decision matrix O and recomputing the determination matrix A, the closest instance of T 2 4 and VTS can be determined as VTE and T 1 1 , respectively. Thus, I 3 and I 4 can be modeled. To this end, we have found a principle to model the idle intervals after task finishes. However, this principle is still not obvious and it is not linear. To make it more simple for DPM representation, we present an two-step technique to transform this idle intervals representation into a 0-1 representation. Before introducing this transformation, we present some properties about the relationship between execution order decision matrix O and decision matrix A.
PROPOSITION 1. Execution order decision matrix O and decision matrix A have following relationships. (1) o ij is 0-1 variable and a ij is integer variable bounded in
Based on the the relationships between execution order decision matrix O and decision matrix A presented in Proposition 1, two-step transformation is presented as follows.
Step 1 
In step 1, the value of 0-1 variable matrix B is determined by the value of the variable matrix A. To linearize this determination, we present the following lemma. Step 2. The 0-1 variable matrix O − B can be used to represent the closest task of each task. If TS j is the closest task instance of TS i , o ij − b ij = 1 holds. Otherwise, o ij − b ij = 0 holds. Thus, we can represent the idle interval I i of TS i directly, which can be determined as follows.
I(TS
In (19), I(VTS) and I(VTE) respectively represent the first and last idle interval in one hy-period, for instance, I 4 and I 3 in Figure 3 . Denote TS k as the last task instance in the task instance set TSS = {TS|TS = VTS, TS = VTE}}. As the closest task of TS k is VTE, we can get I(TS k ) = 0 according to (19) . Thus, we can represent idle interval sets S1 = {I(TS i )|TS i ∈ TSS} for the first N − 1 task instances except the last task instance TS k . It is worthy noting that I(TS k ) = 0 has no influence on the object function. Instead of I(TS k ) = 0, I(VTS) + I(VTE) can be used to calculate the idle interval of the last task instance. For example, in Figure 3 , the idle interval of T 2 4 can be computed as I 3 + I 4 . 
For brevity, I(VTS) + I(VTE) is denoted as I(TS N+1
PROOF. P 1 ⇒ P 2 : We obtain −b · s 1 ≤ t ≤ b · s 2 according to t = bx and −s 1 ≤ x ≤ s 2 . Based on −s 1 ≤ x ≤ s 2 and b ∈ {0, 1}, we can obtain (b−1)(x −s 2 ) ≥ 0 and (b−1)(x +s 1 ) ≤ 0. Hence, t − b· s 2 − x + s 2 ≥ 0 and t + b· s 1 − x − s 1 ≤ 0 hold. P 2 ⇒ P 1 : If b = 0 holds, we can prove that t = 0 and −s 1 ≤ x ≤ s 2 according to the definition of P 2 . If b = 1 holds, we can obtain −s 1 ≤ t = x ≤ s 2 from P 2 . Thus, P 2 P 1 .
To show this transformation flows in details, we present an example in Figure 3 . In the first step, the 0-1 variable matrix B 8×8 can be determined as follows.
⎛ 
In the second step, the 0-1 variable matrix O 8×8 − B 8×8 can be determined as follows. From the representation, the closest task can be directly represented as 0-1 variable and each one task has only one closest task except virtual task VTE. Thus, the idle interval of the core can be formulated as (19) . 
To represent the decision of entering sleep mode, m i is used to denote the mode-switch decision for each interval I(TS i ). The core can enter into the sleep state only when the idle interval I(TS i ) is not shorter than the break-even time (i.e., I(TS i ) ≥ T BET ). Thus, the value of m i is determined by (a) I(
It is obvious that the idle interval I i is bounded by the period of task instance TS i , denoted as h(TS i ). Similar to Lemma 5.3, above determination can be transformed to the linear formulation.
I(TS
Thus, the idle ans sleep interval of individual core in one hy-period can be represented as follows.
It is obvious that the varibale I i is bounded by the period of task instance TS i , that is, (23) and (24) can be linearized according to Lemma 5.4
Objective Function
The energy overhead of mode-switch can be determined as follows:
Up to now, we have presented the formulation for DVFS +DPM integration with task scheduling. In this article, we are to minimize the total energy consumption in one hyper-period and the following object is used:
Refinement
To determine the closest task instance for one task instance, we need to check the timing information of every task instance in one hy-period. Thus, the total number of variables used increases quadratically with the number of task instances in one hyperiod, resulting in dramatically increased exploration space for the MILP. To maintain the scalability of the approach, it is important to develop techniques that can reduce the exploration space. Here, we propose a refinement approach based on execution window analysis, which can be used to determine which task instances have chance to construct the idle intervals. By this approach, we only need to check the task instances which are possible to execute in this execution windows, rather than checking every task instance in one hy-period.
In the following, we outline how to determine execution window for each task and how the execution window can be used to determine the possible task instance set that need to check. The worst-case execution window of task instance T 
Besides, worst case execution window analysis can also be used to determine whether task instance T can be determined as follows.
Thus, we only need to explore the task instances in PCTI(T i p ) to construct determination variable in determination matrix A, rather than all the task instances. Then, we can formulate DPM using the techniques present in Section 5.2.
PERFORMANCE EVALUATIONS
This section presents the case studies. We use some real-life benchmarks in the simulation. The energy parameters of the processor are collected from [Jejurikar et al. 2004; Wang and Mishra 2010; Martin et al. 2002] with 70nm technology. The CPLEX [CPLEX 2010] solver is used to solve the MILP problems. All experiments are conducted on a computer with 2.3GHz Intel 8-core CPU and 16GB memory.
Experiment Setup
To evaluate the effectiveness of our approach, we conduct the experiments on 10 task graphs: three FFT benchmarks from MiBench [Guthaus et al. 2001] , two consumer application benchmark form Embedded Systems Synthesis Benchmarks(E3S) [Vallerio and Jha 2003 ] largely based on the date from the Embedded Microprocessor Benchmark Consortium (EMBC), five task graphs from TGFF [Dick et al. 1998 ]. We run each task in FFT benchmark on SimpleScalar cycle-accurate simulation platform [SimpleScalar 2003 ] to obtain its execution time(in cycles). Two consumer application benchmarks in E3S, that is, consumer-1 and consumer-2, are embedded consumer electronic applications. In these two benchmarks, consumer-1 application contains 7 tasks including tasks like JPEG compression, high pass gray-scale filter, and RGB to YIQ conversion, etc. And consumer-2 application contains 5 tasks including tasks like JPEG decompression, RGB to CYMK conversion, and display, etc. Besides, we use TGFF to generate 5 period task graphs by adopting example input files that come with the software package. kbasic-1 and kbasic-2 are generated by kbasic example input file and respectively contain 8 and 10 tasks. kseries-parallel-1 and kseries-parallel-2 are obtained from kseries parallel example input file and respectively have 8 and 16 tasks. robtst with 13 tasks is obtained from robtst example input file. We consider eight combinations of these applications. Details of the combinations are shown in Table III .
We consider a 4-core and 8-core architectures for our experiment. The experiments are conducted based on classical energy model of 70nm technology processor in [Martin et al. 2002; Wang and Mishra 2010; Jejurikar et al. 2004] , whose accuracy has been verified by SPICE simulation. Table IV lists the energy parameter under 70nm technology [Martin et al. 2002; Wang and Mishra 2010] . We assume that the processor operates at five voltage levels in the range of [0.65V, 0.85V ] with 50 mV steps. From [Wang and Mishra 2010; Jejurikar et al. 2004] , body bias voltage V bs is obtained as −0.7V . According to energy model in Section 3.3, we can calculate the corresponding frequency f , dynamic power P dyn and static power P sta under different voltage level, as shown in Table V . From [Wang and Mishra 2010] , P on related to idle power can be obtained as 276mW and the power consumption in sleep mode P sleep is set as 80μW. The energy overhead E sw of state transition is set as 385μJ. The total state transition time t sw and the overhead of dynamical frequency scaling ξ are obtained by referring to the sleep mode timing specification of the commercial processor [Marvell 2009 ].
To evaluate the performance of our technique, we compared the energy consumption with the following technique:
-Optimal DVFS (Dvfs-OPT): Tasks are assigned the optimal frequency without considering DPM. The processor stays at the idle mode for all the idle interval. -Applying DVFS before DPM (SubOPT): Similar to Srinivasan and Chatha [2007] , we implement optimal DVFS as the first step to get the frequency assignment and applies DPM as the final design step. For fairness comparison, we use our DPM technique to get the final result. -Optimal DVFS-DPM (OPT): Integrate DVFS and DPM globally with scheduling. Figure 5 shows the overall power consumptions of the three compared techniques for 8 task sets on 4-core and 8-core architectures. To increase the workload on 8-core architecture, we implement the simulation by duplicating the number or decreasing the period of task graph. From the simulation result, we can see the scheme of applying DVFS before DPM (SubOPT) fails to achieve power savings at most benchmark set in both architectures. This is because optimal DVFS in the first step will result lower frequency assignment, which will prolong the execution time of tasks and, at the same time, reduce the opportunity of entering the sleep state. Comparing to Dvfs-OPT, the scheme of applying DVFS before DPM (SubOPT) can only on average achieve 2.2% and 4.7% power savings on 4-core and 8-core architectures, respectively. Besides, we can observe that our approach (OPT) is more energy-efficient than the scheme of applying DVFS before DPM (SubOPT). Our approach (OPT) , which integrates DVFS and DPM globally with scheduling, can on average achieve 10.5% (up to 16.0%) and 8.9% (up to 12.2%) power savings with respect to SubOPT on 4-core and 8-core architectures, respectively.
Results
Next, we conduct the experiment to show the impact of P on to the effectiveness of our approach. α is denote as the factor that varies the P on with respect to its original setting, that is, P on = α·P org on , where P org on is the original setting. We vary the factor α from 0.5 to 1.5 with fixed step size 0.2. The dynamic power consumption P dyn , static power consumption P sta are compared between SubOPT and OPT. Figure 6 illustrates the results for benchmark set 7 on 4-core architecture. Note that, when P on increases, the leakage power increases its dominance of the total power consumption. From the results, we can make the following observations: (1) By creasing the frequency of the processor to create longer idle intervals, OPT consume more dynamic power than SubOPT. At the same time, OPT achieve significant leakage power savings, which results the overall power consumption of OPT is smaller than SubOPT. (2) The leakage power of SubOPT increases linearly with respect to P on while its dynamic power becomes constant. It means that when the leakage power increases its dominance, the techniques of applying DVFS before DPM cannot increase the opportunities of entering sleep modes to reduce the leakage power. In contrast, OPT can optimally deal with the trade-off between dynamic power and leakage power to reduce overall power consumption. When P on increases, OPT could increase the frequencies, which results in slightly increase of dynamical power, to avoid the rapid growth of leakage power.
Then, we discuss the impact of the period setting to the effectiveness of our approach. Similar to the definition of the factor α, we also define the factor β that varies the periods of each period with respect to its original setting. We vary the factor β from 0.7 to 1.3 with fixed step size 0.1. The dynamic power consumption P dyn , static power consumption P sta , and sleep time interval I are compared between SubOPT and OPT. Figure 7 illustrates the results for benchmark set 1 on 4-core architecture. In Figure 7 , dynamic power consumption P dyn and static power consumption P sta of both techniques decreases as the period of the application increases. This is expected because the bigger period of the application can prolong both the execution time and the idle intervals. One interesting observation is that the overall power savings of OPT with respect to SubOPT increases when β < 1.1 and decreases when β ≥ 1.1. This is caused by the fact that, in SubOPT, most tasks of the application has been assigned to lowest frequency when the periods of the applications are bigger enough. All increased period are contributed to prolong the idle intervals, which results in the sleep time interval of SubOPT increase rapidly (as shown in Figure 7(b) ). Thus, leakage power of SubOPT will decrease rapidly, which also results in the power savings of OPT decreasing.
In the end, we conduct experiments to show the efficiency of our refinement technique. We adopt the same task graph set as above. We compare the numbers of variables and constraints in the MILP problem as well as the solving time. Figure 8 shows the results on both 4-core and 8-core architectures for approaches with refinement and without refinement. Noting that refinement approach can achieve the same optimal solution as the approach without refinement. The MILP solver is set to have 90 minutes time budget to execute for each task graph set. On 4-core architecture, the refinement approach can generate the results in 2 minutes. On 8-core architecture, the non-refinement approach fail to generate the results on 5 task sets due to expired execution time, while the refinement approach can generate the results in 80 minutes.
Besides, the refinement approach can on average achieves 39.82% reduction on the number of variables and 37.82% reduction on the number of constraints. The results show refinement technique can significantly reduce the MILP problem size.
CONCLUSION AND FUTURE WORK
This article presents an energy optimization technique for scheduling real-time tasks on MPSoCs based platforms with optimally DVFS and DPM combination. A key technique is proposed to directly model the idle interval of individual processor. Based on this technique, an integrated solution for optimal DVFS and DPM combination problem is presented based on mixed integer linear programming. Our technique can generate an optimal time-triggered non-preemptive schedule for each task and an optimal frequency assignment for each task to minimize the total energy consumption of MPSoCs. Based on this optimal schedule, the tasks can be scheduled with insertion of voltage-setting instructions. Besides, we develop a novel technique that can significantly reduce the exploration space of MILP. Proof-of-concept simulation results demonstrate the effectiveness of our approach compared with existing approaches.
For the next step, we are interested in implementing the proposed approach on realistic system and evaluating its performance. Now, a new MPSOCs based on timetriggered architecture is presented in [Salloum et al. 2012 ] specially for safety-critical embedded systems. Besides, there are also available time-triggered embedded systems in the industry, for instance, TTE processor [TTE system 2007] . Our optimization process can be used to produce energy-aware time-triggered non-preemptive schedules for such time-triggered MPSOCs.
Furthermore, another interesting future work would be to support voltage-frequency island partitioned system. Approach to support voltage-frequency island partitioned system should meet two kinds of constraints: (1) cores within one island should share the same frequency. (2) cores within one island should enter into sleep mode at the same time as well as switching to active mode.
