is a powerful technique for ledueing dynamic power consumption in a computing system. However, as technology feature sire continus to r d e , leakage power is increasing and will limit power savings obtained by DVS alone. Previous system-level real-time scheduling approaches use DVS alone to optimize power consumption with. out considering Leakage power. To overcome Ihu Limitation, we propose a new scheduling algorithm that combines DVS and adaptive body biasing (ABB) to simulmeously optimize both dynamic power consumption and Ieahage power consumption for real-time distributed embedded system. First, we derive an analytical expresioo Lo determine the optimal supply voltage and bady bias voltage under a dven e l d frequency. Based on this exprwion, we compute the optimal energy consumption at a given dock frequency and a mlyze the tradeoff between energy eonsvmptioo and execution time for a set of task with precedence relationships and real-time constraints. We then propose B scheduling algorithm to reduce total wwor consumption under given resl-time comtraiots. T U algorithm also considen variations in power con- 
I. INTRODUCTION Power consumption has become the Limiting factor for both battery-operated electronics and high-performance computer syst e m . While dynamic power consumption has traditionally been the primary source of power consumption, leakage power cousumption is becoming an increasingly impatant concern as the technology feature s u e shrink.. Supply voltage scaling, which is supported by various embedded processors such as Intel XScale processor [l] , Transmeta Crusoe processor with LongRun power management technology [2] , and AMD's mobile processor with AMD PowerNow! technology [3] , is effective in reducing dynamic power consumption quadratically. However, supply voltage scaling often requires a reduction in the threshold voltage that increases the subthreshold leakage current exponentially, and hence the leakage power consumption. Leakage power is expected to become comparable to dynamic power in the fonhcoming generations of technology [4] . Hence, it is increasingly important for run-time power optimization techniques to trade off supply voltage and threshold voltage to optimize dynamic power consumption and leakage power consumption jointly. Dynamic voltage scaling (DVS) , an effective run-time technique to manage the dynamic power at the system level, is no longer sufficient to manage the total power consumption for the new technology generations.
Many techniques have been proposed to reduce leakage power consumption. The work in [12, 13] summarizes the most commonly used leakage reduction techniques. The leakage power dissipated by a circuit depends on the input vector to the circuit [14, 15] . Hence, input vector control has been proposed to find the input panem that maximizes the number of disabled transistors in all transistor stacks across the design [16l. The work in [17] develops an efficient algorithm that determines a lowleakage input vector using a sampling of random vectors. Supply voltage gating [4, 18] is another approach to obtain leakage power savings, in which the power supply is disconnected from the cucuit so that idle units do not consume leakage power. Optimizing dynamic power and leakage power simultaneously has been demonsbated to be important for energy reduction automatically adjusts the supply and threshold voltages to minimize power consumption at the circuit level. In [26] , a combined DVS and ABB method is proposed for a single voltage-scalable processor to reduce both dynamic power and leakage power. This work derives an expression to obtain an optimal tradeoff between supply voltage and body bias voltage for a given clock frequency. However, it only considers this tradeoff for a single wsk, and is not applicable to a scenario containing interactions among a set of tasks with real-time constraints. The circuit delay model it uses is different from the standard alpha-power model [271. In this paper, we propose a new scheduling algorithm that addresses both dynamic power and leakage power effectively. The aleorithm can be amlied to heteroreneous distributed embedded -.. To tackle this issue, we propose a novel two-phase approach. First, we derive an analytical energy consumption model using the standard alpha-power model and determine the optimal supply voltage and body bias voltage for a given clock frequency. Next, we compute the energy consumption under the optimal points for supply voltage and threshold voltage, and evaluate the curves of optimal energy consumption vs. clock period. The scheduling algorithm thus performs DVS and ABB simultaneously to trade off energy consumption and task execution time, by utilizing the convex characteristic Of these curves. The rest of this paper is organized as follows. In Section II, we present some preliminary concepts. Section 111 uses a motivational example to illustrate the importance of considering leakage power. In Section IV, we derive the optimal tradeoff point between supply voltage and body bias voltage for a given clock frequency, as well as the curves for optimal energy consumption vs. clock period. Section V presents a new scheduling algorithm combining both DVS and ABB methods for distributed real-time embedded systems. Section VI gives the experimental results, which demonstrate the effectiveness of our proposed algorithm in optimizing both dynamic power and leakage power. Section VI1 draws the conclusions
PRELIMINARIES
We use the following expressions to evaluate the two main sources of power consumption, dynamic power and leakage power [28] , where dynamic power can he represented as and subthreshold leakage power can be represented as
In the above equations, N is the switching activity, C i s the capacitance, Isub is the subthreshold leakage current, I, and n are technology parameters, W and L are device geometries, VT is the thermal voltage, and Vdd and 6,) are the supply voltage and threshold voltage, respectively. The operational frequency f can he exoressed as where k is a constant for a given technology process and I < a 5
2.
Another source of leakage power is the source-body and drainbody junction leakage, which is becoming increasingly important. We also consider this component of leakage power later in Section IV There is a third source of power consumption, short-circuit power, which results from the short-circuit current path between the power supply and ground during switching. Short-circuit power is projected to he constant around 10% for succeeding technologies [29]. We thus ignore it throughout this paper.
Embedded systems are frequently specificd in the form of a set of task graphs. A simple specification consisting of a single task graph is shown in Fig. 1 . It consists of three tasks Ti - '?j (such tasks are of coarse granularity, e.g., discrete cosine transform may be a task). Each task is annotated by its worst-case execution time (WCET) on each type of processor it can run on.
To simplify matters, for this example, only one type of processor is assumed, Edges between tasks denote communication, and are also labeled by their WCET on a given communication link.
There is no WCET assigned to the TI -E edge as both TI and E are assigned to the same PE and intra-PE communication time is assumed to be zero based on the traditional assumption in distributed computing (this is because inter-PE communication takes much more time than infra-PE communication in a distributed system). All the source (sink) nodes have an arrival time (deadline) by when computation can begin (must finish). In general, arrival times (deadlines) may also exist for some intermediate nodes. The time interval at which the task graph is repeatedly invoked is called its period. A period may he less than, equal to or greater than the deadlines. Different task graphs in a specification may have different periods, giving rise to a multi-rate specification, The least common multiple (LCM) of all periods is called the hyperperiod. It is known that scheduling the task graphs in their hyperperiod leads to a valid schedule [30] .
MOTIVATIONAL EXAMPLE
This section presents a motivational example to illustrate the significance of considering leakage power as technology scales down. Example 1: Consider the task graph shown in Fig. 1 we extend the execution times of task F , E and fi to 4s, 6s, and 5s, respectively. Since the execution time of fi is extended from 3s to 4s, the speed of processor PE] can be scaled down by a ratio of 4/3. Given the scaled clock frequency, we consider two voltage scaling approaches to reduce power consumption. One is supply voltage scaling alone, and the other is combined supply and threshold voltage scaling. Let us first consider the 0.07pm technology, in which dynamic power consists of 78% of total power consumption, while leakage power consists of 22%. One approach to reduce power consumption is supply voltage scaling. Based on Equation (3), given the frequency scaling ratio of 413 for TI, the supply voltage Vdd ofPE1 can be scaled down from 2.0V to 1.55V correspondingly (assuming a = 1.4). The dynamic power of TI is thus reduced from 0 . 7 8~~ W to 0 . 3 5~~ W. and the leakaee Dower is reduced for T2 and TJ, the power consumption is reduced to 0.56c~W and 0.51qW, respectively. The average power reduction of supply and threshold voltage scaling compared to supply voltage scaling alone is 13.7%, while the reduction with respect to no voltage scaling is 49.7%, again assuming C I = c2 = c3. Fig. 3 gives power consumption for the three different approaches under the three technologies. It can be observed that both dynamic power and leakage power are lowered by either using supply voltage scaling or combined supply and threshold voltage scaling. However, as technology 'scales from 0.07pm to 0.035pm, and correspondingly leakage power increases from 22% to 61%, supply voltage scaling becomes less effective in reducing power consumption. On the other hand, combined supply and threshold voltage scaling provides more power savings as leakage power increases. This is reasonable given the fact that subthreshold leakage power is exponentially dependent on threshold voltage, while it is only linearly dependent on supply voltage. Ta 
Iv. ENERGY CONSUMPTION MODEL FOR COMBINED DYNAMIC VOLTAGE SCALING AND ADAPTIVE BODY BIASING
This section derives the energy consumption model considering both dynamic power and leakage power for a set of tasks implemented in a distributed embedded system, under the assumption that supply voltage and body bias voltage can be scaled simultaneously in a dynamic fashion. where qhO is the threshold voltage at zero substrate bias, y is the body bias coefficient, Qs is the surface potential, VbS is the body bias voltage, and Ahh(SCE), hY,h(DIBL) 
The total power consumption includes both dynamic power and leakage power. The dynamic power, Pdynomjc, is given by Equation ( I ) . The leakage power, fieafioze, is due to subthreshold leakage current, /subr as shown in Equation (Z), as well as the contributions of drain-body junction leakage current, / j , and source-body junction leakage current, f b [IY] . Thus, a more exact equation for leakage power is given by:
Substituting ( 5 ) into (7), we get: 
D. Tradeofbetween supply voltage and body bias voltage
For each task i with switching activity Ni, its energy consumption, E;, can he expressed in terms of clock frequency h, supply voltage Vd; and body bias voltage vh; as:
where t; is the task's corresponding execution time under clock frequency fi, i.e., There are three variables in Equation (IO), h, Vdi, and Vh;. To reduce the number of variables, from Equation (6), vh; be can be represented as a function of fi and Vd; as:
where k6, k7 and ks are new constants. Substituting (12) into (10) gives:
E; = -tjN;CfrVi. I +t~kuVdieklUVd~+kll(~v~~)Q +fikI2Vdi 2 + tiki, (A vdi) A + t k i 4 (13) Yip' corresponds to the optimal supply voltage leading to minimum energy consumption if the second derivative of Ei with respect to vd; at V~P ' satisfies Thus, given the clock frequency h, the optimal supply voltage Vi? can be calculated based on Equation (15), and then the optimal body bias voltage can be determined by Equation (12). 
E. Tradeoff between energv consumption and clockperiod
Substituting the optimal points of supply voltage and body bias voltage for a given clock frequency into Equation (IO), we can derive the optimal energy consumption EzoP'(fi,ti) at a given clock frequency, which is a function of clock frequency and task execution time ti. The curves of optimal per cycle energy consumption vs. clock period for three tasks with different values ofNi are shown in Fig. 5 . To analyze the tradeoff between energy consumption and clock period, we compute the negative of the first derivative of optimal energy consumption for task i with respect to its execution time, -'a;;?l, based on Equations (13) (N = I), the energy reduction rates of both tasks should be kept at almost the same level, as seen by comparing curves N = 3 and shifted N = I. Then the slack can be allocated to them in a halanced way. For a single processor, optimal slack allocation can be achieved as follows. A task is defined as non-extensible if extending its execution time will lead to a violation of real-time constraints. For a set of extensible tasks e x f s e f , given the frequency scaling step d f , the slack allocation should guarantee that for any task i in exfsef, Fig. 7 , the achieved totalslack is 2s. Hence, slack allocation based on Equation (18) This section proposes a new scheduling algorithm addressing both dynamic power and leakage power effectively for heterogeneous distributed embedded real-time systems. This algorithm is an extension of the power-profile driven DVS scheduling algorithm [I I]. In [ I l l , the scheduling algorithm optimizes power consumption without considering leakage power for a set of tasks As seen from Fig. 5 , the curve of optimal per cycle energy consumption vs. clock period at a given switching activity is convex. Hence, energv.derivafive(f) is a monotonically decreasing function with respect to clock period, and correspondingly a monotonically increasing function with respect to clock frequency for a given switching activity, as shown in Fig. 6 . Under the same clock frequency, energy-derivalive for a higher switching activity ( N = 3) is higher, which means the energy reduction rate is relatively higher. It is therefore important to allocate slack to the task with a higher energvderivufive (task with N = 3) first.
Afier its energyderivative drops to the point equal to the maximum energyderivative of the task with a lower switching activity straints. To achieve maximum ener-gy reduction,it identifies the optimal tradeoff point between supply voltage and body bias voltage based on Equations (12) and (15) when multiple tasks update. their operating clock frequencies. It also considers the tradeoff between energy consumption and clock period using the heuristic based on Equation (18). It thus reduces dynamic power and leakage power simultaneously. To guarantee real-time constraints, it evaluates the validity of the generated schedule by checking the earliesfstarfAme (EST) and lafestfinish_time (LFT) of scheduled events. The EST of a scheduled event is the earliest time at which it can start its execution without violating its arrival time and precedence relationships. The LFT is the latest time at which the event must finish its execution without violating any deadlines and precedence relationships. For each scheduled event i (i is either a task or an inter-PE communication edge), its EST, and LFT, the set of directed edges between vertices, which represents the precedence relationships due to data dependencies and execution order constraints on any PE or communication link. We illustrate the creation of G(V. E ) through Example 2. Example 2: Fig. 9 gives the specification of an embedded system in the form of two task graphs. Fig. I O gives a feasible schedule for the task graphs on a distributed system consisting of two PES connected by a link. We assume both PE! and PE2 have communication buffers. The derived directed graph G(V,E) for this schedule is shown in Fig. 11 . A directed edge is inserted from one event to another if one is a direct predecessor of another (solid edges), or if one is scheduled just ahead of another on the same PE or communication link (dotted edges). A dummy event (small empty box) is inserted between two events to represent the transition time overhead for clock frequency, supply voltage, and body bias voltage, wherever a possible transition may occur. 0 In : algorithm, initially, the frequency f, supply voltage
Vdd. and body bias voltage Vb, of all the tasks are initialized to Jma, Vddmlir and of the PES they are assigned to, respectively. The tasks assigned to the voltage-scalable PES are marked as extensible and their energvderivatives are calculated. We maintain two task lists, the extensible task list, extensibkdist, and the active extensible task list, a c t i v e h r . extensibleJist is initialized with all the extensible tasks in decreasing order of energv-derivative. The extensible tasks with the highest energJderivative are inserted into activeJisf from extensiblelist. Whenever a task is added to activeJist, it is removed from extensible.list at the same time. All the extensible tasks are thus either in extensible-list or in activelist. The following loop is repeated until there is no extensible task remaining in either extensiblelisr or activelist.
In each iteration, for each event in the order of topological sort of G ( V , E ) , its EST is evaluated based on its current clock frequency ,/. Next, the task with the highest enerDderivutive from active.lisr is chosen as the reference task, whose frequency is dropped by frequency scaling step d f . Its corresponding Vdd, VI,.^, and energvderivutive are evaluated based on the new scaled frequency. Its new energvderivarive is the reference energy derivative, r~/erence.enerbyderivafive, Next, in the order of reverse topological sort of G ( V , E ) , the frequencies of all the other tasks in active.list are updated. The frequency scalings ensure that the corresponding enerby.derivatives do not drop lower than the reference level, referencemergyderivafive. The frequency scalings of some tasks might influcnce the validity of the new schedule. For each scheduled event, its LFT is updated based on its extended execution time WCET. If EST + W C E T > LFT, it means the frequency scaling of this task yields an invalid schedule which violates real-time constraints. Such a task is marked as nonextensible and its 1;, Vdi and vhi are restored to their old valid values. For each task in active.list, if any value of its frequency, supply voltage, or body bias voltage has reached its minimum level or it is marked as non-extensible, it is removed from activelist. Then active.list is updated in the following way. If activelist becomes empty, the extensible tasks with the highest energvderivative are inserted into activelist from extensiblelist. Otherwise, activelist is appended with tasks whose enerby.derivutives are higher than reference.energvderivutive from extensiblelist. If no extensible tasks are left, the algorithm terminates by returning the scaled supply voltage and body bias voltage levels.
The proposed algorithm has a complexity of O ( K ( n + e ) + ( n + e)log(n + e ) ) , where n is the number oftasks, e is the number of inter-PE communication edges, and K is the number of iteration steps. During the initializations, reordering the extensible task list, extensible.list, in decreasing order of energyderivative requires a time of ( n + e)log(n + e ) . In each iteration, the EST and 
VI. EXPERIMENTAL RESULTS
In our experiments, we obtain system architectures through the system synthesis tool, called CORDS, described in [31] . CORDS synthesizes multi-rate, real-time, periodic distributed embedded systems. It automatically selects an allocation from a set of field where T is the power efficiency, C, is the capacitance of power rail, and C, is the total capacitance of the substrate and wells of the device. The constant parameters for the 0.07pm process are provided in [26] . We assume 90% power efficiency.
To demonstrate the effectiveness of our combined DVS and ABB method for distributed real-time embedded systems, we applied our proposed real-time scheduling algorithm to embedded systems in the form of the task graphs described in 
TABLE Ill C H A R A C T E R I S T I C S OF EMBEDDED SYSTEMS IN T H E FORM OF T A S K GRAPHS
We compare our combined DVS and ABB method with two other schemes, no voltage scaling and DVS alone. All the comparisons are based on the same initial schedule generated by as late as possible (ALAP) list scheduling. We first set the frequency scaling step df to 20% of the maximum frequency. Fig. 12 shows the power consumption for the benchmarks for the different schemes for the 0.07pm technology. It can be observed that the combined DVS and ABB method provides more power reduction than the other two schemes. Table IV gives power reduction of DVS vs. no scaling, DVStABB vs. no scaling, and DVStABB vs. DVS alone. The combined DVS and ABB method is more effective than DVS alone from the power reduction point of view. It yields an average power reduction of 34.7% with respect to using DVS alone, and 68.3% compared to no voltage scaling. This justifies the advantages of considering dynamic power consumption and leakage power consumption simultaneously during .voltage selection to reduce power consumption for distributed real-time embedded systems. As discussed earlier, the complexity of our proposed real-time scheduling algorithm depends on the frequency scaling step df. If we change the frequency scaling step to 10% ofthe maximum frequency, the improvement in the power reduction achieved is observed to he negligible. We can thus adjust the frequency scaling step to provide a good tradeoff between algorithm performance and complexity.
V11. CONCLUSIONS
This paper proposed a new scheduling algorithm combining DVS and A B 6 to optimize both dynamic power consumption and leakage power consumption for distributed real-time embedded systems. The algorithm is based on a novel two-phase approach, which can trade off energy consumption and task execution time by performing DVS and ABB simultaneously. Experimental results show that the proposed algorithm can achieve substantial power reduction for the 0.07pn1 technology. 
