Abstract
Introduction
In CMOS devices, energy consumption per cycle is proportional to the square of the voltage, while circuit delay decreases roughly linearly with the voltage. As a result, controlling the supply voltage allows the user to trade off workload execution time for energy consumption.
Over the past few years, many researchers have studied this tradeoff. For hard real-time systems -so-called because the workload is associated with a hard deadlinethe tradeoff is particularly challenging. In such applications, the workload is characterized by analysis or profiling, so that its worst-case execution time can be bounded with reasonable certainty. In most cases, the workload is run periodically, with the period and deadlines known in advance. Furthermore, many real-time applications (e.g., spaceborne platforms) have constraints on their power or energy consumption. There is thus an increased need for power-management techniques for such systems; the a priori knowledge about the workload provides an added opportunity to use such techniques. Most of the power-aware voltage-scheduling work for real-time systems has concentrated on independent tasks. By contrast, in this paper, we consider voltage scheduling while executing a task graph, which defines the precedence constraints between tasks.
The problem can be informally described as follows: we are given a task graph, the worst-case execution time of each task and the period at which the task graph is to be executed. This workload is to execute on a multiple-processor system, each processor of which has its own private memory. The task assignment in the processors and the order of the algorithm is also determined apriori and serves as an input parameter. The problem is to schedule the voltage of each processor in such a way that the energy consumption is kept low. Assuming the deadline of the task graph equals its period, there will be no more than one task iteration alive in the system at any time.
Our algorithm has both an offline and an online component. The offline component performs voltage scheduling based on the worst-case execution requirements. The online component adjusts the voltage schedule as tasks complete: in most cases, tasks consume less than their worst-case time, and the remaining time can be reclaimed to run the processors slower than might otherwise be required.
The remainder of this paper is organized as follows. In Section 2 we provide a brief survey of the relevant literature. In Section 3 we outline our algorithm and in Section 4 we provide some numerical results which serve to show its effectiveness. The paper concludes with a brief discussion in Section 5.
Literature Survey
Dynamic Voltage Scaling is increasingly being used to efficiently address the problem of energy aware task scheduling. Initial work on dynamic voltage scheduling [9, 28] was in the context of non-real-time systems where the average throughput is the performance metric. Lin, et al., treated this problem as the integer programming model and presented heuristics for scheduling dealing with timing and resource constraints [18] . Chang et al. in [8] , present a dynamic programming technique for solving the multiple supply voltage scheduling problem in both non-pipelined and functionally pipelined data-paths. The scheduling problem refers to the assignment of a supply voltage level (selected from a fixed and known number of voltage levels) to each operation in a data flow graph so as to minimize the average energy consumption for given computation time or throughput constraints or both. Results for a ARM7D processor at two voltage frequency combinations: (5.0V, 33MHz) and (3.3V, 20MHz) are presented in [4] , for the Dhrystone 1.1 benchmarks it yields 185 MIPS/watt and 579 MIPS/watt, respectively. Yao et al. assumed that the power usage is a convex function of the clock rate [31] and derived a static voltage control heuristic to reduce energy consumption. Ishihara and Yasuura presented a model of a dynamically variable voltage processor and basic theorems for power-delay optimization [15] . A static voltage scheduling problem is also proposed and formulated as an integer linear programming (ILP) problem. They point out that two voltage levels are sufficient and a large number of available levels do not contribute much as long as the two voltage levels are carefully chosen. Hong et al. [12] , present a nonpreemptive scheduling heuristic for low power core-based real-time SOC based on dynamically variable voltage hardware. In [13] , the problem of voltage control in a problem involving scheduling sporadic tasks in the midst of an ambient periodic workload is considered. The authors point out that the voltage transitions are fast: of the order of 10 to 100 sec per volt. In another instance Burd and Brodersen discuss the design of a variable voltage processor which can switch voltage at the rate of 24 sec per volt [6] . In [20] , Ma and Shin present an energy adaptive combined static/dynamic scheduler in Emerald Operating System to execute tasks in mobile applications. The emphasis is on achieving effective use of limited energy by favoring low-energy and critical tasks. Qu and Potkonjak have presented a heuristic for maximizing system utility which is based on quality of service (QoS) under limited energy resource conditions [24] . Shin and Choi slow the processor down to avoid idling it if the current workload is guaranteed to finish before the next job arrival [27] . A similar slowdown approach for the periodic real-time tasks which can consume energy at possibly varying rates is presented in [5] .
A simulation environment and benchmark suite evaluating voltage scaling algorithms have been presented in [23] . Lee et al. introduced dynamic voltage algorithm for fixed priority task systems in [17] . In another work Krishna et al., show how voltage scaling can be based on Earliest Deadline First (EDF) scheduling algorithms to get energy performance optimizations [16] . In [22] , a class of novel algorithms called real-time DVS (RT-DVS) is introduced that modifies the real-time scheduler and task management service to provide significant energy savings while maintaining real-time deadline guarantee. Researchers have also proposed the concept of compiler directed DVS [21] where the compiler sets the processor frequency and voltage with the aim of minimizing energy under real-time constraints.
In [11] , the scheduling problem of independent hard realtime tasks with fixed priorities assigned in a rate monotonic or deadline monotonic manner is addressed. This method employs stochastic data to derive energy-efficient schedules taking the actual behavior of the real time systems into account. Several papers have also recognized the need for both offline and online approaches to address the issue of energy efficient scheduling of independent real-time tasks (see for example, [5, 25] ).
Variable voltage scheduling as a low power design technique at the behavioral synthesis stage is discussed in [26] . Given as input an unscheduled data flow graph with a timing constraint, the goal of this paper is to establish a voltage value at which each of the operations of the data flow graph would be performed while meeting its timing constraint. The authors have used a iterative graph-theoretic approach to identify critical paths and assign nodes to a specific voltage level.
An interesting approach to power-conscious joint scheduling of periodic task graphs and aperiodic tasks in a distributed real-time embedded system has been proposed in [19] . Here the authors focus on the problem of efficient scheduling of a mix of task graphs and independent tasks and present an effective dynamic energy reduction heuristic for them. They use a slack-based list scheduling approach to perform static resource allocation, assignment and scheduling of the periodic task graphs. The emphasis of this work is to meet all hard real-time constraints, minimize the response time of all soft aperiodic tasks and also engage in dynamic voltage scaling and power management to reduce energy consumption. It is assumed in [19] that tasks always run to their worst-case execution times.
An initial study of tasks with precedence constraints has been made in [32] . In this paper, a static evaluation is carried out that defines the order in which the tasks are to be executed. This order is kept unchanged, even if the task execution times are much less than the worst-case. When tasks are completed ahead of their worst-case time, the slack that is thus released can be used by running the processor(s) at a lower voltage than would otherwise be required.
By contrast, in this paper, we recognize that considering precedence relationships among the tasks and the entire shape of the task graph might help us in deriving an algorithm that would achieve significant energy savings. We also believe that an effective technique should comprise both online and offline components and hence we design our offline heuristics with the online component in mind. We focus entirely on the computation aspect of this problem in a multiprocessor system and try to come up with a comprehensive solution using static scheduling and runtime strategies to achieve energy efficient scheduling of hard real-time task graphs.
The Algorithm
The given task graph, henceforth referred to as the task precedence graph (TPG), is assumed to have a hard deadline associated with it. Therefore, our algorithm tries to reduce energy expenditure by voltage scheduling in such a way that the deadline is always met.
In CMOS devices, the power consumption is proportional to the square of the voltage [7, 15] :
where Ä is the circuit output load capacitance, AE ËÏ is the number of switches per clock cycle, is the clock frequency and Ú is the supply voltage. However, reduction of power supply voltage causes increase of the circuit delay denoted by AE [7, 15] :
where Ã is a constant depending on the process and gate size, Ú Ì is the threshold voltage, and « varies between 1 and 2; « ¾ for long channel devices which have no velocity saturation. In this paper, we assume « ¾ for all our numerical experiments.
System Model
Our model consists of a multi-processor system where each of the processors is independent and is connected by a low cost fast interconnection network. The processors can operate in three voltage levels: Ú ÀÁ , Ú ÄÇ , and Ú Á Ä (Ú ÀÁ Ú ÄÇ Ú Á Ä ). Ú ÀÁ and Ú ÄÇ are voltages at which the processors can do useful computation whereas Ú Á Ä is the voltage necessary to sustain the system in idle state.
The factor by which the processor is slower at voltage Ú relative to when at the highest voltage Ú ÀÁ is
where Ú Ì is the threshold voltage.
We define one unit of execution as the computation performed by the processor at Ú ÀÁ in unit time. Thus, one unit of execution will take ×ÐÓÛ´Úµ units of time at voltage Ú.
The ratio of energy consumed per cycle by a processor at
For the task-sets we study, inter-task communication only happens after each task has finished its computation. Such communication consists of transfer of small amounts of data; the communication cost can be safely ignored as insignificant in the types of application under consideration. Since idle(sleep) power consumption is considerably lower than operational power [1], the energy cost when the processors are idle is also ignored. We should also note that our algorithm actually decreases the idle time in the processors compared with the single voltage algorithm. Hence this is a conservative assumption since the relative gain using our algorithm would be even higher if we had considered the energy cost associated with the processor idling.
We have also considered the voltage switching cost as negligible both with respect to the time needed and the energy expended. This is justified by the fact that our algorithm has at most one voltage switch within the runtime of the task and at most one switch at the time of context switching of the tasks. We have accounted for this cost by merging this effect into the worst case profile information.
The Details of the Algorithm
Given the tasks we can apply any static task assignment and ordering heuristic. Any generic multiprocessor static task assignment algorithm can be followed. The task graph under this assignment should be able to meet the deadline if running to their worst case under the highest available voltage. Any assignment that satisfies the above criterion can act as a valid input to our algorithm. Once we are given the assignment and task schedule order in the multiprocessor environment, we can apply our voltage scheduling heuristic to minimize the energy expenditure. We follow a two-pronged approach to achieve our objective. The approach includes an offline component -the voltage scheduling of the TPG based on the static worst-case execution profile. We use the pessimistic worst case execution profile approach because the system has hard real-time requirements and a deadline miss under any circumstances would be catastrophic. We follow with an online, the dynamic slack reclamation, phase.
We will use the following terms for describing our algorithm. The critical path is a set of tasks from a source to a sink of the TPG that misses the deadline under the current voltage configuration. The reverse slack or rslack of a critical path is the difference between the deadline and the worst case execution time of that path with the current voltage configuration. The start-time of a task is the latest time, relative to the beginning of the execution of the task graph, at which the particular task must be invoked, and commit-time is the time by which the task must complete its execution.
Static Voltage Scheduling
We apply any static assignment heuristic, such as the one described in Section 3.2.3, to the TPG. In the cases where there are not enough processors available to exploit the parallelism inherent in the TPG, this assignment and task ordering may lead to new dependency relationships. We would, therefore, need to modify the original TPG by adding new edges to accommodate these dependencies. After the assignments and modifications are done, we apply our static voltage scheduling heuristics to the TPG. In order to better understand this scheduling heuristic let us rephrase the optimization problem in the following way. There is a trivial solution for this problem if the deadline is met when all the tasks are run at Ú ÄÇ . However, the problem becomes more interesting when some of the paths become critical paths and a decision has to be made about Run the entire taskId at Ú ÀÁ and mark the task so that its weight is never considered during subsequent iterations. end if Update the path execution times and remove any path which now meets the deadline from the list of critical paths. end while which task to speed up. Analyzing the expressions above, it appears that if we speed up a task that is part of a large number of critical paths, we affect many paths while paying the energy price only once. Based on this intuition we formulated the following iterative algorithm to determine which task needs to run at Ú ÀÁ and for how many execution units. We start the procedure by assigning all the tasks to run at Ú ÄÇ and then speed them up iteratively until there are no more critical paths left. The weight associated with each task is dependent on the membership of the task in the set of critical paths, every time we encounter a task in the critical path we increment its weight by 1. When we have to break a tie between tasks of equal weight we choose the task nearest to the leaf of the TPG. The rationale behind this is that we would like to schedule a task to run at Ú ÀÁ as late as we can because during dynamic resource reclamation, we could potentially re-acquire enough slack to avoid having to run it entirely at Ú ÀÁ . Note that we could formulate our problem as a linear programming optimization. This, however, would yield an optimal static scheduling which would not attempt to increase the opportunity for dynamic adjustments. We still have experimented with linear programming techniques and the overall results tend to match closely those of our static heuristic. Our static heuristic appears to give us near-optimal performance in most cases as well as facilitating the dynamic resource reclamation in the subsequent step.
Dynamic Resource Reclamation
Once we have completed the static scheduling of the paths, we can assign start time and commit time to the individual tasks. Since the static analysis was based on the worst case execution profile, each task will finish before or at its commit time during actual runtime. Thus, its successor can begin execution earlier if it has no other pending dependencies and we can use this extra slack between start time and the current time to slow down the processor further under the constraint that this task still finishes at its commit time even if it runs to its worst-case execution profile. The following equation calculates the amount of units of execution to be transferred from running at Ú ÀÁ to running at Ú ÄÇ (given that it takes unit time to execute one unit in Ú ÀÁ ):
where ÙÒ Ø× ØÓ ÄÓ is the additional number of units of the task that should be transferred to the portion executed in Ú ÄÇ , ÙÖÖ ÒØ Ø Ñ is the current time and ×Ø ÖØ Ø Ñ is the start time predicted during the static voltage scheduling based on the worst case execution profile of the tasks.
This transfer of certain units of execution from Ú ÀÁ to Ú ÄÇ results in further energy savings. The next subsection first explains the algorithm, and then illustrates it through an example TPG on which the whole procedure is performed. For this example, we used a list scheduling heuristic as our static assignment algorithm.
An Example Static Assignment Scheme
The assignment problem of a task graph to a finite number of processors is, in general, an NP-complete problem [14] and many heuristics have been proposed to address it. Any of these assignments can be used in conjunction with our algorithm as long as the assignment makes the real time task graph feasible under the worst case profile. However, the task assignment heuristic that is followed would have an impact on the effectiveness of our voltage scheduling heuristic. The faster the entire task graph can be completed, the higher will be the effectiveness of our algorithm. We employ a list scheduling heuristic adapted from [30] as an example of how task assignment can be done. We should emphasize here that other assignment heuristics would also work with our algorithm. For example, assignments using genetic algorithms , such as [10] , or assignment using simulated annealing, such as [29] , can be combined with our algorithm.
The list scheduling heuristic we consider gives the highest priority to the tasks in the longest paths during the task assignment [30] . For a particular fixed voltage, this heuristic allows us to finish the execution of the entire task set in the least amount of time for most cases. Hence, the scope for exploiting the slack is likely to be high if we use this assignment. If this algorithm does not meet the deadline criteria of the task graph under worst case, we have to use another task assignment heuristic.
The assignment heuristic is based on the concept of assigning priorities to tasks by using the concept of the top level and bottom level for the tasks. We define top level as the maximum of the sum of the worst case execution units from any connected source of the TPG to the given task (excluding the execution units of the given task) and bottom level as the maximum of the sum of the worst case execution units from the given task (including the execution units of the given task) to any connected leaf of the TPG. The priority of the task is the sum of bottom level and top level. Once we assign the priority we do an offline analysis using a greedy list-scheduling algorithm to assign tasks to each of the processors and to determine the order of their execution. The heuristic we follow is that whenever we find a free slot in a processor and tasks are ready to run, we assign the ready task with the highest priority to that processor.
An Example Taskgraph
We now provide an example to illustrate our algorithm. The example graph is shown in Figure 1 . The number inside the circle represents the task number while the two numbers on the side are the worst case execution units (in bold) and the actual execution units at runtime for some execution instance, respectively. For this example Ú ÀÁ is chosen at 3.3V and Ú ÄÇ at 2V. The deadline of the execution of the task graph is chosen as 99 for which not all tasks can be executed at Ú ÄÇ . We execute this task graph in a system with three processors. The processor assignment following our heuristic keeps the graph unchanged in this case. The priority calculation is demonstrated in Ú ÀÁ is represented by taller rectangle.
imum weight of 3. We choose task 4 since it is nearer to the leaf of the TPG and speed it up appropriately to make the path (with minimum rslack) consisting of tasks 2, 4 and 6 meet its deadline. We remove this path from the list of critical paths and proceed with our algorithm. In the next iteration, the weights of tasks 2,4,5 and 7 are all 2. We then choose task 7 and speed it up such that the path consisting of tasks 3 and 7 meets its deadline. Table 2 describes the individual iterations in detail. The second column depicts the path chosen for speeding up, the middle columns show the weights of the individual tasks in this step of the algorithm, and the last column shows the task that was selected for speeding up. We continue this iterative procedure until finally we obtain the static schedule shown in Figure 3 . We then do dynamic resource reclamation to reclaim any slack that occurs in runtime. Let us now look at task 4 and see how runtime variations affect the scheduling. After the static scheduling, the start time of task 4 is 9 and the commit time is 63 and it has been scheduled to execute 19.2 units in Ú ÄÇ and 10.8 units in Ú ÀÁ . Assume now that the preceding task, task 2, did not take its worst case time to execute and finished instead at time 5.625. So now task 4 could be started at 5.625 instead of at 9 and we can use this extra time to slow it down further such that its worst-case commit time still remains at 63. Thus, at the time of invocation it is scheduled to execute 21.9 units in Ú ÄÇ and 8.1 units in Ú ÀÁ . Figure 4 shows the effect of dynamic resource reclamation on our static algorithm. In addition, we have used the simplex method to solve the corresponding linear programming optimization problem and found the speedups required by different tasks under their worst case profile. The simplex method yielded the same sum of speedups as our algorithm. However, the distribution of the speedups was quite different. For example, if we did static scheduling following the simplex method's solution we would schedule the entire task 2 at the highest voltage, Ú ÀÁ . This would have led to inefficient use of the slack resulting from actual execution time being less than the worst case.
Even though we have shown an example with a single task graph, the algorithm can be used for multiple task graphs as long as they have the same period. After their assignment, the multiple task graphs can be cast as a single task graph and the same algorithm can be applied to achieve our objective.
Extension to a Multi-Voltage System
The algorithm described above has been created keeping in mind processors supporting just 2 voltage algorithms. This can be easily extended to processors running under multiple voltage levels. The extension can be described as follows: We can use the same scheduling algorithm and assign start time and commit time to the individual tasks. Once we fix the interval, we can find an unique voltage level which can finish the task in that interval without any voltage switching. After calculating this voltage level, we can choose the two voltage levels that the processor supports between which the calculated voltage level lies. We then run the task in the two chosen voltage levels such that the task finishes exactly at commit time when running under worst case.
Numerical Results
We have performed extensive simulation experiments of the algorithm described in the previous section. Here we present our results from a real-life application as a case study. We follow it up with our results obtained for a set of random task graphs to show that the algorithm can be applied efficiently to a wide range of task graphs. The application is a task graph for a random sparse matrix solver of electronic circuit simulation using the symbolic generation technique, henceforth referred to as sparse matrix. The sparse matrix has 96 tasks. These task graphs have been published by the Kasahara Lab [3] , and the timings are based on actual profiling done on the OSCAR multiprocessor system. The operating voltages of the processors and their frequencies have been modeled based on Intel xScale [2] .
The parameters used in the simulations are Ú ÀÁ ½ V, Ú ÄÇ ½ ¼V, and Ú Ì ¼ ¾V unless specified otherwise. Using these values in (3) and (4), the maximum energy savings possible is about 67.34% if everything can be run at Ú ÄÇ . The execution units of the individual tasks are uniformly distributed in the range ½¼¼ % of their worst case profile. We have varied in the simulations and the results are presented below. We first compare the energy savings that we get when our scheduling method is followed, with a system where there is no voltage scheduling: that is, all tasks have to run in a predefined Ú ÀÁ . The results are shown in Figure 5 : our algorithm yields considerable energy savings. As the variance of the tasks' execution times increases, we see that we can get increasing savings from the algorithm due to the increasing slack that we can exploit at runtime. Yet, even in the case of worst-case execution (A=100), the plots demonstrate that significant energy savings can be achieved because of our static algorithm. Similarly, when we vary the number of processors, we can exploit the parallelism more and hence have better performance with an increasing number of processors (see Figure 6 ). The plots in Figure 7 show the savings achieved by dynamic resource reclamation over the static scheduling. As predicted, we can see that the greater the variance in execution time, the better the performance. Since our adjustment Voltage (V) Frequency (MHz) 1 1.75 1000 2 1.40 800 3 1.20 600 4 1.00 466 is fast and happens only during context switch, we achieve substantial savings with relatively little overhead.
We now present results for random graphs to show the effectiveness of our algorithms. We chose 60 randomly generated graphs, each consisting of 50 tasks. For each task graph we calculated the deadline necessary to schedule the task graph under Ú ÀÁ and ÅÁAE and to schedule it under Ú ÄÇ and Å . We varied the deadlines from 0 to 10 such that ¾ means ÅÁAE ·¾£ Å ¹ ÅÁAE ½¼ . For each deadline we calculated the average relative gain and presented the results in Figure 8 . The plot shows that the algorithm performs well for a large variety of task graphs.
Next we compared our algorithm with a dynamic voltage adjustment algorithm, referred to as LSSR-N [32] , that chooses from an infinite number of voltage levels (see Figure 9) . For this infinite level algorithm, voltage adjustments have been considered only at the time of context switches.
Here we relaxed the constraint that the Ú ÀÁ has to be fixed at a particular value and instead allowed it to have any value in the voltage range specified. Ú ÀÁ for the subsequent experiments was chosen as the minimum uniform voltage at which the tasks can execute so that the longest path meets the deadline under the worst case scenario. Our two-voltagelevel algorithm actually outperformed this infinite-level algorithm in most cases. This shows that considering the overall structure of the task graph at the time of voltage scheduling does provide substantial benefits over dynamic slack sharing heuristics.
Finally, we measured the energy savings if we had used multiple voltage levels for our algorithm instead of two voltage levels (see Figure 10) . The system supporting multiple voltage levels, as expected, exhibited higher energy savings than the two-voltage level system. The multiple voltage-level algorithm chose the appropriate voltage from any of the voltage levels supported (see Table 3 ), while the dual-voltage scheme used Ú ÀÁ ½ Î and Ú ÄÇ ½ ¼Î.
The results demonstrate that our algorithm can be easily incorporated into more complicated system configurations to achieve superior performance. They also show that when deadlines are more relaxed, the effect of the multiple voltage levels diminishes significantly. 
Discussion
In this paper we have considered the problem of an energy efficient voltage scheduling heuristic for task graphs with precedence constraints. We have described a twopronged approach to solve this problem and have demonstrated that substantial energy savings can be achieved by considering the relationships among the tasks in the graph. The focus of this paper has been on exploiting the structure of task graphs for energy minimization purposes. The voltage scheduling proposed needs at most one voltage switching, and is therefore a relatively low overhead algorithm. We have presented a simple, low overhead voltage scheduling heuristic for executing task graphs in an energy efficient way.
