Abstract. This paper develops energy-driven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multi-processor system with multiple supply voltages. We leverage application's performance requirements, uncertainties in execution time, and tolerance for reasonable execution failures to scale processors' supply voltages at run-time to reduce energy consumption. Specifically, we study how to trade the difference between the highest achievable completion ratio Q max and the required completion ratio Q0 for energy saving. We propose several on-line scheduling policies, which are all capable of providing Q0, based on the knowledge about application's execution time. We show that significant energy saving is achievable when only the worst/best case execution time are known and further energy reduction is possible with the probabilistic distribution of execution time. The proposed algorithms have been implemented and their energy-efficiency have been verified by simulations over real-life DSP applications and the TGFF random benchmark suite.
Introduction
Performance guarantee and energy efficiency are becoming increasingly important for the implementation of embedded software. Traditionally, the worst case execution time (WCET) is considered to provide performance guarantee, however, this often leads to over-designing the system (e.g., more hardware and more energy consumed than necessary). We discuss the problem of how to implement multi-processor embedded systems to deliver performance guarantee with reduced energy consumption.
Many applications, such as multimedia and digital signal processing (DSP) applications, are characterized by repetitive processing on periodically arriving inputs (e.g., voice samples or video frames). Their processing deadlines, which are determined by the throughput of the input data streams, may occasionally be missed without being noticeable or annoying to the end user. For example, in packet audio applications, loss rates between 1% -10% can be tolerated [2] , while tolerance for losses in low bit-rate voice applications may be significantly lower [13] . Such tolerance gives rise to slacks that can be exploited when streamlining the embedded processing associated with such applications. Specifically, when the embedded processing does not interact with a lossy communication channel, or when the channel quality is high compared to the tolerable rate of missed deadlines, we are presented with slacks in the application that can be used to reduce cost or power consumption.
Typically, slacks arise from the run-time task execution time variation and can be exploited to improve real-time application's response time or reduce power. For example, Bambha and Bhattacharyya examined voltage scaling for multi-processor with known computation time and hard deadline constraints [1] . Luo and Jha presented a power-conscious algorithm [14] and static batteryaware scheduling algorithms for distributed real-time battery-powered systems [15] . Zhu et al. introduced the concept of slack sharing on multi-processor systems to reduce energy consumption [26] . The essence of these works is to exploit the slacks by using voltage scaling to reduce energy consumption without suffering any performance degradation (execution failures).
The slack we consider in this paper comes from the tolerance of execution failures or deadline missings. In particular, since the end user will not notice a small percentage of execution failure, we can intentionally drop some tasks to create slack for voltage scaling as long as we keep the loss rates to be tolerable. Furthermore, much richer information than task's WCET is available for many DSP applications. Examples include the best case execution time (BCET), execution time with cache miss, when interrupt occurs, when pipeline stalls or when different conditional branch happens. More important, most of these events are predictable and we will be able to obtain the probabilities that they may happen by knowing (e.g. by sampling technique) detailed timing information about the system or by simulation on the target hardware [24] . This gives another degree of freedom to explore on-line and offline voltage scaling for energy reduction.
Dynamic voltage scaling(DVS), which can vary the supply voltage and clock frequency according to the workload at run-time, can exploit the slack time generated by the workload variation and achieve the highest possible energy efficiency for time-varying computational loads [3, 19] . It is arguably the most effective technique to reduce the dynamic energy, which is still the dominate part of system's energy dissipation despite the fast increase of leakage power on modern systems. The most relevant works on DVS, to this paper, are on the energy minimization of dependent tasks on multiple voltage processors. Schmitz and Al-Hashimi investigated DVS processing elements power variations based on the executed tasks, during the synthesis of distributed embedded systems, and its impact on the energy saving [21] . Gruian and Kuchcinski introduced a new scheduling approach, LEneS, that uses list-scheduling and a special priority function to derive static schedules with low energy consumption. The assignment of tasks to multiple processors is assumed to be given [8] . Luo and Jha presented static scheduling algorithm based on critical path analysis and task execution order refinement. An on-line scheduling algorithm is also developed to reduce the energy consumption for real-time heterogeneous distributed embedded systems while providing the best-effort services to soft aperiodic tasks. The deadlines and precedence relationships of hard real-time periodic tasks are guaranteed [16] . In [18] , Mishra et al. proposed static and dynamic power management schemes for distributed hard real-time systems, where the communication time is significant and tasks may have precedence constraints. However, these algorithms use the slacks to reduce energy but they do not drop tasks to create more slacks. Different from them, some energy reduction techniques on single processor have been proposed by Hua et al. for multimedia applications with tolerance to deadline misses while providing a statistical completion ratio guarantee [10] .
Finally, we mention that early efforts on multi-processor design range from the design space exploration algorithm [12] to the implementation of such systems [7, 23] . And scalable architectures and co-design approaches have been developed for the design of multi-processor DSP systems (e.g., see [11, 20] ). These approaches, however, do not provide systematic techniques to handle voltage scaling, non-deterministic computation time, or completion ratio tolerance. Performance-driven static scheduling algorithms that allocate task graphs to multi-processors [22] can be used in conjunction with best-or average-case task computation time to generate an initial schedule for our proposed methods. It can then interleave performance monitoring and voltage adjustment functionality into the schedule to streamline its performance.
A Motivational Example
We consider a simple case when a multiple-voltage processor executes three tasks A, B, C in that order repetitively. Table 1 (a) gives each task's only two possible execution time and the probabilities that they occur. Table 1(b) shows the normalized power consumption and processing speed of the processor at three different voltages. Table 1 . Characteristics of the tasks and the processor.
(a): each entry shows the best/worst case execution time at v1 and the probability this execution time occurs at run time.
(b): power is normalized to the power at v1 and delay column gives the normalized processing time to execute the same task at different voltages. Suppose that each iteration of "A → B → C" must be completed in 10 CPU units and we can tolerate 40% of the 10,000 iterations to miss their deadlines. We now compare the following three different algorithms:
(I) For each iteration, run at the highest voltage v 1 to the completion or the deadline whichever happens first. (II) Assign deadline pairs (0,6), (5, 8) , and (10, 10) Assuming that the execution time of each task follows the above probability, for each algorithm, we obtain the completion ratio Q, each iteration's average processing time (at different voltages) and power consumption ( Table 2 ). We mention that 1) algorithm I gives the highest possible completion ratio; 2) algorithm II achieves the same ratio with less energy consumption; and 3) algorithm III trades unnecessary completion for further energy reduction. Although algorithm I is a straightforward best-effort approach, the settings for algorithms II and III are not trivial: Why the deadline pairs are determined for A and B? Is it a coincidence that such setting achieves the same completion ratio as algorithm I? How to set execution slot for each task in algorithm III to guarantee the 60% completion ratio, in particular if we cannot find 80% and 75% whose product gives the desired completion ratio? Table 2 . Expected completion ratio and energy consumption for the three algorithms. t@v1, t@v2, and t@v3 are the average time that the processor operates at three voltages for each iteration; E is the average energy consumption to complete one iteration; and the last column, obtained by E · 60%/Q, corresponds to the case of shutting the system down once 6,000 iterations are completed. In this paper, i) we first formulate the energy minimization problem with deadline miss tolerance on multi-processor (DSP) systems; ii) we then develop on-line scheduling techniques to convert deadline miss tolerances into energy reduction via DVS; iii) this departs us from the conservative view of over-implementing the embedded software in order to meet deadlines under WCET; iv) our result is an algorithmic framework that integrates considerations of iterative multiprocessor scheduling, voltage scaling, non-deterministic computation time, and completion ratio requirement, and provides robust, energy-efficient multi-processor implementation of embedded software for embedded DSP applications.
Problem Formulation
We consider the task graph G = (V, E) for a given application. Each vertex in the graph represents one computation and directed edges represent the data dependencies between vertices. For each vertex v i , we associate it with a finite set of possible execution time {t i,1 < t i,2 < · · · < t i,ki } and the corresponding set of probabilities
that such execution time may occur. That is, with probability p i,j , vertex v i requires an execution time in the amount of t i,j . Note that t i,ki is the WCET and t i,1 is the BCET for task v i . We then define the prefix sum of the occurrence probability
Clearly, P i,l measures the probability that the computation at vertex v i can be completed within time t i,l and we have P i,ki = 1 which means that a completion is guaranteed if we allocate CPU to vertex v i based on its
, there is a cost for inter-processor communication (IPC) w vi,vj , which is the time to transfer data from the processor that executes v i to a different processor that will execute v j . There is no IPC cost, i.e. w vi,vj = 0, if vertices v i and v j are mapped to the same processor by the task scheduler. For a given datapath < v 1 v 2 · · · v n >, its completion time is the sum of the execution time at run-time, of each vertex, e i , and all the IPC costs. That is,
The completion time for the entire task graph G (or equivalently the given application), denoted by C(G), is equal to the completion time of its critical path, which has the longest completion time among all its datapaths. We are also given a deadline constraint M, which specifies the maximum time allowed to complete the application. The application (or its task graph) will be executed on a multi-processor system periodically with its deadline M as the period. We say that an iteration is successfully completed if its completion time, which depends on the run-time behavior, C(G) ≤ M. Closely related to M is a real-valued completion ratio constraint(or requirement) Q 0 ∈ [0, 1], which gives the minimum acceptable completion ratio over a sufficiently large number of iterations. Alternatively, Q 0 can be interpreted as a guarantee on the probability with which an arbitrary iteration can be successfully completed.
Finally, we assume that there are multiple supply voltage levels available at the same time for each processor in the multi-processor system. This type of system can be implemented by using a set of voltage regulators each of which regulates a specific voltage for a given clock frequency. In this way, the operating system can control the clock frequency at run-time by writing to a register in the system control state exactly the way as in [3] except that the system does not need to wait for the voltage converter to generate the desired operating voltage. In sum we can assume that each processor can switch its operating voltage from one level to another instantaneously and independently with the power dissipation P ∝ CV [4] . Furthermore, on a multiple voltage system, for a task under any time constraint, the voltage scheduling with at most two voltages minimizes the energy consumption and the task is finished just at its deadline [19] .
In this paper, we consider the following problem:
For a given task graph G with its deadline M and completion ratio constraint Q 0 , find a scheduling strategy for a multi-processor multi-voltage system (a means of (1) It is well-known that the variable voltage task scheduling for low power is in general NP-hard [9, 19] . On the other hand, there exist intensive studies on multi-processor task scheduling problem with other optimization objectives such as completion time or IPC cost [17, 22] . In this paper, We focus on developing on-line algorithms for voltage scaling (and voltage selection in particular) on a scheduled task graph. That is, we assume that tasks have already been assigned to processors and our goal is to determine when and at which voltage each task should be executed in order to minimize the total energy consumption while meeting the completion ratio constraint Q 0 .
Energy-Driven Voltage Scaling Techniques with Completion Ratio Constraint
In this section, we first obtain, with a simple algorithm, the best completion ratio on multi-processor system for a given task assignment. We then give a lower bound on the energy consumption to achieve the best completion ratio. Our focus will be on the development of on-line energy reduction algorithms that leverage the required completion ratio, which is lower than the best achievable.
Q max : the Highest Achievable Completion Ratio
Even when there is only one supply voltage, which results in a fixed processing speed, and each task has its own fixed execution time, the problem of determining whether a set of tasks can be scheduled on a multi-processor system to be completed by a specific deadline remains NP-complete (this is the multiprocessor scheduling problem [SS8], which is NP-complete for two processors [6] .). However, for a given task assignment, the highest possible completion ratio can be trivially achieved by simply applying the highest supply voltage on all the processors. That is, each processor keeps on executing whenever there exist tasks assigned to this processor ready for execution;and stops when it completes all its assigned tasks in the current iteration or when the deadline M is reached. In the latter, if any processor has not finished its execution, we say the current iteration is failed; otherwise, we have a successful completion or simply completion. Clearly this naïve method is a best-effort approach in that it tries to complete as many iterations as possible. Since it operates all the processors at the highest voltage, the naïve approach will provide the highest possible completion ratio, denoted by Q max . In another word, if a completion ratio requirement cannot be achieved by this naïve approach within the given deadline M, then no other algorithms can achieve it either.
When the application-specified completion ratio requirement Q 0 < Q max , a simple counting mechanism can be used to reduce energy consumption. Specifically we cut the N iterations into smaller groups and shut the system down once sufficient iterations have been completed in each group. For example, if an MPEG application requires a 90% completion ratio, we can slow down the system (or switch the CPU to other applications) whenever the system has correctly decoded 90 out of 100 consecutive frames. This counting mechanism saves total energy by preventing the system from over-serving the application.
For system with multiple operating voltages, we mention that energy could have been saved over the above naïve approach in the following scenario: i) if we knew that an iteration would be completed earlier than the deadline M, we could have processed with a lower voltage; and ii) if we knew that an iteration cannot be completed and have stopped the execution earlier. To save the maximal amount of energy, we want to determine the lowest voltage levels to lead us to completions right at the deadline M and find the earliest time to terminate an incompletable iteration. However, additional information about the task's execution time (e.g. WCET, BCET, and/or the probabilistic distribution) is required to answer these questions. In the rest of this section, we propose on-line voltage scaling techniques to reduce energy with the help of such information.
BEEM: Achieving Q max with the Minimum Energy
The best-effort energy minimization (BEEM) technique proposed by Hua et al.
gives the minimum energy consumption on a single processor system to provide the highest achievable completion ratio [10] . We extend this approach and propose algorithm BEEM1 for the multi-processor system and BEEM2 that does not assume tasks' execution time are known as a priori. We define the latest completion time T for a vertex v using the following recursive formulas:
where t j,1 and t j,kj are the BCET and WCET of vertex v j , w vi,vj is the cost of IPC from vertices v i to v j which is 0 if the two vertices are assigned to the same processor.
Lemma 1.
If an algorithm minimizes energy consumption, then vertex v i 's completion time cannot be earlier than T vi e .
[Proof ]: Clearly such algorithm will complete each iteration at deadline M. Otherwise, one can always reduce the operating voltage and processing speed (or adjust the combination of two operating voltages) for the last task to save more energy. Let t be vertex v i 's completion time at run time. If t < T vi e , for any path from v i to a sink node v, u 0 = v i , u 1 , · · · , u k = v, let W CET uj be the worst case execution time of vertex u j , then the completion time of this path will be
This implies that even when the WCET happens for all the successor vertices of v i on this path, the completion of this path occurs before the deadline M. Note that this is true for all the path, therefore the iteration finishes earlier and this cannot be the most energy efficient. Contradiction.
Lemma 2. If vertex v i 's completion time t > T vi l , then the current iteration is not completable by deadline M.
[Proof ]: Assuming that best case execution time occur for all the rest vertices at time t when v i is completed, this gives us the earliest time that we can complete the current iteration and there exists at least one path from v i to one sink node v (u 0 = v i , u 1 
Combining these two lemmas and the naïve approach that achieves the highest possible completion ratio Q max , we have: Let t be the current time that v is going to be processed and t However, it is unrealistic to have each task's real execution time known as a priori, we hereby propose algorithm BEEM2, another version of BEEM that does not require tasks' real-time execution time to make decisions, yet still achieves the highest completion ratio Q max :
Algorithm BEEM2
Let t be the current time that v is going to be processed, } can be computed offline only once and both BEEM1 and BEEM2 algorithms require at most two additions and two comparisons. Therefore, the on-line decision making takes constant time and will not increase the run time complexity. Finally, similar to our discussion for the naïve approach, further energy reduction is possible if the required completion ratio Q 0 < Q max .
QGEM: Completion Ratio Guaranteed Energy Minimization
Both the naïve approach and BEEM algorithms achieve the highest completion ratio. Although they can also be adopted to provide exactly the required completion ratio Q 0 for energy reduction, they may not be the most energy efficient way to do so when Q 0 < Q max . In this section, we propose a hybrid offline on-line completion ratio Q guaranteed energy minimization (QGEM) algorithm, which consists of three steps:
In Step 1, we seek to find the minimum effort (that is, the least amount of computation t i s we have to process on each vertex v i ) to provide the required completion ratio Q 0 (Fig. 1) . Starting with the full commitment to serve every task's WCET (line 2), we use a greedy heuristic to lower our commitment the vertices along critical paths (lines 6-13). Vertex v j is selected first if the reduction from its WCET t j,kj to t j,kj −1 (or from the current t j,l to t j,l−1 ) maximally shortens the critical paths and minimally degrades the completion ratio, measuring by their product (line 10).
The goal in Step 2 is to allocate the maximum execution time t i q for each task v i to process the minimum computation t i s and to have the completion time L close to deadline M (Fig. 2) . Lines 3-9 repetitively scale t i q for all the tasks. compute the completion ratio Q j = Q · Because the IPCs are not scaled, maximally extending the allocated execution time to each task by a factor of M/L (line 6) may not stretch the completion time from L to M. Furthermore, this unevenly extends each path and we reevaluate the completion time (and critical path) at line 7. To prevent an endless repetition, we stop when the scale factor r is less than a small number (line 5), which is set as 10 −6 in our simulation. 
Step 3 defines the on-line voltage scheduling policy for the QGEM approach in Fig. 3 , where we scale voltage to complete a task v i by its expected drop-time D i assuming that the real-time execution time requirement equals to the minimum workload t i s we have committed to v i (line 2). If v i demands more, it will be finished after D i and we will drop the current iteration (line 4).
Note that if every task v i has real execution time less than t i s in an iteration, QGEM's on-line scheduler will be able to complete this iteration. On the other hand, if longer execution time occurs at run-time, QGEM will terminate the iteration right after the execution of this task. From the way we determine t i s (in Fig. 1 ), we know that the required completion ratio Q 0 will be guaranteed. Energy saving comes from two mechanisms: the early termination of unnecessary iterations (line 5 in Fig. 3 ) and the use of low voltage to fully utilize the time from now to a task's expected drop-time (line 2 in Fig. 3 ). We will confirm our claim on QGEM's completion ratio guarantee and demonstrate its energy efficiency by simulation in the next section. 
Simulation Results
In this section we present the simulation results to verify the efficacy of our proposed approaches. We have implemented the proposed algorithms and simulated them over a variety of real-life and random benchmark graphs. Some task graphs, such as FFT(Fast Fourier Transform), Laplace(Laplace transform) and karp10 (Karplus-Strong music synthesis algorithm with 10 voices), are extracted from popular DSP applications. The others are generated by using TGFF [5] , which is a randomized task graph generator. We assume that there are a set of homogeneous processors available. However, our approaches are general enough to be applied to embedded systems with heterogeneous multi-processors.
Before we apply our approaches to the benchmark graphs, we need to schedule all of tasks to available processors based on the performance such as latency. Here we use the dynamic level scheduling (DLS) [22] method to schedule the tasks, however, our techniques could be used with any alternative static scheduling strategy. The DLS method accounts for interprocessor communication overhead when mapping precedence graphs onto multiple processors in order to achieve the latency from the source to the sink as small as possible. We apply this method to the benchmarks and obtain the scheduling results which include the task execution order in each processor and interprocessor communication links and costs. Furthermore, we assume that the interprocessor communication is full-duplex and the intraprocessor data communication cost can be neglected.
After we obtain the results from DLS, we apply the proposed algorithms to them. There are several objectives for our experiments. First, we want to compare the energy consumption by using different algorithms under same deadline and completion ratio requirements. Secondly, we want to investigate the impact of completion ratio requirement and deadline requirement to the energy consumption of the proposed approaches. Finally, we want to study the energy efficiency of our algorithms with different number of processors.
We set up our experiments in the following way. For each task, there are three possible execution time, e 0 < e 1 < e 2 , that occur at the following corresponding probabilities p 0 >> p 1 > p 2 respectively. All processors support real time voltage scheduling and power management (such as shut down) mechanism. Four different voltage levels, 3.3V, 2.6V, 1.9V, and 1.2V are available with threshold voltage 0.5V. For each pair of deadline M and completion ratio Q 0 , we simulate 1,000,000 iterations for each benchmark by using each algorithm. Because naïve, BEEM1 and BEEM2 all provide the highest possible completion ratio that is higher than the required Q 0 , in order to reduce the energy, we take 100 iterations as a group and stop execution once 100Q 0 iterations in the same group have been completed. Table 3 reports the average energy consumption per iteration by different algorithms on each benchmark with deadline constraint M and completion ratio constraint Q 0 (0.900). From the table we can see that both BEEM1 and BEEM2 provide the same completion ratio with an average of nearly 29% and 26% energy saving over naïve. Compared with BEEM2, BEEM1 saves more energy because it assumes that the actual execution time can be known a priori. However, without this assumption the QGEM approach can still save more energy than BEEM2 in most benchmarks. Specifically, it provides 36% and 12% energy saving over naïve and BEEM2 and achieves 0.9111 average completion ratio which is higher than the required completion ratio 0.9000. It is mentioned that for FFT2 benchmark, QGEM has negative energy saving compared to BEEM2 because the deadline M is so long that BEEM2 can scale down the voltage to execute most of the tasks and save energy. Fig. 4 depicts the completion ratio requirement's impact to energy efficiency of different algorithms with same deadline M(9705). We can see that with the decrement of Q 0 , the energy consumption of each algorithm is decreased. However, different from naïve, BEEM1 and BEEM2, the energy consumption of QGEM doesn't change dramatically. Therefore, although under high completion Fig. 4 ), using QGEM consumes the least energy, it may consume more energy than BEEM1, BEEM2 even naïve when Q 0 is low. The deadline requirement's impact to the energy consumption is shown in Fig. 5 with the same Q 0 (0.900). Because the naïve approach operates at the highest voltage till the required Q 0 is reached, when the highest possible completion ratio of the system is close to 1, its energy consumption keeps constant regardless of the change of the deadline M. However in BEEM1 and BEEM2, the latest completion time T (5)), and the energy consumption will be reduced dramatically with the increment of M. For QGEM, the increment of deadline also has positive effect on the energy saving while it is not as dramatic as it does to BEEM1 and BEEM2. Similar to the completion ratio requirement's impact, we conclude that QGEM consumes less energy than BEEM1 and BEEM2 in the short deadline (with the condition that Q 0 is achievable), while consuming more energy when the deadline is long.
From Table 3 and Fig. 4 -5, we can conclude that QGEM save more energy than BEEM1 and BEEM2 when Q 0 is high and M is not too long. Actually this conclusion is valid regardless of the number of multiple processors. its latency will be reduced. So for the same deadline(e.g., 7275), it is not relatively long and QGEM saves more energy than BEEM1 and BEEM2 for the system with small number of processors (e.g., 4 processors), however, for the system with large number of processors(e.g., >5 processors), QGEM will consume more energy than BEEM1 and BEEM2.
Conclusions
Many embedded applications, such as multimedia and DSP applications, have high performance requirement yet are able to tolerate certain level of execution failures. We investigate how to trade this tolerance for energy efficiency, another increasingly important concern in the implementation of embedded software.
In particular, we consider systems with multiple supply voltages that enable dynamic voltage scaling, arguably the most effective energy reduction technique. We present several on-line scheduling algorithms that scale operating voltage based on some parameters pre-determined offline. All the algorithms have low run-time complexity yet achieve significant energy saving while providing the required performance, measured by the completion ratio.
