Abstract -In this paper, we propose a novel, energy aware scheduling algorithm for applications running on DVS-enabled multiprocessor systems, which exploits variation in execution times of individual tasks. In particular, our algorithm takes into account latency and resource constraints, precedence constraints among tasks and input-dependent variation in execution times of tasks to produce a scheduling solution and voltage assignment such that the average energy consumption is minimized. Our algorithm is based on a mathematical programming formulation of the scheduling and voltage assignment problem and runs in polynomial time. Experiments with randomly generated task graphs show that up to 30% savings in energy can be obtained by using our algorithm over existing techniques. We perform experiments on two real-world applications -MPEG-4 decoder and MJPEG encoder. Simulations show that the scheduling solution generated by our algorithm can provide up to 25% reduction in energy consumption over greedy dynamic slack reclamation algorithms.
I. INTRODUCTION
NERGY consumption is an important issue in designing battery operated embedded systems. The goal of designers is to create a system which consumes minimal energy and satisfies performance constraints at the same time. Multi-core chips are becoming increasingly popular for embedded systems. Examples include the ARM Cortex A-9 MPCore [23] and the MIPS 1004K [24] . One of the most effective design techniques for minimizing energy consumption of processors is dynamic voltage frequency scaling (DVFS), in which processor frequency and voltage can be adjusted depending on the workload of the processor. We first briefly examine techniques that have been proposed to exploit variation in execution time to reduce energy.
A. Uniprocessor Systems
The work in [7] [9] [12] [14] [28] models the execution time of a task as a random variable and minimizes expected energy consumption on a single processor system. A heuristic is provided in [7] for obtaining a low-energy schedule. In [9] [12] , exact solutions are provided using convex optimization techniques; however, many of their assumptions, such as the ability to change the voltage to any arbitrary value at any point during the execution of a task, are not valid for practical systems. Many of these issues are addressed in [14] for uniprocessor systems. In [28] , a mathematical formulation is presented to optimize the expected energy consumption (both dynamic and leakage) by using DVFS and Adaptive Body Biasing (ABB). However, a simple extension of the proposed method for multiprocessor systems leads to an exponential increase in complexity.
B. Multiprocessor Systems Dynamic slack reclamation based techniques:
In [13] [8] , the authors propose techniques by which the dynamic slack is distributed among the remaining tasks. While the work in [8] does not consider precedence constraints among tasks, a list scheduling heuristic is used in [13] for tasks with precedence constraints. Such online techniques are constrained to be relatively simple and fast. Schedule Table based : The idea of using heuristic to build a schedule table at design time was proposed in [5] [3] for scheduling and voltage assignment for conditional task graphs (CTG). However, the proposed techniques are restricted to CTGs. Expected energy minimization: A highly complex, non-linear integer programming based method is proposed in [1] for task mapping and scheduling in multiprocessor systems. In [2] , the authors attempt to balance the expected energy consumption across processors by partitioning a set of independent tasks. In [29] , a dynamic programming based method is used to minimize expected energy consumption. However, the proposed method is exponential in complexity for multiprocessor systems. The primary contributions of this paper can be stated as follows. 1) We propose a mathematical programming formulation based technique for scheduling tasks on DVFS capable multiprocessor systems, which takes into account input-dependent variation in execution time of tasks to reduce average energy consumption subject to a specified latency constraint. Our technique is capable of handling precedence constraints among tasks. 2) Our technique runs in polynomial time for multiprocessor systems; the solution is optimal for tree like task graphs. This is achieved by a novel pruning method during the formulation phase that avoids the enumeration done in [28] [29] . Our algorithm constructs a schedule table at design time to provide multiple scheduling options for each task. While complex algorithms can be used to build the schedule table at design time, the only extra processing that a system needs to perform at run-time is a table look-up. The rest of the paper is organized as follows: Section II describes our task graph and processor models. Section II.C provides the problem statement. In Section III, our formulation is presented by which variation in execution-time can be exploited; Experiments performed on randomly generated task-graphs as well as two real-world applications are described in Section V.
II. PRELIMINARIES AND PROBLEM STATEMENT

A. Processor Model
We assume that the number of processors is restricted to be no more than a certain number P and that the voltage of every processor in the system can be set independently to any value within a given range [V lower , V upper ] at run-time. The overhead for switching between voltages is assumed to be negligible compared to the execution time of individual tasks. To model the relation between energy, voltage and clock frequency, we use well known equations for CMOS logic [15] : Figure 1 . All four tasks are assumed to be identical and the WorkLoad vector for the tasks is shown in Figure 1 . The worst-case workload, WCW(v), is assumed to be 100 cycles for all the tasks. The probability that the workload of a task is no greater than 75 cycles is 0.7 and that the workload is between 75 and 100 cycles is 0.3. Suppose we are given a latency constraint of 450ns for this example on a 2 processor system -P 1 and P 2 . Using the model described in Section II.A, we can determine that when all the tasks run for 75 cycles, the worst-case scheduling produces a solution which consumes 44% more energy than the optimal solution.
• Start Cycles Elapsed (SCE(v)) for a task v is the number of clock cycles elapsed when all predecessors of v have completed.
For tasks with only primary inputs, SCE(v) is 0. For other tasks, Continuing to use our example in Figure 1 , we show a sample schedule 
III. VAR-TB -VARIATION AWARE TIME BUDGETING
Our algorithm is divided into two phases -the first phase assigns tasks to processors and the second phase determines the start time and voltage assignment for each task.
A. Task assignment heuristic
The problem of resource-constrained energy minimization subject to latency constraints has been proved to be NP-Hard [8] .
We use a priority based heuristic to assign the tasks to a set of P processors. The highest priority task is scheduled to run on the first processor that is ready to accept a new task. In our experiments, we use the difference between the ALAP and ASAP time to decide the task priorities. These times are computed by assuming that all processors run at their highest frequency and all tasks run at their worst-case. After a task is assigned, the ASAP and ALAP times for the remaining tasks are re-computed. We insert fake edges between consecutive tasks running on the same processor to maintain task order.
B. Task scheduling and voltage assignment
We first present the variable organization in our mathematical formulation for the scheduling and voltage assignment problem. We contrast our approach with 2 existing works and explain how our formulation avoids enumeration of a large number of task workloads. In Section IV.B, we prove why our approach can run in polynomial time and yet provide optimal solutions for certain kinds of task graphs. 
C. Mathematical formulation of VAR-TB
The first 2 constraints impose non-negativity and latency constraints. The third constraint imposes precedence between u and v. Note that the only variables in the above problem are the start times, finish times and clock periods. Constraint 2 is repeated for every value of WorkLoad(v). Constraint 4 imposes bounds on the clock period variable. We explain precedence constraints with the example from Figure  1 . Task 
Objective Function
The objective function represents the average energy consumption and is given by Equation (5). prob v,i is the probability, cp v,i is the clock period for task v when SCE(v) is ece v,i and C v is a constant.
We observe that the only variable in the objective function is the clock period cp v,i . Also, the objective function is convex separable and non-increasing. The probability information is obtained by application profiling.
IV. IMPROVING THE SCHEDULING ALGORITHM
A. Restricting the number of SCE(v) entries per task
1) The number of values of SCE(v) can be large even for small task graphs. To avoid this, we limit the number of values of SCE(v) per task to be less than a constant K. In our approach, we select the K values so that the area under the probability distribution curve for SCE(v) is split into K equal regions.
B. Time complexity
Since the number of values of SCE(v) per task is always less than a constant K, we ensure that the number of variables associated with each task is no more than O(K) and the number of constraints associated with each precedence edge is no more than O(K 2 ). Thus, the number of variables is linear with respect to the number of tasks (O(Kn)) and the number of constraints is linear with respect to the number of edges (O(K 2 m)) (where m is the number of precedence edges). In our experiments, we set K to 16 and obtain good energy savings for graphs with as many as 95 tasks. THEOREM 1: Since all the constraints are linear and the objective function is separable convex, algorithm VAR-TB produces a schedule table to minimize average energy consumption subject to latency and precedence constraints in polynomial time [16] . We state two theorems which prove why our method is optimal for certain kinds of task graphs. Proofs are omitted due to the page limit. The main assumption is that dynamic energy is a convex, non-increasing function of clock period.
THEOREM 2: Given a chain of tasks (T 1 ,…,T i ) to be executed sequentially and a latency constraint L, the value of the optimal energy consumption for a single run is dependent only on the total number of cycles consumed during the run and independent of the cycles consumed by the individual tasks. Theorem 2 also holds for task graphs that are trees but with a small modification. THEOREM 3: Given a task T i in a tree G, the optimum energy consumption for the sub-tree consisting of T i and predecessors depends only on the number of cycles consumed on length of each path to T i from the primary inputs in G. For DAGs, such a claim does not hold true because of reconvergent fan-outs. However, VAR-TB is still an effective technique for DAGs.
C. Online algorithm
We introduce a simple online phase to reclaims slack. When a task starts, it consumes all dynamic slack available from its predecessors and completes within its assigned finish time. Thus, the complexity of the online algorithm is O(1).
V. EXPERIMENTAL RESULTS
We compare the results of our algorithm VAR-TB with a DVS algorithm which considers the worst-case only (WC-DVS), a worst-case DVS algorithm which performs dynamic slack reclamation [27] (WC-Reclaim) and a hypothetical method that can accurately determine the workloads of each task before hand and performs optimal scheduling (Oracle). WC-Reclaim allocates dynamic slack proportional to worst-case WorkLoad(v).
A. Random task-graphs
We run our scheduling algorithm on several random task graphs generated from TGFF [4] with a resource constraint of 4 processors. We add the probability distribution of the task workloads as task attributes in the TGFF description. We obtain the probability distribution of SCE(v) every task by performing a number of Monte-Carlo simulations (10,000 in our experiments). After the optimization step, we use Monte-Carlo simulations to compute the energy consumption and determine the average energy consumption. We compute the energy savings obtained by each algorithm and plot the savings as a percentage of energy consumption of the worst-case algorithm in Figure 2 . The plot depicts the energy savings over algorithm WC-DVS for different task graphs. As can be seen, the simple algorithm WC-Reclaim performs much better than WC-DVS suggesting that the task graphs have significant variation in workloads to exploit. Our algorithm VAR-TB performs significantly better than algorithm WC-Reclaim; on an average the solutions provided by VAR-TB are 20-25% better than algorithm WC-Reclaim. Finally, we analyze the scheduling solution obtained by algorithm Oracle. As we can see, VAR-TB performs well compared to Oracle -with the maximum deviation being 20% and the average deviation being 15%. Moreover, the performance of VAR-TB does not degrade with increase in the number of tasks in the task graph. Additionally, VAR-TB completes within 90 seconds for all task graphs, when executed on an Intel Xeon 2 GHz processor. We have not plotted the results of algorithm DynOpt [29] . We discovered that the pruning step for DynOpt proposed in [29] causes scheduling solutions that are inferior locally but optimal globally to be discarded because of which DynOpt performs worse than WC-Reclaim in some cases. Moreover, DynOpt and [28] will require up to 5 20 optimization passes for the medium sized task graphs. 
B. Real-world Benchmarks
We perform experiments on two real-world applications -MPEG-4 decoder and Motion-JPEG (MJPEG) encoder. For these benchmarks, we apply dynamic slack reclamation after every algorithm. We evaluate two different schemes -W-Aware in which the workload of a task can be predicted from its input values and W-Unaware where the workload of the task cannot be predicted from its input values.
Processor Architecture
The processor cores in our experiments are modeled after the Intel XScale processor has 7 voltage levels as given in [26] . All processors share a 1 MB on-chip L2 cache through a common bus and implement a MESI cache-coherence protocol. Table 2 lists the relevant parameters. We use SESC [17] to simulate our multi-processor system and obtain profiling information (probabilities and WorkLoad values). For measuring dynamic energy consumption of on-chip components, we use Wattch [19] (integrated with SESC). Energy values for read-write operations for caches and SRAM-based array structures (TLB, ROB) are obtained from Cacti [18] for 90nm technology. For other processor components (such as ALU, decoder etc), energy numbers are obtained from Wattch for 180nm technology and scaling factors are applied as in [20] . Inter-processor communication is carried out through blocking FIFOs with bandwidth of 300 MB/s, which are similar to the Fast-Simplex Link (FSL) [21] provided by Xilinx. [22] is shown in Figure 3(a) . The main components of the decoder are the Parser (P), Copy-Controller (CC), Inverse-DCT (IDCT), Motion Compensation (MC) and Texture Update (TU). While IDCT does not show significant variation across different runs, the P, CC, MC and TU modules exhibit significant variation (Figure 4) . By unrolling the loop for one macroblock and performing loop pipelining (  Figure 3(b) ), a parallel version of the decoder was implemented on a 7 processor system. A performance requirement of 20 frames/second was imposed on the decoder leading to a latency constraint of 500us per macroblock. We measure the energy reduction that our algorithm provides over the WC-Reclaim algorithm. Moreover, we measure how the quality of the solution is affected by the number of SCE(v) values per task. The results are summarized in Figure 5 . The two curves represent the energy consumption of the schedule produced by the two schemes -W-Aware (red curve) and W-Unaware (blue curve). From the plot, it is clear that our algorithm can achieve significant savings over the WC-Reclaim -up to 40% for WAware and up to 22% for W-Unaware. 
2) MJPEG encoder
We apply our algorithm to the MJPEG encoder [25] for which the task graph is shown in Figure 6 (a). To process each Minimum Coded Unit in the incoming stream, we implement a pipelined version of the encoder using a four processor system where each task in Figure 6 (a) is assigned to a processor. Of the four tasks, only the Huffman encoding task shows significant variation ( Figure 7 ). We perform loop unrolling to obtain the graph in Figure 6 (b), on which we apply our method. We compare the energy savings obtained from our algorithm against the WC-Reclaim algorithm. As explained before, we consider two cases -W-Aware and W-Unaware and vary the number of SCE(v) entries. We show the energy consumption of the two schemes normalized to the energy consumption obtained by the WC-Reclaim algorithm in Figure 8 . For the W-Aware scheme (red curve), we see that we can obtain up to 14% savings in energy whereas for the W-Unaware scheme (blue curve), the maximum savings we obtain is 4%. The small savings is because of the low variation seen in this benchmark. 
VI. CONCLUSIONS
We present a mathematical formulation which can exploit variation in workloads of tasks in applications to provide a lowenergy scheduling solution. Our algorithm is capable of handling precedence constraints and multiple processors. We show that the schedule table can be generated in polynomial time and is optimal for special cases. Finally, our experiments show that significant energy savings can be obtained by our scheduling algorithm over dynamic slack reclamation methods.
VII. ACKNOWLEDGEMENT
