Abstract-Dynamic voltage scaling (DVS) is a widely applied power management mechanism in real-time systems. We propose an algorithm for scheduling periodic hard real-time streaming applications with linear dependencies and known probability distributions of computational requirements on chip multiprocessors (CMP). The goal of the scheduling is to minimize the expected energy consumption while satisfying two quality of service (QoS) requirements: throughput and response time. Our experiments show significant energy savings (up to 55 %) over scheduling when only the worst case computational requirements are known. In addition, while dynamically reclaiming processor idle time across multiple processors yields small benefit when scheduling is based on the probability distribution of computational requirements, it results in significant energy savings when scheduling for the worst case, especially for applications with short deadlines.
I. INTRODUCTION
Dynamic voltage scaling (DVS) has become a widely applied energy conservation technique in real-time systems, and many processors now implement DVS (e.g., Intel's XScale [21] and AMD's Athlon 64 [2] ). In addition, power issues, as well as processor-memory speed gap and increased difficulty in extracting more instruction-level-parallelism (ILP), are making chip multiprocessors (CMP) replace single-core processors for general-purpose computing and embedded devices. For example, Intel and AMD offer quad core processors [1] , [9] , and Sun offers eight core processor [17] .
With the number of cores on a chip expected to rise, many real-time applications are, or will soon be, running on CMPs. The problem of scheduling realtime applications on CMP or clusters of computers is not new [5] , [13] , [16] , [25] and requires (i) mapping the application task graph to processor cores, and (ii) determining the execution speed of each task. In [23] , Xu et al. considered scheduling a periodic hard realtime streaming application on a CMP with the goal of minimizing energy consumption for the worst case computational requirements, while satisfying the two QoS requirements: throughput and response time.
Workloads, however, often exhibit large variability in computational requirements [6] , [15] . Therefore, scheduling for the worst case execution requirements misses the opportunity of achieving potentially greater energy savings. We are motivated by this observation to consider the same problem, but in addition use knowledge of the tasks' probability distribution of computational requirements to minimize the expected energy consumption while also satisfying the same two QoS requirements. Specifically, we consider scheduling periodic hard-real time application with inter-arrival time T (throughput requirement is t), deadline (response time) D, and linear precedence constraints, i.e., linear task graph, on a CMP, where we can visualize the processors as stages of an execution pipeline. We are interested in T :s; D since it permits using one or more pipeline stages (processors). Our work on linear task graphs is already complex and is a first step towards CMP scheduling of general task graphs.
An algorithm of particular interest to us is Practical Inter-Task DVS (PITDVS) [24] . It schedules a set of periodic hard real-time tasks that share the same period, T, and deadline, D, and having D == T, to run on a single processor using knowledge of the probability distributions of the computational requirements of the tasks. The goal of PITDVS is to schedule the tasks so as to minimize the expected energy consumption. Because PITDVS only solves the problem when T == D, but not when T < D, our initial attempt was to develop an algorithm, which we call Extended PITDVS (EPITDVS) (see section V-A), that uses PITDVS to partition the time D among the tasks, then map (i.e., assign) them to processor cores such that the execution time on any core is at most T. Simulation results did not show significant energy savings (see Section VI). This prompted us to develop a better algorithm -using dynamic programming -which we call ProbabilisticlD (see Section V-B), whose computed schedules achieved energy savings in over 99% of the experiments, and also achieved the highest energy savings in over 94% of all experiments.
The rest of the paper is organized as follows. Section II describes related work. Application and system models are described in Section III. Section IV describes previously developed related algorithms. We present our proposed algorithm in Section V. Experimental setup and results are described in Section VI. Section VII concludes the paper.
II. RELATED WORK
DVS is an effective energy saving scheme since it allows quadratic energy savings at the expense of only a linear increase in execution time. As a result, much work in the last decade suggested different schemes for applying DVS to conserve energy while meeting application deadlines or causing a small degradation of the user interactive experience [7] , [8] , [11] , [12] , [14] , [26] ; most of this work is geared towards single processor. In multiprocessor and multicore systems, tasks are first mapped to cores or processors based on precedence constraints described by general [23] , [27] or conditional task graphs [13] , [16] , [19] . Task scheduling, that is, time allocation and frequency selection, is based on either the tasks' worst case computational requirements [23] , [27] , or on the probability distribution of their computational requirements [4] , [20] , [24] . For independent tasks, where any task can be assigned to any processor, the work in [28] reclaims the time unused by a task to reduce the execution speed of future tasks, a technique we refer to as cross-stage idle time reclamation (see Section V-C).
Although we use the probability distribution of computational requirements of the application tasks to do the mapping and scheduling of tasks to cores, our problem differs from prior work in that we not only have a response time constraint but also a throughput constraint.
Closely related to our work are the scheduling algorithms GRACE [26] , PACE [12] , PPACE, OITDVS, PITDVS2 [24] , SchedulinglD, and Scheduling2D [23] . GRACE, PACE, and PPACE are stochastic intra-task DVS schemes, i.e., based on knowledge of the deadline and the probability distribution of computational requirements of the single task running on one processor, dynamically change the frequency of the processor at run time to save energy while guaranteeing the task meets its deadline. OITDVS is an optimal stochastic intertask DVS scheme for conserving energy for a set of periodic tasks, run on one processor, that share the same period and their deadlines are equal to the end of the period. OITDVS relies on the probability distribution of computational requirements of the tasks to allot time to each task. PITDVS2 is a practical version of OITDVS, which approximates the computed continuous frequency using two adjacent discrete processor frequencies [10] ; we use PITDVS2 as part of our proposed algorithm. For periodic streaming applications with linear and general task graphs, the work in [23] proposes the SchedulinglD and Scheduling2D algorithms for scheduling the tasks on a CMP with the goal of minimizing the worst case energy consumption of the application. yB, where t p is the propagation delay, A is the reciprocal of the operating data rate of the interconnection network, and y is the energy spent to transfer one bit of data. We assume fixed data rate, that is, A and yare fixed.
III. MODELS AND PROBLEM DEFINITION
We assume the program for each task follows stream programming style [18] , where for each task instance the program first receives the data it needs through communication with its predecessor task, then executes the task, and finally sends data to the successor task. b) System Model: Power for each core has two components: static and dynamic. When a processor core is on, it is either (i) idle, consuming Pidle power, or (ii) executing some task at some frequency I, consuming total power P(I), which includes the power for the memory as well as other components that are in the same clock domain of the processor. Each core can be turned on and off. When a core is off, it does not consume energy. As shown in [23] , it is not always optimal to use all available cores due to the static power.
Each processor core provides M discrete frequencies, 11 < 12 < ... < 1M. Dynamic change of frequency incurs time and energy penalties [3] ; time penalty is proportional to the difference between the two frequencies, while energy penalty is proportional to the difference between the squares of the two frequencies. If task~i executes for c~cycles at frequency I, then it executes for ' 1 time and consumes P(J) x ' 1 energy, for those c' cycles portion of execution. The assumption that the execution time is linearly proportional to the execution frequency is justified when the cache (or memory) of a core is clocked at the same frequency as the core, which would be the case in a pipelined architecture.
c) Problem Definition: Given a periodic application as described above, our goal is to schedule the application tasks to run on the cores of a CMP described above, with the goal of minimizing the application expected energy consumption, while satisfying the QoS requirements: throughput, t, and response time, D. Scheduling tasks means mapping (assigning) tasks to execute on cores and computing the frequencies at which to run each task (frequencies can be dynamically changed at run time to save energy, for example, to reclaim slack time). Due to the linear nature of the application task graph, we can visualize the cores running the tasks as being stages of an execution pipeline (only consecutive tasks to be mapped to the same core).
The throughput requirement limits the execution time on any core (pipeline stage) to at most T, including the time for communication with the next stage, while the response time requirement sets D as the upper limit of the sum of the execution times of pipeline stages. We focus on T :s; D because it allows more flexibility in choosing the (one or more) cores on which to execute the application tasks.
IV. SOLUTION TO RELATED SCHEDULING PROBLEMS
In this section, we review related algorithms from [23] , [24] . The algorithm in Section IV-A schedules linearly dependent tasks~I, ••• ,~n on a pipeline in a way that minimizes the worst case energy consumption. The algorithm in Section IV-B schedules~I, ••• .r, onto a single processor in a way that minimizes the expected energy consumption. These two algorithms are relevant to our work presented in Section V, which schedules~I, ••• ,~n on a pipeline in a way that minimizes the expected energy consumption.
A. Scheduling1D
Given a linear task graph, G(V, E), of a periodic hard real-time application, along with knowledge of the worst case computational and communication requirements of each task, application period T, and deadline D, SchedulinglD [23] aims to schedule the application tasks to run on a CMP conforming to the models described in Section III with the goal of minimizing the worst case energy consumption while satisfying the two QoSs: throughput, t, and response time, D.
SchedulinglD is a dynamic programming algorithm that computes all feasible solutions, that is, all feasible mappings of tasks to processors using all the processors supported discrete frequencies, to find the optimal one. Since the number of solutions is clearly exponential in the worst case, the algorithm applies an approximation technique to limit the solutions search space guaranteeing polynomial running time and finding a solution within a (1 + E) of the optimal energy consumption, where E is a user-parameter.
SchedulinglD is based on the recursive structure of the optimal solution for linear task graphs. Assuming the time available to execute tasks~i to~n is t, the optimal scheduling of the tasks~i through~n is represented by the vector-valued function Ei(t) == 
II e' is the energy for the 1st stage 8.
e' == (P(fs) -Pidle)l; +PidleT 9. e==e' +YWj,j+1 +Ej+1(t-d).e 10.
if e < Ei(t).e then denotes the minimum energy consumption executing the tasks~i through~n when servicing a single instance of the application, q denotes the optimal number of stages (processor cores) for the tasks~i through~n, j indicates that tasks~i,~i+ 1 , ... ,~j are mapped to the first of the q stages, and d is the time used for the first stage plus the communication delay from the first stage to the next stage. The constraint of the throughput requirement implies that d should be smaller than or equal to T. Obviously, the optimal energy consumption and optimal scheduling will be obtained from E 1 (D). We use the notation Ei(t).e to denote the e value of Ei(t), and use a similar notation to refer to the q, j, and d values of Ei(t). 
11.

Ei(t).e==e
12.
Ei(t).q==Ej+1(t-d).q+l
13.
Ei(t).}==}
14.
Ei(t).d==d
The recursive procedure starts with En(t), which is obtained by mapping only~n into a single stage (processor). Finally, it should be noted that in [23] , time is discretized so that the function En(t) is defined at discrete times -implied by the M discrete processor frequencies -rather than being a continuous function. Consequently, each of the functions Ei(t) is also defined at discrete times implied by the discretization of Ei; 1 (t), ...,En(t) and the M discrete frequencies. For more details see [23] .
B. Practical Inter-Task DVS (PITDVS)
Given a set of periodic tasks~1, ... ,~n, that share the same period, T, and whose deadlines are equal to the end of that period, along with their order of execution, and the probability distribution function of the computational requirements of each of them, PITDVS [24] aims to schedule the tasks to run on a single processor core with the goal of minimizing their expected energy consumption. The algorithm has two components, one executed offline, and the other executed online. The offline part of the algorithm, called Optimal Inter-Task DVS (OITDVS), is a dynamic programming algorithm that determines the fraction~i of the time T that should be allocated to task~i in order to minimize the expected execution energy of~1, However, the online algorithm must approximate the computed frequency I ==~:t to be one of the available M frequencies, and must also guarantee that all tasks can meet the deadline even if they all need to execute for their worst case cycles. In [24] , PITDVS2 achieved the best results, as it approximated the computed continuous frequency using two adjacent discrete frequencies [10] .
V. THE PROPOSED SCHEDULING ALGORITHMS
In this section, we present our proposed scheduling algorithms for the problem defined in Section III-Dc. Our goal is to use knowledge of the probability distributions of the application tasks computational requirements to map them to pipeline stages and determine the execution time to allot to each stage.
A. Extended Practical Inter-Task DVS (EPITDVS)
Our first algorithm is a natural extension of PITDVS (Section IV-B), by extending it to use multiple resources (i.e., cores). The algorithm has an offline and online components; the offline component works as follows: (1) Run OITDVS, the offline part of PITDVS, which uses the probability distributions of the number of execution cycles to compute the fraction t, of D to allot to each task i, i == 1, ... ,n, with the restriction that no task is allowed time more than T -because of the throughput constraint. Steps 1 and 2 of the algorithm use the probability distributions of the number of execution cycles to find a balanced mapping of the tasks to pipeline stages.
Step 3, also uses the probability distributions to re-allocate times to the tasks mapped to each of the pipeline stages. Experimental results showed that this two-phase process does not provide a significant improvement in energy consumption over SchedulinglD, which although uses only knowledge of worst case execution times, uses dynamic programming to simultaneously map tasks to stages and allocate time to tasks within each stage. Therefore, in the next section we devised a new algorithm, which applies a dynamic programming approach to solve the problem in one phase using probability distributions of execution times.
B. Probabilistic} D
ProbabilisticlD is composed of an offline and online components. The offline component computes a schedule by mapping tasks to pipeline stages, and for each stage, determines the fraction of the execution time to allot to each of its mapped tasks. In order to save energy, the online component dynamically adjusts the frequency of each task based on t', the time remaining until the deadline of the stage it is running on, while guaranteeing all the remaining tasks of the stage can finish execution within t', even if they all require their worst case number of execution cycles.
l ) The offline scheduling component: The offline scheduling part of ProbabilisticlD is a dynamic program that explores different feasible task mappings to pipeline stages, and for each feasible mapping, computes the total expected application energy consumption as the sum of the expected energy consumptions for executing the individual pipeline stages. The chosen schedule is the one with the least expected energy consumption.
Given a subset of tasks mapped to a stage, the expected stage energy consumption varies depending on the stage allowed execution time. Since the throughput requirement constrains the stage time to be at most T, we use T as the stage allowed execution time.'. Probabilistic ID recursively computes solutions for mapping tasks~i, ••• ,~n to pipeline stages. Specifically, for each i, i == n, ... , 1, a set E, of tuples is computed such that each tuple (e q , q) in E, represents the solution with the least expected energy consumption, e q , when~i, ••• ,~n are mapped to a pipeline with q stages. Given that the number of stages in the pipeline cannot exceed the number of tasks, i, and enforcing the deadline constraint requires the number of stages not to exceed l~J, the set E, will be of the form E i == {(e q , q); q == 1, ... ,min{ i, l~J}}
The recursive computation of E, using Ei; 1 , ... , En is shown in lines 4-10 of the algorithm given in Figure 3 with by the operator E9 which is defined formally as follows:
EiU{(enew,q)} ,if~(eold,q) EE i EiU {(enew,q)} -{(eold,q)} ,if
3(eold,q) E E, , eold > enew E, ,otherwise a) Computing expected energy consumption of a stage::
Before we describe how to compute expected energy consumption of a stage, we need to describe Pi, the probability density function of the number of execution cycles of~i. The execution cycles of~i are divided into bins of equal number of cycles. Let bini denote the number of those bins. For task~i, we define Pi(l) as the probability that~i runs for a number of cycles 1We experimented with values other than T for stage allowed execution time, but setting stage time equal to T produced the best results with respect to percentage of energy savings, as well as the percentage of experiments in which the application consumed less energy when scheduled using Probabilistic1D compared to when scheduled using the baseline algorithm, SchedulinglD.
(1) 
In equation 1, the energy consumed when executing task~k is computed as the expected value of the energies consumed when~k executes for a number of cycles that falls into bins 1, 2, ..., bini, Specifically, if c, the number of cycles executed by~k, falls into bin l, 1 :s; l :s; bini, then the energy consumed is equal to the product of the power (the second term in equation 1) and the execution time (the third term in equation 1).
Observe that IEnl == 1, and lEn-II :s; 2, since we may either map~n-I to its own stage separate from~n, or we may map it to the same stage with~n. Similarly, IE n-2 1 :s; 3, because we may map~n-2 to its own stage or map~n-2 to the first stage of each solution u E En-I. This is due to the fact that only consecutive tasks in the linear task graph can be mapped to the same stage. 2) The online scheduling component: The online part of ProbabilisticlD applies the online part of PITDVS2 to the individual pipeline stages in order to dynamically adjust the frequency of the tasks of each stage. As mentioned above, a continuous frequency, j, is computed for the next task,~k, k == 1, ... ,n, to run on a stage based on ci, remaining stage time, and Ij=k+I Cj, where~k+ 1 , ... ,~l are tasks mapped to the same stage of k. j is then approximated using one or two adjacent discrete frequencies of the M frequencies supported by the processor.
C. Cross-Stage Idle Time Reclamation (CSITR)
Due to the variability of the workload of the different application instances, at run time, the online components of the algorithms PITDVS, EPITDVS, and ProbabilisticlD, all apply dynamic voltage scaling to each pipeline stage in order to vary the processor frequency at which to run each stage task; effectively reclaiming the slack time on each stage.
However, it is possible that any stage, s, would finish processing a given application instance, Ii, earlier than its deadline. Instead of s sitting idle consuming static power, we can reclaim its idle time by starting to execute the tasks of the next application instance, Ii; 1, as soon as the previous pipeline stage, S -1, finishes processing l.; 1. This way, when S starts to execute the application instance l.; 1, it would be allowed an execution time longer than the offline computed execution time (in the case of ProbabilisticlD, stage execution time is fixed to T as mentioned above). This in tum allows the online part of scheduling algorithm to reclaim the increased stage slack time by running tasks at even lower frequencies, thus saving more energy. To evaluate the effectiveness of the CSITR technique, in our simulations we compare application execution energy consumption when the algorithms are run with and without reclaiming processor idle time.
VI. EVALUATION
A. Experimental Setup
In this section we evaluate our proposed algorithm, ProbabilisticlD, through simulations. We compare it to EPITDVS and to the baseline state-of-the-art algorithm: SchedulinglD without cross-stage idle time reclamation. In addition, we evaluate the effect of cross-stage idle time reclamation on ProbabilisticlD by comparing the energy savings achieved in each case relative to the baseline algorithm. a) Application Task Graphs:: We experiment with a synthetically generated workload. We run two sets of experiments. In one set, the application task graph consists of 5 tasks, and in the other it consists of 10 tasks. For each set, we experiment with uniform and bimodal distributions of task computational requirements with a minimum workload of 12,500 cycles and a maximum of 1,250,000 cycles per task. We partition task computational requirements cycles into bins of equal number of cycles; we used 100 bins.
We use different application period, T, and deadline, D, value pairs. For the 5-task applications, we use periods of 2.5 msec to 12.5 msec with a step of 1 msec, and deadlines of 10 msec to 25 msec, also with a step of 1 msec. For the 10-task applications, we use periods of 5 msec to 25 msec, and deadlines of 20 msec to 50 msec, both with a step of 1 msec. We only run experiments where D 2: T. The approximation factor, E, used in Scheduling1D is 0.05. Each experiment is defined by the specific values of T and D, as well as the number of application tasks and their probability distributions of computational requirements. For each run, we simulate the execution of 1000 application instances and we compute the average percentage of energy savings of the schedules of Probabilistic1D and EPITDVS, both with cross-stage idle time reclamation, relative to the schedule of Scheduling1D without cross-stage idle time reclamation.
b) Communication Model:: We assume a similar communication medium as in [23] . To compute A and y (section III-Oa), we assume a transmission rate of 20 Gbytes/s and transmission power is set to 20% of maximum processor power when the link is fully utilized. Communication energy and time penalties are only incurred for communication between two tasks mapped to two consecutive pipeline stages (processor cores). In our simulation, we use a normal probability distribution with small variance to generate the amount of communication data. c) Processor Model:: We simulate the Intel XScale processor (Table I) [22] . We assume processor static power is 40 mW when the processor is idle (i.e., half the power consumed when the processor is running at its lowest frequency). Dynamic change of frequency incurs
B. Experimental Results
Considering all the conducted experiments, we found that in over 94% of them, the schedules of Probabilistic1D with cross-stage idle time reclamation achieved the greatest percentage of energy savings, and only in less than 1% of the experiments did the schedules of Probabilistic1D, also with cross-stage idle time reclamation, consume more energy compared to the baseline algorithm. These are much better results than those achieved by the schedules of EPITDVS with cross-stage idle time reclamation. EPITDVS schedules achieved the greatest percentage of energy savings in less than 5% of the experiments, and consumed more energy than the baseline in 37% of the conducted experiments.
We noticed the results of the two sets of experimentscorresponding to using applications with 5 and 10 tasks -exhibited similar trends, so we only show graphs for experiments of a 10-task application with tasks having a bimodal probability distribution of computational requirements.
Figures 4,5, and 6 show the average percentage of energy savings -relative to the baseline -for the schedules of Probabilistic1D, EPITDVS, and Scheduling1D, all with cross-stage idle time reclamation (CSITR) applied, respectively.
For a better comparison of the performance of the algorithms, we take a 2D cross-section of the plots in Figures 4, 5 , and 6 at values of T == 5,9,19 msec, corresponding to short, intermediate, and long interarrival times, respectively, and plot them in Figures 7, 8 , and 9, respectively. From these figures, we find that for a fixed T, the average percentage of energy savings of all algorithm tends to go down when the deadline, D, is increased. This is expected since in this case Scheduling1D has more time to execute each instance of the application, therefore it schedules the stages (processors) to run slower to consume less energy, and although the schedules of, for example, Probabilistic1D, still consume less energy, the relative energy savings are smaller in this case. Also, from the same figures, we note that as D is increased, the schedules of Scheduling1D with crossstage idle time reclamation may consume more energy (we observed a penalty of at most 1% more energy) relative to the baseline schedule of Scheduling1D without cross-stage idle time reclamation. This is due to time and energy penalties associated with dynamic change of processors' frequencies -which are not incurred by tions time and energy penalties (Section III-Ob). the baseline since it always uses the offline computed frequency. Now, we consider how cross-stage idle time reclamation affects the average percentage of energy savings of the schedules produced by Probabilistic1D and Scheduling1D. Figures 10, 11 , and 12, plot the percentage of energy savings of the schedules of Probabilistic1D with and without cross-stage idle time reclamation, as well as the schedules of Scheduling1D with cross-stage idle time reclamation. These plots were generated for a 10-task periodic application using the same tasks probability distributions described above, and for inter-arrival times T == 5,9,19 msec, respectively.
For the schedules of Probabilistic1D, we found that the average improvement in the percentage of energy savings when applying cross-stage idle time reclamation was less than 3% than when it was not applied. This indicates that the offline schedules computed by Probabilisitic1D were able to capture the variability of the application workload to a high degree of accuracy that they minimize the time processors are idle.
For the schedules of Scheduling1D, we note that for relatively short deadlines, cross-stage idle time reclamation allows Scheduling1D to save more energy (up to 30% in our experiments) compared to the baseline. However, for longer deadlines, Scheduling1D with crossstage idle time reclamation usually performs slightly worse (consumes less than 1% more energy) due to the time and energy penalties incurred when processor frequency is dynamically changed.
VII. CONCLUSION AND FUTURE WORK
We consider the problem of scheduling a periodic hard real-time application with linear precedence constraints on a CMP using knowledge of the probability distribution of computation requirements of its constituting tasks. The goal of the scheduling is to minimize the expected energy consumption. Our experimental results show that our proposed scheduling algorithm, Probabilistic1D, achieves as much as 55% average energy savings over scheduling for the worst case execution. Our experiments also demonstrate the benefits of crossstage processor idle time reclamation at run time. For future work, we would like to extend Probabilisitic1D to handle periodic hard real-time applications with general 
