In this work, we investigate the potential benefit of parallelization for both meeting real-time constraints and minimizing power consumption. We consider malleable Gang scheduling of implicit-deadline sporadic tasks upon multipro cessors. By extending schedulability criteria for malleable jobs to DPMIDVFS-enabled multiprocessor platforms, we are able to deriv � an offlil!e poly � omia � -time optimal processor/frequency selectIOn algorIthm. SImulatIons of our algorithm on randomly generated task systems executing on platforms having up to 16 processing cores show that the theoretical power consumption is reduced by a factor of 36 compared to the optimal non-parallel approach.
I. INTRODUCTION
Power-aware computing is at the forefront of embedded systems research due to market demands for increased battery life in portable devices and decreasing the carbon footprint of embedded systems in general. The drive to reduce system power consumption has led embedded system designers to increasingly utilize multicore processing architectures. An oft repeated benefit of multicore platforms over computationally equivalent single-core platforms is increased power efficiency and thermal dissipation [I] . For these power benefits to be fully realized, a computer system must possess the ability to parallelize its computational workload across the multiple processing cores. However, parallel computation often comes at a cost of increasing the total computation that the system must perform due to communication and synchronization overhead of the cooperating parallel processes. In this paper, we explore the trade-off between parallelization of real-time applications and savings in the power consumption.
Obtaining power efficiency for real-time systems is a non-trivial problem due to the fact that processor power management features (e.g., clock throttling/gating, dynamic voltage/frequency scaling, etc.) often increase the execution time of jobs and/or introduce switching time overheads in order to reduce system power consumption; the increased execution time for jobs naturally puts additional temporal constraints on a real-time system. Job-level parallelism can potentially help reduce these constraints by distributing the computation to reduce the elapsed execution time of a parallel job. However, the trade-offs between parallelism, increased communication/synchronization overhead, and power reduc tion form a complicated and non-linear relationship. Thus, for power-aware multi core real-time systems, an important and challenging open question is: what is the optimal combination of job-level parallelism and processor power-management Pradeep M. Hettiarachchi, Nathan Fisher
Department of Computer Science
Wayne State University {pradeepmh, fishem}@wayne.edu settings to minimize system power consumption while simul taneously ensuring real-time deadlines are met?
In this paper, we address the above problem for implicit deadline parallel sporadic tasks executing upon a multi core platform with unique global voltage/frequency scaling ca pabilities. That is, all the cores on the multicore chip are constrained to execute at the same rate. For example, the Intel Xeon E3-1200 processor has such a constraint on the voltage and frequency [2] ; dynamic voltage and frequency scaling (DVFS) is only possible in a package-granularity (i. e., we have to choose one working frequency for all active cores of the processing platform). However, we also permit dynamic power management (DPM): this means that some cores of the platform can be switched-off. In summary, we allow the selection of a subset of the cores to be active and all these chosen cores must run at the same frequency.
In the past, researchers have considered the problem of determining the optimal global frequency for such systems for non-parallel real-time tasks (see Devadas and Aydin [3] ). Parallelism contributes an additional dimension to this problem in that the system designer must also choose what is the optimal number of concurrent processors that a task should use to reduce power and meet its deadline. Our research addresses this challenge by proposing an (offline) polynomial time algorithm for determining the optimal frequency and number of active cores for a set of parallel tasks executing upon a processing platform with homogeneous frequencies.
We use a previously-proposed online scheduling algorithm by Collette et al. [4] to schedule the parallel jobs once the frequency and active core allocation has been determined.
The contributions can be summarized as follows:
• We generalize the parallel task schedulability test of Collette et al. [4] for a processing platform that may choose offline its operating frequency. • We propose an exact offline polynomial-time algorithm for determining the optimal operating frequency and number of active cores. Given n tasks and rn cores, our algorithm requires O(rnn210g� rn) time. • We illustrate the power savings of a parallel scheduling approach by comparing it against the optimal non-parallel homogeneous scheduling algorithm (via simulations of randomly-generated task systems). The main objective of this research is to provide a theo retical evaluation of the potential reduction in system power consumption that could be obtained by exploiting parallelism of real-time applications. As we will see in the next sections, significant reductions in system power are possible even when the degree of parallelism is limited. Our current on-going work in evaluating parallel implementations of real-time applica tions upon an actual hardware testbed is primarily motivated by the power savings observed in the simulations of this paper.
II. RELATED W ORK
There are two main models of parallel tasks (i.e., tasks that may use several processors simultaneously): the Gang [4] , [5] , [6] , [7] and the Thread model [8] , [9] , [10] , [11] . With the Gang model, all parallel instances of a same task start and stop using the processors in unison (i. e., at the exact same time). On the other hand, with the Thread model, there is no such constraint. Hence, once a thread has been released, it can be executed on the processing platform independently of the execution of the other threads.
Most real-time research about energy saving has assumed sequential model of computation. For example, Baruah and Anderson [12] explicitly state that" [ ... J in the job model typi cally used in real-time scheduling, individual jobs are executed sequentially. Hence, there is a certain minimum speed that the processors must have if individual jobs are to complete by their deadlines [ ... J". In this research, by reducing the minimum required speed of the platform below the sequential limit, we push the potential power/energy savings further by removing this constraint and allowing each job to be executed in unison on several processing cores.
Few research has addressed both real-time parallelization and power-consumption issues. Kong et al. [13] explored the trade-oft's between power and degree of parallelism for non real-time jobs. Recent work by Cho et al. [14] have developed processor/speed assignment algorithms for real-time parallel tasks when the processing platform allows each processor to execute at ditlerent speed. In contrast, this work considers some restrictions on parallel processing (e.g., limited proces sor speedup) and power management (e.g., a single global operating frequency) that exist in many of today's multicore architectures, but are not considered in these previous papers.
III. MODELS

A. Parallel Job Model
In real-time systems, a job Jg is characterized by its arrival time Ag, execution requirement Eg, and relative deadline Dg. The interpretation of these parameters is that for each job Jg, the system must schedule Eg units of execution on the processing platform in the time interval [Ag, Ag + De).
Traditionally, most real-time systems research has assumed that the execution of Jg must occur sequentially (i.e., Je may not execute concurrently with itself on two -or moredifferent processors). However, in this paper, we deal with jobs which may be executed on ditlerent processors at the very same instant, in which case we say that job parallelism is allowed. It means that for each time units in the interval [Ag, Ae + De), several units of execution of Jg can be executed (corresponding to the number of processors assigned to .It). Various kind of parallel task models exist; Goossens et al. [6] adapted parallel terminology [15] to real-time jobs as follows.
Definition 1 (Rigid, Moldable and Malleable Job). A job is said to be (i) rigid if the number of processors assigned to this job is specified externally to the scheduler a priori, and does not change throughout its execution; (ii) moldable if the number of processors assigned to this job is determined by the scheduler, and does not change throughout its execution; (iii) malleable if the number of processors assigned to this job can be changed by the scheduler during the job's execution.
As a starting point for investigating the trade-off between the power consumption and parallelism in real-time systems, we will work with the malleable job model in this paper.
B. Parallel Task Model
In real-time systems, jobs are generated by tasks. One general and popular real-time task model is the sporadic task model [16] where each sporadic task Ti is characterized by its worst-case execution requirement ei, task relative deadline di, and minimum inter-arrival time Pi (also called the task's period). A task Ti generates an infinite sequence of jobs .h, .h, . .. such that: 1) .h may arrive at any time after system start time; 2) successive jobs of the same task must be separated by at least Pi time units (i.e., A£+l ? Ag + Pi); 3) each job has an execution requirement no larger than the task's worst-case execution requirement (i.e., Eg :( ei); and 4) each job's relative deadline is equal to the the task relative deadline (i.e., De = di). A useful metric of a task's computational requirement upon the system is utilization denoted by Ui and computed by edpi. In this paper, as we deal with parallel tasks that could be executed in unison on several processors with variable speed (DVFSIDPM enabled), tasks having a utilization value greater than 1 could still be schedulable (i.e., Ui > 1 is permitted). Indeed, to meet its deadline a job of a task having a utilization greater than 1 must either 1) be executed in unison on several processing cores of the platform or 2) be executed at an appropriate processor speed (higher frequency). This frequency/number of cores selection trade-oft' is specifically the problem we explore in this paper.
Other useful specific values are U m ax �f maxi =l {ud and U sum �f L�=l Ui.
A collection of sporadic tasks T �f {T1,T 2 , ... , Tn} is called a sporadic task system. In this paper, we assume a common subclass of sporadic task systems called implicit deadline sporadic task systems where each Ti E T must have its relative deadline equal to its period (i.e., di = Pi).
Finally, the scheduler we use restricts periods and execution requirements to positive integer values, i. e. , ei,pi E N>o.
At the task level, the literature distinguishes between at least two kinds of parallelism: Multithread and Gang. In Gang parallelism, each task corresponds to e x k rectangle where e is the execution time requirement and k the number of required processors with the restriction that the k processors execute task in unison [7] . In this paper, we assume malleable Gang task scheduling (that is, tasks generating malleable jobs); Feitelson et al. [17] describe how a malleable job may be implemented.
Due to the overhead of communication and synchroniza tion required in parallel processing, there are fundamental limitations on the speedup obtainable by any real-time job.
Assuming that a job .Ie generated by task Ti is assigned to ke processors for parallel execution over some t-Iength interval, the speedup factor obtainable is denoted by li,kp'
The interpretation of this parameter is that over this t-Iength interval .Ie will complete li, kp x t units of execution. We let ri � f ( { i ,O'l i ,l , '" 'li, rn 'li,m+I) denote the multiprocessor speedup vector for jobs of task Ti (assuming m identical .
clef clef processmg cores). The values li,O = 0 and li, rn +I = 00 are sentinel values used to simplify the algorithm of Section V. Throughout the rest of the paper, we will characterize a parallel sporadic task Ti by (ei,pi, ri).
We apply the following two restrictions on the multiproces sor speedup vector: .,
The sub-linear speedup ratio restriction represents the fact that no task can truly achieve an ideal or better than ideal speedup due to the overhead in parallelization. It also requires that the speedup factor strictly increases with the number of proces sors. The work-limited parallelism restriction ensures that the overhead only increases as more processors are used by the job. These restrictions place realistic bounds on the types of speedups observable by parallel applications. Notice that these constraints imply that \11 �i � n, 0 � j < m : li,j < li,j+I.
lt could be argued that this constraint is not entirely realistic: at a reasonable high number of processing cores, allocating an additional core to the task will not increase the speedup anymore. However, while our mathematical model requires that the speedup must increase with each additional processing core, the situation where adding a core does not benefit the application speedup can be modeled with li,j+I + E = li,j where f can be some arbitrarily small positive real number. Thus, this strict inequality constraint does not place any true restriction upon approximating such realistic parallel behavior.
There are no other restriction on the actual values of these speedup parameters, i.e., \11 �i � n, 1 � j � m :
li , j E lR>o. An example of two speedup vectors, used in our simulations, is given by Figure 2 (page 8).
C. Power/Processor Model
The parallel sporadic task system T executes upon a multi processor platform with m E 1':1>0 identical-speed processing cores. The processing platform is enabled with both dynamic power management (DPM) and dynamic voltage and fre quency scaling (DVFS) capabilities. With respect to DPM capabilities, we assume that the processing platform has the ability to tum off any number of cores between 0 and m -1.
For DVFS capabilities, in this work, we assume that there is a system-wide homogeneous frequency 1 > 0 (where 1 is drawn from the positive continuous range -i.e., 1 E lR>o) which indicates the frequency at which all cores are executing at any given moment. In short, at any moment in the execution of the system, if k � m processing cores are switched on and m -k cores are switched off (DPM) and the homogeneous frequency of the system is set to the value 1 (DVFS), it means that k cores run at frequency 1 and m -k cores "run" at frequency O.
The power function P(j, k) indicates the power dissipation rate of the processing platform when executing with k active cores at a frequency of 1. We assume that P(j, k) is a non-decreasing, convex function. See section VII-A3 for an instance of this function in our simulations.
The interpretation of the frequency is that if Ti is executing job .Ie on ke processors at frequency 1 over a t-Iength interval then it will have executed t x li,kp X 1 units of computation .
The total energy consumed by executing k cores over the t length interval at frequency 1 is t x P(j, k).
Since we are considering a single system-wide homoge neous frequency, a natural question is: does the ability to dynamically change frequencies during execution contribute towards our goal of reducing power and/or meeting job deadlines? We can show that the answer to the question is "no"; it turns out that there exists a single optimum frequency for a given set of malleable real-time tasks: Property 1 (Obtained by extension of Aydin [18] , and Ishihara and Yasuura [19] ). In a multiprocessor system with global homogeneous frequency in a continuous range, choosing dy namically the frequency is not necessary for optimality in terms of minimizing total consumed energy.
Proof As [19] presented similar result, here we prove the property for our framework.
Although we have a proof of this property for any convex form of P(j, k), for space limitation in the following, we will consider that P(j, k) ex: 1 3 (notice that k is a constant in our analysis). Assume we have a schedule at the constant frequency 1 on the (multiprocessor) platform that is feasible for T. We will show that any dynamic frequency schedulethat is also feasible for Tconsumes not less energy.
First notice that from any dynamic frequency schedule we can obtain a constant frequency schedule (which schedules the same amount of work) by applying, sequentially, the following transformation: given a dynamic frequency schedule in the interval [a, b] which works at frequency h in [a, e) and at frequency 12 in [e, b] we can define the constant voltage such that at that frequency the executed amount of each task Ti E T remains the same as the execution in the dynamic-frequency schedule over [a, b] .
Without loss of generality we will consider schedule in the interval [0, 1] working at the constant frequency 1 and the dynamic schedule working at frequency 1 +,6, in [0, e) and at the frequency 1 -,6,' in [e, 1]. Since the transformation must preserve the amount of work completed we must have:
since the extra work in [0, e) (i.e., ,6,e) must be equal to the spare work in [e,l] (i.e., ,6,' (1 -e) ). Now we will compare the relative energy consumed by both the schedules, i.e., we will show that (2) is equivalent to (by subtracting f3 on the both sides)
Or equivalently (dividing by £6.):
which always holds because 6. > ° and f > 0.
To complete the proof we must show that our transformation preserves the amount of execution for each job of T over its release and deadline. Consider the execution of a job of Ti over some interval [ a, b] (where a, bE N) between its release and deadline where Ti executed e1 units over [ a, £) and e 2 over [ £, b] (where £ E N). Furthermore, consider that one interval executes proportionally more of Ti than the other interval. Without loss of generality, let i!a > R' We can show that the schedule consumes less of the process ing time (and thus remains feasible in the single frequency schedule). The next subsection introduces an optimal parallel scheduler that can execute each task over every unit-length interval at a pre-specified rate. Using this scheduler, over
Since the rate of execution is higher for the first interval and lower for the second, the level of parallelism for Ti must be greater in the first. If we instead execute eb��2 over all intervals, the total amount of execution is preserved. However, the total processing required is decreased as the speed-up is concave function over the level of parallelism (due the properties of sub-linear speedup ratio and work-limited parallelism). Thus, over each job-release and deadline of Ti, we can use a constant rate of execution.
Since we have an implicit-deadline task system, this rate is maintained over all intervals. D As a consequence, without loss of generality, we will consider systems where the number of active cores and the homogeneous frequency is decided prior the execution of the system, i.e., offline.
D. Scheduling Algorithm
In this paper, we use a scheduling algorithm originally de veloped for non-power-aware parallel real-time systems called the canonical parallel schedule [4] . The canonical scheduling approach is optimal for implicit-deadline sporadic real-time tasks with work-limited parallelism and sub-linear speedup ratio upon an identical multiprocessor platform (i.e., each processor has identical processing capabilities and speed). In this paper, we consider also an identical multiprocessor platform, but permit both the number of active processors and homogeneous frequency f for all active processors to be chosen prior to system run-time. In this subsection, we briefly define the canonical scheduling approach with respect to our power-aware setting.
Assuming the processor frequencies are identical and set to a fixed value f, it can be noticed that a task Ti requires more than k processors simultaneously if Ui > li, k f; we denote by ki (J) the largest such k (meaning that ki (J) is the smallest number of processor(s) such that the task Ti is schedulable on ki (J) + 1 processors at frequency I):
The canonical schedule fully assigns ki (J) processor(s) to Ti and at most one additional processor is partially assigned (see [4] for details). This definition extends the original definition of ki from non-power-aware parallel systems [4] .
As an example, let us consider the task system T = {T1' T 2 } to be scheduled on m = 3 processors with f = 1. We have T1 = (6,4, rd with r1 = (1.0,1.5,2.0) and T 2 = ( 3 ,4, r2) with r2 = (1.0,1.2,1. 3 ). Notice that the system is infeasible at this frequency if job parallelism is not allowed since T1 will never meet its deadline unless it is scheduled on at least two processors (i.e., k1 (1) = 1). There is a feasible schedule if the task T1 is scheduled on two processors and T 2 on a third one (i.e., k 2 (1) = 0).
E. Problem Definition
We are now prepared to formally state the problem addressed in this paper.
Given a malleable, implicit-deadline sporadic task system T, DVFSIDPM-enabled processor with m cores, and canonical parallel scheduling, we will determine (offline) the optimal choice of system wide homogeneous frequency f and number of active cores k such that P(J, k) is minimized and no task misses a deadline in the canonical schedule.
To solve this problem, we will introduce an algorithm, based on the schedulability criteria of the canonical schedule, to determine the optimal offline frequency/number of processors combination and evaluate our solution over simulations.
IV. PRELIMINARY RESULTS
In this section, we restate the schedulability criteria for canonical scheduling under homogeneous frequencies and show that the criteria is sustainable (i.e., a schedulable system remains schedulable even if the frequency or number of active cores is increased). We will use these results in the next section to develop an algorithm for determining the optimal choice of number of active cores (k) and system-wide frequency (f).
A. Schedulability Criteria of Malleable Task System with Homogeneous Frequency
As we gave in Section III-C the mathematical interpretation of the parameter lover system execution, it is easy to adapt the schedulability criteria of [4] to a power-aware schedule.
Indeed, we just have to replace Ui by 7in schedulability conditions. This lead us to the following theorem. Theorem 1 (extended from Collette et al. [4] ). A necessary and sufficient condition for an implicit-deadlines sporadic mal leable task system T respecting sub-linear speedup ratio and work-limited parallelism, to be schedulable by the canonical schedule on m processors at frequency I is given by: 
B. Sustainability of the Frequency for the Schedulability
In this section, we will prove an important property of our framework: sustain ability. We will prove that if a system is schedulable, then increasing the homogeneous frequency will maintain the schedulability of the system. This implies that there is a unique minimum frequency for a couple task system/number of processors to be schedulable. The algorithm introduced in Section V will use this property to efficiently search for this optimal minimum frequency. In order to prove that property, we will need several new notations and concepts. 
= 1
This definition will be useful in the following main theorem.
Theorem 2. The schedulability of the system is sustainable re garding the frequency, i.e., increasing the frequency preserves the system schedulability.
Proof Sketch: We only provide a sketch of the theorem proof.
Observe that both ki(J) and Mi(J) (and also NF(J)) are monotonically non-increasing in I. Thus, if the conditions of Equation (5) are satisfied for a given I, they will continue to be satisfied for any I' ? I since Mi (J') � Mi (J) and ki(J') � ki(J). Theorem 2 implies that there is a minimum frequency for the system to be schedulable. The challenge of this section is to inverse the function lvIT (J) in order to have an expression of the frequency depending of the number of processors m. This is not trivial because lvIT (J) is the sum of a continuous term and discontinuous term (the expression of ki(J)). Furthermore, since we assumed that I is potentially any real positive num ber, the mathematical sound way to obtain the optimal value of I is to determine it analytically. See section VI for a discussion about this. We present an algorithm that computes the exact optimal minimum frequency for a particular task system T and a number of active processing cores m in 0(n 2 Iog� m) time (see Algorithm 2). We then use this algorithm in conjunction of the power function P(J, k) to determine the optimal number of active cores and system-wide frequency.
Consider fixing each ki (J) term (for i = 1, ... ,71) with values K: 1 , K:2, ... ,K:n E {O, 1, ... , m -I}, each corresponding to the number of processors potentially assigned to each task Ti at a frequency I. Then, from Definition 3 we can replace ki(J) by K:i in schedulability inequations (5) . The first condition is always true because by choice of K:i. For the second condition: where K: � f (K: 1 ' K:2, ... , K:n;. We have derived a lower bound on the frequency that satisfies Equation (5) given fixed K:. Notice in solving for I in the above paragraph, it is possible that we have chosen values for K: that do not correspond to the ki(J) values. If so, then the value returned by Wr(m, K:) may not correspond to a frequency for which T is schedulable.
To address this problem, we may symmetrically also fix a frequency I and determine the corresponding values of ki (J) according to Equation (3). Let K: r (J) be the vector (k 1 (J), k2(J), ... ,kn(J);. For all Ti E T, ki(J) < m if f > Ud'i ,rn . Thus, if f > maxi =l {Ud,i ,rn } and the following inequality is satisfied, then Theorem 1 and T is schedulable given frequency f. f ;? w T (rn, Ft(j)). (6) Recall that our goal is to minimize the non-decreasing function P(j, k). Therefore, we want the smallest f > maxi =l {Ud,i ,rn } that satisfies Inequation (6) which leads to the following definition. Definition 3 (Minimum optimal frequency). The minimum optimal frequency of a system T schedulable on rn active processors is denoted as f�� rn Consider now taking the inverse of function ki (j):
, [�, (0) otherwise.
li,1 '
We can see that k i 
total time complexity of the schedulability test is 0 (n 10g 2 rn). See Algorithm 1 for a complete sketch of this algorithm.
In Algorithm 2 aimed at calculating l��n), the value of Fti can also be found by binary search and takes O(log 2 rn) time to compute. This is made possible by the sustainability of the system regarding the frequency (proved by Theorem 2).
Indeed, if Tis schedulable on rn processors with f = � ,
then it's also schedulable with f = � > � .
/I,K'i /I,K'i+1
In order to calculate the complete vector Ft, there will be o (n 10g 2 rn) calls to the schedulability test. Since computing W T is linear-time when the vector Ft is already stored in memory, the total time complexity to determine the optimal schedulable frequency for a given number of processors is O(n 2 10g � rn). In order to determine the optimal combination of frequency and number of processors, we simply iterate over all possible number of active processors £ = 1,2, ... ,rn executing Algorithm 2 with inputs T and £. We return the combination that results in the minimum overall power dissipation rate (computed with P(j��;; ) , g)�. Thus, the overall complexity to find the optimal combination is O(rnn 2 10g � rn).
See Algorithm 3 for the complete description.
B. An Example
Let us use the same example system than previously intro (Ftl = 2, Ft 2 = 0). This implies that the optimal minimum frequency (Algorithm 2) for this system to be feasible on 3
processors is equal to f�� rn ) = w T ( 3 , (2, 0;) = 0.9375. We can see that if we call the feasibility test function (Algorithm 1) for any frequency greater or equal than 0.9375, it will return True; it will return False for any lower value. ( . Ui -,,-"-,,1') I ' Therefore, it must be that /1,r" i +1 "'(r,r"'i f l' = W T (rn, Ft(j')).
C. Proof of Correctness
Since (6) has reached equality with 1', this is the smallest frequency such that T is schedulable. Therefore, fl must be I t f ( T,rn ) D equa 0 min .
VI. PRACTICAL CONSIDERATIONS
This section discusses some of our choices and assumptions for this work with regard to reality or practical implementation.
A. Continuous range frequency selection
We made the assumption in section III-C that the ho mogeneous frequency of the processing platform is drawn from the positive continuous rangei.e., f E lR>o. It is worth mentioning that in practice the available frequencies are always drawn from a discrete and finite set. As this set is reasonably small, exhaustive search (by binary search over all available frequencies) could be an applicable approach to determine the minimum optimal frequency value. This approach has an even better complexity than our analytical approach. However, we choose to provide the exact analytical solution due to the generality of the solution and the fact that continuous frequencies can be emulated with discrete frequencies (as discussed below).
Notice that having the analytical expression of f�� ;';'), we can always find the smallest available frequency for which the system remains schedulable. For a given platform with discrete and finite frequency set IF �f {h, 12, ... ,fp}, with h < 12 < ... < fp, we define the following ceiling operator, which represents the smallest available frequency greater than the given analytical frequency:
We just have to take I f�� ;';')h' to select the smallest avail able frequency able to schedule the task system. Notice that if 1f��n)lIF > f��n), i.e., the smallest available frequency is slightly higher than the optimal frequency, the system will not be saturated and some idle instants will appear in the execution of the system.
Symmetrically, we can also define a floor operator, which represents the greatest available frequency smaller than the given analytical frequency:
Therefore, even if the chosen running platform has only a discrete set of frequencies IF, the minimum analytical fre quency fl��n) could be emulated by switching between the frequency above and the frequency below, removing then the idle instants introduced by taking a too large frequency.
Indeed, the analytical frequency f��n) can be approximated by switching between the two nearest discrete frequencies
Then we define a such that f��n) = a ih + (1 -a ) fe.
Solving this for a will give us the amount of time in each time unit that we should run at the high/low frequencies to obtain f�� :), saving then more power than only running the system at frequency fh, which would have been selected by binary search of the set of frequencies IF. Notice that we assume for this that the overheads of switching frequency at run-time can be neglected. Ishihara and Yasuura [19] uses this technique for convex power functions and extends Property I for a discrete set of frequencies. This justifies the need for an algorithm computing the exact analytical minimum frequency.
Finally, it is possible that no available frequency is higher than the computed optimal minimum analytical frequency, i.e., f��n) > fp = max(IF). It means that the malleable jobs of the system cannot meet their deadlines even at the highest speed available on the target platform. It is said so that the system is not schedulable on the chosen platform. Notice that in our simulations, we have made the assumption that it will never be the case: we simply determine f�� rn ) such that the system is schedulable and it requires that the chosen platform is capable of running at this frequency (even if this frequency is greater than 1).
B. Linear dependency between frequency and job execution speed
In section III-C, we made the assumption throughout this paper that there is a linear dependency between processor frequency and execution time. As this is not really accurate in practice -other factors can have huge impact on job execution: cache synchronisation, memory latencies, etc. -this simplify ing hypothesis is often made in the scheduling literature, e.g. in the popular uniform parallel machine model l [20] .
To better cope with potential practical implementation, we could define, for each task, a notion of functional utilization:
Instead of replacing Ui by T in the schedulability criteria (as it is done in section IV-A), we would replace Ui by this function Ui (f). The definition of this function would define how the execution time of a task would be impacted by the selection of a given frequency f. As we assumed in the rest of the paper that the dependency between frequency and execution time was linear, we implicitly set Ui (f) �f 'j.
Other more realistic choices could have been made (and would be in future research) for the definition of Ui (f), but notice that this could have an impact on the optimality of the underlying scheduling algorithm. In particular, for a non linear dependency, optimality of the canonical schedule would be lost.
It is easy to integrate this non-linear dependency with a restricted and finite set of available frequencies (as discussed in VI-A): we just have to restrict the domain of the function.
VII. SIMULATIONS
In order to investigate the potential benefit of parallelism upon power consumption, we have evaluated our algorithm with random simulations. In this section, we describe and discuss the high-level overview of the methodology employed in our evaluation and the results obtained from our simulations. 
A. Methodology Overview
The details for each step of our simulation framework are the following: 1) Random Task Sets Generation: We randomly generate execution times ei and periods Pi with the Stafford's RandomVectorsFixedSum algorithm [21] that is proven [22] to generate uniformly-distributed multiprocessor task systems (i.e., Usum > 1). As in our case we want to model tasks with individual utilization Ui not bounded by 1, 2 we slightly modified the way the authors of [22] use the RandomVectorsFixedSum algorithm to permit Ui > 1 when generating task parameters.
2) Speedup Vectors Values: To fix the speedup vectors of these task systems, we have modeled the execution be havior of two kinds of parallel systems: one not affected significantly by communication and 110 overheads, called the strong parallelized system (SPS), and another one heavily affected by communication and 110 overheads, called the weak parallelized system (WPS). A task of the SPS will have a speedup vector with better values than the one of the WPS. The values used in our simulations for f sps, the vector for SPS, and f wps, the vector for WPS are presented hereafter. Notice that both vectors respects sub-linear speedup ratio and work-limited parallelism, as defined in Section III. Simulations are done for 1 to 16 cores, so there is 16 values for each vector. Values of these two vectors are plotted on Figure 2 and explicitly presented in array-format on Figure 3 .
2 With the theoretical interpretation that a task with 'Ui > 1 will either need to be scheduled on a platform with a frequency greater than 1, either be modelled as a malleable tasks and thus scheduled on several processing cores.
3) Power Dissipation Rate: We define the total power con sumption, introduced in Section III-C, by dynamic and static power portions to closely resembles a physical processing platforms [23] . The static (leakage) power consumption of a processor can be as high as 42% of total power and depends on many factors [24] . In this research, we let the processor static power equal 15% of the total dynamic power when running at unit frequency. For this simulation, we use, P(j, k) = f3k + 0.15 x k, where f is the processing frequency and k is the number of active cores; the two additive terms represent dynamic and static power, respectively, and both depend on k. This model is sound regarding the physical behaviour of power consumption on CMOS processors platform, and many former real-time researches such as [23] use it. This model gives us a tool to compare power savings in the parallel schedule w.r.t. to the non-parallel schedule. It is not expressed in Watts, as it does not describe a real machine, but is useful to relatively compare the different solutions evaluated in our abstract simulations. Notice that it respects constraint of the function as defined in Section III-C. Power consumption is then computed in function for each of the frequency/number of actives cores choices taken in Step 4.
4) Minimum FrequencylNumber of Active Cores Determi
nation: The goal of our simulations is to compare the power consumption for task systems in the three following settings:
• when tasks are strongly parallelized (with SPS vector);
• when tasks are weakly parallelized (with WPS vector);
• when tasks are not parallelized, i.e. scheduled with the traditional non-parallel optimal schedule. Therefore, for each task system generated in Step 1, we com pute three distinct frequency/number of active cores couple values: the two first couple values are those returned by Algorithm 3 to optimally schedule the system when tasks are strongly parallelized and weakly parallelized. Using this Algorithm we make use of both DVFS (frequency selection) and DPM (number of turned on cores selection). For simplicity in our simulations, when evaluating the minimum frequency to schedule task systems for SPS (resp. WPS), the same vector f sps (resp. f wps) is set to each task of the system. The function P(j, k) used in Algorithm 3 is defined in
Step 3. For the third couple value, the frequency/number of active cores selected is the optimal one w.r.t. traditional non parallel (i. e., sequential) technique. To compute the minimum frequency required for a non-parallel scheduling algorithm (referred to as SEQ) with a fixed number of active cores, remember that the optimal schedulability criteria for sequential multiprocessor system on DVFS platform where the frequency is choosen offline is the following (we assume {i , l = 1 vi):
{ « 'fa x::::; 1
Therefore, the minimum optimal frequency for a fixed number of cores m is denoted as j (T,Tn) � f
As we want to compare parallel and sequential schedule with the same DVFSIDPM features, we want to select the couple frequency/number of active cores that minimize the function P(j, k) (like in the parallel case). Therefore, we call a modified version of Algorithm 3, with the function minimumOptimalFrequency ( 7, e) returning fs�� ,R) instead of f(-r: ,R)
ITl ln •
For example, consider 7 = {7 1 ' 72 } to be scheduled on a platform with m = 3 processing cores where e 1 = 6 , P 1 = 4 , e2 = 3 and P2 = 4. We then have U 1 = � and U2 = � and Algorithm 3 returns, for each of the three settings, Us�,R",,) = 0.7525, esps = 3) H P(3, 0.7525) = l. 7284 5 Ul��R�1" ) = 0.7875, ewps = 3) H P(3, 0.7875) = l.9151
Us��R"q) = l.5, eseq = 2) H P(2, l.5) = 7.05.
We can already see on this simple example that the gain from sequential to parallel is substantial. Notice that even in non parallel setting, we allow:
• individual utilization not bounded (Ui > 1 is permitted);
• total utilization not bounded (usum > m is permitted);
• homogeneous frequency is not bounded (f > 1 is permitted). It is important to notice that utilization is not bounded. For example, it is not a problem to have a total utilization greater than m: it just requires a running homogeneous frequency greater than l. Like addressed in section VI-A, if the higher frequency available on the target platform is less than the one computed by our algorithm, it means that the task system is not schedulable on this target platform.
5) Results
Comparison.' Using the random-task generator introduced in Step 1, we generate task systems with 8 tasks 3 . The total system utilization is varied from l.5 to 32.0 by 0.1 increments and number of available cores are varied from 1 to 16. The simulation runs for each task system/maximum num ber of cores pair. For each utilization point, we store the exact frequency and number of active cores returned by Algorithm 3 and the associated power consumed (as defined in Step 3). This is done three times, one for each setting (SPS, WPS and SEQ) as explained in Step 4: our frequency/processor selection algorithm is compared against the power required by an optimal non-parallel real-time scheduling approach. The power gains of parallelisation are plotted in Figure 4 and computed by taking the quotient between power consumed by the system in sequential mode and in parallel (malleable) mode, each time by taking the minimum frequency computed in Step 3. This allows us to manipulate relative gain of the parallel paradigm over the sequential one. Each data point is the average power saving for 10 0 different randomly-generated task systems (with the same total utilization). • JE [ P(f"q ,k"q) ] (Fig. 4b) , P(f,,,, ,k,,,,) 3 The behaviour for n # 8 would be quite similar. where, in the setting x, fx represents the minimum optimal frequency and kx represents the number of activated cores (1 � kx � m) and JE [ . ] represents the mean over 10 0 values.
B. Results & Discussion
The figures show then that the proposed algorithm has substantial power savings over sequential optimal algorithm. Furthermore, for strongly parallelized systems, the power sav ing is substantially larger, it further increases as with the higher system utilization and number of available cores. On the other hand, for weakly parallelized systems, the power gain saturates quickly and does not increases anymore with the system utilization and the number of available cores. We can see on Fig. 4a that, even when the application is weakly parallelized (strong communications and synchronisation overheads), the power gain can be up to 4 x w.r.t. the sequential execution. Moreover, in the strongly parallelized setup, power gain can be up to 36 x w.r.t. the sequential execution (see Fig. 4b ).
From these plots, there are a few noticeable trends:
• As the total utilization increases, the power savings in creases (for active processors greater than 2); the savings appears to be due to the fact that high utilization work loads require higher frequency in the sequential approach (and thus, more power) and can be easily distributed amongst cores in the parallel approach.
• As the total number of available cores increase, the power savings increases; the savings appears to be due to the fact that non-parallel schedules will quickly reach the limit where adding core does not impact the schedulability of sequential jobs. Notice that this limits is also reached (less quickly) by the WPS (where it saturates, cf Fig. 4a ). We can conclude from this than the better jobs are parallelized (i. e., the better are the li , j values of the speedup vector), the better are the power savings.
We can ask if the savings are biased by the fact that Ui can be greater than 1. Indeed, in our framework, in sequential mode of execution, to reach deadlines for task with Ui > 1, there is no other choice than increasing the frequency (to a value greater than 1). This could introduce a bias in simulations based on a randomized Ui thant can be either less or greater than l. However, we can see that power savings are still present when vi : Ui < 1 with this simple example: T = {TI ,T2}, where el = 1, PI = 10 , e2 = 3 and P2 = 4. We let the maximum number of cores of the platform be equal to Tn = 4. For this task system, the power values are the following:
Ust;,R;p;) = 0.42 66, esps = 2) r-+ P(2, 0.42 66) = 0.4553 Ul��R�v;) = 0.44 21, ewps = 2) r-+ P(2, 0.442 1) = 0.4728 u5:/"'I) = 0.8500, eseq = 1) r-+ P( l, 0.8500) = 0.7 64 1 By scheduling malleable jobs, we observe in this example relative power savings from 61 % to 68% (depending of the tasks' speedup vector). Therefore parallelization helps also for systems with only tasks with Ui < l.
VIII. CONCLUSIONS
In this paper, we show the benefits of parallelization for both meeting real-time constraints and minimizing power con sumption. Our research suggests the potential in reducing the overall power consumption of real-time systems by exploiting job-level parallelism. We can see from simulation results that power savings can be substantial even for system with weak parallelization: they tend to use computational resources more intelligently. For better parallelized system, the power gains can be very high. Simulations of our algorithm on randomly generated task systems executing on platforms having until 16 processing cores show that the theoretical power consumption is up to 36 times better than the optimal non-parallel approach.
In the future, we will extend our research to investigate power saving potential when the cores may execute at dif ferent frequencies and also incorporate thermal constraints into the problem. We will consider more realistic application models like the thread or the fork-join model. To avoid over-simplification of platform and power model, we will also consider practical implementations of parallel power aware online schedulers into a RTOS deployed upon an actual hardware testbed and measure practical power savings directly on this platform.
