While Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM) techniques are widely used in real-time embedded applications, their complex interaction is not fully understood. In this research effort, we consider the problem of minimizing the expected energy consumption on settings where the workload is known only probabilistically. By adopting a system-level power model, we formally show how the optimal processing frequency can be computed efficiently for a real-time embedded application that can use multiple devices during its execution, while still meeting the timing constraints. Our evaluations indicate that the new technique provides clear (up to 35%) energy gains over the existing solutions that are proposed for deterministic workloads. Moreover, in a non-negligible part of the parameter spectrum, the algorithm's performance is shown to be close to that of a clairvoyant algorithm that can minimize the energy consumption with the advance knowledge about the exact workload.
INTRODUCTION
Energy management has become one of the most important goals in modern embedded system design, especially for battery-operated systems. In this regard, Dynamic voltage scaling (DVS) is recognized as one of most effective and fundamental techniques. By considering the strictly convex relationship between the supply voltage and CPU power consumption, DVS attempts to save energy by scaling down the frequency along with the supply voltage. Early DVS studies can be classified in two categories: The inter-task DVS schemes [2, 10, 12] focus on allotting the CPU time to multiple tasks and perform frequency scaling only at task preemption/completion points. On the other hand, in intratask DVS schemes [9, [17] [18] [19] , the frequency changes can occur while a task is executing (in its allocated CPU time). In recent DVS research [1, 7] , the concept of energy-efficient frequency is introduced after the researchers have observed that processing frequencies below a certain threshold can have negative effects on the system-wide energy consumption. This energy-efficient frequency is computed by balancing the off-chip device energy and CPU energy consumed during the task's execution.
Another widely-used energy management technique is Dynamic Power Management (DPM) that attempts to put the idle system components into low-power states whenever possible. In fact, off-chip devices (such as I/O devices and the main memory) have an active state and at least one low-power sleep state. However, significant transition energy/time overheads are involved in state transitions of the devices. In fact, a minimum idle interval length (called the device break-even time) is needed in DPM to justify the transition of the device to the sleep state. DPM reduces energy by putting the device into low power sleep state when the idle time is predicted to be no less than the device break-even time. DPM has been well-studied in the recently proposed power management schemes targeting different task/device settings [3, 6, 14, 15] .
While the research efforts that focus on only on DVS or DPM are many, solutions that propose integrating both policies under a unified framework are relatively few [4, 5, 8, 13, 20] . Authors in [13] apply a stochastic DPM policy by using the different DVS voltage levels as multiple active power modes. The work in [20] proposes a DVS-DPM policy that maximizes the operational lifetime of an embedded system powered by a fuel cell based hybrid power source. The frequency scaling level is chosen in [8] by investigating the trade-offs between the DVS-enabled CPU and the DPM-enabled devices. The SYS-EDF algorithm is a heuristic-based energy management scheme for periodic real-time tasks [4] .
The growing importance of system-wide energy management clearly mandates integration of both policies. Yet most of the existing solutions, while noteworthy, adopt heuristicPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. based solutions with no performance guarantees. In fact, it is somewhat surprising that the full solution for a single real-time embedded application using both DVS and DPM policies was published only recently [5] . That work captures the intriguing trade-off between DVS and DPM policies in a precise way. Figure 1 , if the processing frequency is lowered through DVS to save energy, the task execution time is extended. As a result, the idle time is shortened, which prevents DPM from putting the devices into the low-power sleep states. On the other hand, the device energy can be reduced by executing the task at higher frequency to obtain enough idle time for putting the devices to the low power sleep state while this results in additional transition energy overhead and CPU energy consumption. Moreover, the application may be using multiple devices with different power and break-even time characteristics, making an optimal solution non-trivial. An exact and formal characterization of the interplay between DVS and DPM for a real-time application using multiple devices was recently obtained in [5] . In the same work, an algorithm to compute the optimal processing frequency was also derived. However, the solution in [5] is derived assuming a known, worst-case workload. In other words, the solution is optimal only for deterministic workloads. As we show later in this paper, if the workload exhibits probabilistic behavior, minimizing expected energy is more important and the algorithm in [5] (called DET throughout this work) becomes sub-optimal, even overly pessimistic.
Although the task's actual workload cannot be predicted in advance with full accuracy, its probability distribution function can be obtained or estimated through profiling [9, [17] [18] [19] . Our primary goal in this paper is to determine the optimal processing frequency to minimize the expected energy at the system-level, when the cumulative distribution of the application's workload is known. We consider both DVS and DPM features and formally characterize the expected energy. We illustrate our technique by first focusing on an application using a single-device and show how the optimal frequency can be derived in linear-time. Then, we extend the solution to the case of multiple devices. Our solutions also ensure that the timing constraint of the application is met in every execution scenario. Our experimental evaluation shows that our solution yields energy savings of up to 35% over the DET algorithm [5] by exploiting the probabilistic information. Moreover, it approaches the performance of a clairvoyant algorithm that can compute the best solution using the workload information before-hand. To the best of our knowledge, this study is the first to exploit DVS and DPM properties optimally to minimize the expected systemlevel energy consumption, for workloads that are known only probabilistically before execution.
SYSTEM MODELS

Application Model
We consider a real-time embedded application that is invoked repetitively with the period P . Such applications are known as frame-based systems in the literature [5, 11, 17] . Whenever invoked, the application must complete its execution within the relative deadline d, which is assumed to be equal to P . The application's workload, characterized by the number of cycles, is assumed to be known only probabilistically: it can assume values between a lower bound and an upper bound that are given by best-case execution cycles (bcc) and worst-case execution cycles (wcc), respectively. The cumulative distribution function for the application's workload is:
where X is the random variable for the application's CPU cycle demand and p(X ≤ x) represents the probability that the application will not require more than x cycles within a given frame. To approximate the cumulative distribution function F (x), we use the effective histogram technique [17, 18] . Specifically, we assume that the application's workload function is partitioned into n cycle groups:
, where b 0 = bcc, b n = wcc and b i < b i+1 . As shown in Figure 2 , the probability of executing the application with no more than x cycles is estimated conservatively as F (bi+1), when x is in the interval (bi, bi+1].
Notice that F (b n ) = 1. Through profiling, the probability distributions for the cycle groups can be obtained [17, 18] . Also, we define the size of the cycle group i as c i = b i − b i−1 for i > 0 and ci = bi for i = 0. We assume a DVS-enabled system where the processing frequency f can change from a minimum frequency fmin to a maximum frequency f max . All frequency values are normalized with respect to f max . Finally, the utilization of the application is defined as U = denotes its worst-case execution time when its executes its maximum possible workload wcc with fmax.
Device Model
The real-time embedded application uses a set of m devices D = {D 1 , D 2 , .., D m } during its execution. Each device D i can be either in active or sleep (low-power) state, and is defined with the following parameters:
• P Due to the periodic execution pattern, each active-to-sleep transition for a device will be eventually followed by a sleepto-active transition. As a result, to conveniently represent the overheads, we define E [3, 5, 6, [14] [15] [16] , we assume inter-task device scheduling. In inter-task device scheduling, all devices used by the real-time application must be in active state while it executes. In fact, the absence of knowledge about the exact time instants when the application will need a specific device and non-trivial state transition costs easily justifies the inter-task device scheduling paradigm [3, 5, 6, 15] .
However, the devices can be put to sleep state when application finishes its execution within a frame (until the beginning of the next frame). Considering the non-trivial energy and time overheads associated with state transitions, the device break-even time B i is defined to represent the lower bound on the idle interval length so that putting the device to sleep state can be justified at run-time. In addition, no idle interval can be smaller than T i tr ; hence, Bi is given as [3] [4] [5] [6] :
Energy Model
The total system energy consumption within a frame is the sum of static and dynamic energy consumptions. Following previous DPM work [3] [4] [5] [6] , we assume that shutting down the entire system within a frame is not an option (hence, static power is not manageable) and concentrate on the dynamic energy consumption E d . E d is a function of several factors, including the CPU frequency f and power characteristics and states of individual devices (such as main memory and I/O modules). For simplicity, we assume that all device active powers are given in excess of the device sleep powers, as in [4] [5] [6] .
If the application executes c cycles at the frequency f within a given frame, it will complete its execution within t = . For example, in Figure 3 , the device D1 can be put to sleep state during the slack period, while D2 is forced to remain in active mode.
Let DA denote the subset of devices that are forced to remain in active state during the slack period, due to short idle interval lengths. On the other hand, the devices in D − DA can be transitioned to sleep state during the slack period to save energy. The dynamic energy consumption E d (f ) within a frame can be formally expressed as [5] :
where 
is the sum of the energy consumed by the CPU (af 3 · c f
) and the devices (
) during the execution period;
is the energy consumed by the devices that remain in active state during the slack period, and,
is the transition energy incurred by the devices in D − DA, which need to be activated at the beginning of the next frame.
SYSTEM-LEVEL ENERGY: STOCHAS-TIC CASE
Motivational Example
By using a system-level energy model very similar to that given in Section 2.3, the work in [5] derived a precise characterization of the interplay between DVS and DPM, and then showed how to obtain the optimal frequency to minimize system-level energy. A fundamental characteristic of that solution is that it assumes a deterministic workload equal to wcc. While provisioning for worst-case scenarios is important to guarantee the timing constraints, in practice, many real-time embedded applications complete early without consuming their worst-case workloads. Further, although the actual number of execution cycles cannot be known in advance exactly, its probability distribution can be obtained through profiling [17, 18] . This, in turn, provides new and significant opportunities to minimize the expected system-level energy consumption, while meeting the application's deadline, as the following example illustrates. Let us consider an application with deadline d = 35ms running on a system with maximum frequency 1GHz, using a single IBM Microdrive device (P a = 1.3 Watt, E tr = 12 Joules and B = 24ms, as specified in [3, 5] ). To start with, the analysis in [5] considers only the WCET information as the basis; hence, if d − B < WCET, the DET algorithm [5] would assume that the device will need to be kept in active mode during the slack period. For example, if wcc = 12 × 10 6 , DET would choose the optimal frequency as f = 0.34Ghz (effectively planning to complete the application just at the deadline in the worst-case) since W CET = 12ms > d − B = 11ms -as seen in Figure 4 .a. Now assume that the application's workload is known probabilistically, and that it changes between bcc = 2×10 6 and wcc = 12×10 .
6 , as showed in Figure 4 .b, then we find out that DET will yield the frequency f = 0.91Ghz with the corresponding expected energy E DET = 27.79 Joules. But if we use f = 0.69Ghz, the expected energy becomes E = 26.45 Joules, leading to approximately 5% additional energy savings. As this example suggests, minimizing the expected energy consumption while exploiting the subtle interaction between DVS and DPM warrants a full analysis.
Deriving Expected Energy Function
Given a processing frequency f and a cumulative distribution function for the workload, the expected energy consumption Em within a frame can be written as the sum of three components: the expected energy consumption in the execution period E e (f ), that in the slack period E s (f ), and the expected transition energy overhead E t (f ):
Let us elaborate on each of these components separately. The execution period energy Ee(f ): During its execution, the application will execute a given cycle x in the range [bcc, wcc] with a certain probability. In fact, this probability is equal to (1−F (x)), where F (x) is the cumulative distribution function defined in Equation (1) [17, 18] . Consequently, the expected value of overall (CPU + device) energy consumption during the execution period can be written as:
Further, by applying the histogram-based estimation technique, E e (f ) can be formally re-written as:
The slack period energy E s (f ): The energy consumption during the slack period is due to devices that are forced to remain in active state when their completion time within the frame does not allow an energy-efficient transition. Specifically, when the application completes at time t > d − B i for a given device D i , that device will consume a total energy of P i a (d − t) during the slack period by staying in active mode. Completion at each of such time instants occurs with a certain probability; hence, we find:
where p(t = Z) is the probability that the application will complete exactly at time Z. The transition energy Et(f ): When the application completes at time t < d−Bi the device Di can and should be transitioned to sleep state during the slack period. However, each such transition will result in an energy overhead of E i tr . If t is the completion time, the expected value of E i tr is p(t ≤ d − Bi) E i tr . Observe that:
where X is the random variable for the cycle demand of the application and f is the processing frequency. Recalling that p(X ≤ x) = F (x), we get:
At this point, we are ready to develop our solution for the problem of minimizing the expected overall energy Em(f ) while considering the probabilistic distribution of execution cycles while satisfying the deadline constraint.
SINGLE-DEVICE MODEL
In this section, we consider the case where the application uses only m = 1 device during its execution and derive the optimal frequency to minimize the expected energy by exploiting the interaction between DPM and DVS. This also allows us to lay the technical background for the more general case that will be addressed in Section 5. For simplicity, we use the notations P 1 a = Pa, B1 = B and E 1 tr = Etr throughout the section. Using the findings from Section 3.2, the single-device problem can be formally expressed as to minimize:
Subject to:
Above, (12) encodes the deadline constraint while (13) indicates the feasible frequency ranges supported by the system. A non-trivial difficulty with the above optimization problem is that the unknown f appears as a parameter in the cumulative distribution function which may be of arbitrary form, making a closed form solution unlikely. However, notice that F ((d − B)f ) can have only one of the n + 1 distinct values that correspond to F (b i ) (0 ≤ i ≤ n) in the optimal solution. This property suggests an iterative approach: our original problem will be divided into to n + 1 sub-problems by letting
Each sub-problem will be attacked by assuming that F ((d−B)f ) = F (b i ) and the expected energy consumption corresponding to that sub-problem will be recorded. Finally, the frequency that leads to the minimal energy consumption in any of the sub-problems will be selected as the global optimal.
In the following, we focus on the solution of these subproblems. Notice that when
the expression (10) can be readily re-written as a function of F (b i). Further, the properties of the histogram-based approach and simple algebraic manipulation show that, when (9) is equivalent to:
Hence, the i th sub-problem can be formally defined as to minimize:
where 1 the new additional constraint (16) is the sufficient and necessary condition to enforce 
). Notice that f low,i corresponds to the lower bound 2 on the feasible frequency range for the problem while f up,i is the upper bound to the frequency range. It is obvious that if fup,i < f low,i , the feasible region for this subproblem is empty. On the other hand, the case of f up,i ≥ f low,i can be solved precisely. Observe that E i (f ) is a strictly convex function. Therefore, the frequency f i that minimizes E i (f ) without considering constraints (15) , (16) and (17) can be found by setting its derivative to zero:
The convex nature of E i (f ) justifies the following two basic properties for any Δ > 0:
Based on these properties, one can obtain the following: Theorem 1. The optimal frequency for the i th subproblem is equal to f * i = max{f low,i , min{f up,i , f i }}, whenever f up,i ≥ f low,i .
Once we evaluate the optimal solutions to all n + 1 subproblems (each requiring O(n) time), one can easily find the global optimal (in time O(n 2 )). However, the procedure can be further speeded up by observing that once f i is computed, fi+1 can be easily evaluated, since fi+1
. The denominator in this expression is indeed a constant (independent of i) and can be computed just once when evaluating f 0 . Thus, one can easily design an algorithm that computes the fi values bottom-up, by keeping track of the best candidate seen so far across the n + 1 iterations. This optimization will help reduce the complexity to just O(n log n), where the dominant term will be due to the process of sorting the b i and F (b i ) values in increasing order. If they are already sorted in the input, the overall complexity is just O(n).
MULTIPLE-DEVICE MODEL
In this section, we address the general problem where the application uses m devices D 1 , . . . , D m (with corresponding break-even times B 1 , . . . , B m ) . We assume the device indices reflect the ordering of the break-even times, as shown in Figure 8 . With multiple devices, the problem gains a new dimension. By adjusting the processing frequency f , the application's execution time within a frame can be controlled through DVS; but this has a direct impact on the applicability of DPM. Specifically, when the application completes at time t such that (d − B i) < t < (d − Bi−1), the devices D1, . . . , Di−1 can be transitioned to the sleep state during the slack period. However, the devices Di, . . . , D m should remain in active state throughout the frame. Obviously, the probabilistic behavior of the workload adds another nontrivial complexity layer to the problem. The problem can be formally stated as to minimize the expected energy consumption E m(f ) given by (6) , where the execution period, slack period, and transition energy figures are given by (8) , (9), and (10), respectively. Again, the optimal frequency f should satisfy the deadline constraint wcc f ≤ d and the feasible frequency range constraint fmin ≤ f ≤ fmax.
Similar to the case of single-device, the general problem can be seen to give rise to multiple sub-problems by assuming that the product
. In other words, a i will be equal to the index of the cycle group (given in the histogram) that the application is assumed to be executing at time t = d − B i, in the subproblem that corresponds to each tentative optimal solution. Since the cycle group j (from b j to b j+1 ) should be executed before the cycle group j +1 (from b j+1 to b j+2 ), one can infer that a i ≥ a i+1 in a subproblem. Therefore, each sub-problem will be defined by a unique ordered sequence of a 1, . . . , am.
Following a reasoning similar to the case of the singledevice, one can obtain the corresponding formulation of the subproblem for a given sequence a1, . . . , am as to minimize:
where Ea 1 ,..,am (f ) is the expected energy consumption with the specific values of a1, .., am, and the new additional constraints (21) are the sufficient and necessary conditions to enforce
For a given sub-problem, we can define f low = max(fmin,
,
+ (∀i)) and f up = min(fmax,
(∀i)). By following steps similar to those in Section 4, we can conclude that if f up < f low , the corresponding subproblem does not have a solution. For the case where f up ≥ f low , the solution f a is given by:
where α = (
Theorem 2. The subproblem of minimizing Ea 1 ,..,am (f ) admits an optimal solution f * a = max{f low , min{f up , f a }}, whenever f up ≥ f low .
As a quick inspection of the formula for f a reveals, it takes only O(mn) steps to solve each sub-problem. However, the apparent computational difficulty comes from the large number of subproblems. In fact, the total number of sub-problems is given by the number of multisets of cardinality m, with elements taken from a finite set of cardinality n, which is equal to
subproblems, which is prohibitively large. Nevertheless, it is possible to develop a faster solution by observing that most of the subproblems have indeed empty feasible regions and that it is necessary and sufficient to consider only at most m(n + 1) + 3 subproblems, each of which is uniquely defined by a separate combination of the cycle group index j and device index i. The full details of this faster algorithm of overall complexity O(mn log mn) with the accompanying proofs are presented in the Appendix.
EXPERIMENTAL EVALUATION
To evaluate the performance gains yielded by our solution, we constructed a discrete-event simulator in C. In our simulator, we implemented the following three schemes:
• The optimal scheme, denoted by OPT and developed in this paper, which minimizes the expected system energy based on the application's probabilistic workload information, by integrating DVS and DPM in an optimal way.
• The scheme DET (proposed in [5] ), which minimizes the system energy consumption again by considering both DVS and DPM features, but by assuming a deterministic workload (equal to wcc).
• The clairvoyant scheme (CLR), that computes the optimal frequency by using the knowledge about the actual workload (number of cycles) of the application in advance. While it is not a practical scheme (since no algorithm can know the exact workload in advance), CLR is included in our comparison to assess the extent at which our algorithm's performance approaches absolute ideal bounds by exploiting the probabilistic information.
To be consistent with the experimental settings in [5] , we performed our experiments by using the actual device specifications from [3] and the CPU power consumption is modeled after the Intel XScale specifications [17] . The application uses three devices during its execution: IBM Microdrive (B = 24ms) , Realtek Ethernet Chip (B = 20ms) and Simple Tech Flash Card (B = 4ms). The frame length P varied from 40ms to 100ms; the application's execution cycles are generated using normal distribution with the mean . This guarantees that 99.7% of the cycles fall in the range [bcc, wcc] and cycles values outside this range are not considered. The [bcc, wcc] range is divided into n = 100 cycle groups of equal size. The maximum CPU frequency is assumed to be 1GHz
We first evaluate the effect of the worst-case execution time on the expected energy consumption with the frame length (period) = 44ms as in [5] , by changing wcc from 10 × 10 6 to 40 × 10 6 cycles with bcc = 0 ( Figure 5 ). Notice that wcc = 40 × 10 6 corresponds to an almost fully utilized system. In Figure 5 , all energy consumptions are normalized to that of OPT when wcc = 40 × 10 6 . As we can see, the performance of our solution OP T is very close to that of the clairvoyant scheme CLR, especially when wcc ≤ 30 × 10 6 . When wcc ≤ 15 × 10 6 , DET can achieve the same performance as CLR and OPT. The reason is that, in the case of very low workload, the application can be executed with the energy-efficient frequency (f ee) [5] while still meeting the deadline and leaving enough idle time to turn off all the devices. However, when wcc increases beyond 20 × 10 6 , DET chooses the frequency (f = U ) to minimize the system energy under the wcc case, while OPT can still use f ee to achieve better expected energy savings by considering the probabilistic workload information. But once wcc > 25 × 10 6 , OPT has to also increase the frequency to enforce longer idle intervals for device state transitions. In fact, when wcc = 35 × 10 6 , considering the interaction of DVS and DPM, OPT is forced to use the frequency f = U just like DET . These results suggest that OPT is able to achieve performance levels that are practically indistinguishable from the clairvoyant algorithm, except when the worstcase workload is very high. Figure 6 shows the relative performance of three schemes as a function of the actual workload variability. Specifically, we consider an application with period= 44ms, wcc = 20 × 10 6 , and we vary the ratio bcc wcc
. Clearly, the lower this ratio, the more the actual workload deviates from the worstcase. The energy values are normalized with respect to OPT when bcc = wcc. We observe that the performance of OPT is almost identical to that of CLR, when bcc wcc ≤ 0.6, exhibiting performance gains of around 35% over DET. However, when the ratio exceeds 0.7, OPT is forced to use the same processing frequency as that of DET . This is primarily due to the fact that large bcc values do not leave much opportunities to optimistically increase the frequency to create long device idle intervals. In fact, when bcc = wcc, all three schemes achieve exactly the same performance.
In Figure 7 , we study the impact of varying the application period on the energy consumption by setting U = wcc P = 0.5. The energy values are normalized with respect to that of DET when P = 40ms. As we expect, the energy consumption decreases as we increase the period, and increasing slack amounts enable more device state transitions. The performance of OPT is in fact indistinguishable from that of the clairvoyant scheme CLR. OPT provides energy savings of up to 35% over DET, though the savings tend to decrease with increasing period. This is because, increasing the period while keeping the device break-even times constant enables also DET to put the devices to sleep states, even when planning according to worst-case workload scenarios.
CONCLUSIONS
In this work, we considered the problem of optimally integrating DVS and DPM policies for real-time embedded applications characterized by probabilistic workload profiles and we presented algorithms to minimize the expected systemwide energy. Our solution is based on the precise characterization of the expected energy components, using DVS and DPM properties. First, we solved the problem for an application using a single device and then generalized it to multiple devices. By observing the special characteristics of the optimal solution for the multiple-device case, we also suggested a faster algorithm which significantly reduces the search space. Our experimental evaluation shows that our algorithm can achieve significant energy savings compared to the previous algorithm proposed in [5] for deterministic workloads and performs comparably even to a clairvoyant optimal scheduler that knows the exact workload in advance. To the best of our knowledge, this is the first solution to the system-wide energy management problem based on the optimal combination of DVS and DPM for probabilistic workloads.
