Modern embedded applications usually have real-time constraints and they have requirements for low energy consumption. At system level, intra-task dynamic voltage scaling (DVS) is one of the most effective techniques for energy reduction. It changes the processor's supply voltage and clock frequency to the lowest level that still allows the real-time constraints to be met. In this paper, we present how intra-task scenarios, which capture correlations between different parts of the application, can be applied on top of existing DVS techniques, making them more effective. Furthermore, we extend our method for automatic discovery of scenarios and adapt it to the DVS requirements. We show that, by augmenting an existing DVS method with scenarios, the average energy consumption of two real-life benchmarks is reduced with 14% to 52%.
INTRODUCTION
The increasing development of mobile embedded systems, like mobile phones, PDAs, digital cameras, has directed designers' interest in finding solutions for increasing the battery lifetime of these systems. The problem complexity increases when dealing with real-time systems, where, besides reducing the energy consumption and power dissipation of the entire system, there are also tight performance constraints to be met.
At system level, the most effective low-power techniques for real-time systems are dynamic voltage scaling (DVS) and dynamic power management (DPM) aware scheduling [9] . They take into account that the processor energy consumption depends quadratically on the supply voltage (E ∝ V are not currently used, reducing their energy consumption. When both DVS and DPM are available for an architecture, it is known that it is always advantageous to exploit DVS first [9] .
Depending on the granularity, there are two different approaches for DVS-aware scheduling: inter-task voltage scheduling [8, 21, 3] and intra-task voltage scheduling [11, 14, 19, 4, 16, 18] . The first one determines the voltage on a task basis, while the second one selects voltage levels within the task. In this paper, we present a method for improving the performance of existing intratask scheduling algorithms. These algorithms exploit the slack time that appears at runtime because of the difference between the length of the worst case execution path and the current execution path. To do this, in some points of the original program, called voltage scaling points, a piece of code that may change the clock frequency based on the currently followed execution path of the program is inserted.
The energy consumption reduction depends on the amount of slack time and when it is observed during the runtime. The earlier it is detected, the more energy may be saved. Most of the current approaches are reactive: after a piece of code is executed, the slack time is detected as the number of slack cycles, which represents the difference between the worst case number of execution cycles (WCEC) of that piece of code and the number of cycles taken by its current execution (EC) divided by the current processor frequency (t slack = W CEC−EC f CLK ). In this paper, we propose an improved, proactive and fully automatic method for detecting the slack time during a program execution. We rely on static analysis for discovering the correlations between parts of an application [7] and use them to partition the application in different, so-called, scenarios. Because we can detect the WCEC of each scenario at design time, as soon as it can be detected in which scenario an application is executed (at runtime), the processor voltage/frequency may be scaled to the adequate level. Our method is platform independent, introduces a very small runtime overhead and can be applied on top of all the existing intra-task voltage scheduling algorithms.
The paper is organized as follows. Section 2 compares our work with related approaches. A motivating example is shown in section 3. Section 4 details how scenarios may be added on top of an existing DVS-aware scheduling algorithm. In section 5, an automatic scenario-aware DVS scheduling algorithm is introduced. The experimental environment and the evaluation of our approach on two real-life benchmarks are presented in section 6. Conclusions and future plans are discussed in section 7.
RELATED WORK
An intra-task voltage scheduling mechanism which changes at runtime the supply voltage based on the splitting of a task in several slots was proposed in [11] . A similar technique was presented in [14] where the authors propose intelligent ways for selecting the voltage scaling points. Besides the approaches based on natural slack cycles (W CEC − EC), in [19] , Shin et al. propose a static method that exploits the difference between WCEC of different paths of the program. This approach has small runtime overhead and does not need any special support from the hardware or the operating system. It does not take into account the probability that a path is executed, missing some opportunities for average energy reduction. Extensions which overcome this limitation were proposed in [4, 16] .
The only proactive approach that we are aware of is presented in [18] . It tries to identify the slack time in advance, using the combined data and control flow information of the program. Its disadvantages are that the data-flow analysis can not be applied easily outside of a procedure, the runtime overhead (which sometimes is big) can not be controlled, and there are no easy ways for detecting if this overhead leads to increased energy consumption. The way we select scenarios in our approach overcomes all of these limitations. As the tool and the benchmarks used for [18] are not publicly available, and the paper does not give enough information for implementing the tool, we could not directly compare our results with theirs. However, based on the same DVS-aware scheduling algorithm [19] , and using real-life multimedia benchmarks, we obtain energy reductions between 14% and 52% whereas [18] reports reductions between 4% and 40%.
The scenario concept was first used in [20] to capture the datadependent dynamic behavior inside a thread, to better schedule a multi-thread application on a heterogenous multi-processor architecture. As the authors considered also the possibility of changing the voltage level for each individual processor, their work can be considered as combining inter-task voltage scheduling with scenarios. The scenario concept was also introduced in [12] , where the tasks are written using a combination of a hierarchical finite state machine (FSM) with a synchronous dataflow model (SDF). The disadvantage of this method is that the applications must be written using a limited model, which is a time expensive and errorprone operation. Currently, there are no automatic ways for translating high-level sequential programming languages (like C, which is the most used to write embedded system software) to SDF. To the best of our knowledge, in [7] we were the first to introduce a technique for automatically detecting intra-task scenarios for applications written in C. We used the discovered scenarios to reduce WCEC overestimation. In this paper, we adapt our techniques to DVS.
MOTIVATING EXAMPLE
To emphasize the possible benefit of using scenarios in intratask DVS-aware scheduling, we start with an educational example, presented in Figure 1 . Note that the function g is called three times, followed by three calls of f or g, depending on the value of ct. We assume that functions f and g do not change the value of ct. The estimated WCEC, using Shaw's timing schema [17] , for this piece The processor runs at a frequency (100MHz) that allows precisely meeting the timing constraint for the estimated WCEC case. As for the selected case the application execution will be finished before the deadline, the processor goes in the suspend mode. In all schedules given as examples in Figure 2 , the following energy model was used: (i) for each period with a constant clock frequency fCLK , the consumed energy is computed as a product of the energy consumed per cycle (E cycle ≈ V 2 DD ) and the number of cycles, (ii) the E cycle in suspend mode is 0, which gives a big advantage to the schedule from (a) compared to the DVS schedules in (b) and (c), and (iii) the average time for VDD switching is 70µsec. Figure 2 (b) shows for the same case how the DVS+DPM aware scheduler, presented in [19] , works. After each evaluation of the if condition, a slack equal to W CECg −W CEC f is detected; therefore, the processor voltage is reduced, still keeping the possibility of meeting the deadline.
In [7] we introduced scenarios, which are defined as the application behavior for a specific type of input data. The set of scenarios 1 For each scenario, based on its WCEC, a DVS schedule is computed. All of these are combined together in the application global schedule. In the beginning of the execution, this schedule detects the current scenario and activates its schedule. There will be a little more overhead in the code than in the original DVS schedule, but our method of detecting and using scenarios, presented in section 5, keeps this overhead very low. For the example in Figure 1 two scenarios are defined, one for ct = 1 and another one for ct = 1. Figure 2 (c) shows the voltage schedule for ct = 1, assuming that the scenario can be detected at the beginning of the execution and, therefore, considering as the starting voltage level the one that precisely meets the deadline given the scenario WCEC of 3 · (W CEC f + W CECg) + const.
USING SCENARIOS IN AN INTRA-TASK DVS SCHEDULING ALGORITHM
In this section, we briefly describe a state-of-the-art intra-task voltage scheduling algorithm, introduced by Shin et al. in [19] , and we show how the scenarios may be applied on top of it. We assume that the processor has a specific instruction change f V(fCLK ), which changes the processor frequency to fCLK , adjusting the supply voltage to the corresponding voltage VDD. Both fCLK and VDD can be set continuously within the operational range of the processor. There is a transition overhead for changing the frequency, during which the processor stops running.
Original DVS Scheduling Algorithm
The scheduling algorithm from [19] is based on the observation that there are large variations in the WCEC of different paths of the program. The example of Figure 3 (from [19] ), which contains both a piece of code and its control flow graph (CFG), emphasizes these variations. The numbers which appear inside the CFG nodes (bi) represent their WCEC. The back edge from b5 to b wh models the while loop, and contains its maximum number of iterations. The longest path in this example is: b1, b wh , b3, b4, b5, b wh , b3, b4, b5, b wh , b3, b4, b5, b wh , b if , b6, b7.
The WCEC of this path is 160 cycles. If the code has a deadline of 2µsec, the processor frequency must be set to 80MHz. If, for example, the path b1, b2, b if , b6, b7 1 An example of a scenario for an H.263 decoder [15] is the application behavior for any frame of type P . Together with scenarios for frame types I and B, they cover all possible behaviors. The DVS scheduling algorithm identifies at any moment of the execution which is the longest path until its end. To do this, at compile time, for each node bi, the remaining WCEC (RWCEC) among all the paths starting with bi is computed. In the CFG from Figure 3 , the RWCEC appears between brackets near each node. The nodes related to a loop (e.g. b wh , b3, b4, b5) are associated with multiple RWCEC values, one for each iteration count of the loop. Depending on the number of the loop iterations, the RWCEC table can be implemented in the scheduler as a lookup table (array) or as a formula that compute at runtime the RWCEC based on how many loop iterations were executed. The first option is more expensive from memory point of view, and the second one from computational point of view. As the aim is to reduce the energy consumed by the application, for each loop, the RWCEC implementation option that introduces the lowest energy overhead is selected.
Using the computed RWCEC, the edges (bi, bj) that are candidates to contain the voltage scaling points can be statically identified. In these points, code is inserted to compute the new frequency, which permits the remaining part of the application, even in the worst case, to be executed before the deadline. It also calls the change f V instruction to actually change the frequency. An edge (bi, bj ) is a candidate if none of the longest paths starting with bi contains it. Formally, (bi, bj) is selected if:
where overhead represents the cycles taken to execute the introduced code. For the loop exit nodes such as b wh there are multiple options for selecting RWCEC: the largest RWCEC, the most probable RWCEC. A detailed analysis is presented in [19] . In the example of Figure 3 , the selected edges are marked with a •, and numbered from one to four.
As an improvement to [19] , we exploit also the case when the condition from the equation 1 is evaluated to false, but
is true. This means that on the edge (bi, bj ) some slack cycles appear, but they are not enough to be beneficial, in the context of DVS, for an immediate reduction of the processor supply voltage VDD and frequency fCLOCK . In this case, the slack cycles are propagated downwards in the application CFG, until the next candidate edge. To take into account the propagated slack cycles in edge selection, the equation 1 is modified to: In figure 4 , the left hand side CFG contains two selected edges: (b1, b2) and (b2, b3). Considering a voltage scaling overhead larger than five cycles, the right hand side CFG shows how the five slack cycles from edge (b1, b2) are propagated to edge (b2, b3).
Scenarios Add-on
In section 3, a scenario was defined as the application behavior for a specific type of input data. Usually, the input data appears, sooner or later, in the application source code as values for specific variables. For example, let us assume that in the code of Figure 3 , the values of variables cond1 and cond3 and the maximum number of while loop iterations (no iter) can sometimes be directly detected based on the input data before executing b1. Based on these values, the application can be divided in different scenarios (e.g. the header of Table 1 ). The backup scenario is the worst case scenario and it is used when the variable values can not be identified in advance or the overhead of adding a new scenario does not lead to (average) energy reduction 2 . For each scenario, the parts of the CFG that are never executed are removed and, if it is relevant, the maximum number of iterations is updated. For the remaining CFG, the RWCEC annotations and a DVS schedule are computed. Figure 5 shows the remaining CFG (the black part) for scenario 1. Table 1 presents, for each scenario, the computed RWCEC and the used voltage scaling points (VSPs) from the original DVS scenario. The VSPs that appear in a scenario schedule are a subset of the VSPs which would appear in the application schedule when scenarios were not considered. There are two reasons why a VSP may not appear in a scenario schedule: (i) its edge is not present in the scenario CFG (e.g. V SP2 and V SP3 for scenario 1) and (ii) no slack time might be discovered on its edge anymore (e.g. V SP1 for scenario 1).
To detect the runtime active scenario, at compile time Scenario Prediction Points (SPPs) are identified in the application. In each of them, some code to predict the current scenario, based on variable values, is inserted. The overhead introduced by this code must be small; otherwise the approach may not lead to energy reduction. Also, the earlier the current scenario is predicted, the more energy might be saved. For the previous example, one SPP is enough and it appears in the CFG on the input edge of b1. In Figure 6 (a), it is shown as a gray node. If, for the same example, the fact that cond3 = 0 can not be detected before executing b1, but still be-2 If the application is executed multiple times, the scope is to reduce its average energy. For a better evaluation of the saving, the probability of execution of each scenario must be considered. fore b wh , two scenario prediction points are necessary, as shown in Figure 6 (b). The overhead introduced by this prediction code is considered when the RWCEC is computed for the CFG nodes. (e.g. Figure 6 (b) shows the RWCEC computed for the backup scenario, considering that both SP P1 and SP P2 introduce an overhead of 5 cycles). The scenario schedules are combined into a global schedule for the application. This schedule contains for each scenario both a list of the used voltage scaling points (VSP) and a RWCEC table with the RWCEC annotations needed in the scenario schedule (see Table 1 ). Besides this, it incorporates also the prediction code introduced in SPPs.
AUTOMATIC SCENARIO-AWARE DVS SCHEDULING
Our approach, based on static analysis of the application source code that is presented in section 5.1, consists of four steps: (1) identify the parameters that could potentially have an impact on the application execution time; (2) compute the maximum possible impact of these parameters on the application WCEC, using an improved version of the method from [7] , adapted to the DVS requirements; (3) partition the application into possible scenarios, considering these parameters together with their impact, and select only scenarios that, in isolation, reduce the energy consumption;
if v is compared with a constant as part of the B condition, and v is not modified in B1 and B2, max(ICv(B1), ICv(B2)), otherwise.
ICv(while B do B1) = 8 > > < > > :
if v is part of the B condition, and v is not modified in B1, 0, if v is modified in B1, nmax · ICv(B1), otherwise. (7) where S is a non-control statement, B, B1, B2 are blocks of statements, nmin and nmax are the minimum and the maximum number of loop iterations. As the static analysis can not provide information about the frequency of scenario appearance, predictions can not be made about whether the energy saved due to a scenario is greater than the energy overhead introduced by it in the others. To solve this, in section 5.2, step 3 of the algorithm is augmented with a profiling based method for selecting the set of scenarios that leads the lowest energy consumption.
The Algorithm
The four steps of the algorithm are described below: 1: The first step is based on the fact that usually a few parameters have a significant impact on the application execution time (e.g. in a video decoder: image size and type). Many of these parameters are read at the beginning of the execution and remain constant for the rest of it. Moreover, there is usually only a small set of possible values for them (e.g. for an H.263 decoder, there is one variable which specifies the image type, with three possible values: I, B or P). In a C source code, these parameters usually appear as variables or structure fields of integer or enumeration type. For each parameter, there are one or a few statements in the program that change its value (often the value is based on the program input data). In our implementation, we automatically select potential parameters based on these observations.
2:
To identify which of these parameters might influence the WCEC the most, we first compute the application WCEC using Shaw's timing schema [17] . Second, the possible impact on the WCEC of each parameter (denoted by v) is computed in the form of its so-called influence coefficient (I C). The I Cv represents the maximum possible variation caused by the different values of v on the estimated application WCEC. Since it is not possible to accurately predict a scenario based on the value of v before the last write to v, we adapted the I C computation from [7] to take into account only the impact on the code after the last write statement on each execution path. Figure 7 illustrates the I Cv computation for a set of execution paths that share the latest write statement on v, and, also for an application that contains multiple such sets.
As it is not possible to enumerate all possible execution paths of a program, to compute the I Cv, a set of recursive rules are used. To this end, the abstract syntax tree (AST) of the program is traversed in a post-order manner and the I Cv is computed in each node. The AST leaves are the non-control statements of the program and the inner nodes correspond to syntactic composition of blocks of statements. Three types of composition exist: sequential composition, conditional composition and iterative composition. The post-order traversal of the AST allows to determine the I Cv for a program segment as a function of the I Cv values computed for its components. Each AST node type has associated one of the rules shown in Figure 8 .
For a non-control statement, I Cv = 0, as there is only one possible execution path through it, meaning that there is no variation in (Figure 10 (a) ), if at least one of its children (B1 or B2) changes the value of v, during the execution of B the application active scenario is unknown. It could be discovered either after the last write from the children or on the edge between B and the child that does not modify the value of v (e.g. (B, B1) in the example). The I Cv computed for the loop in this case is the maximum of I Cv computed up to each point where the scenario can be determined at runtime. For iterative composition (Figure 10 (b) ), if the value of v is modified in the loop body (B1), the application active scenario could be discovered at runtime only in the last iteration of the loop. As it is almost impossible for most of the loops to indicate in advance which is the last iteration, the active scenario is discovered only after the loop and hence, in this case, computed I Cv = 0.
Only if v is part of an if or while condition, then the estimated WCEC for the associated composition node may vary based on the value of v. If v is not part of the condition, its value does not influence which if branch is taken or how many loop iterations will be executed. Therefore, equations 6 and 7 are the only ones that inject values different from 0 in the recursive computation of I Cv. Figure 11 graphically interprets how I Cv is computed, in the equation 6, as the difference between the WCEC of the longest possible execution path (max term) and the WCEC of the shortest one (min term having as arguments the WCEC of the blocks minus the impact of these blocks). As later on, for splitting into scenarios, only the comparisons of variables with constants are considered 3 , this is taken into account in equation 6. For iterative composition, two distinct cases appear when the loop body does not change the value of v: when v is not part of the condition, and when it is (equation 7). The first case is a natural extension of the sequential composition, where the nodes B and B1 are executed for nmax times. When v is part of the loop condition, the I Cv is computed as the difference between the lengths of the longest possible execution path through the loop (the term that contains nmax) and of the shortest one (the one with nmin).
To go beyond function borders, for each function call that has v as a parameter, a renaming is done for computing the I Cv inside the function.
The four types of AST nodes cover the entire ANSI C syntax. Simple control flow statements, like for, switch, goto, can be directly mapped into while and if statements without affecting their WCEC. A few constructs are difficult to handle: recursive functions (unknown depth), back jumps (hidden loops) and dynamic function calls. The first two can be transformed in loops using different mechanisms [5] . Even though the dynamic function call seems to be a fundamental problem, it is solvable in embedded software as, usually, all possible called functions are known at design time.
In the end, the root of the AST yields the values of the I Cs computed for each possible parameter.
3:
To avoid an explosion in the number of scenarios, different criteria for selecting parameters to define scenarios might be used. The selection may incorporate knowledge about the application combined with heuristics based on the computed values of I Cs. An example of a very simple heuristic is to select only those parameters with very big I C values.
For each selected parameter, the constants it is compared to in the source code are collected. These constants, together with the comparison operators, are used to split the set of possible values of the parameter into subsets. A scenario is characterized in the end, by the possible values of the selected parameters. Figure 12 shows how the IC for the variable ct is computed for the example presented in section 3. As it could already be seen in the source code, we can automatically detect two scenarios based on the values of ct: one corresponding to ct = 1 and the other to ct = 1. The splitting into scenarios does not depend on the variable y as ICy = 0 (because y changes its value in both for loops).
For each potential scenario, by using static analysis, it is computed whether, considering the overhead for scenario detection and scheduling, energy is saved when it is run. This overhead influences the energy consumption in two ways: (i) the prediction code increases the number of execution cycles and the code size (more instruction memory involves more energy) and (ii) the sizes of the RWCEC tables used by the global schedule increase. There is no supplementary cycle overhead for processor frequency computation and changing when compared to traditional DVS scheduling, as no new voltage scaling points are added in the program. If a potential scenario is not energy beneficial, it will be merged with another one.
4: For each scenario, a DVS-aware schedule is computed (e.g. using the method from [19] ). All of those schedules are combined into a global one, as presented in section 4.2. This schedule also includes code for predicting the active scenario. For each of the parameters used for splitting into scenarios, this kind of code is inserted in the points which are for sure not followed by a state-source code
ICct equation
ICct value 1 for (y=0; y<3; y++)
for (y=0; y<3; y++) ment that changes the parameter value. It consists of the variable comparisons also used for the splitting.
Profiling for Average Energy Reduction
A scenario, generated in step 3 of our algorithm, is always beneficial for energy when it is selected at runtime. However, it causes an overhead if it is not active. If the scenario does not appear frequently enough at runtime, the total energy saved by it might be less than the energy consumed by the overhead introduced by it in the other scenarios. The following inequality is used to detect the impact of a scenario S, with a probability of appearance p(S) ∈ [0, 1], on the average energy consumption of the application:
The static analysis can not detect if the average energy of the application increases or decreases when a scenario is introduced. To gather the necessary information, in Figure 13 , it is shown how the original algorithm is augmented with a profiling step. It provides energy information per scenario and scenario frequencies to the heuristic from step 3 of the above presented algorithm. This approach allows multiple iterations over steps 3 and 4 of the algorithm, which may lead to progressive refinement of energy improvement.
EXPERIMENTAL EVALUATION
All the presented steps were implemented on top of SUIF [2] . For our experiments, we used a micro-architecture model similar to ARM7TDMI [1] . We consider that both fCLK and VDD can be set continuously within the operational range of the processor and that there is a transition overhead of 70µsec for changing the frequency, during which the processor stops running. The considered overhead is quite large. For computing the WCEC of scenarios, we use Shaw's timing schema [17] .
We tested our method on two multimedia applications, an MP3 decoder [10] and a restricted H.263 decoder [15] that supports only I and P frames. We show that applying our approach on top of the For the MP3 decoder, we computed the influence coefficient (I C) for all possible parameters. The ones with I C value bigger than 100 are listed in Table 2 . As there is a big difference between the influence coefficient values of the second and the third parameter, initially, we consider only the first two parameters for splitting into scenarios. The first line of Table 3 shows both the average energy improvement and the overhead introduced in the application by using these scenarios (cycles, instruction memory, and data memory for the RWCEC tables). Both energy reduction and overhead are based on the DVS algorithm of [19] as a reference. We have chosen two sets of input files, one taken from [6] and one consisting of randomly selected stereo songs downloaded from the internet. The first set, even if it is a standard benchmark for mp3 decoders, is not representative in our opinion because it tests all the extreme coding cases (which rarely appear in usual mp3 files) and the mono songs (which are not often listened to). By considering in step 3 of our algorithm the other two parameters as well, and based on profiling information, we obtain progressively the sets of scenarios shown in the rest of the table. The best splitting is the one shown in the second line, with leads to an energy reduction of almost 16%
4 . The experiments show that including mixed f lag in the scenario definition introduces more overhead than savings because the extra scenarios occur too infrequently.
For the H.263 decoder, the set of scenarios that reduces the energy consumption the most has one scenario for I frames and one scenario for P frames. As the processing performed for an I frame is a true subset of the processing done for a P frame, the application WCEC is equal with the WCEC of the scenario for P frames, which is also the backup scenario. Therefore, the only scenario that reduces the energy consumption is the one for I frames. Depending on the input stream structure, we obtained an energy reduction from 14% (for an input stream which contains for each I frame six P frames) to 45% (if the input stream contains an equal number of I and P frames).
CONCLUSIONS AND FUTURE WORK
We have presented an automatic scenario-aware DVS scheduling algorithm for reducing the energy consumption of real-time ap- plications. It can be applied on top of all intra-task DVS-aware scheduling techniques, making them more effective. It is based on scenarios that incorporate both the correlations between different parts of the application source code and different numbers of iterations for loops. To discover scenarios, we propose an algorithm based on static analysis augmented with profiling information. This algorithm guarantees a small runtime overhead for scenario prediction, and determines at design time which is the set of scenarios that yields the largest energy reduction. Our method was tested on two applications and an energy reduction between 14% and 52% was obtained when compared with traditional DVS scheduling.
There is a limitation in how close to the beginning of a program a scenario can be detected, and the processor supply and clock frequency can be adapted for it. Due to this, there is also a limitation in the reduction of the energy consumption as, before a scenario is detected, the worst case situation is taken into account. We plan to surpass this limitation by considering a probabilistic approach for predicting in advance on the basis of partial information in which scenario the application will end up. We also want to study a combination of intra and inter-task scenario based voltage scaling for multiprocessors systems.
