Introduction
Limitations on the lifetime of embedded devices, particularly battery-powered mobile devices, have resulted in advances in embedded architecture to extend the lifetime of devices. Microprocessor designs support dynamic adjustment of processing speed to prolong battery life. Generally, two techniques are employed in unison. On one side, dynamic frequency scaling allows the speed of instruction execution to change during the operation of a device. On the other side, dynamic voltage scaling modulates the level of the supply voltage upon demand. Generally, both schemes, referred to as DVS in the following, work hand in hand: When the frequency is lowered by a certain degree, the voltage can be also be reduced to a lower level. Furthermore, both scaling techniques impact the power consumption of a device: power scales linearly with the frequency and quadratically with the voltage. Hence, considerable power savings may result in a concerted approach of dynamic frequency and voltage scaling [6] .
Real-time systems are particularly well-suited to profit from DVS. Due to periodic task execution, it is generally not feasible to utilize the range of sleeping modes that modern processors offer. Tasks are invoked frequently (on a periodic basis in the order of a few milliseconds). The time to enter a sleep mode (and the later wakeup time) is in the order of tens of milliseconds, which generally matches the order of magnitude of a real-time task's period. Hence, suspension in sleep modes is not a viable option for real-time systems. But real-time systems often have task sets that underutilize the processor. Hence, reducing the frequency of execution while still meeting deadlines through DVS is a viable option resulting in considerable power reduction.
Recently, a number of hard real-time DVS scheduling schemes have been studied, ranging from compiler support [16] over numerous static scheduling approaches [10, 19] to dynamic methods [19, 2, 9] . All of these approaches have their own merits in that they provide a solution suitable to certain systems depending on scheduling methods, utilization bounds of the task sets and architectural properties, such as scaling overhead.
Any DVS scheduling scheme is subject to the same constraints as other hard real-time systems: The worst-case execution time (WCET) of a task has to be known a priori, i.e., safe bounds on a task's execution time have to be ob-tained. Prior work on static timing analysis provides the means to derive relatively tight WCET bounds for simple embedded architectures, which are provably safe. A number of research groups have addressed various issues in the area of bounding the WCET of a real-time task. Conventional methods for static analysis have been extended from unoptimized programs on simple CISC processors to optimized programs on pipelined RISC processors , and from uncached architectures to instruction and data caches [18, 14, 12, 17, 23, 13] . The challenge of static timing analysis is to provide not only safe but also tight bounds on the WCET in order to impose a high enough processor utilization. These analysis approaches result in tight bounds for deterministic microarchitectures with simple components.
In the context of DVS, static timing analysis is generally assumed to remain valid with frequency scaling. The conjecture is that reducing a processor's frequency still results in the same number of cycles of execution for a task. Hence, considering the processor frequency should suffice to derive safe WCET bounds. However, this simplistic view generally does not hold for any realistic architectures. Consider the impact of memory references. Any instruction or data reference that is resolved through a main memory access operates at external bus frequency. But bus frequencies generally diverge from internal processor frequencies, and they do not scale at the same rate as DVS scaling does. E.g., the first generation Compaq Ipaq has a StrongArm microprocessor (SA-1110) that scales at 8 frequencies but only supports two different external bus frequencies.
In short, when static timing analysis is applied in the context of DVS, tightness and safety assumptions may no longer hold: WCET bounds may either not be tight (considerable overestimation upon fast memory operations for lower processor frequencies) or are no longer safe (underestimation potentially leading to missed deadlines upon a reduced data bus frequency). As a result, the memory latency also has to be adjusted to discrete values according to dynamic settings for execution frequencies and memory latencies. Instead of obtaining one discrete WCET through static timing analysis, different values for each processor frequency / bus frequency pair would have to be obtained. While this may still be a feasible approach for a static schedule and for a small number of such frequency pairs, it becomes infeasible for dynamic scheduling paradigms or a large number of frequency pairs. For certain scheduling approaches that exhibit intra-task DVS, such a static approach becomes impossible if tight bounds for the WCET are to be determined since the point of frequency changes during task execution is typically unknown at static time, e.g., due to dynamic scheduling, preemption and early completion.
The contribution of this paper is to remedy this problem by promoting a new methodology for frequency-aware static timing analysis (FAST). Instead of obtaining a WCET bound for each frequency pair, FAST takes static timing analysis to a novel level suitable for dynamic scheduling. FAST expresses WCET bounds as a parametric term whose components are frequency-sensitive parameters. On the one side, cycles are interpreted in terms of the processor frequency; on the other hand, memory accesses are expressed in terms of the memory latency overhead due to the external bus speed. This parametric expression of the WCET allows one to determine on-the-fly the WCET for a given frequency pair. This is particularly appealing when scheduling decisions occur dynamically and when the number of frequency pairs becomes large, such as is the case with stateof-the-art processors with fine-grained frequency settings.
In the following, we detail the technical innovations necessitated by DVS to ensure that safe and flexible WCET predictions may be obtained. We provide motivating examples, discuss the design of our FAST analysis tool, and we show the feasibility of our approach in a set of experiments that demonstrate flexibility and competitiveness while still providing tight bounds on the WCET. Related as well as future work and a summary conclude our contributions.
Effects of frequency scaling on WCET
In this section, we motivate the need for a parametric frequency model and assess the challenges of supporting this novel model in a static timing analysis tool. We also describe the parametric frequency model in detail, and we illustrate the key features in examples.
Motivation
Real-time systems that use DVS-based scheduling scale the WCET assuming that the WCEC remains constant even with a change in the frequency. This assumption holds for systems where the memory latency can scale with processor frequency (systems with on-chip memory) In contrast, for a system where the memory latency does not scale with processor frequency (systems with dynamic memory and memory hierarchies), the WCEC of a task does not remain constant when the frequency is scaled since an increase in the frequency typically increases the number of cycles required to access memory. This behavior is caused by a constant access latency for memory references, regardless of changing processor frequencies. By assuming that the WCEC remains constant, one ignores the fact that the WCEC reduces with frequency, which results in WCET overestimations. Figure 1 depicts results for the C-lab real-time benchmark fft, where the actual WCEC for a system with a memory hierarchy is compared to a constant WCEC. The WCEC for the benchmark was calculated for a simple in-order pipeline with instruction and data caches. In this example, it is assumed that the memory access latency is constant. Figure 1 illustrates that the number of WCEC increases proportionally with the processor frequency. This results from an increasing number of wait cycles for a constant time 
Number of Cycles
Actual WCEC Assumed WCEC Figure 1 . Actual vs. Assumed WCEC for fft memory latency as the frequency increases. The slope of the actual WCEC depends on the number of accesses to main memory (and the latency to frequency ratio). Hence, the slope depends on the number of misses in the instruction and data caches combined. Therefore, the accuracy of paradigms that measure the worst-case behavior of the instruction and data caches not only control the accuracy of the WCEC, but they also affect the accuracy by which the WCEC can be scaled with frequency. Figure 2 depicts the equivalent WCET to the two WCEC curves in Figure 1 . The actual WCET depicted indicates the assumption of a constant WCEC independent of frequency modulations result in considerable overestimations of the WCET.
The objective of the work described in this paper is to accurately model the actual WCEC and, thereby, the actual WCET of real-time tasks. We derive a parametric frequency model for this purpose. The model provides WCET bounds that remain tight and accurate throughout any frequency range. The parametric model complements real-time systems employing a DVS-base scheduling scheme, and it is paramount to achieving higher power savings. Ignoring the change in WCEC with frequency results in considerably smaller power savings.
Parametric Frequency Model
Our parametric frequency model can be used for timing analysis with any simple in-order single-issue pipeline. The model is applicable to systems with or without a memory hierarchy. We consider the model in a system with a memory hierarchy in the following, and we contribute solutions to the technical challenges posed. We assume that the system is equipped with an on-chip instruction and data cache and that the main external memory has a constant access latency. To accurately model the WCET in systems with memory hierarchies, we propose a parametric frequency model that captures the effect of frequency scaling accurately by splitting the WCEC of a task into two components. The first component, , captures the ideal number of cycles required to execute the task assuming perfect caches. In other words, does not scale with frequency. The second component, Ñ, counts the total number of instruction and data cache misses Figure 2 . Actual vs. Assumed WCET for fft for the task. Ñ is the part of the WCEC that scales with frequency and depends on the memory access latency. If a system without caches is considered, would count the total number of cycles used for non-memory operations while Ñ would count the total number of memory references. Thus, the WCEC is expressed as follows:
where AE is the number of cycles required to access the memory, which depends on the latency of the memory and the frequency of the processor. For a simple pipeline, the WCEC can be easily be converted into the WCET by dividing by the frequency. This frequency model can accurately model the actual WCET because it separates the WCEC into components, one that scales and one that does not scale with processor frequency.
The following examples are presented to show that the parametric model can capture the effects of different sequences of instructions in a task. Only sequences that contain data or instruction cache misses are of concern since they are affected during frequency scaling. A sequence of instructions without any cache misses can be captured exclusively by the component and represents a trivial example of our parametric model. For the following examples, let AE ½ ¼ , as shown in the figures below. We assume separate instruction and data caches and frequency scaling under our model with an arbitrary simple in-order pipeline.
Consider a sequence of four instructions, as shown in the Figure 3 . This instruction sequence is executed in a processor with a simple six-stage in-order pipeline. The pipeline stages are fetch (IF), decode (ID), issue (IS), execute (EX), memory access (MEM) and write-back (WB).
1. In Figure 4 , we observe the effects of an instruction cache. Consider instruction B resulting in a miss. While instruction B misses in the instruction cache, all other cache accesses result in hits. Figure 5 , we observe the effects of a data cache miss. Instruction B misses in the data cache while all other cache accesses are hits. With and Ñ ½ , the WCEC is again calculated as · ½ AE . Since the data miss stalls the previous instructions, one can separate the number of cycles required for the memory access. However, had the Instruction C or any other stalled instruction performed any useful work instead of being stalled, a potential for overestimation would occur for the model, e.g., for multi-cycle floating-point operations, branch mispredictions, etc. Any such overestimation results from the overlap of useful cycles with the memory stall. In our model, the component counts these useful cycles while the Ñ component counts data miss. Overlap would not be considered by the model. For example, if instruction C took an extra cycle to execute, the new WCEC would become ½¼ · ½AE. The model does not consider the overlap between the data miss and the extra cycle used by instruction C. A similar problem is also observed in example 1 if the instruction miss overlaps with a high execution latency instruction.
The potential for overestimations implies that the obtained WCET obtained still provides an upper bound on the execution time, albeit not necessarily a tight one. But removing overestimations due to instructions with high execution latencies is non-trivial because instructions may have different execution latencies. Subsequent experiments show that these design choices have a diminishing affect on the tightness of WCET bounds.
3. In Figure 6 , we observe the effects of a simultaneous instruction and data cache misses. Instruction B results in a data cache miss while the instruction C results in an in- 
Timing Analysis
In this section, we describe conventional static timing analysis and briefly contrast the approach to dynamic timing analysis methods. We specify the novel enhancements necessitated by DVS to adapt conventional static timing analysis to a frequency-aware static timing analysis (FAST) tool.
Static Timing Analysis
Schedulability analysis for hard real-time systems requires that the worst-case execution time (WCET) be safely bounded in order to ensure feasibility of scheduling a task set for a given scheduling policy, such as rate-monotone and earliest-deadline-first scheduling [15] . If the execution time of a task were obtained through dynamic timing analysis based on experimental or trace-driven approaches, these values would not provide a safe bound of the WCET [22] . On the one side, it is difficult to determine the worst-case input set even for moderately complex tasks that would exhibit the WCET, and to perform exhaustive testing over the entire input space is infeasible except for trivial cases. On the other side, even if the worst-case input set was known, the interaction between the software and hardware might cause the task to exhibit its WCET for a different input set. The cause of this behavior is architectural complexity, such as complex pipelines and caching mechanisms.
Static timing analysis is a viable alternative to dynamic timing analysis, and while various static approaches have been studied, we will constrain ourselves to one such toolset without loss of generality [11, 17, 23] . The WCET bounds obtained by static timing analysis provide a guaranteed upper bound on the computation time of a task. Static timing analysis performs the equivalent of a traversal over all execution paths to determine timing information independent of a program trace and without tracking values or program variables. Loop bodies only require a few traversals to determine the worst-case behavior of the entire loop due to an efficient fixed-point approach. As the execution paths are traversed, the behavior of the architectural components along the execution paths is captured. The paths are composed to form loops, functions and ultimately the entire application to calculate both WCEC and WCET. Figure 7 depicts an overview of the organization of this timing analysis toolset. An optimizing compiler has been modified to produce control flow and branch constraint information as a side effect of the compilation of a source file. The original research compiler VPCC/VPO [3] was replaced by GCC with a Portable Instruction Set Architecture (PISA) backend that interfaces with SimpleScalar. Real-time applications are compiled into assembly code using the GCC PISA-compiler. The control-flow graph and instruction as well as data references are extracted from the assembly code. Upper bounds on the number of iterations performed by loops are provided, a prerequisite for performing static timing analysis. A static instruction cache simulator uses the control flow information to construct a control-flow graph of the program that consists of the call graph and the control flow of each function. The program's control-flow graph is then analyzed, and a caching categorization for each instruction and data reference in the program is produced. Separate categorizations are provided for each loop level in which the instructions and data references are contained. The categorizations for instruction references are described in Table 1 . Next, the timing analyzer uses the control flow and constraint information, caching categorizations, and machine dependent information (e.g., pipeline characteristics) to calculate bounds on the WCET.
The approach in this paper differs from our prior toolset as follows. Our tool separates static I-cache and D-cache (instruction/data cache) analysis. The D-cache analysis currently lacks sufficiently detailed information about references for the GCC compilation phase, and D-cache analysis does not fully match the SimpleScalar model. The focus of this paper is on enhancing the timing analyzer with respect to the FAST model and PISA instruction set. But since we use our SimpleScalar-based architectural simulation environment [20] to validate our approach, we have to make The timing analyzer uses the control-flow information and loop bounds, caching categorizations, and pipeline description to derive WCET bounds. The pipeline simulator considers the effect of structural hazards (an instruction occupying the universal function unit for multiple cycles), data hazards (a load-dependent instruction stalls for at least one cycle if it immediately follows the load), branch prediction (backward-taken/forward-not-taken), and cache misses (derived from caching categorizations) for alternative execution paths through a loop body or a function. Static branch prediction is easily accommodated by worstcase analysis: the misprediction penalty is added to the nonpredicted path (not-taken path for backward branches and taken path for forward branches). Path analysis (see below) selects the longest execution path as usual. Once timings for alternate paths in a loop are obtained, a fixed-point algorithm (quickly converging in practice), is employed to safely bound the time of the loop based on the its body's cycle counts.
The fixed-point approach generally requires path analysis for only a few iterations. Given the longest path for the first iteration, the next-longest path is determined for the second iteration, which may differ from the original path due to caching effects. The lengths of these paths are monotonically decreasing due to cache effects, and once we reach a fixed-point, subsequent loop iterations can be safely approximated by this fixed-point timing value. When the longest paths of consecutive iterations are combined, we account for the pipeline overlap between the tail of the earlier path and the head of the path that follows. The alternative -no overlap -is tantamount to draining the pipeline between iterations. Using this fixed-point approach, the timing analyzer ultimately derives WCET bounds, first for each path, then for loops, and finally for functions within the program. A timing analysis tree is constructed, where each node of the tree corresponds to a loop or function. Nodes in the tree are processed in a bottom-up manner. In other words, the WCET for an outer loop / caller is not calculated until the times for all of its inner loops / callees are known. This means that the timing analyzer predicts the WCET for programs by first analyzing the innermost loops and functions before proceeding to higher-level loops and functions, eventually reaching the tree's root (e.g., main()). For our purposes, the timing analysis tree provides a convenient method for obtaining WCET for a specific scope, in particular for sub-tasks. From the description in this section, it becomes evident that static timing analysis is non-trivial, even for simple pipelines.
Frequency-Aware Static Timing Analysis
The static timing analysis tool calculates the WCEC for a particular task. However, static timing analysis has to be performed whenever the processor frequency is changed. Re-assessing the WCET bound is paramount to temporal safety since a change in the processor frequency causes a change in the number of cycles required to access the memory since front-side bus frequencies do not scale at all (or at least not at the same rate). Due to the change in memory latency, the WCEC information for different paths changes, which may result in a different worst-case path than before. Our frequency model can be elegantly incorporated into static timing analysis such that it calculates the number of cycles for each possible worst-case path in the program. The following technical innovations to the static timing analysis framework support such flexible calculations.
Instead of using the memory access cycles to simulate the sequence of instructions in the pipeline, the ideal number of cycles is calculated assuming all cache accesses to be hits. The instruction and data cache misses are accumulated as a side-effect to compose a first-order polynomial equation describing the WCEC.
Static timing analysis requires different paths through the same node (loop or function) to be compared. The path with the worst WCEC is used as the WCEC for the node. After integrating the frequency model into the framework, one has to compare two equations to determine which one was to result in a larger number of execution cycles. The challenge here is posed by having to consider both equations: One of them has greater WCEC for some range of frequencies while the other has greater WCEC for the rest of the frequency range. Remember that the frequency model is a first-order polynomial. Consider the case where two equations intersect, i.e., both polynomial have a common solution. We propose three approaches to address this problem.
1. One can maintain an ordered list of equations and the ranges where subsequent polynomials represent a larger WCEC than previous ones. Since the frequency model is a first-order polynomial with different slopes, there exists an intersection point constraining the range for each equation.
2. Alternatively, a curve-fitting equation could capture the effects of both equations. This obviates the need for maintaining large numbers of equations but increases the complexity of the parametric equation. A higher-order polynomial with strict upper bounds on each base polynomial would provide a relatively close fit. The resulting curve would not be as tight as in case (1) but may suffice if the slopes of the original polynomials do not diverge significantly. This would impose more overhead on dynamic scheduling schemes that have to perform additional arithmetic to evaluate the equation upon any scheduling action.
3. Another, easier solution is to declare a valid range of frequencies for the processor. If two equations intersect outside the given range, we simply have to choose the equation that provides the higher WCEC within the valid range. If two equations intersect within this specified range, we use a simple curve-fitting technique through a first-order polynomial that provides a WCEC greater or equal to the values of either of the original equations.
By using one of the above techniques, we ensure that a FAST equation obtained always provides an upper bound on the WCEC of the task, regardless of the chosen frequency. For our FAST framework, we have used the third, the easiest technique to bound FAST equations.
FAST-DVS Schemes
Most DVS scheduling algorithms use the assumption that the WCEC is constant with frequency when scaling the WCET. By not considering the effect on WCEC during frequency modulation, DVS schemes assume a considerably overestimated WCET. Thus, DVS schemes fail to completely utilize available slack because the scaled WCET is not a tight bound. We have implemented our parametric frequency model as the FAST framework. Parametric equations obtained by FAST can be used in DVS scheduling schemes to ensure that the scaled WCET remains an accurate and tight bound of the execution time for a task. Thus, we can increase the efficiency of DVS schemes and further reduce the power consumption of the system. DVS schemes can execute a task set at a lower frequency provided that a schedulability test deems the task set feasible and tasks do not exceed their WCET. For DVS schemes based on earliest-deadline-first (EDF) scheduling, the schedulability test expressed in Equation 2 must be satisfied by the task set to ensure feasibility. Equation 2 represents the utilization of the system under frequency scaling. The scaling factor in Equation 4 results in a much lower frequency . The WCET used is not exaggerated, and slack is exploited efficiently.
In our implementation work, we integrated FAST equations into DVS-EDF scheduling as proposed by Pillai and Shin through (a) static voltage scaling, (b) cycle-conserving RT-DVS and (c) look-ahead RT-DVS [19] . With only minimal changes to the original algorithms, we integrated the FAST equations into the respective DVS schemes, thereby improving energy savings obtained.
FAST -Static Voltage Scaling
The static voltage scaling schemes introduced of Pillai and Shin [19] uses the modified EDF test shown in Equation 2 to calculate the scaling factor «. This algorithm uses all static slack in the system. The processor frequency for the entire task set is set statically. Dynamic slack produced during runtime due to early completion of tasks is not considered for frequency scaling. The FAST equations for the WCET can be integrated into the static voltage scheme as shown in Figure 8 
FAST -Cycle-Conserving RT-DVS
The cycle conserving RT-DVS by Pillai and Shin [19] calculates the utilization for a task set at every task release and task completion. Upon task release, the utilization is calculated based on the WCET. Upon task completion, the utilization is calculated by considering the actual execution time of the completed task instead of the WCET. This algorithm uses the static slack available in the system as well as the dynamic slack generated due to early task completions. Figure 9 shows the necessary modifications to the original algorithm to incorporate the FAST equations.
The FAST cycle conserving DVS scheme outperforms the original scheme since it takes the actual execution times as well the scaling levels of previous tasks into account. The scheme derives the current system utilization after task completion by considering the actual execution time. In FAST cycles-conserving RT-DVS, the total number of cycles and the total number of misses experienced by a task are determined during executing, e.g., by hardware counters, which have become quite common for modern architectures. The actual execution time is also converted into a FAST equation to consider its scaling with frequency. The system utilization and the scaling factor are calculated through Equations 3 and 4. 
FAST -Look-Ahead RT-DVS
The look-ahead RT-DVS schemes by Pillai and Shin [19] finds the minimum amount of work that may be performed between now and the next scheduling event without missing any deadlines. All work is deferred till the last possible moment, also referred to as last-chance scheduling [8] . As a side effect, the frequency may be increased as execution approaches a deadline. In practice, most tasks complete execution early, i.e., prior to their WCET. Hence, the frequency rarely has to be raised to complete by a deadline. This algorithm also uses all the static slack (idle) as well as most of the dynamic slack. (see appendix). The FAST look ahead scheme also takes advantage of FAST equations to lower energy consumption of the algorithm. The terms Ð Øand Ñ Ð Ø describe the computation left in the form of a FAST equation. Hardware counters are employed to track total cycles completed and total misses inflicted while a task is executing. The × component shown in Figure 4 .3 cannot be converted into a FAST equation unless considerable changes are made to the algorithm. Doing so would make the algorithm more aggressive leading to lower frequencies. To avoid excessive modifications, only the next scheduled task is expressed in the form of a FAST equation. The experiments show that the performance of the algorithm is improved even with minimal modifications to the algorithms.
Experimental Framework
The experimental framework is divided into two sections. The first section is devoted to comparing the WCEC calculated using FAST equations, obtained from the FAST framework, to the WCEC obtained from the traditional static timing analysis tool. The second section tests and compares FAST-DVS algorithms with the original DVS algorithms proposed by Pillai and Shin [19] .
Testing the FAST Framework
We re-designed our static timing analyzer [11] [1] , is based on the portable ISA (PISA) used by the SimpleScalar tool set. All instruction execution latencies are based on the MIPS R10K latencies. Specifically, a constant memory latency of 100ns is used. We use a 8KB directmapped instruction cache and a 8KB direct-mapped data cache. For the instruction cache categorizations, the static cache simulator of our existing tool set is used. To obtain data cache categorizations distinguishing hits and misses, we use a scheme that assumes a constant number of data accesses as misses and the remaining references as cache hits. During pipeline simulation, a static branch prediction scheme using the Ball-Larus heuristic is modeled. Both the static timing analysis tool and the FAST tool model a simple in-order six-stage pipeline.
When incorporating the frequency model into the static timing analyzer, two paths with FAST equations that result in intersecting first-order polynomials may be encountered. In this case, we resort to the third method introduced in Section 3.2 to choose the equation resulting in the worst-case behavior. First, we try to determine if one equation is always greater than the other for the valid range of frequencies (100MHz-1GHz). Otherwise, we approximate the two equations by an equation providing a safe upper bound. This may result in slight overestimations but, overall, still provides sufficiently tight bound of the WCEC, as will be seen. We also remove the branch misprediction penalty from the FAST equation if branch misprediction overlaps with a data miss stall. The overestimation caused by instructions with execution latencies higher than one are not removed from the equation as they contribute insignificant savings.
We studied six real-time benchmarks from the Clab real-time benchmark suite [5] , commonly utilized for WCET experiments. Three floating point benchmarks, adpcm, lms and fft as well as three integer benchmarks, cnt, srt and mm are analyzed. These benchmarks were compiled by the PISA GCC compiler integrated with our SimpleScalarbased tool set. From the compilation of these benchmarks, the control-flow graphs and instruction layouts were obtained, which are taken as inputs to the FAST analyzer and the static cache analyzer. The FAST output is the WCEC in the form of a parametric equation conforming with our parametric frequency model. The same benchmarks were also exposed to the original static timing analysis tool set for comparison. The original static timing analyzer must be run separately for each frequency under consideration to account for changed memory latency for a given processor frequency. In contrast, the FAST framework captures the same effect in an equation (derived from a single analysis step).
Testing FAST-DVS Schemes
To test the FAST-DVS schemes, we implemented the algorithms in a scheduling simulator. Implementation features include generic static voltage scaling support and scheduling algorithms ranging from base EDF, cycle-conserving RT-DVS, look-ahead RT-DVS, FAST static voltage scaling, FAST cycle conserving RT-DVS to FAST look ahead RT-DVS. All the scheduling algorithms can choose a frequency between 100MHz to 1GHz for the next scheduled task. The base EDF algorithm runs all tasks at 1GHz. All algorithms switch the processor frequency to 100MHz during idle times in the schedule, the lowest available frequency, since it is not realistic to put a processor into sleep mode (with millisecond overheads) for frequent task releases (in the order of milliseconds). A combination of task sets resulting from application workloads of six real-time benchmarks, namely srt, fft, mm, lms, adpcm and cnt, were studied. The task sets were exposed to the simulator, and energy consumption was calculated for all scheduling algorithms. The execution times were derived from exposing the benchmarks to a cycle-accurate pipeline model implemented in our SimpleScalar-based simulator [20] . By exploiting a cycle-accurate architectural simulator, we can obtain the total number of cache misses as well as the total number of cycles executed. The execution times obtained from the architectural simulator are scaled with frequency using the same assumption used while formulating the FAST parametric model. Namely, we assume that the total number of execution cycles does not remain constant with frequency. The same execution time scaling method is used for all the voltage scaling algorithms.
To evaluate the different FAST-DVS and DVS schemes, we formed several tasksets using the cnt, srt, mm, adpcm, fft and lms benchmarks. Three groups were formed as follows -G1: cnt, srt, mm (all integer), G2:adpcm, fft, lms (all floating point) and G3:cnt, mm, fft, lms (mixed). The periods were chosen for each benchmark and from each group two tasksets are created -one with high utilization, and one with low utilization. The high utilization tasksets have a utilization of approximately 0.9 while the low utilization tasksets have a utilization of approximately 0.5.
The frequency/voltage settings used for the scheduling simulator are loosely based on Intel Xscale, which is reported to have 5 settings ranging from 150 MHz / 0.76 V to 1 GHz / 1. 
Results for FAST Framework
The WCEC equations for the six benchmarks obtained from the static timing analysis tool and the FAST tool are compiled in Table 2 and in Figure 12 . The FAST scheme differs from conventional static timing analysis without parametric expressions of frequencies by less than half a percent. Hence, we conclude that the FAST equations accurately model the WCEC obtained from the static analysis tool. Since the effects of scaling on WCEC are accurately modeled by the FAST equations, the scaling of the WCET can also be accurately captured. Table 2 shows the WCEC for all six benchmarks calculated for four different frequencies using the FAST equations and compared with the corresponding WCEC obtained from the static timing analysis tool. Figure 12 plots the ratio of the WCET for the FAST tool and the static timing analysis tool. As shown in the Table 2 and Figure 12 , cnt, mm and srt show that the FAST bounds on WCET match the bounds obtained by the static timing analyzer exactly. For fft, adpcm and lms the FAST bounds on WCET are very close to the bounds obtained by the static timing analyzer. The overestimation in these benchmarks is due to the presence of floating point operations that have overlapping execution latencies with memory stalls (see Section 2.2, Figure 5 ). Thus, the FAST tool can accurately model the WCEC of tasks with a negligible error ( 1%) by using our parametric frequency model. For the integer taskset G1, savings are considerable (excess of 50%) between the original scheme and the corresponding FAST scheme for the static and cycle-conserving approaches (Figures 11(a) and 11(b) ). The look-ahead scheme shows none or only marginal savings under FAST for high and lower utilizations, respectively. This is caused by fact that the FAST look-ahead scheme runs the taskset at a lower frequency and has to recover by raising the fre-quency more often than the original look-ahead scheme. The results are also sensitive to the task set, as a comparison with the floating-point taskset G2 shows. Figures 11(c) and 11(d) indicate that G2 still experiences considerable savings for high utilizations -and slightly lower ones for lower utilizations -under the corresponding FAST scheme. In case of G2, savings for the static and cycle-conserving schemes are even higher. The results for the integer/floating point mix of G3 in Figures 11(e) and 11(f) show savings at levels between the G1 and G2 tasksets for static and cycleconservings schemes. The look-ahead version of FAST results in less significant savings, mostly due to already very aggressive savings due to the original look-ahead scheme.
Results for FAST-DVS Schemes
All results depend on the FAST equation for the benchmarks. The scalability of the WCET depends on the number of misses counted during timing analysis. Due to a worstcase analysis, the number of misses are usually highly exaggerated, especially for data caches. This means that the original schemes are penalized heavily due to their assumptions about scaling the WCET. Using the FAST equations, the DVS schemes can improve the tightness of the WCET, which is already highly exaggerated, thereby improving energy consumption.
Overall, FAST equations with the RT-DVS schemes are more greedy and results in lower frequencies. The relative energy benefits are highest in the static RT-DVS scheme because it has the most scope for improvement. The cycle conserving and the look-ahead RT-DVS schemes are dynamic schemes and already scale the frequency aggressively. The addition of the FAST equations to these aggressive schemes enables them to scale the frequency even more aggressively, showing lower energy consumption. Hence, benefits for FAST are being observed in all cases.
Related Work
Recently, a number of research groups have addressed various issues in the area of predicting the worst-case execution time (WCET) of real-time programs. Conventional methods for static analysis have been extended from unoptimized programs on simple CISC processors to optimized programs on pipelined RISC processors, and from uncached architectures to instruction and data caches [18, 14, 12, 17, 23, 13] . All these methods obtain discrete values to bound the WCET in a non-parametric fashion.
Vivancos et al. describe techniques for addressing static timing analysis for variable loop bounds [21] . The so-called parametric timing analysis allows dynamic schedulers to reassess the WCET based on dynamically determined loop bounds during program execution. Chapman et al. [7] used path expressions to combine a source-oriented parametric approach of WCET analysis with timing annotations, verifying the latter through the former. Bernat and Burns also proposed using algebraic expressions to represent the WCET of subprograms, where the algebraic expression is parameterized by some of the subprogram's parameters [4] . These approaches differ in that they address fundamental problems in static timing analysis. Our FAST approach, in contrast, aims at isolating execution effects as a function of the processor frequency, a unique, unprecedented approach complementing existing work on static timing analysis.
Conclusion
In this work, novel techniques for tight and flexible static timing analysis were developed most suitable -but not restricted to -dynamic scheduling schemes. The essence of our approach lies in providing frequency-aware bounds on the WCET through static timing analysis. Using a frequency-sensitive parametric model, we can capture the effect of combined DFS/DVS on the WCEC and, thus, accurately model the WCET over any frequency range. These techniques are implemented in a frequency-aware static timing analysis (FAST) tool leveraging prior expertise on static timing analysis. Experiments show the capability of FAST to derive safe upper bounds on the WCET, which are almost as tight (within 1%) as conventional, non-parametric timing analysis. FAST equations can also be used to improve existing DVS scheduling schemes to ensure that the effect of frequency scaling on WCET is considered and that the WCET used is not exaggerated. This is demonstrated by incorporating FAST into three DVS scheduling schemes. Results indicate significant energy savings over the base DVS schedulers due to FAST. To the best of our knowledge, this study of DVS effects on timing analysis is unprecedented.
Modified Look-ahead DVS-EDF
A number of DVS schemes were proposed by Pillai and Shin for scheduling hard real-time systems [19] . A simple, static scaling version uniformly scales the frequency for all tasks based on utilization tests for schedulability, both for rate-monotone and EDF scheduling. Cycle-conserving EDF lowers utilization upon task completion temporarily to the proportion of the actual execution time. Look-ahead EDF is an extension to these scheme that capitalizes on early task completion by deferring work for future tasks in favor of scaling the current task. Scaling of the current task occurs based on a modified utilization test that benefits from both idle slots and early task completion. At any completion (both early and on time), the utilization is effectively reduced for the completing task (up until its next release time).
Specifically, upon task completion, Ð Ø ½ ¼ according to Cycle-Conserving EDF and Look-ahead EDF, respectively. The defer calculations of Look-ahead EDF then reassesses the utilization based on future and past deadlines for released and completed tasks, respectively.
We modified the Look-ahead EDF by setting Ð Ø at task completion instead of assigning a zero value. In addition, we reassess the utilization strictly based on the next deadline in the future, irregardless of whether tasks are already released and not. This allows us to look ahead even further in the schedule and, thereby, potentially save additional energy by lowering frequencies more aggressively, and it retains the safety of the schedule by adhering to the EDF utilization test. If the WCET is not fully utilized, then other tasks may still benefit from early completion up to the threshold given by the idle times left in the schedule. This modified Look-ahead EDF scheme was implemented in our comparison and is shown to result in up to 34% higher energy consumption than the original scheme. On the average, the modified scheme consumes an additional 5-11% of energy for utilizations between 25% and 100%. At high utilizations, our modification occasionally requires between 0.5-8% more energy, which is due to considering an actual time of ¼ in the original scheme up to the next release of a task. Hence, it would be possible to switch between the two schemes based on a utilization threshold as a trigger. Additional savings over the modified scheme due to early completion can only be obtained by considering the density of a schedule at some instance in time, such as given by the maximal schedule utilized in our feedback EDF scheme.
