Abstract
Introduction
Generating acceptable code for applications on embedded systems is challenging. Unlike most general-purpose applications, embedded applications often have to meet various stringent constraints, such as time, space, and power. Constraints on time are commonly formulated as worst-case (WC) constraints. If these constraints are not met, even only occasionally in a hard real-time system, then the system may not be considered functional. The worst-case execution time (WCET) must be calculated to determine if a timing constraint will be met.
Unfortunately, many embedded system developers empirically estimate the WCET by testing the application and measuring the execution time. Testing alone is unsafe since the WC input data is often difficult to derive. This approach can result in an unsafe application since WC timing constraints may not be met when an application is deployed. More knowledgeable developers will test, measure, and make conservative assumptions in case the timing measurements do not truly reflect the WCET, which can result in loose estimates and higher overall costs. Thus, accurate WCET predictions are required to produce safe and cost effective embedded systems. Accurate WCET predictions can only be obtained by a tool that statically analyzes an application to calculate the WCET. Such a tool is called a timing analyzer, and the process of performing this calculation is called timing analysis.
WCET constraints can impact power consumption as well. In order to conserve power, one can determine the WC number of cycles required for a task and lower the clock rate to still meet the timing constraint with less slack. In contrast, conservative assumptions concerning WCET may result in a processor being deployed that has a higher clock rate and consumes more power.
Automatically generating acceptable code for embedded microprocessors with a compiler is often much more difficult than generating code for general-purpose processors. Besides sometimes having to meet a variety of conflicting constraints, embedded microprocessors are typically much less regular and have many specialized architectural features. Because of the typical large volumes produced for a product involving an embedded computer system, many embedded systems applications are still being developed in assembly language by hand in order to meet the imposed constraints and to deal with the difficulty of exploiting the features of the machine. In fact, two of the authors of this paper have recently spent time in industry and have personally witnessed the development and maintenance of assembly code applications. Howev er, dev eloping an application in assembly has many disadvantages that include higher development and maintenance costs and less portable code.
It would be desirable to develop embedded system applications in a high level language and still be able to tune the WCET of an application. We hav e provided this capability by integrating a WCET timing analyzer with an interactive compilation system called VISTA (Vpo Interactive System for Tuning Applications) [1, 2] . One feature of VISTA is that it can automatically obtain performance feedback information, which can be used by both the application developer and the compiler to make phase ordering decisions. This information can include a variety of measures, such as execution time or code size. In this paper we describe how we modified VISTA so it can use WCET as one of its performance criteria.
The remainder of the paper is structured as follows. First, we review related work on improving, displaying, and estimating WCET. Second, we give a brief overview of the timing analyzer that we used in this work. Third, we summarize the StarCore SC100 and how we retargeted our compiler and timing analyzer for this processor. Fourth, we describe VISTA and how we integrated the timing analyzer with this framework. Fifth, we show the benefits that were achieved by performing searches using a genetic algorithm to improve the WCET. Finally, we discuss future plans for developing compiler optimizations to improve WCET and give the conclusions of the paper.
Related Work
There have been a variety of different techniques used for timing analysis of optimized code over the years [3, 4, 5, 6, 7] . However, we are unaware of any timing analyzer whose predictions are used by a compiler to select which optimizations should be applied.
While there has been much work on developing compiler optimizations to reduce execution time and, to a lesser extent, compiler optimizations to reduce space and power consumption, there has been very little work where compiler optimizations have been developed to reduce WC performance. Marlowe and Masticola outlined how a variety of standard compiler optimizations could potentially affect timing constraints of critical portions in a task. However, no implementation was described [8] . Hong and Gerber developed a programming language with timing constructs and used a trace scheduling approach to improve code in what would be deemed a critical section of the program. However, no empirical results were given since the implementation did not interface with a timing analyzer to serve as a guide for the optimizations or to evaluate the impact on reducing WCET [9] . Both of these papers outlined strategies that attempt to move code outside of critical portions within an application that have been designated by a user to contain timing constraints. In contrast, most real-time systems use the WCET of entire tasks to determine if a schedule can be met. Lee et. al. used WCET information to choose how to generate code on a dual instruction set processor for the ARM and the Thumb [10] . ARM code is generated for a selected subset of basic blocks that can impact the WCET. Thumb code is generated for the remaining blocks to minimize code size.
In contrast, we are using WCET information to select compiler optimizations, as opposed to which instruction set to select for code generation.
A user interface was developed at Florida State University that allows users to select portions of source code and obtain timing predictions. Unlike VISTA, this interface did not allow the user to affect the generated code or provide feedback during the compilation process [11, 12, 13] .
Genetic algorithms have long been used to search for solutions in a space that is too large to exhaustively evaluate. Genetic algorithms have been used to search for effective optimization sequences to improve speed, space, or a combination of both [14, 2] . Genetic algorithms have also been used in the context of timing analysis for empirically estimating the WCET, where mutations on the input resulted in different execution times (objective function) [15, 16, 17] . Our approach, in contrast, relies on a genetic algorithm to identify optimization phase sequences that result in reduced WCET, which is an orthogonal problem.
The Timing Analyzer
In this section we briefly describe the timing analyzer that we have previously developed and that served as the starting point for the timing analyzer in this study. Figure  1 depicts the organization of the framework that was used by the authors in the past to make WCET predictions. The VPO (Very Portable Optimizer) compiler [18] was modified to produce the control flow and constraint information as a side effect of the compilation of a source file. A static cache simulator uses the control-flow information to give a caching categorization for each instruction and data memory reference in the program. The timing analyzer uses the control-flow and constraint information, caching categorizations, and machine-dependent information (e.g. pipeline characteristics) to make the timing predictions. Besides addressing architectural features, such as caching [19, 3, 20, 21, 22, 23] and pipelining [24, 3] , the timing analyzer also automatically detects control-flow constraints. One type of constraint is the maximum iterations associated with each loop, including nonrectangular loop nests [25, 26, 27] . Another constraint type is when a branch will be taken or fall through. The timing analyzer uses these constraints to detect infeasible paths through the code or how often a given path can be executed [28, 4] .
Porting to the SC100
In order to determine the effectiveness of improving the WCET for applications on an embedded processor, we ported both the VPO compiler and the timing analyzer to the StarCore SC100 processor [29] . In the past we had made WCET predictions for the MicroSPARC I, which is a general-purpose processor [30] . We were able to produce very tight WCET predictions with respect to a MicroSPARC I simulator that we had developed. Unfortunately, it is very difficult to produce cycle-accurate simulations for a general-purpose processor due to complexity of its memory hierarchy and its interaction with an operating system that can cause execution times to vary. Unlike the MicroSPARC I, the SC100 has neither a memory hierarchy (no caches or virtual memory system) nor an OS [29] . In addition, we were able to obtain a simulator for the SC100 from StarCore [31] . Many embedded processor simulators, in contrast to general-purpose processor simulators, can very closely estimate the actual number of cycles required for an application's execution.
Some of the general features of the SC100 are as follows. The SC100 has no architectural support for floatingpoint operations since it is a digital signal processor and was designed instead for fixed-point arithmetic. It has 16 data registers and 16 address registers. The size of instructions can vary from one word (two bytes) to five words (ten bytes) depending upon the type of instruction, addressing modes used, and register numbers that are referenced. The SC100 has a simple five stage pipeline, where most instructions can execute in a single stage. There are no pipeline interlocks. It is the compiler's responsibility to insert noop instructions to delay a subsequent instruction that uses the result of a preceding instruction when the result will not be available in the pipeline. Transfers of control (taken branches, unconditional jumps, calls, returns) result in a one to three cycle penalty depending on the addressing mode used and if a transfer of control uses a delay slot.
There were several modifications we made to support timing analysis of applications compiled for the SC100. First, we modified the machine-dependent information (see Figure 1 ) to indicate how instructions proceed through the SC100 pipeline. We had to identify the instructions that require extra cycles in the pipeline. For instance, if a memory addressing mode on the SC100 performs an arithmetic calculation, then one additional one cycle is required. Second, we also updated the timing analyzer to treat all cache accesses as hits since instructions and data on the SC100 can in general be accessed in a single cycle from ROM and RAM, respectively. Thus, the static cache simulation step shown in Figure 1 is now bypassed for the SC100. Third, we had to modify the timing analyzer to address the penalty for transfers of control. When calculating the WCET of a path, we had to determine if each conditional branch in the path was taken or fell through since untaken branches are not assessed this penalty. In addition, we had to determine the size of each instruction and its alignment in memory. SC100 instructions are grouped into fetch sets, which are four words (eight bytes) in size. Transferring control to an instruction in a new fetch set that spans more than one fetch set results in an additional cycle delay.
We hav e found that transfer of control penalties can lead to nonintuitive WCET results. For instance, consider the flow graph in Figure 2 . A superficial inspection would lead one to believe that the path 1→2→3 is the WCET path through the graph. However, if the taken branch penalty in the path 1→3 outweighs the cost of executing the instructions in block 2, then 1→3 would be the WCET path. This simple example illustrates the importance of using a timing analyzer to calculate the WCET. Simply measuring the execution time is not safe since it is very difficult to manually determine the WC paths and the input data that will cause the execution of these paths. 
Figure 2: Example Control-Flow Graph
Measurements indicating the accuracy of the WCET predictions produced by our timing analyzer will be shown later in the paper. In general, we could produce fairly accurate WCET predictions since some of the more problematic issues, which include memory hierarchies and operating systems, are not present on this processor.
Integrating with VISTA
This section provides a brief overview of the VISTA framework used for tuning the WCET of applications. We also describe the modifications that were required to integrate our timing analyzer with VISTA so that the current WCET can be presented to the user and can be used by the compiler when tuning an application.
The flow of information is depicted in Figure 3 , which includes the VPO compiler, a viewer, and the timing analyzer described in Section 3. The programmer initially indicates a source file to be compiled and then specifies requests through the viewer, which include the order and scope of the optimization phases to be applied. After applying each optimization phase, the compiler sends information about the current instructions, control flow, and constraint information to the timing analyzer and the timing analyzer sends its WCET predictions back to the compiler. The user is presented with the state of the requested performance criteria, which for this version of VISTA includes the WCET and the code size. In previous versions of VISTA, the compiler obtained dynamic measurements after applying each optimization phase by instrumenting the code, producing the assembly code, linking and executing the program, and getting performance measures from the execution [2] . Since we used representative input data to obtain this dynamic measure, we were in effect obtaining average case execution time (ACET) information. Figure 4 shows a snapshot of the viewer when tuning an application for the SC100. The right side of the window displays the state of the current function as a control flow graph with RTLs representing instructions. The user also has the option to display the instructions in assembly. The left side shows the history of the different optimization phases that have been performed in the session. Note that not only is the number of transformations associated with each optimization phase depicted, but also the improvements in WCET and code size are shown. Thus, a user can easily gauge the progress that has been made at tuning the current function.
Besides applying predefined compiler optimization phases, an application developer can also specify transformations manually by inserting, modifying, and deleting instructions. Upon request the system also answers queries, such as which registers are live at a specific point in the program representation. This information can assist the developer to make safe and effective manual transformations. The ability to specify transformations manually is useful for exploiting special-purpose hardware that currently cannot be automatically exploited by the compiler.
The user also has the ability to reverse previously applied transformations, which supports experimentation when tuning an application. This is accomplished by writing the sequence of applied transformations to a file. Afterwards, VISTA reads in the intermediate code generated by the front end and applies the list of transformations to generate the program representation that was previously produced in the compilation. The sequence of applied transformations is also written to a file when the user chooses to complete the tuning of a function or terminate the session. This file is automatically read when VISTA is later invoked so these transformations can be reapplied at a later time, enabling future updates.
There are some initial actions that are performed by VISTA before a function can have its WCET tuned. The information to be sent to the timing analyzer from the compiler includes the number of loop iterations. The compiler detects this information as a side effect of performing a number of optimizations. Thus, VISTA was modified to automatically perform a set of optimizations when a function is being compiled for the first time that allows the compiler to calculate this information. The code-improving transformations are then automatically reversed. In a second pass, the compiler performs the compulsory phases, which includes register assignment (assigning pseudo registers to hardware registers) and fix entry/exit (inserting instructions to manage the run-time stack). The compiler emits the information and the timing analyzer is invoked to obtain the baseline WCET for each function. At this point VISTA can be used to tune the WCET for each function within the application.
VISTA also allows a user to specify a set of distinct optimization phases and have the compiler attempt to find the best sequence for applying these phases. Figure 5 shows the different options that VISTA provides the user to control the search. The user specifies the sequence length, which is the total number of phases applied in each sequence. We performed a set of experiments described in the next section that use the biased sampling search, which applies a genetic algorithm in an attempt to find the most effective sequence within a limited amount of time since in many cases the search space is too large to exhaustively evaluate [32] . The genetic algorithm treats each optimization phase as a gene and each sequence of phases as a chromosome. A population is the set of solutions (sequences) that are under consideration. The number of generations indicates how many sets of populations are to be evaluated. The population size and the number of generations limits the total number of sequences evaluated. VISTA also allows the user to choose WCET and code size weight factors, where the relative improvement of each is used to determine the overall fitness.
Figure 5: Selecting Options to Search for Sequences
Performing these searches can be time consuming since thousands of potential optimization sequences may need to be evaluated. Thus, VISTA provides a window showing the current status of the search. Figure 6 shows a snapshot of the status of the search that was selected in Figure 5 . The percentage of sequences completed along with the best sequence and its effect on performance are displayed. The user can terminate the search at any point and accept the best sequence found so far. 
Experiments
This section describes the results of a set of experiments to illustrate the effectiveness of improving the WCET by using VISTA' s biased sampling search, which uses a genetic algorithm to find efficient sequences of optimization phases. Table 1 shows the benchmarks and applications we used for our experiments. These include a subset of the DSPstone fixed-point kernel benchmarks 1 and other DSP benchmarks or programs that we have used 1 The only DSPstone fixed-point kernel benchmarks we did not include were those that could not be automatically processed by our timing analyzer. In particular, the number of iterations for loops in some benchmarks could not be statically determined by our compiler. While our framework allows a user to interactively supply this information, we excluded such programs to facilitate automating the experiments. [33] . In contrast, all of the results in this section are from code that was automatically generated by VISTA.
Note that the DSPstone fixed-point kernel benchmarks are small and do not have conditional constructs, such as if statements. The other benchmarks shown in Table 1 were selected since they do hav e conditional constructs, which means the WCET and ACET input data may not be the same.
Tuning for ACET or WCET may result in similar code, particularly when there are few paths through a program. However, tuning for WCET can be performed faster since the timing analyzer is used to evaluate each sequence. The analysis time required for our timing analyzer is proportional to the number of unique paths at each loop and function level in the program. In contrast, tuning for ACET typically takes much longer since the simulation time of the SC100 simulator is proportional to the number of instructions executed. We found that the average time required to tune the WCET of each function in our experiments was about 25 minutes and this would have taken several hours if we had used simulation. Table 2 shows each of the candidate code-improving phases that we used in the experiments when tuning each function with the genetic algorithm. In addition, register assignment, which is a compulsory phase that assigns pseudo registers to hardware registers, has to be performed. VISTA implicitly performs register assignment before the first code-improving phase in a sequence that requires it. After applying the last code-improving phase in a sequence, we perform another compulsory phase, fix entry/exit, which inserts instructions at the entry and exit of the function to manage the activation record on the run-time stack. Finally, we also perform additional code-improving phases after the sequence, such as instruction scheduling. For the SC100 another compulsory phase is required to insert noops when pipeline constraints need to be addressed.
Our genetic algorithm searches were accomplished in the following manner. We set the sequence (chromosome) length to be 1.25 times the number of phases that successfully applied one or more transformations by the batch compiler for the function. We felt this was a reasonable limit and gives us an opportunity to successfully apply more phases than what the batch compiler could accomplish. Note that this length is much less than the number of phases attempted during the batch compilation. We set the population size (fixed number of sequences or chromosomes) to twenty and each of these initial sequences is randomly initialized with candidate optimization phases. We performed 200 generations when searching for the best sequence for each function. We sort the sequences in the population by a fitness value based on the WCET produced by the timing analyzer and/or code size. At each generation (time step) we remove the worst sequence and three others from the lower (poorer performing) half of the population chosen at random. Each of the removed sequences are replaced by randomly selecting a pair of the remaining sequences from the upper half of the population Optimization Phase Description branch chaining Replaces a branch or jump target with the target of the last jump in a jump chain.
common subexpr elim Eliminates fully redundant calculations, which also includes constant and copy propagation.
remove unreachable code Removes basic blocks that cannot be reached from the entry block of the function.
remove useless blocks Removes empty blocks from the control-flow graph.
dead assignment elim Removes assignments when the assigned value is never used.
block reordering Removes a jump by reordering basic blocks when the target of the jump has only a single predecessor.
minimize loop jumps Removes a jump associated with a loop by duplicating a portion of the loop.
register allocation Replaces references to a variable within a specific live range with a register.
loop transformations Performs loop-invariant code motion, recurrence elimination, loop strength reduction, and induction variable elimination on each loop ordered by loop nesting level. Each of these transformations can also be individually selected by the user.
merge basic blocks Merges two consecutive basic blocks a and b when a is only followed by b and b is only preceded by a.
evaluation order determination Reorders RTLs in an attempt to use fewer registers.
strength reduction Replaces an expensive instruction with one or more cheaper ones.
reverse jumps Eliminates an unconditional jump by reversing a conditional branch when it branches over the jump.
instruction selection Combine instructions together and perform constant folding when the combined effect is a legal instruction.
remove useless jumps Removes jumps and branches whose target is the following block. Mutation of each optimization phase in the sequences occurs with a probability of 10% and 5% for the lower and upper halves of the population, respectively. When an optimization phase is mutated, it is randomly replaced with another phase. The four sequences subjected to crossover and the best performing sequence are not mutated. Finally, if we find identical sequences in the same population, then we replace the redundant sequences with ones that are randomly generated. The characteristics of this genetic algorithm search are very similar to those used in past studies, [14, 2] except the objective function now is minimizing the WCET. Table 3 shows the WCET prediction results for the benchmarks in Table 1 . The batch sequence results are those that are obtained from the sequence of applied phases when we use VPO's default batch optimizer. The batch compiler iteratively applies optimization phases until there are no additional improvements. Thus, the batch compiler provides a much more aggressive baseline than a compiler that always uses a fixed length of phases [2] . The observed cycles were obtained from running the compiled programs through the SC100 simulator. All input and output were accomplished by reading from and writing to global variables to avoid having to estimate the WCET of performing actual I/O. The WCET cycles are the WCET predictions obtained from our timing analyzer. The ratios show that these predictions are reasonably close to the actual WCET. 2 The ratios for the best sequence from GA results in Table 3 are similar, but the code being measured was the best sequence found by the genetic algorithm. The WCET GA to WCET batch ratio shows the ratio of WCET cycles after applying the genetic algorithm to the WCET cycles from the code produced by the batch sequence of optimization phases. We found that the average number of generations to find the best sequence was 51 out of the 200 generations attempted. Some applications, like fft, had significant improvements. The applications with larger functions tend to have more successfully applied phases, which can often lead to larger improvements when searching for an effective optimization sequence. While there were some aberrations due the randomness of using a genetic algorithm, most of the benchmarks had improved WCETs. The WCET cycles decreased by 6.6% on average. This illustrates the benefit of using a genetic algorithm to search for effective optimization sequences to improve WCET. In addition to improving WCET, we thought it would be interesting to see the improvement in code size. Table  4 shows the results obtained for each benchmark by applying the genetic algorithm when changing the fitness criteria. For each benchmark we performed three different searches, which are based on WCET only (optimizing for WCET), code size only (optimizing for space), and 50% for each factor (optimizing for both). For each type of search, we show the effect both on WCET and on code size. The results that are supposed to improve according the specified fitness criteria used are shown in boldface. For these results, the genetic algorithm was able to typically find a sequence for each benchmark that either achieves the same result or obtains an improved result as compared to the batch compilation. The results when optimizing for both WCET and code size showed that we were able to achieve a better overall benefit when both WCET and code size are considered.
Future Work
There is much future research that can be accomplished on tuning the WCET of embedded applications. We can vary the characteristics of the genetic algorithm search. It would be interesting to see the effect on a search as one changes aspects of the genetic algorithm, such as the sequence length, population size, number of generations, etc. In addition, it would be interesting to perform searches involving more aggressive compiler optimizations, different benchmarks, and different processors. Now that we have integrated a timing analyzer with a compiler, there are a number of compiler optimizations that we plan to develop with the goal of reducing the WCET. These optimizations will use the WCET path information provided by the timing analyzer to drive the optimizations that will be performed.
Conclusions
There are several contributions that we have presented in this paper. First, we have demonstrated that it is possible to integrate a timing analyzer with a compiler and that these WCET predictions can be used by the application developer and the compiler to make phase ordering decisions. Displaying the improvement in WCET during the tuning process allows a developer to easily gauge the progress that has been made. To the best of our knowledge, we believe this is the first compiler that interacts with a timing analyzer to use WCET predictions during the compilation of applications. Second, we have shown that the WCET predictions can be used as a fitness criteria by a genetic algorithm that finds effective optimization sequences to improve the WCET of applications on an embedded processor. One advantage of using WCET as a fitness criteria is that the searches for an effective sequence are much faster. The development environment for many embedded systems is different than the target environment. Thus, simulators are used when testing an embedded application. Executing the timing analyzer typically requires a small fraction of the time that would be required to simulate the execution of the application. Finally, we hav e shown that both WCET and code size improvements can be simultaneously obtained. Both of these criteria are important factors when tuning applications for an embedded processor.
Acknowledgements
The anonymous reviewers' suggestions improved the quality of the paper. We also thank StarCore for providing the necessary software and documentation that were used in this project. This research was supported in part by NSF grants EIA-0072043, CCR-0208581, CCR-0208892, CCR-0310860, CCR-0312493, and the NASA EPSCoR program.
