Abstract
Introduction
Accurate software running time and power analysis are key to optimized system synthesis. In general, imprecise estimation of software execution costs (such as running time and power) increases design risk or leads to inefficient designs. Profiling and simulation are the state-of-the-art in industry, but since exhaustive simulation is impractical, simulation results can only cover part of the system behavior. Static analysis is a more complicated but attractive alternative. It provides lower and upper bounds reflecting data dependent control flow as well as data dependent statement execution cost. In the past, these bounds were wide due to a lack of efficient control flow analysis and architecture modeling techniques. Significant progress in both areas has made formal analysis practical.
Intervals for software execution cost depend to a certain extend on the process control flow which depends on process input data. Execution cost of the software processes and, hence, of the overall system are context dependent. We will use an example from wireless communication, where there are several paths on which different data packets are routed through a network of software processes. Important questions of the system architect can be the power consumption for sending a data packet or the time to set up a connection in a base station. This should take the system clontext into account, since for each packet type the procIcsses react with a different control flow. Of course, simulation is always possible and statistical execution cost analysis is feasible, but the first approach is not reliable and the second is just an approximation of the complex hardware activities when executing a set of communicating software processes. We will show with realistic examples that the static analysis approach provides reliable and narrow intervals lor context dependent process execution cost that is automatically evaluated by the analysis tool.
We explain the influence of data dependent control flow on software execution cost in section 2 . Data dependent instruction execution is explained in section 3. In section 4 we present an example before we conclude in section 5.
Program Path Analysis
For path analysis techniques [7] , a program is typically divided into basic blocks, where a basic block bb is a program segment which is only entered at the first statement and only left at the last statement [ l ] . Any program can be partitioned into disjoint basic blocks. Then, the program structure is represented as a directed program flow graph with basic blocks as nodes. For each basic block a cost with respect to each interval is determined. A longest and shortest path analysis on the program flow graph is used to identify a global interval. This procedure does not yet provide sufficient accuracy. For acceptable analysis precision one must identify feasible paths through a program. A feasible program path or trace is a path in this flow graph corresponding to a possible sequence of basic blocks when the program is executed from the first to the last basic block of a program. A program segment is a sequence of nodes in a program flow or syntax graph. This definition implies a hierarchy of program segments. Not all paths in the graph represent feasible program paths. A false program path is a path in the graph which cannot be executed under any input condition. False path identification is essential for programs with loops since loops correspond to cycles in the graph which lead to an infinite number of potential paths and resulting infinite cost intervals.
Previous Work
The approaches by Puschner and Koza [IO] and Park and Shaw [9] require iteration bounds for all loops in the program which the user must provide by loop annotation. While making formal analysis feasible, loop bounding alone is not sufficient for accurate path analysis. The approach by Gong and Gajski [5] can partially consider false paths because the user can specify the branching probabilities. As a second step in [7] and in [9] , the user is asked to annotate false paths. The number of false paths can be very large. Instead of enumerating false paths or, conversely, feasible paths, a language for user annotation with regular expressions is introduced in [9] . Still, the number of required path annotations can be extremely large in practice, as demonstrated with even small examples in [7] . A major step forward was the introduction of implicit path enumeration [7] . Here, the user provides linear (in)equations to define false paths. To evaluate these (in)equations, Li and Malik map the upper and lower bound identification to two ILP problems, the one optimizing for the lower, the other one for the upper cost bound. Previous work by Ferdinand in [4] bases on this kind of interaction while abstract interpretation is used for the static prediction of cache and pipeline behavior. Abstract interpretation has also been used to reduce designer interaction for loop bounding [6]. It is assumed that all executions of one basic block have the same cost. However, data dependent instruction execution and super scalar or super pipelined architectures with overlapped basic block execution as well as cache behavior lead to widely varying local path cost with respect to latency time and power consumption. This has a substantial effect on the cost interval. For these architectures, the sum-ofbasic-blocks model cannot provide close bounds, but must be pessimistic to be correct. For higher accuracy, basic block sequences in program segments must be considered. This shall be called the sum-of-program-segments model containing basic block sequences which is a major improvement compared to the state-of-the-art.
Execution Cost

Path Classification
Program properties can be exploited to simplify path analysis for the determination of the execution cost through basic block sequences [ 131. Large parts of typical embedded system programs have a single program path only. An FIR filter is a simple example and a Fast Fourier Transform is a more complex one. There is only one path executed for any input pattern, even though this path may wrap around many loops, conditional statements and even function calls which are used for program structuring and compacting. A program segment has a Single Feasible Path SFP, when paths through this segment are not depending on input data. A program segment with an SFP is an SFP-segment.
Previous analysis approaches give more than one execution path for SFP programs because they do not distinguish between input data dependent control flow and program structuring aids. In the best case, they may be accurate but require much designer interaction for SFP program segments and still do not deliver the path segment costs such as [7] . In case of SFP, execution would choose the one correct path and sequence for any input pattern without further designer interaction. Most practical systems also contain non-SFP parts. These have multiple feasible paths MFP. A program segment has Multiple Feasible Paths MFP, when paths through the program segment are depending on input data. A program segment with MFP is an MFP-segment. Isolation of SFP and MFP parts can help to exploit SFP.
In [13] , SFP are exploited by finding SFP and MFP nodes in the control flow graph. Embedded MFP are cut out and analyzed separately using the ILP approach while SFP are analyzed by simulating the timing of the only path. Costs for cutting out the MFP and the MFP cost interval delivered by the ILP solver are added. This leads to tighter cost bounds compared to [7] . In this paper, we present major improvements. The approach in [ 131 can only deal with one level of embedded MFP. If several levels of hierarchy with SFP and embedded MFP are present, they have to be analyzed separately, so dependencies across the hierarchi-cal levels are lost leading to overly pessimistic cost bounds in case of complex programs. We extend this approach to a global cost interval calculation for all levels which provides higher analysis precision. The syntax graph instead of the control flow graph is chosen because it can directly cover the hierarchy of control structures and rewriting the program to generate a control flow graph is not necessary.
Identification of Program Properties
Syntax Graph For the identification of SFP and MFP segments, the input program is mapped to a syntax graph. The syntax graph of a bubble sort algorithm is shown in figure 1. In this syntax graph, every control structure, such as i f and for, is a hierarchical node. The basic blocks are the leaf nodes with the according basic block cost. Every control structure has edges with different meanings. The "control" edge that decides which of the paths is executed and the "successor" edge that leads to the next node are part of every control structure while the "then" and the "else" edge are specific for the iflelse program segment. The same restrictions to use structured programs are assumed as in [7] . Control flow enters and leaves an iflelse program segment exactly twice for the given hierarchy level, once for the control structure and once for either the "then" or the "else" edge like in figure 2. A depth first search algorithm on the syntax graph can be used to determine input data dependencies of conditions using symbolic simulation of basic blocks [ 131. Every control structure which does not contain an input data dependent condition must be SFP. Leaf nodes are SFP by definition. If conditions contain input data, or symbolic execution is not !juccessful due to the complexity of symbolic expansions, the syntax graph nodes are classified as MFP. This leads to %wider cost intervals. This algorithm classifies each hierarIchical node. PrS with MFP child nodes are classified as MFP because the multiple paths also enter and leave this hierarchical node even when their control structure is independent of input data. The for-PrS in figure 2 which shows the inner loop of figure l potentially has 21'errr''un' paths because control flow splits in the if-PrS.
Feasible Paths in the Syntax
Figure 2. Execution paths in the graph
To treat such situations, we introduce a pseudo SFP-PrS. A pseudo SFP-PrS is an SFP-PrS with a single Pas on one level of hierarchy while lower levels may have multiple paths as in figure 3. On this level of control hierarchy, it can be treated like an SFP-PrS as we prove in [12] .
Program Segment Cost Cost determination requires a Program Segment Execution PsE.
It is an execution of a Pas through the complete PrS. Details can be found in [ 131 and in section 3. There can be a minimum and a maximum cost for a single Pas through the PrS because of data dependent instruction execution. The PrScost is the cost for the execution of a PrS. PrScost is determined according to its PrS classification.
MSPrScost(PrSi) PrSi is MSPrS MFP-PrScost
PrSi is MFP-PrS undefined else MFP-PrScost is computed as an ILP problem using the approach of [7] , delivering the execution count xi of the distinct Pas, plus the transition costs. MFP-PrScost has a minimum and a maximum. The Transition Cost Tpl,,pj is the cost representing overlapping PrScost for the prologue p and the epilogue e in figure 2. These transition costs must be conservative. If no MFP-PrS in the Pas on lower levels of hierarchy is present, the recursive descent stops. The execution count xi of a Pas is solved according to section 2.2 after the equations for the embedded MFP-PrS including execution count xi have been propagated to the top level.
Global Cost Calculation For the bubble sort example in figure 3 , the recursive cost calculation with the propagation of equations works the following way: We start on the top level of the process. For its execution cost we need the cost of PrSl and of the lower levels of hierarchy PrS2 and PrS3. After checking the two f o r loops, the recursive descend finds PrS3, the i f /else MFP-PrS. It only contains leaf nodes. The MFP-PrScostif/e'Se of PrS3 is composed by the cost of the paths PaSi,3 across "control" and "then" or "control" and "else" each of which is delivered by PsE. xi is their execution count and TpIe3p3 the transition cost. Then we can calculate the cost of PrS2, the inner f o r loop shown in figure 2. It is composed by the cost equation of the i f /else MFPPrS, and the MSPrScost of the "j++"-PrS and "control"-PrS as these are leaf nodes. The cost equation for PrSI, the outer f o r loop, can be given which adds the MSPrScost of the "i++"-PrS and "control"-PrS to the cost of PrS2 and PrS3.
Execution Cost = xi PaScost(PaSi~) + Tple,p3
PrrSi.3
Even with one level of hierarchy between PrSl and PrS2, their MSPrScost can be delivered by the same PsE because the control flow is given by the program properties. PrScost equations for PrS3 have been propagated to the top level where the designer can provide functional constraints bounding the xi of PrS3 instead of basic blocks according to section 2.2. This finally delivers the execution cost bounds.
There is a one-to-one correspondence between the basic blocks of the syntax graph and the nodes of the hierarchical control flow graph HCFG the syntax graph can be transformed to. For the following examples the HCFG is used to allow an easier modeling of control flow. 
Figure 3. Pseudo SFP-PrS with MFP-PrS
In the previous approach in [ 131, embedded MFP cost and "cut point" cost were separately analyzed and added to the simulated SFP for every MFP on different levels of hierarchy. We do not lose the dependencies across several levels of hierarchy because the MFP-PrScost equations including the execution count xi and transition costs are propagated to the top level of the syntax graph instead of adding values in the control flow graph. This generalizes the approach in [13]. As MFP-PrScost is based on Pas, single paths Pas through the MFP-PrS can be analyzed.
Context Dependent Control Flow
The path analysis approach presented up to this point is based on the identification of input data independent control flow. This improves the estimation accuracy compared to the approach in [7] and the first preliminary SFP analysis approach in [13] . Even MFP segments with input data dependent paths are analyzed with narrower bounds than in the previous basic block based approaches, as long as some segments on the lower levels of hierarchy are SFP-PrS.
In the introduction, we have argued that the designer is often interested in a context dependent process behavior. Here, context is defined to be a subset of input data and/or a subset of possible process states, often called process modes. In each context, only a subset of paths through a program segment can be executed. This potentially means reduced cost bounds which could be exploited for analysis. Global process representation models [ 141 can support process modes, such that the distinguishable contexts are known for cost analysis. A simple example for context dependent control flow in an ATM switch component is given. In other words, the contexts "VCI = 3" corresponding to the OAM mode, and "not (VCI = 3)" corresponding to the USER mode turn an MFP-PrS into a PrS with a single path. We will call such a PrS a Context Dependent Path program segment CDP-PrS. For analysis of the given context, it is treated like an SFP-PrS. Where this approach is not applicable, the reduced path set of a given context can further be exploited via additional structural and functional constraints [7] . In both cases, context dependent behavior can be analyzed using the same techniques as described before. The same discussion for the gain in accuracy as in [ 131 applies because longer sequences are achieved than with SFP identification alone. At the transitions between SFP and CDP segments, MSPrS containing both SFP-PrS and CDP-F'rS can be defined.
For different modes, SFP-PrS and functional constraints for the remaining MFP-PrS stay the same, while a different block of CDP-PrS can be extracted from the MFP-PrS. This way, average cases given as artifical modes can tighten the wide intervals. Stochastic or probabilistic distribution of input data could be considered using according cost functions and convolutions for the PrS-cost.
Architecture Modeling
A program segment execution PsE for the cost determination of a PrS uses one of the following two techniques:
Instruction Cost Addition ICA The instruction or statement execution costs in a basic block or PrS are added. We do not need input data. Host tracing is used while execution costs are taken from a table. This is a very computation time efficient approach. Instruction execution cost c, can be dependent on input data. A popular example is a shift-andadd implementation of a multiplication in a processor delivering an interval for ci. So in the tables, minimum and maximum instruction execution cost can be considered leading to an interval for the PrS cost delivered by ICA.
Program Segment Simulation PSS The basic block or
PrS is simulated using known input data and a cycle true processor model [ 2 ] which can exactly deliver processor timing or power consumption. This can be any well established, off-the-shelf processor simulator provided by the processor vendor. Processor evaluation kits implemented in hardware have been successfully used for timing or power measurement with a logic state analyzer and automatic result back annotation. As an example for PSS that delivers the execution cost of the PrS, a StrongARM simulator core is combined with the DINER0 I11 cache simulator delivering both instruction and data cache behavior. Source codes have been recompiled to one simulator. Architecture modeling regarding timing and the energy dissipation model is derived from [8] and [ 111. Data rates are derived from the amount of data produced or consumed on a path and its execution count from section 2.2.
The major improvement to the architecture modeling of the first SFP analysis approach in [ 131 is the possibility to integrate off-the-shelf processor simulators and emulators. This enables us to determine execution cost intervals for several target architectures.
Experiment
Conclusion
Packet Receiver The approach is applied to a process that reads a packet and loads a picture as presented below. If the picture is addressed to the component, it performs a filter. A pseudo code description is given below. In table 1, execution cost intervals with respect to latency time, power consumption and data rates without picture size or address match mode are given. The intervals as well as the path classification are given for every PrS that is referenced by the line number it is starting with. Due to the loop bounds for each context given by the packet size bounds in the received header, we know the minimum and maximum number of pixels leading to a CDP in line 124. SFP segments and CDP segments are merged into MSPrS, so they may not be visible in the results. The results for the complete process are given in the last line. Table 1 . Cost [~i ,~i~,~~,~~~~] 
without modes
In table 2, different modes are explored. In the first three lines, picture sizes are derived from process modes. Address and luminance calculation modes follow. Table 1 can be found in'line 3. We notice that known process modes lead to tighter, but context dependent cost intervals for the process that can be used for formal process representation supporting process modes on higher levels [ 141. 
