Static timing analysis of embedded software is important for systems with hard real-time constraints. To accurately estimate time bounds, it is essential to model the underlying micro-architecture. In this paper, we study static timing analysis of embedded programs for modern processors with speculative execution. Speculation of conditional branch outcomes significantly improves processor performance, and hence program execution time. Although speculation is used in most modern processors, its effect on software timing has not been systematically studied before. The main contribution of our work is a parameterized framework to model different control flow speculation schemes. The accuracy of our framework is illustrated through tight timing estimates obtained for benchmark programs.
INTRODUCTION
An embedded system contains processor(s) running specific application programs which communicate with an external environment in a timely fashion. These application programs thus have real-time requirements, i.e., there are hard deadlines on the execution time of such software. Moreover, many embedded systems are safety critical. Therefore, PemusSion to make digital or hard copies of all or pan of this work for personal or c l a~~r~o m use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to p a t on sewers or to redistribute to Lists, requires prior specific permission and/or a fee. it is important to perform static analysis of embedded software to guarantee the satisfiability of all timing constraints.
ISSS'OZ,
Static timing analysis can provide an upper/lower hound on the execution time of a program. These bounds are useful for schedulability analysis, hardware/software partitioning, choice of processor (design space exploration) etc. Due to its inherent importance in embedded system design, timing analysis of embedded software has been extensively studied [Z, 5, 6 , 9, 11, 14, 161. Accurate timing analysis critically depends on modeling the effects of the underlying micrc-architecture. Ignoring the micro-architecture can produce extremely pessimistic time bounds. This is particularly so because modern processors employ advanced m i m e architectural features such as pipeline, caches, and speculative execution to speed up program execution. In the recent past, researchers have studied the effects of pipeline and cache on program execution time [5, 9, 11 ; 151.
The presence of branch instructions forms control dependency between different parts &the program. This dependency causes pipeline stalls which can be avoided by speculating the control flow subsequent to a branch. Current generation processors perform control Row speculation through branch prediction, which predicts the outcome of branch instructions [7] . If the prediction is correct, then execution proceeds without any interruption. For incorrect prediction, the speculatively executed instructions are undone, incurring a branch misprediction penalty. This penalty varies between 3-19 clock cycles. If branch prediction is not modeled, all the branches in the program must he conservatively assumed to be mispredicted for finding the maximum execution time. This pessimism results in as much as 60 -70% over-estimation for some of the benchmarks in this paper, even assuming a 3 clock cycle branch misprediction penalty.
In this paper, we model the effects of speculation via branch prediction on the Worst Case Ezecution Time of a program, also known as its WCET. Our micro-architectural modeling is completely generic and parameterizable w.1.t. the currently used branch prediction schemes. It automatically derives linear constraints on the total misprediction count from the control flow graph of the program. These constraints can be solved by integer linear programming (ILP) solver to compute bounds on program execution time.
MODELING BRANCH PREDICTION
Branch prediction can be static or dynamic. Static schemes associate a fixed prediction to each branch instruction via compile time analysis. Almost all modern processors, however, predict the branch outcome dynamically based on past execution history [7] . Dynamic schemes are more accurate than static schemes, and in this work we study only dynamic branch prediction.
Dynamic schemes predict a branch depending on the execution history. The first dynamic technique proposed is called Local branch prediction 17, 121, where each branch is predicted only based on its own last few outcomes. This scheme uses a Z"-entry branch prediction table to store the past branch outcomes, which is indexed by the n lower order bits of the branch address. In the simplest case, each prediction table entry is 1-bit and stores the last outcome of the branch mapped to that entry. When a branch is encountered, the corresponding table entry is looked up and used as the prediction. When a branch is resolved, the corresponding table entry is updated with the outcome.
Most modern processors however use global branch prediction schemes [18] (also called correlation based schemes), which are more accurate. Examples of processors using global branch prediction include Intel Pentium Pro, AMD, Alpha as well as embedded processors PowerPC 440GP 1131 and SB-1 MIPS 64 [8] . In these schemes, the prediction of the outcome of a branch I not only depends on I ' s recent outcomes, but also on the outcomes of the other recently executed branches. Global schemes can exploit the fa& that behavior of neighboring branches in a program are often correlated. Global schemes uses a single shift register, called Branch History Register '(BHR) to record the outcomes of n most recent branches. As in local schemes, there is a global branch prediction table in which the predictions are stored. The various global schemes differ from each other (and from local schemes) in the way the prediction table is looked up when a branch is encountered.
We now present our timing estimation technique to model effects of speculation. In particular, we consider GAS, a global branch prediction scheme 112, 181, which uses the BHR as an index to look up the prediction table. However, our modeling is generic and not restricted to GAg. In Section 4, we will demonstrate how it easily captures other global prediction schemes as well as local schemes.
Controlflow graph. The starting point of our analysis is the control flow graph (CFG) of the program. The vertices of this graph are basic blocks, and an edge i -j denotes flow of control from basic block i to basic block j . We assume that the control flow graph has a unique start node and a unique end node, such that all program paths originate at the start node, and terminate at the end node. Each edge i -j of the control flow graph has a label, denoted label(i -j ) . For any block i, if the last instruction of i is a branch then it has two outgoing edges labeled 0 and 1. Otherwise, block i has one outgoing edge with label U .
For programs with procedures and functions (recursive or otherwise), we create a separate copy of the CFG of a procedure P for every distinct call site of P in the program. Each call of P transfers control to its corresponding copy.
Flow constraints and loop bounds. Let 0% denote the number of times block i is executed, and let e;,, denote the number of times control flows through the edge i -j . As the start and end blocks are executed exactly once,
As inflow equals outflow for other basic blocks,
We provide bounds on the maximum number of iterations for loops and maximum depth of recursive invocations for recursive procedures. These bounds can be user provided, or can be computed offline for certain programs [6] .
Defining Execution llme bounds. Let cost, be the ex+ cution time of basic block i assuming perfect branch prediction. Given the program, cost, is a fixed constant for each a. Then, the total execution time of the program is
where penalty is a constant denoting the penalty for a single branch misprediction; mi is the number of times the branch in block i is mispredicted. Introducing History Patterns. To determine the prediction of a block i, we first compute the index into the prediction table. In the case of GAS, this index is the outcome of last k branches before block i is executed. These k outcomes are recorded in the Branch History Register (BHR). Thus, if k = 2 and the last two branches were taken (1) followed by not taken (0), the index would be 10. We define e:, , Control flow among history patterns. First, we define constraints on v:. This provides an upper bound on m : . Recall that our index into the prediction table is simply a history recording the past few branch outcomes. To model the change in history due to control flow, we use the left shift operator: thus left(?r,O) shifts pattern T to the left by one position and puts 0 as the rightmost bit. We define: DEFINITION 1. Let i -j be an edge in the control pow graph and let T be the history pattern at basic block i. The change in history pattern on ezecuting + j is given by r(r, i + j ) where:
Now consider all inflows into block i in the control flow graph. Basic block i can execute with history x only i f block j executes with some history ii', control flows along the edge j -i , and r ( x ' , j i i) = x .
Note that for any incoming edge j -i : there can be at most two history patterns x' such that r ( x ' , j i i ) = i i . .
'-3
Repetition of a history pattern. Suppose there is a misprediction of the branch in block i with history x . This means that certain blocks (maybe i itself) were executed with history 71, the outcome of these branches appear in the xth row of the prediction th row of the prediction table is used for prediction at blocks i and j, and (b) the T th row of the prediction table is never used for prediction between blocks i and j. In these scenarios, the outcome of block i can affect the prediction of block j (and cause a misprediction). Furthermore, p:tsrt-t (P:--"~), denotes the number of times the x th row of the prediction table is looked up for the first (last) time at block i .
When the xth row of the prediction table is used at block i for branch prediction, either it is the first use of the x th row (denoted by pTtartWi) or the x t h row was used for branch prediction last time in some block j # start. Similarly, for every use of the x th row of the prediction table at block a , either it is the last use of the xth row (denoted by pTeend) or it is used for branch prediction next time in block j # end.
Since U: denotes the number of times block i uses the 71th row of prediction The number of such outcomes is 5 C j p : z 3 , since this denotes the total outflow from block i when it is executed with history x and the branch a t ' i is taken. Since branch at i wais mispredicted, the prediction in row x of the prediction table must have been 0 (not taken). This'is possible only i f another block j was executed with history x , branch of block j was not taken, and history x never appeared between blocks j and i. The total number of such inflows into block i is at most cjp;<';"i. Putting it all together. We have derived linear inequalities on vi (execution count of block i) and m i (misprediction count of block a ) . We now maximize the objective function subject to these constraints using an (integer) linear programming solver to give an estimate of the Worst Case Execution Time (WCET) of the program.
ANEXAMPLE
We illustrate our estimation technique with B simple example. Consider the CFG in Figure 1 . All edges of the graph are labeled. Recall that the label U denotes unconditional control flow and the label 1 (0) denotes control Row by tal+ ing (not taking) a conditional branch. We assume that a 2 bit history pattern is maintained, i.e., the prediction table has four rows for the history patterns 00, 01, 10, 11.
Flow constraints and loop bounak. The start and end
nodes execute only once. Hence
From the inflows and outflows of blocks 1 and 2 , we get:
Furthermore, the edge 2 + 1 is a loop, and its bound must he given. Let us consider a bound of 100. Then, ez.1 < 100.
Defining WCET Let us msume a branch misprediction penalty of 3 clock cycles. The WCET of the program is obtained by maximizing
Recall that cost, is the execution time of block i (assuming perfect prediction); mi is the number of mispredictions of block a. There are no mispredictions for executions of start and end blocks as they do not have branches.
Introducing History Patterns. We find out the possible history patterns for each basic block i via static analysis of the CFG. This information is denoted by the predicate poss(i,a). The initial history at the beginning of program execution is assumed to be 00, i.e., poss(start,a) is true iff T = 00. In our example, we obtain that poss(1, T) is true iff T E {OO,Ol} and poss (2,a) is true iff ?i E {00, lo}.
We introduce the variables v : and m:: the execution count and misprediction count of block i with history T.
The variables U&, uZnd and e: > are defined similarly. From the inflow we get: v?' = e::, + Note that the inflow from block start to block 1 is automatically disregarded in this constraint since it cannot produce a history 01 when we arrive at block 1. Also, for the inflows from block 2 the history at block 2 can be either 00 or 10. Both of these patterns produce history 01 at block 1 when control flows via the edge 2 -1. From the outflows of block 1 with history 01 we have: up1 = e?: + Constraints for other blocks and patterns are similar.
Repetition of a history pattern.
To model the repetition of history pattern along a program path, the variables pYl3' are introduced (refer Definition 2). We. now present the constraints for the pattern 01. Corresponding to the first and last occurrence of the history pattern 01 we get: 
MODELING OTHER SCHEMES
We now discuss the extensions of the technique for modeling other branch prediction schemes. The prediction schemes differ from each other primarily in how they index into the prediction table. To predict a branch I , the index computed can be a function of (a) the past execution trace (history) and (b) address of the branch instruction I . In the GAg scheme, the index computed depends solely on the history and not on the branch instruction address. Other global prediction schemes (gshare, gselect) use both history and In the popular gshare [12] scheme, the BHR is XOR-ed with last n bits of the branch address to look up the prediction table. Usually, gshare results in a more uniform distribution of table indices compared to GAS. We define the index R as K = history, @address,(I) where m,n are constants, n 2 m, @ is XOR, address,(I) denotes the lower order n bits of branch instruction I in block i, and history, denotes the most recent m branch outcomes (which are XOR-ed with higher-order m bits of address,(I)). And,
In gselect (CAp) [18] , the BHR is concatenated with the last few bits of the branch address to look up the table. The modeling is similar and is omitted for space considerations.
In local schemes, the index R for branch instruction I is the Least significant n bits of I's address, denoted address,(I) (n is a constant). Here R is independent of the past execution history of other branches. The update of K due t,o control flow is given by rLoceL(r,i -j ) = address,(J), where address,(J) denotes the least significant n bits of the branch instruction J in hasic block j .
EXPERIMENTAL RESULTS
We selected nine different benchmarks for our experiments (refer Table 1) : check, matsum, matmult, f f t and f d c t are loop intensive programs; i s o r t , bsearch, dhry and eqntott execute hard-tppredict conditional branches arising from ifthen-else statements within nested loops.
Methodology. We assumed zero cache misses and a perfect processor pipeline with no stalls except for penalty due to misprediction of conditional branches. We assumed that the branch misprediction penalty is 3 clock cycles (as in the Intel Pentium processor). We used the SimpleScalar architectural simulation platform [1] We wrote a prototype analyzer that accepts assembly language code annotated with loop bounds. , Our analyzer is parameterized w.r.t. predictor table size, choice of prediction schemes and misprediction penalty. This makes our branch prediction analyzer retargetable w.r.t. various prw cessor micro-architectures. The analyzer first disassembles the code, identifies the hasic blocks and constructs the control flow graph (CFG). From the CFG, our analyzer automatically generates the objective function and the linear constraints. These constraints are then submitted to an ILP solver. For our experiments, we used CPLEX [4], a commercial ILP solver distributed by ILOG. .
Accuracy.
To evaluate the accuracy of our branch prediction modeling, we present the experiments for three different branch prediction schemes: gshare, GAg and local. Since finding the worst case input of a benchmark (which produces the actual WCET) is a human guided and tedious process, we only measured the actual WCET assuming a 4-entry prediction table. The results appear in Table 2 . Even though not shown here due to space shortage, the estimation accuracy was independent of the prediction table size. Our estimation technique obtains a very tight bound on the WCET and misprediction count in all benchmarks except f f t . The reason is that the number of iterations of the innermost loop of f f t depends on the loop iterator variable value of the outer loops. This can be captured hy providing inequations obtained from data-flow analysis of the loop iterator variables.
Performance. We formulated the timing analysis problem (for gshare scheme) with larger branch prediction table sizes varying from 32-1024 entries. Recall that in 'gshare, the branch instruction address is XOR-ed with the global branch history bits. In practice, gshare scheme uses smaller number of history bits than address bits, and XORs the history bits with the higher order address bits [12]. The choice of the number of history bits in a processor depends on the expected workload. In our experiments, we used a maximum of 4 history bits as it produces the best overall branch prediction performance across all our benchmarks.
On a Pentium IV 1.3 GHz processor with 1 GByte of main memory, our timing estimation technique requires less than 0.5 second for all the benchmarks.
RELATEDWORK
Little work has been done to study the effects of branch prediction on a program's execution time. Effects of static branch prediction have been investigated in [2, Observed and estimated WCET (in n u m b e r of processor cycles) and misprediction count w i t h gshare, GAg, a n d local schemes. ever, most current day processors (Intel Pentium, AMD, Alpha, SUN SPARC) implement dynamic branch prediction schemes, which are more difficult to model. To the best of our knowledge, [3] is the only other work on timing estimation under dynamic branch prediction. Their technique is similar t o cache modeling techniques [5] and cannot be used to model global branch prediction schemes.
Using Integer Linear Programming (ILP) for WCET analysis is not new. In particular, [9] has reduced the WCET analysis of instruction cache behavior into an ILP problem. In [17] , ILP has been used for program path analysis subsequent to abstract interpretation based micro-architectural modeling of instruction cache, pipelines etc.
CONCLUSIONS AND FUTURE WORK
In this paper, we presented a framework to measure the effects of speculative execution on the Worst Case Execution Time of a program. Our modeling extends existing work on modeling static branch prediction [2] a,nd uniformly captures various dynamic branch prediction schemes (which are used in both general-purpose and embedded processors [a, 131) . Using our technique,,we have obtained tight timing estimates for benchmark programs under various branch prediction schemes. In future we plan to integrate our modeling with existing micro-architectural modeling of pipeline and cache for analyzing program execution time.
ACKNOWLEDGMENTS
This work was partially supported by National Universitv of Singapore research grant R-252-000-088-112.
