Frequent control dependencies caused by IF-and loop- 
Introduction
The hardware/software cosynthesis system COSY-MA [l] takes C programs from the embedded control domain as input to speed them up on a combination of a programmable processor (e.g. SPARC) and an application specific hard-wired coprocessor. Analysis of input programs, partitioning into SW and HW, as well as generation of the coprocessor with high-level synthesis are done automatically. In this paper we focus on the high-level synthesis part in COSYMA. Inputs are small parts of C programs, typically loops
[a]. In the examples, which we took from different areas, "IF"-statements and loops with data dependent number of iterations occur frequently. These control dependencies seriously limit the potential parallelism.
While early scheduling algorithms focused on the problem to distribute operations within a basic block emphasis has, meanwhile, shifted to scheduling across basic-block boundaries. Path-based scheduling ( [3] , improved in [4, 51) schedules paths of an execution rather than basic blocks, but it does not solve the problem of control dependencies. Speculative computation (SC) potentially reduces the control dependencies by allowing to execute an operation before it is known to be necessary. It is used in percolation scheduling [6] , as part of a global list scheduling [7, 81, or in other scheduling approaches [9, lo].
None of the known approaches uses the full potential of SC in the context of loops, which are typically most critical to circuit speedups. In [ll] multiple branch prediction and error correction (MBP-SC) is employed as SC-technique with low circuit overhead. Given a loop with profiling information, MBP-SC predicts the path through the loop body with the highest probability to be taken. This path is predicted and scheduled using loop pipelining (LoopP) [ 121, whereby operations belonging to other paths are deferred. If this path is predicted incorrectly, execution switches to a restore phase (prediction error correction). The body itself is always predicted to continue rather than being terminated. HW overhead due to prediction error correction remains low [I I].
In this paper we also combine MBP-SC with earlier techniques of parallel path execution [9] and apply it to LoopP. Compared to earlier SC-techniques, it can also reduce memory access. Furthermore, we present the scheduling algorithm MBP-SC which was previously done manually.
The rest of the paper is structured as follows. Chapter 2 explains MBP-SC using an example and chapter 3 gives the exact definition and scheduling approach.
Results for practical examples are given in chapter 4.
Approach & Example
MBP-SC scheduling is explained using the printer context example in figure 1, given in a C-like description. z[] contains a sparse data structure with 16-bit records describing the code and position of characters on a single printer row. There are two fields with variable length, the first given in a Huffman code with The CDFG is shown in figure 2 . The whilecondition is described by the two small multiplexers on bottom and all CP's pointing from the comparison "<". If it is false (value "O"), no operations are executed and the loop terminates. The 3-input multiplexers and all remaining CP's represent the "switch" statement. The value of statement "case (0x4000)" is here written as "0 1". Value "1-" represents the "default" statement and is the complement of "01" and "00".
Node links are used to describe a constant shift and the extension of bit-widths from 16 to 32 bit. They only restructure wires and need no execution time. Access to array z[i] is done by a RAM access with operation Now, we make use of MBP-SC to speed-up the loop. As in [ l l ] we may predict the most probable path (that one with the highest probability to be taken). It goes through the statement "case (0x0000)" and is taken in 50 iterations (prediction accuracy p = 0.5). As long as this prediction is correct, a new iteration may be started every clock cycle and yields a speed-up of 10 compared to the schedule without SC. From [ l l ] we know that the necessary restore phase decreases the speed-up. Here, the probability for a prediction error correction is very high ( p = 0.5) and significantly limits the potential speed-up from 10 ( p = 1 ) to 2.778.
That was the result of our previous work [ 113. In the following we give an improvement and for the first time present a suitable scheduling algorithm.
The idea of multiple branch prediction, as used in CS; = v c j .
This way, the prediction accuracy increases to In our example, the accuracy of the prediction set CSl{"OO","l-"} is (50 + 40)/99 = 0.91. CSi may, now, contain contradictory predictions.
Some of the operations may have to be executed on all predicted paths or, if not, they might at least not lead to incorrect results on other paths. Only operations where this does not hold must not be executed or must be corrected. Any set of predictions CS; defines a 3-way-partition on the CP's into those CP's where the boolean expression is true, CS,,T, those where it is false, C S~, F and all other CP's, where the prediction is not sufficient to decide on the value, CS',,u. Now we will compute the corresponding schedule. In the beginning all conditions are predicted according to CS1 ={"OO1l,lll-"}.
Due to the predictions, the CDFG may be simplified, or ['restricted" as we will say in the sequel. All operations with a cp E CS1,F pointing to them are removed from the CDFG. The result is the CDFG B in figure 3. The upper "++" operation is removed together with the C P pointing towards it, because the condition "01" is not valid with respect to CS1 as well as the operations " * + 1 1 , "RD" and "+" on the left. The multiplexer inputs "01" are removed. If a prediction error occurs and MUX input "01" turns out to be correct, the multiplexer outcome will no longer be valid. Therefore, its [[scope" is restricted to the predicted alternative CS1. The scope of any operation which reads the multiplexer output becomes also restricted. The scope is forwarded through data dependencies as far as possible. As a result, the scope of the remaining increment is also restricted. The two inputs of the upper multiplexer read the same operation and are reduced to a single one. Because only one input remains, the complete multiplexer is removed from the CDFG and its (data) input and output are connected directly. This also happens to the 2-input multiplexers after removal of input "0". All remaining CP's are cp E CS~,T, i.e. always true with respect to CS1 and are removed. cp E C S I l~ stay in the CDFG but with modified boolean expression f lcsl. The remaining CDFG is scheduled using LoopP and yields a latency of two clock cycles.
As long as no prediction error occurs, the next iteration is started every two clock cycles. This happens with the probability of 99/100 * 90/99 = 0.9. That is the second main advantage of MBP-SC: LoopP can be applied to the predicted sets of program paths. The first prediction error may be detected after the fifth clock cycle (operation "&" is executed during the third clock cycle, but the controller delay adds two additional cycles). If "01" is true, a prediction error occurs. Then, two things happen. First, the knowledge about conditions increases: condition "&" is now known to be "01". Second, the executed CDFG/schedule is no longer sufficient because it was restricted to CS1. A new CDFG must be determined according to known conditions and previously executed schedule(s). It is given with CDFG C in figure 3 . There, the "<" and other operations are removed because they were already executed in schedule B. The lower increment is not removed because its scope in schedule B is not sufficient. The remaining CDFG elements are scheduled without LoopP, because we assume CS1 to be true again in the next iteration. The small "W" operations write the correct values of variables i and a into that registers where schedule B assumes them to be.
Only the first clock cycle of schedule C is executed. Then, the controller responds to the next condition ("<" from schedule B). It is a concept of MBP-SC to fork within prediction error correction every time when a new condition is computed and a prediction error may arise. An particular CDFG and schedule is computed for every branch, and will, itself, fork until all conditions are known. Here, schedule C forks into schedules D (with condition "<" known to be " I " ) und E ( "<I' known "0"). Schedule D needs three clock cycles and then jumps back to the beginning schedule B . The loop is restarted.
The other branch, E, is taken when the loop terminates ("<"="O").
Although its CDFG contains no elements, the schedule needs one clock cycle to write the result values a and i to registers. This is necessary, because the loop may terminate within schedules E or One key idea of MBP-SC is to restrict the original CDFG to predicted and known conditions and then apply a conventional scheduling algorithm to it, for A condition may have the attribute "unchanged", "predicted" or "known". With "unchanged" no MBP-SC is applied. Such conditions are not considered in V and always remain "unchanged". All other conditions will be predicted to some alternatives. Before the condition operation is computed, the condition is "predicted", after it is "known". Controller-delay is considered.
Let cp be a C P with expression v, pointing from operation a to b. m denotes a multiplexer controlled The scope of block-multiplexers is never restricted.
Rule 3:
If m has only one data-input, then directly connect it with the output and remove m from the CDFG.
Rule 4:
If m has no data-inputs, then remove m. 
Rule 8:
If operation a has a side effect and its scope is restricted then insert CP's into the CDFG according to the restriction. (The cutset of the conditional expressions of these CP's must be identical to scope(a)). This rule assures that a is executed only with valid input data.
Rules may be applied several times. Rule 2 has priority to rules 3...8. Rule 8 is applied when no other rule matches. The restriction is complete when no more rule matches.
The overall schedule is constructed beginning with the periodic schedule (example: B). All conditions have either status "predicted" or "unchanged". Only this schedule may use LoopP and is executed until the first prediction error is detected. If this happens in iteration n, then (1) abort already running iterations n + 1, n + 2, ..., (2) apply prediction error correction (restore phase) to iteration n and then (3) either restart the periodic schedule with iteration n + 1 or abort the whole loop. For each possible prediction error within the periodic schedule, a separate CDFG and schedule is computed. These schedules are aborted as soon as the next condition changes its status from "predicted" to "known". The control flow forks, regardless whether a prediction error occurs or not (example: this happens after the first clock cycle of schedule C ) .
Average run time, HW amount and controller complexity are effected by the CP-partition. Determination of the optimal 3-way-partition is a combinatorial problem. Due to limited space we will deal with its computation in a later paper. Allocation is performed as in [11] .
Results
We applied MBP-SC to several inputs which are the most run time intensive fragments from real-world programs of embedded control tasks, mostly video applications. "quick" is the inner loop of a quicksort algorithm. Lengths vary between 3 and 29 lines of C code (without comments), containing 1 up to 4 loops and 3 IF-statements (average number). HW is limited to 5 ALUs, 1 multiplier, 1 shifter and 1 RAM. Access to RAM and multiplication need two clock cycles each, all other operations need one. Up to five 2-input multiplexers may be chained with another operation within one clock cycle. A pipelined controller is used with the same delay as discussed above in the example. Table 1 shows the results. If LoopP is used within BBS (basic block scheduling), the schedules become faster. With use of MBP-SC LoopP is more effective and yields faster results. In column "MBP-SC/all" always all paths are predicted. This corresponds to SC with LoopP without branch prediction. In three cases better result are obtained, if prediction is limited to a subset of paths (column "MBP-SC/set"). Table 2 shows results for some of the programs from the HLS workshop benchmark 91/92. The last two columns give the number of clock cycles necessary for execution of the loops (either for a fixed number of iterations denoted by n or, if the loop never stops, as the average number per iteration). "gcd" and "counter" are simple examples, where fastest results are obtained when all paths are predicted and the correct result is selected by a multiplexer. If HW is restricted to two ALUs for "display", the digit "secs" is predicted to "increment" (and not to "overflow"). With four ALUs, '%secs" is predicted to "increment" and both paths of "secs" are predicted. With eight ALUs all paths are predicted. Simple optimizations (use common expressions, determine constant expressions outside the loop, simplification of expression "temp6a" and "(a>counter) XOR (a<counter)") are applied for "fancy" before scheduling. All pathes are predicted.
Summary and Conclusions
We have presented the scheduling algorithm of MBP-SC which is used within a HW/SW cosynthesis system to speed-up small fragments of C programs. It combines multiple branch prediction and parallel path execution with loop pipelining. As a main advantage, MBP-SC allows to predict any set of possible pathes and thereby enables a tradeoff between branch prediction accuracy and number of speculatively executed operations.
The schedule is computed by restricting the CDFG to predicted and already executed conditions. Thus, management of prediction and error correction is separated from the underlying basic scheduling algorithm. Exact rules are given for this management.
Real-world inputs show that MBP-SC significantly improves loop pipelining and gives better results than simple SC techniques which always predict all paths. 
