Abstract: Branches are a major limiting factor to instruction-level parallelism. One solution is to execute several branches simultaneously using multiway branching architectures. Such architectures are especially important when the instruction issue width becomes large. The authors study the problem of compile-time scheduling of branch operations on such architectures: an optimisation called branch merging. The scheduling attempts to bring profitable branches together for concurrent execution. It is shown that finding the optimal solution to the branch merging problem is NPhard. A heuristic is then proposed, which relies on a cost model to direct the merging of branches and their associated basic blocks. Merged branches are then scheduled together for concurrent execution. The authors used simulation to evaluate the effectiveness of the proposed algorithm. Experiments on selected benchmark programs show that the heuristic achieves roughly a 10% performance improvement on multiway branching architectures.
Introduction
Software techniques for exploiting instruction level parallelism (ILP) have made tremendous progress in the past decade and are expected to continue improving the system performance in the near future [11] . In the course of exploiting more ILP, branches remain one of the main obstacles. It is true that global optimisation techniques, such as trace scheduling [6] and percolation scheduling [17] , are able to look beyond branches and move operations across basic blocks to increase ILP. The problem is that applying these optimisation techniques tends to cluster branches together [ 131 and increase the relative frequency of branches in the instruction stream [ 151 [Note 11. Furthermore most existing processors execute branch operations in sequence and use branch prediction to exploit ILP 0 IEE, 1996 IEE Proceedzngs online no. 19960822 Paper first received 20th December 1995 and in revised form 15th July 1996 The authors are with the Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, Republic of China speculatively [9] . As the instruction issue width becomes wider and the density of branches becomes higher, several branches may be encountered at the same cycle and several predictions may be unresolved. It follows that sequential execution of branches will become a bottleneck. Also, misprediction penalty will become very high due to deep prediction. To handle branches more effectively, one viable alternative is to execute multiple branches simultaneously [2, 16, 181 .
An architecture which is capable of evaluating multiple branches at the same time and jumping to one of several branch targets is referred to as a multiway branching architecture. More specifically, such an architecture is an m-way branching architecture if it can jump to one of m different targets. Traditional uniprocessors can be viewed as having a two-way branching architecture. In the past, multiway branching architectures were employed mainly in VLIW processors. With increasing instruction issue width and merging of VLIW and superscalar architectures, we believe that multiway branching will become more and more important in high performance processor designs. Various multiway branching architectures have been proposed [l, 3-5, 7, 12, 13, 16, 181 . Fisher first considered multiway branching hardware for horizontally microcodable machines [5] . The proposed 2"-way branching hardware was able to evaluate the most general decision tree, which consists of n conditions, 2" -1 tests, and 2"-branch targets (see Fig. 1 ). The 2" branch targets were stored in the same location of 2"-memory banks and were fetched at the same time. An n-to-2" decoder, controlled by the n condition fields in the microcode, chose one out of the 2" fetched words as the address for the next microinstruction. The ELI-512 Note 1: Typical optimisation techniques are more effective in eliminating memory references and ALU operations rather than branches [9] .
proposed in [7] had a similar multiway branching unit, except that it only handled a limited if-then-else construct to reduce the hardware cost.
The TRACE computer by Multifiow [3] incorporated a branching mechanism which was able to evaluate four independent tests simultaneously. A software controlled priority scheme arbitrated among the four tests. A set of condition code registers, called the branch bank, was provided to hold the results of condition evaluations. In this way, condition evaluations were independent of actual tests and branches. Similar idea was adopted in later VLIW systems [1, 4, 121 . The system proposed in [12] also considered the combining of multiway branching with predicated execution. Multiway branching can be applied to superscalar processors as well [18] . The hardware now has to collect multiple branch operations, determine their priorities, and also check for possible dependencies. Results reported in this paper are applicable to both superscalar and VLIW architectures.
Previous works on multiway branching mostly concentrated on the design of the architecture itself. The problem of scheduling multiple branches to run on such architectures is often overlooked. For example, suppose we want to schedule the decision tree shown in Fig. 1 to a four-way branching unit. Since we can have two branch operations at the same cycle, is scheduling ccl and cc2 together the most beneficial way? or cc2 and cc3 together? Questions like these were not addressed previously.
In this paper, we investigate the branch merging problem. Given a set of branches and associated basic blocks, the problem is to group these branches so that branches in the same group can be executed in parallel on a multiway branching architecture. The goal is to minimise the execution cycles of the resultant code. This optimisation will be referred to as branch merging. The underlying multiway branching architecture is assumed to be able to evaluate general decision trees such as that shown in Fig. 1 . Other decision tree structures can be treated as special cases.
We will show that finding the optimal solution for the general branch merging problem is NP-hard. A heuristic is then proposed to solve the problem. The heuristic relies on a cost model and profile information to direct the merge of branches. Optimisation tech- niques such as superblock scheduling [lo] and percolation scheduling can be used in conjunction with our heuristic to further improve the code quality.
2

Motivating example
In this Section, we use an example to motivate our discussion on the branch merging problem. Basic ideas of branch merging and necessary definitions are introduced along with the example. Constraints of branch merging are described later in Section 2.2.
I
The example considered in this section is the nested ifstatement shown below: <Example l>: In a CDDG, there is a solid arc from a node B1 to another node B2 if the execution of B2 follows and/or is determined by the operations in B,. Thus, a solid arc will be called a control dependence arc. A control dependence arc may be associated with a value to indicate the probability that the arc will be taken during execution. This probability may be obtained from profile information. On the other hand, a dashed arc from B1 to B2 means that there is a data dependence between an operation in B1 and another in B2. Such an arc will be called a data dependence arc. Now, suppose we want to execute Example 1 on a processor with one addlsubtract unit, one loadistore unit, one multiplyidivide unit and one 4-way branching unit. The latencies for add, subtract, multiply, divide, load, store, and branch operations are 1, 1, 2, 2, 1 1, and 1 cycles, respectively. For simplicity, we assume that the system has a perfect cache and the compiler performs local register allocation only. Consider the CDDG in Fig. 2 . To execute the operations in the test node T I , we have to load the two variables, a and b, and then perform the multipllcation and branch. Thus, the execution time of TI is equal to 1 + 1 + 2 + 1 = 5 cycles. The execution times of T2 and T3 can be obtained similarly. The average number of cycles spent in the three test nodes is thus equal to 5 + 5 x 0.9 + 4 x 0.9 x 0.9 = 12.74.
Since we have a four-way branching unit in the processor, we would like to merge two among the three test nodes for concurrent evaluation. If nodes Tl and T2 are merged, it will take one cycle each to load a and b. In the next two cycles, while a and b are being multiplied, f and g can be loaded. We need two more cycles to perform the division of f by g. Finally, in the seventh cycle, multiway branching is performed. As a result, the average number of cycles for evaluating the three tests can be reduced to 7 + 4 x 0.9 x 0.9 = 10.24. We now have a new CDDG containing a test node T1,2, which is a three-way branch (see Fig. 3 ). We say that T1,2 has a branch factor of 3. Note that in order to merge TI and T2, code duplication is necessary [6, 101.
Alternatively, we can merge Tl and T3 as shown in Fig. 4 . The time for test evaluation can be further reduced to 5 + 5 x 0.09 + 5 x 0.81 = 9.5 cycles. From this example, we can see that choosing an optimal pair of branches for merging is very important. 
Constraints of branch merging
To find suitable branches for merging, we have to rearrange the execution order of the branch operations. Thus constraints for code scheduling are applicable to branch merging. In the following, we briefly summarise the constraints.
Data dependency:
Data dependency must be respected at all time. Consider the CDDG shown in Fig. 5 . There is a RAW dependence from the data node D 1 to the test node T2. Thus, moving T2 upward to merge with Tl will miss the assignment 'b = I' in D 1
and might make a wrong branch decision. Since WAW and WAR dependences can be resolved by variable renaming at compiling time, we only consider RAW dependences in the following discussion. In restricted percolation scheduling, all the instructions that may generate execution-altering exceptions can only be placed in their home basic blocks. With this constraint no new exception will be introduced due to code scheduling, but the number of instructions that can be moved across basic blocks is severely restricted. The general percolation scheduling relaxes the constraint by assuming the availability of a nontrapping version of instructions. This greatly increases the number of instructions that can be speculated. Although this approach has problem in detecting all exceptions caused by speculated instructions [14, 191 , it remains an effective optimisation especially when the program is known to be correct.
Percolation scheduling considers all operations in the program simultaneously instead of those among the main traces as in trace scheduling. The scheduled program is first transformed into tree graphs. Then the operations are moved upward as much as possible according to the dependency relationships. Code motions are controlled by four transformation rules: delete, move operation, move jump, and unification. In general, percolation scheduling achieves better performance than trace scheduling.
Branch merging problem
In this paper we assume that the underlying architecture is able to perform multiway branching on general decision trees. Since such decision trees are complete binary trees, the number of branch targets is always a power of two. Therefore, in the following discussions we will focus on 2k-way branch merging for k conditions. Given a CDDG, let n be the total number of test nodes in the graph. Denote these test nodes as T l , T2, ..., T,. Let s k be the power set of {I, 2, ..., n } , in which TI,, TI2, ..., T,,, where x  = (il, i2, ..., i,} E s k -0 . The costs are obtained through a cost model described in the next Section. The 2k-way branch merging problem, defined informally, is to find a mutually exclusive grouping of the test nodes so that, when the nodes in each group are merged and executed together by the multiway branching architecture, the total cost is minimised. A more formal definition is given below. For any given instance of the problem, the transformation to the general 2k-way branch merging problem is as follows. Define C = {e, j x e S -0, where S is the power set of (1, 2, ..., n } and 1 x 1 5 B } . We have:
It is easy to show that there exists a solution to the transformed instance of the general 2k-way branch merging problem if and only if there is a solution to the minimum sum-of-squares problem. Note that, in the transformation, B = k and J corresponds to the merging cost. Since the transformation can be done in polynomial time, the general 2k-way branch merging problem is NP-hard.
4 2k-way branch merging algorithm
To solve the branch merging problem efficiently at compile time, we resort to heuristics. In this Section, the heuristic to solve the 2k-way branch merging problem is presented.
I Algorithm
The proposed algorithm takes as inputs the given CDDG (representing a nested if-statement) and an architecture specification (representing the hardware constraints). The target system is assumed to support 2k-way branching. A cost model, which is detailed in the next subsection, is used to record the cost of merging two test nodes. Merging the test nodes may cause the corresponding data nodes to combine. The cost model also takes into account of this cost. If two test nodes are not mergeable, then the cost model will set the corresponding merging cost to infinite. Note that two test nodes are mergeable if their merge does not violate the hardware constraints and the merging constraints listed in Section 2. A cost matrix will be generated from the cost model. Table 1 shows one example. Suppose a CDDG has n test nodes, T I , T2, .., T,. Its cost matrix C is an n x n matrix. Entries in C are divided into two categories: an entry cij where i < j , represents the cost of merging the test nodes Ti and q, An entry cji, where i < j , represents the cost of combining the data nodes associated with Ti and q. A negative value means that the merge reduces the execution time. Other factors such as the size of code expansion can be incorporated into the algorithm by adjusting the cost model. Finally, the cost model also contains algorithm termination conditions, as will be shown in the next Subsection.
Our 2k-way branch merging algorithm uses a greedy strategy to iteratively combine two mergeable test nodes. The merging of course must result in a reduced cost. In each iteration, only one pair of test nodes are merged. The merged node is also a candidate for merge in the next iteration. A more formal description of the algorithm is given below.
Branch merging algorithm:
I* C is the cost matrix and cii is an entry of C. "I I* n is the number of test nodes in CDDG. *I Begin
Step I : For i = 1 to n do F o r j = 1 t o n d o cij is computed according to the cost model.
Step 2:
else cy = -; /* Consider the combined cost of test and data nodes and only use those entries with a negative value (i.e. a gain) */
Step 3: Select cpq such that cpq = min{cij 1 cG < 0, 0 < i, j 5 rz, and i < j } ; I* Find two test nodes which can be merged with the minimum cost *I
Step 4: If (Step 3 finds a mergeable pair), then merge test nodes p and q and go to Step 1. Otherwise, output the CDDG and terminate. End In Step 1, the entries of the cost matrix are computed according to the cost model in Section 4.2. Calculation of the costs requires a code scheduling be done on each pair of mergeable data or test nodes of the current CDDG. In Step 2, we compute the combined costs of merging test nodes and data nodes. Only negative costs (i.e. the merge reduces execution cycles) are considered. In Step 3 we find a pair of test nodes whose merge results in the minimum cost. If the pair is found then these two test nodes are merged; otherwise, the algorithm is terminated.
Consider the example cost matrix shown in Table 1 . We will pick the pair (TI T3), because they result in the lowest merging cost: -3.30 -0.54 = -3.84. TI and T3 are then merged as a new test node, T13. This produces a CDDG4-way. We say that a CDDG is a CDDG2kWay if each of its test nodes has, at most, k branches. The CDDG4-way can be executed on architectures supporting four-way branching. In the next iteration, our algorithm schedules code, evaluates the cost matrix, and tries to merge two test nodes again. This time a CDDG8-way or another CDDG4-way is generated depending on which test nodes are merged. By iteratively applying the proposed algorithm, CDDG2k-way can be obtained, which is executable on the target architecture. The complexity of the branch merging algorithm is of O(n2), because the most time-critical step, Step 2, will execute no more than O(n2) operations each.
GJ =
Cost model
In our algorithm, a cost matrix must be constructed for each iteration. Before we describe the matrix, some symbols need to be introduced first.
Suppose that we have already generated CDDG2r-way.
Let Dll, ..., D,r be the data nodes immediately following the test node TJ. Since a test node in a CDDG2r-way has at most 2' data nodes, we have 0 I r < 2' . Let DparentO)
be the data node immediately preceding the test node TJ. If TJ and its immediate preceding test node are merged, data nodes DparentO and D,,, for each 1 2 U 2 r, must be combined into a set of new data nodes,
We are now ready to define the cost matrix C. In the beginning of each iteration, code scheduling is first performed on each node in the given CDDG2~-w,y to find the execution cycles. The scheduling takes into account of hardware constraints. From the results of the scheduling, each entry in the cost matrix C is defined as follows: Let p(T,) = the probability of executing the test node T, NC(T,) = the number of cycles to execute the test 
5
In this Section, the experimental environment is first introduced. The performance of the proposed algorithm on multiway branching architectures is then studied. 
I Experimental environment
Eight benchmark programs were chosen for evaluating the performance of the algorithm. Their basic characteristics are shown in Table 2 . In the Table, the 'density of dynamic branches' is the amount of dynamic conditional branches in the total number of dynamic instructions. All programs are written in C and often used by UNIX users. The source code sizes vary from 40 lines to over 1500 lines.
The experiments were conducted on a simulation environment shown in Fig. 6 . We first used DLXcc [9] to translate the test programs into DLX assembly codes. The generaled assembly code was then fed into the assembly optimiser, which optimised the code with register renaming percolation scheduling and branch merging. The simulator MDLXsim was modified from DLXsim 191. It performs two functions: one is to produce profile information including the branch probabilities required by the assembly optimiser; the other is to execute the branch merging algorithm and simulate the execution of a VLIW architecture with a multiway branching mechanism. An input file, resource configuration, is used to specify the numbers and types of functional units, the latencies of the operations and other features of the multiway branching unit.
In the experiments, we assumed a VLIW architecture with two addisubtract units. one loadistore unit, one multiplyidivide unit, and one multiway branching unit. The latencies of add, subtract, multiply, divide, load, store, and branch operations were 1, 1, 2, 2, 1, 1 and 2 cycles, respectively. There was no cache miss. We used percolation scheduling to schedule the operations in the programs.
number of cycles of a sequential execution of the program by that of the optimised version. In this Figure, 'PS' represents the code optimised by the percolation scheduling with branches being executed sequentially, while 'PS(4-way)' and 'PS(16-way)' represent the code with branches being executed concurrently on a fourway and a 16-way branching architecture, respectively. The execution order of the branches is not changed. Thus, the latter shows the performance improvement purely due to the multiway branching architecture. Finally, the bars labeled with 'PS+BM(4-way)' and 'PS+BM(16-way)' represent speedups of the code optimised by both percolation scheduling and our proposed branch merging algorithm, with branches being executed on a four-way and a 16-way branching architecture, respectively.
The geometric means of the speedups for PS, PS(4-way), PS+BM(4-way), PS(16-way)) and PS+BM(16-way) are 1.65, 1.91, 2.12, 2.10 and 2.28, respectively. In other words, on four-way and 16-way branching architectures, the proposed branch merging algorithm can improve the performance of bare hardware by (2.12 -1.91)/1.91 = 11% and (2.28 -2.1)/2.1 = 8.6% on the eight selected benchmarks, respectively. This is because our branch merging algorithm adjusts the execution orders of branches such that more instruction parallelism can be achieved. Overall improvements due to multiway branching and branch merging are 28% for fourway branching and 38% for 16-way branching.
Note that the speedups of PS+BM(6way) are similar to those of PS(l6-way) in most cases. It means that a low cost implementation plus compiler optimisation can achieve a performance similar to that of a higher cost implementation. One important factor which affects the effectiveness of branch merging is the density of dynamic branches in the programs. For examd e . in the tested benchmarks matrix multidv has a Fig. 7 shows the resultant speedups for the selected benchmarks. Speedup is calculated bv dividing the 4 lbranch density of only 3.2% (see Table 2 ) a n i thus has 
Experimental results
0-0
In this paper, we study the problem of scheduling branch operations for concurrent execution on multiway branching architectures. Our focus is on resourcelimited compile-time scheduling of branches. That is, given a set of branches, determine which ones can be 5 2 3 Fig. 7 Speedups of benchmarks PS PS (four-way) PS+BM+(four-way) PS (16-way) PS+BM (16-way) executed together subject to the capacity of the branching hardware.
We have formally defined the branch merging problem and shown that finding the optimal solution is NPhard. A heuristic is proposed to solve the problem by rearranging the execution order of the branches so that profitable branches are moved together for concurrent execution. The heuristic relies on a cost model and profile information. Our performance studies show that branch merging does improve the performance of programs.
Several issues worth further investigation. For example, the relationships between branch merging and other branch handling techniques such as branch prediction, speculative execution and predicated execution need to be studied more thoroughly. The conjunction of branch merging with existing optimisation methods such as software pipelining should also be studied.
