A hyper-block represents a linear sequence of predicated instructions with a single entry and multiple exit points. To exploit the high level of Instruction Level Parallelism(ILP) in EPIC architectures, hyper-blocks are often used as the unit of program presentation. In this paper, we study the impact of predication and the hyper-block representation in the register allocation phases. Our contribution is as follows. We show that by constructing live ranges in a ne-grained manner the number of interferences can be reduced, and hence this allows both faster compilation and runtime execution. We compare the e ect of live range granularity in both hyper-block and conventional basic block. We show that predicate-aware liveness analysis can be used to obtain accurate interference graphs and to reduce false register pressure. Similarly, we demonstrate that a predicate-aware priority function can give a better register allocation performance over a predicate-insensitive one. We also identify the problems of live range splitting when the liveness computations are based on predicate analysis and propose a splitting scheme based on approximation of predicate expression.
Introduction
Over the past decade, architectural innovations supporting instruction-level parallel processing (ILP) and compiler optimizations that work synergistically with them have become a technological reality 11]. Popularly referred to as explicitly parallel instruction computing or EPIC, key aspects of this technology have in uenced the IA-64 architecture 12]. EPIC o ers several advantages, notably performance that scales both in terms of architectural complexity, as well as the use of increasing levels of parallelism at the ne grained instruction level. However, compile-time analysis to enhance and detect the parallelism, as well as optimizations such as instruction scheduling and register allocation that exploit this parallelism play a central role in harnessing the performance opportunities o ered by EPIC. Since ILP within basic blocks is limited for control-intensive programs, techniques for exploiting ILP across basic block (BB) boundaries have been developed. Trace scheduling 4] or super-block (SB) scheduling 6] have been developed to achieve higher degree of instruction-level parallel processing by providing a wider scope for compile-time program analysis. Predication 8] has been included in EPIC-style architectures to enable modulo-scheduling 10] and hyper-block scheduling (HB) 9]. A hyper-block represents a linear sequence of predicated instruction with a single entry and multiple exit points. Hyper-block scheduling enables to handle branch intensive programs while trace scheduling or super-block scheduling cannot handle clusters of traces that should be considered together for interference optimizations. However, predicated instructions in hyper-block (HB) 9] pose new problems for register allocation(RA) and instruction scheduling(IS).
In this paper, we explore several aspects of register allocation such as liveness granularity, priority functions and di erent notions of live-range interference. The graph coloring approach is accepted as a good model for register allocation. In this, an interference graph is constructed from the program where each node represents a live range of a program variable. Two distinct nodes in the graph are connected if the two variables con ict with each other and they cannot be allocated to the same physical register.
A live range is a continuous group of nodes in the control ow graph where a variable is live. Liveness information can be presented at di erent granularities of a program, such as density of an operation or a basic block. The granularity of live range has a large e ect on the number of edges (size) of interference graph and the execution performance. Our results show that this e ect is larger for hyper-blocks and super-blocks. Predicated code used in \if-conversion" for hyper-block formation introduces another challenge to register allocation, since liveness analysis that is not predicate-aware may increase the size of graph also. In this paper, we explore the e ect of ne grained live ranges and predicate aware liveness analysis for hyper-blocks. We also explore the e ect of predication on the priority function for hyper-blocks. By making Chow and Hennessy-style priority function predicate-aware, we obtain reduced spill code and improved execution performance.
Liveness Granularity
Using a basic block as the unit of liveness 2] makes for faster compilation, but also incurs greater performance penalty compared to ne grained live ranges. For region-based compilation using hyper-blocks or super-blocks, this penalty can be even greater. This is due to two reasons. First, the register pressure in a basic block with coarse grained live ranges is not as large as in HB since the size of basic block is usually smaller. Second, a coarse grained live range may have extra interferences in super-blocks or hyperblocks not present in basic blocks. In Figure 1 ( Table 1 : The number of edges in interference graph for coarse grained (basic block/hyperblock) live ranges and ne grained (operation) live ranges the di erence between basic block based live ranges and operation based live ranges by using the total number of interference edges from several benchmarks chosen from digital signal processing programs and common Unix utilities. The weighted average of coarse grained live ranges has about 54% more interference edges than ne grained live ranges for hyper-blocks while it is about 22% for basic blocks. This reduction makes for a faster register allocation and hence a reduction in compilation time.
We also analyzed the impact of di erent register le sizes and granularity on execution performance (see Table 2 ). The Trimaran 13] simulation environment was used and we measured the dynamic number of spill operations due to register allocation by our cycle-by-cycle simulation. For each register size, the rst and the second column show the dynamic number of operations for coarse grained and ne grained live ranges, and the third column shows the improvement. This improvement varies with the benchmark considered and with the register le size. As number of registers increase more live ranges can be bound to registers rather than being spilled. The performance improvement with larger register le shows this trend in some benchmarks like strcpy or para ns. After a certain register size when all live ranges can be colored, this trend attens out. Coloring assignment in register allocation assigns physical registers for live ranges by a certain order. Chaitin style register allocation assigns colors to the live ranges in the reverse order in which the live ranges are removed from interference graph during the simpli cation and spill 1]. In Chow and Hennessy style register allocation, which we have followed in our research, coloring assignment order is decided by a priority function 2].
The priority function models the savings in memory accesses when assigning a register to a live range as opposed to keeping the variable in memory. We use execution frequencies of instructions to guide register allocation (RA). In addition to the use of hyper-blocks to group together traces that are executed frequently (for more e ective optimizations, including register allocation), we also use live range priorities based on the priority function due to Chow and Hennessy. The above function captures the priority quite well in basic blocks (BBs) where all operations in a BB have the same weight. But in a hyper-block, where some operations can be nulli ed due to predication, these functions do not re ect predicated execution well. In Figure 2 , the weights of B2 and B3 are 90 and 10 respectively. By using the above functions, LiveRange(x) has a higher priority than LiveRange(y) in B2. In the corresponding hyper-block, as in part (b) of gure, the priority of LiveRange(x) and LiveRange(y) is same, even if they are executed di erent number of times dynamically, if we use the traditional priority function using block weight. To correct this problem, we have modi ed the priority function as follows to re ect frequency information using the predicate expression:
Priority(lr) = where PR(x) is the fraction of the time the variable access actually occurs due to predication. Table 3 (a) shows dynamic number of instructions (spill code) added due to register allocation for predicate-aware priority function and predicate-unaware priority function. The two priority function for register allocation have been tested on a number of benchmarks. To quantify the spill code added, we measure the total amount of spill code (load, store and move instructions) added by a register allocation algorithm. The performance improvement of our algorithm is measured as the factor of reduction in spill code achieved in comparison to past method. Table 3 (b) shows total dynamic execution cycles (for the same set of benchmark) in a 9 way parallel machine with 4 integer units, 2 oating point units, 2 memory units and 1 branch unit. We obtained up to 20% performance improvement using our new priority function compared to using a predicate-unaware priority function. One of the main issues with predicated code is computing accurate liveness information in order to build the sparsest possible interference graph. This is central for reducing compile time but more so for performance.
Consider the control ow graph and live ranges in Figure 3 (a) and the corresponding interference graph. In B3 for example, x is live when y is de ned. Therefore, there is an edge between vertices x and y in the interference graph. Notice that there is no edge between vertices y and z since the thread of execution moves exclusively to one of the two basic-blocks B2 or B3. The accurate interference graph can be colored by no more than 2 colors.
Using if-conversion, the four basic-blocks of Figure 3 (a) are merged into the single predicated block of Figure 3 (b) . The operations previously executed in B2 and B3 are guarded by the predicates p and q respectively.
An interference graph using traditional live ranges is also shown in Figure 3 . This interference graph contains an edge between y and z because y and z are considered live simultaneously, as traditional analysis does not examine the predicate expressions. As a result, the register requirement of this predicated block increase from 2 to 3. This increase in register requirements is mainly due to two reasons. First, without knowledge about predicates, the data ow analysis must make conservative assumptions about the side e ects of the predicated operations. Second, the solutions of the data ow analysis rely heavily on the connection topology among basic-blocks in the ow graph, which is altered by the If-conversion process used to construct hyper-blocks.
The problem of predicate aware liveness computation has been studied in the past 5] 3]. Eichenberger and Davidson 3] consider register allocation for predicated code assuming predicated registers. \P-facts" are de ned to capture logically invariant expressions in straight line code (using both the predicated expressions in instructions and branch conditions) and analysis on these are used to de ne liveness. Our analysis di ers in that Table 4 : The number of edges in the interference graph for predicate-aware and predicateunaware liveness analysis we use a predicate query system rather than eagerly computing all the implications (the set of P-facts), which can be very expensive in larger regions. A more complex region like a loop is considered through \bundling" of virtual registers. A predicate insensitive RA is then used. Their framework does not consider spilling and emphasis is therefore placed on minimizing register requirements in the context of modulo-scheduled loops rather than on good register allocation with a xed number of registers. Their data report show the reduction of registers needed for the Livermore kernel loops. Gillies, Roy Ju, Johnson and Schlansker 5] also consider predicated register allocation but hyper-blocks are not used in their framework for region construction. Global analysis (on whole procedures) is performed on the standard control ow graph (CFG) with BB having (materialized) predicated code and control predicates representing the branch component from the CFG. The basis of a BB is the set of predicates used in the BB code stream. The basis of BB within a global scheduling region (GSR) are recursively merged to form a single common basis for data ow analysis (DFA). Interval analysis is used to guide this process. To avoid expensive global analysis, a limit (32) is imposed on the size of the basis during the recursive merging. Our approach uses a similar predicate analysis but HBs are constructed taking into account the frequency of execution. We perform local predicate analysis using a Predicate Query System(PQS) 7] in our region based register allocation. In both, due to the the arti cial limit of the size of basis for each interval or the simpli cation used in PQS, the rest of the predicate expressions have to be approximated to what is in the basis or to TRUE.
Our study (see Table 4 ) shows that region based register allocation with local predicateaware liveness analysis gives sometimes limited performance improvement. This can be explained in several reasons. First, hyper-block formation is not undertaken if there exists nested inner loops inside the selected block. Second, HB construction maximizes the potential of instruction scheduling and this disables many of the predicate-aware RA's careful interference calculations. For example, if a branch is weighted in one direction highly, the block on the opposite side may not be included in HB so as to maximize speculation. Third, if a block contains a function call with side e ects, the block will not be included in a HB as it prevents code motion and instruction scheduling su ers. Last, many of predicate conditions are promoted to TRUE in our HB construction to get a good instruction schedule and this gives little chance to RA for accurate interference graph construction.
Conclusion
Our research has shown that ne grained live ranges reduce the number of interferences and hence allow both faster compilation and runtime execution. Similarly, a predicateaware priority function gives a better register allocation. However, due to the heuristics used in hyper-block formation, predicate-aware liveness computations does not result in signi cant improvements. Hence, further study is indicated in developing hyper-block formation strategies that are both instruction scheduling and register allocation friendly.
