Abstract
1.. Introduction
There is a recent interest in statically [ 14, 15, 12, 50, 20] and dynamically [28, 44, 16 , 51 scheduled Clustered ILP processor microarchitectures as a complexity-effective alternative to wide issue monolithic microprocessors for effectively utilizing a large number of on-chip resources with minimal impact on the cycle time. The function units are partitioned and resources such as register file and cache are either partitioned or replicated and then grouped together into on-chip clusters in these processors. All the local register files share the same name space in the replicated register file scheme, whereas in the partitioned register file scheme each one of the local register files has a unique name space. The clusters are usually connected via a set of inter-cluster communication buses or a point-to-point network [42] .
Several resources are required to execute an operation (OP) in a clustered processor. As in a single-cluster processor, the OPs need local resources in the cluster such as function units for execution and registedmemory to save the results. In addition to this, the OPs often need shared resources such as the inter-cluster communication mechanism to access those operands that reside in remote clusters. Some form of copy operation (using either hardware techniques [ 161 or inter-cluster copy OPs) needs to be explicitly scheduled to access a remote register file in the case of a partitioned register file scheme, or to maintain coherency in the case of a replicated register file scheme. Clearly, a good code generation scheme is very crucial to the performance of these processors in general and especially for statically scheduled clustered ILP processors in which each one of these local and shared resources has to be explicitly reserved by the code generator on a cycle-by-cycle basis. The basic functions' that must be carried out the by the code generator for a clustered ILP processor are: 1) cluster assignment, 2) instruction scheduling, and 3) register allocation. All of the three functions are closely inter-related to each other. If these functions are performed one after another, often their ordering can have a significant impact on the performance of the generated code. For example, register allocation can affect cluster assignment and vice versa. This is because the register access delays (due to inter-cluster copy OPs) of an OP are dependent on the proximity of the cluster in which the operand register is defined to the cluster that tries to access it. The ordering of cluster assignment and instruction scheduling steps can also affect the performance of the compiled code. If scheduling is done before cluster assignment, it may not be possible to incorporate inter-cluster copy OPs in the schedule made by the ear-' We assume that operation selection has been already made.
lier scheduling step, often necessitating a re-scheduling step after the cluster assignment. Cluster assignment of an OP depends on the ready times of its operands and the availability of resources, which in turn depend on the cycle in which the OPs that define the operands are scheduled. Therefore, cluster assignment, if carried out before the scheduling step can often result in poor resource utilization and longer schedule lengths. There are several problems with the approaches using separate phases for register allocation and instruction scheduling [ 19, 381. Global register assignment, if carried out first, can create unnecessary dependences due to re-definition of registers, thereby restricting the opportunities for extracting ILP by instruction scheduler. Instruction scheduling, if performed before register assignment, can result in inefficient use of registers, thereby increasing the register pressure, possibly causing unnecessary spills.
In general, phase-ordered solutions at each phase make a "best effort" attempt to get a feasible cluster schedule, often making unrealistic assumptions such as infinite resources (registers, function units, etc) or zero time copy operations resulting in poor performance code. Clearly, this phaseordering problem can affect the perfonnance of the code generated for clustered processors. However, phase-ordered solutions have certain advantages such as relatively lower engineering complexity and better run-times compared to integrated approaches. An alternative approach is to iterate the cluster scheduling process until a feasible schedule meeting some performance criteria is reached. The main drawback of these approaches is their large running time.
In this paper, we introduce a new code generation framework for clustered ILP processors called CARS (combined cluster Assignment,&gister allocation and instruction Scheduling). In CARS, as the name suggests, the cluster assignment, register allocation and instruction scheduling phases of traditional code generation schemes are performed concurrently in a single phase, thereby avoiding the drawbacks of the phase-ordered solutions mentioned above. In order to maximize the extraction of instruction-level parallelism (ILP), independent OPs (that do not cause exceptions) are often cluster scheduled out-of-order in C A R S . To facilitate this as well as to combine register allocation with cluster scheduling, we developed a new on-the-fly register allocation scheme. The scheme does not rely on any information that depends on predetermined relative ordering of OPs such as the live ranges and interference graphs used by traditional register allocators. Our preliminary experimental results indicate that C A R S generates efficient code for a variety of benchmark programs across a spectrum of eight different clustered ILP processor configurations.
Roadmap:
In section 2 we describe the C A R S framework and algorithms. The details of an implementation of C A R S are given in section 3. We discuss the related work in section 4. Preliminary results of an experimental evaluation of C A R S are given in section 5, followed by some comments and conclusions in section 6.
Combined Cluster Assignment, Register Allocation and Instruction Scheduling

Overview
A generic clustered ILP processor model is shown in figure 1. In this paper we assume a partitioned register file architecture with local register files containing registers with uniquelprivate name space. However, our scheme can be easily adapted for replicated register file architectures as well [26] . We assume that an O P can only write to its local register file and an explicit inter-cluster copy OP is needed to access a register from a remote cluster. These copy OPs use communication function units (a local cluster resource) and inter-cluster communication network (a shared global resource). Either single/multiple shared buses or point-topoint network may be used for inter-cluster communication.
Our code generation framework consists of 3 stages as shown in figure 6(b). In the first stage some of the data structures required for the C A R S algorithm are set up and initialized. The real work is done in the second stage in which the combined cluster assignment, register allocation and instruction scheduling (henceforth referred to as CARScheduling) is carried out, followed by the final code printing stage with peephole optimizations.
The the unscheduled aggregate. The data ready3 nodes in the unscheduled aggregate are identified and moved into the r e a d y l i s t . The nodes in the r e a d y l i s t are selected based on a heuristic for CARScheduling. After CARScheduhg, the nodes are moved to the appropriate vliws. This process is repeated until all the nodes of the DFG are scheduled.
Pre-cARs initializations
The pre-CARS initialization stage is used for preprocessing information needed by the CARS algorithm. For each node we compute its height and depth in the DFG, based on the height and depth of its dependent successorlpredecessor nodes and its latency. Depth is the earliest execution cycle of the OP, counting from the beginning of the DFG; height is the latest execution cycle of the OP, counting from the end of the DFG, in an infinite resource machine. Associated with each SSA DEF that needs to be register allocated, we maintain a RegMap structure. Inside the RegMap of a DEF, we keep the number of uses (use-count) and the ID of the preferred register (prfrd-regmap) to be assigned to the DEE DEFs and USES of certain nodes such as the ones at the entry and exit of the procedure and call OPs must be assigned to specific registers as per the calling convention. We initialize their p r f rd-regmap to the ID of the corresponding mapped register. The register allocator of CARS uses another field of RegMap, the register mask bitvector (regmaskbv) to prevent a set of registers from being allocated to certain DEFs. For example, we initialize regmask-bv of those DEFs that are live across a call OP so that caller-save registers will not be allocated to them. We also identify and flag loop-invariant DEFs, back-edge DEFs and loop-join-DEFs of nodes in loops. This information will be used by CARS, for example to eliminate copy OPs along the back-edges of loops [27, 261. Physical registers are treated as a resource in CARS. Based on the input parametric description of the machine model, we initialize the register resource structures and their bit-vector representations, local resource counters for function units and global resource counters for shared resources such as inter-cluster buses.
3A node becomes dum reudy when all of its dependent predecessor nodes are scheduled.
2.3.On-the-fly register allocation in CARS
Registers are allocated on-the-fly in CARS without using live range information [47] or explicit interference graphs [7] . In order to do this, it maintains and dynamically updates 1) the remaining number of uses (ruse-count) of each physical register, 2) the availability of registers (lcl-reg-bv and glbl-reg-bv), and 3) preferred register mapping (prfrd-regmap) of DEFs, as explained below.
We use ruse-count to keep track of liveldead status of registers. The ruse-count of a register is decremented whenever an OP that uses it is scheduled; when ruse-count becomes zero we mark the register as dead. Since the only information we have at any time during scheduling is the pre-computed use-count of SSA DEFs in the current scheduling region, we initialize and dynamically update the ruse-count as follows. As we start scheduling a new region, the DEFs (4s) of the join node are allocated the same register its scheduled predecessor forks' DEFs ($-Is) are allocated. The pre-computed use-count of join DEF is then added to its allocated register's ruse-count. Similarly, prior to scheduling fork nodes at the exits of a region we update the ruse-count of registers used by 4 -l~ of fork node. The number of unvisited join 4s connected to the fork's 4-l is added to the ruse-count of the 4 -l '~ mapped register. This prevents marking the registers allocated to DEFs that are live beyond the current scheduling region as dead (see the example in figure 2 ).
In addition to ruse-count, the availability (live or dead status) of registers are also maintained in two bit-vectors -one representing the global status (glbl-reg-bv) and the other representing the local (i.e., within the scheduling region) status (lcl-reg-bv). All registers that are not used in the current scheduling region are marked as dead in lcl-reg-bv, whereas the status of registers in glbl-reg-bv does not depend on whether they are used in the current scheduling region or not. We use the information in these two bit-vectors to identify non-interfering lifes in the current scheduling region for efficient use of registers as in a graph coloring based allocator [7] . Registers that are marked dead in g l b l -r e g b v may be allocated to any DEF if they are not masked by DEF's regmaskbv. Also, a register from the set of registers that are marked as dead in lcl-reg-bv but live in g l b l i e g b v may be allocated to a DEF, if the DEF is not live beyond the current region and the register does not belong to the set of preferred registers of all 4 -l~ at the exits of the region. All of the above logic can be implemented by a set of logical operations on bit-vectors [26] . The 4s of a join node require copy OPs along its incoming edges if all of its predecessor 4 -l~ were not allocated to the same register (see the example in figure 3). To avoid these copy OPs we use the dynamically generated p r f rd-regmap information in the RegMap of DEFs as follows. The prfrd-regmap of each DEF which is live beyond the current region is initialized when the DEF is register allocated. This p r f rd-regmap information is propagated to all the directly and indirectly connected 4 s of join nodes as shown in the figure 4. (The set of di- rectly and indirectly connected join DEFs may be thought of as a "web" of intersecting DEF-USE chains [40] -an object for register allocation in graph coloring based allocators). In general, if a DEF has an 4-l use and there exist a valid prfrd-regmap for any of the 4s connected to this $-', then we allocate the register as specified by that prfrd-regmap to the DEE Otherwise, we allocate a register that is marked dead in g l b l -r e g h v and propagate the p r f rd-regmap to all connected 4s as explained above. Because the regions are scheduled strictly in topological order and registers that are live beyond the current region are never marked dead, this scheme almost completely eliminates unnecessary copy OPs that might otherwise be needed to handle mismatches in the register mapping of incoming edges of 4s.
To those DEFs that have a fixed register mapping due to our calling convention, we allocate registers as per their prfrd-regmap initialized in the pre-CARS stage. Copy
OPs are inserted on-the-fly during CARScheduling for those DEFs that are live across call OPs, if necessary, by preempting the scheduling of call OP.
To avoid pseudo name dependencies we select registers for assignment in a round-robin fashion using an efficient data structure [26] . This data structure also allows us to quickly search for dead registers in a cluster. Inside each physical register's resource structure, we also maintain a list of SSA DEFs (belonging to the same live range) that are currently mapped to the register. This information is used for spilling a live range. Complexity: It is clear from the above discussion that the complexity of assigning a register is bounded by the complexity of bit-vector operations, which is O ( T ) where T is a constant, equal to the total number of registers in the processor. Therefore, the complexity of our on-the-fly register allocation scheme is linear in number of nodes scheduled.
Spilling:
If none of the OPs in the ready l i s t c a n be scheduled due to lack of dead registers, then a live register is selected for spilling based on a set of heuristics [26] . In order to spill the selected register, a spill store OP is inserted in the DFG as a new USE of the DEF mapped to the register. Spill load OPs are also inserted for all the unscheduled nodes in the current scheduling region that use the spilled register. Spilling essentially splits the web of DEF-USE chains mapped to the spilled register at the point of spilling, which sometimes necessitates insertion of spill store OPs in the already scheduled portion of the DFG. These spill store OPs will be merged to one of the vliws aggregates by the peephole compaction routine after CARScheduling. The spilled register (now residing on stack) is renamed to a unique ID outside the name space of all local register files and the new ID is assigned to the p r f rd-regmap of the spilled DEE This p r f rd-regmap is then propagated to connected join 4s over-riding the existing p r f r d i e g m a p (if any) using the scheme described above. This facilitates easy identification of the DEFs mapped to spilled registers and their uses in the un-scheduled regions, so that spill loadstore OPs can be inserted in the DFG and CARscheduled on-the-fly. 
The CARS Algorithm
The CARS algorithm is given in figure 5 , which is a modified version of the list-scheduling algorithm [49] . In order to find the best cluster to schedule an OP, we first compute the resource-constrained schedule-cycle (lines 3-5 of figure 5) in which the OP can be scheduled in each cluster based on the following factors: 1) the cycle in which its operands are defined, 2 ) the cluster in which its operands are located, 3) the availability of function unit in the current cycle, 4) the availability of destination register, and 5) whether inter-cluster copy OP(s) can be scheduled or not on the source node's cluster in a cycle earlier than the current cycle. Based on the earliest-cycle computed (line 6 ) , the OP will be either scheduled in the current cycle on one of the clusters4 corresponding to earliest-cycle (lines 7-10) or pushed back into the ready list after incrementing its s c h e d u l i n g a t t e m p t s (lines 13-16). A new vliws aggregate will be created if none of the OPs in the ready 1 i s t can be scheduled in the current-vliwxycle (lines 17-19).
This process is repeated until all OPs in the unscheduled aggregate are cluster-scheduled.
We use one of the commonly used heuristics -schedule those data ready OPs that are on the critical path first -for selecting an OP from the ready list (line 2 of figure 5). The sum of the OP's height and depth is used to identify OPs that are likely to be in the critical path(s) and to assign priority to the data ready OPs. The scheduling-attempts variable associated with each OP (updated in line 14) is used to change the OP selection heuristics so that no OP in the 41f the OP can be scheduled on more than one cluster in the eurliestrycle, then a cluster is selected for assignment based on some simple heurisitcs such as giving preference to the cluster that does not require intercluster copy OPs or has lower local register pressure [26] . ready list will be repeatedly considered in succession for CARScheduling. This also ensures termination of the algorithm. The depth of a scheduled OP is often increased after scheduling due to finite resources available in each cycle, causing the set of OPs in the critical path(s) to change dynamically during the cluster-scheduling process. Therefore, in order to greedily select OPs in the critical paths first, after scheduling each OP, we update the depth of the OP and the depth of all of its dependent nodes that became data ready as a result of scheduling the OP (lines 8 and 12 of figure 5). Coupled with the prioritized selection of nodes in the critical path, this fully resource-aware cluster scheduling approach lets CARS assign and schedule OPs in the critical paths in appropriate clusters such a way that the stretching of critical paths is minimal and subject to the finite resource constraints of target machine.
Often inter-cluster copy OPs have to be inserted in the DFG and retroactively scheduled in order to access operands residing in remote clusters (line 11 of figure 5). We use an operation-driven version of the CARS algorithm for this purpose, In order to find the best VLIW to schedule the copy OP, the algorithm searches all the vliws aggregates starting from the DEF cycle of the operand (or from the cycle in which the join node of the current region is scheduled, if operand is not defined in the current region) to the current cycle. We use the tree VLIW instruction [IO] representation so that independent OPs from multiple regions can be scheduled in the same vliws aggregate. Due to lack of space, we shall not further describe the details of operation-driven CARS algorithm, tree-VLIW scheduling, and scheduling of loops, which may be found elsewhere [27, 261. 
Implementation
We have implemented a code generator based on CARS on topof CHAMELEON [37, 39] VLIW research testbed. The input to CHAMELEON is object code (.o files) produced by a modified version of gcc compiler. An object-code translator processes these .o files to generate an assembly-like sequential representation. The modified version of the VLIW compiler with the CARS-based backend takes this sequential code as input and outputs VLIW code (tree-instructions) as shown in Figure 6(a) .
The VLIW compiler first builds the DFG and then performs a series of optimizations as shown in figure 6(b) . Compilation is performed at procedure level and without function in-lining. The prologue and epilogue code are added to the DFG after CARscheduling the procedure. These pro/epilogue OPs are then cluster scheduled using CARS. The vliws aggregates are then passed through a peephole optimizer and to the final code printing stage. The output of CARS, the tree-VLIWs, are then instrumented and translated into PowerPC assembly code that emulate the target Clustered ILP processor.
A parametric description of the target clustered machine model can be specified to the code generator. The number, type and latency of function units in each cluster, any kind of arbitrary mapping of register name space to local register files, number, type and latency of global interconnect are some of the configurable parameters that are currently supported by our code generator. A subset of the machine resource information is maintained on a per-VLIW basis during code generation.
Related work
To the best of our knowledge CARS is the first code generation scheme that combines the cluster assignment, instruction scheduling and register allocation phases for a partitioned register file clustered ILP processor. Solutions for phase-ordering problem: Several schemes have been proposed to combine different phases of code generation for clustered as well as non-clustered processors. The most recent one, the UAS algorithm [43] for clustered VLIW processors performs cluster assignment and scheduling of instructions in a single step, using a variation of list scheduling. The UAS algorithm, however, does not consider registers as a resource during cluster scheduling. In contrast, the C A R S algorithm treats registers as one of the resources and performs on-the-fly register allocation along with cluster scheduling in a resource-constrained manner.
A variety of techniques have been proposed for combining register allocation and code scheduling of single cluster processors. Goodman and Hsu proposed an integrated scheduling technique in which register .pressure is monitored to switch between two scheduling schemes [21] . Operation-driven version of the CARS algorithm is motivated by this work. However, unlike their scheme, we switch to operation-driven scheduling only for scheduling inter-cluster copy OPs. Bradlee et a1 [3] proposed a variation of the Goodman-Hsu scheme and another technique. Pinter [46] proposed a technique that incorporates scheduling constraints into the interference graph used by graph coloring based allocators. Berson et al [2] proposed a technique based on measuring the resource requirement first and then using that information for integrating register allocation in local as well as global schedulers. Brasier et a1 proposed a scheme called CRAIG [4] that makes uses of information obtained from a pre-pass scheduler for combining scheduling and register allocation phases. The compiler for TriMedia processor uses a technique much like the scheme in [2] to combine register allocation and scheduling [25] . Hanono and Devadas [22] , and Novak et a1 [41] proposed code generation schemes for embedded processors, which combine the code selection, register allocation, and instruction scheduling phases. Motwani et al provides NPcompleteness results of a simple instance of combined register allocation and instruction scheduling problem (CRISP) and an algorithm called the ( a , @)-Combined Heuristic [38] .
Early examples of techniques that did scheduling and register allocation concurrently for single cluster VLIW machines include the resource-constrained scheduling scheme by Moon and Ebcioglu [36] . The fundamental difference between our scheme and all the above are: l ) the use of register mapping information and separate local and global register status during CARScheduhg, and 2 ) combining all the three phases involved instead of two. The vLaITe compiler handles fast scheduling and register allocation together in the context of a JAVA JIT compiler for a single cluster VLIW machine [2Y].
Register allocation: Local register allocation via usage counts [ 181 is a well known technique. More recently, a number of fast global register allocation schemes have been proposed. For example, the Linear Scan register allocation schemes [47, 511 use live interval information to allocate registers in one or two passes. All these schemes and the graph coloring based allocators [7] need the information about the precise ordering of OPs for computing interfering live ranges which is only available after scheduling. The onthe-fly register allocation scheme used in CARS is not based on any such information that is available only after scheduling the entire DFG.
The preferred register map approach in CARS is similar to the Value-Location Mappings used by the Multiflow compiler [lY] and the scheme used by the L a n e just-in-time compiler [52] . However, L a n e makes two passes for local register allocation of tree regions and it needs copy OPs due the mapping mismatches (as illustrated in Figure 3 ). In Conditional instrs., unrolling, cloning, static memory 3-input instrs., disambiguation disambiguation (not used) Figure 6 . Implementation of CARS on the CHAMELEON VLlW research testbed. contrast, CARS performs global register allocation in a single CARScheduhg pass using pre-computed use-count of DEFs and tries to prevent the mapping mismatches. Cluster Scheduling: Pioneering work in code generation for clustered VLIW processors is done by Ellis [ 131. The Multiflow compiler [33] performs cluster assignment using a modified version of the Bottom-Up Greedy (BUG) algorithm proposed by Ellis in a number of steps and then performs register allocation and instruction scheduling in a combined manner. Desoli's Partial Component Clustering (PCC) algorithm [9] for clustered VLIW DSP processors is an iterative algorithm that treats the clustering problem as a combinatorial optimization problem. In the initial phases of the PCC algorithm, "partial components" of DAG are grown and then these partial components are assigned to clusters much like the cluster scheduling scheme of the Multiflow compiler. In the subsequent phases, the initial cluster assignments are further improved iteratively. In contrast, the cluster assignment approach of our scheme is fundamentally different from the recursive propagation of preferred list of functional units and cluster assignment as in the BUG algorithm.
Another work is the cluster scheduling for the Limited Connectivity VLIW (LC-VLIW) architecture [6] . Cluster scheduling for the LC-VLIW architecture is performed in three phases. In the first phase, the DAG is built from the compiled VLIW code for an ideal single cluster VLIW processor. The second phase uses a min-cut graph partitioning algorithm for code partitioning. In the third phase, the partitioned code is recompacted after inserting copy operations.
The dynamic binary translation scheme used in DAISY performs "Alpha [28] style" cluster scheduling (without using inter-cluster copy OPs) along with register allocation for a duplicated register file architecture [ l l , 121. Multiprocessor Scheduling: A large number of DAG clustering algorithms have been proposed for multiprocessor task scheduling [I]. Sarkar's partitioningalgorithm [48] and the more recent ones such as Dominant Sequence Clustering (DSC) [53] and CASS-I1 [32] are examples of such algorithms. The input to these algorithms is a DAG of tasks with known edge weights corresponding to the inter-node communication delays. Clustering is carried out in multiple steps, starting with each node in a different cluster of an infinite resource machine, followed by a sequence of refinement steps in which nodes are merged by "zeroing" the edge weight (communication delay between nodes), and finally merging the clusters so that the resulting number of clusters does not exceed the number of processors in the multiprocessor. The compiler for the MIT RAW project, RAWCC [31] , employs a greedy technique based on the DSC algorithm for performing cluster scheduling in multiple phases. In contrast, the CARS algorithm performs cluster scheduling with register allocation in a single pass, always assuming a finite resource machine.
A number of iterative modulo scheduling schemes have been proposed for clustered processors recently [42, 171. 
Experimental Results
We used a set of programs listed in Table 1 from the SPEC95, MediaBench [30] and SPEC2000 [24] benchmark suites for performance evaluation. The compiled simulation binaries are run to completion with the input data sets of corresponding benchmark programs. These instrumented binaries upon execution provide the number of times each tree VLIW instruction and each path in it are executed. For comparing the code generated for different clustered machine configurations we used the total number of VLIWs executed as a metric, which corresponds to the infinite cache execution time in cycles.
We used two base configurations -both are single cluster machines with 8 and 16 function units with latencies as listed in are configured such that the issue width and resources of the base machine are evenly divided and assigned to each cluster. We compared the number of cycles taken to execute the code generated for 2-cluster and 4-cluster machines. Figures 7 and 8 show the speedup (ratio of number of VLIWs executed) of four different 8 and 16 ALU clustered machines with respect to the corresponding base configurations. Speedup higher than one was observed on clustered machines for some benchmarks. This is primarily due to the aggressive peephole optimizations done after CARScheduling and also due to the non-linearity of the cluster-scheduling process. Similar observations were reported for the code generated using PCC algorithm [9] . On average the additional number of cycles due to clustering compared to single cluster machines is less than 10% for 2-cluster machines with 16 or 8 ALUs, whereas the corresponding figure for 4-cluster machines is 14%, which is 5 to 15% better than prior compilation schemes [ 141. This shows the efficiency and scalability of the CARS-based code generation scheme. The CARS algorithm tries to distribute computation across clusters, utilizing the available inter-cluster communication bandwidth. This is evident from the observed 10% increase in average performance while going from single bus to two bus configurations. In single cluster configuration, CARS could compile more number of functions without spilling compared to the unmodified version of CHAMELEON compiler (which performs register allocation after scheduling for single cluster machines). This clearly shows the ability of CARS to pick different OPs from the ready l i s t until a register is available to schedule an OP. While compiling for clustered machines, the CARS algorithm automatically migrates computation to a cluster with lower register pressure. At present, the un-tuned prototype implementation of CARS inserts slightly more number of spill load/store OPs compared to graph coloring based allocators, which we believe can be improved by using better spilling heuristics.
Comparison with prior work:
The UAS algorithm solves only part of the cluster scheduling problem since it does not perform register allocation. CARS, on the other hand, is a comprehensive code generation scheme that can generate executable binaries. Also, the static estimation used in the UAS study [43] does not take into account some register allocation issues, such as dealing with calling conventions and copy operations required for reconciling the register mapping mismatches at the join nodes, that can affect both the schedule length and quality of the code. Therefore, even though the UAS study has used a subset of similar machine configurations and benchmarks, its results cannot be compared with the results re- Intuitively, the UAS algorithm followed by a register allocation (and a possible rescheduling) phase is more prone to the phase-ordering problem than the schemes that use the final phase to take care of all the resource constraints such as the Bulldog and Multiflow compilers. Only a realistic implementation which can generate executable binaries can quantitatively compare the differences between the schemes, which is beyond the scope of this paper.
Complexity Analysis:
The worst-case time complexity of the CARS algorithm is O(inn'), where in is the average retry factor; the average number of scheduling attempts before a node can be successfully scheduled. We have measured the total number of times nodes are inserted in ready l i s t and the total number of nodes scheduled for each benchmark in all the clustered ILP processor configurations studied. The m is computed as the ratio of the total number of node insertions to the total number of nodes scheduled (shown in Table 3 ), which clearly indicate that m is a small constant for all the benchmarks. Therefore the practical complexity of CARS is
O ( n 2 ) .
Complexity of code generation schemes used in Bulldog compiler [13] , UAS scheme [43] and CARS framework are listed in Table 4 , which shows that practical complexity of CARS and UAS scheme are about the same. 
Conclusion
We have presented CARS, a new code generation framework for clustered ILP processors. Our work is motivated by the phase-ordering problems of code generators for clustered ILP processors. Our scheme completely avoids the phase-ordering problem by integrating the cluster assignment, instruction scheduling and register allocation into a single phase, which in turn helps eliminate unnecessary spills and other inefficiencies due to multiple phases in code generation. The fully resource-aware cluster scheduling scheme of CARS not only helps avoid unnecessary stretching of critical paths in the code but also distribute computation evenly across the clusters whenever possible. We also described an efficient on-the-fly register allocation technique developed for CARS. Even though the register allocation scheme is described in the context of code generation for clustered ILP processors, the technique is well suited for other applications such as "just-in-time compilation" and dynamic binary translation. Our experimental results show that CARS-based code generation scheme is scalable across a wide range of clustered ILP processor configurations and generates efficient code for a variety of benchmark programs.
I Benchmark I AV. retry factor 11 Benchmark I AV. retry factor 
Bulldog/BUG O(n2 log n ) O( n ) UAS
