Abstract. In Clustered Instruction-level Parallel (ILP) processors, the function units are partitioned and resources such as register file and cache are either partitioned or replicated and then grouped together into onchip clusters. We present a novel partitioned register file architecture for clustered ILP processors which exploits the temporal locality of references to remote registers in a cluster and combines multiple inter-cluster communication operations into a single broadcast operation using a new sendb instruction. Our scheme makes use of a small Caching Register Buffer (CRB) attached to the traditional partitioned local register file, which is used to store copies of remote registers. We present an efficient code generation algorithm to schedule sendb operations on-the-fly. Detailed experimental results show that a windowed CRB with just 4 entries provides the same performance as that of a partitioned register file with infinite non-architected register space for keeping remote registers.
Introduction
Registers are the primary means of inter-operation communication and interoperation dependency specification in an ILP processor. In contemporary processors, as the issue width increase, additional register file ports are required to cater to the large number of function units. Further, a large number of registers are required for exploiting many aggressive ILP compilation techniques, as well as for reducing the memory traffic due to spilling. Unfortunately, the area, access delay, and power dissipation of the register file grows rapidly with the number of ALUs [1] . Clearly, huge monolithic register files with large number of ports are either impractical to build or limit the cycle time of the processor. The partitioned and replicated register file structures used in clustered ILP processors [1] [2] [3] [4] [5] [6] [7] [8] are two promising approaches to effectively address the issues related to large number of ports, area and power of register files. A typical cluster consists of a set of function units and a local register file, as shown in Fig. 1 . A local register file can provide faster access times than its monolithic counter-part, because each local register file has to cater only to a subset of all the function units in the processor. If the register file is replicated (each local register file shares the entire architected name space), then copy operations are required to maintain the coherency among the local register files. On the other hand, if the register file is partitioned (each local register file has a non-intersecting subset of the architected name space), then copy operations are needed to access registers from remote clusters. The main advantage of a partitioned register file over a replicated register file is that it can reduce the number of write ports as well as the size of the local register file, thereby providing shorter access delays, smaller area and lower power requirements. In this paper, we shall concentrate on partitioned register file architectures for statically scheduled processors. In general, the partitioned register file schemes used in statically scheduled clustered ILP processors belong to one of the following two categories:
-Registers are copied to remote clusters ahead of their use to hide the intercluster communication delay (eg. Multiflow Trace [9] , M-machine [10] , and Limited Connectivity VLIW [11] ). Often several copies of a register are maintained in the local register files of several clusters, increasing the register pressure due to inefficient use of architected name space in this scheme. -Remote registers are requested either on demand (eg. TI's clustered DSPs 1 [4] ) or using send-receive operations (eg. Lx [2] ), each time they are referenced. The potential drawback of this approach is that it demands more intercluster communication bandwidth, and hence may possibly delay the earliest time a remote register can be used when the interconnect is saturated.
In this paper, we propose a new partitioned register file with a Caching Register Buffer (CRB) structure, which tries to overcome the above drawbacks. The CRB is explicitly managed by the compiler using a single new primitive instruction called sendb.
The rest of this paper is organized as follows. Section 2 describes our CRBbased partitioned register file scheme, and discuss some related programming and hardware complexity issues. In section 3, we briefly discuss the code generation framework used and present an on-the-fly scheduling algorithm to explicitly manage the CRB using the new sendb operations. The experimental evaluation of the proposed CRB-based partitioned register scheme is discussed in section 4. We present related work in section 5, followed by conclusion in section 6.
Partitioned register file with CRB
The basic idea behind our scheme is to use compile-time techniques to exploit the temporal locality of references to remote registers in a cluster. Instead of allocating and copying a remote register value to a local register (and thereby wasting an architected register name), the remote register contents are stored in a Caching Register Buffer (CRB), which is a fast local buffer for caching the frequently accessed remote registers ahead of their use. The CRB is therefore functionally similar to a small fully-associative cache memory, and it may be implemented using an array of registers. There is a tag field (reg-id) and a one bit flag (valid-bit) associated with each CRB data entry as shown in Fig. 2 . The tag (reg-id) contains the unique address (architected name) of the cached remote register, and the valid-bit is used for validating the data and tag. An operation trying to access a remote register first performs an associative search in the reg-id array. If a valid copy of the remote register being accessed is available in the CRB, then the data value is returned from the appropriate CRB data register (akin to a cache hit). A hardware-initiated send-stall bubble is injected into the pipeline automatically if a valid copy of the requested remote register is not found in the CRB. During the send-stall cycle, which is similar to a cache miss, the data is loaded from the remote register file and the execution of the pipe is resumed.
Any local register can be sent to the CRBs of remote clusters via an explicit send-broadcast (sendb) operation using a communication function unit (CFU). The CRBs and the CFUs of all clusters are interconnected via one or more register data buses as shown in Fig. 3 . All the resources involved in inter-cluster communication -CRBs, CFUs, and register data buses -are reserved on a cycle-by-cycle basis at compile-time. The sendb operation (OP) is scheduled on the CFU by the compiler such that the remote register is cached in the CRB of a remote cluster ahead of its use. The sendb OP has the following format:
sendb SrcRegID ClusterIDbitvector where SrcRegID is the ID of the register that has to be sent to other clusters, and ClusterIDbitvector is a bit vector with bit positions indicating the remote cluster IDs to which the register value is to be sent. The width of the ClusterIDbitvector is therefore equal to the number of clusters in the processor. The CRB's bus interface unit snoops the register data bus; if the bit position in ClusterIDbitvector matches its cluster ID then it updates the appropriate CRB data register, concurrently initializing its tag reg-id entry and setting its valid-bit to 1.
Precise exceptions and calling conventions
An important issue that needs to be addressed is how to support precise exceptions in the presence of a CRB. Only the contents of the architected registers (the local partitioned register file) need to be saved on an interrupt by the interrupt handler code. In order to make the CRB structure transparent to the user program, the interrupt return (rfi) OP will invalidate the contents of all CRBs by resetting all valid-bits to 0. Therefore, after returning from the interrupt handler, if a valid copy of the requested remote register is not found in the CRB, it is loaded directly from the register file during send-stall cycles. An alternative approach is to save and restore the entire CRB image on an interrupt. However, this approach may result in large interrupt latencies and slower context switches.
In addition to physical partitioning, there is also a logical partitioning of register name space due to calling conventions. The logical partitioning due to calling conventions should co-exist with the physical partitioning in the new partitioned register file with CRB as well. We use a partitioning scheme in which both caller-save and callee-save registers are partitioned and distributed equally among all clusters [8] . However, the logical partitioning of register name space and the calling convention provide only a protocol to preserve the callee-save registers across function calls. This brings up the important issue of dealing with the changes in the cached callee-save registers in the CRBs across function calls. Note that, for the correct execution of the program, the data in any valid CRB register and the architected register specified by its reg-id tag should be the same. A simple solution to this problem is to invalidate the CRB contents upon return from a function call and load a copy of the register directly into the CRB and function unit. Such send-stall cycles can be completely eliminated by scheduling a sendb OP after the call OP for each one of the registers that are live across the function call. Our preliminary experiments showed that such schemes can either increase the number of send-stall cycles or increase the code size as well as schedule length, especially for function-call-intensive benchmark programs. This led us to the following hardware-based solution. The basic idea is to use "windowed" CRBs, which allow saving different versions of the CRB contents in different register windows. Like any register windowing scheme, such as SUN PicoJava processor's "dribbling" mechanism, each call nesting level selects a different CRB window automatically. When the nesting level of calls exceeds the maximum number of hardware CRB windows in the processor, the contents of the oldest CRB window are flushed to cache.
The number of ports in a typical CRB structure is less than the number of ports in a partitioned register file [8] . Because we need only a 4-way associative search logic for good performance (based on our simulation results presented in section 4), one can argue that it is feasible to build a windowed CRB structure that has the same access time of the partitioned register file attached to it. However, CRB may need a larger area than a traditional register file.
Compilation scheme

Overview of code generation framework
We used the cars code generation framework [12, 8] which combines the cluster assignment, register allocation, and instruction scheduling phases to generate efficient code for clustered ILP processors. The input to the code generator is a dependence flow graph (DFG) [13] with nodes representing OPs and directed edges representing data/control flow. The basic scheduling unit can be a set of OPs created using any "region" formation or basic block grouping heuristics.
During code generation, cars dynamically partitions the OPs in the scheduling region into a set of mutually exclusive aggregates (groups) -unscheduled, ready list, and vliws. The data ready nodes in the unscheduled aggregate are identified and moved into the ready list. The nodes in the ready list are selected based on some heuristics for combined cluster assignment, register allocation and instruction scheduling (henceforth referred to as carscheduling). cars also performs on-the-fly global register allocation using usage counts during carscheduling. This process is repeated until all the nodes of the DFG are carscheduled and moved to the appropriate vliws.
The cars algorithm extends the list-scheduling algorithm [14] . In order to find the best cluster to schedule an OP selected from the ready list, the algorithm first computes the resource-constrained schedule cycle in which the OP can be scheduled in each cluster, and then the minimum value of schedule cycle as earliest cycle. Based on the earliest cycle computed, the OP is either scheduled in the current cycle on one of the clusters corresponding to earliest cycle or pushed back into the ready list. Often inter-cluster copy OPs have to be inserted in the DFG and retroactively scheduled in order to access operands residing in remote clusters. An operation-driven version of the cars algorithm is used for this purpose. In the list-scheduling mode, only the most recently scheduled vliws aggregate's resource availability is searched to make a cluster-scheduling decision. On the other hand, in the operation driven scheduling mode 2 , resource availability in a set of vliws aggregates are considered to make the cluster-scheduling decision. In order to find the best VLIW to schedule the inter-cluster copy OP, the algorithm searches all the vliws aggregates starting from the Def cycle of the operand (or from the cycle in which the join node of the current region is scheduled, if operand is not defined in the current region) to the current cycle. In the next section, we describe how we have extended the cars algorithm for scheduling sendb operations on-the-fly.
3.2 On-the-fly scheduling of sendb operations Our basic approach is to schedule a sendb OP whenever it is beneficial to schedule a node on a cluster in which one or both of its remote source registers are not cached in the CRB and the inter-cluster communication bus is available for broadcasting the register value. We also want to combine multiple inter-cluster copy operations into a single sendb OP on-the-fly while carscheduling nodes in the DFG in topological order, without back-tracking. In order to combine multiple inter-cluster copy OPs, we keep track of sendb OPs scheduled for accessing each live register. The information is maintained by the code generator as a list of sendb OPs inside each live physical register's resource structure.
Our scheme may be explained using an example of scheduling nodes in a basic block. Assume that a sendb OP has already been scheduled for accessing set the bit corresponding to ClusterID in ClusterIDbitvector of S 12: end for a remote register, say r10, from a cluster C 1 , and that we want to use the same register r10 in a different cluster C 2 at some time later while scheduling a different Use-node of the Def mapped to r10. Our algorithm first checks if there are any sendb OPs scheduled for register r10. As there is one such sendb OP scheduled already for r10, we simply set the bit corresponding to the new destination cluster C 2 in the ClusterIDbitvector operand of the sendb OP (instead of scheduling a new inter-cluster operation as in traditional schemes). The process is repeated whenever a Use-node of the Def mapped to r10 needs the register in a different cluster. Therefore, by the end of scheduling all the OPs in DFG, the algorithm would have combined all the inter-cluster copy OPs into the single sendb OP by incrementally updating the ClusterIDbitvector operand of the sendb OP.
The scheme discussed above for scheduling sendb OPs in a basic block can be extended, as shown in Algorithm 1, to the general case of scheduling sendb OPs for accessing a remote register assigned to a web of Def-Use chains (corresponding to a life that spans multiple basic blocks). The nodes that use such registers that are allocated to webs of Def-Use chains may have multiple reaching definitions. Clearly, there are two options: either schedule a sendb OP for each one of the Def nodes that can be reached from the Use node, or schedule one sendb OP before the Use node in the current basic block. Our preliminary experiments showed that former is attractive only when there are empty issue slots in already scheduled VLIWs (low ILP applications), because if there are no issue slots then we have to open a new VLIW just for scheduling a sendb OP. Moreover, if there are not many remote uses for the register, then the latter option is better than scheduling multiple sendb OPs. In view of these observations, our algorithm uses the following scheme to reduce the number of sendb OPs. If sendb OPs have already been scheduled for all the reaching definitions corresponding to a Use, then we update the ClusterIDbitvector of all those sendb OPs (lines 9-12). Otherwise, we schedule a sendb OP in the current basic block using the operation-driven version of cars algorithm (line 5) [8] .
Experimental results
We used the chameleon VLIW research compiler [15] (gcc version) with a new back-end based on the cars code generation framework [12] for the experimental evaluation of our new partitioned register file scheme. We have developed a cycleaccurate simulator to model the clustered ILP processor with CRB and intercluster buses. We use the compiled simulation approach: each VLIW instruction is instrumented and translated into PowerPC assembly code that calls the CRB simulator. For each VLIW instruction, the simulator processes all the local and remote register file access requests, updates the CRB contents and keeps track of the number of CRB hits and misses based on whether the requested register was found in the CRB or not. The send-stall cycles are computed based on the assumption that multiple CRB misses are serviced serially subject to the availability of inter-cluster buses. FIFO replacement policy is used in the CRB.
We compared the CRB-sendb scheme with the traditional partitioned register file scheme using inter-cluster copy (rcopy) OPs. We generated executable compiled simulation binaries for the 14 benchmark programs used (8 from SPEC CINT95, 2 from SPEC CINT2000, and 4 from MediaBench). The binaries are run to completion with the respective input data sets of the benchmark programs. We used the execution time of benchmark programs in cycles as a metric to compare the performance of 4 different clustered ILP processors listed in Table 1. Infinite cache models were used for all configurations and the penalty for flushing/restoring a CRB window is assumed to be 5 cycles. The rcopy OPs are semantically equivalent to register copy OPs; however their source and destination registers reside in different clusters. For each remote register access, an rcopy OP is scheduled on the destination cluster to copy a register from a remote register file to an infinitely large non-architected name space in the local register file. The rationale is to compare the performance of CRB-sendb scheme with the (unrealistic) upper-bound on performance that may be achieved via traditional partitioned register file scheme using rcopy OPs. We studied three partitioned register file configurations with CRB sizes 4, 8 and 16.
The clustered machines are configured such that the issue width and resources of the base machine are evenly divided and assigned to each cluster as shown in Table 1 . Condition code registers (CCRs) and buses are treated as shared resources. The function units, which are fully pipelined, have the following latencies: Fix, Branch, and Communication: 1 cycle; Load and FP: 2 cycles; FP divide and sqrt: 9 cycles. Figures 4, 5, 6 and 7 show the speed up (ratio of execution times) with respect to the base single cluster processor for the four different 8 ALU clustered processors for all the benchmarks studied. In 2-cluster configurations, CRB-sendb scheme does not provide any significant performance improvement for most of the benchmarks. This is mainly because, in a 2-cluster machine, the sendb OP degenerates into an rcopy OP because there is only one target cluster. However, in the 4-cluster configurations, the CRB-sendb scheme outperforms the traditional rcopy scheme. In 4-cluster configurations, on an average an additional speedup of 2%, 4.9%, and 5.7% are observed over rcopy scheme when 4, 8 and 16-entry CRB is used with partitioned register file respectively. Speedup higher than one was observed on clustered machines for some benchmarks (124.m88ksim). This is primarily due to the non-linearity of the cluster-scheduling process and also due to the aggressive peephole optimizations done after carscheduling; similar observations have been reported for the code generated using other algorithms [12] . For some benchmarks, especially for 126.gcc and 134.perl, the rcopy scheme is observed to perform better than sendb scheme. This is because of the combined effect of factors such as finite size of CRB, and redundant sendb OPs scheduled for some long live ranges with multiple reaching Defs. It may be possible to eliminate these sendb OPs (or replace them with a single local sendb OP) by making an additional pass on the scheduled code performing a data flow analysis. A partitioned register file with only a 4-entry CRB is therefore sufficient to achieve a performance identical to a clustered machine with an infinitely large local register space for keeping remote registers. Note that the rcopy experiments provide a conservative comparison: we have not modeled the effects of increase in register pressure, the potential possibility of spills due to the increased register pressure, and the impact of increase in code size in our rcopy experiments. These effects would have further lowered the performance of the rcopy scheme reported herein.
Related work
The partitioned register file structures used in prior clustered ILP processor architectures [9, 11, 4, 10, 2] are different from our CRB-based scheme as explained in section 1. Caching of processor registers in general is not a new concept and has been used in different contexts such as the Named-State Register file [16] and Register-Use Cache [17] for multi-threaded processors, the pipelined register cache [18] , and Register Scoreboard and Cache [19] . Our scheme is different from these as none of them are aimed at keeping a local copy of a remote register.
Fernandes et al. [20] proposed a partitioned register file scheme in which individually addressable registers are replaced by queues. In contrast, our scheme permits random access to the cached remote registers in the CRB, and also provides more freedom to the cluster scheduler for scheduling the Use-nodes of remote registers in any arbitrary order limited only by true data dependencies. Llosa et al. proposed a dual register file scheme which consists of two replicated, yet not fully consistent register files [21] . Cruz et al. proposed a multiple-bank register file [22] for dynamically scheduled processors in which the architected registers are assigned at run time to multiple register banks. Zalamea et al. have proposed a two-level hierarchical register file [23] explicitly managed using two new spill instructions. In contrast, CRB is transparent to the register allocator; it is not assigned any architected name-space, which in turn allows it to cache any architected register. Due to space limitations, readers are referred to [8] for a more comprehensive list of related work and comparison of those with our scheme.
Conclusion
We presented a new partitioned register file architecture that uses a caching register buffer (CRB) structure to cache the frequently accessed remote registers without using any architected registers in the local register file. The scheme reduces register pressure without increasing the architected name space (which would have necessitated increasing the number of bits used for specifying register operands in instruction encoding). The CRB structure requires only one write port per inter-cluster communication bus, and the partitioned register file with CRB can be realized without affecting register file access time of a clustered ILP processor. The CRB is managed explicitly by the compiler using a new sendb operation. The sendb OP can update multiple CRBs concurrently via broadcasting the register value over inter-cluster communication bus/network. The sendb OP can thus eliminate multiple send-receive OP pairs used in prior schemes. It can also combine several inter-cluster copy OPs used by the rcopy-based prior partitioned register file schemes. Experimental results indicate that in a clustered ILP processor with non-trivial number of clusters, a 4-entry windowed CRB structure can provide the same performance as that of a partitioned register file with infinite non-architected register space for keeping remote registers. Because clustered ILP processors are an attractive complexity-effective alternative to contemporary monolithic ILP processors, these results are important in the design of future processors.
