Tl7is paper eva1uate.s nro regirrer sharing rechniqries for reducing register usage. The firs1 technique dynartrically conrbbws plr?sical registers having the same value. The second technique coiiibirres the denrand of several irrsrrucrions updating the same logical register oml share physic01 registcr srorage ontong them. While similar techniques have bees pmposed previous/> on irnponrrnr ronrriburiort of this paper is to exploit only special cases that pmvide most of the benefits of itlore generol solutions bur ar u ver?; loit, Itadnore complexit)! Despile the simnplicir), our design reduces fhe required nunrher r,fph?eical registers by m o~e than 10% on some npplicorions, and provides alntost half of the tntal benefits of an aggressive (coinp1e.r) scheme. More irnportanrlJ: we show the simpler design to reduce register pressure has significant perfonwnce effecrs in a siniulrarieous a~ulrirhreaded (SMT) architecture where register aiwilabilip CUJI be a bottleneck. Our results slrou, an overage of 25.7% perfonnunre inrprovemeirt for an SMT architecture wilh 160 registers 0): equivalently, sirtrilar perfontlarice as cur SMT with 2W regisrers (25% rimre) bur no register sharing.
Introduction
In pursuit of higher performance through higher clock rates and greater instruction level parallelism (ILP), modem microarchitectures are buffering an ever greater number of instructions in the pipeline. The larger window of in-flight instructions offers the microarchitecture hardware more opportunities to discover independent instructions to issue simultaneously. However, maintaining more instructions requires a corresponding increase in the buffering structures; in particular, a larger physical register file with which to hold the generated results.
On the other hand, the counter forces'to arbitrarily sized buffers are effects on cycle time due to non-scalable wire delays I1 I and limits on power consumption 171. Also, smaller buffers can re-duce complexity in other regions of the chip: q., the number of wires in the issue logic of Alpha 21264 is directly proportional to the number of physical registers [SI. Thus, despite large transistor budgets from shrinking technology dimensions, efficient use of register resources will always be an imponant design consideration.
A method for improving the use of physical registers to decrease the'overall average demand is the technique of register sharing based on value [XI. I n this type of sharing, logical registers containing the same value can be mapped to the same physical register, with the other physical register being released early.
In 181, a general scheme to detect shared values is outlined for the Intel IA-32 architecture, an instruction set that exhibits considerable register pressure due to the small set of logical registers.
Another method to reduce register pressure is to aggressively reclaim physical registers when their values are no longer needed. Typical register renaming schemes conservatively allocate and release registers with the result that physical register lifetimes are unnecessarily long. Techniques have been proposed to alleviate this issue such as delaying the allocation until it is necessary 161 and early releasing dead registers indicated by compiler analysis [ I I] .
An important contribution of this paper is to look at the practical design issues of these different ways of reducing register pressure. with a special focus on lowering the hardware complexity. We show that there are important special cases which provide many of the benefits but with much less complexity than that required for the general cases.
In particular, we show that optimizing the shared \.slue detection to the special set of values zero and one generates almost half the benefits of the more general technique which shares arbitrary values. Restricting sharing to these two values enables a number of optimizations that greatly simplifies the implementation. Additionally, we propose a very simple mechanism that allows multiple versions of the same logical register to share the same physical register. We focus on a special type of instruction that we call single-use self-overwriting instructions. These instructions are quite numerous (about a quarter of all value-producing instructions) and their data dependence guarantees that they will execute in program order; thus, such instructions do not need multiple physical registers to avoid anti-and output-dependences.
Paa of our contribution is the detailed design of our simple register sharing scheme that exploits these special cases. We further show that the benefits of register sharing can be significant in a 0-7803-8385-0/04/$20.00 02004 IEEE simultaneous multithreaded architecture where registers are more likely to be a limiting resource. In our simulations. we demonstrate a 25.7% performance improvement on average across a set of integer benchmark mixes.
The rest of this paper is organized as follows: Section 2 presents an overview of the methods to dynamically share physical registers. Our evaluation environment is discussed in Section 3. Results are presented in Section 4. We present a practical implementation of our design in Section 5. Related work is presented in Section 6. We conclude and discuss future work in Section 7.
Dynamically Sharing Physical Registers
Register renaming schemes dynamically transform a program into single-assignment form to remove false dependences and thus expose more instruction-level parallelism. However, allocating a physical register for every \zalue-instance can lead to register waste. For example, if multiple physical registers contain the same value then the mapping can be adjusted to use a single copy of the mlue and the redundant registers can be freed.
Sharing physical registers increases the effecti\,e number of physical registers which can improve performance when registers are a scarce resource, such as in simultaneous multithreaded (SMT)processors 1181.
In this section, we describe a general method for detecting shared values in the integer register file and reducing the number of physical registers in use. We also describe how, in cenain cases, multiple instructions can share the physical register storage.
We show in Section 4 that specializing these techniques prorides most of the opportunities for register sharing and Really simplifies the design. In particular, limiting value-based sharing to just two values, zero and one is a good design trade-off. In Section 5 we present details of a desim that is both simple and effective.
The common value buffer (CVB)
The coiiunoii d u e hirffer (CI'BB) is a mechanism for detecting arbitrary shared values between registers. The CVB is a fullyassociative buffer of the last A' generated values (LRU replacement). After instruction execution, results are compared to values in the CVB. Matches (hirs) are considered to be common values.
Associated with the Yalue in the buffer is the ID of an active physical register having that value. The mapping of the logical register associated with the just executed instruction is modified to point to the physical register from the CVB. This update requires modifying the logical-to-physical register alias table (RAT) and also updating the source fields of the instructions waiting for issue. The implementation requires a counter be associated with each physical register 181 that is incremented each time another register is redirected to use the value. The physical iegister cannot be released until its count decrements to zero. Since the precise timing of these actions requires details of the pipeline, we defer discussing the specifics until Section 5.
Trivial computations
Trivial computations 1211 are computations in which the result can be known from the operand values without performing the calculation itself, e.&, a logical bit-wise AND with zero. We detect two types of trivial computations. The first, called Trivial 0, detects results that can be determined a priori to always be zero. The second class, called Trivial X, detects computations in which the result can be determined a priori to match the value of either operand. The list of trivial computations we detect and their input conditions is given in Table 1 . These computations are detected during decoding and renaming when the physical register mappings of the source operands are read. Upon detection of a trivial computation, the register rename iogic.maps the destination register to the zero register (for Trivial 0 computations) or to the same register as the source operand (for Trivial X computations).
Register lifetime reductlon
In an R10000-like register renaming scheme, a physical register's lifetime spans from allocation to release, whereas the actual useful time is between the definition ofthe register and its last use, as shown in Figure 1 . For simplicity, the register is allocated at the decoddregister-renaming stage. Depending on the time of waiting and the number of cycles of execution. this can be many cycles earlier than the write-back stage, where the register storage is truly necessary. Furthermore, the allocated register is onlg released conservatively at the commit time of the next instruction that updates the logical register. This standard strategy can lead to much longer lifetime than is necessary. Our analyses show that a register is only needed for an interval about 10-20% of its totd lifetime. This suggests that there is potential to increase effective register size by reducing lifetime. In this paper. we propose a simple technique that exploits .xYoverwiring instructions to reduce effective register lifetime. We refer to instructions that update one of the source registers as selfoverwriting (SO) instructions, e.&, r l + r2 + rl. As shown in Figure 2 , if the self-overwriting instruction is also single-ure (ix., the instruction is the only consumer of the value in the destination register), then the liferime of each version (in physical registers) of the (logical) destination register does not overlap. This is enforced by standard data dependence checking hardware that serializes this set of instructions due to their read-after-write dependences. Therefore, instead of allocating multiple physical registers to hold the different versions of the logical register, these versions can conveniently share the same physical storage. Notice that although the instructions can share storage space, distinction among these versions is still necessary to allow correct dependence tracking and value communication. We defer these discussions and other implementation details to Section 5.
Register s h a r i n g designs
There are a number of options in how value-based (register) sharing, trivial computation detection, and lifetime-based sharing can be implemented. Several features to the design are the following:
Common values. The CVB permkregister sharing with a htrary YaIues. A subset of the common value space is the set of highll used values of zero and one. While the CVB can provide more opportunities for sharing, focusing only on the zero-one subset allows only dedicated (hardwired) registers to be shared and eliminates the re, "icter use counters Trivial computations. Implementing trivial X detection requires general suppon for register sharing. In contrast, implementing irivid 0 requires redirection to a hardwired register that is not actually p a of the pool of physical registers and making detection extremely simple.
Early release stage. For zerolone detection, common values are detected during the execution stage. For general value detection using aCVB, detection cxcurs in the writeback stage. The esliest registers can be released back to the free pool is one cycle after detection. In Section 5 we discuss why delaying register release until the instruction corirrnits reduces implementation complexity.
Bandwidth. Permitting multiple register redirections per cycle minimizes the delay in freeing register resources. However, it requires duplication of logic. On the other hand, buffering sharing requests and limiting redirection to one request per cycle eliminates.the need for duplicate logic at the cost of delaying early register release.
Instruction type. When a single-use self-overwriting (SUSO) instruction shares a register with the previous definition and overwrites upon execution, we need the ability to retrieve theovenvrit- A, D, U, and R stands for Allocation, Definition, Usage, and Release. Physical registers 1 to 3 can he substituted with a single register 1' with an extended lifetime.
ten value when handling exceptions. One way to retrieve the value is to reverse the instruction. To be able to do this, we can only allow SUSO instructions with reversible opcodes to share register (see Section 5.4.2). Alternatively, we can rely on a moreelaborate checkpointing scheme and allow all types of SUSO instructions.
Chain scope and length. SUSO instmctions can form a chain of arbitrary length and it can span across multiple basic blocks. Limiting the chain to be within the same basic block will greatly simplify the design in branch misprediction handling. Also, the length of the chain that is allowed to share a register dictates the number of bits in the wakeup system to differentiate between different instructions that write to the same register. These design considerations will be discussed in more detail in Section 5.4.
From the above features, we construct two design points with which to compare to a base design without register sharing. In all, there are three cases: o base: No register sharing. 0 complex: The most aggressive design to maximize register sharing and early release of registers. The design includes a CVB, trivial X , immediate release, a redirection bandwidth matching the issue bandwidth. Every chain of self-overwriting instructions shares a single physical register, regardless of the length and scope of the chain. We note that, complex requires complicated micro-architectural support. especially for handling branch mispredictions and exceptions.
o simple: A modest design that trades some reduction in register sharing opportunities for much less implementation complexity. For value-based register sharing: values are restricted to zerolone (there is no CVB); only rriviol 0 is exploited; early-release is delayed until the instruction commits; and register sharing updates are restricted to one per cycle. For register sharing based on lifetime, sharing is limited to only those SUSO instructions within the same basic block of the initial assignment, and only up to three selfoverwriting instructions are allowed to share the allocated register. This is the scheme uie detail in Section 5.
Methodology
We explore the effects of the various register sharing schemes on individual applications to provide insight into behavior with sharing. From this data, we justify the sirnple scheme as providing the best advantage for the cost, and then demonstrate this scheme's effects on performance in the much more register intensive environment of an SMT architecture.
For the exploratory, per-application results we use the Simplescalar simulator version 4.0 (MASE) [IO] . The processor is modeled after the MIPS RlOOOO [20] and has a general pool of 64 physical integer registers (no dedicated architectural registers). For the SMT simulations (the primary results) we use SMT-SIM 1181. The SMT processor configuration parameters are giwn in Table 2 .
We have modified both simulators to perform common value detection and register redirection as described in the text. Our focus is on common value reuse in integer benchmarks. The complete list of SPEClnt 2000 benchmarks is given in Table 3 and the  Threads FetchlDecode width 4 16 instructions
Results
We limit the primary study to comparing three register sharing configurations: base, simple, and corirplex (defined in Section 2). Shown in Figure 3 is the per application performance improvement for the three configurations. More important is the reduction in register pressure using the three schemes shown in Figure 4. In this figure lower bars are better. While average performance improvement across all the individual applications is only 1.3% for simple and 4.5% for coniplex, the overdll average reduction in number of physical registers in use decreases by 4.4% and 10.6%. respectively. In other words, the simple scheme garners almost half the reduction in register pressure of that of the more complex scheme.
I " ' ' ' I i L S e ' I ' " " I Special care was taken to eliminate NOP-type instructions from being counted. As the buffer size is doubled the hit rate increases linearly. Shown in Figure 6 is the percentage of values that are either zero or one. These two values are invariably the top two most frequently occurring values. On average, 9.6% of the instructions generate one of these special values. The occurrence rate of almost 10% for values of zero and one nearly matches the hit rate of the larger CVB; thus, simply detecting these two values pro-\,ides much of the benefits. Moreover, for all the applications, the frequency of the third frequent value is negligible and the value itself is application-dependent. another method would be to arbitrarily ignore all but one sharing request each cycle. We implement the former.
The simplicity of the simple scheme and its demonstrated effectiveness to exploit register sharing suggests the simple design is preferable to the more complex design for actual implementation Number ofvalues 11 0 1 1 1 2 1 3+ Fetch Simultaneous multithreading. Figure 4 showing the reduction in register pressure is the more interesting data than the performance data since the effect on performance from reducing register pressure is highly dependent on whether the registers are a performance limiting resource. In an SMT processor. however, the availability of physical registers is often a limiting factor. much more so than in a single threaded architecture. We explore this effect in Figure 7 . showing various aspect of the performance improvement for the SMT processor having the .sirti,ile scheme.
In Figure 7 -(a) we show the reduction in register conflict using thesbrrple scheme. Register conflict is measured by the number of cycles the decode stage is stalled due to lack of available registers. With IM1 physical registers, register conflicts are reduced by 28% to 40%. with an average of 35%. The simple scheme has the same or better effect as adding 40 physical registers (25% more). The decrease in register pressure from.sharing improves performance by an average of 25.7% (Figure 7-(b) ). Even with 200 registers. the scheme can still improve the performance and by an average of 8.8%. and up to 13.2%. In Figure 7 -(c), we show the performance impnwements for each individual application in the mixes, using the simple scheme with 160 registers. We measure the execution of rhe SMT processor far a fixed number of cycles and calculate the number of instructions finished per-thread with and without the .simple scheme. Not surprisingly, as shown in the figure, the increase in resource benefits all threads relatively evenly. Finally, in Figure 7 -(d) we show the effect of value-based and lifetimebased register sharing in isolation and combined. We can see that . . while the value-based.sharing is more effective, the improvements from both components are additive.
Overall, we have shown that the siniple scheme is indeed very effective and delivers significant performance improvement in an SMT processor.
A Design for Dynamic Register Sharing

The baseline system
While dynamic out-of-order microprocessors have various possible implementations, we focus on a straightforward baseline processor core that is largely based on MIPS RIOOOO 1201. The pipeline of the processor is shown in Figure 8 In this rename process, source logical registers are renamed into physical registers by reading its corresponding RAT entry. Each instruction with a destination register will allocate a free physical register in FIFO manner from the fire list. This newly allocated physical register is written into the RAT, in the entry for the destination (logical) register. The previous value of that entry (the "old" physical register ID) is copied into the instruction's entry in the reorder buffer (ROB). When this instruction commits, the "old" physical register is appended to the free list. To handle branch misprediction, the RAT, together with the read pointer of the free list is checkpinted upon decoding of a branch [ZO]. The checkpoint is restored when the branch is detected as mispredicted. Since the read pointer of the free list is also restored, physical registers allocated to wrong-path instructions are freed instantly.
Instruction wakeup Each physical register has a dedicated busy bit that is set during allocation to indicate that the producer has not finished execution, and therefore, dependent instructions need to wait in the instruction queue (or issue queue). When an instruction is issued, its destination register is broadcast to wake up dependent instructions and mark the associated operands as ready. Instructions with all operands marked r e d y can be issued in the following cycle. To speed up back-to-hack data dependent instructions, any'oprands that use the result of a currently executing instruction will read the value off the bypass path as it is Written to the register file.
5.2
Value-based register sharing
Overview
When two physical registers P, and Pi contain the same value, one of the register (say Pa) can be early-released and re-allocated for other instructions. To ensure future instructions intending to read from Pb will read from Pa, the following needs to be done:
1. Change the RAT entry pointing to Pb to Po.
2. Any instruction in the issue queue with a Pb in the source operand field needs to change it into Pa.
Apparently these steps would correctly modify the processor state and allow the release of Pb. However, there are three primary complications we list below, C1 -C3. For discussion purposes, let us call their corresponding logical registers L. and Lb, respectively: C-I Physical register Pa cannot be released as usual, namely when the next producer of L. is committed. We have to wait until both L , and La are defined again, and the producer instructions have committed. Reference counting has been proposed to keep track of when a physical register can be freed [El, but the design is complex especially if it permits sharing along speculative paths, as reference counters need to be fixed upon a branch misprediction. C-2 If the RAT entry of Lb has been overwritten, then one of the in-flight instructions (Ia), will release Pb at the commit time. Only one release, either the early-release or the normal release by I, can be allowed. If we allow the early-release, we have to search the ROB, perhaps associatively, and modify the entry of I,. As the ROB continues to g o w in size. and would otherwise need only indexing-based access, this search functionality would have significant impact on the scalability of the ROB.
C-3 Branch mispredictions also present complications. First, if
we allow an instruction to early-release its allocated physical register before the instruction is committed then if the instruction is squashed because of misprediction recovery we cannot free the register again. In our R10000-Iike registerrenaming scheme, this presents a major design challenge, since freeing all the registers allocated on the wrong path is done in a single action of restoring the read pointer of the free list (201. Second, when we early-release a register (say Pb), we need to change not only the current RAT (if it is still mapped) but also any checkpoints where Pb appears. Otherwise when a branch misprediction happens. we may restore a checkpoint (made before the early-release) containing Pb, resulting in an error. The ability to search and selectively change checkpoint entries would introduce significant overhead.
To implement a generic dynamic register sharing, handling these complications would require complicated hardware support and/or a very conservative sharing scheme. We now describe an implementation for the special case of register sharing that is simple and straightforward. Limiting sharing to the values of zero and one allows the following simplifications: I. Because their values are fixed, special dedicated registers (PO and P I ) can he provided without the need of freeing (C-I).
Detecting these two common values is almost trivial.
3. Changing the content inside the RAT or the source register field in the issue queue is greatly simplified only one bit needs to be set (to 0 or I), while others can be cleared.
Additionally, in the interest of hardware simplicity, we only early-release a physical register when ( I ) the common valueproducing instruction commits, and (2) if the register is still in the RAT. Restriction ( I ) ensures that a branch misprediction roll back will not free any early-released register (C-3). Restriction
(2) avoids complication C-2. Combining ( I ) and (2), we know that if the register is still in the RAT it will be present in the same entry in orry checkpoint as well. Thus, we only need the ability to check the primary RAT, not any of the checkpoints. Given that we only exploit values zero and one, changing the checkpoints is quite easy. 1. If an ALU operation results in 0 or 1, the instruction is marked in the ROB as a common-value-producing insmction. The specific value is also recorded. This requires only two extra bits in the ROB.
2. When an instruction is commined, the superseded physical register OldPReg is released as usual. If the instruction is marked as a common-value-producing instruction, its allocated physical register N e w P R e g becomes a candidate for early-release. It is broadcast through a special CAM port (Section 5.2.2) to detect its presence in the RAT. When a match occurs:
(a) The physical register is released to the free list. (c) The physical register number is entered into a I-cycle delay buffer and used to rename instructions inside the issue queue (Section 5.2.3) .
Modified RAT
The register alias table needs to be slightly modified to add the functionality mentioned above. Figure IO shows the diagram for the modified rename table. The base cell design is shaded for a p-ported table. Figure 10-(a) shows the comparators and clear transistor for all bits other than the least significant bit (LSB) of the physical register ID. Figure 10 -(b) has additional logic (shown in bold lines) that sets the LSB to the common value (V) produced by the instruction. The match line is precharged and senseamplified to perform a CAM-style parallel search. If the logical register number is readily available, only its corresponding match line needs to be precharged. To avoid race conditions. the CAM port should be accessed in the opposite clock phase as the RAM ports. Sinceearly-release is not timecritical, weassume this clock phase is after that of the RAM port access phase.
The added five or six transistors represent an insignificant increase: in a four-way issue pipeline. the map table requires 12 read ports and 4 write ports (p = 16) for a total of 36 transistors in the base cell design [ 161. This circuit allows a mmimum earlyrelease of 1 per cycle. As we have seen in Section 4, only 4% of the cycles have more than one early release candidate. A simple, small buffer can easily accommodate the occasional bursts.
To ensure the early-released physical register ID does not reappear erroneously by way of RAT checkpoint restore (Section 5.2.1), the corresponding mappings in all checkpoint copies are likewise set to the same dedicated ID (PO/Pl) simultaneously.
The reason we can change all copies indiscriminately is that the common-value-producing instruction is being commined and any valid checkpoint at the moment should also point to that register.
Modified issue queue
Broadcasting the to-be-released register ID into the issue queue is necessary to ensure all dependent instructions read the correct (a) Non least signifi cant bits (B, i = l..n -1) (b) Least significant bit (6) Figure 10 . Diagram of modifi ed RAT cell source register when issued. This broadcast is very similar to the instruction wake-up broadcast. The difference is, normal wake-up marks the operand as ready and the inStNClion is ready lo issue if all source operands are ready. The broadcast for register early release, however, marks the matching source register as a common value, indicating that during issue, rather than reading from the register file, zero or one should be used.
This logic can be implemented in two ways. In one method, a dedicated broadcast port can be built into the issue queue. When a source operand register ID field matches the content on this special broadcast port, the operand is marked, and the common value recorded. Alternatively, an existing free wake-up broadcast port can be used. In this case, each such port is augmented with two special bits. One bit indicates that the port is used forearly-release broadcast, and the other for the specific value.
The reason for the delay buffer in Figure 9 is that, when a candidate early-release physical register is checked in the RAT in cycle n, there may be instructions decoded and mapped in the same cycle that references the candidate register. Recall that these instructions read the RAT earlier than the potential RAT update due to early-release, and thus will not see any change. They will enter the issue queue in cycle n + 1. If the broadcast is done in cycle n, these instructions will not be notified. Notice that a I-cycle delay is sufficient since typically broadcasts happen in a later clock phase than dispatch. This is to ensure proper wake-up. Finally, we note that there is no race condition between the register's release and subsequent reuse even though the register is released one cy-cle before the broadcast. This is because there are multiple cycles between when a released register can be written to again.
Discussion
By freeing the register at commit time and only if the register is still mapped, our design is much simplified. Compared to immediate early-release, this does miss a few opportunities to free more registers containing the frequent values zero or one. Thisis indeed a good design tradeoff as exemplified in Figure I 
Zero value trivial computations
Minimal logic is required to detect zero value trivial computations. Some calculations are zero regardless of the input operands (e.s.. X ZOT X = 0). For instances where one or both of the operands must be known to be zero for the computation to be trivial, we limit the detection to cases where the operand registers ha\,e been mapped to the zero register already so an explicit read of the register value is unnecessary.
Lifetime-based register sharing
Given an SO (self-overwriting) instruction, we call the previous dynamic instruction that writes to the same logical register its as-.sigmzent instruction. Notice that, this assignment instruction can be an SO instruction itself. As explained in Section 2, an SUSO (single-use self-overwriting) instruction can share the physical register allocated to the assignment instruction. An SO instruction is an SUSO instruction if no other instructions between the SO and the assignment instruction sources the destination register of the assignment instruction.
Detection, sharing, wakeup, and refease
Detection: Detecting SO instructions dynamically is straightforward. Detecting SUSO instructions requires cross-comparing sources and destinations of simultaneously renamed instructions and an extra refewnce hit per logical register in the RAT. The reference bit is cleared when the logical register is written to and set when it is read from. To limit the detection of SUSO instructions to be within the same basic block, we simply set all the reference bits after decoding a conditional branch and making a checkpoint for the RAT (Section 5.4.2) . When restoring a checkpoint, we also set all the bits.
Sharing and wakeup: When an SUSO instruction is detected, we can simply reuse the currently mapped physical register (for the destination register), without allocating a new one. However, in our baseline system, the physical register ID also serves the purpose as a tag for instruction wakeup and value communication. Therefore we cannot allow two instructions to have the same destination physical register number. In a system with vinual physical registers 161. this can be solved by using two virtual physical register IDS pointing to the same physical register. In our desig, we choose a much simpler scheme: we extend the physical register ID and use the most significant bits to differentiate different versions. In particular, if we extend the ID by n bits, we can allow 2" instructions to share the same physical register storage.
For example, in a system with 160 physical registers (requiring &bit addresses), if we add two most-significant bits, then the IDS IO, 266, 522, and 778 are the four tags associated with physical register IO. A non-SUSO instruction that produces a value will be allocated a tag with the two most-significant bits set to 0. Subsequent SUSO instructions writing to the same destination will increment these two bits, until it reaches 3. The next SUSO instruction (writing to the same destination) will be mated as a non-SUSO instruction and assigned a new register.
When decoding an SUSO instruction, if the destination register is mapped to Po or PI, the special dedicated registers, the SUSO instruction will also be treated as a normal instruction and obtain a new physical register.
Release: When an SUSO instruction shares the physical register with its corresponding assignment instruction, the OIdPReg field of the SUSO instruction is set to an invalid value, the same way as a non-value-producing instruction. Thus, when the SUSO instruction is committed, no register is released. The shared register will eventually be released at commit time of the next instruction that updates the same logical register and allocates a new register.
Mispredlction and exception handling
Allowing multiple instructions to write to the same physical register presents a challenge to branch misprediction and exception handling. Consider this sequence of events: ( I ) if the original assignment instruction occurs in a different (earlier) basic block than the associated SUSO instruction, (2) a branch between the SUSO instruction and its assignment instruction is mispredicted, and (3) if the assignment instruction has been committed, then we cannot recover the original assignment value that had been speculatively overwritten. For this reason, we only allow an SUSO instruction to share a register with its assignment instruction if they belong to the same basic block. (If the SUSO instruction falls into the next basic block, it will be treated as a normal instruction and will allocate a new register.) This way, if the SUSO instruction is on the wrong path, so is the assignment instruction.
It is possible that an exception occurs for an instruction between an SUSO instruction and its assignment instruction. If, by the time the exception is handled, the SUSO instruction has already finished execution, then the physical register shared by the SUSO instruction and its assignment instruction no longer contains the assignment instruction's result. After handling the exception, the SUSO instruction will be re-executed leading to an erroneous result.
To solve this problem, we need to reverse the effect of any already-executed SUSO instructions. In a typical exception handling mechanism, to reconstruct the RAT, the oldest valid RAT checkpoint is restored, and the ROB is "walked in reverse order to unmap the instructions &the oldest basic block 116, 201. During this process, the only additional effort for us is to reverse any already-executed SUSO instruction: we compute the overwritten operand using the result and the remaining operands, if there are any. (For example, there is no remaining operand for r l + rl --f rl, and performing a right-shift on rl's current ralue recovers the o v e k i n e n operand.)
To be able to do this, we need the remaining operand unchanged and a reversible opcode for the instruction. Fortunately, the remaining operand is guaranteed to be in a normal register (not shared) and stay unchanged since it is sourced by the SUSO instruction, and therefore can not be the destination of another SUSO instruction (violates the single-use rule).
To guarantee a reversible operation, we simply do not perform register sharing for a non-reversible SUSO instruction (e.g., load) in the first place. Most ALU instructions and address manipulation are reversible. The reverse operation depends on the exact format of the SUSO instruction and the detail can be found in [171. An alternative design is to roll back to the beginning of the basic block, re-execute the basic block without sharing registers. To do so, we cannot commit any instruction in a basic block until all instructions in the basic block finish execution without exception. To handle the pathological case where a basic block is larger than the size of ROB, we have to mificially divide a large basic block into smaller ones by inserting a not-taken branch instruction dynamically. This design is not only complicated, but also suboptimal in that instruction commit (and thus resource recycling) can be delayed unnecessarily.
SUSO instructions and adding 2 bits to extend the regher ID is a good design point. On average, out of all value-producing instructions, 24.3% are SUSO instructions. Our design captures 80% of these SUSO instructions, or 19.2% of all value-producing instructions.
lmplementing both schemes
When implementing both value-based and lifetime-based register sharing, we have to make sure they work together. In pmicular, we have to ensure that a shared register is not erroneously released early. In our design, this is not a problem.
Consider a pair of instructions sharing a register: an SUSO instruction and its corresponding assignment instruction. ( I ) The assignment instruction will not early-release the shared register because of the restriction that the register ID be still mapped in the -RAT. Recall that although an SUSO instruction shares the physical register, it still updates the RAT table with a different register ID (incrementing the two most-significant bits). (2) The SUSO instruction can safely early-release the shared register. Because the assignment instruction commits before the SUSO instruction and thus does not need the register anymore.
Related Work
One of the closest work to ours is the register renaming by Jourdan et a/. 181 . As previously discussed, the authors use register sharing to exploit value locality and reduce register pressure in the Intel IA-32 architecture. In this paper, we restrict the range of values and greatly simplify the design. In a concurrent work, Balakrishnan and Sohi also propose to use dedicated registers for values zero and one to reduce register waste on storing these frequent values 121. However, in 121, implementation and support for branch misprediction handling are not discussed in detail. We analyze rradeoffs and present a simple and very effective design that only reclaims registers at the commit stage. Moreover, we also propose a simplified scheme that reduces the lifetime of physical registers and show how both schemes work together.
In Cherry 1121, registers are released early but state must be recovered on exceptions. The design relies on checkpointing and the ROB to rollback to a correct architectural state on exceptions and replay instructions up to the exception in order to recwer state from resources released early. In our register sharing, when processing an exception, we only need to reverse the effect of already-executed SUSO instructions in the oldest basic block.
In 16, 191 register allocation is delayed until the value is actually ready to be written. In particular, Gonzalez et al. 161 describe a virrualphysical regisrer design. The virtual register scheme assigns a virtual register ID during decode as a placemarker. An actual physical register is not allocated until the result is ready in the writeback stage. This technique reduces register pressure by not requiring physical registers for many of the in-flight instructions. Our work is complementary to the virtual physical register work. In fact, combining our work with virtual registers would further decrease register pressure and simplify our design. The simplification occurs since assignment to a physical register will not occur until writeback when the value of the result is known and common values can be assigned to the zerolone registers directly, eliminating the necessity of redirection later.
Martin et ai. [ I l l detect registers having dead wlues (values no longer needed) and return them to the free pool. The technique is used to reduce the number of saveslrestores at procedure calls, but also reduces register pressure by reducing register lifetime. Our lifetime-based register sharing is simpler in that it is purely hardware-based.
Yi and Lilja [21] evaluate the most extensive set of trivial computations in the literature to impro\,e performance. In addition to our Tn'viul 0 and Trivial X policies, the authors also perform strength reduction (e.g., convert a multiply by 2 to a shift). In contrast, we explore the effect of squashing trivial computations on register pressure.
The frequent value cache (FVC) of Zhang et ai. 1231 attempts to improve the performance of a direct-mapped cache by supple-menting it with a small additional buffer. Because only frequently used values are stored in the FVC, the data can be encoded to keep the buffer small. The FVC acts as a specialized victim cache optimized to storing only data lines with known frequent values. The CVB is similar, but leverages these common values for a different purpose, sharing physical registers to reduce register pressure.
Finally, there is a body of work that tries to build large register files while limiting the adverse effect on the access timing 13, 4, 15, 22, 141. Our approach is complementary to these approaches in that we reduce the demand of physical registers through sharing.
Conclusion and Future Work
As high performance processors attempt to exploit ever greater parallelism, more instructions are in-flight and, consequently, more physical registers are required. With the register file being a complex multi-ported structure. its capacity has both power and performance implications. In this paper, we present a method to reduce the demand for physical registers by sharing registers that have the same ' Blue or non-overlapping useful lifetimes. It is important in any implementation that the benefits justify the cost. We propose a simple register sharing scheme and show that it provides almost holf the benefits of the most aggressive scheme, but with significantly less implementation complexity.
In an SMT architecture, reducing the demand for physical registers using our simple scheme results in a 25.7% performance improvement, which is equivalent as having 25% more registers and no register sharing. In future work, we plan to explore additional opportunities to collapse registers and to combine this scheme with a virtual physical register design, where we expect the late binding of values to physical registers to increase the opponunities for sharing and also further simplify the register sharing implementation.
Keynote I1 Carl Anderson, ZBM
Title: Information Technology Outlook Projections of Information Technology (IT) future will be described. Forecasts of both hardware and software technology trends and ways in which those trends will come together to enable new uses and capabilities for IT. Important trends in key technologies such as raw computing speed, bandwidth, and storage and display capabilities will be shown. New technologies that have the potential of transforming the performance and characteristics of tomorrow's information processing systems and devices are explored.
