Abstract-Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-ofthe-art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM-CAM scheme does not exceed the area required by conventional renaming mechanisms.
I. INTRODUCTION

M
ODERN superscalar microprocessors implement out-of-order and speculative execution to increase the performance. Many mechanisms are devised aimed at enhancing the amount of instructions executing concurrently. These mechanisms require register renaming techniques to solve write after read and write after write data hazards.
Register renaming distinguishes two kinds of registers: logical and physical registers. Logical registers are those used by the compiler, whereas physical registers are those actually implemented in the machine. Typically, the number of physical registers is quite larger than the number of logical registers. When an instruction that produces a result is decoded, the renaming logic allocates a free physical register. The logical destination register is said to be mapped to that physical register. Subsequent data-dependent instructions rename their source registers to access this physical register. Renaming structures are accessed every cycle after the instructions are decoded. The register renaming circuitry also deals with the register mapping table recovery on misspeculation. As these structures are highly accessed, renaming structures are critical because of their high power density [1] , and new solutions must be devised to deal with this problem. Random access memories (RAMs) and content-addressable memories (CAMs) are used for register renaming. Both of them present advantages and shortcomings, and the industry does not show a predominant trend. Some processors [2] , [3] use the RAM approach, whereas others [4] - [6] include a CAM with a large number of checkpoints.
A logical source register is renamed using its identifier to obtain the current mapping. This is performed faster and more efficiently in terms of energy with a RAM structure. The RAM is directly indexed by a source register, whereas this register is compared against all current mappings in the CAM. This associative search is a major concern not only because of its long access time, but also because it hinders scalability with the number of registers [7] .
Regardless of the approach used, the checkpoints allow for quick recovery of the correct mappings after misspeculation. in both RAM- [2] and CAM-based [4] - [6] processors. When the number of checkpoints, however, surpasses a certain limit, CAM checkpointing becomes faster and more energy efficient than RAM checkpointing [7] . Table I shows the recovery penalty time (in processor cycles) of a RAM-based processor (with the baseline processor configuration described in Section IV). As observed, even when triggering the recovery at the writeback stage, the amount of cycles is not negligible.
In this paper, we propose a new scheme that tries to take the best of each implementation, that is, fast register renaming, fast register allocation, and fast recovery. We propose a hybrid approach that uses both a RAM and a CAM. During correct path execution, the RAM provides most of the mappings, acting as a CAM cache. The CAM is checkpointed whenever a branch is decoded, enabling quick misspeculation recovery, which is followed by an invalidation of the RAM contents. After recovery, the RAM is progressively refilled with correct mappings while new instructions enter the pipeline.
The advantages of the hybrid design stem from two main sources. On the one hand, the processors work in a nonspeculative mode in the common case, hence RAM invalidations are unusual. On the other hand, frequently executed instructions (e.g., loops) only use a small subset of the architected register file, hence only a few RAM updates suffices to recover the steady state. Thus, a reduction of the CAM complexity does not hurt performance, but lowers its power consumption, area, and access time.
The rest of this paper is organized as follows. Section II discusses typical renaming mechanisms. Section III describes the hybrid RAM-CAM approach. Section IV presents the experimental results and Section V discusses some related works.
II. BACKGROUND
A. RAM Approach
The following example illustrates how a typical RAM-based approach works. A code snippet consisting of nine instructions, whose four destination registers (r5-r8) are renamed into six physical registers (p11-p16) is shown in Fig. 1(a) . The RAM contents at the time instruction I reaches the rename stage are shown in Fig. 1(b) . At this point, its source registers are renamed to p11 and p12. In addition, a free physical register (p16) is allocated to r7. To allocate a free physical register, RAM approaches use a free register queue (FRQ). After I is renamed, subsequent instructions having a data dependence on r7 will be renamed to p16. The direct-mapped memory allows the mappings to be rapidly performed while taking up a small area. Later, at the commit stage, physical registers are released by placing their identifiers back into the FRQ.
Because the RAM table is updated at the rename stage, it is modified by either nonspeculative or speculative instructions. On misspeculation, the changes must be canceled, restoring the RAM to its previous state at the time the offending instruction enters the rename stage.
The simplest strategy for misspeculation recovery consists in waiting until the mispredicted branch reaches the reorder buffer (ROB) head [recover at commit, see Fig. 2(a) ]. As ROB entries contain the previous mapping for the destination register, the correct RAM state can be restored by scanning the ROB once the offending instruction reaches the ROB head.
Recover at commit incurs a penalty with two main components: 1) the time elapsed because the misprediction is known until the mispredicted instruction reaches the commit stage and 2) the time required to restore the correct mappings. The second component can be reduced using two RAMs, a frontend RAM (FRAM), and a retirement RAM (RRAM) [3] .
To reduce the first component of the penalty, recovery should be triggered as soon as the misprediction is known [recover at writeback, see Fig. 2(b) ]. This approach can rely on a single FRAM table, which is restored by walking the ROB from its tail toward the first misspeculated instruction. If there is, however, a RRAM, the FRAM can also be restored by first copying the RRAM contents into the FRAM, and then walking the ROB from its head toward the first misspeculated instruction [see Fig. 2(c)] .
B. CAM Approach
CAM structures have as many rows as the number of available physical registers. Each row maintains information for renaming, recovery, and register allocation, as shown in Fig. 3 . The first column shows the mapped logical register, whereas the second column specifies whether this mapping is currently active or not.
Let us assume a simple design where the register mappings are checkpointed each time a branch instruction is decoded. An excerpt of a CAM table containing the current mappings and a set of branch checkpoints is shown in Fig. 3 . The table contains the renaming information corresponding to the code shown in Fig. 1(a) at the time instruction I reaches the rename stage. Source registers r5 and r6 of instruction I are renamed to p11 and p12, respectively. Destination register r7, previously mapped to p13, is remapped to p16, which is obtained by means of a priority encoder (PE) connected to the free column. Then, this mapping is updated in the corresponding entries (logical register and current mapping) of the CAM. Simultaneously, the current mapping entry of p13 is reset. Finally, branch checkpoint columns cp H , cp F , and cp A keep a copy of the current mapping column at the time the corresponding branch (i.e., H, F, and A) is decoded. Depending on the implementation, checkpoints can be performed indiscriminately [6] or selectively [1] , [8] . Creating a checkpoint merely involves copying the current mapping column.
Regarding register allocation, a physical register is assumed to be free when its current mapping bit is clear and it is not present in any checkpoint. In the example, p11, p12, p15, and p16 are currently mapped, hence they cannot be released. Although p13 is, however, not currently mapped, it cannot be released until both branches F and H are resolved and known to be nonspeculative, as p13 appears in checkpoints cp F and cp H . This is also the case of p14 in checkpoint cp F . Finally, p17 is free because it is neither currently mapped nor present in any checkpoint. Free registers can be obtained by simply NOR-ing the current mapping and the branch checkpoints bits.
When dealing with multiple checkpoints, the current mapping and branch checkpoints columns can be organized as a circular queue following program order, where the current mapping is located at the tail, and the rest of the entries contain the branch checkpoints. This implementation allows for a reduction of both the temporal penalty and power consumption of misprediction recovery.
The CAM recovery triggers when branch F is resolved as mispredicted, as shown in Fig. 4 . Simply by updating the tail pointer, the checkpoint cp F column becomes the current mapping, and the mapping of r7 from p13 to p16 is undone. This is also the case of r8, which is mapped to p14 again. In addition, p15 and p16 are released, because they stop being allocated by the current mapping or any other checkpoint. Notice that the youngest checkpoints (i.e., cp F and cp H ) are discarded and only oldest checkpoints are kept (i.e., cp A ).
III. HYBRID RAM-CAM
The hybrid scheme uses two tables: 1) a CAM containing all register mappings up to date and 2) a RAM acting as a cache of the CAM, containing a subset of its renaming information. The CAM table can be indexed both directly by a physical register or associatively by a logical register, while the RAM is indexed by a logical register.
A RAM entry in the hybrid scheme may or may not contain a valid copy of the current mapping, as indicated by an additional valid bit/entry. Register renaming is performed by just accessing the RAM, while valid entries are hit. If an invalid entry is accessed, a RAM miss is said to occur and the CAM is used to retrieve the current mapping. Therefore, the CAM is not looked up in all renaming cycles, but only upon RAM misses, allowing for a lower number of CAM read/write ports compared with a typical CAM implementation. On misspeculation, the entire RAM contents are invalidated.
After recovery, the current mappings are only available in the CAM, because all RAM entries are invalid. Subsequent renaming of source registers will cause RAM misses, and the CAM will be looked up to obtain the mappings. Both CAM lookups and new register allocations will cause the RAM to be progressively updated and quickly reducing the RAM miss rate.
Let us analyze how the previous working example behaves on the hybrid RAM-CAM. As instructions in Fig. 1(a) reach the rename stage, destination registers are mapped to the new physical registers, and the new mapping is recorded both in the RAM and the CAM. After renaming instruction I, the contents of RAM and CAM tables are shown in Figs. 1(b) and 3, respectively.
Let us assume that the register lookups are resolved in the RAM, but the branch instruction F is mispredicted, triggering the following recovery process. First, the RAM is invalidated by resetting all valid bits. Second, the CAM is recovered by restoring the current mapping with the branch checkpoint performs when the instruction F is decoded (cp F ), returning to the state in Fig. 4 .
A pipelined implementation of the hybrid scheme, where each box represents a table lookup is shown in Fig. 5 . Lookups in the RAM are always direct mapped, whereas CAM lookups can be either direct mapped (d.m. in the figure) or associative searches. Only one table lookup is allowed in a single stage. This causes our proposal to be pipelined in three stages, though only two of them are in the critical path toward the instruction queues and the ROB.
The block diagram shown in Fig. 5 is also horizontally divided in three sections detailed next.
A. Clearing Previous Destination Mappings
CAM entries corresponding to previous destination mappings are cleared. In the first stage, all previous destination 
B. Destination Register Renaming
Free physical registers are mapped to the destination registers. The PE provides physical register identifiers from a set of free entries in the CAM. These identifiers are used to directly index the CAM and set the new mappings. New mappings are also updated in the RAM, which is indexed with the identifiers of the destination logical registers.
C. Source Register Renaming
For each source register, the associated mapped physical register identifier is obtained. In the first stage, the RAM is accessed. On a hit, mappings are available right away. Otherwise, an associative CAM search is performed in the second stage. Finally, those mappings retrieved from the CAM are updated in the RAM on the third stage. This last stage is optional and increases the RAM complexity. Nevertheless, it may provide performance and energy benefits if these updates avoid enough RAM misses.
Notice that the previous mappings are cleared in the CAM in the second stage, whereas new mappings are set in the first stage. Thus, a hazard arises when an associative CAM lookup in the second stage for a given instruction accesses a mapping allocated by the same instruction in the first stage. This hazard can be avoided by flagging the new mapping entries at the end of the first stage in an additional single-bit column in the CAM. The flags are reset at the end of the second stage.
IV. EXPERIMENTAL EVALUATION
A performance evaluation is carried out on top of SimpleScalar, which is modified to model the renaming approaches. Processor parameters are summarized in Table II .
Results are obtained from the execution of the entire standard performance evaluation corporation CPU2000 benchmark suite. Statistics are gathered using the ref input sets and single simulation points [9] . The SimpleScalar toolset is configured for the Alpha ISA. Five schemes are analyzed, referred to the following: 1) commit, RAM-based approach that triggers recovery at commit; 2) writeback, RAM-based approach that triggers recovery at writeback, walking the ROB from head to tail; 3) writeback-fwalk, RAM-based approach that triggers recovery at writeback, from tail to head; 4) ideal CAM, pure CAM-based approach; and 5) hybrid, RAM-CAM proposed approach. In addition, we will apply the suffix-Iw to the ideal CAM and hybrid schemes that reduce CAM complexity by limiting to I the number of instructions that the CAM can rename each cycle. As the baseline pipeline width is four instructions, the values of I equal to or lower than four are evaluated.
The number of cycles incurred during misprediction recovery is accurately modeled considering the position in the ROB of the mispredicted instruction and the number of pipeline stages. In addition, the latency of the pipeline front end to fetch the correct path is overlapped with the recovery penalty.
We assume that a checkpoint is stored for each dispatched instruction group, as done in the IBM Power7 [10] .
The baseline pipeline length resembles the ARM Cortex-A9 [11] , with 10 stages (five of them before rename). For the hybrid register renaming schemes, we assume a 12-stage pipeline.
A. Performance 1) Analysis on Short Pipelines: Unlike the CAM-based approach, the hybrid RAM-CAM approach performs associative searches on the CAM only upon RAM misses. Thus, a deliberate reduction of CAM complexity can have harmless consequences. This section explores the impact on performance of reducing the CAM complexity in the hybrid approach.
The instructions executed per cycle (IPC) values for each benchmark under the ideal CAM-4w renaming scheme are shown in Fig. 6 . Ideal CAM-4w imposes an upper performance bound for the remaining models, because it takes just one processor cycle for both register renaming and misprediction recovery without negatively affecting the pipeline bandwidth.
The performance slowdown for the analyzed schemes with respect to the ideal CAM-4w, calculated as 1 − IPC Renaming scheme /IPC Ideal CAM−4w , is shown in Fig. 7 . Each bar in this figure is the slowdown of the sequential execution of a benchmark set. Two variants of the hybrid scheme are evaluated: update sources (US) and non-update sources (NUS). The NUS variant does not update the RAM for the source register mappings retrieved from the CAM. It requires one pipeline stage less than the US variant, as the third stage shown in Fig. 5 is not longer needed. It, however, incurs more RAM misses when looking for valid mappings in the RAM.
Writeback and writeback-fwalk behave differently for integer and floating-point benchmarks. The reason is that the location of the mispredicted branch within the ROB is usually farther away from the ROB head in floating-point benchmarks than in integer ones. This is shown in Table III , which presents the average (arithmetic mean) number of instructions that must be scanned during recovery for each writeback variant. The reduction in the number of scanned instructions is correlated with the results shown in Fig. 7 for the writeback approaches.
The commit approach performs worse because its recovery penalty is usually higher. In addition, a two-way CAM-based configuration (ideal CAM-2w) is included in the figure to show the impact on performance of reducing the CAM complexity by blindly halving the renaming bandwidth. Its slowdown for the whole SpecCPU (around 15%) is the second worst of the studied approaches. In contrast, hybrid-2w-NUS presents a slowdown always smaller than 2.2%, suggesting that the additional RAM used in the hybrid approach suffixes to avoid that performance loss incurred by limited CAM ports.
Regardless of the hybrid variant, lower CAM bandwidths damage the performance. The reason is that the rename stage stalls more often because of a lack of CAM ports. For SpecCPU, both variants of the hybrid approach outperform the conventional approaches, with the only exception of hybrid1w-NUS. In addition, the two-way hybrid variants always provide better results than the writeback ones. Compared with the ideal CAM-4w, the performance drops in the NUS variant by 1.4%, 1.8%, and 6.2% for hybrid-4w-NUS, hybrid-2w-NUS, and hybrid-1w-NUS, respectively. These slowdowns are reduced by the US variant to 1.5%, and 4.1% for hybrid-2w-US and hybrid-1w-US, respectively. The reason behind this effect is that the US variant reduces the number of RAM misses, which in turn results in a lower number of searches on the CAM. This enhancement does not affect the performance of hybrid-4w schemes because they have enough CAM ports to avoid stalling because of RAM misses. In fact, the slowdown observed in these cases is only due to the higher number of pipeline stages. Table IV shows the percentage of CAM searches performed by the hybrid approaches with respect to ideal CAM-4w. The US variant roughly halves the number of searches performed by NUS, which is the reason for its better performance. For SpecCPU, the percentage in the US variant lies around 5%, while this value is particularly low (below 3%) for floatingpoint benchmarks.
Performance of hybrid-2w approaches falls very close not only to the hybrid-4w ones, but also to the ideal CAM-4w, for both integer and floating-point benchmarks. The reason can be inferred from Fig. 8 , which shows the cumulative execution time for all benchmarks. As observed, the stalled time because of the CAM ports constraints (black portion of each bar) is higher for integer benchmarks, which explains the higher slowdown exhibited by the hybrid approach. On average, the total bar heights for hybrid-2w and hybrid-4w are very similar, which means that a two-way CAM is enough to avoid performance loss because of the renaming constraints.
Finally, let us compare performance across individual benchmarks. The results for commit, writeback, and hybrid-US variants are shown in Fig. 9 . The US variant is selected as a representative for the hybrid approaches because it offers better performance with a lower number of CAM accesses than the NUS variant. Although the decision on which scheme performs best is benchmark dependent, hybrid-2w-US and hybrid-4w-US perform closest to the baseline for most applications. In some cases (e.g., wupwise, galgel, facerec, ammp, vpr, and gap), the performance is especially affected by the misprediction penalties, and both writeback and writebackfwalk incur a slowdown higher than 5%. In contrast, the slowdown of hybrid-2w-US is lower than this mark for all benchmarks except eon.
2) Impact of Long Pipelines: Long pipelines enable higher clock frequencies by simplifying the amount of work to be done in each pipeline stage. For example, [3] , [10] , and [12] had a pipeline depth of around 20 stages. As a side effect, branch misprediction penalties become more significant in terms of number of cycles. To analyze the performance of the hybrid approach with long pipelines, we assume in this section a 20-stage pipeline. Similarly to the Pentium 4 architecture, six of these stages are assumed to lie before the register renaming stage. Two extra stages (22 stages in total) are again assumed for the hybrid approach.
The results as a slowdown (in processor cycles) with respect to the baseline ideal CAM-4w are shown in Fig. 10 . Comparing these results with the ones obtained for short pipelines, we can see that hybrid approaches exhibit insignificant variations in slowdown. In general, the slowdowns slightly grow for floating-point benchmarks and shrink, also subtly, for integer benchmarks. Writeback shows the opposite trend: the number of occupied ROB entries is much higher for floating-point benchmarks, and thus misprediction penalty slightly increases with the pipeline depth. Overall, the most negatively affected renaming scheme is writebackfwalk, mainly because a late misspeculation detection increases the recovery penalties when walking the ROB from the tail. This effect is especially significant in floating-point applications.
B. Hardware Complexity
Table V lists the implemented memory structures, as well as their number of read (r ) and write (w) ports. The last column of the table summarizes the total area occupied by each renaming scheme. Results are obtained with cache access and cycle time model (CACTI) 6 toolset for a 45-nm technology node. 1 Let us analyze the hybrid design (see Fig. 5 ) from the point of view of complexity. In the first stage, the RAM table is queried for previous destination mappings, and the PE allocates new mappings at the same entries (4r and 4w RAM ports required). Source operands are renamed by also accessing the RAM (additional 8r RAM ports). The US hybrid variant additionally updates the RAM with the source registers involved in previous RAM misses on the third pipeline stage (additional 2w, 4w, and 8w RAM ports for hybrid-1w-US, hybrid-2w-US, and hybrid-4w-US, respectively).
Associative CAM ports in the hybrid designs are used in the second stage to rename sources (2r , 4r , or 8r CAM ports) and to clear previous destination mappings (1w, 2w, or 4w CAM ports) that missed in the RAM. If physical registers are, however, correctly provided by the RAM, previous destinations can be cleared with a direct-mapped (d.m.) access. Notice that the direct-mapped ports are simpler than associative ports, hence different hybrid configurations keep a constant number of them (4w d.m. CAM ports). The remaining 4w direct-mapped ports (8w d.m. CAM ports in total) are used in the first stage to allocate the new mappings provided by the PE.
Regarding the ideal CAM-4w, free physical registers are also provided by the PE. Therefore, as in the hybrid schemes, the CAM is accessed to allocate new destinations (4w d.m. CAM ports). On the other hand, the CAM is associatively searched to rename the source registers and clear previous destination mappings (in total, 8r + 4w associative CAM ports).
RAM-based designs (i.e., commit, writeback, and writeback-fwalk) use the RAM to rename the source registers (8r RAM ports), as well as to look up previous destination mappings and update them with new values (additional 4r and 4w RAM ports). In addition, commit and writeback use a RRAM (4r and 4w ports) where destination mappings are updated when the instructions commit.
All RAM-based designs use the ROB to store (4w ROB ports) and retrieve (4r ROB ports) renaming information. In addition, the RAM-based designs use an FRQ to allocate (4r FRQ ports) and release (4w FRQ ports) physical registers. Table VI shows some technological features, including area, access time, energy per access, and leakage per nanosecond for each discussed hardware component. For comparison purposes, the area and access time of the PE required by our proposal are assumed to be the same as the FRQ structure. This assumption is conservative, because the consumption on the PE is a negligible fraction of the total CAM consumption [13] . In addition, for the CAM components, we include the features corresponding to the direct-mapped ports (labeled as d.m. in the table). Results also consider the contribution of the directmapped ports to the CAM area. Table V shows the area occupied by each renaming scheme, calculated as the sum of each component. Hybrid-1w schemes have the smallest area occupancy because of the reduction of the CAM complexity, as well as the lack of ROB area devoted to renaming. On the other hand, the most area-hungry designs are the hybrid-4w schemes, as they require both complex CAM and RAM structures. The hybrid-2w-US scheme has an area (0.061 mm 2 ) close to writeback and commit.
Finally, the reported access time is the elapsed time since a table lookup starts until the operation completes. All the components exhibit an access time lower than 0.25 ns. Therefore, 4 GHz is the maximum frequency at which the implementation proposed in Section III can work.
C. Energy Consumption
We measure the dynamic energy as the total number of accesses to each component multiplied by the energy per access. Leakage (or static) energy is calculated as the total number of execution cycles times the total leakage energy per cycle, assuming a 1-GHz clock frequency.
The energy budget used for register renaming for a 10/12-stage pipeline is shown in Fig. 11 . Dissipation of leakage energy lies between 20% and 35% of the total energy for all renaming schemes except the ideal CAM-4w. Leakage energy is lower for the ideal CAM-4w and hybrid schemes because their execution is shorter and do not require additional ROB storage for renaming purposes.
Dynamic energy is distributed in the figure by component (FRQ, ROB, RAM, RRAM, direct-mapped CAM lookups, and associative CAM searches). The energy spent to recover a correct RAM state is estimated by accounting for the copy of the RRAM renaming data to the RAM (commit and writeback) and the subsequent walk through the ROB (writeback and writeback-fwalk). The ROB, FRQ, and CAM structures are organized as circular queues, and recovered with the negligible cost of a pointer update. The cost of invalidating all RAM entries in the hybrid designs is also negligible, because this operation only entails resetting a single-bit column.
The ideal CAM-4w design consumes about one order of magnitude more power than the rest of the schemes, because the CAM is associatively accessed every cycle for all mappings.
On the contrary, writeback-fwalk shows up as the best RAM based in this regard. Writeback and commit suffer mainly from RRAM costs. In the latter design, there are additional energy costs because of the late misspeculation detection, which causes more mispredicted instructions to be renamed before triggering recovery.
Besides providing performance close to the ideal CAM4w, hybrid designs drastically alleviate energy dissipation. For SpecFP, they consume less than the commit scheme. Indeed, all US variants except hybrid-4w-US consume less energy than the writeback-fwalk scheme. Regarding SpecInt, hybrid-2w-US and hybrid-1w-US present less energy consumption than both writeback and writeback-fwalk.
An increase of the CAM width in hybrid designs results in better performance but a higher energy cost. The reason is that the number of stalls because of limited CAM bandwidth decreases as the number of ports increases, but this also implies a higher number of useless accesses to the CAM when speculative execution takes place, as well as a higher cost per access in more complex RAM and CAM structures. On the other hand, the US variants show lower energy costs and better performance than NUS variants, in spite of requiring more complex RAM structures than NUS variants. The reason is that updating the RAM more often reduces the amount of associative CAM accesses.
The energy consumption results for a 20/22-stage pipeline is shown in Fig. 12 . The US variants provide again better consumption results than the NUS ones. The latter are excluded from the figure for the sake of clarity. In general, the energy dissipation for all presented designs is higher than in the short pipeline, because misspeculation is detected later. Table VII shows the efficiency of the studied renaming approaches for SpecInt, SpecFP, and SpecCPU. Efficiency values are quantified as the consumed energy multiplied by the square of execution time (energy-delay-square product), as suggested in [14] . In general, the highest efficiency (i.e., the lowest product value) is provided by hybrid-2w-US and hybrid-1w-US across all sets of benchmarks except for SpecInt in the 20/22-stage pipeline.
V. RELATED WORK
Regarding RAM-based register renaming approaches, Moshovos [15] proposed to reduce the number of ports in the FRAM by detecting those instructions that do not use the maximum number of source and destination register operands. With the same aim, Kucuk et al. [16] further reduced the number of accesses to the FRAM by forwarding results of previous accesses performed by instructions nearby.
Concerning the recovery penalty incurred by RAM-based schemes, Moshovos [1] proposed an out-of-order release mechanism, which reduces the number of RAM checkpoints to about one-third. Akl et al. [17] proposed a ROB-like structure to accelerate the checkpoint recovery, which allows misspeculation recovery from specific branches. Similarly, a selective checkpoint mechanism to recover mispredictions and support large instruction windows was proposed in [18] .
Zhou et al. [19] proposed a mechanism to allow the processor to continue executing instructions after a misspeculation while the processor state is being restored, effectively hiding the recovery latency. This technique allows RAM-based approaches to dispatch instructions without waiting for the misspeculated branch instruction to reach the commit stage or scanning the ROB. It requires additional logic to correctly manage the issue stage and the instruction queues. In this sense, it is orthogonal to alternative mapping implementations like the one proposed in this paper.
The RAM approach is used in successful commercial processors [2] , [3] . The CAM approach similarly succeeds in aggressive designs [4] , requiring only a single memory structure for fast recovery and multiple checkpoints [7] . It becomes then a major research concern to reduce the access time incurred by CAMs. Buti et al. [4] detailed how this problem is addressed for the IBM Power4 processor. In addition, Liu and Lu [20] explored the effect of circuit-level speculation to speed up the response of a CAM renaming table.
Safi et al. [7] compared the energy and latency of RAM and CAM approaches. They concluded that, when the number of checkpoints exceeds a given threshold, CAM approaches become more efficient and faster. They also proposed to selectively disable CAM entries to optimize CAM energy consumption.
Finally, Wallace and Bagherzadeh [21] also proposed a hybrid RAM-CAM design. Their goal was to reduce the RAM complexity and access time in RAM-based renaming schemes. The authors implemented a small ROB-like firstinput, first-output queue located before the RAM, which is also associatively addressable by a logical register identifier. This table reduces the number of required RAM ports (much like our design reduces the number of required CAM ports, which are more costly), and allows recovery of the correct mappings in one cycle only when mispredicted instructions have not updated the RAM. Register release, pipelining, and other complexity issues are, however, not tackled.
VI. CONCLUSION
In this paper, we presented a renaming mechanism consisting of a RAM table and a low-complexity CAM table, as a hybrid design that took the best of both approaches. Experimental results showed that a two-way hybrid approach achieved small performance slowdowns (about 2% and 1% for integer and floating-point benchmarks, respectively) with respect to a four-way CAM-based renaming mechanism that was able to recover in one clock cycle. These small slowdowns were accompanied by a drastic reduction of the original associative searches carried out in the CAM-based approach to only 8% and 3%. Hybrid designs also reduced the dynamic energy by 16% and 12% with respect to the original CAM consumption, closing the dynamic energy consumption gap between CAM and RAM approaches. Besides general performance improvements, hybrid designs were proved to be more efficient than the simplest noncheckpointed RAM approaches in terms of both area and energy. Finally, the experiments showed that the performance benefits span different processor configurations, whether with short or long pipelines.
