Abstract-Power is a first order design constraint for most processors today. Benefits of low power designs include lower manufacturing and operating costs and a longer battery life. In this work we propose an out of order processor architecture called Block-precise processor (B-Processor) that is designed for low power consumption. The B-Processor consumes lower power than typical processor designs by eliding the write of results of many instructions to the reorder buffer and to the register file, which are power hungry structures. The B-Processor reduces power consumption even further by omitting the broadcast of certain results over multiple levels of the bypass network. Experimental results show that on average the B-Processor spends 15.1 percent less power on register file and reorder buffer accesses and 14.5 percent less power on broadcasting results. In combination with register file caching, on average the B-Processor saves 28.7 percent power for accessing the register file and the reorder buffer.
Ç

INTRODUCTION
A processor spends a significant fraction of the total power it consumes in reading instruction operands values and writing instruction results to the register file and other structures such as the reorder buffer (ROB) or the rename buffer [1] that may be used for storing register values temporarily. These structures that hold either speculative or committed register values can be collectively referred to as operand stores. Several prior works estimate that the ROB consumes more than 10 percent [1] of the total core power. The register file is also a well known thermal hot spot in processors. Thus, reducing the power consumed by the operand stores can not only reduce the power consumed by the processor itself but can also improve the processor life.
To reduce operand store power consumption in out-of-order processors, we propose a new processor architecture, the Block-Precise Processor (B-Processor). B-Processor saves power by reducing writes (and reads) to (from) operands stores and result broadcasts. In pipelines with data-capture schedulers (Section 3) which are common in modern microarchitectures such as Nehalem [2] and Core [3] , the outputs of the functional units are stored in the ROB, and on instruction commit, results are copied from the ROB to the architecture register file (ARF) to update the architecture state. The focus of this work is on such pipelines.
There are prior mechanisms based on the observation that many instruction results are short-lived [4] , [5] , [6] , [7] . Our mechanism is based on a similar observation, but we enhance power saving opportunities by not writing certain results to any temporary buffer and by supporting precise state only at basic block granularity. We consider only those short-lived instructions whose results are not visible outside the basic block containing the instruction, while some prior mechanisms consider short-lived variables without any basic block restrictions. At runtime, we identify instructions whose destinations have been renamed by instructions from within the same basic block and elide the writing of their results to the ROB and to the ARF if we can guarantee that all instructions dependent on those results can read the results off the bypass network. These conditions are sufficient for the B-Processor to guarantee correct program execution even when certain results are not stored in any location.
Since the B-Processor does not write all results to the ARF (and the ROB), supporting precise exceptions is not straightforward. To support precise exceptions, the BProcessor uses a light-weight mechanism to checkpoint the state of registers at the end of each basic block. This lightweight mechanism requires only one additional copy of the ARF and some bitmasks to checkpoint the registers. Also, store instructions update memory only at the end of the basic block containing the store instruction. Thus, the state of the B-Processor matches the state of a conventional processor at the end of each basic block, hence the name Blockprecise processor. Using the saved processor register state precise exceptions can be supported.
The B-Processor can be combined with mechanisms such as register file caches (RFC), and so we evaluate the power saving achieved by using the B-Processor mechanisms in conjunction with a register file cache. In summary, this work proposes a new low-power processor architecture, the Block-precise Processor, which includes the following new mechanisms while supporting precise exceptions: 1) A mechanism to reduce power consumption in processors by dynamically identifying and eliding writes to the operand stores. 2) A mechanism to reduce power consumption by eliminating unnecessary broadcast of results and result tags over the bypass network. The contribution of this work and the distinction between this work and prior work is further elaborated in Section 2.
RELATED WORK
Several kinds of solutions have been proposed to reduce the energy consumption of register files. These solutions include reducing the number of registers and ports [8] , banking the register file [8] , [9] , [10] , using hierarchical register files [8] , inhibiting writing to the register file those variables that are read off the bypass network [5] , [7] , inhibiting the reading of the register file when variables are read off the bypass network [10] , [11] , register file caches [12] , [13] , [14] and dynamic instruction scheduling to ensure that variables are read off the bypass network than from the register file [4] . Below we discuss some of the relevant work in more detail.
Register File Caches
Several RFC [12] , [13] , [14] mechanisms have been proposed to reduce power consumption by reducing the number of reads/writes from/to the register file (and also to reduce the register file access latency); since the RFC is smaller than the register file, accessing it is more power efficient than accessing the register file. Note that these mechanisms were mostly proposed for machines that use a physical register for both speculative and committed results. In our evaluations, we adapt the RFC for a pipeline that uses the ROB for speculative results and a ARF for committed results. Results are first written to the RFC when an instruction completes execution. Dispatched instructions read the RFC instead of the ROB for their source operands (note that some source operands are read from the ARF). When an instruction misses in the RFC, only the missing instruction is delayed until the operand is read from the ROB. When the RFC is full, values are evicted from the RFC and inserted into the ROB in LRU fashion. On retirement of an instruction, its results are copied from the RFC/ROB to the ARF. Thus, the RFC results only in the reduction in the number of reads/writes from/to ROB while not affecting the reads and writes from/to the ARF. In case of the B-Processor if instruction results are not written to the ROB on instruction completion, then at the time of retirement, reading the ROB for the result and updating the ARF with the result is also eliminated. Additionally, there could be results that were written to the ROB, but need not be written to the ARF. Eliminating the update of such results to the ARF, results in reducing the number of reads from the ROB as well. Thus, the B-Processor reduces reads/ writes from/to the ROB and reduces the number of writes to the ARF as well.
Reducing Writes to ROB and ARF
There is a significant amount of prior work on power reduction by reducing the number of writes to register files. While most of this work target VLIW processors and statically decide whether to write the results of an instruction to the register file or not, our work targets out-of-order processors and dynamically determines whether results should be written to the register file.
Butts and Sohi [15] characterize the degree of use of values in a program and build a predictor to predict the degree of use. One of the uses of the predictor is to reduce the number of registers and the register file write ports required and thereby save power. However, they do not provide any implementation of this mechanism.
Lozano and Gao [16] proposed a compiler-assisted mechanism that skips writing dead values residing in the ROB to the ARF on instruction commit. This mechanism still requires all instruction results to be written back to the ROB on instruction completion. Savransky et al. [17] proposed a similar mechanism that was based entirely in hardware.
Operand Isolation (OI)
The work closest to our work is the operand isolation mechanism proposed by Ponomarev et al. [6] which also eliminates writes to the ROB and to the ARF. OI elides writes to the ROB and to the ARF for instructions whose destination registers will be overwritten by a subsequent instruction that has been dispatched before the instruction completes execution. However, instructions whose results have been identified for ROB and ARF write elision must still write results to a small register file (SRF) to support precise exceptions. Also, all instructions whose destinations have been renamed by subsequent instructions may not elide writes because of lack of space in the SRF or because of a conflicting entry with the same ROB index in the SRF. An associative search has to be done over the SRF whenever a new entry is to be inserted into it and whenever an entry has to be cleared from the SRF. The B-Processor elides writes for only those instructions whose destination registers will be overwritten by instructions from within the same basic block, thus it potentially has lesser opportunities than OI for power saving, but for each ROB write that is elided, it saves more power than OI. Balkan et al. [18] propose a similar write elision mechanism but for machines which use nondata-capture schedulers.
Advantage of B-Processor over RFC and OI
Though both the RFC and OI mechanisms skip writing results to the ROB (OI skips writing results to ARF as well, but it does not skip writing to ROB for all results like RFC does), they still write the results that are not written to the ROB to an alternate structure (RFC and SRF) which consumes significant dynamic power to support precise exceptions. The advantage of the B-Processor is that results that are not written to the ROB and ARF are not written to any other structure. This is possible because Block-Precise processor has precise state at basic block granularity (see Section 3.2). Though the B-Processor uses an additional copy of the ARF for supporting exceptions, the additional copy of the ARF does not increase dynamic power since the register updates on instruction commit are shared between the two register file copies. Thus, the additional copy only increases leakage power but not dynamic power and allows us to reduce the number of reads/writes from/to ROB and the number of writes to ARF.
Other Related Work
Martin et al. [19] use dead-value information to introduce optimizations that improve processor performance and reduce the size requirements of the physical register file (PRF). The processor tracks dead registers and dynamically eliminates unnecessary register save and restore instructions at context switches and procedure calls. These optimizations are orthogonal to the B-Processor mechanisms and could be applied in conjunction with the B-Processor.
Several novel processor architectures that exploit the short lifetime of instruction results have been proposed. Block Structured ISA [20] , [21] , [22] , statically defined tag ISA [23] , TRIPS [24] and the Braid architecture [25] use separate internal and external registers to distinguish between values that are consumed entirely within a block and those that are required to be visible outside a block. 1 This separation allows the design of simplified hardware that can achieve high performance. Similarly, the B-Processor does not make the results of all instructions visible. However, those architectures were intended for high-performance, low-complexity designs, and require complete re-design of ISA, compiler and hardware to support atomic execution at an instruction block granularity. The B-Processor requires only minor hardware changes and no ISA changes to reduce the power consumption of microprocessors and does not use separate internal and external registers.
Checkpointing of logical to physical register mappings is often used for recovery from branch mispredictions. Checkpointing of register mappings is also used by several mechanisms to support large instruction windows [26] , [27] . These checkpoints are used to recover the correct state in the event of branch mispredictions or exceptions. Similarly, the B-Processor also uses checkpoints for recovery from branch mispredictions and exceptions. However, in case of the B-Processor, instead of register mappings, the contents of registers are checkpointed and this is achieved by copying bit-masks rather than copying entire register files. Fig. 1 shows possible pipeline designs with a data-capture scheduler.
In pipelines with a data-capture scheduler, source operands can be read either from (1) (Fig. 1a) a unified physical register file that includes both committed and uncommitted instruction results (i.e., intermediate/speculative states) or (2) (Fig. 1b) from multiple structures-an ARF that contains the committed state and a rename buffer or a ROB that has been extended to hold speculative state. The baseline in this work uses an ARF together with a ROB that stores speculative instruction results as shown in Fig. 1b . This design is popular and is used by Intel's Core [3] and Nehalem [2] microarchitectures and others. In such pipelines instruction results are copied twice-first to the ROB on instruction completion and then read from the ROB and copied to the ARF on instruction retirement. Many of these values are dead by the time they are written to the ROB and many others are dead by the time the ARF is updated. We focus on saving power by reducing the number of these unnecessary writes to the ROB and to the ARF.
BB-Interval Values
As explained, our mechanisms target results that are not visible outside a basic block. We call such results as bbinternal values. If we can guarantee that all instructions that consume a bb-internal value read the value off the bypass network, then we do not have to write the value to the ROB and to the ARF. Instruction results that must be visible outside the basic block are called bb-external. Fig. 2 classifies instructions that produce results as either bb-internal or bb-external for SPEC CPU2006 benchmarks. On average, about 33.9 percent of instructions produce bbinternal values. In general, floating point benchmarks (bwaves and benchmarks to its right) have a larger fraction of instructions producing bb-internal values than integer benchmarks and they also have larger basic blocks. Since bb-internal values are a significant fraction of the results produced, there is significant potential for saving power.
B-PROCESSOR
In this section, we explain the mechanisms implemented by the B-Processor, and discuss possible issues introduced by the mechanisms and solutions to those issues. An overview of the B-Processor is illustrated in Fig. 3 and 1. In case of Block structured ISA, the term block refers to an atomic block, while for others it refers to basic block.
2. In this work, issue means sending an instruction from a scheduler to a functional unit, and dispatch means placing an instruction inside the scheduler. Table 1 shows the additional hardware required. The BProcessor requires additional hardware for two mechanisms: eliminating writes to the operand stores and supporting precise exceptions.
We focus on eliminating the writes made by instructions that produce bb-internal values. To eliminate writing bbinternal values we have to first determine whether a result value is bb-internal or not and if it is, we also have to guarantee that all instructions dependent on the bb-internal value can read it off the bypass network.
Write Elision
Detecting BB-Internal Values
We detect a bb-internal value by comparing the branch tag of an instruction with the branch tag of the instruction (if any) that overwrites its destination. A bit-mask called the Renamed Vector (RV) and an array called the RenamerBranch-Tag array (RBA) are used. Both RV and RBA are equal in length to the ROB, and the width of RBA is equal to the width of a branch tag (4 in our simulations). An additional array called the Renamer ROB Index Array (RRA) is also used. The size of the RRA is equal to the size of the RAT and its width is equal to the number of bits needed to address a ROB entry. Whenever an instruction is renamed, if the instruction's destination register has a pending write, the RRA entry for the destination register is read to obtain the ROB index i of the pending writer. Also, the RRA entry for the destination register is updated with the instruction's ROB index. Then the bit at index i in RV is set and the branch tag of the renamed instruction is copied to index i of RBA. 3 In Fig. 4 , the register mapping table shows that the latest value of R 1 and R 2 will be produced by the instructions at indices 2 and 1 (I c and I b ) in the ROB. Next, when instruction I d whose destination is R 2 is renamed, it is observed that a previous write to R 2 by I b is pending. Due to this, the bit for I b in RV is set and the branch tag of I d is copied to the entry for I b in RBA. Instead of using the RRA we could use the RAT to obtain the pending writer of a register by adding additional read ports to the RAT, however that would increase power consumption significantly, hence we choose to add the RRA.
In the last cycle of its execution, an instruction checks the bit corresponding to its ROB index in the RV. If the bit is set, the branch tag of the instruction is compared with the branch tag saved at its index in the RBA. If the tags match it means the result is a bb-internal value.
Write Elision of BB-Internal Values
In the above mechanism, if the tag comparison succeeds for an instruction, it means all instructions dependent on the result have already been dispatched and are waiting inside the scheduler for its result value to become available on the bypass network. Thus, the instruction result is not written to the ROB and the bit for the instruction in the Read-ROB mask (Table 1) is cleared, otherwise, the 3. The operation of the Renamed Vector is the same as the renamed bit-mask in the operand isolation mechanism [6] .
result is written to the ROB and the Read-ROB bit for the instruction is set. At retirement, the Read-ROB bit is checked, if it is set, the RV bit and the RBA entry of the instruction are checked again to determine if the copy of the instruction's result from the ROB to the ARF can be eliminated. Since this mechanism ensures that the last writer to a register in a basic block always writes to the ROB and the ARF, at the end of each basic block, the state of the registers matches the state of registers in a machine that writes all results to ROB and ARF.
Precise Exceptions
Since all instruction results are not written to the ROB and ARF, the B-Processor cannot support precise exceptions without an additional mechanism. However, the register state of B-Processor is precise at the end of each basic block. We take advantage of this to support precise exceptions. The state of the registers at the end of the each basic block is saved using a low overhead mechanism. If an exception occurs when a new basic block is being executed, execution is resumed from the first instruction of the new basic block after a rollback and all results are written to the operand stores (write elision is disabled temporarily) until the end of the basic block. After the completion of the basic block containing the excepting instruction, execution is resumed with write elision enabled.
Checkpointing Register State
For saving register values we use a low-cost mechanism which requires two copies of the architecture register file, ARF 0 and ARF 1 , and several bit-masks whose lengths are equal to the number of architecture registers. One of the bitmasks is called the dirty mask, while another is called the state mask. The rest of the bit-masks are used for saving a copy of state mask at the start of each basic block. Each bit in a bit-mask is associated with the register that has the same index as the bit. If a bit in the dirty mask is set, it means at least one instruction that writes to the corresponding register has been dispatched in the current basic block. Each bit in the state mask points to the ARF that holds or will hold the latest value of the corresponding register (if bit is 0, ARF 0 holds the register value, otherwise, ARF 1 holds the value).
Let us understand the working of the mechanism using the code example shown in Fig. 5 . When the first instruction, I 0 , of basic block B 1 is renamed, a copy, S 0 , of the current state mask is made and all bits in the dirty mask are cleared (Figs. 6a and 6b) . Also, the address of the first instruction is saved into Restart PCs for resuming execution after rollback when any exception occurs. S 0 contains the mapping of the logical registers to the ARF copies at the end of B 0 . Then the dirty bit for the destination register of I 0 , R 0 , is set and the state bit for R 0 is flipped (Fig. 6c) . Because of the bit flip, the old value of R 0 pointed to by S 0 is unmodified and the new value of R 0 is written to the other register file. Any subsequent instruction within B 1 that writes to R 0 , will see that that the dirty bit for R 0 is set and write the new value of R 0 to the ARF pointed to by the state mask (Fig. 6d) . When basic block B 1 completes, S 0 is freed and returned to the list of free bit-masks. In this fashion, the state of the registers at the end of the each basic block is preserved.
Handling Exceptions
If there are exceptions/interrupts before the completion of B 1 , execution resumes with the first instruction of B 1 i.e., I 0 , and the state mask is restored to S 0 , while the dirty mask is cleared (Fig. 6f) i.e., a rollback is performed. When execution resumes after rollback, all instructions update the ROB and the ARF until the end of the basic block. After the return from the exception handler and the completion of the basic block containing the excepting instruction, write elision is enabled again.
Retiring Store Instructions
Since the B-Processor rolls back to the start of the basic block when there is an exception, stores are not allowed to update memory when they retire. Allowing stores to update memory on their retirement could result in incorrect execution when there is an exception. Instead, memory is updated at the end of each basic block when it is guaranteed that the basic block will complete without any exceptions. Thus, store instructions retire without updating the memory, and the store queue entries allocated to stores are not released until the basic block containing the stores completes. When a basic block completes, the stores from the the basic block which are held in the store queue are issued to memory and each store queue entry is released only when the corresponding store has completed updating the memory. Most basic blocks usually contain few store instructions, and so our mechanism should not present any issues. However, some applications contain basic blocks with many store instructions. When executing such basic blocks the store queue will get filled with entries that can be released only when the basic block completes. However, new store instructions (and the following instructions) from the basic block cannot be dispatched because of lack of space in the store buffer. This causes the processor to deadlock. To avoid such deadlocks, large basic blocks are broken down into multiple smaller pseudo basic blocks. An upper-limit is placed on the number of stores that can be outstanding for a basic block. If this limit is 'n', then every 'n'th store instruction in a basic block marks the end of a pseudo basic block and the instruction following the store instruction is treated as the start of a new (pseudo) basic block. Now, stores update memory and release store queue entries whenever the iÃnth (i = 1, 2, 3, . . .) store in a basic block retires. A counter is used to track the number of stores in a basic block. This way, for basic blocks with a large number of stores (>n), the state of the B-Processor matches the architecture state of a conventional processor at the retirement of every nth store instruction.
When there is an exception during the execution of a basic block with a large number of stores, the B-Processor rolls back execution to the first instruction after the last store instruction that triggered memory updates (and release of store queue entries). It then resumes execution and all instructions until and including the next store that would have triggered memory updates (or until the end of the basic block if it comes first) update both the ROB and the ARF. After the memory update triggering store (or the end of the basic block), write elision is enabled again.
Speculating across Multiple Branches
Like a conventional processor, the B-Processor can speculate across multiple branches. To do this the B-Processor requires multiple copies of the state bit mask and the Restart PC register (Table 1) ; no additional copies of the ARF besides the two ARF copies required for saving register state (Section 4.2.1) are required. Instructions from multiple basic blocks may be flowing through the pipeline, but they will update the ARF only when they are ready to retire and they do so in order. Since instructions in successive basic blocks that write to the same architecture register update different copies of the ARF, we can have the state at the end of the last completed basic block saved by simply saving the state mask mapping at the end of the last basic block and this is used to recover from branch mispredictions.
Eliminating Broadcasts over Bypass Network
When the B-Processor determines it is safe to elide writing a bb-internal value to the ROB and to the ARF, it means all instructions dependent on the bb-internal value will read the value off the bypass network. In such a case, if the pipeline includes multiple levels/stages of bypass, the bb-interval value need not be broadcast beyond the first level/stage of bypass because all dependent instructions will read the bypassed result value off the first stage itself. Broadcasts can be eliminated by generating a signal that disables circuits that drive results on the broadcast bus.
METHODOLOGY
We use MacSim [28] , a cycle-level heterogeneous architecture simulator for our simulations. We sample 200 M instructions from SPEC CPU2006 benchmarks using SimPoint [29] and use the sampled instructions in our simulations. We use the two core configurations-Config-128 ST and Config-96 ST-shown in Table 2 in our simulations. Power is measured using Energy Introspector (EI) [30] a framework for modeling power, energy, temperature, reliability and other physical phenomena. EI provides wrapper API for McPAT [31] , Cacti [32] and other models for modeling power. To reduce the power consumed by the ROB, we model the ROB in each core configuration as two arrays-ROB-Data (ROB-D) which is 64-bits wide and holds data values, and ROB-Meta (ROB-M) which is also 64-bits wide and holds status, control and other information regarding instructions. 4 
RESULTS
We compare the average power consumption for the ROB and ARF by B-Processor against a conventional out-of-order scheduling pipeline as well as against Register File Cache (RFCs) and Operand Isolation (OI) mechanisms. The behavior of the evaluated RFC and OI mechanisms is described in Sections 2.1 and 2.2.1, respectively. Table 5 summarizes pros and cons for each mechanism. We also evaluate the benefits provided by a combination of the B-Processor and register file caching (B-RFC). We simulate RFC and OI mechanisms with different configurations, but present results for only the best performing configurations (best in terms of power). RFC configurations with 16, 32 and 64 entries were tried (16 sets in all configurations), while the size of the small register file in OI was varied as 8, 16, 32 and 48 (fully associative in all configurations). The number of ports provided for the ROB, RFC, SRF and the ARFs are as shown in Table 3 . Table 4 shows the energy per access and leakage energy for different structures in the best performing configurations for the different mechanisms.
5 Since (dynamic) power consumption depends on the application we show energy per unit access which does not change with the application. In these tables we show numbers only for the data part of the ROB (ROB-D), since the control part of the ROB (ROB-M) is much simpler than ROB-D and consumes the same amount of power for the different mechanisms. The access energy for structures such as the restart PC list and renamer branch tag array are not shown, but they are included in the power comparisons. For B-Processor, the maximum number of outstanding stores is kept as 4 and 8 for Config-96 ST and Config-128 ST, respectively.
Unless specified otherwise, in our average power consumption comparisons, for each mechanism we include the power consumed by the ROB (both ROB-D and ROB-M), the two ARFs (Int and FP) and the power consumed by any additional structures used by the mechanism. For BProcessor mechanisms, the power consumed by the additional copy of the register files and the renamer branch tag array and renamer ROB index array are also included. For OI the power consumed by the renamer ROB index array is also included. Results for Config-128 ST are presented by default and in the result graphs, xalancbmk and the benchmarks to its left are integer (INT) benchmarks, while the remaining benchmarks are all floating point (FP) benchmarks.
Power Consumption by Operand Stores
Figs. 7, 8 and 9 show the power consumed for the ROB (includes both ROB-D and ROB-M), the ARFs (INT and FP) and any additional structures used, by the baseline, the BProcessor and the best power saving RFC and OI configurations. All values are shown relative to the power consumed by the baseline with the baseline power consumption represented as 100. The total power consumption is broken down into leakage and dynamic power. While Fig. 7 shows the 4 . Otherwise, the reduction in power consumption due to omitting a write to ROB would be exaggerated. The B-Processor mechanisms are more effective for FP benchmarks than they are for INT benchmarks, while the reverse could be said to be true for RFC-64 and OI-16. Since FP benchmarks have much larger basic blocks than INT benchmarks, it is more likely in FP benchmarks that the destination register of an instruction is overwritten by another instruction from the same basic block.
For 4 benchmarks-mcf, libquantum, omnetpp, xalancbmk-B-Processor consumes more power than the baseline. These benchmarks have a very small fraction of values that become dead due to being overwritten by instructions from the same basic block, thus the B-Processor cannot save significant power for ROB and ARF accesses. Also, all of the 4 benchmarks except xalancbmk have low average power consumption, thus the power consumed by additional structures such as the renamer branch tag array make the power consumption of the BProcessor more than that of the baseline. By using a sampling mechanism which uses the fraction of instructions for which writes can be elided to determine when to enable the B-Processor mechanisms, the overhead due to the additional structures can be reduced. This would reduce the power consumption for these benchmarks while keeping the power saving for other benchmarks intact. Also, any compiler transformation that makes basic blocks large can be adopted to improve the power savings provided by the B-Processor.
Performance of B-Processor
Figs. 7, 8, and 9 also show the execution time for the SPEC2006 benchmarks on the B-Processor expressed as a percent of the execution time on the baseline. On average, execution on B-Processor takes about 0.14 percent longer than on the baseline for SPEC2006 benchmarks, with cactusADM showing the greatest slowdown among all benchmark with a slowdown of about 1.6 percent. This degradation in performance is due to store buffer entries being released only on the completion of a basic block. This performance impact can be reduced by using a separate postretirement store buffer [33] for holding stores before they Register File Cache (RFC) [13] Reduces reads and writes to ROB (i) RFC itself consumes significant power (ii) Live values evicted from RFC have to be written to ROB Operand Isolation [6] (i) Reduces (reads and) writes to ROB and writes to ARF (i) The small register file itself consumes significant power Block-Precise Processor (i) Reduces (reads and) writes to ROB and writes to ARF (i) Potential for reduction in writes to (ii) No additional structures with high dynamic power ROB (and ARF) is smaller than OI for some benchmarks are issued to the memory hierarchy instead of using the store queue or by using a simple counter-based mechanism that can vary the maximum number of stores that can be outstanding depending on the power-saving achieved and the slowdown experienced. 6 
Combining with Register File Caching
It is easy to combine the write elision mechanisms of the B-Processor with the access reduction obtained with register file caching. Values identified for write elision by the B-Processor mechanisms are not written to the RFC since they will be read by all dependent instructions off the bypass network. Reducing the writes to RFC provides a two-fold benefit-(i) lower power consumed by RFC due to reduced writes ( Fig. 10) (ii) reduced pressure on the RFC resulting in fewer evictions of live results from RFC to ROB (Fig. 11 ) lowering power consumption further. As seen in Figs. 7, 8, 9 and subsequent results that will be shown, this results in significant power savings.
Effectiveness of B-Processor Mechanisms
Fig . 12 shows the percent of result values (including temporary values generated by uops) for which writes to both ROB and ARF are elided. While the average value 6. A single mechanism can be devised to enable/disable B-Processor mechanisms (Section 6.1) and control the limit on the number of outstanding stores (Section 6.2).
for INT benchmarks is 12.3 percent, for FP benchmarks the average value is 41.6 percent. The percent result values for which only writes ARF are elided (not shown) is 4 percent. Remember that at the time of retirement of an instruction which has written its result to the ROB, the check for write elision is performed again, and if it is successful, then the copy of the result from the ROB to the ARF is not performed. Thus performing the elision check at retirement is beneficial, especially for FP benchmarks which have long basic blocks. On average, 6 percent of results are not written to the ARF by B-Processor for FP benchmarks by doing the elision check at retirement as well.
Analysis of Power Results
Fig . 10 shows the breakdown of the average power consumption in Fig. 7 for the baseline, RFC-64, OI-16, B-Processor and B-RFC-64. "Others" in Fig. 10 refers to structures such as restart PC list, renamer branch tag array and so on that are used by B-Processor and OI. For several FP benchmarks, B-Processor saves a significant fraction of writes to ROB-D, often much higher than OI-16 and closer to RFC-64. There are two reasons for this, one is that benchmarks such as milc, zeusmp and lbm have large basic blocks with a significant fraction of instructions whose results are overwritten within a few instructions. For these benchmarks, B-Processor naturally reduces ROB writes significantly. While for bwaves, milc, zeus, gromacs and other FP benchmarks, OI-16 is unable to insert many renamed instructions into the SRF due to the SRF being full and thus has to write those results to the ROB. For RFC-64, when there is insufficient space in the RFC for incoming values, values are evicted from the RFC and inserted into the ROB. For the same set of benchmarks, the number of values evicted from the RFC and written to the ROB is high (Fig. 11) for RFC-64. 
Power Consumed by ARFs
Power Consumed by Other Structures
While RFC-64 and OI-16 reduce more writes to ROB than BProcessor, the power overhead of the additional structures is significant, resulting in a reduction in the power saving provided by these mechanisms. The RFC in RFC-64 and the SRF in OI-16 consume about 38.9 and 12.5 percent of the total power consumed by the baseline. The RFC is accessed by all instructions to read their sources and to write their results. Additionally, on instruction retirement, the RFC is read for the result of the instruction and the ARF is updated with the result. Though the SRF is not accessed as frequently as the RFC, it incurs significant cost due to associative searches performed on it. Whenever a value is to be inserted into the SRF, a search for a conflicting entry is to be done and whenever an instruction that overwrites the destination of an earlier instruction retires, the SRF has to be searched to clear the entry occupied by the earlier instruction. In case of the B-Processor, the additional copies of the ARFs contribute only to increase in leakage energy since the data writes are shared between the two copies and the other additional structures consume about 5.6 percent of the power consumed by the baseline. For OI the power consumed by other additional structures besides the SRF is 2.3 percent.
B-Processor with Merging of Basic Blocks
Power savings with the B-Processor can be enhanced by merging two basic blocks into one large logical block at runtime. A basic block ending with a high confidence branch can be merged with the following basic block. If a branch in the merged block is mispredicted, the processor has to reexecute instructions starting from the first basic block. We evaluate a B-Processor design that merges basic blocks using a JRS confidence estimator [34] with 4-bit confidence counters. To merge basic blocks, branch tags are allocated in order and the lower order bits of the branch tag are ignored when comparing the tag of an instruction and its renamer. This way an instruction can skip writing to the ROB and ARF even if it's destination has been renamed by an instruction from the next basic block. Fig. 15 shows the power consumption of the B-Processor with JRS confidence estimators of size 1 KB (2,048 entries) and 2 KB (4,096 entries) relative to the baseline. In Fig. 15 , B-Processor-2K-W (uses estimator with 2,048 entries) and B-Processor-4K-W (uses estimator with 4,096 entries) include the power consumed by the estimator itself in the comparison. If a predictor (such as the TAGE predictor [35] ) which by itself provides confidence estimation is used then the cost of the estimator need not be considered, this case is shown by B-Processor-2K and B-Processor-4K. The use of a free confidence estimator improves power saving by B-Processor to about 20.5 percent with both 1 and 2 KB confidence estimators. If the cost of the confidence estimator is considered then the average power saving drops to 12.5 and 7.7 percent when using the 1 and 2 KB estimators.
Sensitivity to ROB Size
We vary the size of the ROB to evaluate whether the BProcessor mechanisms would provide benefit for machines with smaller ROBs. Since the ROB in Config-96 ST is smaller than the ROB in Config-128 ST, the contribution of the ROB to the total power consumed by the operand stores is lower in Config-96 ST than in Config-128 ST. Thus, the impact of reducing ROB writes is reduced, also the smaller ROB size results in fewer writes being saved. It is to be noted that the impact of RFC-64 is reduced somewhat drastically compared to the BProcessor mechanisms. The difference in the energy per access of the ROB and RFC is significantly smaller in Config-96 ST than in Config-128 ST resulting in a larger reduction in the power saving provided by RFC-64 relative to the baseline. Fig. 17 shows the reduction in the power consumed by the bypass network in the B-Processor relative to the power consumed by a baseline processor with two levels of bypass. Multiple levels of bypass would be required when the writeback latency to the ROB is longer than one-cycle or if the pipeline is designed to bypass values both in the cycle in which an instruction completes execution and in the cycle in which the instruction does writeback. The bypass network in the B-Processor consumes 
Power Consumption of Bypass Network
CONCLUSION
In this work we proposed a new low-power architecture called the Block-precise Processor which reduces the power consumption of the ROB and the integer and floating point register files by 15.1 percent and the power consumption of the bypass network by 14.5 percent while supporting precise exceptions. The B-Processor mechanisms can be combined with register file caching to obtain power savings of 28.7 percent over the baseline for the ROB and the ARFs. The B-processor saves power by eliminating writes to the ROB and the ARF. The results not written to the ROB and ARF are not stored in an auxiliary structure, instead the state of the registers at the end of the last correct basic block is saved using a low cost mechanism. This state is used for supporting precise exceptions. Future work could include code transformations to increase the sizes of basic blocks, instruction scheduling policies to increase the number of instructions that can elide writing to the register file. We will also focus on evaluation on multi-threaded architectures and also on developing a mechanism that dynamically determines whether the B-Processor mechanisms should be enabled or not, and if enabled, what should be the limit on the number of stores that can be outstanding. Such a mechanism will help improve the impact of B-Processor mechanisms. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
