Continual Flow Pipelines (CFPs) allow a processor core to process hundreds of in-flight instructions without increasing cycle-critical pipeline resources. When a load misses the data cache, CFP checkpoints the processor register state and then moves all miss-dependent instructions into a low-complexity WB to unblock the pipeline. Meanwhile, miss-independent instructions execute normally and update the processor state. When the miss data return, CFP replays the miss-dependent instructions from the WB and then merges the miss-dependent and miss-independent execution results.
INTRODUCTION
The current era of multicore processors presents microprocessor architects a difficult design challenge: how to design a processor core that would give good single thread performance on applications with limited thread-level parallelism, yet consume less power, therefore making it possible to integrate more cores on a single die for high throughput performance.
To improve superscalar processor performance on difficult-to-parallelize applications, architects have been increasing the capacity of reorder buffers (ROBs), Reservation Stations (RS), physical register files, and Load and Store Queues (LSQs) [Smith and Sohi 1995] with every new out-of-order processor core. For two decades, enlarging the instruction buffers has provided good performance improvement. However, this approach is not effective anymore [Agarwal et al. 2000] . First, there is little instructionlevel parallelism left to justify further increase in instruction buffer size. Second, trying to reduce execution stalls from cache misses with larger conventional cores is impractical. Current buffer sizes are sufficient for code that hits the L1 data cache and way too small for code that misses the last-level cache. Load latency to the last-level cache on current processors is more than 20 clock cycles, and latency to DRAM, even with on-chip DRAM controller, exceeds hundred cycles. It is impractical to increase the sizes of power-hungry and cycle-critical buffers, like ROBs, RS, and physical register files to the capacity necessary to handle long load latencies to the last-level cache or DRAM.
A different design strategy is to size the instruction buffers to the minimum capacity necessary to handle the common case of L1 data cache hit and to use new scalable out-of-order execution algorithms to handle code that misses the L1 data cache.
Continual Flow Pipeline (CFP) architecture ] was proposed as energy-efficient large instruction window architecture for reducing the impact of data cache misses on performance, without having to increase instruction buffers and physical register files sizes. CFP breaks away from conventional ROB mechanisms for managing speculative out-of-order execution. Instead of using a ROB for tracking, buffering, and sequentially committing one by one the current set of in-flight instructions, also called the instruction window, CFP uses bulk commit of execution results and register state checkpoints to recover from branch mispredictions and exceptions [Hwu and Patt 1987; Akkary et al. 2003; . Instead of residing completely in the hardware instruction buffers, the instruction window becomes virtual and composed physically of a partial, discontiguous subset of the conventional instruction window. As such, instructions that do not encounter cache misses enter the instruction window, execute, and commit quickly, freeing all of their hardware resources. On the other hand, instructions that depend on cache miss data enter the physical instruction window and wait as long as necessary for their input operands without blocking the execution pipeline.
CFP handles data cache misses as follows. When a load misses the data cache, a poison bit ] is set in the destination register of the load. Load-dependent instructions in the RS are then woken up, as if the load completed.
Poison bits propagate through instruction dependences and identify all instructions that depend on the load miss and their descendants. The miss load and its dependents, identified by the poison bits in the ROB, pseudocommit in program order and move from the ROB into a waiting buffer (WB) outside the pipeline. Since dependent instructions do not tie pipeline resources, the core can execute ahead far into the program without stalling due to the cache miss.
When the miss data is fetched, the dependent instructions wake up and replay from the WB into the pipeline to complete their execution. When the WB is emptied and all miss-dependent instructions complete, independent and dependent instruction results are merged using a flash copy operation in the Retirement Register File (RRF). Execution then resumes normally.
CFP was originally proposed to tolerate long latencies of L2 cache misses that go to DRAM . In that work, the miss-independent and missdependent instructions execute at different times, based on the timing of the load miss event and the data arrival event. Switching between the two executions is costly because it involves a pipeline flush, making this proposal unsuitable for L1 misses that hit the on-chip cache. In a more recent work, Simultaneous Continual Flow Pipeline (S-CFP) [Jothi et al. 2011 ] executes the independent and dependent instructions simultaneously to avoid the costly pipeline flush, thus making S-CFP more suitable for first-level data cache misses. However, in S-CFP, many applications or execution phases of applications incur excessive amount of replay and/or rollbacks to checkpoints because of branch mispredictions during CFP execution. This is because CFP recovers from replayed mispredicted branches by rolling back execution to checkpoints taken at the load misses. Excessive replay increases the chance of replaying mispredicted branches and consequently rollbacks to the checkpoints. This frequently cancels any desired improvement resulting from S-CFP handling of L1 data cache misses and can even cause performance degradation.
Article Contributions
In this article, we focus on the excessive replays and checkpoint rollbacks to improve CFP performance and reduce its power consumption. For this, we use a novel Virtual Register Renaming (VRR) substrate [Sharafeddine et al. 2013; Jothi and Akkary 2013b] and fine-tune the replay policies to mitigate excessive replays and rollbacks to the checkpoint.
The key contributions of this work over prior CFP work are:
-On previous CFP architectures, all load miss-dependent instructions have to be moved into the WB and then replayed once the load miss is moved into the WB. This is necessary because the miss load releases its renamed destination register when it pseudocommits and moves to the WB. This breaks the dependence links between the miss load and its dependents, requiring the full dependent thread to be replayed and renamed again to reestablish the dependence relation. This work uses virtual register names, which persist for the full lifetime of the instructions, to specify the dependences between instructions. This allows partial replay of only those dependents moved into the WB while the miss is outstanding, significantly reducing the number of replayed instructions and the total execution time. -This work introduces an improved CFP policy that keeps miss-dependent instructions in the RS as long as they do not block the pipeline. However, when the instruction buffers become full with miss-dependent instructions, thus stalling the pipeline, CFP moves the miss-dependent instructions into the WB. Moving instructions into the WB on a resource need basis significantly reduces the amount of replay and costly rollbacks to the checkpoints resulting from branch mispredictions that depend on load misses. -This work introduces a hardware predictor to predict miss-dependent branches likely to mispredict. We use this predictor as a branch confidence mechanism to reduce the runahead distance of CFP and the corresponding increase in checkpoint rollback risk. This prediction mechanism improves performance slightly but significantly reduces power resulting from checkpoint rollbacks on benchmarks that have many mispredicted branches that depend on load misses. -The optimized CFP architecture in this work removes the ROB from the replay loop, which reduces the replay loop latency of miss-dependent instructions evicted into the WB and thus reduces execution time and dynamic power. -Using a microarchitecture performance simulator and architectural-level power model, the article shows that the optimized CFP architecture improves execution time and energy consumption by 10% and 8%, respectively, over S-CFP architecture. The article also shows that as the instruction buffer sizes are scaled over a wide range, the Tuned-CFP architecture consistently provides better performance return on Energy Per Instruction (EPI) compared to the conventional ROB superscalar architecture, but that the nonoptimized previous S-CFP architecture does not.
The rest of this article is organized as follows. Section 2 presents an overview of the S-CFP architecture. Section 3 follows with a description of VRR and our Tuned-CFP architecture and its optimizations. We outline our simulation methodology in Section 4. Section 5 evaluates performance and energy of the new CFP design and compares with previous CFP architectures as well as conventional baseline cores of equivalent instruction buffers capacity. We discuss related work in Section 6 and conclude in Section 7.
CFP ARCHITECTURE
This section gives an overview of S-CFP [Jothi et al. 2011] and illustrates, with execution examples, some of its drawbacks. It also explains the new mechanisms that we use to tune S-CFP and overcome its drawbacks.
S-CFP Architecture Overview
The S-CFP performs register renaming using a ROB, as in Intel P6 architecture [Papworth 1996] . Figure 1 shows a block diagram of S-CFP microarchitecture [Jothi et al. 2011 ]. Unlike previous latency-tolerant out-of-order architectures, the S-CFP core executes cache miss-dependent and miss-independent instructions concurrently using two different hardware thread contexts. The S-CFP hardware is similar to Simultaneous Multithreading (SMT) architectures [Tullsen et al. 1995] , except that in S-CFP, two simultaneous threads are formed of the miss-dependent and miss-independent instructions constructed dynamically from the same program, instead of being two different programs that run simultaneously in the same core. In order to support two hardware threads, S-CFP has two Register Alias Tables (RATs) for renaming the independent and the dependent thread instructions. S-CFP also has two Retirement Register File (RRF) contexts, one for retiring miss-independent instruction results and the other for retiring miss-dependent instruction results. The two threads share the ROB, load queue (LQ), store queue (SQ), RS, and data cache.
The independent hardware thread is the main execution thread. It is responsible for instruction fetch and decode of all instructions, branch prediction, memory dependence prediction, identifying miss-dependent instructions, and moving them into the WB. The dependent thread execution starts when the load miss data is brought into the cache, waking up the load instruction in the WB, and continues until the WB empties. At the end of dependent execution, when all of the instructions from the WB have retired (i.e., committed) without any mispredictions or exceptions, the independent and dependent execution results are merged together with a flash copy of the dependent and independent register contexts within the RRF. To maintain proper memory ordering of loads and stores from the independent and dependent threads execution, S-CFP uses LSQs, a Store Redo Log (SRL) [Gandhi et al. 2005] , and a store-set memory dependence predictor [Chrysos and Emer 1998 ].
S-CFP renames instructions using the ROB [Smith and Sohi 1995] . When a load miss reaches the head of the ROB, it is pseudoretired and moved into the WB immediately. The load miss instruction releases all of its pipeline resources, including its ROB ID, before entering the WB, breaking its links with its readers that are still in the pipeline. For this reason, the entire dependence chain of the load miss and its dependents needs to be renamed and replayed from the WB until all miss-independent and missdependent instructions complete execution and their results are merged. This can last for a long distance forward in the program, causing excessive replays and rollbacks. We show next, in more detail, examples that illustrate the drawbacks of S-CFP targeted by our optimizations.
S-CFP Execution Examples
Figures 2(a) and 2(b) show snapshots of the ROB and WB states in S-CFP at different execution times. All shaded instructions correspond to either a load miss or a load miss dependent, both of which are potential-candidates to move into the WB.
In Figure 2 (a), the WB has a load miss X at the head waiting for its wakeup. Instruction A misses the first-level cache and is marked as a potential candidate to be moved into the WB. When A reaches the head of the ROB, there are still free entries available in the ROB. Nevertheless, S-CFP eagerly pseudoretires and moves A into the WB. In order to execute when the miss data is fetched, A has to be replayed from the WB back into the pipeline to be renamed again and allocated resources for execution. Figure 2(b) shows that the load miss hits the L2 data cache and A is woken up from the L1 data cache shortly after it enters the WB. However, it is stuck in the WB behind instruction X that has missed to DRAM. For a long time afterward, and until the miss data of load X is fetched from DRAM, many of the dependents of A will be poisoned and moved into the WB. Therefore, even though A has hit the L2 cache, it is replayed with its dependents from the WB as if it has encountered a miss in the L2 data cache and has needed to go all the way to DRAM for the data. Figure 3 (a)-(c) shows an execution sequence to illustrate a situation in S-CFP that leads to a rollback to the checkpoint. Similar to the earlier example, the WB has a load instruction X that has missed. Instruction A misses the first-level cache. In this example, F is a branch instruction dependent on A. A is moved into the WB from the head of the ROB, as shown in Figure 3 (a). F also follows A into the WB, even though the wakeup for A arrives while F is still in the ROB/RS, as shown in Figure 3 (b) . Both A and F are replayed behind instruction X, as shown in Figure 3 (c). On replay, branch F is found to be mispredicted, and branch misprediction recovery has to be performed by rolling back execution to the checkpoint, since by then, the sequential state in the register file has been corrupted by the out-of-order pseudoretirement of instructions during the cache miss processing. Figure 4 (a)-(d) shows another execution sequence to illustrate why S-CFP needs to replay a load and all of its dependents once the load enters the WB. In this example, A is a load miss, and B is dependent on A. The two instructions are separated by miss independents shown as dotted lines. The figures show only the miss dependents in the pipeline for clarity. The renamed physical source and destination register tags (src1, src2 -> dest) are shown alongside the instruction. Also shown below their ROB entries are the ROB IDs or physical destination registers of A and B. The WB, initially empty, is also shown in the figures.
In Figure 4 (a), A reaches the head of the ROB. It pseudoretires and moves into the WB, releasing all of its pipeline resources, including its ROB ID #3, as shown in Figure 4 (b). When the wakeup for A arrives and it replays, it is allocated a new entry at the tail of the ROB, as shown in Figure 4 (c). Notice that A gets a new ROB ID #24 when it is reintroduced into the pipeline. Because of this new ID, even though B is still in the RS and the ROB while A is being replayed, A's data writeback cannot wake up B, because B still has the physical register destination ID #3 as its source operand. B reaches the ROB head, pseudoretires, and moves into the WB. When B is replayed and reintroduced into the pipeline, it goes through the rename stage, gets a new ROB ID #28, and receives the correct physical source register ID #24, reestablishing its link with A from the dependent RAT, as shown in Figure 4 (d). Notice that an 'x' is shown for the other source IDs to indicate that the IDs of these sources are "don't care" for illustrating this example.
Tuned-CFP Execution Examples
As illustrated in the previous examples, S-CFP causes excessive replays and rollbacks, negatively impacting performance and energy consumption. Our Tuned-CFP architecture in this work handles these cases differently, as explained next.
Figures 2(c) and 2(d) shows the ROB and WB states in the Tuned-CFP architecture. Instruction A misses the first-level cache and is marked as poisoned. However, unlike in S-CFP, it does not release its RS entry until it becomes a blocking instruction, just in case the load hits the L2 cache providing the miss data to the CFP core shortly. A may reach the head of the ROB before the L2 data cache loads the data, but it will still be kept in the RS and ROB by stalling pseudoretirement as long as there are free entries in the ROB and other instruction buffers for the pipeline to continue execution of other instructions without blocking. If the miss data arrives before the pipeline blocks, A is woken up from the RS and ROB by clearing its poison bits, as shown in Figure 2 
(d).
A and its dependents do not need to go through the replay loop at all in this example, saving significant time delay and energy.
Figures 3(d) and 3(e) shows how the rollback situation in S-CFP is avoided with the Tuned-CFP architecture. Similar to the previous example, instruction A stays in the ROB, even if it reaches the head, as long as it is not blocking execution. A gets its wakeup before it moves into the WB, as shown in Figure 3 (e). Even though F is a miss-dependent and mispredicted branch, it executes before it pseudoretires. When it reaches the head of the ROB, the ROB flushes the pipeline to clear all of the wrong path instructions that have been fetched after the branch and signals to the fetch unit to restart fetch and execution from the corrected target. The costly S-CFP branch recovery from the checkpoint has been avoided.
Figures 4(e)-4(g) illustrates a partial replay in the Tuned-CFP architecture, representing the same scenario discussed earlier in Figure 4 (a). In Tuned-CFP, virtual register IDs (VIDs) that are not associated with any physical locations are used for register renaming and in the RS wakeup and scheduling logic. The VIDs of instructions A and B are shown under their ROB entries in addition to the renamed source and destination VIDs. As before, when A reaches the head of the ROB, it pseudoretires and moves into the WB, as shown in Figure 4 (e). However, unlike in S-CFP, A releases its RS but carries its VID #3 along with it into the WB, as shown in Figure 4 (f). Later on, when it wakes up and replays, A still carries with it its original VID #3, still maintaining its link with its dependent instruction B intact. This allows the RS to schedule B without replaying and renaming it again, as shown in Figure 4 (g).
Avoiding excessive replays and rollbacks not only saves significant execution time by miss dependents but also considerable power consumption.
CFP WITH VRR
In this section, we describe the core architecture of CFP with virtual registers renaming (Tuned-CFP), its key mechanisms, and its improvements over previous CFP architectures. Figure 5 shows a block diagram of the Tuned-CFP core. Tuned-CFP microarchitecture uses Tomasulo's algorithm and RS to perform data-driven, out-of-order execution [Tomasulo 1967] .
Tuned-CFP Architecture Overview
Like other superscalar architectures, Tuned-CFP uses a ROB to commit instructions and update register and memory state in program order. However, it does not use the ROB for register renaming. Instead, it performs register renaming using VIDs generated by a special counter. These VIDs are not mapped to any fixed storage locations in the core, thus they can be large in number and allocated to instructions throughout their lifetime, including miss-dependent instructions evicted to the WBs. Since the VIDs are plentiful, Tuned-CFP does not run the risk of pipeline stalls resulting from miss-dependent instructions holding on to their renamed registers for a long time while waiting for the long-latency load miss. The VID counter is finite in size and cannot be allowed to overflow in order to present allocating the same VIDs to multiple instructions in the pipeline. Tuned-CFP opportunistically resets the VID counter whenever it can-for example, when the pipeline is flushed to recover from a mispredicted branch. Otherwise, a pipeline stall and drain is forced to reset the counter when it overflows. With 10-bit VID counter, our simulations show that the impact on performance of forced pipeline stalls to reset the counter is negligible.
VRR gives Tuned-CFP a significant advantage over previous CFP architectures by allowing Tuned-CFP to replay only a part of the load miss dependence chain.
3.1.1. Miss Independents Execution. Like previous CFP cores, the Tuned-CFP core is capable of executing cache miss-dependent and miss-independent instructions concurrently in the pipeline, supported by two RRF contexts, one for retiring miss-independent instructions and the other for retiring miss-dependent instructions.
In Tuned-CFP, execution initially starts using an RRF context that we call the independent RRF. When an L1 data cache load miss occurs, a poison bit is set in the destination ROB entry of the load. Load-dependent instructions in the RS capture the poison bit from the common writeback data bus. They are then woken up, as if the load completed, and are scheduled by the RS control logic for pseudoexecution. Pseudoexecution of poisoned instructions does not actually use any execution units. However, pseudoexecution consumes RS dispatch ports and writeback bus cycles to propagate poison bits through instruction dependences and to identify all instructions in the RS that depend on the load miss data. After pseudoexecution, miss-dependent instructions stay in their RS until they wake up for real execution when the load miss data arrives, or until they are moved into the WB. Figure 5 shows the reduced replay loop in Tuned-CFP consisting of two stages: the RS and the WB. The WB basically acts as a second-level storage for the RS. With VRR, entries can be freely evicted from the RS to the WB and then loaded back again to the RS to be scheduled for execution at a later time.
Replay Loop and Miss Dependents Execution.
In Tuned-CFP, miss-dependent instructions are evicted from the RS to the WB only when their buffer resources are needed to unblock the execution pipeline. Therefore, when a miss is processed and its data fetched to the L1 data cache, the miss-dependent instructions may still be in the RS. In this case, when the data is written back, the miss data is captured by all of the RS that have instructions that depend on the miss. The captured writeback data sets the source operand "ready" state bits and clears the poison bits of the dependent RS entries, making the instructions in these entries ready for scheduling and dispatch to execution.
Evicting miss-dependent instructions to the WB on a resource need basis significantly reduces the number of replayed instructions, especially in the case of medium-latency load misses, which are those that miss the L1 data cache but hit the on-chip L2 cache.
In case of a load miss to DRAM, it is often the case that the long miss latency causes the instruction buffers to fill up and some or all of the in-flight miss-dependent instructions to evict to the WB. When the load miss is serviced, the miss load and its dependents are reinserted from the WB back to the RS where they are scheduled for execution.
Notice that even though replayed miss-dependent instructions do not need to be renamed again, they sometimes, as in Tomasulo's algorithm, need to read source operands that have already been computed and retired from the ROB to the register file (RRF). State bits that track whether the last instructions to write logical registers have been retired are stored in a special storage structure. These state bits are checked during replay to determine if the operands are ready in the RRF and to read them and move them into the RS with the replayed instructions.
3.1.3. Tuned-CFP Reservation Stations. Tuned-CFP uses a centralized array of conventional data-capture RS [Papworth 1996] . Each RS entry is extended with a poison bit per source operand and L1-DCache-miss bit. The L1-DCache-miss bit is set to 1 if the entry contains a load instruction that has missed the L1 data cache. We say that an instruction is poisoned if one of its source poison bits or the L1-DCache-bit is set to 1. A source operand of an instruction is poisoned if and only if it is the destination of another poisoned instruction. In other words, the poison bits propagate the dependences from L1 data cache misses to later instructions in the program to identify instructions that may encounter long data cache miss delays. These instructions are candidates to move to the WB to avoid pipeline stalls that could occur if any of the RS, ROB, LQ, or SQ arrays becomes full.
The RS array is augmented with a free list and an order list. Tuned-CFP uses the free list to track the occupancy of the RS. The order list tracks the program order of the RS. The RS order list could be implemented as part of the ROB by adding the RS ID of each instruction to its allocated ROB entry. It also could be implemented as a special array separate from the ROB.
Four conditions are checked to determine if an instruction should be moved to the WB: (1) the instruction is at the head of the RS order list; (2) the instruction is poisoned; (3) one of the RS, ROB, LQ, or SQ arrays is full; and (4) every source operand of the instruction is either poisoned or ready. The last condition ensures that the missdependent instructions carry their nonpoisoned input values with them when they are replayed, since there is no guarantee that these values would not be overwritten in the register file by replay time. In addition, each RS has a state bit that indicates if the instruction in the entry has been replayed once before-that is, it has been moved earlier in time to the WB and then back to the RS array. Each RS also contains a load miss identifier in case it has a load instruction that misses the data cache. An implementation could use for this purpose the ID of the L1 data cache fill buffer used to handle the load miss.
3.1.4. Waiting Buffer. The WB is a wide single-ported SRAM array managed as a circular buffer using head and tail pointers. Miss-dependent RS entries at the head of the RS array moves to the tail of the WB when any of the instruction buffers fills up due to data cache misses. When a data cache miss is completed, Tuned-CFP replays the miss-dependent entries by loading them back from the head of the WB to the tail of the RS. Ideally, the width of the two buses connecting the RS and the WB would match the pipeline width. Narrower interconnect can also be used, trading some performance for simpler hardware.
A key to the efficiency of Tuned-CFP large window design is the fact that the WB has no CAM ports, connections to writeback buses for capturing data operands, or conventional ready/schedule logic. All of these functions are handled in the RS array after the data cache miss completes and the miss-dependent instructions are replayed. Therefore, the WB array can be designed using nontagged SRAM and made significantly larger than the RS array at much lower area and power cost than if the RS array is large enough to hold the full instruction window.
In order to wake up miss dependents from the WB and replay them, the L1 data cache fill buffer handling a load miss has to receive and save the WB ID of its load miss. When the miss completes, Tuned-CFP replays the load miss and its dependents in program order, as described earlier, from the head of the WB back into the RS allocate/write stage of the execution pipeline.
3.1.5. Register File and Results Integration. Tuned-CFP uses the ROB to handle branch mispredictions and exceptions incurred by miss-independent instructions. On the other hand, it uses checkpoints to handle branch mispredictions or exceptions encountered by miss-dependent instructions.
Like the prior S-CFP architecture [Jothi et al. 2011 ], Tuned-CFP has a specialized register file for checkpointing register state at the load miss, for later use to handle miss-dependent branch mispredictions and exceptions. The register file also has special logic for integrating the results of independent and dependent instructions and to restore precise register state after all miss-dependent instructions execute. Figure 6 shows a Tuned-CFP RRF cell with checkpoint flash copy support. Tuned-CFP uses a flash copy of the RRF for creating checkpoints. In one cycle, every independent RRF state bit (left-most latch) is shifted into a checkpoint latch within the register cell (center latch). The register file can be restored from the checkpoint in one cycle by asserting RSTR_CLK.
In addition to the checkpoint bit and the independent RRF context bit, Tuned-CFP register file cell contains one context bit for the dependent RRF state (right-most latch). The integration of the independent and dependent instruction results is done in one restore cycle. At the end of dependent execution, after all instructions in the WB have replayed and retired, the RRF has all the live-out registers, some of which are computed by the independent instructions and some by the dependent instructions. This is determined by the poison bits in the RRF. To integrate these results back into one context, a restore cycle is performed from the dependent context into the independent context. However, not all registers are copied. Figure 6 shows that only poisoned registers are copied by using the poison bits to enable the clock of the copy operation. A 2-to-1 multiplexer in the cell restores either the checkpoint bit or the dependent bit during an RSTR_CLK cycle.
3.1.6. Load and Store Execution in Tuned-CFP. To maintain proper memory ordering of loads and stores from the independent and dependent instructions execution, Tuned-CFP uses LSQs, an SRL [Gandhi et al. 2005] , and a store-set memory dependence predictor [Chrysos and Emer 1998] . A detailed description of Tuned-CFP SRL and the speculative L1 data cache has been presented in previous works [Jothi et. al 2011; Sharafeddine et al. 2012] .
3.1.7. Miss-Dependent Branch Predictor. A key reason for CFP performance degradation is dependent branches that go into the WB and are later found to be mispredicted. Factors contributing to this performance degradation include a large window of wrong path instructions, re-execution of instructions between the load miss and the mispredicted branch, and delayed resolution of the miss-dependent branch while waiting its turn to come out of the WB.
To address this problem, we use an approach similar to pipeline gating in Manne et al. [1998] , except that we apply it only to costly dependent branches. We identify branches that are likely to mispredict and take necessary action when they move into the WB. We have observed that in multiple benchmarks, there is a strong correlation between the dependent mispredicted branch and its PC value. We use a small hardware predictor of 32 entries that contains the addresses of previous dependent mispredicted branches to estimate branch confidence. The processor front end is stalled when a branch with low confidence is moved into the WB. The front end of the pipeline is unblocked only after the load miss data is delivered to the cache and the branch is resolved. Our results in Section 5.2.2 show that this mechanism reduces excessive replay and rollback execution.
EVALUATION METHODOLOGY
We built our Tuned-CFP architecture model on the SimpleScalar ARM ISA simulation infrastructure (www.simplescalar.com). CFP benefits applications that suffer high data cache miss rates. However, its effectiveness in handling data cache misses is limited by mispredicted branches that depend on data cache misses. For this reason, we used all 14 "C" benchmarks from SPEC 2000 and SPEC 2006 that we succeeded in compiling using the SimpleScalar cross compiler tool.
These benchmarks do not suffer much on average from data cache misses but have high branch misprediction rates that make them useful in exposing the performance limitations and glass jaws of CFP, thus making them appropriate for evaluating the effectiveness of our proposed optimizations. Our choice of ARM ISA is arbitrary and does not change the conclusions in the article, since data cache miss rates and branch mispredictions depend mainly on the application characteristics and not the ISA.
After skipping the initialization code and warming up the caches and predictors for 40 million instructions, we simulated 200 million instructions from each benchmark, consisting of four different samples manually selected from representative execution phases to display wide variation in the cache miss rate as well as the branch misprediction rate, both of which significantly impact CFP execution behavior. Results in Section 5 show the average performance of these selected samples for each benchmark.
For energy analysis, we have obtained measurements from SPICE circuit simulations of all baseline superscalar and CFP core functional blocks, using Cadence tools and 45nm process technology. We combined these with logic switching activity from our SimpleScalar simulator, creating an Architectural-Level Power Simulator (ALPS) [Brooks et al. 2000 ]. All energy results reported in this article account for total energy, which is the sum of dynamic and static (or leakage) energy. Our model assumes 30% of energy consumed per instruction to be due to leakage, which is typical of current high performance integrated circuits technology. The energy model accounts for the area of additional structures including 8K bytes of SRAM in the SRL and SDB, two thread contexts, and checkpoint and flash copy logic in the RRF. Our estimate for the area overhead of CFP adds up to less than 5% of the non-CFP core configuration shown in Table I . We used 6-transistor SRAM cell design with differential sense amplifiers for the SRL and WB circuits and a single-sided register file cell, augmented with checkpoint and results integration circuit. Unless specified otherwise, the average consumed EPI is reported for each evaluated core relative to the non-CFP baseline configuration that we describe next. Table I shows the simulated machine configuration. Since our simulations are done with a single-core model, we use a two-level cache hierarchy with an L2 cache size that is representative of the L2 capacity per core of current multicore processors. We selected optimum instruction buffers and L1 data cache sizes for maximum EPI efficiency, as described in Section 5.3.
EXPERIMENTAL RESULTS

Limit Studies to Quantify the Disadvantages of S-CFP
To quantify the disadvantages of S-CFP described in Section 2.2, we used ideal studies to eliminate each of these disadvantages and to compare the performance against a nonoptimized S-CFP architecture.
5.1.1. Buffer Full Condition. As we pointed out earlier in Section 2.2, S-CFP does not wait for the buffer full (bf) condition before moving a load miss or its dependent instructions into the WB. To measure the impact of this issue, we evaluate a model in which a load miss instruction is prevented from moving into the WB until it blocks the core execution pipeline. Figure 7 shows the performance improvement due to this optimization.
Note that benchmarks, such as gcc and perl, benefit the most from keeping dependent instructions in the ROB/RS until they block the execution pipeline. This is because these benchmarks have a significant number of L1 cache misses that actually hit the L2 and do not encounter the very long DRAM access latency. Therefore, leaving a load miss in the RS for a few cycles longer significantly increases the probability that the load miss will receive a wakeup signal before it moves to the WB, therefore saving the entire miss-dependence slice from being replayed.
From Figure 7 , the average contribution of bf optimization over all benchmarks is 2.7%. This speedup is not very high, mainly because the window of opportunity to save unnecessary replay is limited to the time between the load miss reaching the ROB head and the time the instruction buffer fills up. If the miss data does not come back within this time window, the entire dependence chain must be replayed.
Miss-Dependent Branch Mispredictions.
Miss-dependent branches that are later found to be mispredicted cause performance degradation in CFP architectures because of the following reasons. First, execution needs to be rolled back to the checkpoint taken at the load miss to recover from the incorrect speculative updates made to the architectural state. Second, the miss-dependent branch itself may be resolved much later in time depending on when it replays from the in-order WB. Until a miss-dependent mispredicted branch is resolved, S-CFP continues to fetch instructions from the wrong path. This not only wastes precious pipeline resources, including cycles in the powerhungry fetch and decode stages, but also exerts additional pressure on the execution pipeline, thus increasing the number of instructions that are moved into the WB and subsequently replayed.
In order to quantify the impact of miss-dependent mispredicted branches on performance, we evaluate an S-CFP model with an Oracle predictor that stalls the processor front end perfectly on a miss-dependent mispredicted branch. This is the dp model in Figure 7 . As can be seen in Figure 7 from the speedup of the dp model, the average impact of miss-dependent mispredicted branches on performance is ∼5%. This is a moderate performance loss compared to that caused by the large number of miss-dependent instructions replayed from the WB, as we present in the next section.
5.1.3. Unnecessary Replay Performance Overhead. In order to measure the negative impact of instruction replays on S-CFP performance, we evaluate an ideal model that eliminates unnecessary instruction replay. As described earlier in Section 2.2 and Figure 4 , unnecessary replays are those caused by having to replay instructions that are still in the L1 RS when the miss load has been replayed in order to restore the dependence links. This model is called ur in Figure 7 .
From Figure 7 , one can see that 12% reduction in S-CFP performance comes from unnecessary replays. This shows that VRR is therefore the most important optimization to S-CFP. Figure 7 also shows two models (ur + bf and ur + bf + dp) that combine individual limit studies. As can be seen from these two ideal models, the potential performance benefit from multiple optimizations is cumulative. In particular, the ur + bf + dp model, which combines the three optimizations, establishes the performance upper bound that can be achieved with a realistic Tuned-CFP machine exhibiting minimum possible replay and rollback. The importance of eliminating unnecessary replays with VRR to CFP is evident from these results.
Combining Limit Studies.
In summary, on memory-intensive benchmarks, the average contribution from each optimization toward improving performance is 4% for bf , 9% for dp, 18% for ur, and 22% when combined. None of the optimizations require power-hungry structures. Moreover, they provide sizeable speedup, which justifies their implementation cost. Since these optimizations complement each other very well, combining them to get the best possible performance would be the recommended option. We next present simulated performance results for a realistic Tuned-CFP machine. 
Comparing Realistic Tuned-CFP to S-CFP
This section compares the performance and EPI of Tuned-CFP to S-CFP. This section also isolates the contribution of each optimization scheme when applied by itself toward improving overall performance of Tuned-CFP. Figure 8 shows the speedup of Tuned-CFP over S-CFP when targeting data cache misses at all levels. Tuned-CFP, with its VRR, short replay loop, and reduced replay/rollback outperforms S-CFP by an average of 10% when the CFP algorithm is applied to L1 data cache misses. The maximum improvement occurs on gcc (39% speedup), which displays high rate of L1 cache misses as well as high rate of branch mispredictions. The reduced replay loop of Tuned-CFP is very favorable to benchmarks like gcc and perl, since reducing the amount of replay also significantly reduces the number of costly miss-dependent branch mispredictions. The variation in the improvement between benchmarks is mainly due to the variation in the cache miss rates and branch misprediction rates of our simulation samples. The long replay loop of S-CFP is a glass jaw that is exposed on benchmarks like gcc that display high cache misses and branch misprediction rates. A carefully designed replay loop is necessary when designing CFP for general purpose processors that target many applications with widely different execution characteristics.
Tuned-CFP Performance.
Notice that the results in this section are reported with a Tuned-CFP core having all of the optimizations discussed in Section 3. With the intention to highlight the benefits from the most architecturally significant optimization (i.e., VRR), Tuned-CFP performance is compared to an S-CFP that has all of the optimizations except VRRthat is, buffer full optimization for load misses and stalling the front end based on the miss-dependent branch predictor. For this reason, Tuned-CFP speedup over S-CFP in Figure 8 is not as high as the speedup of the ur + bf model over unoptimized S-CFP shown in Figure 7 .
Tuned-CFP Optimizations.
There are three optimizations featured in Tuned-CFP: (1) partial replay of the load miss dependence chain, (2) moving miss-dependent instructions into the WB only when a resource or buffer becomes full, and (3) dependent branch confidence predictor. We call these optimizations PR, BF, and DP, respectively.
The contribution of the PR and BF optimizations toward the performance of Tuned-CFP is exactly the same as the benefit from the ideal models ur and bf shown in Figure 9 shows the percent speedup contributed by the simple history-based miss-dependent branch predictor (DP) discussed in Section 3.1.7. The speedup coming from the Oracle miss-dependent branch predictor (DP_Orc) is also shown for comparison. Notice that the PR + BF + DP model and the Oracle PR + BF + DP_Orc model contribute average speedups of 3.7% and 4.2%, respectively, showing that despite of the small size of our 32-entry dependent branch confidence predictor, it achieves a speedup within 0.5% of the perfect upper bound performance of the Oracle model.
We observed that some benchmarks, like gcc and twlf, have a small number of static branches that contribute to a large percentage of dependent branch mispredictions, making our small 32-entry hardware predictor very effective. Other benchmarks, such as gobk and gzip, display more complex control flow patterns with larger set of static branches contributing to mispredictions. For these benchmarks, our dependent branch predictor is less effective. Using global correlated branch prediction methods might help these benchmarks even further than our simple confidence prediction scheme. Figure 10 shows the percentage increase in EPI of S-CFP and Tuned-CFP over the non-CFP baseline for each of our simulated benchmarks. Table II shows replayed instructions, rollback instructions, and wrong path instructions as a percentage of total instructions for both S-CFP and Tuned-CFP.
EPI Comparison of S-CFP and Tuned-CFP.
The effectiveness of Tuned-CFP optimizations in reducing the excessive replay, rollback, and speculative execution is evident and accounts for the performance improvement ( Figure 8) and energy advantage of Tuned-CFP over S-CFP. Table III shows average EPI across all simulated benchmarks of various functional blocks as well as the total non-CFP baseline, S-CFP, and Tuned-CFP cores. All numbers in the table are shown relative to the total EPI of the non-CFP baseline core. Tuned-CFP shows ∼8% less EPI compared to S-CFP due to reduction in replayed instructions execution, rollback, and wrong path instructions. Finally, notice that Tuned-CFP and the non-CFP baseline consume about the same EPI, with Tuned-CFP measuring on average just about 2% additional EPI over the non-CFP ROB baseline core.
Analysis of Energy Characteristics of Non-CFP Superscalar Cores
This section presents an analysis of energy efficiency of a conventional 4-wide superscalar core that does not implement CFP, as the core instruction buffers and L1 cache sizes increase. We use an intuitive definition of energy efficiency. First, we define return on energy (ROE) from an added hardware feature to be the percent increase in performance divided by the percent increase in EPI resulting from the added feature. Using ROE definition, we say that core A is more energy efficient than core B if core A ROE is larger than 1. In other words, core A is more efficient than core B if A improves performance by a larger percentage than the percentage of additional energy it consumes relative to core B. 5.3.1. Energy Characteristics of 4-Wide Non-CFP Superscalar Core with Ideal Data Cache. The sizes of instruction buffers and the L1 data cache significantly impact performance and EPI of superscalar cores. In this section, we investigate the performance and ROE of non-CFP superscalar cores as the instruction buffers sizes increase. We simulate an ideal data cache to isolate the contribution coming from instruction buffers alone. We vary the sizes of instruction buffers, namely ROB, RS, LQ, and SQ from a 32_16_16_12 configuration to a 192_96_96_72 configuration (ROB_RS_LQ_SQ). All other parameters like machine width, pipeline length, branch predictors, and instruction cache size remain unchanged and are shown in Table I . Figure 11 shows the percent speedup of non-CFP superscalar core with ideal L1 data cache for various buffer sizes. All speedups are reported as percentage increase in IPC relative to the minimum 32_16_16_12 configuration. Only a representative set of benchmarks are shown in the figure for clarity. It is clear from Figure 11 that for all benchmarks, speedup initially increases linearly and then saturates as the buffers sizes increase. Figure 12 shows a plot of the ROE for various configurations relative to the minimum 32_16_16_12 machine and across all benchmarks. We identify two configurations of interest in Figure 12 . The maximum ROE occurs at the 48_24_24_18 configuration for all benchmarks. This peak return on energy configuration is the most energy-efficient configuration by our definition. A second configuration of interest is the configuration for which most benchmarks have a value of ROE around 1. From Figure 12 , this is configuration 128_64_64_48. We call this configuration the point of diminishing return on energy. If the instruction buffers are increased beyond this point, the resulting speedup will be smaller than the accompanying increase in energy. We next use this configuration to determine an optimal L1 data cache configuration.
5.3.2. Selecting the L1 Data Cache Size. The L1 data cache configuration is critical toward deciding the core power consumption. The data cache is typically optimized for lowest possible load hit latency with high associativity and aggressive circuit design that uses concurrent read of the tags and data from all ways in the indexed set. To find the optimal L1 cache configuration experimentally, we use the point of diminishing return buffers configuration and vary the L1 data cache size from 4KB to 32KB and associativity from 1 to 8. Our simulations show that the peak ROE occurs at 2-way, 8KB L1 D-cache, whereas the point of diminishing ROE occurs at 4-way 16KB L1 D-cache size. We do not show the L1 data cache ROE plots due to shortage of space.
Varying Instruction Buffer
Sizes with Optimal L1 Data Cache. To see how the ROE curves of the non-CFP core behave when simulated with a practical cache configuration, we re-do the experiment shown in Figure 12 with the peak ROE 8-KB L1 data cache instead of ideal cache. Figure 13 shows the results of this experiment, in that whereas the peak ROE point stays at the 48_24_24_18 configuration, the point of diminishing return moves backward from 128_64_64_48 for an ideal data cache to the 80_40_40_30 configuration. Notice that by the time the 96_48_48_36 configuration is reached, the ROE is well below unity for almost all benchmarks. This indicates that increasing the buffer sizes from 80_40_40_30 to 128_64_64_48 configuration benefits more from the increased ILP due to a larger instruction window than from increased tolerance to L1 data cache misses. This observation validates our hypothesis that a better design strategy for energy-efficient cores is to size the instruction buffers appropriately for code that hits the L1 data cache and use CFP to handle data cache misses.
Comparing Tuned-CFP to Non-CFP Core Architectures
The experiments presented point to three machine configurations suitable for different design targets:
(1) A peak return on energy machine with 48_24_24_18 buffers and 2-way 8KB L1 data cache. This is the most energy-efficient configuration, so we call it EFF. (2) A unity gain ROE machine that represents the point of diminishing return with 80_40_40_30 buffers and 4-way 16KB L1 data cache (DIM). (3) A large machine that compromises on ROE for high performance (HP). This configuration uses 192_96_96_72 buffers and 4-way 16KB L1 data cache. This is what a designer might choose for best single-thread performance. Figure 14 shows the speedup of Tuned-CFP over a similar-sized conventional non-CFP core for the EFF, DIM, and HP machine configurations when CFP is applied to L1 data cache misses. Table IV shows various relevant execution statistics of Tuned-CFP. The first observation to note is that Tuned-CFP benefits performance mostly on benchmarks that frequently miss the data cache, as to be expected. Second, Tuned-CFP, with its latency tolerance to first-level cache misses, outperforms the non-CFP baseline by an average of 3% to 4% on all of the configurations. Even though the average speedup of all benchmarks over the non-CFP baseline is modest, Tuned-CFP shows considerable performance improvement on benchmarks that frequently miss the data cache-for example, gcc and eqke. Earlier CFP work has similarly shown modest performance improvement on applications like Spec 2006 that do not frequently miss the cache, but significant performance benefit on applications with high cache miss rates, such as server and workstations applications . Figure 15 shows the total core EPI increase of Tuned-CFP over the non-CFP baseline for all three machine configurations of interest. We observe that benchmarks that miss the cache incur an additional increase in EPI that depend on the amount of instructions replayed from the WB. Compare the increase in EPI to the speedup observed for the same benchmarks. Some benchmarks like gcc, eqke, and hmm miss the cache but benefit from CFP speculative execution while the cache miss data is being fetched. This is supported by the large speedup observed in these benchmarks at the cost of a small increase in power. Some benchmarks like twlf and sjng also miss the cache but encounter frequent miss-dependent branch mispredictions that offset the performance gain while burning power in the process. Finally, benchmark traces like libq and milc that do not show any benefit from CFP do not consume additional power. This is because the CFP structures (i.e., the WB and SRL) observe major switching activity only in CFP execution mode.
We summarize this section by saying that it is evident from Figures 14 and 15 that on many benchmarks the speedup is greater than the EPI increase, indicating that Tuned-CFP architecture gives a good performance ROE invested. We conclude Section 5 by comparing the ROE or energy efficiency of Tuned-CFP and S-CFP to non-CFP superscalar cores for different buffer configurations.
ROE Comparison of Tuned-CFP, S-CFP, and Non-CFP Cores
Previous works have argued for CFP as energy efficient, scalable, large instruction window architecture. However, there has been no work that quantifies the energy efficiency of CFP and compares it with that of conventional non-CFP superscalar architecture.
In this section, we compare the ROE of Tuned-CFP to that of S-CFP and non-CFP core by sweeping across a range of buffer configurations starting from the best-efficiency EFF configuration to the best-performance HP configuration. The baseline relative to which the ROE is computed is the 32_16_16_12 non-CFP configuration. Figure 16 shows the average ROE across all of the simulated benchmarks for Tuned-CFP, S-CFP, and non-CFP cores. Tuned-CFP shows better ROE than non-CFP and S-CFP machines on all configurations. This clearly demonstrates the importance of VRR and other optimizations featured in the Tuned-CFP architecture.
Notice that the gap between Tuned-CFP ROE and non-CFP ROE is widest for maximum energy efficiency configuration (EFF) and tends to reduce with increasing buffer sizes as the non-CFP core suffers fewer stalls. With small buffer configurations like 48_24_24_18 and 64_32_32_24, the non-CFP core does not have enough resources to hide even short-latency misses that hit the on-chip L2 cache. Hence, the majority of cache misses end up stalling the non-CFP core. In this case, Tuned-CFP benefits more over the non-CFP core on many benchmarks, particularly when the speculative execution does not have to be discarded. When the buffer sizes are increased moderately (e.g., 80_40_40_30 configuration), Tuned-CFP still maintains a healthy ROE gap between itself and the non-CFP core.
On the other hand, when we move over to a large machine like 192_96_96_72, the gap between non-CFP and Tuned-CFP reduces because there are enough instruction buffers to handle all short-latency misses that hit the on-chip cache. The slight improvement in Tuned-CFP ROE comes because of the inevitable misses that go all the way to DRAM. In this case, even a large machine with 192_96_96_72 configuration is not able to keep the pipeline units busy for the entire time while the miss remains outstanding. It is important to notice the very poor ROE of a large machine for both non-CFP and Tuned-CFP, which is at a disappointing 0.4 value. This shows that increasing the instruction buffer sizes to this extent hurts EPI considerably, proving the widely established opinion that traditional methods of increasing buffer sizes to get performance are already on a downward ROE slide. Figure 16 also shows that the ROE of S-CFP is worse compared to both Tuned-CFP and non-CFP cores. The excessive replay and rollbacks in S-CFP not only bring down its overall performance but also dissipates excessive EPI in the process.
Finally, Table V shows the percent improvement in ROE of Tuned-CFP over the non-CFP core. Tuned-CFP shows up to 11% improvement in ROE for small-sized cores. The ROE gap starts decreasing for medium-sized cores and reaches a minimum for the largest buffer configuration. It is interesting to note that even on the large machine configuration, Tuned-CFP still manages to improve the ROE by 4.8 percentage points.
Tuned-CFP Performance with Data Prefetch
A core architect might ask whether CFP would still be useful to performance on cores that implement data prefetching hardware, and whether the performance of CFP justifies its hardware and increased energy cost.
The answer to the first question is not complicated. CFP should indeed benefit cores with data prefetch hardware. This is because CFP benefits any cache misses whenever they occur, whereas data prefetch, being a predictive mechanism, benefits only the cache misses that are predictable. Any cache misses that are not anticipated by the data prefetch hardware will be handled more effectively by a CFP core.
The question whether the hardware and energy overhead due to CFP can be justified on cores with data prefetch hardware is a more complex question to answer and requires empirical evaluation. In this work, we have evaluated CFP with an aggressive stream-based data prefetcher with 16 stream buffers. Our empirical results show negligible speedup (<0.01%) from data prefetch on our benchmarks suite, mainly because the memory access patterns of our benchmark traces do not exhibit frequent streams that can be exploited with the stream buffers hardware. We can therefore conclude from our results that CFP has clear performance and energy efficiency benefits, at least on our benchmark traces, with or without data prefetch streaming buffers hardware. Previous work ] that was done on an industrial core architecture model with data prefetch hardware using an extensive set of application traces, including SPEC CPU benchmarks and other server, workstation, and productivity traces, reported significant performance benefit from CFP beyond the data prefetch hardware. Our current study reconfirms the previous results on a smaller set of benchmarks.
RELATED WORK
Proposals for latency-tolerant out-of-order cores include the Waiting Instruction Buffer (WIB) [Lebeck et al. 2002] , Virtual ROB [Cristal et al. 2002] , Cherry [Martinez et al. 2002] , Checkpoint Processing and Recovery [Akkary et al. 2003 ], CFPs , and Out-of-Order Commit Processors [Cristal et al. 2004a] . None of these, however, deal with L1 data cache misses or execute miss-dependent and missindependent instructions simultaneously.
In order to support a large instruction window, WIB [Lebeck et al. 2002] physically buffers the entire window with a multilevel register file and large instruction buffers while releasing only the issue queue entries belonging to long-latency instructions. In Tuned-CFP, the entire instruction window is virtual, with only the miss-dependents, which are far fewer in number, occupying physical locations. In WIB, each instruction dependent on a load miss sets a bit vector dedicated for each outstanding miss, which allows miss-dependents to be reissued post wakeup, without needing the complex broadcast logic of the issue queue. Their WB is organized as a multibanked structure that allows miss-dependents to be reissued in any order as and when the wakeup arrives. Physically buffering the entire instruction window also allows instructions dependent on multiple load misses or miss-dependent misses to be moved in and out of the issue queue multiple times, although at the cost of excessive re-execution energy. In comparison, Tuned-CFP uses an energy-efficient single-ported WB that reissues only when the load miss at the head wakes up. Although this idea is simple and energy efficient, it also does not compromise on performance, because miss-independents continue to be processed by the core while the load at the WB head is waiting for the wakeup to arrive. Runahead execution increases memory-level parallelism on in-order cores [Dundas and Mudge 1997] , and on out-of-order cores [Mutlu et al. 2003 ] without having to build large ROBs. In runahead execution, the processor state is checkpointed at a load miss to DRAM. Execution continues speculatively past the miss for data prefetch benefits. When the miss data return, runahead execution terminates, the execution pipeline is flushed, and execution rolls back to the checkpoint. Except for the prefetch benefit, all work performed during runahead is discarded. In Mutlu et al. [2005a] , the runahead overhead is reduced with optimizations targeting short, overlapping, and useless runahead periods; yet, the runahead execution results are still discarded. In comparison to these proposals, Tuned-CFP executes ahead of L1 data cache misses and does not waste energy by discarding large number of instructions. In other works, Mutlu et al. [2005b] and Wolff and Barnes [2011] evaluated the benefit of reusing the execution results from runahead execution.
Flea-flicker [Barnes et al. 2003; Barnes et al. 2005 ] executes a program on two in-order back-end pipelines coupled by a queue. An advance pipeline executes independent instructions without stalling on long-latency cache misses while deferring dependent instructions. A backup pipeline executes the instructions deferred in the advance pipeline and merges them with results stored in the queue from the advance pipeline. Flea-flicker and Tuned-CFP differ in their execution and result integration methods and their instruction deferral queues. Flea-flicker executes instructions in an in-order pipeline, saves advanced instructions, and results in its queue and merges results sequentially during backup pipeline execution.
CFP on in-order cores was first proposed in Nekkalapu et al. [2008] . This approach is suitable for highly energy constrained computing devices but less suitable for the performance needs of conventional single-thread applications targeted by the Tuned-CFP multicore architecture.
iCFP [Hilton et al. 2009] tolerates cache misses at all levels in the cache hierarchy but uses an in-order pipeline, which is less suitable for the performance needs of conventional single-thread applications. BOLT [Hilton et al. 2010 ] utilizes additional map tables in SMT architecture to re-rename L2 miss-dependent slice, combined with a program order slice and a unified physical register file that supports aggressive register reclamation. BOLT reuse of SMT hardware is for improving energy efficiency but does not extend the use of SMT to simultaneous execution of the dependent and independent slices. Neither does BOLT use VRR to improve energy efficiency.
Sun Microsystems Rock is a single-die multicore processor for high-throughput computing. Rock uses Simultaneous Speculative Threading [Chaudhry et al. 2009 ] to defer dependent instructions into a buffer and executes the deferred instructions from the checkpoint after the miss data returns. The deferred instructions execution uses a simultaneous hardware thread and merges the results into the scout thread future file. Rock uses an in-order pipeline, whereas Tuned-CFP core is out of order and thus provides better performance than Rock on single-thread applications. Gonzalez et al. [1998] proposed using virtual registers to shorten the lifetime of physical registers. Kilo instruction processors [Cristal et al. 2004b ] also used virtual renaming and ephemeral registers to do late allocation of physical registers. In contrast to virtual physical registers and ephemeral registers, VRR [Sharafeddine et al. 2013] and Tuned-CFP Akkary 2013a, 2013b] do not require physical registers for any allocation of execution results, and they accomplish renaming with virtual IDs and RS.
CONCLUSIONS
This article presents a tuned CFP architecture that uses VRR and optimized replay policies to improve performance and reduce replay loop circuit activity and checkpoint rollback execution compared to previous CFP designs. It achieves this by reducing the replay latency associated with the processing of miss-dependent instructions after the miss data arrives into the L1 data cache and wakes up the miss load. The architecture achieves this reduction in the miss-dependent instructions execution latency by (1) keeping these instructions as long as possible in the RS and moving them to the WB only when instruction buffer resources are needed and (2) using VRR that allows partial replay of the miss-dependent chain of instructions, requiring only the subset of these instructions that have already moved to the WB to be replayed. Our Tuned-CFP architecture improves performance and power consumption over previous CFP architectures by ∼10% and ∼8%, respectively.
