Abstract-Modern out-of-order processors tolerate longlatency memory operations by supporting a large number of inflight instructions. This is achieved in part through proper sizing of critical resources, such as register files or instruction queues. In light of the gap between processor speed and memory latency, tolerating upcoming latencies in this way would require impractical sizes of such critical resources.
A Case for Resource-conscious
Out-of-order Processors Adrián Cristal, José F. Martínez † , Josep Llosa, and Mateo Valero
Abstract-Modern out-of-order processors tolerate longlatency memory operations by supporting a large number of inflight instructions. This is achieved in part through proper sizing of critical resources, such as register files or instruction queues. In light of the gap between processor speed and memory latency, tolerating upcoming latencies in this way would require impractical sizes of such critical resources.
To tackle this scalability problem, we make a case for resource-conscious out-of-order processors. We present quantitative evidence that critical resources are increasingly underutilized in current processors. We advocate that better use of such existing resources should be a priority in future research in processor architectures.
Index Terms-Out-of-order processor, memory latency, instruction-level parallelism, resource utilization.
I. INTRODUCTION
T he gap between processor speed and memory latency is continuously widening. In order to tolerate long-latency memory operations, modern out-of-order processors maintain a large number of in-flight instructions that effectively hide such latencies. At current trends, however, processors cannot keep up with this growing disparity, and as a result, long-latency operations are increasingly more taxing on performance. Figure 1 shows average IPCs attained by SpecFP and SpecInt applications in simulations using variations of the processor and memory models summarized in Table I . The three leftmost bars show the effect of increasing memory latency on a processor with 64 entries in reorder buffer (ROB), instruction queues, and register files. As memory latency increases, the limited processor resources are unable to keep enough instructions in flight and, as a result, the IPC drops dramatically.
By increasing the size of such resources, the processor can conceivably sport enough in-flight instructions to tolerate long-latency operations. In Figure 1 , the remaining bars show the effect that such idealized size increase has on IPC when memory is set to a futuristic latency of 500 processor cycles. Limited ROB, Regs, and Queues represent configurations with a limited number of entries (as indicated on the X axis) in ROB, register files, or instruction queues, respectively, and an unlimited size of the other resources in each case.
In the case of SpecFP applications, performance suffers if any one of these three resources is too small in size. However, as the size of the limited resource increases, so does the IPC. Only when all three resources are scaled up adequately can we maintain enough number of in-flight instructions to bring the IPC closer to that of a memory-insensitive configuration (represented by Perfect L2 in each group).
In the case of integer applications, larger resources are still insufficient to overcome long-latency operations. This is mainly because many in-flight instructions are not profitable, as relatively frequent branch mispredictions squash a large number of them. We notice, however, that once hard-topredict branches are overcome (PerfectBr), increasingly higher IPCs are obtained as resources scale up in size.
In general, however, it is impractical to design processors with thousands of entries in resources such as the ROB, the register file, and the various queues, since this could adversely affect clock cycle, pipeline depth, or both [8] .
In this paper, we analyze resource utilization in modern outof-order processors, and quantitatively show that, behind the apparent need to grow critical resources to support many in- flight instructions, a significant fraction of such resources are being wasted by blocked or executed instructions. We identify the different causes of resource wasting, and motivate research on resource-conscious processors to overcome upcoming memory latencies. We provide some examples of existing work in this direction. Figure 2 shows the average cumulative distribution of inflight instructions for SpecFP and SpecInt applications, obtained through simulations using the parameters of Table I . An idealistic 2,048-entry ROB, and enough entries in all other resources is assumed. Floating-point applications have, on average, more than 1,500 instructions in flight 75% of the time, while integer applications exhibit a relatively moderate number of instructions in flight (under 500 50% of the time).
II. QUANTITATIVE ANALYSIS OF RESOURCE UTILIZATION

A. Distribution of in-flight instructions
Since each in-flight instruction is assigned an entry in the ROB until it is retired, the ROB should be rather large in order to support high memory latencies. In order to reduce the number of entries in the ROB and still allow a large number of in-flight instructions, the use of checkpointing and early release of uncommitted instructions has been proposed recently [2] .
We now discuss the allocation of various other critical resources, namely register file, instruction queues, and load and store queues. For each resource, we plot the average number of allocated entries against the number of in-flight instructions. The X axis is adjusted according to the distribution function shown in Figure 2 . For illustrative purposes we comment, in each case, on the amount of wasted resources in SpecFP applications, for a number of in-flight instructions equal to the median (around 1,600). Figure 3 shows the average number of allocated registers against the distribution of in-flight instructions. Registers are classified in four categories as follows: Live registers contain values currently in use. Notice that this class constitutes only ten to twenty percent of the total number of allocated registers.
B. Register files
Blocked-Short registers have been allocated during rename, but are blocked at the instruction queue waiting for the execution of predecessor instructions that will issue shortly. This class is shortlived by definition, and represents a relatively small fraction of the allocated registers; therefore, attacking this type of registers would be of limited impact.
On the other hand, Blocked-Long registers are still empty because the producer instruction is blocked waiting for the execution of some long-latency predecessor instruction (e.g., a load miss). Techniques for late physical register allocation [4] may improve utilization of registers currently in this category.
Finally, Dead registers are no longer in use, but they are still allocated because the producer instructions are waiting to retire. Techniques for early register recycling [2] [6] [7] can make these registers available to other instructions.
In all, Blocked-Long and Dead registers constitute the largest fraction of allocated registers, and should be the target of register management techniques. In Figure 3 , results for SpecFP applications show that, at about 1,600 in-flight instructions, as many as 80 percent of the approximately 1,000 allocated floating-point registers fall in one of these two categories. Figure 4 shows the average number of allocated entries in the floating-point queue (for SpecFP applications) and in the integer queue (for SpecInt applications). Unless explicitly noted, our comments are applicable to both.
C. Instruction queues
Ready entries correspond to instructions have all its operands ready. Barring structural dependencies, instructions are generally issued as soon as their operands become available; thus, few entries fall under this category at any given time.
Blocked-Short entries pertain to instructions that are awaiting results from short-latency operations. This group represents a relatively small fraction of the allocated entries.
Blocked-Long entries, however, correspond to instructions that are blocked waiting for some long-latency instruction to complete. This group represents by far the largest fraction of entries allocated in the instruction queue: Figure 4 shows that, in SpecFP applications, at about 1,600 in-flight instructions, only 15 percent of the approximately 400 floating-point instruction queue entries are not in this category. Multilevel queues can be used to track this type of instructions, delegating their handling to slower, but larger and less complex structures [1] [5]. Figure 5 shows the average number of allocated load queue entries. They are broken down in the following categories:
D. Load queue
Live entries correspond to loads that are being executed. This type abounds in floating-point applications, in which cache miss rates are higher that in their integer counterparts. Replayable entries represent loads that have executed out of program order with respect to some store whose address remains unresolved. These loads and their dependent operations must be replayed if the store is later found to overlap the load in memory. This is determined by comparing the addresses and data range as indicated by the appropriate store and load queue entries.
Blocked-Short entries pertain to loads waiting for its address to be produced by a short-latency operation.
Blocked-Long entries belong to load instructions whose address depends on a long-latency operation. In the case of integer applications, this represents a significant fraction of the allocated load queue entries. This is because pointer chains are quite common in this type of applications.
Finally, Dead entries correspond to load instructions that have been executed and are not subject to replay traps, i.e., the addresses of all previous stores in program order have been resolved. For SpecFP applications, Figure 5 shows that, at about 1,600 in-flight instructions, as many as 75 percent of the approximately 400 allocated load queue entries are Dead. Traditional out-of-order processors keep these entries until their load retires, but aggressive implementations to recycle them have been proposed [6] . Figure 6 shows the average number of allocated store queue entries. They are classified as follows:
E. Store queue
Ready entries represent store instructions whose address and source operand are available, and are only waiting to reach the ROB head to execute. Under the right conditions, these entries could be recycled before their store executes [6] . In general, however, exception handling and other issues still mandate in-order execution of stores.
Address Ready entries correspond to stores whose address is ready, but are still waiting for the data. These represent a significant part of all in-flight stores. In general, these entries could also be recycled if disambiguation with earlier memory operations is no longer necessary [6] .
Blocked-Long entries relate to store instructions whose address depends on a long-latency operation. Notice that this category is nearly inexistent in floating point applications, since it is mostly data and not addresses that depend on longlatency operations.
Finally, Blocked-Short entries correspond to stores waiting for its address to be produced by short-latency operations. At about 1,600 in-flight instructions, Figure 6 shows that, in SpecFP applications, only 25 percent of approximately 200 allocated store queue entries fall under this category.
III. CONCLUSION
In this paper we have provided quantitative evidence that, in the pursuit of large numbers of in-flight instructions to tolerate long memory latencies in current out-of-order processors, resources identified as critical to attaining high performance are in fact increasingly underutilized. Specifically, we have looked at register files, instruction queues, and load and store queues. We have broken down allocated resources into several categories according to how they are used, and found that a significant fraction of allocated resources actually does not contribute to execution.
Given the impact that the size of such critical resources have in the clock cycle time, we conclude from our study that a research focus is needed on resource-conscious processor architectures, in which to manage smaller, faster resources more efficiently.
