Most processors today incorporate Tomasulo's algorithm [1] to improve CPI by optimizing parallelism and out-of-order execution possibilities in the execution stage of the datapath. This algorithm generally requires that instructions be buffered in reservation stations until they are ready to execute. The manner in which these reservation stations are allocated can greatly affect performance. Since applications present different demands in reservation station allocation, the idea of pooling or reallocating these resources could improve performance by satisfying the varying demands of different applications. This paper presents a initial look at one technique for using "pooling" of resources to improve CPI. Also, motivation for a more aggressive pooling technique is presented.
Introduction
Tomasulo's algorithm [1] has become an integral part of today's RISC processing cores, since it provides a relatively simple way to resolve data dependencies while minimizing the number of stalls in the datapath. Through the implementation of reservation stations and register renaming, the processors viewing window is increased to exploit more parallelism in the execution units. While this practical theory of execution can be seen in most of today's processors (PowerPC, Alpha, and Pentium), each of the implementations differs greatly in many internal aspects, like:
• Number of Reservation Stations
• Number and Types of Execution Units
• Back Fill During Issuing (Always Attempt to Issue Maximum Instructions)
• Period of Execution when Commission Occurs
• Arrangement or Grouping of Reservation Stations
The differing opinions are not surprising, when you consider that each of the aspects above represent design tradeoffs. Therefore, the optimal solution always depends upon the nature of the code that will be executed on the machine. Obviously, it would be far too costly to create specialized machines for every existing task.
However, if we can achieve a complete understanding of the tradeoffs that hardware offers, it should be possible to create hardware that executes most software at faster speeds. In this paper, I examine a few the tradeoffs mentioned above, with the hope that I can further unravel the optimal arrangement question behind this type of execution. 
Let Us Examine

Both Units Can Execute
The opposing consideration is that organizing reservation stations into groups introduces hardware cost and complexity that is completely unnecessary.
Question 2:
Is there any motivation for arranging all of the reservation stations into one large pool?
As far as hardware vs. performance tradeoffs, this question explores and idea that lies heavily on the side of hardware complexity. While I lacked the foresight to design the simulator to accommodate this simulation, I found compelling evidence that should at least stimulate interest in this type of implementation. Here we consider the possibility of organizing reservation stations into a single group, such that all execution units share reservation stations. It is very easy to see that there are definite problems that will have to be worked around, like how to handle floating point numbers. While I offer no definite solution, these are the areas that make this issue interesting. If the opportunity to continue this project ever arises, this is an area I would like to consider further.
Infrastructure and Simulation Environment
Description In order to get accurate comparisons of the tradeoffs mentioned above, I have written a versatile simulator that will allow the simulation of many aspects of Tomasulo's algorithm. I will not offer a detailed explanation of the simulator in this paper, since only the assumptions are relevant to the underlying architecture. If you are interested in the simulator, you can find it and a graphical user interface version at http://www.ece.gatech.edu/users/wnorton/research/.
Assumptions
• To simplify the design of the simulator, the following assumptions were made about the underlying architecture:
• When issuing, the simulator always attempts to issue the maximum possible number (super scalar factor) of instructions on each statement. This means that the issue stage back fills the instructions that did not issue, so that new instructions can also attempt to issue on the next cycle.
• An instruction occupies a reservation station until it has been committed (i.e. register has been written).
• All branches are predicted correctly, so the datapath never has to flush out instructions and reorder. Thus, the reorder buffer has not been implemented in the simulator.
• Data memory accesses will hit in the cache 95% of the time and instruction memory accesses with hit in the cache 100% of the time.
• Execution Times are: 
Results
Question 1
In order to analyze the effects of "pooled" reservations stations, simulations were run with the architectures shown below.
Pooled Reservation Stations
Int Int 
Non-Pooled Reservation Stations
On each iteration, the number of reservation stations was increased by one. another floating point add instruction, the issue stage is locked until a floating point add reservation station is freed. Therefore, we could argue that more floating point add reservation stations are required for that particular algorithm. The graphs below identify the execution type that is responsible for stalling the issuing of instructions each time such a stall occurs. Bye examining these graphs, we should be able to identify the execution units which require more reservation stations. The important thing we will be examining is the degree to which this varies from code to code.
This illustrates the need for a dynamic allocation of reservation stations solution, like implementing a single, general reservation station pool. 
Analysis and Conclusions
Considering the data presented above, certain important conclusions can be made about the two reservation station pooling issues.
Question 1: When we consider pooling reservations stations for each type of instructions, the results aren't as exciting as one would hope. In theory, the pooled reservation station approach should always be equal to or faster than the non-pooled approach. The CPI data obtained from the simulations also reflects this; however, CPI is about the same unless the number of reservation stations is low (less than three). This means that when more reservations stations are used, even in the non-pooled configuration, the execution units are able to receive an instruction not constrained by dependencies. The conclusion is that perhaps we should just avoid the hardware complexity, and allocate dedicated reservation stations to each execution unit. The one useful application of the data, however, is that using the pooled reservation stations we could achieve approximately the same CPI as three reservation stations in the non-pooled configuration. Unfortunately, we must not forget the added complexity to manage contention between the two execution units attempting to access the pool of reservation stations.
Question 2: Simulations here are intended to illustrate a potential direction for future enhancements through another idea which pools resources. Examining the pie charts shown in the Results section of this paper, we notice only the vast differences in requirements between the two test codes. The matrix multiply code is limited by the amount of floating point reservation stations, since almost all issuing stalls occur because either the floating point add or the floating point multiply reservation stations are full.
The string length code is extremely different, and quite strange. This algorithm includes a branch at about every third instruction; most of the time the branches must also wait for values from the integer units. This means that the limiting factor is the execution of branches. Obviously the fact that branches are the limiting factor in this case greatly restricts our solution, but the point here is that the two algorithms have completely different demands as far as reservations stations are concerned. Therefore, the only way too achieve an optimally low CPI while running both of these codes would be to would like to take a closer look at this potential solution to resource allocation.
