In this paper, we look at the interaction of pipelining and multiple functional units in single processor machines. When implementing a high performance machine, a number of hardware techniques maybe used to improve the performance of the 6nal system. Our goal is togain an understanding of how each of these techniques conaibute to pedormance improvement As a basis for our studies we use a CRAY-like processor model and the issue rate (instructiOns per clock cycle) as the performance measure. We then sysrtmatically augment this base, non-pipelined. machine with more and more hardware features and evaluate the performance impact of each feature. We find. for example, that in non-vector machines, pipelining multiple function units does not provide significant performance improvements. Dataflow limits are then derived for our benchmark programs to determine the performance potential of each benchmark. In addition, other limits are computed which apply more realistic constraints on a computation. Based on these more realistic limits, we determine it is worthwhiIe to investigate the performance improvements that can be achieved from issuing multiple instructions each clock cycle. Several hardware a p p w h e s are evaluated for issuing multiple instructions each clock cycle.
Introduction
To improve the performance of a single processor, designers look to approaches that permit parallelism, or overlap, in instruction execution. Historically, pipelining has been one of the most popular of these approaches. Another technique that can be used independently or to complement pipelining, is the use of multiple functional units. In either case, the application of such appraaches can lead to substantial improvement in a machine's maximum performance because the number of resoulces that are simultaneously available to a running program is incressed. In this paper, we attempt to gain an unhtanding of the performance impact obtained from these and other hardware techniques. Starting with a base, nonpipelined, machine. hardware featum are systematically added and the resmting performance improvement evaluated.
A well-known performance measure used to evaluate high performance machines is the insrmction issue me, the significant difference between it and actual issue rates indicates that an investigation of alvmate machine organizations may be worthwhile. Historically. performance impmvements have been sought by concentrating on methods that either increasc the number of functional units (or their availability through pipelining) or increase the amount of buffer storage (registers). To approach the dataflow limit however, one must also investigate issuing multiple instruction each clock cycle. One way to achieve multiple instruction issue is to have several tightly-coupled p e s - the multiple insrm~tion issuc methods are clearly too difficult to implement, however, they are presented so as to fully explore the possible design space.
In this paper, we first describe our basic machine model. Then, we present performance improvements gained through pipelining for single issue unit Next, we calculate dataflow limits for our benchmark prognuns to determine the potential for improving performance. Finally, we discuss several hardware schemes that can issue multiple instructions per clock cycle in an attempt to improve performance.
The Basic Machine Model
To form a basis for comparing different instruction issue methods, we have selected a base architecture and organization to drive our studies. Each variant of an instruction issue method uses the same set of hardware functional units and memoly. The inslrwtion set of this base architecture is bits) and 2-parcel(32 bits) instructions. Unlike the CRAY-IS architecm, our base architecture can issue all instmctions in 1 clock cycle if issue conditions are favorable. The hardware functional units in the base architecture have the same performance chaacteristics as the CRAY-1 functional units; the time taken for a scalar add is 2 clock cycles, etc. The register files are also the same as in the CRAY-1. All operations are meawnxi in clock units and the clock speed is the same irrespective of the hardware organization.
In addition to varying the issue method, we also vary the memory access time and branch execution time since these 2 parameten can have a dramatic impact on performance in some cases. Using the CRAY-1 model, a memory access requires 11 chxk cycles from the time the load instruction is issued until the time the destination register is available for use. Our experiments were performed for an 11 cycle memory access time (slow memory) and a 5 cycle memory access time (fast memory). A fast memory results if m e form of fast intermediate storage, i.e.. some form of cache is provided. The CRAY-1s has no cache; however, it has 8 64element vector registers that can be used as a software-controlled cache in some cases. For instance, if a piece of scalar code accesses arrays in a regular fashion (for example a linear recurrence), elements of an array can be vector loaded from the memory into one of the vector registers. These elements can then be moved one by one into the scalar registers as needed. In such a case, the effective "memory access" time is simply the time taken to move an element from a vector register to a scalar register, i.e., 5 clock cycles.
The branch execution time was also varied since conwl dependencies are a significant source of instruction blockage, especially in some of the more s o p h i s t i d issue methods. In our basic machine model, we have not i n -any type of guessing or branch prediction to get an early star& on the execution of a likely branch target path. Execution of the branch target is not started until the branch outcome is known. Since the base architectun uses a CRAY-like model. each branch h t is encountered blocks at the issue stage for a period of 4 clock cycles. even if the contents of the A0 (the regkte? upon which the branch decision is made) are available and. therefore, requks 5 clock cycles to execute. We call the 5 cycle branch a sfow branch. The block time associated with a slow branch is due to 2 factors. The first is that a branch is a 2-parcel instruction and quire8 an extra clock to get the 2nd parcel from the instruction Mer. The other delay associated with a branch is the time it talres to fetch a new target instruction stream from the instruction buffers. These delays are partly artifacts of the CRAY-IS implementation and could possibly be eliminated. To evaluate the impact of eliminating these delays, simulations were peaformed in which a branch instruction took only 2 clock cycle to execute ( U would still block on the availability of the A0 register). This type of branch is termed a f a r branch.
For our studies we used a modified CRAY-1 simulator developed at the University of Wisconsin [SI. The benchmark programs were the ot@-nal 14 Lawrence Livermore Loops [a] . InstructiOn traces were generated for each of the benchmark programs and then used to drive the simulations.
The programs wen divided into the 5 scalar loops. loops 5.6.11.13 and 14 and the 9 vectorizable loops. loops 1.2.3.4.7.8.9, 10 and 12. Separate results are presented for the scalar loops and the vector hops. We make this seperation because we expect the v e c h b l e loops to exhibit a reasonably high degree of parallelism while we expect the scalar loops to exhibit a comparatively low degree of parallelism. Finally. all the functional units in the execution unit can be segmented, simultaneously handling several independent requests at a maximum rate of one q u e s t per clock cycle. Since this organization comsponds to the organization found in the CRAY machines, we shall call
The results of executing the benchmark programs on the 4 machines described above are shown in Table 1 . As seen in the tophalf of Table 1 , scalar code achieves the greatest benefit if we allow the overlap of in~rmc-tions that use distinct functional units. E the memory is slow, inmleaving the memory is helpful. Pipelining the functional units, however, does not have a significant impact on performance. This is seen by comparing the results for the NonSegmented and CRAY-like machines. In all the machine organizations. a relatively large performance gain is made by interleaving the memory alone than by pipelining the functional units. This result is not counter-intuitive. For scalar code, it is rarc that several instructions need to use the same functional unit in a time window that is determined by the latency of the functional unit. An exception is the memory since a large number of instructions reference the memory. If the latency of the memory is large, the performance can be improved significantly by interleaving the memory alone (about 2345% over a serial memory for scalar code). If the latency of the memory is smaller, the performance improvement is not so significant (about 4 4 % for scalar ccde). Vectorizable code benefits not only hm overlap amongst inshuctions that use distinct functional units, it also benefits f" the interleaving of memory and from the overlap of instructim that use the same functional unit.
For vector operarwns executed in the vecror unit, clearly the functional units should be highly pipelined to allow for maximum overlap in the processing of successive elements of a vector. If the same functional units are to be used both by scalar and vector operations (as in the CRAY machines), pipelining the functional units also makes sense. If, however, there are independent functional units for the scalar unit and for the vector unit, the computer designer may choose to pipeline the vector functional units only, t h d y increasing the overlap of operations whercas the scalar functional units may not be pipelined at all, thereby reducing the skew factors in the clock [9] .
Other Issue Schemes with a Singk b u e Unit
Given the functional units of a CRAY-licL machine, the instruction issue rate can be further improved by making the issue unit more elaborate. 
MlIBR5
MllBRZ M5BR5 M5BR2 to minimize the number of cases (and to get an upper-bound on performance). we restrict hutha experiments to machines with fully segmented functional units and an interleaved memory system, i.e., a machine with CRAY-like functional units.
Limits on Performance
With a single issue unit one can, at best, achieve an issue rate of 1 instruction per clcck cycle. As was seen in the previous section, with simple issue strategies this maximum performance is not easily reached. When multiple issue units are used, the maximum issue rate is related to the number of instructions than can be simultaneously issued. With N issue units, the maximum issue rate is N instructions per clock cycle. However, as the number of issue uni& is i d , at some point, this number may exceed the width of the dataflow graph associated with a program. In such cases, the resulting issue rate may look modest compared to the maximum allowed by the hardware. In order to determine the effectiveness of multiple instruction issue, we found it necessary to compute an upper-bound on the issue rate. This upper-bound is related to the darapow h u t of the program and is generally unachievable for most machine organizations. The dataflow limit is calculated in two steps: first, a pseudO-dotaj7ow limit is calculated and then a resource limit calculated. The actual dataflow limit for each program is the greater of these 2 limits.
The pseudo-dataflow limit is found by taking the number of instructions executed by a program and dividing by the amount of time taken to traverse a dataflow graph of the program. This computation assumes that the program resides in memwy in the form of a dataflow graph and that an inshuction can execute as soon as its operands are available. There are no resource limits (such as limits on the bus structure, number of functional units, etc.), however, different pomonS of the dynamic program graph, i.e., different loop iterations, cannot start until the approPriate branch conditions have been resolved. In such an idealized machine, the best-case time taken to execute a program, i.e., the pseudo&tafiow limit, is the length of the critical path in the p " . Since this limit is different for different encodings of an algorithm, this limit is a property of the encoding of the benchmark pgram rathex than a pmperty of the algorithm itself. Note that, the pseudo-dataflow limit is also dependent on compiler optimizations. For example, loop unrolling will in some cases shorten the critical path because some of the program's branches are removed.
The pseudo-dataflow limit can be an overly optimistic upper-bound because it assumes an unlimited number of resoums. If there are a dozen independent floating point multiplies in the basic block of a loop, under the assumptions of the pseuh-dataflow limit, these can all proceed concurrently. For the basic machine we are studying, resources are limited. For example, there is only 1 floating point multiply unit and this unit can only accept 1 new floating point operations every clock cycle. From such d i s t i c hardware limitations we calculated a resource limit for a program.
When calculating the resource limit, we assumed that the hardware functional units were limited to those of our base machine. Therefore, for a given program, the best-case execution time is bounded by the maximum number of inseuctions that use the same functional unit. For example, the best-case execution time for a basic block with a dozen independent floating point multiply operations is equal to 12 clock cycles plus the latency of the multiply unit on our hardware as opposed to simply the latency of the multiply unit in a pure dataflow machine with unlimited resources.
The performance limits for various machine organizations are given in Table 2 . The top half of Table 2 (those ennies with a "Pure" prefix) shows the issue rates for the p s e u~o w and the resome limits for vectorizable and scalar loops for each machine type. The actual limit is found by taking the smaller of the 2 limits for each program and calculating the harmonic mean. Because the actual limit for individual loops is sometimes the pseudodadow limit and other times the resource limit, the actual limit for a set of loops is not simply the smallest of either the o v d l pseudpdataflow or resourcc limits. The results in Table 2 indicate that with real~~uc limitations on the amount of available hardware and inherent program dependencies. very larger issue rates cannot be achiwed. However, it is possible to achieve an issue rate of greater rhan 1 inspuction pa cycle and, therefore. the investigation of multiple issue units may be worthwhile.
The lower half of Implicit in the calculation of this limit is the assumption that an unlimited amount of buffer storage is available to store temporary or intermediate results. This assumption is significant when a WAW hazard is encountered. Considei 3 instructions. Ii, I j , and 4 . Assume that both instructions Ii and Il write to register X and that insrmction uses register X as a source operand. Furthermore, mume mat instruction Il finishes execution before insrmction Ij. The "Pure" limits assume that instruction I k can start execution as soon as the result of instruction Ij is available. In a machine that has no mechanism to buffer the result of instruction Ij , issue of instruction Il must be blocked. f m h g it to finish, at best, at the same time as Ii . Forcing I j to be delayed also has the unfortunate effect of delaying the issue II . 'Ihe "Serial" results in Table   2 calculate the issue limit if WAW hazards are handled by forcing insttuctions that write into the same register to finish execution in orda. This "serial" limit also gives us a handle on the bestcast issue rate that could be achieved for a CRAY-like machine assuming that the memory is infinitely fast Of come, these limits would change for different encodings of the ProgramS.
Forcing instructions that write to the same register to finish in order has a dramatic effect on the actual performance limit of a program. As is seen in Table 2 , except for 2 machines organizations executing vectorizablc loops (M5BR5 and M5BR2), it is not possible to achieve an issue rate greater than 1 instruction per cycle. In the light of these limits, the issue rates found in Table 1 are not as low as they may seem. NonetMm. there is still room for improvement, specially for the "Pure" case. Since several instructions could possibly issue in a single cycle, several results could also be produced in a single cycle. Since these results must be stored into the register file. the register file must have an ample number of ports. Therefore, another parameter in our machine organization is the maximum number of results that can be produced in any given cycle, i.e., the number of result busses that form the interconnection between the outputs of the functional units and the register file.
Several designs for the interconnection between the functional units and the register file are possible. First, we could have N result busses where N is the number Of issue units. These busscs could be organized in a crossbar fashion, i.e., the result of an instruction issued from an any issue unit to a functional unit could appear on an any result bus. Thus. an issue unit selects any available result bus to route a result to the register file. The only blockage due to a result bus conflicts occurs when all N busses have already been scheduled for use in the same clock cycle. We call this the full crossbar or X-Bar organization. Clearly the logic needed to implement a scheduling algorithm for such a bus is extremely complex; furthermore, the register file needs N write ports.
A simpler way of using N result busses restricts the result of an instruction issued by issue unit i to result bus i . This approach simplifies the result bus scheduling algorithm since each issue unit only needs to look at the reservations for a single bus. However, the register file still needs N write ports. We call this the N-Bur interconnection. Finally, we could choose to have fewer than N busses. In the limit, we consider the use of a single result bus, i.e., a I-Bur interconnection. In the 1-Bus interconnection, all results appear on a single result bus and the register file only needs a single write port. The simulation results for a machine. with the N-bus and 1-Bus organization are found in Table 3 . Since the results for the X-bar case are essentially the same as those for the N-bus case, we only present the results for the N-bus case. Table 3 shows some interesting performance results for the scalar loops. First, having the capability of issuing up to 8 inmctions per cycle is almost equivalent to having the capability of issuing 3 or 4 instructions per cycle. This is found by reading down the columns of the table. This result simply highlights the number of dependencies that exist in the code. It is rare that 2 consecutive instructions are independent and can issue simult a n w~l y without blocking. Furthermore, remember that for these simulations, instructions are still forced to issue in order. Comparing the N-bus and 1-Bus columns, we see that restricting the size or use of a result bus dnes not significantly impact performance. The issue rates for these machines simply are not high enough in any of the cases to cause a large amount of contention for the result bus.
A second interesting point is that the performance improvements made from multiple issue and the associated hardware complexity can be equally achieved by reducing the memory ~ccess time or the branch block time. Looking at cdumns MllBRS and M11BR2, the issue rate for a machine with a sfow branch and 5 issue units is identical to a machine w i t h a single issue unit and a fast branch. A similar result is seen when comparing MSBR.5 with MSBR2. For a given branch execution time, a fasler memory ~c c c s s time also has a significant performance impact (considex the last row of the M11BR5 column and the first row of the MSBR5 column.)
The results for simulations using the vectorizable loops are shown in Table 4 . As might be expected the overall issue rates for the vectorizable loops are higher than those for the scalar loops. There is a slight diffemwe between the results for the N-Bus and 1-Bus organizations. but clearly the single result bus still is not saturated. Ihc mdeoffs between multiple issue units and a faster memory or branch execution time are much the same as for the scalar code.
Multiple Issue Units with Outd-Order Instruction b e
In the multiple issue machine of the previous section, a blocked instruction prevents the issue of all succeeding instructions. An elaboration of this simple multiple' issue scheme permits out-ofader issue of in--tions that are currenUJ in the instruction buffer. AII insuuction still cannot issue if it encounters a RAW or WAW hazard due to the instruction that precedes it in the instruction buffer. This out-of-order issue of multifle instructions clearly requim a significant amount of hardware to carry out the register interlock OpaatiOns that must be performed (unless each instruction is expanded BS in [14] ). The results of simulations for this issue mechanism are presented in Table 5 and Table 6 .
As one looks down the columns of these tables, the issue rate increases, but in a non-monotone manner. This non-monome increase is due to the fact that only the instructions currently in the instruction buffer are candidates for issue. As with the in-order issue case, the instruction buffer is not filled until all the instructions currently in the issue buffer have been issued. As the number of issue stations increases, meaning the size of the instruction buffer increases, more instructions become available for issue. However, the way in which branch instructions fall into the buffer changes. So, there are cases where previously a branch instruction was the last instruction in the buffer and now it resides alone in the instruction buffer. This leads to the "sawtooth" pattern of the issue rates.
In comparing pairs of N-Bus and 1-Bus columns, again we see little contention for the result bus. If we look at 
Multiple Issue Units with Dependency Resolution
In the issue methods presented above, instruction issue is blocked when a hazard occurs. Hazards exist due to: (i) data dependencies inherent in the algorithm and (ii) depen&ncies caused by limited resomes such as registers. Several dependency resolution schemes wen mentioned in Section 3.3. In this section, we see how dependency resolution schemes would perform with multiple issue units. Rather than discuss the performance of multiple issue units with all b w n dependency resolution schemes, we choose one scheme, namely the RUU scheme presented int121 and [IS]. The RUU scheme was selected because it guarantees precise inturupts, a feature that is desirable in a machine with multiple issue units, and because we had access to detailed simulation tools for the RUU dependency resolution scheme. Before proceeding, we give a brief description of the scheme; details can be found in [12, 15] . In our studies, we simulated a reshicted N-Bus organization, i.e., each issue unit had a preassigned set of R W slots and a preassigned set of busses throughout the machine, and a 1-Bus organization. In the 1-Bus organization. there is a single bus from the R W to the functional units, a single bus from the functional units back to the RUU and a single bus from the RUU to the register file. Since we are interested in potential performance improvement, we allowed bypass logic in the RUU even though such logic might be quite expensive to implement.
The results of simulations for scalar code are presented in Table 7 and for vectorizable code in Table 8 . We present the results for up to 4 issue units since having more than 4 issue units did make a significant difference. From Table 7 one can see that, for scalar code, the biggest improvement from a simple CRAY-lie organization comes from using dependency resolution with a single issue unit, i.e.. by allowing instructions to proceed from the issue stage without waiting for their operands to become available. For example, for the MllBR5 machine, the issue rate. can be increased from 0.44 to 0.73 by using a single issue unit with this dependency resolution scheme. For this scheme we could. at best, achieve an issue rate of 0.83 with 4 issue units, a 4-Bus organization and an RUU of 40 enhies. We could come quite close to this maximum in achieving an issue rate of 0.76 by using only 2 issue, 2 busses and a RUU of 20 enhies. As is expected, an issuing scheme that uses dependency resolution can tolerate slower memory by increasing the amount of M e r storage available (the size of the RUU). Furthermore. by using 3 issue units, we can achieve a performance of about 64849% of the theoretical maximum performance of 1.29 instructions per cycle.
For vectorizable code where there is more parallelism. using multiple issue units can improve the issue rate. significantly. In most cases, 3 4 issue units can be used before performance starts leveling off. Beyond 4 issue units. the competition for the limited number of f u n c t i d units becomes a bottleneck. When sufficient parallelism exists in the code. the use of a single result bus can be a bottleneck Note that we can achieve an issue rate of slightly greater than 1 instruction with a single result bus since some instructions (for example branch instructions) do not usc the result bus.
With 4 issue units, it is possible to achieve a performance of about 63% of the theoretical maximum performance for vectorizable code.
An interesting point to note is that, for a machine with N issue units and an N-Bus organization, the register file has N write ports and 2N read ports. Indeed, many of the read ports are wasted since the dependency resolution scheme allows several instructions to fetch theii operands from the reservation stations in the RUU rather than from the register file itself. Likewise, the number of write ports itself could be reduced since the value of a register need not be updated if h e is a succeeding insauction that will update the value of the register. To restrict the number of experiments, we did not study such cases.
Discussion and Conclusions
From the studies repotted in this paper, we can make several obsesvations about the design of the scalar units of multiple functional unit processors that have functional unit capabilities similar to our basic architecture.
For code that is inherently scalar, i.e., does not have a large degree of parallelism, a simple, serial machine with a single issue unit achieves a performance of about 18%-26% of the theontical maximum performance, depending upon the memory latency and branch time. 
