designer is to make each individual functional unit as fast When designing a pipelined single-chip processor (SCP) with pipelined functional units of varying length, the processor issue logic must deal with scheduling of the result bus. In order to prevent serious performance degradation due to result bus conflicts, some pipeline scheduling techniques developed in the 1970's may need to be incorporated into the issue logic. Since this is a nonhivial complication of the issue logic, a set of simulations were performed in order to evaluate the effectiveness of the combination of multiple length functional units and scheduling techniques. Analysis of the simulation results indicates that providing relatively short multiple length functional units is not worthwhile. Multiple length functional unit configurations employing Fesult bus scheduling do perform slightly better than uniform length configurations, but the difference is often less than 1%.
Introduction
When designing a single-chip processor (SCP) to support a particular instruction set architecture, the design engineer must carefully choose among the many different options available. The types and size of on-chip caches (if any), the type of instruction fetching strategy to pursue, the amount and degree of pipelining to incorporate, and whether or not to include on-chip floating point units are just a few of the design decisions that must be made. To aid in the option selection process, it is important to provide the designer with as much information as possible.
If the processor is to be pipelined, the designer must decide the shucture and organization of the pipeline, including the extent the functional units will be pipelined. The first instinct of a high-performance processor 0194-1895/90/0000/0209/$01 .OO (6 IEEE 209 as pbssible. in order to avoid wasted clock cycles. For example, the more complex floating point units may be designed to take several clock cycles to produce a result, while the simpler logical unit may be able to produce its results in a single clock cycle. While on the surface this approach seems logical, closer inspection reveals certain problems with supporting multiple length functional units.
When different length functional units are implemented
with a CRAY-like issue strategy [Russ78], the issue logic must ensure that an instruction waiting to issue will not require the use of the result bus during a clock cycle when the result bus will be used by an already issued instruction. Being able to detect and deal with this result bus busy condition can significantly complicate the issue logic, and it is not clear that having multiple length functional units in a single chip processor provides a justifiable performance improvement.
In this paper, we will investigate the performance improvement resulting from using variable length functional units. This work is done in the context of SCPs where moderate amounts of pipelining are supported (most of today's SCPs). We first describe, by example, how pipeline scheduling for multiple variable length functional units leads to improved performance. Then, through simulation, we evaluate the performance of machines with different combinations of functional unit lengths.
Result Bus Scheduling
In a machine with multiple length functional units that are relatively short in length, conflicts over the use of the result bus frequently occur. Consider an instruction (Instruction A) that is prevented from beginning execution because the result bus will be in use during the clock cycle in which Instruction A wants to write its result back to the register file. By preventing Instruction A from issuing, the instructions behind it are also prevented from beginning. Overall throughput may be reduced if some of these instructions do not have the same result bus conflict . as Instruction A. Special pipeline scheduling techniques [paDa76] exist that can be used to reduce the impact of these result bus conflicts and which are much less complicated than supporting out of order issue and execution rToma671.
Figures 1 and 2 demonstrate how the technique of result bus scheduling improves the performance of a simple sequence of code. (In this example, we assume that arithmetic operations require two full clock cycles to complete, while logical and move operations require only one.) Without using pipeline scheduling, the code sequence shown in Figure 1 takes six clock cycles to complete. As can be seen in the figure, the OR instruction following the ADD cannot issue until the ADD completes, because at time 2, when the issue logic is first presented with the single cycle OR instruction, the issue logic knows that the result bus will be busy at the time the OR instruction will generate its result. This blocking of the OR instruction also delays the issue of the subsequent SUB instruction, which faces no such result bus conflict.
Instruction

ADD
When result bus scheduling is employed, as shown in Figure 2 , delays are inserted into the datapath forcing selected instructions to take longer to complete. The OR is issued at time 2, and although it can complete in the same clock cycle, the issue logic has it wait an extra clock cycle before writing a result. Since the OR has been issued at rime 2, the issue logic can issue the SUB one cycle sooner. Thus, the blocked instruction and the ones behind it are able to proceed normally (assuming the issue logic detects no other hazards).
As can be seen by comparing Figure 1 to Figure 2 , the use of pipeline scheduling will improve the performance of this code sequence by over 16%. pected advantages.
The Simulation Model
The memory system is modeled as a single large memory that services both instruction and data requests, connected to the processor chip by unidirectional input and output busses. The external floating point unit is memory mapped, so that a pair of data stores to the appropriate locations w i l l cause a multiply to occur. The simulation model gives bus precedence to instruction fetches, followed by data loads and stores, with multiply results getting the bus whenever it is idle.
Giving instruction fetches top priority like this helps keep the instruction fetch unit ahead of the decode unit, and since the processor has been designed to tolerate slow memory, the extra clock cycles required by a data fetch that has been preempted by an instruction fetch are effectively hidden. This is an example of the less obvious advantages of using VO queues.
The Benchmark Program
The benchmark programs selected were the first 14
Lawrence Livermore loops as defined in [McMa84] . The loops were first compiled by the SUN4 optimizing compiler, in order to get a feel for the kinds of optimizations a compiler could perform. The loops were then completely hand-written using the output of the SUN compiler as a guide. A serious effort was made not to hand-optimize the loops, however. The loops are not "tuned" to increase performance, as this might limit the information that could be gained from the interpretation of the results.
In an effort to make the results obtained from the study applicable to the general case of processors that do not use VO queues, an additional set of Lawrence Livermore loops were created that do not take advantage of the I/O queues. In these loops, a data load instruction is followed immediately by the instruction that requires the data item. Since the entire PIPE architecture and instruction set is designed to take advantage of these U0
queues, the simulation results will only approximate the general case. The PIPE processor can only behave similarly to a machine without queues. It can behave similarly enough, however, to allow some general conclusions to be drawn. The 14 loops were assembled as one large program, so that each loop would run until finished and then fall through to the next loop. The variant of each loop was modified so that each loop executes approximately l0,OOO instructions. This was done to balance the impact of the different loops on the results. A total of 139,608 instructions are executed in a single run through the original benchmark program, and 140,267 are executed in a run through the modified benchmark that does not take advantage of the I D queues.
Discussion of Simulation Results
In this section, we evaluate the performance of various functional unit lengths and the impact of result bus scheduling techniques. The simulation results necessary to perform this evaluation were generated by executing the two suites of benchmark programs (those with and without queues) on the PIPE simulator, varying the following parameters: 1. The number of clock cycles required for the different functional units.
2.
The ability of the issue logic to do dynamic bus scheduling.
3.
The time necessary to preform a floating point multiply. The remaining simulation parameters were set so that the simulation results would highlight the effects of result bus scheduling and minimize the number of clock cycles lost due to conditions not related to functional unit length (such as the Instruction Register being invalid). In order to accomplish this, the instruction cache size was set large enough to contain the entire program and it was started warm. In addition, the memory delay was set to a single clock cycle to guarantee there would be no processor blocking due to waiting for a data item from memory.
The results of these simulations are presented in Tables 1 and 2. Table 1 contains the results of the simulations with a four cycle external multiply unit and the processor queues fully enabled, while Table 2 contains the simulation results for a four cycle external multiply unit and the processor queues not fully utilized.
The simulation results are presented in tabular form because, while looking at a graph is generally preferred to reading numbers out of a table, this data does not lend itself to graphical representation. A number of different graphical formats were created, but in each case the information of interest was easier to extract from the table itself than from the graph. Therefore, the reader should be prepared for frequent references to entries in Tables 1  and 2 throughout the following discussion.
The best place to begin is to look at where the implemented PIPE processor falls in Table 1 . PIPE employs a two cycle arithmetic unit, single cycle shift and logical units, fully utilizes its I/O queues, and does no bus scheduling. Looking under the appropriate entry in column 1 of Table 1 , we see that the PIPE processor takes a total of 172,776 clock cycles to execute the benchmark program. Going across the table we see that providing the ability to do result bus scheduling would provide a 10% performance increase.
The impetus behind this study was initially to try and determine if replacing PIPE'S restricted single cycle shift unit with a fully functional two cycle shift unit would have a serious negative affect on performance. (The PIPE processor was implemented in single layer metal nMOS, and the amount of polysilicon necessary to build the shifter prevented a fully functional design.) Interestingly, Table 1 shows that the as-implemented PIPE processor would actually perform slightly better (1.7% better) with a two cycle shifter than it does with a one cycle shifter. This is separate from the implementational difficulties of producing a single cycle shifter; these simulation results show that the total number of clock cycles used decreased with an increase in the time required to do a shift. This result is not what one would first expect, and has to do with the result bus scheduling problem. When result bus scheduling is not being used, the more uniform the lengths of the functional units, the less time that is lost to result bus busy conditions.
Since making the functional units more uniform in length seems to provide a slight increase in performance, the next logical step is to look at how the processor performs when all functional units lengths are the same. Looking at the results in Table 1 we see that the processor actually performs best under these circumstances. This says that the PIPE processor would perform 9.3% faster if the shift and logical units were slowed down. This is perhaps the most significant and initially counter-intui tive result to come out of this study, and implies that an SCP designer may be able to significantly improve the performance of a processor by slowing down the different functional units until they all function at the same speed. Doing this has the added advantage of removing the need for result bus scheduling.
A processor using result bus scheduling in conjunction with minimal length functional units is somewhat faster than a processor using uniform (slower) functional units, as one would expect. Comparing the implemented PIPE processor with result bus scheduling enabled to a processor configuration with all functional units of length two shows that the PIPE setup is in fact slightly faster. However, the difference is just over 0.6%. It is highly Table 2 , some similar results are evident. Table 2 contains the results of simulations in which the I/O queues were not fully utilized, and in these results we sce again the impact of VO queues on performance.
As was the case in Table 1 , the performance of the PIPE processor configuration does increase as the functional unit lengths become more uniform, but the performance increase is not as appreciable.
This difference in performance is due once again to the inherent properties of U 0 queues. Programs for pre cessors using U 0 queues try to maximize the distance between a request for a data item and its consumption, based on the assumption that some unspecified amount of time 2fter the request has been issued it will be serviced. What is causing the delay is immaterial; whether the delay is due to slow memory or multi-staged functional units, it is hidden by the proper use of these queues.
Summary and Conclusions
The goal of this paper was to determine if a processor should employ functional units of varying lengths. The natural instinct of many designers is to maximize the performance of each individual functional unit, which leads to multiple length functional units. If a processor does use multiple length functional units, it must be able to deal with the added issue condition of a busy result bus. Some pipeline scheduling techniques developed in the 1970's may also need to be incorporated into the issue logic to prevent serious performance degradation due to the result bus conflicts. Since this is a non-trivial complication of the issue logic, a set of simulations were performed in order to evaluate the effectiveness of the combination of multiple length functional units and this scheduling technique.
Analysis of the simulation results indicates that having relatively short multiple length functional units is not worthwhile. Multiple length configurations employing result bus scheduling do perform slightly better than uniform length configurations, but the difference is less than 1%. Since configurations with all functional units the same length consistently outperformed configurations with functional units of different lengths when result bus scheduling was not enabled, the added complexity of sup porting this technique is not justified.
These results indicate that an SCP designer should not waste valuable time squeezing performance out of the various functional units, but rather should produce a good design of the most complicated unit and design all other units to match it. However, it should be pointed out that these results are for a single set of hand-written benchmark programs, and processors without 40 queues could only be emulated, so the results may or may not be directly transferable to designs without queues. However, the results themselves are very interesting, and indicate that there is much more work that can be done in this area.
