n the early 1980s, the PIPE (Parallel Instruction with Pipelined Execution) research project was initiated to investigate high-performance computer architectures for VLSI implementation. One of the primary project goals was to devise architectural methods to minimize the impact of off-chip memory accesses on processor performance. Crossing the boundary between the processor chip and external memory is one of the main impediments to achieving high performance in VLSI processors, and the problem is becoming more severe. Because the on-chip clock frequency of a VLSI processor chip is increasing at a much higher rate than external memory speeds, it will soon take 10 to 100 processor clock cycles to perform a single external memory reference.
n the early 1980s, the PIPE (Parallel Instruction with Pipelined Execution) research project was initiated to investigate high-performance computer architectures for VLSI implementation. One of the primary project goals was to devise architectural methods to minimize the impact of off-chip memory accesses on processor performance. Crossing the boundary between the processor chip and external memory is one of the main impediments to achieving high performance in VLSI processors, and the problem is becoming more severe. Because the on-chip clock frequency of a VLSI processor chip is increasing at a much higher rate than external memory speeds, it will soon take 10 to 100 processor clock cycles to perform a single external memory reference.
The project members defined a processor architecture with several unique features that specifically deal with this problem. For example, the architecture supports a decoupled a c c e d e x e c u t e mode. (A decoupled architecture is one in which a program is divided into two or more instruction streams, and a number of processors cooperate in the execution of the task.) The architecture, which features I/O queues (visible to the programmer), buffers external memory accesses, uniquely handles branches, and supports subroutine calls. Extensive analysis and simulation studies of the architecture' have shown that a single PIPE processor, without running in a decoupled mode, can still significantly outperform more conventional processors.
The project members were encouraged
The PIPE processor chip demonstrates that supporting architectural queues does not complicate the instruction issue logic, and frees the processor clock rate from external memory speed influences.
by the results of these simulation studies. However, we felt that unless the merit of our innovations could be demonstrated, the simulation results would not be taken seriously. W e concluded that the best way to demonstrate the worth of our PIPE architecture ideas was to build a working PIPE processor chip. Because of the unique features in PIPE and the important problems it addresses, the PIPE architecture implementation should interest not only those exploring decoupled access/execute architectures, but also conventional processor designers. In the following sections, we present an outline of the machine, a description of the processor, and an evaluation of the valuable lessons learned via the implementation.
The PIPE machine
It is important to differentiate between the PIPE nmAine and the PIPEprocessor.
The PIPE machine includes an intelligent memory and two identical processors, which support a decoupled access and execute mode of execution2.3 (sec Figure I ).
In the PIPE machine, the two processors are called the access and the execute processors. The access processor controls operand address computation, generating data requests for both itself and the execute processor. The execute processor functions like a highly intelligent math coprocessor, consuming data from the access processor and performing what programmers view as the heart of the required computations. These processors communicate with each other and with memory via hardware queues. The intent of separating a program this way is to allow the access processor lo get ahead of the execute processor, reducing or eliminating the delays due to accessing external memory. A more detailed description of the PIPE project is contained in Farrens4 and Goodman et aL5 In the PIPE machine, the access and execute processors are identical. Therefore, throughout the rest of this article, we will refer to these processors as PIPE processors. 
The PIPE processor
A PIPE processor has much in common with other reduced instruction-set computing (RISC) processors, with a register-toregister type architecture similar to Cray and CDC architectures. PIPE is a 32-bit processor with a 32-bit wide internal bus, is 16-bit word-addressable, and has separate input and output buses. The PIPE processor uses a barrel shifter for shifts and a two-stage arithmetic logic unit (ALU) for adds and subtracts as well as logic functions. The PIPE processor employs an elemental instruction set, one whose resource requirements can be easily determined at instruction issue time. This instruction set supports a basic repertoire of three operand instructions: addition, subtraction, logical operations, and shifts in their various forms. Multiply and divide instructions are also specified for future expansion.
Like many other VLSI processors, PIPE is pipelined to increase performance. PIPE has a five-stage pipeline, consisting of instruction fetch, instruction decode, instruction issue, ALU l/logical, and ALU2.
Unlike most other RISC processors, however, PIPE instructions come in two forms (see Figure 2 ). These instructions can be either one or two 16-bit parcels long. The impact of having two different instruction sizes will become clear later in this article.
Unique features. While a PIPE processor has much in common with most RISC processors, it also differs significantly in a number of respects. The following sections briefly describe some of PIPE'S unique high-performance features. Architectirral 4iieire.s. The PIPE processor provides both input and output queues that act as a buffer between the external memory and the chip's internal processing elements. These queues appear to the programmer as register R,. This is the tail of the load address queue (LAQ), store address queue (SAQ), and store data queue (SDQ); and the head of the load data queue (LDQ). This arrangement allows the on-chip clock rate to be determined solely by the timing delays through the chip's processing elements. It also prevents the external memory speed from affecting the processor's internal clock rate.
The external memory takes items off these queues and performs the requested operations. When a requested data item is returned by memory, it is put in the LDQ, which buffers the data. By making this queue explicit in the architectural definition, a program can have multiple outstanding memory requests without forcing the issue logic to reserve a path into the register file for each request. The compiler, employing well-known optimization techniques, should place the load instructions as far ahead as possible of the instruction requiring the data.
Prepcire to branch. Branch instructions are notorious for causing performance degradation in heavily pipelined machines. This is due to the difficulty of keeping the pipeline full of useful instructions while the branch condition is being evaluated. The method used in the PIPE architecture is a generalized form of the delayed branch.
In the delayed branch scheme, there are a fixed number of delay slots following a branch that are guaranteed to execute. Ideally, the number of delay slots should be as large as possible to guarantee that the branch condition will have been evaluated when the instructions complete, thus keeping the pipeline full. However, for many benchmark programs, it is difficult to fill more than two delay slots with instructions that perform useful work. This forces the compiler to place null operations into the slots it is unable to fill.
The PIPE processor uses an instruction called prepare to branch (PBR) that allows the compiler to specify the number of delay slots after a branch instruction (between 0 and 7). The compiler is never forced to place null operations after a branch. If the compiler is able to fill the delay slots with three or four instructions, it guarantees that control flow will not break. The PBR instruction allows the PIPE architecture to always perform at least as well as the more restrictive delayed branch scheme and outperform it as pipeline depth increases.
Opcode
Subroutine cull support. The PIPE processor supports data passing between subroutines by logically partitioning the 16registers in the register file into two sets of eight registers. This approach attempts to balance the chip area devoted to registers with rapid subroutine call support. These two partitions are referred to as the foreground (FG) and the background (BG) register files. The operand fields in a typical instruction always reference the current FG register file. Instructions exist that permit data movement between the foreground and the background register sets to facilitate passing parameters between the calling and calledroutines. There isa switch register file instruction that swaps the pointers to the FG and BG register files, effectively swapping their contents.
To perform a subroutine call, a switch register file instruction is executed, followed by an unconditional PBR instruction. By providing a BG register file, no saving away ofthe calling routine's active register file is needed if the called procedure is a leaf procedure (that is, it does not make any procedure calls). Should the called routine makes a subroutine call, it can schedule the saving away of the registers in the BG register file (formerly the FG register file) at its convenience. The only constraint is that the saving away of the registers occur before the called procedure performs a procedure call itself. This scheduling, coupled with the store data queue, makes this aspect of register saves nontime-critical.
Ri Rk
The instruction cache. The direct-mapped PIPE instruction cache is composed of 16
Figure 2. PIPE instruction format.
four-word lines for a total of 6 4 words or 128 bytes. (This is between 32 and 6 4 instructions, depending on the distribution of one-and two-parcel instructions.) An 8-byte instruction queue (IQ) and an 8-byte instruction queue buffer (IQB) lie between the instruction cache and the decode logic. The goal of the PIPE instruction cache control logic is to keep the IQB and IQ full of valid instructions. If the cache control logic cannot decide the correct values to move into these registers, it guesses that the next sequential line will be needed. The effect of this guessing is limited to on-chip operations, however. Whenever a guess may force an off-chip operation, the control logic waits until it can ensure that some portion of the requested instructions will be executed. This restriction is enforced because of the limited bandwidth of singlechip processors and our desire to limit memory traffic. This instruction cache, while relatively small, proves sufficient for our purposes. It allows us to verify the control logic design and to demonstrate that such a sophisticated I-fetch strategy need not adversely affect the clock rate. In addition, our simulation results indicate that if the I Q and IQB are used properly, larger instruction caches may not significantly improve performance. (Interested readers can find instruction fetch strategy details in earlier paper^.^,^)
The PIPE implementation
Implementing the processor was a critical test of the PIPE architecture. Our goal was to demonstrate that our architectural features could be combined into a functioning whole. We wanted to show that the queues and the queuing discipline, interacting with the pipeline control logic, could operate with a reasonable clock frequency and within minimal chip area. In addition, we also had to insure that the hardware interlocks in the pipelined instruction unit were not prohibitively difficult to implement. The requirement of an on-chip instruction cache also severely restricted the chip area available for control logic, registers, ALU, queues, etc. Finally, the input and output queues made interrupting the PIPE processor (which was not specified in the original PIPE architecture) a real design challenge.
Because of these constraints, we decided to implement and test pieces of the PIPE processor separately before implementing the entire processor. We chose this approach because no one at the University of Wisconsin had previously attempted a VLSI project of this magnitude.
W e believed (and our experience has confirmed) that it was critical to determine early in the implementation phase how accurate our tools were. It was also crucial that we fully understood the iterative process of designing, submitting, and testing a VLSI chip.
Three pieces of the processor were submitted to the MOSIS fabrication facility.
The first chip contained the register file, the two-stage ALU, the branch test logic, and the U0 bypass. The second chip contained a single input queue. The final piece of the processor to be implemented separately was the cache and instruction fetch 1ogic.Thischipcontainedmore than 15,000 transistors and was by far the most complicated of the chips, prior to implementation of the final processor chip. Interestingly, it also came the closest to working exactly as specified, with only one small design error detected after extensive testing. This was very encouraging and gave us confidence "machines with queues are hard to interrupt'' objection. that we could implement the entire processor chip on a single die.
The PIPE chip. After successfully fabricating and testing most of the PIPE processor parts, we were ready to incorporate the entire design onto a single piece of silicon. Technology had made significant advances since the original specification of the PIPE architecture, however. Circuit densities increased so much (even in
The most important result that emerged from an analysis of the implementation is that architectural queues are not difficult to implement.
Issue logic. Implementing the PIPE processor alsodemonstrated that theamount of logic necessary to resolve interlocks quickly and easily at issue time is not prohibitive. The benefits of using architectural queues also show up again. Their use simplifies the issue logic design somewhat, since there is only a single bit that the issue NMOS) that we decided to make several modifications to the original architectural specifications before implementation. These modifications were implemented to make the processor as "real" as possible within our design constraints.
When we began the implementation, neither our C A D tools nor the MOSIS fabrication facility could adeptly deal with complementary metal-oxide semiconductor (CMOS) designs. At implementation time, however, both had progressed to the point where CMOS was the technology of choice. Unfortunately, time and manpower constraints prevented total conversion from N-channel MOS (NMOS) to CMOS. However, we felt that an NMOS implementation (while not ideal) would still accomplish our primary goal of demonstrating the validity of our architectural choices. After the design was completed and the handshakes among the functional units were defined and individually tested, the chip as a whole was extensively simulated to catch as many logic errors as possible. While the processor chips were being fabricated, several test programs were written to exercise various chip portions. Upon receipt of the fabricated processor chips, testing began, and several new errors were detected. (None of them prevented further testing, however.) The errors were primarily found in the cache-issue logic interface and required the addition of No Operation instructions in certain places to prevent incorrect instruction execution. The fabricated chips correctly executed a bubble sort program, a hash sort program, and a Booth's multiply program. The average maximum 5.5-megahertzchip clock speed corresponded to a 5.5 MIPS peak execution rate. This was within 20 percent of the rate predicted by our simulations.
While this number is not as impressive as the performance numbers quoted by several other current processors, it is important to remember that PIPE was fabricated in a very restrictive environment. Therefore, PIPE should be compared to processors designed in the same environment. PIPE was fabricated in 3-micrometer NMOS with a single metal level. A more accurate comparison for PIPE would be the 3-ym NMOS versions of the RISC-11' or MIPSX chips. PIPE'S clock rate is two to three times higher than either of these machines.
Furthermore, Spice simulations of the PIPE processor using 2-pm NMOS parameters with low-resist polysilicon interconnects indicate that if this less-restrictive NMOS fabrication process were available to us, the PIPE performance rate would rise to more than 18 MIPS. The availability of second-level metallization instead of low-resist polysilicon would improve the performance even more. Unfortunately, MOSIS does not support these fabrication processes in NMOS.
Lessons learned from the implementation
Actually implementing the processor provided a unique opportunity to test the wisdom and effectiveness of a number of design choices. We found that many architectural choices made in the design process were good, and some were not.
Architectural queues. The most important result that emerged from an analysis of the implementation is that architectural queues are not difficult to implement. The potential of architectural queues to improve performance has been repeatedly Building the PIPE processor demonstrated that, from an implementation perspective, there is no reason not to equip a single-chip processor with I/O queues. By adding an interrupt to the processor, we were also able to overcome the logic must test to determine when memory has responded with an item. The issue logic does not need to know anything about the external memory's speed or structure. This allows rhe processor design to move gracefully into different kinds of implementations and technologies, since no redesign is necessary to handle varying relative memory speeds.
Barrel shifter. Because of the technology used to implement PIPE, the barrel shifter could not function as originally specified and still fit within the desired clock frequency. If it had been pipelined, it could have been made to function correctly, even in the technology used. However, since we decided to stick to the original specifications of a single-cycle shifter, some of the functionality had to be left out.
The question of pipelining the barrel shifter led us to an unexpected discovery.
Simulation studies4 of pipelining's affect on processor performance led us to the conclusion that when a processor employs U0 queues, the difference between having all functional units of length 1 or having all of length 2 causes less than a 2 percent performance degradation in the benchmark programs studied. These results imply that there is no reason to spend an inordinate amount of time designing the ALU to perform an add as fast as the rest of the machine cycle. Making the ALU take two pipeline stages and making all other functional units match it in length can have a minimal impact on performance.
The cache control logic. The cache control circuitry is undoubtedly the most complex logic on the chip. It provides the PIPE processor with a sophisticated method of making the small on-chip instruction cache appear many times larger than it actually is. It was not clear in the beginning that we would be able to implement the complicated cache control logic and integrate it on-chip with the rest of the processor circuitry. However, the actual implementation proved that the scheme is not prohibitively complicated and that the scheme or a subset of it can be a powerful tool for a designer. Instruction-set format. It is clear from the implementation that the decisions to support an instruction set with two instruction sizes and to allow consecutive twoparcel instruction issues profoundly affected the instruction fetch logic design. This combination requires the PC to increment by either one or two and also means that a path from the IQB into the instruction register (IR) must be provided and supported. Providing these functions comprises a major portion of the instruction fetch logic size. By removing the requirements for variable-length PC change, the instruction fetch logic can be simplified. There are two ways to accomplish this: ( I ) Make all instructions the same size or (2) only allow a single parcel to move into the IR on each clock. The second approach is the one used in the Cray-I, which has an instruction set very similar to PIPE'S. By restricting the IR loading, the PC only needs to increment once per clock. The desired complexity reduction of the instruction fetch logic is then achieved. Studies of this restriction's effect on processor performance indicate that little performance improvement would occur if the Cray instruction fetch logic was modified to allow two parcels to move into the IR each clock.I0
T o analyze this restriction's affect on the PIPE processor, several simulations
The decision to support an instruction set with two instruction sizes and to allow consecutive two-parcel instruction issues profoundly affected the instruction fetch logic design.
were performed on the PIPE simulator in which the loading of the IR was restricted in this manner. The simulation results indicate that the ability to issue consecutive two-parcel instructions is virtually mandatory in a single-chip processor with more than one instruction size. The reason for this disparity between the Cray and PIPE simulation results is that single-chip processors have a much higher rate of instruction issue. This issue rate increases the demand on the instruction fetch logic, exposing the delays due to waiting for the second instruction parccl.
Removing the requirement for variablelength PC incrementation by making all instructions the same size is the approach most current single-chip processors take.
The fixed instruction size is generally 32 bits. While this may make the instruction fetch logic simpler, there is a down side. Having two instruction sizes effectively increases the instruction bus width and improves the performance of the instruction fetch logic. This is especially true when the on-chip instruction cache size is limited. With a fixed 32-bit instruction size, the PIPE instruction cache could only hold 32 instructions. With two different sizes, it can hold up to 64.
The process of creating an initial instruction fetch logic that could support an instruction set with two instruction sizes and move more than one parcel into the IR on each clock was extremely difficult. The design was also very challenging to debug and test. However, once the design is completed and verified, it can be removed from the worst-case implementation path. Because of this and the potential performance improvement mentioned above, it is not clear whether it is better to have a fixed 32-bit instruction size or an instruction set with two different sizes.
mplementing the processor architecture enabled us to evaluate several of I our design decisions in much greater detail. W e demonstrated that supporting architectural queues was not difficult and did not unduly complicate the instruction issue logic. However, it did free the processor internal clock rate from external memory speed influences. In addition, we were able to implement a sophisticated instruction fetch strategy that allowed us to make highly efficientuseof a relatively small onchip instruction cache. Portions of this strategy could effectively be used by any conventional single-chip processor.
We uncovered acouple ofproblems with our architecture. The first relates to PBR instruction. The PBR instruction can have a variable number of unconditionally instructions associated with it, including zero. Supporting the zero case made the implementation significantly more difficult. It would have been wiser to unconditionally execute between 1 and 8 instruction parcels instead of between 0 and 7. Finally, the implications of an instruction format that supported two instruction lengths did not become apparent until the instruction fetch logic was built. The question of the "correct" instruction format remains unresolved, however. Providing two instruction lengths allows better use of the available off-chip bus bandwidth and improves the effectiveness of a small on-chip instruction cache.
Our research taught us many things about our architecture that no amount of simulation could have revealed. W e unearthed a number of interesting questions that we continue to pursue today. W e feel the benefits of going through the implementation process far outweighed the drawbacks. W
