is a 32-bit RISC microprocessor implemented in a conservative 2-p m, two-level-metal, n-well CMOS technology. High performance is achieved by using a nonoverlapping two-phase 20-MHz clock and executing one instruction every cycle. To reduce its memory bandwidth requirements, MIPS-X includes a 2-kbyte on-chip instruction cache.
width requirements, MIPS-X includes a 2-kbyte on-chip instruction cache.
This cache satisfies 90 percent of all instruction fetches, and reduces the memory bandwidth of the processor by a factor of 2.5. MIPS-X has a peak operating rate of 20 MIPS, and provides an effective throughput of 12
MIPS when the effects of the on-chip cache, external cache, and pipeline stalls are inchsded. MIPS-X contains 150K devices in an 8 X 8.5-mm2 die.
To produce a high-speed computer system, MIPS-X uses a simple compute engine, a simple and fast clocking scheme, and a high-performance memory system. The simplicity of the basic processor allowed us to use a significant fraction of the design time and silicon area to integrate a part of the memory system on the processor. This paper provides an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.
I. INTRODUCTION T HE MIPS-X project began in the Summer of 1984
with the goal of designing a second-generation RISC microprocessor that could be used as the processing nodes of a shared-memory multiprocessor.
With the knowledge gained from early RISC designs [1]-[3] and the improved performance available from a 2-pm two-level-metal CMOS process we have designed a processor with a peak instruction rate of 20 MIPS. MIPS-X borrows from the original MIPS machine [1] the ideas of a simplified instruction set, pipelining, and a software code reorganizer to handle pipeline interlocks. However, to improve performance, MIPS-X uses a simpler instruction format, a deeper pipeline, an on-chip instruction cache, and a faster clock rate.
There are several areas that are important to consider when designing a high-speed processor, particularly one that is to be implemented in VLSI. These include the memory system design, the clocking methodology and the complexity of the resulting hardware. We feel that the most important factor is simplicity. For a high-speed processor, additional functionality should only be added when it significantly improves the overall performance of the machine. The design team has a certain amount of time and silicon area it can use to complete its task. Resources spent implementing a feature are resources that cannot be spent on other aspects of the design. In MIPS-X, the execution portion of the processor occupies a small fraction of the die area, allowing us to use the extra area to improve the performance of another critical element of the processor, the memory system.
As instruction rates increase, the bandwidth and latency of the memory system become important issues. This is evidenced by the greater use of on-chip caches and instruction prefetch queues to decrease the average time required to access instructions [4]- [10] . Crossing chip boundaries has become a limiting factor in high-speed processor systems; this makes it difficult to access instructions and data quickly if they have to be kept off-chip. MIPS-X uses both a large 2-kbyte on-chip instruction cache and an external interface optimized for high-speed cache access to provide the required memory bandwidth for the processor.
Increased performance also implies faster clock rates, and this makes the problem of clock distribution more difficult.
Multiphase clocks exacerbate the situation because the time per phase is smaller and there are more phases to distribute.
MIPS-X uses a simple two-phase clocking scheme, and locally generates additional clocks when necessary. Circuits using local clocks are often called self-timed because they derive the timing information from the delay of the circuit being controlled.
The use of selftimed clocks makes the global clocking in MIPS-X simple, but does add some circuit complexity in the parts of the chip that require additional clocks.
The next section gives an overview of the MIPS-X architecture and the supporting memory structure. This is followed by a description of the pipeline in Section III.
Sections IV and V present the hardware required to implement this machine. Section VI follows with a description of the design methodology used to keep the hardware To improve the floating-point interface, two special memory instructions were added to MIPS-X that directly transfer data between one specific coprocessor and memory. With this minor addition we were able to provide a simple interface that supports high-performance coprocessor. One advantage of this interface is that coprocessor instructions look just like memory instructions and thus can be implemented easily.
MIPS-X provides separate system and user addresses. Programs running in user mode are prevented from accessing system addresses, while programs running in system mode can access either address space. The processor can enter system mode only by taking an interrupt or by IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-22, NO. 5, OCTOBER 1987 executing a trap instruction.
To support a dynamic paged virtual memory system, all instructions are restartable. The processor supports both maskable and nonmaskable inter-
rupts. An interrupt causes the machine to flush the instructions in the execution pipeline, enter system mode, and jump to location zero. This simple support for exceptions provides the essential features needed to build an operating system for the processor. To help the software system use the two slots associated with a branch, MIPS-X can optionally squash (turn into no-ops) the instructions in the slots if the branch is not taken. This allows the reorganizer to predict that the branch will go and put the first two instructions of the branch destination after the branch. In this case the machine effectively starts executing the code at the branch destination right after the branch instruction.
III. PXPELINE
Only if the branch is not taken are these instructions turned into no-ops and the resulting cycles wasted. To avoid having additional pipeline constraints, MIPS-X has two levels of internal forwarding or bypassing. The bypassing allows the result of one instruction to be used as input for the next instruction and is needed because the actual WRITE into the register file occurs late in the instruction, too late to be directly used in the next two instructions. The bypass logic slightly complicates the design of the register file, but greatly reduces the number of no-ops needed to eliminate interlocks.
IV. HARDWARE RESOURCES
A microphotograph of the processor with the major functional blocks outlined is shown in Fig. 3 The cache system has a full cycle for its access, but needs to determine whether the instruction will hit in the cache in a single phase. The early hit detect is needed to be able to use the next cycle to fetch the missed instruction from the external cache as shown in Fig. 5 . The root of the problem is that external memory accesses really take one and a half cycles; the processor must drive the address pads on +2 of the cycle before the memory access. To fetch the missed instruction by the end of the first cache-miss cycle, the processor must drive the instruction address off chip during @2of the IF that misses, and thus we need the hit signal by the end of @l. Using the early hit detect, internal cache misses stall the machine for two cycles. The first cycle is used to fetch the missed instruction from the external cache, and the second cycle is used to write this value into the instruction cache. Since we assumed that the data from an external cache fetch are valid just before the end of the cycle, to reduce the miss delay to a single cycle we would need to extend the cycle time to provide sufficient time for a cache write to complete after the data become valid. Instead, MIPS-X uses the second cache-miss cycle to fetch from the external cache the next instruction that will be executed. Therefore, ICache misses have a penalty of two cycles, but fetch back two words. This fetch of two words halves the miss rate of the cache and provides roughly the same system performance as a cache with a single-cycle miss penalty, but accomplishes this performance without influencing the cycle time of the processor. The tags are stored in a content-addressable memory using a standard ten-transistor CAM cell so they can be quickly compared against the current instruction address. Hit detection requires first comparing the current instruction address against the values stored in the CAM array, and then fetching the correct valid bit for the block that matches. To generate the hit information in one phase, the tag compare and valid bit fetch are performed simultaneously.
The Vstore is logically organized as 64 words of eight bits. During the tag compare the low-order bits of the instruction address are used to index into the Vstore to fetch the eight possible valid bits, a bit for each tag that could match. Next these output lines are ANDed with the output of the tag comparison, and then Oiled together to generate the cache hit signal. Since the tag compare and the Vstore access both require roughly 15 ns, it is easy to generate the hit signal in a single phase.
There are two types of internal cache misses: block miss and word miss, depending on whether the block for the Fig. 7 shows how the tag write line also serves as a virtual ground for the valid bits associated with that tag. When the write line is pulled high it forces all the cells to reset, clearing the valid bits for that tag.
MIPS-X uses a simple ring counter algorithm for selecting the tag to be replaced during a block miss. The ring counter is located above the Vstore, and is incremented after each block miss. The fetch of two instructions during a cache miss means that the ring counter must also increment when there is a block hit and word miss, and the ring counter points to the block where the hit occurred. This prevents a block miss during the fetch of the second instruction from clobbering a block that only had a word miss during the fetch of the first instruction. This transistor discharges the precharged node Done, causing Done to rise, and forcing the write drivers to recover the bit lines for the following READ. Transistor MO.C. is needed to prevent the circuit from oscillating. If it is deleted, then the recovery of the= line will cause Done to rise, and the write will restart. The write and recovery is quite fast, requiring less than 20 ns to complete.
To remove many potential pipeline interlocks, the register file is double bypassed. This requires adding bus drivers to two latches in the data path, and adding four comparators in the control as shown in Fig. 11 . The comparators check the destination of the previous two The +~clocks can only be used as an input to a latch; the clocking of functional units is always done on the true clocks @l and +2. This allows the +1 clocks to be slightly shorter than @l, or said a different way, it means that the external cache-miss signal can arrive a little late. As long as the external cache-miss signal monotonically falls, it can actually arrive at the processor after the end of the MEM cycle, during @l of WB. The external miss signal can arrive up to 10 ns late and still provide a valid +~clock. This gives the external cache about 10 ns to generate the miss signal after the data fetch, and prevents the cache tag comparison from being on the critical path for memory accesses.
VI. DESIGN METHODOLOGY AND TESTING
The basic MIPS-X architecture and pipeline structure were developed during the first six to nine months of the 
