The overall prrformance of supercomputers is slow compared to the speed of their underlying logic technology.
INTRODUCTION
Traditional computer architectures use resources inefficiently, resulting in machines whose performance is disappointing when compared to the raw speed of their components. The main technique for running processors near the limits of a technology is pipelining. An operation in a pipelined machine may take several cycles to complete, but a new operation can be started on each cycle, so the throughput remains high. The benefits of pipelining, however, have been limited by the difficulty of keeping the pipeline full. The difficulty can be traced to two sources: data dependencies and the slowness of memory.
A data dependency is a relationship between two instructions that use a ccnnmcm register or memory location. The second operation cannot begin until the first operation is finished using the register or memory. In many supercomputer architectures, complex scheduling hardware is used to keep the almost independent processing units from violating the data dependencies implicit in the code. Although scheduling hardware allows some overlapping of normally sequential operations, the final machine is only about twice as fast as a strictly sequential machine. Even with far more sophisticated scheduling hardware than that in current machines, only another factor of one and a half is obtained [18] The scheduling mechanism is not only expensive to build, but it also slows down the basic cycle time, since it must operate faster than the processing units.
Large memories are slow compared with modern processing elements, limiting the performance of a machine in two ways. First, instructions to be executed must be fetched from memory. Second, data operations that need to read from or write to memory take a long time to complete, delaying other instructions.
For straight-line code, conventional pre-fetch units and instruction caches remove most of the instruction-fetch delay, but substantial penalties are incurred for conditional jumps and cache misses. The smallness of basic blocks[l7], (121 and corresponding frequency of jumps has usually limited the size of pipelines to two or three stages [16] .
RISC (Reduced Instruction Set Computer) designs try to gain performance without scheduling hardware by making all instructions take the same time. The more complex operations, such as floating-point arithmetic, have been broken into smaller operations that can be executed quickly. The approach works well for small machines [8] [15], but is unsuitable for high-performance _ 'This work is supported in part by NSF grant DCR-8502884 and the Cornell NSF Supercomputing seater. 
Data Path
The data path is a conventional design, in that it combines RISC and array processor designs.
It has the following features:
. . Besides the conventional integer arithmetic unit, pipelined floating-point hardware is provided. The main requirements on the arithmetic units are that an operation may be started on every cycle, and that each operation take a hxed time to complete. Pipelined Roatingpoint multipliers and adders are standard on machines intended for scientific computing, since the speed at which floating point operations can be done is the main performance limitation for scientific calculation.
l As in RISC architectures, addressing modes are not provided, and all memory operations are explicit loads and stores. The instruction and data memories are entirely separate, so no conflicts can occur between data and instruction fetching. Ideally, one load or store operation can be issued each cycle, although an operation may take several cycles to complete.
The data and instruction memories are each banked and can accept one request per cycle, as long as no bank has to process two requests simultaneously.
When multiple requests are made to the same bank, the processor freezes until the first request is completed. For instruction fetches, the pre-fetch mechanism described in the next section can be used to guarantee that an instruction is always ready to execute. Various techniques are suggested in the Data Memory section below to make freezing adequately rare.
l The system is fully synchronous, in that all operations hnish after some known multiple of the basic clock period. Although asynchronous systems can be built, such systems are difficult to schedule efficiently, and schedules are difficult to debug. Scheduling hardware is expensive, and would slow the basic operations of the machine. Our design requires no scheduling hardware. It is the responsibility of the compiler to create good fixed schedules for instruction execution, given the execution times for the v&bus instructions as parameters.
. No interrupt handling or fast context-switching is provided in our processor. Rather than slow down our main processor to handle these rare tasks, we will use cheaper, slower processors to handle instruction traps, page faults, and I/O. Sometimes, albeit rarely, we may vi Departments. We are deeply grateful to Bill Powers (Dean of the College of Science and Arts) and Tim Whitten (Vice President of Academic Affairs) for their continuing support and encouragement on this project in particular and our other endeavors at MI'U in general. We extend our thanks to the people who prepared and/or presented lectures at the workshop; a complete list alphabetized by the author's last name follows: their memories, or ready to become active. After an instruction is passed to the decoder, control passes from the active unit to a different pre-fetch unit. For normal straight-line code, control passes to the right from unit i to unit (i+ 1) mod 2". For jumps, control can pass to any pre-fetch unit that is ready. For the machine to work without delays, instruction pre-fetches must be started well before the instruction is executed. Pre-fetches are started automatically for normal straight-line code. Since straight-line code proceeds from left-to-right across the pre-fetch units, it suffices for each unit that starts a pre-fetch to tell its right hand neighbor to start fetching the next address on the next cycle. Since multiple target addresses must be ready at jump instructions, some additional mechanism is needed for starting instruction fetches. Jump targets are started with explicit pre-fetch instructions. around the ring, and the signals on the top communicate with the instruction buses. Note that in a ROPE with 2" pre-fetch units, the bottom n bits of address select the pre-fetch unit, and do not have to be passed around the ring. The high-order part of the address that is passed around the ring does not need to be changed between units, except when passed from unit 2" -1 to unit 0, where it must be incremented.
A pre-fetch unit has two state bits that determine its behavior: busy and target. A prefetch unit is busy if it is in the middle of a fetch, and ready when it has data to be put on the instruction bus. A pre-fetch unit is a target if the word fetched was requested as a result of an explicit PRE-FETCH instruction, and a non-target if the word was requested from the unit to its left.
The pre-fetch units can best be understood by examining what they do with the signals passed to them from the top or left.
A start-fetch.left signal is ignored by target units, but starts a fetch of address.left on non-target units. Whenever a fetch is started, the unit sends a start-fetch.right signal on the next cycle, passing the address it is fetching to address .right.
The unit becomes busy until the fetch is completed.
Note that a start-fetch. left signal could be received by a non-target unit that is busy with a previous fetch, in which case the previous fetch is aborted.
There are usually several start-fetch tokens being passed around the ring at once. a pre-fetch unit becomes ready F cycles after a fetch is requested for it. On the cycle after a jump each of the targets will need to be ready, and the F -1 units to the right of each target must have started pre-fetching. Thus a k-way jump requires least kF pre-fetch units. A four-way jump with a six cycle fetch time requires 24 pre-fetch units. If jumps are close together, each target branch need only pre-fetch up to the next jump instruction, and fewer units are needed. A ROPE machine with 32 or 64 pre-fetch units should achieve almost ali the possible speedup of this architecture.
If not enough pre-fetch units are available, programs will be slowed down. but only by the number of pre-fetches that do not fit (not by the time required for each fetch), since pre-fetches have no data dependencies, just resource availability constraints.
The compiler must schedule the pre-fetches and assign code addresses to minimize the waiting for instruction fetches. For tree-structured control Row with infrequent branching, scheduling prefetches and assigning code addresses is easy. If the branching is frequent and not enough pre-fetch units are available, delays are unavoidable with any schedule. Assigning addresses is dillicult for a code block that has multiple predecessors, such as the entrance to a loop or the statement after an if-then-else.
Such a code block may need to be duplicated to avoid conflicting re;luirements on its placement.
We believe that multi-way jumps will prove to be a valuable part of ROPE. Combining basic blocks and using multi-way jumps should allow us longer sections of straight-line code than compilers for conventional machines, which examine only basic blocks, since basic blocks are usually less than five instructions [12] . The main cost of a jump instruction on conventional machines is the fetch time for a non-sequential instruction.
With our architecture, the prefetches, tests, and jumps can be scheduled independently, and therefore do not slow the machine. Separating pre-fetches, tests, and jumps may improve performance significantly, even without multi-way jumps. placed. All the targets of a conditional jump need to be sufficiently separated so the pre-fetching mechanism can have all the targets ready simultaneously. Each code fragment is placed so that the jumps into or from already placed fragments are all satisfied. If no such placement can be found, part of the code fragment may need to be duplicated.
CURRENT WORK
We have constructed five simulated architectural models, two correspond to conventional machine architectures, one is the ROPE architecture, one is a VLIW architecture, and the last combines ROPE and VLIW ideas. We are implementing a percolation-scheduling compiler for all five architectures, but for the comparisons in this section percolation scheduling is used only for the ROPE machines.
The simplest architectural model executes each instruction in the object program sequentially. The machine contains no cache, so all jumps require memory accesses. Standard optimizations of the code (such as dead code removal) are assumed, but code rearrangement, pre-fetching, and hardware scheduling are all irrelevant for this machine.
The second model is representative of several existing supercomputers (for example, the Cyber 205, Cray-1, and CDC7bOO). It has a fully pipelined data path, identical to the ROPE data path. This machine also has a hardware scheduling mechanism that guarantees executing dependent operations in the same order as they were issued, but allows independent instructions to be executed in any order. The machine contains a program cache, and for fair comparison with our architecture, we assume 100% hit ratios. That is, the machine can fetch and decode any instruction in one cycle, but instyuctions cannot be issued beyond a conditional jump until the condition has been resolved. The data memory is banked, and we assume that the data layout permits memory accesses to start every cycle. This is unrealistic unless sophisticated tools for memory disambiguntion and layout [2],[11] are used. Although most existing compilers do not provide such support, we include the assumption so our comparison is not biased towards the PS and ROPE approach. We also allow this architecture an optimizer that can perform code reorganization within basic blocks. For example, jumps can be moved upwards inside basic blocks when not hindered by dependencies-as in MIPS [8] . The third simulated machine is our ROPE architecture. It has the structure described in section 2, and uses PS and the mapping techniques described in section 3. No cache or runtime scheduling hardware is provided in this machine; the compiler is completely responsible for the correct execution and the efficiency of the machine, including proper data/program bank accesses. l3oth of the architectures with pipelined data paths assume that the registers read by an instruction are not needed again by the instruction after they have been read, and that they are read in a fixed cycle of the instruction.
This assumption removes most read-write dependencies and is realistic for a pipelined architecture.
The fourth model is an ide&edVLIW machine [3] . While the instruction timings are realistic, we allow as many resources (such as functional units, buses, memory ports, register ports) as required for peak performance.
The functional units are pipelined to accept one operation per cycle, and trace scheduling is used to schedule the input program. We assume sufficient instruction pre-fetching on the on-trace path and on unconditional jumps so that instructions in that path can be issued every cycle. Off-trace conditional jumps require a memory access and are therefore slower.
The last model combines VLIW functional-unit parallelism with ROPE instruction pre-fetch and uses a percolation scheduling compiler. The combination of ROPE and PS make the VLIW architecture much less sensitive to branch probabilities. For the following example we use the timings shown in Figure 7 . These timings are consistent with current off-the-shelf components.
We have also considered other timings (for example, those of the Cray-1). Choosing other times will not affect the ROPE architecture or the percolation scheduling compiler, but may of course change the speed of programs. Despite the greater hardware complexity of the conventional pipelined architecture and of the VLIW model, our preliminary results show signi6cant speedups for PS and the ROPE architecture, even on small problems (binary search, bubble sort, Livermore Loop 24, and matrix multiplication)
over all other models. 5 A further speedup is expected in a hardware implementation of ROPE, since the simple, uniform architecture should allow a shorter cycle time than a machine with hardware scheduling.
An Example
The code in Figure 8 (Livermore loop 24) will be used to illustrate our approach.
Loop 24 finds the location of the minimum of an array of floating-point numbers. for all the architectures discussed, the loop has been unwound three times to increase potential pipelining, and traditional optimizations have been done.
Executing the loop sequentially requires between 70 or 73 cycles depending on which branches of the conditionals are taken for an average of 71.5 cycles.
The loop body requires between 38 and 41 cycles to execute on the machine with hardware scheduling, for an average of 39.5. cyclese. This architecture is about 1.8 times as fast as the sequential machine on this example, which is consistent with the speedups reported in [18] . The actual performance of the Gray-1 (cycle time 12.5 nanoseconds) for this loop is of about 2.3 Mflops according to the Livermore benchmarks. ' For the VLIW machine, the intermediate (NADDR) code and the trace scheduled code is shown in Figure 9 . On-trace jumps and unconditional jumps are assumed to be pre-fetched. OK-trace jumps are shown explicitly by arrows and take longer than on-trace jumps.
A 
CONCLUSIONS
Our preliminary results are encouraging and we believe that our approach has significant advantages for the development of a cheap, high performance machine. ROPE can be used by itself, or combined with VLIW architectures.
The ability to handle complex and unpredictable Row of control could significantly enlarge the class of applications for which VLIW's are attractive.
