Pipeline interlocks are used in a pipclined architecture to prevent the execution of a machine instruction before its operands are available. An alternative to this complex piece of hardware is to rearrange the instructions at compile-time to avoid pipeline interlocks, This problem, called code reorganization, is studied.
Introduction
Recent research in computer architecture centers around two major trends: the development of architectures that attempt to support high level language systems through more sophisticated instruction sets, and the design of simpler architectures that are inherently faster but may rely on more powerful compiler technology. The latter trend has several properties that make it an attractive host for high level languages and their compilers and optimizers:
1. Because the instruction set is simpler, individual instructions execute faster. instructions [12] .
3. Although these architectures may require more sophisticated compiler technology, the potential performance improvements to be obtained from faster machines and better compilers are substantial.
Recently, several articles have discussed the relationship between compilers, architectures andperformance [16, 5] . The concept of simplified instrtrction sets and their benefits, both for compilers and hardware implementations, are presented in [11, 12] .
The unique property of some of these experimental architectures is that they will not perform efficiently without more sophisticated software technology. This paper investigates a major problem that arises when generating code for a pipelined architecture that does not have hardware pipeline interlocks.
Without hardware interlocks, naively generated code sequences will not run correctly, These interlocks must be provided in software by arranging the instructions and inserting no-ops (when necessary) to prevent undefined execution sequences. There are currently several architectures that require software imposition of certain types of interlocks [10, 6] . In a pipelined architecture the design of the interlock mechanism is complex andadds significantly to the hardware overhead ofa high-performance pmcessor [ 9, 15] . lle presence of interlock checks also imposes an overhead on all instructions whether or not they are affected by the interlocks. The elimination of thk hardware will allow a simpler design. If the implementation of a processor on a single VLSI chip is attempted, simplicity and regularity can be crucial for the success of the project. Also, a simpler design allows for more regularity and eventually a smaller and faster chip.
Another reason for the interest in such so architecture is the potential speed-up. If the code can be reordered so that legal instructions are always being executed, then no time is lost due to the resolution of interlocks, and the code will run faster than the unordered code would run on a pipelined machine with interlocks. A penalty is only paid when the code sequences have to be padded with no-ops; even in thk case there may not be a time penalty involved. The elimination of the occurrence of dynamic interlocks will also speed up conventional pipelined architectures.
The problem
The problem faced by a code generator for such an architecture is to guarantee the correct execution of the original program, i.e. to ensure that the input-output fi,mction of the program is identical when the program is executed in a single step fashion and when it is executed without interlocks by pipelincd hardware.
Interlocks
Due to the overlap of the cxecutiou of instructions and the pipeline structure, the results of instruction i are not available until instruction i+ k. An attempt by instruction i+ k' with k'{ k to rcferencc data written by instruction i will result in a delay. A source-destination pipeline interlock is the same as a destination-source interlock except that R is initially a source and later used as a destination.
Destination-source interlocks are a natural result of a pipeline structure. Sourc&dertinafion conflicts result from the possibility of interrupts or page faults. In 13gtrre 1-1 the execution of (2,psa)
will normally precede the execution of (l,psc). But if there is a page fault when instruction 2 is fetched, instruction f will complete. Then the page fault will be handled. In this case, stage (2,psa) will not be executed before (l,ps$. 'fherefore, (l,pss
hould not depend on the results of (2,psa). of the first (logical) operation will never be used, otherwise this would be a sotrrce-destination conflict.) 13trt the state of the register will be undetermined so this has to be avoided either by dead code removal or by the reorganization outline below.
Possible solutions
'fhere are two possible approachesto solving the code generation problem in the presenec of noninterlmkcd hardware. F%3L the problem can be stated as an extension to the standard dag-based code generation problcm [2] . In this form it is clear that the code generation problem for the dags with interlocks is at least as hard as the optimal code generation problem for a register-based machine (known to be NP-complete). Furthermore, a heunstie code generation algorithm for dags would probably be extremely complex. Some of these problems can be solved by using a tree representation and ignoring the possibility of common subexpressions and the existence of multiple trees in a basic block.
However, this simplified approach may result in unacceptable code quality, particularly since machine instructions belonging to more than one statement can not be intermixed to avoid
interlocks.
An alternative approach is to reorganize the code to meet the interlock requirements in a postpass following code generation.
The most important advantage of this approach is that it can be applied both to code output from a compiler and to hand-written assembly language code. The absence of hardware interlocks makes it extraordinarily difficult to generate correct programs in assembly language, and a software reorganization tool is needed to support such programming. The other advantage is the decomposition of the problcm into two simpler problems, although this means that the final resuk may not be optimal. This is probably not serious since the goal of optimal code is not obtainable with a practical algorithm independent of the twmtep approach. Very slow near-optimal algorithms may be unsuitable because they must be used during each compilation, not just optimizing runs. A primary characteristic of the horizontal microcode optimization problem is the assumption that resource utilization by an operation is for a fixed period independent of the operation's context. Microcode optimization also assumes that the o}der of memory accesses cannot be attered from the original version of the microcode.
In contrast. the pipeline reorganization problem concerns interlocks whose effect is a dynamic property. The context of a particular instruction determines whether or not that instruction is legal in its current position. Also, reorganization utilizes interchanges of loads and, to a lesser extent, stores, as a primary optimization technique.
Problem representation
The prob[em of code reorganization to meet pipeline constraints requires a representation of the machine level code that is generated without the effects of the interlocks, and a represen- 
A heuristic solution
As tlis reorganization process will be part of each compilation and as generating optimal code is very ezpensive, we will concentrate on "good' solutions rather than on optimal ones. The basic strategy is as follows:
1. Read in a basic block and create a machine-level deg.
2. At any point determine sets of instructions that can be generated.
3. Eliminate any sets that cannot be started immediately.
Choose among the sets left.
The same register can be used in different parallel dags in a basic block (or in parallel subdags Witiln a single dag). Because of tlis it is possible to emit code for the nodes in a dag such that two parallel parts of one or more dags will block each other.
Definition 4-1: Two machine instructions conjficr if
each has a destination that is a source for the other instruction.
Definition 4-2: A code reorganizer is blocked if it
reaches a point where the only remaining choices of instructions conflict.
Because of the potential for blocking, when selecting the next instruction it is not sufficicnt for the reorganizer to look only at the instructions which are ready to be scheduled. Instead, the reorganizer must look ahead to determine that the nodes being selected will not lead to a deadlock situation.
In Figure 4 -1 the transformation will end in a deadlock if the schedule starts with instruction 1 followed by instruction 3. To capture this problem, we introduce the notion of a sqJe set of nodes in a dag. Safe paths are important because the afgorithm we propose will arways rOnOw a safe path (or several noninterfcring safe paths) in the process of code generation.
The Reorganization Algorithm
The reorganization algorithm is a constraint algorithm; since it is nonoptimal, some heuristic choices are incorporated. The general structure of the afgorithm is to find the acceptable set of next instructions and then to heuristically choose from among them. Figure 4-l) . Thus, all we need demonstrate is that the reorganizer will never reach such a state. This is proved by contradiction of the induction hypothesis.
Assume the reorganizer reached a blocking state for nodes nl and n2 with destinations dl and d2 and children S1 and S2
respectively. (The destinations of SI and s arẽ $anddA respectively). Assume that sl was chosen be ore no e S2, an consider the algorithm at the point S2is chosen.
To choose S2over nr S2must be the start of a safe path. The only safe position once SI and S2are covered is the entire dag. But there is no safe path starting with S2that covers the entire dag (since safe paths must involve only parents of s2). Since the afgonthm cannot block, if WillInterlock is false for some amount of instruction separation, then the algorithm must complete the code sequence. Furthermore, since the code is correct, we know that a complete and correct code sequence must be emitted. 1
Implementation
We have implemented a compiling system and reorganizer for MIPS (Microprocessor without Interlocked Pipe Stages) [6] , an ongoing, experimental VLSI processor project. Currently, compilers for Pascal, Fortran, and C exist. These compilers generate machine-language level instructions that ignore the possibility of interlocks. The reorganizer is an implementation of the techniques described above; it also provides several other t%nctions, such as limited instruction collapsing and instruction assembly. Although the pipeline interlocks in MIPS are straightforward, they significantly affect code generated from a standard compiler.
MIPS interlocks
MIPS has a six-stage pipeline with three active instructions occupying every other pipestage. For the purposes of this paper it is sufficient to consider only the destination-source interlocks that occur when registers are written and then used as sources on the next instruction.
Results from arithmetic operations can be written in pipestreges 3 and 6, and registers can be loaded from memory during pipestage 5, Registers are used as sources for address calculations and arithmetic operations during pipcstages 3 and 6, and as sources to be stored during pipestage 4. The interlocks that arise from this pipeline structure can be summarized as follows:
In The reorganizer has given us an opportunity to evaluate the effectiveness of removing pipeline interlocks by compile-time analysis. In 13gure 5-2' we show a typical sample instruction sequence, and the legally padded sequence generated by added no-ops. This figure shows also the resulting machine-level dag, and the reorganized code sequence produced by the reorganizer,
The reorganized code has 2 less instructions than the code using no-ops. Since all MIPS instructions take the same amount of time to execute, and occupy the same amount of instruction space, this is a 30% improvement in execution time and instruction space. Consider the instructions which are generated for the statement:
Our first adoption of the portable C compiler [8] produced the instruction sequence shown in 
Global register allocation
Global register allocation also affects the reorganization process.
When registers are globally allocated, a register may be active at the first instruction of a basic block. this occurs when the register is a dcstioation near the end of a previous basic block. In the general case of long interlock periods, the reorganizer will have to know or find the predecessor blocks to determine what registers are affected by interlocks at the beginning of the basic block.
In most practical instances this will not necessary. When a basic block ends in a jump (or other nonsequential control transfer), the time to process the change.of the PC will nearly always exceed the Icngth of the longest register interlocks. Thus, the reorganizer will only need to consider interlocks that arise from sequential controI flow into a basic blcck. These interlocks are easily computed when the previous basic block is processed.
Debugging
The code reorganization process is like an optimization in that the intermediate stages of the computation are altered. Thus, the same problems that arise when trying to debug optimized code occur, and similar solutions arc appropriate.
The major affect of code reorganization, on the dcbu~ing process, is to move stores with respect to other stores and computations that may fail (for example, from an overflow).
These situations correspond to the type of reordering that can occur when generating code from a dag. TM.. problem and potential solutions to the problem are discussed in depth in [7] .
Conclusion
Modern advances in prcassor architecture and the constraints of We give an ahernative heuristic algorithm and some empirical data on its performance. Lastly, we discuss the integration of the postpass reorganizer into a compiler system.
