Optimal Code Scheduling for Multiple Pipeline Processors by Nisar, Ashar & Dietz, Hank
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
5-1-1990






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Nisar, Ashar and Dietz, Hank, "Optimal Code Scheduling for Multiple Pipeline Processors" (1990). Department of Electrical and
Computer Engineering Technical Reports. Paper 725.
https://docs.lib.purdue.edu/ecetr/725
[!Si?:* w  Si
f.\\v.\v^y.\\v.\\\v?ifcyKvvI'lvIvIvXvI'lvIvX-W
y!vXv>X\vI\\\vttXvIvIvX\vtt:\\vX\vX\\v{.y.y.y.\\y.y.y.y.v.v.y.v.\v.v.v.\v.y.\\v.v.v
t'X’X'X'X'X’X'X’X’X'X’XvX'X’X vX vX vX vX vX
!y iy ly ly iyX yX yX yX ylyX y iy iy ly ly ry ly lyX y!
Optimal Code Scheduling 






School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
OPTIMAL CODE SCHEDULING FOR 
MULTIPLE PIPELINE PROCESSORS
A Thesis





In Partial Fulfillment of the 
Requirements fo r  th e  Degree
of
Master of Science in Electrical Engineering 
August 1990
this is dedicated 
to my mom and dad
ACKNOWLEDGMENTS
This page of thesis is usually reserved for we degree candidates, to eulogize,
cajole or beg (not necessarily in that order) the members of our advisory comit-
. . . .   j  '
tees. In a glimmer of hope that they will not pronounce our theses and claims to 
be ba«' ess or “It has already been done!” So here goes.
Hank Dietz. It is an honor to know this person. A genius with a profound 
research insight, brilliant ideas, an amiable personality and, most importantly, a 
sparkling sense of humor. I have learned a lot from him — everything from 
Chimpminkpunk parodies to the secrets of preparing palatable fruit-punch. 
What can I possibly say in return except, perhaps, “Live long and prosper.”
I also offer my thanks to Professor Robert Fujii and Professor Shaheen 
Ahmad for serving on my advisory committee. My special thanks to Imran and 
Carol for their help and support.
iv
TABLE OF CO N TEN TS
Page
LIST OF TABLES.............................................................................. .......................vii
LIST OF FIGURES....................... ......... ........................................... ..................... viii
ABSTRACT...... ...... J............................................................................ ............ ......... x
CHAPTER I - INTRODUCTION .......................................... ................................,.I
1.1. Introduction ........................................ .................................................. ........ I
1.2. Pipeline Characteristics ............................................................................ .....3
1.2.1. Compiler’s View ................. ............................................................... 3
1.2.2. Architecture’s View ........................... ................................. ...............5
1.3. NOPs and Delay Slots.................................................................................... 6
1.4. An Overview of This Document ............. ......... ......... .............................. ....7
CHAPTER 2 - BACKGROUND AND SURVEY OF RELATED 
LITERATURE..............................................   9
2.1. Introduction .................................................................................................... 9
2.2. An Example of Code Scheduling ....................    9
2.3. The Complexity of Finding An Optimal Schedule .................................... 11
2.4. PostPass Code Optimization ....................................................................... 15
2.4.1. Proof of NP-Completeness................................................................17
2.4.2. The Algorithm ................................................................................... 18
2.4.2.1. Reordering Constraints .............................................. ...... ..19
2.4.2.2. Heuristics .............................. ................... ........... ..... .........21
2.4.2.3. Results ............................................    22
2.5. Improved Approximation Algorithm by David Bernstein ..........................23
2.5.1. Background ....................................................    23
2.5.2. Algorithm ......       23
2.6. Reorganizer for a Variable-Length Pipelined Microprocessor ...................25
2.6.1. Introduction...........................      ...25
2.6.2. The Algorithm ..............................     25
2.7. Micro-Optimization of Floating-Point Operations ........  27
2.8. Scheduling Trees in Pipelined Environments ............................   27
2.9. Summary .......................................................................   28
CHAPTER 3> REFINED EXHAUSTIVE SEARCH
Page
...2 9
3.1. Introductibn .... .................^..............29
3.2. Refined Exhaustive Search ........... ............ ........... ........................ ........ ....30
3.2.1. Refinements ............ ................................. ........ ..............................32
'3.2.1.1. Preclusion .......................................................................32
3.2.1.2. Pruning Based on Cost ..... ....... ........... ...........................33
3.2.1»3. T ree Rearrangement ........................................................33
'3.2.1 >4. Hequivalence fTest .......... ............................ •••••.....34
3.2.2. Combining All Refinements . . . . . . . . . . . . . . 3 4
3.3. Problem Statement And Definitions ....... ....35
3.3.1. The Scheduling Model . . . . . . . . . V . . . . . . ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................35
3.3.1.1. An Example of the Pipeline Resource Model ...... ...........37
3.3.1.2. An Example of the Task System  .... .............................38
3.3.2. The Cost Criteria in Scheduling ......... a . . . , ......39
3.3.3. Delay Slots and Optimal Schedule ...............41
(yoixipiler Model ..................... .......... ...... ................. ..................... .*.,..423.4.
3.5. Schedu ling .Algorith m •♦••••••••«................................................................... .,n
3.5.1. Definitions ....•••• •••••... . . . . e . . . . . . . . . . . . « . . . 4 4
3.5.2. Algorithm D Compute . . . . . . . . . . . 4 5
3.5.3. Algorithm A — Find Optimal Schedule  .... ................... 46
3.6. Proof of Optimality ...................................... .............................................47
3.6.1. Non-truncated Algorithm NT ......................................................48
3.6.1.1. Algorithm ALL .........................»...48
3.6.1.2. Algorithm LG ...................
3.6.1.3. Algorithm B ......................
3.6.1.4. Algorithm G .................................
3.6.2. Truncating Algorithm ...... .
3.7. Summary ...................








. . . * »............
........






4.2. Structure of the Scheduler .........,................I.....,..,.........,,.
4.2.1. Optimized rTuple Generation ..........o....
4.2.2. Iiist Scheduler ............................................................ 57
4.2.3. Pipeline Scheduler ..«.58
4.2.4. Register Allocation and Code Generation ....................................58
4.3. Pipeline .(Configuration Information ........................................................58
4.4. Sumxnary- »»»•***•••»••.»•»***».••*•••*•»•»••.........'................'.........*.........,*59
CHAPTER 5 - PERFORMANCE ANALYSIS .................. ..» I ..... ..............
5.1. Introduction ■. . . .  » .  • * » • » » » » » » • * * . . * » » • ........... ................... . .. .... 61
5.2. Performance Metrics and Parameters ............ ............................................ 62
5.3. Construction of Synthetic Benchmarks ..................... ...63
' ' , vi
Page
5.4. Simulation of the General Behavior ............. ........................ .................... .64
5.4.1. Procedure ............ ...... ............. ........... .......................................... 64
5.4.2. Pipeline Constraints for Simulations .......     .64
5.4.3. Results ..........................      65
5.4.4. Discussion .........      66
5.5. Variations in the Curtail Point ........ .......... ..................... .........................70
5.5.1. Procedure .................       ....70
5.5.2. Results ....................... .......... .......... .............................. .................70
5.5.3. Discussion ........................    ....72
5.6. Varying the Number of Variables .... .............. ......... ........... .....................72
5.6.1. Procedure ..................... ........ ............. ..... ................................. ......72
5.6.2. Results ...........        .....72
5.6.3. Discussion .................          ....74
5.7. Variation in the Pipeline Structure ............................................. ......... .....74
5.7.1. Procedure .....                   74
5.7.2. Results ....................               .74
5.7.3. Discussion ................................         74
5.8. Sub-Optimal Solutions ........... ........ ............. ........... ......................... ........75
5.8.1. Procedure ... ..............             .76
5.8.2i Results ......... .            77
5.8.3. Discussion ...... ..... ................ .............. .................. ......... .....__ ......77
5.9. Summary ........................... ...... ......... ....... ............. ....... ..............................78
CHAPTER 6 - CONCLUSIONS ................................................................... .......... 80
















SG2>rc}i. SpstCG for GxIisiustivg SGSircli Ce************************************
Search Space for Representatiye Examples...... #, . . . . . . . . . . , . . . . . . . . . . . r.
Percent of NOPs Required for Different Heuristics ...........v....... ..........
Sample Pipeline Description Table ...............
Sample Operation-to-Pipeline Mapping
S amp Ie • P lpeline Description _■ rX'able ^ .. ........... •■».
Sample bperation-to-Pipeline Mapping .i....,..*..










Pipeline Description for Simulations ....................
Operation-to-P ipeline Mapping for Simulations ........
Statistics for Scheduling 16,000 Blocks 
Variations in the Pipeline Striictures 
Final NOPs in suboptimal Solutions .
.  . 1.  .  .  66




2.1. Incorrect Code Sequence.................................  10
2.2. Correct Code Sequence.................. .......... ......................... .......... ..............10
2.3. Instruction Sequence without any Delay Slots........ .............     11
2.4. Schedules Searched Vs.I Block Size Vs. Distributionof Inputs........ ...........16
2.5. Best Code Sequence for a Given Register Assignment .........   20
2.6. Improved Code Sequence ................   ....................................21
3.1. A search tree for exhaustive search ...............   31
3.2. Code Sequence for the Preclusion Example ............   32
3.3. AnExample of the Task System ......       39
3.4. Compiler Model with respect to the Scheduling Algorithm ........................43
4.1. Organization of Prototype Scheduling Compiler .........................................55
4.2. Sample of Intermediate Form ...................... ........................................... ..56
5.1. Dbtribution of Sample Block Sizes ........................     65
5.2. Initial and Final NOPs Vs. 13lock Size ......................................................67
5.3. Runtime(Iog scale)Vs. BlockSize .............. .— ........... ............ ................ 68
5.4. Percentage Run To Completion Vs. Block Size.............    68
5.5. Average Search Calb Vs. Maximum Block Size........... ............................... 69
5.6. Average Percentage Run To Completion Vs. Curtail Point ................. .....71
5.7. Average Runtime Vs. Curtail Point .....        71
5.8. Average Percentage Run To Completion Vs. Variables .................73
ix
Figure Page
5.0.1 Average Runtime Vs. Variables ...............................................................73
5.10. Average Percentage Run To Completion Vs. Pipeline Structure............... 76
5.11. Fraction of Initial NOPs Vs. Runtime .......................................................... 77
a b s t r a c t
Ashar Nisar. M.S.E.E., urdue University. August 1990. Optimal Code Schedul­
ing for Multiple Pipeline Processors. Major Professor: Dr. Henry Dietz
Pipelining the functional units and memory interface of processors can result 
in shorter cycle times and dramatic increases in performance, but only if the pipe­
line delays can be hidden by other useful operations. The portion of pipeline 
delays which is not hidden results in an extension of the total execution time, 
either implemented by hardware interlocks or by compile-time insertion of NOPs 
(Null Operations). By rearranging instructions, it is possible to minimize the 
total pipelined execution time, but the problem of finding this optimal code 
schedule is well known to be NP-complete.
In this thesis, we describe a code scheduler for multiple pipeline processors 
where each pipeline may have a different latency and enqueue time. Previous 
approaches simplify the search for a good schedule by arbitrarily imposing con­
straints which sacrifice optimality; the technique given in this paper uses a new 
set of pruning criteria which preserves optimality. Although, in the interest of 
reducing compile time, the new technique permits the search to be truncated, this 





Most modern processors, especially RISC designs like Motorolla’s 88000 
[Mel88], MIPS R3000 [Rio88], SPARC [Muc88], etc., attempt to achieve a peak 
performance of one instruction completing execution with every clock tick. 
However, this does not imply that execution of a single instruction always 
happens within a single clock tick; rather, pipelined hardware is used to overlap 
execution of multiple instructions to achieve this throughput.
For example, if each instruction requires 5 clock ticks to execute, throughput 
of one instruction per clock tick can be obtained by allowing 5 instructions to 
overlap execution within a 5-stage pipeline. In order to obtain one instruction per 
clock tick throughput, one simply needs to have one instruction ready to enter 
the pipeline at every clock tick. The problem is that if code is generated from a 
high-level language in the most obvious way, many instruction sequences will 
require that a delay be introduced before the next instruction can be issued.
The problem of compiling code so as to minimize the total delay which must 
be introduced is nearly as old as the concept of pipelining hardware, and appears 
to have been considered as early as the 1950s. In the 1960s, as circuitry became 
inexpensive enough to make the hardware cost-effective, machines with multiple 
functional units became common: typically, independent adders and multipliers 
which could operate in pipelined overlap with other instructions. Most of the 
compiler research centered on the development of heuristics which could be used 
to “generate” code so that total delay would be reduced for such machines; a 
reasonable overview appears in [CoS70].
Although: the compiler techniques used to generate low-delay code were 
reasonably effective, they generally assumed that the code-generation process was 
relatively straightforward; in other words, these techniques become awkward 
when other compiler optimizations are also being performed. For this reason, the 
emphasis has shifted from heuristics for generating code to heuristics for re­
organizing, or scheduling, code after it has been generated using whatever other 
optimizations were appropriate.
Probably the best known work in instruction scheduling for pipelined 
processors is by Gross, detailed in (Gro83j. Gross proposed a heuristic algorithm 
for reordering instructions and showed that, although his heuristic typically does 
not result in the minimum delay (optimal schedule), the algorithm executes 
quickly and generally yields good results. By applying his algorithm to the 
optimized assembly language output of a compiler, he also avoids the complexity 
of integrating scheduling with the other optimizations within the compiler, It 
appears that this is a reasonable approach, except in that the compiler has 
performed register allocation. Hence, the register assignment can impose 
unnecessary restrictions on the schedule, resulting in unnecessary execution 
delays.
Bernstein presented an improved scheduling algorithm, but his work 
considers only pipelines having a fixed delay [Ber88]. Abraham et. al. [AbP88] 
permitted Variable delay pipelines, but resorted to a greedy heuristic algorithm, 
instead of searching for the optimal schedules.
The algorithm we propose differs from previous work in several ways:
[1] We apply our algorithm to an intermediate form of code which does not 
have specific registers assigned, hence registerallocation happens after 
scheduling and the scheduler is not unnecessarily constrained.
[2] Although our algorithm is also heuristic, none of the heuristics applied 
sacrifices optimality. In other words, the search space is pruned 
dramatically, but the optimal solution will never be pruned: In cases where 
the pruned search space is still too large, the search may be terminated after 
an arbitrary number of cases have been examined, but this happens only 
rarely and still generally results in very good schedules.
3
[3] The target pipeline architecture model supported is significantly more 
general than that typically used, permitting multiple pipelines, each with its 
own latency and enqueue time, to be specified. In particular, we believe our 
proposal is the first to consider the pipeline enqueue time as a key pipeline 
parameter (relating to conflict-induced delays, described in section 1.2.1).
Using reasonable compile-time time limits, the algorithm we propose was found to 
generate provably optimal schedules for 15,812 of the 16,000 synthetic benchmark 
programs examined (over 98%).
1.2. Pipeline C haracteristics
In describing the basic characteristics of pipelined computer systems, it is 
useful to consider the compiler and architecture aspects separately. Naturally, 
this work is more concerned with the compiler’s view, however, the discussion of 
the architectural structures clarifies how the proposed scheduling model applies to 
various real machines.
1.2.1. Com piler's View
As a compiler views a pipelined machine, the main concern is simply that the 
order in which instructions are executed must be sensitive to various pipeline- 
related timing constraints. It is convenient to think in terms of the incremental 
task of trying to generate code for the next in a sequence of instructions.
There are two primary reasons for which execution of an instruction might 
need to be delayed:
• Dependence. A dependence occurs when this instruction uses a result 
computed by an earlier instruction, but the earlier instruction has not yet 
completed pipelined execution. Violating a dependence generally results in 
incorrect results being computed.
•  Conflict. A conflict occurs when this instruction requires access to a 
hardware structure which is still being used by the pipelined execution of an 
earlier instruction. An unresolved conflict results in a pipeline hazard and 
unpredictable behavior.
Dependence is the most common reason for requiring delays. For example, 
loading a datum from memory into a register might be an instruction which takes
4 clock ticks to execute, but the very next instruction might depend on the value 
being loaded. Consider typical code implementing the addition of X to register
Load R1,X 
Add RO,R 1
;make register R 1 * 
;make register RO =
memory[X] 
RO + R 1
If the hardware were simply to enqueue the load in the pipeline and, in the very 
next cycle, attempt to use the register, the wrong value would be obtained; hence, 
some technique must be used to prevent the second instruction from executing 
until after the first has completed. This would introduce a delay qf 3 clock ticks 
between the L oad and Add instructions.
Notice that traditional compiler code generation techniques tend to load 
values on demand, resulting in code sequences which have many such 
dependences.
Modifying the above example, a conflict would arise instead of a dependence 
if the second instruction is another L oad instruction and, for example, the 
hardware required the memory address register (MAR) to hold the memory 
address being accessed for the first 2 clock ticks of the L oad  operation. 
Consider:
Load R 1,X 
Load R2,Y
;make register R 1 * memory[X] 
;make register R2 ■ memory[Y]
In this case, the second L oad  would have to be delayed until the first L oad 
had finished using the MAR — a delay of I clock tick would have to be placed 
between the two L oad operations.
Hence, there is a significant difference between dependence-induced and 
conflict-induced delays: beside the semantic differences, they generally do not 
imply the same amount of delay. For each pipeline, the compiler needs to be 
aware of two separate parameters corresponding to the delay times seen for 
dependence and conflict resolution, respectively:
• Latency. The pipeline latency is the number of clock ticks which must 
occur between enqueuing an operation in a pipeline and the result of that 
operation becoming available. In other words, it is the minimum time 
between issuing an instruction and issuing a second instruction which has a
5
dependence on the first; the “depth” of the pipeline measured in units of 
time.
• Enqueue tim e. The pipeline enqueue time is the minimum number of clock 
ticks which must occur between enqueuing one operation in a particular 
pipeline and enqueuing a second operation in that pipeline. In other words, 
it is the minimum time between items in a pipeline.
For a classical pipeline, the latency is a few clock ticks and the enqueue time 
is I clock tick (since each stage of the pipeline uses functional units independent 
from those of other stages). However, it not uncommon to find hardware being 
shared by a few pipeline stages (or, equivalently, to find each stage taking a few 
cycles). Further, machines which have functional units that can operate in 
parallel with other functional units but are not internally pipelined are easily 
modeled by making each functional unit appear as a pipeline where the enqueue 
Iime =  Iatency.
The fact that some architectures have multiple pipelines raises yet another 
issue in the compiler’s management of pipelined systems: the compiler may have 
to decide which of several viable pipelines to use for each operation. For 
example, in a machine with two pipelined multipliers, which multiplier should be 
used for each operation?
1.2.2. Architecture’s View
In the compiler’s view we identified the causes of execution delays, but we 
did not define their architectural implementation. When a dependence or conflict 
wopld otherwise cause improper execution, the architecture must have some 
mechanism for introducing the appropriate delay. In discussions of pipelined 
hardware, these delays are sometimes referred to as “pipeline bubbles” [Pat85]. 
There are three basic approaches to forcing a delay:
•  Implicit interlock. In this technique, the hardware checks each instruction 
just before execution to make sure that it does not depend on the results of 
any operations which are currently in the pipeline. If there is such a conflict, 
the hardware simply delays issuing the instruction until the Conflicting 
Operation in the pipeline has completed.
6
The implicit interlock approach: has long been the standard approach. 
It continues to be used in most modern processors, including RISC-style 
architectures such as the IBM 801 [Rad83], RISC II, and SPARC [Gar88] 
architectures.
Explicit in terlock (explicit waiting). In this technique, the compiler 
marks each instruction with a tag indicating whether it must wait for a 
particular pipelined operation to complete before this instruction can begin 
executing. This technique is very similar to an implicit interlock, however, 
the hardware is simpler since it does not need to detect which operations 
interfere.
The machine being developed by Tera [Smi88] uses an explicit interlock 
based on the compiler tagging instructions with a  count field which gives the 
number of instructions since the last instruction that this instruction depends 
on or conflicts with. Another example of explicit .interlock is the proposed 
CARP machine [DiS89j; CARP uses a bit mask in each instruction to 
indicate which variable-latency resources (e.g., global memory accesses using 
an interconnection network) each instruction must wait for.
NOP insertion (padding). In this technique; the compiler takes full 
responsibility for the management of the pipeline by simply placing NOP 
(Null GPerations — instructions known to be non-interfering with any type 
of pipeline activity) between instructions which would otherwise result in 
pipeline conflicts. The hardware is the simplest of the three techniques, but 
the compiler must perform analysis of the pipeline activity implied by the 
code.
The best known example of NOP padding for introducing delays is 
probably the MIPS processor [Hen81], although this seems to be becoming 
more popular as a general approach. For example, much of the work toward 
GaAs processors uses NOP padding, Further, pipelines with fixed latency 
are handled in this way in the CARP machine [DiS89].
1.3. NOPs and Delay Slots
In code scheduling, for a pipeline processor, the best solution is to never have 
the next instruction interfere with the instructions currently in the pipeline. By 
pipeline analysis and rearrangement — scheduling — of the code, a compiler can
.''V:";.- VV 7
effectively eliminate the need for inserting delays. But when no instruction can be 
found to replace a delay slot then it becomes necessary to “execute” these delays. 
It should be pointed out here that we can implement these delay slots in a variety 
of ways such as Implicit interlock, explicit interlock or NOP padding as 
described above.
The current popularity of the NOP insertion technique is, probably to a 
great extent, the result of the realization that this scheduling is important enough 
that every compiler should do it, in which case the compiler technology for NOP 
insertion is free, whereas the hardware implementing an interlock is not.
In this thesis, for convenience, we shall consistently refer to delays in terms 
of inserting NOPs. However, the approach is not sensitive to which hardware 
mechanism is being employed. This is a key reason for discussing the 
architecture’s view — to show that it is in fact orthogonal to the compiler’s view. 
Hence, the scheduling techniques discussed in this thesis apply equally well to any 
architectural implementation of delays. In fact our algorithm, that is presented in 
Chapter 3, is based on the genearl notion of delay slots and is not specific to NOP 
padding. The choice of the method used to make those delay slots visible to the 
processor is upto the person implementing this algorithm. Our implementation 
uses the NOP insertion technique, for reasons noted above.
1,4. An Overview of This Document
Chapter I provides an introduction to the problem of code scheduling for 
multiple pipeline processors. We have presented the problem from the Compiler 
perspective and from the Architecture point of view and concluded that these 
issues are orthogonal and that the code scheduling should be incorporated in 
every compiler even if other methods are used to resolve pipeline conflict and 
dependency problems.
The background material and a survey of related research in open literature 
is presented in Chapter 2. This chapter begins with an overview of the complexity 
of the code scheduling problem viewed as an exhaustive search problem. This is 
followed by a compendium of various algorithms proposed by other researchers 
and how they differ from our work.
Chapter 3 presents a concrete illustration of the concepts and rationale 
behind our proposed algorithm. Later in the chapter, a detailed problem 
statement is defined and is followed by a description of our algorithm. The 
chapter ends with a formal proof about the optimality of the solutions obtained 
by our algorithm.
Chapter 4 addresses various issues pertaining to the implementation of our 
algorithm and its integration with existing compilers. The structure of our 
prototype compiler and implementation of the algorithm are discussed and the 
basic characteristics of pipelined systems are reviewed with examples.
Performance analysis of our algorithm (through its implementation) is 
carried out in Chapter 5. Interaction between various system parameters is 
explored and compared with the expected behavior.
Finally, Chapter 6 presents conclusions and directions for further research.
9
CH A PTER 2
BACKGROUND AND SURVEY OF RELATED LITERA TU RE
'IsirTntfddM etiott.'...
In this chapter we discuss some of the work done by other researchers In this 
area. Probably the best know example is the work of Thomas Gross, which is 
discussed in Section 2.2. In Section 2.3, We investigate the work done by David 
Bernstein. The contributions of Abraham and Padmanabhan are reviewed in 
Sections 2.4 and 2*5. Differences between their work and our approach, and other 
general Conclusions, can be found in section 2.6.
2.2. A ttE xarnpla o fC odeS ehedu ling
Iri the previous chapter, we reviewed the concepts of pipeline conflicts and 
instruction dependence issues. Figure 2.1 gives a concrete example of code 
scheduling to resolve these problems. Suppose that this code is to be run on a 
processor that has a memory load delay of one machine instruction. In other 
words, the result of a memory load operation becomes valid two machine 
instructions after its initiation. Assume that all other instructions take one 
machine instruction. Then clearly, this piece of code will produce incorrect result. 
This is because the value of R2 used by the Add instruction, will not be what is 
intended in the program.
We Need proper delay before the execution of Add Instruction. One easy 
way to implement this is by placing NOP instructions to fill the delay slots. Now 
the code sequence will produce a correct result on this processor. This sequence is 
shown in Figure 2.2.
Ld R I , #5 
Ld R2, [Z]
Add R3, R 1, R2
St [ X ] , R3 ; X =  Z + 5
Figure 2.1. Incorrect Code Sequence
Ld R 1, #5 
Ld R2, IZ]
NOP
Add R3, R 1, R2
St [ X ]  ,' R3 . ; X ;=: Z + 5
Figure 2.2. Correct Code Sequence
fi11 deIay slot
The example code sequence requires one delay slot for proper execution. The 
time wasted by this delay slot can be utilized in executing some other instruction 
at that spot. £>0 effectively, we can “fill” the delay slots in the code with other 
instructions in the code sequence. When no instruction is found that can move to 
the position of the delay slot without violating the legal order of execution or 
pipeline usage (conflict), we simply place a NOP there. Note that the legal order 
of execution implies an ordering of instructions such that no consumer of a value 
comes before the producer of that value. A code schedule that eliminates the 
delay slot before the Add instruction is shown in Figure 2.3.
An optimal code schedule would be the one with the minimum possible 
number of delay slots in it. To efficiently find such schedules is our goal in this 
thesis. In the next section, we throw some light on the complexity of finding an 




Ld R1, #5 ; replace the delay slot
Add R3, R 1, R2
St [X] , R3 ; X =  Z i 5
Figure 2.3. Instruction Sequence without any Delay Slots
2.3. T he Com plexity of Finding An O ptim al Schedule
The problem of finding an optimal code schedule for pipeline machines is 
well known to be NP-complete [Gro83aj. The problem of instruction scheduling 
for a program, given set of pipeline constraints, is typically handled by compiling 
the program into assembly language instructions. These instructions are then 
grouped into basic blocks [AhS86] and each basic block is independently 
scheduled1 for the given pipeline constraints.
Without employing any pruning, as is clear intuitively, finding the optimal 
schedule for a block of n instructions requires an exhaustive search of all n! 
possible schedules. It is convenient to think of this as requiring n! invocations of 
an O(n) procedure, which we call II, that generates a schedule of the ra 
instructions and computes the number of NOPs required by that schedule.
As discouraging as these complexity measures sound, we continued to 
determine the approximate time one might expect for a compiler to schedule a 
typical block containing about 15 instructions. A reasonably efficient C 
implementation of the procedure 17 was created and its approximate runtime 
determined on a variety of machines. The average time for one application of fI, 
including the call overhead, was 0.12 milliseconds on a heavily-loaded Gould NP I. 
For a Sun 3/50 workstation the average time was about 0.3 milliseconds. Given a 
block containing 15 instructions, Q would be applied 15!, or 1,307,674,368,000, 
times. Hence, our typical 15-instruction block could be scheduled on an NPl in a 
mere 156,920,924 seconds — just under 6 years! Worse still, most programs
1 Interactions between adjacent blocks can be managed without major 
modification of the basic block schedules, essentially by modifying the 
initial conditions in the analysis for each block.
contain many such blocks. An interpolation of the average runtimes for different 
sized basic block is shown in Table 2.1. Column one in this table shows the size 
of basic block in terms of the number of instructions (after other classical 
optimization and dead code removal has been done). The second column gives the 
number of search calls for an exhaustive search algorithm. Obviously, this 
number is the factorial of the size of basic block. The third column shows the 
approximate time required to execute these many calls to find an optimal 
solution.
Table 2.1. Search Space for Exhaustive Search
Instructions








; ;.7 ■ ' 5,040 0.6 seconds
. 40,320 4.8 seconds
' ■ ■ ;■ ' 362,880 43.5 seconds
10 3,628,800 7.2 minutes
- 11 39,916,800 79.8 minutes
■ V , : : 4.8X10® 15.9 hours
13 6.2X10® 8.6 days
8.7X1010 121.1 days
1.3X1012 5.0 years
16 2.1X10-® 79.6 years
3.6X1014 1353.5 years
No doubt, it is this type of analysis which led researchers to sacrifice 
optimality and investigate heuristic scheduling techniques. However, all is not as 
bleak as it seems because many of the schedules can be pruned from the search. 
Our approach was simply to prune the search as much as possible without 
sacrificing optimality. The most obvious pruning of the schedule search space is to 
avoid consideration of any orderings which would result in incorrect execution
due to violating a dependence (i.e., making the consumer of a value execute 
before the producer of that value).
One question arises at this point that why other researchers did not use this 
approach to find the optimal solutions? Probably, they realized that the worst- 
case time complexity of this approach is still exponential, therefore this idea was 
deemed to be useless for practical2 compilers. An excerpt from [HeG83] 
summarizes this — “Since we have shown that the reorganization problem is 
NT-complete even for the case where [pipeline] interlocks are only one or two 
instructions long, we need to consider heuristic solutions [foregoing the optimal 
solution].” On the other hand, we investigated the average runtime for a refined 
exhaustive search algorithm and also studied the frequency of the occurrence of 
its worst-case performances. Moreover, we also formulated and implemented a 
number of other heuristics which pruned the search space significantly without 
sacrificing optimality. From this empirical study we found that for typical inputs 
(similar to what occur in real programs), nearly all of the inputs resulted in 
optimal schedules within very reasonable runtimes.
■ V:.-' v;.' ■' V'' ' ■ ' ' ■■■■;. . . , v ;  . : ''
Table 2.2 presents a sample of how well we were able to prune the search
space for schedules for typical blocks. All these examples are representatives of 
original test samples. The nature of these sample inputs is described in Section
5,3. ■
Note that for the same block size there can be great variations in the number 
of calls required to perform optimal scheduling. This is true because the search 
space is proportional to the nature of inter-dependencies within a basic block, and 
is independent of the basic block size. However, the search space in general 
increases with the size of basic blocks. This is because of the fact that the range 
in which instructions can move and still have a legal evaluation order, depends on 
the inter-dependencies and the size of the basic block. In Table 2.2, some basic 
block sizes appear more than once to illustrate the variations in the runtime for 
the same block size. Note that this Table is presented here only to highlight the 
remarkable difference between the number of calls (to procedure 12) required to 
schedule various basic blocks using our pruning techniques. An extensive set of
2 Although they did use similar exhaustive search methods to compare 
the results of their heuristics with the optimal solutions.
Table 2.2. Search Space for Representative--Examples
Instructions Exhaustive Pruning Proposed
in Search Illegal Pruning
Block Q Calls 0  Calls ; Q Calls
\  ; v 8 40,320 163 76
i i  ■ 39,916,800 9,039 12
>.-"Xi3X 6.2X109 65,105 394
V / 13 6.2X109 40,240 21
14 8.7X10JO 175,384 1,676
. 15 '''■■/ 1.3X1012 27,487 317
i6 2.IXlO13 5,800,000 66,890
:: 16 2. IXlO13 228,324 443
20 2.4X10*8 12,872 334
5.1X1019 58,581 202
LlXlO21 >9,999,000 119
results is given in Chapter 5. From those results it follows that the same typical 
15-irLstruction block that would have taken 5 years to schedule optimally can be 
scheduled optimally in an average of about 0.01 seconds using the proposed 
pruning techniques.
Of course, despite the fact that our pruning worksvery well on average, it 
has an exponential worst-case performance. To limit the worst-case runtime for 
our algorithm, the concept of a curtail point X is used. This is a user-supplied 
parameter specifying the maximum number of schedules to be considered. The 
proposed scheduling algorithm terminates when either:
[l] All possibly-dptiinal schedules have been examined®. In this case, the best
3 Our search algorithm will sometimes prune optimal schedules from the 
search, but only if they are provably equivalent to a schedule which was 
not pruned.
15
schedule found is an optimal schedule.
[2] A total of X schedules have been examined (i.e., X calls have been made to
0). Because some possibly-optimal schedules have not been examined, the
best schedule found might or might not be an optimal schedule.
Fortunately, our results show that the vast majority of all blocks will 
terminate on case [l] if X is on the order of 1,000. In fact, for most blocks of 
fewer than 20 instructions, a X value of about 50 would suffice. Using the 
algorithms and synthetic benchmarks described in detail later in this paper, the 
search for 15,812 of the 16,000 blocks terminated on condition [I]: the number of 
schedules searched for each of these trials is plotted in Figure 2.4.
In the case that a reasonable X is exceeded and the search is truncated by 
rule [2], a sub-optimal solution might result. We were generally unable to 
determine how often the schedule resulting from a truncated search is actually 
optimal despite the fact that some schedules were not considered. This is due to 
the fact that when a reasonable value of X was exceeded, the search space tended 
to be very large, so that even increasing the X value by a factor of fifty did not 
cause the search to run to completion — however, neither did the best schedule 
change. For this reason, we suspect that many of the truncated searches also 
found optimal or nearly optimal solutions, but we cannot yet prove this.
Note that the total number of legal schedules which must be searched derives 
primarily from the dependence and conflict properties of instructions within the 
block rather than from the block size.
Having presented the basics of our work, we compare our research with the 
relevant work done previously in the literature.
2.4. PostPass Code Optimization
The Stanford University Microprocessor without Interlocked Pipeline Stages 
(SU-MIPS) was one of the first projects to integrate VLSI computer design and 
compiler design. Migration from hardware to software (e.g., compilers) was sought 
whenever possible without any performance degradation. Pipeline 




Figure 2.4, Schedules Searched Vs. Block Size Vs. Distribution of Inputs
17
There were many issues explored in that project, like instruction packaging, 
delayed branches, instruction set design etc ([GrH82], [GiG83] [Gro83] and 
[GrH88]). But we will discuss here only the code scheduling algorithm [Gro83a] for 
the pipeline interlocks (terms code scheduling and reorganizing are use 
interchangedly in this section). This algorithm works on the assembly level 
instructions that have been generated by an earlier phase of the compiler. First 
we take a brief look at how the problem of code scheduling for pipeline processor 
is proved to be NP-complete, then we summarize the algorithm along with the 
various heuristics and how it differs with our work.
2.4.1. Proof of NP-Completeness
In his Ph.D. dissertation [Gro83a], Thomas Gross has shown that the 
problem of optimal reorganization of machine-level instructions at compile time is 
NP-complete. This is done by first showing that the problem is NP-complete 
when an unbounded pipeline interlock length is a parameter to the problem. And 
then to show the strong NP-completeness of the problem, it is proved to be NP- 
complete even when the interlock length is limited to one or two, and only one 
register is used in the original schedule. And finally it is stated that the problem 
is in NP, since an optimal solution can be found non-deterministically by trying 
all possible solutions.
The problem of pipeline scheduling with unbounded interlock length is 
equivalent to a precedence-constrained multiprocessor scheduling problem. A 
sequencing problem computes an optimal single-processor execution sequence for a 
series of tasks under certain constraints. A sequencing problem is NP-complete 
only under a well-defined set of restrictions. Scheduling problems for 
multiprocessors are NP-complete even with fairly simple restrictions. A 
multiprocessor scheduling problem deals with finding a schedule of a set of tasks 
using more than one processors. Gross has shown that the pipeline interlock 
restriction effectively makes the reorganization problem equivalent to a 
multiprocessor scheduling problem.
For a real processor, there is always a bound on the interlock length and the 
resulting reorganization problem would not necessarily be NP-complete. The 
strong NP-completeness of the problem is claimed by deriving an equivalence 
between the reorganization problem with interlock length one and two and at
least one register, and a resource scheduling problem that is already known to be 
NP-complete. The reorganization problem could be constructed from the 
resource scheduling problem in polynomial time, therefore the reorganization 
problem is NP-complete. And since the optimal solution can be obtained by 
guessing at every possible sequence, which can be evaluated for legality and cost 
in polynomial time, thus the problem is in NP.
Although the algorithm presented in this thesis works on an intermediate 
form of code, instead of the assembly level code for which the scheduling 
problem was shown to be NP-complete, the same proof applies. This is because 
the only difference that is visible in problem formulation is that of machine-level 
registers. In the intermediate form memory-variables can be thought of as 
registers without any loss of generality. In fact this increases the runtime of the 
reorganization algorithm because it removes precedence constraints that are 
present when a limited set of registers are allocated to a code sequence. Hence, 
the problem of code scheduling (reorganization) for intermediate level code is also 
NP-complete.
2.4.2. The Algorithm
Thomas Gross implemented a postpass code reorganizer that resolves the 
problem of pipeline interlocking by inserting NOPs (Null OPerations) in the code 
to fill load and branch delay slots, reordering the resulting code to eliminate 
pipeline dependencies, removing as many NOP instructions as possible, and 
packing. Reordering is done on code within a basic block. The branch instruction 
at the end of the block can not be moved; it has to remain the last instruction of 
the reordered block. In a subsequent phase, which we will not discuss here, 
instructions around the branch instruction may be moved to fill the branch 
delays.
He proposed a heuristic algorithm for the code scheduling:
[1] Read in a basic block and create a machine-level DAG.
[2] I At any point, determine the set of instructions that can be generated.
[3] Eliminate any set that cannot be started immediately.
[4] Choose among the sets remaining.
19
Thefirst step shows that this algorithm works on machine level basic blocks, 
one at a time. Steps [2] and [3] determine valid instructions that can be scheduled 
next according to the constraints given in Section 2.4.2.1. In step [4] heuristics to 
select a set of instructions and partial scheduling of that set of instructions are 
considered. This is described in Section 2.4.2.2.
2.4.2.1. Reordering Constraints
The following reordering constraints are applied to compute the set of legal 
instructions:
< [l] All children4 are evaluated before their parents.
[2] All uses of a register or memory value are completed before that value is
•> altered.
[3] Loads and stores to memory are maintained in their original order 
whenever they could refer to the same address. This information easily
.;/>can be determined, or can be provided by the preceding phase of the 
code generator.
[4] If a node stores into a register and that value is used in another basic 
block, then that store must be the last store to the register. This may 
be alternatively stated as: if a register value is live at the end of the 
original basic block, all legal evaluation orders must leave it live.
One observation is immediately obvious. That is, the postpass scheduling is 
limited by the existing register assignments which are fixed before the scheduling 
starts. The scope of reorganization done at this level is limited because the 
assembly code (in general) reflects the assignment of values to a limited number 
of registers based on the initial ordering of the instructions in the source program.
Hence, in the constraints on instruction reordering given above, constraints 
2 through 4 are trivially met if reordering is done on the intermediate code 
before register allocation phase. Also, any aliased memory references are not seen 
by the reorganizer, and therefore cannot be exploited. In intermediate code, the
4 The sense of direction for DAGs in our work is opposite to this. A 
parent node comes before children nodes in our representation of DAGs. 
Though, the difference is in terminology only.
compiler can be made to use analysis and renaming so that these complications 
need not hinder scheduling [Die87].
The Cdnstraints given above greatly reduce the freedom with which the
individual instruction is a code stream can move and thus the reorganized code, 
in general, is not as good (in terms of the number of NOPs inserted) as it could 
be. This is illustrated in the example code sequence given in Figure 2.5, which is 
the best schedule given this register allocation. However, the same sequence with 
a new register assignment eliminates all NOPs, as shown in Figure 2.6. Hence, 
by using an intermediate code which allows any register assignment, better code 
sequences can often by found.
Ld RO, [ B ]
Ld R 1, #5 
Add R1, RO, R 1 
St [ A ], R 1 
Ld R 1 , [ D ]
NOP
Add R 1, RO, R 1 
St [ C I, Rl
A = B + 5 
delay slot
; C = B
Figure 2.5. Best Code Sequence for a Given Register Assignment
The code scheduler suggested and implemented by Gross works on the 
assembly level instructions that are already generated by other phases of 
compiler. Instruction scheduling at code-generation time (before register 
allocation phase) was considered inappropriate for the following reasons:
1. Instruction Scheduling tends to increase register lifetime, making it 
more difficult to obtain a “good” register allocation. The cost of spilling 
a register may easily exceed the cost of an interlock or inserted NOP.
2. The scheduling may be difficult to perform prior to register allocation 
and final instruction selection; In machines with multiple addressing 
modes and instruction formats^ the exact instruction to be used to
Ld RO , I B ]
Ld R1 , #5 "■ ; 1 ■ . .:V. .
Add R 1 , RO , R 1
St I A ], R 1 5 A = B + 5
Ld R 2 , [ D ]
Add R 1 , RO , R2 ■ .
‘ • . St [ c ] , R1 ; C = B + D
Figure 2.6. Improved Code Sequence
3.
•y,,'
implement a particular function and the interlock properties of that 
instruction may not be determined until after the register allocation is 
known (thus after scheduling).
The code generator can not be readily applied to assembly-language 
programs.
In our view, the choice of register allocation before scheduling is not a good
one and the reasons summarized above are not necessarily true in all cases. For
example, with reference to the statement number I above, there is no reason why
register allocation and pipeline scheduling can not be mixed in a single scheduler 
. ' ' ' . • - 
that can perform cost comparisons between register spill and pipeline delays.
Similarly, the second statement is not applicable to modern RISC style machines.
2.4.2.2. Heuristics
In this section, we take a closer look at the different heuristics Gross devised 
for the scheduling problem and how they compare with optimal solution.
The algorithm is based on the Idea of safe paths. A safe path for a resource r 
with a starting node t in a given DAG D with a set of generated (covered) node 
(instruction) d is either < or a minimum set of unscheduled instructions such that 
the set contains all unscheduled descendants of t and we obtain a safe position 
with d. A safe position for resource r is a set of instructions S  in the DAG such 
that, once code has been generated for all the nodes in 5, the nodes in S do not 
effect the generation of code with respect to resource r for the remaining 
instructions in the DAG that are not in S. Detailed definition and explanation of
22
safe paths may be found in [Gro83a].
The proposed reorganization algorithm is a constraint algorithm with 
heuristics added to choose between conflicting safe paths. These heuristics do 
not in general return an optimal solution. Recall that, in step [4] of the algorithm 
described earlier, there is a choice to be made between different candidates for 
scheduling. Once a choice has been made there is no backtracking* and thus the 
quality of the solution heavily rely on the criteria for choosing among the 
candidate sets; Three different heuristics were proposed:
[1] Choose the largest safe path. The assumption is that the number of 
pipeline conflicts is proportional to the number of nodes in the safe 
.path.
[2] Choose the safe path that has the highest interlock penalty. This cost 
can be evaluated by counting the number of instructions that can 
interlock.
[3] Choose the safe path that starts with the node farthest from the root. 
This strategy uses a simple criterion that is known to work well for
. other scheduling problems.
When there are several safe paths with the same weight (as defined by one of 
the three heuristic strategies described above), the safe path whose start node 
appears first is chosen.
2.4.2'S; Results
Clearly, none of the heuristics attempt to find an optimal solution. The 
empirical results reported in (Gro83a) are reproduced in Table 2.3. These results 
are obtained by the application of postpass reorganizer to a set of different 
programs. Only the average values are shown In Table 2.3. Strategy 3 was 
chosen for the final version of the reorganizer. The reorganizer produces “good” 
results based on these heuristics. We conclude this section with a note that the 
work of Thomas Gross is fairly good in terms of the integration of compiler and 
architecture concepts and the implementation is quite reasonable.
Table 2.3. Percent o f NOPs Required for Different Heuristics
Original
Code Strategy I Strategy 2 Strategy 3 Optimal
7.2 4.3 3.9 4.0 3.4
2.5. Improved Approximation Algorithm by David Bernstein
David Bernstein proposed an improved approximation algorithm for 
scheduling instructions for pipeline machines [Ber88].
2.5.1. Background
A class of scheduling algorithms, called leveling algorithms, is defined and 
analyzed. The basic leveling algorithm has been improved so that the worst case 
ratio of the length of a schedule generated by the algorithm over the length of an 
optimal schedule is better than what is achieved by general list scheduling 
algorithms. This upper bound for this ratio is 2-1/(d-|-l) for list schedules and is 
refined to 2-2/(d+l). In these expressions d is the amount of delay, for an 
instruction using a pipeline, after which the result become valid.
The time complexity of the refined leveling algorithm is 0(no;(n) + e log n) 
where n is the number of instructions, e is the number of dependencies among 
the instructions, and ck(») is a very slow-growing function.
In his research, approximation heuristics are used and the worst case 
behavior of the algorithm is analyzed.
2.5.2. Algorithm
The schedule model considered in this research work consists of a single 
processor P and a job system T =  (/, D, G). T is a set of unit time execution 
tasks (or instructions) J  == { Ji, * * * , Jn}, a set of delays (which model the 
pipeline structure) D — { • • • ,Dn}, where D,6{Of • • • ,d} for some fixed
integer d, and a directed graph G =  (JtE) of precedence constraints. Let /S,- be 
the set of immediate successors of J1-
Define level I(Z1) of a task Jt as:
if Jx has no immediate successors
Dj+max ZfJTj otherwise
XS/Sj
A priority list L of the tasks (instructions) is constructed in a non-increasing 
order of their levels. A schedule S corresponding to such an L is called a leveled 
schedule. A refined leveling algorithm that improves on the upper bound of list 
schedule is then introduced.
Let refined level of task J i be denoted by r/fZ,j and let
Mi=Tl(Jit),..., )  be a sequence on non-negative integers constructed from
the refined levels of the immediate successors of J1 in a way that
r/fZ^ j>  r* ' ^rZfZljlwl j  . Then, TffJft)  is defined recursively as follows:
0 if Zj has no immediate successors
Dj+Qj otherwise
where Q1 -m a x(r l(J iJ ,Tl(Jis)+ !,...,rl(Jit \s,| )+  l-S/j I - I )
The refined level schedule is generated according to the following algorithm: 
[lj Compute the levels I(Ji) for all i.
[2] Compute the refined levels rl(J{) for all i.
[3] Create a priority list L by first ordering Z; in a non-increasing order of Z, and 
then ordering the jobs with the same value of Z in a non-increasing order of 
rZ. The order among the jobs of the same level of Z and rZ is arbitrary.
Again, we note that this is an approximation algorithm designed to obtain 
solutions with an upper bound of 2 - 2/(d+l) on the worst case ratio of the length 
of a schedule generated by the algorithm over the length of an optimal schedule. 
As mentioned earlier d is the amount of delay, for an instruction using a pipeline, 
after which the result becomes valid.
One restriction that this research imposed is to limit the instruction delay d 
to be the same for all instructions. Therefore the algorithm and results presented 
do not apply to multiple pipeline machines. Our approach, on the other hand, 
also takes into account different pipeline delays for different instructions. In the 
next section, we discuss another approach that consider multiple pipeline
2.6. Reorganizer for a Variable-Length Pipelined Microprocessor
The implementation of an instruction reorganizer for a floating point 
microprocessor with variable-length pipeline is described in [AbP88], This work 
was done by Seth Abraham and Krishnan Padmanabhan. Some Benchmark 
results are presented by these authors using BLAS and Livermore Loops. The 
presence of variable length pipelines is described as a key feature in this work.
machines with variable delays.
2.6.1, Introduction
The reorganizer is designed to work with compiler generated or hand written 
assembly language code. A greedy heuristic algorithm is used to reorder 
instructions inside basic blocks.
The reorganizer, that works at the assembly language level, can accept 
guidelines or directives from either the compiler or the assembly language 
programmer about data dependencies and memory aliasing.
2.8.2, T he A Igorlthm
The input to the reorganizer is a sequence of assembly language instructions 
Vhich are broken down into a set of basic blocks. Then the reorganizer schedules 
the instructions within a basic block as the first phase. In the second phase the 
dependencies between the blocks are resolved. Although the reorganizer resolves 
data dependencies between basic blocks, but information is used only to add 
NOPs to the end of the ancestor basic block or at the beginning of the descendent 
basic block. But the instruction ordering is driven essentially by instruction 
dependencies within a basic block and that order is not changed by the inter- 
basic block analysis. That phase exists only to prevent any pipeline conflict when 
a stream of basic blocks is run in succession.
A set of lour lists is maintained for performing this algorithm. These lists
are:
AIL Active Instruction List. At any point in time, the active instruction 
list contains a window of instructions that have been reorganized 
and sequenced., along with the maximum and minimum completion
times of the instructions. This list corresponds to a window of 
instructions that could exist inside the pipeline at run time.
R A t This is the Resource Allocation List associated with AIL.
Determining whether one instruction can safely follow another at a 
certain distance requires a test for dependencies; thus each 
instruction in the AIL or DIL has a list of resources that it will use 
as a source or destination. These are RAL and RRL for lists AIIj 
and DlL respectively.
DIL Deferred instruction List. This contains an ordered list of
instructions that can not be scheduled safely (at some point in 
time). ' ■'
RRL The Resource Requirement List that is associated with DIL.
A greedy algorithm is used to order the instructions in each basic block. 
Again, the goal is not to find an optimal solution, instead the algorithm uses 
greedy heuristics to find an “approximate” solution in a reasonable run time. 
The starting point is an empty AIL, and a DIL containing the entire block. Then 
apply steps I and 2 until DIL is empty.
[1] Sequentially go down up to k instructions in the DIL and get the first 
instruction which may be safely scheduled at this point. If an instruction is 
found, insert it into the AIL and also into the reorganizing sequence for the 
basic block. If no such instruction is found and the DIL is not empty then 
insert a NOP. If DIL is empty, exit the algorithm, k is the lookahead 
distance.
[2] Now cycle the AIL. rThis involves removing a completed instruction, if any, 
from the AIL. Every time an instruction is scheduled, it is necessary to cycle 
the AIL in order to free resources and prevent detection of outdated 
dependencies. This is done as follows:
a. For all instructions in the AIL, decrement both minimum and .maximum 
completion times. For all destination resources in the RAL, decrement
; J both maximum and minimum use times.
b. All items with maximum completion times decremented to zero can be 
removed from the list. At m ost one memory store instruction will have
; this condition true at thispoint.
27
e. If no memory store instruction was found in the last step, then from the 
set of all items with minimum completion time less than or equal to zero 
(if any), remove the one with the smallest value of this time.
rEhe branch delays are handled in a fashion similar to [Gro83a]. Intrarblock 
dependencies are resolved by the addition of NOPs without altering the 
instruction ordering in the basic block achieved by the algorithm.
Although satisfactory results are presented, this work also suffers from the 
artificial constraints introduced by the assembly level code generation that we 
discussed in Section 2.4.2.1.
Two more examples of pipeline code scheduling implementations are given 
in the next sections, without going into the details of the respective algorithms.
2.7. Micro-Optimization of Floating-Point Operations
William Dally described a technique in [Dal89] for reducing the operations 
count and time required to perform floating-point calculations on pipeline 
floating-point function units. This work is an effort to integrate floating-point 
arithmetic into RISC computer architecture. Micro Floating-Point function units 
are proposed that break down the floating-point tasks. These are pipelined and 
hence the original task can be divided into micro-operations which can be 
scheduled allowing for overlapping between instructions depending upon the 
interdependencies.
A greedy hueristic algorithm is presented that schedules instructions and try 
to fill pipeline delayed slots with other instructions. In addition to that, 
redundant re-normalizations are eliminated by the scheduler.
2.8. Scheduling Trees in Pipelined Environments
Scheduling task trees to be executed in parallel and/or pipelined processing 
systems are examined under individual situations in [LI! 177]. Simple optimal 
algorithms are presented for special cases for task tree structures. Some simple 
techniques for binary trees for parallel pipeline models are also discussed.
The author has shown that for simple precedence structures in the form of a 
tree, scheduling for pipeline and/or parallel systems is a NP-Complete problem. 
Heuristic approaches are favored and the search for optimal solutions is
discouraged by presenting counterexamples of exponential complexity for optimal 
solutions.
2.9. Summary
As discussed earlier, the few pipeline scheduling algorithms presented in the 
literature act as postpass reorganizers, and work on the assembly level produced 
by the compiler. Doing so imposes unnecessary constraints that sacrifice 
optimality of the solution. Moreover, all techniques rely on heuristics to obtain 
solutions and do not attempt to find solutions which are optimal (even given a 
fixed register allocation). In contrast, our approach which is discussed in the next 
chapter works at an intermediate code level and uses a new set of pruning criteria 
which preserve optimality.
The important point to note is that the advantage of our algorithm is that it 
finds optimal solutions for typical inputs. For a very small percentage of the 
inputs, our algorithm does not guarantee the optimality of the solution, but we 
have found those solutions to be close to optimal. Hence, although we did not 







As discussed in the previous chapter, most of the algorithms in the open 
literature perform code scheduling for pipelined machines using heuristics that do 
not preserve the optimality of solution. Our approach, which is described in this 
chapter, is different from other related works in several ways:
[1] We apply our algorithm to an intermediate form of code instead of the final 
assembly code.
[2] Our algorithm employs pruning techniques which preserve the optimality of 
solution. Unless the search is truncated (which happens rarely) the solutions 
are guaranteed to be optimal.
[3] We allow for a pipeline architecture model that is significantly more general 
than that of other algorithms.
The problem of finding an optimal code schedule for pipeline machines is 
well known to be NP-complete. We studied the complexity of the problem to 
investigate if it is possible to find optimal solution (for most cases) in a reasonable 
time. The results of this investigation, which are given in the next sub-section 
encouraged us to further explore the search for optimal solution. Finally, we came 
up with an algorithm that finds optimal schedules (for most cases) in a very rea­
sonable time. Moreover the solutions that it finds that are not guaranteed to be 
optimal are very good solutions (comparable to the optimal solutions themselves) 
are certainly at least as good as the solutions obtained using other hueristic algo­
rithms in the literature. Iu Chapter 5, we see that over 98% of times our algo­
rithm is successful in finding optimal code schedules.
The basics of the exhaustive search are described in the next section, which is 
the augmented by a discussion of various refinement techniques and how they 
effect the search space for a scheduling problem. Although our strategy in this 
thesis is to schedule an intermediate form of code, for the purpose of illustration 
most of the examples in this chapter are restricted to machine level instructions. 
The intermediate form code in terms of instruction tuples is introduced in the 
next chapter.
In Sections 3.3 we present a formal description of the problem studied in this 
thesis. Various definitions involving the development of our algorithm are formu­
lated. A brief description of the compiler model, which is described in detail in 
the next chapter, appears in Section 3.4. And finally, our proposed algorithm is 
presented in Section 3.5. We conclude this chapter with a set of proofs on the 
quality of the solutions obtained through our algorithm.
3.2. Refined Exhaustive Search
An optimal schedule for the problem of code scheduling for pipeline con­
straints can be found be examining all possible orderings. Clearly, the obvious 
implementation of this approach would be impractical because of the factorial 
nature of the complexity of the problem. However we have found that this prob­
lem is amenable to solution by a refined backtracking search over a tree of all 
possibilities. The basic idea in refined backtracking is to reduce the size of the 
search tree as much as possible such that the resulting minimal tree is guaranteed 
to contain at least one optimal solution. This is substantiated by the results 
obtained by our proposed refined backtracking algorithm which are given in 
Chapter 5.
General backtracking works by continually trying to extend a partial solu­
tion. At each stage of the search, if an extension of the current partial solution is 
not possible, we “backtrack” to a shorter partial solution and try again. Since we 
are only interested in code sequences with orderings that result in a legal 
schedule, as mentioned in the last chapter, we need not consider the solutions 
coinprised of illegal schedules. Therefore, we can say that a backtracking search 
is equivalent to an exhaustive search of all orderings that result in a possible solu-
Consider a search tree for exhaustive search of all possible orderings for n 
instructions. The nodes of the tree can be thought of as sets of configurations, 
and the children of a node n each represent a subset of the configurations that n 
represents. Finally, the leaves each represent single configurations, or solutions to 
the problem. We may evaluate each such configuration to see if it is the best (or 
optimal) solution. Figure 3.1 depicts a search tree for a code sequence consisting 
of three instructions labeled {l, 2, 3} . We want to find a solution Ja1 ,a2,a3} that 
minimizes the cost of pipeline induced delays. Note that there are a total of six 
possibilities:
{1,2,3}
■ { i .3, 2} ' ;





2 > Choices of a2, given al
Figure 3.1. A search tree for exhaustive search
These solutions are obtained from the search tree by traversing all paths from the 
root to each leave, picking up the labels of the nodes that are encountered. The 
time complexity of such an approach is 0((n+l)!) on an n instruction code
sequence, since we must consider n! different leaves and each traversal takes O(n) 
time. In the next section we consider a series of refinements to general backtrack­
ing technique to obtain an algorithm that is no better than the above in the worst 
case, but on average produces optimal results very rapidly.
3.2.1. Refinements
Now we will examine techniques to greatly reduce the number of possibilities 
tried in an exhaustive search. All these techniques involve adding tests to a simple 
backtracking algorithm to discover that subtrees should not be made for certain 
nodes. This corresponds to pruning the exhaustive search tree — cutting certain 
branches and deleting all subtrees beneath.
■ 3.2.1.1. Preclusion
One important pruning technique is to cut off the search as soon as it is 
determined that it can not possibly lead to a possible solution. Remember, a pos­
sible solution is a legal schedule in which precedence constraints between the 
instructions are maintained. For example, consider the code sequence of Figure
I ; Ld Rf) , # 5  
2 : Ld R f , [ M e m l ] 
3 : Add R 2 , R O,R1  
4 : S t  [ Mem2I , R2
Figure 3.2. Code Sequence for the Preclusion Example
While trying all different orderings for this code sequence, it becomes 
apparent that the choice of instruction Add R2 , RO , R1 as the first instruc­
tion, or as a node on the first level of the corresponding search tree, precludes the 
placement of all other instructions on any of its descendant nodes because none of 
these instructions can be executed after the Add R 2 , RO , R1 instruction. 
Therefore this node along with its subtrees is excluded from the search tree.
3.2.I.2. Pruning Based on Cost
This pruning rule is applied to cut off branches in a search tree whenever we 
can prove that pursuing the subtrees of a node will not result in a better solution 
than the one that can be obtained without examining the descendants of that 
node, We are interested in a minimum cost (in terms of pipeline delays) path in 
the search tree. 1
The basic technique is applicable to the pipeline scheduling problem because 
of the existence of partial solutions and also because adding more instructions 
(nodes) to the search path will never decrease the total cost associated with the 
partial solution. See Section 3.3.2 for a more formal description of this property.
In the exhaustive search without pruning, if we find that the cost of some 
solution is less than the cost of the minimum cost solution found so far, then we 
save the new solution as the best solution so far, and record its cost as the 
minimum cost so far for any solution. We make use of this minimum cost to 
exclude nodes and their descendants from the search tree. This technique can be 
implemented by making no search to the descendants of a node if the cost of the 
current partial path is greater than or equal to the best full path found so far.
3<2.1;3. Tree Rearrangement
v The refinement technique discussed in the previous section is more effective if 
a low-cost path is found early in the search. Since the search tree arrangement 
depends on the initial order of the code sequence, this implies that we can rear­
range the search tree by changing the order of instructions the input code 
sequence before starting the exhaustive search.
For example, if the paths from root to leaves are examined from left to right 
in the search tree, then having low-cost solutions towards the left will increase 
the effectiveness of the pruning technique described in the previous section. In 
other words, if near optimal solutions are found early in the search, then more 
subtrees can be cut off from the search tree based on the minimum cost function.
In the next chapter, we discuss how we obtain a good lower bound on the 
cost of the solutions by applying pre-aekeduling that effectively rearranges the 
Search tree.
3.2.1.4. Equivalence Test
If two or more schedules can be shown to be equivalent then we can arbi­
trarily choose to consider just one of them without sacrificing the optimality of 
the final solution. An example of this is the equivalence between two schedules 
that differ only in the value of constants in some instruction. For example, the 
following two schedules are equivalent in terms of the pipeline delays:
1 : L o a d ( c )
0 : C o n s t ( 3)
4 : C o n s t ( 4)
2 : A d d ( 0 , 1 )
5 : A d d ( 4 , 2 )
3 : Stor e (et ̂ 2 I 
6:Store(d ,5)
1 : L o a d ( c )
0 : C o n s t ( 4)





The equivalence between these two schedules stems from the fact that 
instructions (such as loading constants) which do no require any pipeline resource 
and are not dependent on any other instructions can swap positions with other 
such instructions without having any effect On pipeline conflicts and delays associ­
ated with other instructions in the schedule.
3.2.2. Combining All Refinements
; Each time that we cut off the search tree at a node, we avoid searching the 
entire subtree below that node. For very large trees, this is a very substantial 
savings. In fact, the savings is so significant that it is worth while to do as much 
as possible when examining a node to avoid examining its children. As men­
tioned above, a cutoff early in the tree can lead to truly significant savings; and 
missing an obvious cutoff can lead to very significant waste.
In the next section we define the problem statement and other definitions 
that are used later to describe our algorithm, and to prove its optimality.
35
3.3. Problem Statement And Definitions
The scheduling model that we consider is an extension of the models used in 
[Ber88], [BrJ80], or [Gro83]. The key differences between their and our models 
are : .
• Our model is general enough to allow for both the latency and the enqueue 
time of multiple pipeline resources. These parameters can vary from one 
pipeline to another, and are not limited to a single value as in [Ber88].
* This model is not specific to NOP insertion, and deals with delay slots that 
can be filled either by software (for example, a NOP padding compiler), by 
hard ware instruction-wai tin g, or any other technique depending upon the 
architecture of system.
3.3.1. The Scheduling Model
Consider a task scheduling system T on a processor P having a pipeline 
model given by £. The scheduling problem consists of a finite set of tasks (or 
instructions) Z =  {<fi,?2, ‘ ‘ ‘ ,Cn}» where • • • ,fn are tasks (or instructions) 
that are to be executed successively by the processor P, and there is some pre­
cedence constraint given by a partial ordering <  on the elements of Z, and a cost 
function (i) ; cost associated with completing task ft as the Trjl11 task in a 
schedule tt. We want to find an (optimal) schedule n, representing a complete
n
ordering {u< $j • *, 3 € 7r} =  {<Tff • * • ,?*„}, that minimizes ^  cff(k) such
k-l
that no instruction is scheduled before its immediate predecessor as given by the 
partial ordering <  .
Our model assumes that each instruction that does not use any pipeline 
resource takes unit time for execution. Pipelined instructions can take any length 
of time for execution and it is incorporated in the Latency and Enqueue time of 
the pipeline employed and is described later. It should be pointed out, however, 
that it is trivial to introduce variable time for the non-pipelined instructions. But 
since most modern machines are designed to execute one or more instructions per 
cycle, we shall assume one instruction per cycle.
The partial ordering <  can be defined by a Directed Acyclic Graph (DAG) G  
— (Z r E) with vertex set Z  and <  corresponding to the edges E in the graph due
36
to chronological inter-dependencies. The tuple T =  (Z, <  , P )  is referred to as a 
task system.
At this point it is befitting to elaborate on the pipeline resource model of the 
target architecture with respect to its instruction set. This determines the cost 
function mentioned above. Evaluation of this cost function is explained in Sec­
tion 3.3.2.
The pipeline model is described by E — ( PI, LT, E N ), where PI is a set of 
integers from zero to m {0,1,...,ro} corresponding to m pipeline resources in pro­
cessor P. LT  and EN  are the Latency and Enqueue delay times of the pipelines 
and are discussed later. The cardinality of set PI is equal to the total number of 
unique pipeline functional units that may be employed for the execution of any 
instruction in the complete instruction set of Processor P.
Definition I: <r(f)
<?(<•) is a set of pipeline resources such that each member of this set, denoted 
by an integer 0...m, represents a pipeline that may be used for the execution 
of instruction f. Let Uj be the universe of all unique instructions (in other 
words, the instruction set of processor P). Then the function ^  IP is 
an into mapping, Te., more than one instructions may use a single pipeline 
resource (one at a time), or a single instruction may be executed on any one 
of more than one pipeline resources.
Example: The Add and Sub instructions in one processor may use the same 
Pipeline resource, and in another processor there might be more than one pipeline 
function units just for the Add instruction (implying that more than one such 
instructions may overlap execution).
Now we can relate the total number of pipelines PI (specific to different 
instructions) in a processor to the instruction set as PI =  { x : x €&((), C G Uf }.
Finally, we describe the Latency and Enqueue Time characteristics of pipe­
line resource model. L T  is a set of Latency delays L T  =  { L T /, * * * , L T n)  where 
L T i  G {0 . . .U }  such that It is the maximum latency delay in any pipeline resource. 
Similarly, EN is  a set of Enqueue time delays EN  =  {ENlt * * * ,ENn) where ENi 
£  {0...en}, and en is the maximum enqueue time for any pipelines in the system. 
As a convention, an instruction that does not make use of any pipeline will 
return a o{) value of zero. Therefore, we set LTq *»EN0 *=0 for this “pipeline”.
This completes the definition of our scheduling model. The concrete exam­
ples that follow in the next sub-section will elucidate these models.
3 .3 .I.I . An Example of the Pipeline Resource Model
For each hardware pipeline, the function, latency, and enqueue time must be 
specified. Further, so that the compiler can know which pipelines, if any, may be 
used to execute each type of operation, each hardware pipeline is given a unique 
identifier and operation types are associated with sets of pipelines. This is done 
using two tables.
Consider a processor with the following pipelined resources: two memory 
access pipelines (loaders), two adders, and one multiplier. These hardware 
resources are described in Table 3.1.
37
Table 3.1. Sample Pipeline Description Table
Pipeline Pipeline Latency Enqueue
Function Identifier Time
loader I 2 I
loader 2 2 I
adder 3 4 3
adder 4 4 3
multiplier 5 4 2
The second table used to describe the scheduling problem for our compiler is 
Table 3.2, the operation-to-pipeline mapping table. Given these tables, for exam­
ple; the Add instruction has two independent pipelines available to it (namely, 
numbers 3 and 4), and thus can be scheduled for either pipeline. In this example, 
Add and Sub operations share two independent pipelines; likewise, Mul and 
D iv  share a single pipeline.








Notice that changing the pipeline structure changes only the entries in these 
tables, not the structure of the scheduling algorithm.
The pipeline structure sketched above will be described as a tuple X =  (PI, 
L T 1 EN), where P I  =  {0,1,2,3,4,5}, and L T  =  {0,2,2,4,4,4}, and EN — 
{0,1,1,3,3,2}. For this pipeline structure the formulation for <t() is:
"Cf)
{1,2} if OPCODE(C)=Load 
{3,4} if OPCODE(f) =  AddIsub 
{5} if OPCODE(f) =MullDiv
3.3.1.2. An Example of the T askSystem
The following sequence of instruction tuples are to be scheduled for 
minimum !pipeline delay on a processor with a pipeline structure given in the pre­
vious section .
I ! C o n s t ( # 5 )  
2 : Ld(M)
3 : A d d ( 1 , 2 )
4 : S t ( X , 3)
At this stage the reader is expected to derive a meaning of this code sequence 
intuitively. Any discussion about the definition of these tuples is deferred until 
next chapter.
The corresponding task system is T. T={Z,<,P} where Z  =  
{1 : C o n s t ( # 5 )  , 2 : L d ( a )  , 3 : Add(  1 , 2 ) ,  4 : S t ( X , 3 ) } .  The partial 
ordering <  due to the precedence constraints is ( I <3, 1<4> 2<3, 2<4, 3<4 ). 
We illustrated the pipeline structure for the target processor in the previous sec­
tion. A graphical representation of the precedence constraints for our example 




Figure 3.3. An Example of the Task System
S.3.2. The Cost Criteria in Scheduling
We mentioned a cost function in our task system model, and said that our 
goal is to find an optimal schedule that minimizes the total accumulated cost. In 
this section we describe this cost function in detail, particularly its interaction 
with the linear ordering of instructions and pipeline resources.
: Reeal I.'-that t r(tjl ■ is ;the''Cdst associated with completing task ft as the 7T; 
instruction in the schedule. This is actually the time (or delay) an instruction 
must wait until all its source operands have become valid. In other words, an 
instruction can not be executed until all its immediate predecessors have finished 
their execution.
Definition 2: p(f)
p(<‘) is the Set of all instructions S E tt such that f has an immediate depen­
dence on />. Equivalently, p(<;) is the set of all immediate predecessors of f in 
the DAG G(Z, E).
Definition 3: <r(f)
'̂ 'YV.-’dftJ'M'.tbe pipeline resource that is utilized for the execution of instruction ft 
This function associates an instruction to a pipeline resource that was actu­
ally chosen among all the available alternatives. Obviously,
The cost function is expressed in terms of instructions and pipeline resources 
in the following equations. This has been split into two parts, namely Dl and D2, 
for the purpose of clarity. In Equation I and 2, Dl and D2 are defined for some
: y\;: - ■ : Y-./--;'' ;
D l  =  max(£Arq):q=^f,),qe{% -r:l<r<^iVq} (I)
This represents the number of delay slots required to resolve the conflict between 
instruction ft and those that use the same pipeline resource (if any).
D 2 == max(max(7r,-7r~l, LTq)—(7r,-7r;—I)): q=6(ff), (2)
je/»(<r;) ■
D2 is the number of delay slots required stich that all operands in instruction ft 
become available. The expression in Equation 2 examines all parents of ft and, 
depending upon the relative position of a parent instruction, finds the number of 
delay slots to fill the latency delay of parent’s pipeline (if any). DB becomes the 
maximum number Of delay slots computed for any parent. As pipeline conflicts 
and latency delays are resolved simultaneously, the cost function picks either Dl 
or D2, whichever is greater.
c„(») =* m&x(Dl,D2) (3)
It is important to note that this cost function for an instruction at some posi­
tion within a schedule is computed by looking only at the values associated with
the instructions that occur before that instruction, and does not use any informa­
tion about instructions that follow it. Two important properties follow from this 
observation.
[1] This function is applicable to partial schedules.
[2] cff(i) can be applied incrementally to instructions in a schedule.
These properties are instrumental in the development of our algorithm for 
finding an optimal schedule. But before discussing that, we heed to say a few 
words about the accumulated cost function, AC(\).
Definition 4s A C n{t) Accumulated Cost
The accumulated cost for a (possibly partial) schedule 7T for the first i 
instructions is:
A c n( i ) =  EM fcHier
k = l
This is the total number of delay slots required in schedule 7r up to position k.
3 .3 .3 . D elay Slots and Optimal Schedule 
Definition 5: Legal Schedule
A legal schedule (or feasible schedule) is defined as a one-to-one and auto 
mapping 0  from the elements of Z into the set N  of positive integers, I to 
I TTI , (that is, relative positions within a stream of instructions) such that 
-Q($i> >  O I Si1Sj € E, j€p{i) for all i,j.
Hence a more precise statement of the scheduling problem is that we want to 
find a legal schedule (or a mapping 0  ) such that the accumulated cost function 
A C k{ 17r I ) is minimum. Recall that the accumulated cost function is comprised 
of the summation of cost functions «*(.) for all instructions in a scheduled 
stream. As was pointed out earlier, £*(>) is the amount of delay, in multiples of 
unit time, that should be added after instruction t. We can view these wait 
periods as delay slots inserted within an instruction stream. Each delay slot takes 
unit time to elapse. Therefore our scheduling problem can be described in terms 
of minimizing the number of delay slots.
Definition Ot Optimal Schedule
An optimal schedule TTopt is a legal schedule for which the number of delay
slots are minimum. Equivalently, an optimal schedule 7ropt is a legal 
schedule Ir for which the accumulated cost function AC^( | tt | ) is minimum.
A few remarks are appropriate at this point. First, the accumulated cost 
function AC„( 17r | ) represents the total delay time that must be expended on a 
pipelined processor to ensure correct results. This is equivalent to the difference 
between the time taken to execute on the pipelined processor and the time taken 
to execute on a similar processor for which every instruction executes in unit 
time. The delay slots after instruction ft, given by <:„(*), can be implemented in 
various Ways, and our model is not specific to an implementation. Some common 
methods, as discussed in Section 1.2.2, are insertion of NOPs and hardware inter­
locks. Occasionally, we will refer to unit delays interchangeably with NOPs, but 
this does not imply that NOP insertion must be used instead of interlocks.
We present our algorithm in Section 3.5. Since this algorithm can be imple­
mented using a variety of approaches, Section 3.4 gives an overall picture of the 
various trade-offs that should be considered when implementing the algorithm. In 
Section 3.6, we show that in the absence of a curtail point our algorithm indeed 
finds optimal solutions for all cases. A curtail point is a user-specified limit on 
search-space.
3.4. Compiler Model
Our scheduling algorithm works as compiler back end. Different phases of 
compiler are discussed briefly here. Details can be found in the next chapter.
As discussed earlier, the few pipeline scheduling algorithms presented in the 
literature act as postpass reorganizers, and work on the assembly level code pro­
duced by the compiler. The scope of reorganization done at this level is limited, 
because the assembly code reflects the assignment of values to a limited number 
of registers based on the initial ordering of the instructions In the source program.
Qur algorithm works on an intermediate form, that is expounded in the next 
chapter. Traditional code optimizations are performed before scheduling the code 
for'the pipeline machines. This is necessary because, if the optimization is per­
formed after scheduling then some pipeline constraints might be violated, and the 
















Figure 3.4. Compiler Model with respect to the Scheduling Algorithm
It is assumed that the compiler front end has done appropriate analysis for 
memory reference aliasing, and has done renaming so that all references to vari­
ables in the tuple code are unambiguous and mutually exclusive.
At this stage, it might be necessary to do some initial register allocation 
analysis of live values to estimate reg is ter  spill code. This is followed by the pipe­
line scheduler itself. An optional heuristic scheduling might be performed before 
the optimal pipeline scheduler to increase the pruning of the search space. A type 
of initial (list) scheduling is discussed in Chapter 4.
The approach presented here is not constrained by “artificial” conflicts 
resulting from coincidental reuse of a register name. Only at this stage, after 
scheduling has completed, are values assigned to specific registers. Further, it is 
at this time that the tuple form is converted into the notation for the target 
machine instruction set. It is assumed that the tuple operations are defined so 
that each tuple corresponds directly to one target machine instruction, hence this 
transformation is easily accomplished.
3*5. Scheduling Algbrithm
The input to the pipeline scheduling algorithm is an initial (list) schedule 
and the DAG (Direct Acyclic Graph) [AhS86] it embeds. From this, all needed 
dependence information is derived. The pipeline scheduling algorithm is a 
heavily-pruned search algorithm that works on one basic block at a time and 
finds a schedule for which the number of delay slots required is minimum .
Section 3.5.1 defines a few terms and functions that are used in describing 
the algorithm. The algorithm itself is presented in two parts: the algorithm to 
determine Pipeline Delay Cost for different instructions in a schedule appears in 
in Section 3.5.2 and the complete search procedure in Section 3.5.3.
3,5.1. Definitions
In addition to the terms defined earlier in this chapter, the following terms 
and functions are used in the algorithms which follow:
Definition 7: x
x is the current complete ordering of all instructions within this basic block. 
The Ith instruction in x will be denoted as x(i); likewise, returns the
position of instruction 8 within x. Instructions within x are labeled I, 2, 3, 
..., |x |.
Definition 8< e a r l i e s t )
earliest) is the minimum number of instructions in Z  which must be exe­
cuted before f in order to preserve the dependence structure given by the 
DAG. In other words, it is the number of instructions in a slice rooted at f.
Definition ®:
/a<e«*(f) is the maximum number of instructions in Z  which could be
45
executed before f in order to preserve the dependence structure given by the 
DAG. In other words, it is | Z \ - the number of instructions which transi­
tively or directly depend on
- . . . 1 '
8 .6 .2 . Algorithm D -— Compute e f (t)
The following algorithm is used to determine the amount of delay that 
would need to be inserted in the schedule 7T immediately before the ith instruc­
tion, f. It is assumed that for each instruction scheduled in a position j <  i,c„(j) 
has previously been computed. Recall that, in Section 3.3.2, cK(i) is defined as
c„(i) = max(D 1,D 2)
where Dl and D2 are,
D l = max(^Arq):q=o(fl),qG{7r,_r:l<r<AW q}
D2 = max(max(7T,—7T,—l,/vTq)—(7T,—7Ty—l)):q=d(fy), <r{<;])¥=('') 
j €/>(?,)
The algorithm to compute c„.(i) is therefore comprised of the following steps:
[1] Cjr(I) =  0. If » == I, then done. Otherwise, go to step [2].
[2] If o(f) =  0, goto step [4].
i—I
[3] (Check for conflict.) Let r(j)=cff(i)-|- cK(j) + l, the execution time
k * j+ l '
■ aL
between the start of the y n instruction and the i instruction. Search back­
ward from the J= I-I tt  instruction until r(})>ENz(i) U ^O)=^*) U j= l . If 
a{j) -  d(i) U r(j) <  BN»(i), then cn(i) = EN*(i) -  r(j).
[4] If />(f) =  0> then done.
[5] (Check for dependence.) Perform step [8] for each instruction S£p(c), then 
done.
[6] Let x =  r(7r-1(6)). If * >  0, then en(i) =  cn(i) + x. Note that
L T ^tr is the latency of pipeline used (if any) by a parent of instruction
46
3.6.3. Algorithm A — Find Optimal Schedule
The following is the schedule search algorithm which forms the core of our 
approach. This finds an optimal schedule (unless the search is truncated) for a 
basic block of instructions in the set Z. The initial schedule to this algorithm is 
passed in %. Algorithm D from Section 3.5.2 is used repeatedly to evaluate 
schedules being considered. Algorithm A consists of these steps:
[1] For i—l to 17r I, invoke the above algorithm to insert the correct number of
delay slots before instruction 7t( j) . Call the resulting schedule TTbeat> the best
:. ' v . v  V  : \  ■ r ,  | - r |   ̂ / ■ V ■: ■'
schedule found thus far. Then AC'„W ( | TFbest I ) =  Xj cK(k); r
• . k = l
[2] Partition 7r into 4> and 4>, where 4> represents the partial schedule being con­
sidered and 4* represents the list of instructions to be added to schedule 4>. 
Initially, 4> — 0  and 4* =  7T. Let » =  I. Let A=O.
[3] If 4* 0 then the schedule is not yet complete and search continues with
step [4]. If A C n{ I TTI ) <  A C n̂ { |  Trbest | ) then Trbest Goto step [8]. 
Gtherwise consider swapping instruction K  ■ = .  7r(i)| k  G 4> with an instruction 
£ 6 4/. Let 6(£) =  x, z£<t(£). Repeat for all available choices one by one.
[4] (Get next schedule pruned by legality.) The swap should be performed only 
if both of [4a] and [4b] are true:
[4a] (Quick approximate check for legality.)
Iatest(K) >  Tf1̂ )  H earliest^) <  »
[4b] (Real test for legality.) p(£) C d*
If no legal swap was found, goto step [9].
[5] (Check for equivalence.) Goto step [9] if the following condition is not true, 
else proceed to the next step.
* 0 v M * 0
[6] (Apply Excessive Cost Pruning.) Let 4»’ be a partial schedule formed by sub­
stituting instruction £  (f €  4*) for /c (k  E 4>).
If ACpm^ I 4*' I) < A C KbaA{  I Trbbst I) then go to the next step, otherwise, con­
tinue with step [9].
[7] (Apply curtail point search truncation.) Let A =  A + I. If A >  X then 
abort, with a possibly suboptimal best schedule 7Tbest. Otherwise, continue
with step [9]. :
[8] Now actually perform that swap which was considered in step [3]. Inter­
changing £ with K alters 7r, 4>, and Mow move the partition between d> 
and 'I' to reduce by one instruction and goto step [3],
[9] Restore the previous values of 7r, and 'I'. This done by “undoing” the 
most recent changes made in these sets. For example, the set tt is restored to 
its previous contents by swapping the most recently swapped instruction 
back to its original position.
[10] If I <  Iv̂ l then i—i+1 and goto step [3]. Otherwise, done, with an optimal 
solution in ~bcst • Then ^opt "̂best*
The pruning techniques in the algorithm cut the search time by ( 17T|-k)l 
when pruning occurs at position k. Note that, because condition [5] filters-6ut 
equivalent schedules, the algorithm presented finds an optimal schedule, but 
might not examine all optimal schedules when the optimal schedule is not unique.
3.6. P ro o f of Optiniftlity
In this section, we prove that our algorithm produces results which are 
guaranteed to be optimal if the search is not truncated. This is done by first 
proving the optimality of a non-truncating algorithm , described in the next sec­
tion, and then by showing its equivalence to our algorithm for the inputs for 
which the search is not aborted.
Recall the problem statement from Section 3.3.1 that for a given task system 
T  =  (Z, < , P), we want to find an optimal schedule ir, for processor P1 represent­
ing a complete ordering {ft : i €  tt} =  * ’ *ftrn}» that minimizes
n
£] Cjr(Ic) such that no instruction is scheduled before its immediate predecessor as 
k—I '
given by the partial ordering <  . Z =  {ft,ft, * ’ * ,ft}, where ft( • • • ,ft are 
tasks (instructions) that are to be executed successively by the processor P1 en(i) 
is the Cost associated with completing task ft as the 7ifh task in a schedule n.
3.6.1. N on-truncated  A lgorithm  NT
The non-truncating algorithm NT is obtained from A by deleting step [7]. 
We derive a series of algorithms that preserve the optimality of the solution to 
the code scheduling problem for pipeline constraints, and show their equivalence 
(except for algorithm ALL) in terms of the quality of the solutions obtained from 
them. The objective is to prove that NT finds an optimal solution.
3 .6 .1.1. A lgorithm  ALL
Construct an algorithm ALL from algorithm A by deleting steps [4] to [7] 
from it. Let T ai t, be a search tree of all possible orderings of instructions in Z.
Lem m a I: The search tree generated by algorithm ALL is Tall-
Proof: Because of steps [3], [8] and [9] it is clear that algorithm ALL generates 
all possible permutations of the set of instructions in Z. And we have defined 
T Au, as the search tree of all possible orderings of instructions in Z, therefore it is 
the search tree that is traversed by ALL . It should be noted that we are not 
interested in the solution found by this algorithm, because it might violate the 
DAQ precedence constraints. Rather, we use it to construct the algorithm of 
interest by adding back steps to restrict the search.
3i6.1.2. A lgorithm  LG
Let LG be an algorithm constructed from ALL by adding step [4] of A. Define 
Siegai as the set of all possible legal schedules and TiegaI as the search tree 
corresponding to SjegaI.
Lem m a 2 s Algorithm LG examines all solutions in the set S]egai.
Proof: Step [4] adds the preclusion refinement, described in Section 3.2,1.1, to the 
exhaustive search algorithm ALL. Let S1 be the search tree examined by LG, 
Then S1 C  Tat.i. (C means subtree here). Let k be a node in Tall that is not 
present in S1. And let j  be any node in a path from the root to some leave in 
T Au. such that it violates the precedence constraints in the order corresponding 
to such path. Then from step [4] of A it follows that k must be a descendant of 
j. Since each k corresponds to one or more illegal solutions, all paths from root to 
leaves in T at.i. that are not present in S1 are exactly the illegal solutions. There* 
f®*̂  S1 =  ^IegaI •
■ 49
Theorem  3: An optimal solution always exists.
Proof: Since the code as initially generated is correct, there will always be at 
least one legal schedule that can satisfy the precedence constraints. Since the 
pipeline interlock length and conflict parameters are finite, the total delay for 
each schedule can be computed. Hence, an optimal schedule, which is the 
minimum cost taken over the set of all possible legal schedules, exists.
Lem m a 4: Algorithm LG always finds an optimal solution Tropt.
Proof: From Lemma 3, TTopt exists. Algorithm D computes the minimum
amount of delay for a given order that resolves the pipeline constraints. Since Tro p t  
E <Slegal > and from Lemma 2 and LG computes delay cost for each solution in 
S IegaI, step [3] ensures that the final solution TTbest — Tropt.
3.6.1.3. Algorithm B
Construct algorithm B from LG by adding step [6] of algorithm A. Note 
that this corresponds to the minimum cost pruning in Section 3.2.1.3. Now we 
prove that B always finds an optimal solution. To prove this, we first need to 
introduce the following:
Lem m a Si Optimal schedule, TTo p t , is not unique (in general).
Proof: Proof by contradiction. We construct a counter example that will have 
more than one optimal schedule. Consider an optimal schedule Tropt for a task 
system similar to one described in Section 3.3.1. Suppose that there was only one 
delay slot to be filled after instruction f and it is replaced by an instruction 8. 
Hence A Cnopt( | TTopt | )= 0 . Let TT2 be another schedule with an instruction £ fol­
lowing instruction f and £ s4 & Then AC„2( | tt2 | )=*0, only if o(8)=£d($)t 6 (fc p{$). 
This condition depends only on the precedence constraints in the code block and 
pipeline structure. Therefore, for an arbitrary input this may be true and hence 
Tr2 is also an optimal solution. But TT2 TTopt. Therefore optimal solution is not 
unique for an arbitrary task system.
Lem m a 0: AC^{.) is a monotonic increasing function.
P ro o f:From equation (3) we note that cff(i) > 0  for all i and tt. AC„(.) is a 
monotonic increasing function if and only if, for some kt and k2, condition kt >  
k2 implies AC„{ki^>ACn(k2). Now, for kt >  k2
W t i k l ) = ‘-£ c ,U )  
i-1
. J=Jc2 j= k ,
«  x: c«ti) +  ^  t*U)
J=I ■■■ ■ '■ j - M *  .
" j-k, '
=  ACff(A2) +  S  cff(j)
j-k2+l
Since cff(t) >0 for all i and tt, hence A Cff(Aij>A Cff(A2). Proof of the converse is 
obvious and is not given here.
Lem m a 7i Cost ACff(A) of partial solution ( A <  | 7 r|) exists.
- j= k  ■
Proof: Since ACff(A) +  ^ cff(j), no information about the schedule after A is 
: v : '+  j = 1 . ■' +  . • .
required. Therefore, this cost is computable for a partial solution.
Theorem  8: Algorithm B always finds an Optimal solution 7Topt.
Proof: Lemma 8 allows us to compute cost of partial schedules. Step [6] of B 
states that do not examine any descendant of a node £ at level A in the search 
tree if:
Let S be a descendant of £ at level j  =  | tt [. Since j>A, then from Lemma 6,
■ . ' • AC„(j)>AC„(k) . ( 6)
From (5) and (6),
This means that the cost of any complete solution that contains node £ will have 
delay cost greater or equal to the cost of the best solution found so far. Obvi­
ously, if this cost is greater, then we can ignore these nodes, or prune them from 
the search tree. What if the two costs are equal? From Lemma 5, we know that 
an optimal schedule is not unique, therefore if two or more schedules have the 
same minimum cost we can arbitrarily choose one of them and that will be our 
optimal schedule. This proves our theorem.
3.6.1.4. A lgorithm  C
Derive algorithm C from B by adding step [5j of algorithm A. This 
corresponds to the equivalence check pruning of Section 3.2.1.5. We develop some 
definitions and Lemma to prove the optimality of this algorithm.
Define orthogonal instructions as those that do not use any pipeline resource, do 
not have any parent instructions and their execution does not have any side- 
effects (e.g. no I/O). An example of such instructions is loading of a constant 
value in a register (on virtually all machines).
Step [5] mentioned above states that no two orthogonal instructions should 
be swapped. We show that this is equivalent to arbitrarily picking one of the 
schedules that are guaranteed to be equivalent in terms of pipeline delays.
Lem m a 0: Any ordering (schedule) within a set of orthogonal instructions is an 
optimal schedule for that set of instructions.
Proof: By definition, orthogonal instructions do not use any pipeline resource. 
Therefore, no delay slots are required between instructions because all instruc­
tions in the schedule are orthogonal. Moreover, any schedule is legal due to the 
absence of precedence constraints among these orthogonal instructions. All 
schedules are legal and require zero delay slots — they are optimal schedules.
Lem m a 10: Accumulated cost function associated with a basic block that con­
tains one or more clusters of orthogonal instructions remains unchanged if the 
order of orthogonal instructions within clusters is changed. (A cluster of orthogo­
nal instructions denotes a contiguous sub-block of these instructions).
Proof: Any change in the order of orthogonal instructions within a cluster does 
not effect ordering constraints for any of their dependent instructions because 
these instructions remain within the original cluster range. The order of these 
instructions within the cluster boundaries does not effect pipeline delays and 
conflicts. In other words, all such permutations of orderings are equivalent as far 
as the pipeline delays are concerned. Hence, the total number of delay slots 
required for the basic block remain uneffected. An example of this type of 
equivalence is given in Section 3.2.1.4.
Theorem  11: Algorithm C always finds an optimal solution.
Proof: Algorithm B examines all possible legal schedules except for those that 
lead to non-optimal schedules. Adding step [5] in this algorithm gives us a new 
algorithm, Algorithm C, which prunes the search space by discarding equivalent 
solutions. If all instructions in a basic block are orthogonal then by Lemma 9 all 
schedules are equivalent and optimal. Therefore this algorithm returns the first 
such schedule and ignores all other possibilities. On the other hand, if there are 
other instructions present which are not orthogonal then, by Lemma 10, we do 
not need to examine all possibilities within clusters of orthogonal instructions.
Some points are not intuitively clear from the above discussion. We give an 
example to illustrate how Algorithm C still manages to examine all schedules 
that not equivalent. For exampIe, what about the instructions that are depen­
dent on orthogonal instructions and (say) two of them are separated by many 
other instructions? An example of this sequence is (a, b,C, d, e,f, G,h,i,j), where C 
and H are two orthogonal instructions and i and j  are dependents of H. 
Apparently, if we do not swap C and H then i and j  will not be able to appear 
before position 7 in this example code sequence. But a careful analysis of step [5] 
shows that this is not the case, because there is no restriction on swapping orthog­
onal instructions with other instructions which are not orthogonal. Hence H  can 
go anywhere except at position 3, and hence, its dependent can also move freely 
within the constraints of legal ordering. From Lemma 10, instruction 7jf at posi­
tion 2 or 4 with C at position 3 is equivalent to H being at position 3. Therefore 
step [5] does not prevent us from considering any possibility that might lead to an 
optimal schedule.
T heorem  12: Algorithm NT always finds an optimal solution.
Proof: Algorithm NT is the same as algorithm C because both are derived from 
A by adding steps [4] to [6]. In Theorem 11, we showed that C finds the optimal 
solution, therefore NT will also find the optimal solution.
3.6.2. Truncating Algorithm
We extend NT by adding step [7] given in Section 3.5.3. This is our original 
algorithm A now. The step [7] adds a curtail point X in NT , We shall prove 
here that for the inputs for which the search is not truncated this algorithm
53
returns an optimal solution.
Theorem  13» Any schedule found by Algorithm A is guaranteed to be optimal 
if the search is not truncated.
Proof: Let S1 be the solution found by A for an input /, and S 2 be the solution 
obtained from NT. The two algorithms differ in terms of step [7]. Since the 
search was not curtailed in A for input /  therefore we can say that it is 
equivalent to running A for /  without step [7]. But an algorithm A without step 
[71 is the algorithm NT. Hence S i — S 2 for /. From theorem 12 we know that 
S 2 is an optimal solution. Therefore Si is also an optimal solution. Hence proved.
This concludes our proof for the optimality of our algorithm (when the 
search is not truncated).
3.7. Summary
In this chapter we have presented our research work in a more formal 
manner. In the next chapter we describe issues relevant to the implementation of 




:v;;: . ' : .
4.1. Introduction
In Chapter 3, we discussed the basic algorithm for scheduling code for pipe­
line processors. In this chapter we present an implementation of the optimal code 
scheduler.
First, in Section 4.2, we describe the complete picture of our compiler system 
that incorporates the pipeline scheduler. The implementation consist of a com­
piler system that accepts a small subset of C allowing input of basic blocks and 
generates an intermediate form code as instruction tuples. This is detailed in Sec­
tion 4.2.1 Classical optimizations are done incrementally, a backward pass 
removes any dead or redundant code found. The resulting tuple code is then
'.'■ft ’ v. r . . ■ ' ' ' - • . .
reorganized using a heuristic initial scheduling algorithm followed by the main 
pipeline scheduling algorithm, as discussed in Sections 4.2.2 and 4.2.3 respec­
tively. Finally the the tuple code is converted to the target machines instructions 
and register allocation is performed. Register allocation is covered in Section
4.2.4. We conclude this chapter by presenting a summary in Section 4.4.
4.2. Structure of the Scheduler
In this section, we outline the general structure of a prototype implementa­
tion of the proposed optimal pipeline scheduling technique. The construction of 
the compiler front end does not impact the scheduling technique, hence only the 
back end of the compiler is discussed. Figure 4.1 shows the organization of the 
compiler back end in the prototype implementation. The phases of the compiler 













Figure 4.1. Organization of Prototype Scheduling Compiler
4.2.1. Optimized Tuple Generation
The compiler front end is responsible for parsing the source program, per­
forming traditional optimizations, and emitting an appropriate intermediate form 
representation of the program.
Optimization of the code is not strictly necessary in order to to perform pipe­
line scheduling; in fact, if traditional optimizations are applied, the general effect 
is that finding good schedules becomes more difficult. Hence, in the interest of 
obtaining accurate results, the prototype compiler performs most traditional 
optimizations. These include constant folding with value propagation, common 
subexpression elimination, dead code elimination, and various peephole optimiza­
tions. The resulting code, which is usually substantially smaller than the unop­
timized code, is then represented as a DAG (directed acyclic graph) [AhS86]
56
embedded in a linear notation.
The notation we use for each instruction is that of a tuple of the form Fj Qa  ̂
where i is the reference number of the tuple, O is the operation type, and a  and fi 
are two operands. Each operand can be a variable, the result of another tuple 
(the reference number of another tuple), or 0. An example of tuple code, 
corresponding to a very simple basic block is given in Figure 4.2.
{
b = 15; 
a = b * a ; 
}
I: Const 15 
2: Store #b, 1 
3: Load #a 
4: Mul 1,3 
5: Store #a, 4
F1 I 1Const,”15"
r
1 2,Store, "b", I
r
1 SlLoadl' a"
r1 4,M u l,l,3
r
1 5,Store,"a",4
Figure 4.2. Sample of Intermediate Form
At the level of the tuple code, all references to variables are assumed to be 
unambiguous and mutually exclusive, Le., no two variable names refer to the 
same object. Since this is not true of some high-level language program refer­
ences to array elements or objects accessed through indirection on pointers, it is 
assumed that the compiler front end has done appropriate analysis and renaming 
so that these ambiguities need not be seen in the tuple code [Die87]. Since the 
prototype compiler was used solely for synthetic benchmarks whose^properties 
could be controlled directly, the prototype compiler simply assumes that all vari­
able names appearing in tuples are unambiguous and mutually exclusive.
At this stage, it is also important that a portion of the register allocation 
analysis be performed — the creation of register spill code. Since values are not
allocated to particular registers, the concept is simply that if there are more live 
values than registers in the target machine, then all values beyond the number of 
registers will be explicitly re-loaded. In other words, we insure that when regis­
ters are actually allocated later, there will be no need to introduce new spill 
instructions, since these could invalidate the optimality of the schedule. Note 
that inserting spill instructions after scheduling would usually result in a valid 
schedule, since S t o r e  instructions typically do not interfere with any pipelined 
operations.
In the simulations presented here, the prototype implementation simply 
assumed that there were always enough registers so that spilling would be 
unnecessary.
4.2.2. L ist Scheduler
\As tuple code is emitted by the front end, the code is grouped into basic 
blocks [AhS86] and each block is processed independently. The purpose of the 
initial scheduling phase is to apply heuristics to generate a reasonable schedule of
the current block. This is important because the search is pruned, in part, by a 
branch and bound technique which makes the total number of schedules searched 
sensitive to the quality of schedules searched early in the process.
The heuristic used is described in depth in [ZaDflO], where it was applied to 
generate an order for incrementally scheduling tuples across multiple processors in 
barrier NlIMD machines. In essence, the heuristic arranges the tuples into a 
sequential order (schedule) so that the distance between each instruction and the 
instructions that depend on it is as large as possible. Because of the branch and 
bound pruning, the time taken in applying the initial scheduling heuristic is more 
than recovered by the fact that the search for an optimal pipeline schedule will 
converge more quickly.
Alternatively, any other scheduling technique proposed in the literature, e.g. 
Gross [Gro83], etc., eould be applied to find this initial schedule. It is unclear 
whether the extra complexity of those techniques would be justifiable for use in 
place of our list scheduling heuristic.
4.2.3. Pipeline Scheduler
Having obtained a “reasonable” initial schedule, the pipeline schedule search 
algorithm is applied to find the optimal schedule. This algorithm, discussed in 
Section 3.5, represents the prime contribution of this research. The output is sim­
ply a schedule of the tuples within each block.
4.2.4. Register Allocation and Code Generation
As discussed in Chapter 2, the few pipeline scheduling algorithms presented 
in the literature act as postpass reorganizers, and work on the assembly level pro­
duced by the compiler. The scope of reorganization done at this level is limited, 
because the assembly code (in general) reflects the assignment of values to a lim­
ited number of registers based on the initial ordering of the instructions in the 
source program.
The approach presented here is not constrained by “artificial” conflicts 
resulting from coincidental reuse of a register name. Only at this stage, after 
scheduling has completed, are values assigned to specific registers. Further, it is 
at this time that the tuple form is converted into the notation for the target 
machine instruction set. It is assumed that the tuple operations are defined so 
that each tuple corresponds directly to one target machine instruction, hence this 
transformation is easily accomplished.
4.3. Pipeline Configuration Information
In this section we describe the pipeline structure model that is an input to 
the pipeline scheduler. For each hardware pipeline, the function, latency, and 
enqueue time must be specified. Further, so that the compiler can know which 
pipelines, if any, may be used to execute each type of operation, each hardware 
pipeline is given a unique identifier and operation types are associated with sets of 
pipelines. This is done using two tables.
Consider a processor with the following pipelined resources: two memory 
access pipelines (loaders), two adders, and one multiplier. These hardware 
resources are described in Table 4.1.







loader „ I V 2 '■ . ■ --I-V;-;
loader 2 2 -■'■1 V \
adder ■' 3 4 ; 3 ; - VV
adder 4 -V-,.. 4:; v . v - .'3 Vv' ,:-;
multiplier 5 ' V ; t : v V v -  2
The second table used to describe the scheduling problem for our compiler is 
Table 4.2, the operation-to-pipeline mapping table. Given these tables, for exam­
ple, the add instruction has two independent pipelines available to it (namely, 
numbers 3 and 4), and thus can be scheduled for either pipeline1. In this exam­
ple, Add and Sub operations share two independent pipelines; likewise, Mul 
and D iv  share a single pipeline.
The results presented in this paper were obtained using a more conservative, 
single pipeline unit per function, the tables for which appear in Section 5.2. 
Notice that changing the pipeline structure changes only the entries in these 
tables, not the structure of the scheduling algorithm. Further, note that the list 
scheduler does not examine these tables, hence, the initial schedule is independent 
of the target pipeline structure.
4.4. Summary
The search for optimal code schedule for a given pipeline target model is 
done by making calls to an O(n) routine that incorporates pruning techniques dis­
cussed in the previous chapter. The algorithm is very easy to implement and can 
be readily modified to include other solution-cost evaluation criteria when per­
forming pruning. One example would be to consider both pipeline delays and
1 The current implementation does not support this feature.
60








delays resulting from register spills and minimize both simultaneously. Other cri­
teria for optimality can also be taken into account.
Although the upper bound for the worst case is still exponential, our tests 
indicate that our algorithm is usually able to find optimal solutions in a very rea­




In the previous chapters we have made claims that our algorithm finds 
optimal Solutions for “most” blocks. In this chapter we justify these claims and 
explain how we evaluated the performance and merits of our technique. In 
Chapter 2 we annotated the fact that the problem of optimal code scheduling for 
pipeline constraints is NP-complete. This implies that there is no known method, 
with a worst-case polynomial time complexity, for finding an optimal schedule. 
Although we can devise heuristics to reduce the upper bound time complexity, the 
resulting solutions will not be optimal for some blocks. Since our proposed algo­
rithm focuses on the optimality of solutions, its worst case time complexity is 
exponential. We have guarded against such cases by halting the search algorithm 
after some specified number of search calls and forcing it to return the best solu­
tion found so far in the search.
Having stated the worst case time complexity of our algorithm, we now 
demonstrate its superiority (relatively speaking) and usefulness for real applica­
tions. It should be kept in mind that the runtime for finding an optimal schedule 
for input basic blocks varies significantly due to the type of instructions present, 
interdependencies, pipeline structure used, and the number of instructions; There­
fore, it is not feasible to formulate a closed-form expression for the performance 
of this algorithm. Rather, an empirical study was carried out and the results 
presented here are indicative of what to expect when this algorithm will be used 
for real application programs on machines with typical pipeline structures.
A prototype compiler implementing the algorithms given in Chapter 4, was 
tested with carefully generated benchmark programs. These programs were
synthesized according to statistics obtained from “real” programs. The construc­
tion of the synthetic benchmark programs is discussed in Section 5.3. Section 5.4, 
describes a general simulation procedure and results. The effects of variations in 
curtail point, the number of memory references and pipeline parameters on the 
performance are studied in Sections 5.5, 5.6 and 5.7 respectively. Suboptimal solu­
tions are reviewed in Section 5.8 and finally all the analysis is summarized in Sec­
tion 5.9/ '
5.2. Performance Metrics and Parameters
We are mainly concerned with the percentage of cases for which our algo­
rithm is successful in finding optimal solutions for a given set of input basic 
blocks, and the average runtime1 associated with it. Typically there are many 
basic block in real programs and the average runtime will dictate the time over­
head in compiling those programs. The number of NOPs removed is of secon­
dary importance to us. This is because an optimal solution itself implies the 
minimum possible number of NOPs in the schedule. Moreover, in Chapter 2, we 
demonstrated that our algorithm will eliminate more NOPs than previous algo­
rithms simply because it operates prior to register assignment.
An important parameter in this study is the c u r ta i l  p o i n t  X. Obviously, 
given a large enough X, our algorithm will a lw a y s  find an optimal solution. How­
ever, we must to choose a value of X that will result in an acceptable average run­
time (compile time overhead).
For a given value of X, our algorithm will terminate with a suboptimal solu­
tion if the number of search calls exceed X. The number of search calls required 
to find optimal solutions varies with the type of input basic block (i.e., its size, 
dependencies etc.) and the pipeline configuration. Therefore, we should expect 
the percentage of optimal solutions for a given value of curtail point to vary with 
basic block size and pipeline structure. This is indeed the case, as shown by the 
various result graphs in the following sections.
1 In this chapter we denote runtime in terms of the number of calls to 
search procedure Cl that was explained in Chapter 2.
63
5.3. Construction of Synthetic Benchmarks
A C program was developed to randomly generate basic blocks according to 
the statistics described below. This program requires as input the number of 
statements, variables, and constants desired in the generated code. It thett gen­
erates a random sequence of assignment statements satisfying the desired condi­
tions. The frequency of the types of assignment statements corresponds loosely to 
the instruction frequency distributions found in [A1W75]. These frequency distri­
butions reflect the statistics obtained from real programs. The frequency distri­
butions are shown in Table 5.1.
We preferred synthetic basic blocks over real programs for testing our algo­
rithm for the following reasons:
• The performance of our algorithm depends on the nature of inputs. For real 
programs (applications), the structure of basic blocks is not uder our control. 
This makes it impossible to vary them in order to study performance as 
block structure changes.
• Typical block size for real programs is very small (fewer than ten instruc­
tions). We have found that our algorithm works extremely well for basic 
block sizes of up to twenty instructions. We did not want to be overly 
optimistic and wanted to study performance on large basic blocks that might 
occur using techniques such as trace scheduling [E1185]. Such large blocks 
are readily attained using synthetic generation of basic blocks.
For very large basic blocks, it might be useful to split the basic blocks into 
smaller sections (containing, say, twenty instructions or less each) and find solu­
tions which are locally optimal. A good heuristic for the split might be to simply 
partition the list schedule, however, we have not yet examined such techniques.
Note that Table 5.1 does not give the frequencies for L oad  and S t o r e  
instructions, These instructions are provided as necessary during code generation: 
and optimization: the first reference to a variable causes a load for that variable 
to be generated, and a store is generated when a variable is assigned a value. In 
Section 5.6 we vary the frequencies of L oad  and S t o r e  instructions and study 
the outcome.
64
Table 5.1. Synthetic Benchmark Instruction Frequencies
Instruction Execution Freq.
L oad — ; •








5.4. Simulation of the General Behavior
Results obtained in this section are based on realistic pipeline parameters 
and input basic blocks. Therefore, the performance of our algorithm gives a good 
measure of what can be expected in real benchmarks.
5.4.1. Procedure
A set of SSDO basic blocks was generated with varying number of constants, 
variables and instruction count. Frequency distribution of these basic blocks with 
respect to their sizes is shown in Figure 5.1. These inputs were compiled and 
scheduled for the pipeline constraints given in the next section. Curtail point for 
these runs was set to 10000, 20000, 50000, 100000 and 200000 successively. Hence, 
we obtain a total of 16,000 run samples.
5.4.2. Pipeline Constraints for Simulations
The results shown in this and some subsequent sections were obtained using 





Instructions /  Block
Figure 5.1. Distribution of Sample Block Sizes
be expected in a “real” machine. However, this pipeline structure is still more 
complex than SU-MIPS [GrH88] and RCA-MIPS. In Section 5.7, we examine per­
formance on more varied and complex pipeline structures.
6.4.3. Results
The results presented in this section reflect a total of 16,000 runs with basic 
blocks containing various numbers of statements, variables, and constants. The 
curtail point was also varied, but was always large relative to the number of 
search calls made for an optimal search on average. A very brief summary of the 
results appears in Table 5.4.
Figure 5.2 shows the final number of NOPs after optimization versus the ini­
tial number of NOPs. Figure 5.3 shows the average runtime over all 16,000 sam­
ple blocks, Figure 5.4 shows the percentage of all runs which found optimal 
schedules* Le., which were not truncated by X.









multiplier 2 4 2







Notice that the average number of instructions per block for all these inputs 
was 20.6, which implies that the typical search, without pruning, would have 
required searching on the order of IO19 schedules, whereas only about IO3 were 
searched for the average block in our sample.
Figure 5.1 shows the frequency distribution of the number of instructions per 
basic block for our sample. Studies have shown that on average a basic block in 
real programs has fewer than ten instructions, however, our average sample block 
had 20.6; this yields overly conservative results, since for basic blocks with fewer 
than 20 instructions the algorithm nearly always produces optimal solutions. 
Thbugh programs with basic blocks that have more than forty instructions are 
very rare, we have included even such blocks in our study to show the worst-case 
effectiveness of our algorithm.








Number of Runs 15,812 188 16,000
Percentage of Runs 98.83% 1.17% 100%
Avg. Instructions/Block 20.50 32.28 20.6
Avg. Initial NOPs 9.50 14.34 9.6
Avg. Final NOPs 0.67 4.03 0.7
Avg. Q Calls 427.4 54,150 1,060
Initial
Number of Tuples
Figure 5.2. Initial and Final NOPs Vs. Block Size
68
Instructions /  Block
Figure 5.3. Runtime (log scale) Vs. Block Size
Instructions /  Block
Figure 5.4. Percentage Run To Completion Vs. Block Size
69
The percentage of optimal solutions decreases as the size of basic blocks 
increases. It is just in accordance with what we expected. The number of possible 
(legal) schedules that are to searched without any pruning increases as a factorial 
function of the block size, but our pruning technique works exceptionally well and 
even fpr large basic blocks we are able to find optimal solutions for most of the 
cases within a small value of the curtail point.
We selected a set of high curtail points for these runs, i.e., 10000, 20000, 
50000, 100000 and 200000. From the results we obtained it is clear that this 
seemingly high curtail point results in an average of just 1000 search calls. This 
is due to the introduction of artificially produced large basic blocks. — with typi­
cal blocks, there would be even fewer calls. The average number of search calls 
for all basic blocks with less than twenty instructions was about 75. In Figure
5.5, we have plotted the average number of search calls versus the maximum 
basic block size in the sample.
Number of Instructions
Figure 5.5. Average Search Calls Vs. Maximum Block Size
Figure 5.2 shows the final number of NOPs after optimization versus the ini­
tial number of NOPs. Note that the initial number of NOPs grows linearly with 
the number of instructions, but the final number of NOPs remains nearly con­
stant; Obviously, for larger basic blocks there are more instructions that use 
pipeline function units — and hence more initial NOPs. The bottom curve, the
final number of NOPs, indicates that the number of removeable NOPs increases 
with the number of instructions in a basic block. This is quite understandable, 
since more instructions are available to fill the delay slots in a larger basic block.
Our results show that for a very small percentage of the inputs (less than 
1.2% overall) the outputs were possibly not optimal. Further study of these 
inputs revealed that the optimal solutions for most of these inputs were not found 
even by increasing the runtime curtail point fifty fold. Moreover, the number of 
final NOPs found (in general) after that was not much different from what was 
found in the runtime allowed in the sample runs. This suggests that the algo­
rithm quickly converges to a near-optimal solution. This is further explored in 
Section 5.8.
5.5. Variations in the Curtail Point
In this section we investigate how the performance of our algorithm is 
effected by varying the curtail point X.
5.5.1. Procedure
We separate the results obtained in the previous section for various curtail 
points and plot these results against the values of X. For each X, there was a sam­
ple of 3200 basic blocks.
5.5.2. Results
In Figure 5.6 we have plotted the average percentage run to completion 
versus the value of the curtail point. The average runtime for each value of X is 








Figure 5.7.! Average Runtime Vs. Curtail Point
The above results bring forth an important feature of our algorithm. For 
X=IOOOO the percentage of optimal solutions obtained is 97.87%, and for 
X =200000 this increases to 99.31%. The fact that an increase of twenty fold in 
the curtail point improves the performance by only 1.47% shows that nearly all 
optimal solutions are obtained quickly by our algorithm. And, as we discussed 
earlier, a value of X about 100 will be sufficient for most blocks.
5.6. V a ry in g th e N u m b e ro fV a ria b le s
In our discussion about the synthetic benchmark instruction frequencies, we 
mentioned that the L oad and S t o r e  instructions are provided as necessary 
during code generation and optimization: the first reference to a variable causes a 
load for that variable to be generated, and a store is generated when a variable is 
assigned a value. One parameter for the synthetic basic block generation program 
is the maximum number of variables allowed in a basic block. Then program 
statements are generated randomly using different variables from this set of vari­
ables. Although code optimizations like dead-code removal and value propagation 
eliminate some of the instructions referencing these variables, we indirectly vary 
the number of memory references in a basic block by specifying the maximum 
number of variables. In this section, we vary the number of variables in basic 
blocks and study the corresponding results.
5.6.1. Procedure
We generated a sample of basic blocks for different number of variables from 
2 to 15. There are about 360 basic block inputs for each value of the maximum 
number of variables. Thus the total number of samples for this experiment is 
3600 basic blocks. These inputs are run for the same pipeline constraints and 
curtail point parameters as in Section 5.4.
5.6.2. Results
The percentage run to completion versus the maximum number of of vari­
ables in any basic block is shown in Figure 5.8. And the average runtime for the 
various values of the maximum number of of variables in any basic block is 
drawn in Figure 5.0.
5.5 .3 . D iscussion
Number of Variables
Figure 5.8. Average Percentage Run To Completion Vs. Variables
1200 -
Nutober of Variables
Figure 5.0. Average Runtime Vs. Variables
When the synthetic program generator has a small number of variables to 
choose from, it tends to reuse the same variable names in a basic block more 
often. And when that basic block is compiled and optimized, most of the 
instructions are removed as dead-code, resulting in a low average runtime. The 
runtime increases as the number of variables increases. The average runtime has 
its peak over five variables. After that the runtime begins to decrease as the 
number of variables increases further. This is because when the number of vari­
ables is comparable to the number of statements then more rigid dependencies 
between instructions begin to appear, which limit the freedom with which the 
instructions can move within the basic block. In any case, these variations are not 
very significant.
6.7. Variation in the Pipeline Structure
All the results that we have discussed uptill now were obtained by using the 
pipeline structure of Section 5.4.2. Here we show the effect of scheduling the same 
cpde for different pipeline hardware.
6.7.1. Procedure
Recall that the pipeline configuration, i.e., the latency of pipelines, their 
enqueue time and association of instructions with different pipelines, is an input 
to our scheduling algorithm. We collect a sample of inputs with various block 
sizes and variables similar to the sample taken in Section 5.4.1. These inputs are 
run with various settings of curtail points for the six pipelines structures shown in 
Table 5.5. Here d and en denote pipeline delay and enqueue time respectively.
5.6 .3 . D iscussion
5.7.2. Results
Figure 5.10 summarizes the effect of variation in the pipeline structure on the 
percentage of optimal solutions.
6.7.3. Discussion
We note that as we progressively make the pipeline structure more complex 
the percentage of optimal solutions, for the same curtail points, decreases. This is 
because the number of delay slots associated with different instructions increases.




L oad Mul D iv Add Sub
d en d en d en d en d en
O . . .
. :
, rr  ' •; - - - ■
r : ; v :# a  v - : . 2 1 ; - / - v- - - -  ’ - :>
2 I ■ 4 2 =  Mul . ' V '.; - I". -
: #4 ; : 2 I 4 2 =  Mul I O =  Add
; 2 ; V : 4 2 =  Mul 2 .-I ' ; 2 iJ:
#6 - , / -V 2 V . , . 6 2 =  Mul 2 ■ I 2
Therefore, for more complex pipeline structures, our algorithm has to go down 
deeper in the search tree before the pruning based on the minimum cost (see Sec­
tion 3.2.1.2) can be done. This in turn increases the number of search calls 
required to find optimal solutions, and for a fixed set of curtail points the percen­
tage of optimal solutions decreases.
Perhaps a different tree rearrangement criteria (that was discussed in Sec­
tion 3.2.1.3) based on pipeline structure would be helpful in reducing the run­
time.
5.8. Sub-Optimal SoiutionB
Throughout this chapter, we have talked about the percentage of solutions 
that were guaranteed to be optimal by our scheduling algorithm. We also have 
shown that this percentage is very high for all the cases that we have tested. A 
natural question that arises is “what about the cases which are not optimal?” 
How bad are they compared to the optimal solutions? In this section, we attempt
#1 #2 #3 #4 #5 #6
Pipeline Structure #
Figure 5.10. Average Percentage Run To Completion Vs. Pipeline Structure
to answer this question and show some interesting properties of suboptimal solu­
tions found in our experiments.
6.8.1. Procedure
The suboptimal solutions were isolated and studied using the following pro­
cedure. ■
[1] We started with a large sample of basic blocks, similar to the one described 
in section 5.4.1.
[2] These basic blocks were run with a curtail point of 1000.
[3] All those basic blocks that our algorithm was able to schedule optimally were
discarded from the sample. .
[4] The value of a the curtail point was doubled and the remaining basic blocks 
were run with the new value of curtail point.
[5] Steps [3] and [4] were repeated until the curtail point was as high as 512,000.
The set of basic blocks obtained by applying these steps represents a sample 
that produces 0% optimal solutions for a curtail point of 512,000; The number of
final NOPs for each curtail point for these inputs was recorded.
5.8.2. Results
Figure 5.11 shows the, number of final NOPs as a fraction of the initial 
(without any code scheduling) NOPs versus the number of search calls made.
I 2 4 8 16 32 64 128 256 512
Number of Calls (log scale)
(in thousands)
Figure 5.11. Fraction of Initial NOPs Vs. Runtime
The average number of initial and final NOPs that are obtained after each 
curtain point truncation, for this sample of basic blocks, is given in Table 5.6.
8:;8.8.-Discussion;'.
The most remarkable feature of our algorithm depicted by Figure 5.11 is 
how quickly it converges to a near-optimal solution. We call it near-optimal 
because increasing the number of search calls by over five hundred fold we could
78
Table 5.6. Final NOPs in suboptimal Solutioris
: : -V' . : Fraction
Calls Initial of Initial Final
NOPs NOPs NOPs
IOOOf 13 0.417 5.92
• 2000 13 0.417 5.92
4000 13 0.402 5.69
8000 13 0.392 5.54
16000 13 0.386 5.46
32000 13 0.382 5.38
; 64000 13 0.367 5.15
128000 13 0.362 5.08
• • V- : ; :  v ■■ a. . - ‘ • '  . 256000 13 0.352 4.92
51200© 13 0.332 4.62
improve the suboptimal solution by only a small fraction. Also note that while 
the average number for the final number of NOPs for optimal solutions is about 
one, the same average for suboptimal solutions is around five. Therefore we con­
clude that those basic blocks which have a high count of NOPs in their optimal 
solutions will generally result in suboptimal solutions when scheduled by our algo­
rithm with reasonable values of the curtail point. This also follows from the prun­
ing based on the minimum cost that our algorithm uses.
From this discussion wo can assume that if we can not find an optimal 
schedule with a low enough value of curtail point then it is worthless to continue 
search because the quality of the solution is going to improve only marginally (if 
at all).
6.S. Summary
The huge search space for optimal (minimal NOP) code schedules has long 
discouraged researchers from attempting to find optimal code schedules. How­
ever, we have presented a search algorithm which has demonstrated that for over
98% of our realistic synthetic benchmark blocks it is possible to dramatically 
reduce the size of this search space without sacrificing Optimalityv For the fewer 
than 2% in which the search space cannot be completely searched, good results 
were obtained by simply truncating the search, although this may result in subop- 
tirnal schedules. A prototype compiler using our algorithm, running on 
workstation-class machines, schedules about 100 typical blocks per second (>10K 
source LPM).
For very large basic blocks, it might be useful to split the basic blocks into 
smaller sections (containing, say, twenty instructions or less each) and find solu­
tions which are locally optimal. A good heuristic for the split might be to simply 




We have presented an algorithm that searches for an optimal schedule for 
multiple pipeline processors that dramatically reduce the size of search space 
without sacrificing optimality.
Previous approaches simplify the search for a good schedule by arbitrarily 
Iiriposing constraints which sacrifices optimality. Our algorithm uses techniques 
that ensure that the optimality is preserved. For the fewer than 2% of the cases 
(in our test runs) in which the search space cannot be completely searched, near- 
optimal good results were obtained by simply truncating the search, although this 
may result in suboptimal schedules.
In addition to demonstrating the feasibility of optimal code scheduling, we 
have defined our algorithm to use a more general model of pipeline structure than 
previous work. Our model allows multiple pipelines, each with its own latency 
and enqueue time, to be specified. Further, the set of pipelines which may be 
used for each type of instruction can be independently specified.
Ongoing work examines performance using various (more complex) pipeline 
structures than the work presented here. Future work will extend the proposed 
pipeline scheduling algorithm to more general code structures including very large 
blocks (as might be generated by trace scheduling [EU85]) and arbitrary control 
flow. As presented here, the algorithm applies best to scheduling individual basic 












A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, 
Techniques, and Tools, Addison-Wesley, Reading, MA, 1986.
S. Abraham, K. Padmanabhan, “Instruction Reorganization for a 
Variable-Length Pipelined Microprocessor,” IEEE International 
Conference on Computer Design, 1989, pp. 96-101.
W.C. Alexander and D.B. Wortman, “Static and Dynamic 
Characteristics of XPL Programs,” IEEE Computer, November 
1975, pp. 41-46.
D. Bernstein, “An Improved Approximation Algorithm for 
Scheduling Pipelined Machines,” International Conference on 
Parallel Processing 1988, pp. 430-433.
• ' ' . • • • . - 'V V  :/ ' ; . ., " • ’ - '
G. Berman, K. D. Fryer, Introduction to Combinatorics, Academic 
Press, New York, NY, 1972.
J. Bruno, J. W. Jones, and K. So, “Deterministic Scheduling with 
Pipelined Processors,” IEEE Transactions on Computers, Vol. C-29, 
No. 4, April 1980.
J. Cocke and J.T. Schwartz, Programming Languages and Their 
Compilers, Preliminary Notes, New York University Courant 
Institute of Mathematical Sciences, Second Revised Version, April 
1970.
H. G. Dietz, The Refined-Language Approach to Compiling for 
Parallel Supercomputers, Ph.D. Dissertation, Polytechnic University, 
June 1987.
H. G. Dietz, H. J. Siegel, W. E. Cohen, M. T, O’Keefe, et. ah, “A 
Compiler-Oriented Architecture: The CARP Machine,” Fourth 














. . • f
[Mel88]
[Muc88]
J. R. Ellis, Bulldog: A Compiler for VLIW Architectures 
Cambridge, MA: MIT Press, 1985.
W. J. Dally, “Micro-Optimization of Floating-Point Operations,” 
ACM Transactions on Programming Languages, Vol. 26, No. 2, 
1989. ...
J. Gill, T. Gross, N. Jouppi, S. Przybylski, and C. Rowen, 
“Summary of MlPS Instructions,” Standford University Technical 
Note No. 83-237, November 1983.
Garner et. al, “The Scalable Processor Architecture (SPARC),” 
IEEE CompCon, Spring 1988, pp. 278-283.
T. Gross, J. Hennessy, “ Optimizing Delayed Branches,” Proceedings 
of MICRO-15, October 1982.
T. Gross, J. Hennessy, S. Przybylski, and C. Rowen, “Measurment 
and Evaluation of the MIPS Architecture and Processor,” ACM 
Transactions on Computer Systems, Vol. 6, No. 3, pp 229-257, 
August 1988.
T. Gross, “Code Optimization Techniques 
Architectures,” COMPCON ’83, Spring 1983.
for Pipelines
T. Gross, “Code Optimization of Pipeline Constraints,” Ph.D. 
Thesis, School of Electrical Engineering and Computer Sciences, 
Stanford University, Dec. 1983.
J. Hennessy, and T. Gross, “Postpass Code Optimization of Pipeline 
Constraints,” ACM Transactions on Progfaxmning Language 
Systems, Vol. 5, No. 3, pp 422-448, July 1983.
J. Hennessy, et. al., Conference on VLSI Systems and Computations, 
Carnegie-Mellon University, October 19-21, 1981.
H. F. LI, “ Scheduling Trees in Parallel/Pipelined Processing 
Environments,” IEEE Transactions on Computers, Vol. 26, No. 11, 
pp 1101-1112, November 1977.
C. Melear, “RISC Architecture of the M88000,” IEEE International 
Conference on Computer Design, 1989, pp. 370-373.
Muchnick et al., “Optimizing Compiler for the SPARC 








[NiD90] A. Nisar, H. Dietz, Optimal CodeScheduling forMultiple Pipleline 
Processors, Technical Report TR-EE 90-11, School of Electrical 
Engineering, Purdue University, January 1990.
D. Patterson, Reduced Instruction Set Computers, Communication 
of the ACM, Volume 29, No. I, Jan. 1985, pp 8-21.
G. Radin, “The 801 Minicomputer,” IBM Journal of Research and 
Development, May 1983, pp. 237-246.
E. M. Reingold, J. Nievergelt, N. Deo, Combinatorial Algorithms, 
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1977.
T. Riordan et al., “The MIPS M2000 System,” IEEE International 
Conference on Computer Design, 1989, pp. 366-369.
B. Smith, from numerous personal communications. B. Smith is 
currently at Tera Computer Company, Seattle, WA 98103.
A. Zaafrani, H. Dietz, and M. O’Keefe, Static Scheduling for Barrier 
MIMD Architectures, Technical Report TR-EE 90-10, School of 
Electrical Engineering, Purdue University, January 1990.
