Optimal Code Scheduling for Multiple-Pipeline Processors by Nisar, Ashar & Dietz, Henry G.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
1-1-1990






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Nisar, Ashar and Dietz, Henry G., "Optimal Code Scheduling for Multiple-Pipeline Processors" (1990). Department of Electrical and












School of Electrical Engineering 
Purdue University 
West Lafayette, Indiana 47907
Optimal Code Scheduling for 
Multiple-Pipeline Processors
Ashar Nisar and Henry Dietz
School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47906
ABSTRACT
Pipelining the functional units and memory interface of processors can 
result in shorter cycle times and dramatic increases in performance, but only if 
the pipeline latency can be hidden by other useful operations. The portion of 
pipeline latency which is not hidden results in an extension of the total execution 
time, either implemented by hardware interlocks or by compile-time insertion of 
NOPs (Null Operations). By rearranging instructions, it is possible to minimize 
the total pipelined execution time, but the problem of finding this optimal code 
schedule is well known to be NP-complete.
In this paper, we describe a code scheduler for multiple pipeline processors 
where each pipeline may have a different latency and enqueue time. Previous 
approaches simplify the search for a good schedule by arbitrarily imposing con­
straints which sacrifice optimality; the technique given in this paper uses a new 
set of pruning criteria which preserves optimality. Although, in the interest of 
reducing compile time, the new technique permits the search to be truncated, this 
truncation only rarely (in less than 2% of the cases examined) sacrifices optimal- 
ity.
Keywords: pipelines, code/instruction scheduling, optimizing compilers, pipe­
line latency, pipeline enqueue time.
Optimal Scheduling
I. Introduction
Most modern processors, especially RISC designs like Motorolla’s 88000 
[Mel88], MIPS R3000 [Rio88], SPARC [Muc88J, etc., attempt to achieve a peak 
performance of one instruction completing execution with every clock tick. 
However, this does not imply that execution of a single instruction always hap­
pens within a single clock tick; rather, pipelined hardware is used to overlap exe­
cution of multiple instructions to achieve this throughput.
For example, if each instruction requires 5 clock ticks to execute, 
throughput of one instruction per clock tick can be obtained by allowing 5 
instructions to overlap execution within a 5-stage pipeline. In order to obtain one 
instruction per clock tick throughput, one simply needs to have one instruction 
ready to enter the pipeline at every clock tick. The problem is that if code is gen­
erated from a high-level language in the most obvious way, many instruction 
sequences will require that a delay be introduced before the next instruction can 
be issued.
The problem of compiling code so as to minimize the total delay which 
must be introduced is nearly as old as the concept of pipelining hardware, and 
appears to have been considered as early as the 1950s. In the 1960s, as circuitry 
became inexpensive enough to make the hardware cost-effective, machines with 
multiple functional units became common: typically, independent adders and 
multipliers which could operate in pipelined overlap with other instructions. 
Most of the compiler research centered on the development of heuristics which 
could be used to generate code so that total delay would be reduced for such 
machines; a reasonable overview appears in [CoS70].
Although the compiler techniques used to generate low-delay code were 
reasonably effective, they generally assumed that the code-generation process 
was relatively straightforward; in other words, these techniques become awkward 
when other compiler optimizations are also being performed. For this reason, the 
emphasis has shifted from heuristics for generating code to heuristics for re­
organizing, or scheduling, code after it has been generated using whatever other 
optimizations were appropriate.
Probably the best known work in instruction scheduling for pipelined pro­
cessors is by Gross, detailed in [Gro83]. Gross proposed a heuristic algorithm for 
reordering instructions and showed that, although his heuristic typically does not 
result in the minimum delay (optimal schedule), the algorithm executes quickly 
and generally yields good results. By applying his algorithm to the optimized 
assembly language output of a compiler, he also avoids the complexity of 
integrating scheduling with the other optimizations within the compiler, It 
appears that this is a reasonable approach, except in that the compiler has per­
formed register allocation. Hence, the register assignment can impose
Page 2
unnecessary restrictions on the schedule, resulting in unnecessary execution 
delays.
Bernstein presented an improved scheduling algorithm, but his work con­
siders only pipelines having a fixed delay [Ber88]. Abraham e t al. [AbP88] per­
mitted variable delay pipelines, but resorted to a greedy heuristic algorithm, 
instead of searching for file optimal schedules.
The algorithm we proposediffers from previous work inseveral ways:
[1] We apply our algorithm to an intermediate form of code which does not 
have specific registers assigned, hence register allocation happens after 
scheduling andthe scheduler is not unnecessarily constrained.
[2] Although our algorithm is also heuristic, none of the heuristics applied 
sacrifices optimality. In other words, the search space is pruned dramati­
cally, but the optimal solution will never be pruned. In cases where the 
pruned search space is still too large, the search may be terminated after an 
arbitrary number of cases have been examined, but this happens Only rarely 
and still generally results in very good schedules.
[3] The target pipeline architecture model supported is significantly more gen­
eral than that typically used, permitting multiple pipelines, each with its 
own latency and enqueue time, to be specified. In particular, we believe 
our proposal is the first to consider the pipeline enqueue time as a key pipe­
line parameter (relating to conflict-induced delays, described in section
■ 2. 1) .  ■
Using reasonable cbmpile-time time limits, the algorithm we propose was found 
to generate provably optimal schedules for 15,812 of the 16,000 synthetic bench­
mark programs examined (over 98%).
The basic characteristics of pipelined systems are reviewed and the termi­
nology to be used in the remainder of this paper is given in section 2. Section 3 
presents an overview of the complexity of the code scheduling problem viewed 
as an exhaustive search problem. The structure of our prototype compiler and 
algorithm are discussed in section 4; performance of our approach is summarized 




In describing the basic characteristics of pipelined computer systems, it is 
useful to consider the compiler and architecture aspects separately. Naturally, 
this paper is more concerned with the compiler’s view, however, the discussion 
of the architectural structures clarifies how the proposed scheduling model 




As a compiler views a pipelined machine, the main concern is simply that 
the order in which instructions are executed must be sensitive to various 
pipeline-related timing constraints. It is convenient to think in terms of the incre­
mental task of trying to generate code for the next in a sequence of instructions.
There are two primary reasons for which execution of an instruction might 
need to be delayed:
•  Dependence. A dependence occurs when this instruction uses a result com­
puted by an earlier instruction, but the earlier instruction has not yet com­
pleted pipelined execution. Violating a dependence generally results in 
incorrect results being computed.
•  Conflict. A conflict Occurs when this instruction requires access to a 
hardware structure which is still being used by the pipelined execution of 
an earlier instruction. An unresolved conflict results in a pipeline hazard 
and unpredictable behavior.
Dependence is the most common reason for requiring delays. For example, 
loading a datum from memory into a register might be an instruction which takes 
4 clock ticks to execute, but the very next instruction might depend on the value 
being loaded. Consider typical code implementing the addition of X to register
RO: ' ' . •
Load Rl,X ;make register Rl = memory[X]
Add R0,Rl ;make register RO = RO + Rl
If the hardware were simply to enqueue the load in the pipeline and, in the very 
next cycle, attempt to use the register, the wrong value would be obtained; hence, 
some technique must be used to prevent the second instruction from executing 
until after the first has completed. This would introduce a delay of 3 clock ticks 
between the Load and Add instructions.
Notice that traditional compiler code generation techniques tend to load 
values on demand, resulting in code sequences which have many such depen­
dences.
Modifying the above example, a conflict would arise instead of a depen­
dence if the second instruction is another Load instruction and, for example, the 
hardware required the memory address register (MAR) to hold the memory 
address being accessed for the first 2 clock ticks of the Load operation. Con­
sider:
Load RlrX ;make register Rl = memory[X]
Load R2,Y ;make register R2 = memory[Y]
Page 4
In this case, the second Load would have to be delayed until the first Load 
had finished using the MAR — a delay of I clock tick would have to be placed 
between the two Load operations.
Hence, there is a significant difference between dependence-induced and 
conflict-induced delays: beside the semantic differences, they generally do not 
imply the same amount of delay. For each pipeline; the compiler needs to be 
aware of two separate parameters corresponding to the delay times seen for 
dependence and conflict resolution, respectively:
•  Latency. The pipeline latency is the number of clock ticks which must 
occur between enqueuing an operation in a pipeline and the result of that 
operation becoming available- In other words, it is the minimum time 
between issuing an instruction and issuing a second instruction which has a 
dependence on the first; the “ depth” of the pipeline measured in units of 
time.
•  Enqueue time. The pipeline enqueue time is the minimum number of 
clock ticks which must occur between enqueuing one operation in a partic­
ular pipeline and enqueuing a second operation in that pipeline. In other 
words, it is the minimum time between items in a pipeline.
For a classical pipeline, the latency is a few clock ticks and the enqueue 
time is I clock tick (since each stage of the pipeline uses functional units 
independent from those of other stages). However, it not uncommon to find 
hardware being shared by a few pipeline stages (or, equivalently, to find each 
stage taking a few cycles). Further, machines which have functional units that 
can operate in parallel with other functional units but are not internally pipelined 
are easily modeled by making each functional unit appear as a pipeline where the 
enqueue time s  latency.
The fact that some architectures have multiple pipelines raises yet another 
issue in the compiler’s management of pipelined systems; the compiler may 
have to decide which of several viable pipelines to use for each operation. For 
example, in a machine with two pipelined multipliers, which multiplier should be 
used for each operation?
2.2. Architecture’s View
In the compiler’s view we identified the causes o f execution delays, but we 
did not define their architectural implementation. When a dependence or conflict 
would otherwise Cause improper execution, the architecture must have some 
mechanism for introducing the appropriate delay. Indiscussionsofpipelined 
hardware, these delays are sometimes referred to as “ pipeline bubbles” [Pat85].




• Implicit interlock. In this technique, the hardware checks each instruction 
just before execution to make sure that it does not depend on the results of 
any operations which are currently in the pipeline. If there is such a 
conflict, the hardware simply delays issuing the instruction until the 
conflicting operation in the pipeline has completed.
The implicit interlock approach has long been the standard approach. 
It continues to be used in most modem processors, including RISC-style 
architectures such as the IBM 801 [Rad83], RISC II, and SPARC [Gar88] 
architectures.
•  Explicit interlock (explicit waiting). In this technique, the compiler 
marks each instruction with a tag indicating whether it must wait for a par­
ticular pipelined operation to complete before this instruction can begin 
executing. This technique is very similar to an implicit interlock, however, 
the hardware is simpler since it does not need to detect which operations 
interfere.
The machine being developed by Tera [Smi88] uses an explicit inter­
lock based on the compiler tagging instructions with a count field which 
gives the number of instructions since the last instruction that this instruc­
tion depends on or conflicts with. Another example of explicit interlock is 
the proposed CARP machine [DiS89]; CARP uses a bit mask in each 
instruction to indicate which variable-latency resources (e.g., global 
memory accesses using an interconnection network) each instruction must 
wait for.
•  NOP insertion (padding). In this technique, the compiler takes full respon­
sibility for the management of the pipeline by simply placing NOP (Null 
Operations — instructions known to be non-interfering with any type of 
pipeline activity) between instructions which would otherwise result in 
pipeline conflicts. The hardware is the simplest of the three techniques, but 
the compiler must perform analysis of the pipeline activity implied by the 
code.
The best known example of NOP padding for introducing delays is 
probably the MIPS processor [Hen81], although this seems to be becoming 
more popular as a general approach. For example, much of the work 
toward GaAs processors uses NOP padding. Further, pipelines with fixed 
latency are handled in this way in the CARP machine [DiS89].
Of course, the best solution is to never have the next instruction interfere 
with the instructions currently in the pipeline. By pipeline analysis and rear­
rangement — scheduling — of the code, a compiler can effectively eliminate the 
need for inserting delays. Thecurrent popularity ofthe NOPinsertiontechnique 
is, to a great extent, the result of the realization that this scheduling is important
Page 6
Optimal Scheduling
enough that every compiler should do it, in which case the compiler technology 
for NOP insertion is free, whereas the hardware implementing an interlock is 
not.
In this paper, for convenience, we shall consistently refer to delays in terms 
of inserting NOPs. However, the approach is not sensitive to which hardware 
mechanism is being employed. This is a key reason for discussing the 
architecture’s view — to show that it is in fact orthogonal to the compiler’s view. 
Hence, the scheduling techniques discussed in this paper apply equally well to 
any architectural implementation of delays.
23. The Complexity Of The Problem
The problem of instruction scheduling for a program, given set of pipeline 
constraints, is typically handled by compiling the program into assembly 
language instructions. These instructions are then grouped into basic blocks 
[AhS86] and each basic block is independently scheduled1 for the given pipeline 
constraints.
Without employing any praning, finding the optimal schedule for a block of 
n instructions requires an exhaustive search of all n! possible schedules. It is 
convenient to think of this as requiring n! invocations of an 0(n) procedure, 
called Q, which generates a schedule of the n instructions and computes the 
number of NbPs required by that schedule.
As discouraging as these complexity measures sound, we continued to 
determine the approximate time one might expect for a compiler to schedule a 
typical block containing about 15 instmctions. A reasonably efficient C imple­
mentation of the procedure Q was created and its approximate runtime deter­
mined on a variety of machines. The average time for one application of Cl, 
including the call overhead, was Q l 2 milliseconds on a heavily-loaded Gould 
NP1. For a Sun 3/50 workstation the average time was about 0.3 milliseconds. 
Given a block containing 15 instructions, G would be applied 15!, or 
1,307,674,368,000, times. Hence, our typical 15-instruction block could be 
scheduled on an NPl in a mere 156,920,924 seconds — just under 5 years! 
Worse still, most programs contain many such blocks.
No doubt, it is this type of analysis which led researchers to sacrifice 
optimality and investigate heuristic scheduling techniques. However, all is not as 
bleak as it seems because many of the schedules can be pruned from the search. 
Gur approach was simply to prune the search as much as possible without
1 Interactions between adjacent blocks can be managed without major modification of the basic block 
schedules, essentially by modifying the initial conditions in the analysis for each block. However, detailed 
discussion of block interactions is beyond the scope of this paper.
Page 7
sacrificing optimality.
The most obvious pruning of the schedule search space is to avoid con­
sideration of any orderings which would result in incorrect execution due to 
violating a dependence (i.e., making the consumer of a value execute before the 
producer of that value). This was implemented, however, we also formulated 
and Mplemented a number of other heuristics which pruned the search space 
significantly without sacrificing optimality. Table I presents a sample of how 
well we were able to prune the search space for schedules for typical blocks. The 
same typical 15-instruction block that would have taken 5 years to schedule 
optimally can be scheduled optimally in an average of about 0.01 seconds using 
the proposed pruning.
Optimal Scheduling
Instructions Exhaustive Pruning Proposed
In Search Illegal Pruning
Block Q Calls Cl Calls Cl Calls
8 40,320 163 76
11 39,916,800 9,039 12
13 6.2X109 65,105 394
13 6.2X109 40,240 21
14 8.7X1010 175,384 1,676
16 2.1X1013 27,487 17
16 2.1X1013 5,800,000 66,890
16 2. IxlO13 92,228,324 5,434
20 2.4xl018 12,872 334
21 5 IxlO19 58,581 202
22 I.IxlO21 >9,999,000 119
Table I: Search Space for Representative Examples
Of course, despite the fact that our pruning works very well on average, it 
has terrible worst-case performance. To limit the worst-case runtime for our 
algorithm, the concept of a curtail point X is used. This is a user-supplied param­
eter specifying the maximum number of schedules to be considered. The pro­
posed schedufiitg algorithm terminates when either:
[1] All possibly-optimal schedules have been examined2. In this case, the best
2 Our search algorithm will sometimes prune optimal schedules from the search, but only if  they are 
provably equivalent to a schedule which was not pruned.
Page 8
schedule found is an optimal schedule.
[2] A total of X. schedules have been examined (i.e., X calls have been made to 
Q). Because some possibly-optimal schedules have not been examined, the 
best schedule found might or might not be an optimal schedule.
Fortunately, pur results show that the vast majority of all blocks will ter­
minate on case [1] if X is on the order of 1,000. In fact, for most blocks of fewer 
than 20 instructions, a X value of about 50 would suffice. Usingthe algorithms 
and synthetic benchmarks described in detail later in this paper, the search for 
15,812 of the 16,000 blocks terminated on condition [I]: the number of
Optimal Scheduling






Figure I: Schedules Searched Vs. Block Size for 15,812 Complete Runs
Unfortunately, in the case that a reasonable X is exceeded and the search is 
truncated by rule [2], we were generally unable to determine how often the 
schedule is actually optimal despite the fact that some schedules were not con­
sidered. This is due to the fact that when a reasonable value of X was exceeded, 
the search space tended to be very large, so that even increasing the X value by a 
factor of fifty did not cause the search to run to completion... however, neither 
did the best schedule change. For this reason, we suspect that many of the trun­
cated searches also found optimal or nearly optimal solutions, but we cannot yet 
prove this.
Note that the total number of legal schedules which must be searched 
derives primarily from the dependence and conflict properties of instructions 




In this section, we outline the general structure of a prototype implementa­
tion of the proposed optimal pipeline scheduling technique. The construction of 
the compiler front end does not impact the scheduling technique, hence only the 
back end of the compiler is discussed. Figure 2 shows the organization of the 













Figure 2: Organization of Prototype Scheduling Compiler
Each phase is discussed briefly below. Section 3.1 discusses optimized 
tuple generation. The main contributions of this paper are discussed in sections 
3.2 and 3.3: respectively, the list scheduler and the pipeline scheduler. Finally, 
register allocation and code generation are reviewed in section 3.4.
3.1. Optimized Tuple Generation
The compiler front end is responsible for parsing the source program, per­
forming traditional optimizations, and emitting an appropriate intermediate form 
representation Of the program.
Optimization of the code is not strictly necessary in order to to perform 
pipeline scheduling; in fact, if traditional optimizations are applied, the general
Page 10
effect is that finding good schedules becomes more difficult Hence, in the 
interest of obtaining accurate results, the prototype compiler performs most tradi­
tional optimizations. These include constant folding with value propagation, 
common subexpression elimination, dead code elimination, and various peephole 
optimizations. The resulting code, which is usually substantially smaller than the 
unoptimized code, is then represented as a DAG (directed acyclic graph) 
[AhS86] embedded in a linear notation.
The notation we use for each instruction is that of a tuple of the form Ti 0 a p 
where i is the reference number of the tuple, O is the operation type, and a  and (3 
are two operands. Each operand can be a variable, the result of another tuple (the 
reference number of another tuple), or 0 . An example of tuple code, correspond­
ing to a very simple basic block is given in Figure 3.
Optimal Scheduling
I: Const 15 r1 I ,ConsttnI 5”
2; Store #b, I T1 2,Store,V,l
3: Load #a 1 3,Load,"a"
4: Mul 1,3' F1 4,Mul,l,3
5: Store #a, 4 ^ 5 ,  Store, V ,  4
Figure 3: Sample of Intermediate Form
At the level of the tuple code, all references to variables are assumed to be 
unambiguous and mutually exclusive, i.e., no two variable names refer to the 
same object Since this is not true of some high-level language program refer­
ences to array elements or objects accessed through indirection on pointers, it is 
assumed that the compiler front end has done appropriate analysis and renaming 
so that these ambiguities need not be seen in the tuple code [Die87J. Since the 
prototype compiler was used solely for synthetic benchmarks whose properties 
could be Controlled directly, the prototype compiler simply assumes that all vari­
able names appearing in tuples are unambiguous and mutually exclusive.
At this stage, it is also important that a portion of the register allocation 
analysis be performed — the creation of register spill code. Since values are not 
allocated to particular registers, the concept is simply that if there are more live 
values than registers in the target machine, then all values beyond the number of 
registers will be explicitly re-loaded. In other words, we insure that when regis­
ters are actually allocated later, there will be no need to introduce new spill
Page 11
instructions, since these could invalidate the optimality of the schedule. Note 
that inserting spill instructions after scheduling would usually result in a valid 
schedule, since S to r e  instructions typically do not interfere with any pipelined 
operations.
In the simulations presented here, the prototype implementation simply 
assumed that there were always enough registers so that spilling would be 
unnecessary.
3.2. ListScheduIer
As tuple code is emitted by the front end, the code is grouped into basic 
blocks [AhS86] and each block is processed independently. The purpose of the 
list scheduling phase is to apply heuristics to generate a reasonable schedule of 
the current block. This is important because the search is pruned, in part, by an 
a-(3 technique which makes the total number of schedules searched sensitive to 
the quality of schedules searched early in the process.
The heuristic used is described in depth in [ZaD90], where it was applied to 
generate an order for incrementally scheduling tuples across multiple processors 
in barrier MIMD machines. In essence, the heuristic arranges the tuples into a 
sequential order (schedule) so that the distance between each instruction and the 
instructions that depend on it is as large as possible. Because of the a-(3 pruning, 
the time taken in applying the list scheduling heuristic is more than recovered by 
the fact that the search for an optimal pipeline schedule will converge more 
quickly.
Alternatively, any other scheduling technique proposed in the literature, 
e g. Gross [Gro83], etc., could be applied to find this initial schedule. Tt is 
unclear whether the extra complexity of those techniques would be justifiable for 
use in place of our list scheduling heuristic.
33. PipelineScheduler
Having obtained a “ reasonable” initial schedule, the pipeline schedule 
search algorithm is applied to find the optimal schedule. This algorithm, given in 
section 4.2, represents the prime contribution of this paper. The output is simply 
a schedule of the tuples within each block.
3.4. Register Allocation and Code Generation
As discussed earlier, the few pipeline scheduling algorithms presented in 
the literature act as postpass reorganizers, and work on the assembly level pro­
duced by the compiler. The scope of reorganization done at this level is limited, 
because the assembly code (in general) reflects the assignment of values to a lim­





The approach presented here is not constrained by “ artificial” conflicts 
resulting from coincidental reuse of a register name. Only at this stage, after 
scheduling has completed, are values assigned to specific registers. Further, it is 
at this time that the tuple form is converted into the notation for the target 
machine instruction set. It is assumed that the tuple operations are defined so 
that each tuple corresponds directly to one target machine instruction, hence this 
transformation is easily accomplished.
4. Scheduling Algorithm
Before presenting the scheduling algorithm, it is useful to define the infor­
mation which will be used as input to the pipeline scheduler. In the previous sec­
tion, an overview was given of the tuple form representing each basic block to be 
scheduled. Section 4.1 presents a similar overview of the pipeline configuration 
information the search procedure needs in order to determine the optimality of a 
schedule. The following section, section 4.2, presents the scheduling algorithm 
itself.
4.1. Pipeline Configuration Information
For each hardware pipeline, the function, latency, and enqueue time must 
be specified. Further, so that the compiler can know which pipelines, if any, may 
be used to execute each type of operation, each hardware pipeline is given a 
unique identifier and operation types are associated with sets of pipelines. This is 
done using two tables.
Consider a processor with the following pipelined resources: two memory 
access pipelines (loaders), two adders, and One multiplier. These hardware 







loader I 2 I
loader 2 2 I
adder 3 4 3
adder 4 4 3
multiplier 5 4 2
Table 2: Sample Pipeline Description Table
The Second table used to describe the scheduling problem for our compiler 
is Table 3, the operation-to-pipeline mapping table. Given these tables, for
Page 13
example, the add instruction has two independent pipelines available to it 
(namely, numbers 3 and 4), and thus can be scheduled for either pipeline3. In 
this example, Add and Sub operations share two independent pipelines; like­






Sub '{ 3 ,4 } ;
Mul {5}
Div {5}
Table 3: Sample Operation-to-Pipeline Mapping
The results presented in this paper were obtained using a more conserva­
tive, single pipeline unit per function, the tables for which appear in section 5.1. 
Notice that changing the pipeline structure changes only the entries in these 
tables, not the structure of the scheduling algorithm. Further, note that the list 
scheduler does not examine these tables, hence, the initial schedule is indepen­
dent of the target pipeline structure.
42. Pipeline Scheduling Algorithm
The input to the pipeline scheduling algorithm is an initial (list) schedule 
and the DAG (Direct Acyclic Graph) [AhS86] it embeds. From this, all needed 
dependence information is derived. The pipeline scheduling algorithm is a 
heavily-pruned search algorithm in which the minimum valid number of NOPs 
are inserted before each instruction is added to each partial schedule. The 
schedule with the fewest NOPs inserted is the best schedule.
Section 4.2.1 defines a few terms and functions used in the algorithm. The 
algorithm itself is presented in two parts: the NOP insertion algorithm in section 
4.2.2 and the complete search procedure in section 4.2.3.
4.2.1. Definitions
The following terms and functions are used in the algorithms which follow:
Definition I: II
n  is the current complete ordering of all instructions within this basic
3 The algorithm presented in section 4.2 does not support this feature.
Page 14
block. The Jlh instruction in n  will be denoted as 11(0; likewise, FT1(S) 
returns the position of instruction 8 within n . Instructions within IT are 
labeled 1,2, 3,.... |nj.
Definition 2: p(£)
P(0 is the set of all instructions 8 e II| £ has an immediate dependence on 
8. Equivalently, p (0  is the set of all immediate predecessors of C in the 
DAG described above.
Definition 3: o(Q
0(Q is die pipeline resource used by instruction p
Definition 4: T|(0
ri(/) is the number of NOPs inserted immediately before the ith instruction 
within n .
Definition 5: p(II)
" mi ' :
P(n)=XTlO). the total number of NOPs required by the schedule TI.
J =i
Definition 6: earliest^
earliestiO is the minimum number of instructions in FI which must be exe­
cuted before £ in order to preserve the dependence structure given by the 
DAG. In other words, it is the number of instructions in a slice rooted at £.
Definition 7: Iatest(Z3)
Iatest(Z3) is the maximum number of instructions in n  which could be exe­
cuted before Z3 in order to preserve die dependence structure given by the 
DAG. Iri other words, it is |II| - the number of instructions which transi­
tively or directly depend on
4.2.2. NOP Insertion Algorithm
The fonowing algorithm is usedtodetermine the numberof NOPs which 
would need to be inserted in the schedule TI immediately before the Ith instruc­
tion, Z3. It is assumed that for each instruction scheduled in a position j  < i, r\(j) 
has previously been set to the number of NOPs which must be inserted immedi­
ately before that instruction. The algorithm is:
[1] T|(Z) = 0. If i = I, then done. Otherwise, go to step [2].
[2] If 0 (0  = 0 ,  goto step [4].
.»'-1
[3] (Check for conflict.) Let x(j)=x\(i)+ JJ t|(i)+l, the execution time between
k=j+\
the start of the f h instruction and the ith instruction. Search backward from 
the j - I -I th instruction until x(j)>enqueuetimeofo(i) U o(j)=o(i) U j=l. If 
o(J) = o(i) u  x(j) < enqueue time of o(i), then




[4] If p(£) = 0 ,  then done.
[5] (Check for dependence.) Perform step [6] for each instruction 8e p(Q, then 
done.
[6] Letac = latency of pipeline rj(ri'1(8)) - ^(IT1̂ )). If x > 0, then r|(i) = T|(i) + 
x.
4.2.3. The Search Procedure
The following is the schedule search algorithm which forms the core of our 
approach. It uses the NOP insertion algorithm given above and the initial list 
schedule as, n, the current block to schedule.
[1] For I=I to |n|, invoke the above algorithm to insert the correct number of 
NOPS before instruction n(i). Call the resulting schedule %, the best 
schedule found thus far,
[2] Partition IT into O and T, where O represents the partial schedule being 
considered and T  represents the list of instructions to be added to schedule 
0>. InitiaUy, 0  = 0  and »F = H  Let i = I. Let A = 0.
[3] If ¥  *  0  then the schedule is not yet complete and search continues with 
step [4]. If p(ri) < p(jt), Uien Jt = n. Goto step [7].
[4] (Apply curtail point search truncation.) Let A = A + I. If A > X then done, 
with a possibly suboptimal best schedule jc. Otherwise, continue with step
[5].
[5] (Get next schedule pruned by legality and equivalence checks.) Consider 
swapping instruction k = II(i)| K e O  with an instruction 2; e vF. The swap 
should be performed only if all of [5a], [5b], and [5c] are true:
[5a] (Quick approximate check for legality.)
Iatest(K) > I T 1©  n  earliest®  <  i  
[5b] (Real test for legality.) p © c  O
[5c] (Check for equivalence.)
O ©  ^  0  U p ©  ^  0  U O(K) 0  U P(K) ^  0
If no legal swap was found, goto step [7]. Otherwise, interchange £ with K 
(which alters n, O, and 1F) and invoke the above algorithm to insert NOPs 
for this last instruction.
[6] (Apply a - p  pruning.) If p(0) < p(ji), then move the partition between O 
and xV to reduce *F by one instruction and goto step [3]. Otherwise, con­
tinue with step [7].
[7] Restore the previous values of n, O and vF. This done by “ undoing” the 
most recent changes made in these sets. For example, the set II is restored 
to its previous contents by swapping the most recently swapped instruction
Page 16
Optimal Scheduling
back to its original position.
[8] If i < IvFf then i=i+l and goto step [3]. Otherwise, done, with an optimal 
solution in
The a-P and other pruning cuts the search time by (|n|-£)! when pruning 
occurs at position k. Note that, because condition [5c] filters-out equivalent 
schedules, the algorithm presented finds an optimal schedule, but might not 
examine all optimal schedules when the optimal schedule is not unique,
5. Results
A prototype compiler implementing the algorithms given in section 4.2 was 
tested with careMly generated benchmark programs. These programs were syn­
thesized according to statistics obtained from “ real” programs.
Section 5.1 gives the pipeline descriptions used. The construction of the 
synthetic benchmark programs is given in 5.2. Finally, the results are summar­
ized in section 5.3.
5.1. Pipeline Constraints for Simulations
AU the results shown in this paper were obtained using a very straightfor­
ward pipeline design. These pipeline constraints appear in tables 4 and 5. Later 
studies wUl examine performance on more varied and complex pipeline struc­
tures; the purpose of this paper is to demonstrate that optimal code scheduHng is 








loader I 2 I
multiplier 2 4 2








Table 5: Operation-to-Pipeline Mapping for Simulations
5.2. Construction of Synthetic Benchmarks
A C program was developed to randomly generate basic blocks according 
to the statistics described below. This program requires as input the number of 
statements, variables, and constants desired in the generated code. It then gen­
erates a random sequence of assignment statements satisfying the desired condi­
tions. The frequency of the types of assignment statements corresponds loosely 
to the instruction frequency distributions found in [A1W75].
Note Table 6 does not give the frequeiicies for Load and Store instruc­
tions. These instructions are provided as necessary during code generation and 
optimization: the first reference to a variable causes a load for that variable to be 















The results presented in this paper reflect a total of 16,000 runs with basic 
blocks containing various numbers of statements, variables, and constants. The 
curtail point was also varied, but was always large relative to the number of 
items searched for an optimal search of an “ average” block of that size. A very 








NumberofRuns 15,812 188 16,000
Percentage of Runs 98.83% 1.17% 100%
Avg. Instructions/Block 20.50 32.28 20.6
Avg. Initial NOPs 9.50 14.34 9.6
Avg. Final NOPs 0.67 4.03 0.7
Avg. Q Calls 427.4 54,150 1,060
Avg. Search Time (Sun 3/50) ~0 Is ~15s ■"0.3s
Table 7: Statistics for Scheduling 16,000 Blocks
Notice that the average number of instructions per block was 20.6, which 
implies that the typical search, without pruning, would have required searching 
on the Order of IO19 schedules, whereas only about IO3 were searched for the 
average block in our sample.
Figure 4 shows the final number of NOPs after optimization versus the ini­
tial number of NOPs. Note that the initial number of NOPs grow linearly with 
the number of instructions, but the final number of NOPs remains nearly con­
stant.
Figure 5 shows the frequency distribution of the number of instructions per 
basic block for our sample. Studies have shown that on average a basic block in 
real programs has less than ten instructions, however, our average sample block 
had 20.6; this yields overly conservative results, since for basic blocks with 
fewer than 20 instructions the algorithm nearly always produces optimal solu­
tions. Though programs with basic blocks that have more than forty instructions 
are very rare, we have even included such blocks in our study to show the worst- 













Number of Instructions per Block
Figure 5: Distribution of Sample Block Sizes 
Figure 6 shows the average runtime over all 16,000 sample blocks. Figure 
7 shows the percentage of all runs which found optimal schedules, i.e., which 
were not pruned by X. From these two graphs, it can easily be seen that common 
block sizes are easily scheduled within a reasonable compile time, and usually 










Figure 6: Runtime Vs. Block Size
Our results show that for a very small percentage of the inputs (less than 
1.2% overall) the outputs were possibly not optimal. Further study of these inputs 
revealed that the optimal solutions for most of these inputs were not found even 
by increasing the runtime curtail point by fifty fold. Moreover, the number of 
final NOPs found (in general) after that was not much different from what was 
found in the runtime allowed in the sample runs. This indicates that the algo­
rithm quickly converges to a near-optimal solution.
For very large basic blocks, it might be useful to split the basic blocks into 
smaller sections (containing, say, twenty instructions or less each) and find solu­
tions which are locally optimal. A good heuristic for the split might be to simply 
partition the list schedule, however, we have not yet examined such techniques.
6. Conclusions
The huge search space for optimal (minimal NOP) code schedules has long 
discouraged researchers from attempting to find optimal code schedules. How­
ever, we have presented a search algorithm which has demonstrated that for over 
98% of our realistic synthetic benchmark blocks it is possible to dramatically 
reduce the size of this search space without sacrificing optimality. For the fewer 
than 2% in which the search space cannot be completely searched, good results 
were obtained by simply truncating the search, although this may result in subop- 
timal schedules. A prototype compiler using our algorithm, running on 
workstation-class machines, schedules about 100 typical blocks per second 







Figure 7: Percentage Run To Completion Vs. Block Size 
In addition to demonstrating the feasibility of optimal code scheduling, we 
have defined our algorithm to use a more general model of pipeline structure than 
previous work. Our model allows multiple pipelines, each with its own latency 
and enqueue time, to be specified. Further, the set of pipelines which may be 
used for each type of instruction can be independently specified.
Ongoing work examines performance using various (more complex) pipe­
line structures than the work presented here. Future work will extend the pro­
posed pipeline scheduling algorithm to more general code structures including 
very large blocks (as might be generated by trace scheduling [E1185]) and arbi­
trary control flow. As presented here, the algorithm applies best to scheduling 
individual basic blocks averaging about 20 or fewer instructions each.
7. References
[AhS86] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, 
Techniques, and Tools, Addison-Wesley, Reading, MA, 1986.
[AbP88] S. Abraham, K. Padmanabhan, “ Instruction ReorgMzation for a 
Variable-Length Pipelined Microprocessor,”  IEEE International 
Conference on Computer Design, 1989, pp. 96-101.
[A1W75] W.G. Alexander and D.B. Wortman, “ Static and Dynamic 
Characteristics of XPL Programs,” IEEE Computer, November 
1975, pp. 41-46.
[Ber88] D. Bernstein, “ An Improved Approximation Algorithm for 

















Parallel Processing 1988, pp. 430-433.
J. Gocke and J.T. Schwartz, Programming Languages and. Their 
Compilers, Preliminary Notes, New York University Courant 
Institute of Mathematical Sciences, Second Revised Version, 
April 1970.
H. G. Dietz, The Refined-Language Approach to Compiling for  
Parallel Supercomputers, Ph.D. Dissertation, Polytechnic Univer­
sity, June 1987.
H. G, Dietz, H.J. Siegel, W. E. Cohen, M. T. O’Keefe, et. at, “ A 
Compiler-Oriented Architecture: The CARP Machine,” Fourth 
SIAM Conference on Parallel Processing for Scientific Comput­
ing, December 1989.
J. R. Ellis, Bulldog: A  Compiler for VLLW Architectures. fR  Cam­
bridge, MA: M lTPress, 1985.
Gamer et. al, ‘‘The Scalable Processor Architecture (SPARC),” 
IEEE CompCon, Spring 1988, pp. 278-283.
T. Gross, “ Code Optimization Techniques for Pipelines Architec­
tures,” COMPCON ’83, Spring 1983.
J. Hennessy, et. al., Conference on VLSI Systems and Computa­
tions, Camegie-Mellon University, October 19-21,1981.
C. Melear, “RISC Architecture of the M88000,” IEEE Interna­
tional Conference on Computer Design, 1989, pp. 370-373.
Muchnick et al., “ Optimizing Compiler for the SPARC Architec­
ture, An Overview,” IEEE CompCon, Spring 1988, pp. 284-288.
D. Patterson, Reduced Instruction Set Computers, Communica­
tion of the ACM, Volume 29, No. I, Jan. 1985, pp 8-21.
G. Radin, “ The 801 Minicomputer,” IBM Journal of Research 
and Development, May 1983, pp. 237-246.
T. Riordan et al., “The MIPS M2000 System,” IEEE Interna­
tional Conference on Computer Design, 1989, pp. 366-369.
B. Smith, from numerous personal communications. B. Smith is 
currently at Tera Computer Company, Seattle, WA 98103.
A. Zaafrani, H. Dietz, and M. O’Keefe, Static Schedulingfor Bar­
rier MIMD Architectures, Technical Report TR-EE 90-10, School 
of Electrical Engineering, Purdue University, January 1990.
Page 23
