Hierarchical parallelism exploitation by Nicolau, Alexandru
UC Irvine
ICS Technical Reports
Title
Hierarchical parallelism exploitation
Permalink
https://escholarship.org/uc/item/19v6w8s3
Author
Nicolau, Alexandru
Publication Date
1989
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
HIERARCHICAL PARALLELISM EXPLOITATION 
Alexandru Nicolau 
~__:::'-.-- ~ 
Department of Information and Computer Science 
,Yniversity of California, Irvine 
Irvine, California 92717 
Technical Report No.89-32 
,..--c 
z 
~ff 
t8 
))t) t ev- § ~u 
HIERARCHICAL PARALLELISM EXPLOITATION 
Alexandru Nicolau 
1 Introduction 
The generation of hand-crafted code for efficient execution on parallel machines is a tedious 
task. For some important problems, new algorithms carefully designed for parallel execution 
are being developed, often tailored to a particular architecture. However, these algorithms 
are difficult to develop and implement-the problem must be of sufficient generality, interest 
and regularity to compensate for the considerable effort. Even when the core algorithms are 
hand-parallelized, complex application codes will not run at large speedups if the rest of the 
code is not speeded up as well. Furthermore, even the carefully crafted parallel algorithms are 
likely to contain parallelism that is too low-level and too irregular to be explicitly exploited 
by the human designer. The remaining parallelism has a multiplicative effect on the oi.·erall 
performance of the code. Thus the ability to exploit parallelism at all levels is critical for 
execution speed. 
1.1 How Should Parallelism Be Exploited? 
Automatic fine-grain (instruction level) parallelism holds the promise of exploiting substan-
tially all the parallelism available in a given program, including highly irregular forms of 
parallelism not visible at coarser levels. Since the effect of all levels of parallelism exploita-
tion have a multiplicative effect on overall performance, substantially all parallelism should 
be exploited in order to achieve good performance-an obvious consequence of A.mhdal's law. 
The importance of fine-grain parallelism exploitation has already been recognized t~ a small 
extent, and is reflected in the use of pipelining and (relatively narrow) horizontal microcode, 
in virtually all high-performance (numerical) processors. However, its wider application has 
been limited by several factors. to be discussed shortly. In this paper we will describe some 
new results on the exploitation of fine-grain parallelism and will discuss their implications for 
the design of massively parallel machines. 
Ideally, fine-grain parallelism would be exploited at runtime, when all data-dependencies 
are strict (i.e., there is no ambiguity between indirect references) and a unique execution path 
through the code is followed. This is essentially the approach taken in the data-flow model 
of computation. In practice however, the runtime overhead involved in dynamic (hardware) 
scheduling of operations and interlocking to ensure dependency preservaqon is often several 
times larger than the theoretical performance speedup. The alternative approach is compile-
time parallelization of the code. The obvious advantage of this approach lies in the elimination 
1 
of runtime overheads (by doing the scheduling work at compile-time). This yields simpler, 
and thus cheaper and faster, machines. Furthermore, this approach can potentially exploit 
parallelism that is not readily available at coarser levels of granularity, and is far too tedious 
to be expressed at the user level. 
1.2 Difficulties in Static Fine-grain Parallelism Extraction 
Unfortunately, several difficulties have limited compile-time fine-grain parallelism exploita-
tion. These are: 
• Very tight coupling of processors. To achieve maximal benefits, the hardware behaviour 
should be highly predictable. For example, if processors are synchronous, operations 
could be executed in parallel in this model, utilizing the "free" implicit synchronization, 
while the same operations would not be worth executing in parallel if explicit synchro-
nization were required. 1 While this requires high memory bandwidth a~d constrains 
the scalability of the architecture, it is technologically feasible, and typically easier to 
build (and thus less expensive) than complex dynamic interlocking/ scheduling hard-
ware. The main drawbacks of the approach are in the ability of the compiler to expose 
enough fine-grain parallelism to efficiently utilize such hardware. 
• Amount of parallelism exploitable by fine-grain techniques. Due to a misunderstanding 
of some early experiments this was widely (and erroneously) believed to be too small 
or too expensive to bother with. Later evidence, (21], has conclusively established the 
availability of rather large amounts of fine-grain parallelism (factors from 10 to 100) in 
ordinary code. 
• Conditional jumps. Since branches occur very often in ordinary programs (once ev-
ery 3-8 instructions on average), they make the static scheduling of large numbers of 
operations difficult. Previous techniques have either been limited to branch-free code 
(basic blocks), thus drastically limiting the potential parallelism, or strongly relied on 
heuristics to statically predict the direction of runtime branches, with potentially heavy 
penalties in cases where such prediction is unsuccessful. 
1 This assumes that keeping the processors synchronous is done with negligible cost, and/ or does not affect 
the cycle time significantly. 
2 

we can show that the effect of many previous (coarser grain) techniques (e.g., vectorization. 
wavefront/hyperplane, loop-interchange. doacross) can be obtained as restricted combination:, 
of our transformation. This provides us with a means of comparing transformations across 
several computation models. In that context it becomes obvious that the power of the 
transformations to extract parallelism increases when the target architecture is tightly coupled 
and synchronous. 
\Ve will use our results above to argue that statically scheduled, tightly coupled s1;·n-
chronous architectures are both critical and practical, for the efficient exploitation of massive 
parallelism. On the. other hand, due to hardware issues and other pragmatic considerations 
(e.g., compilation time, space considerations) it is unlikely that the fully static approach 
will directly scale up to massive ( · 1000) parallelism exploitation. Fortunately, since good 
programming techniques tend to yield structured (hierarchical) code with relative locality, 
tight coupling and static scheduling at the higher levels of the hierarchy (e.g., across proce-
dures /modules) become less important-the ratio of synchronization/ communication across 
processors decreases relative to the code size). This leads to the notion of a general intercon-
nection network with each node consisting of a set of (possibly dynamically partitionable) 
tightly coupled synchronous processors. 
2 Compile-time Fine-grain Parallelism Extraction 
In this section we discuss the tools necessary for exposing parallelism in ordinary programs 
from the (machine) instruction level up to the procedure level. For the purposes of the section 
we assume that the hardware on which the code will ultimately run efficiently supports this 
granularity of parallelism. As we have argued in the introduction, such support is critical . 
since all levels of parallelism need to be addressed to "beat" Amhdal's law. In the next 
section we will discuss the practicality of such an architecture. 
2.1 Analysis Tools: Disambiguation 
A large fraction of the parallelism available in programs involves indirect references. Thus, 
it is imperative for a parallelizing compiler to be able to effectively disambiguate as many 
indirect references as possible. Indeed, too Liberal an approach to disambiguation could result 
in incorrect code being generated, while too conservative an approach will sharply decrease 

loop induction variable, say l1 3. Since j involves an input variable no further reduction is 
possible, and j is expressed as 2"' r by the disambiguator. Then the diophantine equation: 
2l1 + 1 = 2r 
is solved, using techniques derived from standard number-theory. Since in this particular case 
no integer solutions exist, the compiler can safely assume that no conflict can occur between 
the two statements. Thus they can be executed in any order, and in particular in parallel. 
2.3 Effectiveness of Static Disambiguation 
Indirect references in inner loops of scientific code are mostly array references, and such code 
usually offers the greatest potential for parallelism. Thus the very accurate disambiguation 
of indirect references is crucial to the success of fine-grain parallelizing compilers. 
Evidence supporting both the effectiveness of disambiguation and its importance for a 
fine-grain compiler is provided by our experiments with the BULLDOG compiler. Table 1 
compares the results obtained by the BULLDOG compiler with and without its (fully-static) 
disambiguator system, for several programs and various unwindings. Even with the limited 
unwinding used for some of these tests 4 the importance of disambiguation becomes obvious. 
The significance of disambiguation for the performance of the compiler increases dramatically 
with larger unwindings. 
The programs (a fast Fourier transform, solving a system of linear equations, tridiagonal-
ization, matrix multiplication, finding prime numbers and transitive closure) are all dramati-
cally improved by the use of the disambiguation system; the speedup is essentially doubled in 
several cases by the disambiguation . .As expected, the improvement is particularly large when 
the traces are long and the potential speedups obtainable by trace scheduling are relatively 
large. This happens when the important (innermost) loops are unwound. When unwinding 
is not done, or traces are still small, the length of the compacted schedule is dominated by 
simple arithmetic dependency-chains (e.g., index calculations may determine the length of the 
trace schedule) and no large speedups will be achievable in any case. Under these conditions 
the effect of disambiguation decreases. 
3 In general this further improves the accuracy of the disambiguation process by eliminating multiple in-
duction variables in a loop. Such variables would otherwise become free variables in the diophantine equation. 
4The number at the end of the program names indicates the amount of unwinding. 
6 

(a) 
(b) 
for i = lb,ub do 
j :=read(); /*read() reads an integer from the standard input*/ 
A[2i+1] := exprt; 
B : = A[j]; 
od; 
Figure 2: Ambiguity not Handled by Fully-Static Disambiguation 
the dependency analysis tool. 
2.5 What Runtime Disambiguation Has to Offer 
What we have proposed is the shifting of part of the burden of disambiguation from compile-
time to runtime. While the scheduling decisions will still be made statically, they may occa-
sionally rely on runtime tests to guarantee correctness. This will have the advantage that the 
disambiguation information rather than having to be always right (i.e., verifiable) statically. 
would only need to be usually (or often) right. This relaxation allows-in principle-the 
disambiguator to handle all of the above cases, and in fact could be used not only as a 
complement for a fully static disambiguator, but even-within limits-could make up for 
the lack of sophisticated-and slow-fully static disambiguation. Thus the use of runtime 
disambiguation could dramatically improve not only the running time of the code generated 
but also the running time of the compiler itself. 5 
This approach is new for parallelizing compilers. Previous techniques relied exclusively on 
fully-static information to estimate data-dependencies. This conservative approach, discussed 
m the previous section. has undully restricted the effectiveness of parallelizing compilers. 
In fact, some of the chief critiques voiced against parallelizing compilers (e.g., )4]) center 
precisely on the perceived intrinsic need of such compilers to rely solely on fully-static analysis, 
and their resulting inability to exploit the "real" parallelism limited only by actual runtime 
dependencies. Runtime Disambiguation (RTD) comes to remedy this problem of parallelizing 
5 Between 1/3 and 1/2 of the running time of the Bulldog compiler [12] is spent in preparing accurate fully-
static dependency information. Even with all this effort, the compiler still missed some relatively simple-and 
important-disambiguations. An assertion facility was added to the system precisely to allow the user to 
overcome such problems. 
8 


Table 2: RTD Net Speedups. 
Program (unwinding) Speedup RTD vs NoRTD Speedup .4 U vs .VoRTD Speedup TR v.5 .VoRTD 
dotprod(-1/8) 0/0 0/0 o .. ·o 
ln( 4/8) 0/0 0/0 0, 0 : 
matmul( 4/8) 0/0 0/0 Qi'O 
sqrt( 4/8) 0/0 0/0 o, 0 . 
Conduc(4/8) . 15/.22 (< 0)/.10 .15/.20 
FFT( 4/8) .26/.42 . .10/.34 .26/.-12 
Trid(4/8) .18/.23 (< 0)/.20 .18/.23 
Quanc( 4/8) .15/.28 .08/ .22 .15/.26 
SVD( 4/8) .26/.44 .15/.40 .24/.-11 
Solve( 4/8) .16/.27 .10/.27 .16/.21' 
Invert( 4/8) 2.2/3.4 .5/.9 2.2/3.-l 
BinSort( 4/8) 3.7/i.l 2.3/5.1 . 2.5/5.:3 
BubleSort( 4/8) -; Q . I •v .4/.5 .4; .6 
ShellSort( 4/8) 1.5 /2.4 1.2/2.3 1.3/2.l 
RadixSort( 4/ 8) 3.2/6.2 2.3/5.2 2.2/5.0 
prime( 4/8) .5/.9 .3/.7 A/.'i 
trcl(4/8) 1.3 /2.1 .8/ 1. 7 .8I1..1 
Unions( 4/8) 1.6/3.l .7 /.6 1.3/2.-l 
Inserts( 4/8) 2.9/4.6 1.7/1.9 2.4/4.1 
ShortesPaths( 4/8) 1.2/2.l .9/1.9 .8/1.6 
11 
........ _______ _......._. ___ ....., __ .. -.-~ ... ~--- ......._ ... _.......__ ...... _.__ ______ ,..._ __ ... _. -- .. -
[1 
Ii 
M -- - - - -- -- -- - -I1 ~ 
~ 
'·-9 ~ N' 
E1 
E1 
E1 
E1 
1, 
Figure 5: Core Transformations 
Guided by the higher level rules and transformation~. the core transformations operate 
uniformly on an entire program graph. They can also be applied to partially parallelized 
cude. This allows modification of code produced by other types of compilers. In addition. 
these transformations are themselves highly parallel and could be run on a parallel machine. 
significantly reducing r.ompilation time. 
The following is an outline of the lavers of the PS ~ystem and their function: 
Core Level This level contains a set of four core tran::;forrnations that define semantically 
correct motions of operations between arljacent nodes in a program A.ow-graph. Bv ·~per­
colating~! operations that can execute in parallel to the same node of the graph. the core 
transformations expose parnllelism implicit in the code. These transformations applv di-
rectly. to loop bodies and non-loop co'ie. The~- serves as the main parallelization tool in our 
svstem. The core transformations arf:' illustrated in fig11re :>. 
Support Level At this level we have analv.-;is methods (e.g .. :\[emory Disambiguation 
)9 ! ) and standard optimizations ( e.~ .. Dead-Cu<le Remo ml). They provide accurate <iata-
depen<lency information and thus enhance the applicabilitv of the core transformations. 
Guidance Level This level consists of rules that <iirect the application of the core trans-
formations to achieve effective optimization of the rode in acceptable time and space. Thi~ 
contrasts with Trace Schedulin~ ) l _ where a siru;lC' rule (for trnce picking) is inseparable from 
the actual transformation mechanism. This limit:5 Trnce Scheduli1Lg and makes it too rigid 
for our goals. 
l2 

equal or better running time for the final loop. In other words, OPT not only yields the best 
running time for the loop with respect to unwinding and the particular parallelizing transfor-
mations used, but true time optimality with respect to any possible dependency-preserving 
transformations. 6 OPT relies on the fact that only a finite (and in practice, small) number of 
iterations ever need to be examined to determine a pattern which yields an optimal running· 
time schedule for the loop. These results hold in the presence of multicycle operations. The 
justification of these claims and the details of the algorithm are given in [3]. For the purpose 
of this paper we only need to understand how OPT works. OPT incrementally unwinds the 
loop, allowing operations to be scheduled as early as possible in the schedule, subject only 
to data-dependencies and latencies. 7 Thus operations are scheduled at the earliest possible 
time they could be issued at runtime, if a synchronous multiprocessor were available. A 
repeating, fixed size pattern is guaranteed to emerge after a relatively small amount of such 
unwinding and compaction (parallelization), if the original data-dependencies of the loop are 
not allowed to drastically change throughout the process. Further unwinding and compaction 
beyond this point cannot improve parallelism, and thus replacing the loop body with this 
pattern will yield an optimal execution schedule for the given loop. Of course, a prolog and 
postlog including some start-up and wind-down code may be required; this code consists of 
partial iterations (loop bodies) '.3-t the beginning and end of the loop. There are several ways 
for handling these and other details such as the loop overhead, with either software or hard-
ware support. Some hardware mechanisms which would be relevant have been implemented 
and are discussed in [8], [10]. 
An illustration of the effects of OPT and the optimal schedule produced for the given loop 
is found in Figure 6. For simplicity, latencies of operations in this example are assumed to 
be just one cycle. As we mentioned earlier, OPT can deal with realistic operation latencies. 
When taking into account true operation latencies the notion of optimality derived from 
OPT /PP is realistic, in the sense that a schedule produced by OPT or PP could be run "as 
is" on a synchronous parallel machine (e.g., ~Iultiflow's )i]) with enough resources. Still, 
6 Dependency changes (e.g., due to renaming) can be allowed in this context, even if done dynamically as 
part of the parallelization process. 
7 \Ve are essentially performing a topological sort, creating a partial ordering of the operations; operations 
that are placed at the same Level in the schedule are therefore independent of each other and can be executed 
in parallel. Given a synchronous parallel processor with enough resources, such a schedule could run "as is", 
with each level or slice of the schedule issuing each cycle. 
14 

unwinding of multiple loops may violate correctness, and thus appropriate checks are needed 
to ensure that the transformation preserves the semantics of the original code. For example, 
in figure 7, a 3 by 3 unwinding on each loop would yield an incorrect program. 
Loop Quantization is a technique that we have developed to overcome this problem by 
allowing correct multiple-loop unwinding for arbitrary nested loops. In the case above, for 
example, the 3 by ;3 unwound loop body (the "quantum box") can be slanted to become 
parallel with the dependencies in the code, thereby restoring correctness. Of course the loop 
bounds need to be modified accordingly to allow for such slanted quantization. In [2] we have 
shown how the decision on the bounds of Quantization, and the ensuing transformation of 
the loop, can be automated. 
Loop Quantization rearranges the order of execution of the loop iterations less than some 
other global transformations. For example, quantization will succeed even when straight loop 
interchange would not apply. By exposing even irregular fine-grain parallelism, quantization 
may help achieve significant speedups in ordinary code. The main loop of weather code, for 
example, is naturally amenable to quantization, as are the Livermore loops[16] in their nested 
context. LQ combines with PP to achieve optimal parallel schedules (for a given number of 
processors) for nested loops. 
An example of loop quantiza_tion is given in figure i; further details and an algorithm for 
computing maximal loop quantizations is given in (2]. 
3 Architectural Considerations 
The above compiler techniques combine to effectively expose virtually all fine-grain paral-
lelism obtainable at compile-time. The techniques are resilient in the presence of unpre-
dictable conditional-jumps, and indirect references. To take full advantage of the potential 
of these compiler techniques, synchronous multiprocessors are required. While on a small to 
medium scale (up to a few tens of processors) such machines are relatively easy to build and 
can be very cost effective-as demonstrated by commercial machines such as (li], (8]-on a 
larger scale they may involve a number of disadvantages. 
3.1 Disadvantages of Statically Scheduled Multiprocessors 
The main disadvantage of static architectures is that they can't scale up arbitrarily due to: 
16 

• Conservative scheduling assumptions. As the number of processing elements working 
in parallel increases, we have to find more opportunities of exploiting parallelism in 
the code. We have demonstrated that at the nested loop level (and below) enough 
parallelism exists and can be extracted effectively; a tightly coupled machine of medium 
size (somewhere between ten to one hundred processing units, is typical of ordinary code 
in our experience), can make the most of this parallelism. While conservative decisions 
may sometimes be made at this level to ensure correctness of execution, there is usually 
no alternative: dynamic mechanisms (e.g., for dependency testing) that are general 
enough to allow the exploitation of significant amounts of parallelism, are usually too 
expensive at this level. However, as we go beyond such numbers of processors, an<l 
examine coarser levels of parallelism, (e.g .. at the procedure level), we may wt>ll be 
slowed down by the conservative decisions implicit in the fully static approach more 
than by the use of dynamic synchronization. 
Out of these obstacles to scalability, the last one is probably the most critical. For ex-
ample, consider the task of inserting a sequence of elements into a binary search-tree. Two 
successive calls to Insert could clearly execute in parallel as soon as it is determined (at 
runtime) that the subtrees they need to insert into are disjoint. A purely static approach. 
however, would need to schedule them for sequential execution, since a conflict could some-
times exist. Of course, RTD could be used to m·ercome this problem, but that would involve 
some overhead plus some code duplication for the case where the conflict does indeed arise 
at runtime. To the extent that the overhead involved in explicit synchronization between the 
two procedure calls compares favorably with that of RTD, it would be preferable for the user 
to insert some synchronization code between the calls, (e.g., through the future mechanism 
proposed in Multilisp ), and allow for fully dynamic synchronization. The point is, that when 
the execution time of the tasks (insert) is large relative to the cost of synchronization, the 
overhead on an asynchronous machine becomes tolerable, and possibly preferable to static 
scheduling. 
i\fore importantly, the parallel execution of multiple procedure calls with their individual 
threads of control fl.ow implies, in itself, a combinatorial explosion in code size-if encoded in 
the statically scheduled model. Such an explosion results from the need to encode all possible 
18 

of dedicated and fast communication processors, or by latency avoidance schemes such as 
that used in Burton Smith's Horizon. 
Such a machine could use heuristic high-level scheduling algorithms as in ~23], to map the 
parallelism exposed at the language (and algorithm level) onto dusters. The effectiveness 
of this approach is illustrated in [23], where the automatic mapping was shown to be better 
than that derived by human experts. 
3.3 Cluster Architecture 
Synchronous processing elements, each able to accept (any) one operation per cycle, with 
operation execution pipelined over multiple cycles. While the number of cycles required to 
execute a n operation is fixed for each operation type. This presents no particular problem for 
a register-to-register instruction set, with explicit load/stores. The fixed execution time for 
loads can be enforced-as far as the processors are concerned-by freezing (all) the processors 
in the cluster if loads do not complete in the expected time. Alternatively, latency may be 
masked by trading off parallelism as in the HEP or the Horizon. 
Such synchronous processors are obviously buildable on a small to moderate scale (2 to 30 
processors), as illustrated by the products of:vlultift.ow, Cydrome, Chopp,_ FPS. Thus the only 
other difficulty is in providing a •'clean" machine, i.e., free of structural hazards, so that any 
operation can be accepted by each processing element every cycle. The availability of clean 
pipes is not crucial to our approach. However, while structural hazards (i.e., irregul~rities in 
the machine design that optimize hardware utilization) can reduce the cost of the hardware, 
the trend in architectural design is to avoid structural hazards as much as possible-clearly 
any machine with too many structural bottlenecks cannot perform at or near its peak re-
gardless of the compiler technology used. \Ve are arguing that the added (hardware) cost 
of avoiding structural hazards is now even more justified by the existence of software tech-
niques capable of generating optimal code for clean machines for large classes of loops, and 
provably good code for the cases where optimality is unfeasible-see bellow. If structural 
hazards are not completely avoided, then simple techniques such as further unwinding of the 
OPT schedule and compaction (parallelization) coupled with reasonable mapping algorithms 
can minimize the impact of the hazards on the quality of the code. This,. coupled with the 
relative simplicity and uniformity of application of OPT /PP makes it a good candidate even 
for existing pipelined and synchronous parallel machines. 
20 

Loop 
LLl 
LL2 
LLJ 
LL4 
LL5 
LL6 
LL7 
LLB 
LL9 
LLlO 
LLl l 
LLl2 
LLlJ 
LL14 (av~) 
Average 
Harmonic Mean 
); Or1~111a! Code 
,. Mftops 
9 
8 
1 
6 
6 
8 
I 
20 
II 
17 
10 
4 
4 
4 
4 
I a 
i I 
-.--- ---------- -- ----·- --- - - -- --
LimHed Processors 
I 1 proc Mftops i 2 procs Mftops 
! 3 I . 5 0 I 5 7 - I 00 
20-.1s I 40-60 
16-20 I 20-2J 
' 16 I JO 
I 12-15 15-16 
6-16 
36-51 
I 40-55 
' 35-49 
I 18-25 
I 4.9 
1
13-20 
l 1-12 
14-18 
I 
19-28 
13-20 
I 6-20 
I 
71-99 
80-110 
1
68-97 
.J6-4M 
I 
~;~~o 
22-24 
I ·zs-J 1 
I 1s-5o 
! 18-33 
pro cs 
lJ 
16 
5 
5 
:I 
~ 
.J6 
60 
.J9 
40 
4 
6 
50 
28 
Ideal Schee.Jule 
82 I 
10s I 
8 ! 
8 
5 
8 
243 
363 
264 
210 
4 ! 
31 I 
J1s I 
151 I 
~flops 
-WO 
320 
27-40 
27 
16 ' 
6-·i; 
121:!0 
2400 
1360 
i20 
4-1.J 
80 
.560 
2i0 
I 534-5.J 1 : 
I 25-53 I 
Table 3: Cluster Sample Performance on Livermore Loops 
of iterations in the loop-is not obtainable in general, so good heuristic performance is all 
we may expect (and do in fact achieve) in practice. Some sample measurements based on 
the Livermore Loops )6 '. are shown in Table 3. The timings of the operations are assumed 
to be those of the Crav- l. It is interesting to note that the single (pipelined) processor mean 
performance on these loops is by itself slightly better than that of the Crav-1, while for the 
two processor version. the performance improves e\·en further. The ultimate performance of 
a staticallv scheduled cluster will depend of course on the number of processors. as well as 
on the actual hard ware implement at ion. :V[ore details are to be found in · L 
References 
·i: K. Pingali A. :-J'icolau and A. Aiken. Fine-grain compilation for pipelined machines. 
A.ccc.pted for p7Lblication in thr- J 01tr11.al of S·upncomp-1tlinq, f.o appear A. ug·ust f ')88. 
-~: .-\. Aiken and .-\.. :\"icolau. Loop Quantization: an analvsis and algorithm. Technical 
Report 87-821. Cornt>ll Universitv. l!J~i. 
i:J: .-\. . .-\.iken and A. \'icolau. Optimal loop parallelization. In Proceedings of the 1 !J88 
...! (.,'.\[ SIG P LL.V Cunference un Programming language De:;igrz and Implementation. 
June l D88. 
22 
. ~ ,,. - ~ ,.__ -- _ _,. - . - ~ - - - - - -- ..-. 

[l.5] D. Kuck, R. Kuhn, , B. Leasure, and M. Wolfe. The structure of an advanced vec-
torizer for pipelined processors. In Proceedings of the 4th Int 'l Computer Software and 
Applications Conference, pages i09-il5, October 1980. 
[16] F. H. Mc:\Iahon. Lawrence Livermore National Laboratory FORTRAN kernels: 
MFLOPS. Livermore, CA., 1983. 
[17] ~Iultiflow Computer Inc., Branford, Connecticut. Technical Summary, 1987. 
[18] A. Nicolau. Runtime disambiguation: Coping with statically unpredictable dependen-
cies. Accepted for publication in IEEE Transactions on Computers, to appear Fall J 988. 
[19] .-\.. ~icolau. Parallelism, Jlemory Anti-.4.liasing and Correctness for Trace Scheduling 
Compilers. PhD thesis, Yale University, 1984. 
[20] A. Nicolau. Percolation Scheduling: A parallel compilation technique. Technical Report 
85-6i8, Cornell University, 1984. 
[21] A. Nicolau and .J. Fisher. Measuring the parallelism available for Very Long Instruction 
Word architectures. IEEE Transactions on Computers, C-33:968-76, November 1984. 
[22] M . .J. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of 
Illinois at Urbana-Champaign, October 1982. 
[23] ~I. Y. \.Vu and D. D. Gajski. A programming aid for hypercube architectures. Accepted 
for publication in Journal of Supe·rcomputing, to appear August 1988. 
24 
