The exploitation of pipeline parallelism by compile time dataflow analysis by Lombardo, Joseph Michael
UNLV Retrospective Theses & Dissertations 
1-1-1991 
The exploitation of pipeline parallelism by compile time dataflow 
analysis 
Joseph Michael Lombardo 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Lombardo, Joseph Michael, "The exploitation of pipeline parallelism by compile time dataflow analysis" 
(1991). UNLV Retrospective Theses & Dissertations. 149. 
http://dx.doi.org/10.25669/0t3t-9lna 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand corner and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
University Microfilms International 
A Bell & Howell Information Com pany 
300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA 
313/761-4700 800/521-0600

Order N u m b er 1345926
The exploitation of pipeline parallelism by compile time 
dataflow analysis
Lombardo, Joseph Michael, M.S.
University of Nevada, Las Vegas, 1991
U M I
300 N. Zeeb Rd.
Ann Aibor, MI 48106

THE EXPLOITATION OF PIPELINE
PARALLELISM BY COMPILE
TIME DATAFLOW 
ANALYSIS
by
Joseph Michael Lombardo
A thesis submitted in partial fulfillment 
of the requirements for the degree of
Master of Science  
in
Computer Science
Departm ent o f  Computer Science  
University of Nevada, Las Vegas  
April, 1991
The thesis of Joseph Lombardo for the degree of M aster of Science in 
Com puter Science is approved.
E L vo^ g\ o \  A_________________
C hairperson , E vangelos Y fantis P h .D .
E xam in in g  C om m ittee  M em ber, John M inor P h .D .
E xam in in g  C om m ittee  M em ber, A joy D a tta  P h .D .
G raduate Faculty R ep resen ta tive , R ohan D alpatadu  P h .D .
G raduate D ean , R onald  S m ith  P h .D .
University of Nevada, Las Vegas 
April, 1991
A b str a c t
The autom atic and implicit transformation of sequential instruction stream s, 
which execute efficiently for pipelined architectures is the subject of this 
paper. This paper proposes a method which maximizes the parallel perfor­
mance of an instruction pipeline by detecting and eliminating specific pipeline 
hazards known as resource conflicts. The detection of resource conflicts is ac­
complished with data  dependence analysis, while the elimination of resource 
conflicts is accomplished by instruction stream  code transformation. The 
transform ation of instruction streams is guided by data dependence analysis, 
and dependence graphs. This thesis is based on the premise th a t the elimina­
tion of resource conflicts is synonymous with the elimination of specific arcs 
in the dependence graph. Examples will be given showing how detection and 
elimination of resource conflicts is possible through compiler optimization.
C on ten ts
1 In trodu ction  1
2 O verview  3
2.1 Instruction Timing and S t a l l s .........................................................  3
2.1.1 The MIPS R2000 .................................................................... 7
2.1.2 The Stanford M IP S ................................................................  8
2.1.3 The IBM 8 0 1 ..........................................................................  9
2.1.4 The GMU Microcoded RISC-MIRIS ................................  9
2.1.5 The ACORN RISC M ach in e ...............................................  9
2.2 D ata Reference and S t a l l s ................................................................ 10
2.3 Branching and S t a l l s .......................................................................... 11
2.4 Previous Work ...........................................................   13
2.5 Survey of Dataflow A n a ly s is ............................................................  15
2.6 G ra p h s ....................................................................................................  19
2.6.1 D ata Dependence Graphs, D D G ........................................... 19
2.6.2 Iteration Space Dependence, I S D G ........................................ 20
iii
2.7 Notation and Definitions ................................................................. 23
3 In stru ction  P ip e lin e  O p tim ization  via  D ataflow  A nalysis 27
3.1 Pipeline D e p e n d e n c e ............................................................................... 28
3.1.1 Flow Pipeline Dependence ................................................. 28
3.1.2 Anti Pipeline Dependence ................................................. 29
3.1.3 O utput Pipeline D ep en d en ce ..................................................30
3.2 Pipeline Iteration Space D e p e n d e n c e .................................................31
3.2.1 Dependence Free Loops ........................................................ 32
3.2.2 Loop In te rc h a n g e .....................................................................  32
3.2.3 Loop Code M o t io n .................................................................. 34
3.3 Pipeline Dependence Graph , P D G ............................................... 36
3.3.1 D ata Dependence Graph ....................................................  36
3.3.2 Delay Branching D e p e n d e n c e .................................................. 39
3.3.3 Instruction Timing D ep en d en ce ...............................................40
3.4 Resource Conflict Removal T e c h n iq u e s ............................................ 42
3.4.1 Code reo rdering .............................................................................42
3.4.2 Delay b ranch ing .............................................................................44
3.4.3 NOPing  47
3.5 Transformation P rocess.......................................................................... 49
3.6 An Example of the Transformation Process..........................................53
3.6.1 Transformation E x a m p le ...................................................  54
3.7 Order of Optimizations . . .  ?   57
iv
3.7.1 Folding .........................................................................................58
3.7.2 Dead code e l im in a tio n .......................................................... 58
4 C onclusions and Further R esearch 61
4.1 C onclusion....................................................................................................61
4.2 Further R e s e a rc h .................................................................................. 62
v
List o f  F igures
2.1 An example of a pipeline.....................................................................  5
2.2 A pipeline hazard and stall a t a cost of one machine cycle. . . 6
2.3 A pipeline hazard and stall at a cost of 19 machine cycles. . . 6
2.4 Resource conflict on R 1 ........................................................................  11
2.5 A pipeline stall due to branching........................................................  12
2.6 D ata Dependent Graph, D D G ...............................................................20
3.1 Flow Pipeline D ep en d en ce ..................................................................... 28
3.2 Anti Pipeline D e p e n d e n c e .................................................................  29
3.3 O utput Pipeline D ependence..............................................................  31
3.4 PDG Scalar D ep en d en ce ..................................................................... 38
3.5 Conflict removal by code reordering ................................................. 43
3.6 Delay b ranch ing ..........................................................................................45
3.7 Conflict elimination by delay branching ............................................46
3.8 Conflict elimination by N O P in g ......................................................48
3.9 Example code before transform ation......................................................55
3.10 Example Code after basic block transform ation.................................55
vi
3.11 Example code after instruction timimg transform ation..................56
3.12 Example code after branch delay transform ation......................... 57
3.13 Conflict elimination by folding and dead code removal . . . .  60
A ck n o w led g m en ts
I would like to thank my advisor and chairperson Evangelos Yfantis for 
his support and encouragement throughout my education and research at 
UNLV. W ithout his support this thesis would not have been possible. I 
would also like to to thank John Minor, Ajoy D atta  and Rohan Dalpatadu 
for serving on my committee and for their helpful comments.
C hap ter 1 
In trod u ction
The goal of this thesis is to maximize the parallel performance of an instruc­
tion pipeline by utilizing compile time optimization. Pipelines are used to 
increase the throughput of a system by decreasing the memory latency. M em­
ory latency is the elapsed tim e between the request for data by a processor 
and the receipt of tha t data  by the processor. One cause of memory latency 
is the difference between memory cycle time and processor cycle time. Usu­
ally memory cycle time is greater than processor cycle time. Throughput is 
the number of tasks completely processed by the pipeline per unit of time 
[Dasgupta, 1989b].
This thesis proposes a series of compile tim e optimizations, based on 
dataflow analysis, tha t will detect and remove possible resource conflicts 
within an instruction stream. In its most general form, dataflow analysis is a 
m ethod for finitely describing how a program utilizes its data  [Muchnick and Jones, 1981]. 
For our purposes, dataflow analysis becomes a useful diagnostic tool tha t dis­
1
covers certain properties of a program before the program is executed. The 
information collected during dataflow analysis on an instruction stream can 
be utilized during the compiler optimization phase.
2
C hap ter 2 
O verview
2.1 In stru c tio n  T im in g  and  S ta lls
Instruction execution is accomplished through a series of steps called the 
in s tru c t io n  cycle. The instruction cycle consists of subcycles, each of which 
takes one or more clock cycles. Thus an instruction goes through a series 
of steps called cycles before actual execution. For our first examples an 
instruction cycle with 5 steps will be assumed with the following steps or 
cycles:
1. IF: The Instruction Fetch cycle fetches the next instruction from mem­
ory by loading the instruction register with the correct address of tha t 
instruction.
2. ID: The Instruction Decode cycle decodes the instruction and accesses 
the register file for a register fetch.
3
3. E X : The Execution cycle performs ALU operations or calculates an 
effective address.
4. M E M : The MEMory cycle is the only cycle which accesses memory.
5. SR : The Store Result cycle stores results back into the proper register.
In a p ip e lin e d  system different instructions exist a t different levels of 
the instruction cycle at the same time. That is, at tim e a, instruction /,• 
is executing while instruction I j  is fetching operands. Each s ta g e  of the 
pipeline completes a different part of the instruction cycle. An instruction 
enters at one end of the pipeline, proceeds through the different stages and 
exits a t the other end. In a pipeline the machine cycle is the amount time 
needed to move an instruction one stage down the pipeline. The slowest cy­
cle in the instruction cycle usually represents the length of the machine cycle.
1 2
Machine Cycl 
3 4 5
es
6 7 8 9
i IF ID EX MEM SR
i+1 IF ID EX MEM SR
i+2 IF ID EX MEM SR
i+3 IF ID EX MEM SR
i+4 IF ID EX MEM SR
When a pipeline is free of hazards an instruction is executed every ma­
chine cycle as shown in figure 2.1. P ip e lin e  h a z a rd s  are the conditions
4
Figure 2.1: An example of a pipeline.
within a pipelined system tha t disrupt, delay, or prevent the smooth flow of 
tasks through the pipeline. One type of pipeline hazard is referred to as a 
re so u rc e  con flic t. A resource conflict is created when two instructions try 
to access the same resource a t the same time. Normally when a resource con­
flict is detected, the pipeline will s ta l l  allowing one instruction to complete 
before the next instruction is allowed access to the resource. A stall may 
take one or more machine cycles to complete thus reducing the efficiency of 
a pipeline.
An example of such a hazard can be seen in figure 2.2. The pipeline shown 
in figure 2.2 shares a single data-memory reference port. T hat is, when an 
instruction is using that data-mem ory port it has complete access control 
over the port until it is finished with the port. In our instruction cycle the 
MEM and IF steps both use the data-memory port when fetching data  from 
memory. In figure 2.2 at machine cycle tim e 4 a stall occurs because the 
load instruction needs an extra cycle to complete the data  transfer during 
the M EM  stage before instruction i+3 can be fetched.
5
1 2
Machine Cycles 
3 4 5 6 7 8 9
LOAD IF ID EX MEM SR
i+1 IF ID EX MEM SR
i+2 IF ID EX MEM SR
i+3 sta ll IF ID EX MEM SR
i+4 IF ID EX MEM
Figure 2.2: A pipeline hazard and stall at a cost of one machine cycle.
Some instruction delays can cost as much as 19 cycles and are usually 
found in the floating-point instruction set. In figure 2.3 a 19-cycle delay 
for two consecutive floating point divide instructions is shown. In figure 2.3 
there is only one floating point functional unit and the second instruction 
must wait 19-cycles before it can access the floating-point unit.
1 2
Machine Cycles 
3 ... 22 23 ... 41 42
FDIV
FDIV
IF ID 
IF
EX MEM 
ID EX
SR
MEM SR
Figure 2.3: A pipeline hazard and stall at a cost of 19 machine cycles.
6
Each of the following subsections describe a particular pipelined machine 
and the instruction delays found in tha t machine.
2 .1 .1  T h e  M IP S  R 2 0 0 0
The MIPS R2000 has a 5-stage pipeline, with 5 active instructions in the 
pipeline at any time. One instruction is started down the pipeline every 
machine cycle. The MIPS R2000 instructions are divided into four groups:
1. Computational : All register to register with two or three operands.
2. Load or Store: The only instructions tha t are allowed access to memory. 
Load and Store instructions have a 1 cycle  d e lay  before data  being 
transferred is available to another instruction.
3. Jum p and Branch: Relative jumps, straight jum ps and compares. Jum p 
and Branch instructions have a 1 cycle d e lay  while they fetch the in­
struction and the target address.
4. Special Instructions; These instructions support procedure and inter­
rupt linkage.
The MIPS R2010 Floating-Point Accelerator ( FPA) operates as the copro­
cessor for the R2000 and has a 6-stage pipeline. The MIPS R2010 instructions 
are divided into four groups:
1. Computational : All register to register with two or three operands.
7
• A D D  a n d  SU B : 2-cycle delay.
• M U L .S : 4-cycle delay.
• M U L .D : 5-cycle delay.
• D IY .S : 12-cycle delay.
• D IV .D : 19-cycle delay.
2. Load, Store and Move: The only instructions that are allowed access 
to memory. Load and Store instructions have a 2-cycle d e lay  before 
data being transferred is available to another instruction.
3. Conversion: Performs conversion operations between the various data 
formats. Delays of 5-cycles are possible.
4. Compare ; Performs comparisons between registers and sets condition 
bits. Delays of 5-cycles are possible.
2 .1 .2  T h e  S ta n fo rd  M IP S
The Standford MIPS has a 5-stage pipeline, with three active instructions in 
the pipeline at any time. One instruction is started down the pipeline every 
two clock cycles. All instructions execute in one machine cycle. The MIPS 
instructions are divided into four groups:
1. ALU : All register to register with two or three operands. A total of 
13 instructions in this group.
2. Load or Store: The only instructions tha t are allowed access to memory. 
A total of 10 instructions in this group.
3. Control Flow: Relative jum ps, straight jumps and compares. A total 
of 6 instructions in this group.
4. Special Instructions; These instructions support procedure and inter­
rupt linkage. A total of 2 instructions are found in this group.
2 .1 .3  T h e  IB M  801
Memory on the IBM 801 is accessed by the Load and Store instructions. 
M ultiplication is supported by a MULTIPLY STEP instruction which uses 
16 clock cycles and division is supported by a DIVIDE STEP which uses 
32-cycles. All other instructions execute in one machine cycle.
2 .1 .4  T h e  G M U  M ic r o c o d e d  R IS C -M IR IS
The MIRIS has a set of 64 primitive instructions and each instruction exe­
cutes in a single machine cycle.
2 .1 .5  T h e  A C O R N  R IS C  M a ch in e
There are 44 basic instruction codes which are subdivided into 5 main groups. 
All instructions except the multiple register load and store execute in one 
cycle. The groups are:
1. Load or Store: Single register.
9
2. Load or Store: Multiple registers
3. ALU: all register to register.
4. Branch
5. Software interrupt
2.2 D a ta  R eferen ce  and  S ta lls
Another version of a resource conflict is the data reference conflict. A data 
reference conflict occurs when the order in which operands are accessed is 
changed by the pipeline. A data reference can be seen in figure 2.4. Instruc­
tion i m ust store the new value of R1 before instruction i+1 is allowed a 
register fetch of R l. If the order is changed instruction i+1 will have the old 
value of R l not the value tha t was placed into it by instruction i.
10
i MOV R4,R1 (R l :=  R4)
i +  1 ADD R1,R2,R2 (R2 :=  R l +  R2)
1 2
Machine Cycles 
3 4 5 6 7 8 9
i
i+1
IF ID 
IF
EX MEM SR
ID EX MEM SR
Figure 2.4: Resource conflict on R l
2.3 B ran ch in g  and  S ta lls
A natural characteristic of a pipeline is the ability to prefetch one instruc­
tion, while a previous instruction is being executed. When the executed 
instruction is a successful branch or an unconditional one, the prefetched 
instructions must be flushed from the pipeline. When flushing occurs in a 
pipeline the flushed instructions add to the wasted memory access time, and 
the computing time for a task increases.
Figure 2.5 is an example of a branching stall in a pipeline which has the 
ability to hold four instructions.
11
i +  20 JM P 106
i +  21 ADD R4,R5,R3
i +  22 ADD R4,R6,R2
i +  23 STORE R3,[R5+R7]
40 41
Machine Cycl 
42 43 44
es
45 46 47
i+20 IF ID EX MEM SR
i+21 IF ID EX MEM SR
i+22 IF ID EX MEM SR
i+23 IF ID EX MEM SR
i+24 IF ID EX MEM SR
1
40 41
Machine Cycles 
42 43 44 *** 50 51 52 53 54
i+20 IF ID EX MEM SR
i+106 IF ID EX MEM SR
i+107 IF ID EX MEM
i+108 IF ID EX
i+109 IF ID
Figure 2.5: A pipeline stall due to branching.
Notice in figure 2.5 tha t the pipeline was filled with five instructions
12
where the the first instruction, i+20, is an unconditional branch to instruction 
i-f 106. The next three instructions, i+21,i+22 and i+23 m ust be flushed from 
the pipeline and instruction i+106 is fetched at a cost of 6 machine cycles.
Although this may not seem like a large problem we know tha t 65% of control 
instructions change the value of the PC [Hennessy and Patterson, 1990].
2 .4  P rev io u s  W ork
The following is an overview of pipeline scheduling solutions. Most of the 
solutions are based on the the concept of a Directed Acyclic Graph (DAG).
• N P -C o m p le te  : In 1983, J.Hennesey and T. Gross prove tha t code re­
organization for an optimal pipeline is NP-Complete [J.Hennesey and Gross, 1983]. 
Along with the proof, an algorithm is given which purposes a solution
to the code reorganization problem. The algorithm works on a DAG 
which incorporates a look-ahead scheme for node scheduling. A major 
problem with this concept is the algorithm itself can deadlock during 
scheduling.
• In s tru c t io n  S ch ed u lin g  fo r V ec to r P ro cesso rs : Described in his 
paper [Aray, 1985], S. Aray’s algorithm uses a weighted DAG to solve 
the code scheduling problem of a vector processor. Since the time 
complexity of this solution is exponential, the use of this algorithm in 
a compiler is not feasible.
13
• G ib b o n s  M e th o d  : A reorganizational scheme known as the Gib­
bons Method, is described in a paper by P.B. Gibbons and S. Much- 
nick [Gibbons and Muchnick, 1986]. This method is based on a DAG 
representation of instructions and a Candidate Set. The Candidate 
Set represents instructions tha t are ready for scheduling, tha t is the 
instructions with no predecessors in the DAG.
• C rit ic a l  P a th  : An algorithm tha t works on a unweighted DAG, 
where nodes represent instructions and the arcs represent dependen­
cies is presented by D. Berstien in 1988 [Berstien, 1988]. This algo­
rithm  follows the critical path approach where each instruction is as­
signed to a level in the DAG. The nodes in the graph are then ar­
ranged in descending order where instructions with the highest level 
are scheduled first. In 1989 Bernstien [D.Bernstien and Gertner, 1989a] 
[D.Bernstien and Gertner, 1989b] extends the algorithm so tha t it works 
on a weighted DAG, where the weights represent delay slots of zero or 
one. Thus the largest delay possible with this algorithm is one cycle.
• • T h e  G N U  In s tru c tio n  S c h e d u le r  : The GNU compiler incorpo­
rates the critical path concept where each instruction is given a priority 
[Tiemann, 1989] based on path length and the execution tim e of the 
instruction. An instruction tha t takes longer to execute will be given a 
higher priority, and the instruction with the highest priority is sched­
uled first.
14
• T he M IP S R eorganizer : The M IP-X  compiler uses a three stage 
DAG reorganizing process : (1) local reorganization,(2) interblock re­
organization , and (3) branch scheduling [Chow, 1989]. The local reor­
ganizer detects the branch instructions, thus defining the basic blocks.
Where as the interblock reorganizer determines when an instruction can 
be moved up beyond a basic block boundary to fill delay slots avoiding 
a possible NOP instruction fill.
2.5 S u rvey  o f  D ata flow  A n a ly s is
• R eaching D efin itions: Dataflow analysis was first used by Vyssotsky 
in 1961 as a compile time diagnostic tool for the Bell Laboratories IBM 
7090 FORTRAN II compiler [Hecht, 1977]. Vyssotsky used dataflow 
analysis to solve the reaching definitions problem. If there exist two 
blocks, Bi and B j, then a definition d, defined in B,, is said to reach 
B j if d is not redefined between 2?,- and Bj.
• V ariable Folding : In 1969 E.S. Lowry and C.W. Medlock [Lowry and Medlock, 1969] 
implemented a variable folding dataflow analysis algorithm. When an 
instruction has the form of X  := Y , we substitute the value of Y  for
any future undefined uses of X .
• In struction  S ch ed u lin g : In 1970 [Sethi et al., 1970] and 1974 [Beatty, 1972] 
dataflow analysis was utilized for the optimization of arithm etic expres-
15
sions. Various target architectures often follow different scheduling cri­
teria. Instruction scheduling is a method of mapping the program code 
to a specific architecture. This optimization often leads to performance 
gains. [Muchnick and Jones, 1981]
• R e g is te r  A llo c a tio n  : J.C. Beatty [Beatty, 1974] in 1974 proposed a 
dataflow analysis algorithm tha t attem pts to eliminate useless tem po­
rary variables by assigning program variables to CPU registers.
•  D ead  C o d e  E lim in a tio n  : The instruction X  Y ,  given as an 
example in the variable folding section, would be detected as dead 
code and eliminated after constructing use-definition chains during 
dataflow analysis. Use-definition chains or ud-chains are lists of each 
use of a variable and all the definitions tha t reach that variable. This 
technique was first used by K.Kennedy at Rice University in 1975 
[K.Kennedy, 1975].
• D e te c tio n  o f P a ra lle lism  : In 1975 P.B. Schneck [Schneck, 1975], 
with the use of dataflow analysis, detected and coded implicit parallel 
vector expressions.
• F o rm al D a ta  D e p en d en ce : By 1976 U.Banerjee [Banerjee, 1976] 
[Banerjee, 1979] had discovered three of the most popular dependence 
tests: gcd, bound, and inequality. Banerjee’s work has become a foun­
dation in data  dependence analysis [Wolfe, 1982] which has been modi-
16
fied for many purposes by several authors [Kennedy, 1984] [Allen and Kennedy, 1982] 
[Ellis, 1985] [Wolfe, 1982].
• R ecu rs iv e  D a ta  S tru c tu r e  A nalysis: In 1977 ,Jones and Muchnick 
proposed a general framework for dataflow analysis on programs with 
recursive data structures [N.D.Jones and S.Muchnick, 1982].
• D e p e n d e n c e  D ire c tio n : In his 1982 Ph.D. thesis M. Wolfe [Wolfe, 1982] 
discovered the direction vector. The discovery of the direction vector 
lead Wolfe to several im portant tests which recognize parallelism within
a loop. .
1. The vectorization test recognizes whether the statem ents in a loop 
can be vectorized. After building the DDG 1 the compiler a t­
tem pts to find cycles and backward direction vectors within the 
loop. The loop is vectorizable if it is void of any cycles and back­
ward direction vectors. If the loop is represented by a backward 
direction vector the elimination of all upward arcs by code reorder­
ing is attem pted. The topological sorting sometimes reverses the 
direction vector, from backwards to forwards, allowing vectoriza­
tion of the loop.
2. The loop fusion test recognizes a situation where two or more loops 
in a program can be transformed into a single vectorized loop.
*An as exam ple is given in 2.6.
17
3. The loop interchanging test recognizes the possibility of inter­
changing nested loop levels. In certain architectures, an increase in 
performance can be seen when using this technique [Kennedy, 1984].
• S u b sc r ip te d  V ariab le  A nalysis: By modifying Banerjee’s work in 
1983 J. Allen [Allen, 1983] showed how data  dependence analysis could 
exploit parallelism at the loop level by the dataflow analysis of array 
subscripts.
•  L oop  In te rc h a n g e : An early example of a loop transformation, based 
on Banerjee’s data dependence analysis technique, is referred to as loop 
interchange and was discovered by J. Allen and K. Kennedy [Kennedy, 1984] 
in 1984.
• M in im iz a tio n  o f C o m m u n ic a tio n  : C.D. Polychronopoulos utilized 
dependence analysis in 1987 for the reduction of interprocess com­
munication in a message passing system [Polychronopoulos, 1987a]. 
Through analysis of the data dependence graph 2, transformations are 
selected which reduce the total number of messages needed in a parallel 
program.
• P o in te r  A nalysis : In 1989, S. Horwitz, P. Pfeiffer, and T. Reps, dis­
covered a method for analyzing data  dependence for pointer variables.
2See section 2.6
18
2.6 G raphs
2.6.1 D a ta  D ep en d en ce G raphs, D D G
As the compiler computes data dependence information, it create a data  de­
pendence graph, or DDG. A DDG is a directed graph G = (V ,E ) ,  where 
the nodes, V  =  Si, S 2 , • ••, Sn, represent the statem ents in a program, and the 
directed arcs, E  — {e,j =  (5,-,5'j) | S{,Sj € V }, represent the dependence 
relationships. Parallelism is extracted after the creation of the DDG and the 
code transformation phase begins. The four classifications of dependence are 
shown in figure 2.6
19
Flow : 
Anti :
Output : — Q -
Control : — f Fr
51 : A := B + C
52 : B := D
53 : A := B * f
54 : If ( E = B ) then
55 : A := G * H
CTRL
S2
S3
S5
S4
Figure 2.6: D ata Dependent Graph, DDG
2.6.2  Iteration  Space D ep en d en ce, ISD G
The analysis of data dependence for the recognition of parallelism at a sta te­
m ent or block level is very useful. Although dependence tests within loops 
do not necessarily find true dependence [Polychronopoulos, 1987b], we can
20
compute d io p h a n tin e  e q u a tio n s  to find any dependence within loop it­
erations. The analysis at a subscript level will determine data dependence 
within a loop construct. Different iterations of a loop can be run in parallel 
if, and only if, the loop carries no dependence between the iterations. This 
type of dependence is referred to as lo o p -c a rr ie d  .
In loop constructs the compiler writer needs to form and solve d e ­
p e n d e n c e  e q u a tio n s  tha t involve array subscripts. Solving this depen­
dence is equivalent to solving a d io p h a n tin e  e q u a tio n . Various methods 
used in solving these equations are given in [Banerjee, 1979] [Wolfe, 1982] 
[Allen and Kennedy, 1982] [Griffin, 1954] [Kirch, 1974]. The only values valid 
for our analysis are integer values in the range of the loop counter.
The d e p e n d e n c e  d is ta n c e  gives the number of loop iterations between 
the corresponding dependent array elements. Distance is represented by an 
integer value. W ith distance and direction computed a ISGD, iteration space 
graph, is created.
The d e p e n d e n c e  d ire c tio n  shows the relationship between instances 
of each loop iteration. Direction is represented by the > , < , and =  operators. 
The following is a summary:
• =  : Dependence holds within the same loop iteration.
• > : Dependence holds from a particular loop iteration back to a pre­
vious loop iteration when the computed value of the distance vector is 
larger than 0. This is denoted in the dependence graph with a down­
21
ward arc.
•  < : Dependence holds from a particular loop iteration to a future loop 
iteration when the computed value of the distance vector is less than
0. This is denoted in the dependence graph with a upward arc.
• * : Depedence is unknown or all three, < , > , = , apply.
Consider the next few examples:
Example 1
h&(»h
( h ) ---DO i = 1 to N equation: i=j-l v— '
I \ . A(i)- possible solution. i=l j=2
I 2 : =A(i-l) . / p ,  ,
ENDDO distance: j-i-1
direction: j—i>0
Example 2
I\8()I2
DO i = 1 to N equation: 2i=2j-l
I \. A(2i)- soiutions no
I 2 : =A(2i-l) dependence
ENDDO
©
©
Examples 1 and 2 deal with single dimensional arrays. The same tech­
niques are used in multidimensional arrays. The difference is th a t each di­
mension is solved separately. In the case of two dimensions there are two 
d e p e n d e n c e  e q u a tio n s , two d is ta n ces , and two d irec tio n s .
22
Example 3 Multidimensional subscripts
DO 10 i=l to N 
DO 20 j=l to M 
h :  A(i,j) =
j  =A(i-l,j+1)
20
10
©
©
equationl
distance 1
equation 2 
distance 2
i = k-1
1 direction 1 : >
j=l+l
— 1 direction 2 : <
Usually, in m ultiple subscripts, d is ta n c e  and d ire c tio n  are combined 
into a vector form . In example 3 the d is ta n c e  v e c to r  is represented by 
( 1, -1), and the d ire c tio n  v e c to r  is represented by ( > , < ).
Informally, a loop can be described by an iteration space. [Wolfe, 1982] 
Thus, a d-dimensional loop is described by a d-dimensional iteration space.
2 .7  N o ta tio n  and  D efin itio n s
• IB n: . . .  a basic block of n instructions IB n =  { I i , l 2 , . . . ,I n }.
• I; < Ij: . . .  when instruction I; lexically precedes instruction Ij in the 
instruction stream .
In s tru c t io n  F o rm a ts  . . .  memory is access only by the LOAD and
23
STORE operations.
ADD scrl,scr2,dst =
SUB scrl,scr2,dst =
LOAD [srcl+scr2],dst =
MOV srcl,dst =
STORE scrl,dst =
JM P dst =
INC srcl =
• Ii^*Ij denotes any dependence.
• IVA I W denotes ind irect d ep en d en ce , if
then IVA IW
• Ii A I; denotes a cyclic d ep en d en ce usually found in loops.
• IN (Ij)  . . . t h e  set of variables read by instruction Ij. Consider the
following instruction :
Ii MOV R1,R2
IN (Ij) =  { R1 }
• O U T (Ii)  .. . the set of variables written to by instruction Ij. Consider 
the following instruction :
dst :=  scrl +  scr2
dst :=  scrl - scr2
dst :=  memory[srcl+scr2]
dst :=  srcl
memoryfdst] :=  scrl
PC :=  dst
srcl :=  srcl +  1
24
Ii MOV R1,R2
O U T (Ii) =  { R2 }
• P Sia . . .  a pipeline where s =  the number of stages and a =  the number 
of active instructions allowed in the pipeline at anyone time. Thus P5i3 
denotes a pipeline with 5-stages which has the ability to hold 3-active 
instructions at anyone time.
• P ip e lin e  Form ats . . . We will use three different types of pipeline 
structures in our examples. The first type is a 5-stage pipeline, which 
has the ability to hold five active instructions. The stages are listed 
below:
1. IF: The Instruction Fetch cycle fetches the next instruction from 
memory by loading the instruction register with the correct ad­
dress of tha t instruction.
2. ID : The Instruction Decode cycle decodes the instruction and 
accesses the register file for a register fetch.
3. EX : The Execution cycle performs ALU operations or calculates 
an effective address.
4. M E M : The MEMory cycle is the only cycle which accesses mem­
ory.
5. SR : The Store Result cycle stores results back into the proper 
register.
25
The second pipeline type is a 5-stage pipeline, which has the ability to 
hold five active instructions. The stages are listed below:
1. IF : The Instruction Fetch cycle fetches the next instruction from 
memory by loading the instruction register with the correct ad­
dress of tha t instruction.
2. ID : The Instruction Decode cycle decodes the instruction.
3. O D : The Operand Decode cycle calculates an effective address 
and fetches operands.
4. SX: The Store/Execute cycle sends operand to memory or uses 
ALU if execution.
5. O F: Operand Fetch if the instruction is a load.
The third pipeline type is a n-stage pipeline, which has the ability to 
hold a-active instructions. The stages are listed below:
-  5 i:
-  S 2 :
-  Sn:
26
C hapter 3
In stru ction  P ip elin e  
O p tim ization  v ia  D ataflow  
A n alysis
This chapter will discuss a technique used in optimizing instruction pipelines 
with the use of dataflow analysis. The technique analyzes the structure of a 
basic block of instructions 1 and detects data dependence tha t might create 
a resource conflict within the pipeline.
When two operands reference the same location in memory a dependence 
relation must he recognized [Allen, 1986]. To determine whether two opera­
tions have the ability to execute in parallel requires data dependence analysis 
[Padua and Wolfe, 1986]. D ata dependence analysis at the statem ent or loop 
level reveals fine grain parallelism. Analysis at the subprogram or block 
level reveals coarse grain parallelism. Architectures tend to perform bet­
1A basic block o f  instructions is a straight line sequence o f instructions w ith in  which 
the existence o f a  branch instruction m ay appear only as the last instruction in the block 
[Sethi and U llm an, 1986].
27
ter in either fine grain or coarse grain environments. Compiler writers, 
guided by specific architectures, choose the type of granularity they need. 
The different classifications of data  dependence are flow, anti, output, and 
control.
3.1 P ip e lin e  D ep en d e n c e
3.1.1 F low  P ip elin e  D ep en dence
D efin itio n  3 .1 .1  Ijd lj : Ij is flow d e p e n d e n t on I;. I; must store its 
results before Ij is allowed to execute. Flow dependence exists in pipeline P3<a
iff
(|i-j| < a) A (Ii < Ij) A (OUT(Ii)niN(Ij) ±  0)
i MOV R4,R1 (R1 :=  R4)
* +  1 ADD R1,R2,R2 (R2 :=  R1 +  R2)
DDG
o
©
Machine Cycles
SRIF ID EX MEM
i+1 IF ID EX MEM SR
Figure 3.1: Flow Pipeline Dependence
28
A flow dependence can be seen in figure 3.1. At tim e 3, instruction i +  1 
will access R1 for a read, while instruction i will not write tha t value until 
tim e 5 . Notice the formal condition holds since
(|i -  j| <  a) A (i <  i +  1) A (O U T (i) =  { R l}  f |I N ( i  +  1){R 1} =  { R l}  ± 0)
3.1.2 A nti P ip e lin e  D ep en dence
D efin itio n  3.1.1 Ij(5Ij ; Ij is a n ti d e p e n d e n t on Ij. Ij must fetch its data 
before Ij is allowed to change that value. Anti dependence exists in pipeline 
Ps,a iff
(|i -  j| <  a) A (Ij <  Ij) A (IN (I i)n O U T (Ij)) + 0
i LOAD mem[Rl],R4 (R4 :=  mem[Rl])
i +  1 STORE R6,mem[Rl] (mem[Rl] :=  R6)
DDGo
©
1 2
Machine Cycles 
3 4 5 6
i IF ID OD SX OF
i+1 IF ID □D
Tsx OF
Figure 3.2: Anti Pipeline Dependence
29
An anti pipeline dependence can be occur in a pipeline th a t has the ability 
to write in an earlier stage than a  read. In figure 3.2 instruction i attem pts 
to read mem [Rl], while instruction i +  1 attem pts a write of mem[Rl]. If the 
original order is not preserved instruction i will have the wrong value in R4. 
Notice the formal condition holds since
( | i - j |  < a ) A ( i < i ) A ( I N ( i ) - { R l } f | O U T ( i  +  l ) { R l}  =  { R l}  ^  0)
3.1.3 O utput P ip elin e D ep en dence
D e fin itio n  3 .1 .1  Ii<5°Ij •’ Ij is o u tp u t  d e p e n d e n t on I;. Ij must store its 
results before Ij is allowed to store its results.
Output dependence exists in pipeline Ps>a iff 
(|i -  j | <  a) A (I; <  Ij) A (O U T (Ii) D O U T (Ij) + 0)
i MOV R4,R1 (R l :=  R4)
i +  1 INC R l (R l :=  R l +  1)
DDG
O 
@
Machine Cycles 
1 2 3 4 5 6
i 5 !  S2 S3 s 4 s 5
<j)
i+1 Si S2 S3 S.\ S5
Figure  3 . 3
30
Figure 3.3: O utput Pipeline Dependence
An output pipeline dependence can be occur in a pipeline tha t has the 
ability to write in two or more stages . In figure 3.3 instruction i attem pts 
a write of variable, while instruction i +  1 attem pts a write of variable. If 
the original order is not preserved instruction i will have the wrong value for 
variable. Notice the formal condition holds since
( |i—j| <  a)A(i <  i+ l)A (O U T (i)  =  {var} Q  O U T (i+ l){ v a r}  =  {var} ^  0)
3.2  P ip e lin e  Itera tio n  Space D e p e n d e n c e
The exploitation of parallelism is strongly influenced by the characteristics of 
a language paradigm. A rich source of parallelism in the im perative languages 
can be found in the loop construct. Imperative languages are notorious for 
hiding data  dependencies because of variable aliasing, side effects, and loop 
constructs [Ackerman, 1981] [DTIollander and Opsommer, 1987].
Our m ethod will first concentrate on breaking data  dependence within 
looping structures. We recognize the fact that the m ajority of computing 
time is spent in loop computation. W ith our scheme, some types of architec­
tural bottlenecks created by loops can be detected and eliminated from the 
instruction stream . Data dependence direction vectors and distance vectors 
are used to compute dependence within a loop structured iteration space. 
We look at both single and multiple subscripts.
31
The following sections will illustrate how dependence information can aid 
in the transform ation of sequential loops into parallel code.
3.2.1 D ep en d en ce Free Loops
For our purposes there are two cases in which we will not need to make any 
loop transformations. The first occurs when the d is ta n c e  v e c to rs  , and the 
d ire c tio n  v e c to rs  are computed and the outcome is such tha t dependence 
does not exist. A loop void of data  dependence between its iterations needs 
no further transformations.
The second occurs when the d is ta n c e  v e c to rs  , and the d ire c tio n  
v e c to rs  are computed and the d is ta n c e  v e c to r  is larger than the number 
of active instructions in the pipeline. Any two statem ents tha t are related 
by this dependence can never be in the pipeline at the same time.
3.2.2 Loop Interchange
Loop interchange is the process of interchanging the nested depth of a nested 
loop. In certain architectures, an increase in performance can be seen when 
using this technique [Kennedy, 1984]. Loop interchanging reorders the orig­
inal sequence of statem ents. As discussed previously, we must know the 
dependence information before we can perform any transformations. If the 
dependence directions are (< , > ), this technique is impossible [Wolfe, 1982].
32
Requirements for loop interchanging :
1. The loops m ust be tightly nested.
2. The loop limits of inner and outer loops must be invariant.
3. There is no dependence relation
Example 3.2 shows a nested loop before and after loop interchange. 
Example 3.2
B efore Transform ation
DO 10 1= 2 to N
DO 20 J  = 2 to M
A( I ,J  ) =  A( / ,  J  - 1 )
20 
10
Dependence Equation 1: ( = , > )
A(2,2) = A(2,1)
X
A(2,3) = A(2,2) Flow Dependence
X
A ( 2 , 4 )  = A ( 2 , 3 )
X
etc.
33
A fter  Transform ation
DO 10 J =  2 to M 
DO 20 I  =  2 to N
A (7 ,J  ) =  A( / ,  J  - 1 )
20 
10
Dependence Equation 1: ( > , = )
A(2,2)=A(2,1)
A(3,2)=A(3,1) dependence is broken
A(4,2)=A(4,1) 
etc.
3.2.3 Loop C ode M otion
Loop statem ent reordering is the process of interchanging certain statem ents 
within a single loop. This technique reorders the original sequence of state­
ments. As discussed previously, we must know the dependence information 
before we can perform any transformations. If the dependence direction is 
(= ) the technique is impossible [Wolfe, 1982].
Requirements for loop code motion :
1. The loops must be a single loop or the most inner nested loop.
2. There is no dependence relation
34
Example 3.3 shows a loop before and after code motion
Example 3.3
Before Transformation 5i<$(<)5,2
DO 10 I = 1 to N A(1)=C(1)
A(I)=C(I) D(1)=A(2)
D(I)=A(I+1) S
10 A(2)=C(2)
D(2)=A(3) V 3
A(3)=C(3) 
D(3)=A(4) 
etc.
After Transformation *5'î(<)*5'2
DO 10 I = 1 to N Dfl^=Af2^
D(I)=A(I+1) A(i)=C(l) (SI
A(I)=C(I)
10 D(2)=A(3)
A(2)=C(2)
D(3)=A(4)
A(3)=C(3)
35
3 .3  P ip e lin e  D ep en d e n c e  G raph  , P D G
Since pipelines are machine dependent, a certain am ount of information must 
be known by the compiler for an effective analysis phase. This section will 
show how the PDG is constructed.Parallelism is extracted after the creation 
of the PDG by the code transformation phase.
As the compiler computes data dependence information, it create a pipeline 
dependence graph, or PDG. A PDG is a directed graph G — (F, J5), where 
the nodes, V  =  S\, S 2 , Sn, represent the statem ents in a program, and 
the directed arcs, E  =  {e,j =  (S i,S j)  | S i ,S j  £  V}, represent the depen­
dence relationships,iteration space dependence,delay branching dependence 
and instructional tim ing dependence.
3.3.1 D ata  D ep en d en ce Graph
The approach begins when the compiler computes the IN() OUT() sets for 
each statem ent in the basic block[Polychronopoulos, 1987a]. Where IB n is 
a basic block of n instructions, IB n =  {Ix , I 2 , - - -, In } •> and I; <  Ij when 
instruction I; lexically precedes instruction Ij in the instruction stream . The 
compiler extracts the dependence information by using the following defini­
tions;
3G
Flow  D ep en d en ce
(Ii <  Ij) A (OUT(Ii) P)IN(Ij)) 7  ̂ 0 
A nti-F low  D ep en d en ce
(Ij <  Ij) A (IN(Ij) P| O U T(Ij)) 7  ̂ 0 
O uptut D ep en d en ce
(Ii < I j )A(OUT(Ii) n o U T ( I j ) ) # 0
The segment of code found in figure 3.4 will help us illustrate scalar data 
dependence.
37
II MOV R3,R1 IN (1 )  =  {R3} OUT{  1) =  {Rl}
I 2 ADD R1,R2,R2 IN (2 )  = { R l ,R 2 } O U T (2) =  {R2}
I 3 FDIV R9,R2,R4 I N  (3) = {R9,R2} O U T {3) = {R4}
I 4 FDIV R9,R8,R6 IN{A)  =  {R9,R8} O U T (A) = {R6}
I 5 ADD R9,R2,R0 IN {5) = {R9,R2} OUT(  5) =  {RO}
l6 LOAD [R9+R0],R7 I N  (6) =  {R 9 ,R 0 ,m em [R 9 + RO]}OUT(6) = {R7}
17 STORE R7,[R11] IN (7 )  = { R 7 ,R l l } OUT( 7) = {m em [Rll]}
Is JM P 137
PDG
Scalar Dependence ©
Figure 3.4: PDG Scalar Dependence
38
3.3.2 D elay  Branching D ep en dence
D e fin itio n  3.1 A  de lay  b ra n c h  e x its  3 (I; =  b ra n c h  ) A (Ij ^  (Ij<5Ij))
Previously we defined a basic block of instructions as a straight line sequence 
of instructions within which the existence of a branch instruction may appear 
only as the last instruction in the block. If this branch instruction is void of 
any dependence within the basic block it can be used in resource elimination. 
Usually this type of branch is an unconditional branch or a call statement.
At this point the compiler marks the delay branch node of the PDG. The 
following example shows the delay branching instruction with a double circle 
using the code given in figure 3.4.
39
1
2
PDG
Scalar Dependence 
and
Delay Bracnching
3.3.3 Instruction  T im in g  D ep en dence
In this particular case the analysis focuses on the the type of architectural 
bottlenecks tha t are created by a contention for some hardware resource, 
such as a single functional unit.
Each arc in the PDG will be assigned a weight and a resource identi­
fication number. The weight will represent the instruction found in that 
statem ent along with the resource id number.
40
Using the code given in figure 3.4 and assuming the floating point unit 
id = l l  and its delay =  4 cycles statem ent our PDG makes another transfor­
mation.
PDG
Scalar Dependence 
Delay Bracnching 
andInstruction Delays
(1 1 ,4 )
41
3 .4  R eso u rce  C onflict R em ova l T ech n iq u es
This section will explain optimization techniques tha t eliminate resource con­
flicts within the pipeline. For each instruction the IN,OUT sets, DDG, and 
an illustration of a 4-stage pipeline will be included. Dotted lines within each 
pipeline will reference some data dependence tha t creates a resource conflict.
3.4.1 C ode reordering
D efin ition  3.1 C ode reordering ex its  if  3 I; ^ ( (Ij^’Ij) A (Ij^'Ij) ).
If an instruction exists which has no dependencies within the basic block, 
th a t instruction may be placed lexically anyhere within the basic block. 
Changing the lexical sequence of a basic instruction block may break the 
dependence and remove a resource conflict. Nodes in a DDG which have 
no incoming or outgoing arcs are prime candidates for code reordering. The 
absence of all arcs defines total independence of an instruction, and allows 
the compiler the option of reordering the instruction stream within the basic 
block.
42
Part A
I ± MOV R4,R5
12 ADD R1,R7,R1
13 ADD R1,R4,R2
Part B
I ± ADD R1.R7.R1
12 MOV R4.R5
13 ADD R1.R4.R2
Figure 3.5:
DDG IF ID OD EX
to
tl
t2
t3
t4
t5
DDG
<D
O
IF ID OD EX
to
tl
t2
t3
t4
t5
Conflict removal by code reordering
43
In part A of figure 3.5, instruction 1 is void of any dependence whereas 
instructions 2 and 3 form an adjacent dependence. Instruction 2 is a ttem p t­
ing a write of R l, while instruction 3 is attem pting a  read of R l. In some 
pipelines, such as our example, an adjacent dependence may cause a resource 
conflict.
In figure 3.5 part B, the elimination of the resource conflict was accom­
plished by separating instructions 2 and 3 with instruction 1. Notice, this 
optimization does not eliminate the original dependence but does elim inate 
the resource conflict.
3.4.2 D elay  branching
D efin itio n  3.1 A  d e lay  b ra n c h  e x its  if  3 Ij =  b ra n c h  A Ij (Ij^h) •
Previously we defined a basic block of instructions as a straight line 
sequence of instructions within which the existence of a branch instruction 
may appear only as the last instruction in the block. If this branch instruction 
is void of any dependence within the basic block it can be used in resource 
e lim ination . Usually this type of branch is an unconditiona l branch or a call 
sta tem en t.
A natural characteristic of a pipeline is the ability to prefetch one in­
struction, while a previous instruction is being executed. When the executed 
instruction is a successful branch or an unconditional one, the prefetched 
instructions must be flu sh ed from the pipeline. When flu sh ing occurs in a
44
pipeline the flushed instructions add to the wasted memory access time, and 
the computing tim e for a task increases. The most common method of deal­
ing with this problem is called delay branching.
Figure 3.6 is an example of delay branching in a pipeline which has the 
ability to hold four instructions.
Part A Part B
/ ,  LOAD [R2+R4],R1 JM P 105
12 ADD R4,R5,R3 LOAD [R2+R4],R1
13 JM P 105 ADD R4,R5,R3
h  STORE R3,[R5+R7] STORE R3,[R5+R7]
Is SUB R5,R2,R9 SUB R5,R2,R9
Ie MOV R2,R4 MOV R2,R4
Figure 3.6: Delay branching
Notice, in part A of figure 3.6 , tha t the execution of I 3 , the JM P instruc­
tion, will demand tha t the next instruction to execute is Iios- At the same 
time, instructions 1 4 , 1 s and Ig are being processed by the pipeline. Since 
the next instruction to execute after the jum p will be I 1 0 5  the pipeline would 
halt, 1 4 , 1 5  and Ig would be flushed, and I 1 0 5  would enter the pipeline. To 
eliminate this problem delay branching is implemented. Notice in part B of 
figure 3.6, the code is reordered and the JM P instruction become I i .  When 
I i ,  the JMP, is executed it will demand that the next instruction to enter 
the pipeline is I 1 0 5  and tha t will be the case in delay branching..
45
Delay branching can be utilized in the elimination of resource conflicts
by analyzing the DDG, and performing code reordering.
Part A
X1 MOV R5,R9
h ADD R9,R5,R3
i3 MOV Rl ,R8
% ADD R4,R5,R7
*5 JMP 105
DDG IF ID DD EX
to J1
t l h
t2 % h h
t3 h h
1 1 M
t4 % h *3 2̂
t5 % % X4 %
t6 T7 %  T4
t7 % J7 %  %
Part B DDG IF ID OD EX
h MOV 50 CJ
l S* CO
h JMP 105
I3
ADD R9,R5,R3
X4 MOV Rl ,R8
*5 ADD R4,R5,R7
to
tl
t2
t3
t4
t5 105
t6
t7 107
Figure 3.7: Conflict elimination by delay branching
46
3.4.3 N O P in g
The NOPing optimization technique will be the last effort in the resource 
conflict elimination. The insertion of a NOP instruction is a well known 
technique implemented in RISC compilers [Tabak, 1987]. By having the 
compiler insert a NOP instruction eliminates the need for a pipeline halt by 
hardware. Although this adds an ex tra instruction to the stream , the need 
for a  hardware solution, which is more costly, is eliminated.
Again we will incorporate this technique in the elimination of resource 
conflicts.
47
DDG
Ij_ ADD R9 ,R 6 ,R 5  
h ADD R5, R 7 ,R 1 . 2 
% ADD R1 .R 4 .R 2
IF ID OD EX
to
tl
t2
t3
t4
t5
I ± ADD R9,R6,R5
12 NOP
13 ADD R5,R7,R1
14 NOP
I ADD Rl,R4,R2 5
©
©
IF ID OD EX
to
tl
t2
t3
t4
t5
t6
t7
Figure 3.8: Conflict elimination by NOPing
48
3.5  T ran sform ation  P ro cess
The overall approach can be seen in the following steps:
1: Loop T ransform ations
For all loops do 
begin
• com pute direction and distance vectors
• create ISDG.
i f (direction vectors =  undefined ) then  
• do nothing;
else
i f ( loops are tightly nested ) and
( loop limits of inner and outer loops are invariant ) and 
( there is no dependence relation ) then
• perform loop interchange
i f ( loop is a single loop or the most inner nested loop ) and 
( cycles exist in ISDG ) and 
( there is no dependence relation ) then
• perform loop code motion;
end
49
2: B asic B lock  Transform ation
For all basic blocks do 
begin
•  Compute IN  and O U T  sets.
• Create DDG.
• Transform DDG into PDG by adding delay branching dependence and 
instruction timing dependence.
• Pick the first node which has an outgoing arc data dependence arc 
but no incoming data dependence arc.
Set W IN D O W  =  first node and let the W IN D O W  
size be N . Where N  is the number of active 
instructions allowed in the pipeline at anyone time. 
i f  th a t node does not exist then 
•  goto 3;
• Give each node in the PDG a weight =  to the 
number of children it has.
• Initially, let C A N D ID A T E -S E T  represent the instructions tha t have no 
incoming arcs, do not include the delay branch instruction.
while W IN D O W  ^  e m p ty  do 
begin
while W IN D O W  has d e p e n d e n c e  do
begin
50
if ( C A N D ID A T E -S E T  =  em pty )
•  Insert NOP before the lower node of 
the dependence.
else
• Remove a node from the C A N D ID A T E -S E T
not in this dependence with the largest weight.
Add any of its successors to the C A N D ID A T E -S E T  
which now have no incoming arcs.
• Insert tha t node before the lower 
node of the dependence.
• U pdate W IN D O W  with the new 
sequence of N  instructions.
end
• Mark top node of W IN D O W  U S E D  and if it
is a member of the C A N D ID A T E -S E T  remove it.
•  Move W IN D O W  down one level in PDG.
51
3: In stru ction  T im ing  Transform ation
For possible instruction timing dependence do 
begin
• Update PDG with instruction dependence. 
while ( instruction dependence arc > 0 ) do 
begin
if ( C A N D ID A T E -S E T  =  empty )
• Insert NOP before the lower node of 
the dependence
and decrement instruction dependence arc by 1. 
else
• Remove a node from the C A N D ID A T E -S E T  
which now have no incoming arcs or outgoing arcs.
• Insert tha t node before the lower 
node of the dependence.
and decrement instruction dependence arc by 1.
end
end
52
4: D e lay  B ra n c h  T ra n s fo rm a tio n
• Find the (N  +  l ) th instruction from the bottom  of the PDG, where N  
is the num ber of stages in the pipeline. Label tha t node D_B.
i f (  D_B =  NOP ) then 
o replace the D JB  with the delay branch instruction. 
else
• Insert delay branch instruction after D_B.
3.6  A n  E xam p le  o f  th e  T ran sform ation  P ro ­
cess .
The segment of code found in figure 3.9 will help us illustrate a code transfor­
m ation sequence during optimization. The reordering is targeted at pipeline 
P  4,4 where all hazards occur in adjacent slots of the pipeline. That is be­
tween 7i and I 2 then I 2 and I 3 ... In_\ and Also, in this example the only 
instruction with a timing delay will be the FDIV. W here the floating point 
unit has id = l l  and a stall of 4 cycles. Stalls will appear as blank lines in the 
pipeline.
53
3.6.1 Transform ation Exam ple
11 FDIV R3,R14,R1
12  ADD R1,R2,R12
13 FDIV R9,R5,R4
14  ADD R9,R4,R6
15  ADD R9,R2,R0
16  LOAD [R9+R0],R7
17  STORE R10,[R11]
18 JM P 137
IN(1 )  = {R3,R14}
I N(2 )  = {R l, R2}
IN{3)  = {R9,R2}
/7V(4) =  {R9,R4}
I N { 5) =  {R9, R2}
I N ( 6 ) = {R9, RO, m em [R9  +  RO]} 
IN(1 )  = {RIO,7211}
OUT{ 1) =  {Rl} 
OUT{ 2) =  {R12} 
OUT(3)  = {R4} 
OUT{A) = {R6} 
OUT(5) = {RO} 
O U T(6 ) = { Rl }  
0UT{1)  = {mem[Rl]
PDG
ID EXIF OD
11,4 t0
t2
t3
t4
t5
t6
tl
t8
t9
tlO
til
tl2
tl3
54
Figure 3.9: Example code before transformation.
PDG
© n
©
© 
(7)4
© 
©
©
11,4
IF ID □D EX
to h
t l % *1
t2 h h h
t3 h h h h
t4 h h h h
t5 h h 1 3 h
t6
t7
t8 h h i ,
t9 % h h h
tio h h h h
t i l OM h % h
tl2 h i M h-»> O h h
stall
Figure 3.10: Example Code after basic block transformation.
55
PDG
EXODIF
11,0  t 0  
tl
t2
t3
t4 op
t5
t6 opno t7
t8
t9
tlO
til
ure 3.11: Example code after instruction timim g transformation..
56
PDG
IF EXID □D
11 , 0  t 0  
tl
t2
t3
t4 op
t5 op
t6
no t7
t8
t9 137
tlO 137 *4138
til 139
Figure 3.12: Example code after branch delay transformation..
The rescheduled code show in figure 3.12 is now ready for tha t specific 
pipeline. All possibel conflicts and stalls have been removed.
3 .7  O rder o f  O p tim iza tio n s . . .  ?
At this tim e the order in which optimizations should be performed is an 
open problem. This section has been added to show tha t some well know
57
optimization techniques lend themself to our technique. This chapter will 
explain techniques tha t eliminate resource conflicts within the pipeline. For 
each instruction the IN,OUT sets, DDG, and an illustration of a 4-stage 
pipeline will be included. Dotted lines within each pipeline will reference 
some data  dependence tha t creates a resource conflict.
3.7.1 Folding
D efin itio n  3.1 F o ld in g  e x its  if  3 Ij v a r  :=  con  A 3 Ij G (li^Ij) •
Thus, if an instruction assigns a constant to a variable, and there exists a 
flow dependence from tha t variable to a future instruction, a substitution of 
th a t constant value for any future undefined u s e  of tha t variable is allowed. 
To illustrate, if an instruction has the form of /?; :=  R j , we substitute the 
value of R j  for any future undefined u s e s  of R , .
In part A of figure 3.13, the definition for folding holds since 
7i R 3  :=  v a r A (Iiflfe). When folding is applied, as shown in part B of
figure 3.13, the dependence is broken and parallelism is created between I i
and I 2  . The transformation substituted the value of R3 from instruction 1, 
for R l in instruction 2.
3.7.2 D ead code elim ination  
D e fin itio n  3.1 D ead  co d e  e x its  if
3 Ij such  th a t  (Ij G ( Ii 6  Ij ) V ( Ij =  I i  )) A /B Ik G ( Ij 6  Ik) •
58
Thus, if an instruction is dependent on a previous instruction or it is the 
first instruction of the basic block, and there is no flow to a  future instruction 
before it is redefined, it is labeled useless and can be eliminated.
As before, when an bistruction has the form of R {  := R j , we substitute 
the value of R j  for any future nondefined uses of i?,. The instruction f?,- :=  R j  
is then identified as dead code and is removed from the instruction stream.
In part B of figure 3.13, the definition for dead code elimination holds 
since ( Ij =  I i  ) A Ik G ( I i 6  Ik) .
Notice tha t this transformation does not effect the pipeline, although it 
will improve the overall performance of the com putation and is traditionally 
implemented after a folding optimization [Sethi and Ullman, 1986].
59
part A
MOV R3,R1
ADD Rl
MOV R4.R5
IF ID OD EX
to
tl
t2
t3
t4
t5
h
part B
MOV R3.R1
DDG
o
IF ID
ADD R3,R4,R2 ,— v 
0
MOV R4,R5
0
Part C DDG IF
Ij ADD R3 0,R4,R2vO /
12 MOV R4.R5
0
OD
ID OD
EX
to
tl
t2
t3
t4
t5
EX
to
tl
t2
t3
t4
Figure 3.13: Conflict elimination by folding and dead code removal
60
C hapter 4 
C onclusions and Further  
R esearch
4.1 C o n c lu sio n
This work has presented a software approach to code reorganization which 
maximizes the parallel performance of an instruction pipeline. We accom­
plished this by focusing the approach on the elimination of specific arcs in 
the dependence graph where an arc can represent a type of pipeline hazard 
known as the resource conflict.
Summarizing the approach, the compiler computes direction and distance 
vectors for each loop structure . It then creates an ISDG and examines the 
loop for possible optimizations which break loop iteration dependence. After 
loop dependence is broken the IN ()  O U T ()  sets for each basic statem ent 
block are created. By analyzing the IN ()  O U T () sets a PDG is created. 
When inspecting the dependence graph the compiler is able to detect the 
dependencies tha t cause resource conflicts. Optimizations are then carried
61
out on the instruction block resulting in a revised version of the original code 
void of resource conflicts. A complete example with transformations within 
a instruction block is shown with the corresponding PDG and IN ()  O U T () 
sets.
The optimizations include : loop interchange, loop statem ent motion, 
folding, dead code elimination, code reordering, NOPing, and delay branch­
ing.
4 .2  F u rth er R esearch
• Evaluate the general effectiveness of this approach by simulation.
• Evaluate the performance by implementing this approach on different 
pipelined systems.
• Investigate the interactions between the optimizations to determine 
whether the order in which the optimizations are applied effects the 
overall performance of the pipeline.
62
B ib liography
[Ackerman, 1981] Ackerman, W. B. (1981). D ata flow languages. Tutorial 
on Parallel Processing, pages 335-343.
[Allen and Cocke, 1972] Allen, F. and Cocke, J. (1972). A Catalogue o f Op­
timizing Transformations. Prentice-Hall, Englewood Cliffs, N.J.
[Allen, 1986] Allen, F. E. (1986). Compiling for parallelism. Proceedings o f 
the 1986 IB M  Europe Institute Seminar on Parallel Computing.
[Allen, 1983] Allen, J. (1983). Dependence Analysis for Subscripted Vari­
ables and I t ’s Application to Program Transformation. PhD thesis, Rice 
University (UMI 83-14916).
[Allen and Kennedy, 1982] Allen, J. and Kennedy, K. (1982). Pfc: A pro­
gram to convert fort.ran to parallel form. Technical report, Rice University. 
Technical Report MASC-TR82-6.
[Almasi and Gottlieb, 1989] Almasi, G. S. and Gottlieb, A. (1989). Highly 
Parallel Computing. The Benjamin/Cumm ings Publishing Company, Inc, 
Redwood City, California 94065.
63
[Aray, 1985] Aray, S. (1985). An optimal instruction scheduling model for a 
class of vector processors. IEEE Transactions on Computers, 34(11).
[Banerjee, 1976] Banerjee, U. (1976). Data Dependence in Ordinary Pro­
grams. PhD thesis, University of Illinois, Urbana-Champaign.
[Banerjee, 1979] Banerjee, U. (1979). Speedup o f Ordinary Programs. PhD 
thesis, University of Illinois, Urbana-Champaign.
[Beatty, 1972] Beatty, J. (1972). An axiomatic approach to code optim iza­
tion for expressions. ACM , 19(4):613-640.
[Beatty, 1974] Beatty, J. (1974). Register assignment algorithm for genera­
tion of highly optimized object code. IB M  J. res. Dev., 18(1):20—39.
[Berstien, 1988] Berstien, D. (1988). An improved approximation algorithm 
for scheduling pipelined machines. Proceedings o f the International Con­
ference on Parallel Processing.
[Broy, 1986] Broy, M. (1986). Control Flow and Data Flow : Concepts o f 
Distributed Programming. Springer-Verlag New York.
[Charles and Ferrante, 1987] Charles, F. M. B. P. and Ferrante, J. (1987). 
An overview of the ptran analysis system for multiprocessing. Lecture 
Notes in Computer Science, 297:195-211.
64
[Chow and Rudmik, 1982] Chow, A. and Rudmik, A. (1982). The design of 
a data  flow analyzer. AC M  Proceedings o f the SIG P LA N  82 Symposium  
on Compiler Construction, pages 106-113.
[Chow, 1989] Chow, P. (1989). The M IP S -X  R ISC  Microprocessor. Kluwer 
Academic Publishers.
[Cooper, 1983] Cooper, K. (1983). Interprocedural Data Flow Analysis in a 
Programming Environment. PhD thesis, Rice University.
[Cooper and Kennedy, 1984] Cooper, K. D. and Kennedy, K. (1984). Ef­
ficient com putation of flow insensitive interprocedural summary informa­
tion. A C M  Proceedings o f the SIG P LA N  84 Symposium on Compiler Con­
struction , 19(6):247-258.
[Cytron, 1986] Cytron, R. (1986). On the application of dependence analysis 
and restructuring techniques to parallel and functional languages. Proceed­
ings o f the 1986 IB M  Europe Institute Sem inar on Parallel Computing.
[Cytron and Ferrante, 1987] Cytron, R. and Ferrante, J. (1987). W hat’s in 
a name? the value of renaming for parallelism detection and storage al­
location. Proceedings o f the 1987 International Conference on Parallel 
Processing, pages 19-27.
65
[D. Gellernter and Padua, 1990] D. Gellernter, A. N. and Padua, D. (1990). 
Languages and Compilers for Parallel Computing. MIT Press, Cambride, 
Massachusetts.
[D. Kuck, 1980] D. Kuck (1980). High-Speed Machines and Their Compil­
ers, Proceedings of the CREST Parallel Processing Systems Cource, Cam­
bridge. Cambridge University Press.
[Dasgupta, 1989a] Dasgupta, S. (1989a). Computer Architecture volume 1. 
John Wiley and Sons, New York.
[Dasgupta, 1989b] Dasgupta, S. (1989b). Computer Architecture volume 2. 
John Wiley and Sons, New York.
[D.Bernstien and Gertner, 1989a] D.Bernstien and Gertner, I. (1989a). 
Scheduling expressions on a pipelined with a maximal delay of one cyle. 
AC M  Transactions on Programming Languages and System s, 11(1).
[D.Bernstien and Gertner, 1989b] D.Bernstien, M. R. and Gertner, I. 
(1989b). Approximation algorithms for scheduling arithm etic expressions 
on pipelined machines. Journal o f Algorithms, 10(1).
[de Bakker et al., 1986] de Bakker, J., Kok, J., Meyer, J., Olderog, E., and 
Zucker, J. (1986). Contrasting themes in the semantics of im perative con­
currency. Lecture Notes in Computer Science, 224:51-121.
66
[D’Hollander and Opsommer, 1987] D’Hollander, E. H. and Opsommer, J. 
(1987). Implementation of an autom atic program partitioner on a homge- 
neous mutiprocessor. Proceedings o f the 1987 International Conference on 
Parallel Processing, pages 517-519.
[Dinning and Schonberg, 1990] Dinning, A. and Schonberg, E. (1990). An 
empirical comparison of monitoring algorithms for access anomaly detec­
tion. Second A C M  SIG P LA N  on Principles and Practice o f Parallel Pro­
gramming PPOPP, 25(3): 1—10.
[Ellis, 1985] Ellis, J. (1985). Bulldog: A Compiler for VL I W Architectures. 
MIT Press.
[Fisher, 1987] Fisher, J. (1987). Wide Instruction Word Architectures: Solv­
ing the Supercomputer Software Problem. Elsevier Science Publishers B.V., 
52 Vanderbilt Ave. New York N.Y. 10017 U.S.A.
[Fisher et al., 1984] Fisher, J. A., Ellis, J. R., Ruttenberg, J. C., and Nico- 
lau, A. (1984). Parallel processing: A sm art compiler and a dumb machine. 
AC M  Proceedings o f the SIG PLAN  8Jt Symposium on Compiler Construc­
tion Montreal, Canada, 19(6):37—47.
[Gibbons and Muchnick, 1986] Gibbons, P. and Muchnick, S. (1986). Effi­
cient instruction scheduling for pipelined architectures. AC M  Proceedings 
o f the SIG P LA N  86 Symposium on Compiler Construction Palo Alto.
67
[Griffin, 1954] Griffin, H. (1954). Elementary Theory o f Numbers. McGraw- 
Hill.
[Gurd and Bohm, 1986] Gurd, J. and Bohm, W. (1986). Implict parallel 
processing:sisal on the manchester dataflow computer. Proceedings o f the 
1986 IB M  Europe Institute Seminar on Parallel Computing, pages 179— 
204.
[Hecht, 1977] Hecht, M. S. (1977). Flow Analysis o f Computer Programs. 
Prentice-H all, Inc.
[Hendren and Nicolau, 1990] Hendren, I. and Nicolau, A. (1990). Paralleliz­
ing programs with recursive data structures. Parallel and Distributee Sys­
tems, l(l):35-47.
[Hennessy and Patterson, 1990] Hennessy, J. and Patterson, D. (1990). 
Computer Architecture a Quantitative Approach. Morgan Kaufmann Pub­
lishers,Inc., San Mateo, California.
[J.Banning, 1978] J.Banning (1978). A Method fo r  Determining the Side 
Effects o f Procedure Calls. PhD thesis, Standard, University.
[Jeane Ferrante and Warren, 1987] Jeane Ferrante, K. O. and Warren, J.
(1987). The program dependence graph and its use in optimization. ACM  
Transactions on Programming Languages and Systems, 9(3):319—349.
68
[Jesshope, 1986] Jesshope, C. (1986). Building and binding systems wth 
transputers. Proceedings o f the 1986 IB M  Europe Institute Seminar on 
Parallel Computing.
[J.Hennesey and Gross, 1983] J.Hennesey and Gross, T. (1983). Postpass 
code optim ization of pipeline constraits. AC M  Transactions on Program­
ming Languages and Systems, 5(3):422-448.
[Kennedy, 1984] Kennedy, J. A. K. (1984). Automatic loop interchange. 
A C M  Proceedings o f the S IG P LA N  8Jf Symposium on Compiler Construc­
tion , 19(6):247—258.
[Kennedy and Subhlok, 1988] Kennedy, V. B. D. B. D. C. K. and Subhlok, J.
(1988). Ptool: A system for static analysis of parallel programs. Technical 
report, Rice University. Technical Report COMP-TR88-71.
[Kirch, 1974] Kirch, A. (1974). Elementary Number Theory. Intext.
[K.Kennedy, 1975] K.Kennedy (1975). Use-definiton chains with aplications. 
Technical report, Rice University. Technical Report 476-093-9.
[Krishnamurthy, 1989] Krishnamurthy, E. (1989). Parallel Processing : 
Principles and Practice. Addison-Wesley.
[Kuck et al., 1981] Kuck, D., Budnik, P., Chen, S., Lawrie, D., Towle, R., 
and Strebent, R. (1981). Parallelism in ordinary fort ran programs. Tutorial 
on Parallel Processing, pages 346-362.
69
[Lowry and Medlock, 1969] Lowry, E. and Medlock, C. (1969). Object code 
optim ization. AC M , 12(l):13-22.
[Mickunas and Schell, 1978] Mickunas, M. and Schell, M. (1978). Parallel 
compilation in a multiprocessor environment. AC M  Proceedings 1978 A n­
nual Conference, pages 241-246.
[Moor, 1982] Moor, I. (1982). An applicative compiler for a parallel ma­
chine. A C M  Proceedings o f the SIG P LA N  82 Symposium on Compiler 
Construction Boston Massachusetts, 17(6):284-293.
[Muchnick and Jones, 1981] Muchnick, S. and Jones, N. (1981). Program 
Flow Analysis : Theory and Applications. Elsevier Scientific Publishing 
Company.
[N.D.Jones and S.Muchnick, 1982] N.D.Jones and S.Muchnick (1982). A 
flexible to interprocedural data flow analysis and programs with recursive 
data  structures. 9th AC M  Symposium on the Priciples o f Programming 
Languages, pages 66-74.
[Padua and Wolfe, 1986] Padua, D. and Wolfe, M. (1986). Advanced com­
piler optimizations for supercompurers. Communications o f the ACM, 
29(12):1184—1201.
70
[Pfeiffer and Reps, 1989] Pfeiffer, S. H. P. and Reps, T. (1989). Dependence 
analysis for pointer variables. Proc. SIG P LA N  89 Conference on Program­
ming Language Design and Implementation , pages 28-40.
[Polychronopoulos, 1987a] Polychronopoulos, C. (1987a). On advanced com­
piler optimizations for parallel computers. Proceddings o f the International 
Conference on Supercomputing.
[Polychronopoulos, 1987b] Polychronopoulos, C. D. (1987b). Automatic re­
structuring of fortran programs for parallel execution. Lecture Notes in 
Computer Science, 295:107-130.
[Polychronopoulos, 1987c] Polychronopoulos, C. D. (1987c). Loop coalesc­
ing: A compiler transformation for parallel machines. Proceedings o f the 
1987 International Conference on Parallel Processing, pages 235-242.
[Remi Triolet, 1986] Remi Triolet, Francois, P. F. (1986). Direct paralleliza- 
tion of call statem ents. Proceedings o f the SIG P LA N  86 Symposium on 
Compiler Construction Palo Alto, 21(7):176—185.
[Sarkar and Hennessy, 1986] Sarkar, V. and Hennessy, J. (1986). Compile­
tim e partioning and scheduling of parallel programs. A C M  Proceedings o f 
the S IG P LA N  86 Symposium on Compiler Construction Palo Alto, pages 
17-26.
71
[Schneck, 1975] Schneck, P. (1975). Movement of implicit parallel and vector 
expressions out of program loops. SIG P LA N  Notices, 10(3):103—106.
[Sethi et al., 1970] Sethi, Ravi, and Ullman, J. (1970). The generation of 
optimal code for arithm etic expressions. ACM, 17(4):715—728.
[Sethi and Ullman, 1986] Sethi, A. A. R. and Ullman, J. (1986). Compil­
ers: Principles, Techniques, and Tools. Addison-Wesley, Reading, Mas­
sachusetts.
[Sharp, 1985] Sharp, J. A. (1985). Data Flow Computing. Ellis Horwood 
Limited.
[Tabak, 1987] Tabak, D. (1987). R ISC  Architecture. Research Studies Press 
LTD., England.
[Thomasset and Eisenbeis, 1986] Thomasset, A. L. F. and Eisenbeis, C. 
(1986). Autom atic detection of parallelism in scientific programs with 
application to array-processors. Proceedings o f the 1986 IB M  Europe In­
stitute Seminar on Parallel Computing.
[Tiemann, 1989] Tiemann, M. (1989). Pfc: The gnu instruction scheduler. 
Technical report, Stanford University. CS343 cource report.
[Wolfe, 1982] Wolfe, M. (1982). Optimizing Supercompilers fo r Supercom­
puters. PhD thesis, University of Illinois, Urbana-Champaign.
72
[Yew and Zhu, 1990] Yew, Z. L. P.-C. and Zhu, C.-Q. (1990). An efficient 
d a ta  dependence analsis for parallelizing compilers. Parallel and Dis­
tributed Systems, 1(1) :26—34.
73
