Loop pipelining with resource and timing constraints by Sánchez Carracedo, Fermín
UNIVERSITAT POLITÈCNICA DE CATALUNYA 
 
 
 
 
 
 
 
 
 
LOOP PIPELINING WITH 
RESOURCE AND TIMING 
CONSTRAINTS 
 
 
 
 
 
 
 
 
 
 
 
Autor: Fermín Sánchez 
 
October, 1995 
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
ACKNOWLEDGMENTS
I would like to thank the members of the Department of Computer Architecture for their support
throughout the development of this work. In particular, I would like to thank Jordi Cortadella for
his guidance and support throughout my graduate career. His enthusiasm for my early steps in
the field gave me the confidence to pursue my own ideas. Besides being my advisor, he has been
my best collaborator all these years. I would also like to give special thanks to Rosa M. Badia
for her suggestions and comments, which have contributed to the improvement of this work. My
gratitude goes also to the rest of the C AD-VLSI group. They have also contributed by their
constant encouragement for me to finish this work.
I would like to thank my colleagues in the DAC, especially Anna del Corral, Josep LLosa, Angel
Toribio, Mildred Sarmiento, Enric Pastor, Agustín Fernández and Montse Peiron. They have made
my years in the University much more pleasant. From among all of them, my deepest gratitude
goes to Josep LLosa for the great quantity of discussions that we have maintained in recent years,
which have doubtlessly contributed to the enrichment of this work.
I thank Marc Noi for his help with the Farcy's series, and Tricia for being my English advisor all
these years. I also thank Tomás Lang, David Padua and Mateo Valero for giving me part of their
valuable time, listening to my ideas and giving me their suggestions. I am equally grateful to Q.
Ning,'R. Govindarajan, Eric R. Altman and Guang G. Gao for supplying me the data dependence
graphs used for comparisons in superscalar and VLIW processors.
Finally, I am greatly indebted to Ivette, who has always been understanding about my work. I
would like especially to thank my brother David and my parents Herminio and Francisca. Their
love and support have given me the courage to finish this work. I am privileged to belong to such
a wonderful family. This work is dedicated to them.
ITo my family, the best in the world
I
I
I
I
I
I
1
1
1
1^B
1
1
1
1
1
1
1
1
1
•
1
•
1
•
1
1
1
t
LIST OF FIGURES
LIST OF TABLES
LIST OF ALGORITHMS
PREFACE
1 INTRODUCTION
1 . 1 Motivation of this work
1.2 High-level synthesis and parallel architectures
1.2.1 High-level synthesis
1.2.2 Superscalar processors
1.2.3 VLIW processors
1.3 Internal representation of loops
1.3.1 Program dependences
1.3.2 Data dependence graph
1.4 Coarse-grained parallelization
1.5 Fine-grained parallelization
1.6 Representation of algorithms in this work
1.7 Summary
2 SOFTWARE PIPELINING
2.1 Introduction
2.2 State of the art
2.2.1 Notation and classification
2.2.2 Approaches which do not calculate Mil
2.2.3 Approaches which estimate the Mil
2.2.4 Approaches which analytically calculate Mil
2.2.5 Linear programming approaches
2.2.6 Comparisons among the approaches
2.3 Techniques proposed in this work
2.4 Summary
V
CONTENTS
xi
xviii
xix
xxi
1
1
2
2
4
4
6
6
7
9
11
13
13
15
15
16
16
18
21
24
33
34
35
38
vi LOOP PIPELINING WITH RESOURCE AND TIMING CONSTRAINTS
3 BASIC DEFINITIONS AND LOOP TRANSFORMATIONS
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Introduction
Representation of a loop
Representation of the architecture
3.3.1 Representation of resources
3.3.2 Representation of instructions
3.3.3 Example of representation of instructions
3.3.4 Example of architecture
Bounds on loop execution
3.4.1 Resource-constrained Mil
3.4.2 Recurrence-constrained Mil
3.4.3 Minimum initiation interval and throughput
Dependence retiming :
3.5.1 Dependence retiming transformation
Loop unrolling
3.6.1 Loop unrolling transformation
3.6.2 Il-graphs with integer Mil
Summary and conclusions
4 ANALYSIS OF DATA DEPENDENCES
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Introduction •
Schedule of a ?r-graph
Scheduling dependences
Positive depth and height
4.4.1 Positive path
4.4.2 Maximal positive path, positive depth and height
4.4.3 Example of computing positive depth
ASAP and ALAP time
Negative depth
4.6.1 Negative restrictive dependences
4.6.2 Assigning negative depth to nodes
4.6.3 Example
Summary and conclusions
5 SCHEDULING A H-GRAPH
5.1
5.2
5.3
5.4
5.5
Introduction
Scheduling graph
Overlapped schedule
List scheduling overview
Scheduling priority functions
5.5.1 The 0-mobility of a node
5.5.2 The positive depth of a node
5.5.3 The negative depth of a node
39
39
40
43
43
44
44
46
48
48
49
51
52
52
53
53
54
56
57
57
58
60
62
62
63
64
64
66
66
67
69
69
71
71
72
72
73
74
75
75
75
1
1
1
1
I
•
1
•
1
•
1
1
*
1
1
1
1
i
1
1
1
1
1
1
1
•
1
•
1
•
1
•
1
•
1
1
1
Contents •. •- . ,
 v
5.5.4 The number of successors (not yet scheduled) of a node in the scheduling
graph
5.5.5 The use of resources performed by an instruction
5.5.6 Complexity of selecting a node for scheduling
5.6 Scheduling algorithm
5.7 Summary and conclusions
6 UNRET: LOOP PIPELINING WITH RESOURCE
CONSTRAINTS
6.1 Introduction
6.2 Exploring the solution space
6.2.1 Throughput diagram
6.2.2 Farey's series
6.2.3 Exploring Farey's series in decreasing order of magnitude
6.2.4 Reducing the solution space
6.2.5 Figures of merit
6.3 Retiming dependences
6.3.1 Range for retiming dependences
6.3.2 Retiming dependences not belonging to recurrences
6.3.3 Retiming dependences belonging to recurrences
6.4 Finding a schedule with maximum throughput
6.4.1 Quality of a 7r-graph
6.4.2 Finding a schedule in II cycles
6.4.3 General algorithm
6.5 Examples
6.5.1 Example 1
6.5.2 Example 2
6.5.3 Example 3
6.6 Experimental results
6.6.1 High-level synthesis
6.6.2 Superscalar and VLIW processors
6.7 Summary and conclusions
7 RESIS : REGISTER OPTIMIZATION
7.1 Introduction
7.1.1 Strategy overview
7.2 Previous work
7.3 Lower bounds on register pressure and RESIS strategy
7.3.1 Variable lifetime
7.3.2 Registers required for a dependence
7.3.3 Register pressure
7.3.4 Lower bounds on registers
7.4 SPAN reduction
vii
79
79
80
81
81
83
83
85
85
87
88
91
93
94
94
95
96
98
98
99
102
104
104
105
108
109
109
111
112
115
115
116
116
120
120
121
122
124
126
viu LOOP PIPELINING WITH RESOURCE AND TIMING
7.4.1 Introduction
7.4.2 Heuristics to select a node to reduce the SPAN
7.4.3 Reduce index transformation
7.4.4 Reducing the number of scheduling dependences
7.4.5 Reducing local maxima
7.4.6 Scheduling
7.4.7 SPAN Reduction. Final algorithm
7.5 Incremental scheduling
7.5.1 Overview
7.5.2 Selecting an instruction to move
7.5.3 Moving an instruction
7.5.4 Re-scheduling
7.5.5 Swapping , ••• • '
7.5.6 Computational complexity of incremental scheduling
7.6 Experimental Results
7.6.1 High-level synthesis
7.6.2 Superscalar and VLIW processors
7.7 Summary and conclusions
8 TCLP: LOOP PIPELINING WITH TIMING
CONSTRAINTS
8.1 Introduction
8.1.1 Strategy overview
8.2 TCLP Approach
8.2.1 Minimum initiation interval
8.2.2 Absolute lower bound on the set of resources
8.2.3 Increasing the number of resources
8.2.4 Reducing the set of resources
8.2.5 Increasing throughput
8.2.6 Reducing register pressure
8.2.7 TCLP. Execution time
8.3 Example
8.4 Experimental Results
8.5 Summary and conclusions
9 CONCLUSIONS AND FUTURE WORK
9.1 Contributions
9.1.1 Software pipelining: retiming and scheduling are separated
pendent tasks
9.1.2 Analysis of data dependences and scheduling
9.1.3 Exploration of the solution space
9.1.4 Register reduction
9.1.5 Time-constrained loop pipelining
CONSTRAINTS
126
127
129
129
130
131
132
132
132
135
136
136
136
137
137
138
140
140
143
143
144
146
146
146
147
149
150
151
151
152
153
155
157
158
into inde-
158
158
159
159
160
1
1
1
1
1
I
•
1
•
I
•
1
_
I
.
1
.
1
1
1
I
1
1
1
1
1
1
I
1
1
1
1
1
1
1
1
1
1
1
Contents
9.2 Future research
9.2.1 Decreasing the execution time
9.2.2 Span reduction and incremental scheduling at a time
9.2.3 Integer Linear Programming
9.2.4 Extension towards conditional sentences, while-like loops and multiple-
nested loops
A BENCHMARK LOOPS
A.I High-level synthesis
A. 1.1 Cytron example
A. 1.2 Differential equation
A. 1.3 16-Point Digital FIR Filter
A. 1.4 Fifth-Order Elliptic Filter
A. 1.5 Fast Discrete Cosine Transform Kernel
A. 2 Superscalar and VLIW processors
B EXPERIMENTAL RESULTS FOR UNRET
B.I High-level synthesis
B.I.I Cytron example
B.I. 2 Differential equation
B.I. 3 16-Point Digital FIR Filter
B.I. 4 Fifth-Order Elliptic Filter
B.I. 5 Fast Discrete Cosine Transform Kernel
B.2 Superscalar and VLIW processors
C EXPERIMENTAL RESULTS FOR RESIS
C.I High-level synthesis
C.2 Superscalar and VLIW processors
REFERENCES
ix
160
160
161
161
162
163
163
163
163
164
165
167
167
171
171
172
172
173
174
174
175
179
179
185
205
-
1
1
1
1
1
• '
1
1
1
1
1^B
1.
1
1
1
1
1
1
1
1
LIST OF FIGURES
Chapter 1
1.1 High-level synthesis system
1.2 Execution of instructions in different processors
1.3 Execution of instructions in a VLIW processor
1.4 Source code for the Livermore Fortran Kernel 3
1.5 Inner product compiled into a pseudo-assembly language
1.6 DDG of the inner product
1.7 Model for doacross scheduling
Chapter 2
2.1 Software pipelining a loop
2.2 Example of DG and schedules for 4 adders
Chapter 3
3.1 Representation of a loop by means of a 7r-graph
3.2 Equivalent 7r-graphs
3.3 Description of an architecture
3.4 Description of Cydra 5 Computer
3.5 Execution of compiled inner product
3.6 Recurrence in a loop
3.7 Equivalent 7r-graphs and their schedules
3.8 Unrolling a 7r-graph
3.9 Scheduling a 7r-graph and a multiple-instanced 7r-graph
Chapter 4
4.1 Types of dependences in a schedule
4.2 Types of scheduling dependences according to 6(u, v)
4.3 Time frame for scheduling of a PSD and an NSD
4.4 Length of a positive path
4.5 Positive Depth of the nodes in a 7r-graph
4.6 NSD that does not constrain the scheduling process
4.7 Negative recurrence in a 7r-graph
4.8 ri-graph with negative recurrences chained
4.9 Compute of negative depth
xi
3
5
6
7
8
8
10
16
36
42
43
45
46
47
49
53
54
55
58
61
61
63
65
66
68
68
70
xii LOOP PIPELINING WITH RESOURCE AND TIMING
Chapter 5
5.1 Reservation table example
5.2 List scheduling when the priority function is the negative depth
5.3 List scheduling by using dynamic negative depth
5.4 List scheduling by using the number of successors
5.5 List scheduling by using resource utilization
5.6 Scheduling algorithm
Chapter 6
6.1 Different schedules of a loop
6.2 General overview of UNRET
6.3 Representing throughput in a diagram
6.4 Solution space for UNRET
6.5 Triangles delimited by MaxII = 9 and MaxII — 15
6.6 Representing Farey's series F5 in a diagram
6.7 First element of Farey's series to be considered
6.8 Reducing solution space
6.9 Comparing number of points in FC and FMOXII
6.10 MPP(7r) < II does not guarantee a schedule exists
6.11 Effect of increase-distance in a 7r-graph
6.12 Dependence retiming performed in a recurrence
6.13 Quality and scheduling in equivalent TT-graphs
6.14 Flow diagram of UNRET
6.15 Schedule for the inner product in 1 cycle
6.16 Overlapped execution of inner product
6.17 Throughput diagram for example 1
6.18 Unrolled TT- graph for example2
6.19 Schedules for example 1 and 2
6.20 Example 3. H-graph and schedule
6.21 Example 3. Points to explore
Chapter 7
7.1 Flow diagram of RESIS
7.2 Register assignment in a superscalar architecture
7.3 Variable lifetime for different architectures
7.4 Overlapping of variable lifetimes
7.5 Register requirements for a dependence
7.6 Register assignment and lower bound
7.7 Lower bounds on registers
7.8 Example of SPAN reduction
7.9 Flow diagram of SPAN reduction
7.10 Example of incremental scheduling
CONSTRAINTS
73
75
78
79
80
81
84
85
85
86
87
88
89
92
93
94
96
97
101
103
104
105
106
106
107
108
109
116
118
120
121
122
123
126
127
128
133
1
1
1
1
I
•
1
•
I
1
I
•
I
•
1
•
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
. *1 •'
List of Figures ; M
7.11 Flow diagram of incremental scheduling.
Chapter 8
8.1 Flow Diagram of TCLP
8.2 Resource responsible for not finding the schedule
8.3 Time-frame for scheduling
8.4 Exploration of the throughput diagram
8.5 Throughput exploration for FDCT
Appendix A
A.I Cytron's example and Differential Equation
A. 2 Algorithmic description of the differential equation
A. 3 16-Point Digital FIR Filter
A.4 Fifth-Order Elliptic Filter
A. 5 Fast Discrete Cosine Transform Kernel
A. 6 Some examples of DDGs
Appendix B
B.I Schedule for the differential equation
Appendix C
C.I Comparing loop schedules for Spec Spice 10 benchmark
1
xiii
134
145
148
149
150
152
164
164
165
166
167
169
173
195
1
1
1
1
•
1
1
1
•
1
•
1
•
1.
1
•1
1
I
1
1
LIST OF TABLES
Chapter 2
2.1 Comparison among different software pipelining approaches
Chapter 6
6.1 Cy iron's example
6.2 Differential Equation
6.3 Fast Discrete Cosine Transform
6.4 Comparison for an architecture with 3 FP adders, 2 FP multipliers, 1 FP divisor
and 2 load/store units
Chapter 7
7.1 Register reduction in a modulo scheduling algorithm for the Cytron example
7.2 Register reduction in a modulo scheduling algorithm for the 16-Point Digital FIR
Filter
7.3 Register reduction in a modulo scheduling algorithm for the Fast Discrete Cosine
Transform
7.4 Incremental scheduling after modulo scheduling for the Cytron example
7.5 Incremental scheduling after modulo scheduling for the 16-Point Digital FIR
Filter
7.6 Incremental scheduling after modulo scheduling for the Fast Discrete Cosine
Transform
7.7 Register reduction in a modulo scheduling algorithm by assuming a VLIW pro-
cessor with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
Chapter 8
8.1 Cytron's example
8.2 Differential Equation
8.3 Fifth-Order Elliptic Filter with Non-Pipelined Multipliers
8.4 Fifth-Order Elliptic Filter with Pipelined Multipliers
8.5 Fast Discrete Cosine Transform
Appendix A
A.I Benchmark loops
XV
36
110
110
110
111
138
138
138
139
139
140
141
153
154
154
154
154
168
xvi LOOP PIPELINING WITH RESOURCE AND TIMING CONSTRAINTS
Appendix B
B.I Cytron's example
B.2 Differential Equation
B.3 16-Point Digital FIR Filter
B.4 Fifth-Order Elliptic Filter with Non-Pipelined Multipliers
B.5 Fifth-Order Elliptic Filter with Pipelined Multipliers
B.6 Fast Discrete Cosine Transform
B.7 Results obtained by other approaches for superscalar processors by using an
architecture with 1 FU of each type
B.8 Results obtained by UNRETfo? superscalar processors by using an architecture
with 1 FU of each type. M axil = 15 for all cases except for (*), in which
M ax 1 1 - 50.
B.9 Comparison for an architecture with 3 FP adders, 2 FP multipliers, 1 FP divisor
and 2 load/store units
Appendix C
C.I Lower bounds for the Cytron's example
C.2 Lower bounds for the Differential Equation
C.3 Lower bounds for the 16-Point Digital FIR Filter
C.4 Lower bounds for the Fifth-Order Elliptic Filter with Non-Pipelined Multipliers
C.5 Lower bounds for the Fifth-Order Elliptic Filter with Pipelined Multipliers
C.6 Lower bounds for the Fast Discrete Cosine Transform
C.7 Register requirements for the Cytron's example
C.8 Register requirements for the differential Equation
C.9 Register requirements for the 16-Point Digital FIR Filter
C.10 Register requirements for the Fifth-Order Elliptic Filter with Non-Pipelined Mul-
tipliers
C.ll Register requirements for the Fifth-Order Elliptic Filter with Pipelined Multipli-
ers
C.I 2 Register requirements for the Fast Discrete Cosine Transform
C.13 Register reduction in a modulo scheduling algorithm for the Cytron example
C.14 Register reduction in a modulo scheduling algorithm for the differential equation
C.15 Register reduction in a modulo scheduling algorithm for the 16-Point Digital FIR
Filter
C.16 Register reduction in a modulo scheduling algorithm for the Fifth-Order Elliptic
Filter with Non-Pipelined Multipliers
C.17 Register reduction in a modulo scheduling algorithm for the Fifth-Order Elliptic
Filter with Pipelined Multipliers
C.18 Register reduction in a modulo scheduling algorithm for the Fast Discrete Cosine
Transform
C.19 Incremental scheduling after modulo scheduling for the Cytron example
C.20 Incremental scheduling after modulo scheduling for the differential equation
172
172
174
174
175
175
176
177
178
180
180
180
180
181
181
182
182
182
182
183
183
183
184
184
184
184
185
185
185
1
1
1
I
•
1
•
1
•
I
•
1
1
"
I
*
|^V
1
•
1
1
1
1
1
1
1
1W
1
•
I
•
I
1
1
1
1
1
1
t " f- t; *••»
List of Tables , •• <•
C. 21 Incremental scheduling after modulo scheduling for the 16-Point Digital FIR
Filter
C.22 Incremental scheduling after modulo scheduling for the Fifth-Order Elliptic Filter
with Non-Pipelined Multipliers
C.23 Incremental scheduling after modulo scheduling for the Fifth-Order Elliptic Filter
with Pipelined Multipliers
C.24 Incremental scheduling after modulo scheduling for the Fast Discrete Cosine
Transform
C.25 Register requirements in UNRET for superscalar processors by using an archi-
tecture with 1 FU of each type
C.26 Register requirements in UNRET for VLIW processors by using an architecture
with 1 FU of each type
C.27 Comparison of register requirements for superscalar processors by using 1 FU of
each type
C.28 Comparison of register requirements for VLIW processors by using 1 FU of each
type
C.29 Register requirements in UNRET for superscalar processors by using an archi-
tecture with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
C.30 Register requirements in UNRET for VLIW processors by using an architecture
with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
C.31 Comparison of register requirements for superscalar processors by using 3 FP
adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
C.32 Comparison of register requirements for VLIW processors by using 3 FP adders,
2 FP multipliers, 1 FP divisor and 2 load/store units
C.33 Register reduction in a modulo scheduling algorithm by assuming a superscalar
processor with 1 FU of each type
C.34 Register reduction in a modulo scheduling algorithm by assuming a VLIW pro-
cessor with 1 FU of each type
C.35 Register reduction in a modulo scheduling algorithm by assuming a superscalar
processor with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
C.36 Register reduction in a modulo scheduling algorithm by assuming a VLIW pro-
cessor with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
C.37 Incremental scheduling in a modulo scheduling algorithm by assuming a super-
scalar processor with 1 FU of each type
C.38 Incremental scheduling in a modulo scheduling algorithm by assuming a VLIW
processor with 1 FU of each type
C.39 Incremental scheduling in a modulo scheduling algorithm by assuming a super-
scalar processor with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store
units
C.40 Incremental scheduling in a modulo scheduling algorithm by assuming a VLIW
processor with 3 FP adders, 2 FP multipliers, 1 FP divisor and 2 load/store units
•
xvi i
186
186
186
186
187
188
189
190
191
192
193
194
196
197
198
199
200
201
202
203
1
1
I
1
•i.
1
1
1
•
1
1
1
1
1
1
1
i
1
1
1
1
*
LIST OF ALGORITHMS
Chapter 3
3.1 Algorithm to unroll a 7r-graph m times
Chapter 5
5.1 List Scheduling Algorithm
Chapter 6
6.1 Algorithm to compute a Farey fraction by using the next one
6.2 Retiming dependences not belonging to recurrences
6.3 Algorithm to find a schedule in a given number of cycles
6.4 Optimized retiming.and.scheduling algorithm
6.5 UNRET Algorithm
Chapter 7
7.1 Function reduce_scheduling_dependences
7.2 Function reduce_local_maxima
7.3 Function reduce.span
7.4 Function incremeniaLscheduling
Chapter 8
8.1 Algorithm to increase the architecture
8.2 Algorithm to reduce area cost
8.3 Algorithm to find the maximum-throughput schedule
xix
54
74
90
95
100
102
103
129
131
133
135
149
150
151
I
I
I
I
I
I
I
I
I
I
I
I
i
I
I
I
I
I
I
I
I
PREFACE
This work presents three algorithms to solve three different problems:
• UNRET is proposed to solve loop pipelining with resource constraints.
• TCLP is proposed to solve loop pipelining with timing constraints.
• RESIS is proposed to reduce the number of registers required by a schedule.
Loop pipelining with resource constraints can be defined as follows: "given a set of resources,
finding a pipelined schedule of a loop in the minimum number of cycles". Loop pipelining with
timing constraints can be defined as follows: "given a maximum time to execute an iteration of
a loop, finding a schedule which requires the minimum set of resources (or the minimum area)".
Whilst loop pipelining with timing constraints is a typical problem in the high-level synthesis of
VLSI circuits, loop pipelining with resource constraints is present in both the high-level synthesis
of VLSI circuits and compilers for parallel architectures.
In parallel architectures, the number of registers available to store partial results (during loop
execution) is limited, and it is defined by the architecture. In a VLSI circuit, a register consumes
space in the chip. Therefore, it is interesting in both areas to obtain a schedule which requires as
few registers as possible.
UNRET and TCLP are related to the extraction of the parallelism of a loop. Chapter 1 in-
troduces different ways to exploit such a parallelism, as well as the subjects on which this work
is focused: high-level synthesis of VLSI circuits and compilation techniques for superscalar and
VLIW processors.
UNRET and TCLP belong to a family of techniques known as software pipelining. An overview
and classification of such techniques is presented in Chapter 2.
In Chapter 3 we define two transformations to exploit the parallelism in a loop: dependence
retiming and loop unrolling. Both transformations will be used by UNRET and TCLP. The
maximum parallelism available for exploitation in a loop is limited. Chapter 3 also shows how this
maximum parallelism can be calculated.
A data dependence exists between two instructions when the result produced by the first one is
consumed by the second one. Data dependences impose a partial execution order in the instruc-
tions of a loop. According to whether a data dependence does not influence in the scheduling, or
it influences the scheduling within an iteration or across consecutive iterations, we classify data
dependences into three categories: free scheduling dependences, positive scheduling dependences
and negative scheduling dependences. Chapter 4 presents the theory behind this classification.
xxi
I
I
xxii LOOP PIPELINING WITH RESOURCE AND TIMING CONSTRAINTS _
Chapter 5 describes the scheduling algorithm used by UNRET and TCLP. The algorithm takes re- I
sources into account, as well as multiple-cycle (possibly pipelined) functional units and instructions
that have complex execution patterns (they use several functional units during several cycles). —
Chapter 6 presents UNRET. UNRET uses dependence retiming and loop unrolling to find a ™
pipelined schedule of the loop. Loop unrolling is in general required to extract the maximum
parallelism. Dependence retiming enables us to obtain different (but equivalent) configurations I
for the same (possibly unrolled) loop. Each configuration is scheduled by using the algorithm |
presented in Chapter 5, attempting to find a schedule which executes the loop with the maximum
parallelism. When no schedule exists for any configuration of the loop, UNRET decides a new •
target parallelism (and unrolling degree) and explores new configurations. •
Once a schedule has been found, the number of required registers is reduced while maintaining
 —
the parallelism. In Chapter 7 we propose RESIS, an algorithm oriented to such a purpose. RESIS •
works in two phases. First, several configurations of the loop are explored, attempting to reduce •
the number of different iterations involved in the pipelined schedule. Each configuration is inde-
pendently scheduled by using the algorithm from Chapter 5. Following this, some instructions are •
individually rescheduled in order to reduce the register requirements. I
Chapter 8 presents TCLP, an algorithm for loop pipelining with timing constraints. TCLP is •
based on ideas similar to UNRET. The timing constraint is given in the form of a maximum •
number of cycles to execute each iteration of the loop. TCLP analytically calculates a minimum ~
set of resources (theoretically) required to execute the loop with the given timing constraint.
Several configurations of the loop are explored in order to find a schedule by using the calculated •
set of resources. If no schedule is found, the set of resources is Successively increased and new |
configurations are explored until a schedule is found. Once a schedule fulfilling the given timing
constraint has been found, TCLP attempts to optimize several characteristics of the schedule. •
First of all, TCLP attempts to reduce the set of resources while maintaining the length of the •
schedule. Following this, it attempts to increase the parallelism of the schedule by exploring
different unrolling degrees. Finally, RESIS is used to reduce the number of required registers. _
Chapter 9 presents the conclusions of this work, summarizes the main contributions performed
and indicates futures áreas of work. " ' . ' '' " • ' ' • ' " • ' • '
I
I
I
I
I
I
I
I
