A mathematical formulation of the loop pipelining problem by Cortadella, Jordi et al.
A mathematical formulation of the
loop pipelining problem
Jordi Cortadella, Rosa M. Badia and Fermı´n Sa´nchez
Department of Computer Architecture — Universitat Polite`cnica de Catal unya
08071 Barcelona, Spain
e-mail: fjordic,rosab,ferming@ac.upc.es
Abstract
This paper presents a mathematical model for the loop
pipelining problem that considers several parameters
for optimization and supports any combination of re-
source and timing constraints.
The unrolling degree of the loop is one of the vari-
ables explored by the model. By using Farey’s series,
an optimal exploration of the unrolling degree is per-
formed and optimal solutions not considered by other
methods are obtained.
Finding an optimal schedule that minimizes re-
source and register requirements is solved by using
an Integer linear programming (ILP) model. A novel
paradigm called branch and prune is proposed to ef-
ficiently converge towards the optimal schedule and
prune the search tree for integer solutions, thus drasti-
cally reducing the running time.
This is the first formulation that combines the un-
rolling degree of the loop with timing and resource
constraints in a mathematical model that guarantees
optimal solutions.
1 Introduction
Loops monopolize most execution time in programs.
In many applications a few loops, if not only one, de-
termine the throughput achievable by the implemen-
tation of a behavioral description. For example, DSP
filters often consist of an infinite loop that repeatedly
executes for every sample of the input stream.
In architectural synthesis, the problem of optimiz-
ing loop execution under timing and area constraints is
crucial to obtain high quality architectures. The tech-
niques that address this problem attempt to overlap
the execution of different loop iterations to reduce the
cycle count (initiation interval or II) per iteration. Dif-
ferent methods have been proposed with such a goal:
loop folding [1], functional pipelining [2], loop wind-
ing [3] and rotation scheduling [4] among others. The
area of fixed-rate DSP has also drawn the attention of
other authors to propose techniques for loop pipelining
with timing constraints [5, 6].
Loops are usually represented by means of a data
dependence graph (DG). Figure 1 shows an example.
Vertices represent operations. Unlabeled edges rep-
resent intra-loop dependences (ILDs), e.g. B
i
(reads
X[i]) depends on A
i
(produces X[i]). Labeled edges
represent loop-carried dependences (LCDs), e.g. B
i+1
(reads Y [i + 1]) depends on B
i
(produces Y [i + 1]).
Labels on edges indicate the number of iterations tra-
versed by the dependence. Thus, ILDs can also be
represented as 0-labeled edges.
If no overlap between successive iterations is al-
lowed to execute the loop, the total execution time is
3I cycles (assuming 1-cycle operations). Within each
for i = 0 to I   1 do
A
i
: X[i] = R[i] + S[i];
B
i
: Y [i+ 1] = Y [i] + 2 X[i];
C
i
: Z[i] = 4  Y [i+ 1] + T [i];
endfor
1
Ai
iB
iC
Figure 1: Loop and data dependence graph
iteration, the execution of A
i
, B
i
and C
i
must be se-
quential due to ILDs. An overlapped execution (loop
pipelining) takes I+2 cycles to complete, as shown in
Figure 2. The problem of loop pipelining is basically
reduced to find a schedule (a folded loop body) that
executes at the maximum rate allowed by the depen-
dences. In the schedule, instructions from different
iterations (folds) are executed (A
i+f
denotes that the
execution of A
i
belongs to fold f).
In general, unrolling the loop is crucial to obtain
optimal solutions. If two adders are available for the
schedule in Figure 2(b), the loop requires two cycles
(II = 2) to be executed, as shown in Figure 3(a). How-
ever, if the loop is unrolled twice (Figure 3(b)), every
iteration executes in 1.5 cycles on average (II = 3=2).
Another important aspect to be considered is the
span (number of folds required to obtain the schedule).
As further commented in Section 4.5, smaller spans
result in shorter variable lifetimes, reducing in general
the schedule’s register pressure.
The loop optimization problem addressed in this
paper comprises a large variety of formulations with
different timing and resource constraints. The two
extreme cases are next described:
 Resource-constrained loop pipelining (RCLP).
Given a set of resource constraints, to find a schedule
that minimizes the execution time.
 Time-constrained loop pipelining (TCLP). Given
an upper bound on the execution time, the objective
is to find a schedule that minimizes the cost of the
resources required to execute the loop.
There is a wide range of problems between RCLP
and TCLP , e.g. finding a time-constrained schedule
with constraints on a subset of resources.
A0
B0 A1
C0 B1 A2
C1 B2 A3
C2 B3
C3
(a)
A0
B0 A1
Prologue
C
i
B
i+1 Ai+2 Schedule
C
I 2 BI 1
C
I 1
Epilogue
(b)
Figure 2: (a) Overlapped loop execution. (b) Schedule
A0
B0 A1
C
i
B
i+1
A
i+2
C
I 2 BI 1
C
I 1
(a)
A0
B
i
A
i+1
C
i
B
i+1 (i = i+ 2)
A
i+2 Ci+1
B
I 1
C
I 1
(b)
Figure 3: Schedule with resource constraints: (a) with-
out unrolling (b) by unrolling twice the loop.
This paper presents UNRET (unrolling and retim-
ing), a formal approach to solve RCLP. TCLP is ad-
dressed in [7]. Since the delay decision (minimization)
problem of loop pipeliningwith resource constraints is
NP-hard [8], several heuristics have been proposed to
solve it in moderate computation time. Several au-
thors have used linear programming to obtain optimal
or quasi-optimal solutions for the problems of schedul-
ing and allocation in architectural synthesis [9, 10].
The closest approaches related to the work presented
here have been proposed in [11, 12].
The main contributions of UNRET in relation to ex-
isting formal methods are next presented:
 UNRET performs an exhaustive analysis of the un-
rolling degrees of the loop that can derive optimal
solutions for the available resources. Unlike other
methods that perform loop unrolling [13, 10], this
paper presents a new approach that guarantees an
optimal unrolling degree.
 Similarly to [11, 12], the number of folds for the
schedule is automatically obtained by solving an ILP
model for loop pipelining. The number of registers
required to execute the loop is reduced by reducing
the maximum number of live variables at any cycle.
 A new approach, called Branch-and-Prune, is pro-
posed to solved the ILP model. The heuristics de-
vised to explore the space of solutions allow us
a rapid convergence to the optimal solution. ILP
models with more than 1000 variables and 700 con-
straints have been solved in a reasonable running
time.
The paper is organized as follows. Section 2
presents some preliminary definitions. Section 3 pro-
poses a loop unrolling strategy to find time-optimal
schedules for the RCLP problem. The ILP model to
find an optimal schedule is presented in Section 4. The
branch and prune strategy to efficiently solve the ILP
model is described in section 5. Experimental results
and conclusions are presented in Sections 6 and 7.
2 Basic definitions
For the sake of simplicity, we will first assume that
all operations can be executed in any of the functional
units (FUs) of the architecture in one cycle. Exten-
sions to multiple-cycle, pipelined functional units and
several types of resources can be found in [14].
2.1 Representation of a loop
A loop is represented by a labeled dependence graph,
DG(V;E). Vertices and edges represent operations
and data dependences respectively. Labels of the DG
are defined by two mappings,  (fold) and  (depen-
dence distance), in the following way:


A
i


B
i
?
A
i
B
i
A
i+1
B
i+1
.
.
.
.
.
.


A
i+1


B
i
?
1
A
i+1 B
i
A
i+2 Bi+1
.
.
.
.
.
.
(a) (b)
Figure 4: (a) Schedule of a loop with one ILD. (b)
Schedule of the same loop with an equivalent labeling
function and one LCD.
 (u), defined on vertices, denotes the fold to which
u belongs in the schedule ((u)  0). (u) = iwill
be denoted by u
i
in the DG.
 (u; v), denotes the dependence distance (number
of iterations traversed by the dependence) between
operations u and v. An ILD between u and v is
represented by u
i
0
! v
i
or simply u
i
! v
i
. Sim-
ilarly, an LCD between u and v with distance d is
represented as u
i
d
! v
i
.
Initially,a loop is represented by a DG with only one
fold, i.e. 8u 2 V : (u) = 0. After finding a schedule,
each operation is labeled with a fold representing the
relative execution skew (in iterations) with regard to
the other operations of the loop.
Equivalent labeling functions can be obtained by
simple transformations. Dependence A
i+1
1
! B
i
(or
in general A
i+d
d
! B
i
) is equivalent to A
i
! B
i
.
This transformation can be used to pipeline the loop,
as shown in Figure 4. Note that only ILDs constrain
the scheduling process, and therefore DG from Figure
4(b) is more parallel than the one shown in Figure
4(a). ILDs can be transformed into LCDs by changing
the fold assignment and shorter schedules for the loop
body can be found. This transformation is analogous
to the retiming technique proposed to minimize the
clock period in synchronous systems [15].
Definition 1 : Equivalent labeling functions: Let
(; ) and (0; 0) be two labeling functions forDG =
(V;E). They are equivalent if 8 (u; v) 2 E:
(v)   (u) + (u; v) = 
0
(v)   
0
(u) + 
0
(u; v)(1)
2.2 Initiation Interval
As proposed by other authors [16], UNRET first calcu-
lates a lower bound on the II of the loop: the minimum
initiation interval (MII). Two lower bounds on MII
must be taken into account:
 the minimum initiationinterval imposed by resource
constraints (ResMII). If each iteration of the loop
requires using an FU duringC cycles, and the archi-
tecture has N FUs of such a type, then II  dC
N
e.
Therefore, the FU with the maximum such ratio de-
termines a lower bound on II.
 the minimum initiation interval imposed by the
recurrences1 of the loop (RecMII). Let us consider
a recurrence R. A feasible schedule must fulfill
II  dET
D
e, where ET is the sum of the execution
1A recurrenceR is a set of edges that form a cycle.
A0
0B 0C 0D
0E
3
(a)
K
II
K
3
4
6
4 5 6 8(IImax)
#
#
#
#
#
#
#
#
r
r r
r
A
B
C
D
II = MII
(b)
Figure 5: (a) DG example (b) (II
K
,K) pairs for 4 FUs.
times of the operations in R and D is the sum of the
distances of its dependences [16]. The recurrence
with the maximum such ratio determines another
lower bound on II.
Let us define 
R
as the sum of the distance of the
dependences in a recurrence R. In the example of
Figure 5(a), there are three recurrences with the same
value for RecMII. Let us take “A0 ! B0 ! E0
3
!
A0”, with R = 3 and jRj = 3. This indicates that
A3 must be executed at least 3 cycles after A0, thus
resulting in RecMII
R
= 1. The other two recurrences
are isomorphic to this one.
3 Loop unrolling for RCLP
The length of a schedule is an integer number of cycles,
whereas the MII of a loop may be a rational number.
Let II
K
be the initiation interval of a schedule com-
prising K instances of the loop body (II = IIK
K
). The
goal of UNRET is to find a schedule that minimizes II.
This is done by exploring pairs (II
K
,K) in increasing
order of II, starting from MII. In order to bound the
search space for (II
K
,K), a maximum value for II
K
is
defined: the maximum length of the schedule (IImax).
Figure 5(a) depicts the DG of a 5-instruction loop.
If 4 resources are used for execution, then RecMII =
1 and MII = ResMII = 54 . The diagram in Figure
5(b) represents the pairs (II
K
,K) that can be explored
to find a feasible schedule for the loop. These pairs
correspond to the points with integer values for II
K
and
K enclosed in the triangle limited by the linesK = 0,
II
K
= IImax and IIK
K
= MII. Distinct points with the
same II lie on the same line. Among all points lying
on the same line, those with smaller values for II
K
(or
K) are preferred (they produce shorter schedules).
In the example, the first point to be explored is
C = (5; 4), meaning that 4 iterations must be exe-
cuted in 5 cycles (II=MII=1.25). However, no feasible
schedule with such characteristics exists. In [16], II
K
is incremented by 1 when no schedule is found and,
thus, the pointB = (6; 4) is next explored. This results
in a suboptimal solution. A time-optimal solution (for
IImax = 8) is found if the point A = (4; 3) is explored
after C = (5; 4). But, is there any efficient strategy to
explore all points in increasing order of II ?
3.1 Farey’s series
For a fixed integer D > 0, the sequence of all re-
duced fractions with nonnegative denominator  D,
arranged in increasing order of magnitude, is defined
by the Farey’s series of order D (F
D
).
Let xi
y
i
be the ith element of the series. F
D
can be
generated by using the following equations[17]:
 The first two elements are x0
y0
=
0
1 ;
x1
y1
=
1
D
 The generic term xi
y
i
can be calculated as:
x
i
=

y
i 2 +D
y
i 1

x
i 1   xi 2 yi =

y
i 2 + D
y
i 1

y
i 1 yi 2
For a more detailed explanation of how Farey’s se-
ries are explored see [18].
3.2 Loop unrolling
Every pair (II
K
;K) obtained from Farey’s series de-
notes a different unrolling degree for the loop (K) and
a target initiation interval (II
K
). Unrolling a DG K
times generates another DG in which each vertex v
is instantiated K times (v0; v1; : : : ; vK 1). Besides,
data dependence distances must be changed accord-
ing to the unrolling degree. A dependence u d ! v
in the original DG is represented by K dependences
u
i
b
i+d
K
c
 ! v
(i+d) mod K in the unrolled DG. A complete
description can be found in [18].
3.3 UNRET algorithm for RCLP
The strategy followed by UNRET is the following:
1. Calculate MII.
2. Find the first unrolling degree (K) and expected
initiation interval (II
K
) that minimizes II such that
II  MII.
3. Unroll the loop K times.
4. Find a schedule with length II
K
(ILP approach ex-
plained in Section 4).
5. If no schedule with II
K
cycles is found, generate new
values forK and II
K
that minimize II (II is explored
in increasing order by using Farey’s series) and go
to step 3.
4 Loop pipelining: An ILP ap-
proach
The problem of loop pipelining can be reduced to two
interrelated subproblems that can be simultaneously
solved by using an ILP model: folding the DG (find-
ing the functions  and ) and finding a schedule for
the folded DG subject to the data dependences and
resource constraints.
4.1 Preliminaries
Initially, a DG obtained by unrolling the original DG
K times will be given. An objective II (II
K
) will fix
the number of cycles of the schedule. Hereafter, and
for the sake of brevity, we will use II instead of II
K
.
C = f0; : : : ; II   1g will denote the set of cycles of
the schedule.
All variables used in the model are nonnega-
tive integers. The least nonnegative residue system
(0; 1; : : :; k  1) will be used when modulo k (mod k)
operations are performed on constants.
4.2 Loop folding constraints
The following variables are defined for the labeling
functions:

u
; 8u 2 V (fundamental variables)

u;v
; 8(u; v) 2 E (auxiliary variables)
Folding the loop means finding an equivalent label-
ing function for the DG. According to equation (1), the
auxiliary variables 
u;v
are defined as follows:

u;v
= 
u
 
v
+(v) (u)+(u; v); 8(u; v) 2 E(2)
where(u), (v) and (u; v) denote the initial labeling
of the DG (usually (u) = (v) = 0).
4.3 Scheduling and data dependence con-
straints
The following variables are defined 8u 2 V; i 2 C:
s
u;i
=
n 1 if u starts executing at cycle i
0 otherwise
The following constraint guarantees that an instruction
is scheduled at only one cycle of the schedule:
X
i2C
s
u;i
= 1; 8u 2 V (3)
For simplicity, the auxiliary variable c
u
will be used to
denote the cycle at which u is scheduled. Hence,
c
u
=
X
i2C
i  s
u;i
; 8u 2 V (4)
Data dependences are honored by the constraint:
c
v
 c
u
+ T (u)   II  
u;v
; 8(u; v) 2 E (5)
where T (u) is the execution time of u.
4.4 Resource constraints
The following constraints guarantee that no more than
F
t
functional units are used at each cycle:
X
u2A(t)
i
X
j=i L(u)+1
s
u;(jmodII)  Ft; 8t 2M; i 2 C(6)
where A(t) is the number of available F
t
. An instruc-
tionu uses an FU during the cycles c
u
: : : c
u
+L(u) 1
( mod II). In case the latency of the operation is longer
than II, the execution of several instances of the same
operation may overlap and, therefore, more than one
FU may be required.
4.5 Register requirements
The maximum number of variables whose lifetimes
overlap at any cycle, MAXLIVE , is the minimum num-
ber of registers required for a schedule [19]. For an
edge u ! v, the variable lifetime spreads from the
completion of u (cycle c
u
+ T (u)) to the cycle in
which the FUs executing v does not required the input
data anymore (cycle c
v
+ L(v)   1). The following
auxiliary variables are defined to specify whether an
operation reads or writes a result before a given cycle:
RB
u;c
=
c 1
X
i=0
s
u;(i L(u)+1)modII
WB
u;c
=
c 1
X
i=0
s
u;(i T (u)+1)modII
The auxiliary variable r
u;v;c
defines the number of
registers required to store a result in cycle c produced
by operation u and consumed by operation v:
r
u;v;c
= 
u;v
 
j
T (u)  1
II
k
+ WB
u;(T (u) 1)mod II

+
j
L(v)  1
II
k
+ RB
v;(L(v) 1)modII

+
WB
u;c
  RB
v;c
(7)
The expressions in brackets determine a correction
on 
u;v
produced by the execution time of u and the
latency of v (L(v)), i.e. no register is required to
store the value while u executes, whereas a register is
required during the first L(v) cycles of v’s execution.
A variable can be the input of more than one instruc-
tion. However, there is no need to allocate different
registers for the same variable. The number of regis-
ters required is the maximum from all edges with the
same source. Therefore,
r
u;v;c
 r
u;c
; 8u 2 V; c 2 C (8)
It can be easily proved that if operation v is the last
use of u’s result then r
u;v;c
= r
u;c
for any cycle c [12]:
X
u2V
r
u;c
 ML; 8c 2 C (9)
4.6 Objective function
Assume we have a cost vector A = (A1; : : : ; Am) for
the FUs of the architecture and a cost A
r
for each
register2. We formulated the objective function as:
min Area =
X
t2M
A
t
 F
t
+ A
r
 ML (10)
F
t
and ML can be variables or constants, according
to the initial constraints. If all of them are constants,
a feasible solution will be found. Here we have con-
sidered ML as the number of registers required by the
schedule. Although this might not be true, it is a real-
istic assumption for many practical cases [19].
4.7 Complexity of the model
Table 1 describes the fundamental variables and con-
straints of the model. V and E denote the number of
nodes and edges of the DG respectively, whereas m is
the number of different FU types of the architecture.
Some of the variables, e.g. F
t
and ML, may become
constants if the number of resources of the architecture
is defined in advance.
2Piecewise linear cost functions for registers can also be incor-
porated, as proposed in [20].
variable number constraint number

u
V (3) V
s
u;i
V  II (5) E
F
t
m (6) m  II
r
u;c
V  II (7,8) V  II
ML 1 (9) II
total V (2II + 1) +m + 1 total V (II + 1) + E+
II(m + 1)
Table 1: Variables and constraints of the ILP model
5 Branch and Prune
The most popular method to solve ILP models is the
combination of branch-and-bound techniques with a
linear-programming solver such as simplex [21]. Hav-
ing integer variables in the model increases complex-
ity from polynomial to exponential. The running time
for ILP is highly influenced by the exploration of the
branch-and-bound tree. For an ILP solver insensible to
the problem, the tree of solutions is “blindly” explored
and finding valid integer solutions may require solving
an excessive number of LP problems.
We have implemented branch and prune, an ad-hoc
solver for loop pipelining that takes advantage of the
information known a priori about the problem. The
solver uses a branch-and-bound paradigm to explore
the space of integer solutions that allows us a rapid
pruning when enough information has been captured
to evaluate the objective function. The order in which
integer variables are explored is also crucial to reduce
the running time.
Loop pipelining is decomposed in two different
problems: Retiming (finding the loop fold  for each
operation) and scheduling (assigning operations to cy-
cles). After solving retiming, loop pipelining is re-
duced to the scheduling of a basic block in which not
all dependences must be taken into account. Accord-
ing to this idea, variables corresponding to retiming are
explored before variables corresponding to scheduling.
5.1 Retiming
Recurrences are the most stringent constraints for re-
timing, since the sum of their edges is a constant [14].
The order of the variables is selected as follows:
first explore the nodes belonging to the most stringent
recurrences. Inside each recurrence, first explore those
nodes that also belong to other recurrences.
For nodes not belonging to any recurrence, the ex-
ploration order is defined according to a neighboring
criteria to nodes in recurrences. In these cases the
number of branches to be explored may be unbounded,
since no recurrence limits the value of . For this rea-
son, an early calculation of a lower bound for register
requirements is done at each node of the tree.
5.2 Scheduling
After having defined the folds of the operations, a
scheduling problem is posed for each of the leaves
of the search tree. At this moment, some depen-
dences of the DG can be eliminated from the model
(if T (u)   II  
u;v
 0) and the critical path of the
retimed DG can be calculated. In case the critical path
is longer than II, no feasible schedule can be found and
the exploration is pruned.
The exploration order of the scheduling variables
(s
u;i
) is also crucial. We have chosen one of the
well-known algorithms for scheduling (force-directed
scheduling (FDS) [22]) to assist the solver in defining
an efficient strategy to generate the search tree. FDS
performs a stepwise assignment of operations to cy-
cles according to criteria that attempt to balance the
utilization of resources over all cycles of the schedule.
This strategy leads the solver to a near-optimal so-
lution very soon, thus allowing an efficient pruning of
the tree for the other branches. Similarly to branch-
and-bound, the LP solver is invoked at each branch of
the tree to prune those solutions with a cost greater
than the best integer solution found.
6 Experimental results
The techniques presented in the previous sections have
been implemented by using the package lp solve [23].
In this section, several results are reported to show the
main features of the method. The types of resources
used in the schedules are adders (1 cycle) and multi-
pliers (2 cycles).
6.1 Optimal unrolling degree
Table 2 presents the results obtained by exploring two
different unrolling degrees for the example depicted in
Figure 5. MAXLIVE is also minimized for the optimal II
found by the model. The columnsV andE indicate the
number of nodes and edges of the DG after unrolling
the loop. #Vars and #Const indicate the number of
variables and constraints of the ILP model. SPAN is
calculated as 
max
  
min
+ 1. the schedule). MIIis
1.25 for all three cases.
The first row presents the result for pointC = 5; 4 in
Figure 5), which is an infeasible problem. The second
row presents the result for point A = (4; 3). This case
obtains optimal results in MAXLIVE and II in front of
the schedule found for the third point (B = (6; 4)),
which is the one explored by other techniques that
perform sub-optimal loop unrolling.
II K/II
K
V E #Vars #Const CPU SPAN MAXLIVE
1.25 4/5 15 21 167 121 225 - Non feasible
1.33 3/4 10 14 82 72 199 4 5
1.5 4/6 15 21 192 138 22 4 6
Table 2: Results for the example of Figure 5(a).
6.2 ILP model for RCLP
A significant part of the computational cost for solving
the ILP model is spent in guaranteeing that the final
solution is optimal. For this reason, we also report
the best solution obtained after 1 minute of CPU3.
MAXLIVE is represented as ML.
The results presented in Tables 3 and 4 show that
benchmarks with a large number of operations can be
solved with moderate computational cost. The method
can also be used for heuristic search by limiting the
maximum CPU time and providing the best solution
found at that moment (e.g. one minute). The results
demonstrate that the optimal solution can be often ob-
tained by only using a small fraction of the total com-
putational cost, although a proof of optimality cannot
be given in this case. Optimal solutions have been
found for models with more than 1000 variables and
700 constraints.
Resources II #Vars #Const CPU ML SPAN ML
 + (secs) (60 secs)
8 15 2 118 97 7 16 5 -
4 8 4 210 149 10 8 3 -
2 4 8 394 253 38 4 2 -
2 3 10 486 305 77 4 2 4
1 2 16 762 461 1252 3 2 3
Table 3: Results for the 16-point FIR filter (V =
34andE = 22)
3We only report the value of MAXLIVE after 1 min, which is the
critical variable in the optimization of the objective function.
Resources II #Vars #Const. CPU ML SPAN ML

p
 + (secs) (60 secs)
2 4 16 1125 684 201 9 2 9
2 3 16 1125 684 204 9 2 9
1 3 16 1125 684 188 10 2 10
1 2 17 1193 721 433 9 2 9
3 3 16 1125 684 203 9 2 9
2 3 16 1125 684 223 10 2 10
2 2 17 1193 721 472 9 2 9
1 2 19 1329 795 1924 9 2 9
Table 4: Results for the 5th-order elliptic wave fil-
ter (
p
and  stand for pipelined and non-pipelined
multipliers). V = 34 and E = 58
6.3 Solutions for large examples
Although the branch-and-prune strategy has proved to
be efficient for models with 103 variables/constraints,
the exponential nature of the method becomes critical
in some cases. Rather than discarding this method, we
claim that it can still be used to obtain near-optimal
solutions by limiting the CPU time of the search.
Tables 5 and 6 present results on the Cytron’s and
FDCT loops obtained by limiting the CPU time to 10
minutes. Given the status of the search at the time the
solution was delivered, we suspect that the results are
optimal (although we could not prove it).
#R MII II K/II
K
V E #Vars #Const SPAN ML
3 5.66 5.66 3/17 51 63 1787 1015 3 12
4 4.25 4.25 4/17 68 84 2382 1342 4 18
5 3.4 3.4 5/17 85 105 2977 1669 5 19
Table 5: Results for the Cytron example
Resources MII II K/II
K
#Vars #Const SPAN ML
+ 
4 4 4 4 4 1/4 382 279 3 12
2 2 2 8 8 1/8 718 463 2 16
3 3 3 5.33 5.33 3/16 4162 2365 2 38
Table 6: Results for the FDCT 1st and 2nd cases:
V = 42; E = 53. Third case: V = 126; E = 159
7 Conclusions
In this paper we have presented a mathematical model
that can be solved by using ILP. However, efficient
techniques to wisely explore the space of integer solu-
tions are required to avoid a blind navigation through
the branch-and-bound tree.
We have presented a new strategy called branch and
prune that takes advantage of the information known a
priori about the problem to seek near-optimal solutions
as fast as possible. Decomposing loop pipelining into
two sub-problems (retiming and scheduling) and using
force-directed scheduling to assist the search have been
essential heuristics to guarantee optimal solutions in
moderate CPU times.
We have also demonstrated that exploring the un-
rolling degree is necessary to find optimal solutions for
loop pipelining. A mathematical formulation based on
Farey’s series has been proposed for such a goal.
References
[1] T.-F. Lee, A.C.-H. Wu, Y.-L. Lin, and D.D. Gajski. A
transformation-based method for loop folding. IEEE
Trans. Computer-Aided Design, 13(4):439–450, April
1994.
[2] C.-T. Hwang, Y.-C. Hsu, and Y.-L. Lin. Scheduling for
functional pipelining and loop winding. In Proc. of the
28th Design Automation Conf., pages 764–769, June
1991.
[3] E.F. Girczyc. Loop winding: A data flow approach to
functional pipelining. In Proc. Int. Symp. Circuits and
Systems, pages 382–385, May 1987.
[4] L.-F. Chao, A. LaPaugh, and E.H.-M. Sha. Rotation
scheduling: a loop pipelining algorithm. In Proc. of the
30th Design Automation Conf., pages 566–572, June
1993.
[5] S.M. Heemstra de Groot, S.H. Gerez, and O.E. Her-
rmann. Range-chart-guided iterative data-flow graph
scheduling. IEEE Transactions on Circuits and
Systems–I, 39(5):351–364, May 1992.
[6] F. Sa´nchez and J. Cortadella. Time constrained loop
pipelining. In Proc. Int. Conf. Computer-Aided Design,
pages 592–596, November 1995.
[7] J. Cortadella, R.M. Badia, and F. Sa`nchez. A math-
ematical formulation of the loop pipelining problem.
TechnicalReport UPC-DAC-95-36, Dept. of Computer
Architecture, Univ. Polite`cnica de Catalunya, Novem-
ber 1995.
[8] R. Cytron. Compiler-time scheduling and optimization
for asynchronous machines. PhD thesis, University of
Illinois at Urbana-Champaign, 1984.
[9] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formal ap-
proach to the scheduling problem in high level synthe-
sis. IEEE Trans. Computer-Aided Design, 10(4):464–
475, April 1991.
[10] M. Rim and R. Jain. Lower-bound performance esti-
mation for the high-level synthesis scheduling problem.
IEEE Trans. Computer-Aided Design, 13(4):451–458,
April 1994.
[11] R. Govindarajan, E.R. Altman, and G.R. Gao.
Minimizing register requirements under resource-
constrained rate-optimal software pipelining. In Proc.
of the 27th Annual International Symposium on Mi-
croarchitecture, pages 85–94, November 1994.
[12] A.E. Eichenberger, E.S. Davidson, and S.G. Abraham.
Optimum modulo schedules for minimum register re-
quirements. In Proc. of 9th ACM the International Sym-
posium on Supercomputing, pages 31–40, June 1995.
[13] K.K. Parhi and D.G. Messerschmitt. Static rate-optimal
scheduling of iterative data-flow programs via optimum
unfolding. IEEE Trans. Computers, 40(2):178–195,
February 1991.
[14] J. Cortadella, R. M. Badia, and F. Sa´nchez. A math-
ematical formulation of the loop pipelining problem.
Technical Report UPC-DAC-1995-36, Department of
Computer Architecture (UPC), October 1995.
[15] C.E. Leiserson and J.B. Saxe. Retiming synchronous
circuitry. Algorithmica, 6:5–35, 1991.
[16] B.R. Rau and C.D. Glaeser. Some scheduling tech-
niques and an easily schedulable horizontal architecture
for high performance scientific computing. In Proc.
of the 14th Annual Workshop on Microprogramming,
pages 183–198, October 1981.
[17] M.R. Schroeder. Number theory in science and com-
munication. Springer-Verlag, 1990.
[18] F. Sa´nchez. Loop Pipelining with Resourceand Timing
Constraints. PhD thesis, Universitat Polite`cnica de
Catalunya (Spain), 1995.
[19] B.R. Rau. Iterative modulo scheduling: An algorithm
for software pipelining loops. In Proc. of the 27th An-
nual International Symposium on Microarchitecture,
pages 63–74, November 1994.
[20] C.H. Gebotys and M.I. Elmasry. Global optimiza-
tion approach for architectural synthesis. IEEE Trans.
Computer-AidedDesign, 12(9):1266–1278,September
1993.
[21] C.H. Papadimitriou and K. Steiglitz. Combinatorial
optimization: algorithms and complexity. Prentice-
Hall, 1982.
[22] P.G. Paulin and J.P. Knight. Force-directed scheduling
for the behavioral synthesis of ASIC’s. IEEE Trans.
Computer-Aided Design, 8(6):661–679, June 1989.
[23] M.R.C.M. Berkelaar. lp solve version 2.0: a public
domain ILP solver, 1995. available at ftp.es.ele.tue.nl.
View publication stats
