Time-constrained loop pipelining by Sánchez Carracedo, Fermín & Cortadella, Jordi
Time-Constrained Loop Pipelining * 
Fe& Shchez Jordi Cortadella 
Dept. of Computer Architecture, Univ. Politknica de Catalunya 
08071 Barcelona, (Spain). 
Abstract 
This paper addresses the problem of Time-Constrained Loop 
Pipelining, i.e. given a fixed throughput, finding a schedule of 
a loop which minimizes resource requirements. We propose a 
methodology, called TCLP, based on dividing the problem into 
two simpler and independent tasks: retiming and scheduling. 
TCLP explores diflerent sets of resources, searching for a III(LX- 
imum resource utilization. This reduces area requirements. After 
a minimum set of resourceshas been found, the execution through- 
put is increased and the number of registers required by the loop 
schedule is reduced. TCLP attempts to generate a schedule which 
minimizes cost in time and area (resources and registers). The 
results show that TCLP obtains o p t i w l  schedules in most cases. 
1 Introduction 
This paper presents TCLP, a methodology to solve Tme- 
Constrained Loop Pipelining. TCLP is NP-complete [3]. 
'Avo types of timing constraints (TCs) have been considered 
in the literature: local TCs to specify minimum and/or maximum 
TCs between operation pairs [ 111, and global TCs to specify a 
maximum delay time to process a set of data. 
The term TCs has been previously used to refer to both lo- 
cal and global TCs, despite they are completely different. Ap- 
proaches to solve scheduling with local TCs can be found in 
[7, 10, 111. On the other hand, some Integer Linear Program- 
ming (ILP) approaches have been proposed to solve scheduling 
(not loop pipelining) with global TCs [l ,  51. Force Directed 
Scheduling [12] solves both local and global TCs. This paper 
addresses loop pipelining with global TCs. Henceforth, we will 
indiscriminately use the terms global TCs and TCs. 
1.1 New contributions 
Henceforth, T,,, will denote the maximum number of cycles to 
execute each loop iteration. The main contributions of TCLP with 
regard to the previous time-constrained scheduling approaches 
[ 1,5, 121 are the following: 
0 Loop pipelining is supported. It is reduced to two simpler 
and independent tasks: retiming and scheduling. 
0 Absolute lower bounds are computed for each type of re- 
source. When these bounds are met, the solution is optimal. 
0 Once a set of resources has been computed for a given Tmaz, 
the execution throughput is increased without varying the set 
of resources. 
0 The number of required registers is finally reduced, produc- 
ing a schedule with lower cost in time and area. 
*This research was supported by the Minishy of Education of Spain (CICYT) 









TCLP works as follows (Figure 1 shows the flow diagram): 
Compute the minimum initiation interval (MI) of the loop'. 
There is no solution when Tmax < MlI. 
Calculate the absolute lower bound on the required number 
of resources of each type. 
Find a schedule in T,,, cycles by using the initial set of 
resources calculated at step 2. The loop is successively 
retimed and scheduled until a schedule is found or no further 
retiming can be performed. If a schedule is found, go to step 
5. Otherwise, go to step 4. 
Increase the set of resources. Heuristics are used to select 
the type of resource to be increased. One instance of the 
selected resource is added and step 3 is executed again. 
Reduce the current set of resources while maintaining the 
throughput of the schedule. This step corrects overestima- 
tions of resources introduced at step 4. 
Increase the execution throughput without varying the set of 
resources. Throughput is explored in increasing order by 
using different unrolling degrees. 
Reduce the number of registers required by the schedule. 
Retiming and 
Scheduline. the IMD 
Incrementing 






Figure 1: Flow Diagram of TCLP 
'The inittarion interval (U) is defined as the average number of cycles elapsed 
between the issuing of two consecutive iterations of the loop. 
592 
1063-6757/95 $04.00 0 1995 IEEE 
1.3 Representation of a loop 
A loop is re,presented by a labelled directed dependence graph, 
DG( V, E) .  Vertices represent operations of the loop body, and 
edges represent data dependences. Two labellings are defined 
0 A( U) ,  index definedon vertices, denotes the iteration to which 
the execution of U corresponds in the schedule. A(u)  = i 
will be denoted by U; in the DG. 
0 &(U, T J ) ~ ,  distance defined on edges, is the number of itera- 
tions traversed by dependence (U, w). S(U,  w) = 0 corre- 
sponds, to an intra-loop dependence (ILD), and & ( U ,  TJ) > 0 
corresponds to a loop-carried dependence (LCD). An ILD 
between U and w is represented as U; + w,, An LCD of 
distance d between U and w is represented as U; wi. 
The operations considered by TCLP can take several cycles 
and use several (possibly pipelined) functional units (FUs). The 
execution of an operation is statically led by an execution pattem. 
Figure 2 shows an example. The value in each cell denotes the 
number of iresources of each type required at a given cycle. In 
order to execute axpy, the set of resources must contain at least 1 
multiplier, 11 adder and two inputloutput register ports. 
1 2 3 4 5  
Figure 2: Execution pattem of operation axpy (z  = a . I + y) 
2 Loop pipelining 
2.1 Lower bounds on resources and initiation interval 
Let R, be the number of times a resource of type i is used by 
an iteration of the loop. LB, = is a lower bound on 
the number of resources of type i required to execute the loop. 
Sometimes the execution pattem of any operation may require 
EP, (EP, > LB,) resources of a given type z to be executed (for 
example, operation axpy from Figure 2 requires EPreg-po,.t = 2 
at cycle 1). Therefore, the absolute lower bound on resources of 
type i is N ,  = max(LB,, EP,). TCLP starts with N,  resources 
for each type of resource i. 
Recurrences in a loop impose a lower bound on the II of any 
schedule. Let Tu be the total execution time (delay) of instruction 
U. In general, the Muimposed by a recurrence R is [14]: 
MIIR = --, where TR = Tu andSE = ~ ( u , T J )  
In a loop with several recurrences, the one which produces the 
maximum such ratio is the one which determines the M I I  of the 
loop. The MII of a loop without recurrences is 0. MII can be 
calculated in polynomial time by using Karp’s algorithm [6] to 
find the minimum mean-weight cycle of a graph. 
2.2 Dependence retiming 
A,+d 3 I?, and A, 4 B, represent the same dependence[ 151 in 
a DG(V,E). Therefore, two different labellings (A, 6 )  and (A’, 6’) 
are equivalent (they represent the same loop) if, V(U,  w) E E,  the 





A(v) - A ( U )  + 6(U,  TJ) = A f ( w )  - A’(u) + #(U, w) 
By using Equation (1) we have derived a DG transformation, 
called dependence retiming, which produces the same effect as 
retiming [8, 2, 131. Dependence retiming increases the distance 
of a dependence (U, w) as follows: 
0 A’(U) = A(U) + 1 
0 V(u,  z) E E ,  S f ( % ,  z) = S(u, z) + 1 




Figure 3: Reducing the II by means of dependence retiming 
Dependence retiming implicitally pipelines the loop. The ex- 
ample shown in Figure 3 depicts two equivalent DGs and their 
schedules, assuming all operations are additions that can be exe- 
cuted in one cycle and three adders are available. The execution of 
each iteration of the loop in Figure 3(a) requires two cycles, due to 
the existence of the ILDs (A, B )  and (A,  C). The LCD (A, A)  is 
always honored by the sequential execution of the iterations of the 
loop. However, the loop body in Figure 3(b) may be scheduledin 
only one cycle because no ILD exists after retiming dependence 
(A, B )  (note that dependence (A, C )  is also transformed in LCD 
as a side effect). 
2.3 Retiming and scheduling 
This section presents a loop pipelining algorithm to find a schedule 
in a previously known number of cycles. The DG to schedule may 
contain operations belonging to different iterations. Therefore, 
the length of the pipelined schedule may be different from the 
iteration time. For example, the II of the schedule in Figure 3(b) 
is 1, but the iteration time is 2 (two cycles are required to execute 
each iteration from the original loop). 
We reduce loop pipelining to two simpler and independent 
tasks: retiming and scheduling of DGs. First, the DG is trans- 
formed into another equivalent one by means of dependence re- 
timing. Next, we try to find a schedule of the retirned DG in the 
expected number of cycles. This process is iteratively repeated 
until a schedule is found or no further dependence retiming can 
be done. The scheduling features, such as multicycle operations, 
chaining, pipelined functional units, functional pipelining, local 
timing constraints, etc. are hidden into the scheduling algorithm. 
Since retiming and scheduling are independent tasks in TCLP, 
any scheduling algorithm for basic blocks can be used. In other 
loop pipelining approaches, such as modulo scheduling [14] or 
rotation scheduling [2], both tasks are highly interdependent. 
The scheduler can be potentially often called by TCLP. Thus, 
we are interested in an scheduler with the lowest mn-time com- 
plexity. For this reason, we use list scheduling, which executes in 
linear time. Details about the scheduling algorithm are out of the 




G2 := GI; 
Repeat 
S:=scheduling(Gz); 
if (schedulelength = II) then return true endif; 
e:=select-edge(Gz); {selects an edge for retiming} 
if (edgeselected) then 
endif; 
G2 := dependence~etiming(G2, e); 
ifbetter(G2, GI) then GI := Gz endif; 
Until no edge can be selected 
return false; {schedule not found} 
endfunction 
The loop pipelining algorithm (retimingandscheduling) is 
described in lines above. Heuristics are provided to select an edge 
for retiming (function selectedge) and determine when no further 
retiming can be done (function better). Function selectadge 
selects for retiming the head or the tail of a critical path. An edge 
cannot be selected twice without finding a better DG. Function 
better tries to guess whether a DG is better for scheduling than 
another one before doing scheduling. Function better selects DGs 
with the shortest critical path and the lowest number of ILDs. 
2.4 Which type of resource must be increased ? 
The current set of resources is increased when retim- 
ingandscheduling does not find a schedule in the expected num- 
ber of cycles. Heuristics are used to determine which type of 
resource must be added to the set. After adding the resource, re- 
timingandscheduling is executed again, and so on. Two different 
reasons can preclude to find a schedule: 
e Some operation cannot be scheduled because not enough 
resources are available. When an operation U cannot be 
scheduled at cycle ASAP(u) because of the lack of re- 
sources, it is deferred to the next cycle. Deferring U may 
produce the deferring of some successors of U ,  and so on. 
As the number of resources is limited, some of these suc- 
cessors may not be scheduled within their time frame for 
scheduling (ALAP - ASAP). When this happens, the re- 
source which causes the deferring of U i s  increased in one 
unit. 
e There is no timeffame to schedule some operation U .  Figure 
4 illustrates this fact with an example. Let us assume that the 
execution time of the operations in the DG from Figure 4(a) 
is 2 for U and U, and 1 for w. Figure 4(b) shows a possible 
schedule in which U and 20 have already been scheduled at 
cycles 1 and 4 respectively, and w has not yet been scheduled. 
When the scheduler attempts to schedule w, it finds that w 
should be scheduled after (or at) cycle 3 because of ILD 
( U ,  w)  (time frame TFz), and before (or at) cycle 2 because 
of ILD (v, zu) (time frame TFl) .  Since both time frames 
are disjoint, the scheduler fails. When this occurs, TCLP 
increments the resource most used by the loop. 
3 Optimizing area, throughput and registers 
3.1 Reducing area cost 
The heuristics used to increase the set of resources may overesti- 
mate the resources required to find a schedule. In order to solve 
this mishap, TCLP attempts to reduce the number of resources 
after a schedule is found. To do so, resources are explored in 
'ASAP(u  andALAP(u)  arerespectivelythe f ist  and thelast cycle at which 
u may he schehuled. A S A P ( u )  and ALAP(u) dynamically change depending 




Figure 4: (a) DG (b) w has no time-frame to be scheduled 
decreasing order of area looking for a schedule with a lower area 
cost. This step is able to correct errors introduced by the heuristics 
described in Section 2.4. The combination of both steps produces 
optimal results in almost all cases, as results in Section 5 show. 
The algorithm used to optimize the area cost of the schedule is 
shown in lines below, In the algorithm, N ,  is the absolute lower 
bound on the number of resources of type i required to execute 
the loop, and Ri is the current number of resources of type z. 
function Optimizearea(G, II); 
foreach type of resource (i) do 
(explored in decreasing order of area) 
reducible := true; 
while R, > N, and reducible do 
reducible := false; 
remove a resource of type i; 
found := retimingandScheduling(G, II); 
if found then reducible :=true; 
endif; 
endwhile; 
else add a resource of type i; 
endforeach; 
endfunction: 
3.2 Increasing throughput 
Given a loop and a set of resources, the throughput of a schedule 
can be represented in a diagram, as shown in Figure 5(a). The 
y axis represents the unrolling degree of the loop (K) ,  and the 
1: axis represents the number of cycles of the schedule (IT). A 
point (IT, K )  in the diagram represents a possible schedule of K 
iterations of the loop in IT cycles. The throughput (Th) of such 
a schedule is 5 iterations per cycle. All points representing 
schedules with the same throughput fall in a line (see points A 
and C). Point B is over the line which includes points A and C 
because the throughput of B is greater than the throughput of A 
and C. Point D is under this line becauseit represents a schedule 
with lower throughput. 
K 7 k M a x T h .  Num 
6 ............................... _ . _ _ I  c 
4 ..................... 
3 ................ 
..... 1 ? D  . . ,  . . .  t 
2 4 5  8 
(a) 
. , ,  
4 I ..................... .; ...... i ...... i.? , , . . .  1 ; ; ; ' .  
..... ..... . . . .  
' Den 
1 2 3 4 5  
(b) 
Figure 5: (a) Throughput diagram (b) Representation of Farey's 
series Fs in a diagram 
The maximum throughput achievable by a schedule (MaxTh) 
is bounded by the recurrences of the loop and the set of available 
resources. Any feasible schedule of the loop is represented by a 
point below the line Th=MaxTh. Note that non-integer 17s can 
be achieved by unrolling the loop (for example, the average II of 
a single iteration of the schedule represented by point A is 4). 
We are interested in exploring these points in increasing order of 
594 
throughput, starting at point (Tmax, 1) and finishing at any point 
in the line 'II=MaxTh. Since the number of points between lines 
Th=MaxTh and Th = & is infinite, we limit such number by 
limiting the: maximum number of cycles of any schedule. This 
boundisdenotedby M a x I I .  M a x I I  may begreaterthanT,,, 
because it rnay represent the length of a schedule of several loop 
iterations. 
For a fixed 1~ > 0, the sequence of all the reduced fractions 
with nonnegative denominator 5 n arranged in increasing or- 
der of magnitude is called the Farey's series of order TI, and 
denoted by F, [4]. For example, Fs is the series of fractions: 
{ i , s  , ; i , . ~ r 5 , 2 , s r 3 r 4 , 5 , . . . ) .  Figure5(b) shows adiagram 
represenhng such a sequence. Numbers in the diagram state the 
order of the fractions in the series Fs. Point (4,2) is not in the 
sequence, :since it represents a fraction with the same value as 
point (2,1), and therefore it is not reduced. 
The throughput of a schedule is a fraction with a denominator 
lower than or equal to Max I I .  Therefore, Farey s series of order 
M a x I l  forms the sequence of points to explore in the throughput 
diagram. The ith element of the series F M a x l l  is represented by 
the fraction +, and can be recurrently computed as: 
0 1 1 1 2 1 3 2 3 4  
- "J xis1 = x + xi . -"I ; Yi+, = y + Y; 
where x and y are two integers satisfying the relation 
gcd(Y,, - X t )  = Y, . z + ( -X , )  . y. The coefficients 3; and 
y can be casily computed by using the extended gcd [4]. The 
algorithm to increase the schedule throughput is shown below. 
Function wnroZl(G, X) returns the graph G unrolled X times. 
function increaseJhroughput(initia1loop); 
X := 1; Y := Tmax; 
found := true; 
while found and $ < MaxTh do 
G:=unroll(initral-loop,X); 
found:=retimingandscheduling(G, Y); 
:=Next element from FMaxII; 
eindwhe 
endfunction; 
Increasring the unrolling degree of the loop also increases the 
register pressure. Therefore, this step may be avoided when the 
number of registers i s  limited or the size of the registers has great 
influence m the final area of the chip. Moreover, in a schedaleof a 
loo unrolled X times taking Y cycles, each iteration is executed 
iteration may be longer than Tmax. In some applications, this 
fact must be verified before considering the schedule as a valid 
schedule. 
3.3 Reducing register pressure 
An absolute lower bound on the number of registers required for 
a schedule is the maximum number of variables whose lifetimes 
overlap at any cycle. This number (R) can be reduced by reduc- 
ing variable lifetimes. This can be done in two different ways: 
(1) by moving operations across schedules of consecutive itera- 
tions (SPAN reduction) and (2) by moving operations within the 
schedule of an iteration (incremental scheduling). 
3.3.1 SPAN reduction 
The SPAN of a DG is defined as Amax - A,, + 1, where A,,, and 
A, are the maximum and minimum values for A respectively. 
The SPAN of a DG can be reduced by a transformation similar to 
dependeme retiming. Reducing the SPAN of a DG reduces the 
distance of some dependences, and thus the variable lifetimes. 
in 7 % cycles on average, with $ < T,,,. However, a single 
Figure 6 shows an example, in which variable lifetimes are repre- 
sented as lines. A point in a line crossing two consecutive cycles 
represents a register. Schedule in Figure 6(b) requires 3 registers, 
whilst schedulein Figure 6(d) requires only 2 registers. The SPAN 
of the DG has been reduced by reducing the index of operations 
A and D (see Figure 6(c)). 
..... 
2 r  
2 r  
.... 
Eli  F. - I 
Or 
2 r  
2 r  
(a) (b) (C) (4 
Figure 6: Example of SPAN reduction (a) DG example before 
SPAN reduction (b) Scheduling of (a) requiring 3 registers (c) DG 
after SPAN reduction (d) Scheduling of (c) requiring, 2 registers 
3.3.2 Incremental scheduling 
Unlike SPAN reduction, incremental scheduling does not change 
the iteration index of any operation. Two movements are con- 
sidered (1) Re-scheduling operation moves an operation from 
the current cycle to another cycle so that sufficient resources are 
available, and (2) swapping two operations when both operations 
have the same execution pattern. 
1 A  .Q ......... ..9 ......... I 
2 ............ ,q 2r  ......... 6 8. 01 
.............. 11 .............. I 1  
4 - -  
(a) (b) 
Figure 7: Reducing R by incremental scheduling. 
Figure 7 shows an example of incremental scheduling. Note 
that 2 registers are required to store the variables which are alive 
between cycles two and three in Figure 7(a), and therefore R = 2. 
Figure 7(b) shows the schedule after swapping operations B and 
C. Variable lifetimes have been reduced, and now R = 1. 
4 Example of TCLP 
We have chosen the Fast Discrete Cosine Transform Kernel 
(FDCT) from [9] to illustrate how TCLP works. The DG is 
shown in Figure 8(a). The throughput requirement is Tmax = 18. 
As in [9], we will assume each operation is executed in a single 
cycle in the appropriate FU (multiplier, adder or subtracter). 
The lower bound on the number of required resources is 1 
resource of each type. Retimingandscheduling finds a schedule 
in 18 cycles in less than 0.8 seconds. The number of resources 
cannot be reduced, since it is minimal. 
Now, TCLP attempts to reduce the length of the schedule. The 
maximum number of cycles, M a x I I ,  has been set to 50 cycles. 
Therefore, Farey's series F a  are explored, starting at fraction &. 
Since theMllcomputedbyusingoneFTJofeach typeisMU= 16, 
the last fraction to be considered is &. The fractions explored 
are k, &, A, 4, &, 2 and 8, These fractions are depicted in 
Figure 8(b) between the lines T h  = & and l"h = MazTh .  
A fraction is explored only when a schedule has been found for 
595 
16 18 3335 4950 
(a) @) 
Figure 8: (a) DG of FDCT (b) Throughput exploration 
the previous one. TCLP stops when a schedule for 1 iteration 
in 16 cycles is found. The time used to explore all the fractions 
has been 44.2 seconds. This is the most time consuming step in 
TCLP. 
After reducing the length of the schedule, TCLP attempts to 
reduce the number of registers. The schedule found after the 
exploration of Farey’s series uses 18 registers. After reducing the 
SPAN, the schedule requires 15 registers. The final schedule (after 
incrementalscheduling) requires only 12 registers. The time used 
to reduce the number of registers was 2.55 seconds. 
5 Results 
We present here some well-known examples: the Cytron’s DG, the 
resolution of the differential equation and the Fifth-Order Elliptic 
Filter. More results can be found in [17]. 
Optimal time-constrained scheduling has been studied in [ 1, 51, 
and some results3 can be found in [l]. We will compare the 
results with the MIIand with the lower bounds on the number of 
resources. 
Tables 1 to 4 show the results. The first columns show T,, 
(T) and the lower bounds (LB) on FUs computed for each T,,, . 
Next columns specify the number and type of resources required 
to achieve the given T,,, Ws). The following columns show 
the MII calculated for each set of Fus, the II of the schedule 
(of a single iteration) found by TCLP, the number of registers 
required for each schedule (R) and the fraction of the Farey’s 
series which is associated to the schedule (K/II) .  Finally, last 
two columns show the time used (on a SPARC-10 workstation) to 
find an optimal schedule in area cost Q and the time required 
to optimize the schedule throughput and reduce register pressure 
(Tr). We have considered M a d 1  = SO for all the examples. 
Note that an optimal solution (by taking resource requirements 
into account) is achieved in almost all cases. 
Table 1 : Cytron’s example 
H II LBFUs ( 1  FUs I( ( I  Rewlts I( c?lu(secs) * I A ( 1  * 1 A )I MII 11 I I  1 R I K / I I  ( 1  Tf I 
Table 2: Differential Equation (Fv A is an &U) 
%hen comparing TCLP to [I], TCLP obtains schedules requiring less area, 
because [I] is an ILP approach which does not performloop pipelining. 
28 I( 1 1 1 I( I I 1 11 26 ( 1  28 1 12 I 1/28 ( (  0.70 ( 1.70 
Table 3: Fifth-Order Elliptic Filter with Non-Pipelined Multipliers 
Table 4: Fifth-Order Elliptic Filter with Pipelined Multipliers 
6 Conclusions 
This paper has presented TCLP, a new approach for loop pipelin- 
ing with timing constraints. TCLP is divided into three main 
phases. First, a schedule with minimum resource requirements is 
found for a given throughput. Next, the throughput is increased 
by exploring different unrolling degrees of the loop. Finally, the 
number of registers is reduced while maintaining the throughput. 
TCLP achievesoptimal results in almost all cases. We have shown 
several examples to illustrate its efficacy. 
References 
[l]  H. Acbatz. ExtendedO/l LP formulation for the scheduling problemin high- 
level synthesis. In Proc. European Design Automation Con$, pages 226-23 1, 
1993. 
[2] L-F. Chao, A. LaF’augh, and E. H-M. Sha. Rotation scheduling: a loop 
pipelining algorithm. In Proc. of the 30th Design Automation Conf, pages 
566-572,June 1993. 
131 M.R. Garey and D.S. Johnson. A Guide to the Theory of NP-Completeness. 
W. H. Freeman and Company, 1979. 
[41 G.H. Hardy and E.M. Wright. An Introduction to the Theory of Numbers. 
Oxford University Press, 1979. 
[SI C-T. Hwang, I-H. Lee, and Y-C. Hsu. A formal approach to the scheduling 
problemin high level synthesis. IEEE Trans. an CAD, 10(4):464-475, April 
1991. 
[6] R Karp. A characterization of the minimum cycle mean in a digraph. Discrete 
Mathematics, 23:309-311,1978, 
[7] D.C. Ku and G. De Micheli. Relative scheduling under timing constraints. In 
Pmc. of the 27th Design Automation Con$, pages 59-64, June 1990. 
IS] C.E. Leisenon, E Rose, and J. Saxe. Optimizing synchronous circuitry by 
retiming. In Pmc. Third Caltech Con$ on V U I ?  pages 87-1 16, March 1987. 
191 DJ. Mallon and P.B. Denyer. A new approach to pipeline optimization. In 
Proc. European Con$ on Design Automation, pages 83-88,1990, 
[lo] J. Nestor and G. Krishnamoorthy. SALSA: A new approach to scheduling 
with timing constraints. In Proc. Int. Con5 Camputer-Aided Design, pages 
262-265,November 1990. 
[ 11 1 J. Nestor and D.E. Thomas. Behavioral synthesis with interfaces. In Pmc. Int. 
Con$ Computer-AidedDesign, pages 112-115,November 1986. 
[I21 P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral 
synthesis of ASICs. IEEE Trans. on CAD, 8(6):661-679, June 1989. 
I131 M. Potkonjak and I. Rabaey. Optimizing resource utilization using transfor- 
mations. IEEE Tram. on CAD, 13(3):277-292,March 1994. 
[14] B.R Rau andC.D. Glaeser. Some schedulingtechniquesand an easily schedu- 
labk horizontal architecture for high performance scientific computing. In 
Proc. of the 14th Annual Worbhop on Microprogramming, pages 183-198, 
October 198 1. 
[151 E Sbchez and I. Cortadella. Resource-constrained pipelining based on loop 
transformations. Microprocessing and Microprogramming, 38( 1 -5):429-436, 
September 1993. 
[I61 E Sbchez and J. Cortadella Resource-constrained software pipelining for 
high-level synthesis of DSP systems. In Marc Moonen and Francky Catthoor, 
editors, Algorithms and Parallel VLSIAnzhitectumIII, pages 377-388,1995. 
[17] E Sfinchez and J. Cortadella. Time-constrained loop pipelining. Technical 
ReportRR-1995/11, UPC-DAC, April 1995. 
596 
