Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models by Mansour, O. et al.
Scheduling and Allocation of Non-Manifest Loops on
Hardware Graph-Models
O. Mansour S. Etalle T. Krol
University of Twente, Department of Computer Science
P.O. Box 217, 7500 AE Enschede, the Netherlands
Phone: +31 (0)53 4894178 Fax: +31 (0)53 4894590
E-mail: {o.mansour, etalle, krol}@cs.utwente.nl
Abstract— In this paper we address the problem of
scheduling non-manifest data dependant periodic loops for
high throughput DSP-applications based on a streaming
data model. In contrast to manifest loops, non-manifest
data dependent loops are loops where the number of iter-
ations needed in order to perform a calculation is data
dependant and hence not known at compile time. For
the case of manifest loops, static scheduling techniques
have been devised which produce near optimal schedules
[1]. Due to the lack of exact run-time execution knowl-
edge of non-manifest loops, these static scheduling tech-
niques are not suitable for tackling scheduling problems
of DSP-algorithms with non-manifest loops embedded in
them. We consider the case where (a) apriori knowledge
of the data distribution, and (b) worst case execution time
of the non-manifest loop are known and a constraint on
the total execution time has been given. Under these
conditions dynamic schedules of the non-manifest data de-
pendant loops within the DSP-algorithm are possible. We
show how to construct hardware which dynamically sched-
ules these non-manifest loops. The sliding window execu-
tion, which is the execution of a non-manifest loop when
the data streams through it, of the constructed hardware
will guarantee real time performance for the worst case sit-
uation. This is the situation when each non-manifest loop
requires its maximum number of iterations.
Keywords— Non-manifest loop scheduling, periodic
loops, dynamic hardware scheduling.
I. Introduction
HIGH-LEVEL synthesis translates behavioral de-scriptions written in a high level language (HLL)
such as {C, C++, ...} into hardware network struc-
tures written in {VHDL, or Verilog}. This transla-
tion starts by converting the behavioral description
to a control data flow graph CDFG [5] and then per-
forming a number of optimizations such as dead code
removal, constant propagation, common subexpression
elimination, tree height reduction, code motion, loop
unrolling, inlining and finally scheduling and alloca-
tion onto hardware resources. In digital signal pro-
cessing (DSP) and video signal processing (VSP) ap-
plications, many algorithms have a repetitive and pe-
riodic nature [1] the same computations must be ex-
ecuted on arrival of each new data block. Some loops
within a computation require a constant number of
clock iterations in their loop-body and their num-
ber of iterations is fixed, they thus have a fixed to-
tal execution length. Such loops are called manifest-
loops. Non-manifest data dependent loops, on the
other hand, are those where the number of iterations
required in order to perform a computation is data
dependent and hence have a variable total execution
length.
When performing a functional unit computation
one can classify two different types of functional units.
Functional units which are (1) analytic and (2) non-
analytic or soft functions. Analytic functions are
those that have one exact answer and in order to cal-
culate that answer a variable number of clock itera-
tions, which is dependent of the input data, is needed.
Non-analytic functional units, on the other hand, con-
verge to the required result in time. The quality of the
result is improved upon in each iteration. Examples
of such units are the Taylor expansion series and the
MPEG decoding algorithm. When scheduling loops
of non-analytic functions one can statically schedule
the loop and set its execution to a fixed number of
iterations. The quality of the result in the case of too
few iterations would be sacrificed and in the case of
too many iterations we lose the remaining valuable
clock cycles. This shows us that static scheduling of
non-manifest non-analytic functional operations is not
without cost.
Formal techniques for solving the scheduling prob-
lem of manifest loops using Integer Linear Program-
ming (ILP) have been devised in [1]. By modeling
the periodicity of operations using a bounded period
vector denoting the periods between two consecutive
iterations, determining a start time and processing
O. Mansour, S. Etalle and T. Krol
unit on which the operation is to be executed, a feasi-
ble scheduling solution under area-, processing unit-,
and timing- constraints could be found. Since the ILP
technique requires prior knowledge of the operation’s
duration in order to perform, this makes it infeasible
for the scheduling algorithms with non-manifest loop
behavior. In this paper we focus on scheduling non-
manifest loops with an analytic result. In section II
we describe the formal scheduling problem and its con-
straints, section III describes similar work on variable
latency functional units, sections IV, V, VI describe
the dynamic scheduling solution we present and ex-
perimental results with their conclusions.
II. The Problem
Consider the greatest common divisor (gcd()) func-
tion as an example:
int gcd ( int x , int y){
int g ;
a s s e r t ( ( x>0) && (y>0));
g = y ;
while ( x > 0 ){
g = x ;
x = y % x ;
y = g ;
}
return ( g ) ;
}
Fig. 1. gcd algorithm
The gcd() is a typical example of an analytical non-
manifest loop. The function is analytic as it has only
one correct answer and is also non-manifest because
the number of loop iterations is dependant on the val-
ues of x, and y, which are not known at compile time.
If the gcd() loop has to operate in a manifest environ-
ment on a continuous stream with a sliding window
of n, 16 bit, input values (x1, y1), (x2, y2), ..., (xn, yn)
each execution of the input values (xi, yi) would have a
computation load between [1..23] clock cycles, see Fig
2, and hence the total execution of the sliding win-
dow is within the range [n..(23 ∗n)] clock cycles. In a
high throughput real time system this is not accept-
able and it is prefered to have a fixed execution delay.
The question is, if the execution load of the sliding
window is known, in advance, to have a predefined
maximum workload WLmax , can we find a dynamic
scheduler with real time constraints and a fixed out-
put delay?
In this paper we show that a real-time hardware dy-
namic scheduler is indeed feasible. Before we proceed
in to the details of the scheduler we provide some for-
mal definitions and generalize the scheduling problem.
x y g
28657 , 46368 , 46368
17711 , 28657 , 28657
10946 , 17711 , 17711
6765 , 10946 , 10946
4181 , 6765 , 6765
2584 , 4181 , 4181
1597 , 2584 , 2584
987 , 1597 , 1597
610 , 987 , 987
377 , 610 , 610
233 , 377 , 377
144 , 233 , 233
89 , 144 , 144
55 , 89 , 89
34 , 55 , 55
21 , 34 , 34
13 , 21 , 21
8 , 13 , 13
5 , 8 , 8
3 , 5 , 5
2 , 3 , 3
1 , 2 , 2
0 , 1 , 1
gcd (46368, 28657) == 1,
num i t e r a t i o n s == 23
x y g
0 , 4 , 4
gcd (4 , 32768) == 4,
num i t e r a t i o n s == 1
Fig. 2. (a) Worst case execution versus (b) Best case
execution of a gcd algorithm for 16 bit integer values
A. Mathematical model and definitions
Definition 1 (computation load) A data dependent
non-manifest loop algorithm A generates a computa-
tion load of CLA(v) cycles depending on the input
value v.
Note: For a non-manifest algorithm, computation
load is a property of the data and it is the data which
causes the algorithm to consume a variable number of
clock cycles during its execution.
Definition 2 (computation capacity of a resource) The
hardware implementation of a data dependent non-
manifest loop algorithm is called a hardware resource
or resource in short. A resource R provides Cres com-
putation cycles (or iterations) per unit time.
For the rest of this paper we assume that all re-
sources have the same computation capacity, and that
one time unit is equivalent to one computation cycle
hence Cres = 1. We also assume that at each time
unit an input is provided to the system, if not we con-
sider the input to be NULL and CLA(NULL) = 0 .
Furthermore the inputs are restricted to the domain
154 PROGRESS 2001
Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models
(type) I and the number of cycles generated by the
non-manifest algorithm A is considered to be at-least
CLminA , and at most CLmaxA , where in most cases
CLmin = 0, hence
∀v, v ∈ I CLminA ≤ CLA(v) ≤ CLmaxA . (1)
Definition 3 (Sliding window workload) The work-
load generated over a window of length m time units
starting at time t is denoted by WL(t ,m) cycles.
WL(t ,m) =
t+m−1∑
j=t
CLA(vj ) ∀t ,m ∈ N (2)
Where vj is the input value at time j.
Obviously the following holds:
WL(t ,m) ≤ m × CLmaxA (3)
In this paper we assume that we are given a sliding
window m ∈ N and an upper bound B to WL(t ,m)
such that, CLmax ≤ B ≤ m × CLmax . If the upper
bound B = m × CLmax, the system would require
at least m resources in order to produce the output
within a time frame of m time units.
Definition 4 (Computation capacity) If we con-
sider the available number of resources to be Nres.
Then the computation capacity available in the win-
dow [t, t+m− 1] is,
Ccap = m ×Nres × Cres ∀m ∈ N (4)
cycles.
In a real time system the latency of a computa-
tional unit plays a big role and it is important to know
the latency of the system in advance. If we are pro-
cessing a stream of data, it is obvious that the worst
case latency of a single computation is CLmax . Hence
the minimum bound on the latency is CLmax . The
maximum allowable latency on the other hand is the
latency obtained when we only have one resource R,
which is in this case WL(t ,m) cycles. In a real time
system, a sliding window of length m is usually spec-
ified. If the latency of the system exceeds the period
m×Cres cycles, the output of the system will lag be-
hind the input and delays can accumulate, if delays
do accumulate in the system the system will not meet
its real time constraints. Hence in this paper we as-
sume that the upper bound on the latency will not
exceed m × Cres, in fact the latency is bounded by
CLmax ≤ Lat ≤ min(WL(t ,m),m × Cres). In section
IV we provide a scheduling system with the minimum
allowable latency (CLmax ).
B. Problem formulation
The scheduling and allocation problem to be solved
can now be formulated as follows:
Given an input data stream with known maximum
workload bound B on a stream window of size m,
hence WL(t ,m) ≤ B , and a data dependent non-
manifest loop algorithmA with known bounds CLminA
and CLmaxA , devise a real time hardware scheduler
that will meet the workload WL(t ,m) of the system,
produce an output that is synchronous with the input
in a time frame of at most m time units, and finally
determine the resources of the system and their allo-
cation.
III. Related Work
Variable latency components are in a way similar to
non-manifest loops. Both types of components have a
data-dependent latency. In the case of variable la-
tency components, the latency of the loop-body is
data dependent and the number of iterations is a con-
stant. In the case of non-manifest loops, the latency
of the loop-body is a constant and the number of loop
iterations is variable based on the input data.
Silvia M. Mueller [2] addresses the problem of dy-
namically scheduling the body-part of a variable la-
tency functional-unit. In her work she mentions that
the scheduling of multiple functional units can be split
up into two parts: (a) Global scheduling, this schedul-
ing governs the interaction between the functional
units. Many scheduling algorithms can not cope with
variable latency units as they require prior knowledge
of the latency and the required resources. However
schedulers based on the Tomasulo algorithm make no
prior assumption of the functional units latency and
hence are suited for the global scheduling problem.
(b) Local scheduling, which is the scheduling of re-
sources within a variable latency unit. Due to multiple
paths within the body-part of a variable latency unit,
a situation can exist were instructions would compete
for resources, it is the task of the local scheduler to
ensure that within a functional unit there are no con-
tentions on the busses, and that no data is lost. The
local scheduler devised, schedules instructions com-
peting for a resource based on their age. The oldest
instruction in the stream gets the resource. This en-
sures that the latency of the functional units is never
increased as in the case of simple FIFO queues based
schedulers.
Vijay Raghunathan, Srivaths Ravi, and Ganesh
Lakshminarayana [3] address the problem of integrat-
Workshop on Embedded Systems 155
O. Mansour, S. Etalle and T. Krol
ing variable latency components into high-level syn-
thesis. In their work they show that with variable
latency components throughput could be gained if
the components were properly placed on the critical
paths. Improper placement of the components could
lead to decreased throughput. Since extra overhead
imposed by variable latency components usually leads
to increased chip area, they present a technique to fur-
ther reduce the chip area overhead. This technique is
based on the concept of reduced latency units.
L. Benini, E.Macii, M.Poncino, and G.De Micheli
[4] address the performance issues related to the op-
timization of VLSI designs. In their work they in-
troduce the concept of telescopic units, which is in
essence another term for variable latency units. They
show that every constant latency unit can be trans-
formed into a variable latency unit by adding a signal
for detecting completion. In this case the output of
the circuit is available as soon as it is ready as op-
posed to being constrained by the worst case delay of
the circuit.
IV. Scheduling of Non-Manifest Loops
In this section we describe how to solve the schedul-
ing problem and implement a real time hardware sys-
tem which can meet the WL(t ,m) demands.
Since from the input values we do not know how
many cycles are needed to calculate the output value
so, for CLA(vj ) which is bounded by 0 ≤ CLA(vj ) ≤
CLmax , is unknown until the value vj has been evalu-
ated by the algorithm. The strategy we use is to al-
locate the earliest non calculated value vj to the first
free resource. If no resources are free the value vj will
have to wait in a wait queue until a resource becomes
free. Since the wait queue handles the input value in
a FIFO manner, there is no starvation possible.
Formally, we say that the input specification
〈m,B,CLmax〉 consisting of a window of size m, a
maximum workload bound B and a non-manifest loop
algorithm with maximum number of iterations CLmax
is schedulable with Nres resources if every stream of
input [v1 . . . vj . . .] complying with the input specifica-
tions 〈m,B,CLmax〉, each input value vj is processed
within m time units (i.e it is ready by the time unit
j +m).
Theorem 1 (system schedulability) If
m ≥ CLmax +
⌊
B − CLmax + Nres×(Nres−1)2
Nres
⌋
−Nres
(5)
then the input specification consisting of a windowm,
maximum workload bound B for a non-manifest algo-
rithm with maximum number of iterations CLmax is
schedulable within Nres resources.
Proof: We have to show that, given the input
values v1 . . . vj , vj is processed before time unit j+m.
This implies that a resource must be free at most at
time j +m− (CL(vj)×Cres). Further, in the rest of
this proof, we assume that the value of Cres = 1, this
will ease but not influence the validity of the proof.
It is not restrictive to assume that
(1) CL(vj) 6= 0 ∀j and,
(2) that
j ≥ Nres + 1 (6)
If this is not the case, then at the time unit j there
is at least one resource free and the thesis follows from
the fact that CL(vj) ≤ CLmax ≤ m.
The situation is the following:
t
CL(vj-4)
CL(vj-3)
CL(vj-2)
CL(vj-1)
CL(vj)
tk1 tk2
Nres
j-1 lj
Fig. 3. Example with Nres = 4 of one computation which
has ended the rest are still busy
At the start of the schedule all resources are free,
and only one input value can be allocated to a single
resource until all resources have been allocated. Thus
the wasted number of cycles, which is the area on the
left of figure 3, measures Nres×(Nres−1)2 .
We have to prove that:
l − j ≤ m (7)
while by hypothesis we know that:
j−1∑
i=1
CL(vi) + CL(vj) ≤ B (8)
Now we can assume that the computations
CL(v1) . . . CL(vj−1) will end at the same time step
(+/− 1 step). In fact the the resources entering the
shaded rectangular area are wasted in (8) and do not
modify (7). Intuitively for a given workload bound
B the time taken before the computation of vj can
start, is at a maximum if all the previous computa-
tions CL(v1) . . . CL(vj) end at the same time.
Thus the situation is now shown in figure 4: after
processing CL(v1) . . . CL(vj−1) the first free resource
is found at time
156 PROGRESS 2001
Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models
t
CL(vj-4)
CL(vj-3)
CL(vj-2)
CL(vj-1)
CL(vj)
tk1
Nres
j-1 lj
Fig. 4. Example with Nres = 4 where all computations
end at the same time instance
⌊∑j−1
i=1CL(vi) +
Nres×(Nres−1)
2
Nres
⌋
+ 1 (9)
and (7) becomes:⌊∑j−1
i=1 CL(vi) +
Nres×(Nres−1)
2
Nres
⌋
+1+CL(vj) ≤ j+m
(10)
We now try to maximize the r.h.s of (10). First,
we take CL(vj ) = CLmax . If CL(v1 ) . . .CL(vj−1 ) +
CLmax ≤ B we can simply assume it. If not, it is
better to choose v¯1 . . . v¯j−1 such that
∑j−1
k=1CL(v¯k ) +
CLmax = B . In fact one can then choose CL(v¯j ) =
CLmax and it is easily seen that:Pj−1
i=1 CL(v¯i )+
Nres×(Nres−1)
2
Nres
+ CLmax
≤
Pj−1
i=1 CL(vi )+
Nres×(Nres−1)
2
Nres
+ CL(vj )
(Equality holds for Nres = 1).
Secondly, since CL(v1 ) + . . . + CL(vj ) ≤ B ,∑j−1
k=1CL(vk ) ≤ B − CLmax and (10) becomes:⌊
B − CLmax + Nres×(Nres−1)2
Nres
⌋
+ 1 + CLmax ≤ j +m
(11)
By (6) and substituting j = Nres − 1 in (11) we
have:
m ≥ CLmax +
⌊
B − CLmax + Nres×(Nres−1)2
Nres
⌋
−Nres
(12)
which concludes the proof.
This result can be shown to be optimal for small
workload bounds B and it can be shown that if
bB−CLmaxCLmax c ≥ Nres that (5) might yield a sub-optimal
evaluation. In practice experiments have shown that
in the majority of cases the criterion provided by the-
orem 1 is very close to the optimum.
A. Design constraints
In a design process there is usually a tradeoff be-
tween the number of resources and the latency of the
system. Therefore in order to implement the sched-
uler, we need to calculate the required number of re-
sources that are sufficient for processing a workload
bound B within a window of length m and the actual
latency of the system. This subsection focuses on how
to achieve the design constraints by either minimizing
the latency of the system, which implies that we max-
imize the number of resources, or minimizing the re-
sources, which implies maximizing the latency of the
system.
Remark 1 (Number of resources) Theorem 1 can
be used to find a safe (possibly sub optimal) number
of resources. Finding the actual number of resources
can be found via the following simple iteration:
for n:=1 to INFINITY
do
i f ( 5 ) holds then break ;
done
end
Fig. 5. A loop for finding the number of resources
The loop simply checks whether (5) holds for the
current value Nres = n, since we iterate starting from
1, the first value of n that satisfy the equation is
the minimum number of resources we can use. Since
limNres→∞ of the l.h.s of (5) = −∞ this loop will
terminate.
Since a resource can be busy with a computation
for at most CLmax cycles and then it is free for
reuse, the maximum number of resources ever needed
Nresmax = CLmax . If this is the case, then the sys-
tem is able to process each value immediately and
WL(t,m) can be equal to m × CLmax, thus at each
time instant we could process an input requiring the
maximum processing power. In order to meet a work
WL(t ,m) ≤ B within a window of length m time
units there is at least Nresmin resources needed. Hence
the actual number of resources needed for a system is
bounded by Nresmin ≤ Nres ≤ CLmax .
If WL(t ,m) is not restricted in some way, other
than CLmax, then
Nresmin =
m× CLmax
m× Cres =
CLmax
Cres
(13)
Knowing the minimum number of resources in the
system, we can now calculate the actual latency of the
Workshop on Embedded Systems 157
O. Mansour, S. Etalle and T. Krol
system in a similar way. The system latency Lat can
be calculated by finding the minimum value of m that
satisfies (5).
Theorem 2 (Unconstrained System latency) IfNres ≥
CLmax
Cres
then the worst case latency in time units is:
Lat =
CLmax
Cres
(14)
Proof: For each new input value vt there is a
free resource available for the computation and since
each computation can consume at most CLmax×Cres
time units, the output would be produced at the time
instance t + δ where δ ≤ CLmaxCres . In the worst case
situation δ = CLmaxCres which concludes the proof.
In the case Nres ≤ CLmaxCres then theorem 1 can be
used to establish an upper bound for the worst case
latency. In fact
Lat ≥ CLmax+
⌊
B − CLmax + Nres×(Nres−1)2
Nres
⌋
−Nres
(15)
Finally, in order for the scheduling system to pro-
duce an output that is synchronous with its input, and
knowing that the computations could vary in their ex-
ecution length. We require that each computation has
the latency of the worst case computation.
B. A scheduling example
We demonstrate the scheduling process by means
of the following example:
/∗
∗ algorithm example
∗/
fo r eve r :
out = A( v ) ;
Fig. 6. System black box
A non-manifest algorithm A is to be implemented
in a manifest environment. Algorithm A has to pro-
cess the input samples on each clock cycle, we know
from the specification of the data input stream that it
has aWL(t , 14 ) ≤ 30 clock cycles, and from the spec-
ification of algorithm A we know that the CLmaxA is
10 clock cycles (See table I).
From this information we find the satisfactory min-
imum number of resources Nres using (5) to be 3 re-
sources. The scheduler works as follows (see figure 7):
The input data sample would be allocated to a free
resource if available, and there are no waiting data in
the wait queue otherwise it would be placed into the
TABLE I
The scheduling constraints, input data stream
specifications
Load type range units
CL(v) 1..10 CLKs
WL(t , 14 ) ≤ 30 CLKs
Window size 14 CLKs
wait queue and the time value at the time of arrival
of the input data will be placed in the time queue.
Since we have three resources in the system and each
resource can consume at most 10 clock cycles in the
execution of its computation and the scheduler will al-
locate the waiting data, and/or the input data to the
free resources without any specific order. The pro-
duced output stream of all three resources would not
always have the same order as the input data stream.
This is called out of order data production. In or-
der to allow the system to produce the output stream
in order with the input data stream we reorder the
output data using a reorder shift buffer, see figure 8.
The reorder buffer is responsible for rearranging the
output such that it will be in order with the input
stream. This can be done as follows: The worst case
latency of the system is 14 clock cycles, hence in or-
der to synchronize the input stream with the output
stream the produced output must be delayed by the
remaining clock cycles not used while executing its
computation. This makes the delay time plus the ex-
ecution time an invariant with a constant of 14 clock
cycles.
The output data will be written in the reorder buffer
at the address Ad which is calculated by the following
equation:
Ad = Lat− | Tcurrent − Tstart | (16)
Where Lat is the calculated latency of the system,
Tcurrent is the time an output is produced and Tstart
is the time at which is input data had arrived. Since
at each clock cycle only one input data sample will
arrive, no more than one output data sample can be
written to the same address. Once all outputs have
been written to the reorder buffer, all positions of the
buffer will shift one place, Ad[lat − 1] → Ad[lat −
2] . . . → Ad[−1] → Ad[0] and so on, hence the data
coming out of the least significant address Ad[0] is the
reordered output data stream.
158 PROGRESS 2001
Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models
C. The scheduling algorithm
The scheduling algorithm, allocates the input data
to the first free resource, if no resources are available,
the input data will have to wait in a wait queue until a
resource becomes free. The maximum size of the wait
queue can be calculated by subtracting the worst case
computation CLmax from the calculated latency of
the system:
WQsize = Lat − CLmax (17)
Figure 7 is a simplified version of the C sample used
during simulation of the algorithm.
v o i d schedule ( i n t input data , i n t time ){
i n t ava i l , wait ing , ready ;
ready = ge t r eady r e s ou r c e s ( ) ;
ava i l = g e t f r e e r e s o u r c e s ( ) ;
wait ing = get wa i t queue occupat ion ( ) ;
w h i l e ( ready > 0){
/∗
wr i t e the output of ready re source s
to the reorder b u f f e r at address
( l a t ency − abs ( s t a r t t im e − cu r r en t t ime ))
∗/
ready−−;
}
i f ( ava i l > 0){
w h i l e ( wait ing > 0 && ava i l > 0){
a l l o c a t e ( wait queue−>get ( ) , time queue−>get ( ) ) ;
wait ing−−;
ava i l −−;
}
i f ( ava i l > 0){
a l l o c a t e ( input data , time ) ;
} e l s e {
wait queue−>put ( input data ) ;
time queue−>put ( time ) ;
}
} e l s e {
wait queue−>put ( input data ) ;
time queue−>put ( time ) ;
}
}
Fig. 7. The scheduling algorithm
The function allocate(input , time) basically allo-
cates the input data to the first free resource. It starts
checking the availability by iterating through the re-
sources starting from the resource with the lowest in-
dex number.
D. The scheduler’s hardware model
In 8 we present a block model of the scheduler and
the resources. The model consists mainly of the fol-
lowing parts:
• data queue
• time queue
• time counter register (Time)
• start time resource registers (start1 . . . start3 )
• computation resources (R1 . . . R3)
• address calculation unit
• reorder buffer
• data routing multiplexers
On each clock cycle the time register is incremented
by the controller, the input data will be allocated ei-
ther to the wait queue or to a free resource if avail-
able. If the input data is allocated to the wait queue
the accompanying value of the Time register is stored
in the time queue, this ensures that we preserve the
time of arrival, start time, of the data. Depending on
the number of free resources, one two or three data
allocations and their accompanying start times will
take place. If the resources have produced their out-
puts the address calculation will take place and either
one, two or three output data values will be written to
the reorder buffer at different addresses. The reorder
buffer positions are shifted and the value of reorder
buffer[0] will become the output of the system.
D
a
ta
 
Qu
eu
e
Ti
m
e
 
Qu
e
u
e
R1
Start 1
R2
Start 2
R3
Start 3
Ad
dr
es
s 
se
le
ct
io
n
R
eo
rd
e
r 
Bu
ffe
r
Output
stream
Ad1
Ad2
Ad3
D1
D2
D3
Time
R1W R3 R2
M
u
x
 1
M
u
x
 2
M
u
x
 1
Input stream
Fig. 8. The hardware model of the scheduler, the con-
troller and all control signals are omitted from the figure
for simplicity reasons
V. Simulation results
In figure 9 we present simulation results of the
scheduler. For the simulation we generated a ran-
dom stream of data with a WL(t , 14 ) ≤ 30 . Column
1 of figure 9 presents the time clock, 2 presents the
input data stream, 3 is output data produced by the
resources, 4 is the amount of elements in the wait
queue, 5 status of the wait buffer, 6 presents the re-
order buffer status, 7 is the output of the system and
finally 8 presents the actual workload value over a
window size of 14.
The value of the data represents the computation
load CL of a resource for a single computation and
Workshop on Embedded Systems 159
O. Mansour, S. Etalle and T. Krol
simulation settings ==> M=14 , CL_MAX=10 , WL-max=30 , Latency=14
-1-|-2-| -3- | -4- |-5-| -6- |-7- |-8-
CLK|dat| | |num| 13 12 11 10 9 8 7 6 5 4 3 2 1 |out |
===|===|==========|======================|===|========================================|====|===
0 | 1 | .. .. .. | R1( 0) ...... ...... | 0 | -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 | -1 | 30
1 | 4 | 0 .. .. | R1( 1) ...... ...... | 0 | -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 | -1 | 27
2 | 2 | .. .. .. | r1 R2( 2) ...... | 0 | -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 | -1 | 26
3 | 2 | .. .. .. | r1 r2 R3( 3) | 0 | -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 | -1 | 29
4 | 1 | .. 2 .. | r1 R2( 4) r3 | 0 | -1 -1 2 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 | -1 | 30
5 | 1 | 1 4 3 | R1( 5) ...... ...... | 0 | -1 4 3 2 1 0 -1 -1 -1 -1 -1 -1 -1 | -1 | 30
6 | 2 | 5 .. .. | R1( 6) ...... ...... | 0 | -1 5 4 3 2 1 0 -1 -1 -1 -1 -1 -1 | -1 | 29
7 | 1 | .. .. .. | r1 R2( 7) ...... | 0 | -1 -1 5 4 3 2 1 0 -1 -1 -1 -1 -1 | -1 | 29
8 | 3 | 6 7 .. | R1( 8) ...... ...... | 0 | -1 7 6 5 4 3 2 1 0 -1 -1 -1 -1 | -1 | 29
9 | 5 | .. .. .. | r1 R2( 9) ...... | 0 | -1 -1 7 6 5 4 3 2 1 0 -1 -1 -1 | -1 | 27
10 | 3 | .. .. .. | r1 r2 R3(10) | 0 | -1 -1 -1 7 6 5 4 3 2 1 0 -1 -1 | -1 | 29
11 | 1 | 8 .. .. | R1(11) r2 r3 | 0 | -1 -1 -1 8 7 6 5 4 3 2 1 0 -1 | -1 | 30
12 | 1 | 11 .. .. | R1(12) r2 r3 | 0 | -1 11 -1 -1 8 7 6 5 4 3 2 1 0 | -1 | 30
13 | 2 | 12 .. 10 | R1(13) r2 ...... | 0 | -1 12 11 10 -1 8 7 6 5 4 3 2 1 | 0 | 30
14 | 2 | .. 9 .. | r1 R2(14) ...... | 0 | -1 -1 12 11 10 9 8 7 6 5 4 3 2 | 1 | 30
15 | 1 | 13 .. .. | R1(15) r2 ...... | 0 | -1 -1 13 12 11 10 9 8 7 6 5 4 3 | 2 | 30
16 | 1 | 15 14 .. | R1(16) ...... ...... | 0 | -1 15 14 13 12 11 10 9 8 7 6 5 4 | 3 | 30
17 | 5 | 16 .. .. | R1(17) ...... ...... | 0 | -1 16 15 14 13 12 11 10 9 8 7 6 5 | 4 | 29
18 | 2 | .. .. .. | r1 R2(18) ...... | 0 | -1 -1 16 15 14 13 12 11 10 9 8 7 6 | 5 | 29
19 | 1 | .. .. .. | r1 r2 R3(19) | 0 | -1 -1 -1 16 15 14 13 12 11 10 9 8 7 | 6 | 30
20 | 1 | .. 18 19 | r1 R2(20) ...... | 0 | -1 19 18 -1 16 15 14 13 12 11 10 9 8 | 7 | 30
21 | 1 | .. 20 .. | r1 R2(21) ...... | 0 | -1 20 19 18 -1 16 15 14 13 12 11 10 9 | 8 | 30
22 | 3 | 17 21 .. | R1(22) ...... ...... | 0 | -1 21 20 19 18 17 16 15 14 13 12 11 10 | 9 | 30
23 | 3 | .. .. .. | r1 R2(23) ...... | 0 | -1 -1 21 20 19 18 17 16 15 14 13 12 11 | 10 | 29
24 | 5 | .. .. .. | r1 r2 R3(24) | 0 | -1 -1 -1 21 20 19 18 17 16 15 14 13 12 | 11 | 29
25 | 2 | 22 .. .. | R1(25) r2 r3 | 0 | -1 -1 -1 22 21 20 19 18 17 16 15 14 13 | 12 | 30
26 | 1 | .. 23 .. | r1 R2(26) r3 | 0 | -1 -1 -1 23 22 21 20 19 18 17 16 15 14 | 13 | 30
27 | 2 | 25 26 .. | R1(27) ...... r3 | 0 | -1 26 25 -1 23 22 21 20 19 18 17 16 15 | 14 | 30
28 | 2 | .. .. .. | r1 R2(28) r3 | 0 | -1 -1 26 25 -1 23 22 21 20 19 18 17 16 | 15 | 30
29 | 1 | 27 .. 24 | R1(29) r2 ...... | 0 | -1 -1 27 26 25 24 23 22 21 20 19 18 17 | 16 | 30
30 | 1 | 29 28 .. | R1(30) ...... ...... | 0 | -1 29 28 27 26 25 24 23 22 21 20 19 18 | 17 | 30
31 | 4 | 30 .. .. | R1(31) ...... ...... | 0 | -1 30 29 28 27 26 25 24 23 22 21 20 19 | 18 | 30
32 | 2 | .. .. .. | r1 R2(32) ...... | 0 | -1 -1 30 29 28 27 26 25 24 23 22 21 20 | 19 | 29
33 | 2 | .. .. .. | r1 r2 R3(33) | 0 | -1 -1 -1 30 29 28 27 26 25 24 23 22 21 | 20 | 30
34 | 1 | .. 32 .. | r1 R2(34) r3 | 0 | -1 -1 32 -1 30 29 28 27 26 25 24 23 22 | 21 | 30
35 | 1 | 31 34 33 | R1(35) ...... ...... | 0 | -1 34 33 32 31 30 29 28 27 26 25 24 23 | 22 | 30
36 | 3 | 35 .. .. | R1(36) ...... ...... | 0 | -1 35 34 33 32 31 30 29 28 27 26 25 24 | 23 | 29
37 | 2 | .. .. .. | r1 R2(37) ...... | 0 | -1 -1 35 34 33 32 31 30 29 28 27 26 25 | 24 | 28
38 | 5 | .. .. .. | r1 r2 R3(38) | 0 | -1 -1 -1 35 34 33 32 31 30 29 28 27 26 | 25 | 29
39 | 3 | 36 37 .. | R1(39) ...... r3 | 0 | -1 -1 37 36 35 34 33 32 31 30 29 28 27 | 26 | 30
40 | 1 | .. .. .. | r1 R2(40) r3 | 0 | -1 -1 -1 37 36 35 34 33 32 31 30 29 28 | 27 | 30
41 | 2 | .. 40 .. | r1 R2(41) r3 | 0 | -1 40 -1 -1 37 36 35 34 33 32 31 30 29 | 28 | 30
42 | 2 | 39 .. .. | R1(42) r2 r3 | 0 | -1 -1 40 39 -1 37 36 35 34 33 32 31 30 | 29 | 29
43 | 1 | .. 41 38 | r1 R2(43) ...... | 0 | -1 -1 41 40 39 38 37 36 35 34 33 32 31 | 30 | 29
44 | 1 | 42 43 .. | R1(44) ...... ...... | 0 | -1 43 42 41 40 39 38 37 36 35 34 33 32 | 31 | 29
45 | 4 | 44 .. .. | R1(45) ...... ...... | 0 | -1 44 43 42 41 40 39 38 37 36 35 34 33 | 32 | 29
46 | 1 | .. .. .. | r1 R2(46) ...... | 0 | -1 -1 44 43 42 41 40 39 38 37 36 35 34 | 33 | 29
47 | 3 | .. 46 .. | r1 R2(47) ...... | 0 | -1 46 -1 44 43 42 41 40 39 38 37 36 35 | 34 | 28
48 | 1 | .. .. .. | r1 r2 R3(48) | 0 | -1 -1 46 -1 44 43 42 41 40 39 38 37 36 | 35 | 28
49 | 1 | 45 .. 48 | R1(49) r2 ...... | 0 | -1 48 -1 46 45 44 43 42 41 40 39 38 37 | 36 | 30
50 | 2 | 49 47 .. | R1(50) ...... ...... | 0 | -1 49 48 47 46 45 44 43 42 41 40 39 38 | 37 | 30
51 | 1 | .. .. .. | r1 R2(51) ...... | 0 | -1 -1 49 48 47 46 45 44 43 42 41 40 39 | 38 | 30
52 | 6 | 50 51 .. | R1(52) ...... ...... | 0 | -1 51 50 49 48 47 46 45 44 43 42 41 40 | 39 | 26
53 | 4 | .. .. .. | r1 R2(53) ...... | 0 | -1 -1 51 50 49 48 47 46 45 44 43 42 41 | 40 | 25
54 | 1 | .. .. .. | r1 r2 R3(54) | 0 | -1 -1 -1 51 50 49 48 47 46 45 44 43 42 | 41 | 25
55 | 2 | .. .. 54 | r1 r2 R3(55) | 0 | -1 54 -1 -1 51 50 49 48 47 46 45 44 43 | 42 | 25
56 | 1 | .. .. .. | r1 r2 r3 | 0 | -1 -1 54 -1 -1 51 50 49 48 47 46 45 44 | 43 | 29
57 | 1 | .. 53 55 | r1 R2(56) R3(57) | 0 | -1 -1 55 54 53 -1 51 50 49 48 47 46 45 | 44 | 30
58 | 1 | 52 56 57 | R1(58) ...... ...... | 0 | -1 57 56 55 54 53 52 51 50 49 48 47 46 | 45 | 30
59 | 4 | 58 .. .. | R1(59) ...... ...... | 0 | -1 58 57 56 55 54 53 52 51 50 49 48 47 | 46 | 29
60 | 1 | .. .. .. | r1 R2(60) ...... | 0 | -1 -1 58 57 56 55 54 53 52 51 50 49 48 | 47 | 30
Fig. 9. Scheduling simulation results
not the actual data value fed to the algorithm, as this
is of no relevance for the simulation. In column 3, a
resource will have a capital letter if a new data sample
has been allocated to it. The number between the
brackets indicates the start clock value.
VI. Conclusions
In this paper we have considered the problem of
scheduling a non-manifest loop in a manifest environ-
ment with apriori knowledge of the maximum compu-
tation load CLmax of the loop-algorithm and the WL
bound of the input data stream. By using the work-
load to calculate the amount of resources needed, and
the latency of the system, we were able to design a
dynamic scheduling system with a Ccap ≥ WL of the
input data stream. The sliding window execution of
the system has a constant latency Lat ≤ m which al-
lows us to produce an output stream that is synchro-
nized with the input stream. This makes the system
suitable for a synchronous real time environment.
VII. Future Work
There is still a lot of work to be done concerning the
detailed hardware implementation of the system, the
produced hardware uses a lot of memory which can be
improved upon. Writing more than one output value
to the reorder buffer or allocating more than one wait-
ing data element to the resources requires dedicated
hardware. It is preferable to be able to synchronize
the writing and data allocation with the actual re-
source execution and let them all run on the same
clock. In the case we have a non-manifest loop with a
very long worst case execution, the presented sched-
uler will work but it might not always be acceptable
to have such long delays in the system. A plausible
solution would be to sacrifice the quality of the output
in order to meet real time constraints which is accept-
able if we are scheduling non-analytic type loop al-
gorithms. In the case of analytic loops-algorithms we
can device hybrid systems were we would use lookup
tables for computational loads above a certain thresh-
old and normal resources for computational loads be-
low that threshold.
Another area to be concerned is the scheduling of
non-manifest loops with resource constraints and data
dependency.
References
[1] W. Verhaegh, ”Multidimensional Periodic Scheduling”,
Ph.D Thesis, 1995
[2] Silvia M. Muller, ”On the Scheduling of Variable Latency
Functional Units”, 11th ACM Symposium on Parallel Al-
gorithms and Architectures SPAA’99
[3] Vijay Raghunathan, Srivaths Ravi, Ganesh Lakshmi-
narayana, ”Integrating Variable-Latency Components into
High-Level Synthesis”, IEEE Transactions on computer-
aided design of integrated circuits and systems, October
2000
[4] L. Benini, E.Macii, M.Poncino, and G.De Micheli, ”Tele-
scopic units: A new Paradigm for performance optimiza-
tion of VLSI designs” IEEE, Trans. Computer-Aided De-
sign of Integrated Circuits and Systems, vol. 19, No. 10,
October 2000
[5] R. Camposano, D. Knapp, D. MacMillen, ”A Review of
Hardware Synthesis Techniques”, NATO Advanced Study
Institute, Tremezzo (I), June 1995.
[6] D. Gajski, N.Dutt, A. Wu, S. Lin, ”High-level synthesis:
Introduction to chip and system design”, Kluwer, ISBN
0-7923-9194-2, 1992
160 PROGRESS 2001
