Parameterized Looped Schedules by Ko, Ming-Yung et al.
Parameterized Looped Schedules
Ming-Yung Ko, †Claudiu Zissulescu, Sebastian Puthenpurayil, Rami Nasr,
Shuvra S. Bhattacharyya, †Bart Kienhius, †Ed Deprettere
Department of Electrical and Computer Engineering
University of Maryland at College Park, USA
{myko,purayil,rmn,ssb}@eng.umd.edu
†Leiden Institute of Advanced Computer Science
Leiden University, The Netherlands
{zissules,kienhous,edd}@liacs.nl
Abstract
This paper is concerned with the compact representation of execution sequences in terms of effi-
cient looping constructs. Here, by a looping construct we mean a compact way of specifying a
finite repetition of a set of execution primitives (“instructions”). Such compaction, which can be
viewed as a form of hierarchical run-length encoding, has application in many embedded software
contexts, including efficient control generation for Kahn processes [6], and software synthesis for
static dataflow models of computation, such as synchronous dataflow [1] and cyclo-static data-
flow [3]. In this paper, we significantly generalize previous models for loop-based code compac-
tion of DSP programs to yield a configurable code compression methodology that exhibits a
broad range of achievable trade-offs. Specifically, we formally develop and apply to DSP hard-
ware and software implementation a parameterizable loop scheduling approach with compact for-
mat, dynamic reconfigurability, and low overhead decompression.
1 Introduction
Due to tight resource constraints and the increasing complexity of applications, efficient pro-
gram compression techniques are critical in the implementation of embedded DSP systems. Hard-
ware and software subsystems for DSP often present periodic and deterministic execution
sequences that facilitate compile- or synthesis-time compression. In this paper, we develop a
methodology that exploits this characteristic of DSP implementation subsystems through compact
representation of execution sequences in terms of efficient looping constructs. The looping con-
structs provide a concise, parameterized way of specifying sequences of execution primitives that
may exhibit repetitive patterns of arbitrary forms both at the primitive- and subsequence-levels.
Such compaction provides a form of hierarchical run-length encoding as well as reconfigurabilityPage 1 of 21
during DSP system implementation. Moreover, exploitation of low-cost hardware features are
considered to further improve the efficiency of the proposed methods. The power of our compres-
sion approach is demonstrated concretely through its application to control generation for Kahn
processes [6], and to software synthesis for static dataflow models of computation [1]. 
2 Related Work
Sequence compression techniques have been developed for many years in the context of file
compression to save disk space, reduce network traffic, etc. One basic approach in this and other
sequence compression domains is to express repeating strings of symbols in more compact forms.
A typical example is run-length encoding, which replaces  repeated instances of a symbol by
just a single instance of the symbol along with the repetition count . Several bitmap file formats,
e.g., TIFF, BMP and PCX, adopt variants of run-length encoding. More elaborate compression
strategies employ “dictionary” look-up mechanisms. Here, multiple instances of a symbol
sequence are replaced by smaller-sized pointers that reference a single “master copy” of the
repeated sequence. The collection of master copies can therefore be viewed as a dictionary for
purposes of compression. A typical example is the LZ77 algorithm [14], variations of which are
used in many data compression tools.
Code compression in embedded systems presents some unique characteristics and chal-
lenges compared to compression in other domains. First, code sequences depend heavily on the
underlying control flow structures of the associated programs. Furthermore, the control flow
structures of the associated programs can often be changed subject to certain restrictions, giving
rise in general to a family of alternative code sequences for the same program behavior. Second,
memory resources in embedded systems are particularly limited, and the temporary “scratch
space” for decompression is usually very small. Third, decompression of embedded code must be
fast enough to meet real-time demands.
There are several research works discussing reduction of code size through classical com-
piler optimizations such as strength reduction, dead code elimination, and common sub-expres-
sion elimination [5]. A particularly effective strategy is procedural abstraction [10], where
procedures are created to take the place of duplicated code sequences. The work of [4] further
reveals that procedural abstraction combined with classical compiler optimizations result in more
n
nPage 2 of 21
compact code size than each approach can achieve alone.
For embedded DSP design, many applications are based on dataflow models, which exhibit
certain advantages compared to traditional sequential programming. Dataflow-based DSP design
usually operates at a high level of program abstraction, e.g., in terms of basic blocks, nested loop
behaviors, and coarse-grain subroutines. Furthermore, the control flow at this abstraction level is
often highly predictable. To reduce code size cost, repetitive execution patterns generated by this
form of predictable control flow can be mapped to low-overhead looping constructs that are com-
mon in programmable digital signal processors, and are similarly easy to emulate in programma-
ble- or custom-hardware designs. The work of [1] adopts a dynamic programming approach to
reformat repeated dataflow executions in a hierarchical run-length encoding style. However, the
computational complexity of this strategy is relatively large. More importantly, the constraint of
static and fixed iteration counts in the targeted class of looping structures significantly restricts
compression results.
In this paper, we propose a flexible and parameterizable looping construct with compact for-
mat, dynamic reconfigurability, and fast decompression. The format embeds functions in describ-
ing variable lengths in a configurable run-length encoding to achieve better compression results.
With reconfigurability, appropriate execution subsequences can be derived by adjusting parameter
values at run time without modifying hardware implementation. Our proposed methodology
applies looping constructs that provide flexibility in adapting execution sequences, as well as effi-
ciency in managing the associated iteration control. In summary, we propose an approach for
compact representation of execution sequences that is effective across the dimensions of concise-
ness, decompression performance, cost, and configurability.
3 Background: Static Looped Schedules
We denote the set of all integers by , and the set of non-negative integers by . Suppose
 is a sequence of arbitrary elements and  is a non-negative integer. Then we
define the product  to be the sequence that results from concatenating  copies of . Thus,
for example  is the empty sequence; ; ;
and so on. Furthermore, if  is another sequence, then we define the sum 
to be the concatenation of  to : . Note that in general
Z Z+
S s1 s2 … sn, , ,( )= c
S c× c S
S 0× S 1× S= S 2× s1 s2 … sm s1 s2 … sm, , , , , , ,( )=
T t1 t2 … tn, , ,( )= S T+
T S S T+ s1 s2 … sm t1 t2 … tn, , , , , , ,( )=Page 3 of 21
 does not equal . We occasionally abuse notation by overloading the definition of
a function depending on the type of argument that is applied. For example, as explained fully in
subsequent sections, if  is an instruction, then  defines the cost of that instruction, whereas
if  is a schedule, then  denotes the total cost of that schedule (including the sum of instruc-
tion and loop costs). We abuse notation in this way to highlight relationships across closely-
related functions, and to contain the total number of distinct symbols that are defined.
Suppose we are given a finite sequence of symbols  from a finite alpha-
bet set . Thus, each . We refer to each  as an instruction, and we
refer to the sequence  as the program that we wish to optimize. We define a class 0 (static)
schedule loop over  to be a parenthesized term of the form , where , and
each  is either an element of  (i.e., an instruction) or a (“nested”) class 0 schedule loop. The
number  is called the iteration count of the schedule loop, and each  is called an iterand of the
schedule loop. The concatenation  of iterands is called the body of the schedule loop.
Such a schedule loop is called static because the iteration count is constant. 
A class 0 (static) looped schedule over  is a sequence , where each 
is either an element of  or a class 0 schedule loop over . Note that by definition, if
 is a class 0 schedule loop, then  and  are both class 0
looped schedules. We call  the body schedule of .
Given a class 0 looped schedule , a schedule loop  is contained in  if for some ,  is a
schedule loop and  or  is a schedule loop that is nested within . For example, consider
. This schedule contains the following schedule loops:
, , , , and . Note that in listing the set of
schedule loops that are contained in a schedule, we may need to distinguish between multiple
schedule loops that have identical iteration counts and bodies, as in the first and second appear-
ances of  in the looped schedule . If  and  are schedule
loops that are contained in the schedule , we say that  is contained earlier
than  in  if there exist  and  such that ,  contains , and  contains . We say
that  lexically precedes  in  if (a)  is contained earlier than  in ; (b)  is nested
within ; or (c)  contains a schedule loop  so that  is contained earlier than  in the
body schedule of .
S T+( ) T S+( )
X c X( )
X c X( )
P p1 p2 … pn, , ,( )=
A a1 a2 … am, , ,{ }= pi A∈ pi
P
A cI1I2…Ik( ) c Z+∈
Ii A
c Ii
I1I2…Ik
A S x1 x2 … xn, , ,( )= xi
A A
L cI1I2…Ik( )= SL I1 I2 … Ik, , ,( )= L( )
SL L
S L S i xi
xi L= L xi
S 3A 2B 3CD( )( )( ) E 3 2B( )( ), ,( )=
3A 2B 3CD( )( )( ) 2B 3CD( )( ) 3CD( ) 3 2B( )( ) 2B( )
3AB( ) 2A 3AB( )( ) 5CD( ) 3AB( ), ,( ) L1 L2
S x1 x2 … xn, , ,( )= L1
L2 S xi xj i j< xi L1 xj L2
L1 L2 S L1 L2 S L2
L1 S L3 L1 L2
L3Page 4 of 21
Example 1: Consider the looped schedule , let  denote
the first appearance of , let  denote the second appearance of , let  denote the
schedule loop , and let  denote the schedule loop . Then  lexically precedes
 due to condition (a);  lexically precedes  due to condition (b); and  lexically precedes
 due to condition (c).
Consider an iterand I of a class 0 schedule loop. If  is an instruction, then we say that the
program generated by , denoted , is simply the one-element sequence . Otherwise, if 
is a schedule loop — that is,  is of the form  — then  is defined recur-
sively by
.
Similarly, given a class 0 schedule , the program generated by  is
(with a minor abuse of notation) denoted , and is given by 
.
Example 2: Suppose that the set of instructions  is given by , and suppose
we are given a looped schedule . Then we have
.
Static looped schedules have been studied extensively in the context of software synthesis
from synchronous dataflow representations of DSP applications (e.g., see [1]).
If costs are associated with individual actors and with loop construction in general, then we
can express the degree of compactness associated with specific looped schedules. Suppose that in
the context of looped schedule implementation,  represents the overhead (cost) of a loop,
and  represents the cost of an instruction . For example, for software implementation 
represents the code size cost associated with a loop in the target code. This value will normally
depend on the processor on which the schedule is being implemented, and will include the code
size of the instructions required to initialize the loop and update the loop counter at the beginning
or end of each iteration. If the software is being implemented for a dataflow graph specification,
2A 3B( )( )CD 3A 2 3B ) 2C( )( )( )( ),( ) L1
3B( ) L2 3B( ) L3
2A 3B( )( ) L4 2C( ) L1
L2 L3 L1 L2
L4
I
I P I( ) I( ) I
I I cIX1X2…Xp( )= P I( )
P I( ) P X1( ) P X2( ) … P Xp( )+ + +( ) c×=
S x1 x2 … xn, , ,( )= S
P S( )
P S( ) P x1( ) P x2( ) … P xn( )+ + +=
A A a b c d, , ,{ }=
S a 2c 2ad( )d( ) b 3c( ) d, , , ,( )=
P S( ) a c a d a d d c a d a d d b c c c d, , , , , , , , , , , , , , , , ,( )=
αloop
α x( ) x αloopPage 5 of 21
then the “instructions” in the looped schedule correspond to actors in the dataflow graph, and the
instruction code size values  give the code size requirements of the different actors on the
associated target processor.
The cost of a looped schedule  can be expressed as
,
where  denotes the number of schedule loops in  (including nested loops), and 
denotes the number of times that instruction  appears in schedule 
For example, if , the schedule illustrated above, then
.
To construct a static looped schedule from a sequence of instructions, a dynamic program-
ming approach called CDPPO [2] provides an effective approach. The CDPPO algorithm adopts
a bottom-up approach to fuse repetitive instruction sequences into hierarchical looping constructs.
The objective of CDPPO is to minimize overall code size, including the costs for instructions and
looping constructs. CDPPO has computational complexity that is polynomial in the number of
instructions in the (uncompressed) input sequence. 
4 Class 1 Looped Schedules
Static looped schedules provide a simple form of nested iteration where all iteration counts
in the loops are static values, and loop counts implicitly progress from  to the corresponding
iteration count limits in uniform steps of . However, static looped schedules do not always allow
for the most compact representation of a static execution sequence. This motivates the definition
of more flexible schedules in which more general updating of loop counters is integrated into the
schedule. The class 1 schedules, which we define next, represent one such form of more general
schedules. In class 1 schedules, the loop counter dimension is made explicit, and loop counters are
allowed to have initial values, and update expressions specified for them. Because update expres-
sions are processed frequently (once per loop iteration), their form is restricted in class 1 sched-
α x( ){ }
S
α S( ) nloop S( )αloop napp x S,( )α x( )
x A∈
∑+=
nloop S( ) S napp x S,( )
x S
S a 2c 2ad( )d( ) b 3c( ) d, , , ,( )=
α S( ) 3αloop 2α a( ) α b( ) 2α c( ) 3α d( )+ + + +=
1
1Page 6 of 21
ules to ensure efficient hardware and software implementation.
Formally, a class 1 schedule loop  has five attributes, a body, an index, an iteration count
function, an initial index value, and an index update constant. The body of  is defined in a man-
ner analogous to the body of a class 0 schedule loop. Thus, the body of  is of the form ,
where each , called an iterand of , is either an instruction or a class 1 schedule loop. The index
of a class 1 schedule loop  is a symbol that represents a loop index variable that is associated
with  in an implementation of the loop. The iteration count function of  is an integer-valued
function  defined on , where each  is the index of some other class 1 sched-
ule loop or is a parameter of a looped schedule that contains . The value of  just before execut-
ing an invocation of  gives the minimum value of the index required for the loop to stop
executing. In other words,  will continue executing as long as the index value is less than . It is
admissible to have , so that  represents a constant value . In this case, we write .
The initial index value of  is an integer to which the loop index variable associated with  is ini-
tialized. This initialization takes place before each invocation of , just prior to the computation
of . The index update constant is a positive integer that is added to the index of  at the end of
each iteration of . 
A class 1 schedule loop  is represented by the parenthesized term , where
, , , and  are, respectively, the index, iteration count function, body, and index update
constant of . For brevity we omit the initial index value from this representation. The initial
index value of  is denoted by ; this value is specified separately when needed. Further-
more, when , we may suppress  from the schedule loop notation, and simply write
. If  is not an argument of any relevant iteration count function, we may
suppress , and write  or ; if, additionally,  is constant-val-
ued (i.e., ), and , then we have a class 0 schedule loop, and we may drop the
brackets and write , which is just the usual notation for class 0 schedule loops. We
represent the arguments of the iteration count function by .
Fact 1: The number of iterations executed by an invocation  of the class 1 looped schedule  is given by
, (1)
L
L
L I1I2…In
Ii L
L
L L
f y1 y2 … ym, , ,( ) Z
m yi
L f
L
L f
m 0= f v f() v=
L L
L
f L
L
L xL fL uL, ,[ ] BL( )
xL fL BL uL
L
L xL 0( )
uL 1= uL
L  xL fL,[ ]BL( )= xL
xL L fL[ ]BL( )= L fL uL,[ ]BL( )= fL
fL c= uL 1=
L cBL( )=
args fL( ) y1 y2 … ym, , ,{ }=
IL L
iterations 0 fL y1∗ y2∗ … ym∗, , ,( ) xL 0( )–
uL
----------------------------------------------------------------------
,⎝ ⎠⎜ ⎟
⎛ ⎞
max=Page 7 of 21
where  denotes the value of index  just prior to initiation of .
A class 1 looped schedule over  is an ordered pair . The first
member
of this ordered pair is a finite set of elements called parameters of , and the second member
 is a finite sequence where each  is either an element of  or a class 1
schedule loop over .
We say that a class 1 looped schedule  is syntactically correct if the following three condi-
tions all hold.
• Every schedule loop  that is contained in  has a unique index .
• ; that is, the parameters of  are distinct from the loop
indices.
• For each schedule loop , that is contained in , the iteration count function  is either con-
stant-valued, or depends only on parameters of  and indices of schedule loops that lexically
precede ; that is, 
.
Containment of a schedule loop earlier than another schedule loop, as well as lexical prece-
dence between schedule loops, are defined for class 1 looped schedules in a manner analogous to
that for class 0 looped schedules.
Syntactic correctness is a necessary but not sufficient condition for validity of a class 1
looped schedule. Overall validity in general depends also on the context of the looped schedule.
For example, a syntactically correct looped schedule for a synchronous dataflow graph may be
invalid because the schedule is deadlocked (attempts to execute an actor before sufficient data has
been produced for it).
Intuitively, the semantics of executing a class 1 schedule loop , where
 and , can be described as outlined in Fig. 1. Using this seman-
tics, we can define the program generated by an iterand of a class 1 schedule loop, and the pro-
yi∗ yi IL
A S params S( ) body S( ),( )=
params S( ) p1 p2 … pr, , ,{ }=
S
body S( ) x1 x2 … xn, , ,( )= xi A
A
S
L xL fL uL, ,[ ]BL( )= S xL
xL S contains L{ } params S( )∩ ∅= S
L S fL
S
L
args fL( ) params S( ) xL' L' lexically precedes L in S{ }∪⊆
xL fL uL, ,[ ] BL ( )
fL fL y1 y2 … ym, , ,( )= uL Z
+
∈Page 8 of 21
gram generated by a class 1 looped schedule in a fashion analogous to the corresponding
definitions for class 0 looped schedules. However, when determining these generated programs
for class 1 looped schedules, we must specify an assignment of values to the schedule parameters.
Thus, if  is an assignment of values to parameters of a class 1 looped schedule
, then we write  to represent the corresponding program generated by . 
Example 3: Suppose , and consider the class 1 looped schedule 
specified by  and
,
where  and . Notice that this schedule contains a pair of
nested schedule loops. If the initial index values in these loops are identically zero, and if 
(i.e., we assign the value of  to the schedule parameter ), then we have
.
This simple example illustrates some of the ways in which more irregular programs can be
generated by class 1 looped schedules as compared to their class 0 counterparts. In particular, in
this example, we see that the number of iterations of the inner loop can vary across different invo-
cations of the loop, and furthermore, the amount of this variation need not be uniform.
Because of their potential for parameterization, in terms of schedule parameters and loop
indices, and because of their restriction that loop indices be updated by constant additions, we
also refer to class 1 looped schedules as parameterized, constant-update looped schedules
(PCLSs).
while 
execute 
end while
xL xL 0( )=
limitL fL y1 y2 … ym, , ,( )=
xL limitL<( )
 BL
xL xL uL+=
Fig 1.  A sketch of the execution of a class 1 schedule loop.
v params S( ) Z→:
S P S v,( ) S
α A B C D E F, , , , ,( )= S
params S( ) p1{ }=
body S( ) F x1 f1,[ ]AB f2 2,[ ]CD( )( ) E, ,( )=
f1 f1 p1( ) p1 3–= = f2 f2 x1( ) 5 x1–= =
v p1( ) 6=
6 p1
P S v,( ) F A B C D C D C D A B C D C D A B C D C D E, , , , , , , , , , , , , , , , , , , , ,( )=Page 9 of 21
Theorem 1: Given a syntactically-correct PCLS , and an assignment  of
parameter values, the generated program  is finite.
Proof. Suppose that  is a schedule loop contained in . Then there is a unique sequence
, , of schedule loops contained in  such that  is an iterand of , ,
and each  is an iterand of . That is,  are the outer loops encapsulating .
Then the total number of invocations of  in an execution of  can be expressed as 
.
This follows from (1), which specifies the number of iterations executed by a given schedule loop
invocation . Since  and  is integer-valued, this number of iterations will always be
finite. QED.
5 Affine Parameterized Looped Schedules
One useful special case of class 1 looped schedules arises when  is a linear function of
. We call the special case affine parameterized looped schedules (APLSs).
The ability to parameterize iteration counts of schedule loops in PCLSs is useful in express-
ing related groups of static schedule loops. In many useful design contexts, families of static
schedule loops arise, such that within a given family, all schedule loops are equivalent in a certain
structural sense. We refer to this form of equivalence between schedule loops as schedule loop
isomorphism. Specifically, two static schedule loops  and  are isomorphic if there is a bijec-
tion  between the set of schedule loops contained in  and the set of schedule loops contained
in  such that for each  in the domain of ,  and  sat-
isfy the following three conditions: (a)  and  have the same number of iterands (that is,
); (b) for each  such that  is not a schedule loop (i.e., it is a “primitive” iterand), we
have ; and (c) for each  such that  is a schedule loop, we have that  is also a schedule
loop, and furthermore,  and  are isomorphic.
For each schedule loop  contained in , the mapping  of  is called the image of
 under the isomorphism. Furthermore, two static looped schedules  and  are said to be iso-
S v params S( ) Z→:
P S v,( )
L S
L1 L2 … Ln, , , n 1≥ S L1 S Ln L=
Li 1+ Li L1 L2 … Ln 1–, , , L
L S
iterations Li( )
i 1=
n
∏
IL uLi 0> fLi
fL
args fL( )
L1 L2
f L1( )
L2( ) L f L cI1I2…Im( )= f L( ) dJ1J2…Jn( )=
L f L( )
m n= i Ii
Ji Ii= i Ii Ji
Ii Ji
L L1( ) f L( ) L
L S1 S2Page 10 of 21
morphic if the schedule loops  and  are isomorphic.
We can extend the definition of isomorphic looped schedules to a finite set of static looped
schedules . In this case, we extract the schedule loops from  for some arbitrary
. Then for all  and for each schedule loop  contained in , we define  to be the
corresponding, structurally equivalent schedule loop in .
Using the concept of looped schedule isomorphism, we derive useful formulations in the
remainder of this section for the special case of APLSs where  for every 
contained in .
For clarity in this discussion, we start with  as the only schedule parameter (i.e.,
). Under the APLS assumption, this means that the iteration count expression
for each schedule loop will be of the form , where  and  are constants. Therefore, we
need two instances of a given static schedule loop to fit the unknowns  and . We simply need
that these instances be for distinct values of , say  and , and that these values of  be such
that they reach beyond any transient effects (leading to negative, zero, or one-iteration loops when
viewed from the final parameterized schedule). Note that functionally, a negative-iteration loop is
just equivalent to a zero-iteration loop.
Let  be the looped schedule instance corresponding to  and let  be the looped sched-
ule instance corresponding to . If  and  are not isomorphic, we need to increase
, and try again.
Suppose now that we have an isomorphic schedule pair  and . We then take each loop
 in  and its image  in . Let  be the iteration count of  and  be that of . We
then set up the equations
, and ,
and solve these equations for  and . We repeat this procedure for all schedule loops  that are
contained in .
Generalizing this to multiple schedule parameters, we start with a hypothesized APLS
 in  parameters. The iteration count expression for each schedule loop 
is of the form
1S1( ) 1S2( )
S1 S2 … Sk, , , 1Si( )
i j i≠ L 1Si( ) fj L( )
1Sj( )
args fL( ) params S( )= L
S
p
params S( ) p{ }=
ap b+ a b
a b
p p1 p2 p
S1 p1 S2
p2 S1 S2
min p1 p2,( )
S1 S2
L S1 f L( ) S2 z1 L z2 f L( )
z1 ap1 b+= z2 ap2 b+=
a b L
S1
S p1 p2 … pN, , ,( ) N 1≥ LPage 11 of 21
. (2)
We need  instances of  to fit the  unknowns in the iteration count expression for . For
, let  be the th element in our set of compacted looped schedule instances.
Let  be the corresponding parameter values for , respec-
tively. Furthermore, let  be a schedule loop in , and for each , let  denote
the iteration count of . We set up the following equations:
This can be expressed in matrix form as
, (3)
where  is an  constant column vector,  is an  column vector composed of the
unknown 's,  is an  constant matrix composed of the parameter settings used in the
selected schedule instances, and  is an  column vector obtained by
replicating the unknown offset term .
By solving (3), we obtain  and  to formulate the APLS parameterized sched-
ule loop implementation of , as represented in (2).
If a solution cannot be obtained, we can increase the selected  (to
more completely bypass transient effects, as described earlier), or we may change the hypothe-
sized number of parameters in the looped schedule.
6 Application: Kahn Process Networks
A. Background on synthesis from Kahn process networks
The computation model of the Kahn Process Network (KPN) expresses applications in terms
of distributed control and memory. The KPN model of computation [6] assumes a network of con-
a1p1 a2p2 … aNpN b+ + + +
N 1+ L N 1+ L
i 1 2 … N 1+, , ,= Si i
qi 1, qi 2, … qi N 1+,, , , p1 p2 … pN 1+, , ,
L Si i 1 2 … N 1+, , ,= zi
fi L( )
z1 a1q1 1, a2q1 2, … aNq1 N, b+ + + +=
z2 a1q2 1, a2q2 2, … aNq2 N, b+ + + +=
…
zN 1+ a1qN 1+ 1, a2qN 1+ 2, … aNqN 1+ N, b+ + + +=
z QA b+=
z N 1+( ) 1× A N 1×
ai Q N 1+( ) N×
b b b … b T= N 1+( ) 1×
b
a1 a2 … aN, , , b
L
qi 1, qi 2, … qi N 1+,, , ,{ }Page 12 of 21
current autonomous processes that communicate in a point-to-point fashion over unbounded
FIFO channels, using a blocking-read synchronization primitive. Each process in the network is
specified as a sequential program that executes concurrently with other processes.
To facilitate the migration from an imperative application specification, which is preferred
by many programmers, to a KPN specification, a set of tools, Compaan and Laura [12], is being
developed. This approach allows parts of an application written in a subset of Matlab to be con-
verted automatically to KPNs. The conversion is fast and correct-by-construction. The obtained
KPN processes can subsequently be mapped either to software or hardware.
B. Control Generation
In the synthesis flow of Laura, VHDL description of an architecture is generated from a
KPN. Laura converts a process specification together with an IP core into an abstract architectural
model, called a virtual processor [6]. Every virtual processor is composed of four units (illus-
trated in Fig. 2): Execution, Read, Write, and Controller. Execution units contain the computa-
tional parts of virtual processors. To communicate data on FIFO channels, ports are devised,
which connect FIFO channels and virtual processors. Read/write units are in charge of multiplex-
ing/de-multiplexing port accesses for execution units. Controller units provide valid port access
sequences, or traces, to facilitate computation. The determination of port access sequences, also
called interface control generation, in a systematic way and compact form is our focus in this sec-
tion.
A simple approach to implementing the distributed control is to use ROM tables to store the
Fig 2.  Hardware realization of a virtual processor.Page 13 of 21
traces. However, this strategy is impractical because of large hardware area costs. To reduce the
complexity of the problem, several compile time techniques are proposed to compress these tables
and to keep the flexibility offered by the parametric approach [6]. In this paper, PCLSs are
employed to interface control generation to demonstrate the effectiveness of PCLS for hardware
implementation.
To reduce hardware area costs of the ROM table approach, construction of class 0 looped
schedules can be used. Moreover, applications specified in the KPN model may have parameters
that can be configured at run time. The constructed class 0 looped schedules highly depends on
the parameters values set dynamically. With the isomorphism formulation stated in Section V for
APLS, groups of isomorphic class 0 looped schedules can be summarized by single APLSs if the
formulation is possible. That is the way we generate the parameterizable and compact schedules,
which result in significantly better performance than ROM table approach.
C. FPGA Setup for Controller Units
To reduce the memory cost, we employ a micro-engine architecture that computes the KPN
control using PCLSs. Under the requirements of a virtual processor controller, the micro-engine
has to perform a for-loop operation and generate a KPN control symbol in one cycle. As shown in
Fig. 3, a PCLS controller consists of two parts: a ROM/RAM memory and a sequencer. In the
ROM/RAM memory is stored a compiled version of a PCLS, which describes a trace using
micro-instructions. The sequencer uncompresses the PCLS trace and generates the desired KPN
port through fetching and decoding micro-instructions from the control memory. The memory
Fig 3.  Microprogrammed control unit organization.Page 14 of 21
address of the next micro-instruction needs to be evaluated as well by the sequencer. To realize
PCLSs in FPGA hardware implementation, the following two steps can be employed:
• Symbolic program compilation: The first step involves the compilation of the input PCLS
using the micro-engine instruction primitives. This is done at the symbolic level.
• Hardware program generation: The second step takes the symbolic program and transforms it
in a bit-stream suitable for an FPGA platform. This step takes into account the bitwidths of
the loop count and the symbols used in the PCLS trace.
In hardware, encoding methods, such as one-hot or binary encoding, can be used for the pro-
gram symbols. The choice of encoding schemes is done as a function of the dimension of the
implementation and/or speed constraints.
D. Experiments
Our experiments are based on implementation costs of the controller units on an FPGA. In
Table I, we show the FPGA area costs (the number of FPGA slices) for a number of applications.
Here, QR is a matrix decomposition algorithm [13], and Optical is a generic image restoration
algorithm [11]. For each application, particular processes are selected for our experiments. We
compare the size requirements under ROM table and PCLS implementations. In addition, FPGA
area and maximum frequency in generating port accesses are shown. For example, virtual proces-
sor 3 (VP3) of the KPN representing Optical, requires a ROM table size of  bytes with
parameter values set to  and . The size reduces to only  bytes if the PCLS
scheme is employed. All the experiments are set up on a Virtex-II 2000 equipped device.
The obtained results are promising in terms of area and frequency. For example, the largest
PCLS trace occupies only 1% of the FPGA area. On the other hand, the ROM table approach uses
approximately 9% of the same FPGA chip area. For the QR algorithm, we derived the ROM table
Table I. Experimental results on FPGA.
Virtual Processor ROM size (bytes) PCLS size (bytes) PCLS freq. (MHz) PCLS area
QR VP2 35 6 207 35
QR VP3 176 16 209 37
QR VP4 616 20 205 40
Optical VP3 944460 160 150 132
944460
W 320= H 200= 160Page 15 of 21
for a set of typical parameters values (  and ). The results with the compression tech-
nique applied show a considerable compression rate for this kind of application.
In this section, we have shown that our proposed PCLS methodology is effective for inter-
face control generation. It offers the flexibility of a parametric controller with small hardware
resource requirements. However, the performance of the PCLS-based controller depends on the
size of trace. It is possible that the PCLS algorithm cannot optimally compress some execution
sequences, and this can in turn affect controller performance. We can see this trend from Table I
where the size difference in traces affects the frequency of the entire design. For the first three
traces, the frequency is approximately 205 Mhz, while for the last one the frequency drops by
26%.
7 Application: Synchronous Dataflow
A. Background on Synchronous Dataflow
The model of synchronous dataflow (SDF) [9] is an important common denominator across
a wide variety of DSP design tools. An SDF program specification is a directed graph where ver-
tices represent functional blocks (actors) and edges represent data dependencies. Actors are acti-
vated when sufficient inputs are available, and FIFO queues (or buffers) are usually allocated to
buffer data transferred between actors. In addition, for each edge , the numbers of data values
produced  and consumed  are fixed at compile time for each invocation of the
source actor  and sink actor , respectively. To save memory in storing actor execu-
tion sequences, previous studies have incorporated looping constructs to form static looped sched-
ules. The most compact form for such schedules, called single appearance schedules (SAS), is
that in which exactly one inlined version of code is allowed for each actor [1]. A two-actor SDF
graph and a corresponding SAS are shown in Fig. 4(a).
N 7= K 21=
e
prd e( ) cns e( )
e( )src e( )snk
Fig 4.  Examples of two-actor SDF graphs, schedules, and buffer costs.
A 3 2(a)
(b)
SAS: ((2A), (3B)), buffer cost = 6
Non-SAS: (A, B, A, B, B), buffer cost = 4
buffer cost lower bound
m + n - gcd(m,n)
B
A 3 2 BPage 16 of 21
B. Minimizing Code and Data Size via PCLS
SASs in the form of class 0 looped schedules, however, limit the potential for buffer minimi-
zation. Consider, for example, the SDF graph in Fig. 4. The SAS of Fig. 4 has a higher buffer cost
than the non-SAS does. The fixed iteration counts of class 0 schedule loops lack the flexibility in
expressing more irregular patterns, such as that of Fig. 4(b). In contrast, the more flexible iteration
control associated with the PCLS approach naturally accommodates the non-SAS in Fig. 4(a).
We start by considering two-actor SDF graphs to minimize buffer costs through PCLS. A
useful lower bound on the buffer memory requirement of a 2-actor SDF graph, as shown in Fig.
4(b), is , and an algorithm is given in [1] to compute schedules that achieve
this bound. Intuitively, this algorithm executes the source actor just enough times to trigger execu-
tion of the sink actor, and the sink actor executes as many times as possible (based on the avail-
able input data) before control is transferred back to he source actor. This form of operation can be
described more precisely in terms of the following PCLS formulation.
Theorem 2: Given a two-actor SDF graph as in Fig. 4(b), depending on the values of  and ,
the buffer memory lower bound  can be reached through either of the follow-
ing PCLSs:
• If , , where
 and .
• If , , where
 and .
Proof. This proof generally follows the reasoning behind the algorithm in [1] that provably
reaches the minimum buffer bound for two-actor SDF graphs as in Fig. 4(b). Let us start with the
case . Every execution of  produces more tokens than are consumed by . To make the
smallest buffer size feasible,  must be executed repeatedly in a way that the token consumption
catches up to the production as closely as possible. This behavior is carried out by the inner loop
m n gcd m n,( )–+
m n
m n gcd m n,( )–+
m n≥ PCLS x1 f1,[ ]A x2 f2,[ ]B( )( )( )=
f1
n
gcd m n,( )------------------------= f2
x1 1+( )m
n
-----------------------
x1m
n
---------
–=
m n≤ PCLS x1 f1,[ ] x2 f2,[ ]A( )B( )( )=
f1
m
gcd m n,( )------------------------= f2
x1 1+( )n
m
----------------------
x1n
m
--------
–=
m n≥ A B
BPage 17 of 21
:  is consecutively executed to digest the live tokens to the full extent (i.e., any fur-
ther execution of  at this point would lead to buffer underflow). The number of iterations of the
outer loop is identical to the number of firings of  because  is executed only once in the inner
loop. A valid SDF schedule requires that the total numbers of tokens produced and consumed
have to be equal on an edge (this condition is referred to the balance equation for the edge).
Therefore, 
is necessary for the minimum schedule period. To determine , we have to examine the total token
consumption and production for an iteration. Up to the end of the th iteration, there is a total
of  tokens produced and  executions of  are required to maximally con-
sume the tokens. Therefore, at the th iteration,  needs  executions minus
 executions that have occurred in the previous iteration. Therefore, we obtain the equation
of , as shown above.
With a similar argument, we can derive the PCLS for the case . For the case of
, it does not matter which formulation (  or ) is used because both result in the
same PCLS, . QED.
Example 4: A PCLS can be derived for the SDF graph in Fig. 4(a) by applying Theorem 2:
.
By expanding the loop hierarchy, we obtain the execution sequence , which results in
the minimum buffer bound as shown in Fig. 4(b).
To extend this two-actor PCLS formulation to arbitrary acyclic graphs, we can apply the
recursive graph decomposition approach in [8]. The work of [8] focuses on systematic implemen-
tation based on nested procedure calls, where both data and program memory space are consid-
A x2 f2,[ ]B( ) B
B
A A
f1
total tokens exchanged
m
------------------------------------------------------
mn
gcd m n,( )------------------------
1
m
---⋅
n
gcd m n,( )------------------------
=
=
=
f2
x1 1+( )
x1 1+( )m x1 1+( )n m⁄ B
x1 1+( ) B x1 1+( )n m⁄
x1n m⁄
f2
m n≤
m n= m n≥ m n≤
A B,( )
x1 2,[ ] A x2
3 x1 1+( )
2----------------------
3x1
2--------
–, B⎝ ⎠⎛ ⎞,⎝ ⎠⎛ ⎞⎝ ⎠⎛ ⎞
A B A B B, , , ,( )Page 18 of 21
ered in the optimization process. The work of [8] starts by decomposing an SDF graph into a
hierarchy of two-actor SDF graphs. An example of CD to DAT sample rate conversion is given in
Fig. 5 to demonstrate this decomposition into a hierarchy of two-actor graphs. To adapt the
approach to PCLS implementation, we can map the resulting graph hierarchy into a correspond-
ing hierarchy of PCLS-based schedule loops to derive an efficient PCLS implementation for the
original SDF graph.
C. Experiments
Experiments are set up to compare the results of PCLS and nested procedure call (abbrevi-
ated as NPC) synthesis by means of execution time and code size. The benchmarks, CD2DAT and
DAT2CD, in our experiments are obtained from the released source library of the Ptolemy project
[7]. In the PCLS-based synthesis, the iteration counts in Theorem 2 are pre-computed and saved
in arrays. Therefore, iteration counts are retrieved by indexing at run time. The hardware plat-
forms in our experiments are TMS320C6xxx processor simulators from the Texas Instruments
Code Composer Studio.
The experimental results are summarized in Tables II and III. We measure the percentage
improvement of PCLS over NPC, , for both execution time and code size
(note that buffer memory costs are identical under both approaches since the same intermediate
graph hierarchy is used). A positive percentage indicates that PCLS performs better, and con-
A B1 1 C2 3 D2 7 E78 F15
CD DAT
W1 W2
32 7
W’1 W’2
4 7
W”1 W”2
2 3
A B1 1
E F
5 1
D
C
(a)
(b)
(c)
SAS = ((7(7(3AB)(2C))(4D)), (32E(5F)))
Fig 5.  A CD to DAT sample rate conversion example
NPC PCLS–( ) NPC⁄Page 19 of 21
versely, a negative percentage indicates that NPC performs better. Moreover, C compiler optimi-
zation options are turned on selectively to stress the criticality of execution speed or code size.
From both tables, PCLS always outperforms NPC in requiring less code size, especially when the
“size critical” compiler configuration is selected. On the other hand, neither synthesis strategy
emerges as a “clear winner” for execution time minimization. However, we also observe that the
execution time overhead of PCLS-based implementation is consistently low.
8 Summary and Future Directions
This paper has focused on the motivation for examining broader classes of looped schedules,
and on the definition and application of parameterized, constant-update looped schedules
(PCLSs) for generating static execution sequences (programs). PCLSs go beyond traditional static
looped schedules by making the management of loop counters more explicit. This greatly
enlarges the space of execution sequences that can be compactly represented, while requiring low
overhead in most implementation contexts. As the terminology in this paper suggests, there are
possibilities for further enriching the classes of looped schedules under investigation. For exam-
ple, one might consider a more general class of schedules in which output values computed by
“instructions” can be captured and used in the initialization or updating of loop counts, or in
which the index update function can be more complex, involving possibly the indices of other
loops.
Table II. Experiments on TMS320C670x processors.
Execution Time
Improvement (%)
Code Size
Improvement (%)
speed critical size critical speed critical size critical
CD2DAT -0.86 -0.68 3.52 7.08
DAT2CD -0.02 0.47 4.48 7.8
Table III. Experiments on TMS320C620x processors.
Execution Time
Improvement (%)
Code Size
Improvement (%)
speed critical size critical speed critical size critical
CD2DAT 0.11 4.87 2.49 5.8
DAT2CD -0.4 -0.14 2.25 6.24Page 20 of 21
9 References
[1] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, “Software synthesis from dataflow
graphs,” Kluwer Academic Publishers, 1996.
[2] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, “Optimal parenthesization of lexical order-
ings for DSP block diagrams,” Proc. of the International Workshop on VLSI Signal Process-
ing, Sakai, Osaka, Japan, 1995.
[3] G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete, “Cyclo-static dataflow,” IEEE
Trans. Signal Processing, February 1996.
[4] K. D. Cooper and N. McIntosh, “Enhanced code compression for embedded RISC proces-
sors,” Proc. Conf. Programming Language Design and Implementation, May 1999.
[5] S. Debray, W. Evans, R. Muth, B. de Sutter, “Compiler techniques for code compression,”
ACM Trans. Programming Languages and Systems, 2000.
[6] S. Derrien, A. Turjan, C. Zissulescu, B. Kienhuis, “Deriving efficient control in Kahn pro-
cess networks,” Proc. Intl. Workshop on Systems, Architectures, Modeling, and Simulation,
2003.
[7] J. Eker et al., “Taming heterogeneity — the Ptolemy approach,” Proceedings of the IEEE,
Jan. 2003.
[8] M. Ko, P. K. Murthy, and S. S. Bhattacharyya, “Compact procedural implementation in DSP
software synthesis through recursive graph decomposition,” Proc. Intl. Workshop on Soft-
ware and Compilers for Embedded Processors, pages 47-61, September 2004.
[9] E. A. Lee and T. M. Parks, “Dataflow process networks,” Proc. of the IEEE, 83(5):773-799,
May 1995.
[10] S. Liao, “Code generation and optimization for embedded digital signal processors,” PhD
thesis, MIT, 1996.
[11] E. Memin and T. Risset, “On the study of VLSI derivation for optical flow estimation,” Intl.
Journal of Pattern Recognition and Artificial Intelligence, 2000.
[12] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere, “System design using
Kahn process networks: the Compaan/Laura approach,” Proc. Design, Automation and Test
in Europe Conference and Exhibition, February 2004.
[13] R. Walke, R. Smith, and G. Lightbody, “20 GFLOPS QR processor on a Xilinx Virtex-E
FPGA,” Proc. Advanced Signal Processing Algorithms, Architectures, and Implementations
X, pages 300 – 310, 2000.
[14] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans.
Information Theory, 23:337-342, 1977.Page 21 of 21
