Design space exploration strategies for FPGA implementation of signal processing systems using CAL dataflow program by Rahman, Ab et al.
DESIGN SPACE EXPLORATION STRATEGIES FOR FPGA IMPLEMENTATION OF SIGNAL
PROCESSING SYSTEMS USING CAL DATAFLOW PROGRAM
Ab Al-Hadi Ab Rahman, Richard Thavot, Simone Casale Brunet, Endri Bezati, Marco Mattavelli
SCI-STI-MM, E´cole Polytechnique Fe´de´rale de Lausanne
Station 11, CH-1015, Lausanne
ABSTRACT
This paper presents some strategies for design space
exploration of FPGA-based signal processing systems that
are specified using the CAL dataflow language. The actor-
oriented, high-level of abstraction provided by CAL allows
flexible exploration and consequently results in a wide range
of feasible design implementations. We have applied and ex-
tended the existing techniques for refactoring and pipelining
actors and actions by means of critical path analysis, and in-
troduced some new buffering techniques based on heuristics.
The combinations of these techniques have been applied on
the CAL specification of the MPEG-4 video decoder, and
synthesized to HDL for evaluation in the design implementa-
tion space. Results show that using our configuration for the
exploration of 48 design points, a throughput range of roughly
8x has been achieved, for slice, block RAM, frequency, and
latency range of 1.3x, 2.5x, 2.5x, and 2.9x respectively.
Index Terms— Dataflow, CAL, exploration, MPEG,
FPGA
1. INTRODUCTION
Design exploration is one of the most important aspects in
the implementation of signal processing systems. Essentially,
an implementation of a system should always meet or exceed
the target performance, which for hardware implementation,
is typically that of system throughput, power consumption,
and/or implementation area. In some cases, multiple objec-
tives are required, for example in mobile applications where a
design should exhibit a minimum throughput using the lowest
possible power. For cost minimization, it is also crucial to use
as small silicon area as possible for a given throughput and/or
power requirement. The different feasible implementations
of a system are often defined as design points in the multidi-
mensional space; the exploration of these points spanned by
the objectives is called the design space exploration.
In the context of hardware-based design of signal pro-
cessing systems, there is now a growing interest in specify-
ing at a high-level of abstraction compared to low-level Reg-
ister Transfer language (RTL), mainly for fast design-cycle,
less tedious and error prone, higher degree of code-reuse,
and easier to incorporate advanced algorithms. There exists
a variety of high-level of abstraction languages and models
that are capable of synthesizing its specification to hardware.
For example, the work of Lahiri et al [1] uses pre-configured
IP blocks in a dataflow environment. Although the blocks
are generally in the optimized form, it puts a restriction on
the designer to explore different architecture of the instan-
tiated blocks. Synthesizing hardware from imperative lan-
guages such as C/C++ has also been a topic of intensive re-
search, for example the GAUT tool [2] of LABSTICC. How-
ever, imperative programs are designed to run sequentially
and lack the concept of time, therefore, are difficult to ana-
lyze for potential parallelism [3]. SystemC extends the C lan-
guage with a subset of synthesizable constructs, but mainly
used for rapid cycle-based simulations. In this work we use
the CAL dataflow language [4] which was specified as part of
the Ptolemy project [5], and have proven to generate efficient
hardware implementation such as the works in [6] and [7].
CAL is a domain-specific language for the design of
dataflow actors. The core of the language is based on data
tokens consumption and production by the actors, which lend
itself naturally to signal processing systems. It is based on the
Kahn Process Network (KPN) [8], where actors communicate
via unbounded FIFO channels, but transformed to bounded
sizes for implementation. Individual actors in a network are
designed to execute in parallel, therefore allowing designers
to explicitly specify the desired parallelism. Each actor in
the network should contain at least one action, with the con-
straint that at any given time, an actor selects only a single
action to fire. The selection of how and which action to fire
is at the core of the Model of Computation (MoC), and is
explained in section 2.3. CAL is also designed to be platform
independent and retargetable to a rich variety of platforms for
software (using orcc [9]), hardware (using openForge [6]),
and co-design environments [10].
In this work, we contribute to the application and ex-
tension of existing techniques, as well as introducing new
techniques for improving the throughput of complex CAL
dataflow programs for hardware implementation. All the
techniques are then combined to obtain a design space explo-
ration that explores the trade-off between design throughput
and resources, operating frequency, and system latency for
the FPGA implementation of MPEG4 decoder. Three design
space exploration strategies have been applied and imple-
mented: refactoring and buffering to reduce system latency,
and pipelining to obtain higher operating frequency.
2. DESIGN SPACE EXPLORATION STRATEGIES
2.1. Trace Critical Path & Refactoring
Refactoring of an actor/action essentially means splitting,
replicating, or modifying its computational elements such
that an increase in parallelism is obtained. In a complex
dataflow network, the challenge is to find the critical ac-
tors/actions, such that when the refactoring is performed on
them, overall system throughput is improved. The critical
actors/actions reside in the execution trace critical path (CP),
which is the longest weighted path from source to sink node
of the system.
The first tool that was developed to find the CP for CAL
programs is known as the CAL Design Suite (CDS), which
has been successfully used for optimizing software-based
MPEG4 SP decoder [11] and hardware-based MPEG4 SP
intra decoder [12]. The tool has been superseded by a new
one called TURNUS that improves the technique for finding
the critical actors/actions by performing dynamic analysis of
dataflow programs directly at the dataflow level, instead of
at software executable level as done in CDS. This results in
faster convergence and higher details on the analysis. The
following describes the methodology for trace critical path
analysis using TURNUS.
The analysis is based on the execution traceMDAG(V,E)
that is generated performing a platform independent co-
simulation of the dataflow design and it is defined as a multi
directed acyclic graph such that every node vi ∈ V is a
single firing of an action, and every edge ei,j ∈ E is a de-
pendency of node vj from node vi (i.e. vi must be executed
first before vj [13]). The dependences are fundamental for
defining constraints on the execution order between any cou-
ple of fired action describing a scheduler-independent design
behaviour because they impose an implicit execution order
between the two connected actions (i.e. if it exists a depen-
dence ei,j from node υi to node υj this implies that the firing
of υj can occur only after the firing of υi). Moreover, from
node υi to node υj could exist more than one dependences
{ei,j ,∀i, j : ei,j ∈ δ+i ∩ δ−j }.
After that the generation of the MDAG(V,E) is done,
this latter is weighted by assigning for each node vi a weight
wi, and for each edge ei,j a weight wi,j as well. All the
weights are implementation specific: for example, in an hard-
ware implementation a node weight wi is defined as the la-
tency (i.e. number of clock cycles) to execute the node vi and
an edge weight wi,j is the communication latency from the
end of the firing of node vi to the start of the firing of node
vj . Essentially, the trace represents a platform independent
behaviour of the design and the CP can be figured out only
after the weights assignment.
In order to reduce the algorithmic overhead for the analy-
sis, an augmented graphMDAG(V˜ , E˜) ⊃MDAG(V,E) is
defined where two new fictitious nodes are added: the source
node υS and the sink node υT (both with weight wS = wT =
0). All the nodes υi that have not incoming edges |δ−i | = 0 are
connected to υs with a fictitious connection es,i with weight
ws,i = 0; The same is done for all the nodes υi that have not
outgoing edges |δ+i | = 0 where they are connected to υT with
a fictitious connection ei,T with weight wi,T = 0. The topo-
logical order of the nodes remain the same {υi < υi+1,∀i :
υi, υi+1 ∈ V˜ } by assigning to υs and υT respectively to the
lowest and the highest topological index of V˜ . After all the
nodes, edges, and weights are annotated on the graph, the
trace critical path can be evaluated. There exists several tech-
niques on evaluating the trace such as in [14] and [15]. TUR-
NUS implements the algorithm proposed in [15] because it
provides reduced complexity and more detailed profiling in-
formation; Moreover, all of the required operations can be
done in O(|V | + |E|) by following the topological order of
the MDAG(V˜ , E˜).
In this direction, for each node of the trace, four param-
eters are defined: 1) the Early Start time (ES) defines the
earliest possible time that a node can start executing; 2) The
Latest Start time (LS) defines the latest possible time that a
node can start executing such that overall system latency does
not change; 3) the Early Finish time (EF) defines the earliest
possible time that a node can finish executing; 4) The Latest
Finish time (LF) defines the latest possible time that a node
can finish executing such that overall system latency does not
change.
Furthermore, for each node vi and edge ei,j , the Slack Svi
and Sei,j are defined, representing the maximum delay that
a node or edge could tolerate without increasing the system
latency (i.e. if S = 0 then any increase in latency for the
node or edge results in an increase in latency of the entire
system). Nodes υ∗i = {υi ∈ V˜ : Sυi = 0} and edges e∗i,j =
{ei,j ∈ E˜ : Sei,j = 0} are defined as critical. The sets
of the critical executed action and the critical dependences
are defined respectively as CA = {υ∗i ,∀i : υ∗i ∈ V } and
CD = {e∗i,j ,∀i, j : e∗i,j ∈ E}.
The CP can be defined as a path from υT to υS where all
the nodes and edges are critical. The algorithm for finding
such a path is depicted in Fig. 1: the graph is walked-back
starting from υT and at each iteration i, a critical node v∗j
is reached from the critical edge e∗i,j . The CP is completely
determined when vS is reached and its weight is defined as
CP = LFυT ≤
∑{wi,∀i : υi ∈ CA} +∑{wi,j ,∀i, j :
ei,j ∈ CD}: one such path always exists [16].
The application of TURNUS finding the execution trace
CP and the list of critical actors for our case study of MPEG4
SP decoder is described in Section 3.
v ( i ) = v ( T )
/ / w h i l e s o u r c e node has n o t been r e a c h e d
w h i l e ( v ( i ) != v ( S ) )
v ( j ) = w a l k B a c k C r i t i c a l P a t h ( v ( i ) )
v ( i ) = v ( j )
end
p r o c e d u r e w a l k B a c k C r i t i c a l P a t h ( v ( i ) )
f o r each node j w i th v ( i ) dependency
i f ( Sv ( j ) == 0 and Se ( j , i ) == 0)
r e t u r n v ( j )
end
end
end
Fig. 1. Algorithm for evaluating the trace critical path.
2.2. Action Critical Path & Pipelining
Pipelining could also be considered as refactoring an action,
but performed specifically to increase pipeline parallelism. In
pipelining, the output of a computational element is the in-
put to the next, separated by some memory buffers. The par-
allelism occurs when the computational elements execute in
parallel. For CAL designs, the computational elements are
considered to be the actors that should all fire at the same
time for a given pipeline depth. This requires that every actor
in the pipeline depth is a single-action SDF 1 actor.
A semi-automated tool that performs pipeline synthesis
and optimization for CAL programs is given in [17]. The ba-
sic steps are given in the flow chart of Fig. 2. Starting from a
CAL program, it is first synthesized to Hardware Description
Language (HDL) using openForge, and then to RTL imple-
mentation using tools such as XST or Synplify. From this,
we obtain the action critical path that defines the action in
the network with the longest combinatorial path. If the action
is not part of a single-action SDF actor, then it is converted
to one. The SDF actor is then sent to the automatic pipeline
synthesis and optimization tool which takes in the throughput
requirement and generates the pipelined CAL actors using the
global minimum pipeline resources.
The pipeline synthesis and optimization tool attempts to
synthesize single-action SDF actor into k-parts as equally as
possible in terms of the required length of the combinato-
rial path using minimum pipeline registers. It first gener-
ates the asap and alap schedules for the action based on the
operator-input, operator-output, and operator-precedence re-
lations. From this, operator mobility is determined and oper-
ators are arranged in order of mobility. This is then used in
the coloring algorithm that generates all possible (and valid)
pipeline schedules based on the operator conflict and noncon-
flict relations. For each pipeline schedule, total register width
is estimated, and the least among all schedules is taken as the
optimal solution, which is finally used to generate pipelined
CAL actors. In [18], the tool has been used to pipeline the
MPEG4 SP intra decoder with overall throughput improve-
1Static dataflow, where actors consume a constant number of tokens on
its output and input ports at every firing
CAL Programs
OpenForge
HDL Programs
XST/Synplify
Pipeline synthesis 
optimization
Critical 
Action
Single-action 
SDF?
Convert to single-
action SDF
Throughput 
requirement
Pipelined CAL 
actors
No
yes
Fig. 2. Methodology for pipelining CAL dataflow programs.
ment of more than 3x using minor additional resources.
2.3. FIFO Interconnections & Buffering
In CAL dataflow network, actors are interconnected using
FIFO buffers. The selection of the sizes of each FIFO buffers
in the network is crucial as it impacts not only the functional-
ity (deadlock or deadlock-free), but also its performance.
Since actors can execute in parallel, a high throughput
system is obtained if as much actors as possible are executed
at a given time. An action in an actor fires if enabled by:
1) availability of input tokens, 2) value of input tokens, con-
trolled by guard conditions, 3) the actor scheduler, 4) the ac-
tion priority, and/or 5) the availability of free space to store
output tokens. In order to ensure that actions are enabled and
fired as quick as possible (hence results in higher throughput),
conditions (1) and/or (5) have to be met as fast as possible.
Systems with large buffer sizes between actors would always
satisfy these conditions (for both actors) since input tokens
are rapidly available from the buffers, and output tokens can
always be generated due to large output buffers. However, set-
ting all buffers to large values may not result in area-efficient
implementation. On the other hand, buffer sizes that are too
small between actors may introduce system deadlock. This is
a condition when one or more actor stalls while waiting for
input tokens that will never arrive, or actions that could not
fire due to an always full output buffer.
In the following, we present two automatic buffering
techniques based on heuristics that finds 1) the close-to-
minimum required buffers for deadlock-free execution with
lower throughput using an extended version of [19], and
2) larger buffers for deadlock-free execution with higher
throughput using a modified version of [20].
2.3.1. Close-to-minimum
This evaluation can be performed on a KPN model of compu-
tation graph for extracting minimum buffer size that guaran-
tees deadlock-free executions at Transaction Level Modeling
(TLM). This analysis is performed on an untimed simulation
and it focuses on the communication channel. The assump-
tion of such approach is a demand-driven scheduling [19]
strategy, known to minimize the buffer size requirement by
trying to mimic a perfect scheduler. The algorithm is shown
in Fig. 3. In order to build a demand-driven scheduler, actors
are ordered using a topological sorting algorithm. This algo-
rithm then uses the strongly connected components algorithm
for solving cycle path in a graph. The demand-driven sched-
uler tries to run actor by actor from the sink to the source until
an actor is executed. Once an actor has been executed then
the scheduler restarts from the sink. During the simulation
analysis, the model of computation uses the communication
mechanisms of the TLM FIFO. Those FIFOs are modelled as
abstract channels and the FIFOs reallocate their size accord-
ing the the demand driven computation. Consequently, the
maximum size evaluated by the re-allocation gives the mini-
mal size needed for a deadlock-free implementation.
eos := f a l s e / / end of s i m u l a t i o n f l a g
i n s t a n c e s := t o p o s o r t (V)
i n d e x := i n s t a n c e s . l a s t . i n d e x
w h i l e ( ! eos )
i n s t a n c e := i n s t a n c e s [ i n d e x ] ;
/ / i n p u t t o k e n s and s t a t e v a r i a b l e s a r e a v a i l a b l e
i f ( i n s t a n c e . i s S c h e d u l a b l e )
f o r each i n p u t edges o f i n s t a n c e
s t o r e edge . b u f f e r . maxSize
end
f o r each edges o f i n s t a n c e
edge . b u f f e r . r e a l l o c S i z e
end
i n s t a n c e . s c h e d u l e / / run t h e i n s t a n c e
i n d e x := i n s t a n c e s . l a s t . i n d e x / / r e s t a r t g raph s c h e d u l e r
e l s i f ( i n s t a n c e s [ i n d e x ] != i n s t a n c e s . f i r s t )
i n d e x := i n d e x − 1 / / s e l e c t t h e p r e v i o u s i n s t a n c e
e l s e
eos := t r u e / / n o t h i n g i s s c h e d u l a b l e
end
end
Fig. 3. Algorithm for the close-to-minimum buffering tech-
nique.
2.3.2. Deadlock-increment
The deadlock-increment technique iteratively updates all
buffers that causes the system to deadlock at any given time.
The algorithm is shown in Fig. 4. Initially, all buffers are set
to the smallest value, i.e. 1. While the system is deadlocked,
hardware simulation is ran to find the list of buffers that are
full (i.e. that causes the system to deadlock). Each of these
buffers are then doubled, and the while loop is repeated until
the system is deadlock free.
f o r each c h a n n e l i from 0 t o m
/ / i n i t i a l i z e each c h a n n e l c a p a c i t y t o 1
b u f f e r s i z e ( i ) = 1
end
d e a d l o c k f r e e = f a l s e
/ / w h i l e sys tem i s d e a d l o c k
w h i l e ( ! d e a d l o c k f r e e )
run ha rdware s i m u l a t i o n
/ / check f o r d e a d l o c k
d e a d l o c k = c h e c k f o r d e a d l o c k
i f ( d e a d l o c k )
f o r each f u l l b u f f e r j from 0 t o n
b u f f e r s i z e ( j ) = b u f f e r s i z e ( j ) ∗ 2
end
e l s e / / d e a d l o c k f r e e
d e a d l o c k f r e e = t r u e
g e t c l o c k c y c l e s l a t e n c y
end
end
Fig. 4. Algorithm for the deadlock-increment buffering tech-
nique.
Note that there are infinitely many ways that the buffers
can be incremented in every iteration; for example they can be
incremented by any value either by addition, multiplication,
etc. There are also ways to select which buffers should be
updated, either by the smallest or largest width, all or partial
buffers, etc. In this work we choose to double the buffers due
to the dataflow RTL architecture that takes only buffer sizes
in power of 2, and update all buffers that causes the system to
deadlock in order to obtain a fairly large total buffer size that
would give a reasonably high throughput.
The algorithm in Fig. 4 runs a hardware simulation to
check for deadlocks each time that buffer sizes are updated.
This implies that the algorithm performs a dynamic analysis
where the results are specific to a particular input stimulus.
This is due to the assumption that there is at least one dy-
namic actor (DDF 2) in the network, which is not possible to
be analyzed statically. The algorithm has been implemented
using TCL script for Modelsim hardware simulator, which
also calls a Java program to automatically update the buffers
in the HDL file, and provide logging of results to a text file.
3. CASE STUDY: MPEG4 SIMPLE PROFILE
DECODER
The design space exploration strategies discussed in section
2 have been applied on the Reconfigurable Video Coding
(RVC) MPEG4 Simple Profile (SP) decoder [21]. The de-
coder is specified in CAL and based on the new RVC standard
in the ISO/MPEG, which proposes a new paradigm for spec-
ifying and designing complex signal processing systems.
Essentially, the standard enables specifying new codecs by
2Dynamic dataflow, execution condition depends on the input data
assembling blocks from a Video Tool Library (VTL), which
results in higher flexibility, reusability, and modularity.
The top level network is shown in Fig. 5. The encoded
stream of bytes are first serialized to bitstreams to the parser,
which then provides data, control, and motion vectors to the
texture and motion decoder actors. The texture decoding and
motion compensation are divided into three separate parts:
one for luminance (Y) and two for chrominance (U and V)
for parallel processing potential. At the end of processing
(texture and/or motion), the merger combines the decoded bit-
stream from each three parts to form a complete YUV 4:2:0
video output. The decoder had been designed using 60 actors
that contain both static and dynamic types.
Fig. 5. CAL dataflow network of the RVC MPEG4 SP.
Among the three design exploration strategies, pipelining
and buffering are not design specific, i.e. the techniques can
be used and generalized for any designs. As for the refac-
toring technique, TURNUS only provides the list of critical
actions/actors that are in the trace critical path of the design.
The technique on how to refactor the critical actions/actors
for system improvement is design specific and based on the
discretion of the designer. The following presents the results
from TURNUS for the analysis of the RVC MPEG4 SP and
techniques on refactoring the critical actors.
The trace critical path is found to cross the y-branch tex-
ture inverse scan (texture IS y), y-branch texture inverse ac
prediction (texture IAP y), and y-branch motion addressing
(motion addr y). These three actors are therefore, the criti-
cal actors. Increasing the level of parallelism of these actors
should result in a reduction in design latency.
The refactoring techniques for texture IS y, texture IAP y,
and also texture IDP y (inverse dc prediction) have been re-
ported in [12], and the basic mechanism is as follows: the
y-branch texture decoder processes the y-macroblock video
frames based on four blocks, i.e. blocks 0, 1, 2, and 3. In
the original implementation, the blocks are processed in se-
quence; block 0, followed by block 1, then block 2, and
finally block 3. In the improved parallel implementation,
block 0 and block 3 are processed in parallel, followed by
block 1 and block 2 in parallel. In theory, this would result
in latency reduction of 2x, i.e. from 4 blocks to 2 blocks pro-
cessing in sequence. However, since the video decoder input
is serial, the gain that is achieved in practice have found to be
somewhat less, but nonetheless, a significant gain. Note that
all four blocks could not be processed altogether in parallel
since there are dependencies among the blocks as specified
in the MPEG standard 3. Note also that it is also possible
to refactor the texture IDP y using this technique, which we
will prove that it will not result in any throughput gain since
it is found to be outside the trace critical path.
The refactoring technique for motion addr y is as follows:
In the original implementation of the motion compensation,
one decoded pixel is dedicated to one memory location. This
is inefficient in terms of latency since each memory access
(i.e. reading or writing each pixel) requires two clock cycles.
The addressing part is improved by first packing four pixels
into four bytes, and then storing the packed pixels as 32-bit
word in memory. This reduces memory access by 4x, which
translates to a significant reduction in system latency. Note
that it is also possible to apply this technique to the uv-branch
of motion compensation (motion addr u and motion addr v),
but again will be proven to result in no gain since they are not
in the trace critical path.
4. EXPERIMENTAL RESULTS
This section presents experimental results of the design space
exploration strategies (section 2) applied on the MPEG4 SP
decoder (section 3). The CAL specification of the RVC de-
coder has been taken and synthesized to HDL using open-
Forge. The actor for storing inter frames for motion com-
pensation has been replaced by a memory controller that in-
terfaces to a Cypress Semiconductor CY7C1354C SRAM.
The synthesized decoder has been analyzed for clock-cycle
latency using Modelsim for Foreman QCIF video frames (res-
olution 176x144), and verified for Xilinx Virtex-5 FPGA im-
plementation using the XST synthesis tool.
The representation of the results are as follows. Each de-
sign point is assigned a prefix of either r for refactoring, or
p for pipelining. Furthermore, the points are also assigned
a two-digit subscript. For refactoring design points rxy for
x ∈ {0, 1} and y ∈ {0, 1, ..., 9}, x represents the buffer-
ing technique used for that design point where x = 0 for
close-to-minimum and x = 1 for deadlock-increment buffer-
ing techniques; and y represents which actors to be refactored
as shown in table 1. For example, the design point r01 rep-
resents a close-to-minimum buffering technique with a refac-
tored motion addr y. In total, 20 refactoring design points
have been explored. Note that refactoring is performed for
all combinations of the critical actors: motion addr y, tex-
ture IS y, and texture IAP y. The refactoring of the actors
motion addr u, motion addr v, and texture IDP y are not an-
alyzed for combinations since they are found to be outside
the list of critical actors (This is also proven in the following
graphs).
For pipelining design points pki for k ∈ {0, 1, 2, 3} and
3http://mpeg.chiariglione.org
i ∈ {1, 2, ..., 8}, k = 0 refers to pipelining the point r00,
k = 1 for the point r09, k = 2 for the point r10 and k = 3
for the point r19. The subscript i represents pipeline itera-
tions as shown in table 2. For pipelining starting from point
p01, 6 iterations are possible before the combinatorial path is
not anymore dominated by the action, but the routing delay
and scheduler registers. Similarly, 7, 7, and 8 iterations are
possible for pipelining from the points p11, p21, and p31 re-
spectively. In total, 28 design points have been analyzed for
pipelining space exploration.
Table 1. Refactoring design points rx0 to rx9 with refac-
tored action(s). x = 0 for close-to-minimum and x = 1 for
deadlock-increment buffering techniques.
Design point Refactored actors(s)
rx0 -
rx1 motion addr y
rx2 motion addr y
motion addr u
motion addr v
rx3 texture IS y
rx4 texture IAP y
rx5 texture IDP y
rx6 texture IS y
texture IAP y
rx7 motion addr y
texture IS y
rx8 motion addr y
texture IAP y
rx9 motion addr y
texture IS y
texture IAP y
The results have been analyzed for throughput based on
four parameters: FPGA slice, block RAM, maximum oper-
ating frequency, and clock-cycles latency. Fig. 6 shows the
graph of FPGA slice versus throughput for all design points.
The refactoring points (rxy) for both buffering techniques
and all refactoring strategies show similar pattern, where
rx5 (texture IDP y refactoring) and rx2 (motion addr y,
motion addr u, and motion addr v refactoring) are inferior
points with higher slice requirement for the same throughput
compared to some other points, and rx9 are points with the
highest throughput and slice. Pipelining of all four points
(rx0) and (rx9) for both x = 1 and x = 0 show roughly linear
increase in both slice and throughput. The best throughput
also shows the highest slice requirement (point p38) with slice
of roughly 31000 and throughput of 1700 QCIF frames/s.
Compared this to the original point r00, it results in about 8x
higher throughput with 30% more slice.
The graph in Fig. 7 shows the required FPGA block
RAM (BRAM) versus throughput for all design points. The
pipelining-1
Page 1
0 200 400 600 800 1000 1200 1400 1600 1800
15000
17000
19000
21000
23000
25000
27000
29000
31000
33000
Throughput (QCIF frames/s)
FP
G
A 
Sl
ic
e r09r08r07r06
r05
r04r03
r02
r01r00
r19r18r17r16
r15
r14r13
r12
r11
r10
p06
p05p04p03p02p01
p27p26
p25
p24
p23
p22p21 p17
p16p15
p14p13p12p11
p37p36
p35
p34
p33
p32p31
p38
Fig. 6. FPGA slice versus throughput for all design points.
distribution of BRAM for the refactoring techniques (rxy)
are almost similar for both buffering techniques x = 0 and
x = 1. Design points rx4 and rx6 require similar BRAM,
but between points rx8 and rx9, r08 require less BRAM com-
pared to r09, while r18 and r19 require the same number
of BRAM. Furtermore, we can see that using the close-to-
minimum buffering technique (x = 0), results in about 2x
less BRAM compared to the deadlock-increment technique.
As for pipelining, it can be observed that BRAM does not
increase with every iteration. This implies that pipelining
only takes the slice as additional resource. For the overall
design space exploration with throughput range of 8x, the
range for BRAM is about 2.5x between the original point r00
and highest-throughput point p38.pipelining-1
Page 1
0 200 400 600 800 1000 1200 1400 1600 1800
0
5
10
15
20
25
30
Throughput (QCIF frames/s)
FP
G
A 
Bl
oc
k 
R
AM
r00 r01
r02
r03
r04
r05
r06 r07
r08
r09r10
r11
r12r13
r14
r15
r16
r17
r18 r19
p01p02p03
p04 p05p06
p12
p13
p11
p14
p15 p16
p17
p21 p22p23
p24
p25
p26p27
p31 p32
p33 p34
p35 p36
p37 p38
Fig. 7. FPGA block RAM versus throughput for all design
points.
The frequency versus throughput plot is shown in Fig.
8. As expected, refactoring techniques do not increase the
operating frequency, where it is almost constant at around
40MHz. Pipelining the original point r00 results in frequency
range of between 76MHZ to 94MHz, point r09 with range
between 75MHz to 96MHz, point r10 with range between
75MHz to 95MHz, and point r19 with range between 75MHz
to 105MHz. This implies that pipelining increases system
throughput by increasing the frequency at every iteration. For
low power applications where the frequency should be kept
Table 2. Pipelining design points p0i, p1i, p2i, and p3i with pipelined action at every iteration i.
Pipelined actions
Iterations(i) p0i p1i p2i p3i
1 texture IDP readintra texture IDP readintra texture IDP readintra texture IDP readintra
2 texture IDP readintra motion addr readaddr texture IDP readintra motion addr readaddr
3 motion addr readaddr texture IDP readintra motion addr readaddr texture IDP readintra
4 texture IDP readintra texture IDP readintra texture IDP readintra texture IDP readintra
5 texture dequant ac texture dequant ac texture dequant ac texture dequant ac
6 texture dequant ac texture dequant ac texture dequant ac texture dequant ac
7 - texture IDP readintra texture IDP getdcinter texture IDCT rowcalc
8 - - - texture IDP readintra
as small as possible, pipelining strategies may not be feasi-
ble. On the other hand, buffering and refactoring strategies
provide throughput improvement at a constant frequency.
pipelining-1
Page 1
0 200 400 600 800 1000 1200 1400 1600 1800
0
20
40
60
80
100
120
Throughput (QCIF frames/s)
M
ax
 fr
eq
ue
nc
y 
(M
H
z)
r09
r08r07
r06
r05
r04
r03 r02r01
r00 r11
r19
r18
r17
r16
r15
r14
r13
r12r10
p06p05p04p03p02p01
p27
p26p25 p24p23
p22
p21
p17
p16
p15
p14p13p12p11
p38
p37
p36
p35
p34
p33
p31
p32
Fig. 8. maximum frequency versus throughput for all design
points.
Fig. 9 shows clock-cycles-per-frame (c.c./f) latency ver-
sus throughput for all design points. In contrast to the fre-
quency plot where refactoring techniques do not increase the
operating frequency, latency can be reduced for most refactor-
ing techniques. Using the close-to-minimum buffering tech-
nique, the original point r00 requires 183000 c.c./f, with the
best point r09 with only 94000 c.c./f. On the other hand us-
ing the deadlock-increment buffering technique, the point r10
requires 140000 c.c./f while the best point r19 requires 63000
c.c./f. Comparison between r00 and r19 shows a latency re-
duction of roughly 3x. This is particularly important for sys-
tems that require as small operating frequency as possible to
minimize power; for a given throughput requirement, systems
with low latency requires a lower operating frequency com-
pared to that of high latency system. Another interesting ob-
servation is that the latency of rx5 = rx0 and rx2 = rx1,
which implies that the refactoring techniques of rx5 and rx2
do not result in any performance increase. As for pipelining,
latency does not change for all points, as expected.
pipelining-1
Page 1
0 200 400 600 800 1000 1200 1400 1600 1800
30000
50000
70000
90000
110000
130000
150000
170000
190000
210000
Throughput (QCIF frames/s)
La
te
nc
y 
(c
.c
./f
ra
m
e)
r09
r08
r07
r06
r05r04
r03
r02
r01
r00
r19r18
r17
r16
r15
r14r13
r12r11
r10
p06
p05
p04
p03
p02
p01
p27
p26
p25p24
p23
p21
p22
p17
p16p15
p14p13
p12
p11
p31
p38
p37
p36
p35
p34p33
p32
Fig. 9. latency versus throughput for all design points.
5. CONCLUSION & FUTUREWORK
In this paper, we have presented several strategies to explore
the design space of signal processing systems that are de-
signed using the CAL dataflow language. The strategies have
been applied on the CAL specification of the RVC MPEG4
SP decoder, and synthesized to HDL for the exploration in
the design space. The first strategy examines the trace critical
path to find the list of critical actors where refactoring would
result in feasible implementations. In this context, we have
improved the technique on finding the trace critical path us-
ing a new tool called TURNUS, and introduced a technique
to refactor a critical actor in the MPEG motion compensation.
The second strategy examines the action critical path and ap-
plies the semi-automatic pipelining methodology for pipeline
parallelism. The third and final strategy explores design trade-
off in the exploration space by introducing two heuristic tech-
niques on assigning buffer sizes for deadlock-free execution.
All these strategies are combined in such a way to achieve
feasible design points for evaluation in the design space.
The design points in the exploration space for the case
study given in this paper were obtained using heuristics;
other refactoring, pipelining, and buffering techniques ex-
ist that would possibly give superior implementation points.
This will be further explored in the immediate future.
6. REFERENCES
[1] K. Lahiri, A. Raghunathan, and S. Dey, “System-level
performance analysis for designing on-chip communi-
cation architectures,” Computer-Aided Design of Inte-
grated Circuits and Systems, IEEE Transactions on, vol.
20, pp. 768–783, 2001.
[2] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe,
“Gaut: An architectural synthesis tool for dedicated sig-
nal processors,” in European Design Automation Con-
ference - Proceedings, 1993, pp. 14–19.
[3] G. De Micheli, “Hardware synthesis from c/c++ mod-
els,” in Design, Automation and Test in Europe Confer-
ence and Exhibition 1999, 1999, pp. 382–383.
[4] J. Eker and J. Janneck, CAL Language Report: Spec-
ification of the CAL Actor Language, University of
California-Berkeley, December 2003.
[5] J. Eker, J.W. Janneck, E.A. Lee, Jie Liu, Xiaojun
Liu, J. Ludvig, S. Neuendorffer, S. Sachs, and Yuhong
Xiong, “Taming heterogeneity - the ptolemy approach,”
Proceedings of the IEEE, vol. 91, no. 1, pp. 127 – 144,
jan 2003.
[6] J. W. Janneck, I. D. Miller, D. B. Parlour, G. Roquier,
M. Wipliez, and M. Raulet, “Synthesizing hardware
from dataflow programs,” Journal of Signal Processing
Systems, vol. 63, pp. 241–249, 2011.
[7] A.A.-H. Ab Rahman, R. Thavot, M. Mattavelli, and
P. Faure, “Hardware and software synthesis of image
filters from cal dataflow specification,” in Ph.D. Re-
search in Microelectronics and Electronics (PRIME),
2010 Conference on, july 2010, pp. 1 –4.
[8] G. Kahn, “The semantics of a simple language for par-
allel programming,” Information Processing, pp. 471–
475, 1974.
[9] Matthieu Wipliez, Ghislain Roquier, and Jean-Franc¸ois
Nezan, “Software code generation for the rvc-cal lan-
guage,” J. Signal Process. Syst., vol. 63, no. 2, pp. 203–
213, May 2011.
[10] G. Roquier, R. Thavot, and M. Mattavelli, “Method-
ology for the hardware/software co-design of dataflow
programs,” in Signal Processing Systems (SiPS), 2011
IEEE Workshop on, oct. 2011, pp. 174 –179.
[11] C. Lucarz, G. Roquier, and M. Mattavelli, “High level
design space exploration of rvc codec specifications for
multi-core heterogeneous platforms,” in Proceedings of
the 2010 Conference on Design and Architectures for
Signal and Image processing (DASIP), October 2010.
[12] H. Amer, A.A.-H. Ab Rahman, I. Amer, C. Lucarz, and
M. Mattavelli, “Methodology and technique to improve
throughput of fpga-based cal dataflow programs: Case
study of the rvc mpeg-4 sp intra decoder,” in Signal Pro-
cessing Systems (SiPS), 2011 IEEE Workshop on, oct.
2011, pp. 186 –191.
[13] J.W. Janneck, I.D. Miller, and D.B. Parlour, “Profiling
dataflow programs,” in Proceedings of the IEEE Inter-
national Conference on Multimedia and Expo, 2008, pp.
1065–1068.
[14] Cui-Qing Yang and Barton P. Miller, “Critical path anal-
ysis for the execution of parallel and distributed pro-
grams,” in ICDCS, 1988, pp. 366–373.
[15] Cedell Alexander, Donna Reese, and James C. Harden,
“Near-critical path analysis of program activity graphs,”
in Proceedings of the Second International Workshop on
Modeling, Analysis, and Simulation On Computer and
Telecommunication Systems, Washington, DC, USA,
1994, MASCOTS ’94, pp. 308–317, IEEE Computer
Society.
[16] Reinhard Diestel, Graph Theory, vol. 173 of Gradu-
ate Texts in Mathematics, Springer-Verlag, Heidelberg,
third edition, 2005.
[17] A.A.-H Ab Rahman, A. Prihozhy, and M. Mattavelli,
“Pipeline synthesis and optimization of fpga-based
video processing applications with cal,” EURASIP Jour-
nal on Image and Video Processing, vol. 2011, pp. 1–28,
2011.
[18] A.A.-H. Ab Rahman, H. Amer, A. Prihozhy, C. Lu-
carz, and M. Mattavelli, “Optimization methodologies
for complex fpga-based signal processing systems with
cal,” in Design and Architectures for Signal and Image
Processing (DASIP), 2011 Conference on, nov. 2011,
pp. 1 –8.
[19] Nan Guan, Zonghua Gu, Wang Yi, and Ge Yu, “Improv-
ing scalability of model-checking for minimizing buffer
requirements of synchronous dataflow graphs,” in Pro-
ceedings of Regular paper accepted by the 14th Asia and
South Pacific Design Automation Conference, Jan. 19-
22 2009. Yokohama, Japan. 2009, IEEE computer soci-
ety.
[20] T. M. Parks, Bounded Scheduling of Process Networks,
PhD Thesis-University of California-Berkeley, Decem-
ber 1995.
[21] M. Mattavelli, I. Amer, and M. Raulet, “The reconfig-
urable video coding standard,” IEEE Signal Processing
Magazine, vol. 27, no. 3, pp. 159–164+167, 2010.
