A methodology to implement real-time applications on reconfigurable circuits by Kaouane, Linda et al.
HAL Id: hal-00826258
https://hal-upec-upem.archives-ouvertes.fr/hal-00826258
Submitted on 27 May 2018
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A methodology to implement real-time applications on
reconfigurable circuits
Linda Kaouane, Mohamed Akil, Thierry Grandpierre, Yves Sorel
To cite this version:
Linda Kaouane, Mohamed Akil, Thierry Grandpierre, Yves Sorel. A methodology to implement real-
time applications on reconfigurable circuits. Journal of Supercomputing, Springer Verlag, 2004, 30
(3), pp.283-301. ￿10.1023/B:SUPE.0000045213.82276.8e￿. ￿hal-00826258￿
A methodology to implement real-time applications on reconfigurable circuits
Linda Kaouane, Mohamed AKIL, Thierry Grandpierre
Groupe ESIEE–Laboratoire A2SI,
BP 99 - 93162 Noisy-le-Grand, France
E-mails :
 
kaouanel,akilm,grandpit  @esiee.fr
Yves SOREL
INRIA Rocquencourt–OSTRE,
BP 105 - 78153 Le Chesnay Cedex, France
E-mail : yves.sorel@inria.fr
Abstract
This paper presents an extension of our AAA rapid pro-
totyping methodology for the optimized implementation of
real-time applications onto reconfigurable circuits. This
extension is based on an unified model of factorized data
dependence graphs as well to specify the application al-
gorihtm, as to deduce the possible implementations onto
reconfigurable hardware, in terms of graphs transforma-
tions. This transformation flow has been implemented in
SynDEx1, a system level CAD software tool.
1. Introduction
The increasing complexity of signal, image and con-
trol processing in embedded real-time applications re-
quires high computational power to meet real-time cons-
traints. This power can be achieved by high performance
mixed hardware architectures, called ”multicomponent”,
built from different types of programmable components
(RISC or CISC processors, DSP,..) to perform high level
tasks and/or specific non programmable components like
(dedicated boards, ASIC, FPGA,...) used to perform effi-
ciently low level tasks such as signal and image process-
ing and devices control. Implementing these complex algo-
rithms onto such distributed and heterogenous architectures
while verifying the severe real-time constraints is generally
a difficult and complex task. This explains the real need for
dedicated high level graphical design environments based
on efficient system-level design methodologies to help the
real-time application designer to solve the specification,
validation and synthesis problems [1].
1http://www-rocq.inria.fr/syndex
In order to cope with these increasing needs, in the
one hand we have developped the AAA (Algorithm-
Architecture Adequation) rapid prototyping methodology
and the associated software tool SynDEx wich helps the
real-time application designer to obtain rapidly an efficient
implementation (i.e which meets real-time constraints and
minimizes the architecture size) of his application algo-
rithm on his heterogenous multiprocessors architecture and
to generate automatically the corresponding distributed exe-
cutive [2]. This methodology is based on graphs models in
order to modelize the application algorithm, the available
multiprocessors architecture as well as the implementation
which is formalized in terms of transformations applied on
the previous graphs.
In the other hand we aim to extend our AAA method-
ology to the hardware implementation of real-time appli-
cations onto specific integrated circuits, in order to finally
provide a methodogy allowing to automate the implementa-
tion of complex application onto multicomponent architec-
ture. This extension uses a single factorized graph model,
from the algorithm specification down to the architecture
implementation, through optimizations expressed in terms
of defactorization transformations [3]. This optimization
aims to satisfy the real-time constraints while minimizing
the required hardware resources. In prospect, this exten-
sion is expected to allow the AAA methodology to be
used for optimized hardware/software codesign and conse-
quently to provide generation of either executives for the
programmable parts of the architecture (network of pro-
cessors), or structural synthesizable VHDL for the non-
programmable parts (network of application specific cir-
cuits and/or FPGA).
This paper presents our extended methodology and is
organized as follows. In Section 3, we briefly present the
transformation flow used by our methodology to automate
the hardware implementation process of an application al-
gorithm on reconfigurable circuits. First, we present in Sec-
tion 4 the factorized data dependence graph model proposed
to specify the application algorithm. In the next Section, a
motivating example of matrix-vector product used to illu-
strate the methodology is described. We then present in
Section 7 the principles allowing to automate the synthesis
of both data and control paths from the algorithm specifica-
tion. The principles of optimization by defactorization are
shown in Section 8. We show also the results of the im-
plementation of the matrix-vector product algorithm onto a
Xilinx FPGA following these transformations. Finally, sec-
tion 10 concludes and discusses future work.
2. Related Work
In the field of embedded real-time applications several
system-level design methodologies have adressed the issues
of design space exploration, performance analysis, mapping
and optimizing applications onto different types of hard-
ware architecture.
For example, the SPADE methodology [4] enables
modeling and exploration of heterogeneous signal process-
ing systems onto coarse-grain data-flow architectures. App-
lications can be structured starting from available C-code
using the Khan API functions (the Khan Process Networks
model is used to specify the application). SPADE design
flow uses trace driven simulation to co-simule an applica-
tion model with an architecture model.
SPARK [5] is a high-level synthesis framwework that
provides a number of code transformations techniques.
SPARK takes behavioral ANSI-C code as input and gen-
erates synthesizable RTL VHDL. This VHDL can then be
synthesized into an ASIC or mapped onto an FPGA (the
synthesized control is a finite sate machine controller).
GRAPE-II [6] is a system-level development environ-
ment for specifying, compiling, debugging, simulating and
emulating digital-signal processing applications on hetero-
genous target plaforms consisting of DSPs and FPGAs. In
the specification phase, the application is described using
a cycle-static data flow. The application is represented as
a directed graph, where nodes represent computation tasks,
and edges the communications of the results (tokens). The
fonctionality of the nodes is specified in conventional high
level language (C, VHDL). The target architecture is spec-
ified as a connectivity graph. After specification, resources
requirement, mapping architecture, the last phase generates
C or VHDL code for each of the processing devices.
The POLIS system2 implements a HW/SW codesign
usign the CFSM (the Codesign Finite State Machine for-
mal model). The related work in [7] describes the use of
2http://www-cad.eecs.berkeley.edu/˜polis/
a statechart based tool for seamless specification and co-
simulation of the entire CFSM network. A complete code-
sign environment, based on POLIS system, which combines
automatic partioning and reuse of a module database is pre-
sented in [8]. Working on database of reusable software (C,
assembler) and hardware modules (VHDL), the partition-
ing process passes back the allocation information into PO-
LIS, where a first verification can be performed by Ptolemy3
based simulation. Finally, the partitioning choice is verified,
by using an emulator environment (CPU core coupled with
FPGA boards).
Each methodology has its own features (for example
several models can be used for application and architecture
specification) and some of them have been improved. For
example by introducing the statechart models into POLIS
[7] the resulting CFSMs are smaller than those obtained via
Esterel with POLIS. However, none of them take into ac-
count multicomponent architecture and use a unified model
as well to specificy the application algorithm, as to deduce
the possible implementation onto multicomponent architec-
ture, and then to generate automatically the distributed exe-
cutive corresponding to an efficient implementation (soft-
ware and/or hardware implementations). Based on this uni-
fied model, we can generate both the data and control paths
corresponding to hardware implementation.
3. AAA methodology for integrated circuits
Given an algorithm graph   specifying the application,
we transform it into an implementation graph   follow-
ing a set of graphs transformations as described in Figure
1. This transformation flow is composed of the generation
of the data-path graph  	
 and the control-path graph  	
 .
Data-path transformations are quite simple, but control-path
transformations are not trivial and require to build first a
neighborhood graph   . Finally the implementation graph
(      
   
 ) containing both data and control graphs
is charaterized in order to estimate time and surface perfor-
mance of the implementation. If the deduced implementa-
tion does not satisfy the user specified constraints, we apply
a defactorization process in order to reduce the latency by
increasing the hardware ressources. Since there is a large
but finite number of possible defactorized implementations,
among which we need to select the most efficient one, we
need to use heuristics guided by their cost function. Finally,
the resulting optimized implementation is then used to ge-
nerate automatically the corresponding VHDL code.
3http://ptolemy.eecs.berkeley.edu/
      
      
      
     
     
     
Graphical user
Optimisation
interface
Designer
AAA design flow for circuits
Circuit synthesis based upon graph transformation
Constraints
Satisfied?
No
Yes
Leonardo 
spectrum
synthesis
generation
VHDL code
(area, latency)
Estimation
specification
Algorithm
FDDG
Neighborhod
graph
Data path
graph
Control path
graph
Implementation
graph
Architecture
characterization
Figure 1. The AAA methodology for circuits
4. Algorithm model
The algorithm specification is the starting point of the
process of hardware implementation of an algorithm appli-
cation onto an architecture. According to the AAA metho-
dology, the algorithm model is an extention of the directed
data dependence graph, where each node models an opera-
tion (more or less complex, e.g. an addition or a filter), and
each oriented hyperedge models a data, produced as out-
put of a node, and used as input of an other node or several
other nodes (data diffusion). Although the purely data de-
pendence model is adequate for expressing the parallelism
of computation which it is very attractive for real-time em-
bedded applications, it is rarely sufficient for expressing ite-
ration and repetition inherent in such applications. A more
general data dependence model is thus needed. That is why,
we extend the typical data dependence model to provide
specification of loops through factorization nodes, leading
to an algorithm model called Factorized Data Dependence
Graph. In this FDDG model, each dependence is a data de-
pendence and each node is either a computation operation,
an input-output operation, or a repetitive operation. This al-
gorithm graph may be specified directly by the user using
the graphical or textual interface of the SynDEx software
or it may be generated from high level specification lan-
guages. Such synchronous languages, Esterel, Lustre, Sig-
nal, perform formal verifications in terms of events ordering
in order to reject specifications including deadlocks [9].
4.1. Factorized Data Dependence Graphs Model
In order to specify his algorithm the designer frequently
has to describe repetitions of operation patterns (identical
operations that operate on different data) defining a ”poten-
tial data parallelism”. To reduce the size of the specifica-
tion and to highlight these regular parts we use in practice
a graph factorization process which consists in replacing a
repeated pattern, i.e. a subgraph (SG), by only one instance
of the pattern, and in marking each edge crossing the pattern
frontier with a special “factorization” node, and the facto-
rization frontier (FF) itself by a dashed line crossing these
nodes. The type of factorization nodes depends on the way
the data are managed when crossing a factorization frontier.
Then a factorization node may be:
 a Fork node (  ): factorizes array partition by X in as
many subarrays as repetitions of the pattern (subgraph
SG);
  
........
Factorization
X F
 
	 	 
 
 	 	  




ff

fi


fl

ffi
 
 a Join node ( ! ): factorizes array composition by M
from results of each repetition of the pattern;
Factorization
"#"
M
..........
J
$&%(' $&%*)
$&%(+
$,%(-
. /10 0 23 465
. /10 0 23 4
5
45
7
4
5
8
45
9
465
:
 a Diffusion node ( ; ): factorizes diffusion of a data to
all repetitions of the pattern;
........
Factorization
<=<
>
>
>
?A@CB
?A@=D
?A@(E
?A@GF
D
 an Iterate node ( H ): factorizes inter-pattern data de-
pendence between iterations of the pattern. The first of
which takes its value from the init input, and the last of
which gives value to the last output ’end’.
.......
Factorization
I  

 
	 


 



 






 
 



 
Note that the graphs in Figure 2 specify both the same
scalar product

of two integer vectors ff  and fi of di-
mension 3, the one in figure 2.a is a non factorized data
dependence graph and the one in Figure 2.b is the equiva-
lent (from the specification point of view) factorized data
dependence graph. In Figure 2.a the nodes fl are an array-
decomposition operation which separates its input array fi
(respec. ff  ) into its elements. Althought apparently, Figure
2.a and Figure 2.b are not the same graph (different nodes
and edges), they have the same semantics: apply the pro-
duct operation ffi "! as many times (3) as there are elements
in the vectors to multiply and accumulate the sum. Thus,
from the algorithm specification point of view, the facto-
rization reduces only the size of the specification, without
any modification of its semantic. However, from the im-
plementation point of view, the factorization allows all the
possible implementations, from the all parallel one to the all
sequential one, with all the intermediate cases mixing both
sequential and parallel. The factorized graph of Figure 2.b
may be implemented of one of all its possible implemen-
tations. That is to say, an implementation where the three
multiply operators will be executed sequentially through an
iteration, or will be executed all in parallel like in Figure 2.a,
or two of them will be executed in parallel and executed se-
quentially with the third one, etc. Obviously, each of these
implementation will have different characteristics in terms
of area and response time.
i
i
3
FF
1
2
i.1
i.2
i..3
3
M
M
Factorization0
add
add
mul
X X
mul
add
mul
V
3
3
V
V
V
M
M
M1
1
1
1
1
1
S
a) b)
V
I
S
add
F
mulF
0
Figure 2. The factorization of a scalar product
5. Neighborhood graph
According to the data dependences relating the facto-
rization frontiers, every factorization frontier may be a con-
sumer (located downstream) or/and a producer (located up-
stream) relatively to another frontier. Two frontiers are
neighbor if there is at least one relation of direct dependence
that does not cross a third frontier.
Based on these neighborhood relations between the fac-
torization frontiers in the algorithm graph    , we build a
neighborhood graph    . The nodes of such graph repre-
sent the factorization frontiers and the oriented edges rep-
resent the data flow between factorization frontiers. The
edge orientation describes the consommation/production
relation: an edge starts at a producer and ends at a con-
sumer.
In the case of a sequential implementation of factoriza-
tion nodes, every factorization frontier, called  , sepa-
rates two regions, the first one called ”fast”, being repeated
relatively to the second one, called ”slow”. These slow and
fast sides of a frontier are due to the difference of data trans-
fer rate on each side of the factorization frontier. Every node
of the neighborhood graph is then subdivided in four parts
(see Figure 3):
 slow-downstream: ”slow” side of a consumer  ;
 fast-upstream: ”fast” side of a producer  ;
 fast-downstream: ”fast” side of a consumer  ;
 slow-upstream: ”slow” side of a producer  .
#$#
#$#
% %
% %
&$&
&$&
' '
' '
($( ( ($($(
) ) ) ) ) )
*$*
*$*
+ +
+ + ,
,
,
,
,
-
-
-
-
-
Consumer Producer
upstream

downstream upstream
downstream
slow fast
fastslow
Figure 3. Node of neighborhood graph repre-
senting a factorization frontier FF
This neighborhood graph, deduced automatically from
the FDDG, is then used during the implementation in order
to establish the control relationships between frontiers.
6. Example: Specification of (MVP) Matrix-
Vector Product
We now use a Matrix-Vector Product example (MVP) to
illustrate the algorithm model of specification and its use for
the building of the neighborhood graph. The choice of this
example was motivated on the one hand because it presents
regular computation on different array data which highlight
the use of the factorization process and on the other hand
because it concentrates its computation in nested loops that
manipulate multidimensional array data structures and such
computations are of interest in signal and image processing
applications. So the MVP of one matrix ff     
by a vector fi   gives a vector    , and can be
written in a factorized form as follows:
 
	




ffi 




(1)
where
ffi : number of lines of the matrix ff ,
 : number of columns of ff , size of vector fi ,
ffi 
 :  - -th element of the matrix ff ,
 :  th element of the vector fi .
Equation 1 allows us to obtain the graph corresponding
to the algorithm specification of the factorized MVP (Figure
4). The interface with the physical environment is delimited
by input (  et fffi ) and by output ( ! fl ). It corresponds
to the factorization frontier of the infinitely repeated pattern
of the graph (   ) due to the reactive aspect of embedded
real-time applications. Indeed, these applications interact
infinitely with the physical environment by consuming data
provided by sensors and producing data through actuators.
The output data are the result of operations applied on the
input data. The square brackets ffi  

 correspond to a se-
cond frontier ( "! ), delimited by factorization nodes of a
finitely repeated pattern. This frontier selects the ffi lines of
the matrix ff (  !  ), diffuses the vector fi ( ; !  ) and col-
lects the result vector  ( !#!  ). The functor

$

corresponds
to a third frontier ( &% ), also delimited by factorization
nodes of a second finitely repeated pattern corresponding to
the calculation of the scalar product ff  fi . This frontier se-
lects the ffi   elements of the  th line of the matrix ff (  %  )
and the elements ' of the vector fi (  % ! ) and it supplies the
result of the sum of products between ffi   and  for every
line of matrix ff ( H %  ). The “slow” and “fast” sides of each
frontier are labeled “s” and “f”, respectively.
The neighborhood graph between factorization frontiers,
obtained from the factorized data dependence graph speci-
fying the MVP algorithm, is shown by the Figure 5. Be-
cause the factorization frontier   is infinite, it does not
have neighbor on its ”slow” side which corresponds to the
physical environment.   is, at the same time, a producer
(edges ff and fi ) and a consumer (edge  ) compared to
! . ! is also a producer (edges ff  and fi ) and a con-
sumer (edge   ) compared to  % .  % is a producer and
a consumer, compared to itself through the arithmetic ope-
rations ffi  ! and (*)+) .
,.-0/21
-0/43
576
8:9<;
=>>
?
/43
-@-0/-@-A3 -@-B1
-DC
E
-DC
F
G
C
H
- 13
I
13
G
13
J J
6
K K
fl

6ML
NOL

6ML*NOL
5P6RQ#S

6
TVU
W
UT S
W
S
W2X
T
X
Figure 4. Factorized data dependence graph
of MVP
Y
Y
Y
Y
Z
Z
Z
Z
[
S]\2S ^_^
X
[
U \ U
[
X
\
X
E
,
F
H
E 6
,
F
`
6 a&bdc
, e4ff
^_^
S
^g^
U
Figure 5. Neighborhood graph of MVP: rela-
tions between frontiers
7. Circuits synthesis
To implement the application algorithm on the corres-
ponding circuit we need to generate the data path responsi-
ble for the core of the computation as well as control stru-
cture to generate the appropriate control signals. This trans-
lation process from a high-level behavioral representation
into a register-transfer-level structural description (RTL)
containing both the data and control paths is known as high-
level circuit synthesis. The automation of this synthesis pro-
cess reduces significantly the development cycle of the cir-
cuit, and allows the exploration of different hardware im-
plementations, seeking for an ideal compromise between
the area and the response time of the circuit. Afterwards,
we will present principles allowing to generate automati-
cally the data path and the control path of the circuit, from
the factorized data dependence graph and the neighborhood
graph.
7.1. Data path synthesis
The hardware implementation of the factorized data de-
pendence graph consists in providing a matching operator
for every operation node and every factorization node. The
matching operator is a logic function in the case of an ope-
ration node, or it is composed of a multiplexer and/or regis-
Hardware implementation Implementation  graphAlgorithm graph
  
  
      
 	 


 


      
  
 
 

 	 
   
 
   
ffflfi
 ffi    !

  
"$#&%
fi
   '(     
$)fi
 ffi    !

  
  *  
  
 
fi
   '(      
ff)fi
 ffi    !

  
+-,
  
.
+/,
 
0 1 2
354$6

+/,
0 1 2
7
 	 

  
.
fi
( 89	 !

 
 	 

 
   
   
: ;
: <
= >
=
<
 
: ;
: <
 
      
?
: ;
= >

=
<
: <
 
      
@
      
@
= >
=
<
      
ffi


Figure 6. A node graph transformation: from
algorithm graph to hardware implementation
ters in the case of a factorization node as depicted in Figure
6. Then hardware implementation of the data dependencies
between operations consists in providing, for each edge of
the graph, a matching connection between operators. The
resulted graph of operators and their interconnections com-
pose the data path of the circuit.
7.2. Control path synthesis
The control path corresponds to the logic functions that
must be added to the data path, in order to control the mul-
tiplexers and the transitions of the registers composing the
operators. It is then obtained by synchronization of data
transfer between registers. However, two conditions must
be satisfied to allow a register to change state: the new up-
stream data to the register must be stable, and all down-
stream consumers of the register must have finished the uti-
lization of previous data. Moreover, if upstream data comes
from various producers with different propagation time, it
is necessary to use a synchronized data transfer process.
This synchronization is possible through the use of a re-
quest/acknowledge communication protocol [10]. Conse-
quently, the synchronization of the circuit implementing the
whole algorithm is reduced to the synchronization of the re-
quest/acknowledge signals of the set of factorization ope-
rators. Given that these operators are gathered in factoriza-
tion frontier and their data consumptions and productions
are done in a synchronous way at the level of the frontier,
the generated control must be a local control at each fron-
tier. We propose then a local control system where each
factorization frontier will have its own control unit. This
delocalized control approach allows the CAD tools used for
the synthesis to place the control units closer to the opera-
tors to control rather then a centralized control approach.
7.2.1 Control units and their interconnections
As mentioned above, each factorization frontier has up-
stream and downstream relations on both sides,“slow” and
“fast”. The relations between upstream/downstream and re-
quest/acknowledge signals on both sides of a frontier are
implemented by the “control unit” of the factorization fron-
tier (Figure 7). This control unit contains a counter

with
)
states (corresponding to the
)
factorized repetitions) and
an additional logic function in order to generate, in the one
hand the communication protocol between frontiers (the
slow/fast, request/acknowledge signals at the upstream and
downstream sides), and in the other hand the counter value
AffBDCFE-G
and the enable signal
AffH

G
, that control the frontier
operators. The counter value
AffBDCFE-G
controls the multiplexers
of the frontier operators:  , ! and H . The enable signal ( H  )
determines the clock cycles where the registers of the fron-
tier operators (

, !

, ! and
H
) will change state. Note
that, the signal
A



E-G
resets the counter while the signal
( H 
)
) indicates that the counter is in its last state A
)JILK
G
.
All the other signals are the request ( M ) and acknowledge
( ( ) signals generated by the frontier(s) located upstream or
diffused to the frontier(s) located downstream. They are
separated in two groups: those which relate to the fron-
tier(s) located on the ”slow” side and those which relate to
the frontier(s) located on the ”fast” side, corresponding to
the four parts of the control unit: slow-upstream ( N

), slow-
downstream ( N
)
), fast-upstream ( O

) and fast-downstream
( O
)
).
slow fast
slow fast
P$Q R
ST$U VXWY
mod Z[ W [ U
\ ] ^
_9`5a
bdc-a
_&c-a
bd`5a
_&cde
bdcde
bD`fe
_&`fe
VXW
Z
gfh
eDij`lk_9gffb9m
a9nffophq`fk_$gffb9m
a&n5ophq`fk_9g5b&m
edi/`fk_9g5b&m
ridk
Figure 7. Control Unit
As mentioned above, the control path is mainly com-
posed of the set of control units associated to the facto-
rization frontiers of the application algorithm graph. These
control units can then be inter-connected in an automatic
way based on relationships between the factorization fron-
tiers deduced from the neighborhood graph. In this con-
trol graph, the nodes correspond to the control units and
the arcs correspond to the request signals transmitted be-
tween the control units in the same way as the production
and consumption of data between the corresponding facto-
rization frontiers. The acknowledge signals are transmitted,
in the opposite direction of the associated request signals,
between the same control units. When several signals arrive
at the same input of a control unit, one takes the conjunction
by a logical AND. In Section 9, we will see two examples
of synthesis of the data and control paths.
8. Implementation optimization
If the implementation of the factorized specification onto
an application specific integrated circuit or an FPGA does
not meet the real time constraints, we need to defactorize
the implementation graph corresponding to the specifica-
tion. The defactorization process is the reverse transforma-
tion of the factorization and therefore it does not change
the operational semantic of the data dependence graph. The
goal is to obtain a more parallel implementation in order to
reduce the latency and improve the temporal performances
in spite of increasing hardware resources.
Thus the optimized implementation of a factorized al-
gorithm graph onto the target architecture is formalized in
terms of graph defactorization transformation. The imple-
mentation space which must be explored in order to find the
best solution, is then composed of all the possible defacto-
rizations of a factorized graph specifying the algorithm. For
instance, for a given algorithm graph with  frontiers, we
have at least    defactorized implementations. Moreover,
each frontier can be partially defactorized: a factorization
frontier of M repetitions can be decomposed in M factoriza-
tion frontiers of M M repetitions.
Consequently, for a given algorithm graph, there is a
large, but finite, number of possible implementations which
are more or less defactorized, and among which we need to
select the most efficient one, i.e. which satisfies the real-
time constraints (upper bound on latency), and which uses
as less as possible the hardware resources, logic gates for
ASIC and number of Configurable Logic Blocks CLB for
FPGA. This optimization problem is known to be NP-hard,
and its size is usually huge for realistic applications. This
is why we use heuristic guided by a cost function, in order
to compare the performances of different defactorizations
of the specification. This heuristic allow us to explore only
a small subset of all the possible defactorizations into the
implementation space.
Since we aim rapid prototyping, our heuristic is based
on a fast but efficient greedy algorithm, with a cost function
O based on the critical path length metric of the implemen-
tation graph: it takes into account both the latency  and
the area  of the implemantation which are obtained by a
preliminary step of characterization.
8.1. Optimization heuristic
Here is a brief description of the proposed greedy heuris-
tic described by the algorithm 1. At each step, a list of can-
didate factorization frontiers    W is built from the set of
factorization frontiers of the deduced implementation graph
    . These frontiers are those which belong to the critical
path 

. Defactorizing one of these frontiers will reduce
the critical path length to meet the real time constraint   .
Thus for each frontier       W	 we determine its opti-
mal defactorization factor ) O

 as the smallest factor of fac-
torization implying a latency lower than the time constraint


. When this factor corresponds to the factor of factoriza-
tion (total defactorization) without latency being lower than
the time constraint, then the fully defactorized factorization
frontier is not crossed any more by the critical path.
Then we compute for each couple (factorization frontier
 , optimal corresponding factor ) O 

 ) the cost function
O , called defactorization pressure, as follow:
O



 I ffi (



where

 represent the loss on the area,  the latency be-
fore defactorization,   the latency after defactorization and

 is the user specified time contraint. At the end of each
iteration, the factorization frontier having the highest cost
value will be defactorized by its corresponding ) O 

 .
9. Example: Synthesis of MVP Implementa-
tion on FPGA’s circuit
The Figure 8 represents the hardware implementation of
the factorized MVP corresponding to the algorithm specifi-
cation given in Figure 4 for ffi



. The data path (Fig-
ure 8.a) is composed of the factorization frontier operators
(    , ;   , !   and H   ) and the combinatorial operators
ffi  ! and (*)#) . The control path (Figure 8.b) is composed
of the control units    ,   ! and   % , and of the control
signals M (request), ( (acknowldge), BDC E and H  . The inter-
connections between the request and acknowledge signals,
is based on the relationships between the factorization fron-
tiers, namely the neiborhood graph (Figure 5) built from the
algorithm graph.
Algorithm 1 Greedy optimization algorithm
Inputs: The FDD graph   
  , time constraint  
Output: The optimized implementation graph
1: begin
2: If the latency of the corresponding implementation
graph     meets the time contraint then go to end
3: Determine the list of candidates frontiers    W by
computing the critical path 

;
4: For each candidate frontier       W determine the
optimal factor of defactorization ) O

 ;
5: For each candidate frontier       W compute its
cost function O ;
6: Defactorize the frontier having the highest cost by its
corresponding defactor ) O

 ;
7: Repeat 2 as long as the latency is greater than the time
constraint;
8: end
8
b)
8
a)
8




















1
1
rd
ad
au
ru
au
ru
adad
rd rd
au
ru
cpt cpt
	

 




fffi 
fffiff


fl


ffi ! 
fl
fffi #"#
$&%%
'fi(*)
cpt
0
+-,
./ffiffi
+-ffi
ad
ru
au
ad
rd
ru
au
ad
rd
0
1
ffi ! 

! 2ffi
3 

4657ffi en4658 enen
4659
ru
Figure 8. Implementation graph of MVP
In Figure 9.a we present the hardware implementation of
a defactorized solution corresponding to the partial defac-
torization of the frontier  ! by a factor of   . The   
frontier has been replaced by two frontiers  !  ,  ! N ,
each being repeated 3 times. The factorization frontier @%
remains unchanged but it has been duplicated ( @%  , % N )
due to the partial defactorization of  ! . The data path
is then composed of the factorization frontier operators,
the combinatorial operators (mul, add) and of the operators
fl (array-decomposition operation), ff (array-composition
operation). The control path, deduced automatically from
the neiborhood graph (Figure 9.b), is composed of the con-
trol units    ,   !  ,   ! N ,   % and ff % N . The synchro-
nisation of frontiers  !  ,  ! N is assured by the AND
gates at the upstream request and the downstream acknow-
ledge of ff  .
8
8
b)
a)
X
M
1
8
  
1
:6:!;
:<:>=@?:6:BA@?
CDC7EF
CDCHG@F
CDC
G@I
CJCHEI
CJC6K
:<:>=@L
:6:BA@L
MONP
0
en
rd
ru
au ad
rd
ad au
ru
cpt
rd
adau
ru
rd
au
ru
en
ad
cpt
en
rd
ru
au ad
rd
ad au
ru
cpt
rd
adau
ru
rd
au
ru
en
ad
cpt
MRQ
SUT
V T
V
T W
X W
Y
T
SUT
V
T WZ
X
W
ru
au
ru
ad
rd
ad
encpt
V
T W
X W
Y T [ Q
Y T [ Q
V T
X
V
\
N^]
X
X
_a`Rb c
V T W
Z
X W
dOe
b f
g-h
g&`Rb c
MiP
M
i^]
j
`Rb f
_
`Rb f
g
`Rb f g e
b f
g e
` f kHlnm
f
Ioo
f
Y T
0
p S
i]
jq
\
i]
\
Q
p S Q
p S
N^]
p
S
N^P
p
S
iP
j2`rb c
d
e
b c
g
e
b c
g
e
`^c kHl#m
c
Ioo
c
g-s
C
\
i^P
\
N^P
M N^]
Figure 9. A defactorized implementation
graph of MVP
Tab.1 shows the implementation results of hardware im-
plementation of MVP (    matrix and  elements vec-
tor, coded on 3 bits) onto a Xilinx FPGA XC4000XL-3, us-
ing the CAD tool Leonardo Spectrum, developed by Exem-
plar Logic Inc.. The implementation results are presented
in function of, the area (hardware ressources: number of
CLBs), the number of clock cycles required by the algo-
rithm execution, the maximum frequency of operators in
MHz, and finally the data latency in ns (nano seconds).
These results represent some possible implementations
explored by the optimization heuristic by partial defactor-
ization (as described in [3]) of the initial factorized imple-
mentation. Note that these defactorized solutions allow to
Table 1. Optimization results for the imple-
mentation of MVP onto FPGA
Implementation Area Nb. Freq. Lat.
(CLB) cycl. (MHz) (ns)
Factorized Spec. 76 36 12,4 2916
Part.defac. by "! 99 18 13,5 1332
Fully. defac. by  ! 168 6 14,3 420
Part. defac. by  % 92 30 10,8 2790
Fully. defac. by &% 79 6 9,0 660
Fully. defactorized 234 1 11,4 87
reduce the latency of the implementation, but they increase
the number of required hardware ressources (CLB).
10. Conclusion and future works
We have presented a flow of transformations that lead to
the generation of a complete VHDL design corresponding
to the implementation of an application specified by Fac-
torized Data Dependence Graph model. We validated the
proposed methodology on several examples representative
of low-level image processing such as mean filtering [3],
edge detection operators: Deriche, Sobel,...
This work is part of the extension of the AAA metho-
dology implemented in the software SynDEx to sup-
port implementation on reconfigurable circuits. Basically,
AAA/SynDEx for multiprocessors, allows to generate auto-
matically the dead-lock free executive for the optimized im-
plementation of the given algorithm onto for architectures
based on DSP (TMS320C40, ADSP21060), microcon-
trollers (MPC555), and general purpose processors (linux
PC and unix workstations) [11].
The principles described in this paper allowed us to ex-
tend the AAA/SynDEx for reconfigurable circuits (FPGA).
An automatic generator of structural synthesizable VHDL
for mono-FPGA (one FPGA) architectures, has been added
to SynDEx [12]. The generated VHDL code which cor-
responds to the optimized FPGA implementation obtained
by successive defactorizations of the factorized algorithm
graph, is then used by a CAD tool (e.g. Leonardo Spec-
trum) in order to generate the netlist needed for the FPGA
configuration.
Presently we are working on the control involved by the
conditioning in the algorithm specification, in addition to
the control involved by repetition of operation. We intend
to extend the proposed methodology to the case of multi-
FPGAs architectures. To support such architectures, the
optimization heuristic will adress both defactorization and
partitioning issues.
Thanks to this extension, the AAA methodology will be
used for optimized hardware/software codesign, leading to
the generation of either executives for the programmable
parts of the architecture (network of processors), or struc-
tural synthesizable VHDL for the non-programmable parts
(network of application specific circuits and/or FPGA)
References
[1] S. Edwards, L. Lavagno, E.A. Lee, A. Sangiovanni-
Vincentelli.Design of embedded systems: formal mod-
els, validation, and synthesis. Proceedings of IEEE,
v.85, n.3, March 1997.
[2] T. Grandpierre, C. Lavarenne, Y. Sorel. Optimized rapid
prototyping for real-time embedded heterogeneous mul-
tiprocessors. CODES’99 7th Intl. Workshop on Hard-
ware/Software Co-Design, Rome, May 1999.
[3] A. F. Dias, C. Lavarenne, M. Akil, Y. Sorel. Opti-
mized implementation of real-time image processing al-
gorithms on field programmable gate arrays. Proc. of
the 4th Intl. Conference on Signal Processing, Beijing,
Oct. 1998.
[4] P. Lieverse, P. van detr Wolf, Ed Deprettere, K. Vis-
sers A Methodology for architecture exploration of het-
erogeneous signal processing systems. Proc. 1999 IEEE
Worshop on Signal Processing Systems (SiP’99).
[5] S. Gupta, N. Dutt, R. Gupta, A. Nicolau SPARK, High-
Level Synthesis Framework For Applying Paralleliz-
ing Compiler Transformations. 7th Intl. Conference on
VLSI design, Juanuray 5-9, 2004, Mumbai, India.
[6] R. lauwereins, M. Engels, M. Ad, J. Peperstraete.
Grape-II : A system-level Prototyping Environment For
DSP applications. IEEE Computer, Vol. 28, No 2, pp.
35-43, Feb. 1995.
[7] I.D Bates, E.G Chester, D.J Kinniment. A state-
chart based HW/SW Codesign system. Proceedings of
the 7 Intl. Workshop on Hardware/Software Codesign
(CODES/CASHE), Rome, Italy, 3-5 May 1999.
[8] M. Meerwein, C. Baumgartner, W. Glauert. Linking
Codeisgn and Reuse in Embedded Systems Design. Pro-
ceeding of the 8 Intl Workshop on Hardware/Software
Codesign (CODES/CASHE), San Diego, California,
USA, 3-5 May 2000.
[9] N. Halbwachs. Synchronous programing of reac-
tive systems. Kluwer Academic Publishers, Dordrecht
Boston, 1993.
[10] C. A. Mead, L. A. Conway. Introduction to VLSI sys-
tems. s.l.: Ed. Addison-Wesley, 1980.
[11] T. Grandpierre, Y. Sorel, From algorithm and archi-
tecture specifications to automatic generation of dis-
tributed real-time executives: a seamless flow of graphs
transformations. First ACM & IEEE Intl. Conference
on formal methods and models for codesign. MEM-
OCODE’03, Mont saint-michel, France, june 2003.
[12] R. Vodisek, M. Akil, S.Gailhard, A.Zemva Automatic
Generation of VHDL code for SynDEx v6 software.
Electro technical and Computer Science conference,
Portoroz, Slovenia, september 2001.
