A partition methodology to develop data flow dominated embedded systems by Esteves, António & Proença, Alberto José
A Partition Methodology to Develop Data Flow
Dominated Embedded Systems
António J. Esteves and Alberto J. Proença
Department of Informatics, University of Minho
Braga, Portugal
{esteves,aproenca}@di.uminho.pt
Abstract
This paper proposes an automatic partition methodology oriented to de-
velop data flow dominated embedded systems. The target architecture is
CPU-based with reconfigurable devices on attached board(s), which closely
matches the PSM meta-model applied to system modelling. A PSM flow
graph was developed to represent the system during the partitioning process.
The partitioning task applies known optimization algorithms - tabu search
and cluster growth algorithms - which were enriched with new elements to
reduce computation time and to achieve higher quality partition solutions.
These include the closeness function that guides cluster growth algorithm,
which dynamically adapts to the type of object and partition under analy-
sis. The methodology was applied to two case studies, and some evaluation
results are presented.
Keywords: partitioning, hardware/software co-design, PSM meta-model,
tabu search, cluster growth
1 Introduction
This paper describes an automatic partition methodology oriented to develop data
flow dominated, medium complexity and real time embedded systems, where a
processing element coupled to FPGA/CPLD board(s) [1] form a reconfigurable ar-
chitecture [2]. Since this architecture includes hardware and software components,
the present work applies the hardware/software codesign paradigm.
Partitioning is an NP-complete optimization problem that assigns system ob-
jects to the target architecture components and defines its startup time (scheduling),
to achieve the designer objectives quantified by a cost function [3]. The partition
process converts an unified and uncommitted system representation to a multi-part
representation committed to the target architecture components. The present ap-
proach performs a functional, inter-component and automatic partition.
The partition task is part of a development methodology that covers all phases
of systems development [4]. It is based on an operational approach [5], it runs
1
at a high abstraction level and it takes advantage of the object oriented modelling
paradigm to reduce complexity and design time. Common to object oriented ap-
proaches, it uses multi-view modelling to describe the objects, the dynamic and
the functional perspectives of systems. Following an operational approach, an
executable specification is developed, which runs through a set of refinements
and transformations to achieve a system implementation. When compared to the
methodology followed by the MOOSE approach [6], the proposed methodology
has some advantages: (i) the state transition diagrams (STD) are replaced by PSM1
models [7], which allow adequate handling of the system objects concurrency,
(ii) implementations follow an iterative approach, replacing the traditional cascade
design flow and (iii) the partition is automatically performed, without requiring
additional expertise from a codesign professional.
To evaluate this partition methodology, a prototype tool was implemented, par-
TiTool, and its capabilities were compared to other approaches, following the struc-
ture introduced in [8]. Here, two sets of features are grouped for comparison pur-
poses: the modelling support and the implementation support. The first identifies 3
axis: the application domain (control, data or data+control), the type of validation
(simulation or co-verification) and the modelling style (homogeneous or hetero-
geneous). Figure 1 shows where parTiTool fits in the graph and how it relates to
other approaches. Most approaches adopt homogeneous modelling style, where
the only allowed validation method is simulation. Systems are described with a
software oriented language (C, a C variant, Occam or C++) or an hardware ori-
ented language (VHDL, Verilog or HardwareC). The proposed approach is part of
a development methodology that uses heterogeneous modelling. It can be applied
to data and control systems, but it is oriented to data flow dominated systems. In
the present stage of the evolution, it does not allow co-verification.
To compare the support available to implement the systems with multiple com-
ponents, figure 2 also uses 3 axis: the support to synthesize the interface between
components, the supported target architecture and the automation degree of the
partition process. A reasonable number of approaches (Chinook [9], Cosmos [10],
CoWare [11] or Polis [12]) does not execute partition automatically. None of the
approaches completely supports the automatic partition and the synthesis of inter-
faces. In the proposed approach the partition process is automatic and the infor-
mation required to synthesize the interface between components can be extracted
from the detailed model used to estimate communication metrics. Current parTi-
Tool prototype implementation does not support yet more than one microprocessor,
due to the target evaluation architecture. However, a R&D track is being prepared
to merge this project with current adaptive load and data scheduling in parallel and
distributed systems [13].
The paper is organized in 4 sections. Section 2 describes the proposed partition
methodology, namely the formal description of the partition process, the approach
followed to model the system and its internal representation, the construction and
1Program-State Machine.
2
Chinook
Mickey
Tosca
homogeneous
co
ntr
ol
co-verification
Type of
Validation
Polis
heterogeneous
Ptolemy
COOLda
ta
da
ta+
co
ntr
ol
Domain Cosmos, Cosyma
Lycos, Vulcan
SpecSyn
Castle
CoWare
parTiTool
Application
simulation
Style
Modelling
Figure 1: Categorization of approaches by modelling support.
improvement of partition solutions with cluster growth and tabu search algorithms
and the metrics estimation required by the evaluation functions. Section 3 presents
the prototype system used to validate the partition methodology validation – target
architecture and applied tool – and summarizes the case studies and the obtained
results. Section 4 closes with conclusion remarks and directions for future work.
Lycos
automatic
manual
Chinook
Polis
Cosmos
Vulcan
hig
h
low
n
o
 c
o
n
cu
rr
e
n
cy
1 
uP
, 1
 A
SI
C,
Cosyma
co
n
cu
rr
e
n
cy
1 
uP
, 1
 A
SI
C,
Mickey
Tosca
SpecSyn
COOL
m
u
lti
pl
e 
uP
,
CoWare
parTiTool
m
u
lti
pl
e 
AS
IC
Target
Architecture
1 
uP
,
m
u
lti
pl
e 
AS
IC
Partitioning
Interface
Synthesis
Support
Figure 2: Categorization of approaches by implementation support.
2 Partition methodology
The presentation of the partition methodology starts with the formal description of
the partition process. Given an unified representation for the system (see below, in
system modelling), the partition process generates a description for each component
of the target architecture to be used on the implementation of the system. To reach
this goal, the set of objects on system description must be divided into a series
of disjunct sub-sets that will be assigned to the different components of the target
architecture. The task that divides the set of objects on sub-sets is guided by the
target architecture constraints and the design requirements. In the present work,
the objects represent program-states or variables from the system PSM model. A
3
formal definition of the partition process follows below.
Given the set of objects O = {o1, o2, ..., on} that models the system func-
tionality, the set of constraints Cons = {c1, c2, ..., cm} and the set of require-
ments Req = {r1, r2, ..., rp} that define the feasibility and the quality of the par-
tition alternatives to be generated, the partition process generates several sub-sets
(or partitions) H1, ...,Hnh, S1, ..., Sns, where Hi ⊆ O, Si ⊆ O, {Hi}i=1..nh ∪
{Sj}j=1..ns = O, Hi ∩ Sj = ⊘, Hi ∩Hk = ⊘ (with k = 1..nh and k 6= i) and
Sj ∩ Sl = ⊘ (with l = 1..ns and l 6= j).
The selection of a partition solution, among all that were analyzed by the par-
tition algorithm, implies a cost function Fcost. This function uses the sub-sets
of objects assigned to hardware H = {H1, ...,Hnh}, the sub-sets of objects as-
signed to software S = {S1, ..., Sns}, the set of constraints Cons and the set of
requirements Req to return a value that measures the solution quality. The iterative
partition algorithm is defined by the function
PartAlg(H,S,Cons,Req, Fcost()) (1)
returning H ′ and S′ that verifies
Fcost(H
′
, S
′
, Cons, Req) ≤ Fcost(H,S,Cons,Req) (2)
when the applied cost function returns the minimum value under the best par-
tition solution circumstances.
The value generated by the cost function is obtained from estimated metrics,
related to the system constraints and requirements.
To execute the partial tasks needed by the partition process, the modules iden-
tified in figure 3 were used. Beyond the module that performs the conversion
between the models used externally and internally by the partition process, the
developed partition methodology includes partition algorithms (constructive and
iterative), evaluation functions (closeness and cost) and metrics estimators. The
following sections describe the modelling that is relevant for partition and the par-
tition itself, with emphasis on the applied algorithms and evaluations functions and
briefly presenting the metrics estimation.
2.1 System modelling
In related approaches, the uncommitted systems are commonly modelled with
meta-models such as CDFG [14] [15], DFG [16], FSM [17], Petri net [18], CSP
[19], an extended version of a previous meta-model [20], or a combination of these
meta-models [6]. In spite of the meta-model diversity, most approaches transform
the uncommitted system model into a flow diagram representation. The type of
objects handled during the partition process is constrained by the selected meta-
model, and the several approaches may present quite a different granularity, as a
consequence of using distinct meta-models.
4
Interaction
User
PSMfg GraphPSM Model
PSM Models
Partition Algorithms
Evaluation Functions
Translators
System Internal Representation
Target
Architecture
Model
Uncommitted
Committed Model
System Model
Model
data flow
Algorithm
Construtive
Algorithm
Iterative
Metrics Estimators
Function
Closeness Cost
Function
for each
Component
control flow
Figure 3: The modules of the partition methodology.
The PSM meta-model was selected to describe the systems at the partition
process interface, which combines an HLL/HDL meta-model with HCFSM2 [7].
PSM adequately supports complex embedded systems modelling, since it includes
the best features from both meta-models: behavioural hierarchy, concurrency, state
definition, support to handle algorithmic and data complexity, behaviour comple-
tion, possibility of including exception handling and a graphical representation.
Besides, modelling with PSM is a very intuitive task. The strongest limitations of
PSM are the lack of structural hierarchy and automatic support to formally validate
the models. In the present approach, the VHDL language was selected to describe
variables and leaf program-states. VHDL allows an explicit and elegant modelling
of communication and synchronization among concurrent activities.
A PSM model is described by an hierarchical set of program-states, where a
program-state represents a computation unit that at a given time can be active or
inactive. A PSM model may include composite or leaf program-states. A compos-
ite program-state is defined by a set of concurrent or sequential program-substates,
and a leaf program-state is defined by a block of code on the chosen programming
language. If the program-substates are concurrent they are all active at the same
time; if they are sequential, just one program-substate can be active at a certain
time.
On a composite program-state, the order by which sequential program-substates
get active is determined by the directed arcs connecting them. There are two type
of directed arcs: arcs that represent a transition when the substate activity is termi-
nated and simultaneously the condition associated with the arc becomes true, and
arcs that represent a transition immediately after the condition associated with the
arc becomes true. A transition on a directed arc means that the target substate will
2Hierarchical Concurrent Finite State Machine.
5
become active.
To represent a PSM model textual and graphic notations can be used.
Internal representation
To describe systems during the partition process a CFG type meta-model was de-
veloped: the PSM flow graph or simply PSMfg. The most relevant requirement
of the internal representation, not included on the PSM meta-model requirement
list, is the possibility of associating the information generated during the partition
process with the system model objects.
The motivations that lead to the development of a new meta-model were the
need to automate the partition process and the availability of a library with graphic
and computational support to edit graphs - LEDA3 [21]. By means of a set of
adaptations applied to the editor of generic directed graphs and the associated data
structure, it was possible to obtain the computational support to operate on PSMfg
graphs. The goals to achieve with the performed adaptations were: (i) to customize
the graphic characteristics of the nodes, generating the set of node types that will
be presented ahead; (ii) to increase the nodes and edges functionality, in agreement
with its type; and (iii) to introduce constraints on the interconnection between the
different types of node.
A PSMfg model is an acyclic, directed and polar graph, represented by a G =
{V,E} data structure that includes the list of nodes V and the list of edges E. The
graph is acyclic when no paths on the graph are closed, it is directed because each
edge has a single direction and it is polar because it includes two nodes, one to enter
and the other to exit from the graph, from which all other nodes are successors and
predecessors, respectively [22].
The meta-model of the PSMfg represents the semantic of the PSM meta-model
and all the information needed by the partition process, such as the metrics es-
timate and the assignment of objects to partitions. To control the granularity of
the objects handled during the partition process, the PSMfg graph must be able
to represent the program-states structure. Since the program-states functionality
is described with VHDL, the PSMfg graph supports the following constructs of
the VHDL language: the parallelism associated with processes, the conditional
constructs (if ... elsif and case), the cycles (while and for) and the
constructs that suspend processes (wait).
The nodes of a PSMfg graph represent the variables and the program-states of
a PSM model, with the same or a thinner granularity, and they have associated with
them information that is relevant to the partition process, namely:
⋄ which partition the graph node was assigned;
⋄ which partitions the designer establish as being forbidden for this node;
3Library of Efficient Data types and Algorithms.
6
⋄ the required information to estimate the area occupied by the hardware and
the system performance, which refers to metrics like the functional units, the
storage elements or the interconnection elements area, the variables read/written
by the program-state associated with the node, the computation time, the
time spend on communication with others program-states or the execution
frequency.
The different node types the PSMfg meta-model uses are those that: (i) define
the entry/exit point of the system graph; (ii) indicate where the (parallel) processes
begin/end; (iii) define the begin/end of a conditional construct; (iv) represent the
control part of a cycle; (v) force a waiting cycle; (vi) assert one or more signals
necessary to a waiting cycle; (vii) represent a variable; and (viii) do not fit in any
of the previous types.
The edges, representing the control flow between nodes, have associated a
branch probability (relative to the source node) and a label.
2.2 System Partitioning
In the present work, partitioning is a two-step process: (i) compute an initial parti-
tion solution with a constructive algorithm and (ii) successively improve it with an
iterative partition algorithm.
A constructive algorithm
The analysis of several constructive partition algorithms revealed that: (i) the appli-
cation of an exhaustive algorithm is not feasible since it demands an unacceptable
computation time; (ii) the cluster growth and hierarchical clustering algorithms
create the partition solutions in distinct ways, but produce identical results; (iii) the
ILP4 methods generate optima solutions, do not require the application of an itera-
tive optimization algorithm, but they demand a very high computation time and its
formulation is hard to achieve; and (iv) PACE [23] and GCLP [3] algorithms, be-
ing strongly specific, are not attractive to be adapted to the present work. Since the
solutions generated by the constructive algorithm feed the iterative improving pro-
cess, its quality can be kept in a lower value. Thus, it was selected a constructive
algorithm with a light implementation, the cluster growth (CG) algorithm. Al-
though the optimization heuristic of the CG algorithm is quite simple, the capacity
to generate solutions with quality is determined by the selected closeness function.
The process of creating a solution begins with the selection of the seed object
for each partition. To select the partitions seed object, 4 methods were imple-
mented: (i) random selection, (ii) manual assignment, (iii) combination of random
selection with manual assignment and (iv) selection based on the communication
among partitions. The manual assignment can be used to avoid that objects are
assigned to an implementation for which they are clearly bad candidates. Having
4Integer Linear Programming.
7
selected the seed object for each partition, the cluster growth algorithm assign the
remaining objects to the best possible partition. The best partition is chosen by the
closeness function defined in equation 5 of section 2.3.
An iterative algorithm
Simulated annealing [20] [18] [24] [25] is among the most commonly used iterative
partition algorithms, but it is also frequent to use genetic evolution [8], implemen-
tations of the Kernighan/Lin algorithm [26] [27], tabu search [24] [25] and spe-
cific algorithms. The evaluation of these algorithms has shown that Kernighan/Lin
algorithm has a limited capacity to avoid local minimum of the cost function,
the simulated annealing algorithm presents a stronger potential than greedy and
Kernighan/Lin algorithms to achieve optima solutions, but the computation time is
very high, and the tabu search algorithm decreases the computation time bound-
ing the search for partition solutions to the neighbourhood of these solutions. The
genetic evolution algorithms reduce the design space more efficiently, but the ca-
pacity of convergence to the optimum partition solution is inferior. Having in mind
that the primary goal of partitioning is to find partition solutions with quality, tabu
search and simulated annealing were selected for the iterative process. A thorough
study was carried out with tabu search algorithm, and the results are presented in
this paper.
The tabu search method (TS) can be seen as an extension of the local search
strategies, where a new solution is found on the neighbourhood of the present so-
lution, applying a well defined set of rules [28] [29]. When the iteration n of the
search process tries to minimize the cost function Fcost(Pn), the new solution Pn+1
is selected from the neighbourhood V (Pn) of the present solution, applying an op-
timization criterium. In general, the criterium expresses the objective of selecting
the best solution present on the neighbourhood. The neighbourhood of solution Pn
can be defined by the set of all the alternatives that result from the application of a
rule that modifies the characteristics or attributes of Pn. On the hardware/software
partition problem, the transition from the present solution to a solution on its neigh-
bourhood occurs when at least one object is moved from its current partition to a
target partition, ending in a new solution. It is frequent an hardware/software parti-
tion problem to evaluate a high number of partition alternatives, which means that
to find a solution with quality it is necessary a computation time equally high. To
avoid that all the alternatives present on the current solution neighbourhood are
evaluated, it is implemented a list with candidate solutions; this way, only a partial
neighbourhood of the current solution is evaluated.
Although the tabu search is a local search strategy that tries not to stop in
local minima of Fcost, its policy embodies other features. This strategy was named
tabu search since in every iteration parts of the design space are forbidden, e. g.,
some solutions are considered tabu. To reach this goal, the tabu search implements
a flexible memory structure that supports several search strategies, like avoiding
local minima. The flexible memory includes short term (STM), long term (LTM)
8
and medium term components (MTM). The short term components are based on
the history of most recently visited solutions, the long term components are based
on the most frequent solutions and the medium term components are oriented to
solutions with quality and influent solutions. Using this information it is possible
(i) to diversify the search, in order to escape the local minima, (ii) to intensify the
search, to reinforce the convergence for the absolute minimum and (iii) to avoid
cycles during the search.
To avoid a cycle during the search, the last L visited solutions are saved on
the tabu list. While a solution Pn is on the tabu list, it is forbidden. This way, the
search will not return, at least during L iterations, to a visited solution. The size of
the tabu list, or the tabu tenure, is determinant to the evolution of a search, since it
influences the restrictions applied to the design space that can be searched. For this
reason, the more restrictive is a tabu, the less must be its tenure. The performed
experiments resulted in the following recommendation: the objects and moves tabu
tenure must be 5 to 10% of the number of objects on the system description.
The temporary exclusion of solutions does not result solely in advantages for
the tabu search method. The disadvantages arise when high quality solutions, the
goal of searching, are excluded from the search. To overtake the inconvenient
caused by the high quality solutions exclusion, TS methods have a mechanism
that allows to withdraw the tabu classification of a solution, assuming it may be
a solution with quality. This mechanism is called aspiration criterium. They can
be defined aspiration criteria by objective, by direction of search and by influence
[28].
The tabu search algorithm iteratively tries to improve the provided partition
solution, assembles all the components that participate on the search and controls
its evolution. The implemented algorithm [2] is a modified version of the one dis-
cussed in [29]. Namely, it only searches a partial neighbourhood of the present
solution, it has a richer set of evolution strategies to apply when there are no eligi-
ble solutions with quality, and it applies a more efficient improvement when none
of the moves improves the cost of the present solution. Partial neighbourhood
searching, decreases the computation time per iteration by a factor close to the
number of partitions, but the design space exploitation is less complete. Since the
partial neighbourhood contains the best solutions, it is introduced an intensification
element on the search.
The tabu search algorithm runs until a predefined number of iterations is reached
and each search runs while a predefined number of iterations without improving
the best solution is not exceeded. The number of iterations that can be executed
without improving the partition solution should not be neither too high - to avoid
wasting iterations around a local minimum - nor too low - to increase the possibility
of converging to a local (or absolute) minimum of the cost function.
On every iteration of the search process, the partial neighbourhood of the
present solution is analysed and the move to be executed can be selected from
one of the following ordered alternatives:
9
(1) The move that generates the largest improvement on the partition solution
cost and that obeys one of the following conditions: it is not tabu or it is tabu
but can be executed due to an aspiration criterium;
(2) The move that is not tabu and leads to the smallest increase on the partition
solution cost; the cost of the solutions that result from the moves is decreased
by the application of a negative improvement;
(3) The “least” tabu move, the least frequent move, the move that in the past
resulted in the best cost variation or the move of the object that stays longer
in the same partition.
At the end of every iteration the performed move, the inverse move and the
moved object are classified as being tabu, the history is updated with the informa-
tion about the move and the moved object, the moves and objects tabu tenure, the
best solution found and the number of iterations are updated and, at the end of a
search, a new initial solution is generated.
Experimental results show that, at the beginning of each search, the information
saved on the history of moves and moved objects must be reset. Otherwise, the
capacity of converge to the optimum solution is reduced, since the improvement
used on the second move alternative, proportional to the number of iterations an
object is not moved, would regularly select every object.
Applying all types of tabu classification can be very restrictive to the search. A
subset of tabu classifications was selected, which decreases the dimension of the
neighbourhood to be searched, helps to avoid cycles and does not place excessive
restrictions to the search. The following tabu classifications were selected: (i) move
a given object from a source partition to a target partition; (ii) all moves of a given
object; and (iii) the inverse (move) of the move that originated the present solution.
The implemented TS method includes two types of aspiration criterium: (i) by
objective, when the first alternative selects a move with quality that is classified as
being tabu; and (ii) a default criterium, when the third alternative selects the “least”
tabu move.
The implemented memory structure registers the history of performed moves
and the history of moved objects. For each performed move, the history of moves
saves the source and target partitions, the tabu tenure (STM), the execution fre-
quency (LTM) and the achieved cost variation (MTM); for each moved object, it
saves the tabu tenure (STM), the frequency of move (LTM), the number of itera-
tions an object remains on the same partition (LTM) and the achieved cost variation
(MTM).
It was implemented a neighbourhood with a simple structure since a neigh-
bourhood with a complex structure would increase greatly the computation time.
While on a partition problem with nObj objects and nPart partitions, the size of
the simple neighbourhood is nObj ∗ (nPart − 1), on a generic complex neigh-
bourhood, where each iteration executes a series of nMoves moves, the number
of alternatives that make up the neighbourhood is defined by the equation 3. The
10
value defined by this equations is much higher the number of alternatives on a sim-
ple neighbourhood. A complex neighbourhood favours the diversification on the
search, which means an increased capacity to avoid the local minimum but also an
increased difficulty to converge to the optimum partition solution.
size(V ) = (nPart− 1)nMoves ∗ CnObj
nMoves
= (nPart− 1)nMoves ∗
nObj!
nMoves! ∗ (nObj − nMoves)!
(3)
The partial neighbourhood, or the list of candidate solutions, considered on ev-
ery iteration of the TS algorithm is a subset of the present solution neighbourhood,
with a size that remains fixed during all the search process and equals the number
of system objects. The (nPart − 1) moves per object that define the neighbour-
hood were decreased to only one move per object on the partial neighbourhood,
decreasing the computation time by a factor close to the number of partitions. The
subset of moves that define the partial neighbourhood is made up by the best move
for each object of the system description. The best moves are computed by a func-
tion identical to the closeness function of the cluster growth algorithm (Fprox on
equation 5).
Part of the TS algorithm potential is consequence of executing several searches,
each one with a different initial solution. The method used to generate the initial
solution of the searches combines two strategies: intensification - the new initial
solution results from the best evaluated partition solution - and diversification -
the assignment of a significant percentage of objects is modified, according to the
long term memory. The rule is to execute the least frequent moves, but after a
number of searches without improving the best solution, the choice can be to move
the least frequently moved objects to a randomly selected partition. The random
selection reinforces the diversification on the search. Given that the percentage of
moved objects is a parameter of the algorithm, it is possible to control the relation
between the intensification and the diversification applied on the generation of a
new initial solution. The suggested value for the percentage of objects to be moved
is 20%.
The implemented algorithm includes the following intensification elements:
⋄ to create the list of candidate solutions with the highest quality solutions
present on the current solution neighbourhood;
⋄ to create the initial solution for a new search based on the best evaluated
solution;
⋄ to select, for third evolution alternative of the TS algorithm, the move that in
past resulted in the best cost variation;
and the following diversification elements:
11
⋄ to apply, on the second evolution alternative of the TS algorithm, a cost
improvement based on the number of iterations the objects remained in the
last partition Pk they were assigned (NIMPk); this improvement, described
by equation 4, strongly favours the move of the objects that are not moved
regularly, since they have a high NIMP ; thus, the search is directed to less
explored zones and a diversification component is introduced on the search;
improvement(Pk) = −
NIMPk
nObj
(4)
⋄ to create the initial solution of a new search moving a percentage of the least
frequently moved objects or a percentage of objects selected randomly (after
a number of searches);
⋄ to select, as the third evolution alternative of the TS algorithm, the least
frequent move or move the object that remains longer on the same partition.
2.3 Evaluation functions
This section describes the evaluation functions (closeness and cost) that guide the
partition algorithms (constructive and iterative) on the creation and improvement
of partition solutions.
Closeness function
The best partition used to assign the objects, on every iteration of the constructive
partition process, is chosen by the closeness function defined in equation 5.
Fprox = f
2
4 Fvar(Mcom1)FpsHwSw(Mcmp1,Mcmp2,Mcom2)
FpsHw(Marea,Mcom2)
3
5 (5)
where Mcom1 (Mcom2) represents the communication intensity among a vari-
able (program-state) and the program-states (variables) assigned to the partition,
Mcmp1 (Mcmp2) is the software (hardware) computation time of a program-state
and Marea is the area occupied by all the variables and program-states assigned to
the partition. The Fvar function is used on variables assignment and the FpsHwSw
and FpsHw functions are used on the program-states assignment. On every moment
of the constructive process, the Fprox function measures the closeness among the
object to be assigned and the objects previously assigned to each partition.
If the object to be assigned is a variable that is a bad candidate to hardware,
meaning that the area it occupies in hardware exceeds a defined limit, the Fvar
function suggests an assignement to software. If the variable is not a bad candidate
to hardware, it is assigned to the partition that presents the higher communication
intensity with this variable, e.g., to the partition p that presents the best Mcom1[p]
value.
12
When a program-state is being assigned, if the FpsHwSw function indicates that
software is the best partition to assign it, the program-state is immediately assigned
to software. Otherwise the best hardware partition is selected by FpsHw, a function
that is more appropriated to distinguish the assignment to the different hardware
partitions.
For example, the metric Mcom1 used to select the best partition p to assign a
variable v, is computed with equation 6. The communication intensity Mcom1[p]
simply measures the number of times the variable v is read/written by the program-
states assigned to the partition p.
Mcom1[p] =
X
o∈(rdO(v)∩p)
rdV (o).nRd(v) ∗ rdV (o).pRd(v) ∗ FN(o) +
X
o∈(wrO(v)∩p)
wrV (o).nWr(v) ∗ wrV (o).pWr(v) ∗ FN(o) (6)
where
⋄ rdO(v) (wrO(v)) is the set of program-states that read (write) the variable
v;
⋄ rdV (o) (wrV (o)) represents the set of variables read (written) by the program-
state o;
⋄ rdV (o).nRd(v) (wrV (o).nWr(v)) is the number of times the variable v is
read (written) by o on every execution;
⋄ rdV (o).pRd(v) (wrV (o).pWr(v)) is the probability of variable v to be read
(written) by o;
⋄ FN(o) is the execution frequency of o.
Cost function
The cost function applied on the iterative partition process considers as being
optimum a partition solution that respects the target architecture constraints and
achieves the design requirements, as opposed to considering as being optimum a
solution that uses the least hardware area and/or achieves the best performance. To
reach this goal the function includes a term, per constraint or requirement, whose
value is proportional to the degree this constraint or requirement is not observed on
the partition alternative (equation 7).
Fcost(H,S,Cons, Req) =
3X
i=1
Ki ∗ fi(Mi, Ci) (7)
where
⋄ H (S) is the set of hardware (software) partitions;
13
⋄ Cons = {C1, C2} is the set of design constraints, with C1 being the con-
straint applied to the area of the hardware partitions data path (M1) and C2
the constraint applied to the area of the respective control unit (M2);
⋄ Req = C3 is the performance required from the system (M3);
⋄ M is the set of metrics Mi, whose constraints Cons and requirement Req
apply to;
⋄ Ki is the coefficient applied to the metric Mi;
⋄ fi(Mi, Ci) represents the contribution of the metric Mi to the cost function
and it is defined by equation 8.
fi(Mi, Ci) =
 P
Pj∈H
MAX [excess(Mi[Pj ], Ci) , 0] , i=1,2
MAX [excess(M3, C3) , 0] , i=3
(8)
where
⋄ Mi[Pj ] is the value of metric Mi for the hardware partition Pj ;
⋄ Ci[Pj ], the value of the design constraint Ci applied to the hardware partition
Pj , was replaced by Ci on equation 8; on the considered target architecture,
the pairs (FPGA,CPLD) that implement the pair (DP,CU) of the hardware
partitions include the same devices;
⋄ the term excess(m, c) is given by
excess(m, c) =
m− c
c
(9)
The estimates for the area (M1 and M2) are computed by partition, while the
estimate for performance (M3) is relative to all the system.
2.4 Metrics estimation
Metrics estimation aims to compare partition alternatives, which requires a high
degree of fidelity rather than a high accuracy. However, it is expected that a high
accuracy corresponds to an equally high degree of fidelity.
The estimation operates on the system graph, modelled with PSMfg, considers
an hardware model (with data path and control unit), a software model (with a pre-
defined set of instruction types) and a communication model (for inter-partition
communications). The code optimization performed by the compiler - related to
pipelining, superscalarity and memory hierarchy - is measured as a factor obtained
by simulation. This procedure is acceptable on most partition problems applied to
14
embedded systems. One difference to a significant part of the approaches, is the
emphasis given to the estimation of metrics related to inter-partitions communica-
tions.
To obtain accurate estimates, detailed models for the used resources were de-
veloped, especially the hardware and communication models, and the estimation
runs in two abstraction levels: program-state and system. The incremental update
of the estimates and the estimation in two levels both help to decrease the compu-
tation time.
Low level estimates, which are used by the system level estimates, are com-
puted at the program-state abstraction level, the computations are performed once
per partition session and the estimates are more accurate. Estimates for metrics rel-
ative to the system objects are computed at the program-state level. Examples of
these metrics are the software and the hardware computation times, the area occu-
pied by functional units, multiplexers and variables, the read/written variables and
the program-states that read/write variables. To obtain these estimates, low level
metrics are required: these include the execution time of the arithmetic/logic opera-
tors and the area occupied by multiplexers, arithmetic/logic operators and memory
elements.
At the system level, metrics are estimated at a higher level and the computations
are repeated on every iteration of the partition process. The estimates are less
accurate and, whenever possible, the estimates are simply updated. The metrics
estimated at the system level are the system performance and the area occupied by
the data path and the control unit of the hardware partitions. System performance
is computed through explicit scheduling at the state-program level. By ignoring the
scheduling at the system level, the computation time is decreased and the obtained
performance tend to be over estimated.
The computation of the execution times follows a simple software model, that
estimates computation time by type of executed instruction (the built prototype
follows the IA-32 architecture model) and considers the optimizations performed
by the compilers as a factor obtained by simulation.
The developed hardware model focus on the area of a partition, which includes
the area of the data path - the functional units, the storage elements, the intercon-
nection resources and the resources of the interface with other partitions - and the
area of the control unit - the area of the state machine associated with the partition
data path, which includes the state register, the output logic and the next state logic.
Experimental results confirmed the state register as the dominant term on the area
of the state machine, ranging from 60 to 80%.
The developed communication model defines the timings and the resources
associated with the communication between partitions. The model supports the
register access communication mechanism, by polling and by interruption. At the
beginning of every search of the iterative partition process, estimates for commu-
nication times are computed. These estimates will be updated whenever an object
is moved from one partition to another one, but only for the moved object and/or
those objects that communicate with the moved objects.
15
3 Validation of the partition methodology
The proposed methodology was validated on a CPU-based architecture coupled to
a reconfigurable board (briefly described below), through two case studies that rep-
resent data flow dominated embedded systems: one clearly suggesting a software
implementation, while the other is oriented for an hardware implementation.
A quantitative evaluation compared automatically generated solutions with man-
ually optimised hardware/software implementations, looking into two main results:
the quality of the partition solutions (measured by feasibility and performance) and
the quality of the estimates (measured by accuracy/fidelity). The methodology can
be further evaluated by its performance, e. g., by the computation time needed
to generate the partition solutions and the support to implement these solutions,
namely to synthesize the interface between partitions.
3.1 Prototype system
The prototype system applied on the partition methodology validation includes a
target architecture and a partitioning tool.
The considered target architecture contains a reconfigurable platform (EDgAR-
2) and its host system. The EDgAR-2 board is an FPGA/CPLD based system,
with a PCI interface and fully in system programmable (ISP) [1] [30]. The board
structure, shown in figure 4, contains an array of 4 pairs (control unit, data path),
called processor modules (PMs), which are 2-way interconnected with dedicated
buses, forming a PM pipeline; they are also connected to a different set of 8 lines
in the 32 bits PCI data bus. FPGAs implement the data paths, while CPLDs are
better suited to implement the control units.
The EDgAR-2 architecture was designed to directly accommodate a finite state
machine with data path (FSMD) model. Since the architecture implements several
concurrent FSMDs, it is suitable to map descriptions modelled with concurrent
FSMDs (CFSMD) [31], hierarchical concurrent FSMD (HCFSMD) or program
state machine (PSM) meta-models [32].
Although EDgAR-2 may not be considered a typical reconfigurable board - it
is composed of both FPGAs and CPLDs, and it lacks on board RAM - it is partic-
ularly adequate to validate a general purpose hardware/software partition method-
ology due to these extra challenges.
The applied tool was parT iTool, a framework based on the LEDA library [21]
and which allows the visualisation, edition and partitioning of PSMfg graphs. This
framework also includes support to detect errors on the graphs structure and to
visualise the output of the partitioning process. The major part of the operations
needed by the graphs visualisation and edition is supported by the classes GRAPH
and GraphWin of the LEDA library.
16
Down Connector
CPLD4
CPLD1 FPGA1
FPGA4
......
Down Connector
DownControl DownData
Up Connector
UpControl UpData
PCI Controller
G
en
er
al
 P
ur
po
se
 C
on
ne
ct
or
PM1
PM4
Bus
Data
Bus
Address
Control
Control
Data /
Data /
PCI Bus
Up Connector
Figure 4: The EDgAR-2 platform architecture.
3.2 Case studies
The partition methodology was validated with a detailed analysis of two case stud-
ies: the application of a Sobel filter to an image (convolution) and the DES5 cryp-
tography algorithm [33]; the first one is oriented for a software implementation and
the latter suggests an hardware implementation.
The application of a Sobel filter F (with X by Y pixels) to an image I , runs
through two steps: (i) for every pixel (j, i) of the original image I , which colour is
I(j, i), an area with the filter size and centered on pixel (j, i) is convoluted with the
filter F , generating a new value If(j, i) for pixel (j, i) (equation 10); (ii) with the
minimum and maximum of the filtered image If , m(If) and M(If) respectively,
the filtered image is normalized to the colour range of the original image (r(I)),
generating the filtered and normalized image In (equation 11).
If(j, i) =
Y−1∑
k=0
X−1∑
l=0
I(j − ⌊
X
2
⌋+ l, i− ⌊
Y
2
⌋+ k) ∗ F (l, k) (10)
In(j, i) =
r(I)
M(If)−m(If)
∗ [ If(j, i)−m(If) ] (11)
The implemented DES algorithm applies a set of transformations to the input
data (sample), which depend on these data and on the secret key. This key is
5Data Encryption Standard.
17
also altered during the different iterations of the encrypt process. Every sample to
encrypt goes trough an initial permutation IP , a set of transformations that depend
on the secret key and a final permutation FP , inverse of IP (figure 5). The set of
transformations that depend on the secret key is defined by an encryption function
f and a key scheduling function KS.
The function f includes the expansion E, the substitution tables S-box and the
permutation P . The information generated by the initial permutation IP is splitted
in two 32 bits halves: the least significant part (R) feeds function f and the most
significant part (L) is the input for an exclusive-OR operator. At the end of a round,
the two halves of the sample to encrypt are swapped and the round is repeated. The
algorithm evolves in 16 rounds, in order to “circulate” the sample to be encrypted.
Permutation
PC1
Permutation
PC2ke
y
Se
cr
et
register
C
D
Permutation Permutation
FPIP
R
ou
nd ke
y
(de/en)crypt sample
(64 bits)
(de/en)crypted
key
secret
(56 bits)
(64 bits)
sample to
f
function
Encrypt
Secret key dependent
transformations
Key
scheduling
KS
Substitution
tables
S-boxE
Expansion Permutation
P
left/right
Shift
Figure 5: Block diagram of the DES algorithm.
The key scheduling KS generates a 48 bits key for each of the 16 rounds of
the DES algorithm, through a linear combination of the 56 bits secret key. The KS
module includes a permutation PC1, a register, a permutation PC2 and a shift left
(right) operator, applied on the encrypt (decrypt) process.
The dimension of the partition problem associated with both examples and the
parameters used on the resolution with the tabu search algorithm are synthesized on
table 1. The high number of objects indicated for both examples is a consequence
of using explicit parallelism at the system description.
3.3 Experimental results
The best partition solution, generated by tabu search for the DES example, as-
signs program-states and variables to partitions (SW or HW1 to HW4) as it is
illustrated in figure 6. The objects in the upper part of the figure represent PSM
variables and the remaining objects are the PSM program-states equivalents.
When the automatic partition solutions are compared with manually optimized
hardware/software implementations, the measured performance of the best auto-
matic partition solution reached 72 to 92% of the manual implementation perfor-
mance, being superior on the cryptography example. These results can be improved
by detailing the estimation models and by tuning the granularity of system model
objects, which will significantly increase the computation time. The different ex-
18
Example Convolution Cryptography
dimension
No partitions 5 5
No objects 217 372
Parameter
No iterations 43400 74400
nBest 300 400
pMoves 20% 20%
nRand 4 4
TTmove 20 25
TTiMove 18 22
TTobj 15 20
nBest - Number of iterations since the best partition solution was found.
pMoves - Percentage of objects to be moved when the initial solution of
a new search is created.
nRand - Number of searches without improving the best partition solution
in order to execute “random” moves when creating the initial solution
of the next search.
TTmove - Moves tabu tenure.
TTiMove - Inverse moves tabu tenure.
TTobj - Objects tabu tenure.
Table 1: Parameters used on the partition process with the tabu search algorithm.
periments done with the mentioned examples always ended in feasible partition so-
lutions, e. g., solutions that respect the target architecture constraints, a proof that
the applied closeness and cost functions correctly control the partition process.
The accuracy and fidelity of the estimates for the performance and for the area
occupied in hardware were also evaluated. The accuracy of the system performance
estimates ranged from 82 to 98%, being higher on the cryptography example due to
its lower complexity. A fidelity ranging from 83 to 100%, almost coincident with
the accuracy range, suggests that the computed estimates are reliable. The accuracy
of the estimates for the area occupied by the hardware partitions data path was 92
to 99%, being identical on both examples. The accuracy of the estimates for the
area occupied by the hardware partitions control unit ranged from 89 to 96%, with
very close results on both examples. The obtained results show that the control
unit area depends mainly on the state register area, that in turn is proportional to
the number of states. For the whole set of metrics and examples, the accuracy
and fidelity of the estimates were always above 82%, a very rewarding result. The
results obtained with the partition process are summarised on table 2.
When it was decided to compute accurate estimates, the performance of the
partition methodology tool ended penalized. One way of improving the tool per-
formance is to optimize the estimation of the system execution time. The time
complexity O(nObj), expected for the tabu search algorithm, was experimentally
proved. Since the computation time varies linearly with the number of objects on
the system description, on large sized systems the time required to find the best
19
doutR1 keyOutR1 encryptShift1 decryptShift1 encryptOutR1 doutValidR1 key1 dataSbox1 dataPbox1 dataEp1
doutR2 keyOutR2 encryptShift2 decryptShift2 encryptOutR2 doutValidR2 key2 dataSbox2 dataPbox2 dataEp2
doutR3 keyOutR3 encryptShift3 decryptShift3 encryptOutR3 doutValidR3 key3 dataSbox3 dataPbox3 dataEp3
doutR4 keyOutR4 encryptShift4 decryptShift4 encryptOutR4 doutValidR4 key4 dataSbox4 dataPbox4 dataEp4
doutR5 keyOutR5 encryptShift5 decryptShift5 encryptOutR5 doutValidR5 key5 dataSbox5 dataPbox5 dataEp5
doutR6 keyOutR6 encryptShift6 decryptShift6 encryptOutR6 doutValidR6 key6 dataSbox6 dataPbox6 dataEp6
doutR16 keyOutR16 encryptShift16 decryptShift16 encryptOutR16 doutValidR16 key16 dataSbox16 dataPbox16 dataEp16
dindinValiddinRdinValidRkeyInkeyInRdoutkeyOutstallencryptdoutValid
IIpipeline1
FIpipeline1
wrEncrypt1
IItypeOp1
FItypeOp1
keyShiftL1
keyShiftR1
desPC21
desEp1
desSbox1
desPbox1
wrDout1
IProunds
FProunds
IIpipeline2
FIpipeline2
wrEncrypt2
IItypeOp2
FItypeOp2
keyShiftL2
keyShiftR2
desPC22
desEp2
desSbox2
desPbox2
wrDout2
IIpipeline3
FIpipeline3
wrEncrypt3
IItypeOp3
FItypeOp3
keyShiftL3
keyShiftR3
desPC23
desEp3
desSbox3
desPbox3
wrDout3
IIpipeline4
FIpipeline4
wrEncrypt4
IItypeOp4
FItypeOp4
keyShiftL4
keyShiftR4
desPC24
desEp4
desSbox4
desPbox4
wrDout4
IIpipeline5
FIpipeline5
wrEncrypt5
IItypeOp5
FItypeOp5
keyShiftL5
keyShiftR5
desPC25
desEp5
desSbox5
desPbox5
wrDout5
IIpipeline6
FIpipeline6
wrEncrypt6
IItypeOp6
FItypeOp6
keyShiftL6
keyShiftR6
desPC26
desEp6
desSbox6
desPbox6
wrDout6
IIpipeline16
FIpipeline16
wrEncrypt16
IItypeOp16
FItypeOp16
keyShiftL16
keyShiftR16
desPC216
desEp16
desSbox16
desPbox16
wrDout16
I
F
desIp
desPC1
nShifts
desFP
parityKey
...
...
... ... ... ... ... ... ... ... ... ...
SW HW1 HW2 HW3 HW4
Figure 6: PSMfg model illustrating the best partition solution from the TS algo-
rithm.
partition solution is high. However, in the majority of cases, the first searches of
the partition process generate a solution with quality.
The support given by the partition methodology to the implementation of the
systems was also evaluated. The automatic synthesis of the interface between par-
titions is a straightforward implementation that uses the data from the estimation
of the area occupied by the resources of the interface between partitions and the
communication time between partitions.
20
Metric Convolution Cryptography
(%) (%)
automatic vs manual solution performance 72 80-92
accuracy of performance estimates 82-83 97-98
fidelity of performance estimates 83 100
accuracy of areaHW(DP) estimates 98 92-99
accuracy of areaHW(CU) estimates 91 89-96
Table 2: Results obtained with the partition process.
4 Conclusions
The cluster growth constructive algorithm follows a straight optimization heuristic,
which proved to be able to generate solutions with quality, when guided by an
adequate closeness function.
The results from the performed experiments with tabu search (TS) algorithm
recommend that objects and moves tabu tenure must be 5 to 10% of the num-
ber of objects on the system description. To decrease the computation time while
keeping the capacity to generate solutions with quality, the implemented TS algo-
rithm only searches a partial neighbourhood, has a richer set of evolution strate-
gies, applies a more efficient improvement and includes a richer set of diversifica-
tion/intensification elements. To avoid the reduction of the capacity of converging
to the optimum solution, the history of moves and moved objects must be reset by
TS at the beginning of each search. A subset of tabu classifications was selected,
which decreases the computation time, helps to avoid cycles and does not place
excessive restrictions to the search. A neighbourhood with a simple structure also
helps to decrease the computation time. The goal of the cost function applied by
TS is to achieve the best partition solution with the available resources.
To generate accurate estimates, while keeping the computation time as low as
possible, the implemented estimation methodology (i) uses detailed models for the
hardware resources, (ii) runs in two abstraction levels and (iii) uses incremental
updating.
The obtained results show that the best automatic solution from the TS algo-
rithm achieves 72 to 92% of the manual partition solution performance. This is an
interesting result limited by (i) the optimizations introduced on the manual solution
implementation, (ii) the simple software estimation model and (iii) the fine gran-
ularity used with the objects. The different experiments always ended on feasible
partition solutions, which proves that the partition process is adequately controlled
by the evaluation functions.
The accuracy of the performance estimates, the area of the data path and the
area of the control unit estimates, was respectively 82 to 98%, 92 to 99% and 89 to
96%. The estimates accuracy obtained with both examples, DES and convolution,
was very close. This consistence on the accuracy suggests a reliable estimation.
For all metrics and examples, the accuracy and fidelity of the estimates was al-
ways above 82%, an interesting result that in many cases overcomes the published
21
results.
The time complexity O(n), foreseen for the implemented TS algorithm, was
confirmed on the experiments performed with parT iTool. The time necessary to
compute the best partition solution is high, but in most cases 10% of this time is
sufficient to find a solution that achieves a performance close to 90% of the best
solution.
The estimated data for the interface resources and the communication time,
simplifies the automatic synthesis of interfaces.
Some directions are being considered for future work: (i) evaluation of the
methodology with more and differentiated case studies, namely more complex and
control dominated systems must be tested; (ii) integration of the methodology on a
broader one, which is used to develop concurrent systems that are implemented on
a parallel, distributed and heterogeneous architecture; (iii) implementation of other
iterative algorithms – beyond TS and SA – where different optimization strategies
may lead to better results with some examples, to increase the partition success;
(iv) optimization of the system performance estimation, to improve the perfor-
mance of the partition methodology, strongly dependent on the time needed to
estimate this metric.
References
[1] António Esteves. EDgAR-2: Highly Re-configurable Digital Emulator. Technical
Report UMDITR9805, Dep. Informática, Universidade do Minho, Braga, Portugal,
December 1998.
[2] António Esteves. A Partition Methodology for Digital Embedded Systems Codesign
(in portuguese). PhD thesis, Dep. Informática, Universidade do Minho, Braga, Por-
tugal, July 2001.
[3] Asawaree Kalavade and Edward Lee. A Global Criticality/Local Phase Driven Al-
gorithm for the Hardware/Software Partitioning Problem. In Proceedings of the 3rd
International Workshop on Hardware/Software Codesign, pages 42–48. IEEE Com-
puter Society Press, September 1994. Grenoble, France.
[4] J.M. Fernandes, R.J. Machado, and H.D. Santos. Modeling Industrial Embedded
Systems with UML. In Proceedings of the 8th ACM/IEEE/IFIP Int. Workshop on
Hardware/Software Codesign (CODES’2000), pages 18–22. ACM Press, May 2000.
[5] P. Zave. The Operational versus the Conventional Approach to Software Develop-
ment. Communications of the ACM, 27(2):104–118, February 1984.
[6] Derrick Morris, Gareth Evans, Peter Green, and Colin Theaker. Object Oriented
Computer Systems Engineering. Springer-Verlag, Applied Computing Series, 1996.
[7] Daniel Gajski, Frank Vahid, and Sanjiv Narayan. A System-Design Methodology:
Executable-Specification Refinement. In Proceedings of the European Conference
on Design Automation, 1994.
[8] Ralf Niemann. Hardware/Software Co-design for Data Flow Dominated Embedded
Systems. Kluwer Academic Publishers. Boston, USA, 1998.
22
[9] G. Borriello, P. Chou, and R. Ortega. Embedded System Co-design: Towards Porta-
bility and Rapid Integration, pages 243–264. Hardware/Software Codesign, Ed. M.
Sami e G. De Micheli. Kluwer Academic Publishers, Boston, USA, 1995.
[10] T. B. Ismail and A. A. Jerraya. Synthesis Steps and Design Models for Codesign.
IEEE Computer, pages 44–52, 1995.
[11] K. Van Rompaey, D. Verkest, I. Bolsens, and H. De Man. CoWare - A Design En-
vironment for Heterogeneous Hardware/Software Systems. In Proceedings of the
European Design Automation Conference (EURO-DAC), 1996.
[12] M. Chiodo, D. Engels, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, K. Suzuki,
and A. Sangiovanni-Vincentelli. A Case Study in Computer-Aided Co-design of
Embedded Controllers. Design Automation for Embedded Systems, 1(1-2):51–67,
1996.
[13] Luís P. Santos and Alberto Proença. A Systematic Approach to Effective Scheduling
in Distributed Systems. In Proceedings of the 5th Int. Meeting on High Performance
Computing for Computational Science (VECPAR’02), Porto, Portugal, June 2002.
[14] Anton V. Chichkov and Carlos B. Almeida. An Hardware/Software Partitioning Al-
gorithm for Custom Computing Machines. In Field Programmable Logic and Ap-
plications - Proceedings of the 7th International Workshop FPL’97, pages 274–283,
September 1997.
[15] P. Middelhoek, G. Mekenkamp, E. Molenkamp, and Th. Krol. VHDL and CDFG
Based Transformational Design: a Case Study. In Proceedings of the ProRISC/IEEE
Workshop on CSSP, pages 203–212, March 1995.
[16] Rajesh K. Gupta and Giovanni De Micheli. Constrained Software Generation for
Hardware-Software Systems. In Proceedings of the 3rd International Workshop on
Hardware/Software Codesign, pages 56–63. IEEE Computer Society Press, Septem-
ber 1994.
[17] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-
Vincentelli. A Formal Methodology for Hardware/Software Co-design of Embedded
Systems. IEEE Micro, August 1994.
[18] Petru Eles, Zebo Peng, and Alexa Doboli. VHDL System-Level Specification and
Partitioning in a Hardware/Software Co-Synthesis Environment. In Proceedings of
the 3rd International Workshop on Hardware/Software Codesign, pages 49–55. IEEE
Computer Society Press, September 1994.
[19] Edna Barros and Augusto Sampaio. Towards Provably Correct Hardware/Software
Partitioning using Occam. In Proceedings of the 3rd International Workshop
on Hardware/Software Codesign, pages 210–217. IEEE Computer Society Press,
September 1994. Grenoble, France.
[20] Rolf Ernst, Jörg Henkel, and Thomas Benner. Hardware-Software Cosynthesis for
Microcontrollers. IEEE Design & Test of Computers, 10(4):64–75, December 1993.
[21] Kurt Mehlhorn and Stefan Näher. The LEDA Platform of Combinatorial and Geo-
metric Computing. Cambridge University Press, 1999.
[22] Petru Eles, Krkrysztof Kuchcinski, Zebo Peng, Alexa Doboli, and Paul Pop. Pro-
cess Scheduling for Performance Estimation and Synthesis of Hardware/Software
Systems. In Proceedings of the 24th EUROMICRO Conference, 1998.
23
[23] Peter Knudsen and Jan Madsen. PACE: A Dynamic Programming Algorithm for
Hardware/Software Partitioning. In Proceedings of the 4th International Workshop
on Hardware/Software Codesign, March 1996.
[24] L. Ferrandi, D. Sciuto, and M. Vincenzi. TOSCA User’s Manual, Version 2.0,
September 1997. http://www.cefriel.it/eda/projects/seed/um/-
mainmenuum.htm.
[25] Petru Eles, Zebo Peng, Krkrysztof Kuchcinski, and Alexa Doboli. System Level
Hardware/Software Partitioning Based on Simulated Annealing and Tabu Search.
Design Automation for Embedded Systems, 2(1):5–32, 1997.
[26] Frank Vahid. Modifying Min-Cut for Hardware and Software Functional Partitioning.
In Proceedings of the 5th International Workshop on Hardware/Software Codesign,
pages 43–48, March 1997.
[27] Frank Vahid and Thuy Dm Le. Extending the Kernighan/Lin Heuristic for Hardware
and Software Functional Partitioning. Design Automation for Embedded Systems,
2(2):237–261, 1997.
[28] Fred Glover and Manuel Laguna. Tabu Search, pages 70–150. Modern Heuristic
Techniques for Combinatorial Problems, Ed. Colin Reeves. McGraw-Hill Inc., 1995.
[29] Petru Eles, Krkrysztof Kuchcinski, and Zebo Peng. System Synthesis with VHDL.
Kluwer Academic Publishers, 1998.
[30] Ricardo Machado, João Fernandes, António Esteves, and Henrique Santos. Ch.11
An Evolutionary Approach to the Use of Petri Net based Models: from Parallel Con-
trollers to HW/SW Co-Design, pages 205–222. Hardware Design and Petri Nets,
Ed. Alex Yakovlev, Luis Gomes e Luciano Lavagno. Kluwer Academic Publishers,
Boston, USA, 2000.
[31] Daniel Gajski, Frank Vahid, Sanjiv Narayan, and Jie Gong. Specification and Design
of Embedded Systems. Prentice-Hall, 1994.
[32] Daniel Gajski, G. Marchioro, and J. Zhu. Essential Issues in Codesign, pages 1–45.
Kluwer Academic Publishers, 1997.
[33] António Esteves and Alberto Proença. A hardware/software partition methodology
targeted to an FPGA/CPLD architecture. Submitted to International Conference
on Field-Programmable Logic and its Applications (FPL 2004), Antwerp, Belgium,
March 2004.
24
