Graph based communication analysis for hardware/software codesign by Knudsen, Peter Voigt & Madsen, Jan
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Graph based communication analysis for hardware/software codesign
Knudsen, Peter Voigt; Madsen, Jan
Published in:
Proceedings of the Seventh International Workshop on Hardware/Software Codesign, 1999. (CODES '99)
Link to article, DOI:
10.1109/HSC.1999.777407
Publication date:
1999
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Knudsen, P. V., & Madsen, J. (1999). Graph based communication analysis for hardware/software codesign. In
Proceedings of the Seventh International Workshop on Hardware/Software Codesign, 1999. (CODES '99) (pp.
131-135). New York: IEEE. DOI: 10.1109/HSC.1999.777407
Graph Based Communication Analysis for Hardware/Software Codesign 
FL 
Peter Voigt Knudsen and Jan Madsen 
Department of Information Technology, Technical University of Denmark 
pvk@it.dtu.dk, jan@it.dtu.dk 
A whole loop node 
Abstract 
In this paper we present a coarse grain CDFG (ControVData 
Flow Graph) model suitable for hardwarehftware partition- 
ing of single processes and demonstrate how it is neces- 
sary to perform various transformations on the graph struc- 
ture before partitioning in order to achieve a structure that 
allows for accurate estimation of communication overhead 
between nodes mapped to different processors. In particu- 
lar, we demonstrate how various transformations of control 
structures can lead to a more accurate communication anal- 
ysis and more efficient implementations. The purpose of the 
transformations is to obtain a CDFG structure that is suffi- 
ciently fine grained as to support a correct communication 
analysis but not more fine grained than necessary as this will 
increase partitioning and analysis time. 
1 Introduction 
In this paper we focus on communication analysis for hard- 
ware/software partitioning of control-intensive applications 
that are specified using hierarchy, functions, conditionals and 
loops. In particular, we focus on the structures that imple- 
ment control, i.e. conditionals and loops. These structures 
are used to duect the flow of data between functional ele- 
ments according to the values of test variables. As com- 
munication overhead is an important factor to consider dur- 
ing hardwardsoftware partitioning [4][5], the mapping of 
these structures is thus important to analyze and optimize. 
The presented CDFG model supports the exploration of var- 
ious implementation alternatives for these structures through 
conditional and loop transformations which will be demon- 
strated in the following. Furthermore, it supports communi- 
cation analysis for cross hierarchy communication through 
hierarchical expunsion and for function calls through vir- 
tual function expansion. Virtual function expansion is only 
described briefly in this paper. The purpose of the transfor- 
mations is to obtain aCDFG structure that is sufficiently fine 
grained as to support a correct Communication analysis but 
not more fine grained than necessary as this will increase 
partitioning and analysis time. 
Permission to make disital or hard copies of all or pari oftliis work for 
personal or CI~SSIUDIII w e  i s  griinied wilhuut fcc provided ihai copier 
arc not made 01 distnhiitcd for profit or commercial adwiiiege and ihai 
copies hcai this notice and the full ciiaticm an t l ic first pgc.  l o  copy 
utheiwisu, io republish, to post on sei~ers or to redistribute to lists, 
rewires unor soecific ocrmisaion ond!or a fee. 
Name 
PURE-DFG 
PULL.LOOP* 
LOOP.BODY* 
LOOP.ENTuY 
LOOP-EXIl 
FULL-BRANCH' 
BRANCHBODY I * 
BRANCH.BODYZ* 
BRANCH-SPLIT 
BRANCH.MEXDE 
REPEATER 
HIERJN 
HIER.O"T 
F".CCALL* 
FUJN 
FU-OUT 
NOP 
"Om 
LB Loop body io& 
LE Loop e n q  node 
U( hopexitnode 
FB A full branch node 
81 First branch body 
BZ Second branch body 
BS Branch variable split node 
BM Branch variable merge no& 
R Reoeaterno& 
HI Hi&mhy input interface node 
Ho Hierarchy output interface node 
F Function call node 
FI Function input laterface no& 
Fa Fuuoction output interface node 
N NOP (variable duplicator) node LVoid node (vadable sink) 
Table 1: Elements of NodeType. Hierarchical nodes are 
marked with an asterisk( *)., 
2 CDFG model 
This section defines the CDFG model which is used to de- 
scribe the functionality of a single process. It includes struc- 
tures for basic arithmetic and logical operations, hierarchy, 
conditionals, loops and functions and is as such sufficiently 
expressive as to be able to represent universal computation 
power [2 ] .  
The CDFG can be denoted a high level CDFG as nodes 
represent high level functions rather than simple operations, 
either in the form of function calls or in the form of data 
flow graphs (DFGs) containing simple arithmetic and logi- 
cal operations and no control, and edges represent variable 
sets that are communicated between the high level functions 
rather than single variables. 
Nodes can have different types, as defined in table 1. The 
alias column defines short forms of the type names that will 
be used in figures. Nodes are records that contain a number 
of parameters, as defined in table 2. 
Edges are also records and contain the parameters d e  
fined in table 3. Edges can be either data or control edges, as 
distinguished by the Edge'Type parameter. 
The usage and meaning of the various noddedge types 
and fields will be defined as they are used in the following 
sections'. 
'Only the parametus &at M relevant for this paper M shown in the &le. The 
type (Variable 2 Variable) denotes a map (sometimes d k d  dictionary) &at m a p s  
variables to variables. 
CQDES 199 Rome ;taly 
Copyright ACM 1999 1-581 13-132-1199105 ... $5.00 
131 
Name Type 
G K:L%Tet 
w e 1  Variable-aet 
subdfg OFG 
war Variable 
ip01 Boa1 
bmapl VariablezVariable 
bmap2 VariablezVariable 
wmap VariablezVariable 
map VariabIezVariable 
map VariablwzVariablw 
%& E 2  
in [l] uses one control node for each variable in the sys- 
tem, leading to a very large number of control nodes to con- 
sider for partitioning. While this fine grain graph format 
allows for maximum flexibility with respect to partitioning 
control strnctures, it also complicates the graph and there- 
fore increases analysis and partitioning time. Our graph for& 
mat allows for exploring the whole range from using just 
two large control nodes for each control conshuct to using 
control nodes for each variable. In the following sections 
we demonstrate how graphs with large control nodes can be 
transformed into sufficiently fine grained structures that al- 
low for better optimization of communication. These trans- 
formations improve both efficiency of the final implementa- 
tion and accuracy and efficiency of analysis. It is important 
to note that, while the transformations allow for exploring 
different implementation alternatives for loops and condi- 
tionals, they should only be performed to the extent that the 
synthesis tools are able to produce similar implementations. 
If, for instance, the hardware synthesis tool can only produce 
a coarse grain loop control implementation (i.e. using one 
controller and single big mnltiplexers/demultiplexers), the 
loop control nodes should not he transformed in the graph 
prior to doing partitioning. The graph strncture must reflect 
what is done in synthesis, even if what is done is not effi- 
cient. For a further discussion of the relation between the 
model domain and the implementation domain, please refer 
to 141. 
In the following, we first introduce a basic transforma- 
tion called hierarchical wlparzrion which eases the analysis 
of cross hierarchy communication and which is a prereqni- 
site for performing the subsequently presented conditional 
and loop transformations correctly. 
3.1 Hierarchical expansion 
Hierarchy is introduced by letting hierarchical nodes (those 
marked with an asterisk in table 1) reference a CDFG. The 
node H in figure 1A is such a hierarchical node. We use 
double circles in figures to denote hierarchical nodes. 
comment 
The type of a node. 
The set of variables read by B node. 
The set of variables wdtten by a node. 
TheCOFGthisnodeisapartof. 
The subCOFG of a hierarchid node 
The datalow graph of B OFG node. 
The bmch, loop andrpt. nodetesrvadabla. 
The a t  vmiable paladty 
Branch variablemapping for h o h  bady 1. 
Branch variablemapping for branch body 2. 
Loop emy node variable mapphg. 
Loop exit nodevariablemspplng 
Repeater node variable mapping. 
4 B) 
Figure 1: Structure of a CDFG hierarchy. A) Atomic (node) 
view. B )  Expanded view. 
All subgraphs of hierarchical nodes are polar graphs as 
shown in figure 1B. D4 and D5 are DFGs and FB is a full 
branch node whose sub-CDFG is not shown. We see that 
the hierarchical node H in figure 1A is fed by three nodes 
and feeds two nodes itself. The hierarchy CDFG of a hier- 
archical node always contains a hierarchy input node Hi and 
a hierarchy output node Ho. These nodes act as an interface 
to the hierarchy and as placeholders for the variables that go 
N a m  
e E 
snk 
mrrel 
132 
Type CO-enf 
td ryPe 0 an ge DATA or CONmOL). Node geTypw 
Node 
Variable-set 
%node &t & k edge. 
The node hat  is fed by an edge. 
The set of variables m f d  on B data edge. 
in and out of the hierarchy*. The write set of the Hi node 
equals the set of variables that are read from outer hierar- 
chies. The read set of the Ho node equals the set of variables 
that are written to outer hierarchies. We assume that every 
variable that is produced in a CDFG is unique with respect 
to its name throughout the whole CDFG, i.e. throughout all 
hierarchy levels of the CDFG. 
As mentioned in [7], one of the first steps in the code- 
sign process is to determine the granularity of the functional 
specification that partitioning operates on. This can be done 
in a number of ways [3][6][7], the simplest being hierarchi- 
cal grunulariry selection [6] where we for each hierarchical 
node determine whether it should be regarded as a granule 
(i.e. atomic function which is not split across processors) or 
whether we should replace the hierarchical node with the 
contents of the hierarchy and thus make the input specifica- 
tion more fine grained. Our graph structure supports com- 
munication analysis for both cases. If the hierarchical node 
H is to he regarded as a granule itself, we simply use the 
input- and output hyper edges shown in figure 1A for com- 
munication analysis for a particular processor mapping of 
the node H. If the contents of the hierarchy is to be regarded 
A) 6) i 
Figure 2: One-level apamion of a hierarchical node. A )  
CDFGprior to expansion of H. B) CDFG after expansion. 
as granules, we perform hierarchical expansion in order to 
be able to perform a correct data dependency analysis for 
a particular mapping of the nodes inside and outside of the 
hierarchy to different processors. This is shown in figure 2 
where the hierarchical node H, corresponding to the hierar- 
chy in figure lB, is expanded into its surrounding CDFG. 
The expansion is a one-level expansion as the full branch 
within the CDFG of node H is not expanded, but of course 
expansion can be multi level. Note that when performing 
hierarchical expansion, the Hi and Ho nodes are eliminated 
and hyper edges are regenerated so that we can analyze the 
(rue dependencies between the nodes inside the hierarchy 
and the nodes outside the hierarchy. 
In the example, D5 and D7 are placed in hardware while 
the rest of the nodes are placed in software. This expansion, 
for example, allows us to see that even though D7  reads three 
variables, {f,i,j}, it only needs to have two variables {f,i} 
transferred across the hardware/software boundary. 
Note that it is legal for the same variable to be present on 
several edges when more than one node reads the variable, 
as it is the case for the variables b and f in the figure. When 
several nodes that read such a shared variable are mapped 
to another processor than the producing node is mapped to, 
'This p k e s  the hierarchy graphs polar and corresponds to the implementation of 
hierarchy m thepow graph m d e l  defined in 121. 
there are several possibilities for scheduling the correspond- 
ing edges. If dynamic memory storage on the receiving pro- 
cessor allows it, the variable needs only be transferred once, 
for the first scheduled node (D5, for the variable f). For sub- 
sequent edges that contain the variable (the one from D4 to 
D7 for f), such an already transferred variable can be re- 
moved from variable set of each edge which decreases the 
communication time of the edges and possibly allows snb- 
sequent nodes (D7 for f) to be scheduled earlier. If memory 
storage on the receiving processor is limited and memory 
storage on the transmitting processor allows it, the variable 
can be stored temporarily on the transmitting processor, re- 
transmitted each time it is needed by a receiving node and 
freed when the last receiving node has been scheduled. De- 
termining the optimal timekpace mapping of shared vari- 
ables can be done by introducing variable duplicator nodes 
whose mapping and scheduling in effect determine in which 
time slots the variables are :stored on which processors. This 
is left to future work. 
3.2 Branches 
Branches or conditional structures are introduced by using 
full branch, branch body 1, branch body 2, branch split and 
branch merge nodes. A full branch hierarchical node is used 
to encapsulate the whole branch. The basic structure of a 
conditional is shown in figure 3. 
&J " 
A) 6) 
Figure 3: Basic structure qfafull  branch. A) Node view. B) 
Expanded view. 
The BS node is a branch split node that duplicates its 
input variables and sends them to either B1 or 82, depending 
on the value of the test variable (t in the figure). The BM 
node is a branch merge node that selects the output variables 
from either B1 or 82, also depending on-the value of the 
test variable, and outputs ihe corresponding branch output 
variables. The test variable is identified by the tvar field of 
the BS and BM nodes. A test polarity parameter (tpol) of 
the BS and BM nodes spec:ifies wbicb of the branches that is 
taken if the test variable is true. If the test polarity is true, 81 
is taken, otherwise 82. In order to keep track of how input 
variables map to output variables of the BS and BM nodes, 
we use the brnapl and bmap2 variable maps which define 
the mappings for B1 and EI2, respectively. In the examples 
we have used the intuitive mapping that a variable named x 
outside of a branch maps to the variable XI in 61 and to x2 
in 82. 
133 
3.2.1 Transformation for unshared variables 
In figure 4A we see that the (copies of the) variables {a,b} 
are used solely by B1 and {d,e} solely by B23. If the branch 
is implemented using only a single BS node, such variables 
must be led through the BS node which may be very ineffi- 
cient, depending on the mapping of the BS node. Figure 4B 
shows a transformation that allows such variables to be com- 
municated directly from the= producing node to the branch 
they are used in. 
sw/ HW yo 
A) B) 
Figure 4: Transformation for unshared variables. A) Origi- 
nal branch structure. B) Transformed branch structure. 
Here we have expanded the branch into a surrounding hi- 
erarchy where D1 supplies the {a,b} variables, D2 the {c} 
variable and D3 the {d,e} variables. The branch test vari- 
able is disregarded in the rest ofthis section. In figure 4A, we 
have assumed that the branch has been constructed in such a 
way that all variables read within the branch are led through 
the branch split node. In figure 4B, a repeater node is added 
for each of the source nodes of the branch split node that 
produces variables that are only read by one of the branches. 
These repeater nodes are called R1 and R2 in the figure. 
A repeater node copies its input variables to its output vari- 
ables (according to the rmap variable map) if the value of 
the repeater test variable (War) is equal to the value of its 
polarity field (tpol). Otherwise it absorbs its input variables. 
Repeater nodes for 81 must have the same polarity as the 
branch split node and repeater nodes for 82 must have op- 
posite pokirity4. 
Assume that we know that the left branch 61 is taken so 
that the BS node does not communicate variables to 82. In 
the un-transformed case in figure 4A, communication anal- 
ysis shows that six variables cross the hardwardsoftware 
boundary because it is not recognized that {a,b} can he 
communicated directly from D l  to 81. In the transformed 
case in figure 4B, only two variables cross the hardware/soft- 
ware boundary. 
We find that a similar transformation is not needed for 
the branch merge node because the two branches produce 
equivalent sets of output variables. 
Note that the B1 and 82 nodes are regarded as granules 
in this example. If granularity selection has determined that 
they should be expanded, this expansion must be performed 
before the branch optimization so that repeater nodes are 
unused variables {dl ,el .a2,b2} ace assumed to be absorbed wiUlin the BS 
node. h d d ,  assuring this far all unused vsdables ia anoUlu mnfomtion fhaf we 
perform but which is no1 shown here. 
'NateUlatUlctokcnflowsemaoticsof~~CDFGmeanfhawecaonofuseasimple 
h y p r  edge instead of a repeater node. We should only direct variables (tokens) to the 
activebranch,and,forthis,arepeaternodeisneeded. 
generated with respect to the nodes inside the branch hierar- 
chies. In general, we have that hierarchical expansion must 
be performed before transformation. 
3.2.2 Transformation for shared variables 
This section describes a transformation for those variables 
that are read by (and produced by) both branches, like c in 
figure 4. 
sw HW sw nw 
A) 6)  
Figure 5: Transformation for shared variables. A) Original 
branch structure. B) Transformed branch structure. 
Consider the branch structure in figure 5 .  Here the vari- 
ables {a,b,c,d} are read by both branches. With the given 
structnre, it is not recognized that the (copies of the) van- 
ables {a,b} can be led directly from D1 to B1 and that the 
(copies of the) variables {c,d} can be led directly from D2 
to 82. If we assume again that the left branch B1 is taken, 
we see for the structure in figure 5A that 9 variables must be 
moved across the hardware/software boundary. In figure 5B, 
the BS and BM nodes have been split and communication 
analysis now shows that only three variables {c l  ,dl ,gl] 
have to be moved across the boundary. Notice how the f 
and g output variables are now led directly to D3 and D4. 
Splitting of the BS node must be performed for each of 
its source nodes that produces at least one variable that is 
read by both branches. Such a source node may also produce 
variables that are only read by one of the branches. Such 
variables are still transferred to the original branch split node 
or to a repeater node, as described in section 3.2.1. 
Splitting of the BM node is currently performed for each 
of its sink nodes. If, however, several sink nodes share vari- 
ables in their read sets, this leads to several branch merge 
nodes that produce the same variable. Either, one of these 
branch merge nodes must be selected as the sole producer of 
such a variable, or the produced vanables must be renamed, 
as we do not support two nodes producing the same variable. 
We use the last strategy. 
3.3 Loops 
We use the structure shown in figure 6 to represent a full 
loop. LB IS the loop body that also produces the loop test 
variable t. The loop is a REPEAT UNTIL loop5 that executes 
LB until the value of the test variable t is false. LE is a 
multiplexer that initially, when t is false, directs the input 
134 
LE (tl (.3,WG3) 
(*,,b,,d) 
(*,b2,=21 
*. -. .p -. la%M,dl 
A) B) 
Figure 6: Basic structure of afull loop. A) Node view. B )  
Expanded view. 
variables of the full loop, {aO,bO,cO}, to LB. When t be- 
comes true, it directs the output variables from the U( node, 
{a3,b3,c3}, back into LB. A false token is assumed to have 
been placed on the t edge of all LE nodes before execution 
of the graph as to ensure that the loops start when they r e  
ceive their first input variables. LX is also a multiplexer that 
directs its input variables {a2,bZ,cZ) back to LE as long as 
t is true and out of the loop (to Ho in the figure) when t 
becomes false. 
We perform the single LE/LX node split transformation 
shown in figure I in order to obtain a loop structure that 
allows us to analyze communication between nodes within 
the loop more accurately. This transformation is performed 
with respect to the nodes within the loop as these nodes may 
communicate a large number of times with the LE/LX nodes 
while nodes outside of the loop only communicate one time 
with the LE/LX nodes. The splitting is performed by produc- 
ing one LE node for each of the sink nodes of the original 
LE node and one LX node for each of the source nodes of 
the original LX node. It may be the case that several nodes 
within the loop read the same variable from the original LE 
node, thus causing several LE nodes that produce the same 
variable to be generated. This is currently handled the same 
way as described in section 3.2.2, i.e. by variable renaming. 
Figure I: LERXnode split transformation. A) Initial loop 
structure. B )  Resulting loop structure. 
Figure 7B shows the resulting loop structure in which 
it is apparent that only t needs to be transferred across the 
bardware/software boundary for the given mapping. In fig- 
ure 7A, we haye that five vaiables must be transferred be- 
tween hardware and software for each loop iteration. 
3.4 Transformation of the full graph 
In order to obtain the full CDFG structure on which par- 
titioning and analysis is to be performed, we first perform 
a recursive hierarchical expansion of all hierarchical nodes 
that should be expanded according to granularity selection. 
This expansion includes a CDFG wide regeneration of hyper 
edges. Thereafter, the branch and loop transformations de- 
scribed in the previous sections are performed for each loop 
and branch structure. Furthermore, we perform so-called 
virtual expansion of functims where each function call is 
fully expanded, i.e. (recursively) replaced with a copy of the 
function implementation CDFG. During this expansion, for- 
mal parameters of the function are recursively replaced with 
actual parameters (yielding new names for variables on in- 
put and output edges of the function graph) and internal edge 
names of the CDFG made unique (as to avoid collision with 
other virtually expanded instances of the same function), so 
that a correct data dependency analysis can be performed 
with respect to nodes that feed the function call and nodes 
within the function. Function expansion is denoted virtual 
as it is only performed in order to analyze communication 
correctly, not for mapping nodes of functions to processors 
(i.e. we do not assume inlining of functions). Mapping of the 
nodes of a function graph is performed only once, and this 
mapping is retained for each of the nodes of each virtually 
expanded instance of the function. 
4 Conclusion 
We have presented a coarse grain CDFG format that is use- 
ful for performing hardware/software partitioning of control 
intensive processes. We have shown that loop and condi- 
tional structures can be specified at different levels of granu- 
larity and that it is important to choose the right granularity 
in order to be able to perfoim a correct communication anal- 
ysis and an efficient exploration of implementation alterna- 
tives for these structures. We have developed a tool that can 
translate a VHDL process i.nto this CDFG format and which 
can perform the transformations described above. Future 
work includes integrating this with hardwarehoftware par- 
titioning and communication estimation in the LYCOS [6] 
co-synthesis system. 
Acknowledgements 
This work is supported by $he Danish National Center for IT 
Research under grant no. CIT 149. 
References 
G.  G. de long. Data flow saphs:  
unrestricted semantics. In Proc. 
, O D ,  
system specification with the most 
European DAC, pages 401 - 405, 
[2] Gupta CO-Synthesis ofHardwnre and Sofnoare jor Digital E m  
bedded3 stems. Kluwer Academic Publishers 1995. 
[3] 1. Henkerand R. Emst. A Hardwarelsoftware ~ ~ t i o n e r  Using A Dy- 
namically Determined Granularity h PNC.  34th DAC, pages 691 - 
696, 1997. 
141 P. V. Knudsen and I. Madsen. Aspects of System Modelling in Hard- 
ware/Software Partitioning. In Proc. 7th RSP Workshop., pages 18 - 
23 1996. 
[SI P. b. Knudsen and 1. Madsen. h t e p t i n g  Communication Protowl 
Selection with Partitioning in HardwardSofovare Codesign. In Pmc. 
IlthlSSS, pa s 111 - 116 1998. 
[6] 1. Madsen, Jp"Gro&, P. v' Knudsen, M. E. Petersen, and A. H a -  
thausen. LYCOS: the Lynghy CO-Synthesis System. Design Automa- 
tion orEmbeddedS stems 2(2):195 -235 1997. 
Large Behavianl Pmcessw. In Pmc. 11th ISSS, pages 152 - 157, 
1998. 
[7] P. &d. A Three-& A&oach CO the kuncuond Partitioning of 
135 
