Template Generation - A Graph Profiling Algorithm by Guo, Y. & Smit, G.J.M.
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
Template Generation – A Graph Profiling Algorithm
Yuanqing Guo Gerard J.M. Smit
University of Twente, Department of Computer Science
P.O. Box 217, 7500AE Enschede, The Netherlands
Phone: +31 (0)53 4894178 Fax: +31 (0)53 4894590
E-mail: {yguo, smit}@cs.utwente.nl
Abstract—The availability of high-level design entry tool-
ing is crucial for the viability of any reconfigurable SoC ar-
chitecture. This paper presents a template generation algo-
rithm. The objective of template generation step is to ex-
tract functional equivalent structures, i.e. templates, from
a control data flow graph. By profiling the graph, the al-
gorithm generates all the possible templates and the corre-
sponding matches. Using unique serial numbers and circle
numbers, the algorithm can find all distinct templates with
multiple outputs. A new type of graph (hydragraph) that
can cope with multiple outputs is introduced. The generated
templates represented by the hydragraph are not limited in
shapes, i.e., we can find templates with multiple outputs or
multiple sinks.
Keywords—MONTIUM, Hydragraph.
I. INTRODUCTION
In the CHAMELEON/GECKO1project we are designing
a heterogeneous reconfigurable System-On-Chip (SoC)
[14]. This SoC contains a general-purpose processor
(ARM core), a bit-level reconfigurable part (FPGA) and
several word-level reconfigurable parts (MONTIUM tiles;
see Section II). We believe that in future 3G/4G terminals
heterogeneous reconfigurable architectures are needed.
The main reason is that the efficiency (in terms of perfor-
mance or energy) of the system can be improved signifi-
cantly by mapping application tasks (or kernels) onto the
most suitable processing entity.
Fig. 1. CHAMELEON heterogeneous SoC architecture
In [13], we put forward a 4-phase decomposition that
can be used to map processes, written in a high level lan-
guages, to a MONTIUM tile. The suggested second phase,
clustering, is implemented by the template generation and
selection algorithms. In this paper, the focus is the tem-
plate generation algorithm.
II. TARGET ARCHITECTURE: MONTIUM
Fig. 2. MONTIUM processor tile
In this section we give a brief overview of the
MONTIUM architecture, because this architecture led to
the research questions and the algorithms presented in this
paper. Details of the MONTIUM architecture can be found
in [14]. Figure 2 depicts a single MONTIUM processor
tile. The hardware organisation within a tile is very reg-
ular and resembles a very long instruction word (VLIW)
architecture. The five identical arithmetic and logic units
(ALU1· · ·ALU5) in a tile can exploit spatial concurrency
to enhance performance. This parallelism demands a very
high memory bandwidth, which is obtained by having 10
local memories (M01· · ·M10) in parallel. The small local
memories are also motivated by the locality of reference
principle. The ALU input registers provide an even more
local level of storage. Locality of reference is one of the
guiding principles applied to obtain energy-efficiency in
the MONTIUM. A vertical segment that contains one ALU
together with its associated input register files, a part of
the interconnect and two local memories is called a pro-
cessing part (PP). The five processing parts together are
called the processing part array (PPA). A relatively sim-
ple sequencer controls the entire PPA. The communica-
tion and configuration unit (CCU) implements the inter-
face with the world outside the tile. The MONTIUM has
a datapath width of 16-bits and supports both integer and
fixed-point arithmetic. Each local SRAM is 16-bit wide
1This research is supported by PROGram for Research on Embedded
Systems & Software (PROGRESS) of the Netherlands Organization for
Scientific Research NWO, the Dutch Ministry of Economic Affairs and
the technology foundation STW.
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
and has a depth of 512 positions, which adds up to a stor-
age capacity of 8 Kbit per local memory. A memory has
only a single address port that is used for both reading and
writing. A reconfigurable address generation unit (AGU)
accompanies each memory. The AGU contains an address
register that can be modified using base and modify regis-
ters.
It is also possible to use the memory as a lookup ta-
ble for complicated functions that cannot be calculated us-
ing an ALU, such as sinus or division (with one constant).
A memory can be used for both integer and fixed-point
lookups. The interconnect provides flexible routing within
a tile. The configuration of the interconnect can change
every clock cycle. There are ten busses that are used for
inter-PPA communication. Note that the span of these
busses is only the PPA within a single tile. The CCU is
also connected to the global busses. The CCU uses the
global busses to access the local memories and to handle
data in streaming algorithms. Communication within a PP
uses the more energy-efficient local busses. A single ALU
has four 16-bit inputs. Each input has a private input regis-
ter file that can store up to four operands. The input regis-
ter file cannot be bypassed, i.e., an operand is always read
from an input register. Input registers can be written by
various sources via a flexible interconnect. An ALU has
two 16-bit outputs, which are connected to the intercon-
nect. The ALU is entirely combinatorial and consequen-
tially there are no pipeline registers within the ALU. The
diagram of the MONTIUM ALU in Figure 3 identifies two
different levels in the ALU. Level 1 contains four function
units. A function unit implements the general arithmetic
and logic operations that are available in languages like C
(except multiplication and division). Level 2 contains the
MAC unit and is optimised for algorithms such as FFT and
FIR. Levels can be bypassed (in software) when they are
not needed.
Neighboring ALUs can also communicate directly on
level 2. The West-output of an ALU connects to the East-
input of the ALU neighboring on the left (the West-output
of the leftmost ALU is not connected and the East-input
of the rightmost ALU is always zero). The 32-bit wide
East-West connection makes it possible to accumulate the
MAC result of the right neighbor to the multiplier result
(note that this is also a MAC operation). This is partic-
ularly useful when performing a complex multiplication,
or when adding up a large amount of numbers (up to 20
in one clock cycle). The East-West connection does not
introduce a delay or pipeline, as it is not registered.
Fig. 3. MONTIUM ALU
III. RELATED WORK
There have been published many related research efforts
in the areas of high-level synthesis and FPGA logic synthe-
sis.
In [3][5], a template library is assumed to be available
and the template matching is the focus of their work. How-
ever, this assumption is not always valid, and hence an au-
tomatic compiler must determine the possible templates by
itself before coming up with suitable matchings.
[1][11][12] give some methods to generate templates.
These approaches choose one node as an initial template
and subsequently add more operators to the template.
There is no restriction on the shape of the templates. The
drawback is that the generated templates are highly depen-
dent on the choice of the initial template. The heuristic
algorithm in [9] generates and maps templates simultane-
ously, but cannot avoid ill-fated decisions.
The algorithms in [2][4] provide all templates of a
CDFG. The complete set of tree templates and single-PO
(single principle output) templates are generated in [4] and
all the single-sink templates (possibly multiple outputs)
are found by the configuration profiling tool in [2]. The
central problem for template generation algorithms is how
to generate and enumerate all the (connected) subgraphs
of a CDFG. The methods employed in [4] and [2] can only
enumerate the subgraphs of specific shapes (tree shape,
single output or single sink) and as a result, templates with
multiple outputs or multiple sinks cannot be generated. In
the MONTIUM architecture, each ALU has three outputs,
so the existing algorithms cannot be used.
As far as we know, no algorithm has been designed to
generate the complete set of templates without limitations
to the shapes.
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
IV. DEFINITIONS OF CDFG
For the purpose of formulating our problem in a mathe-
matical context, it is convenient to introduce a new type of
graphs called hydragraphs2 to model our directed acyclic
CDFGs (CDFGs for short in this paper). This concept
should capture and represent the operations, the inputs and
outputs, as well as which inputs are used and which out-
puts are produced by the operations (and which outputs of
a certain operation serve as inputs for one or more further
operations).
A hydragraph G = (NG, PG, AG) consists of two finite
non-empty sets of nodes NG and ports PG and a set AG
of so-called hydra-arcs; a hydra-arc a = (ta,Ha) has one
tail ta ∈ NG ∪ PG and a non-empty set of heads Ha ⊂
NG∪PG. In our applications, NG represents the operations
of a CDFG, PG represents the inputs and outputs of the
CDFG, while the hydra-arc (ta,Ha) either reflects that an
input is used by an operation (if ta ∈ PG), or that an output
of the operation represented by ta ∈ NG is input of the
operations represented by Ha, or that this output is just an
output of the CDFG (if Ha contains a port of PG).
See the example in Fig. 4(a): The operation of each
node is a basic computation such as addition (in this case),
multiplication, or subtraction. Hydra-arcs are directed
from their tail to their heads. Because an operand might be
input for more than one operation, a hydra-arc is allowed
to have multiple heads although it always has only one tail.
The hydra-arc e7 in Fig. 4(a), for instance, has two heads,
w and v. The CDFG communicates with external systems
through its ports represented by small grey circles in Fig.
4(a).
A node subset S ∈ NG generates a hydragraph in the
following natural way: For every v ∈ S consider the fol-
lowing two types of hydra-arcs of G related to v:
- (tv,Hv), so hydra-arcs with tail v: if Hv ⊂ S, we
introduce a new port pv and replace (tv,Hv) by
(tv, (Hv∩S)∪{pv}); otherwise, we keep (tv,Hv)
as it is.
- (tu,Hu) with v ∈ Hu, so hydra-arcs for which
v is one of the heads: if tu ∈ S, we introduce a
new port t′u and replace (tu,Hu) by (t′u,Hu∩S);
otherwise we keep (tu,Hu) as it is.
Doing so for all hydra-arcs, e.g. starting from the sources
in S, we obtain a unique hydragraph which we will refer
to as the template generated by S in G. We denote it by
TG[S] and say that S is a match of the template TG[S].
In the sequel we will only consider connected templates
without always stating this explicitly. For convenience let
2These graphs are named after Hydra, a water-snake from Greek
mythology with many heads that grew again if cut off.
us call a template an i-template if the number of its nodes
is i. Similarly i-match and i-node subset are defined.
+
+
+x y
u
vw
+
+
e1 e2 e3 e4
e5 e6
e7
e8 e9
e10 e11
(a) A small
CDFG
TG[{x}] TG[{w,v}]
+ vw +
e7
e8 e9
e10 e11
+x
e1 e2
e5
(b) Two templates of the CDFG
from Fig. 4(a)
Fig. 4. An example.
For example, in Fig. 4(b) we see two templates of the
CDFG from Fig. 4(a): the left one is generated by the set
{x}, the right one by {v,w}. Compared with the original
CDFG from Fig. 4(a), in the left one, the newly added port
is a head for hydra-arc e5, while in the right one the newly
added port is a tail for hydra-arc e7.
Two hydragraphs G and F are said to be isomorphic if
there is a bijection φ : NG ∪ PG → NF ∪ PF such that:
φ(NG) = NF , φ(PG) = PF , and (tv,Hv) ∈ AG if and
only if (φ(tv), φ(Hv)) ∈ AF .
We use G ∼= F to denote that G and F are isomorphic.
We say that S′ ⊂ NG is a match for the template
TG[S] if TG[S′] ∼= TG[S]. A hydragraph H is a template
of the hydragraph G if, for some S ⊂ NG, TG[S] ∼= H .
Of course, the same template could have different matches
in G.
Note that, in general, a template is not a subhydragraph
of a hydragraph, because some nodes may have been re-
placed by ports. The important property of templates of a
CDFG is that they are themselves CDFGs that model part
of the algorithm modelled by the whole CDFG: the tem-
plate TG[S] models the part of the algorithm characterized
by the operations represented by the nodes of S, together
with the inputs and outputs of that part. Because of this
property, templates are the natural objects to consider if
one wants to break up a large algorithm represented by
a CDFG into smaller parts that have to be executed on
ALUs. In this paper, we only consider connected tem-
plates.
V. A FOUR-PHASE DECOMPOSITION
The overall aim of our research is to execute DSP pro-
grams written in high level language, such as C, by one
MONTIUM tile in as few clock cycles as possible. There
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
are many related aspects: the limitation of resources; the
size of total configuration space; the ALU structure etc.
We propose to decompose this problem into a number of
phases: translation, clustering, scheduling and resource al-
location:
1 Translating the source code to a CDFG: The input C
program is first translated into a CDFG; and then some
transformations and simplifications are done on the CDFG.
The focus of this phase is the input program and is largely
independent of the target architecture.
2 Task clustering and ALU data-path mapping, clus-
tering for short: The CDFG is partitioned into clusters
and mapped to an unbounded number of fully connected
ALUs. The ALU structure is the main concern of this
phase and we do not take the inter-ALU communication
into consideration;
3 Scheduling: The graph obtained from the clustering
phase is scheduled taking the maximum number of ALUs
(it is 5 in our case) into account. The algorithm tries to
find the minimize number of the distinct configurations of
ALUs of a tile;
4 Resource allocation, allocation for short: The sched-
uled graph is mapped to the resources where locality of
reference is exploited, which is important for performance
and energy reasons. The main challenge in this phase is the
limitation of the size of register banks and memories, the
number of buses of the crossbar and the number of reading
and writing ports of memories and register banks.
Note that when one phase does not give a solution, we have
to fall back to a previous phase and select another solution.
The input for clustering and data-path mapping is a
CDFG. In the clustering phase the CDFG is partitioned
and mapped to an unbounded number of fully connected
ALUs, i.e., the inter-ALU communication is not consid-
ered. A cluster corresponds to a possible configuration
of an ALU data-path, which is called one-ALU config-
uration. Each one-ALU configuration has fixed input and
output ports, fixed function blocks and fixed control sig-
nals. A partition with one or more clusters that can not be
mapped to our MONTIUM ALU data-path is a failed parti-
tion. For this reason the procedure of clustering should be
combined with ALU data-path mapping. Goals of cluster-
ing are 1) minimization of the number of ALUs required;
2) minimization of the number of distinct ALU configura-
tions; and 3) minimization of the length of the critical path
of the dataflow graph.
We say that a collection (T1, . . . , Tk) of hydragraphs
is a k-tiling of the hydragraph G if there exists a parti-
tion of NG into mutually disjoint sets S1, . . . , Sk such that
TG[Si] ∼= Ti for all i ∈ {1, . . . , k}. In that case we call
S1, . . . , Sk a k-cover of G. A (k, )-tiling is a k-tiling in
which at most  nonisomorphic hydragraphs appear. Sim-
ilarly, we define a (k, )-cover. The clustering problem is
a graph covering problem:
Problem 1: Hydragraph Covering Problem
Given a CDFG G, find an optimal (k, )-cover S1, S2, . . . , Sk
of G. It is clear that we cannot expect to solve this complex
optimization problem easily. We would be quite happy
with a solution concept that gives approximate solutions
of a reasonable quality, and that is flexible enough to allow
for several solutions to choose from. We propose to start
the search for a good solution by first generating all dif-
ferent matches (up to a certain number of nodes because
of the restrictions set by the ALU-architecture) of noniso-
morphic templates for the CDFG. The second step tries to
find an efficient cover for an application graph with a min-
imal number of distinct templates and minimal number of
matches.
Problem A: Template Generation Problem
Given a CDFG, generate the complete set of nonisomor-
phic templates (that satisfy certain properties, e.g., which
can be executed on the ALU-architecture in one clock cy-
cle), and find all their corresponding matches.
Problem B: Template Selection Problem
Given a CDFG G and a set of (matches of) templates, find
a ‘optimal’ (k, )-cover of G.
In this paper, we concentrate on Problem A: find all the
possible templates and matches from a CDFG.
VI. TEMPLATE GENERATION
A clear approach for the generating procedure is:
1 Generate a set of connected i-node subsets by adding a
neighbor node to the (i− 1)-node subsets.
2 For all i-node subsets, consider their generated i-
templates. Choose the set of nonisomorphic i-templates
and list all matches of each of them.
3 Starting with the 1-node subsets, repeat the above steps
until all templates and matches op to maxsize nodes have
been generated.
In step 1, an i-node subsets can be obtained by different
(i − 1)-node subsets, which will result in unnecessarily
many computations. To avoid this, we use a clever la-
belling of the nodes during the generation process:
• Each hydragraph node is given a unique serial number.
• A leading node is defined within each node subset S,
which is the one with the smallest serial number.
• Within a subset S, each graph node n ∈ S is given
a circle number, denoted by Cir(n|S), which is the dis-
tance between the leading node and n within S, i.e.,
Cir(n|S)=Dis(S.LeadingNode, n|S).
If a (i − 1)-node subset S and one of its neighbor node
Nei satisfy the following conditions, S′ = S∩ {Nei} will
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
be considered as a i-node subset, otherwise S′ is thrown
away.
+
+
+x y
u
vw
+
+
4 5
3
1 2
Fig. 5. Give each node a unique serial number
1 S.LeadingNode.Serial<Nei.Serial;
2 Dis(S.LeadingNode, Nei|S∪{Nei}) is not
smaller than Cir(n|S) for any n ∈ S;
3 For each n which satisfies n∈S and
Cir(n|S) = Dis(S.LeadingNode,Nei|S∪
{Nei}), n.Serial< Nei.Serial.
For each i-template S′, these conditions chose a unique
pair (S,Nei) such that S′ = S∩{Nei}. Thus multiple
copies of S′ are discarded.
In Table I, the procedure of finding all the i-matches
from (i − 1)-matches of Fig. 5 is given. The symbols
in bold are the names of the leading nodes and the newly
added nodes are underlined. In each row, the match in the
left column is the predecessor match of the match in the
right column. The matches that do not satisfy the con-
ditions of the function “CanMatchTakeInNode” are dis-
carded. The numbers next to the discarded matches in-
dicate which of the above three conditions is violated.
REFERENCES
[1] Srinivasa R. Arikati, Ravi Varadarajan, “A Signature Based Ap-
proach to Regularity Extraction”, Proc. of Internaltional Confer-
ence on Computer-Aided Design (ICCAD), 1997, pp.542-545.
[2] Srihari Cadambi, and Seth Copen Goldstein, “CPR: A Configu-
ration Profiling Tool”, IEEE Symposium on FPGAs for Custom
Computing Machines, 1999.
[3] Timothy J.Callahan, Philip Chong, Andre DeHon, and John
Wawrzynek, “Fast Module Mapping and Placement for Datapaths
in FPGAs”, Proc. of International Sysp. of Field Programmable
Gate Arrays, 1998.
[4] Amit Chowdhary, Sudhakar Kale, Phani Saripella, Naresh Seh-
gal, Rajesh Gupta, “A General Approach for Regularity Extrac-
tion in Datapath Circuits”, Proc. of Internaltional Conference on
Computer-Aided Design (ICCAD) San Jose, CA, 1998, pp.332-
339.
[5] Miguel R. Corazao, Marwan A. Khalaf, Lisa M.Guerra, Miodrag
Potkonjak and Jan M. Rabaey, “Performance Optimization Using
Templete mapping for Datapath-Intensive High-Level Synthesis”,
1-matches 2-matches 3-matches 4-matches
{x} {x,u} 1
{y} {y,u} 1
{u} {u,w} 1
{u,v} 1
{u,x} {u,x,y} {u,x,y,w} 1
{u,x,y,v} 1
{u,x,w} 1
{u,x,v} 1
{u,y} {u,y,x} 3
{u,y,w} 1
{u,y,v} 1
{v} {v,u} {v,u,x} {v,u,x,y}
{v,u,x,w} 1
{v,u,y} {v,u,y,x} 3
{v,u,y,w} 1
{v,u,w} 1
{v,w} 1
{w} {w,u} {w,u,x} {w,u,x,y}
{w,u,x,v} 2
{w,u,y} {w,u,y,x} 3
{w,u,y,v} 2
{w,u,v} 3
{w,v} {w,v,u} {w,v,u,x}
{w,v,u,y}
TABLE I
MULTIPLE COPIES OF A MATCH ARE FILTERED OUT BY THE
FUNCTION “CANMATCHTAKEINNODE(oldMatch,
newNode)”.
IEEE Transactions on Computer-Aided Design of Intergrated Cir-
cuits and Systems, vol.15, No.8, August 1996, pp.877-888.
[6] M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness, W. H. Freeman and
Company, New York, 1979.
[7] Yuanqing Guo, Gerard J.M. Smit, Paul M. Heysters, “Template
Generation and Selection Algorithms for High Level Synthesis”,
submitted for publication.
[8] Magnu´s M. Halldo´rsson, Jaikumar Radhakrishnan, “Greed is
good: Approximating independent sets in sparse and bounded-
degree graphs”, ACM Symposium on the Theory of Computing,
1994.
[9] Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and
Majid Sarrafzadeh, “Instruction Generation for Hybrid Reconfig-
urable Systems”, Proc. of International Conference on Computer-
Aided Design (ICCAD), San Jose, CA, November, 2001.
[10] “Instruction Generation for Hybrid Reconfigurable Systems”,
http://citeseer.nj.nec.com/446997.html.
[11] Thomas Kutzschebauch, “Efficient Logic Optimization Using
Regularity Extraction”, Proc. of the 1999 Internaltional Work-
shop on Logic Synthesis, 1999.
[12] D. Sreenivasa Rao, and Fadi J. Kurdahi, “On Clustering For Max-
PROCEEDINGS OF THE 4TH PROGRESS SYMPOSIUM ON EMBEDDED SYSTEMS
© PROGRESS/STW 2003, ISBN 90-73461-37-5 OCTOBRE 22, 2003, NBC NIEUWEGEIN, NL
imal Regularity Extraction”, IEEE Transactions on Computer-
Aided Design, vol.12, No.8,August,1993, pp.1198-1208.
[13] Michel A.J. Rosien, Yuanqing Guo, Gerard J.M. Smit, Thijs Krol,
“Mapping Applications to an FPFA Tile”, accepted for publica-
tion in Proc. of Date03, Munich, March, 2003
[14] Gerard J.M. Smit, Paul J.M. Havinga, Lodewijk T. Smit, Paul M.
Heysters, Michel A.J. Rosien, “Dynamic Reconfiguration in Mo-
bile Systems”, Proc. of FPL2002, Montpellier France, pp 171-
181, September 2002.
