SUNMAP: A Tool for Automatic Topology Selection and Generation for NoCs by Murali, Srinivasan & De Micheli, Giovanni
SUNMAP: A Tool for Automatic Topology Selection
and Generation for NoCs
Srinivasan Murali
Computer Systems Lab
Stanford University
Stanford, CA-94305, USA
smurali@stanford.edu
Giovanni De Micheli
Computer Systems Lab
Stanford University
Stanford, CA-94305, USA
nanni@stanford.edu
ABSTRACT
Increasing communication demands of processor and mem-
ory cores in Systems on Chips (SoCs) necessitate the use
of Networks on Chip (NoC) to interconnect the cores. An
important phase in the design of NoCs is the mapping of
cores onto the most suitable topology for a given applica-
tion. In this paper, we present SUNMAP a tool for automati-
cally selecting the best topology for a given application and
producing a mapping of cores onto that topology. SUNMAP
explores various design objectives such as minimizing aver-
age communication delay, area, power dissipation subject to
bandwidth and area constraints. The tool supports diﬀer-
ent routing functions (dimension ordered, minimum-path,
traﬃc splitting) and uses ﬂoorplanning information early in
the topology selection process to provide feasible mappings.
The network components of the chosen NoC are automat-
ically generated using cycle-accurate SystemC soft macros
from ×pipes architecture. SUNMAP automates NoC selection
and generation, bridging an important design gap in build-
ing NoCs. Several experimental case studies are presented
in the paper, which show the rich design space exploration
capabilities of SUNMAP.
Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Anal-
ysis and Design Aids; B.4.3 [Input/Output and Data
Communications]: Interconnections—Topology
General Terms
Design, Algorithms, Performance
Keywords
Systems On Chip, Networks on Chip, Topology, Mapping,
SystemC.
1. INTRODUCTION
The heavy communication demands of future Systems on
Chips (SoCs) require scalable communication architectures
to interconnect the cores. The interconnect scalability for
bus based systems that are widely used in current SoCs is
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAC 2004, June 7–11, 2004, San Diego, California, USA.
Copyright 2004 ACM 1-58113-828-8/04/0006 ...$5.00.
0 1 2
3 4 5
6 7 8
(a) Mesh
0 1 2
3 4 5
6 7 8
(b) Torus
7 6
54
3 2
10
(c) Hypercube
Figure 1: Direct NoC Topologies
2
1
00
3
2
33
22
11
3
2
1
000
Stag1
4
5
6
7
6
7
3
4
5
1
Cores Cores
Stag3Stag2
(a) 3-Stage Clos
7
6
5
4
3
2
1
00
6
2
33
22
11
3
2
1
000
Stag1 Stag2 Stag3
CoresCores
1
5
4
3
7
(b) Butterﬂy
Figure 2: Indirect NoC Topologies
limited. This motivates a shift in the communication design
paradigm from non-scalable bus based architectures to scal-
able Networks on Chip (NoC) based architectures [1, 2, 3, 4].
NoCs are compatible with core design and re-use and have
several advantages including better structure, performance
and modularity.
An important phase in the design of NoCs is choosing the
most suitable NoC topology for a particular application and
mapping of the application on to that topology. There are
several standard topologies onto which an application can
be mapped. They are broadly classiﬁed as direct topologies
where each switch is connected to a single core (Figure 1)
and indirect topologies where a set of cores are connected
to a switch (Figure 2). Choosing the right topology involves
exploring various design objectives such as minimizing aver-
age communication delay (i.e. latency), power consumption,
area, etc. Moreover, in the resulting NoC the links should
support the desired traﬃc between various cores. Thus the
mapping of cores onto the NoCs must satisfy the bandwidth
constraints of the application.
As a motivating example, a Video Object Plane Decoder
(VOPD) [13] mapped onto two diﬀerent topologies (mesh
and torus) is considered (Figure 3). The cost of the map-
pings in terms of area, power and communication delay are
evaluated (as explained in Section 5) and are presented in
Figure 3(d). The average communication delay for the torus
network is slightly lower (10%) as more network resources
are utilized (more links and larger switches). The area,
power analysis show large savings for the mesh network (20%
power savings and 5% area savings) and it is more suitable
53.2
914
pad
pred samp
mem
mem
stripe 
acdc
vop
Arm
scan
inv up
vop
rec
vld
94
313
16
49
27
500
313 300
353
357
362362362
70 idctiquanrunle dec
(a) VOPD graph
idct arm vopm pad
iqu
ant
ups
amp vopr smem
iscanacdc rld vld
(b) Mesh
rld vld
idct
vopm
vopr arm
acdc iscan pad smem
ups
iqu
amp
uant
(c) Torus
Ratio
54.59
372.1
57.91
2.03 0.9
1.06
1.22454.9
Perf
Met
Mesh
des area
2.25
Torus
tor/mesh
avg
hops
des pow
(mm   )2
(mw)
(d) Design Parameters
Figure 3: Example Mappings of VOPD
than a torus for this application. This elucidates the need
for exploring various design objectives for choosing the most
suitable network topology for a particular application.
The purpose of this research is to describe a method and
a tool, SUNMAP, for high-level mapping of cores onto various
network architectures and to choose the architecture that is
most eﬃcient for the application. SUNMAP maps cores onto
several standard NoC topologies (mesh, torus, hypercube,
butterﬂy, clos) with various objective functions such as mini-
mizing average communication delay, design area and power
dissipation, satisfying the bandwidth and area constraints.
The tool also supports various routing functions (dimension
ordered, minimum-path, traﬃc splitting across minimum-
paths, traﬃc splitting across all paths) and chooses the best
topology from the library of available topologies for the given
application. Note that the approach presented here is gen-
eral and other topologies (such as octagon network or star
network [6, 10]) can be easily added to the topology library.
SUNMAP has an interface to the ×pipesCompiler [18], which
is a tool for instantiating network components (switches,
links, network interfaces) using a library of composable Sys-
temC soft macros. After topology selection and mapping,
the SystemC description of network components and their
interconnection with the cores are automatically generated.
The resulting SystemC code for the whole design can be
simulated at the cycle-accurate and signal accurate level.
For the mapping process, we use traﬃc models as the ab-
straction of communication between cores. Obviously, the
use of high-level models hides a lot of complex technolog-
ical aspects, but facilitates fast exploration of the design
space. Moreover as SUNMAP generates SystemC models of
the mapped NoC, the design can be accurately simulated,
providing a reliable path for system design and veriﬁcation.
2. PREVIOUS WORK
The need to replace busses with on-chip networks has been
presented in [1, 2]. Several NoC architectures and design
methodologies have been presented recently, such as mesh
based Nostrum [4] and aSoC [8], fat-tree based SPIN [3],
ring based Proteo [9], etc. In [6], the use of octagon com-
munication topology for network processors is presented. In
[5], designing NoC with Quality-of-Service (QoS) guaran-
tees has been explored. Hierarchical approach for designing
on-chip networks was presented in [7]. Design methodolo-
gies for building irregular networks has been presented in
[14], [15]. We refer the reader to [11] for surveys on several
aspects of NoC design.
Rout
& Simul
Func
Phase 1
Plan
Floor
Lib
Topo
onto
Codesign
Floor
Area
Pow
Compiler
xpipesTopol
Selec
Phase 3
HW/SW
Pow
Plan
Topologies
Phase 2
Arch
xpipesLibLib
Area
Lib
Lib
Files
SystemC
of whole
design
Appln
Mapping
Figure 4: Design Flow of SUNMAP
The problem of mapping cores onto NoC architectures
is presented in [16],[19]. In [16], a branch-and-bound algo-
rithm is used to map cores onto a mesh-based architecture
with the objective of minimizing energy, satisfying the band-
width constraints of the NoC. A simple dimension-ordered
routing is assumed in the work. In [19], fast algorithms for
mesh NoC architectures under diﬀerent routing functions,
minimizing the average communication delay and satisfying
bandwidth constraints is presented.
To the best of our knowledge, this is the ﬁrst work target-
ing the problem of application-speciﬁc NoC topology selec-
tion and generation. We consider several diﬀerent topolo-
gies, routing functions and design objectives and automate
the mapping and topology selection processes. We use ﬂoor-
plan information for area-power estimates early in the map-
ping process and consider area and bandwidth constraints
of the NoC to evaluate the feasibility of mappings. The net-
work components of the chosen topology are automatically
built using SystemC soft macros from ×pipes library. Thus,
SUNMAP bridges an important design gap in building NoCs.
3. DESIGN METHODOLOGY
The design ﬂow of the SUNMAP tool is presented in Figure 4.
We assume that the application is mapped onto cores using
existing tools such as [12]. By static analysis or simulation,
the amount of data transfer between the cores is obtained.
The resulting cores and communication demands between
them is the input to our tool. SUNMAP has three phases of
operation.
In the ﬁrst phase, for a chosen routing function and de-
sign objective, mappings onto various network topologies in
the topology library are obtained. For each mapping, the
area and bandwidth constraints are evaluated to produce a
feasible mapping. The area-power libraries and ﬂoorplan-
ner are built into SUNMAP, so that area-power estimates can
be incorporated early in the mapping process. The details
of the area-power models and ﬂoorplanner are presented in
Section 5.
In the second phase, the various topologies (with map-
pings produced from the ﬁrst phase) are evaluated for sev-
eral design objectives and the best topology is chosen.
In the third phase, the tool generates SystemC descrip-
tion of the network components using the ×pipesCompiler
and ×pipes architecture. The three design phases are delin-
eated in Figure 4. In the following sections we elaborate on
the ﬁrst two phases of the design methodology and refer the
reader to [17], [18] for a detailed description of the architec-
ture of the network elements and their automatic generation
for a chosen topology.
915
4. MAPPING OF CORES ONTO
TOPOLOGIES
In [19], mapping of cores onto a mesh topology is pre-
sented along with the performance gains of the algorithm
compared to previous mapping approaches. In this section
we extend the mapping algorithms in [19], for several stan-
dard topologies (torus, hypercube, butterﬂy and clos) for
minimum-path routing. Mapping algorithms for other rout-
ing functions are similarly extended for various topologies
and are incorporated in the tool, but due to lack of space,
here we present only the minimum-path mapping algorithm.
Before presenting the algorithm, we formulate the map-
ping problem mathematically. The communication between
the cores of the SoC is represented by the core graph:
Definition 1. The core graph is a directed graph, G(V,E)
with each vertex vi ∈ V representing a core and the directed
edge (vi, vj), denoted as ei,j ∈ E, representing the commu-
nication between the cores vi and vj . The weight of the edge
ei,j , denoted by commi,j , represents the bandwidth of the
communication from vi to vj .
The connectivity and link bandwidth of the NoC is repre-
sented by the NoC topology graph:
Definition 2. The NoC topology graph is a directed graph
P (U,F ) with each vertex ui ∈ U representing a node in the
topology and the directed edge (ui, uj), denoted as fi,j ∈ F
representing a direct communication between the vertices ui
and uj . The weight of the edge fi,j , denoted by bwi,j, repre-
sents the bandwidth available across the edge fi,j .
The mapping of the core graph G(V,E) onto the topology
graph P (U,F ) is deﬁned by the one-to-one mapping function
map:
map : V → U , s.t. map(vi) = uj ,∀vi ∈ V,∃uj ∈ U (1)
The mapping is deﬁned when |V | ≤ |U |. An example map-
ping of the VOPD core graph onto mesh and torus topology
graphs is shown in Figure 3. The communication between
each pair of cores (i.e. each edge ei,j ∈ E) is treated as a ﬂow
of single commodity, represented as dk, k = 1, 2, · · · , |E|.
The value of dk represents the bandwidth of communication
across the edge and is denoted by vl(dk). The set of all
commodities is represented by D and is deﬁned as:
D =

dk : vl(dk) = commi,j , k = 1, 2, · · · , |E|,∀ei,j ∈ E,
with source(dk) = map(vi), dest(dk) = map(vj)
ﬀ
(2)
4.1 General Mapping Algorithm
In this sub-section, we present the generalized minimum-
path mapping algorithm and in the next sub-sections we
show how the algorithm is modiﬁed for each topology.
The mapping algorithm is presented in Figure 5. As the
mapping problem is intractable [19], we use a heuristic ap-
proach with three phases: in the ﬁrst phase an initial map-
ping is obtained using a greedy algorithm. In the initial
mapping procedure (step 1 in Figure 5), ﬁrst the core that
has maximum communication is placed on to the NoC node
with maximum neighbors. Then the core that communi-
cates the most with placed cores is chosen. This core is
placed onto the NoC node that minimizes the cost function
and this procedure is repeated until all the cores are placed.
Once an initial mapping is obtained, in the second phase
(steps 2 to 8), the commodities are sorted in decreasing order
of their values. Then, for each commodity in order, a quad-
rant graph between the source and destination of the com-
modity is formed, as the shortest path between the source
and destination lies within the quadrant between them. The
shaded regions in Figure 3 are examples of quadrant graphs
for the communication between the cores smem and iquant.
1
2
       for each d   in D in order of decreasing cost3
4
5
7
6
8
9
10
                make quadrant graph Q(d  ) with source(d  )
k k
k
k
k
k
k
k
Mapping (G(V,E), P(U,F))
{
       obtain an initial greedy mapping of G onto P;
       sort commodities in D with decrasing comm
       costs;
       {
                and dest(d  ) as end vertices;
                Path(source(d  ), dest(d  )) = minpath(Q(d  ));
                Increase edge weigths in Path by vl(d  );
       }
       floorplan_area_power_estimates(P);
       if bandwidth and area constraints are satisfied,
       find the mapping cost;
       repeat steps 2 to 8 for each pair−wise swap of
vertices in P;
return the mapping with lowest cost of all evaluated
mappings;
}
Figure 5: Minimum-path Mapping Algorithm
Then, Dijkstra’s shortest path algorithm is applied (step 5)
to the quadrant graph and the minimum path is obtained.
The edge weights are incremented suitably and the proce-
dure is repeated for each commodity in order. After routing
all commodities, if the bandwidth and area constraints are
satisﬁed, the cost of communication is calculated. Band-
width constraints are satisﬁed, if in the resulting mapping,
the traﬃc across any link is smaller than or equal to the
capacity of the link.1 The area constraints are satisﬁed
when the mapped design area is lower than the maximum
allowed area and aspect ratios of the design and soft core
blocks (blocks that have ﬂexible sizes) are within permissi-
ble ranges. For these area estimates (and power calculations
that are needed when the mapping objective is power min-
imization), ﬂoorplanner and area-power libraries are incor-
porated into SUNMAP as explained in Section 5.
The mapping algorithms can have many diﬀerent objec-
tives such as minimizing communication delay, area or power
dissipation and is an input parameter to SUNMAP. Depend-
ing on the objective function, the cost function calculation
(done as part of step 8) varies.
In the last phase of the algorithm (steps 9 - 10), for each
pair-wise swapping of vertices, phase-2 is repeated. Finally,
the best mapping from all evaluated mappings is returned
by the procedure.
As the minimum-path computations are performed on the
quadrant graph instead of the entire NoC graph, large com-
putational time savings is achieved, as the number of nodes
in a quadrant graph is much smaller than the total NoC
nodes. The above algorithm was validated for a mesh topol-
ogy in [19], where the mappings produced by this algorithm
was shown to be superior when compared to previous ap-
proaches.
In the mapping algorithm presented above, all but two
of the steps are common to all topologies: the ﬁrst step is
the formation of NoC topology graph, which is obviously
speciﬁc to a particular topology and the second step is the
procedure used to form quadrant graphs, which varies with
the topology used. These two steps are explained in detail
in the following sub-sections.
1Capacity of a link in a NoC is technology and implemen-
tation dependent and is assumed as an input to SUNMAP.
916
4.2 NoC Topology Graph Definition
The formal deﬁnition of the NoC topology graph was pre-
sented in the beginning of this section. The edges in the NoC
graph represents connection between adjacent NoC nodes.
Thus for deﬁning a topology graph, we simply need to de-
ﬁne the nodes that are adjacent to a particular node in that
topology.
For a mesh, each node, except the nodes on the edges have
four neighbors (such as node 4 in Figure 1(a)), nodes in the
four corners (e.g. node 0) have two neighbors, and other
nodes in the edges (e.g. node 1) have three neighbors. A
torus has similar structure as the mesh, but has additional
wrap-around channels between the edge nodes (e.g. node 0
in Figure 1(b) is connected to nodes 2 and 6 on the opposite
edges).
For a hypercube with N nodes (also called as 2-ary n-
cube), each node has n (which is equal to log2 N) neighbors.
A node ui in such a network is represented by the n-tuple:
(h1,h2, ..., hn), which is the binary representation of the
decimal i. Intuitively, each hj represents a single dimension
and thus an n-tuple uniquely identiﬁes every node of a 2-ary
n-cube. As an example, node 2 in Figure 1(c) is represented
by (0, 1, 0). All nodes with n-tuples distance 1 apart from
the n-tuple for ui are neighbors of ui (node 6 whose 3-tuple
is (1, 1, 0) is adjacent to 2).
In this work, we consider 3-stage clos networks, where each
switch in a stage is connected to every switch in the next
stage (e.g. switch 0 of stage 1 in Figure 2(a) is connected
to switches 0, 1, 2, 3 of stage 2). Thus adjacency calculations
are trivial for this topology. Butterﬂy networks (withN core
nodes) are also known as k-ary n-ﬂy networks, where k is
the radix of switches in the network, and n is the number of
stages in the network (n = logk N). A 2-ary 3-ﬂy network is
presented in Figure 2(b). As seen, the switches in each stage
are connected to 2 switches in the next stage. The max-
imum distance between the adjacent switches halves with
each stage (e.g. switch 0 of stage 1 is connected to switches
0 and 2 of stage 2, resulting in a maximum distance of 2.
Switch 0 of second stage is connected to switches 0 and 1 of
third stage, thus resulting in a maximum distance of 1).
4.3 Quadrant Graph Formation
The procedure for forming quadrant graphs is speciﬁc to a
topology as the nodes that lie in the shortest path of a com-
modity is topology speciﬁc. Example quadrant graphs for
mesh and torus networks were presented in Figure 3 (shaded
areas in the Figure). For a mesh network, the nodes that
are within the bounding box formed by the row and col-
umn boundaries of the source and destination nodes of a
commodity form the elements of the quadrant graph of that
commodity (Figure 3(b)). For a torus network, the wrap-
around channels need to be considered for computing the
smallest bounding box between the source and destination
nodes (Figure 3(c)).
For hypercubes, all nodes that have matching hj values
(of the n-tuple) as that of the source and destination nodes
of a commodity are included in the quadrant graph. As an
example, for source node 0 (represented by (0,0,0)) and des-
tination 3 (represented by (0,1,1)), all nodes with n-tuples
of the form (0,*,*) (* represents don’t care values), form the
quadrant graph (i.e. nodes 0,1,2,3 are the elements of the
quadrant graph). Intuitively, all nodes that have the same
dimensions as the source and destination nodes are included
in the quadrant graph.
As clos networks have full interconnection pattern be-
tween switches of adjacent stages and butterﬂy networks
have no path diversity (a single path from any source to
any destination), the quadrant graph formation for these
networks is trivial.
5. AREA-POWER MODELS AND
FLOORPLANNING
We developed analytical models for estimating the area
of switches. The architecture of the switches is assumed to
be based on the ×pipes architecture presented in [17]. The
area calculations include the crossbar area, buﬀer area, logic
(including control) area. The models take into account the
nuances of individual switch conﬁgurations and include ﬁne
granularity of details (like accounting for pipeline registers,
cross points, etc). We used ORION [22], a power model-
ing tool, for developing bit energy models for the switches.
The area-power models are used to generate area-power li-
braries for various switch conﬁgurations for diﬀerent tech-
nology parameters. In the rest of this paper, we assume 0.1µ
technology and use the area-power libraries generated by the
models for this technology. We use wiring parameters from
[23] to estimate link power dissipation. We assume that the
area-power values of the cores are an input to our tool.
The general solution to the ﬂoorplanning problem has two
basic steps: ﬁrst is ﬁnding the relative positions of modules
and the second is ﬁnding the exact positions, area and size
of the modules [20]. For a particular mapping that needs to
be evaluated for area-power-latency, the relative positions of
the cores and switches are known. Thus the ﬂoorplanning
problem is reduced to the one of ﬁnding the exact positions
and sizes (for soft blocks) of the cores and switches. We
use a simple Linear Program (LP) based ﬂoorplanner ex-
isting in literature [21] for this purpose. Note that a more
sophisticated ﬂoorplanner such as the one presented in [20]
can be used in place of the simple ﬂoorplanner. As the ma-
jor focus of this work is not ﬂoorplanning, we use a simple
ﬂoorplanner in the tool. In our subsequent work, we plan to
enhance the ﬂoorplanner to take speciﬁc features of NoCs
into account.
The area and aspect ratio constraints (for feasibility of
mapping) are evaluated and link lengths in the NoC are
obtained from the ﬂoorplanner. Using the built-in power
libraries, power dissipation for the switches and links are
calculated based on the average traﬃc (shown as edge an-
notations in Figure 3(a)) through them. The computed area,
power values are returned (step 7 in Figure 5) to the map-
ping algorithm.
6. EXPERIMENTS AND CASE STUDIES
6.1 Experiments on Video Applications
We applied SUNMAP to two diﬀerent video processing ap-
plications: the Video Object Plane Decoder (VOPD-mapped
onto 12 cores) and the MPEG4 decoder (14 cores) [13]. The
core graphs of the MPEG4 and VOPD are presented in Fig-
ures 3(a), 7(a). The maximum link bandwidth for the NoCs
is conservatively assumed to be 500 MB/s.
The results of mapping VOPD onto various topologies are
presented in Figure 6. As seen from Figure 6(a), the butter-
ﬂy topology (4-ary 2-ﬂy) has the least communication delay
out of all topologies. The lower communication delay is due
to the fact that a 4-ary 2-ﬂy has 2 stages of switches, which
means an average delay of 2 hops for all communication.
Mesh, torus and hypercube networks have a higher average
hop delay as the least possible hop delay (that of adjacent
nodes) itself is two and it was not possible to place all com-
municating nodes adjacent to each other. As the clos net-
work has three stages, the average hop delay is three. The
area, power estimates for the topologies are presented in
Figures 6(c), 6(d). As seen from Figure 6(b), the butter-
ﬂy topology has the least number of switches, but has more
links when compared to mesh, torus or hypercube.
The large power savings achieved by the butterﬂy net-
work (Figure 6(d)) is attributed to the fact that there are
fewer switches and smaller number of hops for communi-
cation. Moreover, all the switches are 4x4, while the di-
917
Msh Trs Hyp Cls Bfly0
1
2
3
Avg 
No
Hops 
(a) Avg hop delay
Mesh Torus Hyp Cls Bfly0
5
10
15
20
25
30
35
40
45
SWs
Sws
Lnks
& 
Lnks 
(b) Resource Util
Msh Trs Hyp Cls Bfly0
20
40
60
Area 
(mm2) 
(c) Design Area
Msh Trs Hyp Cls Bfly0
100
200
300
400
500
Power
(mW) 
(d) Design Power
Figure 6: Mapping Characteristics of VOPD
au cpumed rast
adsp
idct
,etc
up
samp bab risc
sram1 sram2
600600.5 40
40190
670 500
250
173
32
0.5
910
vu
sdram
(a) MPEG4 core graph
Mesh
Hyper
Torus
Bfly
Clos
2.49
2.47
2.48
3.0
62.51
66.03
67.05
64.38 504.1
546.7
541.4
445.4
No Feasible Mapping
av 
hops
des area
(mm   )2
des pow
(mw)Topo
(b) Design Param
Figure 7: MPEG4 Mappings
rect topologies have 5x5 switches (for communicating with
four neighbors and to the core). The average link length in
the butterﬂy network (obtained from ﬂoorplanner) was ob-
served to be longer than the link lengths (around 1.5×) of
direct networks. However, as the link power dissipation is
much lower than the switch power dissipation, we get large
power savings for the butterﬂy network. The smaller num-
ber of switches and smaller switch sizes also account for the
large area savings achieved by the butterﬂy network. Thus,
butterﬂy is the best topology for VOPD. The performance
gains for the butterﬂy over other topologies may be surpris-
ing, but after careful inspection we see the reason. Butterﬂy
network trades-oﬀ path diversity for network switches and
average hop delay. As the VOPD example has lower band-
width demands compared to MPEG4, we are able to satisfy
the bandwidth demands of the application using a butterﬂy
network.
The results of mapping MPEG4 are presented in Fig-
ure 7(b). As seen from the core graph of MPEG4 decoder
(Figure 7(a)), the amount of communication between the
cores (such as to/from the shared SDRAM) is much higher
than that can be supported by minimum-path routing. All
topologies violate the bandwidth constraints for minimum-
path routing. So we apply multi-path routing, splitting the
traﬃc across many paths. As the butterﬂy network has no
path diversity, it is unable to support the communication
between the cores, and thus doesn’t produce any feasible
mapping for MPEG4. All other topologies produce feasible
mappings with split-traﬃc routing. As seen from the ﬁg-
ure, the torus network has lower communication hop delay
than the mesh as it utilizes more network resources. How-
ever, the mesh network has large savings in area and power
which overshadow the slightly higher communication delay
cost. Thus a mesh topology is more suitable for the MPEG4
than other topologies.
6.2 Experiments on Network Applications
We consider a network processor with 16-nodes, each node
having the architecture shown in Figure 8(a), obtained from
[6]. The objective of the communication architecture is to
Request 
Generator
Sche
duler
Proc Mem
Arbiter
input outputsSwitch
(a) Node Arch
0.1 0.2 0.3 0.4 0.50
100
200
300
400
500
Av
g 
Pa
ck
 L
at
  (C
y)
 Injection Rate (flits/cycle)
Clos TorMsh Bfly 
(b) Latency of Communication
Msh Trs Hyp Cls Bfly0
20
40
60
80
100
Area 
(mm2) 
(c) Design Area
Msh Trs Hyp Cls Bfly0
500
1000
1500
2000
Power 
(mW) 
(d) Design Power
Figure 8: Network Processing Application
provide low contention for the data transfer between the
nodes. As clos networks have maximum path diversity, they
have the least congestion for large data ﬂows and are more
suitable for the network applications. We validated the
need for clos networks by producing mappings onto various
topologies by relaxing the bandwidth constraints and simu-
lating the resulting SystemC design. We use traﬃc gener-
ators to generate adversarial traﬃc pattern for each topol-
ogy. As seen from Figure 8(b), where the average packet
latency is plotted with increasing traﬃc injection (which in
turn mean that the network processor is processing larger
amounts of data), the clos clearly outperforms other topolo-
gies. Moreover, the area and average power dissipation in
a clos network (Figures 8(c), 8(d)) is only slightly higher
than the butterﬂy topology, justifying its use for network
processing applications.
6.3 Exploring Design Space of Topology
In this sub-section, we explore the MPEG4 mappings onto
a mesh topology. There are two ways in which a chosen
topology can be explored: ﬁrst is to evaluate the eﬀects of
various routing functions and second is to obtain a set of
Pareto points for the mappings from which the optimum
design point can be chosen, thereby performing area-power-
performance tradeoﬀs.
The minimum bandwidth for diﬀerent routing functions
(DO - Dimension Ordered, MP - Minimum-path, SM - Split-
traﬃc across Minimum-paths, SA - Split-traﬃc across All
918
DO MP SM SA0
200
400
600
800
1000
B
W
  (
M
B/
s)
(a) Eﬀect of Routing (b) Area-Power Exploration
Figure 9: Design Space Exploration of a Topology
Disp
lay
ARMMem
ory
FFT
200
200
200200 200200
600
600
IFFT
Filter
(a) Filter Appln
ARM
S − 3x3 SW
IFFT
FFT
S S
SS
Memory
Display
Filter
(b) Bﬂy Floorplan
Msh Tors Hyp Cls Bfly0
10
20
30
40
50
A
vg
 P
ac
k 
La
t (
C
y)
(c) SystemC Plots
Figure 10: Mapping of DSP Application
Figure 11: SystemC Snapshot
paths) is shown in Figure 9(a). When maximum available
link bandwidth is 500 MB/s, only split-traﬃc routing can
be used for mapping MPEG4. Figure 9(b) shows the Pareto
points (for area-power trade-oﬀs) in the design space of the
mapping from which the optimum point can be chosen.
6.4 DSP Application and SystemC Simulations
We applied the SUNMAP algorithm to a DSP Filter design
with six cores (refer Figure 10(a)). The cores are modeled in
SystemC and the design is simulated at the transaction level.
The resulting core graph is used by the SUNMAP which pro-
duces mappings onto the butterﬂy topology (Figure 10(b)).
Then the network components for the butterﬂy topology are
automatically generated and the resulting NoC design of
the DSP is simulated at cycle accurate and signal accurate
level in SystemC. A snap-shot of the SystemC simulation is
shown in Figure 11. We also generated the best mappings
of other topologies for comparison purposes. The SystemC
simulation of all topologies is carried out, and the observed
average packet latency for the topologies plotted (shown in
Figure 10(c)). As seen from the Figure, the butterﬂy topol-
ogy indeed has the minimum latency.
For all these applications NoC selection and generation
was obtained in few minutes on a 1GHZ SUN workstation.
The SystemC simulations were also checked for functional
and timing correctness validating the output of SUNMAP.
7. CONCLUSIONS AND FUTURE WORK
Future Systems on Chips need scalable Networks on Chip
architectures to interconnect the cores. Selecting the most
suitable topology for an application, mapping of cores onto
that topology and generating the resulting network are im-
portant phases in designing NoCs. In this paper we have pre-
sented SUNMAP, a tool that automates all these steps, bridg-
ing an important design gap in building NoCs. In future,
we plan to enhance the tool with automatic heterogeneous
topology modeling and guaranteeing Quality-of-Service for
applications.
8. ACKNOWLEDGEMENTS
This research is supported by MARCO Gigascale Systems
Research Center (GSRC) and NSF (under contract CCR-
0305718).
9. REFERENCES
[1] M. Sgroi et al., “ Addressing the System-on-a-Chip Interconnect
Woes Through Communication-Based Design”, in Proc. Design
Automation Conference, 2001.
[2] L.Benini and G.De Micheli, “Networks on Chips: A New SoC
Paradigm”, IEEE Computers, pp. 70-78, Jan. 2002.
[3] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet
switched interconnections”, DATE 2000, pp. 250-256, March
2000.
[4] S.Kumar et al., ”A network on chip architecture and design
methodology”, ISVLSI 2002, pp.105–112, 2002.
[5] E.Rijpkema et al., ”Trade-oﬀs in the design of a router with
both guaranteed and best-eﬀort services for networks on
chip”,DATE 2003, pp. 350-355, Mar 2003.
[6] F.Karim et al., ”On-chip communication architecture for OC-768
network processors”, Design Automation Conference, June 2001.
[7] X.Zhu, S.Malik, ”A Hierarchical Modeling Framework for
On-Chip Communication Architectures”, ICCD 2002, pp.
663-671, Nov 2002.
[8] L. Jian, et. al, ”aSOC: A Scalable, Single-Chip communications
Architecture”, PACT 2000, Oct. 2000, pp. 37-46.
[9] D.Siguenza-Tortosa, J. Nurmi, “ Proteo: A New Approach to
Network-on-Chip”, in CSN 02, Sep. 2002.
[10] S.J.Lee et al.,“ An 800MHz Star-Connected On-Chip Network
for Application to Systems on a Chip”, ISSCC 2003, Feb. 2003.
[11] A.Jantsch, H.Tenhunen, “Networks on Chip”, Kluwer Academic
Publishers, 2003.
[12] S.J.Krolikoski, et. al, “Methodology and Technology for Virtual
Component Driven Hardware/Software Co-Design on the
System-Level”, ISCAS 99, pp. 456-459, June 1999.
[13] E.B.Van der Tol, E.G.T.Jaspers,”Mapping of MPEG-4
Decoding on a Flexible Architecture Platform”, SPIE 2002, pp.
1-13, Jan, 2002.
[14] A.Pinto et. al, “Eﬃcient Synthesis of Networks on Chip”,
ICCD 2003, pp. 146-150, Oct 2003.
[15] W.H.Ho, T.M.Pinkston, “A Methodology for Designing
Eﬃcient On-Chip Interconnects on Well-Behaved
Communication Patterns”, HPCA 2003, pp. 377-388, Feb 2003.
[16] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based
NOC Architectures Under Performance Constraints”, ASP-DAC
2003, Jan 2003.
[17] “×pipes: a Latency Insensitive Parameterized Network-on-chip
Architecture For Multi-Processor SoCs”, pp. 536-539, ICCD,
2003.
[18] “×pipesCompiler: A Tool For Instantiating Application Speciﬁc
Networks on Chips”, Vol. 2, pp. 20884, DATE 2004.
[19] “Bandwidth Constrained Mapping of Cores onto NoC
Architectures”, Vol. 2, pp. 20896, DATE 2004.
[20] J.G.Kim, Y.D.Kim, ”A linear programming-based algorithm for
ﬂoorplanning in VLSI design ”, IEEE Transactions on CAD, pp.
584 -592, Vol. 22, Issue: 5 , May 2003,
[21] N. Sherwani, ”Algorithms for VLSI Physical Design
Automation”. Kluwer Academic Publishers, 1995.
[22] H.S Wang et al., ”Orion: A Power-Performance Simulator for
Interconnection Networks”, MICRO, Nov. 2002.
[23] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires”,
Proceedings of the IEEE, pp. 490-504, April 2001.
919
