Bandwidth-Constrained Mapping of Cores onto NoC Architectures by Murali, Srinivasan & De Micheli, Giovanni
Bandwidth-Constrained Mapping of Cores onto NoC Architectures
Srinivasan Murali, Giovanni De Micheli
Computer Systems Lab
Stanford University
Stanford, California 94305
{smurali, nanni}@stanford.edu
Abstract
We address the design of complex monolithic systems,
where processing cores generate and consume a varying
and large amount of data, thus bringing the communication
links to the edge of congestion. Typical applications are
in the area of multi-media processing. We consider a mesh-
based Networks on Chip (NoC) architecture, and we explore
the assignment of cores to mesh cross-points so that the traf-
ﬁc on links satisﬁes bandwidth constraints. A single-path
deterministic routing between the cores places high band-
width demands on the links. The bandwidth requirements
can be signiﬁcantly reduced by splitting the trafﬁc between
the cores across multiple paths. In this paper, we present
NMAP, a fast algorithm that maps the cores onto a mesh
NoC architecture under bandwidth constraints, minimizing
the average communication delay. The NMAP algorithm is
presented for both single minimum-path routing and split-
trafﬁc routing. The algorithm is applied to a benchmark
DSP design and the resulting NoC is built and simulated
at cycle accurate level in SystemC using macros from the
×pipes library. Also, experiments with six video process-
ing applications show signiﬁcant savings in bandwidth and
communication cost for NMAP algorithm when compared
to existing algorithms.
Keywords: Systems on Chips, Networks on Chips,
cores, mapping, bandwidth, routing.
1 Introduction
Present and future Systems on Chip (SoC) are designed
using preexisting components such as processors, DSPs,
memory arrays [1], which we call cores. The use of stan-
dard hardwired busses to interconnect these cores is not
scalable. To overcome this problem, Networks on Chips
(NoCs) have been proposed and used for interconnecting
the cores [2, 3, 4] and replacing dumb physical routing. The
use of on-chip interconnection network has several advan-
tages, including better structure, performance and modular-
ity.
In several application domains, such as multi-media pro-
cessing, the bandwidth requirement between the cores in
Down
362362
iQuant IDCTrun
stripe
70
memory
362 357
157
27 49
decoder
calculation
Context
&Sampl
16
300
length
16
353
16
1616
memory 94313
16
500
samp
up
reconstr
VOP
memory
VOP313padding
demux
16
length prediction
AC/DC
scan
inverse
samp
upref memory
decoder
Var.
Decoder
Arithmetic
Figure 1. Block diagram of Video Object Plane
Decoder, with communication BW (in MB/s).
SoCs is increasing. The aggregate communication band-
width between the cores is in the GBytes/s range for many
video applications. In the future, with the integration of
many applications onto a single device and with increased
processing speed of cores, the bandwidth demands will
scale up to much larger values [3]. As an example of a me-
dia processing application, a Video Object Plane decoder
[7] is shown in Figure 1. Each block in the ﬁgure cor-
responds to a core and the edges connecting the cores are
labeled with bandwidth demands of the communication be-
tween them. As seen from the ﬁgure, the bandwidth de-
mands are in the order of hundreds of MBytes/s.
Networks on Chip can be designed in different ways, ac-
cording to the network architecture and protocol choice. In
this paper, we limit our considerations to mesh/torus net-
works and to packet-switched data transmission. Note that
our techniques are not limited to mesh topologies, but we
stick to this restriction to be more speciﬁc in this paper.
Packet switching leads to better link utilization. Neverthe-
less, packet switching with single-path deterministic rout-
ing places high bandwidth demands on the network links.
By allocating higher bandwidth across the links of the NoC,
more energy is dissipated. Thus, it is important to balance
the bandwidth needs across the different links. In this pa-
per we apply packet switching in both single and multi-path
routing, where the trafﬁc between two end-nodes is split
across many paths.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 
1530-1591/04 $20.00 © 2004 IEEE 
The overall objective of this research is to show how to
automatically map cores to a network architecture. In par-
ticular, we consider mesh/torus topologies, and the map-
ping of cores to their cross-points. We describe a mapping
algorithm called NMAP that satisﬁes the bandwidth con-
straints of the NoC and minimizes the average communica-
tion delay. The algorithm supports both single-minimum-
path routing and split-trafﬁc routing. The mapping of cores
is done at a high level of abstraction (based on average traf-
ﬁc between the cores), so that fast exploration of the de-
sign space can be performed. The SystemC code gener-
ated by the tool can then be simulated to accurately evalu-
ate the chosen mapping. To validate our method, we apply
the NMAP algorithm to a DSP system designed in SystemC
and we construct the network around the cores using macros
from the ×pipes library [9] that has parameterizable Sys-
temC components for network elements. A cycle accurate
simulation of the resulting NoC architecture validates our
design approach. We also apply the NMAP algorithm to
six other video processing applications, which show signif-
icant reduction in bandwidth requirements and communica-
tion costs when compared to existing realizations.
2 Previous Works
We refer the reader to several recent surveys [2, 3, 4, 5]
on NoCs for pointers to recent research and development.
This paper deals with a speciﬁc graph embedding problem,
which is intractable [10]. The mapping of clusters onto the
physical topology of processors has been studied in the ﬁeld
of parallel processing [11, 12]. In [12], PMAP, a two-phase
mapping algorithm for placing clusters onto processors is
presented. The mappings produced by the PMAP algorithm
are shown to have lower communication costs than map-
pings with previous algorithms.
The mapping of cores onto NoC architecture presents
new challenges when compared to the mapping in parallel
processing. A major difference is that the trafﬁc require-
ments on the links of a NoC are known for a particular ap-
plication, thus the bandwidth constraints in the NoC archi-
tecture need to be satisﬁed by the mapping.
In [8], a branch and bound algorithm is proposed that
maps cores onto a tile-based NoC architecture satisfying the
bandwidth constraints and minimizing the total energy con-
sumption. In our approach, we consider the mapping prob-
lem together with the possibility of splitting trafﬁc among
various paths, thus easing the satisfaction of bandwidth con-
straints and providing a more efﬁcient solution.
3 Methodology
As a starting point we assume to have an application that
needs to be mapped onto a SoC populated by cores. Next
we assume that the application has parallel kernels, and that
the kernels have been associated with processing cores. By
means of static analysis or simulation, it is possible to de-
termine the average/mean size of the messages exchanged
among cores and their frequency. Our problem is to map the
cores onto a mesh NoC, so that the links support the desired
message transfer. This paper describes only this important
problem, that is formalized in the next section. The paral-
lelization of the application, and the assignment of kernels
to processors, can be done with known methods (e.g. [6])
and are not described in this paper.
4 Mathematical Formulation of the Mapping
Problem
The communication between the cores of the SoC is rep-
resented by the core graph:
Deﬁnition 1 The core graph is a directed graph, G(V,E)
with each vertex vi ∈ V representing a core and the di-
rected edge (vi, vj), denoted as ei,j ∈ E, representing the
communication between the cores vi and vj . The weight
of the edge ei,j , denoted by commi,j , represents the band-
width of the communication from vi to vj .
The connectivity and link bandwidth of the NoC is repre-
sented by the NoC topology graph:
Deﬁnition 2 The NoC topology graph is a directed graph
P (U,F ) with each vertex ui ∈ U representing a node in the
topology and the directed edge (ui, uj), denoted as fi,j ∈
F representing a direct communication between the vertices
ui and uj . The weight of the edge fi,j , denoted by bwi,j ,
represents the bandwidth available across the edge fi,j .
The core graph of the decoder in Figure 1 is shown in
Figure 2(a) and the NoC graph for a 16-node mesh is shown
in Figure 2(b). The mapping of the core graph G(V,E) onto
the processor graph P (U,F ) is deﬁned by the one-to-one
mapping function map:
map : V → U , s.t. map(vi) = uj , ∀vi ∈ V, ∃uj ∈ U
(1)
The mapping is deﬁned when |V | ≤ |U |. An example map-
ping of the decoder is shown in Figure 2(c). The communi-
cation between each pair of cores (i.e. each edge ei,j ∈ E)
is treated as a ﬂow of single commodity, represented as dk,
k = 1, 2, · · · |E|. The value of dk represents the bandwidth
of communication across the edge and is denoted by vl(dk).
The set of all commodities is represented by D and is de-
ﬁned as:
D =
{
dk : vl(dk) = commi,j , k = 1, 2, · · · |E|, ∀ei,j ∈ E,
with source(dk) = map(vi), dest(dk) = map(vj)
}
(2)
The bandwidth constraints are represented by the inequal-
ity:
|E|∑
k=1
xki,j ≤ bwi,j , ∀i, j ∈ 1, 2, · · · , |U | (3)
For single minimum-path routing, xki,j are obtained by the
following equation:
xki,j =
{
vl(dk) , if fi,j ∈ Path(source(dk), dest(dk))
0 , otherwise
(4)
2Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
1616
313300
357 353
27
49
362
362
362
70
v16
v14 v13
v12
v11
500
16
31316
16
16
157
16
v15
v3
v4
v10
v8v7
v6
v5v2v1
94
v9
(a) Core graph
u12u11u10
u8u7u6
u16u15u14u13
u9
u4u1 u3
u5
u2
(b) NoC graph
v15v8v4
v7v6
v5
v0
v1
v2
v14
v11
v9
v3
v12v13
v10
(c) Mapping
Figure 2. Mapping of Core graph onto NoC graph
where the set Path(a, b) represents the set of links that form
the shortest path between the mesh nodes a and b. For multi-
path routing with trafﬁc splitting, the value of xki,j are ob-
tained from the following set of equations:
∀i
|Adji|∑
j=1
|E|∑
k=1
xki,Adji(j) −
|Adji|∑
j=1
|E|∑
k=1
xkAdji(j),i =
|E|∑
k=1
flowk
(5)
where
flowk =


vl(dk) , if source(dk) = ui
-vl(dk) , if dest(dk) = ui
0 , otherwise
(6)
and Adji is the set of adjacent mesh nodes of ui. Equation 5
represents the conservation of ﬂow, i.e. the sum of the ﬂow
coming into a node and sourced by the node equals the sum
of the ﬂow going out of the node and sinked by that node.
5 Mapping with Minimum-Path Routing
In this section, we present the mapping algorithm that
uses minimum-path routing between the cores in the mesh
architecture. As the problem is intractable, we use a heuris-
tic approach that has three phases: an initialization phase
that computes an initial mapping, followed by the second
phase, where minimum path computations are performed.
In the last phase, the initial solution is iteratively improved
by invoking the second phase for each pair-wise swapping
of vertices. In the initialize() routine, the core that
has maximum communication demand is placed onto one
of the mesh nodes with maximum number of neighbors.
Then for each core yet to be mapped, the core that commu-
nicates most with the already mapped cores is selected. The
core is placed onto the mesh node that minimizes the com-
munication cost with mapped cores. This best mesh node
is obtained by examining every available node in the mesh.
The procedure is repeated until all the cores are mapped.
The shortestpath() routine performs the minimum
routing. The commodities are sorted in decreasing order of
the value of their ﬂows. For each commodity, a quadrant
graph is formed between the source and destination of the
commodity, as the shortest path between the source and
destination lies within the quadrant between them. The
shaded region in Figure 2(c) is an example quadrant graph
for the commodity with source v14 and destination v9.
initialize(G(V,E),P (U,F )){
initialize P laced(W,H) to φ;
assign the vertex with max communication
requirements in G(V,E) to maxs;
assign the vertex with maximum neighbors in
P (U,F ) to maxt;
map(maxs) = maxt;
remove maxt from P , maxs from G and add
maxt to P laced(W,H);
While(|V | > 0){
assign the vertex in G with maximum comm
with ∀wi ∈ W , to nexts;
for ∀uj ∈ P (U,F ) and wi ∈ P laced(W,H)
commcost(uj) + = commnexts,map−1(wi)
×[xdist(wi, wj) + ydist(wi, wj)];
assign uj with minimum cost to nextt;
map(nexts) = nextt;
remove nextt from P , nexts from G and
add nexts to W ;}
return(map,P laced(W,H));}
Then, Dijkstra’s shortest path algorithm is applied to the
quadrant graph and the minimum path is obtained. The
edge weights are incremented suitably and the procedure
is repeated for each commodity in order. After routing all
commodities, if the bandwidth constraints in Inequality 3
are satisﬁed, the cost of communication is calculated. The
communication cost is given by:
commcost =
|E|∑
k=1
vl(dk)dist(source(dk), dest(dk)) (7)
where dist(a, b) is the minimum number of hops be-
tween nodes a and b. This is a heuristic procedure to
ﬁnd the minimum paths in the NoC. Finding the short-
est paths can also be formulated as an Integer Linear
Program (ILP), but the time taken by the ILP is of the
order of minutes (the above procedure completes in
3Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
few seconds and the solution obtained is experimentally
observed to be within 10% of the solution from ILP).
shortestpath(P laced(W,H)){
initialize edge weights of P laced with total comm
BW for adj nodes and maxvalue for others;
sort commodities in D with decreasing comm costs;
for each dk ∈ D do{
make quadrant graph Q(dk) with source(dk)
and dest(dk) as end vertices;
Path(source(dk), dest(dk)) = minpath(Q(dk));
increase edge weigths for edges in Path by vl(dk);}
if BW constraints are satisﬁed, ﬁnd the
comm cost and store it in cost;
else assign maxvalue to cost;
return(cost);}
The routine mappingwithsinglepath() ﬁnds the
best mapping obtained by pair-wise swapping of vertices,
invoking shortestpath() routine O(U2) times. The
worst-case computational complexity of the entire algo-
rithm is O(|U |3E log |F |).
mappingwithsinglepath(G(V, E), P (U,F )){
S(A,B) = makeundirected(G(V,E));
initialize(S(A,B), P (U,F ));
bestcommcost = shortestpath(P laced(W,H));
assign Placed to Bestmapping;
for i = 1 to |U | do{
for j = i+1 to |U | do{
assign P laced(W,H) to Ptemp(tW, tH)
swapping vertices wi and wj ;
ﬁnd commcost = shortestpath(Ptemp(tW, tH));
if (commcost < bestcommcost) assign Ptemp to
Bestmapping and commcost to bestcommcost;}
assign Bestmapping to P laced;}
return(bestcommcost,Bestmapping);}
6 Mapping With Trafﬁc Splitting
In this section, we present the NMAP algorithm split-
ting the trafﬁc across multiple paths between the source
and destination for each commodity. In the ﬁrst phase
of the algorithm, an initial mapping is obtained using the
initialize() routine presented in Section 5. In the
next phase, mappings obtained by pairwise swapping of
vertices are evaluated to get a mapping that satisﬁes the
bandwidth constraints. Once such a mapping is obtained,
mappings with pairwise swapping of vertices are evaluated
to ﬁnd the mapping with the best cost.
In order to obtain a mapping that satisﬁes bandwidth
constraints, the following set of Multi-Commodity Flow
(MCF) equations are evaluated. The MCF1 are used to ob-
tain a feasible mapping from the set of mappings obtained
by pairwise swapping of vertices.
MCF1:
min:
|U|∑
i=1
|U|∑
j=1
si,j
s.t
|E|∑
k=1
xki,j − si,j ≤ bwi,j , ∀i, j ∈ 1, 2, · · · , |U |
∀i
|Adji|∑
j=1
|E|∑
k=1
xki,Adji(j) −
|Adji|∑
j=1
|E|∑
k=1
xkAdji(j),i =
|E|∑
k=1
flowk
xki,j ≥ 0, si,j ≥ 0 ∀i, j, k (8)
The si,j are slack variables and signify the amount by which
bandwidth constraints are violated. By reducing the sum of
slack variables, mappings that reduce the amount by which
bandwidth constraints are exceeded are obtained. Once a
mapping that satisﬁes bandwidth constraints is obtained, a
second set of MCF equations (MCF2) are solved in order to
obtain a mapping with the best cost:
MCF2:
min:
|U|∑
i=1
|U|∑
j=1
|E|∑
k=1
xki,j
s.t.
|E|∑
k=1
xki,j ≤ bwi,j , ∀i, j ∈ 1, 2, · · · , |U |
|Adji|∑
j
|E|∑
k=1
xki,Adji(j) −
|Adji|∑
j
|E|∑
k=1
xkAdji(j),i =
|E|∑
k=1
flowk
xki,j ≥ 0 ∀i, j, k (9)
where, the objective is to minimize the total communi-
cation cost, which is given by the sum of the ﬂow of
commodities through all the edges of the NoC graph.
To solve these multi-commodity ﬂow equations, we use
lp solve [14], a linear programming solver. The
mappingwithsplitting() routine implements these
three phases.
For SoC applications that require low jitter (the time be-
tween the delivery of adjacent packets), the trafﬁc between
the cores can be split across multiple minimum paths, in-
stead of all paths, so that the packets traveling in the dif-
ferent paths have the same hop delay. To achieve this, we
can restrict i, j in Equation 5 for each commodity dk to lie
within the quadrant formed by the source and destination of
dk, i.e. Equation 5 can be rewritten as:
|Adji|∑
j
|E|∑
k=1
xki,Adji(j) −
|Adji|∑
j
|E|∑
k=1
xkAdji(j),i =
|E|∑
k=1
flowk,
∀i, Adji(j) ∈ Q(d
k) (10)
Splitting the trafﬁc increases the size of the routing ta-
bles at each node of the NoC. As each core communicates
with only few other cores, even with split trafﬁc routing, the
4Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
mappingwithsplitting(G(V, E), P (U,F )){
S(A,B) = makeundirected(G(V,E));
initialize(S(A,B), P (U,F ));
bestslackcost = MCF1(P laced(W,H));
assign bestcommcost to maxvalue;
if (bestslackcost = 0){
assign true to bwconstsatisfied;
bestcommcost = MCF2(P laced(W,H));
assign P laced to Bestmapping;}
for i = 1 to |U | do{
for j = i+1 to |U | do{
assign P laced(W,H) to Ptemp(tW, tH)
swapping vertices wi and wj ;
if (bwconstsatisfied = false){
ﬁnd slackcost = MCF1(Ptemp(tW, tH));
if (slackcost = 0) assign Ptemp to P laced and
bwconstsatisfied to true;
else if(slackcost < bestslackcost) assign slack
cost to bestslackcost,Ptemp to Bestmapping;}
else{
ﬁnd commcost =MCF2(Ptemp(tW, tH));
if(commcost < bestcommcost) assign Ptemp to
Bestmapping,commcost to bestcommcost;}}
assign Bestmapping to P laced;}
return(bestcommcost,Bestmapping);}
number of bits occupied by the routing tables is less than
10% of the total number of bits for the network buffers, a
small overhead for the bandwidth savings obtained. The
results of the algorithms are given in the next section.
7 Simulation Results
7.1 Experiments with Video Applications
We simulated the NMAP algorithms for core graphs
of six video processing applications: MPEG4 decoder
(mapped onto 14 cores), Video Object Plane Decoder
(OPD-16 cores), Picture-In-Picture application (PIP-8
cores), Multi-Window Application (MWA-14 cores), MWA
with Graphics (MWAG-16 cores) and Dual Screen Display
(DSD-16 cores), the last four being high-end video applica-
tions [15]. We also implemented the PMAP algorithm [12],
the greedy mapping algorithm (GMAP - the algorithm for
UBC calculation in [8]) and the partial branch-and-bound
algorithm (PBB) presented in [8]1 for comparison.
Figure 3 shows the minimum communication cost for
the applications with the same bandwidth constraints for
all algorithms. As seen from the ﬁgure, NMAP and PBB
perform well for all applications when compared to the
other algorithms. Figure 4 shows the minimum bandwidth
needed to satisfy the communication bandwidth demands of
the applications with dimension ordered routing for PMAP
and GMAP (referred to as DPMAP and DGMAP), single
1We monitored the queue length as explained in [8] so that the PBB
algorithm ran for few minutes.
0
1000
2000
3000
4000
5000
6000
7000
8000
PMAP
GMAP
PBB
NMAP
MPEG−4 OPD PIP MWA MWAG DSD
Comm
Cost
(hops
* Bw)
Figure 3. Communication costs for six video
applications for the mapping algorithms
Table 1. Cost
and BW Ratio
App cstr bwr
mpeg4 1.61 2.35
opd 1.35 2.41
pip 1.10 2.00
mwa 1.52 2.07
mwag 1.51 1.80
dsd 1.73 2.13
Avg 1.47 2.13
Table 2. Communica-
tion Cost Ratio
no PBB NMAP rat.
25 7540 4892 1.54
35 11204 6959 1.61
45 21820 11820 1.85
55 28741 16987 1.69
65 41667 23649 1.76
minimum-path routing for PMAP, GMAP and NMAP, rout-
ing with trafﬁc splitting across minimum paths for NMAP
(NMAPTM) and routing with trafﬁc splitting across all
paths for NMAP (NMAPTA)2. The graph shows signiﬁcant
reduction in bandwidth needs with trafﬁc splitting. The ra-
tio of average cost and bandwidth requirements of PMAP,
GMAP and PBB with the cost and bandwidth requirements
of NMAP (with split-trafﬁc routing) is given in Table 1. The
NMAP algorithm results in an average of 53% savings in
bandwidth needs. For same bandwidth constraints, there is
32% reduction in cost for the example applications.
For small number of cores, PBB gives good perfor-
mance, comparable to NMAP, as large part of the solution
space is searched. As the number of cores scale up, NMAP
produces mappings that give signiﬁcant reduction in com-
munication cost when compared to PBB. Random graphs
with large number of cores (the number of cores varied from
25 to 65) were generated using the graph package LEDA
[16]. Table 2 shows the savings obtained by NMAP algo-
rithm when compared to PBB.
7.2 DSP Filter Design
We applied the NMAP algorithm to a DSP Filter design
with six cores (refer Figure 5(a)). The cores are modeled
in SystemC and the design is simulated at the transaction
level. The resulting core graph is used by the NMAP al-
gorithm to produce a mapping onto the mesh NoC archi-
tecture. Once the mapping is obtained, the network com-
ponents (routers, links, etc) are added to the design. For
2As bandwidth for NMAP and PBB is same for dimension ordered and
minimum path for these examples, we only present for min path NMAP.
5Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
0200
400
600
800
1000
1200
1400
DPMAP
DGMAP
PMAP
GMAP
NMAP
NMAPTM
NMAPTA
MPEG4 OPD PIP MWA MWAG DSD 
BW
(in
MB/s)
Figure 4. Bandwidth require-
ments for the algorithms
Disp
lay
ARMMem
ory
FFT
200
200
200200 200200
600
600
IFFT
Filter
(a) DSP
NI1
FFT
ory
Mem
NI4 r6
ARM
NI2
Filter
NI5
r4 r5
Interface
NI − Network
r − router
IFFTNI6
lay
DispNI3r3r2r1
(b) NoC Impl
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
30
35
40
45
50
55
Av
g 
Pa
ck
 L
at
 (C
y)
Split
Minp
BW of Links (GB/s) 
(c) Packet Latency Vs BW
Figure 5. NoC Implementation of DSP
Table 3. DSP NoC Design Results
NI area 0.6 mm2 Pack. size 64B
SW area 1.08 mm2 minp BW 600MB/s
SW del 7 cy split BW 200MB/s
this, we use the ×pipes library [9] that has parameteriz-
able SystemC components for the network elements. The
NMAP algorithm has an interface to ×pipesCompiler
[13], so that the appropriate switches, links and network in-
terfaces are chosen and added to the cores. The resulting
NoC design (refer Figure 5(b)) of the DSP is simulated at
cycle-accurate and signal-accurate level in SystemC.
The average packet latency with single path and split-
trafﬁc routing with varying link bandwidths obtained from
the simulation is presented in Figure 5(c). As the trafﬁc is
bursty in nature, we have contention even when bandwidth
constraints are satisﬁed. As seen from the ﬁgure, the aver-
age latency is higher and also increases more sharply with
decrease in bandwidth for single path routing. This is be-
cause, in single minimum-path routing the congestion on
links is higher when compared to the case where the trafﬁc
is split across many paths. Moreover, use of wormhole ﬂow
control results in a non-linear increase in latency (due to
blocking of paths in case of contention, creating a domino-
effect) with decreasing link bandwidth. The NoC design
parameters are presented in Table 3.
8 Conclusions and Future Work
For efﬁcient design of future SoCs, an automatic map-
ping of cores onto NoCs is highly desirable. Towards this
end, we have presented a fast mapping algorithm satisfying
the bandwidth constraints of a mesh NoC, minimizing the
average delay. By splitting the trafﬁc in the NoC, we obtain
signiﬁcant bandwidth and cost savings compared to existing
realizations. Our approach is validated by cycle-accurate
simulation of a DSP design modeled in SystemC to which
NoC components are added from the ×pipes library. The
approach can be extended to map cores onto various NoC
topologies for fast and efﬁcient design space exploration for
NoC topology selection.
9 Acknowledgements
This research is supported by MARCO Gigascale Sys-
tems Research Center (GSRC) and NSF (under contract
CCR-0305718).
References
[1] W.Cesario et al., “Component-Based Design Approach for Multicore
SoCs”, DAC 2002, pp. 789-794, June, 2002.
[2] L.Benini and G.De Micheli, “Networks on Chips: A New SoC
Paradigm”, IEEE Computers, pp. 70-78, Jan. 2002.
[3] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet
switched interconnections”, DATE 2000, pp. 250-256, March 2000.
[4] S.Kumar et al., ”A network on chip architecture and design methodol-
ogy”, ISVLSI 2002, pp.105–112, 2002.
[5] E.Rijpkema et al., ”Trade-offs in the design of a router with both guar-
anteed and best-effort services for networks on chip”,DATE 2003, pp.
350-355, Mar 2003.
[6] The Cadence Virtual Component Co-design (VCC),
http://www.cadence.com/company/pr/09 25 00vcc.html
[7] E.B.Van der Tol, E.G.T.Jaspers,”Mapping of MPEG-4 Decoding on a
Flexible Architecture Platform”, SPIE 2002, pp. 1-13, Jan, 2002.
[8] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC Ar-
chitectures Under Performance Constraints”, ASP-DAC 2003, Jan
2003.
[9] M.Dallosso et. al, “×pipes: A Latency Insensitive Parameterized
Network-on-chip Architecture For MPSoCs”, pp. 536-539, ICCD
2003.
[10] M.Garey, D.Johnson, “Computers and Intractability”, W.H Freeman,
1979.
[11] V.Lo et al.,”OREGAMI: Tools for Mapping Parallel Computations to
Parallel Architectures”,Intl Journal of Parallel Programming, vol. 20,
no. 3, 1991, pp. 237-270.
[12] N.Koziris et al.,”An Efﬁcient Algorithm for the Physical Mapping of
Clustered Task Graphs onto Multiprocessor Architectures”, Proc. of
8th EuroPDP, pp. 406-413, Jan, 2000.
[13] A.Jalabert et.al, ”×pipesCompiler: A tool for Instantiating Ap-
plication Speciﬁc Networks on Chip”, Proc DATE, 2004.
[14] ftp://ftp.es.ele.tue.nl/pub/lp solve/
[15] E.G.T.Jaspers, et al.,”Chip-set for Video Display of Multimedia In-
formation”, IEEE Trans. on Consumer Electronics, Vol 45, No. 3, pp.
707-716, Aug, 1999.
[16] http://www.algorithmic-solutions.com/
6Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
