xpipesCompiler: A Tool for Instantiating Application Specific Networks on Chip by Jalabert, Antoine et al.
×pipesCompiler:
A tool for instantiating application speciﬁc Networks on Chip
Antoine Jalabert Srinivasan Murali Luca Benini Giovanni De Micheli
CEA CSL DEIS CSL
LETI-DSIS Stanford University Univ of Bologna Stanford University
antoine.jalabert@cea.fr smurali@stanford.edu lbenini@deis.unibo.it nanni@stanford.edu
Abstract
Future Systems on Chips (SoCs) will integrate a large
number of processor and storage cores onto a single chip
and require Networks on Chip (NoC) to support the heavy
communication demands of the system. The individual com-
ponents of the SoCs will be heterogeneous in nature with
widely varying functionality and communication require-
ments. The communication infrastructure should optimally
match communication patterns among these components
accounting for the individual component needs. In this pa-
per we present ×pipesCompiler, a tool for automati-
cally instantiating an application-speciﬁc NoC for hetero-
geneous Multi-Processor SoCs. The ×pipesCompiler
instantiates a network of building blocks from a library of
composable soft macros (switches, network interfaces and
links) described in SystemC at the cycle-accurate level. The
network components are optimized for that particular net-
work and support reliable, latency-insensitive operation.
Example systems with application-speciﬁc NoCs built using
the×pipesCompiler show large savings in area (factor
of 6.5), power (factor of 2.4) and latency (factor of 1.42)
when compared to a general-purpose mesh-based NoC ar-
chitecture.
Keywords: Systems on Chips, Networks on Chips,
latency-insensitive design, application-speciﬁc, SystemC.
1 Introduction
With increasing transistor density, the number of cores
on a chip and the communication demands between them is
rapidly increasing. System interconnect scalability is lim-
ited for state-of-the-art SoC communication architectures
based on shared communication resources. Networks on
chip (NoC) architectures have been proposed to address the
scalability challenge [1, 2, 3, 8]. NoCs are scalable and
compatible with design and reuse of cores, which is a criti-
cal feature required by SoC designers to meet tight time-to-
market constraints [4].
An important design decision for NoCs is the choice
of topology. Several researchers [10, 11, 12, 13] envision
NoCs as regular topologies (such as mesh networks and
fat trees), which are suitable for interconnecting homoge-
Chip−Multiprocessor
(a) Homogeneous
dsp
rast
sram
ups
mcpu
vu
bab
risc
adsp
sdram
sram
au
(b) MPEG4 SoC
Figure 1. Homogeneous CMPs and heteroge-
neous SoC applications
neous cores in a chip multiprocessor (Figure 1(a)). How-
ever, many SoCs involve heterogeneous cores having varied
functionality, size and communication requirements. If a
regular interconnect is designed to match the requirements
of few communication-hungry components, it is bound to
be largely over-designed with respect to the needs of the
remaining components. This is the main reason why most
current SoCs use irregular topologies like bridged busses
and/or dedicated point-to-point links [14].
As an example, consider the implementation of an
MPEG4 decoder [5], depicted in Figure 1(b), where blocks
are drawn roughly to scale and links represent inter-block
communication. First, the embedded memory (SDRAM)
is much larger than all other cores and it is a critical com-
munication bottleneck. Block sizes are highly non-uniform
and the ﬂoorplan does not match the regular, tile-based
ﬂoorplan shown in Figure 1(a). Second, the total com-
munication bandwidth to/from the embedded SDRAM is
much larger than that required for communication among
the other cores. Third, many neighboring blocks do not
need to communicate. Even though it may be possible to
implement MPEG4 onto a homogeneous fabric, there is
a signiﬁcant risk of either under-utilizing many tiles and
links, or, at the opposite extreme, of achieving poor perfor-
mance because of local congestion. These factors motivate
the use of an application-speciﬁc on-chip network [15].
With an application-speciﬁc network the designer is
faced with the additional task of designing network com-
ponents (e.g., switches) with different conﬁgurations (e.g.,
different I/Os, virtual channels, buffers) and interconnect-
ing them with links of uneven length. These steps require
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 
1530-1591/04 $20.00 © 2004 IEEE 
SS
S
S
S
Appln Sepcific
Network
NI −Network
interface
S −switch
NI
NI
NI
NI
NI
NI
Core
Core Core
Core
Core
Core
Figure 2. NoC architecture block diagram
signiﬁcant design time and the need to verify network com-
ponents and their communications for every design. In this
paper we describe a tool that bridges the design gap for het-
erogeneous application-speciﬁc NoCs.
The×pipesCompiler automatically instantiates net-
work components (routers, links, network interfaces) for a
speciﬁc NoC topology, using the ×pipes library of Sys-
temC soft macros deﬁned at the cycle-accurate and signal
accurate level [6]. The ×pipes library is aggressively de-
signed for high performance: links can be pipelined to an ar-
bitrary degree to decouple clock cycle time from worst-case
link delay (this mode of operation has been called latency
insensitive in recent literature [9]). Latency insensitive link-
level error control is fully supported, ensuring robustness
against communication errors.
Even though ×pipes can be instantiated as a high-
performance regular NoC, its most innovative feature is that
all its components are highly parameterized, and they can be
tailored to the communication needs of a speciﬁc architec-
ture. Thus, the ×pipesCompiler can instantiate opti-
mized NoCs. Signiﬁcant improvements in area, power and
latency are achieved with respect to regular NoC architec-
tures, as demonstrated by several case studies detailed in the
paper.
2 The ×pipes Architecture
In this section we present a brief description of the archi-
tecture of switches, links and network interfaces that form
the ×pipes library. We refer the reader to [6] for a de-
tailed description of these components. A conceptual pic-
ture of the network architecture is shown in Figure 2. It sup-
ports packet-switched communication, with source routing
and wormhole ﬂow-control. Source-based routing results
in lightweight switch implementations when compared to
dynamic routing and wormhole ﬂow-control results in re-
duced buffering at each switch. Cores can be plugged into
the network, provided they are OCP compliant [18] and the
network communication protocols are completely hidden to
the cores. The individual components are detailed below.
2.1 Network Interface
The Network Interface (NI) connects the core to the
NoC. It converts the end-to-end OCP transactions into pack-
ets that are to be transmitted through the network. The NI
builds the packet header using the routing information for
CRC_decoder[0]
CRC_decoder[1]
CRC_decoder[2]
CRC_decoder[3]
IN[0]
IN[3]
IN[2]
IN[1]
ACK_VALID
ACK
which_in
OUT[0]
ACK ACK ACK
out port
crc_ACK[0]
crc_ACK[1]
crc_ACK[2]
crc_ACK[3]
NACK NACK NACK
MATCHING
INPUT
AND
OUTPUT
PORT
ARBITER
MUX STAGE
VIRTUAL
CHANNEL
ARBITER
VIRTUAL
CHANNEL
REGISTERS
FORWARD
CONTROL
FLOW
ARBITER
LINK
OUTPUT
ERRROR
DETECTION
LOGIC
OUTPUT MODULE
Figure 3. Pipelined architecture of a switch
the destination stored in a look-up table. The packet is bro-
ken down into header and payload ﬂits and the ﬂits are in-
jected into the network at the rate of one ﬂit every clock
cycle. The NI synchronizes the network requests with the
consuming rate of the core. To keep the interface complex-
ity low, the NI supports only a single outstanding read oper-
ation, but an arbitrary number of write transactions can be
carried out after an outstanding read.
2.2 Switch Architecture
Switches in the ×pipes library are deeply pipelined to
maximize the operating frequency. The pipeline structure
for a single output module is shown in Figure 3, and the
structure is repeated for each output. Forward ﬂow control
is used and a ﬂit is transmitted to the next switch only when
adequate storage is available in that switch. Switches sup-
port multiple virtual channels and a physical link is assigned
to different virtual channels on a ﬂit-by-ﬂit basis, thereby
improving network throughput. The CRC decoders for error
detection work in parallel with the switch operation, thereby
hiding their impact on switch latency.
For latency insensitive operation, the switch has virtual
channel registers to store 2N+M ﬂits, where N is the num-
ber of link pipeline stages and M is an architecture depen-
dent parameter (12 cycles in our design). The reason is that
each transmitted ﬂit has to be acknowledged before being
discarded from the buffer. Before an ACK is received, the
ﬂit has to travel across the link (N cycles), an ACK/NACK
decision has to be taken at the destination switch (a por-
tion of M cycles), the ACK/NACK signal has to be propa-
gated back (N cycles) and recognized by the source switch
(remaining portion of M cycles). During this time, other
2N + M ﬂits are transmitted but not yet ACKed.
2.3 Link Architecture
The links of an irregular NoC are of varying lengths,
and it takes a varying number of clock cycles to traverse
the links. In order to maximize throughput, the links are
subdivided into basic segments that require a single clock
cycle for traversal, making the links latency insensitive by
pipelining ﬂits through them. This pipelining is applied to
both data and control lines. As previously discussed, the
2Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
Simulation
Files
Switch
Files
NI
Files
Link
Files
Appln
Specific
NoC
Tables
Routing Source
Core
xpipesCompiler Tool
Instantiation Software
SystemC
Files
of whole
design
xpipes Library
Appln
Figure 4. NoC synthesis ﬂow
switches and network interfaces have enough buffering re-
sources and their functional correctness depends only on the
ﬂit arriving order and not on timing. This ensures a correct
latency insensitive operation of the whole system.
2.4 Instantiation-time Parameters
All network components can be specialized through
instantiation-time parameters. Some parameters are global
and they affect all component instances. Global parameters
are: ﬂit size, address space of cores, degree of redundancy
(minimum distance) of the CRC error-detection code to be
used on the links, number of bits used for packet sequence
count (for end-to-end ﬂow control), maximum number of
hops between any two network nodes (used for header siz-
ing), number of ﬂit types (to support special ﬂits, for in-
stance control ﬂits with no payload), number of packet types
(which can be used by higher level protocols).
As for the local parameters affecting single network
component instances, we have the following. For the net-
work interface: number of data and address lines and max-
imum burst length in the OCP connection between NI and
the core, type of interface (master/initiator, slave or both),
ﬂit buffer size in the output port (which enables the network
interface to continue packetization even when the network
link is congested) and content of the routing table. For the
switch: number of ports, number of virtual channels, link
buffer size for each port (which relates to the number of
pipeline stages in the corresponding link). For each link,
the number of stages can obviously be speciﬁed.
3 The ×pipesCompiler
The complete NoC design ﬂow is depicted in Figure 4.
From the speciﬁcation of an application, the designer (or
a high-level analysis and exploration tool) creates a high-
level view of the SoC ﬂoorplan, including nodes (with their
network interfaces), links and switches. Based on clock
speed target and link routing, the number of pipeline stages
for each link is also speciﬁed. The information on the
network architecture is speciﬁed in an input ﬁle for the
×pipesCompiler. Routing tables for the network in-
.module_sw -NAME=SW_0 -NPORT=3 -NVC=4
.module_sw -NAME=SW_1 -NPORT=2 -NVC=2
.module_sw -NAME=SW_2 -NPORT=3 -NVC=2
.module_sw -NAME=SW_3 -NPORT=5 -NVC=3
.module_sw -NAME=SW_4 -NPORT=4 -NVC=2
.module_lnk -NAME=lnk_0_1 -MAP=SW_0,0;SW_1,2 -NREP=3
.module_lnk -NAME=lnk_1_0 -MAP=SW_1,2;SW_0,0 -NREP=3
.module_lnk -NAME=lnk_2_0 -MAP=SW_2,3;SW_0,1 -NREP=2
.module_lnk -NAME=lnk_0_3 -MAP=SW_0,2;SW_3,0 -NREP=5
.module_lnk -NAME=lnk_0_4 -MAP=SW_0,3;SW_4,1 -NREP=2
.module_lnk -NAME=lnk_4_0 -MAP=SW_4,1;SW_0,3 -NREP=2
Figure 5. Example input speciﬁcation
terfaces are also speciﬁed. The tool takes as additional in-
put the SystemC library of soft components described in the
previous section. The output is a SystemC hierarchical de-
scription, which includes all switches, links, network nodes
and interfaces and speciﬁes their topological connectivity.
The ﬁnal description can then be compiled and simulated at
the cycle-accurate and signal-accurate level. At this point,
the description can be fed to back-end RTL synthesis tools
for silicon implementation (the details of the back-end syn-
thesis process are not covered in this paper).
In a nutshell, the×pipesCompiler generates a set of
network component instances which are custom-tailored to
the speciﬁcation contained in its input network description
ﬁle. Network instantiation follows a two-step procedure:
ﬁrst, the input ﬁle is parsed in to an internal data structure,
then the structure is traversed and the output is generated. In
the following subsections we describe these steps in more
detail.
3.1 Input Speciﬁcation and Parsing
The input ﬁle describes the cores, switches, links and
the relationships between them. From the designer’s view-
point, the implementation of the NI that connects a core to
the NoC is transparent. The ×pipesCompiler will in-
stantiate the needed NIs according to the type of the core
(Master, Slave, Master/Slave).
Example 1 Part of a simple input speciﬁcation is shown in Fig-
ure 5. Attributes of a switch are: name (-NAME), number of I/O
ports (-NPORT) and the number of virtual channels for each out-
put port (-NVC). The attributes of a link (.module lnk) are:
name (-NAME), the name and port number of the source and des-
tination switch for the link (-MAP), and the number of repeaters it
includes (-NREP).
While parsing the design description ﬁle, the
×pipesCompiler dynamically allocates a tree data
structure to store the necessary information for each object
of the design. Once the ﬁrst step has been executed and
no errors have been detected (like invalid parameters,
parameters missing, global parameters not deﬁned), the
×pipesCompiler processes the collected information,
as described in the following subsection.
3.2 Network instantiation
The ×pipesCompiler identiﬁes the different types
of switches, links and network interfaces that are required in
the design. Because all components are written in SystemC,
3Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
IN[0]
IN[2]
IN[1]
IN[2]
IN[1]
IN[0]
IN[1]
IN[0]
IN[0] OUT[3]
OUT[0]
SW_3SW_1
SW_0
IN[2] flits
flits
Buffsize = 16
Buffsize = 22
flitsBuffsize = 18
sw 3x3 4vc
PortOut sw4x4 2vc
PortOut sw2x2 2vc
sw2x2 2vc
OUT[2]IN[1] IN[2]
sw4x4 2vc
SW_4
sw3x3 2vc
SW_2
sw5x5 3vcPortOut sw5x5 3vc
OUT[2]
OUT[3]
OUT[0]
Figure 6. Irregular n/w optimizations
pad
pred samp
mem
mem
stripe
acdc
vop
Arm
scan
inv up
vop
rec
vld
94
313
16
49
27
500
313 300
353
357
362362362
70 idctiquanrunle dec
(a) VOPD
risc
190
250
32
id
ct
up
spadsp
sd
ram
0.5
173
500
600
40
60
sra
m1
pumc
sra
m2
40
670
910
0.5
bab
rastauvu
(b) MPEG4
se
64
128
me
m1
g2jug1 mem3
ble
nd
hvs
96
64
hs
96
96
96
64
64
96
9696
64
m2
me
ju
vs
nrin
(c) MWD
Figure 7. Communication pattern of example designs
a single class is created for each type, and multiple objects
are instantiated whenever needed.
During network instantiation, the ×pipesCompiler
performs several optimizations to remove redundant logic
from the generic library components. For example, if a
switch has only an input link connected to a port, the logic
and buffers of the missing output port will not be gen-
erated. In the case of a custom-made irregular design,
this is a very valuable optimization that drastically reduces
hardware complexity. The optimized network component
classes are dynamically stored into arrays of structures.
For each object type, a recursive function processes the
tree of ﬁles from the leaves to the main object ﬁle (the root),
parsing each ﬁle and customizing it according to its type. It
is also during this step that the routing tables for each NI are
processed and converted into ﬁles that are to be included.
The top level (main.cc) of the design is then generated.
This ﬁle instantiates all objects of the design at run-time. It
also deﬁnes the signals needed to connect the objects ac-
cording to the design description ﬁle. Here again, if pos-
sible, the ×pipesCompiler will share the signals that
are common to all objects. In order to automate tracing of
signals, the debug command has been implemented, which
enables monitoring of any signal in the design.
Example 2 Referring to Example 1, Figure 6 shows how the net-
work is optimized by removal of redundant logic during instanti-
ation. Because port[1] of SW 0 has only an input connection
the×pipesCompilerwill not instantiate a PortOUT Block
that would be connected to this output. Similar optimization is per-
formed for an output-only port like port[3]: in each PortOUT
Block generated, the signals related to this non-existing input are
removed from the generic template ﬁles. The×pipesCompiler
will also optimize the size of each output buffer according to the
number of repeaters on each output link.
The time of execution of the ×pipesCompiler de-
pends on the design characteristics. For example, a regu-
lar topology such as a 16x16 mesh can be generated faster
than an application-speciﬁc topology with only few cores
and switches. But in absolute terms execution time is not
a major concern. For instance, for a custom design with 4
switches and 2 cores the execution time is about 2s on a Pen-
tium running at 1.8Ghz. For all our experiments, network
generation was completed in few minutes.
4 Experimental Validation and Case Studies
We consider three video processing applications:
Video Object Plane Decoder (VOPD), MPEG4 Decoder
(MPEG4) and Multi-Window Displayer (MWD), presented
in [5, 7], that are mapped onto cores. The communication
characteristics of these applications are shown in Figure 7,
with the edges annotated with the amount of data transferred
between the cores in MB/s. We manually developed cus-
tomized application-speciﬁc topologies that closely match
the applications’ communication characteristics. For com-
parison, we also developed a regular mesh NoC for each ap-
plication. The application designs with different NoC con-
ﬁgurations are shown in Figures 8-10.1
4.1 Application-Speciﬁc NoC Characterization
In the VOPD, about half the cores communicate to more
than a single core. This motivates the conﬁguration of
the custom NoC in Figure 8(b), having less than half the
number of switches than a regular mesh NoC. Also these
switches are much smaller than the mesh switches. In the
MPEG4 design considered, many of the cores communi-
cate with each other through the shared SDRAM. So a large
switch is used for connecting the SDRAM with other cores
(Figure 9(b)) with smaller switches for other cores. We
also consider an alternate custom NoC for MPEG4 (Fig-
ure 9(c)) which is an optimized mesh network, with super-
ﬂuous switches and switch I/Os removed. In the communi-
cation pattern of MWD, all but two cores communicate to
only one other core. So the custom NoC for MWD (Fig-
ure 10(b)) has only two switches.
4.2 Area-Power Analysis
We developed analytical models for estimating the
switch areas. The area calculations include the crossbar
area, buffer area, logic (including control) area. The models
take into account the nuances of individual switch conﬁgu-
rations and includes ﬁne granularity of details (like account-
ing for pipeline registers, cross points, etc).
1For clarity, we show link pipelining only in Figure 8 and we assume
NI as part of core in the experiments.
4Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
up
iQuantiDCT
AC/DC
Predict
scan
inverse
decoder
length
run
VLD
samp
s1
− repeater
s3 − 5x5
s2 − 4x4
s1 − 3x3
s1s2
s2
s3s2
s2s3s2
s1s2s1
Padding
Mem
VOP
core
ARM
Mem
Stripe
reconstr
VOP
(a) Mesh NoC
samp
Predict
AC/DC
iDCT iQuant
scan
inverse
decoder
length
run
VLD
up
− repeater
s1 − 3x3
s2 − 3x2
s3 − 2x3
S2S3
Padding
S2 S1
Mem
Stripe
Mem
VOP
core
ARM
S2
reconstr
VOP
(b) Appln Speciﬁc
NoC
Figure 8. Video Object Plane Decoder
DSP
Audio
izer
raster
CPU
Media
au
vu
SDRAM
DDR
CPU
s3 − 5x5
s2 − 4x4
s1 − 3x3
s3
s3
s2
s2
s2s2
s2
s2
s1
s1
s1
s1
SRAM
SRAM
up
samp
etc
iDCT,
BAB
Calc RISC
(a) Mesh NoC
vuau
CPU
RISC
calc
BAB
DSP
Audio
SDRAM
DDR raster
s8 − 8x8
s3 − 3x3
S3
S3S3
S3
S3
S8
CPU
Media
SRAM
SRAM
samp
up
iDCT
etc
izer
(b) Appln Speciﬁc
NoC1
CPU
RISC
au vu
SRAM
izer
raster
CPU
Media
SDRAM
DDR
iDCT,
s3 − 4x4
s2 − 3x3
s1 − 5x5
s3
s2s2s2
s2
s1
s1
SRAM
Calc
BAB
DSP
Audio
samp
up
etc
(c) Appln Speciﬁc
NoC2
Figure 9. MPEG4 Decoder
HV
Juggler
Scaler
Vert
MemoryMemory
Mod
Reduc
Noise
Juggler
Scaler
Horiz
IN
scaler
s2 − 4x4
s3 − 5x5
s1 − 3x3
s1s2
s2
s1
s2s3
s3s2
s1
s2
s2s1
Enh
Sharp
Blender
Memory
(a) Mesh NoC
Horiz
Scaler
Vert
Scaler
Juggler
Memory Memory
Memory
Sharp
Enh
Blender
HV SclJuggler
IN
Mod
Noise
Reduct
S1
S2
s1 − 3x3
s2 − 3x2
(b) Appln Speciﬁc
NoC
Figure 10. Multi-Window Displayer
The area estimates for the various NoC conﬁgurations
for 0.1µ technology is shown in Table 1. As custom VOPD
and MWD NoCs (generated by the ×pipesCompiler)
have relatively small number of switches, we obtain
signiﬁcant area improvement for the custom NoC. But for
the MPEG4, as each core communicates to many other
cores we have many switches and obtain only 1.69× area
improvement with the custom NoCs. We obtain an average
of 6.54× area savings for the custom NoCs when compared
to the mesh NoC.
We used ORION [16], a power modeling tool, for de-
veloping bit energy models for the switches. We use wiring
parameters from [17] to estimate link power dissipation. We
calculate the power dissipation for each NoC design, based
on the average trafﬁc (shown as edge annotations in Fig-
ure 7) through each network component for a supply voltage
of 1.8 V. The results of the power analysis is summarized in
Table 2. For VOPD and MWD we obtain power savings of
a factor of three, whereas for the MPEG4 the power sav-
Table 1. Area esti-
mates
appl type area rat.
mm2 mesh/cust
vopd mesh 1.26
cust 0.22 5.73
mpeg4-1 mesh 1.31
cust 0.86 1.52
mpeg4-2 mesh 1.31
cust 0.71 1.85
mwd mesh 1.22
cust 0.10 12.2
Table 2. Power es-
timates
appln topol pow rat.
mW mesh/cust
vopd mesh 108.74
cust 40.08 2.71
mpeg4-1 mesh 114.36
cust 110.66 1.03
mpeg4-2 mesh 114.36
cust 93.66 1.22
mwd mesh 25.9
cust 7.72 3.35
ings are smaller. The reason for lower savings in MPEG4
is that most of the trafﬁc traverses the bigger switches con-
nected to the memories. As power dissipation on a switch
increases non-linearly with increase in switch size there is
more power dissipation in the switches of custom NoC1 of
MPEG4 (that has an 8x8 switch) as compared to the mesh
NoC. However most of the trafﬁc traverses short links in this
custom NoC (Figure 9(b)), thereby giving marginal power
savings for the whole design.
4.3 SystemC Simulation Results
We performed cycle-accurate simulation of the SystemC
models of the NoCs generated by the×pipesCompiler.
We used trafﬁc generators to model the bursty nature of
the application trafﬁc, with average communication band-
width matching the applications’ average communication
bandwidth. Snapshots of SystemC simulations of mesh and
custom NoCs for some of the cores of VOPD are shown
in Figure 11(a). The time between transmission of a ﬂit
and its reception, which includes the switch delay, link de-
lay and contention delay, is marked in the ﬁgure. The
variation of average packet latency (for 64B packets, 32
bit ﬂits and 7 cycle switch delay) with link bandwidth is
5Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
(a) SystemC Output
1.6 1.8 2 2.2 2.4 2.6
26
28
30
32
34
36
38
40
42
44
Av
g P
ac
k L
at 
 (C
y)
Cust
Mesh
BW (GB/s)
(b) VOPD Avg Lat
2.2 2.4 2.6 2.8 3 3.2
32
34
36
38
40
42
44
46
48
50
Av
g P
ac
k L
at 
(Cy
)
Cust1
Cust2
Mesh
BW (GB/s)
(c) MPEG4 Avg Lat
400 450 500 550 600 650 700 750
20
25
30
35
40
45
Av
g P
ac
k L
at 
(Cy
)
Cust
Mesh
BW (MB/s)
(d) MWD Avg Lat
VOPD MPG1 MPG2 MWD0
0.5
1
1.5
2
2.5
3
3.5
 
Lin
k U
til.
 R
ati
o
(e) Link Utilization Ratio
(custom/mesh)
Figure 11. Simulation Results
shown in Figures 11(b)-11(d). application-speciﬁc NoCs
have lower packet latency as the average number of switch
and link traversals is lower. Moreover, the latency increases
more rapidly for the mesh NoCs with decrease in band-
width. With the custom NoCs we achieve an average of
30% savings in latency (measured at the minimum plotted
BW value). Also, custom NoCs have better link utilization
as seen in Figure 11(e).
5 Conclusions and Future Work
In this paper we have presented ×pipesCompiler,
a tool that automatically instantiates application-speciﬁc
NoCs. The tool bridges the design gap in building
application-speciﬁc NoCs that optimally match the commu-
nication requirements of the system. The network compo-
nents built are highly optimized for the particular NoC de-
sign, providing large savings in area, power and latency for
example designs. In the future we plan to enhance the tool
with automatic selection of network topology, thus provid-
ing a complete design ﬂow for heterogeneous NoCs.
6 Acknowledgements
This research is supported by MARCO Gigascale Sys-
tems Research Center (GSRC) and NSF (under contract
CCR-0305718).
References
[1] L.Benini, G.D.Micheli, “Networks on Chips: A New SoC Paradigm”,
IEEE Computers, pp. 70-78, Jan. 2002.
[2] A.Jantsch, H.Tenhunen, “Networks on Chip”, Kluwer Academic Pub-
lishers, 2003.
[3] E.Rijpkema et al., ”Trade-offs in the design of a router with both guar-
anteed and best-effort services for networks on chip”,DATE 2003, pp.
350-355, Mar 2003.
[4] W.Cesario et al., “Component-Based Design Approach for Multi-Core
SoCs”, DAC 2002, pp.789-794, June, 2002.
[5] E.B.Van der Tol, E.G.T.Jaspers,”Mapping of MPEG-4 Decoding on a
Flexible Architecture Platform”, SPIE 2002, pp. 1-13, Jan, 2002.
[6] M.Dallosso et. al, “×pipes: a Latency Insensitive Parameterized
Network-on-chip Architecture For Multi-Processor SoCs”, pp. 536-
539, ICCD 2003.
[7] E.G.T.Jaspers, et al.,”Chip-set for Video Display of Multimedia Infor-
mation”, IEEE Trans. on Consumer Electronics, Vol 45, No. 3, pp.
707-716, Aug, 1999.
[8] F.Karim et al., “An Interconnect Architecture for Network Systems on
Chips”, IEEE Micro, Vol.22, No.5, pp.36-45, Sep. 2002.
[9] L.P.Carloni, K.L.McMillan, A.L.Sangiovanni-Vincentelli, ”Theory of
latency-insensitive design”, IEEE Trans. on CAD of ICs and Systems,
pp.1059–1076, Vol.20, no.9, Sept. 2001.
[10] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet
switched interconnections”, Proc. DATE, pp. 250-256, March 2000.
[11] S.Kumar et al., ”A network on chip architecture and design method-
ology”,ISVLSI 2002, pp.105–112, 2002.
[12] S.J.Lee et al.,“ An 800MHz Star-Connected On-Chip Network for
Application to Systems on a Chip”, ISSCC 2003, Feb. 2003.
[13] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC
Architectures Under Performance Constraints”, ASP-DAC 2003.
[14] H.Yamauchi et al., “A 0.8 W HDTV video processor with simulta-
neous decoding of two MPEG2 MP@HL streams and capable of 30
frames/s reverse playback”, ISSCC, Vol.1, pp. 473-474, Feb. 2002
[15] H.Zhang et al., ”A 1V Heterogeneous Reconﬁgurable DSP IC for
Wireless Baseband Digital Signal Processing”, IEEE Journal of SSC,
pp.1697–1704, Vol.35, no.11, Nov. 2000.
[16] H.S Wang et al., ”Orion: A Power-Performance Simulator for Inter-
connection Networks”, MICRO, Nov. 2002.
[17] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires”, Proceedings
of the IEEE, pp. 490-504, April 2001.
[18] http://www.ocpip.org/home OCP speciﬁcation
6Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04) 1530-1591/04 $20.00 © 2004 IEEE 
