Design Technologies for Networks-on-Chip by De Micheli, Giovanni
Design Technologies for
Networks-on-Chip
Giovanni De Micheli
EPFL
Federico Angiolini, Srinivasan Murali,
Luca Benini, David Atienza, Antonio Pullini
Intel’s 80-tile NoC
1248 pin LGA, 14 layers,
343 signal pins
Package
3 mm2Tile Area
275 mm2Die Area
100 MillionTransistors
1 poly, 8 metal (Cu)Interconnect
65nm CMOS ProcessTechnology
Vangal et al.  ISSCC 2007
Application domains
n Multiprocessors on chip
n Homogenous fabric
n Designed for performance
n General purpose
n Application-specific SoCs
n Heterogeneous structure
n QoS and power constraints
n Domain specific software
Wireless Networks Mesh nodes, Picocells
Picochip PC205 (Apr’06)
n 260MHz, 31GMAC/s, 160GIP/s 
n 64KB I,D$, 128KB SRAM 
n Less than 5 W, less than
1$/GMAC
249 16b PEs
IOs
GP core
Embedded SoC Trend
Heterogeneous clusters
Multi-hop interconnect
n Roadmap continues: 90®65®45 nm
n “Traditional” Bus-based SoCs fit in one tile !!
Architecture Evolution
n Communication demand is staggering, but unevenly
distributed, because of architectural heterogeneity
I/0
I/0
PE
PE PE PE
SRAM SRAM
DRAM
I/O
I/O
P
E
R
I
P
H
E
R
A
L
S
3D
 stacked
m
ain
m
em
ory
PE
Local
Memory
hierarchy
CPU
i/o
n 65 nm low-power library
n low Vt library, high VDD – power/perf tradeoff
n very high frequencies or very long links infeasible
n but even some feasible links burn up to 30 mW!!
n heavy buffer insertion
Power consumption
Unidirectional link
(38 bits+flow control)
Interconnect Bottleneck
Power consumption
Unidirectional link
(38 bits+flow control)
Interconnect Bottleneck
n 65 nm low-power library
n High Vt library, low VDD – absolute min power
n Even at 250 MHz, > 2 mm link length infeasible
Addressing Interconnect Issues
n High-end industrial solutions:
n Evolutionary path from shared busses
AMBA AXI
Protocol evolutions
AMBA AHB
AMBA AHB ML
Topology evolutions
n Challenges
n Complexity: how to analyze, verify “spaghetti interconnects”?
n Scalability: bus is bandwidth-limited, Xbar is size-limited
n Predictability: how to tie interconnects with floorplanning
AHB
AHB
AHB
The Network-on-Chip Paradigm
DSPNI
NIDRAM
switch
DMANI
CPU NI
NIAccel
NI MPEG
switch
switch
switch
NoC
switch
switch
The “power of NoCs”:
n Clean separation at session layer
n Cores issue end-to-end 
transactions
n Network deals with transport, 
network, link, physical
n Modularity at HW level: only
2 building blocks
n Network interface
n Switch (router)
n Physical design aware
(floorplan global routing)
Scalability is supported from the ground up!
SoC and  NoC Characteristics
n Typical applications targeted by SoCs
n Complex
n Highly heterogeneous (component specialization)
n Communication intensive
n Tailor-made interconnects for applications
n NoCs are resource constrained:
n Power, area constraints – low buffering available
n Large available wire bandwidth
n But tapping it with modular, structured design is key
New design challenges
n From multiprocessor field
n Assigning tasks to processors
n Synchronization, consistency, coherency  
n Networking
n Network topology, routing, flow control
n Quality of Service (QoS) needs 
n VLSI
n Floorplan in 2D, wire lengths
n Power, area, performance
An integrated design approach is crucial …
CPU
Memor
y
DSP
Memor
y
link
switch
network 
interface
CPU
The Big Picture
T1
T2
T3
B
E
PE
PE
M
M
IO
Applications
Abstract architecture
HW/SW
Co-design
PE
NoC
PE
M
M
IO
NoC architecture
NoC design
Orthogonalize computation from communication
Why Design Automation ?
n Large design space, several steps
1. Capturing application traffic
• How to capture ?
• How to account for burst, jitter ?
• What about multiple applications? 
Why Design Automation ?
n Large design space, several steps
1. Capturing application traffic
2. What topology ?
3. Mapping ?
4. Routes to use ?
Why Design Automation ?
n Large design space, several steps
1. Capturing application traffic
2. What topology ?
3. Mapping ?
4. Routes to use ?
-Resource constrained: power, area
-Large wire bandwidth - tapping it 
with modular design is key
More Steps ! 
5. Tuning communication architecture
parameters (link width, buffer sizes)
6. Verification for correctness, performance
7. Build simulation, synthesis, emulation models
8. Reliable operation under unreliable conditions
Automating and integrating the steps essential !
Should ensure design closure
(fast time-to-market)
Layered Design Flow
High-level
specification
Topology design, 
mapping, routing,
refine arch. 
parameters
Analytical models,
static effects,
large solution
space
Accurate traffic 
modeling, 
performance, 
power modeling
Stochastic
packet-level
simulation
Buffer sizing, 
arbitration policy,
dynamic routing
Dynamic, fast 
C++ simulations,
stochastic traffic
Traffic generator 
models, accurate 
network models
Transaction
simulation
Further refine
arch params, key
topology changes
Dependencies in
communication
Reflect cycle-
accuracy, speed
Cycle acc.
simulation
Performance test,
very few arch, 
topology changes
Completely 
accurate
Speed, FPGA 
emulation
Design phases Models/effects Key Issues
Research Teams
Tampere University 
of Technology
Stanford
Princeton KTH, Sweden
University of 
Bologna
Technion
Brazil
University 
of Cagliari
All omissions are purely accidental ...
SunFloor Design Flow
SunFloor Design Flow
Topology
Synthesis
includes:
Floorplanner
NoC Router
Communication
characteristics
NoC
Area models
User
objectives
NoC
Power models
User
Constraints
SUNFLOOR
System
specs
RTL
Simulation
SystemC
code
FPGA
Emulation
Synthesis
IP Core
models
To fab
Placement&
Routing
Codesign
Simulation
Application
Platform
Generation
NoC
component
library
Platform
Generation
(xpipes-
Compiler)
xpipes
Front-End Design
n Design application-specific custom topologies 
Synthesize best topology for application
• Objectives: Power, performance (hop delay)
• Constraints: performance, power, bandwidth
• Tune NoC frequency: 
match needs
• Design deadlock-free network
• Consider timing constraints
early in design cycle
• Use accurate floorplan information
Achieves design closure, bridging design gaps across different steps
Input Models
n Traffic Models
BW
(MB/S)
isn
vld rld iqn
idct
arm
ups
vpr
pad vpm
smm
a-d
70
362
27
357
362
isn
49
313
94
313
500
353
300
16
Core graph
• Consider bursty traffic, criticality of streams
• Obtained from initial simulations, application knowledge
• Hardware monitors to obtain traffic characteristics
Back-End Flow
xpipes
library
fabric instantiation
xpipesCompiler
topology
SystemC
traffic
logs
verification,
power modeling
Mentor ModelSim
Synopsys PrimePower
power
figures
traffic
generators
architectural simulation
cycle-accurate simulation platform
architectural
statistics
performance
figures
area
figures
topology
specs
fabric synthesis
Synopsys Physical 
Compiler
tech
library
topology
netlist
place&route
Cadence SoC 
Encounter
topology
floorplan
HDL translation
RTL SystemC Converter
topology
HDL
Æthereal Design Flow
Architecture Specification
c1
p2
input  
p1
memory p1 output
c1
filter1
p1
p2
p1
filter2
c1
p2
architecture.xml
[Kees Goossens, NXP]
Application specification
c1
p2
input  
target port
initiator port
p1
memory
p1 output
c1
filter1
p1
p2
p1
filter2
c1
p2
direction of data flowcommunication.xml
[Kees Goossens, NXP]
NOC design flow
n Split large optimization 
problem in smaller 
pieces
path allocation
time slot allocation
configuration
NOC software
NOC hardware
verification simulation
validation
results
RTL synthesis &c
compilation &c
mapping
topology
synthesis selection
NOC hardware
application
IP 
blocks
constraints
binaries
mapping
path allocation
time slot allocation
UMARS
[Kees Goossens, NXP]
NOC design flow
n Split large optimisation 
problem in smaller 
pieces
n may fail (feedback)
path allocation
time slot allocation
configuration
NOC software
NOC hardware
verification simulation
validation
results
RTL synthesis &c
compilation &c
mapping
topology
synthesis selection
NOC hardware
application
IP 
blocks
constraints
binaries
[Kees Goosens, NXP]
NOC design flow
n Split large optimisation 
problem in smaller 
pieces
n may fail (feedback)
n back annotation
path allocation
time slot allocation
configuration
NOC software
NOC hardware
verification simulation
validation
results
RTL synthesis &c
compilation &c
mapping
topology
synthesis selection
NOC hardware
application
IP 
blocks
constraints
binaries
[Kees Goosens, NXP]
NOC design flow
n Split large optimisation 
problem in smaller 
pieces
n may fail (feedback)
n back annotation
path allocation
time slot allocation
configuration
NOC software
NOC hardware
verification simulation
validation
results
RTL synthesis &c
compilation &c
mapping
topology
synthesis selection
NOC hardware
application
IP 
blocks
constraints
binaries
im
pr
ov
e
[Kees Goossens, NXP]
UMARS: Multiple applications
n SoCs typically support multiple applications
n Applications can run in parallel: compound modes
n UMARS supports multiple applications
n Supports NoC reconfiguration across compound modes
phone
MP3 download
take picturesview pictures
MP3 play
time
standby & roam
ringtone download
A
pp
lic
at
io
ns
[Kees Goossens, NXP]
Several NoC CAD efforts
Nostrum simulation environment
NoC buffering with queueing theory [Hu]
OEDIPUS design system [Ahonen]
Case Study 1:
Custom Vs Regular NoCs
Processor-
memory 
cluster
SUNFLOOR vs Manual design
On the 30-core multimedia benchmark
P-processors, M-private memories, 
T-traffic generators, S-shared slaves
Hand-mapped topology SUNFLOOR custom topology
Bi-directional 
links
Bi-directional 
links
Uni-directional 
links
Design Layouts
Hand-design (custom mesh) SUNFLOOR Design
From Cadence 
SoC Encounter 
SUNFLOOR vs Hand-Mapped
Hand-mapped design:
• Topology: 5x3 mesh
(15 switches)
• Operating frequency:
885 MHz (post-layout)
• Power consumption:
368 mW
• Floorplan area:
35.4 mm2
• Design time: weeks
•0.13 µm technology
Hand-mapped design:
• Topology: 5x3 mesh
(15 switches)
• Operating frequency:
885 MHz (post-layout)
• Power consumption:
368 mW
• Floorplan area:
35.4 mm2
• Design time: weeks
•0.13 µm technology
SunFloor:
• Topology: custom
(8 switches)
• Operating frequency:
885 MHz (post-layout)
• Power consumption:
277 mW (-25%)
• Cell area:
37 mm2 (+4%)
• Design time: 4 hours 
design to layout
•0.13 µm technology
SunFloor:
• Topology: custom
(8 switches)
• Operating frequency:
885 MHz (post-layout)
• Power consumption:
277 mW (-25 )
• Cell area:
37 mm2 (+4 )
• Design time: 4 hours 
design to layout
•0.13 µm technology
Benchmark execution time comply with application 
requirements and are even 10% better on SunFloor topology.
constraint
1.15
2.00
2.00
20.53
90.17
38.60
Custom
Mesh
Opt-mesh
MWD
(12 cores)
1.33
2.00
2.00
30.00
95.94
46.48
Custom
Mesh
Opt-mesh
VOPD
(12 cores)
1.50
2.17
2.17
27.24
96.82
60.97
Custom
Mesh
Opt-mesh
MPEG4
(12 cores)
1.67
2.58
2.58
79.64
301.8
136.1
Custom
Mesh
Opt-mesh
VPROC
(42 cores)
Avg. nr. 
hops
Power(mW)TopologyApplication
Custom Vs Regular Topologies
§On average, SunFloor 
custom topologies:
§ 2.75x less power 
consumption
§ 1.55x less hop delay
§Despite large design 
space, maximum run 
time of few hours for 
VPROC
Case Study 2:
Technology Scaling Effects
Effect of Technology Scaling 
VLDVL RLDL
INV 
SCAN
I V 
S
STRIPE 
MEM
STRIPE 
E
IQUAN
T
I UAN
T
ACDC 
PRED
 
P E
VOP 
MEM
V P 
E PAD
P VOP 
REC
V P 
E
UP 
SAMP
P 
S P
IDCTI T
ARM
VLDVL RLDL
INV 
SCAN
I V 
S
STRIPE 
MEM
STRIPE 
E
IQUAN
T
I UAN
T
ACDC 
PRED
 
P E
VOP 
MEM
V P 
E PAD
P VOP 
REC
V P 
E
UP 
SAMP
P 
S P
IDCTI T
ARM
MEM 1E  1
MEM 2E  2
126
126
540
540
Decoding 2 streams: dVOPD
VLDVLD RLDRLD
INV 
SCAN
INV 
SCAN
STRIPE 
MEM
STRIPE 
MEM
IQUANTIQUANT
ACDC 
PRED
ACDC 
PRED
VOP 
MEM
VOP 
E PADPAD
VOP 
REC
VOP 
REC
UP 
SAM
P
UP 
SA
P
IDCTIDCT
ARMAR
MEM 1E  1
MEM 2E  2
126
126
540
540
VLDVLD RLDRLD
INV 
SCAN
INV 
SCAN
STRIPE 
MEM
STRIPE 
MEM
IQUANTIQUANT
ACDC 
PRED
ACDC 
PRED
VOP 
MEM
VOP 
E PADPAD
VOP 
REC
VOP 
REC
UP 
SAM
P
UP 
SA
P
IDCTIDCT
ARMAR
VLDVLD RLDRLD
INV 
SCAN
INV 
SCAN
STRIPE 
MEM
STRIPE 
MEM
IQUANTIQUANT
ACDC 
PRED
ACDC 
PRED
VOP 
MEM
VOP 
E PADPAD
VOP 
REC
VOP 
REC
UP 
SAM
P
UP 
SA
P
IDCTIDCT
ARMAR
126
540
Decoding 3 streams: tVOPD2
X2 bandwidth scaling
dVOPD2
Network Synthesis Results
800 MHz
800 MHz
-
400 MHz
-
400 MHz
Frequency Avg. head flit 
latency
Total NoC 
Power
Largest 
Switch
Switch 
Count
Library
4.35 cycles
[3,9]
196.40 mW7x710
65nm
HP
4.24 cycles
[3,7]
129.36 mW7x66
65nm
HP
----
65nm
LP
3.42 cycles
[3,5]
81.96 mW10x94
65nm
HP
----
90nm
LP
3.42 cycles 
[3,5]
175.88 mW10x94
90nm
HP
dVOPD2
tVOPD2
§Observations:
§ Lower power in 65nm for same design
§65 nm supports 2x BW, at lower power!
§NoC for a big design (38 cores) operates at 800 MHz
§With increasing app BW or number of cores, more 
switches needed (due to freq limit of switches)
dVOPD
Case Study 3:
NoCs for low power applications ?
Private 
Memory 0
Private 
Memory 1
Private 
Memory 7
…
ARM 0
ARM 1
ARM 7
…
Shared
Memory
Semaphore
Device
Interrupt
Device
180 MB/s
180 MB/s
180 MB/s
All Links: 
1.8 MB/s
Parallel Encryption Engine
• 18 cores
Low Bandwidth & Power Application
50 MHz
50 MHz
50 MHz
50 MHz
Frequency
Avg. head flit 
latency
Total NoC 
Power
Largest 
Switch
Switch 
CountLibrary
4.38 cycles 
[3,7]
3.1 mW9x95
65nm
LP
3.94 cycles
[3,5]
4.72 mW11x112
65nm
HP
3.94 cycles 
[3,5]4.1 mW11x112
90nm
LP
3.94 cycles 
[3,5]10.4 mW11x112
90nm
HP
Energy efficiency: 2.2Gbs/mWà 2.5x better than high-perf NoC
Custom Topology Layout 
Conclusions
n Design flows and CAD tools are critical for NoCs
n Layered design flow 
n Tackle problems from several levels
n Several key steps
n Traffic analysis, mapping, topology design, routing,…
n Integrated approach is critical
n Interact with existing back-end tools
n Fertile ground for more R&D work:
n Run-time configurability
n Robustness w.r.t. to static/dynamic variations, errors
n Tackle floorplan and layout issues
