A global wire planning scheme for Network-on-Chip. by Liu, J. et al.
A Global Wire Planning Scheme for Network-on-Chip 
J.  Liu, L-R Zheng, D. Pamunuwa, H .  Tenhunen 
Laboratory of Electronics & Computer Systems (LECS) 
Royal Institute of Technology (KTH) 
Electrum 229, SE-164 40 Kista, Sweden 
{ j ianliu, lrzheng, dinesh, hannu } @ imit.kth .se 
Abstract 
As technology scales down, the interconnect for on-chip global 
communication becomes the delay bottleneck. In order to 
provide well-controlled global wire delay and efficient global 
communication, a packet switched Network-on-Chip (NoC) 
architecture was proposed by different authors [1][2]. In this 
paper, the NoC system parameters constrained by the 
interconnections are studied. Predictions on scaled system 
parameters such as clock frequency, resource size, global 
communication bandwidth and inter-resource delay are made for 
future technologies. Based on these parameters, a global wire 
planning scheme is proposed. 
1. Introduction 
Interconnect has been the major design constraint in deep sub- 
micron circuits. The downscaled wire size, increased aspect ratio, 
combined with higher signal speed cause many signal integrity 
challenges and time closure problems. Traditionally, these issues 
are tackled mainly from an electrical design point of view. 
Recent studies show that the problem also can be coped with 
interconnect-centric system architectures [ 1][2]. One such 
emerging architecture is the Network-on-Chip (NoC). The NoC 
architecture is a packet switched network on a single chip [1][2]. 
It scales from a few dozens to several hundreds or even 
thousands of resources. A resource may be a processor core, a 
DSP core, an FPGA block, a dedicated HW block, or a memory 
block. Any kind of inter-resource information is sent in packets 
over the network. The structured network wiring gives well- 
controlled electrical parameters and enables reusing of building 
blocks. Clearly, any topology that fully connects the resources 
can be used for the network. However, a two-dimensional mesh 
topology turns out to be simple and effective [2][3]. Thus, the 
following study will be based on this specific topology. 
The NoC uses a backbone to provide a reliable and efficient 
communication platform for user-specified resources. The NoC 
backbone consists of resources and switches organized in a two- 
dimensional mesh, as shown in Figure 1. A data packet from one 
resource is first passed to the switch attached to the resource. The 
switch then routes the packet onto the appropriate link. 
As the NoC is targeted to future DSM and nanometer 
technologies, the following questions are interesting: what is the 
appropriate size of each synchronous resource; how many 
resources can be integrated in one chip in future technologies; 
how fast can signals travel from one resource to another through 
the on-chip communication network and how to plan the wires to 
get an optimal data bandwidth with limited wire resource. In this 
paper, we study the NoC system parameters constrained by the 
interconnections and answer the above questions. In section 2, 
we use empirical rules to derive the gate delays for future DSM 
technologies, which is followed by an estimation of the 
maximum clock frequency and the corresponding resource size. 
In section 3, the inter-resource delay is studied and a global wire 
planning scheme providing maximum bandwidth is proposed. 
The NoC is a typical interconnect-centric architecture, which 
means that the wire planning is the first design step. In this early 
planning stage, detailed system parameters for the wires are often 
unknown, making it  impractical to consider layout-related 
properties such as 3D multiplayer interconnections. Therefore, a 
simpler wire model is used below. When the planning is done 
and various requirements on the wires, such as delay and noise 
level, are determined, a dynamic interconnect model can be used 
to generate a wire structure meeting these requirements in later 
design phases. One dynamic interconnect model using 3D 
capacitance, resistance and inductance is described in [4]. 
Similar CAD tools like Magma’s FixedTiming [www.magma- 
da.com] are also emerging commercially. 
; S S 
Figure 1. The 2D-mesh backbone of the NoC, with 
switches (S) and resources (R). 
:2. NoC Interconnect Fabric Optimization 
The performance of interconnections is a major concern in scaled 
technologies. Under scaling, the gate delay decreases. However, 
the global wires do not scale in length since they communicate 
signals across the chip. For these wires, the delay per unit length 
can be kept constant if optimal repeaters are used [ 5 ] .  In NoC, 
we assume that all global wires are reserved for global 
communications and semi-global/local wires are used within a 
resource. 
2.1. Technology Scaling and Gate Delay 
Since four is the typical average gate connectivity, “fan-out-of- 
four inverter delay”, or simply F 0 4  is a reasonable parameter to 
0-7803-7761-3/03/$17.00 02003 IEEE IV-892 
Authorized licensed use limited to: Lancaster University Library. Downloaded on May 07,2010 at 14:10:44 UTC from IEEE Xplore.  Restrictions apply. 
be used for measuring gate delays. As the name suggests, an F04 
is the delay through an inverter driving four identical copies. Ron 
Ho [5] pointed out that, historically, gates have scaled linearly 
with technology, and an accurate model of recent F04 delays has 
been 360' Lgofv ps at typical and 500. Lgofp ps under worst-case 
environmental conditions. After studying today's existing 
nanometer scale devices, he also predicts that this trend will 
continue for future generations of transistors, which means 
500.L,,,, ps is a lower limit for future F04 delays. This model 
of gate delay will be used later when estimating clock cycle time 
and comparing with wiring delays. 
2.2 Clock Cycle Analysis 
A resource in a NoC can run at different speed. To study how the 
clock cycle within a NoC resource scales with the gate delay, we 
first examine the relationship between clock cycle and F 0 4  
delay. Recent Pentium4 micro architecture and the aggressive 
CompaqlDEC alpha chips have 14 to 16 F04s  per clock cycle. 
Older processors, for example PentiumProlII, run at 20 to 40 
F04s per clock cycle. I t  shows that the number of F04s required 
in a clock cycle decreases as the technology scales down. 
Extrapolating historical data would lead to 6-8 F04s per clock 
cycle within a few generations [ 5 ] .  However, such fast-cycling 
machines pose many difficulties. With 6-8 F04s per clock cycle, 
clock skew of a few F04s would be extreme hard to manage. 
Furthermore, generating a clock of 8 F04s per clock cycle is a 
difficult task since the rise and fall time of a clock wave take 
more than 2 F04s to fully transition. With these difficulties in 
consideration, a clock cycle of 20 F04s is projected for a cost- 
performance NoC resource and I O  F04s for a high-performance 
one. Thus, with 0.05-pm technology, the clock cycle becomes 
20.500.0.05 = 500 ps for a cost-performance NoC resource, 
giving a clock frequency of 2 GHz. Table 1 shows projected 
clock frequencies for some different technologies. 
0.18-pm 0.13-pm 
Cost Perf. (GHz) 0.56 0.77 
High Perf. (GHz) I .  I 1.5 
0.lO-pm 0.07-pm 0.05-pm 
1.0 1.4 2.0 
2.0 2.9 4.0 
Table 1. Projected clock frequencies for NoC resources 
under worse-case F 0 4  delays. 
2.3 Synchronous NoC Resource Size Estimation 
Knowing the projected clock cycle, the maximum size of a 
synchronous NoC resource is limited by the wiring delays since 
the clock signal must be able to traverse 2 resource edges within 
a clock cycle (assuming the resource is quadratic) in the worst 




The wiring delay of a distributed RC line can be modeled as: 
T,,, = 0.4rcl 
Parameter 0.18-pm 0. 13-pm 0. IO-pm 0.07-pm 0.05-pm 
R(ohm/mm) 107 185 317 611 1196 
c(fF/mm) 331 268 208 170 155 
Here T,,,., is the wiring delay, 1 is the wire length, r is the 
resistance per unit length and c is the capacitance per unit length. 
This is a very good approximation and is reported to be accurate 
to within 4% for a very wide range of r and c [6].  Knowing the 
clock cycle time and RC delay model, the maximum resource size 
satisfies: 
max- wiring -delay < clock -cycle 
0 . 4 r ~ ( 2 L ) ~  < clock-cycle 
Here, L is the maximum resource edge length. The clock cycle 
estimation is described in previous section and qualified 
predictions on wire resistance and capacitance for future 
technologies are available in a number of different papers. 
The RC-model given above shows that the wiring delay grows 
quadratically with wire length. To  reduce the delay for semi- 
global and global wires, a long line can be broken into shorter 
sections, with a repeater (an inverter) driving each section, see 
Figure 3. This makes the total wire delay equal to the number of 
repeated sections multiplied by the individual section delay: 
Ttotu, = k.(T,,, +0.4.rc(l I k ) 2 )  
Now, a first order model of the driver (repeater), with lumped 
output resistance and input capacitance, gives the driver delay as: 
T,,,, =0.7-(hC,, R +hC, +c-)+0.7r-hhCX 1 1 
h k k 
Here, R is the minimum sized inverter resistance, C,, and C, 
are diffusion and gate capacitances of a minimum sized inverter 
and rand  c are wire resistance and capacitance per unit length. 
&&-& - - - - - - - - - -4% 
1 2 3 ... k - v 
I 
Figure 3. A long wire with k repeaters, each with a size 
of h times the minimum sized inverter. 
The expression above for the total delay can be minimized and 
the minimum delay per unit length can be shown to be 
2 . 1 3 4 x  pslmm [5][7]. Here, F01 stands for fan-out-of-one 
delay and 1 F 0 4  = 3F01 . The time for a signal to traverse 2 
resource edge lengths should be less than a clock cycle, 
suggesting the inequality 4.26. L.J= < 1 clock - cycle. 
Using the predicted future semi-global wire parameters provided 
in [7], as shown in Table 2, the maximum synchronous resource 
size and the number of resources on a single chip are calculated 
and listed in Table 3. 
Figure 2. The worst-case delay in a NoC resource. Table 2. Wire parameters for different technologies. 
IV-893 
Authorized licensed use limited to: Lancaster University Library. Downloaded on May 07,2010 at 14:10:44 UTC from IEEE Xplore.  Restrictions apply. 
Technology 0.18-pm 0.13-pm 0. IO-pm 
ChipSize(mm) 20 21 23 
High Max ResourceSize 6.5 4.7 3.5 
Performance NrofResources 9 20 42 
xResourceSize( 13 I 9.3 I 7.1 I 4.7 I 3.0 





Table 3. Maximum resource size and number of 
resources on a single chip, with different technologies. 
The resistance and capacitance used to calculate Table 3 are for 
semi-global wire, since the semi-global wire is normally used 
within a resource. Routing with global wires within a resource 
would allow larger resource size, since global wires, in general, 
have lower resistance and therefore also smaller delay per unit 
length than semi-global wires. From the table, we have that the 
maximum size of a synchronous high performance resource is 1.5 
mm using 0.05 pm technology. For a cost performance resource 
with a cycle time of 20 F04s, twice as long as the high 
performance resource cycle time, the maximum resource size is 
also twice as large. 
It should be noticed that the analysis made above is valid for 
single wires. Crosstalk effects are not taken into consideration. If 
many wires are in parallel and switch simultaneously, the delay 
will be higher for unfavorable switch patterns, requiring smaller 
resource size. Therefore, the derived maximum resource size 
above should be seen as an upper bound. 
3. Inter-Resource Delay and Bandwidth 
3.1 Inter-Resource Delay 
Inv. Resistance (ohm) 
Inv. Capacitance (fF) 
The inter-resource communication link will most likely consist of 
a large number of parallel wires, with uniform coupling over 
most of the wire length. For such closely coupled parallel wire 
structures, the crosstalk effects are considerable and cannot be 
neglected. Hence, the single wire model used in previous section 
is not valid here. Instead, the model shown in Figure 4 is used. 
Each wire is modeled as a distributed RC line with total 
resistance R, total self-capacitance C,v ,  and total coupling 
capacitance C, uniformly distributed over the whole line. 
0.18-pm 0.1 3- lm 0.10-pm 0.07-pm 0.05-pm 
9020 10560 11370 13710 15080 
1.795 1.267 0.996 0.709 0.532 Figure 4. Distributed RC lines with uniform coupling. 
The effect of crosstalk on the delay depends on the switching 
pattern of the aggressor (adjacent) lines. Most often, static timing 
models that take crosstalk into account are based on a switch 
factor. To model the crosstalk effects, the coupling capacitance is 
multiplied by this switch factor, which takes the value between 0 
and 2 for the best and worst case respectively. In Figure 4, 
suppose that the victim line in the middle switches up from zero 
to one, the switching pattern that gives rise to the worst case 
delay on the victim line is when the two aggressor lines switch 
down from one to zero (almost) simultaneously [6]. The worst- 
case delay is then given by: 
to,5 = 0.7Rd, (C,\ + 4.4C, + C, ) + R(0.4C,\ + 1 .5Cr + 0.7Cd, ) 
Here, I , ) , ~  is the delay for step response to reach 50% point, Rd,,, 
is the driver (minimum sized inverter) output resistance and C,, 
is the driver capacitance. Similar to the single wire case, the 
second term in this expression grows quadratically with the wire 
length. Inserting repeaters reduces the total wire delay. As shown 
in Figure 5, a long wire is broken into k sections, with an h-sized 
repeater driving each section. For each section, the driver has a 
lumped resistance of R,  l h  and capacitance of h . C d r v ,  the 
wire has a distributed resistance of R / k and self-capacitance 
C,s / k , the mutual capacitance becomes C,  / k between two 
adjacent lines. 
lib 1:* li* 
Figure 5. Insertion of repeaters in a long uniformly 
coupled RC line. 
Applying the formula for worst-case delay for each section, the 




1 R C  C -(0.4"+ k k  1.5 ---f_+ k 0.7hCd,) 
To obtain the optimal k and h value, the partial derivatives are 
equaled to zero, giving: 
- at,, 5 = 0 a h, = ,/- 0.7Rd,C,, + 3.1 R,,C, 
ah 0.7RCd, 
NOW, the optimal value of k must be a positive integer. Using the 
minimum sized inverter resistance and capacitance from [8], as 
sho,wn in Table 4, the optimal k and h values are calculated and 
listed in Table 5. If the optimal k is not an integer, both of the 
two closest integers are used and corresponding delays are 
compared to each other in order to find the smallest delay. 
Table 4. Resistance and capacitance of minimum sized 
inverter for different technologies. 
From Table 5, we see that the optimal size of the repeaters is 
large and the number of sections does not seem to be very 
significant for the delay. The increased number of repeaters only 
gives marginal improvement in delay. This means that the trade- 
off between the number of repeaters and the delay should be 
considered. 
IV-894 
Authorized licensed use limited to: Lancaster University Library. Downloaded on May 07,2010 at 14:10:44 UTC from IEEE Xplore.  Restrictions apply. 
4. Summary and Future Works 
In this paper, we study the NoC system parameters constrained 
by the interconnections. Predictions on future technology feature 
size, clock speed in a synchronous resource, maximum NoC 
resource size, optimal global communication bandwidth and 
inter-resource distance, are made. These quantities are closely 
related to each other. The technology determines the gate delay, 
which in turn determines the maximum clock frequency. The 
maximum resource size can then be derived from the obtained 
clock frequency and the semi-global wire delay. At last, the 
global communication bandwidth is limited by the distance 
between resources and the global wire delay. Based on these 
estimated quantities, this paper provides a global wire planning 
scheme for NoC and can be used as a guideline for NoC system 
architecture definition. This can be demonstrated in a numerical 
example: for a NoC in 50-nm technology, the clock frequency is 
estimated to be 4 GHz for a high-performance synchronous 
resource with an edge length of 1.5 mm. With an inter-resource 
distance of 1.5 mm, there is room for about 350 such resources 
on a single chip of 28x28 mm. The bandwidth between two 
adjacent resources is estimated to be 0.6 Gbps per global wire 
without using repeaters. 
Future work involves global communication bandwidth 
optimization strategies under different constraints such as area, 
power consumption, etc. In addition, the role of multilayer 
interconnection and real-world application integration in NoC 
are important and should be studied more closer. 
5. References 
I ]  A. Hemani, A. Jantsch, S. Kumar, A. Postula, J .  Oberg, M. 
Millberg, and D. Lindqvist. “Network on Chip: An 
Architecture for Billion Transistor Era”, Proceeding of the 
IEEE NorChip Conference, November 2000. 
21 W. J.  Dally and B. Towles, “Route Packets, Not Wires: On- 
Chip Interconnection Networks”, Design Automation 
Conference, Proceedings, 684-689, 2001. 
[3] E. Nilsson, “Design and Implementation of a Hot-potato 
Switch in Network on Chip”, Master of Science thesis, 
Laboratory of Electronics and Computer Systems, Royal 
Institute of Technology (KTH), Sweden, June 2002. 
[4] L-R Zheng, H. Tenhunen, “Design and Analysis of Power 
Integrity in Deep Submicron System-on-Chip Circuits”, 
Analog Integrated Circuits and Signal Processing, 30, 15- 
29, 2002. 
[ 5 ]  R. Ho, K. W. Mai and M Horowitz, “The Future of Wires”, 
Proceedings of The IEEE, vol. 89, no. 4 ,  April 2001. 
[6] D. Pamunuwa, L-R. Zheng and H. Tenhunen, “Maximizing 
Throughput over Parallel Wire Structures in the Deep Sub- 
micro Regime”, in manuscript, Laboratory of Electronics 
and Computer Systems, Royal Institute of Technology 
(KTH), Sweden. 
[7] H. Tenhunen, workshop “Systems on Chip, Systems in 
Package”, ESSCIRC 2001, Villach Austria, Sep 2001. 
[8] A. Maheshwari, S. Srinivasaraghavan and W. Burleson, 
”Quantifying the Impact of Current-Sensing on Interconnect 
Delay Trends”, ASIC/SOC Conference, 15th Annual IEEE 
International, 461-465, 2002. 
Technology 
Optimal h 
Ootimal k Wmm) 
0.18-p 0.13-pm 0.10-pm 0.07-pm 0.05-pm 
322 296 226 187 IS4 
0.99 1.30 1.66 2.28 3.33 
Integer k (1 /mm) 
Integer k (I/”) 
Table 5. Optimal size of the repeaters, h, optimal number 
of sections, k,  closest integer values to k and 
corresponding delay per unit length. 
3.2 Inter-Resource Bandwidth Estimation 
We have seen that repeater insertion can reduce the wire delay. 
However, the repeaters tend to be area- and power hungry and 
repeaters for global wires require many via cuts from the upper- 
layer wires all the way down to the substrate, introducing 
considerable via-resistances. Therefore, it is preferable to avoid 
repeaters in inter-resource communication. 
The wire delay makes demand on the inter-resource bandwidth 
and distance. To see how these quantities are related, we first 
assume that a good signal has duration of at least 3t, , where t ,  
is the time for a rising signal to rise from 10% to 90% of its final 
value. Usually, for RC delays, 0-50% time tO5 =0.69r and 
t ,  = 2 . 2 r  [ 5 ] ,  where 2 is the RC time constant. Thus, the 
bandwidth of a single wire is limited by - . Figure 6 shows 
the allowed maximum length of a global wire at different 
bandwidths, with and without repeaters. Clearly, for same 
technology and wire length, wires with repeaters can have higher 
bandwidth due to their low propagation delay. For an inter- 
resource distance of 1.5 mm with 0.05-pm technology (assuming 
that the resources are close to each other and the inter-resource 
distance is therefore equal to the resource size), the bandwidth 
between two adjacent resources is estimated to 0.6 Gbps per 





Data Rate Per Wire (Gb/s) 
Figure 6. Maximum length of a global wire for different 
bandwidths and technologies, with and without repeaters. 
JY-895 
Authorized licensed use limited to: Lancaster University Library. Downloaded on May 07,2010 at 14:10:44 UTC from IEEE Xplore.  Restrictions apply. 
