A 3D bus interconnect for network line cards by Engel, J & Kocak, T
                          Engel, J., & Kocak, T. (2004). A 3D bus interconnect for network line cards.
In 2nd Annual IEEE Northeast Workshop on Circuits and Systems, Montreal,
Canada. (pp. 257 - 260). Institute of Electrical and Electronics Engineers
(IEEE). 10.1109/NEWCAS.2004.1359080
Link to published version (if available):
10.1109/NEWCAS.2004.1359080
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Take down policy
Explore Bristol Research is a digital archive and the intention is that deposited content should not be
removed. However, if you believe that this version of the work breaches copyright law please contact
open-access@bristol.ac.uk and include the following information in your message:
• Your contact details
• Bibliographic details for the item, including a URL
• An outline of the nature of the complaint
On receipt of your message the Open Access Team will immediately investigate your claim, make an
initial judgement of the validity of the claim and, where appropriate, withdraw the item in question
from public view.
Regular Session F : Parallel Systems, NoC 8 SoC 
A 3D bus interconnect for network line cards 
Jacob Engel and Taskin Kocak 
Department of Electrical and Computer Engineering 
University of Central Florida. Orlando, EL. 328 I6 
e-niail: {jcngel, tkocak} @cs.ucf.cdu 
Abstract-In this paper. we propose a 3D bus architecture 
as a processor-memory interconnection system to increase the 
throughput of the memory system currently used on line cards. 
The 3D hus architecture allows multiple processing element5 on a 
line card to access a shdred memory. The main advantage of the 
proposed architecture is to increase the network processor off- 
chip memory banduidth while diminishing the latency otherwise 
caused by the single bus competition. 
I. INTRODUCTION 
As network line rates arc constantly increasing, memory 
access times keep decreasing. For example, a 40 byte packet 
arrive every 32 lis. Current network processors or network 
processing units (NPUs) use multithreading to hide memory 
Latency. However, it is not clear whether this technique will 
scale well at 1 Tbps and beyond. In this paper, we report our 
initial work on processor-memory interconnections to increase 
the throughput of the memory system currently used on line 
cards. This can he easily done by increasing the memory 
handwidth; however the current chip manufacturing techniques 
limit the number of 110 pins, and only so many of them can 
he memory VOs. 
Our interconnect hus structure is categorized as m e a -  
nine inlcrconnect which generally employs an addressidata 
read/write data model with memory-like semantics and is 
targeted for simple translation hetween processor bus memory 
operations and mezzanine interconnects transactions [ I ] .  Ex- 
amples of iiiczzaninc type interconnect architcctures arc: SPI- 
4.2 [3] which is a point-to-point comniunication architecture 
bctwcen MACS and NPUs or switch fabrics; CSlX [4], a com- 
mon switch interface layer between NPU and switch fabric. 
and HyperTranspon 151 which is a point-to-point. chip-to-chip 
interconnect technology that uses packet-based protocol and 
variable link width. 
The main advantage of tlie proposed architecturc is to 
increase the NPU off-chip memory bandwidth while dimin- 
ishing the latency otherwise caused by the bus competition. 
Compared to parallel bus architectures, our 3D bus can carry 
more data on the wire at the same time. This is due to 
the fact that the bus from processor to memory is broken 
into shorter wires or links (see Fig. I ) .  Each link carries a 
different data set or memory packet. Furthermore, the 3D 
bus archilecture can handle network traffic bursts by load- 
balancing write operations to different niernory banks. 
11. THE PROPOSED AKCHITECTURE 
Our proposed 3D interconnection architecture is shown 
in Fig. 2. The 3D bus structure is a packet-based multiple 
Fi:. I. Channels used in  parallel and 31) bur 
path forwarding mechanism that allows network packets to 
he shared by different processor and memory modules on 
the network line card. In Fig. 2 ,  the line card processing 
and communication components which have access or requirc 
access to the memory hanks are shown on the left. The 
components i n  the figure are given as example. and other 
functional components can be also interfaced to this bus. The 
incmory hanks arc located on the othcr side of the 3D bus 
structure. Each component, which requires memory access, 
sends its data encapsulated in packets. 
The default route is in the x-axis direction (shortest path). If 
there is a congested area (hot spot). packets in transit will take 
a different route in y-axis or z-axis directions using the traffic 
controller (TC) in each corner (node). The proposed architec- 
ture protocol utilizes an efficient message-passing structure to 
transfer data. If a link goes down, not only should the fault he 
limited to the link. the additional links fiom the intermediate 
nodes rhould ensure [he connectivity continues. 
A. Routing Mechanisiii 
The packets are routed to the destination memory module 
by iollowing the header using wormhole routing. The message 
required to he sent to or from memory module is segmented 
into smaller size packets (flits). First the packet’s header is sent 
to set the direction on each node in its path. The rest of the 
packets comprising the message do not wait. but transmitted in 
a pipeline manner following the message header as illustrated 
in Fig. 3 .  The packets size is determined by tlie channel width. 
Thc main advantage in using wornihole routing in this 3D 
bus structure is that it diminishes the latency as thc size of 
the message increases while increasinp its throughput. The 
nia,jor part of the latency is hidden in the transfer of the first 
packet. The rest of the packets are following it and introducing 
only wire transfer delay. As the messages size increases. the 
ratio of consecutive latencies decreases. From a throughput 
viewpoint, the packets can enter the 3D bus stmcture from 
all four inputs with every T sec following their header. T is 
the propagation delay of one bit in one unit length which is 
equal to 62.5 ps per I cm using the current manufacturing 
0-7803-8322-2/04/$20.00 02004 IEEE. 
251 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:34 from IEEE Xplore.  Restrictions apply.
Regular Session F : Parallel Systems, NoC & SoC 
Fiz. 2. 3D bus structwe on the newark line card 
........... 
....... 
..... 
..... 
 
technology. This becomes a great advantage in achieving high 
throughput while a parallel bus can only send those packets 
likc a store and forward type architccturc (sec Fig. 6). 
.... . 
.... 
.... .... 
Pig. 3. Wormhole routing 
111. PEKPOKMANCE ANALYSIS 
A. Cube Notation 
Cubcs can be connected in serics in order to increase the 
bus routing paths. Each vertical cube face is marked i since 
i t  is incrcmcntcd in thc x-axis direction. Within cach i planc 
there are four corners, called j notation, moving in a clockwise 
direction. 
The j notation has a dual digit value. Its first digit is the i 
plane to which j belongs and the other digit is the j location 
within the plane. The k novation is used to distinguish the 
horizontal buses connecting vertical i planes. k gets a four 
digit number. First two digits are the i and j on the left edge 
(source node) and the other two digits are the i and j values 
to the right of the bus (destination node). 
i = I  i = Z  i=3 
Each node is located as the origin of axis in a three 
dimensional coordinates system. The movement of data words 
is translated to niovetncnt along the axis. A word moving from 
a bottom node up is moving in the positive t-direction and 
down in the negative t-direction. A packet moving from top 
node up, to its adjacent node in parallel, is moving along the 
positive y-direction. X is the movement to the right [toward 
the output) or left (toward the input). A node transfers data 
word based of a default order of departure directions. First, 
i t  will try to send it in the positive x-direction. If the node 
is busy, then it will send i t  in the positive y-direction (up) 
and is this node is busy too i t  send it in the negative/positive 
t-direction (down for top node and up for bottom node). In 
caw all adjacent nodes are busy the data word will no1 be sent 
until one of the routes becomes available. 
R. Shortest Puth 
The shortest path and longcst path from a certain input nodc 
to destination w'as calculated by assigning labels on cach nodc 
and bus segment as explained earlier (Fig. 4). The result, as 
shown in figure 5 ,  was a hierarchical tree. Figure 5 represents 
onIy a single cube. 
i-11 
A 
Fig. 4.  3D cube interconnection notation Fig. 5. Hierarchical Tree 
258 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:34 from IEEE Xplore.  Restrictions apply.
Regular Session F : Parallel Systems, NoC & SoC 
We calculated the shortest path for multiple cubes tup 
to 3)  by constructing incremental view of trees by taking 
thc Manhattan distancc from each corncr. Each cube arch 
corrcsponded to ncw child node on thc trcc. We draw the 
complete trcc for up to three cubes inrcrconncction. The rcsults 
we got show that the shortest path from input to output can 
he reached in three hops whereas. the longest path can he 
reached with up to twelve hops. Moreover. there is a linear 
relationship helween the two by a factor of four. These results 
assist in perforinance evaluations of the best vs. worst latency 
and throughput results. 
C. Measures 
We use standard performance metrics such as latency and 
throughput for evaluation. Latency is defined as thr time it 
takes for a complete message to reach its destination. We adopt 
Dally’s basic equation for k-ary ?I-cuhe interconnects [21 and 
inodilied i t  on our design. The resulting latency ( L )  equation 
is 
( 1 )  
,.\I 
L S D  = [- - 1 + U )  *- 7’+ U * T ( n )  
W 
where AI is the message size. 111 is the channel width, D is 
the Manhattan distence. T represents the propagation delay 
of one hit in one unit length which is equal to h2.5ps (per 1 
cm), and T(?i) is the node processing (switching) time. For 
the best case, D = c and for the worst case, D = 4 - c. The 
header latency is lareer than thz rest of the message (M/u-l) 
since it sets the nodes direction in time T’(n,). The rest of the 
message just requires to propagate through the channels in Y’ 
time per unit length. 
Fig. 6 .  Message timing (3D bus vs. parallel bus) 
Throughput is defined as the rate in which the packets are 
exiting the bus lor a certain message s i x  per second. The 
resulting throughput ( r p )  equation is 
We compare the performance of the 3D bus with that of the 
four parallel buses. This is mainly because 3D bus resembles In 
four parallel buses with some extra exitsllinks in every corner. 
The latency of this parallel bus can bc =. w e n  as 
which describes the propagation delay for the number of 
packeis in a message divided hy 4 because we are using 4 
parallel huses. Here it is assumed that each cuhe link is 1 cm 
long and the four parallel buses have the same length as the 
3D bus formed hy (: cubes. 
The throughput for the four parallel huses is obtained by 
dividing the message size by the latency 
D. Resultv 
We use the equations developed above to calculate the 
latency and throughput measures. We are also interested in 
determining the optimal number of cuhes that are needed to 
form the 3D bus and also the channel widths. Figure 7 shows 
the latency ratio (parallel bus to 3D bus) against the number of 
cuhes for several channel widths when message size is kept at 
1024 B. The figure implies that 3D hus with 5 cuhes or mnre 
has lower latency than the four parallel huses. It also reveals 
that the ratio improves with lower channel width. 
Latency ratio (parallel busl3D bus; M=1024B) 
4.50 
4.00 
3.50 
3.00 
9 250 
B 2:oo 
1 50 
1.00 
0.50 
0.00 
1 2  3 4 5 6 7  8 1 6  
Cubes 
Fig. 7. Latency Ratio (Message size 10248) 
The latency increases with message sire as shown in Fig. 8 
when bus length is 8 cubes. For messages larger than 256 B, 
3D bus has lower latency compared to 4 parallel buses with 
the same channel width. 
Figure 9 shows the throughput ratio (3D bus to parallel 
hus) against the number of cubes for several channel widths 
when message sire is kept at 1024 B. The ratio improves 
with increasing bus length this is due to the pipeline-like 
mechanism used by the 3D bus, and also store and forward 
type architecture used by the parallel bus. In addition. as the 
259 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:34 from IEEE Xplore.  Restrictions apply.
Regular Session F : Parallel Systems, NoC & SoC 
Throughput ratio (worst case; M=1024B) 
Pig. 9. Throughput Ratio (Message size IO2JB) 
channel width increases there is some reduction in throughput 
ratio however in all scenarios, the 3D hus is still proven 
to rcach hctter performance. Thc throughput of the 3D bus 
improves with incrcasing incssagc sire as shown in Fig. IO. 
This is in rcsult of filling thc packets on the links of thc 3D 
bus. In contrast, the throughput stays the same for the parallel 
bus due to store and forward type architecture. Our 3D bus 
interconnect reached almost 400Ghps with 32 hit channel size 
and 2048B message size. Currently there are no memories 
which operate in throughput as high as 400Ghps. Significantly, 
our approach prop to use existing high bandwidth memories 
(such as Ramhus) [6] while we design the memory interface 
to efficiently employ their internal hanks. By taking this cost- 
effective approach we assure that the interconnection bus will 
he effortlessly integrated into existing network linecard hoards 
and other on-hoard processor-memory architectures. 
102.4 
Fig. I I .  Throughput cumparison with cumnt  inurconnect tcchnologics 
Iv. CONCLUSIONS A N D  FUTURE WORK 
An interconnection architecture for network line cards is 
presented. The 3D bus architecture allows the multiple pro- 
cessing elements on the line card lo access multiple memory 
modules. Besides the line card, the 1D bus structure can also 
he used within network processors as an on-chip communi- 
cation incchanism hctwecn processing elements, memory and 
peripherals. This can improve the performance of the current 
implementations such as lntcl IXP2800, where 8 micro engines 
compete for one bus. The results show that the throughput 
significantly improved when compared with parallel hus or 
other contestants structures currently in the market. Future 
work include memory and PE interfaces which are able to 
supply the resulted handwidth to memory and processing 
modules. 
REFERENCES 
[ I ]  D. fillliday. "The evolution of mezzanine modules for next-generation 
telecom architectures", Co,npaciPCI-.~?siemr. June 2003. 
121 W. J. Dally, "Prriomance analysis of k-ary n-cube interconnection 
netu,orks". IEEE Trmt. on Cornpererr, vol. 39. no. 6 .  pp. 775-785, 
1990. 
[3] K. Marquxdt. "Hitting the IO-Gbit Mark wilh SPIL?". 
( w e b : w w w . c o r n n a d e ~ i ~ . ~ " , ~ d ~ ~ i z " ~ ~ ~ ~ ~ l O E G 2 0 0 2 0 9  1 S00 IO). 
141 CSlX Interface, White Wprr (web:www.aJtera.com lpmductslip 
lcammunic~tionslcsi~ipm-indel.)sp). 
[5J HyperTranspon Consortium. "HyperTrmspon technology: Simpliiy- 
ing system design". Oct. ZOO? (wcb: hnp:l/www.hypenranrp~n.~~g). 
[6] "Ranibus DRAM for OC192 Data Rate Line Card Applications", 
Rambus Inc., 2000. 
[7] "IXP2800 Intel Network Pcoocessor IP Forwarding Benchmark Full 
Dicclosure Repon for OC192-POS". White Paper, Intel corp., Oct. 
30, 2003. 
[XI PCI Special Interest Group, "PCI local bus specification, revision 2.2", 
Dec. 1998. 
260 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:34 from IEEE Xplore.  Restrictions apply.
