Low cost, adaptive, fault tolerant routing in low dimension direct interconnection networks. by Swarbrick, Ian Andrew.
7086418
UNIVERSITY OF SURREY LIBRARY
All rights reserved
INFORMATION TO  ALL USERS 
T he  q u a l i ty  of this r e p r o d u c t io n  is d e p e n d e n t  u p o n  th e  q u a li ty  of th e  c o p y  s u b m it te d .
In th e  unlikely e v e n t  th a t th e  a u t h o r  did no t s e n d  a  c o m p l e t e  m a n u s c r ip t  
an d  th e re  a re  m issing  p a g e s ,  th e s e  will be  n o t e d .  Also, if m a te r ia !  h ad  to b e  r e m o v e d ,
a  n o t e  will i n d ic a te  th e  d e le t io n .
P u b lish ed  by P ro Q u e s t LLC (2017). C o p y r ig h t  of th e  D isserta tion  is held  by th e  A u th o r.
All rights re s e rv e d .
This w ork is p r o te c te d  a g a in s t  u n a u th o r iz e d  c o p y in g  u n d er Title 17, United S ta te s  C o d e
M icro fo rm  Edition © P ro Q u e s t LLC.
P ro Q u e s t LLC.
789 E a s t E isen h o w er P a rk w a y  
P.O . Box 1346 
Ann Arbor, Ml 4 8 1 0 6 -  1346
L o w  C o s t ,  A d a p t i v e ,  F a u l t  T o l e r a n t  
R o u t i n g  i n  L o w  D i m e n s i o n  D i r e c t  
I n t e r c o n n e c t i o n  N e t w o r k s
Ian Andrew Swarbrick
Submitted for the Degree of Doctor of Philosophy from the University of Surrey.
Distributed Systems Research Group,
School of Electronic Engineering, Information Technology and Mathematics,
University of Surrey,
Guildford,
Surrey,
GU2 5XH,
England.
Abstract
Throughput and latency are critical parameters in multiprocessor interconnection net­
works. These parameters are governed by the combination of routing node and inter­
connect performance. Recent years have seen several new interconnect technologies 
reach the stage of maturity where they may be applied in practical systems. One such 
technology is free-space optical interconnect. The problems of wiring density, low 
data-rates and limited integrated circuit (IC) pin-out are neatly side-stepped by the use 
of optical interconnect.
In order to make practical use of optical interconnects, packet routing node data 
rates must increase by an order of magnitude 01* more. At the same time, latency cannot 
be sacrificed as it is critical to the performance of multi-processor systems.
One possible avenue to meeting the required performance is to re-examine the hard­
ware cost of packet router architectures and attempt to improve them. The research 
presented in this thesis attempts to do exactly that.
The result of this research is a packet router architecture known as the Cellular 
Router. The router allows massive throughput, while maintaining low latency. The 
architecture is designed to minimise silicon area and maximise achievable clock rate in 
any given fabrication process. The router is scalable, in the sense that area requirements 
increase linearly along one axis in proportion to increased throughput.
This thesis describes a novel packet router architecture. It is a compact, power 
efficient, scalable design, that is capable of exceptionally high throughput. The Cellular 
Router allows the benefits of free-space optical interconnects to be effectively utilised 
in multi-processor systems.
A c k n o w le d g e m e n ts
I wish to express my sincere gratitude to Dr Alex Shafarenko and Professor Chris 
Jesshope for their supervision, guidance and support throughout the course of this re­
search.
I am deeply indebted to Dr Vladimir Vasekin and Mr Alex Bolychevsky for their 
advice and comments during our many discussions 011 hardware issues. I would like 
to thank Dr Ivailo Nedelchev, whose enthusiasm and insight initially encouraged me to 
pursue this research.
Thanks go to Dr Roger Peel for some interesting technical discussions, and to Slava 
Muchnick for his advice on the issue of deadlock. I wish to thank past and present 
members of the Distributed Systems Research Group, especially Mircea, Nick, Andy 
and Nayef who have contributed to creating a stimulating research environment
Thanks go to my parents, who have supported and encouraged me throughout my 
studies. The Cook family - thanks for looking after me, and making me feel so at home.
Special thanks go to Stina, who supported me through the difficult times, and made 
it all seem worthwhile.
C o n te n ts
Abstract i
Acknowledgements ii
Table of Contents ii
List of Figures vi
1 Introduction 1
2 Interconnection Networks 3
2.1 Introduction................................................................................................... 3
2.1.1 Scalability ........................................................................................ 3
2.1.2 Latency..............................................................................................  3
2.1.3 Grain Size. .  .................................................................................  4
2.1.4 Dynamic Switching N etw orks.......................................................  4
2.1.5 Circuit Switching versus Packet Switching .................................  4
2.2 Network Properties........................................................    5
2.2.1 Node Degree and Network Diameter..............................................  5
2.2.2 Static Interconnection Network T opologies.................................  6
2.3 Interconnection Network Topologies.........................................................  6
2.3.1 k-ary n-cube Networks..................................................................... 7
2.3.2 Tree N etw ork s.................................................................................  10
2.4 Message R ou tin g .........................................................................................  11
2.4.1 Store-and-Forward Routing ...........................................................  11
2.4.2 Wormhole R outing...........................................................................  11
2.4.3 Hardware Support for Additional Communication Patterns . . .  12
2.5 Adaptive Routing, Oblivious Routing and Network Dimensionality . . . 13
2.6 Virtual Channel Flow Control......................................................................  14
2.7 Deadlock, Livelock and Starvation............................................................  15
2.7.1 Deadlock Avoidance........................................................................  15
2.7.2 Deadlock in Store-and-Forward Networks   . . . 15
2.7.3 Deadlock in Wormhole Routed N etw orks..................................... 16
2.8 Network A nalysis.........................................................................................  19
2.9 Maximum Throughput and Injection R a t e ....................................................21
2.10 Summary of Trade-offs.................................................................................... 22
iii
3 Packet Routers 24
3.1 Existing Routers................................................................................................. 24
3.1.1 Store and Forward Routers.................................................................. 25
3.1.2 Virtual Cut-Through Routers...............................................................25
3.1.3 Wormhole R outers...............................................................................26
3.1.4 Alternative Switching Techniques.................................................  30
3.2 Generic R o u ter .............................................................................................  31
3.3 Straight-through R outing................................................................................. 34
3.4 Summary of Router Trends..........................................................................  35
4 Implementation and Design of Asynchronous Packet Routers 37
4.1 Introduction to Asynchronous D e s ig n ........................................................ 38
4.1.1 Timing M od els ......................................................................................38
4.1.2 Signalling P ro toco ls ........................................................................38
4.1.3 Data Signalling......................................................................................38
4.2 Benefits of Asynchronous Design ................................................................ 40
4.3 Asynchronous Design K it............ ....... ......................................................... 41
4.4 Router d es ig n ....................................................................................................45
4.4.1 Router Building Blocks ......................................................................47
4.5 Possible Improvements................................................................................ 55
4.6 Conclusions....................................................................................................... 57
5 High Performance Interconnects 58
6 Ring-Based Router Architecture 63
6.1 Introduction...................................................................................................  63
6.2 Proposed Architecture: A Cellular R ou ter................................................ 64
6.2.1 Routing S ta g es ..................................................................................  64
6.2.2 Ring of Buffers...................................................................................... 68
6.2.3 Logical Structure of the Cellular Router............................................ 69
6.3 Packet Format.................................................................................................... 69
6.4 Virtual Networks and Deadlock Prevention .................................................69
6.4.1 Virtual N etw orks...................................................................................69
6.4.2 D eadlock ................................................................................................73
6.4.3 Deadlock in Toms Cycles ...................................................................74
6.4.4 Deadlock in Routing R ings...................................................................75
6.4.5 Adaptive Routing................................................................................... 77
6.4.6 Fault Tolerance..................................................................................  78
7 Computer Simulations 79
7.1 Uniform Random T raffic ............................      80
7.1.1 Virtual N etworks...............................................................................  80
7.1.2 Calculation of Applied L o a d ............................................................ 80
7.1.3 Load Measurement............................................................................ 81
7.2 Adaptive Routing in the Cellular R outer...................................................  83
7.3 Simulations with Network Hot-spot traffic................................................  85
7.3.1 Multiple H ot-spots............................................................................  85
7.3.2 Simulation Conclusions .......................................................................92
iv
8 Router Hardware Cost 93
8.1 CMOS Integrated Circuits ...............................................................................94
8.1.1 Power Dissipation .  .........................................................................94
8.2 Comparison Between Ring of Buffers and Crossbar-based switching . . 96
8.3 Cellular Router Implementation C ost............................................................ 100
8.3.1 Router Synthesis.................................................................................100
8.3.2 Limits of the Design K i t ....................................................................104
8.3.3 Estimated Area of the Router............................................................. 104
8.4 Power Consumption......................................................................................... 105
8.5 Achievable Clock R a t e ...................................................................................107
9 Conclusions 108
v
L is t  o f  F ig u r e s
2.1 Interconnection schemes (a) Chordal ring (b) Fully connected................  8
2.2 A 2d m esh .......................................................................................................  8
2.3 A 2-dimensional tom s...................................................    9
2.4 A binary hypercube (2 — ary 4 — cube)  ..........................................  9
2.5 Interconnection schemes (a) Binary tree (b) Fat tr e e ................................  10
2.6 Oblivious, minimal adaptive and non-minimal adaptive routing.............  13
2.7 Virtual channels [22]. Routing (a) without and (b) with virtual channels 14
2.8 The cause of deadlock: (a) Deadlock occurs (b) channel dependency 
graph .............................................................................................................. 16
2.9 Effect of transit traffic on available network bandwidth..............................22
3.1 Generic packet router......................   32
3.2 An n x n crossbar sw itch................................................................................. 33
3.3 Crossbar implemented using multiplexors.................................................... 34
4.1 Two Phase Signalling.................................................................................... 39
4.2 Four Phase Signalling......................................    39
4.3 Dual Rail Signalling ........................................................................................40
4.4 Simple Handshaking with Bundled D a ta ....................................................... 40
4.5 Silicon layout of C-element standard c e ll....................................................... 42
4.6 Transistor level implementation of C -elem ent..............................................43
4.7 Implementation of toggle e le m e n t...................  44
4.8 Merge element, (a) static implementation and (b) using pass-transistor 
lo g ic .....................................................................................................................44
4.9 Routing node interconnection sc h e m e ...........................................................46
4.10 Modified bundle data p ro to co l........................................................................47
4.11 Sequence of events in packet exchanges....................................................... 48
4.12 MUX element  ........................................................................................49
4.13 MUX elem ent.....................................................................................................49
4.14 DEC element construction..........................................................................  50
4.15 ROUTER element construction.................................................................... 51
4.16 A block diagram of the router sw itch ..........................................................  53
4.17 Final chip layout ..........................................................................................  54
4.18 Fast circuit to check when hop counter is z e r o .............................................. 56
5.1 I/O placement (a) for wire bonding (b) for flip chip bonding...................  59
5.2 Tree of alternative optical interconnection architectures [42] .................... 60
vi
6.1 Requirements of the Cellular Router: data is de-multiplexed, routed
and multiplexed.............................................................................................  65
6.2 Components of Cellular Router (a) A single routing stage (b) Router
cell and I/O connections (c) Standard cell based silicon layout method­
ology .............................................................................................................   67
6.3 Ring based interconnection scheme................................................................ 68
6.4 Logical structure of Cellular R o u te r ............................................................. 70
6.5 Packet format.......................................................................................................71
6.6 Class-climbing using virtual planes............................................................  72
6.7 Circular waits in toms cycles, (a) Deadlocked cycle (b) Deadlock free
cycle .....................................................................................................................74
6.8 Buffer restriction in deadlock prevention scheme .......................................75
6.9 Restricting circular waits in the rou ter .......................................................... 76
7.1 Routing latency versus load with varied number of virtual networks . . 82
7.2 Effect of adaptive routing with 5 hot-spots................................................ 87
7.3 Average packets per node: TIME_OUT=5, applied load=55%, achieved
load=50.5 % .................................................................................................... 88
7.4 Average packets per node: TIME_OUT=5000, applied load=55%, achieved
load=51.7 % ....................................................................................................  89
7.5 Average packet per node: TIME_OUT=5, applied load=60%, achieved
load=55.3% ......................................   90
7.6 Average packet per node: TIME_OUT=5000, applied load=60%, achieved
load=36.0% ........................................................................................................ 91
8.1 Charging and dis-charging load capacitance in a 2 input CMOS NAND
g a t e ..................................................................................................................... 95
8.2 Crosspoint switch with m I/O channels, each w bits wide..............................97
8.3 Basic wire crosspoint for 0.7pm process [3]................................................... 98
8.4 Area requirements compared.............................................................................99
8.5 Schematic of one routing stage........................................................................102
8.6 Controller for the output buffer (OUT), together with the associated
state diagram.......................................................................................................103
C h a p te r  1 
I n tr o d u c t io n
Multicomputer and multiprocessor system computation performance is typically a frac­
tion of the sum of the computational power of each processor. The reason for this is 
the overhead involved in communication between processors and memories in such 
systems. The cost of communication is inescapable. Even in fully connected systems, 
where every processing node has a direct connection to eveiy other, there is an overhead 
involved in arbitrating between communication requests. The best that can be done is 
to tiy to minimise the communication cost.
It is generally understood that direct interconnection networks offer the best com­
munication performance for a given cost. The fundamental component of such net­
works is the packet routing node. With modem integrated circuit (IC) technologies, 
this is usually a single chip which is most commonly implemented in CMOS (comple­
mentary metal-oxide-silicon), due to its low cost and low power consumption.
The performance of any direct interconnection network is governed by that of the 
router and the interconnect. Developments in the area of optical interconnects now al­
low thousands of I/O devices to be comiected to a single chip [92]. The achievable data 
rates are already in the Gbit/sec range, and this is only set to increase. The performance 
of the packet router is now likely to form a bottleneck in the interconnection network. 
To address this issue, a novel packet router architecture has been developed, known as 
the Cellular Router. The ideas grew from a realisation that in order to manage the data 
switching rates sustained by optical interconnects, something drastic was required. The 
usual approach to packet router design is to examine the functionality required, and 
then find some way to map this into hardware. The design approach taken in the Cellu­
lar Router is to tiy to efficiently map the desired functionality into the planar structure 
of an IC. Only then are other issues such as routing algorithm and deadlock prevention 
considered.
Part of the inspiration for the design came from systolic array architectures [41]. 
These are regular structures that are made up of identical cells. A small amount of 
design effort in optimising the basic cell may dramatically improve the overall perfor-
1
mance.
The Cellular Router has a number of useful features. The architecture is relatively 
low in power consumption and area requirements. More importantly, the device scales 
linearly with increasing throughput. If the width of the data channels is increased, the 
router area and power consumption grow in proportion to the increase. The same linear 
increase occurs if the number of channels is increased. This linear scalability allows the 
router design to track improvements in optical interconnect technology. Some features 
of the router architecture can easily be modified. Increasing the number of I/O channels, 
for example, requires a simple replication of resources, and a very small change to 
the logic (fewer than ten connections must be changed). Increasing the width of the 
channels can be done by simply changing one parameter in the source HDL (hardware 
description language) code that is used to synthesise the router.
There has been a recognition in recent years that the hardware cost of packet-router 
features must be taken into account, along with the benefits those features provide. One 
area where this is particularly important is in the implementation of adaptive routing 
[14]. The cost in hardware is usually significant however. The Cellular Router has 
the novel feature that the inclusion of adaptive routing does not have any effect on the 
hardware cost of the router. A small number of additional gates are added, but these 
do not lie on any critical paths. Adaptive routing allows packets to follow multiple 
routes in the network, increasing throughput and potential fault tolerance. It is usually 
implemented in such a manner that packets are sent along minimum-length paths in the 
network. Allowing non-minimal routes may waste network bandwidth and introduces 
the possibility of livelock. The Cellular Router has a novel fault tolerance feature that 
allows deterministic, non-minimal routing in the presence of faulty network links. It 
does not introduce any livelock or deadlock conditions. Again, this mechanism is off 
the router critical path and does not increase the complexity of the routing decisions.
In Chapter 2 of this thesis, various terms and features of interconnection networks 
are explained. In Chapter 3, the hardware implementation of various packet routers is 
examined. Chapter 4 describes the author’s implementation of an asynchronous packet 
router. A brief discussion of interconnect technologies is given in Chapter 5. The Cel­
lular Router architecture is described in Chapter 6, and Chapter 7 shows the simulated 
performance of the router. In Chapter 8, the hardware implementation of the router is 
examined. The conclusions drawn from this research are discussed in Chapter 9.
2
C h a p te r  2  
I n te r c o n n e c t io n  N e tw o r k s
This chapter begins by explaining the need to connect microprocessors together and the 
ways in which this can be achieved. It then looks at various methods of interconnecting 
processors. The benefits of direct interconnection networks are described. This chapter 
goes on to describe the various issues involved in constructing such networks, and 
concludes with a discussion of the trade-offs involved.
2.1 Introduction
2.1.1 Scalability
Scalability is the ultimate goal in designing a parallel computer network. If true scala­
bility were achievable, we would see a linear increase in performance with the number 
of processing elements added to the system. In practice this is not achievable, since 
there is always an increasing overhead in interconnecting more and more processing 
elements. Perhaps a more realistic aim is scalability within a given range.
In a traditional single processor computer, a bus is an adequate method for con­
necting the various system resources. As more processors are connected to the same 
bus, it approaches saturation [13], and the delay in accessing the bus increases for each 
processor.
2.1.2 Latency
In the context of packet routing networks, latency can be defined as “the time elapsed 
since the message transmission is initiated until the message is received at the desti­
nation node” [55], We are generally concerned with average and worst-case latency 
rather than the latency in any single communication transaction. In a packet routing 
chip, through-node latency (also known as routing latency) refers to the average time 
for the head of a packet to travel from the input to the necessary output on the chip.
3
2 .1 .3  G r a i n  S iz e
Grain size or granularity refers refers to the average size of code segments that are 
executed between communication events. The distribution of computational tasks in 
a parallel computer can be referred to as coarse, medium ox fine grained. Fine grain 
tasks obviously generate the highest number of communication requests, and place the 
greatest demands on the interconnection network.
2.1.4 Dynamic Switching Networks
In a dynamic switching network, connections between two nodes (processors, pro­
cessor/memory) are made by changing the settings of some (or all) of the electronic 
switches in the interconnection network. Sometimes there are restrictions on the combi­
nations of connections that can be made in the network at any one time. These networks 
are described as blocldng. If simultaneous comiections can be made between all inputs 
and outputs at the same time, the network is described as non-blocking. Non-blocking 
networks can be sub-divided into strictly non-blocking, where input / output connec­
tions can be made arbitrarily, and wide-sense non-blocking, where switches must be 
toggled according to some pre-determined algorithm in order not to disturb existing 
connections.
Dynamic switching networks are only suitable where small numbers of processing 
nodes need to be connected. For more than a few tens of nodes, the hardware require­
ments become too great.
2.1.5 Circuit Switching versus Packet Switching
In circuit switched networks, a fixed connection path is made for eveiy communicating 
source and destination node pair. Data is transmitted between the nodes, and the link 
is broken only when the transmission ends. This is veiy useful when a large volume 
of data is being sent in each transmission. The transmission latency in this case is 
veiy small because once the path is made there is nothing to block the movement of 
data. This can be wasteful if  only small amounts of data are being sent, since network 
resources are claimed and are not freed until the transmission is complete.
In packet switched networks, data is gathered into small bundles, known as packets. 
A packet contains data and the address in the network of the destination node. It may 
also contain other information, such as the address of the source node from which the 
data was sent and some error correction data. Each packet traverses the network from 
source node to destination node in discrete steps. The packet enters a routing node, 
where the destination address is examined. The outgoing link is determined from this 
and the packet is sent along that link to the next node. This process is repeated until the
4
packet reaches the desired node. There it is removed from the network, and the data is 
extracted from the packet. This has the advantage that each packet only claims a small 
amount of network resource at any given time (typically just one buffer in a node). The 
disadvantage here is the overhead involved in organising the data into packets and the 
increase in latency, due to the overhead involved in examining the packet destination at 
every node in the network.
Switch based networks are also called indirect networks, since the communicating 
processors must reserve several sets of switches. Direct networks use routers, rather 
than switches. Each router connects to one processing element, and to a number of 
neighbouring routers. Transmission of data between processing elements takes place 
by sending the data through a number of routers between the source and destination 
nodes.
This thesis is concerned with direct interconnection networks, since these can be 
made scalable. Direct networks offer ways to obtain high performance without exces­
sive amounts of hardware resources being used. The number of switches and the cost 
of arbitration grow rapidly when indirect networks are increased in size, making them 
impractical for large systems.
2.2 Network Properties
When we refer to a node, this will mean a processing node containing a processor, 
memory and a packet router with an interface to the network. In designing direct inter­
connection networks, a node often simply refers to the packet routing chip. Generally 
we represent computer interconnection networks using graphs. The nodes are repre­
sented by vertices and the edges represent links between nodes. We can represent the 
networks using either directed or undirected graphs.
Dally [20] states that an interconnection network is described by its topology, rout­
ing and flow control The topology is the connectivity graph itself. Routing specifies the 
manner in which packets choose a path through this graph. A routing relation is used 
to determine this. Flow control defines the way in which packets are assigned resources 
inside the routing nodes. These terms will be explained further in the remainder of this 
chapter.
2.2.1 Node Degree and Network Diameter
The number of edges incident to a node of a graph, and also to the network node it 
represents, is called the node degree, d. If the channels are uni-directional we can also 
refer to the number of channels feeding into the node as the in-degree, and the number 
leading out as the out-degree. In this case the node degree is the sum of the two. The
5
node degree is a reflection of the cost of implementing the node. On VLSI devices, 
the number of I/O pads is limited, so increasing the number of channels will decrease 
the bandwidth available to each channel. It is desirable to have the same degree for 
all nodes in the network so that the same nodes can be used and the network can be 
constructed in a modular manner. The diameter D of a network refers to the longest 
minimal path that exists between any two nodes in the network. It is the number of 
hops between nodes that a message will have to make when taking the shortest possible 
route between two nodes. From a communication point of view, it is desirable that the 
network diameter is as small as possible. Direct network topologies may be classified 
according to whether they are regular and symmetrical. A network is regular if all 
nodes have the same degree. A network is symmetric if the network looks the same 
from every node.
The channel bisection width, b of the network is the number of channels that must 
be severed to divide the network into two halves. If the channels in the network are w 
bits wide then we refer to the wire bisection width, B = bw. The total data rate across 
the bisection of the network is referred to as the bisection bandwidth.
2.2.2 Static Interconnection Network Topologies
There is a wide variety of interconnection topologies in use in static interconnection 
networks. There is a huge number of trade-offs to be considered when choosing an 
interconnection scheme, and the demands of each system and application vaiy. In this 
section we describe some of the possible network topologies and the relative merits 
of each of these. The primary factor in constructing interconnection networks is cost. 
The network must accommodate the communication demands of the system, while 
minimising the cost. One fairly reliable cost metric is the number of wires used. The 
cost of construction of the network is roughly proportional to it, and so is the power 
consumption [20].
Table 2.1 summarises the features of various network topologies. It should be noted 
that the connection to/from the processor is not included. In some interconnection net­
works, each node contains a processor, memory, interface logic and a router. In others 
the processing element (PE) is separate, and adds one extra bi-directional channel to 
the node degree.
2.3 Interconnection Network Topologies
Possibly the simplest static network structure is the linear array. This is a one-dimensional 
network, where N nodes are connected in a line. The network diameter is N — 1, mak­
ing it unsuitable for large numbers of nodes. The bisection width b — 1. If we connect
6
Network
type
Node
degree
d
Network
diameter
D
No. of 
links 
I
Bisection
width
B
Symmetry Remarks 
on network
Linear
Array
2 N - l N - l 1 No N  nodes
Ring 2 [A/2] N 2 Yes N  nodes
Completely
Connected
N - l 1 N ( N  — l ) /2 (N /2 )2 Yes N  nodes
Binary
Tree
3 2(h — 1) N  -  1 1 No Tree height 
h = [log2N]
Star N - l 2 N - l [N/2] No N  nodes
2D-Mesh 4 2 ( r - l ) 2 N  — 2r r No r  x r  mesh 
where r  =  ■s/N
2D-Torus 4 2[r/2] 2 N 2 r Yes r  x r  torus 
where r  =  \ / N
Hypercube n n 3 <; to N / 2 Yes N  nodes, 
n  — [log2 N]
CCC 3 2k -  1 +  [fe/2] 3A /2 N /  (2k) Yes N  = k x 2k 
nodes with a 
cycle length fe >  3
fe-ary n-cube 2n n[fe/2] n N 2fen-1 Yes N  = kn nodes
Table 2.1: Parameters of various interconnection networks
the end nodes of a linear array, we have a ring structure. This reduces the network 
diameter to N/2 (assuming the ring is bi-directional). The ring is symmetrical and has 
a constant node degree of 2. If we increase the node degree and add some extra links 
between nodes, we obtain a structure known as a chordal ring (Figure 2.1(a)). This 
reduces the network diameter. We can take this to its extreme, connecting every node 
with eveiy other to obtain a completely connected network (Figure 2.1(b)). This is only 
suitable for a small network, since the number of bi-directional links grows very rapidly 
(I = N(N — 1)) with increasing network size.
2.3.1 k -ary n-cube Networks
Rings, meshes, tori and hypercubes belong to the fe-ary n-cube class of networks. The 
parameter n represents the dimensionality of the network and the parameter k is the 
radix of the network, such that:
N = kn,( k = VN, n = logkN) (2.1)
Networks in the /©ary n-cube class are described here:
7
2.1: Interconnection schemes (a) Chordal ring (b) Fully connected
Mesh Figure 2.2 shows a 2 dimensional mesh connected network. The nodes are 
simply arranged in a grid. This is a simple topology to construct, but the net­
work is not symmetrical. This may result in uneven traffic distribution.
X
Figure 2.2: A 2d mesh
Torus Figure 2.3 shows a 2-dimensional torus. The torus is the same as the mesh, but 
with added wrap-around links. These additional links serve to reduce the network 
diameter and to make the network symmetrical.
Hypercube Hypercubes are high dimensional networks. For n > 3, the network can 
no longer be mapped directly onto physical space. This results in either uniformly 
long wires, or in variable wire lengths. However, the diameter of the network is 
reduced. Figure 2.4 shows a 3d binary hypercube.
8
c *
o
- j ] i j i ] ©
H - • • •
r < 1 # o ©
c 1
1.- - I I —
? -
o o
©
< J L
- - "
X
Figure 2.3: A 2-dimensional toms
Figure 2.4: A binary hypercube (2 — ary 4 — cube)
9
2 .3 .2  T re e  N e tw o r k s
Figure 2.5(a) shows a binary tree. In general, a completely balanced, A;-level, binary 
tree should have N =  2k — 1 nodes. The maximum node degree is 3 and the diameter 
is 2(k — 1). The binary tree is a scalable architecture, but the diameter quickly grows 
with the number of nodes. Tree topologies are not symmetrical.
Figure 2.5: Interconnection schemes (a) Binary tree (b) Fat tree
In comparing different interconnection schemes, one concern is the degree of con­
nectivity between nodes. In symmetric networks, the bisection width is a direct measure 
of this. Two extremes of this are the linear array and the completely connected network. 
The connectivity of other types of networks lies between these two extremes. It may 
seem sensible to simply use fully connected networks, since these offer the greatest 
connectivity. They are impractical for large systems, since the number of links required 
grows at a rate of N(N — l) /2 .  Networks with less connectivity are used in order 
to keep the required resources to a minimum, and to achieve a reasonable perform­
ance/cost ratio. A reasonable compromise is to use a hypercube topology. One useful 
feature of hypercubes is that the node degree grows logarithmically with network size, 
which means that the bisection bandwidth remains proportional to the number of nodes. 
The downside of this is that the network size is fixed by the degree of the routing node. 
Hypercubes of dimension higher than 3 cannot map directly into physical space. Lo­
gically adjacent nodes may be physically placed on opposite sides of the system. This 
results in a longer network cycle time. One solution to this is to use pipelined chan­
nels [91]. In a network with pipelined channels, multiple bits may be in flight on each 
wire at any given time. This serves to decouple network cycle time from wire lengths, 
making higher dimension networks more attractive. Although this method improves 
network throughput in high dimension networks, it also increases the latency incurred 
by each packet.
10
2.4 Message Routing
2.4.1 Store-and-Forward Routing
Store-and-Forward routing was used in the first generation multicomputers [65]. Here 
each packet is buffered entirely in a routing node. A packet can only move from one 
node to the next if  there is enough buffer space free to store the entire packet in the 
receiving node. The communication latency in store-and-forward networks is directly 
proportional to the distance (number of hops) between the source and destination nodes. 
The base network latency (latency in an imcongested network) for a store-and-forward 
router is given by:
Tb — c x d x D x L  (2.2)
The variable c represents the network cycle time (the clock rate), d is the routing 
delay per node (in cycles). The average number of hops made by each packet is D, and 
the length of the packet (in flits) is given by L.
2.4.2 Wormhole Routing
The most popular routing method in modem multiprocessor interconnection networks 
is wormhole routing. Packets are divided up into separate units, each the width of 
the router data-path, known as flits (flow-control digits). The packet is transmitted as 
a train of flits through the network. The front flit or header flit(s) contains routing 
information that is used to direct the packet through the network. In order for a node 
to start receiving a packet, it only needs to be able to buffer a single flit. The other flits 
form a tail to the header flit and can be distributed over many nodes. The header flit 
is eventually received at the destination node, followed in strict order by the remaining 
flits in the packet.
One benefit of this method is that communication latency is drastically reduced 
over store-and-forward routing, assuming that the network is relatively uncongested. 
The flits effectively form a pipeline between the source and destination node. On every 
cycle in which the packet header is not blocked, the packet progresses closer to the 
destination node. The base network latency of a wormhole router is given by:
Tb =  c[(D +  l)d +  L — 1] (2.3)
The term c represents the network cycle time (the clock rate), d is the routing delay 
per node (in cycles). The average number of hops made by each packet is D, and the
11
length of the packet (in flits) is given by L. It is easy to see that for large L, wormhole 
routing is considerably faster than store-and-forward routing.
Another benefit of this technique is that fewer buffers are required in each node, 
since only individual flits need to be buffered, rather than entire packets. One disadvan­
tage of wormhole routing is that the flits in a packet can be distributed over a number 
of network nodes, resulting in increased blocking in the network. Kermani and Klein- 
rock [80] devised a technique to overcome this, known as virtual cut-through. As long 
as packets can progress in the normal manner, they do so. When a packet header is 
blocked, it is diverted into a cut-through buffer, where the trailing flits are also received 
until the entire packet is stored. The tail of the packet no longer causes unnecessary 
blockages in the network. As soon as the desired output channel is free, the packet is 
read out of the cut-through buffer and continues on its route. The expression for the 
base network latency is the same for wormhole and virtual cut-through routing. Vir­
tual cut-through routers require many more buffers than wormhole routers, since they 
must be able to store entire packets when the header flit is blocked. Although virtual 
cut-through appears to be an extension of wormhole routing, it actually predates it.
2.4.3 Hardware Support for Additional Communication Patterns
It may be useful to support additional, frequently used communication patterns directly 
in hardware. The types of patterns that are sometimes provided in hardware are de­
scribed here.
unicast The unicast pattern is the basic point-to-point, one-to-one routing function 
supported by all packet routers.
multicast This is a one-to-many communication pattern where the source node sends 
the same message to multiple destination nodes.
broadcast Broadcast involves one-to-all communication. This is useful in supporting 
system functions, such as synchronisation.
Simple point-to-point (unicast) routing is a standard feature of packet routers. All other 
communication patterns may be used by combining many point-to-point routes. To in­
clude dedicated mechanisms in hardware is more efficient, provided that the additional 
logic required is not too complex.
12
2.5 Adaptive Routing, Oblivious Routing and Network 
Dimensionality
Once the choice has been made to use a direct interconnection network to route pack­
ets between processors, decisions must be made about the network topology and the 
routing strategy.
Adaptive routing algorithms can be classified as minimal or non-minimal. In either 
case, the packets do not follow a fixed path from source to destination, but are allowed 
to make turns. A turn involves a switch from one dimension to another. Figure 2.6 
illustrates three methods of routing a packet from a source to a destination node. Route 
A uses oblivious dimension-order routing, where the packet follows a fixed, minimal 
path. Route B uses minimal adaptive routing. Each packet can switch between the X 
and Y dimensions as long is it moves closer to the destination node on each hop. Route 
C shows an example of non-minimal adaptive routing. The packet can mis-route (take a 
turn which causes the packet to follow a non-minimal path) at any node if it is blocked. 
This is done to allow the packet to avoid congested areas of the network or faulty nodes.
The choice of routing scheme depends on a number of factors. Deadlock avoidance 
or recovery must be taken into account. Oblivious routers are simpler to implement 
and hence faster, but adaptive routing may perform better in a congested network by 
utilising unused channel bandwidth. Non-minimal routing is of limited benefit and 
must be restricted in order to avoid livelock, which is the condition where packets are 
moving in the network, but never reach their intended destinations.
gure 2.6: Oblivious, minimal adaptive and non-minimal adaptive routing
13
2.6 Virtual Channel Flow Control
Dally originally proposed the idea of virtual channel flow control [22]. The central 
idea is to decouple a routers buffer resources from its transmission resources. A simple 
router consists of a switch and a set of I/O buffers for each channel. The I/O buffers 
can be split into separate classes corresponding to each virtual channel. The virtual 
channels take turns to access the physical link. This gives the effect of having a number 
of links, rather than just one. Packets generally remain in the same virtual channel for 
the entire course of their transmission, but this does not have to be so. There are two or 
more virtual channels for each physical channel. The virtual channels take turns to use 
the physical channels according to some arbitration scheme.
Virtual channels provide a greater utilisation of physical bandwidth. This is achieved 
because packets which would normally be blocked by another packet in their path have 
an opportunity to overtake such packets by routing on a separate virtual channel. This 
is illustrated in figure 2.7.
Nodel Node2 Node3 Node4 NodeS
(a) Block
Nodel Node2 Node3 Node4 Node5
(b) Block
□  ■
A B
Figure 2.7: Virtual channels [22]. Routing (a) without and (b) with virtual channels
In the network without virtual channels shown in figure 2.7(a), packet B is blocked 
making a turn at node 4. As a result, packet A is queued up behind packet B until 
the blockage clears and the packets can progress again. In figure 2.7(b) two virtual 
channels are employed. When packet B becomes blocked, packet A is able to divert to 
another virtual channel and continues moving towards its destination.
The idea of virtual channels was originally designed to improve channel utilisation
14
by allowing packets to progress past others that are blocking them. They have also been 
used extensively in deadlock prevention. Different routing functions can be applied to 
each set of virtual channels, to produce a scheme that is overall deadlock free, but may 
allow cyclic dependencies to exist in some of the virtual channels. This is discussed 
further in section 2.7
2.7 Deadlock, Livelock and Starvation
As mentioned earlier, the issues of deadlock, livelock and starvation must be dealt with 
in the design of direct interconnection networks. Livelock is the condition where pack­
ets keep moving, but never reach their destination node. A simple solution to this is to 
only allow packets to make hops which bring them closer to the destination node. In 
other words, if  minimal routing is used, livelock will not occur. Staivation is the con­
dition where a packet is unable to progress because the resources it requires are always 
granted to other packets. This can be avoided by having a correct resource assignment 
scheme [55], in which all packets requesting a resource are eventually guaranteed use 
of that resource. The condition of deadlock is dealt with in the following sections.
2.7.1 Deadlock Avoidance
The packet buffers inside the routing nodes in an interconnection network are a finite 
resource. Due to this fact a condition known as deadlock may arise, if steps are not 
taken to prevent it. Deadlock is the condition where each packet within a given cycle of 
nodes is unable to proceed because the buffer it intends to move to is already occupied. 
Since every packet is stationary and camiot proceed, none of the packets can ever move 
and we have deadlock. This situation is shown in figure 2.8.
The above diagram shows one situation where deadlock can occur. Next to it is 
a channel dependency graph. The vertices of the graph represent the channels in the 
network and the edges represent the connections between those channels. Deadlock is 
characterised by cyclic dependencies, as shown in figure 2.8(b). One method of proving 
deadlock freedom is to break the channel dependencies. This can be done by restricting 
the routing algorithm in some way. This is the basis of dimension-order routing.
2.7.2 Deadlock in Store-and-Forward Networks
In store-and-forward networks, the task of ensuring deadlock freedom is quite straight­
forward. Many methods involve dividing the buffer pool in each node into various 
classes and restricting the set of packets which may access each buffer class [40]. One 
simple approach is to restrict access to buffers according to the number of hops which a
15
Node 0 Node 1
c
Packet 
in buffer
C 3
Node 3 N ode 2
(a) Network (b) C hannel dependency  gaph
Figure 2.8: The cause of deadlock: (a) Deadlock occurs (b) channel dependency graph
packet has made. The drawback with this approach is that each node must have at least 
H -+ 1 buffer classes, where H is the maximum number of hops a packet must make.
A structured approach to creating deadlock free store-and-forward networks is to 
use directed buffer graphs [81].
Much of the research in the area of Store-and-Forward deadlock has concentrated 
on minimising the number of buffers required to guarantee deadlock freedom [45].
Cypher and Gravano [84, 83] have produced deadlock free routing algorithms for 
store-and-forward and virtual cut-through networks, with minimal buffering require­
ments. One algorithm allows full minimal adaptive routing in toms networks, with 
only three packet buffers in each node, regardless of the size and dimensionality of the 
toms. The other algorithm does not use a central queue and requires only two buffers 
per edge.
In modem CMOS routers, silicon area is not a critical resource, so the number of 
buffers used is of less concern. Routing performance is more important, and in any 
case, most modern routers use wormhole routing.
2.7.3 Deadlock in Wormhole Routed Networks
The task of ensuring deadlock freedom becomes more complex when wormhole routing 
is used. Packets can vary in length in wormhole routers. This means that it is impossible 
for a node that receives a packet to be able to guarantee that the entire packet can be
16
buffered.
Various methods of preventing deadlock in wormhole router networks are described 
in this section. The ultimate goal of such methods is maximum routing freedom with 
minimum resources and overhead. The use of adaptive routing complicates the dead­
lock prevention scheme and increases the hardware complexity of the router. This 
brings a trade-off between latency at low network loads (oblivious routing) and through­
put when the network is heavily loaded (adaptive routing). Ni and McKinley have 
published a survey of wormhole routing techniques [69].
Dally and Seitz [24] showed that removal of cyclic dependencies is a sufficient 
condition for deadlock freedom in mesh connected networks. They proposed using 
dimension order routing. Packets traverse each dimension in the network in a fixed 
order. This scheme is simple to implement and does not require the use of virtual 
channels. Later work by Dally [22] suggested the use of virtual channels to increase 
channel utilisation, which was described in section 2.6. Each virtual channel has its 
own set of input and output buffers in a routing node. A routing function defines the set 
of usable output links for each packet, based on the current and destination nodes. The 
use of virtual channels allows different routing functions to be applied to packets using 
different sets of virtual channels.
The equivalent of this for a hypercube network is the e-cube routing algorithm [43]. 
In this method, when a packet enters a routing node its destination address is compared 
with the address of the current node. The most significant bit in which the two address 
differ is determined, and the packet may then only leave the node on the channel cor­
responding to that bit in the address. This is repeated at every node that the packet 
traverses. Thus, the addressing of the packet is resolved in strict order.
Yantchev and Jesshope [60] put forward a scheme based on partitioning the network 
into 2n virtual networks (VNs). Each virtual network is acyclic in nature. Packets may 
move between networks in a fixed order, but not back. For example a 2 dimensional 
mesh network would be partitioned into four virtual networks ( (+X , + Y ) , (+X , —Y)
, (—X, +Y) , (—X, —Y) ). The number of VNs required makes this technique less 
applicable for higher dimension networks. However, in certain cases it is very useful. 
The scheme was proposed in conjunction with Mad Postman routing. This routing 
method is useful for routers with narrow channels. The packet is wormhole routed in 
one dimension. Flits making up the packet propagate in the normal manner. When 
enough flits are received to determine that the packet should make a turn, the dead 
header flit(s) continue to propagate, until blocked, when they are killed. The tail of 
the packet makes a turn and the routing process continues. This scheme has very low 
latency, since the packet progresses on every clock cycle and the router logic is veiy 
simple.
17
The ^-channels routing algorithm [77] provides deadlock-free full minimal adaptive 
routing in wormhole toms networks. Six sets of virtual channels are used. Four of 
these, the star-chamiels, have a restricted routing fimction. Dimension-order routing 
is used on them. Two virtual channels in each routing direction are required, due to 
the dependencies introduced by the wraparound links. The routing algorithm does not 
require the use of all virtual channel buffers in each node, so only five virtual channel 
buffers are needed for each bi-directional link. The highest dimension requires only 
three such buffers, since the non-star channels are not used. Packets making hops not 
normally permitted by dimension order routing do so in the non-star channels. Packets 
may switch between the two classes of channels, so that all minimal paths are allowed. 
This routing algorithm requires 10(n — 1) +  6 buffers per node.
Dally and Aoki [23] have produced a deadlock free wormhole routing algorithm 
which allows packets to use all available paths. The scheme is based on keeping a count 
of dimension reversals. A dimension reversal (DR) is a hop which does not conform 
to the dimension-order routing function. A class of virtual channels is assigned to 
each DR number, so the number of dimension reversals allowed is equal to the number 
of virtual channels. When a packet has reached the limit of dimension reversals, it 
is restricted to dimension-order routing. This algorithm is static in nature. A dynamic 
algorithm is also proposed which eliminates dependencies in the packet wait-far graph. 
The wait-for graph is used to indicate the resources which blocked packets are waiting 
to use. The dynamic algorithm restricts packets so that they are not allowed to wait for 
the use of a virtual channel held by a packet with a lower dimension reversal number. 
The static algorithm is wasteful in terms of virtual chamiel resources. The dynamic 
algorithm performs better, but becomes unstable at high network loads. Throttling the 
injection channels was shown to make the algorithm stable.
Chien and Kim’s Planar Adaptive Routing (PAR) [15] requires only 3 virtual chan­
nels per physical channel, and allows partially adaptive routing in k-ary n-cube net­
works where n > 2. Routing is restricted so that adaptation is allowed in two network 
dimensions at a time. One of the main benefits of this approach is its low hardware 
overhead. The number of virtual channels remains constant with increasing n. The 
restrictions on routing adaptivity allows the crossbars used for routing to be partitioned 
into several smaller ones. PAR does provide fault-tolerant routing, but the faulty nodes 
must be detected and marked statically.
The turn model [11] is perhaps the least restrictive deadlock prevention scheme to 
date. The model is based on analysing the turns that packets may take and the cycles 
that those turns can form. Prohibiting just enough turns will break the cyclic dependen­
cies that characterise deadlock. The resulting routing function provides packets with 
a selection of minimal routes between source and destination. This scheme does not
18
require any additional physical or virtual channels, but these can be used to provide 
greater routing adaptivity.
Duato developed a number of design methodologies for creating fully adaptive 
deadlock free routing algorithms [53, 52, 54], These are combined into one method­
ology [55], which supplies deadlock free routing algorithms for virtual cut-through 
and store-and-forward networks. It also supplies algorithms for wormhole routing net­
works, but an additional verification step is required. The methodology starts with a 
deadlock free routing algorithm, which can be adaptive or oblivious. Virtual channels 
are added in such a maimer that full adaptive or minimal adaptive routing functions are 
produced. The verification step (if wormhole routing is to be used) involves ensuring 
that the channel dependency graph for the new routing function is acyclic. This is found 
to be true for many minimal adaptive algorithms, but few non-minimal ones. When the 
algorithms are restricted to minimal paths, they are referred to as Duato’s Protocol [78] 
(DP).
The Cray T3E multiprocessor [89] uses a 3D toms interconnection network. Dead­
lock prevention is implemented using a technique known as direction order routing. 
Packets traverse the network in direction order. In the T3E the order is +X, +Y, +Z, -X, 
-Y, -Z. Since the toms structure has wraparound links, packets may travel in either di­
rection in any given dimension and they will still reach the correct node ( although they 
may have taken a longer route). Take the example of a packet routed in the following 
order: +X, +Y then +Z. If the X route is changed to -X, then the packet will now be 
routed in the order +Y, +Z then -X. This results in the packet taking a different route 
and also turning on a different comer. Packets routed using dimension order always 
turn on the same comer. The result of using direction order routing is greater fault tol­
erance, since faulty nodes can be avoided, while maintaining freedom from deadlock. 
The T3E router uses five sets of virtual channels. Two of these are used to eliminate the 
possibility of deadlock introduced by the wraparound toms links.
Multi-processor interconnection networks must have some scheme for handling 
deadlock. Although it has been shown to occur infrequently [93], it cannot he ignored.
2.8 Network Analysis
The performance of direct interconnection networks is dependent on the interaction 
between a large number of system parameters. In order to analyse the relationships 
between network parameters, simplifying assumptions must be made, and the network 
must be constrained in some way. Previous work in this area is described here.
Dally [20] analysed fe-ary n-cube interconnection networks, asserting that the net­
work is fundamentally limited by the cost of wiring. Various network topologies were
19
examined under constant bisection width conditions. This means that low dimension 
networks have wider channels than high dimension network. The study assumed that 
wormhole routing was used. The results showed that low dimension networks were 
preferable, due to their lower latency and higher hot-spot throughput. In Daily’s anal­
ysis, only wire delay was considered, while switching delay was neglected. Dally [21] 
later proposed augmenting low-dimension networks with additional express channels. 
These channels comiect selected nodes with others that would ordinarily be several 
hops away in the same dimension. The cost of this is that the degree of some nodes 
increases. The benefit is a reduction in latency for non-local communications.
Agarwal’s analysis [4] took both switch and wire delays into account. The relative 
effects of these delays was examined under various constraints, such as constant bisec­
tion width, fixed channel widths and constant node sizes. A simple contention model 
was derived and used to examine network performance under various conditions. The 
increased wire lengths introduced when mapping high dimension networks into physi­
cal space were taken into account. The wire delays were based on mapping the network 
under examination into two dimensions.
Agarwal’s research generated the following conclusions:
Packet Size Agarwal found that smaller message sizes allowed the network to be op­
erated closer to its theoretical peak bandwidth, without significant increases in 
latency. The converse of this is that for a given channel utilisation, latency will 
be lower.
Constant Bisection Width and Wire Density Under conditions of constant wire den­
sity and constant bisection width, two-dimensional meshes yield the lowest la­
tency. The reason for this is that higher dimension networks cannot map di­
rectly into physical space. This means that nodes which are logical neighbours 
in the network may not be physically adjacent, leading to increased wire delays. 
When node delays were taken into account, “moderately high” dimension net­
works were favoured [4]. This is due to the fact that the proportionate influence 
of switch delay over wire delay increases, lessening the influence of longer wires 
in high dimension networks.
Message Length Message lengths influence the optimal network dimensionality. When 
message lengths are long the relative influence of distance between nodes is less­
ened. Short message lengths increase the relative influence of distance, favouring 
networks with more dimensions.
Constraints The results obtained were highly dependent on the chosen constraints. 
When bisection bandwidth was constrained, low dimension networks were favoured,
20
due to the influence of wire delays in higher dimension networks. When only 
node size was constrained, the reverse was true. This is due to the fact that higher 
dimension networks have a greater bisection bandwidth.
Communication Locality Communication locality is dependent on the application be­
ing run and the manner in which its processes are distributed. Under conditions 
of constant bisection width, low dimension networks perform best when com­
munication locality exists. This is due to the wider channels, and hence shorter 
message lengths.
Agarwal’s analysis provides the reader with some insight into the relationship be­
tween various network parameters. There are some limitations to this analysis however. 
Firstly, networks of varying dimensions were compared assuming that they were phys­
ically mapped into two dimensions. Three dimensions could have been used, and this 
slightly favours higher dimension networks. Secondly, the complexity of the routing 
nodes is not taken into account. Routing a packet in an n dimensional network requires 
the packet to make up to n — 1 turns, if  oblivious routing is used, and more turns if  
adaptive routing is used. The latency for a packet making a turn is higher, since the 
packet must pass through a crossbar switch to reach the desired output channel. This 
usually adds several clock cycles of delay for each turn.
2.9 Maximum Throughput and Injection Rate
The maximum throughput of a direct interconnection network is equal to the maximum 
throughput of all the wires in the network. In practice, maximum throughput is un­
achievable, since it assumes complete communication locality (each packet makes only 
one hop), a uniform traffic distribution, and that there is no contention.1
Exploiting communication locality is a useful feature of direct interconnection net­
works. Communication locality in applications selves to increase the throughput of 
the network. This increase in throughput is due to the decrease in transit traffic. An 
explanation of the role of transit traffic in network performance is given in [5]. The 
connection between the processor and the network can be viewed as a single channel. 
Figure 2.9 shows this arrangement. A uni-directional channel is shown but the analysis 
is equally applicable to bi-directional channels.
When the network is lightly loaded, applied throughput will equal achieved through­
put. The benefits of communication locality are seen when the network is heavily 
loaded. A certain amount of traffic passing through each node is transit traffic, i.e. it is
1 This refers to internal contention in the router, rather than contention for network channels. Con­
tention for channels is not possible if there is complete communication locality and the traffic pattern is 
uniform.
21
Processor
.© ^^LocalLoc l processor traffic: 1 /D
Transit traffic: (D-1 )/D
Average message distance: D hops
Figure 2.9: Effect of transit traffic on available network bandwidth
not destined for that node. If D is the average number of hops that a packet makes, then 
there is a 1/D chance that any given packet will be consumed at its current node. In 
other words, of all the traffic in any given node, a fraction 1/D of that traffic is destined 
for that node. The remaining 1 — {1/D) o f the traffic is in transit. If D, the aver­
age number of hops is decreased, the bandwidth available to each processor increases. 
Conversely, an increase in D results in a reduction in available bandwidth.
It is difficult to quantify the relationship between the various trade-offs involved in 
designing direct interconnection networks. This is partly due to the fact that the re­
lationships between system parameters are application and technology dependent. We 
can however draw some general conclusions regarding direct interconnection networks.
Communication Locality Communication locality in applications improves network 
throughput and decreases average latency. The increase in throughput is partly 
due to the reduction in contention in the network and partly due to the reduction 
in transit traffic.
Packet Length Wormhole routing introduces packets of variable length. The packet 
length will affect the optimal network dimensionality. Long packet lengths favour 
low dimension networks. The reduced number of hops that packets must take 
in high dimension networks is offset by the time taken for the trailing flits in 
the packet to be received. If bisection width is constrained, then low dimension
2.10 Summary of Trade-offs
22
networks have wider channels. The number of flits in the packet will be less, 
so the packets will cause less contention, and the trailing flits can be received in 
fewer cycles.
Router Complexity The complexity of the routing switch should be taken into ac­
count. There is a disparity in the time taken for straight-through routes and routes 
involving turns. This lessens the benefits of high dimension networks, and also 
reduces the effectiveness of adaptive routing strategies.
Constraints When comparing various network topologies constraints must be applied. 
The choice of what to constrain is important. One possible constraint is node size. 
This has been used in the past [20], reflecting the fact that ICs had limited pin-out. 
Modem packaging technologies allowing thousands of high speed I/O pins on a 
single package have eliminated this constraint. The most realistic constraint to 
apply is constant bisection width. This is a direct reflection of the cost of building 
a direct network. It also reflects the overall power consumption of the network, 
since driving wires consumes most of the power.
Latency and Throughput Base network latency is a feature of the routers and wiring 
that make up the network. Minimising latency is important. In applications where 
communication is frequent, processors may sit idle waiting for data from across 
the network. Achieving low latency serves to minimise this idle time. As load on 
a network increases, latency increases also. Increasing the maximum throughput 
of the network means that for a given volume of data transmission, the network 
will be operating at a smaller fraction of its maximum load, and hence latency 
will be kept to a minimum. The theoretical maximum throughput of the network 
can be increased by adding more links, or by making each channel wider or 
faster. The utilisation of the available bandwidth can be improved by exploiting 
communication locality.
23
Chapter 3 
Packet Routers
The packet router device is responsible for the efficient delivery of packets in direct 
interconnection networks. Assuming that the network topology and size have been set, 
two of the router parameters are of most concern. The throughput of each router chan­
nel will determine the overall throughput of the network. The latency of the individual 
routers determines the time taken to deliver each packet from an input link to an output 
link. Messages must make multiple hops in the network, so reducing the router latency 
will reduce the delay at every hop. The throughput and latency are related. Latency 
increases at a greater-than-linear rate as throughput is increased, due to increasing con­
tention in the network. The aims of the work presented in this thesis are low latency 
and high throughput message delivery.
The latency of a given packet router can be split into two components, flow control 
delay and router delay. Flow control is concerned with the time taken to transmit a flit 
from one router to the next on its path. Router delay is the time taken for messages 
(or at least the header flit) to pass from a router input to a desired output channel. A 
flit is the basic miit of flow control data in the router. Another commonly used term is 
phit (physical digit), which refers to the units in which data is transported across the 
physical links between routers. In this chapter, several past and current router designs 
are discussed and their benefits and drawbacks are explained. The typical construction 
of routers for parallel architectures is then described, and the purpose of each feature 
is explained. The example used is a wormhole router, since these are most commonly 
used. This chapter concludes by examining some aspects of the hardware complexity 
involved in constructing packet routers.
3.1 Existing Routers
In this section a number of commercial and research routers are described. They are 
classified here according to the switching technique used.
24
3.1.1 Store and Forward Routers
Packet switching (also known as store-and-forward) was used in several early multi­
processor systems. The Cosmic Cube [17] is a 64 node system using a hypercube 
topology. The Intel iPSC/2 and iPSC/860 systems were among the first to employ het­
erogeneous network nodes. Previously I/O functions had been controlled by a host ma­
chine. Other machines using packet switching are the Delencor HEP, the MIT Tagged 
Token Dataflow Machine and The Manchester Dynamic Dataflow Machine [58].
The main disadvantage of the use of packet switching is the large latency, due to 
storing the entire message at each node and then re-transmitting it.
3.1.2 Virtual Cut-Through Routers
The Chaos Router [10, 63, 64] attempts to offer improved routing performance in con­
gested networks through algorithmic methods, while matching the performance of sim­
ple oblivious routers when there is no congestion. The chaos router is a randomising, 
non-minimal adaptive wormhole router for use in 2-d mesh and torus networks. In the 
absence of contention for channels, packets simply cut through [80]. Packets may be 
routed on any output channel which brings them closer to their intended destinations. 
The main feature of the router is the addition of a central multiqueue. This is used under 
two conditions:
1. Packet exchanges for deadlock prevention. If routers on either side of a shared 
link both have packets to send to one another, both packets must be sent. If either 
of the two input buffers is in use, the packets they contain are moved into their 
respective multiqueues.
2. Packets unable to cut-through. When an entire packet is buffered in an input 
buffer, unable to progress, it is fed into the multiqueue.
If the multiqueue is full and is required to accept a new packet then one of its packets 
is selected to be de-routed. When this happens, the packet is simply routed on the first 
available output channel. This is the only mechanism in the Chaos Router for routing 
along non minimal paths.
The multiqueue is always given priority over the input buffers when requesting 
output buffers.
In an uncongested network, the multiqueue will not be used very often. When 
the network becomes congested the router aims to improve performance through fully 
adaptive routing, utilising otherwise unused network bandwidth.
The Chaos Router has a hardware complexity that is greater than that of a simple 
oblivious router. The improvements in throughput in congested networks must be offset
25
against the increased  average la tency  w h en  the ne tw ork  is ligh tly  loaded.
3.1.3 Wormhole Routers
The majority of past routers for direct networks have employed wormhole routing, since 
it has low buffering requirements, allowing simple, fast routers to be constructed. The 
MIT J-Machine [34] is a multicomputer system made up of message driven proces­
sors. Each node consists of a processor, 4096 word by 36 bit memory, and a network 
port. The nodes are connected in a 3-d grid topology. Routing is done using the de­
terministic e-cube routing algorithm [43], in which dimensions are traversed in a fixed 
order. The routers in the J-machine have a 62.5 ns routing latency, which is one pro­
cessor clock cycle. The I/O links are bi-directional and are 15 bits wide. Nine bits are 
used for data and the others are control bits. The router has an aggregate data rate of 
864 Mbits/sec.
The Mosaic Multicomputer uses single nodes which combine processor, mem­
ory and routing [18]. The project spawned a number of routers, used in research and 
commercial systems, known as Mosaic Routing Chips (MRCs) [19]. The routers use 
wormhole switching, with oblivious dimension order routing to prevent deadlock. A 
self-timed implementation of the MRC achieves throughput of 70 MBytes/sec, which 
is throttled by the I/O pads. The routing latency of the self-timed MRC is around 30 ns.
The Intel Paragon uses a mesh network constructed from Caltech mesh routing 
chips (MRCs). Each node has two Intel i860 XP processors, one of which is used to 
handle message passing. The router has 16 bit wide channels. The minimum latency 
(for fall-through routing) is 40 ns. Dimension order routing is used, to prevent the 
possibility of deadlock.
The MIT Alewife [25] is a 512 node mesh connected network. Again, the Caltech 
MRC chip is used for communication. The Sparcle processor used is an adaptation 
of Sun’s SPARC v7, designed for fast context switching in order to support multi­
threading. The designers of the Alewife recognised the importance of the network 
interface chip (NIC), and considerable effort went into the design of the CMMU [31] 
(communications and memory management unit).
The Stanford DASH (Directory Architecture for Shared Memory) [28], is a dis­
tributed memory multiprocessor system with up to 64 processors. The DASH uses a 
directory based caching scheme in order to break the scalability bottleneck of using 
snoopy bus schemes. The processing nodes are clustered, and each hold four proces­
sors. The interconnection network consists of two meshes: one for request messages, 
the other for replies. The routers are a variant of the MRC, with the data path width 
increased from 8 to 16 bits. The routers are self-timed, with a routing latency of around 
50ns per hop. The router links can stream flits at approximately 30 MHz, so the link
26
bandwidth is around 60 MBytes/sec in each direction. Since there are two networks, 
the data rate in and out of each cluster is roughly 120 MBytes/sec.
The MRC formed the basis of the Atomic LAN router [85], which was the prede­
cessor of Myrinet [74]. Switches for LAN environments have different requirements 
than those in parallel processor systems. Myrinet switches are designed to operate over 
25m long cables, and have a typical latency of 550 ns for a 32 port switch. Every port 
has a data rate of 640 Mbits/sec in each direction.
The Reliable Router (RR) [35, 36, 66] is a wormhole router for 2-d meshes which 
is able to tolerate a single link failure anywhere in the network without disruption of 
seivice. The router was implemented in 1 pm feature size process with three metal 
layers. The router is clocked at 100 MHz, with the links operating at twice this rate. 
The RR overcomes VLSI pin-out limitations by using transmission line techniques to 
implement simultaneous bi-directional signalling [66, 68]. The router also introduce 
a novel technique for synchronising the transmission of data across the I/O links [67]. 
The RR has built-in hardware fault-tolerance to ensure the delivery of packets in the 
presence of a link failure. Each link has a usable bandwidth of 3.2 Gbits/sec.
The Cray T3D [90] uses a 3-d bi-directional torus network with 24 bit wide chan­
nels (16 data bits, 8 control bits). The router is implemented in 3 separate ECL gate- 
array devices. Each device contains the resources for one dimension of the network. 
The router clock rate is 150 MHz, matched with the two DEC Alpha 21064 processors 
attached to each node.
The router uses four sets o f virtual channels: two classes are to prevent cyclic depen­
dencies due to request/response traffic, and there are two VCs in each class to prevent 
deadlock due to the wrap-around torus links. In order to identify packets that must 
be sent from one VC to a lower one, a dateline node must be specified. The dateline 
node simply acts as a point of reference. Considerable effort was put into placing the 
dateline node in order to maximise the traffic balance between virtual channels. Table- 
based routing is used in the T3D, which determines the dateline node. There is no 
other hardware mechanism for this, so it is possible to create deadlock by incorrectly 
programming the routing tables.
The Cray T3E multiprocessor [89] also uses a bi-directional 3d torus interconnec­
tion network to connect up to 2048 processors. Absolute, static table-based wormhole 
routing is employed. Virtual channel (VC) assignments are used for deadlock preven­
tion and to improve utilisation of physical link bandwidth. The use of a multi-chip 
ECL logic router, as in the T3D, was abandoned in favour of a single-chip CMOS de­
sign in the T3E. The chip core has a clock rate of 75 MHz. Packet flits are 70 bits 
wide on-chip. The (uni-directional) I/O pads operate at 375 Mbits/sec, allowing one 
flit to be transmitted in five 14-bit wide transmissions on every (core) clock cycle. This
27
data rate is achieved by using custom low-voltage swing differential transmitters and 
receivers. This leads to a raw data rate in excess of 600 MBytes/sec per direction. The 
T3E router uses direction order routing to eliminate deadlock, while providing more 
than one minimal route between each pair of nodes. Five sets of VCs are provided 
for each routing direction. One VC is fully adaptive, allowing cyclic dependencies to 
arise. Packets which are likely to deadlock are simply switched back into the acyclic 
VCs. The remaining four VCs can be split into two pairs, to eliminate request/response 
cyclic dependencies. This means that request packets will use one set of VCs, while 
packets generated in response to a request use the other set of VCs. Two VCs are used 
in each set to eliminate deadlocks due to torus cycles. One node in each unidirectional 
ring is designated the dateline. Packets start out in VCO. If packets in VCO enter the 
dateline node, they are switched into VCI. As in the T3D, the T3E incorporates an al­
gorithmic method of balancing VC utilisation. This reduces the probability of blockage 
occurring in any given VC. A separate spanning tree network is provided for synchro­
nisation. The use of five virtual channels is costly in terms of hardware complexity, but 
provides a considerable degree of fault tolerance. Routing tables hold the deterministic 
routes that packets may follow in the acyclic virtual channels. When a faulty node is 
detected, the routing tables can be updated to avoid the node.
The SGI SPIDER [38] (Scalable, Pipelined Interconnect for Distributed Endpoint 
Routing) is a flexible router that is used in the SGI Origin shared-memory multipro­
cessor produce line, and in the Stanford FLASH (FLexible Architecture of SPIared 
memoiy) multiprocessor project [32] (which uses a mesh topology). The router core 
operates at 100 MHz, with six 20 bit wide data links operating at 400 Mbits/sec, and 
can be used to construct arbitrary topologies. Four virtual channels are used. The raw 
throughput of the device is 4.8 GBytes/sec. The SPIDER chip has many features more 
commonly associated with LAN routers, such as cyclic redundancy checking at every 
link, and a go-back-n sliding window protocol, in order to ensure reliable data delivery. 
The SPIDER chip uses static table based routing. Source vector routing is also pro­
vided for system maintenance. To prevent table lookup introducing routing latency, the 
routing table at each node determines the packet output port for the next node. Since 
table-look up is pipelined in this way, and is done in parallel with other operations, it 
does not affect the routing latency.
Packets are made up of one or more 128-bit micropackets. Each micropacket re­
quires 32 bits of additional control data. A counter is earned which contains the age of 
the packet in the network. This prevents starvation and ensures fairness. High priority 
packets can be injected into the network with an increased age setting. The routing 
scheme looks similar to wormhole routing, but packets can be split into multiple mi­
cropacket blocks to make way for other packets with a higher routing priority. Each
28
micropacket contains complete routing information, so it can still be delivered when 
separated from the rest of the data stream.
The pin-to-pin latency of the router is around 40 ns. The SPIDER router is fabri­
cated in 0.5gm technology, uses 850,000 gates and occupies a 160mm2 die. The device 
is packaged in a 624 pin, 18-layer ceramic column grid array (CCGA) uses flip-chip 
bonding. The core runs from a 3.3V supply and has a power consumption of 29 W when 
all ports are in use.
The primary feature of the WARRP router [95] is the use of deadlock recovery, 
rather than deadlock avoidance, to allow true fully adaptive routing. This work devel­
oped from the realisation that deadlock hazards occur fairly infrequently [93], espe­
cially when fully adaptive routing together with virtual channels was used. The dead­
lock handling scheme used is known as Disha [61]. The WARRP identifies packets with 
the potential to cause deadlock, based on two criteria: westward turns [11] and dateline 
crossings [89]. Such packets are given access to a deadlock recovery channel, which 
has priority over others for network bandwidth. Use of the deadlock recovery channel 
is arbitrated by the use of a circulating token. This ensures freedom from deadlock on 
that channel, since only one packet at a time may use it. The router uses three sets 
of virtual channels, each with its own crossbar to reduce contention in each node to a 
minimum. The router is for 2-d toms networks and uses wormhole routing. Data-paths 
(and hence flits) are 8 bits wide. The device was targeted at a RAM based FPGA. The 
device operates at 25 MHz and has an average pin-to-pin latency of 120 ns.
The benefits of adaptive routing are clear: better performance under congestion. 
However, this is based on the assumption that each adaptive route has the same cost 
as do other flow-control strategies. This is not the case. The WARRP router has a 
fairly complex architecture. The router uses a minimum of two clock cycles for routing 
and internal data flow, respectively. The clock period in the FPGA implementation is 
relatively long (40 ns), and this masks the complexity of the routing decision.
The WARRP II [95,94] router demonstrates the use of optoelectronic interconnect. 
The router is a 1-D toms router with 4-bit wide data-paths. The interconnect uses 
AlGaAs based multiple quantum well self-electro-optic-effect devices (SEEDs). Thirty 
six SEED devices are used to produce 18 optical signals. The authors claim that devices 
of this type are capable of data rates of 2 Gbits/sec, and that 5000 devices can fit into 
lcm 2.
The MP1 [56] is an implemented router which employs mad postman routing [60], 
which is an extension of wormhole routing. The routing algorithm is designed to min­
imise routing latency by allowing the packet header to propagate while the routing deci­
sion is being made. If the address decode determines that the packet has finished routing 
in the current direction, the (unnecessary) header is allowed to continue to propagate,
29
while the tail of the packet makes a turn. The dead header flit is destroyed whenever 
it is blocked by a full buffer, or an edge of the network for asymmetric topologies. 
This scheme minimises the header routing latency to just one clock cycle. The MP1 
router was fabricated in a 1.2pm standard cell based process, operates at 25 MHz and 
consumes less than 0.7W of power. The router is for 2-D mesh and toms networks, 
and implements four physical acyclic networks. Adaptive routing is achieved by allow­
ing blocked packets to move from one acyclic network to the next, but not back, in a 
fixed order. The MP1 has ten 3-bit wide I/O channels, two for the X and Y channels 
on each acyclic network, a channel to the processor/NIC, and a piggyback channel, 
which allows multiple MP1 networks to be comiected without intervention from any 
of the processors. This type of router is especially suited to networks with narrow I/O 
channels.
Wormhole routing has been applied in a variety o f routers, due to its simplicity and 
low hardware cost. The designs differ in many other respects, such as the deadlock 
handling mechanisms, addressing schemes and the number of virtual channels, which 
are tailored to the end application. Wormhole routers allow variable-length packets, so 
they are applicable to a wide variety of systems.
3.1.4 Alternative Switching Techniques
Wormhole switching is the most popular technique in use due to its low buffering re­
quirements and low latency. Several researchers have suggested other switching tech­
niques, in recognition of the need for fault-toleranee in multi-processor and multi­
computer networks. These mostly involved some form of hybrid switching technique.
In Pipelined Circuit Switching (PCS) [79], the route that the packet takes through 
the network is established before the packet is transmitted. The main difference be­
tween PCS and circuit switching it that virtual channels are reserved, rather than physi­
cal channels. The header which establishes the route is able to backtrack in the presence 
of failed links, to establish a reliable link before packet transmission begins. The main 
drawback with PCS is that the transmission of the packet does not begin until a path 
through the network has been established. An improvement on PCS is the use of scout­
ing switching. With scouting switching, the packet follows the scouting header, but 
remains K links behind it in the network. The variable K is known as the scouting 
distance, and corresponds to the number of network link failures that this protocol is 
able to tolerate. A recent improvement on this technique is to dynamically vaiy the 
scouting distance according to local conditions in the network [9]. The links in the 
network are marked as faulty, unsafe or safe. Packets are routed using Duato’s protocol 
(DP) [53], which allows deadlock free full-minimal adaptive routing. When a packet 
enters a region where it might encounter a faulty link, scouting switching comes into
30
effect with the scouting distance being set so that the packet can always manoeuvre 
out of the faulty region. This two phase (TP) routing protocol allows packets to route 
with the low latency of wormhole switching in the absence of faults, and to avoid faulty 
links at the expense of increased latency. The TP scheme does increase the hardware 
complexity, especially in the flow control protocols for the physical links.
There are numerous approaches to dealing with faults in multiprocessor systems. 
The decision to include fault tolerance features in hardware must take into account the 
additional hardware complexity that this implies.
3.2 Generic Router
Figure 3.1 shows a typical packet router. As explained previously, most modem packet 
routers in direct networks use wormhole routing. Wormhole routing requires only a sin­
gle flit of each packet to be buffered at any given time, so the buffering requirements are 
low. Packets can be of variable length. The head of the packet contains the information 
used to direct the packet through the network.
The various components of the router are described here:
Flow control Flow control is concerned with the external interface of the router with 
other routers. Flow control logic manages the physical transfer of data across the 
links between routers. The flow controller is often quite complex for a wormhole 
router, especially when high speed links are in use. If a simple handshaking is 
used, the link will be slow. One method of overcoming this limitation is to allow 
the sender to transmit multiple consecutive flits [7]. The receiver buffers these 
flits. When a pre-determined number of the receiver buffers are filled, a nearly 
full signal is sent back to the sender. This signal is generated at the point where 
there are still enough free flit buffers in the receiver to store all flits that are in 
flight before the nearly full signal reaches the sender. This scheme overcomes the 
round-trip delay inherent in simple handshaking protocols.
Virtual channel control The virtual channel control logic may be integrated with the 
flow control logic as their functions are related. The virtual channel logic de­
termines which virtual channel has access to the physical channel. It performs 
arbitration, as well as buffering. The complexity of the virtual chamiel logic 
grows with the number of virtual channels.
Address decode When the head of a packet is received at a routing node, the possible 
outgoing chamiels(s) must be determined. Addressing may be:
1. absolute With absolute addressing an address comparison is made between 
the packet destination address and the address of the current router node.
31
+X in
AD/HU
FC/VCC
-Xout FC/VCC
+Y out
k
FC/VCC
+Y in
XBAR
ARB
■Yin
AD/HU
FC/VCC
-Y out
AD/HU 11
FC / VCC FC/VCC
AD / HU address decode I header update 
FC flow control 
VCC virtual channel control 
XBAR crossbar switch 
ARB arbitration logic
FC/VCC -S*- +X out
AD/HU
FC/VCC -X in
Figure 3.1: Generic packet router
2. relative With relative addressing (also known as source vector routing), the 
header contains the number of hops the packet must make in each dimension 
in order to reach the destination node. At each node, the hop count (in the 
packet header) relating to the dimension on which the packet progresses is 
decremented. When the count reaches zero, the router can determine that 
the packet has finished routing in the corresponding dimension.
Another approach is to use table-based routing: each router keeps a programmable 
table which is consulted for every packet. The table provides one (for oblivious 
routing) or more (for adaptive routing) outgoing channels that the packet may be 
sent to in order to reach its destination. This approach is more flexible than the 
others, and allows the router to be used for arbitrary network topologies (provided 
the node degree is sufficient).
Crossbar switch The crossbar switch performs the actual switching of packets. Figure
32
3.2 shows a general purpose crossbar switch with n input and n output channels.
Output ports
Figure 3.2: An n x n crossbar switch
The device consists of n input ports and n output ports. Each port consists of 
w wires. At the crosspoint of each vertical and horizontal port, there is a set of 
w cross-point switches. To implement a crossbar switch requires n2w switches. 
The area required grows as (nw)2 (mostly due to wiring). A slight area saving 
can be made by taking into account the fact that input port i does not need to 
connect to output port i.
It is a common practice in modem routers to partition a large crossbar into several 
smaller crossbars, as these are faster. This may also involve restricting the rout­
ing algorithm so that not all output ports are available to packets in each virtual 
channel [60]. This is done to optimise the critical path latency of the router. An 
alternative approach, shown in figure 3.3, is to use multiplexors (as in [38]).
For a crossbar with n inputs and n outputs, each of which is w bits wide, w log(n— 
1) multiplexors are required. As w increases, wiring may begin to dominate the 
area of the device.
Arbitration logic Several input channels to the crossbar may require the use of the
33
Figure 3.3: Crossbar implemented using multiplexors
ime output port. The arbitration logic matches input ports with output ports, 
'he complexity of the arbitration logic (and the latency incurred) grow rapidly 
dth the number of arbitration candidates. The design of the arbitration unit can 
npact significantly on the throughput of the crossbar [96].
Straight-through Routing
.iters used in the Cray T3D and T3E super-computers include a mechanism 
as fall-through routing. This is where an incoming packet which does not need 
ge dimension is simply directed straight to the corresponding output channel, 
imple, a packet entering on the +X direction input is forwarded straight to the 
put. The complexity of the routing decision is absolutely minimal in this case 
le comparison. There are many modem interconnects available where I/O pins 
placed anywhere on the IC (such as solder bump techniques). These I/O devices 
e used veiy effectively when combined with fall-through routing. Input and out- 
:s could be grouped together for each dimension. This would produce a router 
:ry low latency. If dimension order routing were used, then the router would
34
be very fast in the frequently occurring in-dimension routing case. The infrequent di­
mension changes are less critical and a router could implement these using more costly 
mechanisms, without degrading average latency in the router.
Adaptive routing, whether fully or partially adaptive is a useful feature, as it may 
increase the network throughput and provides some degree of fault-tolerance by allow­
ing messages to route around network faults. Implementing adaptive routing does not 
come without cost, however. Chien and Aoyama [62, 14] examined the implementation 
of various adaptive routers and compared them with simple oblivious routing. One of 
their conclusions was that adaptive routers were considerably more complex. Addition­
ally, if  source based routing is used, header update is an expensive operation, and Chien 
suggests using absolute addressing. The analysis also showed that virtual channels add 
considerable cost, and should be added with caution. Increases in crossbar size and arbi­
tration overheads were found to have a relatively small impact on the router complexity, 
although the research concentrated only on routers for low-dimension networks.
3.4 Summary of Router Trends
This chapter has presented an overview of packet routing in MPP systems. A number of 
conclusions can be drawn from recent work in this area. The throughput of early packet 
routers was governed by limited pin-out of IC packages [20]. This has been overcome 
by using transmission line techniques, and specialised driver and receiver circuitry to 
push up the achievable data rates per pin. Developments in IC packaging, such as flip- 
chip style packaging have increased the number of available I/O pins to around one 
thousand per package. There is still some difficulty in routing large numbers of high 
speed signals at the PCB level.
Fault tolerance issues must be addressed. In systems with hundreds of processors, 
hardware failure is to be expected at some stage. In order that this does not cause a 
system crash, the interconnection network should be able to tolerate network faults, ei­
ther in the network links or in the routing nodes themselves. Early multicomputers left 
message routing to the processor. Later attempts used a processor, message router and 
an interface device, all integrated in a single node. Modem systems use high perfor­
mance commodity processors. In order to support the communication requirements of 
such systems, routing chips, and network interface chips are implemented separately. 
The importance of the network interface chip should be emphasised. The benefits of 
a low latency interconnection network may be diminished by the lack of an efficient 
NIC. Many modem systems have placed special emphasis on the design of this compo­
nent. Some router designs have attempted to introduce flexibility into the architecture 
by using table-based-routing. Although the routing algorithm is static, it can be updated
35
when necessary.
Wormhole routing predominates in modem packet routers, due to its inherent low 
latency. Although network references can now be served in a fraction of a micro­
second, even in systems with hundreds of processors, this is still a long way from 
a system-wide uniform memory access time. Future router designs will aim to reduce 
latency further, while increasing throughput in order to meet the demands of ever-larger 
systems and faster processors.
36
Chapter 4 
Implementation and Design of 
Asynchronous Packet Routers
Asynchronous circuits remove the reliance on a global clock signal to control the flow 
of data in a digital system. The most obvious benefit in packet router chips is that 
packets can progress as soon as the routing decisions are complete, without waiting for 
the next clock edge. Also, the power dissipation is roughly proportional to the volume 
of data, so savings can be made when the network is lightly loaded. The Caltech MRC 
was implemented asynchronously [19].
This chapter describes the implementation of an asynchronous packet router, using 
the building blocks described by Nedelchev and Jesshope [47]. The aim of this work 
was to use these blocks to construct a router with veiy low latency, and to examine the 
design trade-offs in a simple packet router. The latter part of this chapter describes the 
packet router design. The router uses dimension-order [24] routing. Virtual channels 
[22] are not supported, in order to keep the implementation complexity to a minimum. 
The blocks used [47] are designed to support wormhole routing, so this is used.
The synchronous design methodology encapsulates a wide variety of design tech­
niques, with one common feature: the use of a clock signal to control the sequence of 
events. The use of one or more clock signals simplifies the design process, since circuit 
timing calculations can all have the same point of reference. This also eases the test 
process: a set of test vectors can be specified, together with the expected results on a 
cycle-by-cycle basis.
The asynchronous approach to circuit design removes the reliance on any global 
clock signal. Circuits which operate asynchronously are often referred to as being 
self-timed. Self-timed operation complicates timing calculations and testing, but offers 
some benefits. This chapter first introduces some terminology, then describes some 
asynchronous (also known as self-timed) design methods. The remainder of the chap­
ter is devoted to the asynchronous packet router, and concludes with some suggested 
improvements to the design.
37
4.1 Introduction to Asynchronous Design
This section provides some background for asynchronous design. More information is 
provided in reference [50].
4.1.1 Timing Models
In order to abstract the asynchronous design process some form of timing model must 
be employed. Delay insensitive designs are those where correct logical operation is 
independent of arbitrary time periods added to both gate and wire propagation delays. 
In practice this is a veiy strict condition and the designer may instead aim for speed in­
dependence. This is where correct logical operation is independent of an arbitrary time 
period added to the propagation delay of any logic gate. In practice delay insensitive 
and speed independent timing models are often mixed. Circuits may operate internally 
in a speed independent maimer, but the external interface may be delay insensitive, for 
example.
4.1.2 Signalling Protocols
For synchronous circuits, the designer is generally concerned only with the logic lev­
els of signals on a particular edge of the clock signal. When designing asynchronous 
circuits, events are considered. An event is simply a logic transition, which may be 
from logic low to high or vice versa. Asynchronous circuits respond to events rather 
than logic levels. In order to make use of asynchronous design methods, some form of 
signalling protocol between communication circuits must be employed. This generally 
involves some form of handshaking.
Two-phase signalling is the simplest approach. The exchange of data involves a 
single request/acknowledge handshake. Figure 4.1 shows an example of the two-phase 
scheme.
Two-phase signalling usually requires more logic than four-phase signalling. Four- 
phase signalling requires two sets of request and acknowledge signals per communica­
tion. Figure 4.2 shows the four phase signalling method.
4.1.3 Data Signalling
Dual rail encoding (figure 4.3) involves using two separate wires to represent each bit 
of data. An event on one wire represents a logic zero, an event on the other wire 
represents a logic one. The receipt of each data bit is acknowledged on the ack wire. 
This design method takes advantage of data dependent delays. For example, a self­
timed multiplier may have a range of processing delays dependent on the input data.
38
Figure 4.1: Two Phase Signalling
Figure 4.2: Four Phase Signalling
The primary disadvantage of dual rail encoding is the amount of wiring generated. In 
addition, the number of logic gates used is relatively large.
Bundled data protocol methods (first proposed in [44]) are less restrictive than other 
asynchronous design styles. In tills design method, the control and data paths are sep­
arated. The control path consists of circuitry designed to respond to the asynchronous 
arrival of control signals. The circuits then trigger actions into the data path, and then 
possibly pass control to subsequent circuits. Actions on the data-path may involve such 
things as toggling latches to store data values, or directing the flow of data. The logic 
on the data-path looks similar to that used in synchronous designs. Communication 
in bundled-data designs is performed using handshaking, which can be 2-phase or 4- 
phase. This is illustrated in figure 4.4.
This type of circuit relies on the insertion of delays, to ensure that signals have 
time to propagate along the data path, before the corresponding control signals arrive.
39
logic 0
logic 1 _----------------------------------
sender receiver
ack
- -- ------------
Figure 4.3: Dual Rail Signalling
request
—►
acknowledge
Figure 4.4: Simple Handshaking with Bundled Data
Internally, these circuits operate in a speed independent manner. The timing of the 
circuit must be carefully checked to ensure correct operation. These circuits operate 
externally in a delay insensitive manner.
4.2 Benefits of Asynchronous Design
The most obvious feature of any asynchronous circuit is the absence of any clock signal. 
This negates the issue of clock skew. Clock skew is the variation in the delays of various 
paths of the clock signal, potentially leading to circuit failure. In high performance 
synchronous devices, such as microprocessors, the problem can be tackled by adding 
additional de-skewing circuits [88], which actively vary delays in the clock lines to 
minimise skew.
Driving a clock signal accounts for a significant proportion of the power dissipation 
in any high speed chip [37], This overhead is removed in asynchronous circuits, al­
though additional control circuitry is required to control the flow of data, which offsets 
the benefits to some extent. The most useful feature of self-timed approach with regard 
to power consumption is the fact that power dissipation is roughly proportional to cir­
cuit activity. In idle circuits power dissipation is only due to small leakage currents in
40
the silicon substrate. For applications where circuit activity is infrequent, significant 
reductions in power consumption can be made. This feature has been exploited in a 
number of consumer oriented applications (see [39], pi 51).
Self-timed circuits can easily be interfaced with one another. Once the internal 
timing of the circuit has been verified as correct, only the external interface needs to be 
considered. This allows the circuit to be used as a building block, allowing the designer 
some level of abstraction. It also aids design re-use.
Certain classes of asynchronous circuits (such as those where all data is dual-rail 
encoded) are particularly robust, in the sense that variations in processing and the en­
vironment in which the device operates do not cause the device to fail. Instead, the 
speed at which the device operates varies according to processing and the ambient con­
ditions. This feature makes asynchronous circuits potentially more robust than other 
design styles.
One set of situations where asynchronous circuits show a clear advantage over other 
logic styles is when meta-stable behaviour is likely to occur, such as in synchronis­
ers and arbiters (see [49], Chapter 9). The probability of synchronisation failure de­
creases exponentially with time. Synchronous designs deal with this by connecting 
multiple synchronisers in series, which reduces the probability of failure exponentially. 
In an asynchronous circuit computation will not begin on any data until synchronisa­
tion/arbitration is complete.
Asynchronous designs that are internally and externally delay insensitive have the 
advantage of portability. If a circuit is designed using a library of cells from one fab­
rication process, it can easily be ported to any other fabrication process, provided that 
the same cell library is made available.
Perhaps the greatest drawbacks of asynchronous circuits is the lack of EDA (elec­
tronic design automation) tools, and the difficulties in testing asynchronous designs. 
These problems are being overcome, and asynchronous techniques and tools are begin­
ning to enter the commercial arena [1,2].
4.3 Asynchronous Design Kit
In order to begin construction of the asynchronous packet router, a library of general- 
purpose asynchronous design elements was constructed as standard cells. Standard 
cells are tiles of silicon layout, with a fixed height and variable width. The power and 
ground rails run horizontally along the top and bottom of the cell, and are positioned 
so that they can be aligned with other cells in uniform rows. Gaps are left between the 
rows for routing wires. The silicon layout of the Muller C-element (described later) is 
shown in figure 4.5
41
Figure 4.5: Silicon layout o f C-element standard cell
The standard cell implementations were taken from a number of sources [75, 46]. 
Each cell was created as a schematic, and then standard cell layouts were created using 
a layout synthesis tool. Each cell was simulated with HSPICE (a commercial version 
of SPICE), which was also fed with parasitic capacitance and resistance information 
obtained from the layout of the cells. The simulations in HSPICE were carried out 
for three reasons: 1) to verify that the circuits were logically correct 2) to measure the 
rise and fall times of the circuits, and 3) to ensure that the circuits did not produce any 
glitching. For cells used on an asynchronous control path, glitching can cause incorrect 
operation. If an unexpected transient signal of sufficient duration occurs, it may be 
wrongly interpreted as one or more asynchronous events by the subsequent circuits.
There is a direct trade-off between simulation time and accuracy. In order to verify 
the structure of the design, Verilog models of each standard cell were created, so that 
large designs could be simulated in a reasonable period of time. The standard cells and 
the internal circuits in each router building block are speed-independent, so they must 
be individually simulated using a SPICE simulator, since the accuracy of the simulation 
timing is critical to the successful operation of these circuits.
The most commonly used asynchronous cells are described here.
Muller C-element The Muller C-element is used to join processes together. An event 
must be received on both input wires in order to produce a single event on the 
output wire. Two successive events on the same input wire, without an interme-
42
diate event on the other input wire will cancel each other out. When an output 
transition occurs, the circuit returns to the original state, waiting for an odd num­
ber of events on both wires before producing an output event. The transistor level 
implementation of the Muller C-element is shown in figure 4.6.
Figure 4.6: Transistor level implementation of C-element
Toggle The first event received and all subsequent odd-numbered events at the input to 
the toggle element produce an event on the toggle output marked dot. All even 
numbered events produce an output event on the blank output. The implementa­
tion of the toggle circuit is shown in figure 4.7.
Merge The merge element accepts events on its input channels and outputs these 
events on the output wire. The merge element can be implemented using a stan­
dard XOR gate (they are logically identical). Difficulties arise when two or more 
events arrive simultaneously, and this must be prevented, as indeterminate be­
haviour will result. The merge element is constructed from eight transistors, 
which need the complements of each of their inputs as well as the inputs them­
selves. A six transistor solution exists, made using transmission gates, but has 
a possible charge sharing problem. This should not be a concern as long as the 
device is strongly driven. Both implementations are shown in 4.8.
Mutex (Mutual exclusion) Element The mutex element is used to perform arbitra­
tion. It has two input channels and two output channels. When transitions are
43
(b) H a lf  T o g g le  0
(c) C o m p le te  T o g g le
Figure 4.7: Implementation of toggle element
Figure 4.8: Merge element, (a) static implementation and (b) using pass-transistor logic
44
received on both inputs simultaneously, the mutex only allows one of the inputs 
to propagate the event onto its respective output. When a second transition is 
received on that channel, this is again allowed to propagate to its output. This 
second transition also frees the mutex to begin arbitration again. The mutex can 
be combined with additional circuitry to provide a robust arbiter.
Decision Wait (DW) Element The decision wait element is a generalisation of the 
Muller-C element. It is an array with n column inputs and m row inputs, with 
corresponding outputs On*m. The DW element is used throughout the router to 
ensure that the correct acknowledge signals are generated in response to request 
signals.
The merge, toggle, Muller C-element and mutex, together with latches and boolean 
logic gates are adequate for constructing most types of digital circuits. Other compo­
nents, such as the decision-wait element are provided for convenience. All cells used 
on the asynchronous control path must be glitch- and hazard-free. The requirements on 
data-path elements are less stringent.
4.4 Router design
The routing switch uses dimension-order routing (first X then Y) in order to avoid 
deadlock [24]. There are five input channels (+X, -X, +Y, -Y and inject) and five output 
channels (+X, -X, +Y, -Y and consume). Each routing node (RN) is comiected to a 
processing element (PE) via the inject and consume channels, and to other nodes in the 
network using X and Y channels. The connection configuration is shown in figure 4.9.
The routing information for each packet is contained in one or more header bytes 
at the leading end of the packet. Each header byte consists of three fields, an address 
field (5 bits), a broadcast bit (1 bit) and a control field (2 bits), which determines the 
next dimension along which the packet will be routed (the address dimension filed). 
On each hop the address field in the header byte is decremented by one. When the 
address reaches zero, two of the control bits determine the next dimension along which 
the message will be routed. The head of the packet is stripped off and the same process 
is repeated. If the address field reaches zero and the next dimension indicated in the 
address dimension field is opposite 1 to that which the packet is currently travelling in, it 
will be sent to the consume channel, and is removed from the network by the processor.
'There are five router channels, +X,-X,+Y,-Y and the processor inject/consume channel. In order 
to represent these five channels with only 2 address-field bits, the next routing direction specified in 
the packet header is set so that it represents a reversal of the current routing direction. Packets do 
not ordinarily backtrack, so this setting is used to indicate that the packet should be consumed by the 
processor.
45
Figure 4.9: Routing node interconnection scheme
The address field is five bits wide, which allows a maximum of 32 hops. Further 
moves can be made along the same dimension by setting the control bits in the current 
header byte to continue in the same direction and by specifying the required number 
of hops in the address field of the next byte in the packet, although this is achieved at 
the loss of some efficiency in payload. An additional control bit, the broadcast bit, is 
used to specify whether message propagation is point-to-point or point-to-point with 
broadcast in the rectangular closure of source and destination points. Thus if the packet 
is travelling along one of the x-dimensions and this bit is set, then the broadcast part of 
the router will strip off all of the packet headers containing x-dimensions in the control 
field, until the control field contains the bit pattern representing a route in the +Y or -Y 
direction. The message is then sent along one of the y-dimensions. At the same time 
the message is sent along the current direction of travel (in this case X) as it would be 
if broadcast mode were off.
In accordance with Nedelchev’s design, [47], the movement of data through the 
router is directed by a variant of the bundled data protocol. The difference is that two 
sets of control signals are used, as this simplifies the design. Figure 4.10 shows the con­
nections between each sender/receiver circuit pair. The data path throughout the router 
is 8 bits wide, with four control signals. The e and req lines are both used to signal that 
valid data is being presented on the data bus, and the acke and ackreq lines are used by 
the receiving circuit to signal that the data has been received, and can be removed from 
the data lines. The difference between the two sets of request/acknowledge wires is that 
the e wires are used to mark the head and tail of a packet, and the req wires are used to 
mark data exchanges other than the head or tail byte. This simplifies the router design,
46
since the operations required on the head and tail bytes of the packet are different from 
operations on other data bytes.
Receiver
Figure 4.10: Modified bundle data protocol
Figure 4.11 shows the protocol in practice. To ensure correct timing in the circuits, 
the data being transmitted must settle to the correct state before the relevant e or req 
signal arrives at the receiving circuit. If this condition is violated, the receiver may 
latch spurious data values. In practice there is some slack in the setup timing, as there 
is always a delay of at least one logic gate (in the control path) before the receiver’s 
internal latches are switched.
4.4.1 Router Building Blocks
The router was constructed from three basic building blocks, which were first proposed 
in [47]. The internal timing of each of these blocks was carefully examined and delays 
were inserted into the circuit control paths where necessary. It should be noted, that as 
the data path is quite simple in a packet router, the complexity in the control circuitry 
often allows the timing constraints to be met without the insertion of explicit delays. 
Once this condition is met, the building blocks need only to be wired together, in a 
speed independent way by ensuring that the data wires are given the shortest routing 
paths, to avoid inadvertently violating the delay sensitivity of the timing constraints.
MUX The MUX element (figure 4.12) serves to multiplex two data channels, to allow 
two competing modules to have mutually exclusive access to a shared resource,
47
tFigure 4.11: Sequence of events in packet exchanges
whilst adhering to the bundled data protocol described previously. The operation 
of the MUX element is as follows: A module wishing to access the resource sends 
an initial e event to request exclusive access. If both modules make a request at 
the same time, the MUTEX element inside the MUX will block one e event, while 
allowing the other to pass. The data switches inside the MUX will be set so that 
the data path from the winning module is connected to the output channel of the 
MUX. The datamerge component consists of combinational logic, which simply 
copies the data from the selected input bus to the MUX output bus. The MUX 
element copies the request signals to its outputs and returns acknowledgement 
signals from the receiving resource back to the sending module. When the e 
line of the winning module is eventually brought low, the other module is given 
immediate access to the MUX, and is allowed to send its data. The MUTEX 
element inside the MUX ensures fair access to the resource, as the loser in a 
previous request is kept waiting, and has access as soon as the requested resource 
becomes free.
If more than two modules need to share access to a single resource, then MUX 
elements can be cascaded. Figure 4.13 shows the internal structure of the MUX 
element.
DEC The DEC element acts to buffer the data, and also decrements the hop counter 
inside the header flit. The construction of the DEC element is shown in figure 
4.14.
The operation of the DEC element is as follows. The internal storage element
48
Figure 4.12: MUX element
“  3 8 - a“ jp a •§.
Figure 4.13: MUX element
49
Figure 4.14: DEC element construction
consists of an 8 bit register made up of latches. Initially the latches are in a 
transparent state. Receipt of either an e event or a req event closes the latches, 
storing the data. When an ack signal arrives, indicating receipt of the data by the 
next block, the latches revert to their transparent state. The decel block shown 
in the diagram copies data from its inputs to its outputs, except when the control 
signal is low, in which case the data is decremented by one before passing to the 
output wires. The e signal is comiected to the control on the decel block. At the 
beginning of each packet, the latches close before the decel control signal has 
time to switch, so only the packet header byte has its address field decremented.
ROUTER The ROUTER (shown in figure 4.15) element is the most complex of the 
building blocks. The ROUTER block determines which outgoing channel each 
packet is directed to. To do tins it examines the packet header byte. If the hop 
counter in the header is non-zero, the packet is simply routed straight through, 
and continues in its current routing direction. If the hop counter has expired 
(reached zero) the byte is stripped off and the addressing information in the next 
byte is examined. Then the ROUTER element directs the packet to the next 
outgoing link, specified in the header byte. This may either be to an outgoing 
link, or to the processor consumption channel.
The router block is based on that originally described by Jesshope and Nedelchev 
[47], but has been modified to support broadcast by redesigning the router building 
blocks and providing the additional connections required internally between X and Y
50
Figure 4.15: ROUTER element construction
51
routers. This connection scheme is shown in figure 4.16. The wires on the diagram 
represent either one or two entire bundles, comprising both data and control wires. The 
reader may notice that there are a greater number of channels leading into the multi­
plexers in the Y- dimensions. These have two extra channels in each direction which 
are used when point-to-point multicast is being earned out. Since the multiplexers are 
larger for the Y channels than for the others, the through node latency will be increased 
by around 10ns for these channels. The components shown in this diagram have the 
following behaviour:
Simulation of the router proved to be a challenging task, as the tools used have 
been created for synchronous design methods. The lack of synthesis tools meant that 
the design was created from the bottom-up rather than top-down. HSPICE simulations 
including circuit parasitics were carried out at every level in the design hierarchy up 
to the MUX, DEC and ROUTER building blocks. The operation of these blocks was 
verified to be externally delay insensitive, so that the blocks could be connected together 
without risk of timing violations. The insertion of delays in the design assumed worst- 
case situations, in order to avoid violating the internal timing constraints of the main 
blocks.
The entire router design was too large to simulate entirely with HSPICE. Several 
simulations were carried out with only the circuits on the routing path of the simulated 
packets, to assess the detailed timing. In order to make the simulation of the entire 
design fast enough to be manageable, Verilog models of each library cell were created, 
and the entire design was simulated at the switch level. This allowed the structure of 
the circuit to be verified.
The routing latency was variable, due to the differing number of MUX elements at 
each output. The latency varied between 60 ns for straight through routing on either 
the +X or -X channel, to 80 ns for a straight through route with the broadcast feature 
enabled.
A simple method of determining the overall speed of the device was used. The 
slowest part of the circuit is the ’router’ resource (see fig. 4). The operation of this 
sub-circuit is slowest when evaluating the packet header. Since the rest of the packet is 
queued up behind the header, this is a bottleneck in the system. Thus, by measuring the 
time for the packet header to be resolved to be sent onwards we could estimate the speed 
(and hence throughput) of the circuit. Handling the packet header took 18ns normally 
and 20ns in broadcast mode. This corresponds to a data rate of 55 Mbytes/sec (on each 
output) and 50 Mbytes/sec respectively. Of course, the header bytes do not carry any 
usefi.il information, so it is more efficient for larger packets. The through-node latency 
of the router varies, due to the self-timed nature of the device, but was found to remain 
below 80ns in all simulations.
52
+Y output
D ec
Figure 4.16: A block diagram of the router switch
53
After the entire schematic had been created, the next step was to lay out the circuits 
into actual silicon structures. In the final version of the design, all of the devices were 
placed within one block, as standard cell designs are more compact when placed in this 
way. The design tools used were able to then optimise the placement of each standard 
cell in the design, to produce a globally near-optimum placement. Constraints were 
placed on the wiring process, so that wires carrying data were given priority over those 
carrying control signals. In effect this meant that the data wires were connected first and 
allocated the shortest paths to their destinations. The early arrival of data has no effect 
on the correct functioning of the circuit, but the early arrival of a control signal without 
the arrival of the corresponding data would be catastrophic. The data and control wires 
in each bundle were examined manually in the final layout view to ensure that the data 
lines were relatively short and that there was no danger of violating timing constraints.
Figure 4.17: Final chip layout
The final chip layout was 5781pm x 5642pm and a chip plot is shown in figure 
4.17. The design was pad-limited, meaning that the size of the chip was determined by 
the number of input and output pads on the device.
54
4.5 Possible Improvements
The implementation of the asynchronous packet router described in this chapter was 
the author’s first attempt at a sizeable IC design. Much of the implemention process 
was a learning experience for the author. There are a great many improvements and 
optimisations which could be made. With the full knowledge of hindsight, it is easy to 
be critical.
Asynchronous circuits tend to be modular in nature. This allows circuits on the 
critical path to be improved without altering the rest of the design. The slowest parts 
can be improved to achieve an overall speedup.
Critical Path Optimisation The design presented here is an oblivious router. Packets 
are routed first completely in the X dimension, and then in the Y dimension. 
The majority of packets pass straight through the routing node, without making a 
turn. There are several optimisations that could be made to this critical path. The 
most obvious improvement is to split the ROUTER component into two parts. 
The first part, which shall be called HOP_CHECK should simply assess whether 
the hop counter in the packet header flit has reached zero. If this is not the case, 
the packet should be sent straight to the output. This channel is on the critical 
path, so the channel should be comiected through only one MUX element to the 
output.
If the packet header flit has reached zero, the packet should be diverted to the new 
ROUTER element, which has the functionally of the original device, less that of 
the HOP_CHECK component.
Broadcast The broadcast facility gave additional functionality to the APR, at the ex­
pense of latency on the critical path. The broadcast logic could have been incor­
porated inside the ROUTER element.
Bus Connections Power consumption could have been reduced a great deal by sepa­
rating out the buses from each ROUTER to a number of MUX elements. Every 
time data is transmitted on the bus, the driving gates must switch wires connected 
to a number of MUX elements. Separate connections to each MUX would reduce 
power consumption, at the expense of extra wiring. The design was pad-limited, 
so the extra wiring could easily have been accommodated.
Mixed Logic Style Asynchronous circuits allow logic styles to be mixed, providing 
communication protocols are adhered to. Part of the HOP_CHECK element could 
be implemented using dynamic logic. One possible implementation is shown in 
figure 4.18. The function if this circuit is to check when the routing hop counter
55
has reached zero. Before a packet arrives, in is at logic low. Transistor pi is 
conducting, so the output is a logic high. When a packet arrives, any non-zero 
address bit will cause one of the five parallel n-type transistors to conduct. When 
the in signal is raised high to signal the arrival of a packet, the bottom n-transistor 
(n6) is turned on, and pi is turned off. If all five address bits are zero, then the 
charge stored on the output node (out) will be preserved. Otherwise the output 
node will be discharged. The result is that a logic high signal at the output means 
that the address is zero, a logic low signals that the address is non-zero. Dynamic 
logic relies on charge storage on a node. The output result must be latched as 
soon as it is generated. If the delay is too great, the charge on the output node 
may leak away, causing an incorrect result.
This circuit was simulated using SPICE and found to produce a result in under 
2 ns. Using such a circuit on the critical path of the router would mean that the 
propagation delay would be greatly reduced.
Figure 4.18: Fast circuit to check when hop counter is zero
Virtual Channel Support The router chip was I/O bound, and could have accommod­
ated considerably more logic. A large proportion of the available silicon area was 
left unused. This could have been used to provide support for virtual channels. 
Without VCs the router is unable to fully utilise the available physical bandwidth 
in the network. The added logic, buffers and arbitration circuitry would increase 
latency on all paths, but careful design could minimise this.
Re-arranging I/O Resources The delay any packet on the critical path experiences 
consists of two components. The first is the processing delay, where the packet
56
header is being examined and the hop counter modified. The otiier delay is due 
to the propagation delay in transporting the physical signals across the chip. This 
delay could be reduced by grouping the input and output resources for each rout­
ing direction together. This would complicate matters at the PCB level, but off- 
chip signals could be driven with a greater current.
4.6 Conclusions
The design of the asynchronous packet router demonstrates the feasibility of Nedelchev 
and Jesshope’s packet router building blocks [47]. The implementation gave the author 
an understanding of the hardware complexity issues in packet routing. Simulations 
with HSPICE provide an insight into the detailed timing of a packet router, and into the 
hardware costs involved.
The work also provided an understanding of the benefits and drawbacks involved 
in using asynchronous techniques. The features of asynchronous design which are 
most useful to packet router design are low power requirements, control over detailed 
timing and the modular nature of asynchronous circuits. The greatest difficulty in using 
asynchronous methods is the lack of commercial EDA tools, although this situation is 
changing [1,2].
57
Chapter 5 
High Performance Interconnects
Packet Router performance is ultimately governed by the properties of the available 
interconnect. For many years now, such devices have been restricted by the limited 
number of I/O pads which can be connected from the periphery of a single chip (the so- 
called pin-out problem). There have been several developments in recent years which 
are set to change this. Research in the area of I/O transceivers [12, 35, 36, 66, 68, 
67, 71] has greatly increased the achievable data rates of a single I/O link. The pin 
limitations can be alleviated by using flip-chip style bonding, where I/O pads can be 
placed anywhere on the IC. These devices can be arrayed across the chip surface, to 
provide thousands of connections [92].
A small and uniform quantity of solder is placed on each pad, and the silicon sub­
strate is then turned over and attached directly to the PCB. Specialised multi-layer ce­
ramic PCBs are used. There are several benefits to this. Firstly, even though the indi­
vidual pads may be larger than in existing wire bonded designs, the overall chip area 
may be reduced in I/O bound designs. This is due to the removal of the restriction on 
placing the pads around the chip periphery. This is illustrated in figure 5.1. The diagram 
shows the placement of 100 I/O pads of the same size, using wire bonding and flip-chip 
bonding. The overall die area is considerably smaller for the flip-chip bonded device. 
Flip-chip bonding leaves less silicon area for placing logic gates, but this is ideal for 
designs which would otherwise be I/O bound. This is generally the case for packet 
routers. The throughput of each device depends critically on the throughput of the I/O 
pads. If large numbers of I/O devices (> 500) are to be used, then flip-chip interconnect 
offers a considerable reduction in die size. For any chip in production quantities, the 
cost of each device is almost directly proportional to the silicon area used. Smaller die 
sizes also produce higher process yields.
Another benefit is the increased number of I/O pads available in flip-chip packaging 
processes. The IBM C4 flip-chip bonding technology [86] allows almost five thousand 
I/O pads to be distributed over a 16mm x 16mm chip area. Each of the bonding pads 
is around 225pm x 225pm in the current state-of-the-art process [86].
58
(a)
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ o n
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
(b)
Figure 5.1: I/O placement (a) for wire bonding (b) for flip chip bonding
Flip-chip bonding allows much larger numbers of I/O pads on a single chip than 
bonding methods where the pads are placed around the chip periphery. However, a new 
problem arises as a result of congestion in the PCB, due to its planar nature.
One promising area of research which overcomes these difficulties is optical in­
terconnect. Streams of photons do not readily interact with one another, which is an 
advantage for carrying data. The density of connections can therefore be extremely 
high, and with free-space optics, signals can be routed in three dimensions.
The range of technologies being developed is vast. Optical interconnect technolo­
gies have long been recognised as a solution to the VLSI pin-out problem [57]. Large 
numbers of optical “pins” can be placed on a silicon substrate and integrated with 
the underlying CMOS devices. Such devices are capable of data rates in the Gbits/sec 
range[33b These technologies have been applied to a number of areas, including packet 
routers [95], crossbars [27], multi-chip modules [30], novel parallel computer architec­
tures [29], etc. Optical interconnects are also recognised to be of use in areas where 
electronic interconnects would have been applied.
Ozaktas’ [42] classification of the range of opto-electronic architectures is shown 
in figure 5.2.
A rb itra ry  c o n n e c tio n  p a tte rn  R e g u la r  c o n n e c tio n  p a tte rn
Figure 5.2: Tree of alternative optical interconnection architectures [42]
Free-space interconnect appears to be the most applicable for the purpose of direct 
interconnection networks. Optical transmitters and receivers are attached to the chip 
surface using flip-chip bonding. The two main transmitter devices used are VCSELs 
(vertical cavity surface emitting lasers) and SEEDs (self-electro-optic-effect-devices). 
The so-called smart-pixel technology involves arraying large numbers of these devices 
onto a single IC substrate. The smartness of the devices refers to the fact that they 
integrate logic functions with the optical transceivers.
In VCSEL devices, the light resonates between mirrors at the top and bottom of a 
vertical cavity. Due to the small size of the cavity the VCSELs have a small round- 
trip gain in comparison with horizontal edge-emitting lasers, and so highly reflective 
mirrors (reflectance > =  0.9) are required in order to sustain oscillations [59]. VCSELs 
can be constructed in arrays on the same substrate. The substrate is transparent to the 
generated beams, so the devices can be attached to the surface of the IC, which is then 
turned over. The beams will pass through the back of the substrate, rising perpendicular 
to the IC surface. The devices are kept constantly on and the beam is modulated, as this 
avoids a start-up delay. This means that the devices dissipate power even when no data 
is transmitted. VCSELs have a low threshold current, low voltage and high data rate (up 
to 10 GHz) [26]. In conjunction with microlenses integrated onto the device surface, 
around 90% of the VCSEL output may be coupled into optical fibres [59].
The optical multi-mesh hypercube (OMMH) network [26] is a proposed intercon­
nection network using a hybrid topology. VCSEL devices are proposed for the free 
space optical links. P-I-N photodiode (PIN-PD) receivers were proposed for use in
6 0
the OMMH. These devices do not provide any internal amplification. Transimpedance 
amplifiers are generally used for this purpose, due to their large dynamic range.
SEEDs are passive diode devices in which the light absorption varies when an elec­
tric field is applied across the device terminals. These devices also generated a small 
leakage current when light is incident upon them, so they can be configured as receivers.
SEED devices require an external laser source. The devices act by modulating 
the laser source. Desmulliez et al. [70] have used 32 x 32 arrays of SEED devices 
to implement an optical sorting demonstrator. A number of types of SEED devices 
were explored. For CMOS-SEED devices Desmulliez et al.[70] put the theoretical 
communication rate at up to 1012 pin-Hz, given a 1cm2 chip and a 10W/cm2 power 
dissipation limit. A 16 x 32 sorting node has been implemented by Dines et al. [51]. 
The system operates at 100 MHz. Each basic node contains a transmitter, receiver and 
some control logic. The area of each cell is 180/um x 180/im. This corresponds to an 
active area of 2.9mm x 5.8mm for a sorter with 1024 optical channels. The total on/off 
chip data rate for the system is around 200 Gbits/s. A simple opto-electronic router 
has been constructed by Pinkston et al. The WARRP (wormhole adaptive recovery- 
based routing via pre-emption) router [95] was implemented as a 4-bit wide 1-d torus 
(WARRP II) router. The device uses 36 SEED devices to provide 18 dual-rail optical 
links. The aggregate bandwidth of the device is 25 MBytes/sec, which is well below 
the capabilities of the interconnect.
The receiver devices are simpler and generally use some form of photodiode.
The advantages of optical interconnects are:
•  Large number of devices (more than 3400 [92]) per IC.
•  High data rates - up to 10 Gbits/sec.
•  High off-chip routing density.
Optical interconnect are capable of operating at much higher data rates than CMOS 
integrated circuits. However, CMOS is probably the technology of choice when imple­
menting a packet router, due to its low power consumption, high yield and low cost. 
Two approaches appear feasible in constructing a packet router:
1. Use high data rate optical interconnect with a lower data rate on silicon. Optical 
data links may operate in the GHz range, while CMOS silicon processes gener­
ally operate in the sub-GHz range. The data must be de-multiplexed, routed in 
silicon and then multiplexed again. The de-multiplexing is the most demanding 
part. The problem has been addressed [33]. A system has been demonstrated in 
which the data is transmitted using VCSEL devices operating at 2.48 Gbits/sec. 
The data stream is de-multiplexed to eight 311 Mbit/sec data streams. This is
61
made possible by the use of a clocked optical receiver. Each receiver is fed data at 
the full transport rate of 2.48 Gbits/sec. A delay line is included in each receiver, 
so that in each receiver a different slice of the signal is latched. The disadvantage 
with this approach is an increase in latency on the routing critical path, due to 
the multiplexing and de-multiplexing overhead. The primary advantage of this 
approach is that the CMOS based router will scale with the interconnect.
2. Match the optical pin rate with the router clock rate and use large numbers of op­
tical pins. With this approach, the optical devices operate at fairly low data rates 
for this technology. The benefit is obtained from the parallelism of thousands 
of connections. This requires a greater silicon area, but simplifies the interface 
between the router logic and the interconnect.
The use of optical interconnections places two limitations on the router architecture. 
The silicon area is dominated by the optical devices and their underlying drivers. This 
implies that the router architecture must be area efficient. The other main restriction is 
the power consumption. Below 10 W, conventional cooling methods, such as a heatsink 
and fan may be used. The power budget effectively defines the throughput of the router. 
Ideally, the router logic should consume as little of the power budget as is possible.
62
Chapter 6 
Ring-Based Router Architecture
6.1 Introduction
The research presented in this thesis was initially inspired by developments in optical 
interconnects. These offer the possibility of thousands of devices [92] on a single IC 
substrate, limited only by available area. In addition to this, the achievable data rates 
are likely to be in the Gbits/sec per pin range [33]. PCB routing difficulties are avoided, 
since optical signals may be routed either in free space, or through a holographic plate 
which directs the course of each beam. This type of interconnect seems ideally suited 
to packet routers for direct networks, since the performance of such devices is limited, 
if not defined, by the capabilities of the interconnect.
The challenge in constructing a packet router using free-space optical intercomiects 
is to use silicon area efficiently, particularly wiring. It is also important to consider 
the hardware cost of each feature that is implemented in the router. The use of optical 
intercomiects will increase packet router throughput considerably, but this must not 
be at the expense of increased latency. These considerations were the basis of this 
research, which inspired the design of the Cellular Router described in this chapter. 
The name is derived from the fact that the router is constructed from a small number 
of repeated logic cells. The most obvious feature of the Cellular Router is the use of a 
ring of buffers, rather than a crossbar, to perform switching. This feature enhances the 
planar mapping of the router, and means that the silicon area and wiring requirements 
of the router scale linearly with the maximum throughput. The router is designed to 
accommodate very wide data paths (> 100 bits). Store-and-forward routing is used, 
with a fixed packet size.
The router implementation described in this chapter concentrates on 2-dimensional 
toms networks. This topology is chosen due to its constant node degree, making it 
easily extensible, and its low diameter, due to the use of wrap-around links. The router 
architecture can be applied to many other direct networks with some modification.
This chapter describes the method of constructing a Cellular Router for 2-d toms
63
networks. The structure of the router is described in section 6.2. The packet format is 
described in section 6.3. The use of virtual channels and a deadlock prevention strategy 
are presented in section 6.4.
Chapter 7 shows simulation results for a network of Cellular Routers. In Chapter 
8, a cost model for constructing the router is described, which includes estimates of 
silicon area and power consumption. The Chapter concludes with a summary of the 
features of the Cellular Router.
This chapter describes the core logic of the Cellular Router. The circuitry which 
controls the physical links is not described, as this depends on the exact nature of the 
optical devices used. The main requirements of the router are that the data can be 
received and transmitted on and off-chip respectively, in wide parallel channels which 
are placed wherever they are required on the chip.
The optical devices may be clocked at a higher frequency than the router core. In 
this case, the optical clock rate should be some multiple of the router clock rate, and 
the data should be multi-plexed and de-muliplexed when leaving and entering the router 
respectively.
The consequence of a higher off-chip data rate is that the optical signals can be 
de-multiplexed to a slower, wider data stream on-chip, and multiplexed back to a faster 
narrower data-path off-chip, as shown in figure 6.1. This will mean that fewer optical 
devices are required, but will complicate the design of the flow control unit.
In the remainder of this chapter it is assumed that the optical clock rate matches the 
router clock rate. This is done for ease of explanation, but the reader should note that 
higher optical data rates can be accommodated, requiring fewer optical devices to be 
used.
6.2 Proposed Architecture: A Cellular Router
This section describes the router architecture, and is followed by details of the routing 
algorithm and the deadlock prevention scheme.
The Cellular Router partitions the hardware resources into routing stages, each of 
which represents one routing direction. These stages are connected using a group of 
buffers, which are comiected in a cycle (described later in this section). Packets are sent 
from one routing stage to the next by progressively moving around the ring of buffers 
until they reach the destination routing stage.
6.2.1 Routing Stages
Each routing stage consists of four packet-wide registers. The puipose of these registers 
is as follows:
64
router
external flow control 
and demux
external flow control 
and mux
Figure 6.1: Requirements of the Cellular Router: data is de-multiplexed, routed and 
multiplexed
65
IN register The IN register stores the packet when it enters the router. If the packet has 
more hops to make in the current routing direction (an in-dimension route) it will 
wait until the output (OUT) register is free before being transferred there. Oth­
erwise, the packet waits until the RING register is free before being transferred 
there.
RING register Each routing stage has a ring register, and the ring registers are con­
nected to one another in a cycle. This ring of buffers is used to perform switching 
in the router. The packets in the ring registers move one step around the switching 
ring on every clock cycle. When the ring buffer in any routing stage is occupied 
by a packet that can be routed in the corresponding direction, it will be removed 
from the ring and sent in that direction (subject to an available OUT_BUF).
OUT_BUF The OUT_BUF is an intermediate register, where packets leaving the rout­
ing ring can be stored, while waiting to move to an empty output (OUT) buffer.
OUT buffer A packet waiting to move one hop in the routing direction represented by 
the stage are stored in its output (OUT) buffer until the physical link is free.
Each routing stage contains all of the hardware resources for one routing direction. 
The arrangement of the registers in the routing stages is shown in figure 6.2(a). The 
sequence of events for packets entering a node is as follows: The packet arrives in 
the input register in one core clock cycle. On the next core clock cycle the packet is 
forwarded to the output buffer. In parallel with the action of transferring the packet, the 
destination node address for the current dimension is examined. If the packet needs to 
make more hops in the current dimension, it is forwarded straight to the output register, 
and will leave the node on the next clock cycle in which the destination register is 
empty. The connection between the IN (input) and OUT (output) registers is shown by 
the large arrows in figure 6.2(a). If the packet has completed all hops in the current 
dimension it is placed in the adjacent RING register shown in figure 6.2(a). This buffer 
forms part of the structure used to transport packets around the chip and is described 
in section 6.2.2. Packets entering the routing stage do so through this buffer. In order 
to leave the RING register, the packets enter the OUT_BUF register, where they wait 
to enter the OUT buffer. The scheme described so far gives a router delay of 3 clock 
cycles for in-dimension routes.
Figure 6.2(b) shows a more detailed view of the connections between the router 
logic cells (registers) and the I/O devices. A 3-bit wide segment of the router data-path 
is shown.
The router is designed to use the standard cell IC design methodology (see figure 
6.2(c)). Standard cells can vaiy in length, but must all have the same height. The
66
Routing Stage
(a)
IN | RING
input cell
r d ] —
from
previous
stage
control signals
OUT BUF OUT
output cell
(b)
to
next
stage
control signals
VDD
GND
VDD
routing channel
VDD
GND
(c)
Figure 6.2: Components of Cellular Router (a) A single routing stage (b) Router cell 
and I/O comiections (c) Standard cell based silicon layout methodology
67
power (VDD) and ground (GND) rails are abutted together at the top and bottom of 
each cell respectively. The standard cells are arranged in rows, with a gap left between 
rows, known as a routing channel. Standard cells are comiected by wires routed along 
this channel. If wires need to cross rows, feed-through cells are inserted. The only 
function of these cells is to route wires across rows. The router design keeps the length 
of wires in the routing channel to a minimum, since communicating cells are always 
close together. The aim of this is that wire lengths are kept short, so that the loading 
effect of the wires does not require the clock cycle time to be extended to accommodate 
the additional delay.
6.2.2 Ring of Buffers
bit o
bit 1
O  ^
n
2
from 3
to 3 
from 3
to 3
routing stage 1 routing stage 2 routing stage 0
Figure 6.3: Ring based interconnection scheme
Each routing stage has a RING register. Packets move from one RING register to 
the next on eveiy clock cycle. There is one RING register in eveiy stage. In order 
that data can be transported quickly and efficiently between these registers, they are 
connected as shown in figure 6.3. In this way a complete ring is formed, so packets 
can be routed around the ring, until they reach the required I/O stage. By interleaving
68
the stages and connecting the ring buffers as shown, the wire lengths are kept short. 
This keeps wire delays between communicating stages to a minimum. The diagram 
also shows the maimer in which this interleaving translates into connections between 
standard cells.
6.2.3 Logical Structure of the Cellular Router
Figure 6.4 shows the top-level construction of the router. The physical layout differs 
considerably from this, but the diagram clearly shows the way in which the routing 
stages are connected.
The diagram also shows the processor stage, used for the injection and consumption 
of packets. Two additional stages are shown. These are used to allow packets to move 
between virtual networks. The use of virtual networks is described in section 6.4.2.
6.3 Packet Format
The packet format is shown in figure 6.5. Inside the router, the packet is split into two 
halves, TD (top data) and BD (bottom data), on either side of the control circuitry. The 
control circuitry at the centre of the chip will examine and modify certain bits in the 
packet. For this reason the lowest bits in the top and bottom portions of the packet are 
used for packet address and control information. The direction and dimension bits for 
each dimension are set before the packet is injected into the router. The direction bit 
for each dimension determines which direction around the torus cycle the packet will 
be sent. The dimension bit is set to logic one when the packet still has hops to make in 
the corresponding dimension, and is set to zero when the packet reaches the destination 
node in that dimension.
The packet format used ensures that the logic complexity for each local decision is 
kept to a minimum, since each routing stage only deals with decisions related to its own 
routing direction.
6.4 Virtual Networks and Deadlock Prevention
6.4.1 Virtual Networks
Virtual channels are used to increase channel utilisation and as part of deadlock preven­
tion schemes [22]. The Cellular Router uses virtual networks (VNs), where all router 
resources are replicated for each virtual channel.
In store-and-forward packet routers that use a central crossbar for switching, virtual 
channels can be implemented without additional crossbars, although for performance
69
pr
oc
es
so
r 
in
je
ct
/c
on
su
m
e
3O
¥
r
| OP_
* - x I
I O P  BUF | 
| RING |
>-+
A
I OUT
r
| OP BUF |
L
INT
§
$
1 RING \
n i
c
><
i
| jna~do |
4—1 in o
T
3OX
.£>■
Figure 6.4: Logical structure of Cellular Router
70
to 
lo
we
r 
VN
 
fro
m 
hi
gh
er
 V
N
J  A
remaining bits in TD
TD bits 7:4 Y address
TD bits 3:0 X address
control circuitry
  BD bit 0 - direction bit (X)
_  ' BD bit 1 - dimension bit (X)
_  BD bit 2 - direction bit (Y)
BD bit 3 - dimension bit (Y)
I I
■—S remaining bits in BD
M T
---------------------------------------- Figure 6;-5r-Packet format.----------------------—  —
reasons, each set of virtual channels often has its own crossbar [14]. This is because 
the crossbar controller can allow an input resource to access an output resource, with a 
guarantee that the packet can be transmitted. In other words, packets do not claim the 
use of the crossbar without freeing it again within a bounded period of time. Wormhole 
routers must use separate crossbars for each set of virtual channels, so instead they 
really have virtual networks: all of the routing resources (input buffers, output buffers 
and crossbars) are duplicates, sharing only the physical links. Even though the cellular 
router uses store-and-forward routing, it requires virtual networks to be used. This is 
due to the use of a ring structure for switching and arbitration. When packets enter the 
ring, it cannot be guaranteed that the desired output buffer (OUTBUF) will be free. If 
virtual channels are implemented with a shared ring, congestion in one virtual network 
will cause packets from other VNs to be blocked. This diminishes the capability of 
virtual channels to counteract localised congestion.
The throughput of a single ring is limited. If a router is implemented with n chan­
nels, and each sends packets to all others with equal probability, then the throughput 
of a single ring is limited to J packets per cycle. This does not take account of the 
extra ring stages used for packet management. These increase the size of the ring, and 
hence the average time each packet spends in the ring, further decreasing throughput. 
Fortunately there is a simple solution to this limited throughput: use several rings. This 
is one of the reasons for employing the use of virtual networks.
71
Virtual Networks also allow greater routing freedom. Deadlock prevention requires 
the restriction of the paths that packets may take. The constraints imposed may be 
relaxed by using VNs. The scheme is based on that described in reference [82], and is 
known as class-climbing (see figure 6.6). The virtual networks are arranged in some 
fixed order. Packets may be removed from one VN and injected into a lower one, but 
may not move to a higher VN.
The application of virtual networks for deadlock prevention in the Cellular Router 
is described further in section 6.4.2.
The term virtual networks has been used previously to describe separating the net­
work into separate virtual routing planes, each with a restricted routing algorithm. The 
order in which these planes may be accessed is restricted in order to break cyclic de­
pendencies in the network, guaranteeing deadlock freedom. The Cellular Router really 
uses virtual routers. All routing resources are replicated and multiplexed onto the phys­
ical channels by the flow control unit. Packets may move from one VN to another, in a 
fixed order (from one VN to the next lowest, but not back). The aim of this is to spread 
the packets in the router across several virtual routers, thus reducing the congestion in 
each one. The main cost in direct interconnection networks is wiring (or whichever in­
terconnect is used). The use of VNs is aimed at ensuring that the throughput is limited 
only by the capacity of the physical links and not by that of the buffering space.
Figure 6.6: Class-climbing using virtual planes
The physical implementation of VNs involves interleaving the VNs in such a man­
ner that the input and output register can be de-multiplexed and multiplexed respec-
72
tively onto the physical channels. This introduces a small latency penalty. However the 
use of VNs (or virtual channels in other architectures) increases the network throughput 
[22]. Without VNs, the physical interconnect is under-utilised (see Chapter 7).
6.4.2 Deadlock
The possibility of deadlock cannot be ignored in direct networks, since its effect is to 
permanently block the packets involved. Many deadlock prevention schemes are based 
on identifying cyclic dependencies between network channels. Virtual channels are 
then employed to restructure the routing function in order to break cyclic dependencies. 
Another approach is to restrict turns [11]. This does not require any virtual channels to 
be used. The Cellular Router uses store-and-forward routing, so the deadlock scheme 
is simpler than for wormhole routing. It uses the technique proposed by Roscoe and 
Dathi [8]. Buffer allocation is restricted to ensure that (locally) there is always a free 
buffer.
The problem of deadlock is not limited to routers, and must be addressed in most 
concurrent systems [6], Coffinan et al. [16] showed that four conditions must hold for 
a deadlock to arise.
1. Mutual exclusion condition. Each resource is either currently assigned to exactly 
one process or is available.
2. Hold and wait condition . Processes currently holding resources granted earlier 
can request new resources.
3. No pre-emption condition. Resources previously granted cannot be forcibly taken 
away from a process. They must be explicitly released by the process holding 
them.
4. Circular wait condition. There must be a circular chain of two or more processes, 
each of which is waiting for a resource held by the next member of the chain.
An instance of these rules applies to the Cellular Router, by replacing the word 
process by packet, and the word resource by buffer.
To simplify the deadlock prevention strategy, a divide-and-conquer approach is 
used. The input and output buffers in a number of nodes combine to form the uni­
directional toms cycles that make up the network topology. Inside each router, routing 
rings perform the switching functions for the corresponding virtual networks. The over­
all deadlock prevention strategy consists of ensuring that both the routing ring and the 
toms cycle used by a message are free from deadlock, and that dependencies between 
the two do not lead to deadlock. Thus, the overall deadlock prevention strategy is 
formed.
73
6.4.3 Deadlock in Torus Cycles
Recall that the Cellular Router divides the routing resources for each direction into 
stages. The input and output buffers in each routing stage are linked to one another and 
to other routers to form a uni-directional torus cycle. Deadlock is prevented in these 
cycles by avoiding circular waits. Figure 6.7 shows an example of a torus cycle with 
four nodes, each containing one buffer. Each vertex of the directed graph represents a 
buffer belonging to a different routing node. Packets are denoted by the letter P, and the 
subscript denotes the destination buffer of each packet. Figure 6.7 (a) shows the case 
where a circular- wait has arisen. None of the packets in the cycle is able to progress, 
so the cycle is deadlocked. One method of preventing deadlock in a cycle would be to 
restrict the routes packets can take in each VN, in order to prevent cyclic dependencies 
between channels. An alternative, which is used here, is to ensure that there is always 
one free buffer in the cycle. This is shown in figure 6.7 (b). With one buffer free, all 
packets may in turn advance by one node. This will be repeated until the leading packet 
reaches its destination. As long as the packet can be removed from the toms cycle at 
this point within a bounded period of time, deadlock cannot arise.
© — • © >  ©
©— o
(a) (b)
Figure 6.7: Circular waits in toms cycles, (a) Deadlocked cycle (b) Deadlock free cycle
The method of ensuring that there is at least one free buffer is as follows (see figure 
6.8): Packets leaving the routing ring move to an intermediate buffer, labelled outJbuf. 
Once there, the packet waits for the output buffer to become empty, before transferring 
to it. In order to ensure that the packet does not occupy the last free buffer in the toms 
cycle, the router waits until both the input and output buffers are empty, and only then 
allows the packet to enter the output buffer. This restriction ensures that no packet may 
ever occupy the last free buffer in any toms cycle. This method of ensuring that one 
buffer remains free in each toms cycle forms a sufficient (but not necessary) condition, 
as it only operates on knowledge of the registers in each routing stage, rather than the
74
entire cycle. This does not make optimal use of the available buffers.
input buffer output buffer
Figure 6.8: Buffer restriction in deadlock prevention scheme 
6.4.4 Deadlock in Routing Rings
The routing ring is a dynamic structure, in which packets advance by one slot around 
the ring on each clock cycle. Again, circular waits are avoided to prevent deadlock.
The routing ring allows packets from all routing directions to freely mix, which 
may lead to deadlock. An example of this is shown in figure 6.9. The upper cycle 
represents the torus cycle formed by the input and output buffers. The packet labelled 
Pa has reached its destination node. The ring is filled with packets addressed to nodes 
in the same torus cycle. None of the packets can leave the ring since the torus cycle has 
only one free buffer which cannot be filled owing to the rules for preventing deadlock 
in torus cycles described previously. The packet waiting to enter the routing ring is 
blocking the other packets in the torus cycle. All the packets in the routing ring are 
waiting to route along the blocked toms cycle.
The solution to this problem involves placing restrictions on the virtual networks. 
For an 77,-dimensional torus network, at least n VNs must be used, although there is no 
maximum number of VNs. Packets are restricted to completing all hops in one routing 
direction before traversing the next, in the lowest n VNs. Packets may only be injected 
by the processor on VNn-1 or higher. Packets may be consumed in any VN in which 
they have completed routing.
The following restrictions apply to the lowest 77 — 1 VNs: Packets entering VN x, 
either from VN x +  1 or from the processor injection channel must have at most x +  1 
dimensions to route in. For example, in a 2 dimensional network, packets entering VN
75
Figure 6.9: Restricting circular waits in the router
76
1 must have at most two dimensions to route, and packets entering VN 0 must have 
at most one remaining dimension to route in. Packets entering the routing ring from a 
higher dimension or from the processor injection channel may not do so until at least 
two free buffers are available in the routing ring.
The consequences of these restrictions are as follows. There will always be one 
free buffer in the routing ring in any router, unless it is occupied by an incoming packet 
from one of the torus cycles. If the last buffer is filled, it will be freed again within a 
bounded period of time, since the packet will either be consumed by the processor (if 
it has completed routing) or move to a lower VN because it has completed all hops in 
another dimension (because it has come from a toms cycle).
The deadlock prevention scheme is complete. It has been shown that deadlock 
cannot arise within toms cycles, provided the routing rings in each router are able to 
accept packets that have completed all hops in the toms cycle. This condition has been 
satisfied, so deadlock cannot arise.
The deadlock prevention scheme described in section 6.4.2 provides considerable 
routing freedom. Packets in the lowest n VNs are restricted so that they must complete 
all hops in one dimension followed by all hops in the remaining dimension. This allows 
a small degree of routing freedom, even if only n VNs are used, since the order in which 
dimensions are traversed is not restricted. Adaptive routing can be used if additional 
VNs are added, as the deadlock scheme allows this.
One drawback with the deadlock prevention scheme is that it is costly in terms 
of buffering requirements. Reducing these requirements with minimal loss of routing 
freedom and minimal impact on latency is an area for further research.
6.4.5 Adaptive Routing
The lowest two VNs in the router must be restricted, but higher VNs allow adaptive 
routing. Adaptive routing is known to add complexity to the packet router logic. Match­
ing input packets with free output channels involves arbitration, and adds delay on 
the router critical path. The Cellular Router implements adaptive routing without this 
penalty. Decisions are made in a decentralised manner. When packets are routing in 
a given dimension, they enter the IN buffer in a routing node. If the packet has not 
completed all hops in the current dimension it should be sent to the OUT buffer. If con­
gestion exists, the packet will be kept waiting to access this buffer. Other packets may 
be queued up behind it in other nodes. Adaptive routing is implemented using counters. 
When a packet enters the IN buffer, a counter starts to increment. If the counter exceeds 
a certain number, the packet is injected into the central ring in the current routing node. 
The packet may or may not have hops to complete in a dimension other than the one in 
which it was routing. The packet will pass around the central ring, until it can move to
77
an empty OUT JBUF. This means that packets that are able to adapt will tend to do so. 
Even if the packet cannot route along another link, there is still some benefit in tem­
porarily freeing the IN buffer, since packets in the rest of the toms cycle may be able 
to make progress. Determining the threshold at which this mechanism is employed is a 
complicated matter. This is explored through simulation in chapter 7. The scheme adds 
a small amount of additional logic to the controller of each IN buffer, but this logic is 
not on the chip critical path, so it will not affect the clock rate.
6.4.6 Fault Tolerance
The deadlock prevention scheme requires that the packets complete all hops in one 
dimension, then another, without restriction on the order in which dimensions are tra­
versed. Another consequence of this scheme is that the direction of routing in each 
dimension does not matter. A toms network has wrap-around links, so one path will be 
shorter than the other. When packets are injected into the network, direction bits inside 
the packet are set, so that the packet follows the shortest path in each dimension.
This property is useful when links in the network fail. It is a simple matter to update 
a counter for each packet to keep track of the number of times it has passed around the 
central routing ring. This can be done in the processor routing stage. When a given 
threshold is exceeded, the direction bits in the packet are reversed. The packet will now 
be routed along a longer path, but is able to avoid faults. With this mechanism in place, 
up to half of the links in the network may fail and packets will still be delivered. Only 
when failures occur in two links in opposite directions in the same torus cycle will the 
network fail.
This mechanism of fault tolerance is veiy light in terms of hardware overhead. All 
that is required is that a count of the number of passes around the ring is kept for each 
ring slot, and that a comparator checks this value, and alters two bits within the packet.
78
Chapter 7 
Computer Simulations
This chapter describes the performance of a number of Cellular Routers comiected 
together to form a network, in this case a 2-d toms. The simulator operates at the 
register transfer level (RTL), modelling the movement of packets on a per-cycle basis.
The objectives of the simulations are to assess the optimal number of virtual net­
works, examine the routing latency at various loads and to look qualitatively at the 
effects of adaptive routing on network performance.
Synthetic loads are used in the network simulations. Although these do not repre­
sent realistic traffic patterns, they are useful for assessing the overall network perfor­
mance. A realistic application may place varying demands on the network throughout 
the course of program execution. It is difficult to assess the network performance under 
these conditions. The synthetic loads chosen are uniform random traffic (URT) and 
hot-spot traffic.
The Cellular Router is designed to combine high throughput and low latency in 
a single-chip router. Throughput can be achieved through replication of resources. 
The difficulty lies in achieving high throughput at low latency. One pmpose of the 
simulations in this chapter is to show that the router achieves the desired aim of low 
latency.
It should be noted that latency is used to mean the time taken (in network cycles) 
to deliver a message from the source to the destination node once injection has been 
achieved. This does not include the time spent by packets in queues waiting to be 
injected into the network1.
'The time spent queueing is insignificant, except when the network is saturated. The queues are 
provided for the purposes of simulation, so that packets are not discarded if they cannot immediately be 
injected into the network.
79
7.1 Uniform Random Traffic
The first traffic model used is uniform random traffic (URT). URT is one of the most 
general traffic patterns. Packets may be sent to and from each node with equal prob­
ability. URT is useful for assessing the maximum throughput that can be achieved in 
the network. Non-uniform traffic patterns place uneven loads on the channels in the 
network, which tends to reduce throughput. The purpose of the simulations using URT 
is to determine the optimal number of virtual networks (VNs). The simulations also 
provide insight into the increase in routing latency with network throughput.
7.1.1 Virtual Networks
One of the aims of the simulations using URT is to determine the optimal number of 
virtual networks to use in the router. As described earlier, the Cellular Router relies 
on the central ring structure to perform switching. Whenever a packet makes a turn, 
either to route adaptively or to switch dimensions, it passes through a routing ring. The 
use of VNs spreads the traffic load across a number of virtual routing planes. This 
increases the likelihood that a packet will reach the desired output channel on the first 
pass around the ring. The benefit of this is two-fold. Firstly packets will leave the ring 
sooner, leaving an empty slot in the ring for other packets to use. Secondly, the latency 
of the packets is reduced substantially, since each complete pass around the routing ring 
takes 2n +  3 clock cycles.
In the simulations in this chapter it is assumed that packets can be injected into 
any of the unrestricted VNs (VN2,VNmax). All VNs contain a buffer to consume 
messages. In the physical hardware, this means that the injection and consumption 
registers in each VN must be multiplexed onto the injection and consumption channels 
(respectively) of the processor.
7.1.2 Calculation of Applied Load
Under URT, any node may send a message to any other network node with equal prob­
ability. The number of packets injected on each cycle is normalised with the maximum 
network throughput. The maximum throughput of a network, assuming completely uni­
form traffic distribution, depends on the number of links in the network and the average 
distance travelled by messages. For a toms network, the number of uni-directional 
links, I, is given by:
l = 2nN (7.1)
80
Eveiy link can transmit one packet per cycle in each direction, so the number of packets 
able to make one hop in any given cycle is given by:
Tmacc =  (2 nN/D), (7.2)
where D is the average number of hops each packet will make. Maximum throughput 
can only be achieved if all packets make only one hop, and the communication pattern 
is strongly correlated, so that all channels are used on eveiy cycle. In this case, D =  1, 
so:
Tmax — 2 nN (7.3)
In the case of URT, packets can have any source and destination node with equal proba­
bility, so the average number of hops is equal to n/c/4 =  fe/2. In this case the maximum 
number of packets that can be injected into the network is:
/ — —n— (7 4)1max — V' •©/
(7.5)
but since N =  kn
Imax = 4nfen~1 (7.6)
In the simulations the maximum injection rate is calculated as Imax =  8k (n = 2). 
The loads are all expressed as a percentage of the maximum injection rate. For the pur­
pose of the simulations, message queues with 1000 slots are provided for each injection 
channel. The queue slots are provided purely for the purposes of simulation, to ensure 
that messages that cannot be injected on a given cycle are not simply discarded. The 
small increase in latency due to queueing is ignored in the simulations. In a practical 
implementation, such large queues would not be provided. Instead, it is likely that the 
processor channel would be throttled.
7.1.3 Load Measurement
Even with virtual networks, the actual network throughput may be less than that ex­
pected from the applied load. This is particularly so when the network is heavily 
loaded. Packets are injected into the routing rings, which are also used to perform 
switching. When the routing rings are congested, the injection of packets is held back. 
Some measure of the achieved network throughput must be used. In the Cellular Router 
simulator, average channel utilisation is used. The number of channels in the network 
that transmit a packet on each cycle is recorded, and the average is calculated. This is
then expressed as a percentage of the total number of channels in the network (4iV for 
a 2-d toms).
A 15 by 15 toms network of Cellular Routers has been simulated with uniform ran­
dom traffic. The load is expressed as a percentage of the maximum load that the net­
work can sustain. Each load and virtual network combination is simulated for 500,000 
clock cycles. Packets may be injected into any virtual network except for the lowest 
one. Injecting packets into the lowest VN may cause a deadlock. Packets may be 
consumed (removed from the network) in any VN.
Packets which cannot be injected into the network on any given cycle are stored in 
queues. Each node is provided with 1000 queue slots, which should be enough to store 
all packets waiting to enter the network.
Routing latency versus accepted load
accepted load (%)
Figure 7.1: Routing latency versus load with varied number of virtual networks
The average distance travelled by packets in the network should be nk/4 =  2 * 
15/4 =  7.5. The latency of the simulations with light network loading agrees with this
82
figure, as there are 7.5 straight through routes, with a latency of 2 cycles each. The 
routing ring contains 7 buffers, so this adds 7 cycles of delay, plus one cycle for each 
packet entering the ring, and one cycle when leaving. This gives an expected latency of 
((7.5 x 2) +  7 +  2) =  24 clock cycles, which agrees with the graph shown in figure 7.1.
The applied load in each case is varied from 5% to 100 % in 5% intervals. It can be 
seen from figure 7.1 that the achieved channel utilisation falls short of the applied load, 
especially as the load on the network increases.
This is due to the fact that congestion in the network throttles the injection channel 
in each router. In some sense, this property is useful. If the router continued to ac­
cept packets when it is veiy congested, the amount of contention in the router would 
increase, and the average routing latency would be greater. This throttling property of 
the router prevents saturation, at least under URT.
Six sets of virtual networks gives the lowest latency as the network load increases, 
and allows the highest accepted load. However, adding virtual networks is not without 
cost. Each virtual network requires additional silicon area. The area increase caused 
by adding virtual networks is mainly along the x-axis. This implies that the wires 
connecting the routing rings are lengthened, which will increase the power consumption 
of each ring. The virtual channels at the inputs and outputs must be multiplexed onto 
the physical channels, which is not taken into account in the simulation. For this reason, 
the optimal number of virtual networks appears to be four. This gives a reasonably high 
achieved load (> 80 % ) and the latency is not significantly higher than with six virtual 
networks. An additional benefit of using four virtual networks is that this provides a 
balanced tree of multiplexors between the virtual and physical channels. For x virtual 
networks, log2X logic stages are required for the virtual channel multiplexors and de- 
multiplexors. Four virtual networks require two logic stages. If more virtual networks 
are added, at least three logic stages must be used.
The remaining simulations in this chapter will use four sets of virtual networks.
7.2 Adaptive Routing in the Cellular Router
One novel feature of the Cellular Router is the maimer in which adaptive routing is 
implemented, which results in a low hardware overhead. Much research has been done 
relating to the use of adaptive routing. It is well established that adaptive routing offers 
potential benefits, such as reduced latency and increased network throughput, but also 
introduces additional hardware complexity, which may increase the average routing 
latency. There is also a latency penalty for individual packets which route adaptively. 
Contemporary packet routers partition the switching between the input and output links 
in such a way that straight-through routes have a lower latency. When packets make a
83
turn, the latency is greater than it is when the packet routes straight-through. Adaptive 
routing causes packets to make an increased number of turns. If this is done to avoid 
network congestion, then the overall effect is to reduce latency. Spurious turns will 
simply increase routing latency.
The Cellular Router is able to implement minimal adaptive routing, with almost 
no increase in the router logic or its complexity. Adaptive routing is useful in con­
gested networks, since it may allow packets to make use of an idle channel instead of 
being queued up to use a busy channel in the network. One novel feature of the Cel­
lular Router is that the routing algorithm varies with local congestion conditions in the 
network. When the network is relatively uncongested, packets will complete all hops 
in one dimension before travelling along the other routing direction. In the presence 
of congestion, blocked packets will be dropped back into the routing ring, allowing 
them to follow any minimal path in the network. The deadlock prevention scheme also 
permits some non-minimal routes to be used. This is described in section 6.4.6.
Section 6.4.5 outlined the use of timers to implement full minimal adaptive routing. 
In the remainder of this chapter, this will be referred to as adaptive routing. ■When this 
feature is not applied, packets complete all hops in one dimension before traversing the 
next dimension. The order in which the dimensions are traversed is not fixed. This will 
be referred to as partial adaptive routing.
In most packet router architectures, increased routing freedom implies increased 
hardware complexity [15]. In looking at the effects of adaptive routing, it would seem 
sensible to compare the Cellular Router using full minimal adaptive routing with one 
using fixed dimension order routing. However, the Cellular Router architecture has a 
peculiarity, in that partial adaptive routing is less complex to implement than dimension 
order routing. For this reason partial adaptive and full minimal adaptive routing are 
compared. The contrast between the two sets of results is diminished by this selection.
Packets which are allowed to route adaptively make more turns and spend a greater 
proportion of their journey using the central dynamic ring inside the router. Packets 
making turns in order to route adaptively incur a latency penalty. This is undesirable 
when the network is lightly loaded, but is useful when the network is heavily loaded.
The deadlock prevention scheme used restricts the Cellular Router so that adaptive 
routing is prohibited in the lowest n VNs. For the two dimensional case considered in 
this chapter, this involves restricting VNO and VN1. The simulations described in the 
remainder of this chapter use 4 virtual networks, so VN2 and VN3 permit full minimal 
adaptive routing.
84
7.3 Simulations with Network Hot-spot traffic
The Cellular Router has the capability to introduce full minimal adaptive routing on a 
per-packet basis, whenever a packet is blocked for more than x clock cycles, where x is 
a programmable parameter of the router. In the absence of contention, packets complete 
all hops in one dimension before traversing the next. The order in which dimensions 
are crossed is not fixed.
7.3.1 Multiple Hot-spots
Networks with multiple hot-spots are examined in this section. Assessing the particular 
traffic load to apply in this case is not straightforward. There are many parameters: the 
number of hot-spots, their distribution and the probability with which packets are sent to 
and from hot-spot nodes. The aim of the simulation is to determine the qualitative effect 
of hot-spots on network performance and to assess the potential benefits of minimal 
adaptive routing. To do this, many sets of simulations are carried out, with different 
numbers of randomly-chosen hot-spots. In each simulation, the load is represented as 
a percentage of the maximum thr oughput of the network. The applied load is varied 
at 5 percent intervals. The achieved channel utilisation is measured. At low loads, 
the network should be able to cany all of the applied load, so the two figures should 
match. At greater loads, contention in the network will reduce the achieved channel 
utilisation, as well as delaying the injection of some packets. Queues are included for 
messages waiting to be injected in the network. A queue length of 1000 is used. When 
the network is operating below saturation, only a small number of queue slots are in 
use. At saturation, the injection rate is unsustainable, and the queues are quickly filled.
The results from simulation to simulation are markedly different. The following ex­
amples are taken from networks with five randomly chosen hot-spots. The probability 
of a packet being sent to or from a hot-spot is 0.09. If the source or destination is not a 
hot-spot node, it is chosen randomly, with all nodes in the network equally favoured.
The threshold at which packets are allowed to re-enter the routing ring and start 
looking again for an outgoing link is controlled by the TIME-OUT parameter. In the 
following simulations, two settings of this parameter are compared. When a packet sits 
in the input buffer in any given routing stage, it waits to move to the output buffer. If 
the packet has been waiting for TIME_OUT clock cycles or more, the packet may also 
enter the routing ring, if there is an empty slot. This means that the packet has the 
option to use either the current output link, or to re-enter the routing ring and become 
a candidate to use other output links. This means that packets may potentially follow 
any minimal path in the network. One set of simulations use a TIME_OUT setting of 
5, which causes the adaptive routing mechanism to be used fairly frequently. The other
85
TIMEJ3UT setting used is 5000. At this value, the routing is effectively restricted to 
being minimal and partially adaptive (first X  then Y 01* vice-versa).
Two sets o f  simulation results are selected, and are shown in figure 7.2. It can 
be seen that adaptive routing offers some benefit over partial adaptive routing. In set 
one, the difference in latency grows with the load on the network. In addition to this, 
the achieved channel utilisation is slightly greater for the full minimal adaptive case 
(TIME_OUT=5). The situation is completely different in set two. For achieved loads 
up to 50 %, the latency is the same for both cases. Above 50 %, there is a sudden 
increase in latency for the partial adaptive case at around 50 %. As the applied load 
increases beyond this point, the latency increases even further, and the achieved load 
drops sharply 2. This is due to the network saturating. With full minimal adaptive rout­
ing, the latency increases sharply with achieved load, but this does not cause saturation 
to occur.
In order to explain this effect, a graphical representation o f  the network is used. 
The average number o f  packets in each node in the network was recorded throughout 
the simulations. These figures are plotted as 3-d graphs, in which the base represents 
the position o f  each network node, and the heights represent the average number o f  
messages in the node at that position. This provides a graphical view o f  the traffic 
pattern in the network.
Figure 7.3 shows a simulation where the TIME_OUT parameter o f  the router is set 
to 5. The peaks where network hot-spots are located can clearly be seen.
In figure 7.4, the TIME_OUT parameter is set at 5000. The congestion pattern in 
the network looks similar to that in the adaptive case. In both simulations, the row 
corresponding to Y  =  0 in the network is heavily congested, but not saturated. In the 
partial adaptive simulation, the congestion is greater, since the peaks are higher.
Figure 7.5 shows the non-adaptive simulation when the applied load on the network 
is increased to 60%. The congestion pattern looks relatively unchanged from the 55% 
case. The achieved load is close to the applied load.
The situation is entirely different for the partial adaptive simulation shown in figure 
7.6. The entire network appears to be highly congested. The reason for this is that 
the channels in the area o f  the network where Y  =  0 are overloaded. There are more 
packets which require the use o f  these channels than the channels can accommodate. 
Nodes surrounding the congested ones are also heavily congested, due to packets wait­
ing to cross the congested region. The congestion quickly spreads throughout the entire 
network. There is still some movement o f  packets, but the chamiel utilisation drops 
sharply. The network settles into an equilibrium state, where the injection channels
2It is important to note that latency is dependent on both achieved load (channel utilisation) and 
applied load
86
ro
uti
ng
 
lat
en
cy
 
(c
yc
le
s)
Routing latency vs channel utilisation
channel utilisation ( % )
Figure 7.2: Effect of adaptive routing with 5 hot-spots
87
■5
Cl
&
□  35-40
■  30-35
□  25-30
■  20-25
□  15-20
□  10-15
■  5-10
Figure 7.3: Average packets per node: TIME_OUT=5, applied load=55%, achieved 
load=50.5 %
88
Cl
&
25=-<
■ 45-50
■ 40-45
□ 35-40
■ 30-35
□ 25-30
■ 20-25
□ 15-20
□  10-15
■ 5-10
□ 0-5
Figure 7.4: Average packets per node: TIME_OUT=5000, applied load=55%, achieved 
load=51.7 %
89
■  40-45
□  35-40
■  30-35
□  25-30
■  20-25
□  15-20
□  1 0-1 5
■  5-1 0
□  0-5
Figure 7.5: Average packet per node: TIME_OUT=5, applied load=60%, achieved 
load=55.3%
90
□  50-60  
■  40-50
□  30-40
□  20-30
■g □  1 0-20
□  0-1 0
■8flP 
C l
*
Figure 7.6: Average packet per node: TIME_OUT=5000, applied load=60%, achieved 
load=36.0%
91
are throttled so that the number o f  packets entering the network is equal to the number 
leaving. In this case the achieved load has dropped to just 36 %.
7 .3 .2  S im u la t io n  C o n c lu s io n s
The simulations in this chapter demonstrate the feasibility o f  constructing a direct net­
work using the Cellular Router as a building block. The uniform random traffic simu­
lations, taken together with hardware implementation considerations, suggest that four 
virtual networks should be used in the router.
The hot-spot simulations demonstrate that the router mechanism for full-minimal 
adaptive routing does work, and produces some benefit. Since the hardware overhead 
is minimal, this feature should be used.
The simulations also provide some insight into the nature o f direct interconnec­
tion networks. Exceeding the maximum load o f  a single channel in the network for 
a sustained period o f  time is enough to cause congestion to spread through the entire 
network. Increasing routing freedom reduces the likelihood o f  this situation arising.
Future simulation work should involve other synthetic loads and, i f  possible, some 
realistic application loads. The optimum value o f  the TIME.OUT parameter has not 
been determined. This requires realistic application loads. One option is to make this 
parameter programmable.
The issue o f  virtual channel balance should be addressed in future work. In the 
Cray T3D [90] and T3E networks [89], the designers went to great lengths to ensure 
that the traffic load was spread evenly across the virtual channels, as this was shown to 
improve throughput. The balance o f  traffic across the virtual networks in the Cellular 
Router has yet to be examined.
92
Chapter 8 
Router Hardware Cost
In Chapter 6 the implementation o f  the Cellular Router was described. Chapter 7 looked 
at the simulated performance o f  the router in a network. The simulations operate on a 
per-packet basis in which the actual packet size is not specified. This chapter aims to 
develop a cost model for constructing the Cellular Router, for a given packet size. Two 
metrics are important in the construction o f  a packet routing integrated circuit: power 
consumption and silicon area. Power consumption must be kept within certain bounds 
for a given cooling technology, and may restrict the speed o f  the router and the size o f  
the data path. The silicon area is limited by the achievable yield. Larger die result in a 
greater number o f  faulty devices per wafer, which increases the manufacturing cost o f  
the device.
This chapter begins with a brief description o f  power consumption issues in CMOS 
circuits. The Cellular Router architecture is designed to allow routers with exception­
ally wide data-paths to be constructed, for use in low dimension networks. In the 
past, packet routers have been restricted by IC pin-out limitations. This restriction 
has now eased, as devices which are flip-chip bonded to a PCB can have several thou­
sand pins [86], and low voltage swing high-speed signalling techniques are available 
[89, 38, 66, 68, 67, 12]. While some restrictions have been relaxed in implementing 
a high performance router, other issues have arisen. If large numbers o f  I/O pins are 
in use, PCB routing becomes very difficult. Optical interconnects can overcome this, 
as thousands o f  signals can be routed, either through a holographic plate, or in a fibre 
bundle. Technological advances allowing I/O bandwidth to increase greatly make the 
requirements on the router logic ever-more stringent. Large numbers o f  I/O pins or 
optical devices leave less silicon area for logic cells. This makes the area efficiency o f  
a router o f  greater importance. Power consumption can never be ignored in large ICs, 
as it affects the amount o f  logic that can fit on a single IC, and the cooling technology 
used. Cooling a chip adds cost to the finished product. This is one o f  the many reasons 
why power consumption should be kept to a minimum.
The Cellular Router architecture differs from other approaches. The only router
93
using optical interconnect to date (to the author’s knowledge) is the WARRP router 
[95, 94]. There is little point in comparing it and the Cellular Router, since only 36 
optical devices were used in the implemented WARRP router.
This Chapter begins by explaining some IC design considerations. Then a com­
parison is made between the ring o f  buffers used as the switching element in the 
Cellular Router, and a crossbar switch, which is used in most other packet routers 
[48, 74, 10, 63, 15, 36, 66, 95, 94, 19]. This comparison is highly relevant, since the 
switching element is likely to be the single biggest component in a router (as in [63]). 
The rate o f  growth o f  the switching elements is compared as the router data-path width 
is increased. Following this comparison, the power consumption o f  the Cellular Router 
is estimated, and is shown to be relatively low. The simulations in Chapter 7 model the 
network at the register transfer level RTL. The achievable clock cycle for the router is 
not taken into account. This issue is discussed at the end o f  this chapter.
8.1 CMOS Integrated Circuits
For the analysis in this chapter, only static CMOS logic is considered, in which the 
output node o f  any gate is always attached to the positive power supply (VDD) through 
one or more PMOS transistors, or attached to ground (GND) through one or more 
NMOS transistors. One exception to this would be tri-state gates.
8 .1 .1  P o w e r  D is s ip a t io n
There are three sources o f  power dissipation in static CMOS circuits.
Static dissipation This relates to power dissipation due to currents drawn continuously 
from the supply which is mainly caused by leakage currents in the IC substrate.
Switching transient current Due to the threshold o f  the NMOS and PMOS devices, 
there is a brief period between output signal transitions when both devices are 
conducting. This generates a transient current to flow from V DD to GND, caus­
ing power dissipation.
Charging and discharging o f load capacitances This is the main source o f  power 
dissipation. The output load o f  a gate can be thought o f  as a load capacitance 
Cl- This load may be caused by the parasitic capacitance o f  the gate itself, plus 
those o f  any gates it is connected to, plus any capacitance in the wires connecting 
the gates. The various sources o f  capacitive parasitics attached to a gate may be 
lumped together and considered as a single load, denoted CL in figure 8 .1.
94
It is important to note that the power dissipation depends on the sequence o f  
switching events. W hen the output o f  the gate is logic high, the load capacitance 
is charged up. W hen the gate output switches to logic low, the charge on CL is 
rem oved to ground. The power dissipation is dependent on the number o f output 
transitions going from high to low.
P2
in1 P1 charge 0
in2 out
n1
□
n2
%
dis-chargeC L
V
V
Figure 8.1: Charging and dis-charging load capacitance in a 2 input CM OS N A N D  gate
Charging and discharging o f  load capacitances is the dominant source o f power 
dissipation. The threshold voltage o f  the transistors and the process parameters are 
carefully chosen to m inim ise the other tw o sources o f  power dissipation.
The dynamic power consum ption can be expressed as [49]:
Pdyn — QlVpp/i-^Oj (8*1)
where Cl is the load capacitance at the output o f  the logic gate, Vdd is the power 
supply voltage and / i _>0 is the frequency with which transitions from logic  one to logic  
zero occur (note: / x_»0 <  O.bfdock).
The m ost obvious step in reducing power consumption is to low er the supply volt­
age, since there is a quadratic dependence on this term in the equation. This is done at 
the expense o f  reduced noise margins, and a reduction in the gate switching speed. In 
som e cases, architectural changes can be made to reduce /i_>o- The m ost direct impact
95
that the digital designer can make on the power consumption is to try to reduce the 
load capacitance. The load capacitance is made up from a number o f  sources: (1) the 
parasitic capacitances o f  the gate itself (referred to as intrinsic load) (2) interconnect ca­
pacitance, and (3) the input capacitance o f  the gates being driven. The designer is able 
to influence capacitances associated with the driving gates and those that are driven, 
by transistor sizing and careful cell layout. The wiring capacitance can be reduced by 
minimising wire length, which may be due to architectural changes, or to optimisation 
at the stage o f  silicon layout.
It is not only o f  interest to estimate power consumption and silicon area for the 
Cellular Router, but also to tiy to estimate the maximum clock speed o f  any realistic 
implementation. In order to determine this, the critical path in the design must be 
identified, which is the path through a circuit which has the longest signal propagation 
delay.
The area and power consumption calculations in this chapter are based on relatively 
outdated silicon processes. The reason for this is that data on such processes is readily 
available in textbooks ([49] [72] [76]). In any case, the estimates produced will be pes­
simistic. As feature sizes are reduced, gate capacitances fall. The parasitics associated 
with wiring also reduce, but at a slower rate than those for devices. This leads to the 
situation where wiring begins to dominate in power consumption estimates and signal 
propagation delay calculations.
8.2 Comparison Between Ring of Buffers and Crossbar- 
based switching
The area and power consumption o f  the core switching element in a packet router for 
direct networks with wide I/O channels is likely to be dominated by wiring [73], While 
developments in free-space optical interconnects will allow ever-larger routers to be 
constructed, the implementation o f  such devices will be restricted by the core switching 
element.
In both a crossbar switch and in the Cellular Router, the load capacitance is domi­
nated by wiring, since both involve long wires.
In the previous chapters where the implementation and simulation o f  the Cellular 
Router were described, there were seven routing stages in total: four for the routing 
directions, one for the processor injection/consumption channel, and two to support 
virtual networks. One o f  the virtual network stages is used to receive packets from a 
higher virtual network, and the other is used to send eligible packets to a lower virtual 
network. It should be noted that the two virtual network stages can be combined to form 
one routing stage, with an input buffer, output buffer and a ring buffer. This means that
96
m x m crossbar 
each link w bits wide
Area proportional to (m w)
Figure 8.2: Crosspoint switch with m I/O channels, each w bits wide.
for a Cellular Router with m I/O channels (including the processor link), there will be 
m +  1 routing stages.
The ring o f  buffers in the Cellular Router is compared here with a crosspoint switch. 
In each case, w e consider a configuration with m input and m output channels. For a 
router for 2-d networks, m  — 5, since there is an I/O channel for eveiy routing direction, 
and one for the processor inject/consume channel. Each channel is w bits wide.
Consider the crosspoint switch shown in figure 8.2. The crossbar has m inputs and 
m outputs (an m xm  crossbar), each o f  which is w bits wide. The area o f  the crosspoint 
switch has two components, the area o f  the switches and the area o f  the wiring. The 
wiring area can be calculated as follows: the basic component is a wiring erosspoint, 
as shown in figure 8.3.
It can be seen that for a n m x m  crossbar with w bit wide channels, there are (m2w) 
crosspoint switches. In order to estimate the total wiring cost o f  the crossbar, consider 
a wire crosspoint like that shown in figure 8.3. The vertical and horizontal wires cross. 
The wire width is 1.2 pm  [3]. With a minimum distance o f  1.2 pm  spacing between 
wires on the same routing layer. This leads to a crosspoint cell size o f  2 Apm x  2Apm. It 
can be seen that the entire crossbar wiring structure consists o f  a grid made o f  such wire 
cross-points. The total number o f  these wire cross-points required is mw2. The total 
area required to implement such a crossbar switch, for various values o f  w is shown in
outpiit 1 ou
ler
tpiit 2
h
»
m
ou
x\i
tpi
V
it 3 outpi
>  
it A 5
input 1
input 2
input 3
i
input 4
- 4 h- —4 F - - 4 *■■ —4
97
< 2.4 um >
7Y
~7V
io
c
3
AL
to
L
c
3
AL
metal-1
metal-2
Figure 8.3: Basic wire crosspoint for 0.7pm  process [3].
A. Number of 
switches
B. Area of 
switches @
C. Wire 
crosspoints
D. Wire area 
2Apm x  2Apm
E. Total 
area (B+D)
m w m2w 100 pm each (mw)2 crosspoints in sq. pm
5 64 1600 160000 102400 589824 749824
5 128 3200 320000 409600 2359296 2679296
5 256 6400 640000 1638400 9437184 10077184
5 384 9600 960000 3686400 21233664 22193664
5 512 12800 1280000 6553600 37748736 39028736
Table 8.1: Area cost o f  an m  x  m  crosspoint switch with w bit w ide channels.
table 8.1. In order to sim plify the analysis, the control wires for the crosspoint switches 
are omitted, and the total area o f the sw itches and wires are sim ply added together (the 
placement o f  the switches w ill lengthen som e o f  the wires).
The Cellular Router uses a ring o f  buffers to perform switching. This requires fewer  
wires than the crossbar, but each wire is longer. For a ring o f buffers with m  inputs and 
m outputs, each o f  w hich is w bits w ide, there w ill be (ra +  l)w  data wires in total 
(the + 1  term is due to the virtual network routing stage). The length o f  the wires is 
equal to the width o f  two routing stages. The layout o f the router should be such that 
the routing stages have a greater height than width. The routing stages are likely to be  
around 500/im  in width, so the wires in the routing ring w ould each be 1000/jm  long. 
The switches in the Cellular Router are more com plex than those in a crossbar, since 
they also serve to buffer the data. For the sake o f the analysis, the buffers are assumed  
to be 30pm  x  30pm  in size. Table 8.2 show s the estimated area o f  the ring o f buffers 
for varies data-path widths.
< 1.2 um >
98
are
a 
re
qu
ire
d 
(s
q.
 u
m
)
Area cost of crosspoint switch compared with ring of buffers
number of I/O channel, m = 5
data-path width (w)
Figure 8.4: Area requirements compared.
99
m w
A. Number of 
switches 
m2w
B. Area of 
switches @ 
900sq.fim each
C. Number 
of wires
D. Wire area 
1000pm x 2Apm 
per wire
E. Total 
area (B+D) 
in sq. pm
5 64 384 345600 384 921600 1267200
5 128 768 691200 768 1843200 2534400
5 256 1536 1382400 1536 36864007 5068800
5 384 2304 2073600 2304 5529600 7603200
5 512 3072 2764800 3072 7372800 10137600
Table 8.2: Area cost o f  a ring o f  buffers, with m to—bit wide I/O channels
It can be seen from figure 8.4 that the area requirements o f the Cellular Router grow 
linearly with the data-path width, w. This is in contrast to the crossbar switch, where 
the area requirements rise quadratically with w.
It can be seen from figure 8.4 that the switching element in the Cellular Router re­
quires considerably less silicon area than a crosspoint switch when the data path width 
is large (more than 150 bits). In both cases, the area cost is dominated by wiring. A l­
though silicon feature sizes are falling steadily, the effects on wiring and gates differ. 
Interconnect capacitance is not decreasing at the same rate as gate capacitance, so de­
vice power and delay characteristics are increasingly dominated by the interconnect
[72]. The predominance o f  wiring costs w ill only grow in future IC designs, and there 
are many problems in scaling wires that, as yet, have no known solutions [87].
8.3 Cellular Router Implementation Cost
It is important to know the likely silicon area requirements and power consumption o f  
the Cellular Router. The cost o f  manufacture is proportional to the silicon area. The 
power consumption will determine what kind o f  cooling is required for the device. This 
is significant, as a heat-sink and fan can cost as much as the device itself. Conversely, 
given a particular power budget, the maximum data-path width and clock rate can be 
determined.
8 .3 .1  R o u t e r  S y n t h e s is
The packet size in the Cellular Router depends on the end application. For the imple­
mentation, it was assmned that the packet size was 128 bits. The actual packet size 
will depend on the number o f  I/O pads and the ratio o f  the I/O data rate to the on-chip 
clock rate. Commercial packet routers, such as the SGI SPIDER [38] and the Cray T3E
[89] router use I/O links clocked at several times faster than the chip core rate. For 
the Cellular Router to achieve reasonable latency, the entire packet must be transmitted
100
within one core ciock cycle. This is dependent on the flow control (FC) unit. This is 
not considered in this thesis, as it is a complex design issue on its own. The design o f  
the FC unit will be simpler for the Cellular Router than for a wormhole router, as trans­
mitted data always has a buffer to go to. In a wormhole router with high-speed links, 
a signal must be sent back to the sender to stop data being transmitted [7]. During this 
time several flits may be in flight. The sender has to be able to rewind by several flits 
and this complicates the design o f  the flow controller [14].
The implementation o f  the router was limited by the tools that were available. The 
author did not have access to design kits containing flip-chip style I/O pads. The core 
logic o f  the router was implemented however. Each part o f  the design was synthesised 
from a Verilog register-transfer-level (RTL) description. The synthesis tool produced 
schematics using cells from a proprietary 0.7 pm  standard cell library [3]. The clock  
frequency was set at a conservative value o f  75 MHz, corresponding to a 13.3 ns clock  
period. The router was constructed for 2 dimensional mesh and toms networks with a 
maximum radix, d, o f  16. The design was synthesised and then simulated, including the 
processor injection and consumption channels. Figure 8.5 shows the circuit schematic 
o f  a single routing stage.
The partitioning o f the router design was fairly simple. It consisted o f  the routing 
stage, repeated once for each routing direction. The only difference between stages 
were in the inputs to the address comparators, and the dimension bit comparators. The 
parts o f  the Verilog code used to synthesise these circuits used parameter statements. 
These parameters could be overloaded in the top-level code to make the alterations 
needed for each routing stage. Each register in each stage has a separate controller. The 
controllers communicate by handshaking. This is illustrated in figure 8 .6 .
The ring controller uses negative edge triggered devices, while actions are triggered 
by the positive clock edge in all other controllers. This is so that the ring buffer can 
be examined and written to or read from in the same clock cycle. The design could 
have been improved i f  multi-phase clocks were permitted, but the synthesis tool was 
restricted to using one single phase clock.
An eager handshake protocol is used between the OUT buffer in one routing node 
and the IN buffer in the next. At initialisation, the receiver ACK signal is raised to 
indicate that the receiving buffer is unoccupied. As soon as the sender has data to trans­
mit, it can issue a request and send the data. The receiver lowers the ACK signal to 
acknowledge receipt o f  the data. The ACK signal is not raised again mitil the buffer 
in the receiver has emptied. This handshaking protocol ensures that latency is min­
imised. When a packet arrives in any buffer, it is propagated immediately. The action 
o f returning the buffer controller to its original state takes place after the data has been 
propagated onward.
101
ha
nd
sh
ak
e 
si
gn
al
s 
{re
q 
I a
ck
 
pa
ir
)
Figure 8.5: Schematic of one routing stage.
102
Each o f the buffer controllers was implemented as a Moore-type state machine, 
where the outputs are a function o f  the current state only. An example is shown in 
figure 8.6 .
reqinO
ackoutO
reqinl
ackoutl
reqout
ackin
Figure 8.6 : Controller for the output buffer (OUT), together with the associated state 
diagram.
The controller shown produces two control signals. The output labelled latch Jp 
selects which o f  two inputs w ill be latched. The other control output, latchjctl opens 
and closes the latches to store the data in the OUT buffer.
103
2 VNs 
Area (106sg.^m)
4 VNs 
Area (106sq.pm)
8 VNs 
Area (106sq.pm)
Pads 22.68 22.68 22.68
Standard cells 
and wiring
12 24 48
Total area 34.68mm2 46.68mm2 70.68mm2
Chip dimensions 5.89mm x 5.89m?n 6.83?nm x 6.83mm 8.41m?n x 8.41mm
Table 8.3: Growth in silicon area when adding virtual networks
8 .3 .2  L im it s  o f  th e  D e s ig n  K it
The Cadence Synergy synthesis tool was used for the router implementation. The only 
target design kit that was available for this was a relatively old 0.7 pm kit, supporting 
two layers o f  metal. The router design could have been faster and consumed less area 
i f  a 0.25 pm process (commonly in use at the time o f  writing) with three metal layers 
had been used, for example. The design used a single phase clock. Again, the design 
could have been faster i f  multi-phase clocks had been supported. Despite these limi­
tations, the pin-to-pin latency for in-dimension routes was comparable with the fastest 
contemporary packet routers, such as the SGI SPIDER ([38]).
8 .3 .3  E s t im a t e d  A r e a  o f  t h e  R o u t e r
The synthesised Cellular Router design could not be converted to silicon layout, since 
a flip-chip process flow was unavailable. The existing tools would tiy to route all I/O 
wires to the periphery o f  the chip, causing an explosion o f  area, and producing a com­
pletely unrepresentative layout.
An estimate o f  the silicon area can be made, based on the area o f  all o f  the standard 
cells used in one virtual network and the size o f  the optical devices described in [51]. A  
cell contains some control logic, an optical transmitter and an optical receiver occupies 
180pm, x  180pm in area. For a router with five 128-bit wide channels, 640 data inputs 
and outputs are required. This figure is rounded up to 700 to include control signals. 
The total area required by the optical devices would be 22.68 x  106 pm2.
The estimated area o f  standard cells comes from the initial area allocation made 
by the Cadence Place and Route software. This initial estimate is based on the typical 
wiring area overhead for a given number o f  standard cells, for the 0.7pm design kit 
used.
The table shows that a large number o f  virtual networks can be accommodated 
without the IC becoming unmanageably large. These estimates are for the two dimen­
sion torus network topology. Any implementation o f  the router is likely to use three or
104
four virtual networks. In this case the optical I/O devices and the router logic occupy 
roughly equal amounts o f  area. Electronic interconnect could be considered for the 
router, i f  the data pins were operated at a clock rate that is some multiple o f  the core 
clock rate. This would reduce the number o f  I/O pads, so that routing at the PCB level 
is more manageable.
Flip-chip packaging is less mature than wire bonding based methods. The area 
occupied by pads is likely to fall as the technology matures [86], This w ill reduce the 
cost o f  implementing the Cellular Router, or allow more logic to be used.
8.4 Power Consumption
A  relatively old 0.7pm design kit was used for the router synthesis. The estimated 
power consumption per virtual network is calculated, followed by some discussion o f  
the effects o f  moving to a newer fabrication process.
As described in section 8.1, the main source o f power dissipation is the charging and 
discharging parasitic capacitance in the router. A  single virtual network o f  the router 
was synthesised. This contains 4598 standard cells. The power consumption o f  each 
virtual network can be estimated using information in the design kit data-book [3]. The 
power consumption for each cell can be calculated as follows: The power consumption 
o f  each virtual network can be estimated using information in the design kit data-book
[3]. A  power consumption figure is given for each cell, measured in pW/MHz. The 
total for all o f  the cells will be referred to as Pceiis- The frequency used to calculate 
this is not that o f  the clock, but the actual output frequency o f the cell. For the analysis 
here, each cell is assumed to have a 40% probability o f  switching on any clock cycle, 
hence the number o f 1 —> 0 transitions is 0.2 x fdock- The other component o f  the 
power dissipation is due to the capacitive load o f  the gate(s) that each cell is driving. 
This is calculated using equation 8.1, again assuming that each input is toggled 40% o f  
the time. The total capacitive load o f  all data-driven cells is given by Cdata. All inputs 
that are driven by the clock are added separately, as these are toggled on every cycle, 
and are represented by Cciocked-
Capacitive loads due to wiring are negligible in most cases, since the connections 
are fairly local. There are two exceptions. The wires used as control signals to each 
o f  the packet-wide registers, and those used to comiect ring buffers together. Wiring 
capacitance is pessimistically assumed to be 0.03fF /pm 2 (taken from a 1 pm process 
described in [49]). The wire lengths depend on the detailed layout o f  the router, but a 
fair estimate is that each control wire is bOOOpm long (this is a pessimistic assumption) 
and the interconnection wires between rings are 1000/im in length. This gives the total 
estimated power consumption (per VN, at clock rate / c) to be:
105
Power dissipation due to intrinsic capacitance o f  cells:
f
Pintrinsic =  0.2 X ^  X Pceus (8.2)
Pceiis =  35112.1 pW/MHz
(8.3)
Power dissipated driving gates:
Pgate-load /„ x  (0 .2 C W  +  Cclocked) x  V lD (8.4)
Ciata =  402.742 pF 
Cclocked — 21.984 pF
(8.5)
Power dissipated driving wires:
PwireJtoad — 0 .2 /c X CwiTjng X Vj}j j  (3-6)
Cwiring ~  9.24 pF
(8.7)
Total power consumption:
Ptotal * Pintrinsic “F Pgate-load +  Pwire-load (3*3)
Ptotai = fci7.02 x  IO"9) +  / c(2.56 x  10"9) +  f c(4.62 x  10"11)
=  (9-62 x  10“9)A  W  (8.9)
The total power dissipation is less than 1W /V N  at 100MHz. For a Cellular Router 
with 4 VNs that is clocked at 75 MHz the power dissipation is less than 3W.
The analysis is somewhat pessimistic for a number o f  reasons:
1. The supply voltage is set at 5 V. The power dissipation is proportional to V[D, 
so simply reducing Vdd to 3.3 V  w ill reduce the power dissipation to 0.44 o f the 
calculated value. Some processes currently allow 1.8 V operation, which would 
reduce power consumption to 0.13 o f  the dissipation when 5V supplies are used.
2. A  0.7pm process was used. The power dissipation is mostly due to parasitic ca­
pacitance effects, and these would be much lower with a 0.25pm  process (which 
is in common use at the time o f  writing).
106
3. A  standard cell design flow has been used. Standard cells are designed for general 
use. The fan-out o f the library gates is typically high, to suit a number o f  cases. 
Customised cells would reduce parasitics considerably. Without a great deal o f  
effort, it would be possible to design full-custom cells for the Cellular Router, 
since a small number o f  repeated cells could be used.
4. The synthesis tool used performed only a simple global optimisation to meet 
the target clock rate. Using a more sophisticated synthesis tool and performing 
some optimisations for low power could produce a substantial reduction in power 
consumption.
The puipose o f the previous analysis is to estimate the upper bound on power con­
sumption. It has been shown that the power consumption o f  the device is relatively 
low, and that the architecture can be used to build packet routers with large data-paths. 
Whichever cooling technology is applied to the router chip, the majority o f  the power 
budget will be used up by the optical interconnect - as was the intention from the outset 
o f designing the router.
The other aim was linear scalability. With advances in optical device technologies, 
the router performance can be increased to match them. The area and power consump­
tion o f the router grows slightly less than linearly with any increase in the packet size 
(which equals the data-path width).
8.5 Achievable Clock Rate
In the router configuration used in Chapter 6, the optical devices are assumed to be 
clocked at a fairly modest data rate, matching that o f  the router core. The optical 
device can easily operate at much higher frequencies at the cost o f  increased power 
consumption. The router architecture has a number o f  features that should allow the 
clock rate to be set close to the limits o f  any fabrication process.
The implementation work has shown that the architecture is feasible and does in­
deed produce low latency routing. The latency can be improved further however. When 
packets are routed in-dimension the only operation that needs to be earned out is to 
compare the destination node address in the current dimension with that o f  the current 
node. The registers on the critical path are all transparent when waiting for data. Care­
ful design o f  the controller circuitry would allow the pin-to-pin latency to be limited 
only by the time taken for address comparison plus the physical propagation delay o f  
the data from the input to the output register, followed by a small delay due to the 
control signals.
107
Chapter 9 
Conclusions
The research presented in this thesis has addressed the problem o f  utilising high per­
formance interconnects in packet routers for multi-processor systems. This work began 
with an examination o f  asynchronous building blocks for constructing packet routers. 
This work provided many insights into the hardware cost and timing issues in creating 
a packet router. This experience proved useful later on, when the Cellular Router was 
being developed.
The use o f  ffee-space optical interconnects implies throughput far greater than is 
possible using electronic means [73]. The router logic must be able to cany the data, 
without a significant increase in latency. Developments in optical interconnects will al­
low achievable throughput to continually increase. For this reason a scalable, extensible 
router architecture is desirable. Such a router must have low area, use on-chip wiring 
efficiently and have low power consumption. In addition, deadlock must be handled, 
either by prevention or recovery. Adaptive routing and fault tolerance are also desir­
able features. The Cellular Router architecture was created in response to the above 
concerns.
Throughout the design o f  the router, the hardware overheads were considered. It 
had previously been shown that the benefits o f  implementing one routing algorithm in 
preference to another may be out-weighed be the hardware overheads involved [14, 15]. 
The Cellular Router allows an oblivious router to be constructed with similar latency 
to the fastest routers available today [38, 89]. Additionally, the Cellular Router (in a 
2-d toms network) implements full minimal adaptive routing, and allows some noil- 
minimal routes to be used, without any cost in terms o f  reduced latency or clock rate. 
Full minimal adaptive routing was shown to significantly improve the throughput in 
congested networks, when compared with partial adaptive routing. The use o f  non- 
minimal routing allows some degree o f  fault tolerance. Both o f  these features come at 
little or no cost in terms o f  implementation speed and hardware cost.
Area requirements, and particularly the cost o f  wiring, have been identified as a 
major problem in routers using (free space) optical interconnects. In a recent study on
108
the WARRP II Router [73], a number o f  difficulties in applying free-space optical inter­
comiects were identified. The irregularity o f  the I/O comiection patterns generated by 
CAD tools does not match well with the regular array o f  optical devices (SEEDs in this 
case). This problem diminishes the achievable transistor density considerably. These 
problems are overcome in the Cellular Router. The regular structure o f  the routing cells 
makes efficient use o f  the available wiring layers. The required I/O comiection pattern 
matches that o f  the optical devices - a regular array.
The router has some useful scaling properties: the silicon area o f  the router logic 
scales linearly with throughput (assuming the clock rate is fixed). This contrasts with 
routers which use crossbars or multiplexers for switching, in which the switching ele­
ment grows quadratically with throughput. The low area requirements o f  the Cellular 
Router permit massive throughput. The limiting factor will undoubtedly be power dis­
sipation. Again, this scales roughly linearly with throughput, and in Chapter 8, the 
power consumption was shown to be low. Both the power consumption calculations 
and the gate-level hardware implementation assumed a 0.7pm fabrication process and 
5V operation. At the time o f  writing, 0.2bpm and 0.18pm feature sizes are commonly 
in use, with a supply voltage o f  3.3V or 1.8R [87]. Using smaller feature sizes and 
a lower supply voltage will significantly reduce the power consumption o f  the router. 
Conversely, much larger routers could be built within the same power budget.
The router control logic consists mostly o f  simple state machines. The complexity 
o f routing decisions is kept low, and the router does not require a central arbiter to con­
trol switching. Since routing control decisions are distributed so that each routing stage 
makes only local decisions, the logic complexity in the routing stages does not grow if  
the node degree is increased. The bulk o f  the router logic consists o f  storage elements. 
There are a small number o f  cells which are replicated many times. Optimising the de­
sign o f  these cells would allow reductions in power dissipation and area, and possibly 
increased clock rates. The time spent optimising the cells would produce significant 
improvements.
The initial design o f  the Cellular Router did not take fault tolerance into account. 
However, a simple mechanism was found in the implementation for low dimensional 
torus networks. Up to half o f  the network links may fail without disrupting the delivery 
o f  packets. It is only when links in two opposing toms cycles fail that the router is no 
longer able to deliver some packets. The hardware overhead involved in implementing 
this feature is negligible, and the additional logic does not lengthen any o f  the critical 
paths in the router.
The Cellular Router has been designed for multiprocessor systems. The question 
o f how much throughput is required from the router has not yet been addressed. This 
depends on the system features, such as the number and type o f  processors and the
109
memory bandwidth. If the router throughput is greater than that required, multiple pro­
cessors and memories can be comiected to each router, reducing the network diameter 
for a fixed number o f  processors. The wide channels o f  the Cellular Router are suitable 
for carrying a whole cache line, or some fraction o f  a cache line. The router uses a fixed 
packet size. Large transmissions can easily be broken down into a number o f  packets, 
with some cost in terms o f  reduced payload. This is partially offset by the reduction 
in network contention which results from parts o f  the same packet following different 
routes through the network, and by a reduction in contention compared with networks 
using wormhole routing [4]. I f  the ordering o f  the data reception is important, there will 
be some overhead in re-ordering the packets for a large block o f  data, and this must be 
taken into account.
For a Cellular Router with fixed throughput, the number o f  channels can be in­
creased instead o f  increasing the width o f  the channels. This allows the construction o f  
networks with more dimensions, or with additional links.
One drawback o f  the Cellular Router is the large number o f  buffers used, particu­
larly i f  several virtual networks are used in order to implement minimal adaptive rout­
ing. It should be possible to devise a more efficient deadlock handling scheme which 
requires fewer buffers and/or fewer virtual networks. Virtual networks have been shown 
in Chapter 7 to improve throughput, so the costs and benefits should be examined care­
fully.
The Cellular Router has been shown to be capable o f  massive throughput and low  
latency. The next logical step is to apply the router in a practical system. The design 
o f the flow control unit, which includes the drivers for the optical devices has not been 
considered in this research. There are some technical issues and decisions to be made in 
implementing the optical part. Further simulations using real application traffic patterns 
may be useful in order to fine-tune the router parameters. In this research, only low­
dimensional toms networks have been considered. The Cellular Router architecture 
could be applied to other network topologies. This may require many modifications to 
the router, and is a potential area for future research.
110
Bibliography
[1] Cogency technology inc. Website, http://www.cogency.com/.
[2] Theseus logic inc. Website, http://www.theseus.com/.
[3] European Silicon Structure ECPD07 Library Databook, 1996.
[4] A.Agarwal. Limits on interconnection network performance. IEEE Trans, on 
Parallel and Dist. Systems, 2(4): 398-412, October 1991.
[5] A.Bolychevsky, C. Jesshope, and V.Muchnik. Dynamic scheduling in rise archi­
tectures. IEEProc.-Comput. Digit. Tech., 143(5):309-317, September 1996.
[6] A.S.Tanenbaum. Modern Operating Systems, chapter 6 . Prentice Hall, 1992.
[7] A.S.Tanenbaum. Computer Networks. Prentice Hall, 1996.
[8] A.W.Roscoe and N.Dathi. Pursuit o f  deadlock freedom. Information and Com­
putation, 75(3):289-327, December 1987.
[9] B.Dao, J.Duato, and S.Yalamanchili. Dynamically configurable message flow 
control for fault-tolerant routing. IEEE Transactions on Parallel and Distibuted 
Systems, 10(l):7-22 , January 1999.
[10] K. Bolding, S. Cheung, S. Choi, C. Eberling, S. Hassoun, T. N go, and R. Wille. 
The Chaos router: Design and implementation o f  an adaptive router. In Proceed­
ings of VLSI ’93JFIP, pages 311-320, 1993.
[11] C.Glass and L.Ni. The turn model for adaptive routing. lournal of the ACM, 
41(5):875-902, September 1994.
[12] K.IC.Y. Chang, W.Ellersick, S. Chuang, S.Sidiropoulos, M. Horowitz, and 
N. McKeown. An asymmetric serial link architecture for high-bandwidth packet 
switches. In Proceedings of Hot Interconnects V, 1996.
[13] A. Charlesworth. Starfire: Extending the SMP envelope. IEEE Micro, pages 
39-49, January/February 1998.
Ill
[14] A. Chien. A  cost and speed model for k-ary n-cube wormhole routers. IEEE 
Parallel and Distributed Systems, 9(2): 150—162, February 1998.
[15] A. Chien and J.H.Kim. Planar-adaptive routing: Low-cost adaptive networks for 
multiprocessors. Proc. International Symposium on Computer Architecture, pages 
268-277, May 1992.
[16] E.G. Coffman, M.J. Elphick, and A. Shoshani. System deadlocks. Computing 
Surveys, 3:67-78, June 1971.
[17] C.Seitz. The cosmic cube. Communications of the ACM, 2 8 (l):22 -33 , January 
1985.
[18] C.Seitz, N.Boden, J.Seizovic, and W.Su. The design o f  the cal- 
tech Mosaic C multicomputer. Research on Integrated Systems: Pro­
ceedings of the 1993 Symposium, pages 1-22, 1993. Obtained from: 
http://www.myri.com/researcli/publications/index.html.
[19] C.Seitz and W.Su. A  family o f  routing and communications chips based on the 
mosaic. Proceedings of the Washington Symposium on Integrated Systems, 1993.
[20] W. Dally. Performance analysis o f  k-aiy n-cube interconnection networks. IEEE 
Transactions on Computers, pages 775-785, 1990.
[21] W. Dally. Express cubes: Improving the performance o f  k-aiy n-cube intercon­
nection networks. IEEE Transactions on Computers, 40(9): 1016-1023, 1991.
[22] W. Dally. Virtual-channel flow control. IEEE Transactions on Parallel and Dis­
tributed Systems, pages 194-205, 1992.
[23] W. Dally and H.Aoki. Deadlock-free adaptive routing in multicomputer networks 
using virtual channels. IEEE Transactions on Parallel and Distributed Systems, 
4(4):466-475, April 1993.
[24] W. Dally and C. Seitz. Deadlock-free message routing in multi-processor inter­
connection networks. IEEE Transactions on Computers, (36):547—553, 1987.
[25] A.Agarwal et al. The MIT Alewife Machine: Architecture and performance. 
ISCA, 1995.
[26] A.Louri et al. Feasability study o f  a scalable optical interconnection network for 
massiely parallel processing systems. Applied Optics, 35(8): 1296-1308, March
1996.
112
[27] C.P.Barrett et al. Components for the implementation o f  free-space optical cross­
bars. Applied Optics, 35(35):6934—6944, December 1996.
[28] D.Lenoski et al. The Stanford DASH multiprocessor. IEEE Computer, pages 
63-79, March 1992.
[29] D.S.W ills et al. A  three-dimensional high-throughput architecture using through- 
wafer optical interconnect. Journal of Lightwave Technology, 13 (6): 1085-1091, 
June 1995.
[30] J.Fan et al. Design considerations and algorithms for partitioning optoelectronic 
multichip modules. Applied Optics, 34(17):3116-3127, June 1995.
[31] J.Kubiatowicz et al. The A lew ife CMMU: Addressing the multiprocessor com­
munications gap. Proceedings o f  Hot Chips ’94, August 1994.
[32] J.ICuskin et al. The Stanford FLASH multiprocessor. Proceedings of the 21st 
International Symposium on Computer Architecture, pages 302-313, April 1994.
[33] T.Woodward et al. Demultiplexing 2.48-gb/s optical signals with a cmos receiver 
array based on clocked-sense-amplifiers. IEEE Photonics Technology: Letters, 
9(8): 1146-1148, August 1997.
[34] W. Dally et al. Architecture o f  a message driven processor. Proc. Int. Symp. 
Computer Arch., pages 189-205, June 1987.
[35] W. Dally et al. Architecture and implementation o f  the reliable router. Proceed­
ings of Hot Interconnects II, August 1994.
[36] W. Dally et al. The reliable router: A  reliable high-performance communications 
substrate for parallel computers. Proceedings of the Workshop on Parallel Com­
puter Routing and Communication, pages 241—255, May 1994.
[37] W.Bowhill et al. Circuit implementation o f a 300-MHz 64-bit second-generation 
CMOS Alpha CPU. Digital Technical Journal, pages 100-118, 1995.
[38] M. Galles. Spider: A  high-speed network interconnect. IEEE Micro, pages 34-39,
1997.
[39] G.Birtwhistle and A. Davis (eds). Asynchronous digital circuit design. Springer- 
Verlag, 1995.
[40] D. Gelemter. A DAG-based algorithm for prevention o f  store-and-forward dead­
lock in packet networks. IEEE Transactions on Computers, C -30(10):709-715, 
October 1981.
113
[41] H.Amin, K.M.Curtis, and B.R.Hayesgill. Efficient two-dimensional systolic ar­
ray architecture for multilayer neural network. Electronics Letters, 33(24):2055- 
2056, 1997.
[42] H.M.Ozaktas. Toward an optimal foundation architecture for optoelectronic 
computing. Part I. regularly interconnected device planes. Applied Optics, 
36(23):5682-5696, August 1997.
[43] H.Sullivan and T.R.Bashkow. A large scale, homogeneous, fully distributed par­
allel machine. 5:105-124, March 1977.
[44] I.E.Sutherland. Micropipelines. Communications of the ACM, 32(6):720-738, 
June 1989.
[45] I.Gopal. Prevention o f  store-and-forward deadlock in computer networks. IEEE 
Trans. on Communications, CO M -33:1258-1264, December 1985.
[46] I.Nedelchev. Asynchronous VLSI Design. PhD thesis, University o f  Surrey, 1995.
[47] I.Nedelchev and C.Jesshope. Basic building blocks for asynchronous packet 
routers. In IEEE Great Lakes Symposium. IEEE Computer Society Press, 1994.
[48] J.Allen, P.Gaughan, D.Schimmel, and S.Yalamanchili. Ariadne - an adaptive 
router for fault-tolerant mulit-computers. Technical Report TR-GIT/CSRL-93/10, 
Georgia Institiute o f Technology, 1993.
[49] Jan.M.Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice Hall, 
1996.
[50] J.Brzozowski and C.Seger. Asynchronous Circuits. Monographs in Computer 
Science. Springer-Verlag, 1994.
[51] J.Dines, J.Snowdon, M .Desmulliez, D.Barsky, A.Shafarenko, and C.Jesshope. 
Optical interconnectivity in a scalable data-parallel system. Journal of Parallel 
and Distributed Computing, 41:120-130, November 1997.
[52] J.Duato. On the design o f  deadlock-free adaptive routing algorithms for mul­
ticomputers: Design methodologies. Proceedings of Parallel Architectures and 
Languages Europe, pages 390-405, June 1991.
[53] J.Duato. A  new theory o f  deadlock-free adaptive routing in wormhole networks. 
IEEE Transactions on Parallel and Distributed Systems, 4(12): 1320-1331, De­
cember 1993.
114
[54] J.Duato. A  necessary and sufficient condition for deadlock-free routing in cut- 
through and store-and-forward networks. IEEE Transactions on Parallel and Dis­
tributed Systems, 7(8):841-854, August 1996.
[55] J.Duato, S.Yalamanchili, and L.Ni. Interconnection Networks: an Engineering 
Approach. IEEE Computer Society, 1997. ISBN 0-8186-7800-3.
[56] C. Jesshope and I. Nedelchev. Asynchronous Packet Routers, pages 211-227. 
DIMACS series in Discrete maths and Theoretical C.S. 21. 1995.
[57] J.Goodman, F.Leonberger, S.Kung, and R.Athale. Optical interconnections for 
VLSI systems. Proceedings of the IEEE, 72(7):850-865, July 1984.
[58] J.Gurd, C.Kirkliam, and I. Watson. The Manchester prototype dataflow computer. 
Communications of the ACM, 2 8 (l):34 -52 , January 1985.
[59] J.Wilson and J.Hawkes. Optoelectronics: an introduction, chapter 5, pages 2 2 0 -  
222. Prentice Hall, third edition, 1998. ISBN 0-8186-7800-3.
[60] J.Yantchev and C.Jesshope. Adaptive, low latency, deadlock-free packet routing 
for networks o f  processors. IEE Proceedings, 136(3): 178-186, May 1989.
[61] K.Anjan and T.Pinkston. An efficient, fully adaptive deadlock recovery scheme: 
Disha. Proceedings of the 22nd Annual International Symposium on Computer 
Architecture, June 1995.
[62] K.Aoyama and A.Chien. The cost o f  adaptivity and virtual lanes. Journal of VLSI 
Design, 2(4):315-333, 1995.
[63] K.Bolding. Chaotic Routing - Design and Implementation of an Adaptive Multi­
computer Network Router. PhD thesis, University o f  Washington, 1993.
[64] K.Bolding, M.Fulgham, and L. Snyder. The case for chaotic adaptive routing. 
IEEE Transactions on Computers, 46(12): 1281-1291, December 1997.
[65] K.Hwang. Advanced Computer Architecture. M cGrawHill, 1993.
[66] L.Dennison. The Reliable Router: An Architecture for Fault Tolerant Intercon­
nect. PhD thesis, Massachusetts Institute o f  Technology, June 1996.
[67] L.Dennison, W.Dally, and D.Xanthopoulos. Low-latency plesiochronous data re­
timing. Proceedings of the 1995 Advanced Research in VLSI Conference, March
1995.
115
[68] L.Dennison, W.Lee, and W.Dally. High-performance bidirectional signalling in 
vlsi systems. Proceedings of the 1993 Symposium on Research on Integrated 
Systems, January 1993.
[69] L.Ni and P.McKinley. A  survey o f  wormhole routing techniques in direct net­
works. IEEE Computer, pages 62-76 , February 1993.
[70] M.Desmulliez, B.Wherrett, A.Waddie, J.Snowdon, and J.Dines. Performance 
analysis o f  self-electro-optic-effect-device-based (seed-based) smart-pixel arrays 
used in data sorting. Applied Optics, 35(32):6397-6416, November 1996.
[71] M.Haycock and R.Mooney. A  2.5 Gb/s bidirectional signalling technology. In 
Proceedings of Hot Interconnects V. Intel Corporation, 1996.
[72] M.J.S.Smith. Application Specific Integrated Circuits. Addison Wesley, 1997.
[73] M.Raksapatcharawong, T.M.Pinkston, and Y.Choi. Evaluation and design issues 
for optoelectronic cores: a case study o f  WARRP II router. J.Opt.A: Pure Appl 
Opt., 1:249-254,1999.
[74] N.Boden, D.Cohen, R.Felderman, A.Kulawik, C.Seitz, J.Seizovic, and W.Su. 
Myrinet: A  gigabit-per-second local-area network. IEEE-Micro, 15(l):29-36, 
February 1995.
[75] N.Paver. The Design and Implementation of an Asynchronous Microprocessor. 
PhD thesis, University o f  Manchester, 1994.
[76] N.Weste and K.Eshraghian. Principles of CMOS VLSI Design: A systems Per­
spective, chapter 3. Addison-Wesley, 2nd edition edition, 1993.
[77] P.Berman, L.Gravano, G.Pifarre, and J.Sanz. Adaptive deadlock and livelock free 
routing with all minimal paths in torus networks for multiprocessors. Proc. Inter­
national Symposium on Computer Architecture, 1992.
[78] P.Gaughan and S.Yalamanchili. Adaptive routing protocols for hypercube inter­
connection networks. IEEE Computer, 26(5): 12-23, May 1993. 1993.
[79] P.Gaughan and S.Yalamanchili. A  family o f  fault-tolerant routing protocols for 
direct multiprocessor networks. IEEE Transactions on Parallel and Distributed 
Systems, 6(5):482-497, May 1995.
[80] P.Kermani and L.Kleinrock. Virtual cut-through: a new computer communication 
technique. Computer Networks, 3(4):267-286, 1979.
116
[81] P.Merlin and P. Schweitzer. Deadlock avoidance in store-and-forward networks - 
i: Store-and-forward deadlock. IEEE Transactions on Communications, COM- 
28(3):345-354, March 1980.
[82] P.Miller. Efficient Communication for Fine-Grain Distributed Computers. PhD 
thesis, Southampton University, 1991.
[83] R.Cypher and L.Gravano. Requirements for deadlock-free adaptive packet rout­
ing. SIAM Journal on Computing, 23(6), December 1994.
[84] R.Cypher and L.Gravano. Storage-efficient, deadlock-free packet routing algo­
rithms for torus networks. IEEE Transactions on Computers, 43(12), December 
1994.
[85] R.Felderman, A.DeSchon, D.Cohen, and G.Finn. Atomic: A high-speed local 
communication architecture. Journal of High Speed Networks, 3(1): 1-30, 1994.
[86] R.Kuracina. Flip Chip Packaging for the Year 2000. IBM Microelectronics. Ob­
tained from IBM Microelectronics web site.
[87] Semiconductor Industry Association. The National Technology Roadmap for 
Semiconductors, 1997.
[88] J. Silberman, N. Aoki, D. Boerstler, J. Bums, S. Dhong, A. Essbaum, U. Ghoshal, 
D. Heidel, P. Hofstee, K. Lee, D. Meltzer, H. Ngo, K. Nowka, S. Posluszny,
O.Takahashi, I. Vo, and B. Zone. A  1.0 GHz single-issue 64b PowerPC integer 
processor. In Proceedings of IEEE International Solid-state Circuits conference, 
February 1998.
[89] S.L.Scott and G.M.Thorson. The Cray T3E network: Adaptive routing in a high 
performance 3d toms. In Proceedings of Hot Interconnects IV, 1996.
[90] S.L.Scott and G.Thorson. Optimized routing in the Cray T3D. PCRCW, pages 
1-12, 1994.
[91] S.L.Scott and J.R.Goodman. The impact o f  pipelined channels on lc-ary n-cube 
networks. IEEE Transactions on Parallel and Distributed Systems, 5(1):2—16, 
January 1993.
[92] S.Tang, R.Chen, L.Gamett, D.Gerold, and M.Li. Design limitations o f  highly 
parallel free-space optical interconnects based on arrays o f  vertical cavity surface- 
emitting laser diodes, microlenses and photodetectors. Journal of Lightwave Tech­
nology, 12(11): 1971—1975, November 1994.
117
[93] S.Wamakulasuriya and T.M.Pinkston. Characterization o f  deadlocks in intercon­
nection networks. To appear in: Proceedings of the 1999 International Confer­
ence on Parallel Processing, September 1999.
[94] T.M.Pinkston, M.Raksapatcharawong, and Y.Choi. WARRP core: optoelectronic 
implementation o f  network-router deadlock-handling mechanisms. Applied Op­
tics, 37(2):276-283, January 1998.
[95] T.M.Pinkston, Y.Choi, and M.Raksapatcharawong. Architecture and optoelec­
tronic implementation o f  the warrp router. In Proceedings of Hot Interconnects V,
1996.
[96] Y.Tamir and H.Chi. Symmetric crossbar arbiters for VLSI communication 
switches. IEEE Transactions on Parallel and Distributed Systems, 4(2): 13-27, 
1993.
UNIVERSITY OF SURREY LIBRAfW
