High-Speed Message Routing Mechanisms for Massively Parallel Computers by Flavell, Andrew Colin
機式 6
j命 文 目 録
?
????
?
??
仁コ
守 氏名 Andrew Colin Flavell 
学位論文題目 High-Speed Mes:sage Routing Mechanisms for 
Massively Parallel Computers 
論文の目次
第1章: Introduction 
第2章: Sca1able Multicomputer Systems 
第3章: Tokkyu:A High-Performance， Rando凶zing，Adaptive Message 
Router with Packet Expressway 
第4章: Restricted Leng出HardwareMulticasting in Multicomputer Network5 
第5章: Conclusions 
参考論文
主論文
1. Flavell， A. C. and Takahashi， Y.， 
domizing， Adaptive Message Router with Packet Expressway"， IEICE 
Trans. on lnformation and Systems， vol. E78-D， no. 10， pp. 1248-1260， 
October 1995. 
2. Flavell， A. C. and Takahashi， Y.，“Restricted Length Hardware Multicas-
ting in Multicomputer Networks"， Transactions of the IPSJ， vol. 36， no. 
5， pp. 1228-1238， May 1995. 
副論文
1. Flavell， A. C. and Takahashi， Y.，“The Tokkyu Router: A Randomiz-
ing Router for k-ary n-cubes"， Proc. of the lnternational Symposium on 
Pαrallel and Distributed Supercomputing， pp. 127-134， September 1995. 
2. F引、'lav刊el日1，A. C. and Takaha回.s~ぬhiηi ， Y.，“Cor山 n川 u叩m:A Hybrid Time/Space 
Communications Paradigm for k-ary n-cubes"， Proc. of the lnternαtional 
Conference on Parallel Processing 199ιvol. 1， pp. 138-141 ， August 1994. 
3. Flavell， A. C. and Takahashi， Y.，“Mandala: An lnterconnection Network 
for a Scalable Massively Parallel Computer"， inProceedings of the JJrd 
IPSJ Programming Symposium， pp. 79-90， January 1992. 
4. Flavell， A. C. et. al.，“Mandala: An lnterconnection Network for a Scal-
able Mωsively Parallel Computer"， Technical Report of the IPSJ， vol. 91， 
no. 100， pp. 91.101-91.109， November 1991. 
段式7
論文内容要
? ? ?
?
????
?
?
6 8 号 氏名 Andrew Colin Plavell 
学位論文題目
High-Speed Message :Routing Mechanisms for 
Massively Parallel Computers 
内容要旨
現在超並列処理システム(MPP)は、伝統的なベクトルプロセッサや SIMDマシンの
牙城であった多くの分野に進出している。これらのシステムは、入手が容易な高性能
CPUの急激な進歩をうまく利用し、これらを数百~数千個接続して均質なマルチプ
ロセッサのシステムとして構成したものである。しかし、これらのシステムの性能は、
現実の問題を解くときは必ずしも良くなく、常に公称の最高性能にははるかに及ばな
いのが現状である。これらのシステムではプロセッサ間の通信はすべて相互結合網に
よって行われるので、実現可能な最高性能を決める決定的な要素は相互結合網と、そ
れに使われる通信機構である。
本論文ではMPPの相互結合網に使われる、効率的な通信機構を実現する 2つの方法
を提案する。第 1は「特急ルータJの提案であり、これを相互結合網に用いた場合の
適合性を検註する。特急ルータは多重の単方向レジスタ挿入パスを利用して、時間
空間混合分割型ネットワークを実現するためのものである。異なる基数や次元数につ
いて、特急ルータのスイッチ回路とバッファ回路の性能を予測するための正確なモデ
ルを開発した。この結果、特急ルータは効率的な通信を行うためのすべての条件を満
足していることが確かめられた。さらに重要な点は、特急ルータはネットワークに故
障のある場合や、通信が錯綜する場合にも、低遅延時間、高スループットを損なわな
い経路制御が行えることである。シミュレーションによって評価した特急ルータのの
性能は、これまでに発表された固定経路選択方式のルータより優れており、また他の
適応経路市j御方式のルータに比べても、同程度あるいはそれを越えていることが確か
められた。
第2は経路長制限方式のマルチキャスト通信の提案である。マルチキャスト通信は
多くの並列処理問題において速度向上に寄与する通信方式である。そこでワームホー
ル通信方式において問題となるマルチキャスト通信におけるデッドロックの問題につ
いて研究した。そしてこの問題を解決する方法として経路長制限方式のマルチキャス
ト通信を提案し、この方式による通信性能をシミュレーションによって評価し、ユニ
キャスト方式やマルチパス方式によるマルチキャスト通信の性能と比較した。その結
果、提案する経路長制限方式のマルチキャスト通信は、パリヤ同期のためのクラスタ
へのマルチキャλ ト通信や、最近傍ノードへのマルチキャストや全ノードへの放送の
場合に、特に優れた解決法となることを明らかにした。
械式9
L_ 
⑪工
報告番号|乙 工
工 修
主 査
審査委員 IN{IJ 査
高IJ 査
学位論文題目
審査結果の要旨
論文審査の結果の要旨
第 6 8 号|氏 名 IAndrew Colin Flavell 
高橋義造
島田 良作
赤松則男
High-Speed Message Routing MechanisnlS for・
Massively Parallel C0l11puters 
超並列計算機は，数百~数千個のプロセッサ要素を接続して並列に動作させ，超高速処理を
行わせようとするものである. ここで、はプロセッサ問の通信はすべて相互結合網によって行わ
れるので，このシステムの総合性能を決める決定的な要素は相互結合網の通信機構と通信制御
方式になるが，まだ十分に満足できるものが得られていないのが現状である.
本論文では相互結合網の通信機構と通信制御方式について研究し，新方式のルータ機構と，
独特の制御を行うマルチキャスト通信方式を提案してしも.新ししVトタ機構を「特急ルータj
と呼んでいるが，多重の単方向レジスタ挿入パスを用いて時分割・空間分割混合型ネットワー
クを実現し，ネットワークに故障のある場合や著しく通信量が多い場合にも，低遅延時間，高
スループットを損なわない経路制御が行えることを特長としている.実際シミュレーションに
よって詳細な性能評価を行った結果，従来の固定経路選択方式のルータより優れ，他の適応経
路制御方式のルータに比べても，遜色のない性能を持つことがことが確かめられている.
次に新しい通信方式としてパケット長制限方式マルチキャスト通信を提案している.マルチ
キャスト通信は多くの並列処理問題において必要とされる機能であるが，これをできるだけ高
速に行う必要がある.しかしワームホール通信の場合にはマルチキャスト通信はデッドロック
を起こす可能性があるという問題がある.この問題を研究した結果，パケット長を自動的に制
限してマルチキャスト通信を行えば，性能を損なうことなくデッドロックを回避できることを
証明した.また，シミュレーションによってこの方式の通信性能を評価した結果，バリヤ同期
のためのクラスタへのマルチキャスト通信や，最近傍ノードへのマルチキャストや全ノードへ
の放送の場合に，特に優れた効果を発揮することを確かめられた.
以上本研究は高性能の超並列計算機を構成するための重要な要素である相互結合網について，
その通信機構と通信制御方式についての新しい提案を行い，その効果を実証したものであり，
本論文は博士 (工学)の学位授与に値するものと判定する.

-High-~peed Message Routing 
Mechanisms for Massively 
Parallel Computers 
恥1:arch1996 
Andrew Colin Flavell 
High-Speed Message Routing 
Mechanisms for Massively 
Parallel COr.lputers 
A dissertation submitted to the Department of Information Science and 
Intelligent Systems and the Graduate School of the University of 
Tokushima in partial fulfillment of the requirements for the degree of 
Doctor of Engineering 
by 
Andrew Colin F'lavell 
March 1996 
Approved as to the style and content by 
¥入ν孔ナてもん人
Professor Yoshizo Takahashi 
安う 7勾 主 ~F
Professor Ryosaku Shimada 
Dept. of Information Science and Intellig 
ehぬ。λ白期点主μ
Professor N orio Akalmatsu 
Dept. of Information Science and Intelligent Syst 
Acknow ledgments 
1 wish to express my sincere grati tude to Professor Yoshizo Takahashi， for 
enabling me to study for a doctoral degree in J apan. His guidance has 
served me well and has helped to keep me focused on the task at hand. 1 also 
wish to thank Professors Ryosaku Shimada and N orio Akamatsu， for their 
contributions as the members of my defense committee. Thanks must also 
go to the J apanese Ministry of Education， Science and Culture， for granting 
me the scholarship which has made studying in Japan a reality. 
To Masahiko Sano and Tomio Inoue， many thanks for helping to make my 
university life， and adjustment to life in Japanうsimplerand more enjoyable. 
Thanks to Dr. Tim Gleeson for his useful and constructive criticism of my 
written work， especially the comments relating to this dissertation. 
Finally， special thanks must go to my wife， Figen Ulgen. Her belief in 
my ability has been， and continues to be， an inspiration. 1 couldn't wish for 
anything more. . . 
Figenう busenin 'icin. . . 
11 
ー圃.. 
Abstract 
Massi vely parallel processing systems (MPPs) are currently making inroads 
into many areas that are traditionally a stronghold for vector or SIMD pro-
cessors. These systems leverage the rapid advances being made in readily 
available high performance CPU s by connecting hundreds or thousands of 
them together to form homogeneous multiprocessor systems. Unfortunately， 
the performance of these systems when solving real-world problems has been 
somewhat disappointing and always fals far short of the theoretical peak 
performance quoted by system vendors. As al of the communications be-
tween processors in these systems rely on the interconnection network， a 
critical component in determining the maxirnum achievable performance is 
the interconnection network and the communications structures supported 
by it. 
This dissertation introduces two solutions to providing effective communi-
cations structures for MPP systems. The Tokky註router is presented and i ts
suitability for use in MPP interconnection networks is demonstrated. The 
Tokkyu router utilizes multiple， unidirectional， register-inser七ionbuses to 
provide a hybrid timejspace division network. Accurate models are devel-
oped to predict the switch and buffer performance of Tokkyu routers for 
varying radix and dimension. The Tokkyu router meets al of the require-
ments necessary to be considered effective. Importantlyヲ thesupport for 
routing in the presence of faults or network congestion does not compromise 
the low latency and high throughput of the router. The simulated perfor-
mance of the Tokkyu router exceeds that of published results for oblivious 
111 
routers and is equal to or exceeds those reported for other adaptive routers. 
The multicast deadlock problem is investigated， asmulticast has been 
identi五edas an area which can provide significant speedup to a number of 
parallel applications. Restricted-length multicast is introduced as a solution 
to multicast deadlock in wormhole routed networks and the implementation 
of this multicast scheme is examined. Restricted-length multicast is then 
compared to unicast and multi-path based multicasts by simulation. The 
results of the simulations indicate that restricted-length multicast provides 
a good solution to multicast problems such as multicasting to clusters of 
nodes found in barrier synchronization， multicasting to nearest neighbors 
and broadcasting to al of the nodes in the network. 
List of Publications 
Papers Accepted for Journal Publication 
• FlavellヲA.C. and Takahashi， Y.， "Tokkyu: A High-Performance， Ran-
domizing， Adaptive Message Router with Packet Expresswa句y
TrαηSふ. 0ηIη10r門、γmηlαtμzorη1αTηldSystems， vo1. E78-D， no. 10ヲ pp. 1248-
1260， October 1995. 
• Flavell， A.C. and Takahashi， Y.，“Restricted Length Hardware Mul-
ticaおstingi凶nMu叫ltic∞ompu凶1比te白rNetwoωrks 
36， n∞o. 5， pp. 1228-1238， May 1995. 
Papers Accepted to International Conferences 
• FlavellヲA.C. and Takahashi， Y.，“The Tokkyu Router: A Randomizing 
Router for k-ary n-cubes"， Proc. of the Internαtionαl Symposium on 
Pαrallelαnd Distributed Supercomputingぅpp.127-134， September 1995. 
• Flavell， A.C. and Takahashi， Y.， "Continuum: A Hybrid TimejSpace 
Communications Paradigm for k-ary n-cubes"， Proc. of the Internα-
tional Confe陀 nceon Parallel Processiing 1994， vo1. 1， pp. 138-141 ヲ
A ugust 1994. 
Other Related Papers 
• Flavell， A.C. and Takahashi， Y.， "Mandala: An 1nterconnection Net-
work for a Scalable Massively Parallel Computer"， inProceedings of 
the 33rd IPSJ Programming Symposium， pp. 79-90， January 1992. 
• Flavell， A. C. et. a1.， "Mandala: An 1nterconnection Network for a 
Scalable Massively Parallel Computer':I， Technical Report of the IPSJ， 
vo1. 91， no. 100， pp. 91.101-91.109， November 1991. 
V 
Contents 
Abstract 111 
List of PU blications V 
1 Introduction 1 
2 Scalable Multicomputer Systems 
2.1 Node Structure 
2.2 Interconnection Network Topologies 
2.3 Message Switching 
2.4 Message Routing 
2.4.1 Deterministic Routing 
2.4.2 Adaptive Routing . 
2.5 Deadlock.... 
2.6 Multicast Messages . 
2.6.1 Multicast Deadlock 
??????
??
?????
?
3 Tokkyu: A High-Performance， Randomizing， Adaptive Mes-
sage Router with Packet Expressway 35 
3.1 The Register-insertion Bus . . . . . . . . . . . . . . . . . . . 36 
3.1.1 Register-insertion Bus Operation 36 
3.2 Archi tecture of the Tokkyu Rou ter 40 
3.2.1 Rou ter Operation . 41 
3.3 Switch and Bu:fer Design . 49 
3.3.1 Switch Evaluation . 49 
3.3.2 Bu:fer Evaluation . 56 
CONTENTS Vll 
3.4 Performance. 
3.4.1 
3.4.2 
3.4.3 
3.4.4 
Simulation of U niform Random Traffic 
Simulation of Hot-spot Traffic . . . . . 
Simulation of Tra伍cin the Presence of Faults 
Discussion of Results 
?
?
??
??
??
?
??
?
?
??
??
??
4 Restricted-length Hardware Multicasting in Multicomputer 
Networks 76 
4.1 Preliminaries ........................... 76 
4.1.1 Definition of Multicast Deadlock Problem ....... 76 
4.2 Restricted-Length M ulticasting 81 
4.2.1 Gate-array Implementation 83 
4.3 Simulation . 86 
4.3.1 Multicast Latency. . 86 
4.3.2 Simulation Results 87 
4.3.3 Discussion of Results 90 
5 Concl usions 91 
List of Figures 
1.1 Generic multiprocessor architecture . . . . . . . . . . . . . . 2
2.1 Generic node architecture .. . . . . . . . . . . . . . . . . . 6
2.2 (a) Simple ring network and (b) corresponding spanning sub-
graph. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 (a) Strongly connected digraph and (b) corresponding directed 
tree， w hich is also a rooted tree. . . . . . . . . . . . . . . . .. 8
2.4 Contemporary static network topologies:( a) 2D torus (b) 2D 
mesh (c) 3D mesh (d) 4D Hyperc山 e(e) Fat-Tree (f) Mandala 
interconnection network. • . .•.•• • •••.•.••••.• 10 
2.5 Communications channels for a 2-dimensional router ..... 11 
2.6 Latency of various switching techniques . . . . . . . . . . . 14
2.7 Division of information units . . . . . . . . . . . . . . . . . 16 
2.8 Three virtual channels sharing a unidirectional physical channel 17 
2.9 e-cube routing on a hypercube . . . . . . . . . . . . . . . . 20
2.10 Dimension order routing on a 2D mesh . . . . . . . . . . . 21 
2.11 (a) Dimension order rOl 
2.1ロ2Physical communication channels divided i凶ntωorou凶1北ti凶ngplanes. 23 
2.13 Twかdimensionalchaos router . . . . . . . . . . . . . . . . 25
2.14 (a) Network and (b) its channel dependency graph without 
virtual channels. (c) Network and (b) its' channel dependency 
graph with extra virtual channels. . . . . . . . . . . . . . . . 27
2.15 (a) Multicast by unicast (b) Tree based multicast (c) Path 
based multicast . . . . . . . . . . . . . . . . . . . . . . . . 30 
2.16 Multicast deadlock in binary tree . . . . . . . . . . . . . . . 31
2.17 Multipath multicast. . . . . . . . . . . . . . . . . . . . . . . 33
-LIST OF FIGURES IX 
3.1 Register-insertion bus interface 
3.2 N-dimensional register-insertion bus port. 
3.3 Architecture of a two-dimensional Tokkyu router . 
3.4 Global arbiter inputs and outputs . . 
。?
???
??
?
?
?
?
?
?
3.5 State diagram for determining the distance distribution in an 
8-ary 2-cu be . . . . . . . . . . . . . . . . . . . . . . . . .. 53
3.6 Probability of misrouting versus applied load for 16-ary 2-
cube. Solid lines are predicted values， points are measure-
ments taken by simulation . . . . . . . . . . . . . . .. 55 
3.7 The discrete-time Markov chain state transition diagram for 
the output queue size . . . . . . . . . . . . . . . . . . . .  57 
3.8 Performance of output queues. Solid lines are predicted values， 
points are measurements take by sirnulation . . . . . . . .. 58 
3.9 Dialog for setting simulation variables ............. 60 
3.10 (a) Simulation display showing test mode (b) Simulation dis-
play key . . . . . . . . . . . . . . . . . . . . . . . . 61 
3.11 (a) Simulation display showing random simulation (b) Simu-
lation display key ........................ 62 
3.12 (a) Simulation display showing hot-spot simulation (b) Simu-
lation display key ........................ 63 
3.13 (a) Simulation display showing fault simulation (b) Simulation 
display key ............................ 64 
3.14 Performance of queue switches for 256 node 16-ary 2-cube. 
Solid lines are predicted values， points are measurements taken 
by simulator . . . . . . . . . . . . . . . . . . . . . . . . 66 
3.15 Performance of output queues for 256 node 16-ary 2-cube. 
Solid lines are predicted values， points are measurements taken 
by simulator . . . . . . . . . . . . . . . . . . . . . . . . 67 
3.16 Latency versus offered tra伍cfor a 256 node 16-ary 2-cube 
under uniform random tra伍c .................. 68 
3.17 Throughput versus offered tra伍cfor a 256 node 16-ary 2-cu be 
under uniform random traffic .................. 69 
LIST OF FIGURES X 
3.18 Latency and reduction in latency versus applied load under 
uniform random traffic wi th packet expressωαy enabled and 
disabled 69 
3.19 Latency versus offered traffic for a 256 node 16-ary 2-cube 
under bi t reversal traffic . . 70 
3.20 Throughput versus offered tra伍cfor a 256 node 16-ary 2-cu be 
under bi t reversal tra伍c . . . . 71 
3.21 Faulty node is bypassed .. .... 72 
3.22 A verage latency versus percent faulty channels at 50% applied 
load (m=2， L=2). Mean latency averaged over ten random 
fault sets . 73 
3.23 Throughput versus percent faulty channels at 50% applied 
load (m=2， L=2). Mean throughput averaged over ten ran:-
dom fault sets . . 73 
4.1 (a) Multicast by node (2，1) and (b) the resulting concurrent 
resource trees 81 
4.2 Organization of a single MEGA router input 83 
4.3 Send latency for Lm = 16 bytes 89 
4.4 Send latency for Pr(b) = 0.5 . . 89 
-List of Tables 
2.1 Routing steps from s = 0000 to d = 1101 . . . . . . . . .， 20 
3.1 2-tuples defining total distance to travel and W dg for packets 
in an 8-ary 2-cu be. . . 
3.2 Probability of j dimensions remaining to be traversed 
4.1 Resource usage for various buffer structures 
52 
54 
85 
Chapter 1 
Introd uction 
The peak performance levels of Massively Parallel Processing (MPP) systems 
have recently begun to rival those which are obtained using traditional vec-
tor and SIMD supercomputers. Many therefore believe that MPP systems， 
constructed by the interconnection of thousands of homogeneous computa-
tional nodes， are a promising technology for the construction of computers 
wi th teraflops performance. However， the e伍ciencyof multicomputer based 
MPP systems when solving real world problems tends to be disappointing， 
especially when compared to vector superCOTIlputers [11， 20] 
Although there are many ways in which the nodes of an MPP system can 
be connected， by far the most popular is the static or direct network. Each 
node in a direct network has a point-to-point， ordirect， connection to itsう
'neighboring' nodes and these connections form the interconnection network 
as is illustrated in Fig. 1.1. Direct networks are popular as they are said to 
scale well， i.e. as the number of nodes in the system is increased， the total 
processing power， communication bandwidth and memory bandwidth of the 
system also increases. 
Inter-process communication， data-sharing and synchronization in an MPP 
目 回目 日、油、
Interconnection net'work 
回
、治、
回
Figure l.1: Generic multiprocessor architecture 
2 
system are al achieved by the passing of messages via the interconnection net-
work (IN)， and therefore a critical component in determining the maximum 
achievable performance of MPP systems is the IN and the communications 
structures supported by it. A considerable amount of research has therefore 
been conducted in both the design and evaluation of interconnection net-
works [1， 2ぅ46ヲ5う42，47，22，23，24，25，26ヲ27ヲ28，29]， and this continues to 
be an acti ve avenue of research. 
The interconnection networks of massively parallel systems must provide 
effectiveうdynamicand arbitrary connectivity between al of the processors 
in the system. In order to be considered effective it is desirable that the 
interconnection network satisfies the following requirements: 
• the packet routing algorithm must be free from deadlock 
• the network must be free from livelock， i.e. packets must not be in-
五nitelydelayed in the network 
.. 
3 
• network latency should be as low as possible 
• network throughput should be as high as possible 
• the path taken by a packet should adapt dynamically to tra伍ccondi-
tions 
• network performance should degrade gracefully in the presence of faults 
Freedom from deadlock and livelock are both essential for the correct op-
eration of the network. Guaranteed freedorn from deadlock is essential to 
ensure that there isno potential for the network being brought to a com-
plete halt because of dependencies in the allocation of network resourcesう
and freedom from livelock is essential to ensure that packets do not end-
lessly cycle in the network， never reaching their destinations. Low latency 
and high throughput are necessary to allow a good balance of the compu-
tationj communication ratio of the system. Adaptive packet routing and 
graceful degradation of network performance in the presence of faults are 
both desirable features， provided they do not compromise the latency and 
throughput of the network[44]. Adaptive routi時 allowsbetter utilization of 
communication resourcesヲespeciallyat high network loads or in the presence 
ofhot叩 ottra伍c[31，9，32， 35，41ヲ15]，and networks which are fault tolerant 
are becoming increasingly important as the size and complexity of massively 
parallel systems grows. In addition to these requirements， multicast com-
munication， inwhich a source node transmits a single message to a number 
of destination nodes in the system， has been identi五edas being crucial to 
achieving acceptable performance in a number of application areas[37， 51]. 
ー 圃 4・
4 
Organization of this Dissertation 
This dissertation focuses on simple and effective solutions to meeting the 
requirements for an IN to be considered effective and is divided into two 
distinct areas. An introduction to scalable multicomputer systems is given in 
Chapter 2 and this is followed in Chapter 3 with an examination of adaptive 
routing in multicomputer networks and the introduction and investigation 
of the Tokkyu interconnection network. In Chapter 4 an examination of 
multicast deadlock in wormhole routed networks is given and the concept 
of restricted-length multicαsting is introduced and investigated. Finally， a 
summary and conclusions are given in Chapter 5. 
. ・
Chapter 2 
???????????????，?
?
? ?
?
??
2.1 Node Structure 
Each node in most current MPP systems contains an off-the-shelf R1SC prか
cessor， local memory， a number of support units， an interface to a commu-
nications network and a message router， asillustrated in Fig. 2.1. Off-the-
shelf processors are often chosen for MPP system construction as they are 
inexpensive and can help to reduce the design time of the system. For ex-
ampleうtheConnection Machine CM-5 uses 32-MHz SPARC processors， the 
NEC Cenju-3 uses 75-MHz NEC VR4400SC processors and the 1ntel Paragon 
XP /S uses 50-MHz i860 processors. Support units may include vector prか
cessing units， a graphics controller and H1PPl， SCS1， ethernet or some other 
1/0 interface. The role of the network interface unit is to perform message 
assembly / disassembly and provide flow control for messages entering and 
leaving the network， while the router provides routing and flow control for 
messages within the communication network. By removing the functions of 
message assembly / disassemblゎrouti時 andflow control from the CPU， com-
.. 
2.2 Interconnection N etwork Topologies 6 
?????
??
???
?
コ routers 
Figure 2.1: Generic node archi tecture 
munication and computation can occur concurrently， signi五cantlyincreasing 
the performance of the system. 
2.2 Interconnection Network Topologies 
The topology of a networ k de五neshow the nodes are connected and can 
usually be represented using graph notation. Therefore， a brief introduction 
to the relevant graph theory notation is presented before the discussion of 
static interconnection networks. 
Definition 1 A static interconnection network may be represented by the 
strongly connected directed graph， digraph， 1 = G( N， C)， where the vertex 
set N (1) and the arc set C (1) represent the no白sand channels of the network 
respectively. The degree of a vertex n， in1甲 denotedd(η)ヲ isthe number 
of edges incident with η. The graph [1 = G( N， C) is a subg問 phof 1 if
N(H) c N(I) and C(H) c C(I)， and H 1S a spanning subgraph of 1 if
N(H) = N(I). 
~ 
~ーレ
2.2 Interconnection N etwork Topologies 
Co 
C20 
(a) 
C30 
Co 
C20 
(b) 
7 
C10 
Figure 2.2: (a) Simple ri時 networkand (b) corresponding spanning s山graph
Figures 2.2( a) and (b) illustrate Definition 1. Figure 2.2( a) presents a 
simple ring network， which is a strongly connected digraph and Figure 2.2(b) 
represents a spanning subgraph of (a)， asit contains the same set of nodes. 
Definition 2 A tree is a connected graph which contains no cycles， and it 
follows that a subgraph which is a tree is called a subtree， and a spanning 
subgraph which is a tree is called a spαηning tr'e. A directed tree is a digraph 
which becomes a tree when the directions of the edges are ignored and a rooted 
tree is a directed tree with one vertex of in degree 0， and al other vertices of 
in degree 1. 
Figures 2.3( a) and (b) illustrate Definition 2. Figure 2.3( a) presents a 
strongly connected digraph and Figure 2. 2(b) represents the corresponding 
directed tree. This tree is a binary tree and therefore i t is also a rooted tree. 
Some of the more important static evaluative measures of an interconnec-
tion network are its degree， diameter， average distance [2]， channel bisection 
. ・
~ 
2.2 Interconnection N etwork Topologies 
〆n3 n4 ns n6 n3 
(a) 
n4 ns 
(b) 
n6 
8 
Figure 2.3: (a) Strongly connected digraph a吋 (b)corresponding directed 
tree， which is also a rooted tree. 
width [12]， maximum message density， and its ability to be scaled. The degree 
is defined as the number of channels incident on a node， the diameter as the 
maximum of the shortest distances between any two nodes in a system， and 
the average distance as the average number of channels that a message must 
traverse when traveling from a source node to a destination node. As the 
degree of a node and the average distance for a given network are often inter-
related， the normalized average distance， given byαverαge distαnce x degree， 
may provide a more useful measure for static evaluation. The channel bisec-
tion widthう B，is defined as the minimum number of channels that， when cut 
separate the network into two equal parts， and the maximum message den-
sity is the maximum of the total number of communications paths passing 
through each node in the system. Scalability is defined as the relative ease 
with which the number of processing elements in a system can be increased. 
A sy批 mwhich requires major hardware cha時 esandj or rewiri時 toincrease 
the number of processors is therefore not considered scalable when corupared 
to a sy批 min which an additional processor can be plugged in. Fe時 [21]
.. 
--ー}
2.2 Interconnection Network Topologies 9 
classified the topologies of static networks according to the dimensions re-
quired for layout， i.e. one-dimensional， two-dimensional， three-dimensional， 
and hypercube. Multicomputer networks are typically constructed from ar-
rays of nodes in at least twかdimensions.Twかdimensionaltopologies include 
the ring， 2D mesh， torus and tree， while three-dimensional topologies include 
the 3D mesh and 3D torus. Presented in Figure 2.4 are a number of contem-
porary static network topologies. 
The networks under consideration here are bi-directional， asthese net-
works allow locality of communication to be employed in the programming 
model of the parallel machine. Therefore， each arc in Fig 2.4 is divided 
into two communications channels， one in each direction. A router in a 2-
dimensional network will have communications channels in the 十九 -x，+y 
and -y directions， along with a connection to the local processor， asshown 
in Fig. 2.5. 
Torus 
The torus of Fig. 2.4( a) is a member of the general k-ary n-cube family. For 
the example torus of Fig. reffig:static，た=4 and n = 2. 
Definition 3 A k-ary n-cube is an n-dimensional cube of radix k， and a 
node within a k-ary n-cube can be identi五edby the n-digit radix k address， 
(α0，α1ぅ…?αηー1)' Each node in a k-ary n-c山 eis connected to every other 
node whose address differs in exactly one digit by土1modulo k. 
The number of nodes in the network， N， isrelated to n and k by: 
N= kぺ(k= ¥IN， η== logk N) 
.. 
~ 
2.2 Interconnection Network Topologies 10 
f一、 〆F司、 r、 f町、
• a・‘ a・‘ ‘' 司， 司， 司，
a・‘ a・‘ 。4・司， 司，
〆
a・h a・‘ -4・〉司， 司，
-4・a・h a・‘ 〕司， 司，、J ~ 、-'
(a) 
(c) 
、?，
?
? ?? ，， 、、
(d) 
(e) 
? ?
Figure 2.4: Contemporary static network topol叩es:(a)2D torus (b) 2D mesh 
(c) 3D mesh (d) 4D Hyperc山e(e) Fat-Tree (f) Mandala in terconnection 
network. 
~ 
2.2 Interconnection Network Topologies 11 
+Yconnection 
-Xconnection +X connection 
Node Connection 
Figure 2.5: Communications channels for a 2-dimensional router 
Although there are many possible topologies for the direct networks em-
ployed in MPP systems， by far the most popular in the current generation 
of MPP systems are k-ary n-cubes and those networks which are isomor-
phic to them 1. Parallel systems based on 2 and 3-dimensional k-ary n-cubes 
have been intensely investigated in the past， due to their ease of construc-
tion within the confines of 3-dimensional space and the natural mapping of 
a number of algorithms to them. Usually， low dimensional k-ary n-cubes are 
referred to as tori， while higher dimensional binary n-cubes are called hyper-
cゆes.The dia~eter of a torus is 2l n/2 J.Although the wrap-around links 
of the torus reduce the diameter of the system， they can complicate message 
routing in the system and may make it di伍cultto connect peripherals to the 
network. Howeverヲ severalparallel machines have been constructed using 
tori. The 2D torus is used in the iWarp[6] and the K2 parallel processor[3]， 
and more recently， the 3D torus has been used in the construction of the 
Cray Research T3D[43]. 
10ne notable exception to this is the CM-5， which is based on a fat-tre IN[36] 
羽田---
2.2 Interconnection N etwork Topologies 12 
2D and 3D Mesh 
2D and 3D meshes are presented in Figs. 2.4(b) and (c) respecti vely. The 
mesh topology is an aperiodic variant of the k-ary n-cube family， and looks 
like a torus with the end around connections removed. The 2D mesh of Fig. 
2.4(b) has (η=2うた=4) and the 3D mesh of Fig. 2.4(b) has (η=3うた=3). 
In general a k-dimensional mesh with N = nk nodes has a node degree of 
2k and a network diameter of k(n -1). Several simple routing algorithms 
have been presented for the mesh， including fault tolerant algorithms， and 
the unused connections around the edge of the mesh provide an abundance 
of connections for peripheral devices. A number of commercial parallel com-
puters have been constructed based on the 2D rnesh， including the CM-2 and 
the Intel Paragon [53]， and a 3D mesh has been utilized in the J-mad山 e[16]
and the Wavetracer Inc. Data Transport Computer[53]. 
Binary Hypercube 
The 4-dimensional binary hyperc山 eof Fig. 2.4( d) is a member of the k-ary 
n-cube family， with k五xedat two. The hypercubeヲasi t is often referred 
toぅ hasa network diameter of n， which is one of the lowest known average 
communications distances of any IN. Many numerical algorithms are suited 
to this topology， and it is simple to embed other topologies in the hypercube. 
The main disadvantage of the hypercube is th.at the number of nodes in the 
system is increased by increasing the dimension of the network. Thus a large 
number of connections are required for each node if a large system is to be 
built. In spite of this， the hypercube topology has been used for a number 
of commercial and research machines including the Cosmic Cube， CM-2 and 
. 
百~
2.2 Interconnection Network Topologies 13 
nCube corporations nCube2. 
Fat-Tree 
The fat-tree takes a somewhat different approach to implementing a static 
IN. A typical binary tree has a bisection width of only 1， which results in 
severe message-traffic congestion at the root node of the tree. The number 
of communications channels， and therefore the communications bandwidth 
in a fat-tree， increases as you move up the tree hierarchy， thus alleviating 
the communications bottleneck experienced by a standard binary tree in-
terconnection network. One disadvantage of this scheme is that it requires 
several different types of routing nodes and the number of communications 
channels in the hierarchy can become very large. However， the network is 
qui te practical as the Connection Machine Corporation CM -5is constructed 
using a 4-ary fat-tree [36]. The 4-ary fat-tree of Fig. 2.4( e) has clusters of 
four processors located at the leaves of a tree， each of w hich is connected to 
two rou ter chi ps. 
Mandala 
The Mandala network， presented in Fig. 2.4(f)， isa hierarchical network 
proposed by Takahashi and Flavell [22， 23， 24]. It can be described by the 
size of its clusters， C and number of levelsぅ L.For example the network in 
Fig. 2.4( f) is a (4三)Mandala network. The number of nodes in this sy批 m
is given by N = C L. Each of the nodes in a network of cluster size C， has 
C -1 communications channels forming a complete connection at level 1， 
wi th 1 channel per node reserved for connection to higher levels. The degree 
_. 
司~
2.3 Message Switching 14 
no.de 
???
????
packet 
E凋 dala
" header 図
図
I~1 
(a) Store-and-forward switehing 
time 
node 
nSO21口圏什f図e 悶l l S図1ilEコZコ工Z :11 11 1 I nO I 11 I I 1 
n11111111] 
悶 n2m I I I I I 11I 
time t泊予e(b) Circuit switching (c) Cut-through switching 
Figure 2.6: Latency of various switching techniques 
of each node is given by C and the average distance is given by V万.
2.3 恥1essageSwitching 
The message switching technique， i.e. the method by which data is passed 
from the input of a router to the output， can have a significant effect on the 
lαtency of the network. There are a number of possible switching techniques 
and these include circuit switching， packet switching， virtual cut-through 
routing and ωormhole routing. Circuit switching was originally used in tele-
phone networks and involves the formation of a physical channel between 
the source and destination nodes. In packet switching， orstore-and-forward 
networks， complete packets are buffered at each node between the source and 
destination and the header of a packet may not leave a node until the tail 
has been received. 
Both virtual cut-through [34] and wormhole ro凶 ng[12] use cut-through 
. ・、.. 一
2.3乱1essageSwitching 15 
to reduce the network latency by allowing a packet to be forwarded as soon 
as the routing decision has been made. 
Figures 2.6 (a)ー (c)present a comparison of the latency of packet switch-
ing， circuit switching and cut-through routing techniques respectively. In 
each case a single packet is sent from the source node S via the intermediate 
nodes nO， n1 and n2. Given a packet length of L bits， a channel bandwidth 
of W bi ts per second and a distance of D hops between the source and 
destination nodes the latency for circuit switching is given by 
えs- 乙etup+長+D (2.1 ) 
the latency for cu t-through rou ting is gi ven by 
Tct =トD (2.2) 
and the latency for store-and-forward switching is given by 
Tsf =会(D+ 1) (2.3) 
If L > > D then Tct becomes L/W and thus the distance has negligible effect 
on latency. Clearly the latency of store-and-forward routing is considerably 
higher than that of both circuit and cut-through routing. Also， inthe absence 
of contentionヲthenetwork latency of cut-through based switching is similar 
to that of circui t swi tching. However， if there is a large amount of contention 
in the network， the time taken to establish a complete circuit between the 
source and destination nodes can add a considerable amount to the delay of 
a circuit switched message. 
When channels become blocked， networks using wormhole routing buffer 
only small uni ts of data called flow control digi ts orβits which are ilus-
、.....-
2.3 Message Switching 16 
time ↓ ?
〉?
packet routing / 
information 
Figure 2.7: Division of information units 
trated in Fig. 2.7， whereas networks employing virtual cut-through routing 
buffer entire packets and therefore requires considerably more buffer resource 
than wormhole routing. Wormhole routing and virtual cut-through routing 
provide low latency message delivery and often make use of vir初 αlchαn-
nels， which can significantly improve the throughput of an interconnection 
network [13]. Moreover， deadlock free routi時 algorithmsfor many mul-
ticomputer topologies which utilize these switching mechanisms have been 
proposed [17ヲ30].Virtual channels provide excellent channel utilization and 
allow multiple disjoint logical networks to coexist on a single physical net-
work， which is very useful for adaptive routing. Figure 2.8 presents a physical 
channel which is being shared by three virtual channels. Even though two 
of the destination buffers are ful， the physical channel can stil be utilized 
as the third destination buffer is free. Thus， the data in the free channel can 
pass the data in the blocked channels. 
、~回-
2.4 Message Routing 17 
?????????
?
???? ? ~r'1essage router 
Source B uffers 
????
???」
??
?
?
??
?? 」 ??? =|凹l
zI I 
E|ロ|
IAstination Buffers 
Figure 2.8: Three virtual channels sharing a unidirectional physical channel 
2.4 Message Routing 
The routing of a message in a direct IN involves the selection of an appro-
priate path from the source node to the destination node. Routing can be 
classified in several ways. In SOUice iouting， asthe name implies， the source 
nodes determines the entire path of a packet prilor to injecting it into the net-
work. While this method may reduce the complexity of the message router 
hardware， itrequires that each packet carry the information in its' header， 
increasing the packet size. Also， the path of the packet is五xedand cannot be 
changed once it has left the source node. Most current state-of-the-art direct 
IN s employ distパbutediouting. In this case a routing decision is made at each 
intermediate router which lies on the path between the source and the desti-
nation nodes. The decision process determines whether the packet should be 
delivered to the local processor or forwarded to a neighboring router. If the 
message is to be forwarded， the routing algorithm decides which of the adja-
、...-
2.4 Message Routing 18 
cent routers the message should be passed to・Thisrouting decision should 
be as simple as possible to allow it to be easily implemented in hardware and 
provide minimal routing latency. 
Routing can also be classified as oblivious orαdαptive. In oblivious or de-
terministic routingぅthepath of a packet is cornpletely de五nedby i ts'source 
and destination addresses. The path taken by a packet in a network em-
ploying dynamic routing depends not only upon the source and destination 
address， but also on dynamic network conditions such as network load， or
the presence of faulty channels. 
2.4.1 Deterministic Routing 
Most current state-of-the-art interconnection networks employ deterministic 
routing. Although deterministic routers are not fault tolerant and have poor 
performance in networks experiencing high traffic loads or hot-spots， they are 
extremely simple and therefore fast. This makes them suitable in the prac-
tical implementation of interconnection network hardware[44]. Many multi-
computer systemsぅsuchas the Cosmic Cube， NCUBE， J-machine， iWarp and 
Intel Paragon， therefore utilize deterministic routers. The most widely used 
ro凶時 algorithms for these machines are the e-cube ro凶 ngalgorithm [49]， 
which is used for routing on hypercubesヲanddimension order routing， which 
is used on n-dimensional meshes. 
e-cube Routing 
In an n-cube with N = 2ηnodes， each nodeヲsaddress is binary coded as 
α= (α0，α1，・汁αηー1)'Given a source address s = (so， S1，…ぅ Sn-l)and a des-
~ 
、~
2.4 Message Routing 19 
tination address d = (do， d1，…， dnー 1)the ro凶 ngfunction should determine 
a route from s to d wi th a minimum number of steps. Denoting the n dimen-
sions as i = 1，2γ ・.，n， where the i“th dime釘nsioncorresponds to the (i -1 )st 
bit in the node address and letting υ =υηー 1・ VlVo be any node along the 
packet route， the route is determined as follows: 
1. Compute the direction bit Ti = Si-l⑦ di-1 for al n dimensions (i = 
1，2，・ ?η
2. Start with dimension i = 1 and υ=s 
3. If Ti = 1， route from the current node υto the next node v⑦ 2i-l， else 
ski p this step. 
4. Move to dimension i + 1(i.e・，1 ←i+l). Ifi三n，go to step 3， else 
quit. 
An example of e-cube routing on a 16 node hypercube is presented in 
Fig. 2.9. In the example n = 4ヲs= 0000 and d = 1101. Thus T = T4T3T2Tl = 
1101. The routing steps are summarized in Table 2.1. As can be seen in the 
example， the packet is routed from dimension 1 to dimension 4. If the ith 
bit of s and d are the same， no routing is needed along dimension i. Thus in 
the example， no routing is required for dimension 2. If the ith bit of s and 
d differ then the packet is routed from the current node along dimension i. 
this process is repeated until the destination isreached. 
Dimension order Routing 
Dimension order routing is somewhat similar to e-cube routing. As was 
discussed previously， a k-ary n-cube is an n-dimensional cube of radixム
τ;一一一 一.
-、~
2.4 Message Routing 
0010 
0000 
dim2 
dim4 
s = 0000 
d = 1101 
r = 1101 
Figure 2.9: e-cube routing on a hypercube 
Table 2.1: Routing steps from s = 0000 to d = 1101 
Step γi Operation Next node 
0000 ED 20 0001 
i = 2 。 skip 
i = 3 1 0001 ED 22 0101 
i = 4 1 0101⑦ 23 1101 
20 
-、~
2.4 Message Routing 21 
E 
Figure 2.10: Dimension order routing on a 2D mesh 
and a node within a k-aryη-cube can be identified by the n-digit radix k 
address， (αoぅα1，…，αη 1ー).Given a source address s = (SO，Sl'…ヲ Sn-1)and 
a destination address d = (do， d1，…，dnー 1)，a packet is routed along each 
dimension i = 1， 2ヲ・・・ ，n，where the ith dimension corresponds to the (i -1 )st 
digit in the node address， until Si-1 is equal to diー 1・
This is illustrated in Fig. 2.10， which shows routi時 betweenfour (source， 
destination) pairs on a twかdimensionalmesh. A packet from any source 
node S = (X1y1) to any destination node d = (X2Y2) wiU first route along 
the X-axis until it reaches column Y2， where d islocated. It will then route 
along the Y-axis until d isreached. A west-north route is taken from node 
(1，0) to (0，4). An east-north route is traversed from node (1，1) to (3，3). A 
west-south route is needed from node (4，4) to node (1，3) and an east-south 
route is required from node (リ)to node (6，1) .
Dimension order routing alone is sufficient to ensure that deadlock does 
not occur in mesh connected networksぅ asit prevents a circular wait for 
-、--
2.4 Message Routing 22 
???????????
@ 
@ 
の
@@@@@ 
Figure 2.11: (a) Dimension order routing (b) Adaptive routing 
channel resources. However， the same dimension ordering scheme will not 
prevent a deadlock from occurring in a torus network. This is discussed in 
further detail in Section 2.5 
2.4.2 Adaptive Routing 
Although deterministic routers are simple to Ilmplement and therefore fast， 
they suffer from poor performance in the presence of hot-spot tra伍cand 
are not fault tolerant. Figure 2.11(a) presents a simple example in which 
dimension order routing may result in poor use of channel resources. Node 
(0，4) is sending a packet to (4，4)ヲwhileat the same time node (1，4) has a 
packet to send to (4，1)， node (2，4) as a packet to send to node (4，2) and 
node (3，4) has a packet to send to node (4，3). As dimension order routing in 
a two-dimensional mesh requires that the message be sent along the X-axis 
五同ぅ nodes(1，4)， (2，4) and (3，4) are unable to sent their packets， even though 
、~
2.4 Message Routing 
.-.7"'-
ア¥六ミ¥
J二~Virtual corrrnunications 
23 
Figure 2.12: Physical communication channels divided into routing planes 
a plethora of available channels exist. In Fig. 2.11 (b) the routi時 ruleshave 
been relaxed to allow adaptive routing so that the packets from nodes (1，4)， 
(2，4) and (3，4) can be transmitted concurren七lywi th the packet from node 
(0，4). This allows better channel 凶 lizationand lower packet latency. 
A number of different approaches have been proposed for the construc-
tion of adaptive and fault tolerant routers. Many of these proposals have 
advocated the use of virtual channels to supply multiple virtual paths be-
tween a gi ven (sou悶ヲ destination)pair and thus provide varying degrees of 
adaptivity and fault tolerance. These include P/αnai-Adαptive Ro山 ng[9]， 
Viitua/ Netωo巾 [32]， Adαptive Ro山句 ωithViitUα/ Chαnnels [15] and The 
TUin Mode/ fOi Adαptive Ro山 ng[30]. 
A general technique for providing adaptive routing is to partition the 
physical network into a number of disjoint subsetsヲwhereeach subset consti-
tutes a corresponding subnetwork. Packets are routed through different sub-
networks depending upon the location of the source and destination nodes. 
Figure 2.12 illustrates an application of this method to a 2D mesh. The 
a・、.-
2.4 Message Routing 24 
network is partitioned into four subnetworks or planes， the +X+ Y plane， 
the -X+Y plane the +X-Y plane and the -X-Y' plane. If， for example， the 
destination node is to the left and above the source node， that is， ifdxく Sx
and dν> Sれ thenthe packet will be routed a.long the -X+ Y plane. If in 
this example dx was equal to sx， then the packet can be routed in either of 
the +X+ Y or the -X+ Y planes. This adaptive routing algorithm is said to 
be mznzmαl and fully adaptive， that is， a packet can be delivered through 
any of the shortest paths between the source and destination. In addition to 
this， for the 2D mesh i t can be proven to be deadlock free. However， provid-
ing minimal fully adaptive and deadlock free routing algorithms using this 
method for the general class of k-ary n-cubes rnay require additional chan-
nels. Linden and Harden [38] have demonstrated that a k-ary n-cube will 
require 2n-1 subnetworks or routing planes and thus the number of chan-
nels required increases rapidly with n. The use of virtual channels is also 
expensive in terms of latency and cycle time[8] and requires that fiow con-
trol information be sent in the reverse directioIl to signal the availability of 
buffering on the receiving node. This fiow control information either requires 
extra wires， orwill consume communications bandwidth from the reverse 
communications channel. 
Ngai and Seitz also proposed a non-minimal adaptive mesh router which 
allows complete freedom of path selection between any (source， destination) 
pair， by using misrouti時 toprevent deadlock[41]. However， this approach 
requires the use of time stamps and prioritization to prevent livelock， requir-
ing that extra state information be stored for each packet and results in a 
complex router design. 
-可....-
2.4 Message Routing 25 
Main Xbar 
Figure 2.13: Twcトdimensionalchaos rou ter 
Another non-minimal adaptive router which utilizes misrouting to avoid 
deadlock is the Chaos router proposed by Konsta凶 nidouand Snyder [35]. A 
block diagram of a twかdimensionalrouter is presented in Figure 2.13. The 
Chaos router utilizes randomization to provide probablistic freedom from 
livelock and therefore does not require any extra state information to make 
routing decisions. The central queue of Fig. 2.13 is used to store packets 
which arrive at an input frame and are unable to be routed to an output 
frame before the entire packet is received. Once the central queue becomes 
ful and a message is speci五edto be sent to the queue， one of the packets 
in the queue will be randomly selected and sent to the五rstavailable output 
frame. 
Konstantinidou and Snyder have shown that no packet in a router is ever 
mis-routed with certainty or in other words， every message has a non-zero 
a・可，._.一
2.5 Deadlock 26 
chance to avoid mis削
demonstrated t凶ha抗t七山heprobability that a packet will not have been routed 
after i routing steps， where i→∞ 1S: 
kQ(i)=(1-ε同 N)i= 0 (2.4) 
Therefore， the longer that a message remains in the network， the more prob-
able that it will be delivered to its' destination. The major disadvantages of 
this router are that it requires a central misrouting queue， queues at both 
inputs and outputsヲandextra state information to make the misrouting de-
cision. These factors may result in a large and slow implementation. 
2.5 Deadlock 
Deadlock occurs in an IN of a parallel computer when no packet can advance 
towards its destination because the queues or channels of the message system 
are ful and no packet can release the queue space that it currently holds. This 
phenomenum has been studied extensively for wormhole routed networks and 
a general solution for deadlock avoidance in any wormhole routed network， 
based on the concept of virtual channels， has been proposed [18]. Deadlock 
in wormhole routed networks is normally descriibed in terms of a network's 
routing function and channel dependency grαph. 
Definition 4 A routing function， R : C x N→Cヲmapsthe current channelう
Cc， and the destination node， Nd， tothe next channel， C川 onthe rou te from 
the source node to the destination node. A channel is not allowed to route 
to i tselfヲCc-1Cn. 
a・、，._.-
2.5 Deadlock 
C3 
C1 
(a) 
CJ3 
ーーーー 『ー同』
一__..-
Cll 
(c) 
C1 
• 
C。
CJ3 
tt 
C 
(b) 
C2 .e 
• C3 
'F 62 
' 4-eI 
t‘. CJ(J Cll 
(d) 
27 
Figure 2.14: (a) Network and (b) its channel dependency graph without 
virtual channels. (c) Network and (b) its' channel dependency graph with 
extra virtual channels. 
2.6孔1ulticastMessages 28 
Definition 5 A chαηηel dependency graph， D， for an interconnection net-
work， 1， and routing function，沢， is the directed graph， D = G( C， M). The 
vertices， D(C)， are the channels of 1 and the edges， D(M)， are the pairs of 
channels mapped by the routing function， R.
The routing functionヲ況， for a network is deadlock free if there are no 
cycles in its channel dependency graph. Deadlock can occur in the network 
of Fig. 2.14(a)， due to a circular wait for channels， asthere is a cycle in 
itピchanneldependency graph， shown in Fig. 2.14(b). A circular wait for 
channels can occur if， for example， a flit from ηo that is destined for n2 is 
holding Co， a flit from n3 that is destined for nl is holding C3， a flit from n2 
that is destined for no is holding C2 and a flit from η1 that is destined for n3 
is holding C1. By adding a set of virtual channels to the networkうasshown in 
Fig. 2.14( c)， and modifying the routi時 functionappropriately， the cycles in 
the channel dependency graph are removed， asshown in Fig. 2.14( d). 1n the 
五gure，packets at nodes numbered less than their destination are routed on 
high channels and packets at nodes numbered greater than their destination 
are routed on low channels. Channel Coo isnot used. There is now an 
ordering of virtual channels according to their subscripts: C13 > C12 > Cll > 
ClO > Co3 > Co2 > Col and the routing function is now deadlock free. 
2.6 Multicast Messages 
Point to point， or unicast communication， inwhich a source node sends a 
message to a single destination node， isthe basic structure supported by 
present multicomputers. Broadcast and multicast communications are the 
-・'一一一一一一一一----
孟益E
、.--
2.6 Multicast Messages 29 
transmission of a message from a source node to al other nodes in the system， 
and from a source node to a subset of the nod.es in a system respectively. 
Broadcast communication can be viewed as a special case of a multicast 
communication， inwhich the same message is delivered. to al of the nod.es 
in the system [40]. 
Two parameters commonly used to measure the e伍ciencyof multicast 
schemes are chαηηel tr、α:ficand communication latency. Channel traffic is 
defined as the number of channels used to deliver the message under consider-
ation and latency is defined. as the longest packet transmission time involved. 
These two parameters are somewhat interrelated. as is illustrated. in Fig. 2.15. 
The unicast based. multicast generates tra伍c= 14 and has has distance = 3， 
the tree based. multicast has tra伍c= 9 and d.istance = 3 and. the path based. 
multicast has traffic = 7 and. d.istance = 4. 
Multicast communications can be implemented. using multiple unicasts， 
software multicast trees， orby hardware multicast facilities. Multiple uni-
casts， while simple to implement， generate large amounts of unnecessary traf-
日cwhich can cause blocking and contention in the network [37]. Software 
multicast trees， inwhich a worker node will forward the multicast message 
to its neighbors upon reception of the message， exhibit considerable speedup 
when compared to multiple unica山 [51]， but are stil inferior to hardware 
based multicast schemes. Although hardware based multicast schemes of-
fer the best potential performance for the implell1entation of multicasting， it 
has been shown that these schemes may result in deadlock in those networks 
which employ wormhole routing [37] 
可，....-
2.6 Multicast Messages 30 
口口
(a) 
口口
? ? ?? ???
? ?
口
口日 ???
Figure 2.15: (a) M ulticast by unicast (b) Tree based multicast (c) Path based 
multicast 
園田 ." ーーーー・E ・-ー圃園田ーー回目・
皇、
ws 
2.6 MultIcast孔1essages 31 
一一一一一炉 Channelsheldby mes泊伊
ーーーー--)l降、Channelsrequired by mess勾e
I Output buffer 
口川川
Figure 2.16: Multicast deadlock in binary tree 
2.6.1 Multicast Deadlock 
One of the properties of wormhole rou ted， tree based multicast schemes is 
that， due to the small amount of buffer space at each node， a potentially 
large number of network resources must be concurrently held by a single 
multicast message. The resources that the messages are competing for in the 
network are the communication channels and rnessage buffers of each node. 
Each physical communication channel has a dedicated message buffer and 
typically the message buffers are partitioned into separate virtual channel 
?、?????????? ?
While a number of routing algorithms， such as e-cube routing in hy-
percubes and dimension oidei routing in meshes， guarantee deadlock free 
routing of unicast messages， multicast trees based on these algorithms are 
prone to deadlock. In fact， networks which are inherently free of deadlock， 
『曹..---
2.6恥1ulticastMessages 32 
such as the n-αry tree and fat tree [36]， may also deadlock if more than one 
tree based multicast occurs concurrently. In the simple example presented 
in Fig. 2.16 a deadlock has occurred as the channels (N3，N6)ぅ(N3，N7)that 
are held by N3 are required by N2う andthe cl即日lels(N2，N4)，(N2，N5) that 
are held by N2 are required by N3. 
Although the unicast routing algorithm of this network is deadlock free， 
a deadlock has occurred because of cyclic dependency in the concurrent al-
locαtion of multiple resources between the two multicasts. Thus， multicast 
deadlock differs significantly from traditional unicast deadlock， asin multi-
cast deadlock， the resources contributing to the deadlock situation are dis-
tributed over a number of nodes. Traditional methods of deadlock avoidance， 
such as releasing al of the deadlocked resources once deadlock is detected or 
requesting al of the required resources prior to initiating an operation which 
might result in deadlock， are not suitable for prevention of multicast dead-
lock. Releasing the distributed deadlocked resources results in considerable 
waste of communications bandwidth and may be di伍cultto implement due 
to the large number of distributed resources which may need to be released， 
while requesting al of the necessary channels prior to initiating a multicast 
would significantly increase the multicast latency. New methods of deadlock 
avoidance for multicast must therefore be found. 
Multicast deadlock avoidance has typically been achieved by limiting the 
growth of the multicast tree and Lin， McKinley， and Ni have extensively 
studied the use of multi-path multicasting algorithms utilizing Hamiltonian 
paths to ensure that deadlock does not occur [40， 37，51]. In addition to dead-
lock avoidance， multi-path multicast allows arbitrary multicast destinations 
可聞----
2.6孔1ultIcastMessages 33 
。十
Figure 2.17: Multipath multicast 
and they have demonstrated that this technique has the added advantage 
of reducing the amount of traffic in the network. Figure 2.17 illustrates a 
multi-path broadcast in an 6 x 6 mesh network. As can be seen in Fig. 2.17う
a multi-path message is broadcast by sending four copies of the message 
on individual multicast paths. Similarly， Byrd et al. have investigated the 
restricted branch multicast approach to ml山icasti時 [7].This approach閃
quires t凶ha抗ta mul川ticαas“tmessage can only be s叩pl以it比i凶ntωotwo paths a抗tany given 
n∞od白e，and that one of these paths must be connected to the local processing 
element. 
Multi-path and restricted branch multicasting have a number of disad-
vantages. For example both restricted branch and multi-path multicasting 
require that the packet header store multiple destination addressesうasal of 
，.. 
2.6乱1ulticastMessages 34 
the destinations for a broadcast or multicast must be stored in the header， 
which increases the length of a packet and complicates router design. In 
addition to this， restricted branch multicasting requires an extra port re-
source to guarantee deadlock freedomヲandthe algorithm used in multi-path 
multicasting to determine the multicast paths is complex. 
、，..-
Chapter 3 
Tokkyu: A High-Pe]~formance ラ
RandomizingぅAdapltive
Message Router wit:h Packet 
Expressway 
The Tokkyu router is a new high-performance message router for k-ary n-cube 
multicomputer systems[26， 29， 28]. The k-ary rings that make up the inter-
connection network are constructed using uni-directional register-insertion 
buses. Tokkyu utilizes misrouting to prevent deadlock and randomization to 
prevent livelock in a fully adaptive routing environment. Any packet arriving 
at an input to a Tokkyu router that can not be profitably routed is imme-
diately misrouted. This is signi五cantlydifferent than both the NgaijSeitz 
router and the Chaos router which defer the misrouting of a packet that is 
waiting for an output until it is to be overwritten by a newly arriving packet. 
The misrouting rate is minimized by utilizing a small number of queues， 
placed at the outputs of the communication ports. As blocking or buffering 
fiow control is not used， alof the available cornmunications bandwidth can 
be utilized for sending messages between processors in the system. Finally， 
可~
3.1 The Register-insertion Bus 36 
uncongested network performance is improved by the inclusion of the pαcket 
expresswαy， which provides a low latency bypass path for packets which need 
not pass through the core of the router. 
3.1 The Register-insertion JBus 
High performance ring buses have become a favorable alternative in the im-
plementation of local area networks [45]. However， LAN /WAN structures 
are not direct1y applicab1e to INs due to differences in the node structure 
and communications patterns [15]. The use of the unidirectional regi批 r-
insertion bus in the construction of IN s does， however， have a number of 
advantages. These advantages include: 
• A packet may propagate through a 1arge number of bus interfaces with-
out being buffered. 
• Processors are free to inject packets at any time， subject to avai1ab1e 
space in the transmit queue. Thus there is no globa1 arbitrationぅas
each processor can decide whether to inject a packet according to in-
formation 10ca1 to i tsbus interface. 
• Active repeaters can be used at the output of each message router， 
instead of the pulldown structure required for a bi-directiona1 bus， thus 
making the network more sca1ab1e. 
3.1.1 Register-insertion Bus Operation 
With reference to Fig 3.1 the operation of a register-insertion bus is as fo1-
10ws; Assume that the input and output data is synchronized at the same 
で:一一一一 一.
可...-
3.1 The Register-insertion Bus 37 
transrnission rate， sothat for each word receivedうanothercan be transrni t-
ted. The transmit (tx.) buffer is used to ternporarily store a packet frorn the 
local processor while it is waiting for injection onto the bus. These packets 
are of variable length and so only a portion of the tx. buffer rnay be used for 
a particular packet， however， the packet length rnust not exceed the length 
of the tx. buffer. The function of the delay buffer can explained by五rst
considering the area currently being used. The used， oractive portion of the 
delay bu百er，operates as a FIFO queue that delays the incorning packets. 
Assurning that the entire delay buffer has a capaci ty of n words and that i 
words are currently used， 1 :S;i :S;n， then n -i words rernain for the unused 
or inactive portion. Thus locations ωoぅωh・・・，Wi-lof the delay buffer are 
active and locations Wi，ωi+l， • .・?ωn-lare inactive. If， in each tirne step t， 
a new word can be received， and a new word is to arrive at tirne t + 1， then 
the active portion of the delay buffer represents a FIFO queue containing 
the words which arrived at tirnes t， t+ 1， ・ ，t + (i-1). A t irne t + 1 the 
word stored in Wo is removed frorn the queue and sent to the output. Si-
rnultaneously， the incorning word is added to the queue such that locations 
ω0，ω1， • .・ ?ωi-lnow contain data which arrived at tirnes t十1，t+2，・・・，t十九
and the queue length rernains unchanged. 
It is desirable that in each tirne step， ifi 2:1， the queue size be reduced. 
A reduction can take place if the data received at the input is not part 
of any packet destined for the output. In this case， the previous discussion 
should be modified so that the incorning word is not stored in location ωi-l 
and also so that i isred uced to i'ニ i-1. Furtherrnoreうifi = 0， then any 
incorning word need not be stored at al and can pass directly to the output. 
3.1 The Register-insertion Bus 38 
l"I圃~
data 
ut select out 
data in 
?
」??
〉〉
?
?』
? ? ? ?
??
』
I~盤璽露麹翠 buωJf伽
|亡コ buff凶行erspace free 
Figure 3.1: Register-insertion bus interface 
可司圃~
3.1 The Register-insertion Bus 39 
In this case the incoming word is not stored in location ωi-l and i isconstant 
at i = O.
The inactive portion of the buffer is essential for the injection of packets 
into the network from the tx. buffer. Assuming that the tx. buffer contains 
a packet of length ム1三lく (η-i)， and that at time t + i the五rst
word of the of this packet is to be sent to the output， then the previous 
FIFO discussion should be modified as follows; In this case， attime t + i 
the incoming word is stored in location ωi and i isincreased to i' = i + l.
A t ime t + (i + l)-1)， after the last word of the transmi tted packet has 
been sent， the locations ω0，ω1，・ ?ω(i+l)-lof the delay buffer now contain 
words t， t+ 1，..， t + (i+ l)-1). In addition， the requirements for queue 
reduction must be modified such that queue reduction can only occur if the 
data received at the input is not part of any packet destined for the output 
and no packet is currently being sent from the local tx. buffer. 
From the preceding discussion we can observe that if i -0， the delay 
experienced by a packet is only due to the propagation delay through the 
output selector. Also if no packet is being sent from the transmit buffer 
and i isless than the length of the incoming packet， then the packet will 
cut-through the FIFO. Finally if i isgreater than the length of the incoming 
packet， ora packet is being transmitted and l isgreater than the length of 
the incoming packet， then the incoming packet will be completely buffered 
in the FIFO， ina store-and-forward manner. 
The concept of the register-insertion bus can easily be extended to the 
k-ary n-cube as is shown in Fig. 3.2， which illustrates the structure of a single 
port of an n-dimensional register-insertion bus router. The delay buffer of 
. 
3.2 Architecture of the Tokkyu Router 40 
日n
n:J Dout 
Concentrator 
Figure 3.2: N-dimensional register-insertion bus port. 
Fig 3.1 is replaced by a group of output buffers. These buffers store packets 
that are changing dimensions， inaddi tion to those w hich must be delayed 
while the local processor injects new packets into the network. Also， the 
control is now distributed between the input and output control sections to 
improve performance. 
3.2 Architecture of the Tokkyu Router 
The archi tecture of a twかdimensionalTokkyu router is presented in Fig. 3.3. 
The input queues of a typical oblivious router have been replaced by m queues 
per output and n : m switches connect the inputs to the queues， where n = 4 
for a twかdimensionalrouter. A small input fra.me is also provided in each 
input controller to temporarily store several words of an incoming packet 
while a routing decision Is made. Each of the output queues iscapable of 
holding multiple， variable length packets and al of the queues support cut-
through routing. As the router may buffer cornplete packets when output 
contention occurs， it requires the use of compara.tively short packets， i.e.less 
『司.........-
3.2 Architecture of the Tokkyu Router 41 
than 32 bytes. An output controller schedu1es the output of packets from the 
output queues in a FIFO manner and a1so contro1s the injection of packets 
into the network via the output switch. Under the assumption of uniform 
tra伍cdistribution， each packet in a k-ary n-cube traversesσ = kj4 channe1s 
in each dimension before a rou ting decision must be made. Therefore we 
have provided the pαcket expressωαy w hich， inthe absence of b1ocking， allows 
packets to pass directly to an output. Thus， a single unidirectional channe1 
in any dimension can be viewed as a high speed register-insertion ring[26]. 
The header of each packet is updated prior to entering the output register， 
when passing through the inc or dec modules， torefiect the progress of the 
packet through the network. 
As misrouting is used to prevent deadlock and randomization is used to 
prevent 1ive1ockヲcorrectoperation of the router can be guaranteed provided 
no packet， orpart of a packet， islost due to bu百eroverfiow. The aggregate 
data rate into any router must therefore never exceed the aggregate data 
rate out of the router. A simple way for the data rates within the network 
to remain tight1y matched is through the use of a globally distributed clock. 
Then， by restricting packet injection to only occur when su伍cientspace 
exists to complete1y store any packet which may arrive while injection is 
taking p1ace， buffer overfiow is guaranteed not to occur. 
3.2.1 Router Operation 
The operation of the router can be understood by examining the contro1 
algorithms of its major components. These cornponents are the input and 
the output controllers of each port， the queue controller associated with each 
可~
3.2 Architecture of the Tokkyu Router 42 
+XI (m+2):1→ 匹勾二…?
-X 1rp.A (m+ 2) : 1→匹午~-X叩?
+YI~ (m州 →匹辛→+Y?
-Y 什~ (m+2) :11~卜骨 Y 0I.tPL( ?
E主副司舵t
P配車帽 Irjecl
Figure 3.3: Architecture of a two-dimensional Tokkyu router 
‘『・~
3.2 Architecture of the Tokkyu Router 43 
output queue and the arbiter which controls access to the output queues 
via the queue switches. Throughout this section the following notation is 
adopted for convenience: 
Drem : Distance remaining in this dimension 
p/en : Length of current packet 
h~ : Input count register 
J/eη : Injec七ioncount register 
L/en : Queue load count register 
Q/en : No. of words stored in queue 
Qmax : Max. contiguous queue space 
O/en : Output count register 
Output : Queue output selected 
PαSS'lve 
BypαSS 
Inject 
: Pαcket expressωαy selected 
: Pαcket expressωαy ln use 
: Packet injection selected 
Input Controller AIgorithm 
A19orithm 3.1 Input Controller Algorithm 
1. If no packet， wai七;
2. Decode header; 
3. 1 f D rem = 0 0 r Pαssive not asserted， 
4. Request new output(s); 
5. Else， assert BypαSS 
6 . 1/ en = P/ en - 1 
7. While hen > 0 do 
8. I/en = hen - 1 
9. Enddo; 
10. Reset BypαSS 
11.Goto 1; 
丸町ithreference to Algorithm 3.1 the input contJroller operation is as follows; 
The received data is sampled by the input controller on each clock cycle to 
司~
3.2 Architecture of the Tokkyu Router 44 
test for a valid packet header. Upon the detection of the first word of a packet， 
the header is decoded to gene凶 ethe output request(s). A packet which is 
j-dimensions from its destination will generate j valid output requests. If 
the packet has finished traversing the current dimension (Drem = 0) or the 
output switch is not in the Pαssive state， then the output request(s) will be 
passed to the global arbiter. Bypαss is asserted if Drem三1and the output 
switch is Pαs幻ve，to signal that the packet is passing to the output via the 
pαcket expressωαy. The packet length is loaded :into the input count register 
and on each subsequent clock cycle I1en is decremented as each new word of 
the packet is received. Once I1en has decremented to zero， indicating that 
the entire packet has been receivedうBypαssis reset and the input controller 
begins to sample the input for a valid header once again. 
Output Controller AIgorithm 
With reference to Algorithm 3.2 the output controller operation is as follows; 
Operation of the output controller begins with setting the output switch to 
the Pαssive state， allowing any packet on the pαcket expressωαy to pass 
directly to the output register. Once an output request is detected and no 
packet is currently bypassing the output， the request is processed and the 
output switch is set accordingly. If an injection request is being made and 
there exists sufficient space for any incident packet to be temporarily stored 
while the new packet is being injected (Qmaxどみen)うthenthe swi tch is set 
to the injection input. This ensures that there always exists sufficient space 
to buffer an arriving packet within the node while a new packet is injected 
so that no informationぅi.e.no part of a packet， islost. The packet length 
司司--
3.2 Architecture of the Tokkyu Router 45 
A19orithm 3.2 Output Controller Algorithm 
1. Asser七 Pαsszve;
2. If no output requests， wait; 
3. If Bypαss is asserted， wait; 
4. Reset Pαsszve; 
5. While output requests do 
6. If injection reques七，
7. If Qmαz 三 J1en，
8. Assert Injeci; 
9. If Injeci not asserted and output request， 
10. Assert Output; 
11. Get first output reques七;
12. Olen = P1en; 
13. While Olen > 0 do 
14. Output word; 
15. Olen = Olen - 1; 
16. If Output asserted) Qlen Qle:n - 1; 
1 7 . E 1 s e)J1 eη = J1eπ - 1; 
18. Enddo; 
29.Enddo; 
20.Go七o1; 
司....-
3.2 Architecture of the Tokkyu Router 46 
is loaded into the output count register and a new word of the packet being 
output is placed in the output register during each clock cycle. Olen and 
either of Qlen or J1eη are decremented until the町ltirepacket has been sent. 
Global Arbiter Algorithm 
A19orithm 3.3 Global arbiter algorithm 
1. If no reques七s，wait; 
2. While requests do 
3. Get first request; 
4. If requested output(s) free， 
5. Assign available queue; 
6. Else， Assign random queue; 
7. Enddo; 
8. Goto 1; 
With reference to Algorithm 3.3 the output controller operation is as follows; 
The global arbiter processes each output request sequentially， beginning with 
the request at the head of the request queue. The arbiter examines the output 
request and the current state of the queue switches and the output queues in 
an attempt to profitably route the requesting packet. If it is not possible to 
profitably route the packet， itwill be randomly misrouted to any available 
output queue. Although it may appear that this approach of immediately 
misrouting blocked packets will result in excessive misrouting of packets， the 
discussion in Sect. 3.3 and the simulation results of Sect. 3.4 demonstrate 
that the careful selection of the switch and output queue sizes prevents this 
from occurring. 
The arbiter algorithm presented here processes each input sequentially. 
At first glance it might be appear that it would be beneficial to process al 
司守--
3.2 Architecture of the Tokkyu Router 47 
from +x 
input 
from -x 
input 
Global 
Arbiter 
from +y 
input 
from -y 
input I ~V 
--1 
dO d1 
Figure 3.4: Global arbiter inputs and outputs 
.，...-
3.2 Architecture of the Tokkyu Router 48 
of the inputs simultaneously using a large combilnatoriallogic circuit， asthis 
may result in shorter average time to make routing decisions. However， with 
reference to Fig. 3.4 which presents the inputs and outputs for the arbiter 
section of a twかdimensionalrouter with only two queues per output port， it
can be seen that this would require the solution‘to a boolean equation with 
31 inputs. The resulting circuit would therefore be cumbersome and slow， 
and so a sequential design was used in the simulations of Sect. 3.4. 
Queue Controller AIgorithm 
A19orithm 3.4 Queue controller algorithm 
1. If no packe七 assigned，wait; 
2. Request output; 
3. Select assigned port; 
4. L1en = ~ 
5. While L1en > 0 do 
6. Load word from input; 
7. L1en 二 L1en 一 1; 
8. Qlen = Qlen + 1; 
9. Enddo; 
10.Goto 1; 
With reference to Algorithm 3.4 the output controller operation is as follows; 
When the queue controller detects that a received packet has been assigned 
to it， an output request is immediately made to the output controller and 
the length of the packet from the assigned port is loaded into the queue load 
count regi山 r(L1en). A new word of the packet is loaded into the queue in 
each clock cycleぅ(Llen)is decremented and the count of the number of words 
currently stored in the queue (Q len) is incrementedう untilthe entire packet 
has been received (L1en = 0) 
~ 
3.3 Switch and Buffer Design 49 
3.3 Switch and Buffer Desi!~n 
The misrouting of packets provides a simple solution to the problem of dead-
lock. However， any packets which are misrouted will remain in the network， 
requiring channel and buffer resources. This rnay exacerbate any existing 
congestion and result in further misrouting. It is therefore desirable that the 
output switch and buffer sizes be selected so that under normal operation 
there is a minimal amount of misrou ti時 occurring. Karol et al [3] and 
Yeh et al [52] have studied in detail the design and performance of sy批 ms
employing output queues. However， their analyses have focused on those 
systems in which an arriving packet can only select one possible output from 
those available， and where the number of inputsヲへ approachesin五nity.We 
extend their work here by examining the switch and buffer requirements for 
those cases in which an arriving packet may select from a number of outputs， 
and we focus on small values of n， typically 4 or 6. To simplify the following 
discussion we assume that al packets are of五xedsize. 
3.3.1 Switch Evaluation 
Assume that五xedsize packets arrive at the n inputs to the k-ary n-cube 
router. In each time slot， packet arrival is governed by independent and 
identical Bernoulli processes and packets arrive independently at each input 
with probability p. Under the assumption of uniform random tra伍cln a 
k-ary n-cube， on average， each packet must traverse σ= k / 4channels in 
each dimension and the average distance of a packet， dαυe， is(η ×σ). Of the 
arriving packets， 1/dαve are destined for the loca.l processor and therefore the 
司，....-
3.3 Switch and Buffer Design 50 
probability that an arriving packet is destined for one of the queue switches 
associated with an output， which we define asαヲ isequal to p一(p/dωe). 
The probability of i packets arriving at the router inputs， al destined for a 
single output queue switch，αi， has the binomial probabilities 
ヨ解
αi 
i = 0，1，2ヲ…?η
(3.1 ) 
If the probability of misrouting is very low then most arriving packets will 
be profitably routed， i.e. routed towards their destinations. Arriving packets 
are therefore equally likely to be destined for only n -1 ofthe available 
outputs， asthe ηth output will send the packets in the opposite direction to 
which they have just travelled， and thus Eq. 3.1 becomes 
αt (3.2) 
。，1，2ヲ…?η-1
Packets arriving at the n router inputs to the k-ary n-cube must com-
pete for access to the m queues associated with each output， via the queue 
switches. If i packets arrive at the inputs at the same time， aldestined for 
the same output， and iく m，then al requests can be satisfied by the switch. 
If i > m， then i -m requests will be rejected and these packets will have to 
be misrouted. It follows then that the probability of an output request being 
unsuccessful， for the case where a packet can be successfully routed via only 
司~
3.3 Switch and Buπ'er Design 51 
one output， isgiven by the sum of the probabilities of i > m 
Pr(Mj=l ) 
???
? ?
?
???
?
?
?
? ?
? ?
??
?
?
?
? ?
「
?????
」
??
?
??
?????
?
??
(3.3) 
Extending Eq. 3.3 to the case where a packet can be pro五tablyrouted via 
more than one output: If i packets arrive at the router inputs at the same 
timeうeachof which can be profitably routed via j outputs， and i三jm，then 
al of the requests can be satisfied by the switches. If i > jmヲtheni -jm 
requests will be rejected and these packets will have to be misrouted. The 
probability that i > jm is
Pr(i > jm 
? ?
? ? ? ? ? ? ? ? ?? ?
??
??
?
?
?
?
?
?
????
? ?
?
?
?
?
?
?
?
?
ー
?
????????
???
?
?
??
??
? ?
?
????
(3.4) 
In order to evaluate the effect of allowing packets to request more than 
one output， we need to determine sj， the fraction of arriving packets with 
j dimensions stil to traverse， where 0三j:; n. To calculate sj， we need 
to determine the distance distribution for newly generated packets. This 
is given by the number of ways in which the n-tuple describing the total 
distance to travel in each dimension， (匂，Cl， C2， ・，Cn-l)ぅcan be arranged so 
that the sum向+Cl+C2+...+Cπー1is equal to the distance to travelヲdg，where 
0三C[:::;k/2 for alll = 0，1，2，…?η-1. The number of solutions for the 
equation co + Cl+ C2+…+ Cn-l = dg， which we de五neas並dg，is gi ven by the 
coefficient of xdg in the generating function， f(x) = (1 + x + x2 + .+ xk/2)ぺ
司~
3.3 Switch and Buffer Design 52 
Table 3.1: 2-tuples defining total distance to travel and W dg for packets in an 
8-ary 2-cube 
dg J = 1 j=2 Wdg 
(0，1)(1，0) 2 
2 (0三)(2，0) (1ヲ1) 3 
3 (0，3)(3，0) (1ス)(2，1) 4 
4 (0，4)(4，0) (1，3)(3，1)(2ス) 5 
5 (1，4)(り)(2，3)(3，2) 4 
6 ー (2，4)(4ス)(3，3) 3 
7 (3，4)( 4，3) 2 
8 (4，4) 
Oく dg三川/2.Table 3.1 shows the distance distribution of newly generated 
packets， their corresponding 2-tuples and W dg for an 8-ary 2-cube. 
If Pdr is the probability that a packet is at a distance， dr， from its desti-
nation when it arrives at the input to a router and P(dg，dr，j) is the probability 
of that packet having j dimensions stil to traverse， given that it started with 
a distance to travel of dg， then sj is given by 
sj = L: 2二PdrP(dg，川) (3.5) 
where 
Pri _ Wdr ー一
凶
2ごWdg
(3.6) 
and P(dg，dr，j) can be determined by considering the state transition diagram 
for the distance distribution of a given k-ary n-cube. Fig. 3.5 shows the state 
diagram used for determining distance distribution， Pdr， inan 8-ary 2-cube. 
The vertices in the figure are the 2-t叩 lesreprese凶時 the(x， y)distances to 
travel and the arcs are the probabili ties of a transition from distance (x， y)
to distance (xペダ)
司~
3.3 Switch and Buffer Design 53 
PA24p.22.1915 
n =ーー- F， =ー - P. =一回 P. =ー -ー
u 100 . 1∞ ノ ~ 1∞.' -J 1∞ 
d-O d-l d-2 d・3 d-4 
10 
P.=ー ー
守 l∞
~ y y: "-7 -:. '-J P
7 .1 _， __!..' i i 1∞ 
1".' 0.5".' 0.5.... 0 ふ唱~ . ' 0.5 .... 
『 司 . ・ ・.
'1 '1 . I ，1 '1 
.'j_ .'1 ..'1 / 1 ，.'1 d=7 
..'r-'¥ O~ .' r六 o~5 〆~ 0.5 .. ~ 0.5 .. /'ヘ/
リ，O)~3 ，l)~\3 ，2)~ー{3 ，J )~ァ←一(3 ，4)ノ n /"¥ 、、-、、一./、--〆，、、./、、ノ r_= '1i T -γ 、r .8 -..， - -一一一，1平.' 0ふ.. 0.5..... 0.5.... 0.5.... .， 1'" l. ，. ， .司 4
.'1 .' 1 .' I .' I .' I d=8 
γ一'¥0.5 ..'~ 0.5 .，'~ 0.5〆/ム""0.5 ..'/ム¥
• \.4 ，O)~ァ←\4 ，1)~ァ」い，2) ・4ァ斗 4 ，3)~千-(4，4) )一一一一'
Figure 3.5: State diagram for determining the distance distribution in an 
8-ary 2-cube 
Using Eq. 3.5 we can predict the probability that a packet arriving at 
an input has j dimensions stil to traverse. This is important since the 
probability that a packet will be misrouted due to contention for a queue or 
swi tch decreases if j isgreater than 1， and thus the size of the swi tches can 
be reduced if sj is large for values of j greater than 1. Table 3.2 presents 
the probabilities of sj ， 0 ~ j ~ 3， for a 64 node 8-ary 2-cube， a 256 node 
16-ary 2-cube and a 512 node 8-ary 3-cube. As can be seen in the table， 
the probability that a packet can request 2 or more outputs is approximately 
29% for the 8-ary 2-cube， 46% for the 16-ary 2-cube and 53% for the 8-ary 
3-cube. We can therefore conclude thatヲunderthe assumption of uniform 
traffic， asthe radix or dimension of a network is increased， the probability 
of misrouting due to queue or switch contention decreases. 
司，....-
3.3 Switch and Buffer Design 54 
Table 3.2: Probability of j dimensions remlaining to be traversed 
8-ary 2-cu be 16-ary 2-cube 8-ary 3-Cl仇 l
so 0.25 0.125 0.167 
s1 0.46 0.413 0.303 
s2 0.29 0.462 0.351 
s3 0.179 
Finally， the probability of misrouting， for the case where arriving packets 
can be profitably routed via j outputs， is given by the sum of the probabilities 
of i > jm multiplied by sjヲforal of 0 :;j :;n 
Pr(Mj>d SFjZ1川 η;1
(3.7) 
Applying Eq. 3.7 we can evaluate how the rate of misrouting increases as 
the load applied to a router increases. Figure 3.6 illustrates how the predicted 
probability of misrouting varies as a function of the applied load for a single 
node in a 16-ary 2-cube， along with results obtained by simulation. As can be 
seen in the五gure，the predicted results and simulated results remain in close 
agreement， indicating that our model is suitable for predicting the switch 
performance in networks where multiple outputs are available for routing. 
Although the results of Fig. 3.6 are useful in quantifying the amount of 
misrouting at a given applied load， any messages which are misrouted will 
remain in the network and will require channel and bu:fer resources which 
may result in further misrouting. 
55 3.3 Switch and Buffer Design 
+' -
?
?
?
????
?• 
?
?
?
m=l 
m=2 
4砂
• lE心4
~I I I 
0.4 0.5 0.6 0.7 0.8 0.9 1 
Applied load (fraction of capacity) 
0.3 0.2 0.1 。
営1E+∞
。.J
コ。
』
.~ 1E-01 
5 
匂・4
0 
e 1E心2・F司
-司
._ 
.0 ro 
.0 
2 1E心3
仏
1E心5
Figure 3.6: Probability of misrouting versus applied load for 16-ary 2-cube. 
Solid lines are predicted values， points are measurements taken by simulation 
マ--
3.3 Switch and Buffer Design 56 
3.3.2 Buffer Evaluation 
The output of each port has a set of m queues for temporarily storing packets. 
These FIFO queues operate as， a single shared buffer for the associated port， 
while the output controller ensures that a first-in五rst-outqueuing discipline 
is maintained for packets arriving at that output. If no packets are lost in 
the queue switches， then in order to select an appropriate buffer size for 
each dimension of the router we need to determine the probability that there 
exists insu伍cientspace in a queue to satisfy an output request. Assume 
again that fixed size packets arri ve at the ηinputs to the router governed by 
independent and identical Bernoulli processes and that the probability of i 
packets arriving at a single shared buffer has the binomial probabilities given 
in Eq. 3.2. Given the discrete-time Markov chain state transition diagram of 
Fig. 3.7， the steady state queue size probabilities can be determined directly 
from the Markov chain balance equations[33] 
q。
q1 
qn 
Pr(Q = 0) =と2
UO 
(1-α。ーα1)Pr(Q = 1) = ，- -V -'1./ qo 
uo 
町Q=η)=与生qπ-1-玄手qn-i
uo i=2 uo 
n>2 
(3.8) 
and it follows that the probability that a queue size is greater than or equal 
to some value， L， isthe sum of the probabilities of queue lengths greater 
than or eq ual to L 
cxコ
Pr(Q三L)= L qi (3.9) 
3.3 Switch and Buffer Design 57 
Q4 . 
α 
Q2 
Qo +~ 
L...--.....I玖一一一
Qo ao ao 
Figure 3.7: The discrete-time Markov chain state transition diagram for the 
output queue size 
As packets are permitted to request more than one output， the probability 
of misrouting is given by the sum of the probabilities that a packet has j 
dimensions stil to traverse， 0三j~ n， multiplied by the probability that 
the queue sizes of the requested queues exceed L， raised to the jt九power
??????
」
、 、 ? ? ? ? ?
?????
? ??????「?? ?
? ????
???? 、 、?， ， ，，
????
?
?? ?
， ? 、
Applying Eq. 3.10ぅwecan evaluate the probability of misrouting if the 
queue Slzes are五xedat L packets. Figure 3.8 illustrates how the probability 
of misrouting due to buffer overfiow varies as a function of the applied load 
for queue sizes of 2， 4 and 8 packets in a 16-ary 2-cube， along with results 
obtained by simulation of a single router. As can be seen in the五gure，
the results predicted by the Markov chain mode:l remain in close agreement 
with the simulation results， except at high applied loads where the Markov 
approximation overestimates the overflow rate. 
， ， ?，? ?
， ， ?
， ， ?
， ， ?
?
??
?， ， ?
， ， ?
?
， ， ?
?
，?
， ， ?
? ?
， ， ?
，，，?
， ， ?
， ， ?
? ?
，?
?
?
，，，??
， ， ?
， ， ?
? ?，，?，?
，?，?
， ， ?
?
???
， ， ?，?
58 3.3 Switch and Buffer Design 
L=2 
L=4 • 
1E副03
1E-07 
1E-08 
???? ? ? ? 。 』
? ? ? ?
???
。 ょ
1E-02 
司~
1E-05 
1E-06 
1E心9
L=8 .A 
1E・10
0.9 
Applied load (fraction of capacity) 
0.8 0.7 0.6 0.5 0.4 0.3 0.2 
??? ?。
Figure 3.8: Performance of output queues. Solid lines are predicted values， 
points are measurements take by simulation 
十;一一一一一一 一1
3.4 Performance 59 
3.4 Performance 
In this section， we evaluate the performance of the Tokkyu router under a 
variety of tra伍cconditions by simulation. The simulator is a C++ program 
with a graphical user interface and includes a dynamic display of the simu-
lation progress. The simulator supports prograrnmable network size， buffer 
size， routing algorithm， tra伍cpattern and packet length as shown in the 
dialog for setting the simulation variables of Fig. 3.9. In addition to this 
there is a test mode which can be used to verify the routing algorithms， 
bu百'erassignments and the correct operation of the simulator. This is ilus-
trated in Fig. 3.10， where node (0，0) is sending a single， 16word packet to 
node (15，15) in a 2D mesh. As can be悶 nin the日gurethe route taken by 
the packet is minimal and fully adaptive. All of the nodes of the simulator 
operate synchronously and a word is transferred between nodes in a single 
clock cycle. Figures 3.11， 3.12 and 3.13 illustrate snapshots of the simulation 
display for random， hot-spot and fault simulations respectively. Each square 
in the display windows +X Load， -X Loadヲ+Y Load and -Y Loadぅrepresents
the buffer load for the given dimension of the corresponding router， while 
the display window， A ve. Load， shows the average load of the buffers of the 
corresponding router. The display has proved invaluable in the development 
of the simulator， aswell as providing insight into the results obtained. 
Network performance under uniform random tra伍c，hot-spot tra伍cand 
tra:fic in the presence of router faults has been simulated. Simulations were 
al performed wi th twかdimensionaltori (16-ary 2-c山 es)and a packet size of 
16 words. In order to accurately model the performance of a practical router 
可F
3.4 Performance 60 
Network/Routing: 
020Un川i-Torus/R剛l 
32D B町i卜-Torl山 IR町l 
0208町iト-Me白sh/R則1 
Simulation Mode: 
???? ?? ? ? ??
?
aus日dfl
o Hot Spot 
(_) fault 
o Test 
Oefault Switch Mode: 
!) Passive 
o Active 
{仁ancelJ 1OK LJ] 
8uffer Size: 
O/P 8U1ffer: 1 1 w町出
X Size:口n帥 s
Y Size:口nodes
Max. P'acket length: 
length:口山由
Simula1tion Settings: 
。柑 1 1% 
End load: 1 I %
Step Si~~e: 1 1 %
Res. Sp.ttce: I| 
A此 D世ω由1拘lay:
Sw.S凶||
Figure 3.9: Dialog for setting simulation variables 
，..，-
3.4 Performance 61 
ド1eshSimulation Test 
+X Lood +Y Lood 
TIME 
Full 
TESTING 
LOAD 
Empty 
Faulty 
(a) 
?， ??????、
Figure 3.10: (a) Simulation display showing test mode (b) Simulation display 
key 
~ 
3.4 Performance 62 
Bi-Torus Random Simulation 
+Y lood 
Full 
-Y lood 
TIME 
45449 
MODE 
Recordi ng 
Gen.Rote 
100需
Empty 
Faulty 
(a) 
?
?
?
? ??? ?
Figure 3.11: (a) Simulation display showing random simulation (b) Simula-
tion display key 
3.4 Performance 63 
Di-Torus Hot Spot Simulation 
~量油 .守I
圃ム雄目、 7・z
y必グ ミ
RY1kh:吋
及品6ち耗>'l匹、，-唾・- ・R噌t否噂 圃圃・ d 喝、 正、
fプー副島1、正明I
t田園 事ムホ 思む，.
[)HIJ 
Rき 轟~聞里町・
1・膚 目-プ・
1・・ ・. 
出・~圃.
監..割圃闘.
品~同
[)HIJ 
目
日
+Y load 
???「
-Y load 
TIME 
48526 
MODE 
Recording 
Gen.Rate 
100需
Empty 
Faulty 
(a) 
、 ， ????
??
Figure 3.12: (a) Simulation display showing hot--spot simulation (b) Simula-
tion display key 
3.4 Performance 64 
Bi-Torus Fault Simulation 
-11、=ム!Iiwi拘
民、11‘ h
、，"
• 
，司'"
.守~ t 
aJ 1・ 4
‘ . ，唖
+X Lood 
-x Lood 
Ave_ Lood 
+V lood 
Full 
-V lood 
TIME 
33805 
MODE 
Startup 
Foults 
10% 
Empty 
Faulty 
、?
?
? ???
、(a) 
Figure 3.13: (a) Simulation display showing fault simulation (b) Simulation 
display key 
可.，...-
3.4 Performance 65 
we have fixed the uncongested routing latency of each router at 4 cycles. The 
assumed cycle-by-cycle operation of the router Is as follows; The header of a 
packet entering the router will be decoded and eL routing request made in the 
first clock cycle. The routing decision will be ll1ade and an output assigned 
in the second and third cycles and the header will be updated and sent to the 
output in the fourth cycle. This is typical of current generation routers[8]. 
Packets using the pαcket expressωαy only require that the header be checked 
for a value of zero， indicating that the packet has completed routing in the 
current dimension. Therefore the pαcket expressωαy has a latency of only 
one cycle. In al instances， collection of results was not initiated until the 
latency and throughput measurements of the network under test had reached 
a steady state. In the presentation of the results， the applied network load 
of the networks has been normalized such that fullload corresponds to al of 
the network channels transmitting simultaneously. 
3.4.1 Simulation of Uniform Random Traffic 
In order to evaluate the performance of the network under uniform random 
tra伍ca constant rate source with exponential interarrival times was applied 
to each input and the time from the creation of the first word of the packet 
until the last word of the packet is accepted at the destination was measured. 
Figures 3.14 and 3.15 present the predicted and simulated misrouting 
rates in a 16-ary 2-cube for varying switch and queue sizes respectively. In 
these simulations a packet requesting more than one output was randomly 
assigned to one of those outputs available to it. The simulation result for a 
queue swi tch size of 4: 1 in Fig. 3.14 is ini tially higher than the predicted re-
マ.，..-
3.4 Performance 
1E+∞ 
eq吋何~ 
-ーbE コーS lE-Ol 
20  
lE-02 
lE-03 
lE-04 
lE-05 
.， 
圃
• • 
4砂
• m=l 
・一一一一一一一一 m=2 
o 0.1 0.2 0.3 0.4 0.5 0“6 0.7 0.8 0.9 1 
Applied Load (fraction of capacity) 
66 
Figure 3.14: Performance of queue switches for 256 node 16-ary 2-cube. Solid 
lines are predicted values， points are measurements taken by simulator 
sult， due to the higher tra伍cpresent in the network as a result of misrouting. 
At 30% applied load the measured network load is 45% and the misrouting 
rate is 13.4%. At approximately 35% applied load the extra tra伍cproduced 
by misrouting causes network operation to become unstable and results in a 
misrouting rate of 50%. A switch size of 4:2 is su缶cientto maintain stable 
network operation and the simulation and predlicted results remain in close 
agreement. 
The predicted misrouting due to buffer contention in Fig. 3.15 overesti-
mates the measured rate for buffer sizes of 2， 4 and 8 packets. All of the 
simulations remained stable， with the misrouting rate rising steadily as the 
applied load was increased. A minimum buffer size of only 2 packets per 
port is sufficient to guarantee stable network operation. Figures 3.16 and 
可~
3.4 Performance 67 
lE+∞ 
・4司』.. A d‘ ・ lE-Ol 
】bE H コa lE-02 
lE-03 
.。b』，η・3
lE-04 
~ 
lE-05 
lE-06 
lE-07 
lE-08 
lE-09 
lE-10 。
?，，
? ?
， ，????，， ，??
， ， ?
，? ? ?
?
，
，?
，??，
，?
，?，???
?
，?
，，，，? ??，??
?
?
• 
A ， ， ， ， ， ， ， ， ， ， ， ， ， ， ， 
，'・， ， ， ; ~・ 一一一一一一一 L=2 L=4 ， ， 
A 一一ーーーーーー L=8 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Applied Load (fraction of capacity) 
Figure 3.15: Performance of output queues for 256 node 16-ary 2-cube. Solid 
lines are predicted values， points are measurem印式staken by simulator 
3.17 show the average packet latency and network throughput as a function 
of applied network load respectively， for a 256 node 16-ary 2-cube and a 
number of different switch and buffer con五gurations.With a single switch 
output and buffering for one packet per port， (m=l，L=l)， the misrouted traf-
五ccauses the network to saturate at 35% applied load， and the throughput 
is reduced to just 3%. lncreasing the number of switch outputs to two and 
the buffer size to two packets， (m=2，L=2) gives a signi五cantimprovement 
in performance with a saturation throughput of 80%. lncreasing the switch 
and buffer sizes to three outputs and three packets respectively， (m=3，L=3) 
further increases the saturation throughput to 90%， while further increases 
in buffer size give diminishing returns. This is highlighted by the plot for a 
switch size of three outputs and a buffer size of 16 packets， which saturates 
~ 
3.4 Performance 
450 
〆-、
r;/) 
~ 4∞ 
にふ
〉、
~ 350 
〉、
u 
ロ 3∞ω 
~ 
伺
~ 250 
2∞ 
150 
l∞ 
50 
。
0.1 0.2 0.3 
.-
-ー____+-_・ー
ー -ー-e-ーーー
一一一一合 一一一一
m=3，L=16 : 
m=3，L=3 
m=2，L=2-
m=l，L=l./ 
J・
5ヲゴタτ
" 
0.4 0.5 0.6 0.7 0.8 0.9 1 
Throughput (fraction of capacity) 
68 
Figure 3.16: Latency versus offered traffic for a 256 node 16-ary 2-cube under 
uniform random traffic 
at 95% throughput. 
Figure 3_18 illustra七esthe effectiveness of the pαcket expressωαy by com-
paring a network in which packets make use of the pαcket expressωαy with a 
network in which al packets are forced to pass through the core of the router. 
The average latency of packets in the network which utilizes the pαcket ex-
pressωαy isreduced significantly when compared to the network in which the 
pαcket expressωαy isdisabled. This decrease in latency occurs at al applied 
loads and varies from a maximum of 43%， which occurs at 10% applied load， 
to 23% at an applied load of 95%. The maximurn throughput of the network 
utilizing the pαcket expressωαy isalso slightly higher， 95% versus 92%， due 
to packets in the network spanning a greater nurnber of channels at any given 
time_ 
『・，..-
69 3.4 Performance 
m=3，L=16 
m=2，L=2 
m=3.L=3 
m=l，L=l 
一-一。一一一
ーーーー.ーーーー
一一一一治r一一一一
0.2 
????
??
?????
?
? 。
??
』???
???
0.7 
0.6 
0.3 
0.5 
0.1 
Appled Load (fraction of capacity) 
0.3 0.2 0.1 
。
Figure 3.17: Throughput versus offered traffic for a 256 node 16-ary 2-cube 
under uniform random tra伍c
1∞9も
卯%
? ?
??
??? 【 】
?
??
? ?
?
? 。
?
??
?
?2O~ゐ
80lJも
70lJも
ωqも
50% 
40lJも
30% 
。
• 
'y~ 
'1!l 
冶
. 
-・..-
…+一一・0・・-- --.-. - ---- ~・一一 v
m=3，L=3 enabled 
m=3，L=3 disabled 
-・・ー・-+----
Latency reduction 
. -
一ー一巳ーーー
?
?
? ?
?
? ? ?
??? 、 ?
4∞ 
3∞ 
ー一司ーーーー匹 ー -ー---~ーー巴~ーーー-
z∞ 
1∞ 
10% 
OlJも
0.4 0.5 0.6 0.7 0.8 0.9 1 
Throughput (fraction of capacity) 
0.3 0.2 0.1 
。
Figure 3.18: Latency and reduction in latency versus applied load under 
uniform random traffic wi th pαcket expressωαy enabled and disabled 
可""..--
70 3.4 Performance 
? ? ? ? ? ? ?400 
m=3，L=16 
m=3，L=3 
m=2，L=2 
-E圃冒
--+---
ーーーー.ーーーー
300 
200 
? ???? ?
? ?
? ? ? ? ? ?
???」
0.4 0.5 0.6 0.7 0.8 0.9 1 
Throughput (丘actionof capacity) 
0.3 0.2 
????
?
?
Figure 3.19: Latency versus offered tra伍cfor a 256 node 16-ary 2-cube under 
bi t reversal tra伍c
Simulation of Hot-spot Tra:fic 3.4.2 
Adaptive routing allows better utilization of communication resources， es-
One pecially at high network loads or in the presence of hot-spot tra伍c.
method of generating large imbalances in the channelloads within a network 
is to apply bit-reversal tra伍c_U nder bi t-reversal tra伍c，each node， p， sends 
packets to node q， where the address of node q isthe bit reversal of the ad-
dress of node p. For example node 2716 in our 16-ary 2-cube sends messages 
to node E416・Figures3.19 and 3.20 present the average packet latency and 
network throughput as a function of applied network load respectively， for a 
256 node 16-ary 2-cube under bit-reversal tra伍c.The maximum throughput 
for (m=2，L=2) and (m=3，L=3) are 63% and 65% respectively. Increasing 
the applied trafi.c rate past these points results in a decrease in the through-
可~
3.4 Performance 71 
( おC帽司u〉Jh 、 0.9 
0.8 
匂04 0.7 
・、)】5H に司L吋A 喝 0.6 
0.5 
」戸M2 口u。与=コ4a 4 04 0.3 
0.2 
0.1 
。
0.1 
?
?、】
? ? ? ????
?『
?
??
ーー-..-ー ー
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Applied load (fraction of capacity) 
Figure 3.20: Throughput versus offered traffic for a 256 node 16-ary 2-cube 
under bi t reversal tra缶c
put to 56% and 58%. Increasing the buffer size to 16 packets results in an 
increase in latency prior to saturation， due to packets queueing in the larger 
buffersヲandan increase in the maximum throughput of 70%. 
3.4.3 Simulation of Traffic in the Presence of Faults 
The correct operation of the router requires that the aggregate input and 
output data rates remain balanced. Failure of a single channel of a router 
will require that the in-degree of the router be reduced by one to maintain 
the balance in data rates. In Fig. 3.21， the + ~{ channel of router (4点)has 
failed and so it is bypassed， creating a connection between nodes (3点)and 
(5点).Depending upon the nature of the fault it may be possible to use the 
pαcket expressωαy of node (4，5) to provide the bypass path. 
可.，."，--
3.4 Performance 72 
Figure 3.21: Faulty node is bypassed 
Figures 3.22 and 3.23 present the performance of a Tokkyu network in 
the presence of faults. The network on which the faults were simulated had 
switch and buffer sizes of two outputs and two packets respectively， a con-
stant applied load of 50% and uniform random tra伍c.Ten fault simulations 
were carried out， each with randomly generated fault sets and the results 
were averaged to produce Fig. 3.22 and Fig. 3.23. The network performance 
degraded only slightly， even with 10% of the available channels faulty， ascan 
be seen in the figures. There was only a 26% increase in the packet latency 
from a fault-free network to a network with 10% faulty channels and the 
throughput remained五xedat approximately 50%. 
司.，..-
73 3.4 Performance 
m=2.L=2 田園
1∞ 
ω 
20 
50 
40 
30 
卯
80 
70 
? ? ? ? ? ?
???
?
10 
があ 79も 89も 99も 1ぴる
Percent faulty channels 
59も49も39も29も1% 09も
。
Figure 3.22: A verage latency versus percent faulty channels at 50% applied 
load (m=2うL=2).Mean latency averaged over ten random fault sets 
m=2，L=2 -ー
??
? ???
?
?
? ?
? ????????
?
?
? ?
??
? ? ? ? ? ? ? ?
??』??。
? ?
?
??
??
?
?
?????
??
??
????
???
?
??
? ?
? ???59も4% 39も29も1% 0% 
。
Figure 3.23: Throughput versus percent faulty channels at 50% applied load 
(m=2ヲL=2).Mean throughput averaged over ten random fault sets 
可守F
3.4 Performance 74 
3.4.4 Discussion of Results 
A.A. Chien has illustrated the hazards of making comparisons between differ-
ent router implementations based on channel utilization and latency without 
considering the important effects of implementation complexity[8]. The effect 
of these factors is di伍cultto quantify wi thout simulation at the gate-level or 
actual implementation of the router. We can however， highlight a number of 
features of the Tokkyu router when compared to other similar implementa-
tions. The predicted low load throughput and latency of Tokkyu is as good 
as or exceeds the published performance of virtual channel based oblivious 
routers[50， 48， 16] due to the low latency path provided by the pαcket ex-
pressωαy. In networks experiencing high load， hot-spots or fault conditions， 
small Tokkyu routers， (m=2，L=2) or (m=3，L==3) have a clear throughput 
and latency advantage over oblivious routers. The predicted latency and 
throughput performance of the Tokkyu router with a small number of buffers， 
(m=2，L=2) and (m=3，L=3)， also closely matches， orexceeds the through-
put and latency performance reported for the adaptive Dally / Aoki router， 
with 16 virtual channels per physical channel and a similar amount of total 
buffer space. These results are encouraging as many of the routers which 
make use of virtual channels to implement adaptivity require large cross-
bars and complex arbitration， which contribute to their size and complexity. 
The use of virtual channels is also expensi ve in terms of latency and cycle 
time[8]. However， asthe Tokkyu router must buffer complete packets when 
output contention occurs， it requires the use of comparatively short packets， 
i.e. less than 32 bytes. The cost of message disassembly for transmission 
司，...-
3.4 Performance 75 
and reassembly at the destination， along with the cost of potentially larger 
packet headers， would have to be included in the latency and throughput 
measurements to make a direct comparison with virtual channel routers. 
Both the Chaρs router and the NgaijSeitz router have similar architec-
tures to the Tokkyu router and thus a more accurate comparison can be 
made between them. The simulated performance characteristics of these 
two routers are again similar to the results reported here. The low latency 
register-insertion ring formed by the pαcket eJ.;piessωαy allows the Tokkyu 
rou ter to achieve lower packet latency than the Chaρs and NgaijSeitz router， 
especially at low network load. The pαcket expiessωαy achieves lower latency 
in a manner similar to the Expiess Cubes proposed by Dally [14]. However， 
unlike Express Cubes， the pαcket expiessωαy does not require additional 
interchanges and wiring， thereby simplifying the network design and imple-
mentation. The simple routing decisions made by the Tokkyli router， which 
are made using only the message header and current buffer and switch in-
formation of the router， will allow for simpler arbiter implementation and 
therefore faster operation. The simulation results demonstrate that， for a 
16-ary 2-cube， two or three queue switch outputs， each with su伍cientbuffer 
space for a single packet， are su缶cientfor a low probability of misrouting， 
low latency and high throughput. 
司，...-
Chapter 4 
Restricted-length H:ardware 
Multicasting in Multicomputer 
Networks 
We begin this chapter by carrying out an in-depth investigation into multicast 
deadlock in wormhole routed communication networks. This is followed by 
a presentation of a hybrid virtual cut-throughjwormhole routing method 
for the effective distribution of broadcast and multicast messages in MPP 
system networksぅcalledrestricted-length multicasti時 [27].This method uses 
a single enlarged fit buffer per physical communications channel to provide 
virtual cut-through routing for multicasts at the nodes where the message is 
replicated， thus preventing deadlock. 
4.1 Preliminaries 
4.1.1 Definition of Multicast Deadlock Problem 
Multicast deadlock will now be examined in detail using a graph theoretical 
approach. Any graph theoretical terminology not de五nedhere may be found 
in [10， 39]. In the following discussion we make the following assumptions: 
可"."..-ー
4.1 Preliminaries 77 
1. There are no cycles in the channel dependency graph of the unicast 
routing algorithm， i.e. unicast is deadlock free1. 
2. There are no cycles in the channel dependency graph of the multicast 
routing algorithm. 
3. A destination node will eventually consurne a message. 
Let the set of nodes， M = {η0，η1γ・スk-l}三Nう bereferred to as the 
multicαst set， M， with k -1 destinations. Let no be the source node and 
D = {nl' .，nk-l} be the destination nodes of the multicast set， and let P 
be the number of nodes in the set N (G). A u山I
with k = 2， and a broadcast is a multicast with k = P. 
Definition 6 The multicast routing function 沢m:NxN→C maps the 
current node，ηCヲandthe destination nodes， ndεD， tothe next channel( s)， 
Cn， for the routes from nc to ndεD. 
Definition 7 The resource tree of the multicas，t set M is the rooted subtree， 
RT(N， C) of G(N， C)， which has ηo as the root， and where N(RT) c N(G) 
and C(RT) c C(G). The vertices N(RT) and the arcs C(RT) are the nodes 
and channels of the interconnection network respectively， and are de五nedby 
沢m for the multicast set M. The resource tree of a unicast is therefore a 
rooted tree RT(N， C) with only one brαηch. 
Let L be the length of the multicast packet P m in fits， and Bd be the 
depth of a fit bu:fer in node nc. If node ncεRT contains the tail of the 
1 For a complete discussion of channel dependency and deadlock avoidance for unicast 
messages in wormhole routed networks refer to [18] 
可~
4.1 Preliminaries 78 
multicast packet Pm in one of its fit buffers， then the conCUT'T'ent T'eSOUT'ce 
tT'e of RT at time tαis the rooted su bt附 ，CT(N， C) of RT(N， C)， whose 
root is nc. The nodes N(CT) c N(RT) and arcs C(CT) c C(RT) are the 
set of resources which are required concurrently， before the tail of P m can 
leave nc. The pαth length of a vertex in CT isde五nedas the number of edges 
from nc to the vertex. The height of the tree CT， defined H( CT)， isthe 
maximum of the path lengths in CT， and the number of nodes in the path 
of maximum length is equal to H(CT) + l.If {LjBd三H(CT)+l}then 
CT = RT， and if {H( CT) = 1} then either L ~~ Bd， ornc is adjacent to the 
destination nodes ηdε D. Let CT(N， C) and CT'(N， C) be the concurrent 
resource trees of RT( N， C) and RT'( N， C) respectively. The intersection of 
two concurrent resource treesぅ1= CTnCT'， is given by N( CT)nN( CT') and 
C(CT) n C(CT'). The numbeT' of components of 1ヲdenotedω(1)， isde五ned
as the number of connected subgraphs of 1， that are not contained in any 
other connected s山graphof 1， and let ω(Ic) be the number of components 
of 1， whose degree 三l.If a component in 1 has a degree equal to zero， then 
the component consists of a single vertex， with no incident arcs. 
Theorem 1 Let Rs = {CTo， CT1，…，CTI-1} be the set of concuT'T'ent T'e-
SOUT'ce tT'ees foT' e conCUT'T'ent multicαstsαt time tα・Deadlockdue to the coη-
CUT'T'entαlocαtion of T'eSOUT'ces mαy only OCCU7' ifαnd only if the following 
conditions αpply: 
V(RTi， RTj)εRs， 3{II(I = RTi n RTj子。)ぅ(0三i三e，o三j三f，iヂj)}
(4.1 ) 
『司守F
4.1 Preliminaries 79 
V(I昇。)，ヨ{ω(Ic)Iω( Ic)三2} (4.2) 
Proof:牛=
Let CT and CT' be two concurrent resource trees in Rs， where 1 = CTnCT'. 
1.1 1 =仇noconcurrent resources are shared by the concurrent multicasts 
in Rs. Therefore assumptions 1， 2 and 3 are su伍cientto guarantee 
deadlock avoidance. 
1.2 If 1ヂoand ω(Ic) = 0， itfollows that N(CT) n N(CT')ヂoand 
C( CT) n C( CT') = o.Therefore only node resources are shared by 
concurrent multicasts and assumptions 1， 2 and 3， and a fair local 
arbitration scheme are su伍cientto guarantee deadlock avoidance. 
1.3 If 1ヂoand ω(Ic) = 1 then there is a single rooted s山 treein 1， 
which we denote Su・ Letnu be the root node of S川 withoutput 
ports Pu， and Cuε Pu be the output channels of nu defined by ~m for 
the packets associated with CT and CT'. If at time tα， nu allocates 
al of the output channels Cu to the packet associated with CT， then 
CT' will remain blocked until the packet associated wi th CT releases 
its resources. Thus， only one multicast is given access to the resources 
below nu and assumptions 1， 2and 3ぅanda fair local arbitration scheme 
are sufficient to guarantee deadlock avoidance. 
==> 
『司守F
4.1 Preliminaries 80 
2.1 If 1 =1-o and ω(lc) = 2 then there are two rooted subtrees in 1 that 
are required concurrently by CT and CT'， which we denote 5u and 5v. 
Let nu and nv be the root nodes of 5u and 5v， and Cu εPu and Cv εPv 
be the output channels of nu and 川 definedby沢m for the packets 
associated with CT and CT' respectively. If at time ta， nu allocates 
Cuε Pu to the packet associated wi th CT and nv allocates cvεpv to 
the packet associated with CT'， a concurrent allocation of dependent 
resources has occurred， and a deadlock si tuation has been reached. 
Corollary 1 Deαdlock due to the concurrentαlocαtion of resources cαηηot 
occur znαnetwor土employingvirtuα1 cut・throughrouting. 
Proof:牛=
3.1 By de五nition，the length of a packet in a network employing virtual 
cu t-through is L三Bd. H(CT) is therefore equal to 1， and ¥/(1子
。)，ω(lc)~ 1. Thus by proofs 1.1 and 1.2" multicast is free of deadlock 
due to the concurrent allocations of resources. 
Figures 4.1(a) and (b) illustrate a multica計 andi tsassociated concurrent 
resource trees respectively for virtual cut-through ro凶時・ InFigure 4.1 (b) 
¥/(I=1-O)，ω( lc)三1and therefore deadlock can.not occur due to the concur-
rent allocation of resources. If a single channel of a branch in the restricted-
length multicast tree becomes blocked， i t will not result in the rest of the 
tree holding channel resources， asis the case in conventional tree-based mul-
ticasting. 
、..-
4.2 Restricted-Length Multicasting 
//~~連
@Q0)@~@ 
氏cif 〆U
@@(~ @口@
6f ~え
@ @ @@ 。
@ 
81 
Figure 4.1: (a) Multicast by node (2，1) and (b) the resulting concurrent 
resource trees 
4.2 Restricted-Length Multicasting 
Rather than restricting the branching of a multicastヲwepropose restricted-
length multicastingヲinwhich the packets of a multicast message are restricted 
in length so that they are routed in a virtual cut.-through manner in a network 
which usually supports wormhole routing. Messages are usually divided into 
one or more packets at the source， prior to injection into the network. Thus， 
in order to implement restricted-length multica.sting in a network which nor-
mally supports wormhole routing， the source node must divide a multicast 
message into packets of length L :; Bd. By ensuring that a flit buffer is 
su伍cientlylarge to hold a complete multicast packet， orthat a packet is 
su伍cientlysmall to五tin a single buffer， itis therefore possible to imple-
ment deadlock free multicasting utilizing existing routing algorithms such 
『司，.....-
4.2 Restricted-Length M ul ticasting 82 
as dimension order， ore-cube routing. As has been previously stated， each 
router must also implement a fair local arbitration scheme to prevent mul-
ticast packets from indefinitely holding output port resources， while waiting 
for others to become free. Howeverぅasthis requires only local information a 
simple timeout and resource release scheme will be sufficient to avoid dead-
lock. 
The buffers of most current generation routers， which employ wormhole 
routing， can only store one or two fits each. As the header information 
for a single packet is typically one or two fi ts in length also， i t would be 
impractical to implement restricted-length multicasting on these systems. A 
simple solution would be to increase the size of the buffers so that a complete 
packet could be stored in each bufferぅthusimpllementing virtual cut-through 
routing. However， this would signi五cantlyincrease the size of the message 
router， which would complicate its design and result in lower performance. 
Another approach would be to increase the size of a single buffer so that it 
can hold an entire packet. While this approach is preferable to increasing the 
size of al of the buffers， the size of a buffer capable of storing the maximum 
length packet employed in the system may stil be prohibitively large. Our 
proposed approach is therefore to increase the size of a single buffer， while 
restricting the length of multicast packets. ¥Vhen a packet appears at the 
input to a router， a single bit in the header indicates whether the message is 
a multicast or a unicast. If the multicast bit is set， then the message must 
request the enlarged buffer， while unicast messages are free to be placed in 
any available buffer. 
『司--
4.2 Restricted-Length Multicasting 83 
Request Registers 
R 
Multicast Request 1+ . To Mul比ti氾cas試t 
4 
〆-- Controller 
16 }Ch~刊O 川Channe剖IAr巾bi比ter陪3 
LO・L3:Virtual channel 
buffers 
LDO-LD1: Virtual channel 
load. 
RDO・RD3:Virtual channel 
read. 
Figure 4.2: Organization of a single MEGA router input 
4.2.1 Gate-array Implementation 
A number of researchers and commercial enterprises have developed hardware 
routers for use in multicompl巾 rnetworks in recent years [17， 19， 50]. These 
have typically been implemented using full custom VLSI techniques， which 
have enabled them to achieve high throughput and low switching latency. 
However， anumber of advantages exist in taking a semi-custom approach to 
the design. These include a shorter design time， lower production costs for 
small volumes of devices， and well established design and simulation tools 
[36]. 
も'Neare therefore undertaking the design of a MEssage passing Gate-
司，...-
4.2 Restricted-Length Multicasting 84 
Array (MEGA) router [25]， using a 1.2μm Cl¥10S gate array. The design 
tools available include schematic capture， design rule checking， functional 
simulation and critical path analysis. Our second prototype router design has 
four virtual channel buffers per port， which are 16 bits wide and typically 4 
words deep， and the router contains 10 uni-directional ports which are formed 
into 5 bi-directional pairs. The minimum requirement for the implementation 
of restricted-length multicasting is that a single packet of a multicぉtcan be 
accommodated in a virtual channel buffer. As the header of each packet in our 
system requires 4 bytes， this would result in only 4 message bytes per packet 
for multicasting. To increase the ratio of message information to header 
information we have enlarged a single flit buffer per physical communications 
channel， labeled LO in Fig. 4.2. The buffer load lines (LDO-LD3) are operated 
by the input control to load a virtual channel buffer in response to a request 
on the input controllines. The input control section also controls the request 
register associated wi th each virtual channel， placing a new request in a 
register whenever a new packet is received. These requests are passed to 
the appropriate arbiters via the request and the select lines. Once a packet 
has been passed to an output， the output control section (not shown) will 
assert the reset lane line to indicate that the lane is now free. The output 
controllers are also responsible for asserting the virtual channel read lines 
(RDO-RD3)， once for each word which is read. 
The basic unit for the implementation of digitallogic within a gate array 
is the Basic Cell (BC or cell). Each BC istypically implemented as two pairs 
of P-channel and N-channel transistors and the logical function performed 
by each basic cel is determined by the metalization pattern assigned to i t. A 
『司，..-
4.2 Restricted-Length Multicasting 85 
Buffer Size Cell Count Terminals Nets 
4 lanes x 4 words 2119 3709 734 
3 lanes x 4 words， 1 lane x 8 words 2512 3351 847 
3 lanes x 4 words， 1 lane x 16 words 3440 6408 1322 
Table 4.1: Resource usage for various buffer structures 
user creates a design using U ni t Cells (U Cs)， s吋 1as N AND gatesぅ自ip-fiops
and shift registers， by interconnecting them usi時 wiringnetworks (nets) and 
this design is then mapped to the gate array by the design software. Most 
UCs are made up of a number of BCs and thus these also require intercon-
nection by nets. Terminals are used to provide the connections between BCs 
and nets， and also between nets on different rnetalization layers within the 
device. . Although current gate array devices offer BC counts of more than 
100，000 cells， the number of nets and terminal:s can signi五cantlyreduce the 
maximum utilization of these cels. In order to evaluate the effect on the gate 
array implementation of our router due to enlarging a single virtual chan-
nel buffer in each input port， we examined the increase in the cel， terminal 
and net counts for varying sizes of virtual channel buffer. These results are 
presented in Table 4.1， which gives the cel， terminal and net counts for the 
input section of a single port. In each case the number of virtual channels 
lS五xedat four， and the size of one lane is increased from 4 to 16 words in 
depth. As can be seen in the tableぅincreasingthe size of a single buffer per 
port from 4 to 16 words results in a considerable increase in the number of 
cells， nets and terminals. However， this increase is significantly less than that 
which would occur if the size of al of the buffers was to be increased. 
4.3 Simulation 86 
4.3 Sinlulation 
4.3.1 Multicast Latency 
In order to evaluate the potential bene五tsof utilizing restricted-length mul-
ticasting we have implemented a simulator， which determines the latency of 
sending a multicast from a single node， based upon the design parameters of 
our message router. In our simulations we therefore assume a 2D mesh topol-
ogy with 16 bit data paths， a header length (Lh) of a unicast message of 4 
bytes， and that the standard size of a fit buffer is 4 double-byte words. Two 
bytes of the header contain the destination addressヲwhilethe remaining two 
bytes contain the packet length and sequence information etc. Given a mul-
ticast message of length Lm bytesぅthelatency of sending a multiple-unicast 
based multicast to N destinations is given by: 
D.， = 二一~(Lm + Lh)D!lit(i) 
u コ Lflit (4.3) 
where D flit is the average delay in sending each fit to destination i and 
L flit is the size of a fit buffer. Figure 2.17 illustrated that， ina 2D mesh， 
a multi-path multicast message is broadcast by sending four copies of the 
message on individual multicast paths. The hea.der appended to each copy of 
the message must contain a list of al of the destination a.ddresses. Assuming 
that， asin the case of a unicast， each destination address requires 2 bytesヲand
that 2 addi tional bytes of status information are appended to the header， the 
average number of bytes per header for a multi-path message being broadcast 
to N destinations in a 2D mesh is gi ven by 
可"，.---
4.3 Simulation 87 
瓦=(与_1_) (4.4 ) 
and the send latency for a multi-path based multicast with four paths is 
therefore 
ふ(Lm+ Lh)Dfμt(i) 
・ ーロ Lf1it
ふ(Lm+ (与引 Df1it(i)
訂 Lflit 
A restricted-length multicast will divide the Lm bytes of the multicast 
Dmp = 
(4.5) 
message into a number of flit sized packets. The data content of the each 
packet is Pd = (L flit -L九)and the total amount of header information for 
required to broadcast a message of Lm bytes， assuming each header requires 
4 bytes， is 4Nf ，where Nf is given by Nfニ Lm/Pd and is rounded up to the 
nearest whole number. The send latency of restricted-length based multicast 
is therefore gi ven by 
???????
? ?
??
?
? ??
???? ?? (4.6) 
Note that the send latency of restricted-length multicast is independent of 
the number of destinations of the multicast. 
4.3.2 Simulation Results 
We have assumed that the multicast set is an 8 x 8 mesh and the load in the 
network is simulated by varying the probability of blocking at a single port 
(Pr( b) from 0 to 0.7. If multiple outputs are required concurrently， asis 
the case in restricted-length multicast， then the total probability of blocking 
(Pt) isgiven by Pt二 1-(1一九)η，where ηis the number of output ports 
._. 
4.3 Simulation 88 
requested. A node is chosen at random to iniitiate the multicastヲ andthe 
time taken from the ini tialization of the broadcast until the tail of the last 
flit arrives at the last node is measured. The results of each multicast method 
were then averaged over 100 simulations. 
Figure 4.3 shows the latency of sending a rnulticast (in cycles) with Lm 
五xedat 16 bytes， while varying the probability of blocking from 0 to 0.7. Re-
sults for multiple unicast， multi-path， and restricted-length multicasting with 
buffer sizes of B=l， 2 and 4 flits are given. All instances of the restricted-
length multicast provided a reduction in latency for Pr( b)三0.43. By in-
creasing the size of one flit buffer so that it can accommodate 2 flits， the 
probability of blocking must exceed 0.65 before the blocking， due to the re-
questing of multiple outputs， degrades the performance of restricted-length 
multicasting to below that of multi-path multicasting. Figure 4.4 illustrates 
the effect of varying the message length， from 4 bytes to 2048 bytes for a 
五xedprobability of blocking. The header overhead of multi-path multicast 
is evident in its poor performance for small messagesぅwhileunicast performs 
poorly regardless of message size. 
4.3 Simulation 89 
p・、 1E+04 
Rぬ
ω 
u 
h 
u 
、ー，〆
? ?? ???」
? ? ?
???
1E+02 
1 E+01 
0.0 
?。?
?
? ??」 』? ? ? ? ?
?
?
?
? ???
?
?
???
1 E+02 
1 E+01 
1 E+OO 
a-ー ー母子ー・・ー回ーーーー図ーー_-m---.ffi-ーーー田
一一--0一一一 8=1 
8=2 
8=4 
-・・・・-<>-・・・
-・・・0・・・・
ーー ーー でrーーーー M ultipath 。
，fタ
_....<:rて##，~' 
-::-lr"'- _.，(J，' 
A 一一ー でちー -ー -:;t:r ー.....，.Dt:r---ー 司ー王 ~-........ _.J:>'-_，' 
べ..J・「心'
d・。.-，，' 
eごごご二~:':8ごごごごご:会ごこ二 --σ
ーーー田ーーー Unicast 
0.2 0.4 0.6 0.8 
ProbabiIity of Blocking P(b) 
Figure 4.3: Send latency for Lm = 16 bytes 
，図
E 
10 100 1000 10000 
Message Length (bytes) 
Figure 4.4: Send latency for Pr( b) = 0.5 
4.3 Simulation 90 
4.3.3 DiscussIon of Results 
As expected， the performance of unicast based multicast is much lower than 
the other methods of multicast investigated here， aseach of the unicast mes-
sages must wait until the preceding message has left the sending node before 
it can be sent. This result could be improved upon by adding additional 
input ports to each node in the system and allowing multiple unicぉtsto be 
sent concurrently from a single node. However， asthe unicast based multi-
cast generates the most tra伍cof those methods presented here， this would 
probably result in an increase in network congestion which would adversely 
affect the network performance of the entire system. 
Both multi-path and restricted-length mu.lticasts exhibited signi五cant
speedup when compared to unicast based multicast. As was the case with 
unicast based multicast， the performance of multi-path based unicast could 
also be improved by allowing multiple messages to be concurrently injected 
into the network from a single node. The performance of restricted-length 
multicast with only two enlarged fit buffers was superior to that of multi-
path based multicast except for when the probability of blocking exceeded 
0.65. As restricted-length multicast makes use of well known routing algか
rithms， litle modification would be required to existing router designs to 
allow them to support it. 
Chapter 5 
Concl usions 
Effective communication structures are essential if the full potential of MPP 
systems is to be realized. The requirements for an interconnection network 
and its communications structures to considered effective include freedom 
from deadlock and livelock， low latency and high throughput， adaptive rout-
ing， fault tolerance and support for multicast communication. This disserta-
tion has focused on two solutions to meeting these requirements. 
The Tokkyu router was presented and its suitability for use in MPP in-
terconnection networks was demonstrated. Accurate models were developed 
to predict the switch and buffer performance of Tokkyu routers for varying 
radix and dimension and these models can be used in the design of routers 
for networks other than those investigated here. The Tokkyu router meets 
al of the requirements necessary to be considered effective， asdefined in the 
introduction. Importantly， the support for routing in the presence of faults or 
network congestion does not compromise the low latency and high through-
put of the router. The sin1ulated performance of the Tokkyu router exceeds 
that of published results for oblivious routers and is equal to or exceeds those 
reported for other adaptive routers. These performance predictions are es-
92 
pecially encouraging when the simplicity of the control structures required 
to implement the Tokkyu router are taken into consideration. 
The multicast deadlock problem was stated explicitly using a graph theか
retical approach which enabled the conditions necessary to avoid deadlock to 
be defined. Restricted-length multicast was introduced and the implemen-
tation of this multicast scheme was examined. Restricted-length multicast 
was then compared to unicast and multi-path based multicωts. The sim-
ulation model allowed the relative merits of restricted-length multicast to 
be evaluated， and under al but very high sirnulated congestion conditions 
restricted-length multicast provided lower latency than unicast or multi-path 
multicasting. The results therefore indicate that restricted-length multicast 
provides a good solution to multicast problems such as multicasting to clus-
ters of nodes found in barrier synchronization， multicasting to nearest neigh-
bors and the broadcasting to al of the nodes in the network. 
References 
[1] Agarwal， A.， "Limits on Interconnection Network PerformanceぺIEEE
Trαηs. on PαTαlelαηd Distributed Computing， vol.2， no. 4， pp. 398-412， 
October 1991. 
[2] Agrawal， D. P・ぅ Virenda， J.K.， “Evaluatii時 the Performance of M ul-
ticomputer Con五gurations"，IEEE Computer， vol.19， no. 5ぅpp.23-3??
May 1986. 
[3] Annaratone， M.， et.al.， "The K2 Parallel Processor: Architecture and 
Hardware Implementation"， Proc. 0/ the 17th Ann・Int.Symp. on Com-
puter A rchitecture， pp. 92-101うMay1990. 
[4] Athas， W. C. and Seitz， C.L.， "Multicomputers: Message-Passing Con-
current Computers.ぺIEEEComputer， vol.21， no. 8ヲpp.9-24， August 
1988. 
[5] Bhuyan， L.N.， Yang， Q・， Agrawal， D. P.，“Performance of M山 lpro-
cessor Interconnection Networks"， IEEE Computeηvol. 22ぅno.2， pp. 
25-37， February 1989. 
REFERENCES 94 
[6] Borkar， S.， et. al.，“S叩 portingSystolic Memory Communication in 
iWarp"， Proc. of the 17th Ann. Jnt. Symp. on Computer Ar、chitecturでヲ
pp. 70-81， May 1990. 
[7] Byrd， G. T. etal.， "Multicast Communication in Multiprocessor Sys-
tems" ， inProceedings of the 1989 Conference on Pα叩 lelProcessing， pp. 
1196-1200， 1989. 
[8] Chien， A. A.， "A Cost and Speed Model for k-ary n-cube Wormhole 
Routers"ヲInProc. of Hot Jnterconnects 98， August 1993. 
[9] Chien， A. A. and Kim， J.H. ， "Planar-Adaptive Routing: Low-Cost 
Adaptive Networks for Multiprocessors"， Proc of the 19th Anη. Jnt. 
Symp. on Computer A r、chitecture，pp. 268-277ヲMay1992. 
[10] Clark， J. and Holton， D. Aう AFirst Look at G叩 phTheory.， Singapore， 
World Scientific， 1991. 
[11] Cybenko， G. and Kuckヲ D.J.，“Supercomputers: Reinventi時 the
Machine-Revolution or evolution?"， JEEE Potentials， vo1.29， no. 9， pp. 
39-41ヲSep.1992. 
[12] Dally， W. J.，明etworkand Processor Arc:hitecture for Message-Driven 
Computing" ，in VLSJαηdPαTαlel Process:ing， R.Suya and G. Birtwistle 
eds.ヲMorganKaufmann， pp. 140-222， 1989. 
[13] Daly， W. J.， "Vi山 alChannel Flow ControlぺJEEETrαηs. on Pαrallel 
αnd Distributed Systems， vol.3， no. 2， pp. 194-205， March 1992. 
REFERENCES 95 
[14] Dally， W. J.， "Express Cubes: Improving the Performance of k-ary n-
cube Interconnection Networks"， IEEE Trans. on Computers， vol.40， 
no. 9， pp. 1016-1023うSeptember1991. 
[15] Dally， W. J. and Aoki， H.， "Deadlock-Free Adaptive Ro凶 ngin Multi-
compute釘rNetworks using Virtual Channds 
αηd Di臼stかr、i必b1ωdedSystems， vol.4， no. 4， pp. 466-475， April 1993. 
[16] Dally， W. J.， et.al.， "The Message-Driven Processor: A Multicomputer 
Processing Node with Efficient Mechanisrns"， IEEE Micro， pp. 23-39， 
April 1992. 
[17] Dally， W. J. and Seitz， C. L.，“The torus rou ting chi pぺDistributed
Computingぅvol.1，pp.187-196う 1986.
[18] Dally， vV. J. and Seitz， C. L.，勺eadlockFree Message Ro凶 ngin Multi-
processor Interconnection Networks.ぺIEEETrαnsαctions on Comput-
ers， vol.C-36， no.5うpp.547-553. 
[19] Dally， W. J. and Song P.，“Design of a self-timed VLSI multicomputer 
communication controller"， in?roceedings: 0/ the Internαtionα1 Confer-
eηce 0ηComputer Design， IEEE Computer Society Press， pp.230-234， 
October 1987. 
[20] Dongar叫 J.J.， "Performance of Various Computers Using Standard 
Linear Equations Software.ぺACMComputer Architecture News， vol. 
20， no. 3， pp. 22-44， June 1992. 
REFERENCES 96 
[21] Fe略 T.，"A Survey of 1nterconnection NetworksぺIEEEComp山 r，vol.
14， no. 12， pp. 12-27， December 1981. 
[2] Flavell， A. C.ヲKanoh，T. and Takahashi， Y.， "Mandala: An 1ntercon-
nection Network for a Scalable Massively Parallel Computer"， inProc. 
of theイ3rdAnnuα1 Convention of the IPSよvol.6， pp. 91-92， October 
1991. 
[23] Flavell， A. C. et. al.， "Mandala: An 1nterconnection Network for a Scal-
able Massively Parallel Computer"， Technical Report of the IPSJ， vol. 
91， no. 100， pp. 91.101-91.109， November 1991. 
[24] Flavell， A. C. and Takahashi， Y.う "Manda.la:An 1nterconnection Net-
work for a Scalable Massively Parallel Cornputer" ， inProceedings of the 
33rd IPSJ Programming Symposium， pp. 79-90， January 1992. 
[25] Flavellヲ A.C. and Takahashiヲ Y.ぅ“The~ÆEGA Router: A Hardware 
Message-Passing Gate Array Router"ヲ inProceedings of theイ5thAll 
Jαpαn Symposium on lnformαtion Science， vo1.6ヲpp.183-184， October 
1992. 
[26] Flavell， A.C. and Takahashi， Y.，“Continuum: A Hybrid Time/Space 
Communications Paradigm for k-ary n-cubes"， Proc. of the Interηαtional 
Conference on Pαrallel P1'ocessing 199ムvol.1， pp. 138-141 ， August 1994. 
[27] Flavell， A.C. and Takahashi， Y.， "Restricted Length Hardware Multi-
casting in Multicomputer Networks"， Transαctions of the IPSJ， vol. 36， 
no. 5， pp. 1228-1238ヲMay1995. 
REFERENCES 97 
[28] Flavell， A.C. and TakahashiヲY.，"The Tokkyu Router: A Ra叩n吋domi均Zl凶n
Router for kι-a訂ryn任lト-cubes♂"， Proc. of the Jnternαtionαl Symposium on 
PαTαlelαnd Distributed Supercomputingヲpp.127-134， September 1995. 
[29] FlavellヲA.C. and TakahashiヲY.，"Tokky忌:A High-Performance， Ran-
domizing， Adaptive Message Router with Packet Expressway 
Trαηs. 0ηlrη1formηlαtμlorηlαηd SysteεmηlS，ヲ vol.E汁78-D，no. 10， pp.1248-1260， 
October 1995. 
[30] Glass， C.J. and Ni，“Adaptive Routi時 inMesh-Connected Networksぺ
in Proceedings of the 12th Internαtionα1 Conference on Distributed Com-
puting Systems， pp. 12-13， June 1992. 
[31] Hwang， K.， Adωnced Comp1山 7、Ar、chitectu川 McGrawHill， New York， 
1993. 
[32] Jesshope， C. R. and Yantchev， J.T.，“ High Performance Communi-
cations in Processor Networks"， Proc of the 16th Ann. Int. Symp. on 
Computer Architecture， pp. 150-157， 1989. 
[3] Karol， M. J.， etal，“Input Versus Output Queuing on a Space-Division 
Packet Switch"， IEEE Trαηs. on Commuη化αtioηs，vol. COM-35， no. 12. 
pp. 1347-1356， December 1987. 
[34] Kermani， P.and Kleinrock， L.， "Virtual Cut-through: A New Commu-
nications Switching Technique， Computer・Networks，vol 3， no. 4， pp. 
267-286， 1979. 
REFERENCES 98 
[35] Konsta凶 nidou，S. and Snyder， L.ヲ "Chaosrouter: Archi tecture and 
Performance"， SIGARCH， vo1.19ぅno.1， pp. 212-221， March 1991. 
[36] Lieserson， C.E.， et.a1.，“The Network Architecture of the Connection 
Machine CM-5"， PT'OC. of theイthAnn. ACM Symp. on Pα問 lelAlgo-
T'ithms αnd A T'chitectuT'es， ACM， pp. 272-285うJune1992. 
[37] Lin， X. and Ni， L.M.， "Deadlock-Free Multicast Wormhole Ro凶 ngln 
Multicomputer Networks." PT'oceedings ofthe 18th Anηuα1 InteT'ηαtioηαl 
Symposium on ComputeT' A 7、chitectuT'e，pp. 116-125ヲMay1991. 
[38] Linder， D. and Harden， J.， "An Adaptive and Fault-tolerant Wormhole 
Routing Strategy for k-ary n-cubes"， IEEE TT'αηs.。ηComputeT'sヲvo1.
C-40ヲno.1， pp. 2-12， January 1991. 
[39] Liu， C.L.， Elements of Disc問 teM，αthematω.， New York， McGraw Hill， 
1977. 
[40] McKinley， P.K.ヲXu，H.ぅEsafahanian，A. H. and Ni， L.M.，句nicast-
Based Multicast Communication in Wormhole-Routed Networks."， 
Tech. Rep. MSU-CPS-ACS-57うDepartmeniCof Computer Science， Michi-
gan State University， East Lansing， MI， January 1992. 
[41] Ngai， J.Y. and Seitz， C. L.，“A Framework for Adaptive Routi時 ln
Multicomputer Networks"， SIGARCH， vo1.19ぅno.1， pp. 6-14， March 
1991. 
[42] Ni， L.M.， McKinleyぅP.K.， "A Survey of Routing Techniques in Worm-
hole Networks"， Tech. Rep. MSU-CPS-ACSーイ6ヲ Department of Com-
REFERENCES 99 
puter Science， Michigan State U niversity， East Lansing， MI， October 
1991. 
[43] Oed， W. and Walker， M.，“An Overview of Cray Research Computers 
including the Y-MP jC90 and the new MPP T3DぺPTOC.of the 5th Ann. 
A CM Symp. on Paiallel Algoiithms αnd Ar'chitectuie， pp. 271-272， June 
1991. 
[4] Panda， D.K.， "A Report of the ICPP 94 Panel on -Sea of Interconnec-
tion Networks: What's Your Choice?"， Department of Computer and 
Information Science， Ohio State University， Columbus， OH， November 
1994. 
[45] Reames， C. C. and L叫 M.T.，“A Loop Network for Simultaneous 
Transmission of Variable-Length Messages'， PiOC. of the 2nd Anη. Int. 
Symp.。η ComputeiAr、chitectuie，pp. 7-12， January 1975. 
[46] Reed， D.Aん.， Fu凶1Jωjimotω0，R. M.， Mult似icωompuμ山i
P α i、刀叫αalμεel Piocessing， MIT Press， Cambridge MA う1987. 
[47] Reed， D. A.， Grunwald， D. C.，“The Performance of Multicomputer 
Interconnection Networks"， IEEE Computei， vol. 20， no. 6ヲ pp.6573?
June 1987. 
[48] Seitzヲ C.L.ぅ“ConcurrentArchi tecturesぺinVLSIαηdPαiallel Com-
putation， R.Suya and G. Birtwistle eds.， :Morgan Kaufmann， pp. 1-84， 
1990. 
REFERENCES 100 
[49] Sullivan， H. and Bashkow， T.R.， "A Large Scale， Homogeneous， Fully 
Distributed Parallel Machine"， Proc. of theイthSymp. on Computer 
Architecture， vol. 5， pp.105-124， Mar 1977. 
[50] TamirヲY.and Frazier， G.L.ヲ“HighPerformance Multi-Queue Buffers 
for VLSI Communication Switches"， inProc. 19th Annuα1 Symposium 
on Computer A rchitecture， IEEE Computer Society Press， pp.343-354， 
June 1988. 
[51] Xu， H.， McKinely， P.K. and Ni， L.M.，“Efficient Implementation of 
Barrier Synchronization in Wormhole-Routed Hypercube Multicomput-
ers"， inProceedings of the 12th Internαtional Conference on Distributed 
Computing Systems， pp. 118-127， June 1992. 
[52] Yeh， Y.， etal，“The Knockout Switch: A Simple， Modular Architecture 
for High-Performance Packet Switchi吋lng
Areαs z仇nComη~mη~urηH化Cαtμzorη~sう vol. SAC-5， no. 5， pp. 1274-1283， October 
1987. 
[53] Zorpette， G.ヲ“S叩 ercomputersjReinventingthe Machine -The Power 
of ParallelismぺIEEESpectrum ， vol. 29. no. 9， pp. 28-33ヲ September
1992. 


