A new-generation class of parallel architectures and their performance evaluation by Wang, Qian
New Jersey Institute of Technology 
Digital Commons @ NJIT 
Dissertations Electronic Theses and Dissertations 
Spring 5-31-1999 
A new-generation class of parallel architectures and their 
performance evaluation 
Qian Wang 
New Jersey Institute of Technology 
Follow this and additional works at: https://digitalcommons.njit.edu/dissertations 
 Part of the Databases and Information Systems Commons, and the Management Information 
Systems Commons 
Recommended Citation 
Wang, Qian, "A new-generation class of parallel architectures and their performance evaluation" (1999). 
Dissertations. 992. 
https://digitalcommons.njit.edu/dissertations/992 
This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at Digital 
Commons @ NJIT. It has been accepted for inclusion in Dissertations by an authorized administrator of Digital 
Commons @ NJIT. For more information, please contact digitalcommons@njit.edu. 
Copyright Warning & Restrictions 
The copyright low of the United States (Title 17, United 
States Code) governs the ma~ing of photocopies or other 
reproductions of copyrighted material. 
Under certain conditions specified in the law, libraries and 
archives are authorized to fum ish a photocopy or other 
reproduction. One of these specified conditions is that the 
photocopy or reproduction is not to be "used for any 
purpose other than private study, scholarship, or research." 
If a, user ma~es a request for, or later uses, a photocopy or 
reproduction for purposes in excess of "fair use" that user 
may be liable for copyright infringement, 
This institution reserves the right to refuse to accept a 
copying order if, in its judgment, fulfillment of the order 
would involve violation of copyright law. 
Please Note: The author retains the copyright while the 
New Jersey Institute of Technology reserves the right to 
distribute this thesis or dissertation 
Printing note: If you do not wish to print this page, then select 
"Pages from: first page # to: [ost page #" on the print dialog screen 
NJI 
_Itr~ Sderw:t 6: 
TedInoIogy U_'s.lty 
The Van Houten library has removed some of 
the personal information and all signatures from 
the approval page and biographical sketches of 
theses and dissertations in order to protect the 
identity of NJIT graduates and faculty. 
ABSTRACT 
A NEW-GENERATION CLASS OF PARALLEL ARCHITECTURES 
AND THEIR PERFORMANCE EVALUATION 
by 
Qian Wang 
The development of computers with hundreds or thousands of processors and 
capability for very high performance is absolutely essential for many compu-
tation problems, such as weather modeling, fluid dynamics, and aerodynamics. 
Several interconnection networks have been proposed for parallel computers. Never-
theless, the majority of them are plagued by rather poor topological properties that 
result in large memory latencies for DSM (Distributed Shared-Memory) computers. 
On the other hand, scalable networks with very good topological properties are 
often impossible to build because of their prohibitively high VLSI (e.g., wiring) 
complexity. Such a network is the generalized hypercube (GH). The GH supports 
full-connectivity of its nodes in each dimension and is characterized by outstanding 
topological properties. In addition, low-dimensional GHs have very large bisection 
widths. ,1\!e propose in this dissertation a new class of processor interconnections, 
namely HO,1\!s (Highly Overlapping ,Vindows), that are more generic than the CH, 
are highly scalable, and have comparable performance. ,1\!e analyze the communi-
cations capabilities of 2-D HOW systems and demonstrate that in practical cases 
HO,1\! systems perform much better than binary hypercubes for important commu-
nications patterns. These properties are in addition to the good scalability and 
low hardware complexity of HO,V systems. ,1I,1e present algorithms for one-to-one, 
one-to-all broadcasting, all-to-all broadcasting, one-to-all personalized, and all-to-all 
personalized communications on HO,1I,1 systems. These algorithms are developed 
and evaluated for several communication models. In addition, we develop techniques 
for the efficient embedding of popular topologies, such as the ring, the torus, and 
the hypercube, into I-D and 2-D HO\i\1 systems. The objective is to show that 2-D 
HO\i\1 systems are not only scalable and easy to implement, but they also result in 
good embedding of several classical topologies. 
A NEW-GENERATION CLASS OF PARALLEL ARCHITECTURES 




Submitted to the Faculty of 
New Jersey Institute of Technology 
in Partial Fulfillment of the Requirements for the Degree of 
Doctor of Philosophy 
Department of Computer and Information Science 
May 1999 
Copyright © 1999 by Qian \iVang 
ALL RIGHTS RESERVED 
APPROVAL PAGE 
A NEW-GENERATION CLASS OF PARALLEL ARCHITECTURES 
AND THEIR PERFORMANCE EVALUATION 
Qian Wang 
Dr. Sotirios G. Ziavra~, Dissertation Advisor J ~ Date 
Associate Professor of Electrical and Computer Engineering, 
and Computer and Information Science, NJIT 
Sf I?; 77 
Dr. David N assimi, Committee Member 'Date 
Associate Professor of Computer and Information Science, NJIT 
Dr. James McHugh, Committee Member Date 
Professor of Computer and Information Science, NJIT 
Dr. l\1engchu Zhou, Committee Member Date 
Associate Professor of Electrical and Computer Engineering, NJIT 
Dr. Alex Gerbessiotis, Committee Member Date 
Assistant Professor of Computer and Information Science, NJIT 
BIOGRAPHICAL SKETCH 
Author: Qian \I'lang 
Degree: Doctor of Philosophy 
Date: May 1999 
Undergraduate and Graduate Education: 
• Doctor of Philosophy in Computer and Information Science, 
New Jersey Institute of Technology, Newark, NJ, 1999 
• lVlaster of Science in Electrical Engineering, 
New Jersey Institute of Technology, Newark, NJ, 1993 
• Bachelor of Science in Electrical Engineering, 
Huazhong University of Science and Technology, \i\1uhan, China, 1988 
Major: Computer Science 
Presentations and Publications: 
Q. \i\lang and S.G. Ziavras, "PO\'verful and Feasible Processor Interconnections 
\i\1ith an Evaluation of Their Communications Capabilities," International 
Symposium on Parallel Architectures, Algorithrns, and Networks, Freemantle, 
Australia, June 23-25, 1999. 
Q. "'"ang and S.G. Ziavras, "Network Embedding Techniques for a New Class of 
Feasible Parallel Architectures Capable of Very High Performance," Interna-
tional Conference on Applied Informatics, Innsbruck, Austria, February 15-18, 
1999. 
S.G. Ziavras and Q. Wang, "Robust Interprocessor Connections for Very-High 
Performance," in: Robust Comm'unication Networks: Interconnection and 
Survivability, N. Dean, F. Hsu and R. Ravi (Eds.), American Mathematical 
Society, Rhode Island, 1999. 
Q. \i\lang "Optical Flow Determination and IVlotion Analysis," Master's Thesis, New 
Jersey Institute of Technology, Ne"w Jersey, 1993. 
IV 
To my husband Dong Liu 
v 
ACKN O\VLED G IVIENT 
I \vould like to express my deepest appreciation to Dr. Sotirios G. Ziavras, 
who not only served as my research supervisor, providing valuable and countless 
resources, insight and intuition, but also constantly gave me support, encouragement, 
and reassurance. 
Special thanks are given to Dr. David Nassimi, Dr. James McHugh, Dr. 
Mengchu Zhou, Dr. Alex Gerbessiotis for actively participating in my committee. 
Many of my fellow graduate students in the Computer and Information Science 
Department are deserving recognition for their support. I also wish to thank Leon 
Jololian, Karen Hare for their help over the years. 
I thank my family members, Dong Liu, Guoqi \iVang, Hui Yi, Xuanshi V/ang, 
Ying Liang, Luzhong \iVang, for their affectionate support, patience, and encour-
agement throughout the duration of this project. 
This research was supported in part by the NSF jDARPA (also cosponsored by 
NASA) New Millennium Computing Point Design Grant ASC-9634775. 
vi 
TABLE OF CONTENTS 
Chapter Page 
1 INTRODUCTION 1 
1.1 The Class of HOVl Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 5 
1.1.1 Their Structure .................................. 6 
1.1.2 Further Implementation Issues . . . . . . . . . . . . . . . . . . . . . .. 11 
1.2 The Class of Wrap-Around HO'vV Architectures. . . . . . . . . . . . . . .. 14 
2 COST ANALYSIS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17 
2.1 Cost Analysis for the Regular HOH/(p, lV, 1) . . . . . . . . . . . . . . . . .. 17 
2.2 Cost Analysis for the \iVrap-Around HOl1'(p, w, 1) . . . . . . . . . . . . .. 23 
3 1-D HO\iV SYSTEM EMBEDDINGS . . . . . . . . . . . . . . . . . . . . . . . . . .. 30 
3.1 Embedding a Ring into a 1-D HO\iV System .................. 31 
3.2 Embedding a 2-D Mesh into a 1-D HOW System. . . . . . . . . . . . . .. 33 
3.2.1 2-D Regular Mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33 
3.2.2 2-D \iVraparound Mesh or Torus. . . . . . . . . . . . . . . . . . . . .. 36 
3.3 Embedding a Binary Tree into a 1-D HOVl System . . . . . . . . . . . .. 37 
3.4 Embedding a Hypercube into a 1-D HO\iV System. . . . . . . . . . . . .. 38 
4 2-D HO\i\T SYSTEM ENIBEDDINGS . . . . . . . . . . . . . . . . . . . . . . . . . .. 40 
4.1 Embedding a Ring into a 2-D HO\iV System . . . . . . . . . . . . . . . . .. 40 
4.2 Embedding a 2-D Mesh/Torus into a 2-D HO\i\T System. . . . . . . . .. 40 
4.3 Embedding a Binary Tree into a 2-D HO\iV System . . . . . . . . . . . .. 42 
4.4 Embedding a Hypercube into a 2-D HO\i\T System . . . . . . . . . . . . .. 44 
5 COMMUNICATION OPERATIONS ON 1-D HOW SYSTEMS . . . . . . .. 52 
5.1 One-to-One Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54 
5.2 One-to-All Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55 
5.2.1 Model-1....................................... 55 
VB 
Chapter Page 
5.2.2 Model-2 and Model-3 58 
5.3 All-to-All Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60 
5.3.1 Model-I .................................... ··· 61 
5.3.2 Model-2....................................... 63 
5.3.3 Model-3 .................................... ··· 67 
5.4 One-to-All Personalized Communication. . . . . . . . . . . . . . . . . . . .. 70 
5.4.1 Model-1 and Model-2 ............................. 70 
5.4.2 Model-3....................................... 72 
5.5 All-to-All Personalized Communication ..................... , 73 
5.5.1 Model-1 and Model-2 ............................. 74 
5.5.2 Nlodel-3....................................... 76 
6 COMMUNICATION OPERATIONS ON 2-D HO\V SYSTEMS . . . . . . .. 83 
6.1 One-to-One Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83 
6.2 One-to-All Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84 
6.2.1 Model-I....................................... 84 
6.2.2 Model-2 and Model-3 ............................. 85 
6.3 All-to-All Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87 
6.3.1 Model-I....................................... 88 
6.3.2 Model-2....................................... 89 
6.3.3 Model-3....................................... 90 
6.4 One-to-All Personalized Communication . . . . . . . . . . . . . . . . . . . .. 91 
6.4.1 Model-1 and Model-2 ............................. 91 
6.4. 2 Model-3....................................... 92 
6.5 All-to-All Personalized Communication .................... " 93 
6.5.1 I\1odel-l and Model-2 ............................. 93 
6.5.2 Model-3....................................... 97 
7 COMMUNICATION OPERATIONS ON BINARY HYPERCUBES 99 
Vlll 
Chapter Page 
7.1 One-to-One Communication .............................. 101 
7.2 One-to-All Broadcasting ............................... , 101 
7.3 All-to-All Broadcasting .................... '. . . . . . . . . . . .. 102 
7.4 One-to-All Personalized Communication. . . . . . . . . . . . . . . . . . . .. 102 
7.5 All-to-All Personalized Communication. . . . . . . . . . . . . . . . . . . . .. 105 
8 PERFORIVIANCE COMPARISONS BETVVEEN HO\V AND BINARY 
HYPERCUBE SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 109 
9 PERFORMANCE COMPARISONS BET\VEEN HO\V AND GENER-
ALIZED HYPERCUBE SYSTEMS .......................... 119 
10 CONVERSION OF COMMUNICATIONS ALGORITHMS FOR GENER-
ALIZED HYPERCUBES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 129 
11 CONCLUSIONS AND FUTURE \VORK . . . . . . . . . . . . . . . . . . . . . . .. 137 
APPENDIX A SIMULATION FOR ALL-TO-ALL PERSONALIZED COMMU-
NICATION ON I-D HOWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 138 
REFERENCES ............................................ 153 
IX 
LIST OF FIGURES 
Figure Page 
1.1 The 2-D generalized hypercube GH(7,2) ... 3 
1.2 The neighbors of the node with address k in the 1-D HO\I\'(p,\v,l) system. 6 
1.3 1-D HO\V system with 15 processors and window size of 3, 4, and 5, 
respectively. ......................................... 9 
1.4 Examples of 2-D HOW systems with w=3. (a) HOHf(4, 3, 2). (b) 
HOlll(5, 3, 2). (c) HOVV(6, 3,2). (d) HOVV(7, 3,2) ............ " 10 
1.5 1-D wrap-around HO\V systems with 15 processors and window size of 
3, 4, and 5, respectively ............ . 15 
1.6 The 2-D wrap-around HOH/(7, 3,2) ..... 16 
2.1 Colinear layout of the 1-D HO\V system with 12 PEs and window size of 
4, and its brute-force decomposition into printed-circuit layers. ..... 19 
2.2 Colinear layout of the 1-D HO\i\T system with 12 PEs and window size of 
5, and its brute-force decomposition into printed-circuit layers. ..... 20 
2.3 Colinear layout of the 1-D HO\i\T system with 12 PEs and window size of 
4, and its decomposition into printed-circuit layers using vertical and 
horizontal lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24 
2.4 Colinear layout of the 1-D HO\V system with 12 PEs and window size of 
5, and its decomposition into printed-circuit layers using vertical and 
horizontal lines ...................................... " 25 
2.5 Colinear layout of generalized hypercube with 12 PEs, and its decompo-
sition into printed-circuit layers using vertical and horizontal lines .. " 26 
2.6 Decomposition of the 1-D wrap-around HOVV(12, 4,1). 28 
2.7 Decomposition of the 1-D wrap-around H01;f1(12, 5,1). 29 
3.1 The definition of dilation, congestion and expansion. .. 31 
3.2 (a) A 16-processor ring and (b) its embedding into the 1-D HO\i\T(p,w,l) 
systen1 ............................................ " 32 
3.3 Embedding a p-processor ring into the 1-D HO""(p,w,1) system with 
another technique .................................... " 32 
x 
Figure Page 
3.4 (a) Source 3x5 mesh and (b) its optimal embedding into the 1-D 
HO\V(15,3,l) system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34 
3.5 (a) Source 5x3 mesh and (b) its optimal embedding into the 1-D 
HO\V(15,3,1) system ................................... , 34 
3.6 Mapping the 7 x 7 mesh in two different ways. .................. 35 
3.7 4x4 ivraparound mesh and its optimal mapping onto the 1-D HO\V(16,8,1) 
systen1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36 
3.8 (a) A 31-processor full binary tree with depth d = 5 and the numbering 
of its nodes and (b) its optimal embedding into the 1-D HOW(31,8,1) 
systeln. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37 
3.9 (a) A 16-processor hypercube vvith binary addresses for its nodes and (b) 
its optimal embedding into a 1-D HO\V(16,8,1) system. .......... 39 
4.1 Embeddings of rings into 2-D HO\i\1 systems. . . . . . . . . . . . . . . . . . . .. 41 
4.2 Embeddings of rings into 2-D HOVV systems when the numbers of nodes 
in the rings are smaller than those in the HO\V systems. . . . . . . . . .. 41 
4.3 Mapping the 6 x 6 torus onto a 2-D HOIN system with window size of 
3. Consecutive bold segments in a row/column implement wraparound 
connections in the torus. ................................ 42 
4.4 Optimal mapping of the 3-1evel binary tree onto the 2-D HO\i\1(3,2,2) 
systen1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43 
4.5 Optimal mapping of the 4-1evel binary tree onto the 2-D HO\i\1( 4,2,2) 
system. The two distinct building blocks for the mapping of 3-level 
binary trees are enclosed in dotted lines. '" . . . . . . . . . . . . . . . . .. 44 
4.6 IvIapping the 5-1evel binary tree onto the 2-D HO\V(6,2,2) system. . . . .. 45 
4.7 Optimal mapping of the 6-level binary tree onto the 2-D HO\iV(8,2,2) 
systen1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46 
4.8 Optimal 3-D and 4-D hypercube embeddings into the HO\V( 4,2) system 
(method one). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47 
4.9 5-D hypercube and its embedding into a 2-D HOV'l system (method-one). 
Optimal mapping is derived if w 2: 4. ....................... 48 
4.10 5-D hypercube embedding with the second method. This figure shows 
the original hypercube, the embedding of the 3-D hypercube into the 
building block HO\V(3,2,2), and the final embedding into the 2-D 
HO\iV(6,3,2) system. ................................... 50 
Xl 
Figure 
4.11 6-D hypercube embedding in a 2-D system using method two. (Actually 
method-two and method-one are the same for even number dimension 
hypercube.) This figure shows the building block and the embedding 
in 2-D system.. . . . . . . . . . . . . . .. . .... 
5.1 Different output port models .. 
5.2 One-to-all broadcast ............. . 
5.3 One-to-all broadcasting under model-l with 12 processors and window 
size of 3. A number in parentheses is the label of the source processor 





sho\\'n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57 
5.4 One-to-all broadcasting under model-2 and model-3 with 12 processors 
and window size of 3. A number in parentheses is the label of the source 
processor from which data has been broadcast. All communication 
steps are shown. .... . . . . . . . 59 
5.5 All-to-all broadcast. . .. 61 
5.6 All-to-all broadcasting under model-l with 12 processors and window size 
of 3. The numbers in parentheses for each processor are the labels of 
source processors from which data was received prior to the current 
communication step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 62 
5.7 All-to-all broadcasting under model-3 with 12 processors and window size 
of 3. Addresses of processors from which values have been received at 
the end of each step are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68 
5.8 One-to-all personalized communication. 71 
5. 9 All-to-all personalized communication. . 74 
5.10 Chosen linear arrays in the HOVV(10, 3,1) for all-to-all personalized 
communication ........ . 76 
6.1 Processor addresses in the HOvV(5, 3, 2). . 83 
6.2 One-to-all broadcasting under model-2 and model-3 with two different 
methods, both of \vhich have the same number of communication steps. 
A filled circle means that the current processor has already received the 
message broadcast by the source. All communication steps are shown 
here. We assume that w-=3. For the worst case, we assun1e POD to be 
the source. .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86 
Xll 
Figure 
6.3 One-to-all personalized communication under model-3, for w = 3. The 
Cartesian coordinates of destination processors are shown as pairs of 
numbers. A shaded circle means that the corresponding processor has 
Page 
already received the personalized message sent by the source. ...... 94 
7.1 One-to-all communication procedure with 16 processors, for a hypercube 
systen1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ., 100 
7.2 All-to-all communication procedure with 16 processors, for a hypercube 
system ............................................. , 100 
8.1 Comparisons between HOVV and binary hypercube systems for one-to-all 
broadcasting with message size 711, = 2 words .................. , III 
8.2 Comparisons between HOW' and binary hypercube systems for one-to-all 
broadcasting with message size 7n = 5 words .................. , 111 
8.3 Comparisons between HOVV and binary hypercube systems for one-to-all 
broadcasting with message size Ttl, = 10 words. . . . . . . . . . . . . . . . . . 112 
8.4 Comparisons between HO\V and binary hypercube systems for one-to-all 
broadcasting with message size m = 20 words .................. 112 
8.5 Comparisons between HO\i\f and binary hypercube systems for all-to-all 
broadcasting with message size m = 2 words .................. , 113 
8.6 Comparisons between HOVl and binary hypercube systems for all-to-all 
broadcasting with message size TIL = 5 words .................. , 113 
8.7 Comparisons between HO\V and binary hypercube systems Jor all-to-all 
broadcasting with message size m = 10 words ................. , 114 
8.8 Comparisons between HOV/ and binary hypercube systems for all-to-all 
broadcasting with message size 1n = 20 words ................. , 114 
8.9 Comparisons between HO\i\T and binary hypercube systems for one-to-all 
personalized communication with message size Tn = 2 words ....... , 115 
8.10 Comparisons between HO\i\T and binary hypercube systems for one-to-all 
personalized communication with message size m = 5 words ....... , 115 
8.11 Comparisons between HOW and binary hypercube systems for one-to-all 
personalized communication with message size Tn = 10 words ...... , 116 
8.12 Comparisons between HOW and binary hypercube systems for one-to-all 
personalized communication with message size m = 20 words ...... , 116 
8.13 Comparisons between HOW and binary hypercube systems for all-to-all 
personalized communication with message size m = 2 words ....... , 117 
XllI 
Figure Page 
8.14 Comparisons between HOVV and binary hypercube systems for all-to-all 
personalized communication Ivitb message size Tn = 5 words. . . . . . .. 117 
8.15 Comparisons between HOW and binary hypercube systems for all-to-all 
personalized communication with message size rn = 10 words ..... " 118 
8.16 Comparisons between HOVV and binary hypercube systems for aU-to-all 
personalized communication with message size Tn = 20 words ..... " 118 
9.1 Comparisons between HO\1\1 and generalized hypercube systems for one-
to-all broadcasting with message size Tn = 2 words. . . . . . . . . . . . . .. 121 
9.2 Comparisons between HO\lV and generalized hypercube systems for one-
to-all broadcasting with message size Tn = 5 words ............. " 121 
9.3 Comparisons between HO\lV and generalized hypercube systems for one-
to-all broadcasting with message size Tn = 10 words ............ " 122 
9.4 Comparisons between HOVV and generalized hypercube systems for one-
to-all broadcasting with message size Tn = 20 words. . . . . . . . . . . . .. 122 
9.5 Comparisons between HO\i\f and generalized hypercube systems for all-
to-all broadcasting with message size Tn = 2 words ............. " 123 
9.6 Comparisons between HO\i\f and generalized hypercube systems for all-
to-all broadcasting with message size Tn = 5 words ............. " 123 
9.7 Comparisons between HO\i\f and generalized hypercube systems for all-
to-all broadcasting with message size Tn = 10 words ............ " 124 
9.8 Comparisons between HO\i\f and generalized hypercube systems for all-
to-all broadcasting with message size Tn = 20 words ............ " 124 
9.9 Comparisons between HO\1\1 and generalized hypercube systems for one-
to-all personalized communication with rnessage size TtL = 2 words .. " 125 
9.10 Comparisons between HO\1\1 and generalized hypercube systems for one-
to-all personalized communication with message size Tn = 5 words .. " 125 
9.11 Comparisons between HO\1\1 and generalized hypercube systems for one-
to-all personalized communication with message size Tn = 10 words. " 126 
9.12 Comparisons between HO\i\f and generalized hypercube systems for one-
to-all personalized communication with message size Tn = 20 words. " 126 
9.13 Comparisons between HO\1\1 and generalized hypercube systems for all-
to-all personalized communication with message size Tn = 2 Ivords .. " 127 
9.14 Comparisons between HO\1\1 and generalized hypercube systems for all-
to-all personalized communication with 1118ssage size 117, = 5 words. . 127 
xiv 
Figure Page 
9.15 Comparisons between HO\;Y and generalized hypercube systems for all-
to-all personalized communication with message size m = 10 ,,'ords ... 128 
9.16 Comparisons between HO\iV and generalized hypercube systems for all-
to-all personalized communication "vith message size Tn = 20 words. .. 128 
10.1 The spanning tree BSTo2 of the GHS,2 . ..... . 132 
10.2 The spanning tree BSTo2 of the HOIV(5, 3, 2). 132 
10.3 The spanning tree BSTo2 of the GHS,2 . ..... . 133 
10.4 The spanning tree BSTo2 of the HOIV(8, 3, 2). Shaded nodes show the 
procedure for the GH. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 134 
10.5 The spanning tree BSTo2 of the HOH!(8, 4,2). Shaded nodes show the 
procedure for the GH ................................... 135 
10.6 The spanning tree BSTo2 of the HOVV(8, 5, 2). Shaded nodes show the 
procedure for the GH. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 136 
xv 
LIST OF TABLES 
Table Page 
1.1 Comparison of existing interconnection networks. All network$ have I\T = 
pn = 2m nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14 
1.2 Comparison of interconnection netv/orks. All networks have N = pn = 2m 
nodes. ............................................. 16 
5.1 The propagation rules of one-to-all broadcasting under model-I. . . . . .. 56 
5.2 The detailed steps for all-to-all broadcasting under model-2 ........ " 64 
5.3 The detailed steps for all-to-all broadcasting under model-2 using a wrap-
around system with 16 processors. . . . . . . . . . . . . . . . . . . . . . . . . .. 65 
5.4 The detailed steps for all-to-all broadcasting under model-2 using a wrap-
around system with 17 processors. . . . . . . . . . . . . . . . . . . . . . . . . .. 66 
5.5 The detailed steps for all-to-all broadcasting under model-3 using a wrap-
around system with 16 processors. . . . . . . . . . . . . . . . . . . . . . . . . .. 67 
5.6 The detailed steps for all-to-all broadcasting under model-3 using a wrap-
around system with 17 processors ......................... " 69 
5.7 The detailed steps for one- to-all personalized communication under 
111odel-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73 
5.8 The detailed steps for all-to-all personalized communication in 1-D HOVv 
under model-3. . .................................... " 78 
5.9 The detailed steps for all-to-all personalized communication in 1-D HO\i\T 
under model-3. (continue-I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79 
5.10 The detailed steps for all- to-all personalized communication in 1-D H O\i\T 
under model-3. (continue-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80 
5.11 The detailed steps for all-to-all personalized communication in 1-0 HOW 
under model-3. (continue-3) ............................ " 81 
5.12 The detailed steps for all-to-all personalized communication in 1-0 HO\i\T 
under model-3. (continue-4) ............................ " 82 
6.1 The initial and final state of HOW(5,3,2) ..................... " 87 
6.2 Messages received in the first two detailed steps for all-to-all broadcasting 
within the rows of the HOVV(5, 3, 2) system. ................. 90 
XVI 
Table Page 
6.3 The initial and final states for one-to-all personalized communication in 
the HO\'V(p,w,2). ..................................... 91 
6.4 The initial state for all-to-all personalized communication in 2-D HOW 
systen1 .................................... : . . . . . . . .. 95 
6.5 The final result for all-to-all personalized communication in a 2-D HO\J\T 
systen1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 96 
7.1 Detailed information for all-to-all broadcasting on the hypercube ...... 103 
7.2 Detailed information for one-to-all personalized communication on the 
hypercube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 104 
7.3 Detailed information for all-to-all personalized communication on the 
hypercube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 106 
7.4 Detailed information for all-to-all personalized communication on the 
hypercube (continued). . ................................ 107 
7.5 Detailed information for all-to-all personalized communication on the 
hypercube (continued). . ................................ 108 




The demand for ever greater performance by many computation problems has been 
the driving force for the development of computers with thousands of processors. 
Two important aspects are expected to dominate the mcl,ssively-parallel processing 
field. High-level parallel languages supporting a shared address space (for DSM 
computers) and point-to-point interconnection networks with \vorkstation-like 
nodes. Near PetaFLOPS (i.e., 1015 floating-point operations per second) and 
more performance is required by many applications, such as weather modeling, 
simulation of physical phenomena, fluid dynamics, aerodynamics, simulation of 
neural networks, simulation of chips, structural analysis, real-time image processing 
and robotics, artificial intelligence, seismology, animation, real-time processing of 
large databases, etc. Dongarra pointed out in 1995 that the world's top ten technical 
computing sites had peak capacity of only about 850 GigaFLOPS, with each site 
containing hundreds of computers. The goal of 1 TeraFLOPS (i.e., 1012 floating-
point operations per second) peak performance was reached in late 1996 with the 
installation of an Intel supercomputer at Sandia Laboratories. 
The PetaFLOPS performance objective seems to be a distant dream primarily 
because of the, as currently viewed, unsurmountable difficulty in developing 10\\1-
complexity, high-bisection bandwidth, and low-latency networks to interconnect 
thousands of processors (and remote memories in DSM systems). To quote Dally, 
"wires are a limiting factor because of power and delay as well as density" [6]. 
Several interconnection networks have been proposed for the design of massively-
parallel computers, including, among others, regular meshes and tori [7], enhanced 
meshes [17], fat trees, (direct binary) hypercubes (9], and hypercube variations 
[lJ [l1J [12]. The hypercube dominated the high-performance computing field in 
1 
2 
the 1980's because it has good topological properties and rather rich intercon-
nectivity that permits efficient emulation of many topologies frequently employed 
in the development of algorithms [9] [14]. Nevertheless, these properties come at 
the cost of often prohibitively high VLSI (primarily wiring) complexity due to a 
dramatic increase in the number of communication channels with any increase in the 
number of PEs (processing elements). Its high VLSI complexity is undoubtedly its 
dominant drawback, that limits scalability [14] and does not permit the construction 
of powerful, massively-parallel systems. Two nodes in the 'In-cube or m-D hypercube 
with 2m nodes are neighbors if and only if their unique m-bit addresses differ in a 
single bit. The versatility of the hypercube in emulating efficiently other important 
topologies constitutes an incentive for the introduction of hypercube-like intercon-
nection net\'\'orks of lower complexity that, nevertheless, preserve to a large extent 
the former topological properties [1] [12]. Indirect implementations of hypercubes 
have also been proposed [8]. 
To support scalability, current approaches to massively-parallel processmg 
use bounded-degree networks, such as meshes or k-ary n-cubes (i.e., tori), with 
low node degree (e,g., the FLASH, Cray Research MPP, Intel Paragon, and Tera 
computers), However, low-degree networks result in large diameter, large average 
internode distance, and small bisection width. Relevant approaches that employ 
reconfiguration to enhance the capabilities of the basic mesh architecture (e.g., 
reconfigurable mesh, mesh with multiple broadcasting, and mesh with separable 
broadcast buses) will not become feasible for massively-parallel processing in the 
foreseeable future because of the requirements for long clock cycles and precharged 
switches to facilitate the transmission of messages over long distances [17], 
The high VLSI complexity problem is unbearable for generalized hypercubes 
(GHs). Contrary to nearest-neighbor k-ary n-cubes that form rings with k nodes 
in each dimension, G Hs implement fully-connected systems with k nodes in each 
3 
Figure 1.1 The 2-D generalized hypercube GH(7,2). 
dimension [16]. The n-D (symmetric) generalized hypercube GH(k,n) contains k71 
nodes. The address of a node is Xn-1Xn-2".XjXO, where .Ti is a radix-k digit with 0 :::; 
Xi :::; k -1. This node is a neighbor to the nodes 'with addresses Xn-1Xn-2".X~""TIXO, 
for all 0 :::; i :::; 17, - 1 and x~ f. Xi. Therefore, tvvo nodes are neighbors if and only if 
their n-digit addresses differ in a single digit. For the sake of simplicity, we restrict 
our discussion to symmetric generalized hypercubes where the nodes have the same 
number of neighbors in all dimensions. Therefore, each node has k - 1 neighbors 
in each dimension, for a total of n· (k - 1) neighbors per node. The n-D GH(k, n) 
has diameter equal to only n. Figure 1.1 shows the GH(7, 2) with 2 dimensions (i.e., 
17, = 2) and k = 7. For 17, = 2 and k an even number, the diameter of the GH is only 2 
and its bisection width is the immense k3 / 4. The increased VLSI/wiring cost of G Hs 
results in outstanding performance that permits optimal emulation of hypercubes 
and k-ary 17,-cubes, and efficient implementation of complex communication patterns 
[4] [3]. 
4 
In order to reduce the number of communication channels in systems similar 
to the generalized hypercube, the spanning bus hypercube uses a shared bus for the 
implementation of each fully-connected subsystem in a given dimension. Hov\Tever, 
shared buses result in significant performance degradation because of the overhead 
imposed by the protocol that determines each time ownership of the bus. Similarly, 
hypergraph architectures implement all possible permutations of their nodes in 
each dimension by employing crossbar switches [10]. Reconfigurable generalized 
hypercubes interconnect all nodes in each dimension dynamically via a scalable 
mesh of very simple, low-cost programmable switches [15J. However, all these 
proposed reductions in hardware complexity may not be sufficient for very high 
performance computing. 
To summarize, low-dimensional massively-parallel computers with full connec-
tivity for their nodes in each dimension, such as generalized hypercubes, are very 
desirable because of their outstanding topological properties (e.g., extremely small 
diameter and average internode distance, and immense bisection width), but their 
electronic implementation is a Herculean task because of packaging (and primarily 
wiring) constraints. We propose in this dissertation a new class of interprocessor 
connection architectures, namely HO"\i\1s (standing for architectures with Highly 
Overlapping \Vindo'ws), which employ the generalized hypercube [3] [16] [26J [19J 
with outstanding topological properties (e.g., extremely small diameter and average 
internode distance, and immense bisection width) as the basic building block. HO\Vs 
are also obtained from generalized hypercubes by removing some of their processor 
interconnections in order to reduce the wiring complexity and render them viable 
structures for very high-performance computing. Large generalized hypercubes have 
outstanding topological properties; however, they are characterized by very high 
wiring complexity that p,r0hibits their implementation [11] [18J [19J. In contrast, 
5 
HOWs can be viable while having simultaneously topological properties comparable 
to those of generalized hypercubes. 
This dissertation is organized as follows. Chapter 1 introduces HO\IVs, a 
new class of parallel architectures. Chapter 2 introduces cost analysis for HO\Vs. 
Chapters 3 and 4 present the embedding of various interconnection networks into 
I-D and 2-D HOVV systems. Chapters 5 and 6 present and analyze communication 
operations for I-D and 2-D HO\i\! systems, respectively. Chapter 7 briefly analyzes 
communication operations for (direct binary) hypercubes. Finally, Chapters 8 and 9 
present performance comparisons involving hypercubes (binary and generalized) and 
2-D HO\i\l systems. 
1.1 The Class of HOW Architectures 
The definition of the generalized hypercube network, which is the building block of 
HOVis, is first in order. \Ve shall show later in this section that HO\i\Ts can also be 
derived from generalized hypercubes by selectively removing some of their interpro-
cessor connections. The terms node and processor are used interchangeably. The n-D 
(symmetric or balanced) generalized hypercube GH(p, 17,) with p nodes per dimension 
contains a total of pn nodes [16]. The address of a node is X n -lXn -2' .. XIXO, where 
the radix-p digit Xi is 0 :::; Xi :::; P - 1 for i = 0, 1, ... ,17,- 1. Two nodes are neighbors 
if and only if their n-digit addresses differ in a single radix-p digit. This generalized 
hypercube can be obtained from the n-D mesh by replacing the linear arrays in each 
dimension with fully-connected systems. Therefore, each node in the GH(p, n) has 
17, x (p - 1) neighbors and its diameter is equal to just n. 
Lmv-dimensional generalized hypercubes have very impressive bisection widths. 
\i\lhen a network is cut into two equ2J halves, its bisection width is the number of 
edges that run between these two halves; dense/heavy communications operations 
can benefit from a large bisection width. For 17, = 2 and p an even number, the 
6 
?w±l neighbors of processor k 
0 0 0 0 0 00 000 
0 2 k-w k k±w p-3 p-2 p-l 
I· wineiow for processor k ... 1 
I""" I 
winelow for processor k-I .1 
Figure 1.2 The neighbors of the node with address k in the 1-D HO\V(p,w,l) system. 
bisection width of the GH(p,n) is the immense p3/4. Also, generalized hypercubes 
implement efficiently very demanding communications operations, such as broad-
casting and multicasting [26] [4]. Their outstanding topological properties are the 
result of their high node degree (that is, the large number of connections per node) 
which, however, has negative effect on the \viring complexity. 
1.1.1 Their Structure 
We first introduce the class of 1-D HO\V processor interconnections [29] [30]. 
HOT;{!(p, w, 1) denotes a 1-D HOW system with p nodes and window size w. Each 
node with unique address k, where 0 ::; k ::; p - I, is connected directly to all nodes 
within the windows of size w immediately to its left and right. More specifically, its 
neighbors have addresses 0 ::; k ± 'i, ::; P - I, for all i = 1,2,3, ... ,w. Therefore, all 
connections are local in this 1-D system and span up to w nodes to the left and w 
nodes to the right of the referenced node. Figure 1.2 shows the neighbors of a node 
in a 1-D HO\V system. 
Each processor k belongs to as many as w + 1 maximal-sized 1-D generalized 
hypercubes GH(w + 1,1) (i.e., fully-connected subsystems); they can be derived 
by starting with the subsystem spanning node k and all its left neighbors in the 
colinear representation of the HOW(p, w, 1), and shifting each time the window 
by one position to the right until the last subsystem spans node k and all its right 
7 
neighbors. Therefore, each such pair of successively-derived GH(w+l, l)s have a very 
large overlap that forms a GH(w, 1). The HOVV(p, w, 1) can also be derived from the 
GH(p, 1) by removing for each node, in the colin ear representation of the GH{p, 1), 
those edges that connect it to nodes outside of the left and right windmvs defined by 
w. Therefore, existing algorithms for generalized hypercubes can be modified easily 
to run on HOV,/s because of the following reasons: 
• HO\I\Ts are derived from generalized hypercubes by removing some edges . 
.. HO\Vs contain many smaller, highly-overlapping generalized hypercubes. 
Not only do HOVis have reduced wiring complexity than GHs of similar size, 
but also the locality of processor interconnections in HO\iVs can be a viable solution 
for very high-performance computing [29J [30] [31J: 
.. Moore's Imv predicts the doubling of transistor density for chips every 18 
months. Multiprocessor chips have already appeared in the market and this 
design concept is expected to have in the near future a very significant market 
share in the high performance computing field. Local intrachip processor 
connections, such as those required predominantly in HO\Vs, will then be very 
effective . 
• Intrachip and/or local interchip connections could be implemented efficiently 
with current and expected electronic technologies for reasonable values of the 
window size w; in contrast, the global interconnections required in generalized 
hypercubes are much more difficult to realize. Improvements in intrachip 
and/or interchip interconnection technologies can increase the value of w. 
• Free-space optical interconnects are expected to become viable and commonplace 
in the near future for the local interconnection of chips [19]. Very substantial 
work is carried out in research laboratories, quite often with federal support, 
8 
for the efficient realization of free-space interconnects within computer systems; 
,\iVDM (wavelength-division multiplexing) will be employed for the transmission 
of multiple bits in parallel [19]. Because of the fact that chromatic dispersion 
becomes a major problem in '\t\!DM for distances larger than Clfbout a meter, 
the global interconnections required in generalized hypercubes will still be very 
difficult to implement. Therefore, HO'\t\ls will increase further their advantage 
over GHs with respect to interconnection complexity. 
All of the above demonstrate that HO'\t\Ts are more prone than GHs to scalability 
related to technological advancements. 
The (symmetric) n-D HOTiV(p, w, n) with p nodes per dimension is constructed 
recursively, so that each node has up to 2wn neighbors. A node has address 
X n -lXn -2 ... Xi' .. XI.TO, where Xi is a radix-p digit with 0 :::; Xi :::; P - 1 for 
all i = 0,1,"', n - 1. The neighbors of this node have addresses that differ 
from its own address only in a single radix-p digit, that is they have addresses 
Xn-l X n -2 ... x: ... XIXO, where 1 :::; IXi - x:1 :::; 'W for 0 :::; i :::; n - 1. This HO\t\l 
system contains pn nodes. It is important to note that such a system contains many, 
highly-overlapping generalized hypercubes GH(w + 1, n). The HOTIV(p, 'W, n) can 
also be derived from the GH(p, n) by removing in each dimension all connections 
for each node that do not fall into its left and right neighborhood windows defined 
by w. Figure 1.4 shows 2-D HO'\t\l systems containing 16,25,36, and 49 processors, 
respectively, and having window size w = 3. The HOliV(4, 3,2) in Figure 1.4.a 
is identical to the GH(4, 2). In general, the HOVV(p,p - 1, n) is identical to the 
GH(p, n). Also, the HOliV(p, 1, n) is identical to the n-D mesh. 
Figure 1.3 shuws 1-D HOW systems containing 15 processors and having 
window size of 3,4, and 5, respectively. Figure 1.4 shows the 2-D HOH!(4, 3, 2), 
HOVV(5, 3, 2), HOTiV(6, 3, 2), and HOVV(7, 3, 2) systems containing 16,25,36, and 
49 processors, respectively, and having window size 'W = 3. 
9 
(a) I-D system with IS-processor and window_size=3 
(b) I-D system with IS-processor and window_size=4 
(c) I-D system with IS-processor and window_size=S 







Figure 1.4 Examples of 2-D HO\i\1 systems with w=3. (a) HOVV(4, 3, 2). (b) 
HOHf(5, 3, 2). (c) HOHf(6, 3,2). (d) HOVV(7, 3, 2). 
11 
The next tvw theorems are pertinent: 
Theorem 1.1.1 The diameter of the HOVV(p,w,n) is nrr:ll-
PROOF. In the vwrst case, a message may have to traverse all ndimensions to 
reach its destination. The diameter of the 1-D HOVV(p, w, 1) is r~ 1· It becomes 
nfP:ll for the n-D HOVV(p, w, n) .• 
Theorem 1.1.2 The number of channels ~n the n-D HOliV(p, w, n) is npn-lCll 
where C1 = 3!f(2p - w - 1) is the number of channels in the l-D HOVV(p, w, 1). 
PROOF. The number of channels Cl in the 1-D HOliV (p, w, 1) is (p - w)w + 
L~=(/ i or (p-w )w+ (w-;l)w or 3!f (2p-w-1). This is because starting from the leftmost 
node and proceeding sequentially to the rightmost node in the colinear representation 
of the 1-D system, each node contributes w new channels except for the rightmost 
w nodes. The i - th node from the right, where 0 :::; i :::; w - I, contributes i new 
channels. The proof for the n-D HOVV(p, w, n) is based on mathematical induction. 
The number of channels in the 2-D HOVV(p, w, 2) is 2pCl because it contains p 
rows and p columns of 1-D H01;f1(p, w, 1)s. Let the number of channels Cn-l in the 
(n - l)-D HOVV(p, w, n -1) be (n - 1)pn-2c1 . The n-D HOVV(p, w, n) is formed by 
connecting together in HOVV(p, w, 1) structures all nodes with the same address in 
p independent HOVV(p, w, n - l)s with pn-l nodes each. Therefore, the number of 
channels Cn in the H01;f1(p, w, n) is PCn-l + p71-1 C1 or npn-l Cj •• 
1.1.2 Further Implementation Issues 
V\1e will analyze the following systems and derive the equations for calculating their 
numbers of channels. 
• the binary hypercube, (i.e. the m-cube); 
• the k-ary n-cube; 
12 
.. the generalized hypercube GH(k, n); 
.. the 2-D HOltV(2 T , w, 2); 
.. the n-D HOFl(k, w, n). 
Assume all systems have the same number N of processors, \\'here .N = kn = 
2m The following are the derivations for these systems . 
.. In the m-cube each node connects to 717, other nodes. Nodes share channels in 
pairs, so the total number of channels is ~m2m . 
.. In the k-ary n-cube each node has 2n neighbors and there are kn nodes. The 
total number of channels is ~2nkn = nk 11 • 
.. In the generalized hypercube GH(k, n) each of the k nodes connects to the 
remaining k - 1 nodes along one dimension. There are n dimensions and kT! 
nodes. The total number of channels is n( k - 1) k;' . 
,. For the 2-D HOH!(2T, w, 2), since the 1-D HOVV(k, w, 1) is the building block 
we first calculate the number of channels in the 1-D HOW(k,w,l) system. For 
the first (starting from the left side) k - w nodes, each node has w channels 
because the window size is wand connects to the w nodes to its right. Following 
this rule, no wire will be counted twice. For the rightmost w nodes, the number 
of channels will be 0+ 1 +2+· . ·+w -1 = (w-;1)w. The total number of channels 
in the 1-D HO\iV(k,w,l) is (k - w)w + (W-;l)W = w((k - w) + W~l) = w(k - Wil). 
For the 2-D HOltV(k, w, 2), there are k2 nodes, and each nodes has up to 4w 
neighbors. It can be viewed as k rows and k columns of HOVV(k, w, 1) systems, 
and therefore the total number of channels is 2kw(k - Wil). 
,. The n-D HOVV(k, w, n), contains k 11 nodes and kn- 1 1-D HOTlIl(k, w, 1) 
building blocks. Applying mathematical induction, we find that the total 
number of channels in the n-D HOliV(k, w, n) is nkn-1w(k - Wil). 
13 
Table 1.1 compares the numbers of channels in the binary hypercube (i.e., m-
cube), the k-ary n-cube (i.e., n-D torus), the generalized hypercube GH(k, 11), the 
2-D HOVV(2T, w, 2), and the n-D HOHf(k, w, n), all with the same number N of 
processors. 
This dissertation focuses on 2-D HO\V systems because of their simplicity, high 
bisection Iyidth, and ease of implementation. For a comparison, assume bidirectional 
data channels for full-duplex communications (i.e., simultaneous data transfers in 
both directions) and that N = k n = 2m (therefore, k = 1'ITlln = 2mln). For an 
example, assume systems Ivith N = 16,384 processors (i.e., m = 14) and 64-bit data 
channels; the numbers of wires in these systems are: 
.. ~ * 14 * 214 * 64 = 7 * 16384 * 64 = 7,340,032 channels (also means 14,680,064 
full-duplex bidirectional wires) for the 14-cube with diameter 14; 
.. 2 * 1282 * 64 = 2,097,152 channels (also means 4,194,304 full-duplex bidirec-
tional wires) for the 128-ary 2-cube with diameter 128; 
• 2 * 1282- 1 * 127;128 * 64 = 133,169,152 channels (also means 266,338,304 full-
duplex bidirectional wires) for the GH(128, 2) with diameter 2; 
.. 128 * 32 * (2 * 128 - 32 - 1) * 64 =58,458,112 channels (also means 116,916,224 
full-duplex bidirectional wires) for the H01¥ (128,32,2) with diameter 8; 
• 128 * 16 * (2 * 128 - 16 - 1) * 64 =31,326,208 channels(also means 62,652,416 
full-duplex bidirectional wires) for the H01¥(128, 16,2) with diameter 16; and 
• 128 * 8 * (2 * 128 - 8 - 1) * 64 =16,187,392 (also means 32,374,784 full-duplex 
bidirectional wires) for the HOvV(128, 8, 2) with diameter 32. 
For the comparative analysis of these results, we emphasize again that HO\i\T 
systems with reasonable window size ware scalable, and could be implemented with 
14 
Table 1.1 Comparison of existing interconnection networks. All net\",orks have 
_N = pn = 2m nodes. 
Network Number of channels Diameter 
m-cube if * log2 N m = log2 N . 71, * log2 P 
I 1 
71,*N 71, * l j\~n J - 71,* lE.J N;; -ary 71,-cube - 2 
GH(N~, 71,) I N logpN = 71, (N;; - 1) * 71, * 2 
HOl¥( VN, w, 2) VN * W * (2 * VN - W - 1) 2 * r\/N-q = 2 * rE.::~q 
HOW(N~, w, n) 
1 1 
IV I W 
~ * N 1--; * w * (2 * N-; - w - 1) n * rNn,-ll = 71, * rp,~ll 
current and expected electronic and/or optical technologies because of the locality 
of their interconnects. In contrast, binary hypercubes are not scalable because the 
node degree increases with increases in the number of processors and, therefore, are 
difficult to build. Also, large generalized hypercubes are impossible to build because 
of their very large wiring complexity. 
1.2 The Class of Wrap-Around HOW Architectures 
Similarly to the wrap-around mesh, Ive introduce here vvrap-around HOVV archi-
tectures. For the \vrap-around HOltV(k, w, 1) system, each node will have 2w 
neighbors, that is w neighbors to its left and w nodes to its right. Figure 1.5 shows 
the 1-D wrap-around H01¥(15, 3, I), HOliV(15, 4, 1), and HOTi/(15, 5, 1) systems. 
Figure 1.6 shows the 2-D wrap-around HOTi/(7, 3, 2) system. 
Each processor in the n-D wrap-around HOliV(p, w, 71,) has 2wn neighbors. 
The derivation of the total number of channels in the wrap-around HOTiV(k, w, n) 
system is then as following. Because each node has 2nw neighbors and processors 
share channels in pairs, each processor contributes nw channels to the whole system. 
Therefore, the total number of channels is kn - 1 * k * nw = knnw. Its diameter is half 
of that for the regular HOliV(k, w, 71,). Table 1.2 shows the comparison of different 
netv·.rorks. 
15 
Ca) J-O wraparound HOW(lS.3,J). 
(b) J-O wraparound HOW(IS,4,J) 
(e) 1-0 wraparound HOWeJS,S,!) 
Figure 1.5 1-D v:.rrap-around HOW systems with 15 processors and window size of 
3, 4, and 5, respectively. 
16 
Figure 1.6 The 2-D wrap-around HOliV(7, 3, 2). 
Table 1.2 Comparison of interconnection networks. All networks have N = pH = 2m 
nodes. 
Network N umber of channels Diameter 
m-cube if * log2 N 177, = log2 N = 17, * log2 P 
1 
1 
N-n-ary n-cube nLN 17, * l N2 n J = 17, * l ~ J 
GH(N~, 17,) ( 1 N N-n -1)*17,* 2 logpN = 17, 
HOVV(VN,w, 2) VN * w * (2 * VN - w - 1) 2* 1~-11 =2* 171 
HOTiflwrap( '-'Fl, w, 2) 2*w*N 2* 1&-11 =2* 1E.=l1 1w 2w 
HOTlV(N~, w, 17,) n N1-1. (? Nl. 1) 17,* IN~-ll =n* r~l 2*~ n *w* _* n -w-
HOHlw1.ap(N~, w, 17,) n*w*N 17,* rN*,~ll =n* r~l 
CHAPTER 2 
COST ANALYSIS 
In this chapter, a VLSI cost comparison between 1-D HOVV systems and generalized 
hypercubes is presented. To determine the VLSI cost, we measure the number of 
wires and the complexity of the system based on the number of layers in the colinear 
layout of the circuit. 
2.1 Cost Analysis for the Regular HOVV(p, lV, 1) 
A VLSI cost comparison between 1-D HOVVs and generalized hypercubes IS 
presented. Since the focus of our attention in this dissertation are 2-D systems 
with p nodes in each dimension, this 1-D comparison is assumed to be carried out 
for each of the p rows and p columns in the 2-D systems (i.e., for their building 
blocks). The next definition is pertinent. 
DEFINITION 2.l. The crossing numbeT of a gmph is the m'lni11wm nU'lnbeT of 
edge CTossings needed to dmw the gmph in the plane [27J. 
This number is related to the area needed to layout the graph for VLSI implemen-
tation. To eliminate all edge crossings, several printed-circuit layers may have to 
be implemented. Not only does the number of layers affect the VLSI cost, but the 
thickness also of each layer contributes to the cost measure. 
To determine the VLSI/wire cost, we measure the complexity of each system 
based on the minimum number of layers required in the colinear layout of the circuit 
for zero edge crossings and/or the width of each layer. In the colinear layout, all 
nodes in the 1-D system lie on the same straight line. The chosen rules of routing 
the wires for 1-D systems are: 
.. Vife consecutively number the processors 0, 1, 2, ... , p - 1, from left to right. 
17 
18 
• Going from left to right, for even-numbered processors the wires go to the top 
half of the printed-circuit board . 
• For odd-numbered processors, the yvires go to the bottom half of the printed-
circuit board. 
These basic rules of routing the wires minimize their maXImum collective width, 
~MCT¥ (expressed in number of wires), in the x dimension. Figure 2.1 shows the 
colinear layout of the 1-D H01¥(12, 4,1) and its brute-force decomposition for its 
implementation with two layers that eliminate all edge/wire crossings. However, the 
number of layers that eliminate all wire crossings depends on the value of w, and 
thus it increases \vith increases in the window size. For example, Figure 2.2 shows 
that the HOliV(12, 5, 1) requires three layers for the elimination of all wire crossings. 
The following theorems are pertinent. 
Theorem 2.1.1 The AICT¥ in the colineQ,1' layout of the l-D HOI¥(p, w, 1) with a 
single layer is 
~MCliV = { ~(~ ~ 1) for even w 
(W;l)_ for odd W 
fOT pmctical cases with w < ~. FOT the l-D genemlized hypeTcube GH(p, 1), the 
value of AICVV is (p - 3)¢ + p - 1 - 2¢2 with ¢ = l P~l J . 
PROOF. A1Cl¥ can be determined by finding the maximum number of those 
wires that are located in either the upper or lower half of the layer between P Ew - 1 
and PEw. If w is even, then this maximum number corresponds to the lower half 
of the layer because PEw-I, which is the rightmost PE in the leftmost window, is 
the last PE that contributes to A1CT¥ and contributes to the lower half (because 
it has an odd address). Therefore, we have P E1 contri bu ting two wires because it 
is connected to PEw and P Ew+l outside of this leftmost windo·w. P E3 contributes 
four wires because it is connected to PEw, P Ew+1 , P EW+2 and P Ew+3 , and so on. 
19 





~ '!) ~ '--- ~J 
'---
,2.- • 5 7 S 9 10 II 




I 2 3 ~ 5 6 7 S 9 10 II 
(b) The dccompru:itinn: the first layer. 
CD 
(c) l1u~ decumposition. the second Inycr 
Figure 2.1 Colinear layout of the 1-D HOVV system with 12 PEs and window size 




If In I r-=-- II I II I I I Irl I I I I 61 L- ' '- '----- 1- '------ ! -'-- 2 3),':'-- 5 ~ 7 ~ ~ 10 II 0 -' u , '-Jl 
-
(li}Colin~M l:l.yout orthe llne~diH,\Il':IIs:iol1aJ J-D HOW S:YlOtcm u.;th 12 PE's and window ;;iu: of 5 
r--
c:::£: '-- '-- '-I 2 ) 4 5 ~ 7 g 9 10 11 
(b) The uccompl);'iioll: the fin;t layer 
10 II 
(c) The dt:c()tnpusiun. the >CC,lIIJ layer 
10 
(J) Tht': uccornposion: the third hlyer 
Figure 2.2 Colin ear layout of the I-D HO\V system with 12 PEs and window size 
of 5, and its brute-force decomposition into printed-circuit layers. 
21 
w 
Therefore, we have _MCTtfI = 2 + 4 + 6 + 8 + ... + w, or L-Sl 2i where w/2 is an 
integer or, finally, ¥j-C?f + 1). For odd w, however, _MCTIV corresponds to the upper 
half of the layer because P E11J - 1 , which is the rightmost PE in the leftmost window) 
is the last PE that contributes to AfC1tf1 and contributes to the upper half (because 
it has an even address). Therefore, we have P Eo contributing one wire because it 
is connected to PEw. P E2 contributes three wires because it is connected to P E11J) 
P Ew+ 1 and P Ew +2, and so on. Therefore, we have .M Cltfl = 1 + 3 + 5 + 7 + ... + w, or 
w-I tV-I 
Li~O (2i+ 1) where W~l is an integer, or 2Li~1 i+ (W~l + 1) Of, finally, (W~1)2. To 
obtain these results, we assumed that all W wires leaving P Ew - 1 exist, and therefore 
w - 1 + w < p or w < 9. This should be expected to be the practical case for 
HO\iVs. However, the results do not cover generalized hypercubes because for them 
we have w = p - 1. Therefore, generalized hypercubes must be treated separately. 
Because of the symmetry in 1-D generalized hypercubes, without loss of generality 
we can find the .1\1C1;)1 by focusing on the upper half of the printed-circuit. In fact, 
we can count the contribution of each PE in a left-to-right order. Let 0; be equal to 
p - 1. P Eo contributes 0; wires because it is connected to a neighbors to its right. 
P E2 contributes a - 4 wires to lI1CHI because it is connected to a - 2 neighbors 
to its right and two levels of wires emanating from P Eo can be reused (therefore, 
P E2 also can use the same wire levels). Similarly, P E4 contributes 0; - 8 wires to 
.M CHI because it is connected to 0; - 4 neighbors to its right and four levels of 
wires emanating from P Eo can be reused. Similarly, P E6 contributes a - 12 wires 
to _MCHI because it is connected to 0; - 6 neighbors to its right and six levels of 
wires emanating from PEo can be reused. In general, PEi , where i = 2j, contributes 
a - 2] wires to J11CT;)I because it has a - ] neighbors to its right and it can reuse 
] levels of wires emanating from P Eo. HJwever, even-numbered PEs i for which 
a - i is negative or zero do not contribute to .1\1CVV. Therefore, contributing PEs 
22 
have addresses 2i, with a - 4i 2: 0 or i ::; l ~ J. The value of 114 CTV is then given by 
-:ZT=o(a - 4i), \"here ~ = l~J. This sum is also equal to (p - 3)~ + p - 1 - 2~2. 1& 
This theorem shuws that HOVls have much smaller wire width CMCTV) than 
generalized hypercubes for pract.ical cases because this width is O(w2 ) arid O(p2), 
respectively. The next theorem shows the number of printed-circuit boards (i.e., 
layers) required t.o eliminate all \\'ire crossings when the brute-force decomposit.ion 
of t.he type shown in Figure 2.1 is applied. 
Theorem 2.1.2 The nwnber of layers that eliminate all wire cTOssings with brute-
force decomposition of the HOVV(p, w, 1) is r*l. It becomes 1 + rp;41 f07' the geneT-
alized hypercube. 
PROOF. Assuming the wire routing rules defined earlier and the brute-force 
decomposition to produce zero wire crossings, we focus for the proof on a single 
window. Each layer deals with a pair of consecutive nodes within the window and 
there are r w/21 pairs. Thus, we need a total of r*llayers for the H01;j1(p, w, 1). For 
the generalized hypercube, going from left to right in the colinear representation of 
the system, each layer contains two successive nodes that connect to all other nodes 
to their right. However, up to four rightmost nodes can be combined in the last 
layer with zero \V1re crossings, and thus the total number of layers for the generalized 
hypercube is 1 + r~l .• 
\1I/e observe that the numbers of layers in HOWs and generalized hypercubes 
of similar size are O( w) and O(p), respectively. This is another advantage of HO'1I/s 
that renders them more viable for implementation than generalized hypercubes. 
It is worth also mentioning here another wire routing technique, namely 
restricted routing [28], that requires only two layers for the implementation of 
any system represented in the 2-D space. As a result, both HOVVs and gener-
alized hypercubes require two printed-circuit layers regardless of their size. In the 
23 
case of restricted routing, horizontal and vertical \\!ire segments are laid on two 
different ,viring layers. Figures 2.3, 2.4 and 2.5 demonstrate this technique for 
the HOVV(12, 4, I), HOH!(12, 5,1) and GH(12, 1) systems, respectively. Horizontal 
and vertical wires can then cross over each other without any electrical connect.ion. 
If a connection is needed, a cont.act is placed at the respective intersection; these 
contact.s contribute to the \1L8I cost. Therefore, the total vI/iring cost with restricted 
routing has four components: 
• The t.otal number of wires. This number is O(Wp2) and O(p3) for 2-D HOVVs 
and CHs, respectively. 
• The maximum collective width of wires, }\1ClV (it. affects the cost of the larger 
layer that contains the horizontal wires). This number is O(w 2 ) and O(l}) for 
HOVVs and CHs, respectively. 
• The length of t.he \vires. The maximum length is O(w) and O(p) for HO\i\Ts 
and CHs, respectively. 
• The total number of electrical connections (contacts) between the two layers. 
This number is twice the total number of wires. Therefore, it is O(Wp2) and 
O(p3) for HOVis and CHs, respectively. 
Therefore, HOWs are superior to CHs even with restricted routing. Vile can conclude 
that HO\1\1s are more prone to implementation than CHs for reasonable values of w. 
The following sections also show that HO'Ws can deliver very high performance. 
2.2 Cost Analysis for the Wrap-Around HOH!(p, w, 1) 
Let us now further investigate the \1L8I wire cost of HO\l\1s with wrap-around 
connections. From Figures 2.6 and 2.7, it is very clear that the maximum collective 
width (MC\I\1) increases wit.h increases in the window size w. It is because in 
24 
II 
i [ liI 
- - - -
! I~ 
~ 'sl 7) (~ -0 ~ , 4 ( 6 (,-!-- 10 ( II - ,--=-- ~ , 
! 
'--- ! '----- I 
(II) G..hnellf Inyou( of 111e ollc-dllnensHmnl \·D HOW ;;:ysaem .... ·1111 12 PE s alii..! wmdow &I:t.e of .. 
10 II 
CD CD CD GJ GJ CD CD CD 
Figure 2.3 Colinear layout of the I-D HO\iV system with 12 PEs and window size 
of 4, and its decomposition into printed-circuit layers using vertical and horizontal 
lines. 
25 
j Illn! I fl ,- -
'--- 7J 
~ ~ ~ ~ '----! 4 5 ;-- ~ ,2- 10 II ..-=-- ' ,-
I 
! ! L-I '=== '=== 
(a) Cohnen. lfiyout uf Ihe une--dIlIlCns.tolllll 1·0 HOW s:),stem with 12 PE s and wIndow size of) 
o 10 II 
(b) DCl'l.)mposilioll with vertical lilles. 
Figure 2.4 Colinear layout of the 1-D HO\i\1 system "lith 12 PEs and window size 
of 5, and its decomposition into printed-circuit layers using vertical and horizontal 
lines. 
26 
!!, ! I ! I I I , I 
j 
i 
(tl) Cohncnr layout nfthe tHlC'·JullI:nslOflul .genc:nlh;r.cd hypercuOe ""',tb minimi7..l.'-u numher ol)( wIres. 
I 
o , 
(h) D':COIlI!}();::itinn of vcnlcllilines 
(c) Decompl)c;;ilioll ()flwr17 ..ulllnlltllt':$ 
Figure 2.5 Colinear layout of generalized hypercube with 12 PEs, and its decom-
position into printed-circuit layers using vertical and horizontal lines. 
27 
order to connect pairs of nodes belonging to the leftmost and rightmost ,vindows, 
respectively, in the colinear layout, the wires will cross the entire printed-circuit 
plane. The number of wires needed to connect all nodes in the two opposite ends is 
w + (w - 1) + (w - 2) + ... + 1 = w(~H). Of course, Vire could split the \vires equally 
between the upper and lower halves of the layer. 
The follmving theorem determines the value of MC\iV. 
Theorem 2.2.1 The MCWin the colineaT layout of the wmp-amund I-D HOW(p,w, 1) 
with a single layeT is 
{ 
¥fUi + 1) + rW(~+l)l faT even w 
.Mel'll = (W;1)2 + rW(~+l)l faT odd w 
PROOF. Refer to the proof for the regular 1-D HO\iV(p,w,l). The total number 
of extra wires for the wrap-around system is W(~+l). \Ne could split the wires equally 
between the upper and lower halves of the layer. So, \ve need to add rW(~+l)l to the 




II 11111~i 1111r- I 
r----
l I l& ~ I ,--h ~ Llt i 7 '9 0 I ,0 ( J 4 ( ~ ~) ) ( 10 11 r ,-- I 
Il --
(:1) Colluelif layuut of the un-r:-O!Jl1enSlllllll! I~D HOW system wllh 12 PE S tluJ u'l1ulm .. ' Size of 4 
10 11 
CD GJ GJ GJ CD CD GJ CD CD 
(c) Drcompm;itiull with horl7.()l1ln! line.!> 
Figure 2.6 Decomposition of'the 1-D ,vrap-around HOVV (12,4,1). 
29 
III I I I I 
I 




II II I Il 10~ 
I 
- u - '--:] I ,- '--




Figure 2.7 Decomposition of the I-D wrap-around HOTV(12, 5, 1). 
CHAPTER 3 
I-D HOW SYSTEM EMBED DINGS 
In this chapter, we discuss embeddings of various v;ridely-used interconn~ction 
networks into I-D HO\V systems. Such embeddings could prove very beneficial 
as HO\V and related systems demonstrate significant promise in scalable parallel 
processing [18] [19] [29J [30] [31]. 
Some definitions are pertinent for the analysis of results. Given two graphs 
G(V, E) and G' ("V', E'), embedding the graph G into the graph G' results in the 
mapping of each vertex in the set V onto a vertex in the Ii' and of each edge in the 
set E onto an edge, or a set of edges in E'. There are three important parameters 
that determine the quality of mapping a graph G(V, E) onto a graph G' (V', E'). 
• Dilation of a source edge in E: the number of edges in E' that the edge in E 
is mapped onto. 
• Congestion of a target edge in E': the number of source edges mapped onto 
the edge in E'. 
• Expansion: the ratio of the number of nodes in the set V' to that in the set 
V. 
Example: referring to the figure 3.1, there are two graph: source graph G(4, 2), 
target graph G' (9,8). The parameters are following: 
• dilation of (A,B): is 5. 
• dilation of (C,D): is 4. 
• congestion of (K,L): is 2. 






Source Graph with N nodes Target Graph with N' nodes 
Figure 3.1 The definition of dilation, congestion and expansion. 
In this dissertation, we try, if possible, to limit the scope of the discussion to 
cases \"here the expansion is one, for the sake of cost effectiveness. 
3.1 Embedding a Ring into a 1-D HOW System 
A ring of p nodes with addresses 0 to p -1 can be embedded into a 1-D HOW system 
with p nodes by mapping the ring processor with address i, where i = 0, 1,2, ... , p-1, 
onto the distinct processor x, where x = 0,1,2, ... ,p - 1. Our embedding procedure 
distinguishes between even and odd addresses x, with x = 2k and x = 2k + 1, 
respectively, in the 1-D system and uses the function i = C(k) to get the address i 
of the corresponding processor in the ring. The function C (k) is defined as follows: 
C(k) = { k if x = 2k, for k = 0,1,2, ... , 19 J 
. (p - 1) - k if x = 2k + 1, for k = 0,1,2, ... , ip;ll - 1 
It is easy to see that this mapping technique requires a window size of at least 
w = 2 for optimal mapping (i.e., the dilation is one). Figure 3.2 illustrates the 
embedding of a sixteen-processor ring into a 1-D HUVV system, also with sixteen 
processors. 
For w > 2, we can also use several other embedding functions for optimal 
mapping, including the function C' (k) that follows: 
32 
(0)------;,-'" Outside numbering: 
processor number in the ring 
---'1,-----;,- Inside numbering: 








... -E) 0 G-e--e-e- ... 
(p-2) (p-l-k) 
6c;bv-~~ .. .< -'0--60==0 
I' / 
(0) (1) (2) (3) 
w-l 2w-1, __ ' ',----fJ-2 p-l 
Figure 3.3 Embedding a p-processor ring into the 1-0 HOW(p,w,l) system with 
another technique. 
33 
C' (k) = and k = 0, I, 2, ... , L 7 J 
{ 
(w - l)k + (w - j) if x = (k + l)w - j, for j = 2,3,4, ... , W 
(p - 1) - k if.T = (k + l)w - I, for k = 0, I, 2, ... ,171 - 1 
Figure 3.3 illustrates the general embedding of a p-processor ring into the 1-D 
HOV/(p,w,l) system, using the function C/. All proposed embeddings have dilation 
one, congestion one, and expansion one. 
3.2 Embedding a 2-D Mesh into a 1-D HOW System 
Vie present here embedding techniques for the 2-D mesh and torus topologies. The 
target architectures are 1-D HO"\i\! systems. In a subsequent section, we will shmv 
that much better embeddings can be derived if the target HO"\i\! systems are 2-D. 
3.2.1 2-D Regular Mesh 
Considering a p x n mesh with p rows and n columns, we can embed this mesh 
into a (p x n)-processor 1-D HOW system by mapping the processor (i,j), where 
i = 0, I, 2, ... ,p - 1 and j = 0, 1,2, ... , n - I, of the mesh onto the processor 
.T = H(i, j), where x = 0, I, 2, ... , (pn - 1) of the 1-D system. The function H(i, j) 
is defined as follows: 
H(i .)={ ~+jJ~ ifp::S;n,fori=O,l,2, ... ,p-1and.i=0,l,2, ... ,n-1 
,J 'tP + J if p > n, for i = 0, I, 2, ... ,p - 1 and j = 0,1,2, ... ,n - 1 
This mapping of a mesh onto a 1-D system has the following properties: 
• If p < n., with column-wise mapping of mesh nodes and window size of at least 
p the mapping is optimal (with dilation one). 
• If p > n, with row-wise mapping of mesh nodes and window size of at least n 
the mapping is optimal. 
• If p = n, the row-wise mapping is the same as the column-wise mapping. 
34 





Figure 3.4 (a) Source 3x5 mesh and (b) its optimal embedding into the I-D 
HO\iV(15,3,1) system. 
(0,0) ° (0, I) I (0,2) 2 
(a) 
(b) 
Figure 3.5 (a) Source 5x3 mesh and (b) its optimal embedding into the I-D 
HO\iV(15,3,1) system. 
o 234 5 6 
o 0 0 0 0 0 0 
---------------------------------, 
-;- y - - - -s - - - - -9- - - - -10- - - -11- - - -12 - - - -13' 
:0 0 0 0 0 0 0 
~---------------------------------, 
'-14 - - - IS - - - 16 - - - 17 - - - 18 - - -19 - - - "20 
:0 0 0 0 0 0 0 
I _____________ - _ - - - - - - - - - - - - - - - - - -, 
: -2 C - - - 2'X - - -23- - - - 24- - - - 25- - - -26 - - -2f 
'0 0 0 0 0 0 0 1- _______________________________ _ 
; -is- - - -i9- - - -3-0- - - -3-( - - -32- - - -3Y - - -34-
:0 0 0 0 0 0 0 
---------------------------------, 
_________________________________ 1 
'35 36 37 38 39 40 41 
:0 0 0 0 0 0 0 
,- 42 - - - 43 - - - 44 - - - 45 - - - 46 - - - 47 - - - 48' 
:0 0 0 0 0 0 0 
1 __ -- __ ---------------------------
Figure 3.6 Mapping the 7 x 7 mesh in two different ways . 
35 
• The window size must be at least p = 17, = mi17,{p, n} for an optimal mapping. 
Figures 3.4 and 3.5 illustrate optimal embeddings of the 3x5 and 5x3 meshes, 
respectively, into the I-D HGVV(15,3,l) system. In order to get an optimal mapping, 
the window size w should be at least equal to the min{p, n}. If rnin{p, n} > 
w 2: l min;p,n} J, then the mapping is suboptimal with maximum dilation two; if 
l
min{pn}J lmin{pn}J 'I I .. I t' 1 . 1 . d'l t' 2' > w 2: 3 ' , t len t le mappmg IS su )OP Ima WIt 1 maXImum 1 a IOn 
three' etc. In the general case if lmin{p,n}J > w > lmin{p,n}J where m is a I)osition 
, , 771 - m+1 ' 
integer, then the embedding has maximum dilation m + 1. The expansion and 
congestion are both one. 
The best mapping is not unique. For example, we can use another \vay to get 
a best mapping for the same windmv size. Figure 3.6 is an example to map the 
7 x 7 mesh onto a I-D system using two different ways; the first mapping applies 
row-major order while the second mapping is along the diagonals (i.e., along the 
dashed lines). 
36 
(0,0) (0,3) (0,1) (0,2) (0,0)0 (0,3) 1 (0,1 )2 (0,2)3 
0 0 0 0 0 0 0 0 
(l,0) ( l,3) (l,l ) (1,2) (3,0)4 (3,3)5 (3,1)6 (3,2)7 
0 0 0 0 0 0 0 0 
(2,0) (2,3) (2,1) (2,2) (1,0)8 (l,3 )9 (1,1)10 (1,2)11 
0 0 0 0 0 0 0 0 
(3,0) (3,3) (3,1) (3,2) (2,0)12 (2,3)13 (2,1)14 (2,2)15 
0 0 0 0 0 0 0 0 
(a) 4x4 wraparound mesh: source (b) row-wise intermediate step (c) column-wise intermediate step 
with processor number in 1-0 system 
(d) 1-0 system: target 
Figure 3.7 4x4 wraparound mesh and its optimal mappmg onto the 1-D 
HO\V(16,8,l) system. 
3.2.2 2-D Wraparound Mesh or Torus 
Embedding a p x n wraparound mesh into a 1-D system is a natural combi-
nation and extension of embedding (p + n) rings and a 2-D mesh into a 1-D 
system. Vile can embed a p x n wraparound mesh into a (p x n)-processor 1-
o HO\lV system by mapping the processor (i, j) of the torus onto the processor 
(G(i,j); where j is fixed)II(G(i,j); where i is fixed)IIH(i,j). The symbol II denotes 
concatenation of two different mappings onto the 1-0 HO\V system. The functions 
"G" and "H" were defined earlier for the ring and mesh embeddings. 
Figure 3.7 is a step-by-step example for mapping a 4x4 wraparound mesh. This 
mapping of a wraparound mesh onto a 1-D system is a natural combination/extension 
of ring and mesh mappings, and therefore it inherits all the properties associated 
with the latter mappings. For example, the window size should be at least equal to 
2 x min{p, n} for optimal mapping. 
37 
16 
Level # (d) Required minimun window size 
for optimal mapping (dilation=l) 
3 d-2 ------------'3-;. 5 8 = 2 = 2 
24 2 d-2 
------.;"';..- 4 4=2=2 
1 d-2 
------;;.~ 3 2=2=2 




I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
(b) 
Figure 3.8 (a) A 31-processor full binary tree with depth d = 5 and the numbering 
of its nodes and (b) its optimal embedding into the I-D HOV/(31,8,l) system. 
3.3 Embedding a Binary Tree into a I-D HOW System 
Binary trees can be embedded into a I-D system in several ways. Consider a full 
binary tree of level d containing 2d - 1 processors. A good embedding can be derived 
by numbering the nodes of the full tree in the manner shown in the example of 
Figure 3.8. The number assigned to a tree node then denotes the address of the 
corresponding node in the HO\i\l system. 
This mapping has the following properties: 
• It is a recursive mapping 
• The required minimum window size for optimal mapping is 2d - 2 , where d is 
the level of the full tree. 
If 'W < 2d - 2 , the mapping is suboptimal. To find the dilation for suboptimal 
mapping, the following proposition is pertinent. 
38 
Proposition 1. For the embedding of a full binary tree with depth d and 
2d -1 nodes into a 1-D HO"\iV system with the same number of nodes, we need 2d- i - 1 
connections at distance 2i in the 1-D linear-array configuration of the HO"\iV system, 
for i = 0, 1,2, ... , d - 2. 
d f b I b d r 2dw- 21 Corollary 1. The maximum ilation or a su optima em e ding is 
and corresponds to two source edges. 
Corollary 2. If w is not a power of two, the maXllTIUm congestion for 
suboptimal embedding is one. Othenvise, for w = 2v , with v < d - 2, the maximum 
congestion is d - 2 and corresponds to two target edges. 
3.4 Embedding a Hypercube into a 1-D HOW System 
Ad-dimensional (direct binary) hypercube consists of p = 2d processors, and a (d+ 1)-
dimensional hypercube is constructed by connecting together pairs of processors with 
the same addresses in two d-dimensional hypercubes [9J [14J. Two nodes are neighbors 
in the d-dimensional hypercube if and only if their unique d-bit addresses differ in a 
single bit (lJ [9] (11]. 
We can embed the d-dimensional hypercube into the 1-D HO"\iV system with 
2d nodes by mapping each hypercube node to the node with the same address in 
the HO"\iV system. The window size should be at least equal to 2r1- 1 for optimal 
mapping. For a large value of d, the mapping should normally produce large dilation 
because of the large difference in the dimensionalities of the two systems. A large 
dimensional HO\iV system could produce very good results. For this reason, we avoid 
further analysis of this mapping for 1-D HO"\iV systems. 
Figure 3.9 shows the embedding of a 16-processor hypercube into a 16-processor 
1-D HOW system. 
39 
(a) 
Figure 3.9 (a) A 16-processor hypercube with binary addresses for its nodes and 
(b) its optimal embedding into a I-D HO\i\T(16,8,l) system. 
CHAPTER 4 
2-D HOW SYSTEM EMBEDDINGS 
In the following sections, we propose embeddings of various interconnection networks 
into 2-D HO\iV systems. VVe limit the scope of the discussion to cases in which the 
number of rows and the number of columns are the same, and equal to n, in the 2-D 
HOW system. 2-D HO\V system embeddings are extensions of 1-D HO'I\1 system 
embeddings. Therefore, everything we discuss here is based on the preceding section. 
4.1 Embedding a Ring into a 2-D HOW System 
An optimal ring mapping is always possible with expansion one if the window size is 
at least 2. 'I\1e visit the nodes in a serpentine-like, column-wise way where the first 
column is scanned sequentially for an even number of rows. In this case, even with 
w = 1 \ve produce an optimal mapping. 1:<'or an odd number of rows, the nodes in 
the first column cannot be visited sequentially, but still an optimal mapping exists 
for w > 2, as shown in Figure 4.l. 
If the number of processors in the source graph is (n - 1)2 < PEs < n2 , where 
n is a positive integer, we just use one or more links connecting nodes at distance 2 
to bypass several processors in the 2-D HO'rV system for optimal mapping, as shown 
in Figure 4.2. 
4.2 Embedding a 2-D Mesh/Torus into a 2-D HOW System 
It is straightforward to get an optimal mapping for the regular mesh. 'I\1e apply the 
ring mapping for I-D systems in individual rows and columns. An optimal mapping 
for wraparound edges of the torus does not exist if the target graph does not contain 








Figure 4.1 Embecldings of rings into 2-D HO\i\T systems. 
Figure 4.2 Embeddings of rings into 2-D HOW systems when the numbers of nodes 
in the rings are smaller than those in the HO\i\T systems. 
42 
(a) 6x6 torus (b) mapping ontO the 2-D system 
Figure 4.3 Mapping the 6 x 6 torus onto a 2-D HOVl system with window size of 
3. Consecutive bold segments in a row/column implement wraparound connections 
in the torus. 
Otherwise, for the wraparound mesh (i.e., torus) we split the wraparound 
connections into a minimal number of segments based on the window size provided 
by the 2-D HO\i\T system. Some target processors are then used not only to process 
data but also to forward data destined to otherwise neighbors in the torus. The 
target system should still be expected to perform very well for algorithms employing 
the torus. 
The dilation of wraparound connections is then rn:1l. The mapping of torus 
wraparound links can always be chosen so that the maximum congestion is one, 
assuming that w ~ 2. Figure 4.3 is an example to map the 6 x 6 torus onto the 2-D 
HOvV(6,3,2) system, using a window size of 3. 
4.3 Embedding a Binary Tree into a 2-D HOW System 
Binary trees can be embedded into 2-D HOv\! systems in several ways. Such an 
embedding could be used for the implementation of data reduction operations [13). 
Consider a full binary tree of level l containing 21 - 1 processors and the 2-D 
43 
o 
(a) 3-level binary tree (b) the mapping onto the 2-D system 
Figure 4.4 Optimal mapping of the 3-1evel binary tree onto the 2-D HO\iV(3,2,2) 
system. 
HOV\!( h/21 - 11, w,2) system for the smallest expansion. Vie assume that w 2::: 2. 
The two basic building blocks used in our binary tree mapping are for the 3-level 
tree, and are shown in Figures 4.4 and 4.5. These two building blocks and their 
mirror images are employed for the mapping of larger trees. For example, Figure 
4.6 shows a mapping where the building block #1 at the upper-left corner of Figure 
4.5 and its three mirror images are used for the mapping of the four distinct 3-level 
trees containing leaves of the original 5-1evel tree. The mirror images are employed 
to minimize the distances between the roots of these trees for connections at the 
next level. The largest dilation of edges is 2 in this case (the reason for this is that 
there is no way to directly connect processor-1 and processor-4, or processor-2 and 
processor-6; \ve use two edges to connect them together as shown with the bold lines 
in Figure 4.6). 
In general, a large binary tree of levell is viewed as four appropriately connected 
subtrees of level l - 2 for which embeddings into a 2-D HOV\! system are easily 
obtained recursively; interconnection of their roots after the embeddings are then 
easily derived. An example is shown in Figure 4.7. The maximum dilation is two for 
binary trees with an odd number of levels. Otherwise, we have optimal embeddings. 
o 
7 8 9 10 11 12 13 14 
(a) 4-level binary tree 
r------------------~ 




I 9 , 






P I I o )-----{ }-----{ I I , ~ __________________ ~ I 
V 
Building block as shown in figure 4 I 
Y 
Another building block for tbe 3-1evel binary tree mapping 
(b) tbe mapping onto the 2-D system 
44 
Figure 4.5 Optimal mapping of the 4-level binary tree onto the 2-D HO\i\T(4,2,2) 
system. The two distinct building blocks for the mapping of 3-level binary trees are 
enclosed in dotted lines. 
4.4 Embedding a Hypercube into a 2-D HOW System 
\i\Te can embed a (direct binary) hypercube into a 2-D HOW system with two different 
methods, based on the desired expansion. 
First, we consider the embedding of the d-D hypercube into the 2-D HO\i\T 
system with 2f%1 x 2f%1 nodes, corresponding to minimal expansion. \i\Te can embed 
this hypercube recursively as shown in Figures 4.8 and 4.9, where optimal mapping 
is achieved because of the large windows. This mapping is derived from the classical 
2-D representation of hypercubes; optimal mapping results if w 2: 2 f%1- 1. In the 
general case, the largest dilation of edges for this mapping technique is r2r~-11. The 
advantage of this method is that it is very simple and easy to implement, but its 
disadvantage is that half of the processors are wasted \vhen d is an odd number. The 
maximum congestion is one if w is not a power of two. If w = 2v , with 'U < r~l - 1, 
the ma..'{imum congestion for the mapping that minimizes the maximum dilation is 
(r~l - 1) - v + 1 or r~l - v. The expansion is 22~11. 
Second, in order to minimize the number of unused processors in the 2-D HO\i\T 
system if d is odd, we can use another method to embed the d-D hypercube into the 
15 16 l7 18 19 20 21 22 23 24 25 26 27 28 29 30 
(a) 5-level binary tree 
r------------, r------------, 




















(b) the mapping onto the 2-D system 





131- 32 33- 34 -35 36- 37-38 39 -40 41-42 -4344-45 46 47 -48 49-50 sf 52- 53 54 555657-58 -5960- G1 -62 






I ____________ ~ 
::~? 
1 1 
14 4 3 I 
I 1 1 1 __________________ ~ _________________ l 
(b) the mapping onto the 2-D system 
This mapping is based on the 
4-level mapping. including 
the 4-level mapping building 
block and its threc minors. as 
shown in the bold dash line. 
Figure 4.7 Optimal mapping of the 6-1evel binary tree onto the 2-D HO\V(8,2,2) 
system. 
47 
o o o o 
(a) 3-D hypercube o o o o 
(b) mapping of the 2-D system 
(c) 4-D hypercube Cd) mapping of the 2-D system 
Figure 4.8 Optimal 3-D and 4-D hypercube embeddings into the HOVl( 4,2) system 
(method one). 
48 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
Figure 4.9 5-D hypercube and its embedding into a 2-D HO\1\1 system (method-one). 
Optimal mapping is derived if w 2:: 4. 
49 
( ) 
d-3. . d-2 d 
2-D HOVV k,w,2 system, where k = 3 * 2-"2 1f d 1S odd, or k = 2 * 2'""'2 = 2"2 if d is 
even. This recursive mapping method is based on the fact that the d-D hypercube 
is formed from four (d- 2)-D hypercubes. Vv'hen d is even, then we use the mapping 
of 22 hypercube as the fundamental building block; when d is odd, then we use the 
mapping of 23 hypercube as the fundamental building block The advantages of this 
method are that we can save a lot of otherwise unused processors and that the 2-D 
mapping looks neater when d is an odd number. The embedding of the 3-D hypercube 
into the "building block" HO\iV(3,2,2) is used recursively. As shown in Figure 4.10, 
the embedding into the building block results in only one unused node. The source 
edges (000,100), (100,101) and (110,111) have dilation two in the building block, and 
the congestion is two for the target edge (110,100) -- which also means there are 
3 edges with dilation tvw, and there is 1 edge with congestion tVI'O.. Figure 4.10 
shows the embedding of a 5-D hypercube using four 3-D hypercube building blocks 
in HOvV(3,2,2)s. This example shows that there are only four unused nodes. 
In the general case, with the second method for an odd d the chosen target 
system is the HO"\i\f(3 x 2d;3 , w, 2) for the best mapping with minimum expansion. 
The expansion is actually equal to 9X;:-3 or ~. 
Proposition 1: Given the d-D binary hypercube, if d ~s even, the lar:qest-
dilation of edges is r2r~~-11 and the congestion is 1. 
Proposition 2: If d is odd with method-two, there aTe 2d - 3 edges with 
3(,!2d-3 1) 
congestion 2, the largest dilation of edges is nwx{2, r w -1), and there are at 
least 3 x 2d- 3 edges with dilation 2, and 2d- 3 unused nodes. 
Lemma 1: In the building block (3-D binary hypercube), there are (3 x 23 /2) or 
12 edges. Among them there is only one edge (node 110 to node 100) with congestion 
2, there are thr·ee edges with dilation 2. 
This can be easily proven from in Figure 4.10. Only four unused nodes result 
with this mapping. 
50 
Figure 4.10 5-D hypercube embedding with the second method. This figure shows 
the original hypercube, the embedding of the 3-D hypercube into the building block 
HOW(3,2,2), and the final embedding into the 2-D HOVV(6,3,2) system. 
51 
01 
1 ____________ 1 
Fiigure 4.11 6-D hypercube embedding in a 2-D system using method two. (Actually 
mtethod-two and method-one are the same for even number dimension hypercube.) 
TIhis figure shows the building block and the embedding in 2-D system. 
CHAPTER 5 
COrvIMUNICATION OPERATIONS ON I-D HOW SYSTEJVIS 
Our focus in this dissertation is 2-D HOVV systems. However, I-D HO\I\T systems are 
their building blocks (BBs), and therefore we first develop communication routines 
for l-D HO\I\T systems. Before we propose algorithms for implementing various 
communication patterns on l-D HO\I\T systems, introductory material is needed to 
facilitate evaluation of the algorithms. The communication latency, that is the time 
consumed to communicate a message between t\VO processors in the system, depends 
on the following parameters [5] (20): 
• Startup time (t s ): the time consumed by the sending processor. It comprises the 
time to prepare the message (producing the header, trailer, and error correction 
information), the time for the routing algorithm at the source, and the time to 
send the first part of the message to the appropriate communication port. 
• Per-word tmnsfer tim,e (tw): the time taken by a \vord to traverse a channel. 
If the channel bandwidth is b words per second, then each word takes time 
tw = lIb. 
.. Cornb'inir"g time (tc ): the time consumed by an intermediate node to switch a 
message from an input to an output port; it also includes the time to combine 
incoming messages, if needed, and send them to the appropriate output port. 
Vie calculate only the time taken by a message to reach the input port of the 
destination. Additional time may be needed to get the data from that port. In store-
and-forward (SF) routing, with a message traversing a path with multiple links, 
each intermediate processor forwards the message to the next processor in the path 
after it has received the entire message. To increase the utilization of communication 
resources and reduce communication time, wormhole routing divides a message 
52 
53 
(a) model-l with one output port (b) model-2 with mUltiple pons 
and the same output value 
(c) model-3 with multiple pons 
and different omput values 
Figure 5.1 Different output port models. 
into flits (flow-control digits). As the header flit advances along the chosen path, 
the remaining flits [ol1O\v in the same path in pipelined fashion. If the header flit 
encounters a channel already in use, this flit is blocked until the channel becomes 
available [5). Normally, the flit size coincides with the channel width. The combining 
time tc is ignored in wormhole routing. 
We develop algorithms under three different communication models. For all of 
the models, each processor can receive more than one message at a time in different 
input ports. These models differ in how they can use their output ports. 
• Model-1: Each processor can use only one output port at a time. 
• Model-2: Each processor can use multiple output ports simultaneously, as 
long as all output ports contain the same value. 
• Model-3: Each processor can use multiple output ports simultaneously, and 
different output ports can have different values. 
The architecture considered here is a 1-D system. There are three different 
models in a 1-D system which are used for communications, as shown in Figure 5.1. 
In the following subsections we develop algorithms for various communication 
operations on 1-D HOVI systems and derive corresponding execution times for the 
aforementioned models. The analysis is done each time for SF and wormhole routing, 
in this order. These operations are very frequently used in parallel processing [13J 
[4J. 
54 
5.1 One-to-One Communication 
This basic operation sends a message from one processor to another. 
With SF routing, sending a single message containing m words takes ts + mtwl + 
te(l - 1) time, where l is the number of links traversed by the message. For a 1-D 
HOW system with p processors and window size w, l is at most rJ1: 11, and therefore 
the time for a single message transfer has the upper bound of 
assuming no contention with other messages at intermediate processors. 
With wormhole routing, assume that the Hit is one word, and therefore the flit 
transfer time is two If the message traverses l links, then the header of the message 
takes ts + ltw time to reach the destination. If the message is m words long, then the 
remaining Hits will reach the destination in time (m - 1 )tw after the arrival of the 
header. Therefore, the upper bound is 
p-1 p 
T(VV R)one_to_one = ts + tw f--l + (m - l)tw = O(rn + -) 
w W 
For the wrap-around HOIIV(p, w, 1). l is at most fp-1l and therefore it takes 
2w ' 
half of the time for the regular HOVV(p, w, 1) system. 
Therefore the time \vith SF routing for a single message transfer has the upper 
bound of 
The time with wormhole routing for a single message transfer has the upper 
bound of 
55 
5.2 One-to-All Broadcasting 
One-to-all broadcasting is an operation where a single processor sends the same data 
of Tl1 words to all other processors. Initially, only the source processor has the data of 
size Tn that needs to be broadcast. At the termination of the procedure, there are p-1 
copies of the initial data, one copy residing in each of the other processors. The naive 
\,\ray to perform one-to-all broadcasting is to sequentially send p - 1 messages from 
the source to the other p - 1 processors. For the sake of efficiency, every processor 
could keep a copy of the message it receives from a neighbor, and then could forward 
this message to one or more of its other neighbors. 
M One-to-all broadcast M M 
o 0 o --------> 0 0 




Since there is only one output port "available" for each processor at each transfer 
step, we consider two different stages. We assume that the leftmost processor is the 
source, for worst case timing. In the first stage, we copy the data to all processors 
(PEs) in the source's window of size w In the second stage, the data in the leftmost 
window is propagated to the right, one window size at a time. 
We introduce two parameters here: S1 represents the number of transfer steps 
needed to fill the first window, and S2 represents the number of transfer steps needed 
in the second stage to copy the values in the first window into the remaining windows. 
In the first stage, the propagation doubles each tim.e the number of PEs that receive 
the message, and therefore the processors within the window are assumed to form a 
binary tree. Vve have the following relations among S1,82, and w: 
S1 = flog(w + 1)1 
S 2 = r (p - 2 s 1 ) / w 1 
All logarithms in this dissertation are in the base 2. 
56 
Table 5.1 The propagation rules of one-to-all broadcasting under model-I. 
S PEmax P Etotal 
1 2S 1 = 21 1 1 ),::; ?i 1 = 2s-1 1 - ,,-,i-l - -
2 25 1 22 1 2 2::::' ?i 1 = 2s _ 1 3 - - t-1 - -
3 25 1 23 1 =4 2:::5 ?i 1 = 25 - 1 =7 - i-l ~ 
Sl 2S 1 251 1 tS .)i-1 = 2S - 1 = 251 1 - t-1 - -
VVe can also use Table 5.1 to illustrate the propagation rules to follow in the 
first stage. (S is the number of the transfer step in the first stage, P Emax is the 
maximum number of PEs that can receive a copy at each transfer step, and P Etotal 
is the total number of PEs that have received a copy at each transfer step). 
Figure 5.3 shows an example. The communication time for one-to-all broad-
casting under model-1 and SF routing has the uppeT bound of 
This asymptotic time is optimal. 
With wormhole routing, the uppeT bound is 
O(m logp) 
if (p - 1) ~ w 
O(m logw + 1n~J 
if (p - 1) > w 
O(m + logp) 
if (p - 1) ~ w 
O(m + logw + ;!;) 
if (p - 1) > w 
assuming that incoming data can simultaneously be stored locally and also be trans-
ferred to the next PE in the path. 
For wrap-around HOVV(p, w, 1). It will need S'1 steps to fill the leftmost window 
and rightmost \vindmvs, which is 2 * Sl' Also it will need s~ steps which is only half 
of S2 to copy the values in the leftmost and rightmost windows into the remaining 
57 
(a) HOW(l2.3.1) with initial information 
MeO) M(O) 
0---0 0 0 0 0 0 0 0 0 0 0 
0 2 3 4 5 6 7 8 9 10 11 
(b) First communication step (Stage 1) 
~O) 
0 0 0 0 0 0 0 0 
0 2 3 4 5 6 7 8 9 10 J I 
(c) Second communication step (Stage 1) 
M(O) M(O) M(O) M(O) M(O) M(O) M(O) 
0 0 0 0 0 0 
0 2 3 4 5 6 7 8 9 10 J 1 
(d) Third communication step (Stage 2) 
M(O) M(O) M(O) M(O) M(O) M(O) M(O) M(O) M(O) M(O) 
0 0 0 0 0 0 
0 2 3 4 5 6 7 8 9 10 II 
(e) Fourth communication step (Stage 2) 
M(O) M(O) M(O) M(O) M(O) MeO) M(O) M(O) M(O) M(O) M(O) M(O) 
0 0 0 0 0 0 0 0 
0 2 3 4 5 6 7 8 9 10 11 
(f) Fifth communication step (Stage 2) 
Figure 5.3 One-to-all broadcasting under model-1 with 12 processors and windO\\' 
size of 3. A number in parentheses is the label of the source processor from 'vvhich 
data has been broadcast. All communication steps are shown. 
windows. Therefore, 
S'l = 2flog(w + 1)1 
I _ fP-1-2si 1 
S2 - 2w 
58 
Therefore, the communication time of the wrap-around HOVV(p, w, 1) for one-
to-all broadca.sting under model-1 and SF routing has the uppeT bound of 
{ 
ts + mtw\logp1 +tc(\logpl-1) O(mlogp) 
Twrap _ if (p - 1) ::S w 
onLto_all,l - ts + mtW(S'l + s~) + tc(Sl + s~ - 1) O(mlogw + 111;;;) 
if (p - 1) > w 
\iVith \vormhole routing, the uppeT bound on the communication time is 
Special-case: Fully connected I-D subsystems. 
O(n1 + log p) 
if (p - 1) ::S w 
O(m + logw + !) 
if (p - 1) > w 
For a fully connected 
subsystem, the procedure is similar to that for stage-l under our model-I. Therefore, 
With wormhole routing, the communication time is 
5.2.2 Model-2 and Model-3 
For one-to-all broadcasting, there is only one value to be sent, and therefore the 
procedures for this operation are identical under model-2 and model-3. Assume the 
leftmost PE as the source. Model-2 is not inferior to model-3 because up to w output 
ports are "available" to the right of each processor at each transfer step as long as 
these ports transfer the same value, which is the case here. The first stage now 
consumes one transfer step and the total number of transfer steps is f(p - l)/w 1. 
Figure 5.4 shows an example. The communication time has the uppeT bound of 
This asymptotic time is optimal. 
o 2 
M(O) M(O) M(O) 
0 
0 2 
M(O) M(O) M(O) 
0 0 0 
0 2 
M(O) M(O) M(O) 
0 0 0 
0 2 











(b) First communication step (Stage 1) 
M(O) M(O) M(O) M(O) 





(c) Second communication step (Stage 2) 
M(O) M(O) M(O) M(O) M(O) M(O) 
0 
3 4 5 7 8 
(d) Third communication step (Stage 2) 






















0 0 0 0 
~~I 3 4 5 6 
(e) Fourth communication step (Stage 2) 
59 
Figure 5.4 One-to-all broadcasting under model-2 and model-3 with 12 processors 
and window size of 3. A number in parentheses is the label of the source processor 
from which data has been broadcast. All communication steps are shown. 
60 
With wormhole routing, the uppeT bound is 
f _ p-1 I _. P 
T(H R)onc-to_all,2 - ts + tw f--l. (m - l)tw - O(m + -) 
w w 
For the wrap-around HOliV(p, w, 1). Every node can be treated similarly, and 
the communication time is exactly half of that for the regular HOW(p, w, 1) system. 
Therefore, the communication time of the wrap-around HOVV(p, w, 1) for one-
to-all broadcasting under model-2 and model-3 and SF routing has the uppeT bound 
of 
With wormhole routing, the upper bound on the communication time is 
Special-case: Fully connected I-D subsystems. It is easy to get the result 
for the fully connected subsystem; the one-to-all broadcasting just needs one transfer 
step. Therefore, 
\!\Tith wormhole routing, the communication time is 
5.3 All-to-All Broadcasting 
In all-to-all broadcasting, which is a generalization of one-to-all broadcasting, all p 
processors simultaneously initiate a broadcast. A processor sends the same m-word 
message to every other processor, but difIerent processors may broadcast different 
messages. 
61 
11.110 _Mo 11'10 
Nh 1\11 11111 
1110 1111 ]\1p - 1 All-to-all broadcast 111p _ 1 lv1p _ 1 111p _ 1 
0 0 0 - - - - - - -- > 0 0 0 
Figure 5.5 Al1-to-all broadcast. 
5.3.1 Model-1 
For model-I, there is only one output port of each processor we can use at a time. 
In order to let every processor pass information to a neighbor in each step, we 
deliberately choose those channels that form a ring, as shown in Figure 5.6. If 
communication is performed circularly in a single direction, then each processor 
receives all (p - 1) pieces of information from all other processors in (p - 1) steps. 
The time taken by the entire operation is 
This asymptotic time is optimal because each processor can use only one output port 
at a time, and therefore each message must make p - 1 = O(p) hops. 
With wormhole routing, the communication time is 
because the header of each message is blocked at each intermediate node until the 
previous message ha.s completely departed. 
For the wrap-around HOVV(p, w, 1). Since only one cycle has to be formed in 
order to pass the information around, the communication time is exactly the same 
as that for the regular HOVV(p, w, 1). 
~(O) ~(l) ~(2) ~(3) ~(4) ~(5) ~(6) 11(7) ~(8) ~(9) ~(lO) 
(a) HOW(l2,3,1) with initial infonnation 
~(O,l) ~(2,O) ~(4,2) ~(6,4) ~(8,6) ~(lO,8) 
,/- - - - - - - - ~ - - - - - - - -.:;;::.. - - - - - - - - ~ - - - - - - - - -::>- - - - - - - - - ..;:>-- - - - ->, 




,~- - - - -::::;:. - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - _/ 
~(l,3) ~(3,5) ~(5,7) ~(7,9) ~(9,11) ~(l1,10) 
(b) First communication step 
~(0,1,3) ~(2,O,1) ~(4,2,0) ~(6,4,2) ~(8,6,4) ~(10,8,6) 





'.....:r- - - - -4 - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ...c::::::. - - - - - - - - -:::E- - - - - - - - _I 
MO,3,5) ~(3,5,7) ~(5,7,9) ~(7,9,11) M(9,1l,10) ~(ll,10,8) 
(c) Second communication step 
~(O, 1,3,5) ~(2,0, 1,3) ~(4,2,O,l) ~(6,4,2,0) ~(8,6,4,2) ~(lO,8,6,4) 
I 
I 
,/- - - - - - - -..;;;>- - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - >-, 
o 2 3 7 8 9 
, I 
'-oE::- - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - - ~ - - - - - - - _/ 
~(l,3,5,7) ~(3,5,7,9) ~(5,7,9,11) ~(7,9,1l,10) ~(9,11,lO,8) ~(ll,lO,8,6) 
(d) Third communication step 
~(0,1,3,5,7,9, ~(2,0,1,3,5,7, ~(4,2,0,1,3,5, M(6,4,2,0,1,3, M(8,6,4,2,0,1, M(l0,8,6,4,2,0, 
11,10,8,6,4,2) 9,11,10,8,6,4) 7,9,11,10,8,6) 5,7,9,11,1 0,8) 3,5,7,9,11 ,10) 1,3,5,7,9,11) 
I 
I , 
/- - - - - - - - ~ - - - - - - - - -::::>- - - - - - - - -.;::.- - - - - - - - - .:::>- - - - - - - - - ..;;:>- - - - ->, 
\ 
\ 
\ ° 2 3 4 5 8 9 , I 
,-<- - - - -<:E- - - - - - - - - ~ - - - - - - - - -<:::::- - - - - - - - - -<:E- - - - - - - - - ~ - - - - - - - _/ 
M(3,5,7 ,9,11, M(5,7,9,1l, 10, M(7,9,11,10, M(9,1l,1O,8, M(lI,10,8,6, 
62 
11(1,3,5,7,9,11, 
10,8,6,4,2,0) 10,8,6,4,2,0,1) 8,6,4,2,0,1,3) 8,6,4,2,0,1 ,3,5) 6,4,2,0,1 ,3,5,7) 4,2,0,1,3,5,7,9) 
(e) Eleventh communication step 
Figure 5.6 All-to-all broadcasting under model-1 with 12 processors and window 
size of 3. The numbers in parentheses for each processor are the labels of source 
processors from which data was received prior to the current communication step. 
63 
Special-case: Fully connected I-D subsystems. As for a fully connected 
1-D subsystem, no intermediate node 'Nill be involved in the broadcasting procedure. 
The time taken by the entire broadcasting procedure is 
\iVith wormhole routing, the communication time is 
5.3.2 Model-2 
The broadcasting procedure follows: 
• First stage: Each PE sends its message to all of its neighbors . 
• Remaining stages: Assume the stage i, where i = 1,2, ... , fp: 11 - 1. In one 
direction, beginning from position iw and also involving all its successors, send 
the messages from the PEs 0, 1, ... , (p -1- i'W -1) through all possible channels. 
In the other direction, beginning from position (p - 1 - i'W) and also involving 
all its predecessors, send the messages from the PEs p -I, P - 2, ... , (i'W + 1). If 
there is an overlap between these two directions, then split this stage into two 
steps in order to make sure that every PE sends just one value at a time. From 
all the messages it contains, each time a PE sends out the message received 
earlier from its most distant PE. 
Table 5.2 shows the detailed steps involved in the broadcasting procedure for 
12 PEs and a window of size 3. It consumes five steps. Refer to Figure 5.7 for an 
example. The example in Figure 5.7 is for model-3, and therefore ((step" in the table 
stands for "stage" under model-2. However, the only difference between the two 
models is in the second transfer step, because there is an overlap between the two 
64 
Table 5.2 The detailed steps for all-to-aH broadcasting under model-2. 
I 
Po PI P2 P3 I P4 P5 P6 P7 P8 P9 PlO Pll 
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) 
0, ~ 0),2, 1,2,3, 2,3,4, 3,4)5) 4,5,6, 5,6,7, 6,7,8, 7,8,9, 8,9,10 
1,2,3 2,3,4 3,4,5 4,5,6 5,6,7 6,7,8 7,8,9 8,9,10 9,10,11 10,11 II 
Q 0,1 0,1,2 1,2,3 2,3,4 3,4,5 4,5,6 5,6,7 
4,5,6 5,6,7 6,7,8 7,8,9 8,9,10 9,10,11 10,11 II 
7,8,9 8,9,10 9,10,11 10,11 II Q 0,1 0,1,2 1,2,3 2,3,4 
10,11 I II Q 0,1 
directions; therefore, \ve need to split this "transfer step" into two steps for model-2. 
The whole procedure consumes five steps under model-2. 
The total time taken by this operation is 
p-l p-l 
TalLto_all,2 = ts + 1ntw(\--1 + x) + tc(\--l + x-I) 
w w 
where x is the number of stages needed to be split into two steps, and x should satisfy 
the condition xw < p - 1 - xw. So :17 is the largest integer less than P2-:,}' Therefore, 
This asymptotic time is optimal because the diameter of the system is 0 (!). 
With wormhole routing, the communication time is 
p-l p 
T(VV R)alUo_alL,2 = ts + mtw( ,--1 + x) = O(m-) 
w w 
because of message blocking on reused channels. 
For the wrap-around HOVV(p, w, 1). Every node could be treated similarly, 
so the number of transfer steps is ,P:1l. Although each node has 2w neighbors, we 
divide w because output ports must transfer the same message. Tables 5.3 and 5.4 
show detailed information for this process. 
65 
Table 5.3 The detailed steps for all-to-all broadcasting under model-2 using a wrap-
around system \'\Iith 16 processors. 
Po PI P2 P3 P4 Ps P6 P7 P8 P9 PIO Pll P12 PlS PH PlS 
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) 
15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 I 
3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 
12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 
11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 
10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 
4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 
5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 
6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 
9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 
8 9 10 11 12 13 14 15 0 I 2 3 4 5 6 7 
7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 , 
Therefore, the communication time of the wrap-around HOHf(p, w, 1) for one-
to-all broadcasting under model-2 and SF routing is 
wrap - r p - 11 ( r p - 11 ) - 0 ( p ) 
TalLto_all,2 - ts + mtw --;;;- + tc --;;;- - 1 - m w 
With wormhole routing, the communication time is 
( ) wJ'ap _' I 'rP - 11 - O( p) T ,ltV R alUo_oll,2 - ts T 1ntw --;;;- - m w 
Special-case: Fully connected I-D subsystems. For a fully connected 
subsystem, only one transfer step is needed to accomplish the broadcasting 
procedure. 
With wormhole routing, the communication time is 
66 
Table 5.4 The detailed steps for all-to-all broadcasting under model-2 using a wrap-
around system with 17 processors. 
PO PI ])2 P3 ])4 P5 
])G I P7 P8 P9 ]110 PII PI2 P13 {l14 PI5 PIG 
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) ( 13) (14) (15) (16) 
16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 
3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 
13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 
12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 
I 
11 
11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 
4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 I 3 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 
6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 
10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 
9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 
8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 
7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 
67 
Table 5.5 The detailed steps for all-to-all broadcasting under model-3 usino- a wrap-
around system with 16 processors. D 
Po PI P2 P3 P4 P5 P6 P7 Ps P9 PIO Pl! Pl2 Pl3 PH PI5 
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) 
15 
1: I 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 
14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
13 14 I 15 0 1 2 3 4 5 6 7 8 9 10 11 12 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 
3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 
12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 
11 12 13 14 15 0 1 2 3 -1 5 6 7 8 9 10 
10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 
-1 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 
5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 
6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 
9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 
8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 
7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 , I I 
5.3.3 Model-3 
This procedure is very similar to that for model-2. Since each individual processor 
can send different messages at the same time, we do not need to split any step, 
as shown in the example of Figure 5.7. The total time taken by this operation is 
optimal and given by 
])-1 ])-1 ]J 
TalLto_oll,3 = ts + 1Td·w r--l + tcU--l - 1) = O(m-) 
W W W 
With wormhole routing, the communication time is 
68 
~(O) ~(l) ~(2) ~(3) ~(4) ~(5) ~(6) ~1(7) ~(8) ~(9) ~(lO) ~1(1 1) 
(0) I-D system (PES=12, window_size=3) wilh initial information 
(1) First communication step 
(2) Second communication step 
(3) Third communication step 
o o o o o o 
(4) Fourth communication step 
]]0 ]]1 ]]2 ])3 ]]4 ])5 ]]6 ]]7 ])8 ]]9 ]]10 ])11 
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) 
Q., ~ 0,1,2, 1,2,3, 2,3,4, 3,4}5, 4,5,6, 5,6,7, 6,7,8, 7,8,9, 8,9,10 
1,2,3 2,3,4 3,4,5 4,5,6 5,6,7 6,7,8 7~81g 8,9,10 9,10,11 10,11 11 
.Q 0,1 0,1,2 1,2,3 2,3,4 3,4,5 4J5~6 5,6,7 
4,5,6 5,6,7 6,7,8 7,8,9 8,9,10 9,10,11 10,11 11 
7,8,9 8,9,10 9,10,11 10,11 11 Q 0,1 0,1,2 1,2,3 2,3,4 
I 10,11 I 11 .Q I 0,1 
Figure 5.7 All-to-all broadcasting under model-3 with 12 processors and window 
size of 3. Addresses of processors from which values have been received at the end 
of each step are shown. 
69 
Table 5.6; The detailed steps for all-to-all broadcasting under model-3 using a wrap-
around syS;tem with 17 processors. 
Po PI 
I 
P2 P3 P4 P5 P6 P7 P8 P9 PI0 Pl1 P12 PI3 P14 Pl5 P16 
(0) (1) I (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) 
16 0 I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 
3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 
13 14 15 16 0 1 2 3 -1 5 6 7 8 9 10 11 12 
12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 
11 12 13 14 15 16 0 I 2 3 4 5 6 7 8 9 10 
4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 
5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 
6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 
10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 
9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 
8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 
7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 
70 
For the wrap-around HOW'(p, w, 1). The number of transfer steps is r~ 1. 
Tables 5.5 and 5.6 show detailed information for this process. 
Therefore, the communication time of the wrap-around HOVV(p, w, 1) for one-
to-all broadcasting under model-3 and SF routing is 
\iVith \'101'mhole routing, the communication time is 
rp(lXfR)wrap - rP - 11 - O( p) 
.L 'f aILto_all,3 - ts + mtw ~w - 7n-
~ w 
Special-case: Fully connected 1-D subsystems. For a fully connected 1-D 
subsystem, the whole broadcasting procedure just needs a single transfer step. 
With wormhole routing, the communication time is 
T(1V R)~~llto al13 = ts + mtw = O(m) - - , 
5.4 One-to-All Personalized Communication 
One-to-all personalized communication is an operation where the source processor 
sends (p - 1) unique messages, each one destined for a different processor in the 
system. Unlike one-to-all broadcasting, one-to-all personalized communication does 
not involve any duplication of data. However, the communication patterns for one-
to-all broadcasting and one-to-all personalized communication are identical; only the 
sizes and contents of messages are different. 
5.4.1 Model-1 and Model-2 
Even though under model-2 each processor has multiple outports available in each 
step, all the outports are supposed to transport the same message. But for one-to-




- - - - - - -- > 
Figure 5.S One-to-all personalized com.munication. 
71 
transmitted. In this case, the communication procedures are exactly the same for 
both model-1 and model-2. For these two models, no matter what the window size is, 
it will take (p - 1) transfer steps for this communication operation. A ring structure 
is used to communicate values, as shown in Figure 5.6. Messages going farther have 
higher priority of transmission. The total time taken by this operation is 
This is similar to the asymptotic time consumed by the source, and therefore it is 
optimal. The shortest paths in the ring are chosen to reach respective destinations. 
For the sake of simplicity, assume that the source is Po. To reach the PE Px, \'\'here 
1 ::; x ::; (p - 1), the message makes r ~ 1 hops. Assume that the source first sends out 
the messages destined for the odd-numbered PEs. It then transmits messages to the 
even-numbered PEs. Assume for the second case the PE Px \vith .1: = 2y. This PE 
will receive its message with delay te(y -1) + TlLtw(y - 1) after it was transmitted by 
the source. The time left for t.he source to complete the entire operation is mtw(y-1), 
because (y - 1) is the number of messages still to be transmitted. Therefore, the 
"combining time" term used in the equation is for the worst case, where y = r~ 1. 
With wormhole routing, the total number of flits to be transferred by the 
source is (p - 1 )rn. Messages going farther have higher priority of transmission. The 
communication time is 
T(VV R)one-to_alLpers,l = ts + mtw(p - 1) = O(mp) 
72 
This also represents the time consumed by the source because of the pipelining of 
messages and the chosen priority for message transmission. 
Special-case: Fully connected I-D subsystems. Referring to the previous 
case, we know that even under a fully connected 1-D subsystem, we still need (p -1) 
transfer steps. The total time taken by this operation is 
'!\lith wormhole routing, the communication time is 
5.4.2 Model-3 
Under model-3, the one-to-all personalized communication operation can be done as 
follows. For the \;vorst case, we assume Po to be the source: 
• First, the processor Po passes the w most distant messages to its w neighbors, 
so that a destination processor with higher address gets a message for a higher-
addressed processor. 
• Second, the processor Po similarly passes the next w most distant messages to 
its window, while all processors that received an intermediate message earlier 
pass that message to their neighbor at distance w in the next window (i.e., 
window to their right) . 
., The second step repeats until all processors receive their own message. 
Table 5.7 shows a complete example for 12 processors and window size of 3. 
The total time taken by this operation is 
p-1 p-1 P 
Toncto_aILpe7's,3 = is + rrdw 1--1 + icU--l - 1) = O(rl1 w ) w w 
which has the same asymptotic complexity with the time consumed by the source, 
and therefore it is optimal. 
73 
Table 5.7 The detailed steps for one-to-all personalized communication under model-
3. 
Po PI P2 P3 P4 P5 P6 P7 P8 P9 PlO Pu 
rnO--ll 
ma, mI, rl12, rng rl110 rl1ll 
1713, rl14, 77151 
m6, m7, 1718 
mo, ml, m2, m6 1717 mg rl1g mlO 1'n11 
1713, 7114) 7115 
1710, 71111 1712 rn3 rn4 1715 m6 1717 7118 rl1g 71110 mll 
rno 7111 1'112 1713 m4 1715 7716 1717 1718 mg 17110 mll 
With wormhole routing, all processors receive their messages simultaneously 
in time ts + mtw r~ 1, because of message pipelining and message blocking resulting 
from the m-fiit messages. Therefore, the total communication time is 
which is again optimal because it is identical to the time consumed by the source 
with peak utilization of its communication ports and no data duplication. 
Special-case: Fully connected I-D subsystems. For a fully connected 1-
o subsystem, the entire communication operation needs just a single transfer step. 
Therefore, 
T Juli - t t - 0(' ) one_to_alLpers,3 - s + 171 w - 1n 
\i\!ith wormhole routing, the communication time is 
5.5 All-to-All Personalized Communication 
In all-to-all personalized communication, also known as total exchange, each 
processor sends a distinct message of size m to every other processor. Unlike 
74 
all-to-all broadcasting, all-to-all personalized communication does not involve any 
duplication of data. 
NIo,o 
11/10 1 
lv1p - I ,0 _Mo,o 111o,] .~10,p-l 
, !l1p- I ,] "MI,o !l11,1 ]\11,p-l 
All- to-all personalized 
NIo,p-1 }I,1p- l ,p-1 communication NIp-l,o -Mp-I,l J\!Jp_] ,p-l 
o 0 - - - - - - -- > 0 0 
Figure 5.9 All-to-all personalized communication. 
5.5.1 Model-l and Model-2 
For all-to-all personalized communication, the source processor has different messages 
to be transmitted. Although model-2 has mUltiple outports available, all the 
outports are supposed to transport the same message. Therefore, the communi-
cation procedures are exactly the same for both model-1 and model-2. 
V\Te form a ring here, as in Figure 5.6. In each transfer step every processor 
transfers the m-word message destined for its farthest remaining processor. If only 
one direction in the ring is used for all transfers, then the total number of transfer 
steps is equal to Z,f::l1 (p-i) = Z,f::l1 i = (P~l)P. The total time taken by this operation 
is 
p-l p-l 
ts + L mtw(p - i) + L tc(p - i-I) 
i=l i=] 
(p - 1) P (p - 1) (p - 2) 
t s + mtw + tc ..:.:--~::...--~ 
2 2 
However, for the shortest paths, and therefore for smaller communication time, 
both directions in the ring should be used. In this case, there are r~ 1 "large" 
communication stages. In the i-th "large" stage, where i = 1,2, ... , r~l, each 
processor transmits the respective messages to the processors at the same distance 
i to its left and to its right, exclusively in this order. If p is even, then the r~ l-th 
0 
75 
"large" stage implements transmissions in only one of the t\VO directions in the ring. 
Therefore, the total number of transfer steps to neighbors is equal to 
2~rP - 1HrP ~ 11 + 1) - uP - 11 - lP - 1 J) 2 2 2 2 2 
rP~IF+lP~IJ 
The total time is 
which is asymptotically optimal because each processor sends out 0 (p) messages of 
m words each, and the average distance traveled is O(p). 
With wormhole routing, the communication time is 




Special-case: Fully connected I-D sUbsystems. For a fully connected I-D 
subsystem, all the processors use one port at a time to send a single message, and 
therefore the entire communication operation needs (p - 1) steps. 





Figure 5.10 Chosen linear arrays in the HOlll(10, 3,1) for all-to-all personalized 
communication. 
5.5.2 Model-3 
The all- to-all personalized communication operation involves a lot of message 
transfers. Vie will not necessarily derive the most efficient procedure here, because 
such a procedure can be of a very complex nature. We present a simple procedure that 
comprises two stages. The basic idea is to use the largest possible number of linear 
arrays for pipelined message transfers, with the smallest possible number of nodes 
per such array. Figure 5.10 shows the chosen linear arrays in the H01V(10, 3,1) . 
• First stage: this is the initialization stage where local transfers are employed to 
move messages to processors that belong to the aforementioned linear arrays. 
Every processor passes all relative messages to neighbors in its windO\v(s). For 
a given destination message, it passes that message to its neighbor that belongs 
to a linear array containing that destination; if two such neighbors exist, the 
one closer to the destination is chosen. It takes up to 31 = r~ 1 cycles to 
finish the initialization, which is the same as the maximum number of values 
to be sent from a processor to another one . 
• Second stage: the linear arrays are used to transfer the values. There are 
w linear arrays to be used. Vie need up to 32 = r~ 1 - 1 cycles to finish 
the broadcasting along the linear arrays, which is the same as the maximum 
number of values a processor has to send in a single dimension; messages going 
farther have higher priority. 
The total time taken by this operation is 
77 
p-l p-l 
ts + 2 mtw r --:;;;-1 + mtc(2 r --:;;;-1 - 1) 
O(mE. ) 
w 
An example with 10 processors and ·window size equal to 3 is shmvn In the 
follmving tables: 
With worrnhole routing, the communication time is 
p - 1 P 
T(Hi R)alUo_aILpers,3 = ts + 2 1ntw f--1 = 0(1'17,-) 
W W 
Special-case: Fully connected I-D subsystems. For a fully connected I-D 
subsystem, all the processors use all output ports sending different destined messages 
to their destination in one single step. The total time taken by the operation is 
\Vith wormhole routing, the communication time is 
78 
Table 5.8 The detailed steps for all-to-all personalized communication in l-D HOVV 
under model-3. 
Initial state with all information 
Po PI P2 P3 P4 Ps P6 P7 Ps ])9 
{O,O} {l,O} {2,O} {3,O} { 4,O} {5,O} {6,O} {7,O} {S,O} {9,O} 
{O,l} {l,l } {2,1 } {3,1} { 4,l} {5,1 } {6,l} {7,l } {S,l} {9,1} 
{O,2} {1,2} {? ?} -,- {3,2} { 4,2} r ?} 0,- {6,2} {7,2} {S,2} {9,2} 
{O,3} {l,3} {2,3} {3,3} { 4,3} {5,3} {6,3} {7,3} {S,3} {9,3} 
{O,4} {1,4 } {2,4} {3,4} { 4,4} {5,4} {6,4} {7,4} {S,4} {9,4} 
{O,5} {l,5} {2,5} {3,5} { 4,5} {5,5} {6,5} {7,5} {S,5} {9,5} 
{O,6} {l,6} {2,6} {3,6} { 4,6} {5,6} {6,6} {7,6} {S,6} {9,6} 
{O,7} {1,7} {2,7} {3,7} { 4,7} {5,7} {6,7} {7,7} {S,7} {9,7} 
{O,S} {l,S} {2,8} {3,8} { 4,8} {5,8} {6,8} {7,8} {8,8} {9,8} 
{O,9} {1,9} {2,9} {3,9} { 4,9} {5,9} {6,9} {7,9} {8,9} {9,9} 
Step-I: exchanging information with all connected neighbors. 
Po PI P2 P3 P4 Ps P6 P7 Ps P9 
{O,O} {l,l } {2,2} {3,3} { 4,4} {5,5} {6,6} {7,7} {8,8} {9,9} 
right {O,l} {O,2} {O,3} 
{1,2} {l,3} {1,4} 
{2,3} {2,4} {2,5} 
{3,4} {3,5} {3,6} 
{4,5} { 4,6} { 4,7} 
{5,6} {5,7} {5,8} 
{6,7} {6,8} {6,9} 
{7,8} {7,9} 
{8,9} 
left {9,O} {9,1} {9,2} 
{8,O} {8,l} {8,2} 
{7,O} {7,l } {7,2} 
{6,O} {6,1} {6,2} 
{5,O} {5,l } {5,2} 
{4,O} { 4,l} { 4,2} 
{3,O} {3,l } {3,2} 
{2,O} {2,l } 
{1,O} 
{O,4} {1,5} {2,6} {3,7} { 4,O} {5,O} {6,O} {7,O} {8,O} {9,O} 
{O,5} {1,6} {2,7} {3,8} { 4,8} {5,1 } {6,1} {7,1} {8,1} {9,l } 
{O,6} {I,7} {2,8} {3,9} { 4,9} {5,9} {6,2} {7,2} {8,2} {9,2} 
{O,7} {I,8} {2,9} {7,3} {8,3} {9,3} 
{O,8} {I,9} {8,4} {9,4} 
{O,9} {9,5} 
79 
Table 5.9 The detailed steps for all-to-all personalized communication in I-D HO\i\I 
under model-3.(continue-l) 
Step-2: transferring the farthest messages through all connected neighbors. 
Po Pi P2 P3 P4 P5 P6 P7 Ps P9 
{O,O} {l,l} {2,2} {3,3} {4,4} {5,5} {6,6} {7,7} {8,8} {9,9} 
{1,O} {O,l} {O,2} {O,3} {1,4} {2,5} {3,6} { 4,7} {5,8} {6,9} 
{2,O} {2,1} {1,2} {1,3} {2,4} {3,5} { 4,6} {5,7} {6,8} {7,9} 
{3,O} {3,1} {3,2} {2,3} {3,4} { 4,5} {5,6} {6,7} {7,8} {8,9} 
{ 4,1} { 4,2} {4,3} {5,4 } {6,5} {7,6} {8,7} {9,8} 
{5,2} {5,3} {6,4} {7,5} {8,6} {9,7} 
right {O,7} {O,8} {O,9} 
{I,7} {I,8} {I,9} 
{2,7} {2,8} {2,9} 
{3,7} {3,8} {3,9} 
{ 4,8} { 4,9} 
{5,9} 
left {9,O} {9,1 } {9,2} 
{8,O} {8,1} {8,2} 
{7,O} {7,1} {7,2} 
{6,O} {6,1 } {6,2} 
{5,O} {5,1} 
{ 4,O} 
{O,4} { 1,5} {2,6} {7,3} {8,3} {9,3} 
{O,5} {l,6} {8,4} {9,4} 
{O,6} {9,5} 
80 
Table 5.10 The detailed steps for all-to-all personalized communication in I-D HO,V 
under model-3. (continue-2) 
Step-3: intermediate step to transfer information. 
Po PI P2 P3 P4 P5 P6 P7 P8 P9 
{O,O} {l,l} {2,2} {3,3} { 4,4} {5,5} {6,6} {7,7} {8,8} {9,9} 
{1,O} {O,l} {O,2} {O,3} {1,4} {2,5} {3,6} { 4,7} {5,8} {6,9} 
{2,a} {2,1 } {1,2} {1,3} {2,4} {3,5} { 4,6} {5,7} {6,8} {7,9} 
{3,a} {3,1} {3,2} {2,3} {3,4} {4,5} {5,6} {6,7} {7,8} {8,9} 
{4,1} { 4,2} { 4,3} {5,4} {6,5} {7,6} {8,7} {9,8} 
{5,2} {5,3} {6,4} {7,5} {8,6} {9,7} 
{6,3} {7,4} {8,5 } {9,6} 
right {a,4} {a,5} {O,6} 
{1,5} {1,6} {a,7} 
{2,6} {1,7} {a,8} 
{2,7} {1,8} {a,9} 





left {9,3} {9,4} {9,5} 
{9,2} {8,3} {8,4} 
{9,1} {8,2} {7,3} 
{9,O} {8,1} {7,2} 
{8,a} {7,1} {6,2} 
{7,O} {6,1 } 




Table 5.11 The detailed steps for all-to-all personalized communication in 1-D HO\i\1 
under model-3. (continue-3) 
Step-4: intermediate step to transfer information. 
Po Pl P2 P3 P4 Ps P6 P7 Ps P9 
{O,O} {1,1 } {2,2} {3,3} { 4,4} {5,5} {6,6} {7,7} {8,8} {9,9} 
{1,O} {O,l} {O,2} {O,3} {1,4} {2,5} {3,6} { 4,7} {5,8} {6,9} 
{2,O} {2,1} {1,2} {I,3} {2,4} {3,5} { 4,6} {5,7} {6,8} {7,9} 
{3,O} {3,1} {3,2} {2,3} {3,4} {4,5} {5,6} {6,7} {7,8} {8,9} 
{4,1 } { 4,2} { 4,3} {5,4} {6,5} {7,6} {8,7} {9,8} 
{5,2} {5,3} {6,4} {7,5} {8,6} {9,7} 
{6,3} {7,4} {8,5} {9,6} 
{ 4,O} {5,1} { 4,8} {5,9} 
{5,O} { 4,9} 
{6,O} {3,9} 
right {} {} {O,4} 
{} {1,5} {O,5} 
{2,6} {I,6} {O,6} 
{2,7} {I,7} {O,7} 
{I,8} {3,7} {O,8}* 
{} {2,8} {O,9} 
{3,8} {I,9} 
{2,9} 
left {9,5} {} {} 
{9,4} {8,4} {} 
{9,3} {8,3} {7,3} 
{9,2} {8,2} {7,2} 
{9,1} {6,2} {8,1} * 
{9,O} {7,1 } {} 
{8,O} {6,1 } 
{7,O} 
82 
Table 5.12 The detailed steps for all-to-all personalized communication in l-D HOVl 
under model-3. (continue-4) 
Step-5: intermediate step to transfer information. 
Po PI P2 P3 P4 P5 P6 P7 P8 P9 
{O,O} {l,l } {2,2} {3,3} { 4,4} {S,S} {6,6} {7,7} {8,8} {9,9} 
{l,O} {O,l } {O,2} {O,3} {l,4} {2,S} {3,6} { 4,7} {5,8} {6,9} 
{2,O} {2,l } {l,2} {l,3} {2,4} {3,5} { 4,6} {5,7} {6,8} {7,9} 
{3,O} {3,l } {3,2} {2,3} {3,4} { 4,S} {S,6} {6,7} {7,8} {8,9} 
{ 4,l} { 4,2} {4,3} {5,4} {6,5} {7,6} {8,7} {9,8} 
{S,2} {5,3} {6,4} {7,5} {8,6} {9,7} 
{6,3} {7,4} {8,5} {9,6} 
{ 4,O} {5,l } {4,8} {5,9} 
{5,O} { 4,9} 
{6,O} {3,9} 
{7,O} {6,l } {6,2} {7,3} {O,4 } {O,S} {O,6} {O,7} {O,8} {O,9} 
{8,O} {7,l} {7,2} {8,4} {9,S} {3,7} {2,8} {l,9} 
{9,O} {8,1 } {3,8} {2,9} 
right {l,S} {2,6} 
{1,6} {2,7} 
{1,7} {1,8} 
left {7,3} {8,4} 
{7,2} {8,3} 
{8,1 } {8,2} 
CHAPTER 6 
COl\1MUNICATION OPERATIONS ON 2-D HOW SYSTEMS 
Assume symmetric 2-D HOVV systems with p processors. The numbers for rows 
and columns are then 0, I, ... , vp - 1. For example, Figure 6.1 shows the processor 
addresses in the 2-D system HOH!(5, 3, 2). 
POO POI P02 P03 P04 
PlO P11 P12 P13 P14 
P20 P21 P22 P23 P24 
P30 P31 P32 P33 P34 
P40 ])·11 P42 P43 P44 
Figure 6.1 Processor addresses in the HOVV(5, 3,2). 
6.1 One-to-One Communication 
\Ve assume, without loss of generality, that POD is the source processor and that the 
destination is at distance l. 
With SF routing, sending a single message containing m words takes ts +mtwl + 
te(l - 1) time, where l is the number of links traversed by the message. For a 2-D 
HO\V system with a total of P processors (having vp rows and vp columns) and 
window size w, l is at most 2 r V:-1l, and therefore the time for a single message 
transfer has the upper bound of 
vp-1 .. vp-1 _ vp 
Tone to one = ts + 2mtw r 1 + tc(2 r 1 - 1) - O(m-) - - w w w 
assuming no contention with other messages at intermediate processors. 
83 
84 
With wormhole routing, for a single message transfer on t.he 2-D HOVI system 
the upper bound is 
6.2 One-to-All Broadcasting 
6.2.1 Model-1 
For the best possible performance, we first have to determine 'which of the row or 
column windo-IV the source belongs to is closer to the center of that row or column, 
respectively. If it is the row vi1indow, then the source broadcasts within that row, and 
this is follo-wed by broadcast.ing from those row PEs into all columns. Otherwise, ,ve 
begin with column broadcasting. However, here we assume the Vlorst case, where 
the source PE is in the first window of the corresponding 1-D HO\V row and column 
subsystems. Using the same notations as for the 1-D HO\i\l system, 31 represents the 
number of transfer steps needed to fill the first window in this row and 32 represents 
the number of transfer steps needed in the second stage to copy the values from the 
first window into the remaining windows of this row. \i\1e already know the following 
relations among Sl,32, and w 
S1 = pog(w + 1)1 
32 = r(VJ5 - 2Sj )/w 1 
This operation is done by first broadcasting within the aforementioned row and 
then from that row within all the columns. The communication time under model-1 
with SF routing has the upper bound 
I
ts + 2mtw flog VJ51 + t c (2l1og VJ51 - 1) 
T -
ondo_all,l - ts + 2mtw(31 + S2) + tc(2(Sl + S2) - 1) 
O(Tnlog VJ5) 
if (VJ5 - 1) ~ w 
O(m logw + m{J) 
if (VJ5 - 1) > w 
\Vith wormhole routing, the upper bound is 
O(m + log vP) 
if (vP - 1) .s; w 
O(m + log w + {!-) 
if (vP - 1) > w 
85 
assuming that incoming data can be stored locally and can simultaneously be trans-
ferred to the next PE in the path. 
Special-case: Fully connected I-D subsystems. For fully connected 
subsystems that form a 2-D generalized hypercube, the procedure is similar to 
that for (JP - 1) = 'W under model-I. 
VVith wormhole routing, the communication time is 
T(l¥ R)~~~l Lo alll = ts + 2tw flog v'Pl + (m - l)tw = O(m + log v'P) - - , 
6.2.2 Model-2 and Model-3 
For the one-to-all broadcasting operation, there is only one value to be sent, and 
therefore the \I,ihole procedure for model-3 is exactly the same as that for model-
2. Figure 6.2 shows two different methods used for one-to-all broadcasting. The 
numbers of communication steps for the two methods are the same. However, method 
(b) is easier to program, because it is an extension of the respective method for the 
I-D HOW system. This method first broadcasts within the row and then within all 
columns. The upper- bound on the total time taken by this operation is 
With wormhole routing, the upper- bo'u,nd is 
, . yIP - 1 vP 




o o o 000 o 
o o o 000 o 
o o o 000 o 
o o o o o o o o o o 
(a) step-l (b) step-l 
o o o o o o 
o o o o o o 
I o 
o 









(a) step-2 (b) step-2 
I I I---: o o o o 0 
(a) step-3 (b) stcp-3 
I I I I I 
(a) stcp-4 (b) step-4 
Figure 6.2 One-to-all broadcasting under model-2 and model-3 with two different 
methods, both of ·which have the same number of communication steps. A filled 
circle means that the current processor has already received the message broadcast 
by the source. All communication steps are shown here. 'Ale assume that \v=3. For 
the worst case, we assume POQ to be the source. 
87 
assuming that the dimension to be traversed is changed just after the first flit. is 
received. 
Special-case: Fully connected 1-D subsystems. It is easy to see that for 
fully connected 1-D subsystems, one-to-all broadcasting needs just two transfer steps. 
Therefore, 
Vlith wormhole routing, the communication time is 
6.3 All-to-All Broadcasting 
The following table 6.1 shows the initial message state and the required final state 
for all-to-all broadcasting in a 5x5 system. 
Table 6.1 The initial and final state of HO\iV(5,3,2). 
Initial state of HOW(5,3,2) Required final state 
rno,o mO,l r11O,2 mO,3 InO,4 MM MM MM MM MM 
ml,O In1,1 Inl,2 1n1,3 In1,4 MM MTvI MlvI MM MM 
rn~2,0 In2,1 rn2,2 m2,3 In2,4 MM MM MM MM MM 
r113,0 r113,1 1T)'3,2 17/,3,3 7n3,4 M1\11 MM MM MM MM 
r114,0 1714,1 m4,2 m4,3 1714,4 MM MIV1 MM 1v1M MM 
where each processor receives messages from all other processors, and therefore 
771,0,0 m'O,1 171,0,2 1710,3 mO,4 
ml,O ml,] m'1,2 1711,3 m1,4 
.M.M = 171,2,0 17/,2,1 m'2,2 1712,3 17/,2,4 
1713,0 17/,3,1 m3,2 1713,3 1713,4 
m4,0 m'4,1 r114,2 m4,3 Tl14,4 
The procedure repeats many times the corresponding procedure for the 1-
D HOW system. That is, processors first exchange messages along rows, so that 
each processor has vp messages at the end for the processors on its own column. 
88 
Then, processors exchange their vP messages along columns by repeating the same 
procedure vP times I,:vithin the columns. 
6.3.1 Model-1 
For model-I, there is only one output port of each processor we can use at a time. 
In order to let every processor pass some information to a neighbor, we deliberately 
choose some channels to form a ring on each row/column. V.,re assume pipelining of 
messages along rows and columns. 
VVe start with all-to-all row broadcasting that takes time is + T = is + (vP-
I)miw +ic(vp- 2), as derived for the I-D HO\i\l system in Subsection 3.l.1. The vp 
column broadcasts then take time VPT, because all-to-all I-D HOVV broadcasting is 
repeated Vp times. The time taken by the entire operation is 
is + (1 + yIP)miw(yIP - 1) + (1 + yIP)tc(.Jp - 2) + tc 
is + (p - I)miw + (p - yIP - I)ic = O(n/'p) 
The last ic term is for switching from row broadcasting into column broadcasting. 
This asymptotic time is optimal because each processor can use only one output 
port at a time, and therefore each message ,,,,ill make 0 (p) hops to visit all 0 (p) 
processors. 
With wormhole routing, within each rov" the entire time is is + m( vP - 1)i"UJl 
assuming the formation of a ring. This is because each processor starts receiving 
flits with the first data transfer, pipelining of messages is applied, and the total 
number of flits each processor receives is m(vP -1). Similarly, for columns the time 
is TnvP(vP - I)iw. The total time is 
T(ll1 R) alLto_a.ll, 1 = is + Tn(I + yIP) ( yIP - I)iw = is + m(p - l)iw = O(mp) 
89 
Special-case: Fully connected I-D subsystems. As for the 1-0 subsystem, 
there is one tc that will be involved in the broadcasting procedure within the row 
and the column. The time taken by the entire broadcasting procedure is 
assuming again two steps (rmv-wise and column-wise steps) in the implementation. 
vVith wormhole routing, the communication time is still 
6.3.2 Model-2 
Based on the algorithm proposed for the 1-0 HU\iV system, the total time taken by 
this operation is 
vvhere x is the largest integer less than 1-1 . The algorithm for the 1-D HO\i\T system _w 
is used (1 + JP) times, once for the rows and .jj5 times for the columns. 
With wormhole routing, the communication time is 
.jj5-1 ]J 
T(TiV R)alLto_all,2 = ts + 2mtw(i l + x) (1 + Jp) = O(m-) 
w w 
Special-case: Fully connected I-D subsystems. For the 1-0 subsystem, only 
two transfer steps are needed to accomplish broadcasting. Therefore, 
vVith wormhole routing, the communication time is 
90 
Table 6.2 Messages received in the first two detailed steps for all-to-all broadcasting 
within the [O\vs of the HOVV(5, 3, 2) system. 
Initial state Step 1 Step 2 
THOO ill 0 1 7n02 1n03 1n04 7nOl 71"100 TnOO 1T/,00 TnOl rn04 77~00 . 
Tn02 7n02 77~01 mOl 17~02 
7n03 1n03 rn03 m02 m03 
7n04 rn04 1n04 
7nlO Tn 11 m12 771 13 7n14 Tn11 Tn 10 Tn 10 17'LlO Tn 11 7n14 7nlO 
77~12 n~12 rnll ,nll 1n12 
m13 Tl113 17~13 17112 Tn 13 
TH14 TI~14 ml4 
m20 17~21 7n22 17~23 i7124 77121 7n20 TI~20 17~20 17121 7n24 17120 
17~22 7n22 17121 1'11,21 77122 
17~23 77123 17123 1'n22 77123 
7//'24 17124 7//'24 
'm30 17131 17132 m33 17134 17~31 1T/'30 17130 m30 17131 17134 77130 
77132 77132 17131 17131 Tl132 
77~33 77133 1T/'33 11132 m'33 
I TI~34 11~34 77134 
1'11,40 17141 77142 17143 17144 77141 11140 17140 1'1140 1'1141 1)~44 77140 
77142 17~42 1I~41 1)141 77142 
17143 m43 11143 1)142 1'11,43 
m44 11144 11144 
6.3.3 Model-3 
Table 6.2 shows the first two steps involving all-to-all broadcasting under model-3. 
It is very similar to the procedure for model-2. Since each individual processor can 
send different messages at the same time, we do not need to split any stage. The 
total time taken by this operation is 
JP-1 _ JP-l P 
TalUo_all,3 = ts + (1 + Jp)mtw I 1 + tc(l + Jp) (I 1 - 1) = O(m-) 
w w w 
With wonnhole routing, the communication time is 
91 
Table 6.3 The initial and final states for one-to-all personalized communication in 
the HO\;Y(p,\v,2). 
Initial state Required final state 
7no,o 7nO,1 mO,2 mO,3 mO,4 1 
7n1,0 Tn1,1 'I?(,l,2 Tn1,3 'I?('l,4 
1n2,0 1n2,1 'I?(,2,2 'I?('2,3 1n2,4 rno,O 1nO,1 rnO,2 1nO,3 1nO,4 
1n3,0 1n3,1 1n3,2 'I?13,3 'I?13,4 I 
rn4,0 m4,1 rn4,2 7n4,3 1/('4,4 
1/11,0 Tnl,l 1nl,2 7nl,3 T/{,1,4 
rn2,0 111'2,1 1n2,2 1n2,3 11/'2,4 
1113,0 77('3,1 11('3,2 77('3,3 71"1'3,4 
11{,4,0 11/"1,1 1n4,2 7n4,3 rn4,4 
Special-case: Fully connected 2-D subsystems. For the 2-D gener-
alized hypercube, the whole broadcasting procedure needs just t\VO transfer steps. 
Therefore, 
With wormhole routing, the communication time is 
6.4 One-to-All Personalized Communication 
Table 6.3 shmvs the initial state and the required final state for one-to-all person-
alized communication in the 5 x 5 2-D HOHi(5, 3, 2) system. Vie assume, without 
loss of generality, that Poo is the source processor. 
6.4.1 Model-1 and Model-2 
Because of personalized data, the same procedure is applied for model-1 and model-
2. Restricted by the availability of only one output port at a time for each processor, 
independently of the window size it will take (vP - 1) transfer steps along a row 
or a column for a processor to send personalized data to all other processors. In 
92 
the first phase, the source processor, assume POD, passes messages within its row for 
all processors in the corresponding columns. Messages going farther have higher 
priority of transmission. This process is implemented as vp one-to-all personalized 
communications within the row (i.e., 1-D HGW system). At the end of the first phase, 
each of the first row processors \'vill have vp messages. All vp messages of each first 
rmv processor will be transferred in the second phase along the corresponding column 
applying again one-to-all personalized communication. The total time taken by this 
operation is 
v'P - 1 
Tondo_aILpe1"s,l = ts + (vp + l)mtw( vp - 1) + te(l + vp)( r 2 1 - 1) 
vp-1 
= ts + (p -l)ndw + te(1 + vp)(r 2 1-1) = O(mp) 
With wormhole routing, the communication time is 
Special-case: Fully connected I-D subsystems. Referring to the previous 
case, we know that even under a fully connected 1-D subsystem, we still need (JP-1) 
transfer steps along each row and each column. The total time taken by this operation 
1S 
With wormhole routing, the communication time is 
6.4.2 Model-3 
Vve first send the messages that must travel the longest distance using simultaneously 
all column and row connections. (Note: it is a different method than that used for 
93 
model-I.) Figure 6.3 shows the exact steps needed for the HOH1(5, 3, 2) system, 
with Poo being the source. The number of message transfer steps is 2 r ~-ll, the 
same as the diameter of the system. The uppeT bound on the total time is 
which is optimal. 
With wormhole routing, the uppeT bound is 
Special-case: Fully connected 1-D subsystems. For the fully connected 
I-D subsystem, the whole communication operation needs just two transfer steps. 
Therefore, 
With wormhole routing, the communication time is 
6.5 All-to-All Personalized Communication 
Tables 6.4 and 6.5 shmv the initial state and the required final result for all-to-all 
personalized communication in a 5 x 5 2-D system. T\vo phases are implemented 
again. 
6.5.1 l\1odel-1 and Model-2 
\lYe form rings on rows and columns. In each transfer step the message size is 'Tn 
words and every processor tries to transfer the message(s) destined for its farthest 
processor. ,Ve start \vith row transfers and continue with vp all-to-all personalized 
94 
0 
0 0 0 
0 0 0 
0 0 0 0 
0 0 0 0 0 
(a) first step (b) second step 
o 0 
o 0 
(c) third step Cd) fourth step 
Figure 6.3 One-to-all personalized communication under model-3, for w = 3. The 
Cartesian coordinates of destination processors are shown as pairs of numbers. A 
shaded circle means that the corresponding processor has already received the person-
alized message sent by the source. 
95 
Table 6.4 The initial state for all-to-all personalized communication in 2-D HO\i\T 
system. 
Initial state of HO\i\T(5,3,2) 
{(O,O),(O,O) },{ (0,0),(0,1) },{ (0,0),(0,2) },{ (0,0),(0,3) },{ (0,0),(0,4)}, {(1,O),(O,O)} , 
{(0,0),(1,0) },{ (0,0),(1,1) },{ (0,0),(1,2) },{ (0,0),(1,3) },{ (0,0) ,(1,4)}, 
{(0,0),(2,0) },{ (0,0)'(2,1) },{ (0,0),(2,2) },{ (0,0),(2,3) },{ (0,0) ,(2,4)}, ...... 
{(O,O) ,(3,0)},{ (0,0),(3,1) },{ (0,0) ,(3,2) },{ (0,0),(3,3) },{ (0,0),(3,4)}, 
{(O,O),( 4,O)},{ (0,0),( 4,1)}, {(O,O),( 4,2)},{ (0,0), (4,3)}, {(O,O),( 4,4)} 
{(I ,0) ,(O,O)}, {(I ,0) ,(O,l)}, {(1,0) ,(0,2)}, {(I ,0), (0,3)}, {(l,O), (O,4)}, {(l,l),(O,O)}, 
{(1,0),(1,0) },{ (1,0),(1, 1) },{ (l,O),(l,2)}, {(l,O),(1,3)}, {(l,O),(l,4)}, 
{(I ,0) ,(2,O)},{ (1,0) ,(2,1)}, {(l,O), (2,2)},{ (1,0) ,(2,3)}, {(l,O), (2,4)}, . . . . . ~ 
{(1,0),(3,0)),{ (1,0),(3,1) },{ (1,0),(3,2) },{ (1,0) ,(3,3)}, {(1,0),(3,4)}, 
{(l,O),(4,O) },{ (1,0),( 4,1) },{ (1,0),( 4,2)},{ (1,0),( 4,3) },{ (1,0),(4,4)} 
{(2,0),(0,0) },{ (2,0),(0,1) },{ (2,0),(0,2) },{ (2,0),(0,3) },{ (2,0),(0,4)}, {(2,1),(O,O)}, 
{(2,O),(l,O) },{ (2,0),(1,1) },{ (2,0),(1,2) },{ (2,0) ,(1,3) },{ (2,0),(1,4)}, 
{(2,0),(2,O) },{(2,0),(2,1) },{ (2,0)'(2,2) },{ (2,0),(2,3) },{ (2,0),(2,4)}, .. , ... 
{(2,0),(3,0)),{ (2,0)'(3,1) },{ (2,0),(3,2) },{ (2,0),(3,3) },{ (2,0),(3,4)}, 
{(2,0),( 4,0) },{ (2,0),( 4,1) },{ (2,0),(4,2) },{ (2,0),( 4,3) },{ (2,0),(4,4)} 
{(3,0),(O,O) },{ (3,0) ,(0,1) },{ (3,0),(0,2) },{ (3,0),(0,3) },{ (3,0),(0,4)}, {(3,1),(O,O)}, 
{(3,0),(1,0) },{ (3,0) ,(1,1) },{ (3,0),(1,2) },{ (3,0),(1,3) },{ (3,0),(1,4)}, 
{(3,0),(2,0)}, {(3,O),(2,1)}, {(3,O),(2,2) },{ (3,0),(2,3) },{ (3,O),(2,4)}, ...... 
{(3,0),(3,O) },{ (3,0) ,(3,1) },{ (3,0),(3,2) },{ (3,0),(3,3) },{ (3,0),(3,4)}, 
{(3,O), (4,0)}, {(3,O),( 4,1)}, {(3,O), (4,2)},{ (3,0),( 4,3)}, {(3,O), (4,4)} 
{( 4,O),(O,O)},{ (4,0) ,(O,l)}, {(4,O),(0,2) },{ (4,0) ,(0,3) },{ (4,0),(0,4)}, {( 4,l),(0,0)}, 
{( 4,0),(1,0) },{ (4,0),(1,1)},{ (4,0),(1,2)},{ (4,O),(1,3)},{ (4,O),(l,4)}, 
{ (4, 0), (2,0)}, {( 4,0), (2,1) }, { (4,0), (2,2)}, { (4,0), (2,3)}, { (4,0), (2,4)}, . . . . . ~ 
{ (4,0), (3,O)}, {( 4,0), (3,1) }, { (4,0), (3,2) }, {( 4,0), (3,3) }, { (4,0), (3,4) }, 
{( 4,0),( 4,O)}, {( 4,0) ,( 4,1)},{ (4,0), (4,2)}, {( 4,0),( 4,3)}, {( 4,0),( 4,4)} 
96 
Table 6.5 The final result for all-to-all personalized cornmunication in a 2-D HO\J\l 
system. 
Required final state of HOW(5,3,2) 
{(O,O) ,(O,O)},{ (0,1),(0,0) },{ (0,2) ,(0,0) },{ (0,3),(0,0) },{ (O,4),(O,O)}, {(O,O),(O,1)}, 
{(1,0) ,(O,O)}, {(1,1) ,(O,O)}, {(1,2), (O,O)}, {(1 ,3),(0,0)}, {(1,4), (O,O)}, 
{(2,O) ,(O,O)},{ (2,1) ,(0,0) },{ (2,2),(0,0)},{ (2,3),(0,0) },{ (2,4),(O,O)}, ...... 
{(3,O),(O,O)},{ (3,1) ,(0,0) },{ (3,2),(O,O)},{ (3,3) ,(0,0) },{ (3,4),(O,O)}, 
{( 4,0), (0,0) },{ (4,1 ),(O,O)},{ (4,2) ,(O,O)},{ (4,3) ,(O,O)}, {( 4,4),(0,0)} 
{(O,O),(l,O)},{ (0,1),(1,0) },{ (0,2),(I,O)},{ (0,3) ,(1 ,0) },{ (O,4),(I,O)}, {(O,O),(l,l)}, 
{(I,O),(I,O) },{ (1,1),(1,0) },{ (1,2),(1,0) },{ (1,3),(1,0) },{ (1,4),(l,O)}, 
{(2,0),(l,O) },{(2,1),(1,0) },{ (2,2),(1,0) },{ (2,3),(1,0)},{ (2,4),(l,O)}, . ~ . . . . 
{(3,0),(1,0)},{(3,1),(1,0)},{(3,2),(1,0)},{(3,3),(1,0)},{(3,4),(1,0)}, 
{( 4,0),(1,0) },{ (4,1)'(1,0) },{ (4,2),(1,0) },{ (4,3),(1,0)},{ (4,4),(1,0)} 
{(0,0),(2,0)},{ (0,1) ,(2,0) },{ (0,2),(2,0) },{ (0,3),(2,0)}, {(0,4),(2,0)}, {(0,0),(2,1) }, 
{(1,0),(2,0)},{ (1,1) ,(2,0) },{ (1,2),(2,0) },{ (1,3) ,(2,0)},{ (1,4),(2,0)}, 
{(2,0),(2,O)} ,{ (2,1) ,(2,O)},{ (2,2)'(2,0) },{ (2,3),(2,0) },{ (2,4),(2,0)}, ., .... 
{(3,O),(2,O)},{ (3,1) ,(2,O)},{ (3,2),(2,O)},{ (3,3) ,(2,O)},{ (3,4),(2,0)}, 
{( 4,0), (2,0)}, {( 4,1) ,(2,0)}, {( 4,2), (2,0)},{ (4,3) ,(2,0)}, {( 4,4), (2,O)} 
{(O,O),(3,0) },{ (0,1),(3,0)},{ (0,2),(3,0) },{ (0,3) ,(3,0) },{ (0,4),(3,0)}, {(0,0),(3,1)}, 
{(1,0),(3,O) },{ (1,1),(3,0) },{ (1,2),(3,0) },{ (1,3),(3,O)},{ (1,4),(3,O)}, 
{(2,0),(3,O) },{ (2,1),(3,0) },{ (2,2),(3,O)}, {(2,3) ,(3,0) },{ (2,4),(3,0)}, ...... 
{(3,O),(3,O)},{ (3,1),(3,0) },{ (3,2),(3,0)},{ (3,3),(3,O)},{ (3,4),(3,0)}, 
{( 4,0),(3,0) },{ (4,1),(3,0) },{ (4,2),(3,0) },{ (4,3),(3,0)}, {( 4,4),(3,O)} 
{(O,O),( 4,0) },{ (0,1)'(4,0) },{ (0,2),( 4,0) },{ (0,3),( 4,0) },{ (0,4),( 4,O)}, {(0,0),(4,l)}, 
{(I,O),( 4,0) },{ (1,1),( 4,0) },{ (1,2),( 4,0) },{ (1,3),( 4,0) },{ (1,4),( 4,O)}, 
{(2,0),(4,O) },{ (2,1),( 4,0) },{ (2,2),( 4,0) },{ (2,3),( 4,0) },{ (2,4),( 4,0)}, ...... 
{(3,0),( 4,0) },{ (3,1),( 4,0) },{ (3,2),( 4,0) },{ (3,3),( 4,0) },{ (3,4),( 4,0)}, 






communications within columns. Based on the implementation of (v'P + 1) all-to-all 
personalized 1-D HOVV operations, we get 
With wormhole routing, the communication time is 
! ) (rv'P - ll2 Lv'P - 1 J 3/2 T(H R)aIUo_alLpers,l = is + (JP + 1 rntw 2 + 2 ) = O(mp ) 
Special-case: Fully connected I-D subsystems. For a fully connected 1-D 
system, because all the processors use one port at a time to send a single message, 
the total time taken is the same as that for the regular case. 
The total time taken by this operation is 
\t\!ith wormhole routing, the communication time is 
6.5.2 Model-3 
The implementation of this operation requires the following steps: 
• Each processor transmits v'P values to each of the other vP - 1 processors on 
its row, to be later distributed on the corresponding columns. At the end of 
this step, each processor has received (v'P - 1) * v'P messages. This operation 
is equivalent to vP aU-to-all personalized communications on an 1-D HO\iV 
(row). 
98 
• In this step, each processor transmits the values it received earlier and its O\vn 
Jp - 1 values to the other processors on its colurnn. Since vP - 1 of the 
messages received in the first step were destined for this particular processor, 
the number of messages to be transmitted is (.jj5-1)*.jj5- (.jj5-1) +( vP-1) = 
(vP - 1) * .jj5. 
So the total number of all-to-all personalized 1-D HOV" communications is 
.jj5( vP - 1) + vP = p. Therefore, the total amount of time is 
vP-1 y'P-1 





With wormhole routing, the time is 
Special-case: Fully connected I-D subsysterns. For a fully connected 1-D 
subsystem, all the processors use all output ports sending different destined messages 
to all accessible processors. The total time taken by the operation is 
\Vith wormhole routing, the communication time is 
CHAPTER 7 
COMMUNICATION OPERATIONS ON BINARY HYPERCUBES 
Vile compare here the performance of 2-D HO\iV systems wit.h t.hat. of binary 
hypercubes for the studied set of communication operations. The (binary) hypercube 
is an interconnection network that has been widely used in parallel processing, 
primarily in the 1980's. A tremendous number of algorithms have been developed 
for this system. The d-D binary hypercube or d-cube contains 2d nodes. Two nodes 
are neighbors if and only if their d-bit unique addresses differ in a single bit. A 
hypercube with p nodes has (~logp) edges. 
No matter what communication model we are using (such as model-I, model-2, 
or model-3), the number of transfer steps is the same and depends on d = logp. 
The examples shown in this section are for the I6-processor hypercube or 4-
cube. 
Of course, the one-to-all communication procedure is different from the all-to-
all communication procedure. For one-to-all communication, the channels used in 
this communication procedure are shown in Figure 7.1. In each step, there is only 
one message sent along each direction. The number of channels and which channel 
will be used are shown in Figure 7.1. 
For all-to-all communication, in each step there are 21ogp-l = 24- 1 = 8 channels 
to be used and the pairs of processors exchange their information. Of course, different 
channels v-lill be used in different steps. Figure 7.2 shows the channels involved in 
the 4-cube for all-to-all communication. 
For the sake of simplicity, we restrict our comparisons to model-3, the most 
powerful communication model, by also assuming the st.ore-and-forward routing 
technique. In fact, the equations we derive for the hypercube are also valid under 
model-I and model-2. First, we briefly evaluate communication operations for 
hypercubes [5]. Then, comparisons with HO\iV systems follow in Section 6. 
99 
100 
(a) step-one (b) step-two 
(e) step-three (d) step-four 
Figure 7.1 One-to-all communication procedure with 16 processors, for a hypercube 
system. 
(a) step-one (b) step-two 
(e) step· three (d) step-four 
Figure 7.2 All-to-all communication procedure with 16 processors, for a hypercube 
system. 
101 
7.1 One-to-One Communication 
Routing in the hypercube is carried out by first producing the XOR (exc1usive-
OR) result bet\veen the d-bit source and destination addresses and then routing the 
message in those dimensions where the bit in the XOR result is equal to 1. Two 
addresses may differ in up to d bits, and therefore the maximum distance is equal to 
d = logp. 
Therefore, the uppeT bound on the communication time is 
7.2 One-to-All Broadcasting 
The implementation of this communication operation requires the traversal of all 
d dimensions. Despite the fact that the order chosen for the traversal of the d 
dimensions does not matter, the description here assum.es that this traversal starts 
with the highest dimension. In the first phase, the source processor sends the message 
to its neighbor in the (d -- l)-th dimension. In the second phase, the source and the 
processor that previously received the message send a copy to their neighbors in the 
(d - 2)-th dimension. In general, in the s-th phase, the 2s - 1 processors that have a 
copy of the message send a copy to their neighbors in the (d - s)- th dimension, for 
1 :::; s:::; d. 
The communication time required here is the same as the worst-case commu-
nication time required for one-to-one communication, the only difference being that 
for one-to-all broadcasting the message is stored in the intermediate nodes while 
for one-to-one communication the message is not stored in the intermediate nodes. 
Therefore, 
Tone_to_all = ts + mtw logp + (logp - l)tc = O(m logp) 
102 
7.3 All-to-All Broadcasting 
This operation is carried out in d = logp steps. Pairs of processors exchange infor-
mation in each step. Each step doubles the size of the data to be exchanged between 
processors in the next step because processors concatenate their current data with 
the data they receive. Each step i, for i = 1,2, ... , d, implements communications in 
a different dimension i, and the size of all messages in step i is (2 i - 11n) words. The 
communication time is 
logp 
is + (2: 2i-Im)iw + (logp - l)ic 
i=l 
is + 1n(210gp - l)iw + (logp - l)tc 
is + m(p - l)iw + (log]) - l)ic = O(mp) 
Table 7.1 shows the entire procedure of all-to-all broadcasting in the 4-cube. 
7.4 One-to-All Personalized Communication 
The communication patterns are similar to those for one-to-all broadcasting. 
However, the amounts of information to be exchanged in different steps differ 
dramatically. In step i, for i = 1,2, ... , d, a processor that has received earlier data 
(or the source processor for i = 1) sends half of its data to its neighbor in dimension 
i; the set of 2d- i values sent to that neighbor is for the 2d - i processors v"ith the 
higher addresses if the neighbor has a higher address (otherwise, the values are for 
the 2d - i processors with the lower addresses). The communication time is 
logp 
Tondo_alLpeTs = is + (2: 2!ogp-im )tw + (logp - l)tc 
i=l 
log p-l 
ts + ( 2: 2im)tw + (log]) - l)tc 
i=O 
ts + m(2logp - l)tw + (logp - l)tc 
ts + m(p - l)tw + (logp - l)tc = o (mp) 
Table 7.2 shows the details involved in this communication. 
103 
Table 7.1 Detailed information for all-to-all broadcasting on the hypercube. 
Initial state 
Po with message mo PI with message ml P1 with nl6Ssage 1112 P3 with message rIt3 
P4 with message m4 PS with messa,ge 7TIS Pa with rnessage '1116 P7 with tuessage m7 
PB with message ms P9 with message 7119 PIO with message rl110 PI! with n1essage 11111 
P12 with message m12 PI3 with message m!3 PI4 with message 111!4 PIS with message 11115 
First step (anlOng two processors with first bit difference, such as Po and Pl') 
mO,ml rno, ml 1712, 'Ili3 7112, m3 
m4,7TIs 711:11 ms H16 ~ 1n7 7n6, in? 
1nS, mg mg, mg n~lO, 11111 11110, mll 
m121 111,13 ffi12,11'113 71114,11115 11114, mlS 
Second step (among two processors with second bit diflerence, such as Po and P2.) 
rrtO,ml,n1,2 )7113 mo,in l,7n2,m3 nLO)n11 )7n2 )TIt3 rnO,lnl,7712,11ls 
TrLt ,1ns)7n6,l1l7 7714,1115,m61rrt7 7114 lilt5 ~nL61rn7 7Tt.1 ~rn5 ,711,6 ,1117 
ms,rl1g,71l1O,mll rns 11?l9,m lO, m ll 7118 ,Tr19,1111 0 ,nlll rn,SlYl19,1nlQ,rrtll 
1TI12,71t13 ,7n14 ,rn15 m12,11113,m14 )711.15 r11.12,m13,11t14,11l15 111,12 )11113,71114 ,m15 
Third step (aIIlong two processors with third bit difference, such as Po andpd 
71l0,7n l,m2,m3 7nO,1?11 ;1"(1.2,7113 7ll0,ml,7'J12,m3 ino ,il'l1 ,7n2,11'1..3 
711.4 ,n1.5 ,rI1.e ,tn7 rrL111Tl-5 ,7.,.1.6 ,in7 1714,7n5,1n6,rn7 71'14 ,rnS,m6 ,1117 
Hto,ntl,Hl2,rrL3 mO,nll,1112,ffi3 1nC , Tn1 1 rn .. 2 , ll13 rnO)ml )ffi2)m3 
1n4,11151711.(j.,7n7 171'1 lrnS l1n6-,1117 111.4,1'11.5,1116 )111.7 1'11.(\,111.5 )711.G ,71'1.7 
m811719JmlO~m11 ffiS,lng;n110,ml1 mS ,1119,7n lO,11'1.11 11L8,1ng,71110,nq 1 
71112 ,1n13,711.14 ,71'115 1I112,1n13 ,17114 ,rn15 11112 ,H113 ,1l1',14 ,17115 nl.12 ,1l'113;rn14 111115 
1I18,rn'9,71'110,71111 ,nS,1ng,mlO,rn ll rrL8,7119 ,n110 "nIl 7TtS,7Ttg ,711'10,11111 
7Tt 12 ,nLI3 ,m 1-1,11115 ffi12,H"L131m14 )711..]5 m12 ,rrq3,17114 ,1l115 m12 ,11'113 ,rrL} 'I )11115 
Fourth step (among two processors with forth bit difference, such as Po and ps.) 
T11'-0, 111 1,1?12, 7n3 1710,7lLl,rrL2,1!"3 mO,nt1 ,m2 ,1113 rnO ,1n 1,nl2,71l3 
111,1 ,rn5)1116 ,n17 m'l ,ms )1116,1117 11t41111S,n1(nn17 rn4 ,r115 ,1110 1111,7 
n~8,1l19,mlO,nq 1 1118,'r1'19,rrQO,mll mS I 7?l.9)ln lO)m11 1118 ,rn9,m,10 ,7nll 
rfL} 2,m13 ,177,14 ,11115 Ht12 lrTl.13 ,11114 ,n-115 7YL12 )11113)m14,71115 m12 ,11113 linl.:!. linl5 
'f7'to 11111 )m2 ,m3 lTLO,rnl,m2,m3 1TtO,1nl,1"]1,2,rI13 rnO ,m1 1m2 )1TI3 
m4,171s,m6,nL7 rJ14,mSlrn6,Trt7 m4 ,,-n5 ,1116 ,7n7 rn4 ,trl5,m-6 ,1117 
mS, 7(19)m10,n1 11 Ins )1119,71'1..10,1':n11 rn8 ~m9,rn1o,nL11 rnS)m9~rnl0,mll 
71112 ~rrl,l3 ,11114 ,1n15 m12,7nI3,111 14,m15 77112 ,nt13 ,7"1114 ,inI5 rn12 )rnl3 )11114 ,17115 
7110 ,n1l,rn2 ,n13 111,011111 J iTI2 ,1n3 /110, 711 1,m2,1TI3 rno ,7n 1 tTn2 ,7n3 
1114 )7n5Im,Gl7117 'nl4 )7n5 )'n16 , nL7 7114 ,rn"5,1n6 lm7 rn4 )1115 )1116 ,'rn7 
I11S,711g, rn lO,rn ll 7ns ,111g,rn 10)7n l1 'rnS,fllg,11110,nl11 7118,7119,71110,71111 
1nl2 ,1)1]3 )111.14 )r11:.15 111.12,771 13,711 14,11115 m12 ,11113 ,1'n·14 ,'(11.15 '0112 ,/n13,1TI14 ,111.15 
7nO,ffi1)1?12,1113 ,."-2..0 ,ml ,m2 ,1113 rno,ml,r11.2,1n3 1no ,ml ,rn .. 21rJl-3 
1Y!.4 )m5,7n6,n~7 1n4 ;n15 ,1'n,6 ,m7 711.4,1715,7116,7l17 1114) 1115 ) m6 I IT!.7 
rnS,m91'lnlO,n1.1] 111.8,1719,mlO,mll rns ~n~9In1.10,mll 1118}1ng ,n1.1 0 ,1'1111 
ffi12,mI311?114,71115 rn1217J113,m14,nl15 1TI12,TI113,m14,m15 ffi12 )111.13,111,14 ,1n] 5 
104 
Table 7.2 Detailed information for one-to-all personalized communication on the 
hypercube. 
Detail information about one-to-all personalized cmnmunication. 
Initial state 
Po with message PI P2 P3 
TnO,Tn1,m2,'T/13 with no message with no message vvith no message 
'T/14,1/1,5,1/1,6, l11,7 
ms ,I11,g,'T/11O, l11, ll 
m12,m13,7?1-14,mI5 
P4 P5 P6 P7 
\vith no message with no message with no message with no message 
Ps pg PlO P11 
with no message with no message with no message with no message 
P12 P13 P14 PIS 
with no message with no message with no message with no message 
First step: Message transfer from Po to ])s. 
mO,mj,Tn 2,1/1,3 
1/1,4 , 'T/15, 1'11,6,1'11,7 
1T/'8, 'T/1g, 17110 ,mll 
17112 ,m13, 17114,17115 
Second step: Message transfer from Po to PI and froIn Ps to pg. 
1/1,0 ,1712,1/1,4 ,1?1-6 m],1?1-3,m5,1717 
fis, m'10, 1?1-12, 1'11,14 1?1-g ,17111, 'T/113, 'T/115 
Third step: Message transfer from Po to P2, from PI to P3, 
fronl Ps to PIa, and from pg to P11 
'T/10,m'4 1'11,1, 1715 1712,1/1,6 fi3,1?1-7 
171S,17112 1?1-9, 111,13 111,lO,m14 mn,m15 
Fourth step: Message transfer from Po to P4, from PI to P5, 
from P2 to P6, and from P3 to P7; 
from ])s to ])12, from pg to P13, from ])10 to ])14, and from P11 to P15. 
mo m1 m'2 m3 
m'4 1?1-5 fi6 'T/17 
1?1-S 1719 filO mll 
'T/112 111,13 17114 m15 
105 
7.5 All-to-All Personalized Con"lmunication 
This operation also requires log p communication steps. Each processor contains p 
values in each step. In step i, for i = 1,2, ... , d, each processor sends half of its data to 
its neighbor in the i-th dimension; these data are destined for processors whose the 
i-th bit in the address is similar to that of the chosen neighbor. The communication 
time is 
Tables 7.3, 7.4, and 7.5 show the details involved in this communication 
procedure. 
106 
Table 7.3 Detailed information for all- to-all personalized communication on the 
hypercube. 
Detailed information for all-to-all personalized communication. 
Initial state 
Po with messa.ge PI with message P2 with l1lC'ssage Ps with Hl_ossage 
mOQ,mOl,m02. m 03 
m04 ,rnOS ,mOB ,m07 n1-24 , 111,25,17126. n1-27 
rn08 ,mOS ,1110.10 ,1710,11 17tlS ,11119,111] .10 ,m I,ll m 38 ,fi 39 ,InS.lD ,rag,ll 
mO,12 ,mO, 13 ,mO.14 ,m,O, 15 Tn1 ,12 ,nt}, 13.m 1,14,1n 1, 15 11'13,12,»13,13 ,In:i,14 ,ln3.IS 
1>4 ,-,,,ith :,nessagc P5 ,.",ith l'nessagc P6 .. vith Hlossage P7 .... ,ith messago 
11164 ,1n05. '11166,111-67 111-7_1,'1'175. 1'n 76,n117 
n158. n1.59 ,111.5 ,10 ,nt-S.ll 1n68 ,11169 ,711,D, 10 ,1116, 11 
nl5-,12 ,1'n.5,13. ,fnS.l·1. m S,] 5 171-6,12 ,n16, 13 ;rno,l.t ,111.6,15 TTL 7 ,12 , n1 7 .13 •In 7, 1,1 ,n1-7 .15 
P8 .,. ... i th message pg with message PIC with message PI! with message 
1nl0,O ,11110, 1 ,rn 10.'2 ,nl-l0,3 rnl1,O ,11l11 ,1 ,n"t! 1,2 ,Httl.3 
ffiS.1, m·S5, 11'1-86, rn87 11l10, -1 ,lliIO,!) ,nqO, G ,rn 10, 7 Tn 11.4 ,Jll.ll.5,11'1l1,6 ,111} 1.7 
nt-ss ,lnS9.n18, 10,1)"1.8.11 111.98 ,111g9 ,m_g, 10 ,Tng,Il 11110,8 , rn l0,9 ,Tn 10, 10 ,,-n 10, 11 1111 1 ,8 ,n1-11,9 ,ntll ,10,mll,11 
r11s, 12 ,ma.IS ,rng, 14 ,rnS,IS 1'1'1.0,12,1119,13, lng, 1.-1, f1!9, 15 TI1IQ, 12 ,11110,13 ,Otl0, H ,J1qO, 15 U111, 12 , 711 11,13 , Jl1 11.I4,i1'1 11 ,15 
Pl2 with message PI3 with message Pl-t with Hlcssage PIS \vith message 
l1l12,0 ,Tn 12,1 ,m 12.2 ,rn12,3 11113,0,111.13.1, rH13,2 ,rn13,S 71"1.14,0, Ut1·t.t, 1111-1, 2 ,ln14.3 111-15,0,17115,1 ,rn15. 2. n1 15,3 
111.12,4,17112.5 ,in 12,6 ,in 12. 7 11113,4 ,m 13.5 ,1'1113,6 ,TTl-IS. 7 T11-14 ,4 ,1'11 1-1,5 ,nlI4,6,Trt14, 7 1n 15,4 ,11115.5,1'n 15,0 ,n!} 5, 7 
Tn12,S ,111-12,9 ,m"12,1 0 ,Tn 12.11 nl13,8 ,Tn I3.9,m 13,10 ,rn 13, 11 rn14,S ,11114,9 ,171 14,lO, rn l_1, 11 r11.15,8 ,1n 15,0 ,171-15.10 ,r11 15, 11 
ffi12.12 ,7n12,13 ,11'1.12,14 , rn 12, 15 mI3, 12,m13.13 ,m13, 14;m 13,15 n114, 12 ,f11.14, 13 ,Tl1.1'1.14 ,111.14, 15 
First step (among two processors with first bit difference, such as Po and Pl.) 
rnoo ,11'1.10, 1H02, ntI2 
111.04 ,11114 ,Ot06,11116 
1nos ,71:1.18,111.0,10 ,nIl, 10 n109 ,m19,n'O,11 ,111.1, II ;11-28 ,n1-38 ,1112, 10 ,11"l3,10 
n10,12 ,nt 1,12,1'110,1"1 ,Tn 1,14 1110,13 ,rn l,13 ,ntO, 15 ,m l, 15 
nl.'IO ,11150 ,TI'l-42, rn 52 
rn-1,9 ,1"nS9 ,nt4" 11 ,mS, 11 H169,n179,1n6,11 ,11'1.7.11 
111-0;,12,1715,12 ,111.·1, 14 ,1115,}-:1 1'n4, 13 ,1"Us, 13 ,nL1,15,1715, 15 Til a , 12,])"1-7,12 , rJ1 0.1-1 ,1n7, 14 
Tl110.0 ,111.11,0 ,11110,2 ,nq 1,2 1"J!10,1 ,m 11,1,111-10,3 ,nq 1,3 
ml0,4,U1.11,4. 111 10,6,»1-11,6 ffil0,S.IHll,5,171.10,7 , 711 11, 7 
rn 88, -,--/1,08,11'18,10 ,lng ,10 171.S9.rn 99,Ji1-S.11,mg,ll Tn 10.8 ,mIl.S , n1 1Q,10,nq 1, 10 mlO,g,1Tl.}I,D, rn lO,111 7Tl I1,11 
1'1'18,12 ,17"1-9,12,11"18.11 ,11t9,14 mS,13 ,'171.9, 13,1718,15 ,rn9, 15 mlO, 12 ,nt} 1 ,12 ,n1.10,14 ,n1.11,14 
nl12.0 ,m 13,0 ,1'1112.2 ,1n 13,2 m 14,0 , ln 15.0 ,1T1.14 ,2 ,nl15.2 n1.1.1,1 ,11"1-15,1.1n]4.3 ,771:15,3 
m12,4 ,111.13,4 ,mI2.6 ,ltl13,G Tn12.5 ,1nI3,5 ,nt 12.1.n q 3, 7 1'n 14,-1 ,n115"1,riL14,6 ,Tll 15,6 11l14,5 ,1n15,5 ,irqq. 7,111.15.7 
1n12,8,m13,8.m 12,10 , rn 13, 10 nq 2,9 ,r11 13,9 ,ffi12, 11 , n1 13. 11 n114,8 ,1TI 15,8 ,71l14, 10 ,77l15, 10 'lll-14.9 ,rnI5,9.f1t 1-1.11 ,In 15, 11 
rnl2, 12.m 13, 12 ,nl12,14 ,nl}3, 1-1 flt12.13 ,T11-13, 13 ,m 12.15,1n13, 15 7'11 14,13 ,'llt}5, 13 ,rJ1.14, 15 ,1nU5-,16 
107 
Table 7.4 Detailed information for all- to-all personalized communication on the 
hypercube (continued). 
Second step (among two processors with second bit difference, such as Po and P2.) 
moo ,mlO ,m:W ,n1.30 
InO.l! ,,-rq,ll ,111 2,11.lnS, 11 
111.0,12 ,111}, 12,1112,12 ,rH:3,12 mO.lS ,rn 1, 13 ,n12, 13 , tIl 3. 13 1710,14 ~Tn 1,14 ,m2.14 ,171 3, 14 
17141 ,rJlSl ,rHO} ,11171 
nL:1S .n1S8 ,171.68,171 78 rn-1, 10 ,1n.5,10 ,T1'16 , 10 ,1117.10 J114,11 ,1115,11 ,lTt6,ll ,r1l7 ,II 
171 4,12, 171 5,12,1116,12,1117,12 n:l-.1. 14,1115,1-1 ,nI6, 14 ,1117.14 711.1,15 ,rnS, 15 ,lTI6-,15, rn";. 15 
1'n81 ,1"1191.11110,1 ,Tnl1.1 mS2 ,H192 ,71110.2 ,171 11,2 T1183 ,n-LOS ,nt 10.3 ,»111.3 
1n84 ,17104,71110.4 ,Ttl-II ,4 11185 ,n195 ,mID.S ,11'1} 1,5 mS7 , n197 ,nQO,7 ,11111, 7 
m·ss .17l98,mlO,8,ml1,8 m89 ,1"1109 ,171.10,9,171.11,9 tnS,lO ,tng,IO ,Ul I0, 10 ,11111,10 7118.11 ,rng,I1 ,1n-l0.11.1)111,11 
rna, 12 ,lng, 12 ,m 10, 12 ,n1 11,12 rn8,13 ,7ng,13,mlO,13 ,m 11, 13 mS.l1 ,rng, 11.17l-10.14 ,11'1-11.14 'InS.1S ,-nt9, 15 ,1nIO, 15 .111.11, 15 
17l12,O ,11113.0 ,111.14,0 ,;nlS,O 17112.1,11LI3.1,TnI·i,l.nI15.1 l1q 2,2 ,In 13,2 '1n 14,2 ,rn 15,2 17112.3 ,Tn 13,3 ,11111,3 ,lnI5,3 
Tl'l12,4 ,UL13,-1 ,ffiI4,4 ,1'11 15,4 7n 12,5 ,nLlS,"; ,'I1Q.1,5 , 711 15.5 Tn 12,G , 111. 13.6,:n 1·1 ,6 ,11115,6 11112,7 ,1n13, 7 ,11't 14,7 ,nL 15, 7 
Tn 12,8 ,1nI3,8 ,m14,S ,tn 15,8 t11-12,9 ,r1113,9 ,111 14,9 ,n1-15,9 m 12, 10 ,711.13, 10,1111.1.10 ,r11 15, 10 rn12, 11 ,mI3, 11 ,m-14, 11 ,mlS,ll 
111-12.1:2 ,111.13.12 ,11114, 12 ,m 15, 12 "tnl:?, 13 ,rn 13,13,11114,1.3- , 171 15,13 n112, 14 ,7"11 13,1-1, rn14, 14 ,1ll-15, 14 rn-12, 15,71113,15,11114, 15,1H 15, 15 
Third step (among two processors with third bi t difference, such as Po and P4.) 
moo ,rn 10 ,1U::W ,ra30 1HOI ,Tn 11 ,l1t21 ,17131 rH03 ,Tn13 ,11123 ,Ht33 
rnO!} ,nll9 ,11129 ,0139 T110, 10 , nt 1, 10 ,1n 2, 10 , n13. 10 111.0,11,1711,11 ,1n2, 11 ,J'n 3.11 
'1'114,10 ,rn'5, 10 ,1716, 10 ,111.7, 10 
rn45 ,:mS5 ,17165 ,11175 171.16,11'156, 7n66, 11) 76 
rno, 12 ,n1}, 12 , 1n2, 12 ,111-3, 12 TnO,13 ,m 1,] 3 ,11"1.2.13 ,Tn~i, 13 nl-O, l.1,m 1,1-1 , nt2.14 ,1113, 14 n10, 15 ,lHl,15. rn Z.Hi ,fu 3, 15 
1114,13,'1'11 5, 13 ,7110,13 ,1Y"7, 13 1111,14,1115,14,111.0,14 ,rrl7, 14 1")14,15 ,H1S,15 ,rna, 15 ,111'7,15 
InSO, rngo,rn 10,0 ,rn 11 ,0 mSl ,lHgl.HqO, 1 ,inlI, 1 1"n 82 ,11192 ,r11. 10,2 ,1nl1, 2 
nt12,Q ,m 13.0 ,1n 1-1,0 ,ra 15,0 ffi12, 1 ,111.13,1 ,nll-l,1 ,ffi 15,1 rn 12:2 ,1rL 13,2 ,1n 14,2 ,m15.2 1n 12,3 ,n't}3,3 ,711.1·1 ,3 ,17115,3 
msa ,ntV8 ,11'110,8 ,m 11.8 rnS9 ,nt9l),rn10.9 ,lnl1 ,9 r118.10 ,r11g, 10 , 171 10, 10 ,111-11,10 TnS,ll ,111.9,11 ,n110, II ,m 11,11 
rn 12,8 ,71t 13,8 ,m 14,8 , nt lS.8 PI.} 2, 9 ,Dl 13,9, n11-1,D ,17115, 9 71112,10,1'1113,10 ,Tn 14,10 ,111.15,10 1n12,11 , Tn 13,11 ,ml-1,11 ,111-15,11 
mS·1 ,11194 ,H110,1 , rH l1,4 mS5 , 171 05 ,Tn 10,5,lH 11.5 1")1.86 ,n-t9o ,1n-lO,6 ,Tn 11,6 rnS7 ,Hl!)7 ,mIG, 7 "nIl, 7 
rn12,·1. Tn 13, 4, 1n 1'1, 4,111 15,4 H1.12.5, 11113,5, m'1<1,5.lTl15,5 11112,6 ,m 13.6 ,Tn 14,6 ,m 15,6 711 12,7 ,Hl I3, 7 ,tT.! 14,7 ,ntIS. 7 
r118, 12 ,1TI9,12 ,n110, 12 ,1n 11,12 1nS.13 ,rng, 13 ,lnl0.13 ,ill} 1.13 111S,14 ,lng.14 , fl1. l0,14 ,1ll-tl, 14 rns, 15 ,rng,15 ,111.10,1.5 ,n}, 11,15 
ffi12, 12 ,,11.13, 12 ,1TL14, 12 ,lH IS, 12 Tn 12, 13 ,11113.13 ,ffi14, 13 ,rn15.13 m12,14 , 711 13.14 ,rnI4, 1." , 17t I5, 14 1TI12, IS ,nqs, 15. 171-1·1. 15,nt15. 15 
108 
Table 7.5 Detailed information for all-to-all personalized communication on the 
hypercu be (continued). 
Fourth step (among two processors with fourth bit difference, such as Po and P8.) 
t1100 ,111-10 ,Tf1ZO.17130 '(JI01 ,11111 ,11121 ,lH31 
111.12,711 52,11162, n1 72 
r.nSl.l1l91, 1l1 10,1,mll,} 711.82 ,1n02 ,r11.10,2 ,11111.2 
171 12,0 ,lTLla,O ,r)"1 14,0 ,11'L15.0 nt12,2.ffi13.2, Tn 1-1,2 ,111 15.2 tn 12.3 ,111 13,3,1111-1,3,11"115,3 
7n05. Tn 15 ,1'11.'25, Tll35 Tfl06 ,Hl 16,n}26 ,71136 
111,84,1110.1 ,Tn 10,-1,11111,4 H185 ,11L95 ,71l10,5 ,1'n 11.5 TJ186,J1tg6 .H"llO,Q ,n1-11 ,6 111-8.7 ,rH97 lm'IO, 7 , n1 11, 7 
Ttl-I '2,4 ,1n13,_1"n114,4 ,11"1-15.4 1n12,5 ,mI3,S ,1711_1,5 ,Hl 15,5 "In]2,7 ,n113, 7 ,lH14, 7.17115, 7 
lUOS ,THIS ,rH28 , ru 3S m·o, 10 , ra 1, 10 ,111-2.10 ,TTl-3.10 rnO,11 ,Tnl ,11,7'112.11 ,H13, 11 
111-48,17158 I 1"n68, 111.78 nl-·i, 10 ,71l 5, 10 , rn6,10 ,m7, 10 111.1.,11,1115.11, 711 6.11,111-7,11 
111-88,1"1108, HlIO,8, fTt11 ,8 l'f189 ,111-99 ,r7l. 10, 9 ,rn 11, 9 THS, 1 Q ,111.9. 10 ,711 10. 10 ,m'11, 10 Tns, 11,1119,11 ,IThIO, 11,TH 11.11 
11112,8 ,n113,8,111.14,8 ,ntlS,S 77112,9 ,lll 13, 9 , n1. 14, 9 ,7l'Q 5,9 J1112, 11 ,m·13.11 ,mH.ll ,mIS,Il 
1110,12 ,nq,1 2,m2, 12 ,711-3, 12 rI10,}3 ,lTII.I:t ,7"11-2,13 ,1'n 3.13 
n1-1.12,111-5, 12 ,rTIu, 12 ,m?, 12 m4,14 ,1115,14 ,l115, 1,1 , ln 7 .14 n1"1,15,r(1.5, 15 , lne, 15 , lTI 7, 15 
lng, 12 ,7119.12 ,n-qQ, 12 ,Jfl-l1.12 JTIS, 13 ,rnO, 13 ,r71,10, 13 ,r-n 11,13 nlB, 14 ,lng, 14 ,11110, 1~1 ,nt 11,14 rn·8.15 ,mo. IS ,11110, 15 ,Jl1 11,15 
m12,12 ,mI3, 12 ,mH, 12,71'115,12 l'rq 2, 13,m 13,13 ,Tn 14,13 ,ln lS, 13 J1L}2,14 ,ln 13, 1;( ,71114,14 ,111-15,1_1 HlJ2.1S , m 13, 15 ,m14,15 ,lTI 15, 15 
CHAPTER 8 
PERFORMANCE COMPARISONS BETWEEN HOW AND BINARY 
HYPERCUBE SYSTEMS 
In this section we compare the communications capabilities of 2-D HOVV systems and 
hypercubes. Vie consider communications under model-3 which permits a processor 
to send out different values simultaneously using different channels, because t.his is 
often actually the case with real systems. \Ve assume t.hat tw is one unit of time and 
t.hat ts = tc = 0 in order t.o simplify t.he calculations. 
The equat.ions derived in the previous sections for 2-D HOW syst.ems follow: 
The equat.ions for hypercube systems are: 
TonLto_alLpers = m(p - l)tw = O(mp) 
109 
110 
It becomes obvious that HOVv systems perform asymptotically better than 
hypercubes in one-to-all personalized communication and all-to-all broadcasting. In 
the other two types of communications, the result of the comparison depends on the 
value of w. The remaining figures show cOlnparative results for practical cases, where 






Binary hypercube system -<r-
2-D system with w=4 ----. 
2-D system with w=8 ..... 




oL~~ .. ~;~~.,~,,;c~~~~---.. ~t~.:~~:=·~=· ====== ....=:it~~·~ .. ==~~~~========----9 
o 2000 4000 6000 8000 10000 
processors 
111 
Figure 8.1 Comparisons between HO\i\T and binary hypercube systems for one-to-all 









Binary hypercube system -<l>-
2·0 system with w=4 ----. 
2-D system with w=8 ... .. 
2-D system with w=16 + .. . 
O~~ ______ _L __________ L-________ -L __________ ~ ________ ~ 
o 2000 4000 6000 8000 10000 
processors 
Figure 8.2 Comparisons between HO\i\T and binary hypercube systems for one-to-all 










•... ' +-...... _-+ ........ . 
,i 
.. ' 
Binary hypercube system -<>-
2-D system with w=4 -----
2-D system with w=8 
2-D system with w=16 +---
,-,---"' .--
O~ __________ L_ __________ L-________ ~L_ ________ ~ __________ ~ 
o 2000 4000 6000 8000 10000 
processors 
112 
Figure 8.3 Comparisons between HO\i\T and binary hypercube systems for one-to-all 












Binary hypercube system --<)--
2·0 system with w=4 -----
2-D system with w=8 
2-D system with w=16 -+---
~
/ ,,- ......... --................ + 
200 /' _..... ... _---.--.---".-.-.-- ... ,,--.-
i ,." + .............. . 
100 ~: ::': ............... .;--.. .. 
T ------+,,-
O~ __ --------L-----------L---------~L---------~----------~ 
a 2000 4000 6000 8000 10000 
processors 
Figure 8.4 Comparisons between HO\i\T and binary hypercube systems for one-to-all 











2-D system with w=4 ----
2-D system with w=8 
2-D system with w=16 -+--
.;.-. -: :;;.=; .. ..:;.::- .................. + ........................ -...... . 
O~~~~----L-----------~----------~--------~~--------~ 
o 2000 4000 6000 8000 10000 
processors 
113 
Figure 8.5 Comparisons between HOVv and binary hypercube systems for all-to-all 





... 4--.......• -_ ..................•.•... 
2000 4000 6000 8000 10000 
processors 
Figure 8.6 Comparisons between HO\i\T and binary hypercube systems for all-to-all 









/ .. ' 
/~ .. " 
hypercube system ---
2-):)' system with w=4 .'--' 
.2:0 system with w=8 ..... 
... ·2-0 system with w=16 ''''''''' 
0~L-________ L-__________ L-________ ~L-________ ~L-________ ~ 
o 2000 4000 6000 8000 10000 
processors 
114 
Figure 8.7 Comparisons between HO\i\1 and binary hypercube systems for all-to-all 














,/ ::' ...... / ............. . 
,: ,,: ... 
/ .. :' .. /0. ..... 
. 2'0 system witll w=8 ..... 
2-0 system with w=16 + .. 




o 2000 4000 6000 8000 10000 
processors 
Figure 8.8 Comparisons between HO\i\T and binary hypercube systems for all-to-all 
broadcasting with message size 17'1 = 20 words. 
<Il 




2·D system with w=4 .... . 
2·D system with w=8 .... . 




2000 ............................... -+ 
.... -r ....................................................... . 
processors 
Figure 8.9 Comparisons between HO\V and binary hypercube systems for one-to-all 










I ." -t ......... ... 
<~ .. :.:~: ......  
.............................. 
.>t-..........•... 
.. -r ................................... .... 
.. nypercube system -<>-
.2·D system with w=4 ..... 
.... 2·D system with w=8 




0~L---------L-__________ ~ __________ L-__________ L-________ ~ 
o 2000 4000 6000 8000 10000 
processors 
Figure 8.10 Comparisons between HOW and binary hypercube systems for one-to-






* .. / 
.... -!1·ypercube system -+-
._2-0 system with w=4 ---- . 
. / - 2-D system with 1'1=8 ..... 
/ .......... 2-D system with 1'1=16 + .. 
O~~----L------L------L------~L--------J 
o 2000 4000 6000 8000 10000 
processors 
116 
Figure 8.11 Comparisons between HO\iV and binary hypercube systems for one-to-










2-D system with 1'1=4 ----. 
2-D system with w=B 
2-D system with 1'1=16 + ... 
BOOO 10000 
Figure 8.12 Comparisons between HO\iV and binary hypercube systems for one-to-
all personalized communication with message size m = 20 words. 
Q) 
600000 r-----------,----------,,----------.-----,-----.-----------, 
,elnary hypercube system ->-
:' 2-D system with w=4 ----










O~~~~ ____ L-__________ L-________ ~~ ________ ~L-________ ~ 
o 2000 4000 6000 8000 10000 
processors 
117 
Figure 8.13 Comparisons between HOW and binary hypercube systems for aU-to-all 









2000 4000 6000 
processors 
Binary hypercube system -<)-
2-D system with 1'0'=4 ----. 
2-D system with w=8 ..... 
2-D system with w=16.;-···· 
8000 10000 
Figure 8.14 Comparisons between HO\i\1 and binary hypercube systems for all-to-all 










Binary hypercube system -<>-
2-D system with w=4 ----
2-D syste with w=8 ----
2-0 syst with w=16 -+---
8000 10000 
118 
Figure 8.15 Comparisons between HO\i\T and binary hypercube systems for all-to-all 




'" -E 300000 
200000 
100000 
2000 4000 6000 
processors 
Binary hypercube system --;;--
2-D system witll w=4 ----
2-D system with w=8 -----
2-D system with w=16 -+-
8000 10000 
Figure 8.16 Comparisons between HO\!\l and binary hypercube systems for all-to-all 
personalized communication with message size rn = 20 words_ 
CHAPTER 9 
PERFORMANCE COMPARISONS BETWEEN HOvV AND 
GENERALIZED HYPERCUBE SYSTEMS 
In this section \\le compare the communications capabilities of 2-D HO\i\T systems and 
generalized hypercubes. Vie consider communications under model-3 which permits a 
processor to send out different values simultaneously using different channels, because 
this is often actually the case with real systems. \iVe assume that tw is one unit of 
time and that ts = tc = 0 in order to simplify the calculations. 
The equations derived in the previous sections for 2-D HO\iV systems follow: 
y'P-l p 
TalLio_all,3 = mtw(1 + JP) I 1 = O(m-) 
w w 
. y'P - 1 p3/2 
TalLio_aILpel's,3 = 2pn~1 ltw = O(m--) 
w w 
The generalized hypercube is special case of our HO\iV system. The equations 
for generalized hypercube systems (or I-D fully connected HOW subsystem) are: 
Tful/ () 
aIUo_aILpers,3 = 1TI P - 1 tw = O(1TIp) 
119 
120 
Table 9.1 Cost comparison bet,veen the H01/\!(y15, w, 2) and GH{...fi5. 2) systems. 
Cost Comparison 
System one-to-all all-to-all one-to-all-pers. all-to-all-pers. 
broadcasting broadcasting communication communication 
HOVV(y15,w,2) O(mpw) 0(mp3/2W) 0(mp3/2w) o (n/,p2w) 
GH(...fi5,2) i 0(1'17,p3/2) 0(mp2) o (mp2) o (7T/,p5/2) 
The remaining figures show comparisons between generalized hypercubes and 
HO\iV systems. It becomes obvious that generalized hypercube systems perform 
better than HO\iV systems from the communication time point of view. But the 
generalized hypercube has a fundamental design disadvantage. It has very large 
wiring complexity, as demonstrated by its bisection width. The bisection width is 
defined as the minimum number of wires that must be cut to separate the network 
into two equal halves [23]. A very large bisection width makes the network impossible 
to build. The bisection width of the GH(k,17,) is O(kn+l). 
It is derived as follows. The bisection width of the GH(y15, 1) is r~l * l ~J, 
because when cutting the graph into two halves the edges which connect the left 
I ~l nodes with the right L ~ J nodes must be removed. For the G H (JP, 2) the 
bisection width is JP * I~l * L ~J = 0(y15 p) = 0(p3/2) and for the GH(k, 17,) the 
bisection width is kn - 1 * r~l * L~J = O(kn+l). 
For the 1-D FIOFV(y15, w, 1) the bisection width is 1+2+3+·· .+w = W(~+l) = 
O(w2). For the 2-D HOW(y15,w, 2) the bisection width is w(~+l) * JP = O(vp w2). 
Let us define the cost of an interconnection network as the product of the 
"communication time" and the "bisection ,vidth". This is a reasonable cost measure 
because "ve should like to achieve small communication time with a small system 
complexity. Table 9.1 shows the costs of the HOVV(vp, w, 2) and the GH(vp,2) 
for vp ~ w. This table also shows that reductions in the cost are proportional to 
reductions in the value of wand this leads to predictability. The HOVV(vp, w, 2) 





Generalized hypercube system -<r-
2-D systeJll with w=4 ----. 
2-D system with w=8 .-.--
2:p--system with w= 16 --+ .... 
------- -- ----------
O~--------~~--------~-----------J----------~----------~ 
o 2000 4000 6000 8000 10000 
processors 
121 
Figure 9.1 Comparisons between HOVl and generalized hypercube systems for one-







Generalized hypercube system -<r-
2-D system with w=4 ----. 
2-D system with w=8 ---.-
2-0 system with w=16 .. + .... 
~~--~----~----------~-----------------------o 
2000 4000 6000 8000 10000 
processors 
Figure 9.2 Comparisons between HOVV and generalized hypercube systems for one-








generalized hypercube system -{l-
2-D system with w=4 -----
2-D system with w=8 ... --
2-D system with w=16 -+---
o~~--______ ~ __________ L_ ________ ~ __________ ~ __________ ~ 
o 2000 4000 6000 8000 10000 
processors 
122 
Figure 9.3 Comparisons between HO\i\T and generalized hypercube systems for one-





Generalized hypercube system -<>-
2-D system with w=4 -----
2·D system with w=8 






..... ' . 
........ 
A .... ···-
O~ __________ L_ __________ ~ ________ ~ __________ ~ __________ ~ 
o 2000 4000 6000 8000 10000 
processors 
Figure 9.4 Comparisons between HO\i\T and generalized hypercube systems for one-













Generalized hypercube system _ 
2-D system with w=4 ----. 
2-D system with w=8 
6000 
2-D system with w=16+···· 
......... ,.. .. - .. --
..... - .. 
8000 10000 
123 
Figure 9.5 Comparisons between HOVv and generalized hypercube systems for all-











Generalized bypercube system _ 
~b system with w=4 ----. 
/2-0 system with w=8 
//"'/""" ',0 'y,<Om wOh w." '" 
,/"////" 
I~,~~<~.-~··~-~~==t===========~==========~==========~--------~ 0.' --. o 2000 4000 6000 8000 10000 
processors 
Figure 9.6 Comparisons between HO\'\1 and generalized hypercube systems for all-















/2·0 system with w=4 ..... 
.... 2·0 system with w=8 ... 
. 2·0 system with w=16 --+ .. -
.......................... / .. / ..... 
........... 
.............. 
/ ..... ::-+'_/_ .. /_./_ ..... -. <>----------------<> / .... :~ .. ~ 
O~~--------L-__________ L-__________ L-________ ~~ ________ ~ 
o 2000 4000 6000 8000 10000 
processors 
124 
Figure 9.7 Comparisons between HO\iV and generalized hypercube systems for all-
to-all broadcasting with message size m = 10 words. 
10000 r----------cr----------nr----------.--------~_,----------_. 
, Generalized .. ·~ypercube system -<>-
t .... ·· .. 2·0 system with w=4 ---- . 
.... / 2·0 system with w=8 .... . 
, ... /." 2·0 system with w=16 .+ .. .. 




2000 !:' ./ 
~
/.<'/// . 
l .... ·/ 
o j 
o 2000 4000 6000 8000 10000 
processors 
Figure 9.S Comparisons between HO\I\/ and generalized hypercube systems for all-






Generalized hypercube system -v-
2-D system with w=4 ----. 
2-D system with w=8 
2-D system with w=16+-··· 
125 
Figure 9.9 Comparisons between HOW and generalized hypercube systems for one-













. ..r, ....... . 
Generalized·ti'ypercube system -v-
.-~-o system with w=4 ----. 
.... 2-D system with w=8 -.. --
, 2-D system with w=16 -+ .... 
, ............................. . 
............. 
~ .. , 
.+ 
.................. 
,/ .. ,.- ..;. ...... . 
1~:~.··;*~···~=====t==========~========================~ ________ J a -li-' / 
a 2000 4000 6000 8000 10000 
processors 
Figure 9.10 Comparisons between HO\iV and generalized hypercube systems for 






. . . 
. . ............ // 
"" ................ . 
. .......... ,./ 
2000 :,':' ................... . 
:' .. >t-•.••.• / 
/ /~"/"_ .... .A--_----v------------------------<> 
~ O ~ ________ L-__________ L-________ ~L-________ ~ __________ ~ 
o 2000 4000 6000 8000 10000 
processors 
126 
Figure 9.11 Comparisons between HOV\l and generalized hypercube systems for 



















Generalized hypercube system '"'>-
2-D system with w=4 ----. 
2-D system with w=8 
2-D system with w=16-+·· 
O~~ ________ L-__________ L-__________ L-________ ~L-________ ~ 
a 2000 4000 6000 8000 10000 
processors 
Figure 9.12 Comparisons between HOW and generalized hypercube systems for 











2000 4000 6000 8000 10000 
processors 
Figure 9.13 Comparisons between HOv\! and generalized hypercube systems for 








Generalized hypercube system -0-
2-D system with w=4 ----. 
2-D system with w=8 
2-D system with w=16 + .... 
6000 8000 10000 
Figure 9.14 Comparisons bet\veen HOW and generalized hypercube systems for 




Generalized hypercube system -<>-
2-D system with w=4 ----. 
2-D system with w=8 .... . 
2-D system with w=16 .+ .. . 
.~ 100000 
50000 
O~~--______ L-__________ L-________ ~L-________ ~ __________ ~ 
o 2000 4000 6000 8000 10000 
processors 
128 
Figure 9.15 Comparisons between HOW and generalized hypercube systems for 








Generalized hypercube system -<>-
2·D system with w=4 ----. 
2·0 system with w=8 .... . 
2-D system with w=16 .+ ... . 
6000 8000 10000 
Figure 9.16 Comparisons betiveen HO\i\1 and generalized hypercube systems for 
all-to-all personalized communication with message size 171 = 20 words. 
CHAPTER 10 
CONVERSION OF COMTv1UNICATIONS ALGORITHIVIS FOR 
GENERALIZED HYPERCUBES 
Because the G Hk,n is the building block of our HOV\T systems, it is \\'orth trying 
to modify existing communications methods used for the GHk,n' The following 
terms are used for constructing BST (Balanced Spanning Tree) and BSG (Balanced 
Spanning Subgraph) graphs [4}. 
DEFINITION 10.1. GHk,n, an n-dimensional k-ary generalized hypercube, is 
an undirected graph of N = klt nodes, each one labeled by an n-digit number in 
radix k arithmetic. Each node v is connected to n(k - 1) other nodes with which it 
differs in only one digit; i.e., node v = Vn-l ... Vi+l ViVi-l ... va is connected to nodes 
DEFINITION 10.2. The tTanslation of a node v with respect to node s, denoted 
by Ts(v), is defined to be the node t = Ts(v), so that ti = (Vi + sd mod k, for 
o ::; i ::; n -- 1. The inverse tmnslation of a node v with respect to node s, denoted 
by Ts-l(V), is defined to be the node t = Ts-1(v), so that ti = (Vi - Si) mod k, for 
O::;i::;n-l. 
DEFINITION 10.3. Consider the function T from the set {O, 1"", k - 1} to 
itself as follows: 
. {O if i = 0 
T(~) = (i rnod (k - 1)) + 1 otherwise 
(Notice that T maps digit 0 to itself and the remaining digits as follows: 1- > 2- > 
3- > ... - > k - 1- > 1.) The Totation of a node v = Vn-l'" Vi+lV{Ui-l ... va, 
denoted by R( v), is defined to be the node Vn -2 ... Vi+ 1 ViVi-l ... VOT (Vn - d. 
DEFINITION 10.4. An ordered group of nodes, each one derived from its 
subsequent one cyclically by the application of a rotation, is called a necklace. 
DEFINITION 10.5. The binaTY cOTTespondent of a node v of GI-h,n is the 
binary nUlnber obt.ained if we substitute each nonzero digit in v with the digit 1. 
129 
130 
The generator node of a necklace is defined to be the largest among the nodes of the 
necklace that have the largest binary correspondent. 
DEFINITION 10.6. The displacernent of a node v, denoted by D(v), is defined 
to be the minimum number of rotations that we have to apply on v iI) order to derive 
the generator of its necklace. 
DEFINITION 10.7. The peTiod of a node v, denoted by P(v), is defined to be 
the number of nodes contained in the necklace to which it belongs. 
DEFINITION 10.S. An unfolded necklace is an ordered group of exactly n(k-1) 
nodes, not necessarily distinct, each one obtained from it subsequent one cyclically 
by the application of a rotation. 
DEFINITION 10.9. A shortest path balanced spanning tree, rooted at node on 
(it represents n zeros) of the GI-h,n and denoted by BSTon, is defined through the 
following parent function. For node v, with D( v) = i, let p be the position of its first 
nonzero digit cyclically to the left of position n - 1 - i. Then the parent of this node 
in the BSTon is 
parentBSTon (v) = { 0 
Vn-l ... V p+l OVp--l ... Vo 
if v = on 
if v -::j:: 0 
DEFINITION 10.10. A shortest path spanning s1tbgraph, rooted at node on of 
the G Hk,n and denoted by BSGoTt, is defined through the following parent function. 
By parentBSGon (v, i) we denote the parent of node v in the ith, where 0 :s; i:S; n(k-
1), spanning tree of BSGon. For node v with D(v) = i nwd P(v), 0 :s; i :s; n(k - 1), 
let Pi be the position of its first nonzero digit cyclically to the left of position n-1-i: 
parentBSGon (v, i) = . { 
0 if v = on 
Vn-l . " Vpi+lOVpi-l ... Vo If v -::j:: 0 
Figures 10.1 and 10.3 show the BSTo2 ofthe GHS,2 and the GHS,2) respectively. 
The translation operation with respect to node s is applied to all the nodes of the 
BSTon to obtain the BSTs rooted at any node s. 
131 
Using a similar method, \ve can create the BSTo2 for the HOTiV(p, W, 2) based 
on the BSTo2 for the GHp ,2, where k = p in the GHk,n- It is based on the fact that 
HOVls can be obtained from GHs by removing some edges_ These steps are: 
• Create the BSTo2 of the GHp ,2-
• Break non-connected edges in the HOVV(p, w, 2) which are connected in the 
G Hp ,2, using the path ,,,,,hich consists of all possible edges of window size w_ 
• If there is a conflict bet,veen intermediate nodes and leaf nodes (with the same 
parent), then the intermediate nodes stay where they are and the leaf nodes 
move to the next level. 
Figures 10.2 and 10.4 show the BSTo2 of the HOvV(5, 3,2) and the 
HOvV(8, 3, 2), respectively. Similarly, Figures 10.5 and 10.6 show the BSTo2 of 
the H01/V (8,4,2) and the HOVV (8,5,2), respectively. Shaded nodes in these figures 
show the procedure for the GHS,2' According to [4], the one-to-all personalized 
communication consumes time 0 C::((:~:n on the GI-h,n' For the HOVV(p, w, n), 
the modification of this communication procedure results in time 0 (~). This is 
similar to what we also derived with our procedure in Chapter 6. Therefore, we do 




00 0°1 0°2 0
03 004- GH _5.2 k=5 










14 d=2 11 -> 12 -> 22 -> 23 -> 33 -> 34 -> 44 -> 41 -> II 




21 TJ ?" 
0
24 O-~ 0-" 
The necklaces of GH_5.2 
0
30 "I OJ "? OJ- 0
33 
0
34 d=1 {40. 04. 30. 03.20.02. 10. Ol} 








0 {42. 14,31.43.24,32.13.21) 
Figure 10.1 The spanning tree BST02 of the GHS ,2' 
13 
14 








































































































































































































































































































































































































































































































































































































































































































































































CONCLUSIONS AND FUTURE WORK 
V'le introduced in this dissertation a neVi' class of scalable architectures capable of very 
high performance. We also proposed algorithms for the implementation of various 
important communication operations, under frequently used communication models. 
VVe finally compared the performance of this class of architectures with that of the 
hypercube for the aforementioned communication operations. Our results show that 
not only are our architectures scalable and feasible with current technology, but 
also they perform better than the hypercube for several highly demanding communi-
cation operations. Of course, HOV.,r systems perform outstandingly better than the 
currently popular torus systems, because of their much better topological properties. 
Further ,;vork is needed on HO\i\l systems with wrap-around connections, and 
on embeddings and communications operations on n-D HOV" systems. Also, data 
reduction operations should be studied on 2-D and n-D HO\i\l systems. 
137 
APPENDIX A 
SIMULATION FOR ALL-TO-ALL PERSONALIZED 
COI'vl1\1UNICATION ON 1-D HOWS 
In all-to-all personalized communication, also known as total exchange, each 
processor sends a distinct message of size m to every other processor. It involves a 
lot of message transfers. Vve will not necessarily derive the most efficient procedure 
here, because such a procedure can be of a very complex nature. \Ve present a 
simple procedure that comprises t,\,O stages. The basic idea here is that the first 
stage is initialization in which every processor exchanges related messages with its 
connected neighbors. The second stage is for sending related messages using the 
longest channel, when they are available. 
















static node *all_nodes_1; 
static node *all_nodes_2; 
1* node number *1 
static node *current_state, *next_state; 
static step; 
138 
static IDSg *new_ffisg Ont src, int dest); 
static void init node (node *p, int n) ; 
static void sort_node (node *p); 
static void sort_all_node(void); 
static void copy_aILnodeO; 
static void add_ffisg (node *pNode, msg *pM) ; 
static msg *get_msg (node *pNode, int i) ; 
static void del_msg (node *pNode, IDSg *pM) ; 
static void init all (void); 
static void print_all (void); 
static void exchange_direct_nodeCnode *pl, node *p2); 
static int get_rightmost_IDsg (node *P, 
msg *msg_vector[window_sizeJ); 
static int get_leftmost_msg (node *p, 
msg *msg_vector[window_sizeJ); 
static 
static int *node_used; 







if (argc >= 2) 
nUID_of_nodes 
if (argc >=3) 
window size 
step = 0; 
print_all 0 ; 
= atoiCargv [1]) ; 
= atoi Cargv [2J ) ; 
/*first step, exchange all nodes within window_size*/ 
for (i =0 ; i < nUID_of_nodes ; i++) { 
pNodel = current_state+i; 




if ( i + W < nUID_of_nodes) { 
} 





msg_array = (msg **)malloc(sizeof(msg*)*window_size); 
node used = (int *)malloc(sizeof(int)*window_size); 
while (1) { 
int dest; 
done = 1; 
copy_alI_node (current_state, next_state); 
/* send msg to right */ 
for (i = 0; i < nUID_of_nodes; i++) { 
if (get_rightmost_msg(current_state + i, msg_array» { 
memset(node_used, 0, sizeof(int)*window_size); 
/* first try destination already within window */ 
for Cw = 0; w < window_size; w++) { 
if (!msg_array[w]) 
continue; 
if Cmsg_array[w]->dest <= i+window_size) { 
/* already with window size */ 
if (!node_used[msg_array[w]->dest - i-1]) { 
del_msgCnext_state + i, msg_array[wJ); 
add_msg(next_state + msg_array[w]->dest, 
msg_array[wJ); 
node_used[msg_array[w]->dest - i-1J = 1; 
} else { 
int ww = w; 
while (ww < window_size) { 




dest = i + window size - ww; 












if (dest < nliffi_of_nodes) { 
del_rnsgCnext_state + i, 
msg_array[wJ); 
add_rnsgCnext_state + dest, 
} 
msg_array[wJ); 
node_used[window_size - ww -lJ 
break; 
1· ,
* then try the algorithm: longest destination using 
* longest w 
*/ 





if (rnsg_array[wJ->dest > i+window_size) { 
} 
int ww = w; 
while (ww < window_size) { 
} 




dest = i + window size - ww; 
if (dest < nurn_of_nodes) { 
del_msg(next_state + i, msg_array[wJ); 
add_msgCnext_state + dest, msg_arrayCwJ); 
node_usedCwindow_size - ww -lJ = 1; 
} 
break; 
done = 0; 
/* send msg to left */ 
141 
for (i = num_of_nodes-1; i >=0; i--) { 
if (get_leftmost_msg(current_state + i, msg_array» { 
mernset(node_used, 0, sizeof(int)*window_size); 




if (msg_array[wJ->dest >= i-window_size) { 
1* already with window size *1 
} 
if (!node_used[i - rnsg_array[wJ->dest - 1J) { 
del_msg(next_state + i, rnsg_array[wJ); 
add_rnsgCnext_state + msg_array[wJ->dest, 
msg_array[wJ); 
node_used[i - rnsg_array[wJ->dest - 1J = 1; 
} else { 
} 
int ww = w; 
while (ww < window_size) { 
} 




dest = i - (window_size - ww); 




if ( dest >= 0) { 
} 
del_rnsgCnext_state + i, msg_array[wJ); 
add_msg(next_state + dest, msg_array[wJ); 
node_used[window_size - ww -1J = 1; 
break; 
for (w = 0; w < window_size w++) { 
if (!rnsg_array[wJ) 
break; 
if (rnsg_array[w]->dest < i-window_size) { 
int ww = w; 
while (ww < window_size) { 










dest = i - Cwindow_size - ww); 
if C dest )= 0) { 
} 
del_msgCnext_state + i, msg_array[w]); 
add_msgCnext_state + dest, msg_arrayCw]); 
node_usedCwindow_size - ww -1] = 1; 
break; 





pNode1 = next_state; 
next_state = current_state; 
current_state = pNode1; 
sort_alLnode 0 ; 
step++; 
print_all () ; 
static void 




/* send msg from p1, to p2 */ 
for (i = 0; i < p1 -) index; i++) { 
} 
if ( pi -) table[i]-)dest == p2 -) number) { 
pM = get_msgCpi, i); 
add_msgCp2, pM); /* send to p2 */ 
} 
/* send msg from p2, to pi */ 
for (i = 0; i < p2 -) index; i++) { 
143 
if C p2 -) table[i]-)dest == p1 -) number) 
pM = get_msgCp2, i); 




static int cmp_msgCconst void *p1, const void *p2) 
{ 
} 
msg **m1 = Cmsg **)p1; 
msg **m2 = Cmsg **)p2; 












for (i =0 ; i < num_of_nodes ; i++) 
sort_node(current_state+i); 
static void 





for (i =0 ; i < num_of_nodes ; i++) { 
p2[i] . number = p1[i] .number; 
p2[i] .tbl_size = p1[i] .tbl_size; 
p2[i] . index = p1[i] . index; 
} 
for (j = 0 ; j < p1[i] .index; j++) 
p2[i] .table[j] = p1[i] .table[j]; 
static int 




int my_num = pNode -> number; 
int ret; 
int l; 
int J = 0; 
1* note! the messages in node->table are sorted *1 
for (i = pNode -> index - 1; i >= 0; i--) { 
} 
msg *pMsg = pNode->table[i]; 
int distance = pMsg->dest - my_num; 
if (distance > 0) {/*this msg should send to righ*1 
msg_array[j++] = pMsg; 
if (j >= window_size) 
break; 
} 
ret = j; 
while (j < window_size) 
msg_array[j++] = NULL; 
1* remove msg from node *1 






get_leftmost_msg(node *pNode, msg *msg_array[window_size]) 
{ 
int my_num = pNode -> number; 
int ret; 
int i; 
int j = 0; 
1* note! the messages in node->table are sorted *1 
for (i = 0; i < pNode -> index - 1; i++) { 
msg *pMsg = pNode->table[i]; 
int distance = my_num - pMsg->dest; 




ret = j; 
msg_array[j++] = pMsg; 
if (j >= window_size) 
break; 
while (j < window_size) 
msg_array[j++] = NULL; 
/* remove msg from node */ 








static msg * 
new_msg(int src, int dest) 
{ 
} 
msg *ret = malloc(sizeof(msg»; 
ret -> src = src; 
ret -> dest = dest; 
return ret; 
static void 
add_msg(node *pNode, msg *pMsg) 
{ 
} 
pNode -> table[pNode->index] 
pNode -> index++; 
static msg * 
get_msg(node *pNode, int i) 
{ 
msg *ret; 
if (i >= pNode -> index) 
return 0; 




pNode -> table[i] = pNode -> table[pNode->index - 1J; 
pNode -> index--; 
return ret; 
static void 




for (i =0 ; i < pNode -> index; i++) { 
} 
if (pMsg == pNode -> table[i]) { 
} 
pNode -> table[i] =pNode->table[pNode->index-1]; 
pNode -> index--; 
return; 
static void 





pNode -> number = num; 
pNode -> tbl_size = num_of_nodes*num_of_nodes; 
pNode -> table = (msg **)malloc( 
sizeof(msg*)*pNode->tbl_size); 
pNode -> index = 0; 
for (i =0 ; i< num_of_nodes; i++) { 













current_state = all_nodes_1; 
next_state = all_nodes_2; 
for (i = 0; i < nUID_of_nodes; i++) { 
init_nodeCcurrent_state + i, i); 









j = 0; 
printfCItStep %d\n", step); 
while (1) { 
} 
ffiSg *pM; 
printed = 0; 
for (i = 0; i < nUffi_of_nodes ; i++) { 
} 
if ( j < current_state[i] . index ) { 
pM = current_state[i] .table[j]; 
printf("%2d,%-2d ", pM->src,pM->dest); 





if (! printed) 
return; 
The running results for HOVV(10, 3,1) and HOliV(ll, 4,1) are: 
For HDW(iO,3,1): 
Step ° 
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 9,0 
148 
149 
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 8,2 9,2 
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 8,3 9,3 
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 8,4 9,4 
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 8,5 9,5 
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 8,6 9,6 
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 8,7 9,7 
0,8 1,8 2,8 3,8 4,8 5,8 6,8 7,8 8,8 9,8 
0,9 1,9 2,9 3,9 4,9 5,9 6,9 7,9 8,9 9,9 
Step 1 
0,0 0,1 0,2 0,3 4,0 5,0 6,0 7,0 8,0 9,0 
3,0 1,1 1,2 1,3 1,4 5,1 6,1 7,1 8,1 9,1 
1,0 4,1 2,2 2,3 2,4 2,5 6,2 7,2 8,2 9,2 
2,0 2,1 4,2 3,3 3,4 3,5 3,6 7,3 8,3 9,3 
0,4 3,1 3,2 4,3 4,4 4,5 4,6 4,7 8,4 9,4 
0,5 1,5 5,2 6,3 7,4 5,5 5,6 5,7 5,8 9,5 
0,6 1,6 2,6 5,3 5,4 7,5 6,6 6,7 6,8 6,9 
0,7 1,7 2,7 3,7 6,4 6,5 8,6 7,7 7,8 7,9 
0,8 1,8 2,8 3,8 4,8 8,5 7,6 8,7 8,8 8,9 
0,9 1,9 2,9 3,9 4,9 5,9 9,6 9,7 9,8 9,9 
Step 2 
0,0 4,0 5,0 6,0 7,0 8,0 9,0 9,1 9,2 9,3 
3,0 1,1 1,2 5,1 6,1 7,1 8,1 8,2 8,3 9,4 
1,0 3,1 4,2 1,3 2,4 6,2 7,2 7,3 8,4 9,5 
2,0 2,1 2,2 2,3 3,4 3,5 3,6 5,7 9,8 8,9 
0,4 4,1 3,2 3,3 4,4 4,5 4,6 4,7 6,8 7,9 
0,5 0,1 5,2 6,3 7,4 5,5 5,6 7,7 5,8 9,9 
0,6 1,5 0,2 4,3 1,4 2,5 6,6 6,7 7,8 6,9 
1,6 2,6 5,3 6,4 6,5 8,6 8,7 8,8 
0,7 1,7 0,3 5,4 8,5 7,6 9,7 5,9 
0,8 2,7 3,7 7,5 9,6 4,9 
1,8 2,8 3,8 4,8 
0,9 1,9 2,9 3,9 
Step 3 
0,0 5,1 7,0 8,0 8,1 8,2 7,3 8,4 9,5 5,9 
3,0 1,1 1,2 9,0 7,1 9,2 8,3 9,4 3,8 4,9 
1,0 3,1 4,2 1,3 9,1 7,2 9,3 3,7 4,8 3,9 
2,0 2,1 2,2 2,3 6,4 3,5 3,6 5,7 9,8 8,9 
6,0 4,1 3,2 3,3 3,4 4,5 4,6 4,7 6,8 7,9 
5,0 0,1 5,2 6,3 4,4 5,5 5,6 7,7 5,8 9,9 
4,0 6,1 0,2 4,3 1,4 2,5 6,6 6,7 7,8 6,9 
150 
0,4 6,2 5,3 7,4 6,5 8,6 8,7 8,8 
1,5 0,3 5,4 8,5 7,6 9,7 
0,5 0,6 2,4 7,5 9,6 2,9 
1,6 0,7 2,8 0,9 
2,6 1,7 1,8 1,9 
2,7 0,8 
Step 4 
0,0 9,0 7,1 9,1 7,2 9,3 2,6 2,7 6,8 5,9 
3,0 1,1 1,2 9,2 8,3 9,4 5,6 4,7 3,8 4,9 
1,0 3,1 4,2 1,3 8,4 9,5 6,6 3,7 4,8 3,9 
2,0 2,1 2,2 2,3 6,4 3,5 3,6 5,7 9,8 8,9 
6,0 4,1 3,2 3,3 3,4 4,5 4,6 7,7 5,8 7,9 
5,0 0,1 5,2 6,3 4,4 5,5 8,6 8,7 8,8 9,9 
4,0 6,1 0,2 4,3 1,4 2,5 7,6 6,7 7,8 6,9 
8,0 8,1 6,2 5,3 7,4 6,5 9,6 9,7 0,8 1,9 
7,0 5,1 8,2 0,3 5,4 8,5 1,7 1,8 0,9 2,9 





0,0 7,1 9,2 8,3 9,4 1,5 2,6 2,7 6,8 5,9 
3,0 1,1 1,2 9,3 0,4 0,5 5,6 4,7 3,8 4,9 
1,0 3,1 4,2 1,3 8,4 9,5 6,6 3,7 4,8 3,9 
2,0 2,1 2,2 2,3 6,4 3,5 3,6 5,7 9,8 8,9 
6,0 4,1 3,2 3,3 3,4 4,5 4,6 7,7 5,8 7,9 
5,0 0,1 5,2 6,3 4,4 5,5 8,6 8,7 8,8 9,9 
4,0 6,1 0,2 4,3 1,4 2,5 7,6 6,7 7,8 6,9 
8,0 8,1 6,2 5,3 7,4 6,5 9,6 9,7 0,8 1,9 
7,0 5,1 8,2 0,3 5,4 8,5 0,6 1,7 1,8 2,9 
9,0 9,1 7,2 7,3 2,4 7,5 1,6 0,7 2,8 0,9 
For HDW(i1,4,1) 
Step ° 
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 9,0 10,0 
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 10,1 
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 8,2 9,2 10,2 
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 8,3 9,3 10,3 
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 8,4 9,4 10,4 
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 8,5 9,5 10,5 
151 
0,6 1,6 2,6 3,6 4,6 5,6 6,6 
7,6 8,6 9,6 10,6 
0,7 1,7 2,7 3,7 4,7 5,7 6,7 
7,7 8,7 9,7 10,7 
0,8 1,8 2,8 3,8 4,8 5,8 6,8 
7,8 8,8 9,8 10,8 
0,9 1,9 2,9 3,9 4,9 5,9 6,9 
7,9 8,9 9,9 10,9 
0,10 1,10 2,10 3,10 4,10 5,10 6,10 
7,10 8,10 9,10 10,10 
Step 1 
0,0 0,1 0,2 0,3 0,4 5,0 6,0 7,0 8,0 
9,0 10,0 
1,0 1,1 1,2 1,3 1,4 1,5 6,1 7,1 
8,1 9,1 10,1 
2,0 2,1 2,2 2,3 2,4 2,5 2,6 7,2 8,2 
9,2 10,2 
3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 8,3 
9,3 10,3 
4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 
9,4 10,4 
0,5 5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8 5,9 
10,5 
0,6 1,6 6,2 6,3 6,4 6,5 6,6 6,7 6,8 6,9 
6,10 
0,7 1,7 2,7 7,3 7,4 7,5 9,6 7,7 7,8 7,9 7,10 
0,8 1,8 2,8 3,8 8,4 8,5 7,6 9,7 8,8 8,9 8,10 
0,9 1,9 2,9 3,9 4,9 9,5 8,6 8,7 9,8 9,9 9,10 
0,10 1,10 2,10 3,10 4,10 5,10 10,6 10,7 10,8 10,9 10,10 
Step 2 
0,0 5,0 6,0 7,0 8,0 9,0 10,0 10,1 10,2 10,3 10,4 
1,0 0,1 0,2 6,1 7,1 8,1 9,1 9,2 9,3 9,4 10,5 
2,0 1,1 1,2 0,3 0,4 7,2 8,2 8,3 10,8 10,9 10,10 
3,0 2,1 2,2 1,3 1,4 1,5 2,6 3,7 4,8 9,9 9,10 
4,0 3,1 3,2 2,3 2,4 2,5 3,6 4,7 5,8 5,9 8,10 
0,5 4,1 4,2 3,3 3,4 3,5 4,6 5,7 6,8 6,9 7,10 
0,6 5,1 5,2 4,3 4,4 4,5 5,6 6,7 7,8 7,9 6,10 
1,6 6,2 5,3 5,4 5,5 6,6 7,7 8,8 8,9 
0,7 1,7 6,3 6,4 6,5 9,6 9,7 9,8 5,10 
0,8 7,3 7,4 7,5 7,6 8,7 4,10 
2,7 8,4 8,5 8,6 10,7 
1,8 2,8 9,5 10,6 4,9 
0,9 1,9 3,8 3,9 3,10 
0,10 2,9 2,10 
1,10 
Step 3 
0,0 6,1 7,2 9,0 9,1 9,2 10,4 10,5 2,8 3,9 5,10 
1,0 0,1 0,2 10,0 10,1 10,3 9,4 2,7 3,8 2,9 4,10 
2,0 1,1 1,2 0,3 10,2 9,3 2,6 3,7 10,8 10,9 10,10 
3,0 2,1 2,2 1,3 0,4 1,5 3,6 4,7 4,8 9,9 9,10 
4,0 3,1 3,2 2,3 1,4 2,5 4,6 5,7 5,8 5,9 8,10 
8,0 4,1 4,2 3,3 2,4 3,5 5,6 6,7 6,8 6,9 7,10 
7,0 5,1 5,2 4,3 3,4 4,5 6,6 7,7 7,8 7,9 6,10 
152 
6,0 8,1 6,2 5,3 4,4 5,5 9,6 9,7 8,8 8,9 2,10 
5,0 7,1 8,2 6,3 5,4 6,5 7,6 8,7 9,8 4,9 3,10 
7,3 6,4 7,5 8,6 10,7 
8,3 7,4 8,5 10,6 1,10 
0,5 8,4 9,5 0,8 0,10 
1,6 1,7 0,9 
0,6 0,7 1,9 
1,8 
Step 4 
0,0 10,0 10,1 10,2 9,3 9,4 0,6 2,7 2,8 3,9 5,10 
1,0 6,1 7,2 10,3 10,4 10,5 2,6 3,7 3,8 2,9 4,10 
2,0 0,1 0,2 0,3 8,4 0,5 3,6 4,7 10,8 10,9 10,10 
3,0 1,1 1,2 1,3 0,4 1,5 4,6 5,7 4,8 9,9 9,10 
4,0 2,1 2,2 2,3 1,4 2,5 5,6 6,7 5,8 5,9 8,10 
8,0 3,1 3,2 3,3 2,4 3,5 6,6 7,7 6,8 6,9 7,10 
7,0 4,1 4,2 4,3 3,4 4,5 9,6 9,7 7,8 7,9 6,10 
6,0 5,1 5,2 5,3 4,4 5,5 7,6 8,7 8,8 8,9 2,10 
5,0 8,1 6,2 6,3 5,4 6,5 8,6 10,7 9,8 4,9 3,10 
9,0 7,1 8,2 7,3 6,4 7,5 10,6 0,7 1,8 1,9 0,10 




0,0 10,1 10,2 9,3 9,4 9,5 0,6 2,7 2,8 3,9 5,10 
1,0 6,1 7,2 10,3 10,4 10,5 2,6 3,7 3,8 2,9 4,10 
2,0 0,1 0,2 0,3 8,4 0,5 3,6 4,7 10,8 10,9 10,10 
3,0 1,1 1,2 1,3 0,4 1,5 4,6 5,7 4,8 9,9 9,10 
4,0 2,1 2,2 2,3 1,4 2,5 5,6 6,7 5,8 5,9 8,10 
8,0 3,1 3,2 3,3 2,4 3,5 6,6 7,7 6,8 6,9 7,10 
7,0 4,1 4,2 4,3 3,4 4,5 9,6 9,7 7,8 7,9 6,10 
6,0 5,1 5,2 5,3 4,4 5,5 7,6 8,7 8,8 8,9 2,10 
5,0 8,1 6,2 6,3 5,4 6,5 8,6 10,7 9,8 4,9 3,10 
9,0 7,1 8,2 7,3 6,4 7,5 10,6 0,7 1,8 1,9 0,10 
10,0 9,1 9,2 8,3 7,4 8,5 1,6 1,7 0,8 0,9 1,10 
REFERENCES 
L S. G. Ziavras, "RH: A Versatile Family of Reduced Hypercube Interconnection 
Networks," IEEE Transact£ons on Parallel and Distr£buted System,s, Vol. 
5, No. ll, Nov. 1994, pp. 1210-1220. 
2. C. Qiao and R. Melhem, "Reducing ComnlUnication Latency with Path 
Multiplexing in Optically Interconnected Multiprocessor Systems," IEEE 
Transactions on Pamllel and Distributed Systern,s, Vol. 8, No.2, Feb. 
1997, pp. 97-108. 
3. J. K. Antonio, L. Lin, and R. C. Metzger, '(Complexity of Intensive Commu-
nications on Balanced Generalized Hypercubes," Intenwt£onal Parallel 
Processing Syrnposiwn, 1993, pp. 387-394. 
4. P. Fragopoulou, S. G. AId, and H. Meijer, "Optimal Communica.tion Primitives 
on the Generalized Hypercube Network," Journal oj Parallel and 
Distributed Computing, Vol. 32, 1996, pp. 173-187. 
5. V. Kumar, A. Grama, A. Gupta., and G. Karypis, Ini1'Odnction to Pamllel 
C07nputing: Design and Analysis oj Algorithms, Benjamin/Cummings, 
California, 1994. 
6. VI/. Dally, "Netvvork a,nd Processor Architecture for Message-Driven 
Computers," in: lILSI and Pamllel C07nputation, R. Suaya and G. 
Birtwistle (Eds.), Morgan Kaufmann, California, 1990, pp. 140-222. 
7. W.J. Dally and C.L. Seitz, "'I'he Torus Routing Chip," Journal oj Distributed 
Computing, Vol. 1, No.3, 1986, pp. 187-196. 
8. M.C. Pease, III, "The Indirect Binary n-Cube Microprocessor Array," IEEE 
Tmnsaclions on . Computers, C-26(5), 1977, pp. 458-473. 
9. C.L. Seitz, "Concurrent VLSI Architectures," IEEE Tmnsaclions on 
Computers, C-33(12), 1984, pp. 1247-1265. 
10. T. Szymanski, (( "Hypermeshes": Optical Interconnection Networks for Parallel 
Computing, " Jonnw/ oj Parallel and Distributed Compnbng, Vol. 26, 
1995, pp. 1-23. 
11. L.D. Vv'ittie, "Comrnunication Structures for Large Networks of IVIulticom-
puters," IEEE Tmnsaclions on C07npv.iers, C-30( 4), 1981. 
12. S.G. Ziavras, ((Generalized Reduced Hypercube Interconnection Networks for 
Massively Parallel Computers," in: Networks JOT Parallel Com]),utat'ions, 
D.F. Hsu, A. Rosenberg, and D. Sotteau (Eds.), American Mathematical 
Society, Rhode Island, 1995, pp. 307-325. 
153 
154 
13. S.G. Ziavras and A. Mukherjee, "Data Broadcasting and Reduction, Prefix 
Computation, and Sorting on Reduced Hypercube Parallel Computers," 
Parallel Computing 22, 1996, pp. 595-606. 
14. S.G. Ziavras, "On the Problem of Expanding Hypercube-Based Systems," 
Journal of Parallel and Distributed C07nputing, 16(1), 1992, pp. 41-53. 
15. S.C:. Ziavras, "Scalable i\1ultifolded Hypercubes for Versatile Pa.rallel 
Computers," Parallel Pmcessing LetteTs, 5(2), 1995, pp. 241-250. 
16. L.N. Bhuyan and D.P. Agrawal, "Generalized Hypercube and Hyperbus 
Structures for a Computer Network," IEEE Transactions on Cornp'U,teTs 
33 (4), 1984, pp. 323-333. 
17. S.C. Ziavras, "Investigation of Various Mesh Architectures with Broadcast 
Buses for High-Performance Computing," VLSI Design; Specia.l Issue 
High Performance Bus-Based Architectures, pp. 29-53, 1999. 
18. S.C. Ziavras, H. Grebel, and A.T. Chronopoulos, "A Low-Complexity Parallel 
System for Cracious, Scalable Performance. Case Study for Near 
PetaFLOPS Computing," 6th Symposium on FmntieTs Massively Parallel 
Computing, Special Session New Millennium Computing Point Designs, 
1996, pp. 363-370. 
19. S.G. Ziavras, H. Grebel, and A.T. Chronopoulos, "A Scalable/Feasible Parallel 
Computer Implementing Electronic and Optical Interconnections for 156 
TeraOPS Minimum Performance," PetaFLOPS ArchitectuTe WOTkshop, 
1996, pp. 179-209. 
20. P.T. Gaughan and S. Yalamanchili, "Adaptive Routing Protocols for Hypercube 
Interconnection Networks," IEEE ComputeT, May 1993, pp. 12-23. 
21. P.\iV. Dowd, "High Performance Interprocessor Communication Through 
Optical 'Wavelength Division Multiple Access Channels," Pmceedi'ngs of 
Intenwtional Symposium on ComputeT ATchitectuTe, 1991, pp. 96-105. 
22. A. Abraham, K. Padmanabhan, "Performance of Multicomputer Networks 
under Pin-out Constraints," Jounwl of Parallel and DistTibuted 
Computing, Vol. 12, 1991, pp. 237-248. 
23. A. Agarwal, "Limits on Interconnection Network Performance," IEEE Trans-
actions on Parallel and DistTibuted Systems, Vol. 2, 1991, pp. 398-412. 
24. C.D. Thompson, "Area-Time Complexity for VLSI," Proceedings of 11th Annual 
ACM Symposium on Theory of Computing, May 1979, pp. 81-88. 
25. \N.J. Dally, "\iVire-Efficient VLSI Multiprocessor Communication Networks," 
Pmceedings of 1987 Stanford ConfeTence on Ad'uanced ReseaTch in VLSI, 
MIT Press, Cambridge, MA, 1987, pp. 391-415. 
155 
26. S.G. Ziavras and S. Krishnamurthy, "Evaluating the Communications 
Capabilities of the Generalized Hypercube Interconnection Netv.wrk," 
Concurrency: Pmctice and Experience, (accepted for publication). 
27. J.D. Ullman, Computational Aspects of IILSI, Computer Science Press, 
t-.1aryland, 1984. 
28. P. Banerjee, Pamllel Algorithms for IILSI Computer-Aided Design, Prentice-
Hall, New Jersey, 1994. 
29. Q. \iVang and S.G. Ziavras "Powerful and Feasible Processor Interconnections 
with an Evaluation of Their Communications Capabilities," Intenw-
tional Symposium on Pamllel Architectures, Algorith1ns, and Networks, 
Freemantle, Australia, June 23-25 1999. 
30. Q. VI/ang and S.G. Ziavras "Net\vork Embedding Techniques for a New Class of 
Feasible Parallel Architectures Capable of Very High Performance," Inter-
national Conference on Applied Informatics, Innsbruck, Austria, Feb. 23-
25 1999. 
31. S.G. Ziavras and Q. \iVang, "Robust Interprocessor Connections for Very-High 
Performance," in: Robust Communication Networks: Interconnection and 
Survivability, N. Dean, F. Hsu and R. Ravi (Eds.), American Mathe-
matical Society, Rhode Island, 1999. 
