Design and Analysis of Optical Interconnection Networks for Parallel Computation. by Li, Yueming
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1997
Design and Analysis of Optical Interconnection
Networks for Parallel Computation.
Yueming Li
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Li, Yueming, "Design and Analysis of Optical Interconnection Networks for Parallel Computation." (1997). LSU Historical
Dissertations and Theses. 6578.
https://digitalcommons.lsu.edu/gradschool_disstheses/6578
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may be 
from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in reduced 
form at the back o f the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6” x 9” black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly to 
order.
UMI
A Bell & Howell Infonnation Compai^
300 North Zed) Road, Ann Arbor NO 48106-1346 USA 
313/761-4700 800/521-0600
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DESIGN AND ANALYSIS OF 
OPTICAL INTERCONNECTION NETWORKS 
FOR PARALLEL COMPUTATION
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agricultural and Mechanical College 
in Partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
m
The Department of Computer Science
By
Yueming Li
B.S., University of Science & Technology Beijing, 1982 
M.S., University of Science &c Technology Beijing, 1984 
M.S., South Dakota State University, 1994 
December 1997
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
UMI Number: 9820732
UMI Microform 9820732 
Copyright 1998, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, United States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Acknowledgm ent s
First and foremost, I am especially grateful to my major professor, Dr. S.Q.Zheng. 
This work would have been impossible without his guidance, patience, encouragement and 
assistance.
I would like to thank the members of my doctoral committee. Dr. Jerry  Trahan, Dr. 
Doris L. Carver, Dr. John Drilling, Dr. S. Sitharama Iyengar, and Dr. Xianhe Sun for 
their comments, suggestions and instructions which help improve the quality of this work.
I would like to thank other staff and faculty in the Departm ent of Computer science 
for their help and assistance during my studies at LSU.
I would like to thank my friends, Guodong Qin, for his great help and numerous 
errands while I was away from campus, and Xin Liangyang and his wife for the lodging 
and delicious food when I defended this work. Thanks to all my other friends for their 
friendship.
To my parents, T ianhuan Li and Zhongdi Li, thanks for your persistent encouragement 
and support.
To my parents-in-law, Yiqian Liu and Shaping Chen, thanks for your enormous help 
and for your picking up the responsibility to take care of my son.
To ray wife, Xiaojiang Liu, thanks for your love, understanding and never-ending push. 
Also thanks for all the duties tha t would have been mine if I had not been tha t busy due 
to this work.
To my son, Danxiang Li, thanks for your understanding why your Dad was not with 
you for years.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
C ontents
A ck n ow led gm en ts.................................................................................................................. ii
List o f T a b le s .............................................................................................................................v
List o f F ig u r e s ......................................................................................................................... vi
A b s tr a c t ....................................................................................................................................  ix
1 In tr o d u c tio n ......................................................................................................................  1
2 O ptical B u s e s ...................................................................................................................... 8
2.1 Non-pipelined Optical b u s .......................................................................................... 8
2.2 Pipelined Optical B u s .....................................................................................................10
2.2.1 FA-TDM M eth o d .................................................................................................13
2.2.2 DA-TDM M e t h o d ..............................................................................................13
3 P ipelined Optical Bus w ith  Conditional D e la y s ............................................  15
3.1 A New Optical Bus A rc h ite c tu re ................................................................................ 16
3.2 Algorithm Design E x a m p le s ..........................................................................................21
3.3 Summary and D iscussions............................................................................................. 25
4 Asynchronous Optical T D M  B u s e s ....................................................................  28
4.1 ATDM with Linear P r io r i ty ...........................................................................................29
4.2 ATDM with Round-Robin P r io r i ty ...............................................................................31
4.3 S im ulation ...........................................................................................................................37
4.4 Summary and D iscussions .............................................................................................42
5 Processor Arrays C onnected by Segmented B u s e s .......................................  45
5.1 Segmented B u s e s ..............................................................................................................48
5.2 Versatility of Parallel Architectures Based on Segmented B u s e s ....................... 51
5.2.1 Simulation of Linear A r r a y .............................................................................. 52
5.2.2 Simulation of Binary T r e e ................................................................................. 53
5.2.3 Simulation of X - t r e e ........................................................................................... 56
5.2.4 Simulation of One-dimensional Multigrid ....................................................57
5.2.5 Simulation of M esh-of-tree................................................................................. 5 9
5.2.6 Simulation of P y ra m id ........................................................................................60
5.2.7 Simulation of High-dimensional M u ltig r id ....................................................61
iii
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
5.3 Parallel Prefix C o m p u ta tio n .......................................................................................... 63
5.3.1 Prefix on 1-D a r ra y ............................................................................................... 63
5.4 k — D  mesh with segmented b u s e s ............................................................................. 67
5.5 Summary and D iscussions............................................................................................. 77
6 H yp ern etw ork s................................................................................................................. 79
6.1 B ackground ........................................................................................................................ 81
6.2 Hypernetwork Design Is su e s ..........................................................................................83
6.3 Dual Hypernetworks and Q ’ Hyper netw orks............................................................ 83
6.3.1 Dual G r a p h ...........................................................................................................83
6.3.2 The Hypernetwork Q * ........................................................................................84
6.3.3 The Properties of ........................................................................................... 87
6.4 Data Communication Algorithms for Q * ...................................................................89
6.4.1 One-to-One C om m unication.............................................................................. 90
6.4.2 One-to-Many C om m unication...........................................................................91
6.4.3 Many-to-One C om m unication...........................................................................96
6.4.4 Many-to-Many C om m unication........................................................................99
6.5 Summary and D iscussions...........................................................................................102
7 C onclusions.......................................................................................................................103
B ib liograp h y ..........................................................................................................................106
V i t a ...........................................................................................................................................112
IV
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
List of Tables
6.1 Comparison between G M SH {Z ,d) and point-to-point h y percubes..................... 81
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
List o f Figures
2.1 Reflection and Refraction of Light..............................................................................  9
2.2 Transmission of light in a fiber..................................................................................... 9
2.3 A pipelined optical bus system ..................................................................................... 11
2.4 A train of packet slots..................................................................................................... 12
2.5 (a) An address frame, (b) A packet slot...................................................................  12
3.1 An optical bus with conditional delays......................................................................  16
3.2 An address frame of the bus with conditional delays, assuming tha t all
switches are in the cross s ta te ....................................................................................... 17
3.3 Address frames for broadcasting, (a) All switches are in the cross state, (b)
All switches are in the straight s ta te ..........................................................................  18
3.4 Address frames for multicasting, assuming that all switches are in the cross 
state. Three source processors send messages to three subsets (a), (b) and
(c), of destination processors......................................................................................... 18
4.1 The configuration of a  DA-TDM bus............................................................................ 29
4.2 A packet with a  flag............................................................................................................ 30
4.3 The implementation of the flag....................................................................................... 31
4.4 The ring structures constructed from a DA-TDM bus. (a) The ring cor­
responding to the select waveguide, (b) The ring corresponding to the 
reference and message waveguides................................................................................... 32
4.5 DA-TDM bus with hardware round-robin priority scheme, (a) The reference
and message waveguides, (b) The select waveguide................................................... 34
4.6 Switch im plementation...................................................................................................... 34
4.7 DA-TDM bus configurations: (a) Bq a t time to, (b) B i a t t \  and (c) B 2 a t £2 - 36
VI
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4.8 DA-TDM bus with double-input processors. Configurations: (a) B q, (b)
E l and (c) B2 .......................................................................................................................37
4.9 Relation between the AMRT and the message size for a  DA-TDM bus with
linear p rio rity ...................................................................................................................... 38
4.10 Comparison of the AMRTs of a FA-TDM bus and DA-TDM bus with linear 
p r io r i ty . ............................................................................................................................... 39
4.11 Relation between the AMRT and the message size of an  FA-TDM bus. . . .  40
4.12 The unfairness of the DA-TDM with linear priority scheme.................................. 40
4.13 The unfairness of the DA-TDM bus with the round-robin priority scheme. . 41
4.14 The AMRT of the DA-TDM bus with the round-robin priority scheme. . . .  41
4.15 Implementing a switch using a DA-TDM bus. (a) A 8  x 8  switch, (b) A 
reconfigurable DA-TDM bus with 8  pairs of I/O  devices........................................ 42
4.16 (a) A two-dimensional processor array, (b) A physical arrangem ent of the 
array. .................................................................................................................................. 43
5.1 A 4x5 mesh with multiple broadcasting....................................................................... 46
5.2 The reconfigurable mesh architecture............................................................................46
5.3 A 4x4 mesh with hyperbuses........................................................................................... 47
5.4 1-spacing, 2-spacing and 4-spacing segmented buses of size 16..............................49
5.5 Folded bus configuration................................................................................................... 49
5.6 2-spacing segmented folded bus...................................................................................... 50
5.7 A 2-D MCSB Mg(8,2)....................................................................................................... 51
5.8 Simulation of a complete binary tree by a 2-spacing segmented bus....................55
5.9 Recursive constructions of complete binary tree and 2-spacing segmented
bus...........................................................................................................................................56
5.10 Simulation of an X-tree by a 2-spacing segmented bus............................................ 5 7
5.11 Simulation of a 1 -D multigrid by a  2-spacing segmented bus. (a) 1-D multi­
grid of 31 processors, (b) Processor mapping to 5 ( 1 6 ) .......................................... 58
5.12 A 4 X 4 mesh-of-trees.........................................................................................60
5.13 A 4 X 4 multigrid................................................................................................ 61
5.14 A 4 X 4 pyramid.................................................................................................. 61
Vll
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5.15 Parallel prefix computation using recursive doubling............................................... 64
5.16 A  3 — D  mesh with segmented buses............................................................................ 67
5.17 The bus notations for the 3 — D mesh shown in Figure 5.16................................67
5.18 A...............The collecting phase of 3 — D prefix com putation.................................71
5.19 A The broadcasting phase of 3 — D prefix com putation.................................72
5.20 A k — D mesh is imagined to consist o i n k — I dimensional meshes..................74
5.21 A A: +  1 dimensional mesh is imagined to consist o i n  k — D  meshes..................74
6.1 Bus im plementation of GMSH(3,2)................................................................................80
6.2 Bus implementation of ................................................................................................8 6
6.3 Hypercube Qz corresponding to ...............................................................................8 6
6.4 Bus implementation of Q \................................................................................................87
6.5 D ata communication pattern for broadcasting from (0,1) in Q4 .......................... 95
6 . 6  D ata communication pattern for reduction in Q \ ......................................................98
VIll
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
A bstract
In this doctoral research, we propose several novel protocols and topologies for the 
interconnection of massively parallel processors. These new technologies achieve consid­
erable improvements in system  performance and structure simplicity.
Currently, synchronous protocols are used in optical TDM  buses. The m ajor disad­
vantage of a synchronous protocol is the waste of packet slots. To offset this inherent 
drawback of synchronous TDM , a pipelined asynchronous TDM  optical bus is proposed. 
The simulation results show tha t the performance of the proposed bus is significantly 
better than that of known pipelined synclironous TDM optical buses.
Practically, the com putation power of the plain TDM protocol is limited. Various 
extensions must be added to the system. In this research, a new pipelined optical TDM bus 
for implementing a linear array parallel computer architecture is proposed. The switches 
on the receiving segment of the bus can be dynamically controlled, which make the system 
highly reconfigurable.
To build large and scalable systems, we need new network architectures th a t are su it­
able for optical interconnections. A new kind of reconfigurable bus called segmented bus 
is introduced to achieve reduced structure simplicity and increased concurrency. We show 
that parallel architectures based on segmented buses are versatile by showing th a t it can 
simulate parallel communication patterns supported by a wide variety of networks with 
small slowdown factors.
New kinds of interconnection networks, the hypernetworks, have been proposed re­
cently. Compared with point-to-point networks, they allow for increased resource-sharing 
and communication bandw idth utilization, and they are especially suitable for optical 
interconnects. One way to derive a  hypernetwork is by finding the dual of a point-to- 
point network. Hypercube Q„, where n  is the dimension, is a  very popular point-to-point 
network. It is interesting to construct hypernetworks firom the dual Q* of hypercube of
ix
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Qn- In this research, the properties of Q* are investigated and a  set of fundam ental da ta  
com munication algorithms for Q* are presented. The results indicate that the hyper­
network is a  useful and promising interconnection structure for high-performance parallel 
and  distributed  computing systems.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
C hapter 1 
Introduction
Many real-life problems such as weather forecast modeling, molecular modelhng. computer- 
aided design of VLSI circuits, large-scale database management, artificial intelligence, and 
strategic defense initiatives are computation intensive. The circuit density of a single chip 
is approaching the limit. This means the processing speed of a single processor is reaching 
the limit. To further increase the speed, we must turn  to parallel computing. Therefore, 
the im portance of parallel com puting for real life application is obvious.
In parallel computing, time complexities of many com putation problems are con­
strained by d a ta  movement among processors. Designing efficient interconnection net­
works has long been a focus in parallel computing research. To resolve this communication 
bottleneck, researchers have been trying to invent better network topologies and to find 
new transmission media th a t have bigger bandwidth.
Various network architectures have been proposed. These networks can be classified 
as static connection networks and dynamic connection networks. Typical static connec­
tion networks include linear array, ring, binary (fat) tree, star, mesh, torus, hypercube, 
cube-connected cycles (CCC) and k-ary n-cube [34]. Expected features of those static 
networks include small and constant node degree, small network diam eter, symmetry, and 
scalability. W ith regard to dimensionality, low-dimensional networks reduce contention 
because having a  few high-bandwidth channels results in more wire sharing and thus a 
better queuing performance than  having many low-bandwidth channels. In addition, low-
1
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
dimensional networks have a higher maximum throughput and lower average block latency 
than high-dimensional networks. It has been argued tha t ring, mesh, torus, k-ary n-cube, 
and CCC all have some desirable features for building future M PP systems.
Dynamic connection networks include bus systems, multistage interconnection net­
works (MINs) and crossbar switch networks. Characteristics of dynamic networks include 
minimum latency, bandwidth per processor, wiring complexity, switching complexity, and 
connectivity and routing capability. The high bandwidth and routing capability are usu­
ally a t the cost of high wiring and switching complexity.
Designing networks connecting a large number of processors that support massively 
parallel computing is a challenge. As discussed above, in designing a network for interpro­
cessor communication, one has three major choices: point-to-point networks, MINs and 
multiple-bus systems (also see [18, 40, 71] for surveys). MINs suffer from flexibility and a 
relatively large lower bound for propagation delay through the stages of the network. Most 
previously investigated multiple-bus schemes are either restricted to processor-memory 
interconnections (shared-memory architecture) or proposed to augment point-to-point in­
terprocessor networks for improved broadcast and multicast performance in the electrical 
domain. Point-to-point networks have poor resource sharing, especially when dimension­
ality is high. One can see the trade-offs in designing an efficient network from the above 
discussion. The trend is low-dimensional networks, which is consistent with the above dis­
cussion. This observation is evidenced by several most recently implemented or proposed 
network architectures, which include the multiple buses in the Wisconsin multicube [23], 
Orthogonal buses in the OMP [31, 33], sparse torus in the Tera computer [4], and the fat 
tree used in CM-5 machine [34].
The most common transmission media has been copper wire for electronic intercon­
nection. However, due to the bandwidth limit of electronic interconnections, researchers 
axe turning their attention to optical interconnections. Several approaches, such as free 
space optical devices, wavelength-division multiplexing (WDM) and time-division multi­
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
plexing (TDM ), of constructing optical interconnections for multiprocessor systems have 
been studied, and the advances in optical devices and fiber optic communications have 
made the optical interconnections feasible. Prototypes of optical interconnections have 
been constructed and shown promising. New generation massively parallel computers 
using optical interconnections may become a  reality in the near future.
Notice the interaction between the network topology and the transmission media. A 
new transm ission medium usually requires new network topology for efficient communi­
cation. Traditional point-to-point networks may be suitable to electronic interconnection, 
but they may be not suitable for optical interconnection. T he purpose of this research 
is to find be tte r transmission protocols and network topologies tha t will result in more 
efficient parallel com putation systems.
Optical waveguides can be used to implement a  bus. Signal propagation in an optical 
bus is unidirectional and has predictable delay per unit length. Furthermore, an optical 
bus can connect more processors than an electrical bus. Processors connected by an optical 
bus are linearly ordered. Such a bus system is considered as a one-dimensional parallel 
com puter architecture — a  linear processor array. It is im portant to investigate linear 
arrays because they can be used as building blocks to construct parallel architectures of 
higher dimensions to achieve improved scalability and performance.
One class of promising parallel architectures are the distributed-m emory SIMD (Single 
Instruction Stream  Multiple Data Stream) com puter systems equipped with pipelined 
optical buses using TDM access protocols. In such a  system, the transmission latency 
between furthest processors is the end-to-end propagation delay of light over a waveguide. 
Since messages are transm itted concurrently in a  pipelined fashion on a bus, this latency 
is hidden. Due to their remarkable advantages, such optical bus systems nave received 
much attention (e.g. [26, 27, 41, 49, 55, 56, 57, 65]). More powerful multidimensional 
processor arrays connected by such optical bus systems have also been proposed recently 
(e.g. [59, 65]).
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
The TD M  multiaccess methods can be divided into two categories: fixed assignment 
and dem and assignment. All previously proposed pipelined buses use fixed assignment, 
for example, the time-division source-oriented multiplexing (TDSM) multiaccess m ethod 
and time-division destination-oriented multiplexing (TDDM) multiaccess m ethod. W ith 
TDSM, each packet slot is assigned to a source processor while with TDDM, each packet 
slot is assigned to a destination processor. The m ajor disadvantage of the fixed-assigmnent 
TDM (FA-TDM) is the requirement that the  packet slots for each processor are fixed 
regardless w hether or not it has a packet to transm it. A demand-assignment TDM  m ethod 
(DA-TDM) allocates packet slots to processors dynamically according to their dem ands 
and the traffic situation.
In this research, we introduce the idea o f a flagged packet. We then use this idea 
to modify the  known pipelined optical bus structu re  so that a demand-assignment TDM 
multiaccess m ethod using a linear priority scheme can be implemented by hardware. To 
improve the fairness of the DA-TDM multiaccess method, we incorporate reconfigurability 
into our pipelined optical bus to implement the round-robin priority scheme. The schedul­
ing of the DA-TDM multiaccesses with the round-robin priority scheme is implemented 
by reconfiguring the bus in hardware. We compare the performance of our pipelined DA- 
TDM optical bus with the pipelined FA-TDM optical bus by simulations, in term s of 
average message response time and fairness. We also explore the potential of using our 
buses to construct multichannel switches and multidimensional processor arrays.
For parallel applications that are suitable for synchronous TDM, the com putation 
power of the  system in terms of parallelism and  concurrency is often limited. It can be 
improved by introduction of reconfigurability. For example, it is proposed in [56] tha t 
programm able switches are added to the transm itting  segments of the optical bus. An 
average case O(logiV) and worst case 0 {N )  tim e complexities for parallel selection have 
been achieved on such a  system. The idea o f programmable delays using electroopti- 
cally switched fiber loops has been used in the  designs of TDM time slot interchangers
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
(e.g. [30, 37, 6 6 , 78]). Also, two optical bus structures with conditional delays have been 
proposed in [26] and [57]. In the bus structure of [26], switches are on the transm itting 
segment of the bus, and because of this, parallel computation on subarrays, if not impos­
sible, is difficult. To solve this problem, an optical bus with considerably more switches 
was proposed in [57]. Such a bus can be physically partitioned into several buses for 
parallel subarray computation. In this research, we introduce the technique for binary 
prefix summation and processor reordering. Structurally, our system is very simple and 
cost-effective. For example, we dem onstrate that our system is very powerful by showing 
that parallel selection can be done in optimal time on our system. We also show that 
using this linear array architecture several fundamental parallel communication and com­
putation operations, which include broadcasting, multicasting, compaction, partition and 
concurrent subarray computation, can be carried out efficiently.
The realization of general-purpose, massively parallel computers hinges largely on be­
ing able to build scalable interprocessor networks. To build large and scalable systems, 
we need new network architectures that are suitable for optical interconnections. High­
dimensional networks can not be candidates due to their poor wire sharing property which 
has been well studied during the last few years. On the other hand, low-dimensional net­
works show a better wire sharing property. To compensate the lost concurrency and 
parallelism resulting from the switch from high-dimensional networks to low-dimensional 
networks, it has been proposed tha t the low-dimensional networks be enhanced in some 
ways. For example, meshes can be enhanced by multiple buses to improve broadcasting 
performance [74, 60, 81]. A number of multiprocessors connected by multiple reconfig­
urable buses have also been proposed. Examples include, among many, the bus automaton 
[69], the reconfigurable mesh [51] and the polymorphic torus [42]. The common feature of 
these machine models is that the bus configurations can change under program control. 
Some of these models have been shown surprisingly powerful because of the dynamically 
reconfigurable communication paths achieved by enhancement. For example, by using bus
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
control techniques to reconfigure communication paths as an integral part of computation, 
it is shown in [36, 46, 53, 54] that n  numbers can be sorted in constant time on an n  x n 
reconfigurable mesh. One side effect of enhancement is that the resulting systems may 
be too complicated to implement. In this research, we are going to propose a class of 
reconfigurable buses, called segmented buses. A segmented bus connecting p  processors, 
denoted by 5 (p), is a bus that can be dynamically partitioned into several segments, 
each connecting a subset of processors, by switches. We also generalize the concept of 
segmented bus to obtain parallel architectures of higher dimensions, called fc-dimensional 
mesh connected by segmented buses (fc-D MCSB). We show that the segmented bus and 
the t-D  MCSB are versatile parallel computing architectures by showing th a t they can 
simulate a  wide vaiiety of useful network structures. In particular, we show tha t B(p) can 
simulate any linear array or ring of no more than p processors with a constant slowdown 
factor, and B[p) can simulate a (2p — l)-processor complete binary tree, X-tree and one­
dimensional multigrid with an O(logp) slowdown factor. Then, we use these results to 
show tha t a A:-D MCSB can simulate a A:-D mesh or torus with a constant slowdown factor, 
a.n N  X N  MCSB can simulate an N  x N  mesh-of-trees, an N  x  N  multigrid network and 
N  X  N  pyramid network with an O(logW ) slowdown factor. It would not be complete 
without considering the algorithmic aspect of the segmented bus based architecture. We 
dem onstrate the advantages of parallel architectures based on segmented buses by giving 
a parallel algorithm for the prefix com putation problem.
Traditionally, interconnection networks are characterized by graphs. Network topolo­
gies under graph models have been extensively investigated. Many network structures 
have been proposed, and some have been implemented. Observing the improved electrical 
bus and switch technologies, and m aturing optical interconnection technologies, Zheng 
pointed out that the conventional graph structure is no longer adequate for the design 
and analysis of the new generation interconnection structures and proposed a new class 
of interconnection networks, the hyper networks [84].
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
The class of hypernetworks is a generalization of point-to-point networks, and it con­
tains point-to-point networks as a subclass. In a hypernetwork, the physical communica­
tion medium (a hyperlink) is accessible to multiple (usually, more than two) processors. 
The relaxation on the number of processors that can be connected by a  link provides 
more design alternatives so th a t greater flexibilities in trade-offs of contradicting design 
goals are possible. The underlying graph theoretic tool for investigating hypernetworks is 
hypergraph theory [8 ]. Hypergraphs are used to model hypernetworks. Existing results 
in hypergraph theory and combinatorial block design theory, which is closely related to 
hypergraph theory, can be used to design hypernetworks. For example, in [87], Zheng 
introduced several low diam eter hypernetworks based on the concept of Steiner Triple 
System. In [8 8 ], Zheng and Wu proposed a scheme for constructing a new hypernetwork 
from an existing one using the concept of dual graph in hypergraph theory. They showed 
that the dual H* of any given hypergraph i f  is a  hypergraph that has some properties 
related to the properties of H  so that one can investigate the properties of if* based on 
the properties of H . Since the structure of H  and its dual if* can be drastically different, 
finding hypergraph duals can be considered as a general approach to the design of new 
hypernetworks. They investigated the structure of the dual K*  of an n-vertex complete 
point-to-point network Kn-
The hypercube is a popular point-to-point network tha t has many desirable features 
such as small diameter, symmetry, and support of a large class of efficient parallel algo­
rithms. In this research, we propose a class of hypernetworks, the Q* (read as star) 
hypernetworks. The Q* hypernetwork is the dual of the n-dimensional hypercube We 
discuss the topological properties and fault tolerance aspects of Q*, and present a set of 
parallel da ta  communication algorithms for Q*.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Chapter 2
O ptical B uses
Recently, optical technologies for interconnection of massively parallel computers have 
received considerable attention. Optics can utilize free-space as well as guided wave tech­
nologies. There are many desirable characteristics of optical interconnections such as 
high speed, high bandwidth, increased fanout, longer interconnection lengths, low power 
requirements, and reduced crosstalk. These characteristics have significant system config­
uration and complexity implications.
2.1 Non-pipelined O ptical bus
The refraction property of light makes it possible to be transm itted  in a fiber. W hen a 
light ray is sent fiom one substance to another, some of it is reflected and some passes into 
the new substance, referring to Figure 2.1. The ray of light getting into the new substance 
is usually bent from its original angle. The extent to which the ray bends depends on the 
index o f refraction of each of the two substances and the wavelength of the light. W hen 
the light is sent at an angle greater than a certain threshold, called critical angle, the light 
is completely reflected (none passes through). Nov/ let us have a strand of glass, called 
core, wrapped by another layer of slightly difierent glass, called the cladding w ith index 
of refraction of the core being higher than that of the cladding. A light sent into the core 
at a particular angle will stay in the core because any of the light trying to escape the 
core through the cladding will be reflected back into the core, as shown in Figure 2.2. By
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
transm itting light through the core, it is possible to transm it bits in the form of p u lse s  
of light.
Substance 1
Retlected
ray
Incident
ray
Substance! Refracted
ray
index of substance 1 > Index o f  substance!
Figure 2.1: Reflection and Refraction of Light.
o
7
Uxhi
,L
CLiJUiOf 
laJex til c ire  > lo t lo  ni datkiiag
Figure 2.2: Transmission of light in a flber.
In its simplest way, optical fibers can be used like electrical wires. One example is 
the Synchronous Optical Network (SONET) which is part of a large suite of telephone 
standards known as the Synchronous Digital Hierarchy (SDH), standardized by CCITT, 
the worldwide telephony standards body. Compared to an electrical wire, an optical fiber 
has much larger bandwidth, much smaller physical size, longer interconnection lengths, 
lower error rate, and reduced crosstalk. For example, a  normal electrical cable has a chan-
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
10
nel bandwidth in the range of several Mbps to 150 Mbps depending on the transmission 
media tha t are used [77], while a single fiber can have a bandw idth as high as 50 to 75 
terabits per second [58]. Optical fibers have been successfully used in telecommunication 
networks to replace copper wires. It is expected that fiber optics will soon be used in 
tightly connected systems.
Due to the huge bandw idth of fiber optics compared to electrical signaling devices, 
fiber sharing will be an im portant feature of the new generation interconnection networks 
which connects electronic processors with optical fibers. From past experience, high­
dimensional networks show poor wire sharing [34]. It is expected that low-dimensional 
networks like meshes and tori are better suitable for fiber optics. Observing this new 
trend in interconnection networks, some researcher advocates hypernetworks [84] tha t 
have excellent wire sharing ability.
2.2 P ipelined Optical Bus
Based on two im portant optical transmission properties, namely unidirectional propaga­
tion and predictable propagation delay of optical signals, pipelined optical buses using 
time division multiplexing (TDM) multiaccess methods have been proposed [26, 27, 41, 
49, 57, 62, 65]. Such an optical bus transmits packets in a  pipelined fashion, achieving 
bandwidths higher than tha t of non-pipefined bus communications. Multidimensional 
processor arrays using pipelined optical buses as building blocks have been proposed to 
achieve improved system scalabihty [59, 65].
In the following, we briefiy review the pipelined optical bus proposed in [27, 49, 62]. 
This is necessary because our newly proposed bus will be compared with those existing 
buses. Unless otherwise specified, we assume that the architectures discussed operate in 
the SIMD fashion.
A pipelined optical bus consists of three folded waveguides, the message waveguide, 
the reference waveguide and the select waveguide, connecting n  processors, Pq, P i . . . ,  P„_i
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
11
(refer to Figure 2.3). The processor indices are linearly ordered. We call processor P n-i 
and Pq the head processor and tail processor of the bus, respectively. The message waveg­
uide is used for carrying messages. The reference and the select waveguides are used 
together for carrying address information encoded using the coincident pulse technique 
[12, 41]. The bus is divided into three segments, the transmitting segment, which is the 
upper half of the bus with taps from processors, the receiving segment, which is the lower 
half of the bus with taps to processors, and the U segment, which is the folded part that 
connects the transm itting and receiving segments. Let w  be the pulse duration in seconds, 
and vi be the velocity of light in these waveguides. Define a pulse tim e unit (or simply, 
time unit) as w  xv i -  The spatial separation of any two adjacent taps on the waveguides is 
D  time units. Loops are added on the receiving segments of the reference and the message 
waveguides. Each loop causes the light a unit tim e delay in a waveguide.
D
PS„., PS„., PS.
Message
Reference
Select
n-1
Figure 2.3: A pipelined optical bus system.
Referring to Figures 2.3 and 2.4, we explain the  coincident pulse technique. W hen a 
source processor sends a packet (which will be defined shortly), it sends a  reference pulse 
and a select pulse. The select pulse is transm itted later than  the reference pulse with an 
appropriate time delay so tha t the two pulses will arrive a t the destination processor a t 
the same time. T hat is, the coincidence of the two pulses occurs at the desired destination. 
W henever a  processor detects a coincidence of a reference pulse and a select pulse, it s tarts
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
12
n-2n-1
n-l row bus i
PS.
Figure 2.4: A train of packet slots.
to read from the message waveguide. More specifically, suppose tha t a  processor is going 
to send a packet to processor j .  We use tr to denote the time when the processor transmits 
its reference pulse and tj to denote the time when it transmits its select pulse. The two 
light pulses will coincide a t processor j  if and only if tg = tr + j .
Time 0 1 j n-1
Select Address Frame Data
Reference
(a) (b)
Figure 2.5: (a) An address frame, (b) A packet slot.
Let n  denote the number of processors attached to the bus. We call the duration of 
each light pulse a pulse slot. T hen  a  sequence of n  pulse slots is called an address frame, as 
shown in Figure 2.5(a). Note th a t the existence of a select pulse a t pulse slot j  means that 
a  message is to be sent to the processor j .  According to the definition, the address frame 
for broadcasting can be easily implemented by setting a select pulse a t each pulse slot 
of an address frame. Define a  packet as a collection of information including an address 
frame and a da ta  frame (see Figure 2.5(b)). Let L  be the packet length in terms of time 
units. Clearly, it must be L >  n  and D > L  m  order to provide correct addressing and 
prevent packet overlaps.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
13
The TDM multiaccess methods can be divided into two categories: fixed assignment 
and demand assignment as discussed below.
2.2.1 F A -T D M  M eth o d
The fixed-assignment method is the simplest multiaccess protocol for the bus discussed 
above. In the fixed-assignment TDM (FA-TDM), the n packet slots are numbered from 
n — I down to 0. as shown in Figures 2.3 and 2.4. The packet slots are assigned to each 
processor in a fixed manner. The two often used techniques are time-division source- 
oriented multiplexing (TDSM) multiaccess method and time-division destination-oriented 
multiplexing (TDDM) multiaccess method. In the TDSM method, Packet slot (PS for 
short; it has L  pulse slots) PS{ is fixed assigned to a source (sending) processor i. Imagine 
a train of n packet slots is originated on a bus. If processor i has a packet to send, it loads 
its packet to PSi. In the TDDM method, a packet slot is fixed assigned to a destination 
(receiving) processor. If some processor i wants to send a  message to a processor j ,  
processor i must load its message to the packet slot assigned to processor j .
The major disadvantage of the fixed-assignment TDM (FA-TDM) is the requirement 
that the packet slots for each processor are fixed regardless whether or not it has a packet 
to transmit. For example, assume that packet slot i is assigned to processor i. If processor 
i does not have a packet to send in a particular bus cycle, packet slot i will be wasted. This 
situation happens to both  TDSM and TDDM methods. The TDDM method has another 
problem. When two source processors want to send messages to the same destination 
processor, a packet slot collision or contention happens since the two source processors 
want to load their packets to the same packet slot. Therefore, a reservation scheme must 
be provided when the TDDM method is used.
2.2 .2  D A -T D M  M eth o d
To avoid the disadvantages of FA-TDM, a demand-assignment TDM m ethod can be used. 
A demand-assignment TDM method (DA-TDM) allocates packet slots to processors dy­
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
14
namically according to their demands and the tra& c situation. It belongs to the class of 
asynchronous TDM  (ATDM) multiaccess methods.
In spite of the  disadvantages of FA-TDM, all previously proposed pipelined buses use 
it because of its simplicity. In this work, we are going to introduce the idea of flagged 
packet. We then use this idea to modify the known pipelined optical bus structu re so 
that a demand-assignment TDM multiaccess method using linear priority scheme can be 
implemented by hardware. The detail of flagged packet and its application to optical 
pipelined TDM will be discussed in a separate chapter.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
C hapter 3
P ip elin ed  O ptical Bus w ith  
C onditional D elays
In the previous chapter, we discuss the pipelined optical buses using the TD M  access pro­
tocol. W ith this technique, one class of promising parallel architectures, the distributed- 
memory SIMD (Single Instruction Stream  M ultiple D ata Stream) com puter systems, have 
been proposed and studied. In such a  system, the transmission latency between furthest 
processors is the end-to-end propagation delay of light over a waveguide. Since messages 
are transm itted  concurrently in a pipelined fashion on a  bus, this latency is hidden. Due 
to their rem aikable advantages, such optical bus systems have received much of attention 
(e.g. [26, 27, 41, 49, 55, 56, 57, 65]). If enhanced with reconfigurability, the com putation 
power of such systems can be further increased as shown in [59, 65].
In this work, we propose a modified pipelined TDM optical bus for im plem enting a 
linear array parallel computer architecture. In this bus system, switches are introduced 
on the receiving segment of the bus to control the signal delays on the waveguide. The 
states of switches are dynamically programm able under the control of processors accord­
ing to com putation needs. In conjunction w ith the coincident pulse processor addressing 
technique, the reconfigurability of signal delays becomes an integral part o f parallel com­
putation. We show that using this linear array architecture several fundam ental parallel 
communication and computation operations, which include broadcasting, multicasting, 
binary prefix sums, processor reordering, compaction, partition and concurrent subarray
15
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
16
computation, can be carried out efficiently. We present parallel algorithms for the selec­
tion problem and the sorting problem to dem onstrate tha t these operations constitute a 
set of powerful tools for designing parallel algorithms.
3.1 A N ew  Optical Bus A rchitecture
Our new bus architecture is shown in Figure 3.1. On the receiving segment o f the reference 
and message waveguide, we introducep 2x2  optical switches, 5,, 1 <  î <  p ( For simplicity, 
the switches on the message waveguide are not shown). Switch Si can be set to one of 
two states, straight and cross, by processor i. W hen a switch is set cross, w time delay 
occurs when a  signal passes the switch, in comparison to the case when the switch is set 
straight, because of the extra traversed distance. The Ti:LiNb03 switches of [7] can be 
used for this purpose. We use the new oflfset message transmission scheme proposed in 
the previous section to avoid incomplete message read. The setting of the address frame 
for any processor to send a packet to processor j  is shown in Figure 3.2, assuming tha t all 
switches are in the cross state. Note that the address frame for this bus system  requires 
one more pulse slot with respect to the bus system  discussed in the previous section.
Message
Select
Reference
S S Sp-1 P
o Processor Conditional delay switch in straight state Conditional delay switch in cross state
Figure 3.1: An optical bus w ith conditional delays.
The adjustable delays on the receiving segments of the reference and message waveg­
uides make our bus system more powerful. Previously, two pipelined optical buses with
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
17
0 1 2  3
Figure 3.2: An address frame of the bus with conditional delays, assuming that all switches 
are in the cross state.
conditional delays were proposed in [26] and [57], both buses require more hardware. Our 
bus requires p switchesk Compared with the bus of [26], our bus, which has half the num­
ber of switches, is more powerful. The bus proposed in [57] has 2{p — I) 2 x 2 switches, 
3(p — I) 1 x 2  switches, and 3(p — 1 ) 2  x 1 switches. It is believed that the additional 
reconfigurability of this bus, which has about eight times the number of switches, does not 
exhibit additional power compared with our bus design. In what follows, we show that 
our bus architecture supports several useful fundamental operations. In the next section, 
we show how to use these operations to construct efficient parallel algorithms. Since the 
length of a  bus grows linearly with respect to the number of processors, the time taken 
by a bus cycle in turn grows linearly with respect to the system size. In the complexity 
analysis of operations and algorithms that involve bus communications, we separate com­
munication time from com putation time. The communication performance is measured 
in terms of the number of bus cycles required.
B ro a d c a s t a n d  M u ltic a s t In a multicast operation, each processor may send a packet 
to a group of processors, and each processor receives at most one packet. In a  broadcast 
operation, which is a special case of multicast, a  packet is sent from one processor to 
all other processors. There are different ways to carry out a  broadcast operation. One 
way is to set all switches to the cross state and let the source processor use the address 
frame shown in Figure 3.3(a). Another way is to set all switches to the straight state and
' Actually, p — 1 switches are sufücient if we slightly modify all operations and algorithms discussed in 
this and subsequent sections. We choose to use p switches for the simplicity of our discussions.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
18
let the source processor use the address frame shown in Figure 3.3(b). Similarly, there 
is more than one way to perform a multicast operation. Assuming that all switches are 
in the cross state, Figure 3.4 shows the address frames for a  multicast operation with 
three source processors, each sending a  packet to a group of three destination processors. 
Clearly, one bus cycle is sulEcient for either a  multicast or a  broadcast operation.
p-l  P 0 1 2  3 P- '  P0 1 2  3
(a) (b)
Figure 3.3: Address frames for broadcasting, (a) All switches are in the cross state, (b) 
All switches are in the  straight state.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 I 2 3 4 5 6 7 8 9
o.
(a) (b) (c)
Figure 3.4: Address frames for multicasting, zissuming th a t all switches are in the cross 
state. Three source processors send messages to three subsets (a), (b) and (c), of destina­
tion processors.
B in a ry  P re f ix  S u m s  a n d  P ro cesso r R e o rd e rin g  Given p one-bit values, bj, 1 < j  <
p, the binary prefix sums (BPS) problem requires the com putation of Sj = bi +b2 -\ i-bj
for all 1 <  _y <  p, where '‘-t-” is the addition operation. This operation can be done as 
follows. Processor j  sets its switch Sj to the cross s ta te  if =  1 ; otherwise Sj is set to 
straight state by processor j .  An empty message is broadcast from processor p. Assume 
that the bus cycle in which the broadcasting occurs s ta rts  a t to. It takes {p + j  — l ) r  time 
for the reference light pulse to arrive at processor j  if no delays caused by switches are 
involved, where r  is the time delay corresponding to distance D. If some switch delays
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
19
occur, the reference light pulse will arrive a t a later time, say tj , a t processor j  to cause 
coincidence with a  broadcast select pulse. The number of unit delays can be com puted by
.. ^  t j - t o - { p  + j  - l ) T  (3 .1 )
w
We call the value computed by Equation (3.1) the i-value of processor j .  Clearly, the 
i-value of processor j  is Sj.
We say that a processor is active if it will participate in the next com putation step; 
otherwise, we say that it is inactive. Processor reordering is to assign active processors 
indices in such a way tha t the i-th  active processor is assigned a  new index i. The operation 
for processor reordering is almost the same as that for BPS. If a processor is active, it 
sets its switch to the cross state; otherwise, it sets its switch to the straight state. After a 
broadcast operation, the i-value obtained for an active processor j  is its new index. Thus, 
both the BPS problem and the processor reordering problem can be solved using one bus 
cycle and 0 ( 1 ) com putation time.
C o m p a c tio n  a n d  P a r t i t io n  Suppose that each processor has one data item. Using 
processor reordering method, in one bus cycle the active processors can be assigned new 
ordered indices starting from 1 , and the total number of active processors can be deter­
mined by the i-value of processor p, regardless whether or not it is active. Suppose tha t 
the number of active processors is s. Then, each active processor j  with its new index i 
sends its data  item to processor i. In one additional bus cycle, the s  data  items of active 
processors can be packed in the first s processors, each having one item; furthermore, the 
compacted data items preserve their original order. Such an operation is called an  ordered 
compaction operation.
A partition operation is to partition  p da ta  items, one per processor, into two subsets, 
one contains s  items, and the other contains p —s items. The first subset of items are moved 
to the first s processors, whereas the second subset of items are moved to the remaining
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
2 0
processors. After the partition operation, each processor contains one original item. This 
operation can be carried out in two phases. The first phase employs a processor reordering 
operation followed by a compaction operation to move s  items in the first subset to the 
first s processors. Then, in the second phase, active processors and inactive processors 
switch their roles, and another pair of processor reordering and compaction operations are 
performed on the p — s items in the second subset. Therefore, a partition operation can 
be performed in four bus cycles and 0 (1 ) computation time.
P a ra lle l O p e ra tio n s  on  S u b a r ra y s  For parallel computation, it is often convenient to 
consider a linear array connected by a  pipelined bus as several subarrays, each consisting 
of a sequence of consecutive processors. If the processors are controlled properly, the 
operations on subarrays can be performed in parallel. Parallel com putation on subarrays 
is an important feature for supporting parallel algorithms based on the divide-and-conquer 
design paradigm. In a divide-and-conquer algorithm, the initial problem is partitioned into 
several subproblems, and each subproblem, which can be further recursively partitioned, 
is solved independently. The combination of subproblem solutions yields a solution to the 
initial problem. An example divide-and-conquer algorithm is given in the next section.
Let us refer to the processors with the smallest and the largest indices in a subarray 
as the head and tail processor, respectively, of the subarray. In a divide-and-conquer 
algorithm, processors in a subarray are notified the identities of head and tail processors 
of the subarray in the problem partition  step. Knowing the head and tail processors, a 
processor in a subarray can broadcast a  packet to all processors in the subarray in one bus 
cycle using a sequence of select pulses targeted at processors in the subarray (see Figure 
3.4). Obviously, concurrent broadcasting on all subarrays is a  special case of m ulticast.
Other operations can also be performed in parallel on subarrays. Consider the BPS 
problem on subarrays. We set switch S j  to the cross state if bj =  1 , and set it to the 
straight state if bj = 0. After one broadcast operation, each processor computes its i-value.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
21
For a  head processor of a subarray, we compute its i' value as follows: i' =  i — I if this 
processor is active; otherwise, i' =  i. Then, we use a  multicast operation to send the i'- 
value of the head processor of a subarray to all remaining processors in the subarray, for all 
subarrays. The binary prefix sum of a processor in a subarray is obtained by subtracting 
i' from the i-value of the processor. Therefore, The BPS problem on subarrays can be 
carried out using two bus cycles and 0(1) computation time. Consequently, the processor 
reordering problem for all subarrays can be done in same amount of time.
I t is not difficult for a reader to be convinced that, if the head and tail processors are 
known to all processors in a subarray, for all subarrays, then concurrent compaction and
partition  operations on all subarrays can be carried out using 0 ( 1 ) bus cycles and 0 ( 1 )
com putation time. For brevity, we omit detailed discussions.
3.2 Algorithm  D esign Exam ples
In this section, we present examples to show how to use the basic operations described in 
the previous section as tools to design more complex parallel algorithms. We consider the 
selection problem and the sorting problem. Both of our selection amd sorting algorithm s 
have improved performance compared with previous algorithms for the buses of [26] and 
[57].
The selection problem is defined as finding the &-th smallest (or largest) in a  given set 
of p elements. This problem has many applications such as merging and sorting, convex 
hull computation, image analysis, statistical analysis, etc. Our parallel selection algorithm  
SE L E C T  is an adaptation of the sequential algorithm of [9]. Assume tha t each of the p 
processors contains one element. We want to find the fc-th smallest of these elements. The 
algorithm  is recursive. Initially, SELEC T{k,p) is called, and all p processors aie active.
p ro c e d u re  SELECT{k,p')
I. B as ic  case. If p' < 5, then send all p' elements to processor I, and find the fc-th 
smallest one.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
22
2. Find m edian of medians. T h ep ' active processors are partitioned into groups 
w ith each group having exactly five active processors except for the last group which 
may have fewer than five active processors. Let the first processor, which has the 
smallest index, in each group be the group head processor. W ithin each group, every 
non-head processor sends its element to its group head processor; that is, processor 
j  sends its element to the ( [ ^ J  * 5 +  I)-th  active processor. After receiving all the 
elements from each member in its group, the head processor in each group finds the 
m edian of the five elements in constant time. By a  compaction operation, all group 
medians are moved to the first processors. We set these processors tem porally 
active and all other processors temporally inactive, and call SELECT{ [ ,  f ^ ]  ) on 
the first f ^ ]  processors. Let m be the element tha t we have obtained. The median- 
of-medians m, which is also referred to as the pivot element, is put in processor 
1 .
3. P artition . Processor 1 broadcasts the pivot element m  to all thep ' active processors. 
Each of these processors compares its element with m . If the element in processor 
j  is less than or equal to m, the processor sets its local Boolean variable bj to 1; 
otherwise, it sets bj to 0. By a BPS operation, the total number s of elements that 
are less than or equal to m  can be com puted (note: s is the z-value of processor 
p' com puted by the BPS operation). If fc <  s, then by a  partition operation, all 
elements th a t are less than or equal to m  are moved to the first s processors and the 
remaining elements are moved to the rest of p' processors, one element per processor. 
If k > s, then by a partition operation, all elements that are greater than  m  are 
moved to the first p' — s processors and the remaining elements are moved to the 
rest of the p' processors, one element per processor.
4. R ecursion. If A: =  s, processor s returns the pivot element. If A: <  s, set the first 
s processors active and the remaining processors inactive, then call SE LE C T{k,s);
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
23
else let k  := k  — s, set the first p' — s processors active and the remaining processors 
inactive, then call SELEC T{k,p ' — s).
e n d  o f  SE L E C T
Now let us analyze the time complexity of this algorithm. Let B{p) and T{p) denote 
the total number of bus cycles and the computation time for a  problem size p. The worst- 
case complexity B(jp) of our algorithm is derived as follows. Each of Step 1  and Step 3 
takes 0 (1 ) bus cycles and 0 (1 ) computation time. Step 2 takes 5 ( | )  bus cycles. By 
adapting worst-case time complexity analysis given in [29] for the sequential algorithm of 
[9], Step 4 takes no more than B ( ^ )  bus cycles for p >  24. Then, we have the following 
recurrence relation
-  I  B(§)  + B ( ^ )  + 0(1)  i f p > 2 4
Solving this recurrence relation yields B(p) = O(logp). Similarly, we obtain T(p) =  
O(logp). Hence, the A:-th smallest elements of p elements can be computed using a 
pipelined optical bus of size p in O(logp) bus cycles and O(logp) com putation time. 
In [56], a parallel selection algorithm based on an optical bus of [26] was presented. This 
algorithm requires O(logp) bus cycles and O(logp) com putation time on average. But, in 
the worst case, 0(p)  bus cycles and 0(p)  computation time may be required. Hence, our 
algorithm has an improved performance.
The second problem we consider is sorting p elements using p processors, one element 
per processor. If the elements are integers, then the BPS and partition operations sup­
ported by the conditional-delay feature of our optical bus can be used to implement the 
radix sort algorithm. For to-bit elements, to iterations are needed, each performing a pair 
of BPS and partition operations on one bit position of all elements. Note tha t this imple­
mentation of radix sort does not involve any comparison operation between elements, and 
the total number of bus cycles and computation time required are independent of bus size 
P-
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
24
For sorting arbitrary  elements, the bus of [59] was used to implement the sequential 
quicksort algorithm. It is shown in [59] that such an implementation has average per­
formance of O (logp) bus cycles and O(logp) com putation time. In the worst case, this 
algorithm requires 0 (p ) bus cycles and 0(p) computation time. W ith procedure SELECT  
at disposal, this worse case performance can be improved by the following divide-and- 
conquer algorithm  SORT. W ithout loss of generality, assume that p =  2*^ , and all elements 
are distinct.
A lg o r ith m  SO R T  
b eg in
for i =  0  to  g — 1 do
fo r all 2 ‘ subarrays of size 2 *^“ * do in  p a ra lle l
Use SE LE C T  to find the 2‘^ ~‘“ ^-th element of the subarray;
Use the partition operation to partition the elements into two subsets 
of equal size such that the first processors of the subarray
contain the elements that are not greater than  the median of the 
subarray; 
e n d fo r  
e n d fo r  
e n d
The algorithm  consists of logp iterations. In the first iteration, procedure SELECT  
is invoked to find the [ f j - t h  element (i.e. the median). Using it as the pivot element to 
par tition the elements, the linear array of size p is conceptually divided into two subarrays 
of equal size. In the fc-th iteration, the medians of 2 *“  ^ subarrays, each of size are
computed in parallel. At end of the iteration, each subarray is divided into two subarrays 
of equal size. This process is continued until each subarray has two processors. The 
complexity of each iteration is upper bounded by the operations of SELEC T. Since each
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
25
iteration requires no more than  O(logp) bus cycles and O(logp) com putation time, this 
sorting algorithm requires O(log^p) bus cycles and O(log^p) com putation time.
3.3 Sum m ary and Discussions
Optical buses offer much larger bandwidth than electronic ones. Among available optical 
interconnection technologies for parallel computing, optical buses are probably the easiest 
to implement. Designing a  bus-based parallel computer architecture requires taking both 
computation and communication aspects into consideration. We proposed a  linear array 
architecture based on a pipelined TDM optical bus. We showed that using the conditional- 
delay and coincidence pulse techniques, several fundamental operations can be carried out 
very eflSciently on our bus system. We also dem onstrated how to design parallel algorithms 
on our linear architecture. Our bus structure is simpler and more powerful than the one 
proposed in [26], and much simpler than and as powerful as the bus of [57].
Our linear array can be used as building blocks to construct processor arrays of mul­
tiple dimensions to achieve better scalability and performance. For example, Pavel and 
Akl proposed to use pipelined TDM optical buses of [26] to implement a two-dimensional 
reconfigurable array [51]. They call this architecture an array with reconfigurable optical 
buses (AROB). They showed that an AROB is a powerful parallel computing architec­
ture. In order to achieve their claimed reconfigurability, they require each processor to 
be equipped w ith several high-speed counters to determine the processor ordering after 
the buses are reconfigured by switches. Such a counter is used to count the number of a 
sequence of optical pulses. It would be impractical for an electronic processor to match the 
bit rate of an optical waveguide, and using optical counters would increase the system cost 
significantly. O ur optical bus provides a simple optical solution to the reconfigurability of 
AROB.
We would like to note tha t the idea of programmable delays using electro-optically 
switched fiber loops has been used in the designs of TDM time slot interchanger (e.g.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
2 6
[30, 37, 6 6 , 78]). Also, two optical bus structures with conditional delays have been 
proposed in [26] and [57]; both are more complicated than the one presented in this work. 
In the bus structure of [26], switches are on the transm itting segment of the bus, and 
because of this parallel com putation on subarrays, if not impossible, is difficult. To solve 
this problem, an optical bus w ith considerably more switches was proposed in [57]. Such 
a  bus can be physically partitioned into several buses for parallel subarray com putation. 
O ur bus structure is more powerful than the bus of [26], and more cost-effective than the 
bus of [57].
The time for an optical signal to propagate distance D  is r .  The communication 
o f processors are performed in terms of bus cycles. A bus cycle  consists of end-to-end 
transm ission of p consecutive packet slots. Since the end-to-end latency of an  optical pulse 
is no more than (2p — 1 ) t  +  {p — l ) u  time, a bus cycle takes no more than 3 p r 4- (p — l)w 
time. Let Tc denote the processor execution time of an arithmetic (say, multiplication) 
instruction. Then, a bus cycle takes 0 ( 2 £I±i£zLLliiL) units of com putation time. Pavel 
and  Akl [59] have argued th a t for reasonable size buses (say up to 1000 processors), the 
duration of a  bus cycle may be assumed constant and comparable to the tim e for a  CPU 
operation. The data  transm ission performance can be further improved if packets of a 
new bus cycle are transm itted before all the packets of the previous bus cycle reach their 
destinations. Suppose th a t each processor has n  packets to be transm itted continuously. 
Then, with overlapped bus cycles, the total time for transm itting the packets is (2p - I -
— l ) r  + {p — l)w, and, if n  is sufficiently large, in average each packet takes only about 
T time.
Despite the remarkable effect of pipelined d a ta  transmission, we still need to be careful 
when we assess the performance of such an optical bus. To avoid misleading time com­
plexity claims of a linear array of processors connected by such a pipelined optical bus, it 
is necessary to separate the communication time from computation time. As in sequential 
algorithm  analysis, we assume th a t each parallel computation step takes constant time.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
2 7
The communication time is measured in terms of bus cycles. When p is small, a bus cycle 
can be assumed taking constant time. However, as p becomes larger, the time for a  bus 
cycle increases proportionally. Of course, for large p, optical ampUfiers are needed. The 
communication power of such a bus also depends on the capabilities of optical transm itters 
and receivers associated to processors. We conservatively assume that in each bus cycle, 
each processor can transm it at most one packet and receive a t most one packet.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Chapter 4
A synchronous Optical TD M  Buses
As we discussed before, the TDM multiaccess methods can be divided into two categories: 
fixed assignment and dem and assignment. The major disadvantage of the fixed-assignment 
TDM (FA-TDM) is the requirement that the packet slots for each processor are fixed 
regardless whether or not it has a packet to transmit. If a processor does not have a  packet 
to send or receive, the packet slot assigned to this processor will be wasted. A demand- 
assignment TDM (DA-TDM) allocates packet slots to processors dynamically according 
to their demands and the traffic situation. It belongs to the class of asynchronous TDM 
(ATDM) multiaccess methods. If the packet generating rate of a processor is uniform, the 
FA-TDM is very efficient; otherwise, DA-TDM should be considered.
In this chapter, we introduce the idea of a flagged packet. We then use this idea 
to modify the known pipelined optical bus structure so tha t a demand-assignment TDM 
multiaccess method using a linear priority scheme can be implemented by hardware. To 
improve the fairness of the DA-TDM multiaccess method, we incorporate reconfigurabil­
ity into our pipelined optical bus to implement the round-robin priority scheme. The 
scheduling of the DA-TDM multiaccesses with the round-robin priority scheme is im­
plemented by reconfiguring the bus in hardware. We compare the performance of our 
pipelined DA-TDM optical bus with the pipelined FA-TDM optical bus by simulations, 
in terms of average message response time and fairness. O ur experiments show th a t the 
performance of our pipelined DA-TDM optical buses is significantly better than tha t of
28
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
29
pipelined FA-TDM optical buses. We also discuss the possibilities of using our buses to 
construct multichannel switches and multidimensional processor arrays.
4.1 ATDM  w ith Linear Priority
In this and next two sections, we introduce two possible implementations of a  DA-TDM 
bus and evaluate their performances against tha t of a  FA-TDM bus.
We modify the bus structure described in Chapter 2 as follows. First, we add one 
more loop on the receiving segments of the reference and message waveguides. These 
two loops are in the positions aligned with processor Pn-i- Then, we introduce n  loops 
of unit delay on the transm itting segment of each of the reference, select and message 
waveguides. These loops are in the positions aligned with processors 0 to n  — 1. For eacli 
of the reference and message waveguides, we introduce n  additional taps to processors on 
its transm itting segment. For the relative positions of the processors, loops and taps, refer 
to Figure 4.1.
n-2 n-l
Message
Reference
Select
Figure 4.1: The configuration of a DA-TDM bus.
The operation of our pipelined DA-TDM optical bus is simple. If a packet slot carries 
a  packet, we call it a fu ll packet slot; otherwise, we call it an empty packet slot. When a 
processor wishes to transm it a packet, it captures the first empty packet slot th a t passes 
by and loads the packet. Once a  packet slot contains a packet, it becomes a  full packet 
slot, and it will not be loaded again.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
30
A packet consists of two parts: header and data  frame. For a  packet of the bus 
described in Chapter 2, the header is an address frame. We add one pulse slot, called a 
flag, to the header, as shown in Figure 4.2 for the new bus. There is always a flag pulse 
a t the flag position of a packet slot on the reference waveguide. These pulses are injected 
from the head processor, processor P n -i. The state of a flag for an empty packet slot is 
represented by a flag pulse on the reference waveguide only. The sta te  of a flag for a full 
packet slot is represented by a  coincidence of the light pulses on the reference waveguide 
and the message waveguide. If the rightmost processor wishes to send a packet, it sends 
the flag pulse on each of the reference and message waveguides in the flag pulse slot, and 
then load the packet slot w ith a  packet. For all other processors, their operations are 
slightly different. If a processor wishes to send a packet but detects a coincidence of the 
light pulses on the reference and the message waveguides in a  flag position, it has to wait 
because a full packet slot is passing by. Otherwise, it sends a  light pulse on the message 
waveguide at the flag position, sets the unary address in the address frame and loads 
the d a ta  frame. The coincidence of flag pulses on the reference and message waveguides 
ensures that the processors down the stream will not load this packet slot. The transition 
from one flag state to the o ther is shown in Figure 4.3, in which the transmissions of the 
address frame and da ta  of a  packet is omitted.
Flag Address Frame Data
Figure 4.2: A packet with a  flag.
Assume that reading a  reference flag pulse and making a  decision take unit time. 
W ithout delaying the flag pulse on the reference waveguide, it would not be possible to 
send a pulse at the same position on the message waveguide. The purpose of introducing 
loops to transm itting segments of all waveguides is to allow sufficient time for sending a flag
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
31
pulse on the message waveguide after detecting an empty packet slot, while maintaining 
the correctness of the unary addressing scheme described in the previous section.
Reference
Message
(a) An empty packet is coming (b) A full packet is leaving
Figure 4.3: The implementation of the flag.
We have extended the coincidence pulse technique to a solution to the problem of 
detecting the s tate  of a packet slot. This demand-driven packet slot assignment method 
is equivalent to linear priority reservation scheme. Processor Pi has a  priority higher than 
the priority of P i-i-  Thus, the head processor and the tail processor have highest and 
lowest priority, respectively. When competing for packet slots, the processor that has 
the highest priority among all competing processors will be the one to succeed. As long 
as there are processors, regardless of where they are, demanding d a ta  transmissions, the 
packet slots are fully utilized. However, the processors with lower priorities may suffer 
from starvation, a situation in which they are not able to access the bus because of the 
existence of demands from processors of higher priorities.
We would like to note that the existence of loops in the positions aligned with the 
rightmost processor, P n -i, on the transm itting segments of the three waveguides, and the 
receiving segments of the reference and message waveguides has no effect on the operations 
described above. They can be deleted. They are included for the purpose of simplifying 
our presentation of a  reconflgurable bus in the next section.
4.2 A T D M  w ith  Round-Robin Priority
For improved performance, a more sophisticated priority scheme should be used. In this 
section, we propose an asynchronous TDM optical bus that uses the round-robin priority
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
3 2
scheme. In such a scheme, processors are arranged as a circular queue. The priorities 
of processors are reassigned in a rotating manner. More specifically, the current head 
processor, which has the highest priority, will become the tail processor with the lowest 
priority after a specified period of time, and at the same time all remaining processors’ 
priorities are incremented. This priority reassignment will be done periodically.
Consider the structure obtained from the bus structure discussed in the previous sec­
tion as follows: for each waveguide, remove its U-segment and close the two open ends 
of the transm itting segment and receiving segment to form two rings. Then, for each 
waveguide, we have two rings as shown in Figure 4.4. The inner ring corresponds to the 
receiving segment, and the outer one corresponds to the transm itting segment. The loop 
delays on the transm itting waveguides in the position aligned with processor P n -i th a t 
appeared to be redundant in the bus structure described in the previous section now play 
the role of making these ring structures symmetric.
(b)(a)
Figure 4.4: The ring structures constructed from a DA-TDM bus. (a) The ring cor­
responding to the select waveguide, (b) The ring corresponding to the reference and 
message waveguides.
In order for this ring structure to operate as a bus, we incorporate reconfigurabilities 
into the ring. The new structure is abstracted in Figure 4.5, where a box includes a  switch 
(represented by a dashed smaller box), taps, and loops. The switch has two states. In one
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
3 3
state, the straight state, the switch makes the two waveguide segments to pass through it. 
In the other state, which we call the U state, the switch cuts the segments and connects 
one pair of open ends, as shown in Figure 4.6. Such a  switch can be implemented by two 
2 x 2  switches tha t have the straight and cross states. T he Ti:LiNb03 switches of [7] can 
be used for this purpose. Note that a waveguide of length D  th a t forms a U turn  in the 
U state is included between the two 2 x 2  switches (see Figure 4.6).
If one switch is set to the U state but all remaining switches are set to the straight state, 
the ring is broken and a  folded bus is formed. We use Bk  to denote the bus conhguration 
in which Pjt is the head processor, P^', where k' = {k + n  — I) mod n, is the tail processor, 
and all processors are assigned new linearly ordered indices from 0  to n — 1 such that the 
new indices of P^ and P f  are 0 and n — 1 , respectively. Setting any switcli to the U sta te  
corresponds to a unique linear priority scheme. If the switches are selected to be set to 
the U states in a circular way, the round-robin priority scheme can be implemented. This 
is exactly what we are going to do. We assume th a t the initial bus configuration is B q. 
Suppose that the current configuration is B j, then the next configuration is B( j ^ ^  „.
For n  =  8, the system will be reconfigured as Bq, B \ , B 2 , • • •, P?, Bq, P i, P g , . . .  in sequence. 
From Bj  to P(j+i) mod ni a-11 we need to do is to set the switch associated with Pj to the 
straight state and the switch associated with P(j+i) mod n  to the U state simultaneously.
The operations of this bus are divided into overlapped phases, counting from the 0- 
th  phase with bus configuration B q. Assume that a  switch takes 6 time to switch from 
one state to another, and let P  > L +  J -(- 1 . Each phase, which consists of m , m  > n, 
consecutive packet slots, is divided into two subphases. The first subphase consists of the 
first m  — n  packet slots, and the second subphase consists of the remaining n  packet slots. 
The first slot and the last slot in the second subphase are not used. These two packet 
slots are called unused slots. Let ti be the time when the z-th phase starts. Assume tha t 
£o =  0, and define t{ =  £,_i +  m D , i > 0. For simplicity, we assume P  =  1  and then we 
have ti = t i - i  + m  At time £q =  0 the system is in configuration B q. The system changes
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
3 4
r I
Figure 4.5: DA-TDM bus with hardware round-robin priority scheme, (a) The reference 
and message waveguides, (b) The select waveguide.
r4-0 x i
....
!X o i
Straight U
Figure 4.6: Switch implementation.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
35
its configuration at time U — I ’s, for i > 0 ; or the system sta rts  a  new configuration at 
time ti's. We now explain how the system works.
Consider the 0-th phase. In its first subphase, the address firarnes of packet slots are 
set assuming configuration B q (refer to Figure 4.7(a)). In the second subphase of the 
0 -th  phase, although the system remains in configuration Bq, packet slots are loaded with 
address frames set assuming configuration B n-i-  At time ti — l = m  — I, all packets of 
the first subphase of phase 0 have passed processor Pj on the receiving segment of B q, the 
first unused slot of the second subphase is on the U segment of B q, and all loaded packets 
of the second subphase are still on the transm itting segment of B q. At this time, the 
switches associated with Fq and Pi change their states simultaneously, and consequently, 
the system becomes B i. Since the first and the last slot of the second subphase of phase 
0 are unused, this state change will not cause any packet loss. Since the address frames 
of packets of the second subphase of the 0 -th  phase are set according to B y, these packets 
will use B i to reach their destinations. There is a  problem with this bus reconfiguration. 
Right before the reconfiguration a t time ti, the last n — 1 packets of the first subphase of 
phase 0  are on the receiving segment of B q. At time — 1-t-J, the bus is reconfigured, and 
there is no way for these packets to reach Pq. Thus, we insist th a t the last n  — 1 packet 
slots of the first subphase of phase 0 will not carry any packet with Pq as destination. 
This can be done by programming processors to send packets to any, but Pq, processors 
using the last n  — 1 packet slots of the first subphase.
At time £i, the bus system starts to act as B i, and phase 1 is initiated (refer to Figure 
4.7(b)). Now, the last packet slot of phase 0 becomes the first slot of phase I. In the first 
subphase of phase 1 , the address frames of packet slots are set according to configuration 
B i. In the second subphase of phase 1, although the system remains to be in configuration 
B i, packet slots are loaded with address frames set assuming configuration Bg. At time 
£2 — 1 , all packets of the first subphase of phase I have passed processor Pq on the receiving 
segment of B \, the first unused slot of the second subphase is on the U segment of B i,
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
3 6
and all loaded packets of the second subphase are still on the transm itting segment of B \.  
At this time, the switches associated with Pi and P2  change their states simultaneously, 
and consequently, the system becomes 8 2 - Since the first and the last slot of the second 
subphase of phase 1 are unused, this state change will not cause any packet loss. The last 
n  — 1 packet slots of the first subphase of phase 1 will not carry any packet w ith P\ as 
destination. In the next phase, the configuration B 2 is used, as shown in Figure 4.7(c).
// If
(b)
Figure 4.7: DA-TDM bus configurations: (a) B q at time to, (b) B i a t ti and (c) B 2 a t (3 .
In general, this system operates as follows. If the system at time tj  has a  configuration 
B j, then the system is reconfigured a t time t j + i  — 1 , and at time t j + i ,  the system 's 
configuration becomes P(j+i) mod n- The last packet slot, which is unused, of phase j  
becomes the first slot of phase j  +  1. In the first subphase of phase 7  +  1, the address 
frames of packet slots are set assuming configuration P(j+i) mod n- la  the second subphase 
of phase 7  +  1 , packet slots are loaded with address frames set according to configuration 
B(j+2) m o d  n- The last n — 1 packet slots of the first subphase of phase 7  -h 1 will not carry 
any packet with P ( j + i )  mo d n  as destination. By a simple induction, it is easy to verify 
that, operating in this way, all packets will reach their destination processors. Consider 
the case that n =  8 . The bus configurations at time t o ,  t i  and 62  are shown in Figure 4.7.
Using this DA-TDM with round-robin priority, m  — 1 out of every m  packet slots can 
carry packets. If m  is too large, then the system may behave like one with linear priority 
scheme. But if m  is too small, the packet slots may not be effectively utilized. We can
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
3 7
(b) (c)(a)
Figure 4.8: DA-TDM bus with double-input processors. Configurations: (a) Bq, (b) B i 
and (c) B-2-
select the m value that achieves the best performance. To further improve the performance 
of this bus architecture, an  additional input from each waveguide can be introduced for 
each processor as show in Figure 4.8. W ith  double-input processors, the last n  — 1 packet 
slots of the first subphase of each phase can carry packets with arb itrary  processors as 
their destinations.
4.3 Sim ulation
In this section, we evaluate the performance improvement obtained by our DA-TDM 
optical bus systems over the FA-TDM optical bus systems. The performance is measured 
in terms of average message response time. Let r  be the time between the starting  times 
of two consecutive packet slots. Message response time is the elapsed tim e from the 
time a message is generated for transm ission to the time tha t the message transm ission is 
completed. We use n r  as the unit to measure the response time, where n  is the number 
of processors connected by the bus. The average message response time (AMRT) is found 
by dividing the total response time of all the messages by the number of messages. Of 
course, the smaller the AMRT, the better the performance.
In our simulations, m  = 2n and n  =  100. Negative exponential probability distribution 
is assumed for the message transmission demands. We also assume th a t each processor 
has the same message generating rate per n r  time. Uniform probability d istribution is
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
38
^ - » a — I
load ot — '
used for message sizes. The size of a  message is the number of packets in the message. 
We consider three cases of randomly generated messages: the average message sizes 10, 50 
and 100. Let r, be the average packet generation rate of processor Pi, we define the bus 
clearly 0 <  a  <  I. W ith a fixed bus load, the AMRT may vary when 
different bus systems are used.
Figure 4.10 shows the comparison of the AMRTs of the FA-TDM bus and the DA- 
TDM  bus with linear priority. The average message size is 10. The figure shows tha t the 
AMRT of our DA-TDM bus with linear priority is insensitive to the bus load, and it is 
significantly smaller than the AMRT of the FA-TDM bus. For example, when bus load 
a  =  0.5, the AMRT of the DA-TDM bus with linear priority is 0.16nr while the AMRT 
of the FA-TDM bus is 15nr. When bus load =  0.9, the AMRT of the DA-TDM bus with 
linear priority and the FA-TDM bus is O.GOnr and 56.88nr, respectively. For this case, 
the AMRT of the DA-TDM bus is about 85 times shorter than tha t of the FA-TDM bus.
m essa g e  s ize  «  to  
m essa g e  s ize  = SO 
m essa g e  s iz e  =100
to
8 -
0.1 0.2  0 .3  0.4  0.5  0.6 0.7  0.8 0 .9  1
b u s  load
Figure 4.9: Relation between the AMRT and the message size for a  DA-TDM bus with 
linear priority.
Figures 4.11 and 4.9 show the relation between the AMRT and the average message 
size for a DA-TDM bus and a  FA-TDM bus, respectively. It is easy to see tha t for the 
FA-TDM bus, the AMRT is lower-bounded by the average message size. This is exactly
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
3 9
the problem of the FA-TDM bus. However, for the DA-TDM bus, when the bus load is 
low, the average response time is very small. In our experiments, the performance of the 
DA-TDM bus is two orders better than that of the FA-TDM bus for the assumed traffic 
pattern.
100
□A-TDM
FA-TDM
I
I
40
0.90.6 0.70.2 0.3 0.4 0.50.1
bus load
Figure 4.10: Comparison of the AMRTs of a FA-TDM bus and DA-TDM bus with linear 
priority.
The unfairness of the DA-TDM bus with linear priority scheme is shown in Figure 4.12. 
Unfairness becomes severe when the bus load is larger. However, we observed that even 
when the bus load reaches 0.9, the average message response time for the processor with 
lowest priority is still less than  tha t of the same processor for the FA-TDM bus system.
The simulation results of our DA-TDM optical buses using the round-robin priority 
are consistent with our expectation. Figure 4.13 plots the average response time of each 
processor. The curve is the result of the simulating a bus running lO^nr time. If simulation 
time is sufficiently longer, we will expect a smooth, flat curve. Figure 4.14 shows the 
AMRT with respect to different bus loads, which are about the same as the AMRT of the 
DA-TDM bus using linear priority. In conclusion, our round-robin priority scheme does 
solve the unfairness problem, while maintaining the same AMRT of the DA-TDM with 
the linear priority scheme.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 0
600
m essag e  s ize  = 10 ------
m essag e  size  = 50  —  
m essag e  size  =100 .......
500
400
300
200
100
0.5 0.6 0.7 0.8 0.90.1 0.2 0.3 0.4 1
Figure 4.11: Relation between the AMRT and the message size of am FA-TDM bus.
50
20
1 1 
Dus load a  0.1 —
bus load a  0 .3  —
Dus load a  0 .5  .......
bus load a  0 .7  —
bus load a  0 .9  —  -
-■ -
- ; -
- -
0  10 2 0  3 0  4 0  50  6 0  70  80 90 100
p ro c e sso r  index
Figure 4.12: The unfairness of the DA-TDM with linear priority scheme.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 1
1.28
1.26
1.24
1.22
I
1.16
8 00 10 20 30 40 SO
p ro cesso r index
6 0 70 90 100
Figure 4.13: The unfairness of the DA-TDM bus with the round-robin priority scheme.
m e s s a g e  size  » lO -----
1.2
I 0.8
s
I 0.6
0 .4
0.2
0.1 0.2 0.3 0.4 0.5 0.6 0 .7 0.8 0.9 1
b u s  load
Figure 4.14: The AMRT of the DA-TDM bus with the round-robin priority scheme.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
42
4.4 Sum m ary and Discussions
We have introduced a  pipelined asynchronous TDM optical bus based on coincidence pulse 
technique. We also proposed a reconhgurable version of tliis bus to solve the unfairness 
problem. O ur simulation results indicate that the performance of our TDM bus is much 
better than the performance of its FA-TDM counterpart. In this section, we conclude this 
paper by mentioning two possible generalizations of our reconfigmrable DA-TDM bus.
Our bus architecture can be used to implement an  n  x n  switch shown in Figure 4.15(a). 
There are n  input channels Ik, and n output channels Ok, 0 <  A; < n  — 1. For simplicity, 
the buffering queues of these channels are omitted in the figure. We can implement such 
a switch by a  DA-TDM bus with round robin priority scheme in the way shown in Figure 
4.15(b). For each input channel, there is a device responsible for injecting packets into the 
waveguides, and  for each output channel, there is a  device responsible for detecting the 
coincidence pulses and picking up packets. The performance of such a switch is expected 
to be good, as indicated by our simulation results given in the previous section.
10 oo
ID
II
12
13
14
15
16
17
O I
07
00
0 1
02 02
03
04
05
06
07
03
05
04  14
16
■ 06
(a) (b)
Figure 4.15: Implementing a  switch using a DA-TDM bus. (a) A 8  x 8  switch, (b) A 
reconfigurable DA-TDM bus with 8  pairs of I/O  devices.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
43
There are several factors that constrain the size of a bus-connected system: the bus 
fan-out, power distribution problem, the length of unary addresses, and the increased 
latency when the number of processors is large. To improve the scalability, we can use our 
DA-TDM bus as a building block to construct processor arrays. We construct an n  x n 
two-dimensional processor array as shown in Figure 4.16(a), where our DA-TDM buses 
with round-robin priority scheme are used to connect rows and columns. This structure is 
symmetric in the sense th a t the rows and columns of the array are connected by identical 
buses. The row communications and column communications are separated.
ac
r-p
n
o -
o
Û
(=5
■ Û
■ o
(a) (b)
Figure 4.16: (a) A two-dimensional processor array, (b) A physical arrangem ent of the 
array.
This processor array has advantages in scalabilities over the ASOS architecture pro­
posed in [65]. The structure of our processor array can be easily extended to higher 
dimensions. For example, a three-dimensional array can be constructed by introducing 
the same DA-TDM buses along the th ird  dimension. It is hard, if possible, to extend 
the processor array ASOS with all-optical spanning buses to higher dimensions. For an 
n  X n  array using DA-TDM buses, the address frame contains n pulse slots. But for the 
ASOS of the same size, the address frame has 2n — 1  pulse slots. Even for the  n  x n x n 
3-D array with DA-TDM buses, the size of address frame remains to be n. Since the row 
buses and column buses in our 2-D array are totally separated, the power distribution
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 4
for our array is simpler. Furthermore, a  separate clock can be used to control each bus. 
Simplified clock distribution allows for a  larger system to be built. We can even drop the 
SIMD assumption by letting each processor have its own clock. In such a situation, the 
flag signal on the reference waveguide has an additional role of acting as a synchroniza­
tion signal, and the loops on the transm itting segment of each waveguide are lengthened 
to allow sufficient time for clock synchronizations. The structure of our 2-D array using 
reconfigurable DA-TDM buses is essentially a  torus. The adjacent processors in a torus 
can be laid out with equal physical wiring separations as shown in Figure 4.16 (b) to avoid 
potential wiring problems.
The major disadvantage of our 2-D array is that when two processors on different 
rows and /o r columns communicate, intermediate 0 /E  and E / 0  conversions are required. 
For very sparse communications, the conversion overheads may contribute to a  slowdown 
factor, compared with all-optical communications. We believe that the actual perfor­
mance of our 2 -D processor array can be much better than the ASOS architecture of [65]. 
T here are several reasons: (1) Our buses use DA-TDM, rather than FA-TDM, so that the 
bandw idths of waveguides are fully utilized. (2) The pipelined row communications and 
column communications can be chained (for pipeline chaining, refer to [32]) so that the 
ex tra  latency caused by conversions can be hidden (for latency hiding, refer to [34]). (3) 
Using the round-robin priority scheme, the fairness is enforced without losing the gain in 
average message response time over the FA-TDM. (4) The structure of ASOS is not sym­
m etric, and it only allows X-Y dimension ordering routing paths. Due to the symmetry 
of row buses and column buses of our array, the routing paths can be more flexible. Both 
X-Y and Y-X dimension ordering routing paths can be used simultaneously for different 
source-destination pairs to avoid traffic congestions. Adaptive routing algorithms that 
route packets according to network conditions can be incorporated.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
C hapter 5
Processor Arrays Connected by  
Segm ented Buses
The realization of general-purpose, massively parallel computers hinges largely on being 
able to build scalable interprocessor networks. The full advantage of parallelism can be 
realized if processors are fully connected (as in a complete graph). In reality, a  fully- 
connected system is either too costly or impossible to build. The communication cycle 
may far exceed the processor cycle, due to many hardware limitations. Interprocessor 
communication is the bottleneck in the overall system performance.
A M ultiple-bus system as an interconnection network for shared memory m ultiproces­
sors has been extensively studied in the literature. However, due to bandwidth lim itations, 
shared-buses are not suitable for exploiting large parallelism in applications. It has been 
proposed to augment low-dimensional point-to-point networks, such as meshes, by mul­
tiple buses to improve broadcasting performance [60, 74, 81], as shown in Figure 5.1. A 
number of multiprocessors connected by multiple reconfigurable buses have also been pro­
posed. Examples include, among many, the reconfigurable multiple bus machine [79], the 
bus autom aton [69], the reconfigurable mesh [51] (Figure 5.2), mesh with hyperbus [28] 
(Figure 5.3) and the polymorphic torus [42]. The common feature of these machine models 
is th a t the bus configurations can change under program control. Some of these models 
have been shown surprisingly powerful because dynamically reconfigurable communica­
tion paths are used to perform tasks th a t can be done only by processors in o ther parallel
45
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 6
architectures. For example, using bus control techniques to reconfigure communication 
paths as an integral part of computation, it is shown in [36, 46, 53, 54] that n  numbers 
can be sorted in constant time on an n x n reconfigurable mesh.
pg 1—3 Row Bus
  Local Link |  Column Bus
Figure 5.1: A 4x5 mesh with multiple broadcasting.
Processor
Sw itch
Reconfigurable bus
Figure 5.2: The reconfigurable mesh architecture.
Recent advances in optical interconnect technologies have drastically changed the land­
scape of interconnection schemes. Recently, massively parallel computing using optical 
interconnections has received considerable attention. Photons are non-charged particles, 
and do not naturally interact. Consequently, there are many desirable characteristics 
of optical interconnects: high speed (speed of light), increased fanout, high bandwidth, 
high reliability, longer interconnection lengths, low power requirements, and immunity to 
EMI with reduced crosstalk. The characteristics of optical interconnects have significant
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 7
w \
w w w w
7
Figure 5.3: A 4x4 mesh with hyperbuses.
system configuration and complexity implications. Multiple-bus configurations with in­
creased scalability are possible because of relaxed fanout and distance constraints. The 
optical fanout (which is the maximum number of processors that can be attached to an 
optical connecting device) is not bound by capacitance but by the power that must be 
delivered to each receiver to maintain a specified bit-error-rate. Processors can be ar­
ranged a t increased physical distances. Several optical bus structures have been proposed 
to utilize the advantages of fiber optics (e.g. [27, 41, 49, 57, 59, 62, 80]).
In this work, we propose a class of reconfigurable buses, called segmented buses. A 
segmented bus connecting p processors, denoted by B{p), is a bus that can be dynamically 
partitioned into several segments, each connecting a  subset of processors, by switches. We 
also generalize the concept of segmented bus to obtain parallel architectures of higher 
dimensions, called Az-dimensional mesh connected by segmented buses (A:-D MCSB). We 
show tha t the segmented bus and the t-D  MCSB are versatile parallel computing archi­
tectures by showing that they can simulate a wide variety of useful network structures. 
In particular, we show tha t B{p) can simulate any linear array or ring of no more than p 
processors with a constant slowdown factor, and B{p) can simulate a (2p — I)-processor 
complete binary tree, X-tree and one-dimensional multigrid with an O(logp) slowdown 
factor. Then, we use these results to show tha t a k-D  MCSB can simulate a 6 -D mesh or 
torus with a constant slowdown factor, an W x W MCSB can simulate an AT x W mesh-
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
4 8
of-trees, an N  x iV multigrid network and N  x N  pyramid network with an  O(logiV) 
slowdown factor. It would not be complete without considering the algorithmic aspect of 
the segmented bus based architecture. We dem onstrate the advantages of parallel architec­
tures based on segmented buses by giving a parallel algorithm  for the prefix com putation 
problem.
5.1 Segm ented Buses
A segmented bus is a  bus tha t can be dynamically partitioned into several segments by 
switches. A segmented bus B(p) is obtained as follows. Given p processors Pi, 0 <  f <  p, 
we connect them by a  bus in the linear order of their indices. We use p to indicate the size 
of the bus. Considering a bus as a line, the p connection points on the line (bus) divide the 
line (bus) into p 4 - 1 intervals. Then, we select a subset of m  intervals, and add a switch in 
each of these intervals. These switches divide the bus into m-f-1 segments, which are called 
basic bus segments. By dynamically setting the switches, the bus is partitioned into a  set 
of disjoint segments, which are called compound bus segments. A compound bus segment 
consists of a series of basic bus segments. For simplicity, we also refer to a  compound bus 
segment as a  sub-bus. At any time instance, only the processors tha t are connected by a 
common sub-bus can communicate with each other. Parallel d a ta  communication among 
processors on a segmented bus is achieved by a sort of space division multiplexing.
We propose a class of segmented bus architectures called 2'’-spacing segmented buses 
tha t connects p =  2” processors, where 0 < i  < n .  The 2^-spacing segmented bus has ^  — 1 
switches. The switches are inserted between every 2‘ consecutive processors attached  to 
the bus; tha t is, the 2*-spacing has ^  =  2"“ ‘ basic segments of equal size. For n  =  4, 
we illustrate 1 -spacing, 2-spacing and 4-spacing segmented buses in Figure 5.4. For i < j  
and same n, the 2 '-spacing segmented bus is more costly than  the 2 ^-spacing, but the 
former is more powerful than  the latter. The class of 2*-spacing segmented buses provide 
a  wide range of cost/perform ance tradeoffs: linearly ordered processors may be connected
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
49
by one or more segmented buses, an d /o r the number of switches on a  segmented bus may 
be selected from several alternatives.
■Q■O ■O■O •o o ■o
■Q u Q
O  S w itch Q Processor i
Figure 5.4: 1 -spacing, 2-spacing and 4-spacing segmented buses of size 16.
The hardware implementations of a segmented bus can be in either the electronic 
domain or the optical domain. Segmented buses are more justifiable in the optical domain. 
An optical bus system is usually implemented as a folded bus. In such a configuration, 
each processor is attached to the bus twice, one attachment for reading (using a photo 
diode) and the other for writing (using a laser diode), as shown in Figure 5.5. In Figure 
5.6, we show how to implement a  2-spacing segmented folded bus. Each switch is a 2 x 2 
electronically controlled optical device [3, 6 , 73], which can be in one of two states, straight 
and cross. Switches are grouped into pairs, both switches in a  pair are set to one of the 
straight and cross states a t the same time. These switches allow for all-optical paths 
without intermediate 0 /E  and E /O  conversions.
Figure 5.5: Folded bus configuration.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
50
0 )  I 1
OCUl (» ID  UOll
— T ^ O r - f —
- O
2 )  I 3, 4 S. 6 7 ,
- o
Switch states: | |,
Figure 5.6: 2-spacing segmented folded bus.
Bus communications can be either synchronous and asynchronous. In asynchronous 
mode, arbiters are needed to allocate the bus to processors in an on-line fashion. Since 
there are m =  2 " " ' — 1 switches on the bus, the total number of possible sub-bus partitions 
is 2^  =  2^" While dynamic bus partition, which requires one bit per switch, is easy 
to achieve, the complexity of bus access control poses m ajor difficulties. One approach 
is to make the arbiters also partitionable. Segmented buses are more suitable for syn­
chronous communication. We can equip each processor with an off-line circuitry so that 
both segment partitions and sub-bus allocations, although operated dynamically, are pre­
determined by an off-line scheduling algorithm. W ith off-line bus allocation assumption, 
the bus partitions can be “compiled” in advance. The advantage of this method is that 
the complexity of system design can be reduced because a handshaking mechanism, which 
is necessary in an on-line environment, and arbiters can be omitted.
Because of physical limitations, the number of processors and the number of switching 
elements attached to a segmented bus cannot be so large. To achieve higher scalability, we 
generalize the notion of segmented bus to obtain parallel architectures of higher dimen­
sions. We define a  2”*-processor k-dimensional mesh connected by T'-spacing segmented 
buses {k-D MCSB), denoted by M)t(2",2'), as follows: the processors, which are denoted 
by ••.«fc-D 0  ^  ÿ  <  2 ", are connected by 2 *-spacing segmented buses of size 2 ", each 
connecting processors with the same k -  1 processor indices. An Mg (8 ,2), 8 x 8  mesh 
connected by 2-spacing segmented buses, is shown in Figure 5.7.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 1
r Ô  r Ô  K )  r O  r Ô  r O  r Ô  r Ô
3< 5 %
F5 %
kStofôîSfôtotoE) 
3o %
K5t5tôtô t^ôTôtô
I r O — I r O — I r - D — i 1k5 ^ W 5 W 5 W )
□  Sw itch O
Figure 5.7: A 2-D MCSB Mo (8 ,2).
5.2 Versatility o f Parallel Architectures Based on Segm ented  
Buses
Parallel architectures based on segmented buses, especially those of low-dimensions, are 
feasible for implementation. They have low wire densities, small diameters, and a processor 
in such a system has a small number of I/O  ports. The control of switches in a such system 
is less complex than most other parallel models based on reconfigurable buses, such as 
reconfigurable meshes. To justify that this class of architectures is suitable for general- 
purpose parallel processing, we need to show th a t they perform well for a large range of 
applications. A parallel model is considered versatile if it can efficiently simulate many 
other useful parallel machine models. If machine M \ can simulate macliine Mz efficiently, 
then any algorithm developed on machine Mg is portable to machine M i. In this section, 
we compare segmented-bus based architectures with several point-to-point network based 
architectures. We dem onstrate tha t architectures based on segmented buses are versatile 
by showing that they can efficiently simulate many useful macliine models.
For simplicity, we assume synchronous com putation and communication modes. T hat 
is, a  parallel computation process is partitioned into com putation and communication
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 2
steps. In a  com putation step, a  subset of processors execute an instruction. In a  commu­
nication step, each processor communicates with a subset of processors th a t are directly 
connected to it. Let M  be a parallel machine whose underlying interprocessor connection 
structure is a point-to-point network (a conventional graph), i.e., pairs of processors are 
connected by dedicated links. We define a maximum communication step of machine iV/ as 
a communication step with all links of its interconnection network engaged in the commu­
nication. W hen we consider sim ulating a  communication step of M , we always consider its 
maximum communication steps. This is a rather conservative approach, since not all re­
alizations of point-to-point networks have facilities that support all-port communications. 
Similar techniques were used in [16].
Our discussions will be concentrated on 2-spacing segmented buses, because any claim 
on such buses can be easily generalized to other segmented buses. Let us label the f  — 1 
switches by numbers starting  from 0 to |  — I in order from left to right. T he i-th  switch 
is denoted as 5j.
5 .2 .1  S im u la tio n  o f  L inear A rray
Consider the problem of simulating a  p-processor linear array, where p = 2", by a  2-spacing 
segmented bus B{p). In the first step, all switches are set off (by which we mean that 
the switch disconnects the two basic bus segments adjacent a t the switch). All processors 
Pi such tha t i being even can communicate with in parallel. In the second step, we 
set a  switch Si on (by which we mean tha t the two basic bus segments adjacent a t 5 ,- are 
connected) if and only if the rightm ost bit of the binary representation of i is 0. Then, 
all processors Pi such tha t (i mod 4 =  1) can communicate with P,+i in parallel. In the 
third step, we set a switch Si on if and only if the rightmost bit of its binary label is 1 . 
Then, all processors Pj such th a t {i mod 4 =  3) can communicate with Pj+i in parallel. 
Therefore, a 2-spacing segmented bus B{p) can simulate any parallel communication step 
of a p-processor linear array in at most 3 parallel communication steps. To simulate a 
ring, we can add one more step to let Pq and Pp_i communicate by setting all switches
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 3
on. Therefore, we have
T h e o re m  1  A 2-spacing segmented bus B{p), p =  2", can simulate any parallel communi­
cation step o f a p-processor linear array (reap, ring) in 3 (resp. parallel communication 
steps. ■
Let L' =  (Pio, P ii, • • •, Pi„), where ij < ij+ i < m, be a linearly ordered subset of the 
linearly ordered processor set L  =  (Pq, P i, • • •, P p-i). A linear array corresponding to L' 
is a  point-to-point network such tha t P,v is connected to Pij+i, 0 < j < m < p  — I. A 
ring corresponding to L' is a linear array corresponding to L ' w ith an additional link that 
connects Pj„ and P{,„. We show th a t the 2-spacing segmented bus can simulate a linear 
array (resp. ring) corresponding to any L' efhciently. Let S ' =  {5 [ij/2 j|y < ^  and j  is 
even }. If we let all switches that are not in S ' be on, and ignore the processors that are 
not in L', then L ' and S ' define a 2-spacing segment bus (note: the last segment may have 
only one processor of L'). By controlling the switches in S ' in the way that is described 
for the case of simulating a linear array of p — 1 processors, the linear array (resp. ring) 
corresponding to L' can also be simulated efficiently. This generalization of Theorem 1 is 
stated  in the following Corollary.
C o ro lla ry  1  A 2-spacing segmented bus B (p), p =  2". can simulate any parallel commu­
nication step of any linear array (resp. ring) corresponding to a linearly ordered processor 
subset o f B{p) in at most 3 (resp. 4) parallel communication steps.
5 .2 .2  S im u la tion  o f  B in ary  T ree
In addition to supporting efficient broadcasting and multicasting, and various linear array 
and ring communication patterns, a  segmented bus can also simulate tree interconnection 
structures efficiently. Consider a complete tree T  of 2p — 1 processors P /, 0 <  i <  2p — I. 
We map P/ to P/(j) of a segmented bus B{p) using function f{ i )  =  [ |J .  Clearly, by / ,
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 4
two processors of T  are mapped to one processor of the segmented bus (except that one 
processor of T  is m apped to the last processor of the bus). For p  =  16, this mapping is 
shown in Figure 5.8. In this figure, processors bounded by a dashed box are mapped to 
the same processor of the segmented bus that appears in the same column. In simulating 
T  by B(p), some links of T  can be ignored. Such situations occur when two processors, 
e.g., Pq and P[, connected by a  link in T  are m apped to the same processor of 5 (p ). We 
mark each of the remaining links of T  by an integer pair (x. y) in a recursive manner as 
shown in Figure 5.8. Here, x indicates the switching pattern  of B{p) used for simulating 
the communication along the link, and y  indicates the step number using pattern x. More 
specifically, we use the following switching scheme:
• All links in the tree with the same link label (x, y) are simulated by the same step.
• If X =  1 , then all switches of the 2-spacing segmented bus B[p) are oil.
• For X >  1, a switch Sj is turned off if and only if the rightmost x — 1 bits of the 
binary value of j  are all I ’s.
For example, to simulate all links of T  that are marked (2,1) and (2,2), we need to use 
the following switching pattern  of B{p)-. turn off switch Sj if and only if the rightmost 
bit of the binary value of j  is 1 . After setting this pattern , two communications steps are 
carried out. In the first step, B(p) simulates all links marked (2,1), and in the second step 
B(p) simulates all links marked (2,2).
T h e o re m  2 A 2-spacing segmented bus B{p), p =  2” , can simulate any parallel compu­
tation step and communication step of a complete binary tree T  o f 2 p — l processors in at 
most two parallel computation steps and at most 2  logg p — 1 parallel communication steps, 
respectively.
Proof: Two parallel com putation steps are needed to simulate one computation step of
T  because two processors in the tree are mapped to one bus processor. We prove the
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 5
theorem by showing th a t 2 " — 1  steps are sufficient for simulating any communication step 
of T  using the switching scheme given above.
For n  =  2, our claim is obviously true. Suppose tha t our claim is true for n  = k  and 
consider the case th a t n = k + 1 . Now we use two segmented buses of size p  =  2* to 
construct a larger segmented bus of size q =  2^ "*’*. Similarly, we use two complete binary 
trees of size 2p — 1 to construct a larger tree of size 2g — 1, as shown in Figure 5.9.
In the first 2k — I steps, switch is set oflF because the rightmost A: — 1 bits of
the binary value of are all I's (see Figure 5.9). By the hypothesis, a t the end of
2 & — 1 simulation steps, the left and the right subtrees of the root are sim ulated by the 
inductive hypothesis. In step 2k, the switch 2^~^ — 1 is still off and the link labeled {k,2) 
is simulated. In step 2& + 1 , the link labeled (6 -t-1,1) is simulated. Notice th a t the largest 
label of the switches is 2* — 2 whose binary value has only {k — I) I ’s, and a  switch S j is 
set off if and only if the rightmost k 4 -1 bits of the binary value of j  are all I ’s. Thus, in 
simulating link (k -t- 1,1), all switches are set on. This is exactly what is needed by our 
mapping. Therefore, B(2^+^) can simulate a  maximum parallel communication step of a 
complete binary tree of 2^+^ — 1 processors in 2A: +  1 =  2[k -f-1 ) — 1  steps. This completes 
the induction and the proof of the theorem. ■
( 3.2) (4. 1)
(2.2 )(2.2 ) 33. 1) 33. 1)
( 1.2 ), (2 . 1) ( 2 . 1)(L2k
111)
000
HD-
001
-O-
010
HD-
011
H D -
100
H D -
101
HD-
110
H D -
Figure 5.8: Sim ulation of a  complete binary tree by a 2-spacing segmented bus.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
5 6
righi subtreeleft subtree
(k.l)
K . :
0
-a—
2"‘l
-Q- ■a—
2 - 2  
-o O
Figure 5.9: Recursive constructions of complete binary tree and 2-spacing segmented bus.
5.2 .3  S im u la tio n  o f  X -tree
A useful variation of the tree structure is the X-tree. An X-tree is a supergraph of a 
complete binary tree with links added to connect consecutive processors on the same level 
of the tree. For example, a 16-leaf X-tree is shown in Figure 5.10. Using Corollary 1 and 
Theorem 2, it is easy to derive the following claim.
C o ro lla ry  2 A 2-spacing segmented bus B{p), p =  2", can simulate any parallel computa­
tion step and communication step of an X-tree of 2p — \ processors in at most two parallel 
computation steps and at mosi 5 (log2 p —1 ) -r 1  parallel communication steps, respectively.
Proof: Again, the simulation of the parallel com putation steps is obvious. We only
consider the simulation of the parallel communication steps. By Theorem 2, we need 
2 n — 1 parallel communication steps to simulate the tree links. Consider the simulation of 
horizontal links. From Corollary I, it takes three parallel communication steps to simulate 
the horizontal links in each level of the X-tree except for the highest three levels. The 
highest level has one processor, so no step is needed for this level. The second highest level 
has two processors, and and one step is required for simulating the link connecting them. 
The third highest level has four processors, and two 2 steps are sufficient for simulating 
the three horizontal links. Therefore, simulating all horizontal links takes 3(n — 2) +  3
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 7
steps. In summary, to a maximum parallel communication step of the whole X-tree,
2ti — 1 4- 3(n — 2) +  3 =  5(n — 1) -t-1 
parallel communication steps are sufficient.
000
- a -
001
-O-
010
- C l ­
o u
-O-
100
- a -
101
-O-
110
-O-
© Ô Q G G G Q G G ) G ® ( ü ) © ® Ô ®
Figure 5.10: Simulation of an X-tree by a 2-spacing segmented bus.
5 .2 .4  S im u la tion  o f O n e-d im en sion a l M u ltigrid
A subgraph of an X-tree, called a one-dimensional multigrid, has been proved useful for 
implementing efficient parallel m atrix algorithms [40]. It has 2"^^ — 1 processors, divided 
into n -)- 1 levels. Level m, 0 < m < n - t - l ,  is a linear array of 2"‘ processors. The y-th 
processor on level m is connected to the 2j-th  processor on level m -h 1. A (2“* — 1 )- 
processor one-dimensional multigrid is shown in Figure 5.11 (a). We map the processors 
of a one-dimensional multigrid to the processors of B{p) in the way shown in Figure 5.11 
(b). Compare Figure 5.11 (b) with Figure 5.8. If we delete all horizontal links of 5.11 
(b), we obtain a subgraph of Figure 5.8. The missing non-horizontal links are those that 
are either not labeled or labeled (r , 2). Using the same switching scheme for simulating 
the tree, but without simulating the missing links, loggp steps are sufficient for B(p) to 
simulate all non-horizontal links of the one-dimensional multigrid of 2^ — 1 processors. By 
the proof of Corollary 2, it takes 3 (log2 P — 2) -f-3 steps to simulate all the horizontal links. 
Therefore the total number of steps for simulating a maximum communication step of a
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
5 8
one-dimensional multigrid is logg p -I- 3 (Iog2  p — 2) -t- 3 =  4 Iog2 p — 3. In summary, we have 
the following claim:
T h e o re m  3 A 2-spacing segmented bus p = 2’^ , can simulate any parallel compu­
tation step and communication step of a one-dimensional multigrid 2p — I processors in at 
most two parallel computation steps and at most 4 log2  p —3 parallel communication steps, 
respectively.
(aJ
Figure 5.11: Simulation of a 1-D multigrid by a 2-spacing segmented bus. (a) 1-D multigrid 
of 31 processors, (b) Processor mapping to .5(16)
Now, consider the A:-D MCSB. By Theorem 1, we know that an M t(2". 2) can simulate 
a  communication step of a  6 -dimensional 2 "-ary mesh and torus (which is also called a 
2 "-ary 6 -cube) efficiently.
C o ro lla ry  3 An MCSB  Mjt(2",2) can simulate any parallel computation step and any 
parallel communication step of a k-dimensional 2 ^-ary mesh (resp. torus) in one parallel 
computation step and at most three (resp. four) parallel communication steps, respectively.
Proof: Directly from Theorem 2. ■
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
5 9
5 .2 .5  S im u lation  o f M esh -o f-tree
By adding a bus to each row and each column of a conventional two dimensional mesh, 
one can obtain a mesh with improved d a ta  broadcasting performance. Such a model has 
been considered for many applications (e.g. [60, 74, 81]). We refer to this interconnection 
structu re as the mesh-connected com puter with multiple broadcasting (MCCMB). Clearly, 
a A:-D MCSB is more powerful than  its corresponding A:-D MCCMB. We dem onstrate this 
by showing that the a A:-D MCSB can efficiently simulate several useful machine models.
An im portant parallel com puting model is the mesh-of-trees (M OT). A 2" x 2" two- 
dimensional MOT is constructed from a 2" x 2" two-dimensional grid of processors by 
adding processors and links to form a  complete tree in each row and  each column of the 
grid. The leaves of these trees are the original processors in the grid. Similarly, a three- 
dimensional MOT is constructed from a  tlu-ee-dimensional grid o f processors by adding 
processors and links to form a complete tree whose leaves are the grid processors w ith the 
same two indices. There are 3 • 2^" — 2"+^ processors in the 2" x 2" MOT, and there are 
4 • 2^" — 3 • 2^" processors in the 2 ” x 2 " x 2” MOT. A 4 x 4 MOT is shown in Figure 5.12. 
In [40], it is shown that the mesh-of-trees is a versatile machine model. Many problems 
can be solved using the MOT in polylogarithmic time. The following claim is derived by 
using the processor mapping of the proof of Theorem 2 for all the trees of MOT.
C o ro lla ry  4 An MCSB  M2 (2 " , 2 ) can simulate any parallel computation step and any 
parallel communication step o / a 2 ” x 2" 2-D M O T in at most 3 parallel computation 
steps and at most 2n — I parallel communication steps, respectively. A n  M CSB  M 3  (2", 2) 
can simulate any parallel computation step and any parallel communication step 0 /  a 2 ” x 
2" X 2" 3-D M O T in at most 4 parallel computation steps and at most 2n — 1 parallel 
communication steps, respectively. ■
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
60
Figure 5.12: A 4 x 4 mesh-of-trees.
5 .2 .6  S im u lation  o f  P y ra m id
Two related classes of point-to-point networks, niultigrids and pyramids [40], have been 
proven useful for many applications. For example, a common approach to solving a system 
of partial differential equations is the use of finite difference methods. A variety of algo­
rithm s for finite difference problems have been devised for multigrid structures. Pyramids 
are also useful for parallel image processing. MCSB’s can be used to simulate these two 
classes of interconnection networks efficiently. The 2" x 2" multigrid network consists of 
n +  1 levels of 2-D processor arrays; the array at level m, 0 < m  <  n, is of size 2^  x 2^". 
The processor with indices {i , j)  on the 2”* x 2'" array is connected to the processor with 
indices (2i, 2j) on the 2™+  ^ x 2™+  ^ array. The 2" x 2" pyramid network consists of n  -t- 1 
levels of 2-D processor arrays; the array at level m, 0 <  m  < n, is of size 2 ^  x 2"*. The 
processor with indices {i , j )  on the 2 "* x 2 "“ array is connected to the processors with 
indices {2i — l ,2 j  — 1), (2-i — l,2 j) ,  {2i,2j -  1), and {2i,2j)  on the 2™^^ x 2™+  ^ array. A 
4 x 4  multigrid network and a  4 x 4 pyramid network are shown in Figures 5.13 and 5.14, 
respectively. The M ultigrid network and pyramid network are the natural two-dimensional 
generalizations of the one-dimensional multigrid and X-tree, respectively; they are closely 
related. It is known tha t a 2" x 2" multigrid network can simulate a  2" x 2" pyramid 
network with a slowdown factor three, communication slowdown factor. Thus, if an 
MCSB Mg(2", 2) can simulate a 2" x 2" multigrid network with a slowdown factor c, then 
Mg (2” , 2) can simulate a 2" x 2" pyramid network with a slowdown factor 3c.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
61
level-2 p rocessor
level-1 processor
level-0 processor
Figure 5.13; A 4 x 4 multigrid.
level- 2  processor
level- 1 processor
level- 0  processor
Figure 5.14: A 4 x 4 pyramid.
5 .2 .7  S im u la tio n  of H igh -d im en sion a l M u ltigrid
Consider simulating a 2" x 2" multigrid network by an MCSB M 2 (2 " , 2 ). Imagine that 
we cut a 2" X 2” multigrid network into slices, then we obtain a set of subgraphs of 
one-dimensional multigrids. For each slice, we use the processor mapping method for 
simulating a one-dimensional multigrid by a 2-spacing segmented bus. Then, it is not 
difficult to derive the following claim.
T h e o re m  4 An MCSB  Af2 (2 ” , 2 ) can simulate any parallel computation step and any 
parallel communication step o/ a 2 " x 2 " multigrid network and a 2 " x 2 ” pyramid net­
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
6 2
work in at most two parallel computation steps and 0 {n) parallel communication steps, 
respectively.
Proof:
From above discussion, we only need the proof for the case of a multigrid. We use 
{x, y , z )  to represent the indices (coordinates) of a  processor in a 2 " x 2 ” m ultigrid (refer 
to Figure 5.13). We say that a link of the multigrid is a dimension x (resp. y  and z) 
link if it connects two processors of the different x-coordinates (resp. y-coordinates and 
z-coordinates). Let L^,  Ly and denote the links of dimension x, y and z, respectively. 
We partition the links of a 2 " x 2" multigrid into two subsets: Lxz = L^ ULz  tha t contains 
all links of dimensions x and z, and the set Ly  of links of dimension y. We cut the 2" x 2" 
multigrid by xz planes to obtain a set of partial grids, each being a subgraph of a  one­
dimensional multigrid. Clearly, all links on these planes are in Lxz, and the remaining 
links are in Ly. We use the processor mapping method shown in Figure 5.11 to map the 
processors in each partial grid to a row bus of M 2 (2 ", 2 ) (note: at most two processors of 
the multigrid are mapped to one processor of M 2 (2 ", 2)). Furthermore, a communication 
step on all links in Lxz can be simulated by row buses in 4n — 3 steps (Theorem 3). The 
links in Ly are mapped to column buses of M 2 (2", 2). Since in the multigrid there are n  — 1 
levels that contain links of Ly, a step of communications on links of Lx requires 3(n — 2) -t-3 
steps (by Corollary 2). Since the simulations of links in Lxz and Ly are independent from 
each other, a maximum communication step on the 2 " x 2 " multigrid can be sim ulated by 
M 2 (2", 2) in at most max{4n — 3,3n — 3} =  4n — 3 steps. The processor mapping used 
ensures tha t a computation step of the multigrid can be simulated by at most two steps 
by M 2 (2 " , 2 ).
■
All above discussions are based on 2-spacing segmented buses. If 2'-spacing segmented 
buses, i > 1 , are used, all the claimed simulation performances degrade as i increases. 
For example, a 2‘-spacing segmented bus can simulate any parallel communication of a
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
63
p-processor linear array (resp. ring) in 2* +  1 (resp. 2* +  2) parallel communication steps. 
Similar performance degradations occur when using segmented buses of less number of 
switches to simulate other interconnection structures. We would like to point out that 
M C S B  is much sparser than the point-to-point networks we considered. For example, a 
2" X 2" torus has 0 (2 -") processors and 0(2^") links, whereas a 2" x 2" MCSB has 2-" 
processors and 2n segmented buses. All the simulation results discussed in this section 
are asymptotically optimal.
5.3 Parallel Prefix C om putation
Given a  sequence S  =  (qqi û i , • • •, a^v-i) of N  elements in a  domain D,  and  an associative 
operation ® on O , the prefix problem is to compute =  ao ® ® ® Oi for 0 < i <  W.
The prefix com putation is a fundam ental problem in parallel computing. It has a wide 
range of applications such as processor allocation, da ta  distribution and alignment, data 
compaction, job scheduling, sorting, packet routing, m atrix com putation, linear recur­
rence, polynomial évaluation, graph algorithms, general Horner expressions and general 
arithm etic formulae. Refer to [28, 40] for references of these applications.
There has been much research on parallel prefix com putation (PPG), and  efficient PPG 
algorithms have been proposed for various parallel computing models such as PRAM, tree­
like machines, hypercube, mesh-like machines, and the shufile-exchange machine. For a 
good survey of previous PPG  results, refer to [28]. In this section, we want to show how 
PPG  can be efficiently realized on the architecture based segmented bus.
5 .3 .1  P refix  on  1-D  array  
Recursive doubling
Horng [28] introduced a  concept called recursive doubling for the parallel computation 
of prefix problem. The idea is to break the calculation of one term  into two complex 
subterm s, as shown in Fig. 5.15. He gave a proof tha t prefix can be correctly calculated 
by recursive doubling. But we give a  more concise and direct proof in the following.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
6 4
S tep
aO a l  a2 a3 a4 a5 a6 a?
/  N  / '  \  /^"~N /'"~N  s = 0
o^ooo) (oooy (oo^ (oo^ (o^ (o^ (o^ (o^
foooo) foooi) (om^ (m^  (o^ (on^ (on^  ^ ^
aO aOal a2 a2a3 a4 a4a5 a6 a6a7
( o ^  (o o ^  (om^ (om^ ( m ^  ( m ^  ( o ^  (o n ^
aO aOal a0 ..a2  a0..a3 a4 a4a5 a4..a6 a4..a7
S = 2
[ 0 ^  ^oooy 1^00^ ( o ^
aO aOal a0 ..a2  a0..a3 a0..a4 a0..a5 a0..a6 a0..a7
S = 3
Figure 5.15: Parallel prefix computation using recursive doubling.
The following notation will be used in this paper.
(i) Let ij denote the j th  bit of the binary representation, id - i id -2 —ij---hioi of i.
(ii) Let P f be a variable located in processor i  that contains a segment of prefix computed
at step s.
(iii) Let s denote the parallel com putation step.
(iv ) Let p denote the number of processors.
(iv ) The following definition and properties are from Horng [28]. 
sub{i , j ,0 ) = id-i id-2 —i j 0 j - i 0 j - 2 — 0  
sub{i,j,Q) =  sub{i, j  +  1,0) , if ij =  0 
sub{sub{i,j,Q) — l , j , 0 ) =  sub{i, j  +  1,0)), if =  1
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
65
Lem m a 1
=  P f - '  i f  i , _ i = 0
Lemma 1 can be justified directly from the observation of Figure 5.15.
Lem m a 2 P? =  Xs„6(.,s,o) ® ^sub{i,s,o)+i ® -  ® ^ i- i  ®
Proof: The proof is done by induction on s.
Basics step: W hen s =  0,z =  sub{i,s,0).  Therefore, P ° =  x,. That is, at the 
beginning, x, is assigned to processor i.
Induction step: Assume that when s = m,  the lemma is true. T hat is,
P j  ^ su b ( i ,m ,0 )  ®  ^su6(t,m ,0)+l ®  ••• ® ®  ^ i -
We need to prove th a t when s =  m +  1, it is also true. Lemma I and the properties of 
sub will be used in the following steps.
If im =  0. then
p ' n + 1 _  p m
^su6(t,m,0) ® ^su6(t,m,0)+l ®  ••• ®  I ®  Xj
^ s u b { i ,m + l ,Q )  ®  ^su6(i,m +l,0)+ l ®  ••• ®  I ®  ^ i -
If Im =  L then
~  ^su6(su6(i,m,0) —l,m,0) ®  ^sub{sub{i ,m ,0 ) - l , m , 0 ) + l  ®  ••• ®  ^au6(i,m,0)—1
•^sub(i ,m ,0 ) ®  ^au6(i,m,0)+l ®  ••• ®  1 ®
^ s u 6( i ,m + l ,0) ®  ^5uû(t,m+l,0)+l ®  ••• ®  ^ i — l  ®  X^.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
66
Theorem  5 y i  —
Proof: In Lemma 2, let s,0) =  0, which can be guaranteed by s =  [log(f +  I)].
■
The algorithm
Use the recursive doubling and the above architecture, we can have the following algorithm 
to do the parallel prefix computing, 
procedure paralleLprefix;
1. begin
2. parfor 0 <  i <  iV do
3. begin
4. yi =  ai /*  initialization */
5. for s =  I to log N  do
6. begin
7. if s = I  close all the switches
8. else close switch w =  |,i'o =  0 and mj_ 2  =  I
9 -   ^ 2 / s t i i ( i , 5 — 1 , 0 ) — 1 )  1  ~  I
10. V i  = X i ®  V i
11. end
12. end
13. end
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
6 7
5.4 k  — D  m esh w ith  segm ented buses
The segmented buses can be easily integrated into a higher dimensional mesh. Figure 5.16 
shows a 3 —D mesh enhanced with segmented buses. The corresponding bus enumerations 
are shown in Figure 5.17 and the notations will be explained later on. Note that in the 
figure we only show the segmented buses necessary for prefix com putation. We do not 
show the buses in detail either, since each bus is the same as a  1 — D bus.
k=2
k=0
(O .O J )  U . O J )  ( 2 . 0 J )  ( 3 .0 .J )
k=l
(O^W) (1,0.0) (2.0.0) (3.0.0) (OJLl) (1.1.1) (11.1) (3.1.1)
-  -  -  O  O  Q  C H
0  0  O  O ]
l O ^ )  ( lU ^ I )  ( 1 0 .1 )  (3 .0 .1 ) 1 0 ^ 2 )  (1 .1 .2 )  ( 1 1 . 2 )  ( 3 .1 .2 )
O O P  c h o  o  Q  g
1 0 ^ 2 )  (1 .2 .2 )  ( 2 .2 .2 )  (3 .2 .2 )Q  O  O  C>
( 0 J 2 )  ( U . 2 )  (2 .3 .2 )  (3 .3 .2 ;
O O P
O  P  P  P i
(0.0.2) (1.0.2) (2,0.2) (3.0.2) (OJU) ( l . l j )  (H J )  (3.1 J )
O  P  P
I 0 J 3 ) )  (1 .1 .0 )  ( 1 1 .0 )  (3 .1 .0 )
O  P  P  P
(0^0) (1.10) (2.2.0) (3.2.0)
O  P  P  P
1 0 ^ )  (1.3.0) (13.0)
O  P  P
Figure 5.16: A 3 — D mesh with segmented buses.
O  O  0  P i
(0.2.3) (1.2.3) (2.2.3) (3.2J)
P O O P
(0J.3) (1.3.3) (2.3.31
G P P
O  P  P  P
( 0 ^ )  ( U . l )  ( 1 3 .1 )
G P P
k=2
k=0
k=l
L](2)
Figure 5.17: The bus notations for the 3 — D mesh shown in Figure 5.16.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
68
For the convenience of description, we introduce some more definitions and notations. 
Let Mfc be a fc-dimensionai mesh with each dimension having n  processors. There are 
a total of p =  n* processors. Each processor is identified by Given
m  < k, u n m  — D  submesh can be specified by Mk{n — 1,... n  — 1, im, Zm-n, —, 4 - i )  where
0 ^  im, ^m+li •••, 1 ^  ^
In this section, a subscript is also used for denoting the dimension to which each index 
belongs. For example, in P(z'o, ii, ...,ij , ij is an index in dimension j .  For the
sake of simplicity, a subscript will be omitted if it can be infered firom the context.
Consider a prefix oq <2> a i ® ... ® aj, 0 < j  < n* — I. Initially, we have the following 
mapping relationship between the elements aj and the processors:
1  f - Î 2 « n - + t i » n + i o  ^  F ( % Q , % ! ,  . . . , 2 ^ — l )
T hat is, we fill the processors using row major order.
We will also denote an element by the coordinate of the processor to which this element 
is initially assigned. For example, ûq =  ao^,o...o, ui =  ai,o,o...o •
A submesh head processor (SHP) is a processor with the maximum coordinates in a 
given submesh. For example, in Figure 5.16, P (3 ,1,0) is a 1 — Z? SHP, P (3 ,3,0) is a 2 — D 
SHP and P (3 ,3 ,3 ) is a  3 — D SHP. Generally, P (n  — 1,... n  — 1, —,ifc-i),0  <
im,im+i, ■■■, 4 -1  <  n  — 1, is an m  — P  SHP. During the k — D  parallel prefix computing, 
the communication is mainly done through these SHPs. A processor may serve as an SHP 
for different submeshes. For example, P (3 ,3,3) is a SHP of submeshes of dimension from
1 to 3. SHPs tha t are on the same dimension axis are siblings of each other. An SHP P  
is the parent of those SHPs, called child SHPs, which are within the submesh headed by 
P  and are one dimension lower. Specifically, the head processor P (n  — l ,n  -  l , . . . ,n  -  
liÎT7i)im+i7 ••- ,4 - i)  of an m  — P  submesh M[m)  is the parent processors of those SHPs 
P (n  -  l ,n  — l , . . . ,n  — 1, t , - - , 4- i ) ,  0 < t <  n, which are in M{m]  and on the 
dimension m  -  1 axis. For example, P (3,3,0) is the parent of P (3 ,0 ,0 ), P (3 ,1,0) and 
P (3 ,2,0), and P (3 ,3,3) is the parent of P (3 ,3,0), P (3 ,3,1) and P (3 ,3,2).
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
6 9
0 <  m <  n, denotes a set of segmented buses connecting:
P{tX ~~~ LfTZ 1, 0, 7^71+1Î 1 — l)
P{tX 1,7Z 1, 1? im+11 T^n-i-2: *-*i — 1 )
P^ Tt L,fl Tl 1, Ifji-if.[, 7^71+2: "7 — l)
Each bus in Lk{m)  represents a perm utation of { im+i, 'i‘m+2 , i k - i  }• A single bus will
be denoted by .....Refer to Figure 5.17 for some example bus notations.
Each bus in Lkim)  connects a set oî m  — D SHPs on a  dimension m  axis. Each of the 
buses Lk{m) has a head processor that has the biggest coordinate in dimension m  and a 
tail processor that has the smallest coordinate.
We define a collection step on a bus Lk{m) as all the operations needed for a 1 — D 
prefix computation on tha t bus. A broadcasting step on a  bus Lk{m)  means that the bus 
head processor broadcasts a message or a subprefix to each of the processors on the bus.
SDrn denotes a Synthesized-from-Descendent subprefix which contains all the elements 
in an m — Z? submesh. SDm{'^ — l , n  — I ,. . . ,n  — 1^ + 1 ,.... where 0 <  m  < n,
denotes a subprefix for an  m  — D submesh located on the SHP P{n — l , n  — 1, ...,n  — 
1? *^771Ï *^771+ 1 Î ••*• 1 )• P{rL 1, n  1.... : rr ...^tk—i} collects its SDm  sub­
prefix from P[n — l ,n  — I , . . . ,n  — 1 ,t ,im ,i 77i+ i ,••.,û --i), 0 <  t <  n, using the bus
 I). For example, P (3 ,3,0) constructs its S D 2 from P (3 ,0,0), P (3 ,1,0)
and P (3 .2 ,0) using the P °(l)  bus as shown in Figure 5.16.
SSm  denotes a  Synthesized-from-Sibling subprefix that an m —D SHP receives from its 
younger siblings on the dimension m axis. T hat is, P (n —l , n —1,..., n —1,2777, 2777+ 1 , •••, ijfc-i) 
collects SSm messages from processors P (n  — l , n - 1 , ...,n  — 1 , t, Zm+i, ■■■■,ik~i),Q < t < im, 
through the bus Likely, we use 5 5 m ( n - l ,n —l , ... ,n - l , i m ,  2777+ 1 ,
denotes the subprefix located on the processor P (n  -  1, n — 1,..., n  — 1, im, 2’m+i, ., ik-i)-  
I  Pm denotes an  Inherited-from-Parent subprefix tha t an m — Z? submesh head pro­
cessor receives from its parent which broadcasts it. T hat is, P (n  — l , n  — l , . . . ,n  — 
l , t , 2m + i,2m+2 , —,ijfc-i),0 < t < n ,  receives an IPm subprefix from P (n  — l , n  -  l , . . . ,n  — 
l ,n  -  l , 2m +i,2m+2 , —,2 t- i)  which broadcasts IPm on the bus
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
7 0
Now we use an example to clarify the above definitions and to help understand the 
following proof. We consider a 3 — D prefix com putation. Initially all the elements are as­
signed to its corresponding processors as described before. We divide the com putation into 
two phases, collecting phase and broadcasting phase. Figure 5.18 shows the interm ediate 
states of the collecting phase. The collecting phase is further divided into three collection 
steps, each for one dimension. Referring to Figure 5.18, at the end of the collection step on 
dimension 0 or on the buses 7,3(0), each processor on a bus 7,3(0) has a proper subprefix 
computed in the same way as with the one-dimensional case; each 1 — 7? SHP has an 
S D i  subprefix. Next, the com putation is done in dimension 1 or on the buses 7 ,3 ( 1 ). At 
the end of com putation in this dimension, each processor except the tail processor on an 
7 ,3 ( 1 ) bus has an S S i  subprefix. Each bus head processor, which is a 2 — 7? submesh head 
processor, has an SD-z subprefix. The same collection operations are done in dimension 2 
and the collecting phase ends.
The broadcasting phase is done in a top-down fashion as shown in Figure 5.19. First, 
each 2 — 7? SHP broadcasts its S S 2 subprefix along its corresponding 7 ,3 ( 1 ). The receiving 
processors, which are 1 — 7? submesh head processors, receive the subprefix and save it 
in I P l .  Next each 1 — 7? submesh head processor combines its 7Pi and SSy  subprefixes 
and broadcasts the resulting subprefix on the its corresponding ^ 3 (0 ) bus. Each receiving 
processor combines the received subprefix and its previous subprefix to get a complete 
prefix. The 3 — 7? prefix computation is finished. Note for 3 — 7? prefix com putation, we 
need two broadcasting step.
W ith the concept used in the above example, it is easy to prove the following theorem.
T h e o re m  6 A prefix of length p = n^ can be computed in time O(logp).
Proof: The theorem is proved using two phase induction on dimension k.
Inductive hypothesis
We imagine tha t a. k — D mesh consists of n  submeshes of A: — 1 dimension, referring to 
Figure 5.20. Note the simplified notations used in the figure. S 7 ? m  is used
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
71
W* HWV «48..A3I
SD(-*)6_43V
/ _ r  Q o  Q On*JI.i54 i51a5î
o  o  o  o
. , *30 *50*37 *30 i3H *3o *3S
O O O gM jOOuoI dMi *02 *00*0^
O o _ g
*J2  * J 2 * .n  *.12. *34 *J
O 0  0  g
*30 *IO*.37 *.30 *39 *<0 *3V
O O O g
*W1 *41 KHI *411 * 4 2  *41) *43
O O O g
*44 *44*43 .H 4..*4
O O P
*10 *10*17  *10  * 1 9  *10 *19  I
O o Q g  o o o Q-
*48 *48*49 *48. *31) *48 *51
O O O O”
.3 :  A 5 U }i  4 H . J 5
SDI^ OIJJ O O O O”
*30 *50*57 *50. *58 *50 *39
O O o g -
JSti *01601 *0U *02 *01) .A 3
O O P
*32 *.12*33 *32_*W *1
O O O g i
*30 *30*37 *30 *38 *30 *39
sdL u^ m  o  o  0  O"
/  *41) >41641 >41) *42 *41) *43
o  O O g
*44 H 4 H S  *44 H
' - * «  O O P
SD I«*52.*5S
SSI«*48-*5I
S O I-* 5 0 .* 5 9
SS I»4 8 ..* 5 S
*10 *10*17 *10_*18 *10 *19
SOI«AlL.At
S51«*48..*59
S 0 3 -> 4 8 .A I
S0f-*.«0..*.19
SSI»*.12_*35
I» 4 I) .* 4 3
SSI«*32..*.)9
l->44 . *47 
SSI«*.12..*43 
S02-*32.-»47
O o  p g
*24 *24*25 *24 *20  *24 .*27
O p  p g
*28 *28*29 *28 *3
O O P
81 8 6 1  8 ) *2 *13 *3
O P  P  P i
*4 >4*5 *4 A  *4. *7
O  O  O  g
*8 *8*9 *8 *10 *8 .*11
O  O  0  g
*12 *12*13 *12*1 '  '
O P  P
O O P  g
*24 *24*23 *24_*20 *24..*27
. . : 4 . 2 7  o  Q  Q  Q -
*28 *28*29 *28. *3
■ 4 ^ 4 3 1  O O P
8) 8 6 1  8 1 * 2  8 ),*3
sdiL.  ^ q  p  p  Q -
I  >4 >4*5 *4 A  *4 *7
'-47 o  p  p  g
A  *8*9 *8..*1U *8. *11
411 Q  P  P  g
*12 *12*13 *12*14  *12 *15
: 4 „  O P  P
SD(#*2U_*2.3 
’ S S I« > |0 .* 1 9
S n i« * 2 4  *27 
SSi>alO..*23
ÜI.*28_*3I
SSIm*l0^7
S D 2 « a lO .^ I
S O I«a4  *7 
SSI «8).*.}
sm-*9..*ii
SSI .8 1  *7
O P  P  Ch
*52 *52*53 *52 *34 *52 *55
O P  P  g
*50 *50*57 *56 *58 *50 A 9
P  P  P  g
*M) *0I)AI >0U. A 2  *01) A 3
P  P  P
*.12 *32*33 *32 *.34 *.1
P  P  P  _P]
*.10 *30*37 *10 *38 *10. *W
O P  P  g
>W) >41641 *411 H 2  >41) *43
O P  P  g
*44 *44*45 >44 >4
O P  P
*10 *10*17 *10. *18 *10. *19
O P  P  Ch
*21) *21621 *21) *22  *21) *2.1
O P  P  g
*24 *24*25 *24 *20 *24 *27
P  P  P  g
*28 *28*29 *28.^31) *28 *31
O P  P
8 )  8 6 1  8 )  *2 J
O P  P  P i
*4 *4*5 *4 A  >4 *7
O O O g
A  *849 A  .*11) A  *11
O  P  P  g
*12 *12*11 *12. *14 *12 *15
O O P  gU"7,'ütLu
SD Iw 52_*55
SS1 .»48 .*5 I
S D I-* 5 0  *59 
SSlm*48..*55
S 0 t« A t) .A 3  
SS I«*48 *59 
SD2#*48 A 3  
S S 2 .8 )  *47 
S D .3 -A A .3  »32-a.3S
S O f.* 3 0  *.19
S S lM .11  *.35
SUIM41) *41
S S Im 3 1 .x 39
SpiM44.*47
S S Im .31.*43  
SD2#*.11.>47  
S 5 2 -8 )  *.11
SD l5*10 *19
SO fM 2l).*13
S S l-a lO  *19
St}lM 24 *27 
SSlM lO . *13
1m 28.*.1| 
SSI-*10  *27 
SD :«tlo ..* .3I 
SS2m 1) *15
SD (m 4..*7 
S S lM l) *3
S q lM 8  *11 
S S I .8 I  *7
(a) AiicrcuilccuiA on L3<W) (h) Aller cullevtion un L3( I ) (c) Aller culIccUiin un L3(2)
Figure 5.18: A The collecting phase of 3 — D prefix computation.
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
7 2
k=2
j4M WXWV a l) .a 4 «  ]ll..a5l) J tL a S l
/
O  O O Ch
aS2  i5 2 a5 3  a 5 1 .a 5 4  a52_a55o o o q
i5<nt51 iS b ..a59o o o o
jM ) aN W iI M l . a t
O O P
a J 2  X Î1 Ü 3  a J l .x W  aJ
O  o  0  O i
a j o  a J f t a J ?  aJA  ^ jX  a i n  .aJVO O 0 o
*U i » W W I J4<}.a42 a4 l)..» (3
O  o  o  q
a4 4  a44» iS
O O P
a lA  a l A a l ?  a lA .a lX  a l
O  O  P  P i
a2t) a2 lk i2 l a2» ..a22  a2 l) .a23
O  O O q
*24 *24*25 *24 *26 a24..*27
o  o  o  o -
*2X *2X*2V *2X..*j
O O P
*» Ji)al *1J..*2o P  P  P i
*4 >4*5 *4. *m >4 *7o p p g
*x *x*v * x ..* iu  4X ..*n
o  p  p  g
*12 *12*13 * 1 1 ^ 1
O O P
(a) After broadcasting SS2 on L3( 1 )
S D I» 4 4 i . .* 5 I  IPI>*(1..*47
SD 1«*52 ..*55
S S I> * 4 M .a5 l lP I« a n .» 4 7
S D I> *5ft..*59
S SIa*4X „*55  IP I b*I)..*47
SDIaaAO..*A3
SSI-*4X _*5V  lP ta ^ )_ * 4 7
S D 2 » 4 X ..* a 3 
SS2*at)..>47 
SD3«*4i.*A3 
S D t# * 3 2 ..* 3 5  IP I« 4 ) ..* 3 i
S D |« a 3 6 .* JM
S S I . * 3 1 * 3 5  t P l . a U . i J l
S q i .> 4 4 ) . .* 4 3
S S l . a J 2 .a J V  I P l « i J  * J l
S p i* » U . .a 4 7
S S l . a J l . a 4 3  I P l . ^ L a J l  
S D 2 « i3 1 .* 4 7  
S S 2 .* n ..a J I
S 0 I^ * 1 6 . .* IV  IP l» * n  *15
S D f.* 2 i> ..a 2 J
S S l . a l 6  .alV  I P |.* U  *15
S q i .* 2 4 . . a 2 7
S S t . a l 6 . . a 2 J  I P I . a n . a l 5
l.* 2 X  a J I  
S S I = 3 16 .* 2 7  I P l . ^ 1  a l5  
S D 2 .* 1 6 . a J I  
S S 2 = a n _ a l5
SD Ij a 0 . . a J
S D Ifa»!..*7
S S I . a l l . a J
S D I . a X . a l l
S S : .a l l_ a 7
s p t . a l l a i s
S S I . a l l . a t l
S D 2 .4 1 .a l5
O  0  0  Q -] S D I - :
*U..*52 al)..aS3 aU .*54 aU_*55o p p g
^ 1 * 5 6  aü ..a5 7  a4J.*5X KLa5Vo p p g
*n..aM I j t l .a A l  J)..aA :
O O P
4 J .JÜ 2  aU..aJ3 all..*34 ai:o p p gi
aJA  J ) . * J 7  *n..aJX ji l  .aJV
o  p  o  q
Ml .* 4 0  aU .»4 l jO  >42 *U .a4Jo p p g
*0 * 4 4  aU..*45 aO .*46
O O P
a 0 ..a l6  a 0 ..a l7  a0 ..a lXo p p g
aU. * 2 0  a 0 . . a i l  all *22 all *23
O  P  P  q
aO..*24 aO..*25 aO. *2A aU..*27o p p g
^ . . * 2S *o  .*2v  *0 ..*.%:
O O P
aO *0*1 aO..*2
O  P  P  P i
all *4  ^  J&A aO..*7
O  P  P  q
41..*K W1..1V *U..*1U 4 1 .*  11
O  P  P  q
* 0 * 1 2  a 0 ..a l3  * 0 * 1 4  * 0 _ a l5
O  P  P
i»4X *51 IP l .a O .> 4 7
S D I . * 5 1 * 5 5
S S I .* 4 X .* 5 1  IP I .a l l .> 4 7
S D :.* 5 A  *5V
S S l« * 4 X ..a5 J  IP  I .a i l  *47
S D l . 36i i .a A j
S S 1 .*4X..aSV IP  1 .41 ..*47  
S D 2 > » t» ..aA j 
S S 2 .a O .a 4 7  
S D 3 .a l l.a 6 3  
S P U a J l . a J 5  IP I . 4 1  a J t
S D f .* jA ..a JV
S S l . * J l . a J 5  I P I .a i l . .* J l
S q  I . *441. *43
S S U a J l . a J V  IP t= a O .* J l
I .* 4 4 . .a 4 7  
S S 1 .* J 1 .> 4 3  I P l .a O  *J1  
S D 2 « * J l .a 4 7  
S S 2 .a iL a J I
S D I# a lA .* IV  I P |.4 1 . .* I 5
S D f .a 2 l l . a 2 J
S S 1 .* IA ..* IV  lP l . a 0 . . a lS
S q i .* 2 4 . .a 2 7
S S l . a l A .a 2 J  I P I . 4 I  *15
S p i . * 2 X .a J I
S S I .3 lA .a 2 7  t P I . a l l . a l 5  
S D 2 . a t A . a J l  
S S 2 .3 II . *15
S D l.a 4 ..a 7  
S S I .a O .a J
S p i .* X ..a l  I 
S S l .a O .a 7
S p t . a l l . a l 5  
S S I . 3O .a l l  
S D 2 .4 1 .a l5
(b) After broedcasiing IPI+SSI onLJ(0)
Figure 5.19: A The broadcasting phase of 3 — £> prefix computation.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
73
for SDmin — l ,n  — We do the same with SSm  and Mm-
First let us consider the collecting phase. In the collecting phase, each SHP sends its 
S D  subprefix to its older sibling. The receiving sibling receives the subprefix as an  5 5  
subprefix, combines it w ith its own S D  subprefix and sends the combined subprefix to 
its older siblings. The process is exactly the same as with the one-dimensional case. The 
collecting phase starts w ith dimension 0 towards higher dimensions. At the end of the 
collecting phase, assume th a t each SHP of A; — 1 dimensional submesh contains an S D k - i  
subprefix and, except for the first SHP, also contains an S S k - i  subprefix. The time needed 
for the collecting phase is denoted by Tc{k). Next we consider the broadcasting phase. 
In the broadcasting phase, which starts with the dimension k — 2 towards dimension 0, 
each k — I dimensional SHP broadcasts its 5 5  subprefix to its children. Each child will 
combine the received subprefix, or IP,  with its 5 5  subprefix and broadcasts the resulting 
subprefix to its children. This process is done recursively until broadcasting is done in the 
dimension 0 buses. We assume that, at the end of the broadcasting phase, each processor 
contains a correct prefix. The time needed for broadcasting is denoted by Tfj{k). In brief, 
we have the following three inductive hypotheses:
1. At the end of the collecting phase, SDj._^ contains a  subprefix of all the elements 
in the submesh
2. SS l_ i  — 0 < j  < n. Note SS^Zi  contains a prefix of all the elements 
in the k — D  mesh.
3. After the SHP of broadcasts its SSl_^  to its children, eventually each processor 
in M l_ i  will contain a  correct prefix.
Inductive proof
We need to show that the three inductive hypotheses still holds for a  A: -t- 1 dimensional 
mesh. Again, we construct a  A: -t- I dimensional mesh from n k — D  meshes, referring to 
Figure 5.21. Notice the notational differences between Figure 5.20 and Figure 5.21. The
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
7 4
SD k-1 SD
SS
k-1
k-1
SD
SS
n-1
k-1
n-I
k-1
n-1
M  (k -1 )M (k-1)M (k-1)
Figure 5.20: A k  — D mesh is imagined to consist of n A: — 1 dim ensional meshes.
notations in Figure 5.20 are based ona. k  — D mesh while the notations in F igure 5.21 are 
based on a A: -f- 1 dimensional mesh.
SO SDSD
SS SD
SD SD SD
SS
SD
Figure 5.21: A A: -I-1 dimensional mesh is imagined to consist o i n  k — D  meshes.
For the hypothesis (1), we simply let SD], =  SS^Zl '^■ For hypothesis (2), we do a 1 —D 
prefix com putation on the bus Lk+i{k). I t  is obvious that SS^  = 0 <  i <  n.
For hypothesis (3), we let the SHP of broadcasts along the bus Lj^_^i(k — 1). Each
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
75
of the receiving processors combines the received subprefix with its own S S  subprefix and 
broadcasts the combined subprefix to its own children. According to the hypothesis (3), 
at the end of the broadcasting phase, each processor will contain a correct prefix.
For {k + I) — D  mesh, we use T d k  +  1) to denote the time needed for the collecting 
phase and 4- 1) for the broadcasting phase. From the above analysis, we have the 
following iterations:
Tc{k +  1) =  Tc{k) + O (logn) w ith Tc(l) =  logn and 
Tb(k +  1) =  Tbik) +  1 with Tb(l] = 1.
Solving the iterations yields T d k )  =  O[k[ogn)  and Tb{k) =  k. The to tal time com­
plexity is:
T{k) = Tc{k) +  Tb{k) = 0 { k  logn) + k = 0{k  logn) =  O(logp) ■
Using the above concept, we can come out with the following parallel algorithm  for
a. k  — D  mesh enhanced with global segmented buses. In the following algorithm , x  and 
y  are used as temporary variables to hold a substring. S D q is introduced to store the 
original elements ju st for the expressing purpose; that is, it can be eliminated, 
procedure &D .paralleLprefix;
1. begin /*  collection phase */
2. parfor 0 <  iq, i i , .... ik - i  < n  do
3. SZ?o(i’o , n , —,ifc-i) =  a ( io ,n , ...,îfc_i)/*initialization*/
4. for y =  0 to A: — 1 do
5. begin
6. parfor 0 <  i j+i ,i j+2 , ^  /*for each of the buses Lk{j)  */
7. begin
8. parfor 0 < ij  < n /*  1-D parallel prefix on each bus * /
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
76
9. begin
10- yir^ —  T ^  — i) —  ^  ^ j  i , . . . , T i  — 1 )
11. SS j{n  — 1,.... a  — I. ij, ij+i, . . . ,ik-i) = 0 /*  initialized to empty */
12. Turn off all the switches
13. for s =  1 to logn do
14. begin
15. if s >  1, tu rn  on w, Ws- 2  =  1
16. x{n — 1....U — l,Zj,Zj+i,...Zfc_i) =  SD j{n  -  l , . . . ,n  — l , sub{ i j , s  — 1,0) —
1? ^J+l I "M I }) )s—I ~  1
17. SS j{n  -  1, ...,n  -  ..., Zjb_i) =  x(n — 1, . . . n -  . . . ik-i)  ® SS j{n  —
1, —, n i j , î j ^k— 1 )
18. y{n -  1, ...,n  -  1, ij, Zj+i,..., zjt-i) =  x{n -  1, ...,n  -  l, ij,Z j+t, . . . ,z t - i )  ® y{n -  
1, . . . .  n  \.,ij,ij-^.i,...,Liç — [)
19. end
20. if(zj=n-l) S 'Z )_,+ i(n-l,...,n-l,Z j,z_,+ i, ...,zjt_i) =  y { n - l ,  . . . , n - l , i j , i j + i ,  . . . ,ik-i)
21. end
22. end
23. end
24. end
25. begin /*  broadcasting phase * /
26. parfor 0 <  z t - l  <
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
7 7
27. do IP [ n  — 1, ...,n  — =  0 /*  initialized to empty */
28. ior j  = k — 2  downto 0
29. begin
30. parfor 0 <  ij+i,ij+2 : — <  n.
31. begin
32. parfor 0 < ij < n do
33. I  P{^ Tl I , . . . , 72 1, 2 J , 2 J + [ , ..., — [) — IP  {tl 1,.../2 1 ? 11 — l) ®
55j(t2  —I, . . .72- l ,2y+i , . . . ,2fc_i )  / *  P(t2 —1,...72 — l , 2_,+[, . . . ,2fc_i) broadcasts IP{n — 
1, ...72 -  l , 2j + i , . . . , 2jt_i) <^SSj{n -  l,...n -  I ,2j+i ,  on the buses Lk{j) */
34. end
35. end
36. parfor 0 <  2 0 , 2 1 , . . . , 2 ^ - 1  <  72 do
3 7 . 7/ ( 2o, 2i,  . . . ,2 t _ l )  =  / P ( 2 o ,2 l ,  . . . ,2 fc_ i)  ®  a (2 o , 2 i , . .. , 2fc_i) ®  5 5 q( fo , 2 l , ..., 2 j t_ i )
38. end
5.5 Sum m ary and Discussions
We proposed a class of reconfigmrable buses, the segmented buses, and constructed multi­
dimensional interconnection structures using such buses. We showed that these intercon­
nection structures can be used to build versatile general-purpose parallel machines. To 
improve bandwidth, reliability, and versatility, one may connect linearly arranged proces­
sors by several segmented buses instead of one. Interconnection patterns that are more 
complex than  multidimensional meshes are possible. Segmented buses may appear in a  
hierarchical system design, providing chip-to-chip, module-to-module, board-to-board or
R ep ro d u ced  with p erm issio n  o f  th e  cop yrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
78
node-to-node communication. We believe th a t segmented buses are more promising in 
hardware implementation than  most other reconfigurable buses, such as the ones used in 
reconfigurabie meshes, because of their relatively simpler control schemes.
We would like to mention that an optical bus system, called linear array with a re- 
configurable pipelined bus system (LARPBS), has been introduced recently. The paper 
by Pan and Li [57] gives an excellent survey on this subject. By allowing packets to be 
transm itted in a pipelined fashion using the time-division multiplexing (TDM) method, 
an LARPBS can simulate a  complete point-to-point network (complete graph) with a  con­
stan t factor slowdown. In addition, it can perform several operations with performance 
better than a PRAM of the same size. Since a segmented bus does not assume pipelined 
the TDM transmission mode, it can be implemented in either electronic or optical domain. 
Though less powerful than an LARPBS, a segmented bus is easier to be implemented.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Chapter 6
H ypernetw orks
Designing high bandw idth, low latency and scalable interconnection networks is a great 
challenge in the construction of high-performance parallel computer systems. Tradition­
ally, interconnection networks are characterized by graphs. Network topologies under 
graph models have been extensively investigated. Many network structures have been 
proposed, and some have been implemented. Observed the improving electrical bus and 
switching technologies and maturing optical interconnection technologies, Zheng pointed 
out that conventional graph structure is no longer adequate for the design and analysis of 
the new generation interconnection structures and proposed a  new class of interconnection 
networks, the hypernetworks [84].
The class of hypernetworks is a generalization of point-to-point networks, and it con­
tains point-to-point networks as a subclass. In a hypernetwork, the physical communica­
tion medium (a hyperlink) is accessible to multiple (usually, more than  two) processors. 
The relaxation on the number of processors that can be connected by a  link provides 
more design alternatives so that greater flexibilities in trade-offs of contradicting design 
goals are possible. The underlying graph theoretic tool for investigating hypernetworks 
is hypergraph theory [8]. Hypergraphs are used to model hypernetworks. Hypernetwork 
designs have been formulated as an optimization problem of constructing constrained hy­
pergraphs. Interested readers may refer to [84, 85, 87, 88] for more justifications, design 
issues and im plem entation aspects of hypernetworks.
79
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
8 0
Existing results in hypergraph theory and combinatorial block design theory, which is 
closely related to hypergraph theory, can be used to design hypemetworks. For example, 
in [87], Zheng introduced several low diameter hyper networks, referred as to G M S H ,  
based on the concept of Steiner Triple System, as shown in Figure 6.1. The comparison 
between the G M S H  and point-to-point hypercube is shown in Table 6.1. From the table, 
one can see clearly the above claimed advantages of hypernetworks over point-to-point 
networks. In [88], Zheng and Wu proposed a scheme for constructing a new hypernetwork 
from an existing one using the concept of dual graph in hypergraph theory. They showed 
that the dual H '  of any given hypergraph H  is a. hypergraph that has some properties 
related to the properties of H  so tha t one can investigate the properties of H* based on 
the properties of H.  Since the structure of H  and its dual H" can be drastically different, 
finding hypergraph duals can be considered as a general approach to the design of new 
hypernetworks. They investigated the structure of the dual * of an n-vertex complete 
point-to-point network K-n..
OO ÔOÔOOÔ ÔOÔOOÔÔÔ ÔÔ
(1.1) (2.1) (3.1) (1.2) (2.2) (3.2) (1.3) (2 J )  (3.3) (1.4) (2.4) (3.4) (1.5) (2 j )  (3.5) (1.6) (2.6) (3.6) (1.7) (2.7) (3.7)
Figure 6.1: Bus implementation of GMSH(3,2).
Hypercube is a popular point-to-point network which has many desirable features such 
as small diameter, symmetry, and supporting a large class of efficient parallel algorithms. 
In this paper, we propose a  class of hypernetworks, the Q* (read as star) hyper­
networks. The Q* hypernetwork is the dual of the 72-dimensional hypercube Q„. We 
discuss the topological and fault tolerance aspects of Q’ , and present a  set of parallel data
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
8 1
Table 6.1: Comparison between G M SH {3 ,d )  and point-to-point hypercubes
n,number 
of vertices
m,number of 
(hyper)links
A
degree
6
diameter
771
a
ratio
G M S H { 3 , 1) 3 1 1 1 0.3
Q2 4 4 2 2 1
G M SH {3 ,2 ) 21 14 2 3 0.6
Qa 16 32 4 4 2
G M S H { 3 , 3) 903 903 3 7 1
Qio 1024 5120 10 10 5
G M S H  {3,4) 1,631,721 2,175,628 4 15 1.3
Q21 2,097,152 22,020,096 21 21 15.5
G M S H { 3 , 5) 5.3 X 10^ ^ 8.9 X 10^^ 5 31 1.6
Qa2 4.4 X 10^ ^ 92.4 X 10^^ 2 2 1
communication algorithms for Q*. Our results indicate that the Q* hypernetwork is a  use­
ful and promising interconnection network for high-performance parallel and distributed 
com puting systems.
6.1 B ackgr ound
Hypergraphs are used as underlying graph models of hypernetworks. A hypergraph [8] 
H  =  (V, E)  consists of a set V =  {vi,V2 , • • • ,u„} of vertices, and a set E  =  {ei, C2 , • • •, e^}  
of hyperedges such that each e, is a non-empty subset of V  and {n|u 6 e,, 1 <  i < m }  = V.  
An edge e contains a vertex u if u € e. If Cj Ç ej  implies that i = j ,  then H  is a simple 
hypergraph. In this article, we only consider simple hypergraphs. When the cardinality 
of an edge e, denoted as |e|, is 1, it corresponds to a self-loop edge. If all the edges 
have cardinality 2, then H  is a graph tha t corresponds to a point-to-point network. A 
hypergraph of n  vertices and m  hyperedges can also be defined by its n  x m  incidence 
m atrix  A  with columns representing edges and rows representing vertices such that Oj j  =  0 
if Vi 0  ej, a i j  =  1 if u, € Cj.
For a  subset Ef of E,  we call the hypergraph H '{V ' ,E ' )  such tha t V  =  €
e, e e  Ef}  the partial hypergraph of H  generated by the set É . For a subset U  of V, 
we call the hypergraph H"{V",E")  such tha t E"  =  {e^  n  U\ei r \U  #  0 ,1  < i <  m}
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow ner. Further reproduction  prohibited w ithout p erm issio n .
8 2
and V"  =  {u|u €  e, e 6 E"}  the sub-hypergraph induced by the set U. Note th a t such an 
induced sub-hypergraph may or may not be a  simple hypergraph.
T he degree df[{vi) of Vi in H  is the number of edges in V  that contain V{. A hypergraph 
in which all the vertices have the same degree is said to be regular. The degree of hypergraph 
H,  denoted by A{H),  is defined as A {H )  =  max^^gK d//(u,). A regular hypergraph of 
degree k is called k-regular hypergraph. The rank r{H)  and antirank s{H)  of a hypergraph 
H  is defined as r{H) = m axi< j<^ |ej| and s{H) = m in i^ < m |e j|, respectively. We say 
th a t H  is a. uniform hypergraph if r{H) = s{H). Hypercube is a good example for being 
regular and uniform. A uniform hypergraph of rank k  is called k-uniform hypergraph. A 
hypergraph is uertex (resp. hyperedge) sj/mmetnc if for any two vertices (resp. hyperedges) 
Vi and Vj (resp. ei and Cj) there is an autom orphism  of the hypergraph tha t maps Uj to 
Vj (resp. Cj to ej).
In a hypergraph H,  a path of length q is defined as a sequence {vi^, ey, .Uij, ey, Cj^,
) such tha t (1) «il, Vi^^^ are all distinct vertices of H; (2) eji , , ■ • •, ej^ are
all distinct edges of H; and (3) for A; =  1,2, • • • ,q. A path from Vi to vj,
i 7  ^ j .  is a path in H  with its end vertices being Vi and vj. A hypergraph is connected if 
there is a path connecting any two vertices. We only consider connected hypergraphs. A 
hypergraph is linear if |e,- D Cj\ < I for i  ^  j ,  i.e., two distinct buses share a t most one 
common vertex. For any two distinct vertices Uf and Vj in a hypergraph H,  the distance 
between them, denoted by dis{vi,Vj), is the length of the shortest path connecting them 
in H.  Note that dis{vi,vi) =  0. The diameter of a hypergraph H  =  [V,E],  denoted by 
5{H),  is defined by 6 {H) =  (iis{vi,Vj). More concepts in hypergraph theory
can be found in [8].
A hypemetwork M  is a network whose underlying structure is a  hypergraph H,  in 
which each vertex u, corresponds to a unique processor Pi of M , and  each hyperedge ej 
corresponds to a connector th a t connects the processors represented by the vertices in ej. 
A connector is loosely defined as an electronic or a photonic component tlurough which
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
8 3
messages are transm itted between connected processors, not necessarily simultaneously. 
We call a  connector a  hyperlink.
Unlike a point-to-point network, in which a  link is dedicated to a pair of processors, 
a  hyperlink in a hypernetwork is shared by a  set of processors. A hyperlink can be 
implemented by a bus or a crossbar switch. Current optical technologies allow a hyperlink 
to be implemented by optical waveguides in a folded-bus form operating in pipelined 
fashion using time-division multiplexing. Pree-space optical or optoelectronic switching 
devices such as bulk lens, microlens array, spatial light modulator (SLM), and sm art pixel 
arrays can also be used to implement hyperlinks. A star coupler, which uses wavelength- 
division multiplexing, can be considered either as a generalized bus structure or as a 
photonic switch, is another implementation of a hyperlink. Similarly, an ATM switch, 
which uses TDM, is also a hyperlink. In the rest of this chapter, the following pairs of 
terms aie used interchangeably: (hyper)edges and (hyper)links, vertices and processors, 
point-to-point networks and graphs, and hypernetworks and hypergraphs.
6.2 Hypernetwork D esign Issues
The problem of designing efficient interconnection networks can be considered as a con­
strained optimization problem. For example, the goal of designing point-to-point networks 
is to find well-structured graphs (whose ranks are fixed, as a constant 2) with small de­
grees and diameters. In hypernetwork design, the relaxation on the number of processors 
tha t can be connected by a hyperlink (i.e. the rank of the hyperlink) provides more de­
sign alternatives so that greater flexibilities in trade-oSs of contradicting design goals are 
possible.
6.3 Dual H yper net works and Q* Hyper net works
6 .3 .1  D u a l G raph
The dual of a hypergraph H  = {V,E)  with vertex set V  =  {ui,U2 i • • • , «n} and hyper­
edge set E  = {ei,e 2 ,---,em } is a hypergraph H* =  (V*,E*)  with vertex set V* =
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
8 4
{vl ,V2 , - ■ ■ , v ^ }  and hyperedge set E* =  {e^, e^, - - - ,e^} such tha t Vj corresponds to ey 
with hyperedges e? =  {vj\vi € ej and ej €  E}.  In other words, H '  is obtained from H  by 
interchanging of vertices and hyperedges in H.  The incidence m atrix of H* is the trans­
pose of the incidence m atrix of H.  Thus, (/?*)* =  H.  The following relations between a 
hypergraph and its dual are apparent [88].
P ro p o s itio n  1 H  is r-uniform if  and only i f  H* is r-regular.
P ro p o s itio n  2 The dual of a linear hypergraph is also linear.
P ro p o s itio n  3 A hypergraph H  is vertex symmetric if  and only i f  H" is hyperedge sym­
metric .
P ro p o s itio n  4 The dual of a sub-hypergraph of H  is a partial hypergraph of the dual 
hypergraph H*.
Since {H')* = H,  all the above propositions still hold after interchanging H  with H*.
P ro p o s itio n  5 6(H)  — 1 <  S{H*) < S{H) -h I.
Propositions 1 - 5  show that some properties of the dual hypergraph H '  of a given 
hypergraph H  can be derived from properties of H.  For example, if f f  is a ring, then 
H* is isomorphic to H.  However, in general, the structures of H  and its dual H* can be 
drastically different. Finding hypergraph duals can be considered as a  general approach 
to the design of new hypernetworks.
6 .3 .2  T he H yp ern etw ork  Q*
We consider using the dual Qû the hypercube as a hypernetwork. An n-dimensional 
hypercube consists of 2” vertices, each being labeled by a  unique n -b it binary number. 
Two vertices are connected by an edge if and only if their binary labels are distinct in one 
bit position. Properly labeling the vertices and hyperedges in Q* can greatly simplify its 
use as a  communication network. Vertex labels are used as processor addresses. Similarly,
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
85
hyperedge labels are used as the unique names of hyperlinks. There are many ways to 
label the vertices and hyperedges of Q*.
Let be the set of non-negative integers that can be represented by n-bit binary
numbers. For l , u  E In, we use d{l,u) to denote the number of different bits in the 
binary representations of I and u. i.e., d{l,u) is the Hamming distance between the binary 
representations of I and u.
D efin itio n  1 Let Nn =  n2” "^ for n  >  2. The hypernetwork is a hypergraph with 
vertex set {{l,u)\l ,u E In, !• < u, and d{l,u) = 1} of Nn vertices and 2" hyperlinks, 
eo .ei,-- - ,e-2 '»-i. Each vertex {l,u) is connected to exactly two hyperedges e; and e„.
E x am p le  1 The incidence matrix j4 is
eo ei 6 2 6 3 6 4 6 5 6 6 6 7
(0,1) f 1 I 0 0 0 0 0 0 ^
(0,2) 1 0 1 0 0 0 0 0
(1,3) 0 1 0 1 0 0 0 0
(2,3) 0 0 1 1 0 0 0 0
(0,4) 1 0 0 0 I 0 0 0
A == (1.5) 0 I 0 0 0 1 0 0
(2,6) 0 0 1 0 0 0 1 0
(3,7) 0 0 0 1 0 0 0 1
(4,5) 0 0 0 0 1 1 0 0
(4,6) 0 I 0 0 I 0 I 0
(5,7) 0 0 0 0 0 1 0 1
(6,7) I 0 0 0 0 0 0 1 1
The transpose of A is
eo.i eo,2  ei 3 62,3 60,4 61,5 6 2 ,6 «3,7 «4 ,3 64,
Vo /  1 1 0 0 1 0 0 0 0 0
V l 1 0 1 0 0 1 0 0 0 0
V 2 0 1 0 1 0 0 1 0 0 0
A ^  = vz 0 0 1 1 0 0 0 1 0 0
V4 0 0 0 0 1 0 0 0 1 1
V 5 0 0 0 0 0 1 0 0 1 0
Vo 0 0 0 0 0 0 I 0 0 1
V? 0 0 0 0 0 0 0 1 0 0
0
0
0
0
0
1
0
1
0 \  
0 
0 
0 
0 
0 
1 1 y
Clearly, is the incidence m atrix of the hypercube Q3 . Figure 6.2 shows the bus im­
plementation of the Q 3  hypernetwork, whose incidence m atrix A  is given above. Its 
corresponding hypercube, whose incidence m atrix is A '^ , is shown in Figure 6.3, where 
each edge is labeled by its two end vertices.
□
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
86
<0.1 > <I,3>
<0.2> <2.3>
<0.4> <2,5>
<1.5> <3,7>
DE—D
<4.5> <5.7>
<4,6> <6.7>
Figure 6.2: Bus implementation of Q^-
<4,5>
<i,5><0,4>
<0,1 >
<4,6>
<0,2> <5,7>
<1,3>
<6,7>
<2 ,6>
<2,3>
Figure 6.3: Hypercube Qz corresponding to
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
8 7
The Q* hypernetwork can also be defined in a recursive way. One can easily ob­
serve that Q 3  can be constructed with two copies of Q2 (see Figure 6.2), and can be 
constructed with two copies of Q3  (see Figure 6.4). For brevity, we omit the recursive 
definition of Q*.
<3;
1
1
1
'
1
!1
I 1 1
T 1 1
1 1
7 1 1 ! 1iJ u J ) Hf
<«.l> <U> <«.4> <1S*> <4.5> <3.7> _ <!»JO <4.I2> <fk.l4><iL2> <2..1> <IJ> <-t.7> <4A> <ft.7> <I.V> <j.M> <7.15<%.v> <v.li> <jc.i2> <12.»J>, <ILIJ>> <s.l(l> <l(l.ll> <v.| %» <tl.l5> <I2.I4> <14.153
Figure 6.4: Bus implementation of Ql-
6 .3 .3  T h e  P r o p e r t i e s  o f  Q*
Based on the properties of hypercube Q„ and Propositions 1 to 4, we have the following 
fact:
F ac t 1: Q* is 2-regular, n-uniform, linear, and vertex and hyperedge symmetric.
Let dis{{l,u) ,{l ' ,u' ))  be the distance between vertices (/,u) and ( l ' , u )  in Q^.  We 
want to derive the diameter of Q*.
L em m a 3 For any two vertices {l,u) and { l \ u )  in Q*, dis{{l,u),  (/', u ')) =  min{ d{l,l'), 
d{ l ,u ) ,  d ( ti,/ ') ,d (u ,u ')}  -t- 1 .
P ro o f. We view {l,u) and { l ' , u )  as two edges in The minimal path  connecting these 
two edges is one from one end node of ( / ,u) to one end node of { l ' ,u ) .  Therefore, the 
distance is the length of a minimal path between two end nodes plus one. □
L em m a 4 For any two vertices (l,u) and {l' , u )  in Q*, dis{{l,u), {t ,u')) < n.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
88
P ro o f. For two hyperedges {l,u) and (Z , u ) in Q*, = n  then d { l,u )  < n. There­
fore, d((i, u), {l' ,u ) )  =  m 'm { d { lj ') ,d { l ,u ) ,d { u ,l ') ,d { u ,u )}  + l < m 'm {d{l,l'),d{l,u  )} + I 
<  (n — 1 ) -h 1  <  n.
T h e o re m  7 The diameter of Q* is n.
P ro o f . From Lemma 4, we know that the diameter is less than  or equal to n. All we need 
to do is to prove that there are two vertices (Z,u) and (/', u ') such that dis{{l, u), {l', u ) )  = 
n  — 1. Actually, it is easy to see that vertices (000...0,100.-.0) and (0II...1,111...1) meet 
the above condition. □
In Q*, the number of processors is n /2  times the number of the hyperlinks. Each 
processor is attached to exactly two hyperlinks, and this simplifies the processor interface 
circuit design. Each hyperlink connects n  processors. Suppose that a  10 x 10 crossbar 
switch is implementable and cost effective, then Q \q of 5,120 processors can be imple­
mented using 1,024 such switches as hyperlinks.
Consider the fault tolerance aspect of the Q* hypernetwork. We say tha t a hyper- 
network H  is x-processor fault-tolerant (resp. y-hyperlink fault-tolerant) if it remains 
connected when no more than any x  processors (resp. y-hyperlinks) are removed. We 
have the following claim.
T h e o re m  8  Q* is (2n — 3)-processor fault-tolerant and i-hyperlink fault-tolerant.
P ro o f. Consider the hypercube Q„. Tcike an edge e and delete all edges th a t share a 
vertex with e, then the graph becomes disconnected. This implies th a t it is possible to 
disconnect the Q* hypernetwork by removing 2(n — 1 ) processors. However, removing any 
less than 2 (n — 1 ) processors from Qn will not make the remaining part of Qn disconnected. 
Hence, is (2 n  — 3)-processor fault-tolerant. Since Q* is 2-regular, removing any two 
hyperlinks that share a vertex v  will disconnect processor v. □
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
89
6.4 D ata  Com m unication A lgorithm s for Q*
la  this section, we use the vertex and hyperedge labels to design d a ta  com munication 
algorithm s for the Q* hypernetwork. For simplicity, we assume bus im plem entation of 
hyperlinks. In the electronic domain, the bus load, i.e., the number of processors that 
can be connected by a bus, is limited. Using optical fibers to implement a  bus, the bus 
load can be increased significantly. Since a  bus is shared by all its connected processors, 
the performance of a bus depends on the way it is accessed by processors. For example, 
one way for processors to share a  bus is to use time-division multiplexing (TDM ), which 
allocates time slots to processors so tha t they can only access the bus during their slots. 
A nother way is to let processors compete for bus tenure, and use an arb iter to  grant the 
bus tenure in an on-line fashion.
We assume a synchronous mode communication. Bus allocations, although operated 
dynamically, are predetermined by an ofif-line scheduling algorithm. This bus operational 
mode has been used in [16] for analyzing a  multiple-bus interprocessor connection struc­
ture. Assume that all messages are of the sam e length. The communication performance 
is measured in terms of parallel message steps. We adopt these assumptions for two rea­
sons. First, under these assumptions, it is easier to assess the capability and  lim itation 
of the proposed hypernetwork structure. Secondly, the performance results obtained  can 
be easily used to measure other bus com munication method by either adding additional 
overheads, which may be incurred in TDM transm ission and asynchronous bus allocation, 
or deducting transmission latency saving due to pipelining effect of a pipelined optical 
bus.
We discuss four types of communication operations: one-to-one communications, one- 
to-many communications, many-to-one communications and many-to-many communica­
tions. For each type, we present an algorithm  for a representative com m unication op­
eration. These communication algorithms constitu te a useful set of tools for designing 
parallel algorithms on the Q* hypernetwork.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
9 0
6.4 .1  O n e-to -O n e C om m u nication
We consider shortest path  routing between two processors, where (l,u )  and {I , u )  rep­
resent the source processor and the destination processor, respectively. The idea of the 
shortest path  routing is as follows. If the two processors share one hyperlink, the message 
is transm itted tlirough that hyperlink. Otherwise, the transmission is done from hyper­
link Ca toward hyperlink ej, such tha t a 6 {I, u}, 6 E { / , % } ,  and d{a,b) is the minimal. 
The source processor sends the message to the processor {a, c) through hyperlink Cq such 
that d(c, 6 ) =  d{a,b) — 1. This process is recursive; tha t is, the processor (a, c) will then 
relay the message toward the processor { l ',u ) .  The following is the shortest path routing 
algorithm.
procedure RO U TE{{l,u), {/ , a )) 
begin
Let a E {f,u}, b E { l',u  } such tha t d{a,b) =  dis{{l,u), { l',u  )) — I; 
if  a =  6  th e n
Processor (/,u ) sends the message to (/', a ' ) using Cq 
else begin
Select c such that d(c, 6) =  d(a, b) -  1;
Processor (/, a) sends the message to (a, c) using e^;
ROUTE{{a,c), { l ',u ) )  
end
end
T h e o re m  9 For any given pair of processors {l,u) and {I , u )  in the Q* hypernetwork, 
algorithm RO U TE routes a message from {l,u) to (Z',a ) along a minim al path in dis{{l,u), 
{ I', u ' )) message steps.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
91
P ro o f. The theorem directly follows from Lemmas 3 and 4. □
6 .4 .2  O n e - to -M c in y  C o m m u n ic a t io n
VVe consider broadcasting a message from any processor (Z, u) to all other processors in 
Given {L u), procedure TRANSFO RM  is used to transform  (Z, u) to (0,1) and all the 
other (a, b) in to {o', b') in 0 {n) time.
p ro c e d u re  TRANSFO RM {{l,u)) 
beg in
for all (a, b) do  in  p a ra lle l
a ’ =  a; 
b’=  b;
for (k= 0 ; k<n; k + + ) 
if  l(k) 76  u(k) th e n  break;
Shift bit k of l,u ,a’,b ’ to Qth bit 
fo r (k=G; k<n; k + + ) 
if  l(k) #  0  th e n
complement l(k), u(k), a’(k), b’(k)
en d
A more complicated program can do the same transform  in constant time. We call 
this program as TRA NSFO RM l as shown below.
p ro c e d u re  TR A N SF O R M l{{l,u))  
beg in
for all (a, b) do  in  p a ra l le l
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
92
if  a =  0 and  6 = 1  th e n  (a', b') =  (1, u); 
else if  a =  1 and b = u  th e n  {a',b') =  (0,1); 
else {a',b') =  { T R A N S L IN K { a ,l ,u ) ,  
T R A N S  L I  N K [b ,l,u ))
en d
p ro c e d u re  TRANSLINK{e,l,u) 
beg in
1 band u —)• x l 
(1 bxor u) I 1 x2 
e band x2 x3 
e band {bcorn x2) —> x4 
if x 3 = l or (x3 mod 2) =0 th e n  
x3 bxoT F —> x5 
xl bxor x4 -*■ x6 
x5 bar x6 —> e ’ 
return e' 
en d
By the symmetry of the Q* hypernetwork, we know th a t the new identities (a', b') 
assigned to processors of Q* satisfy the connectivities of Q *. We only need to describe an 
algorithm that broadcasts a message from (0,1) to all processors in Q*.
We use 0^ to represent k  consecutive O’s, e.g., 0  ^ =  000. Let Qk =  0"~*^“ ^0+* and 
Q'i^  = 0"“ *“ ^** denote the 6-dimensional subcube of induced by all vertices whose 
left n — k  bits are Q "-^-^0 and 0"“ *~'^1, respectively. Here, an * in a  bit position stands 
for “don’t care” , and  represents k  consecutive *’s. We use Qk+i = Qk + Q'k to denote
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
93
the {k H- l)-dim ensional subcube of Qn induced by vertices in Qf. and Q].. We explain the 
broadcasting algorithm  in Q* using an n-dimensional hypercube {Qn) by interchanging 
the role of vertices and edges. The idea behind our broadcasting algorithm is as follows. 
Assuming th a t initially the edge connecting vertices and in Qn is colored,
and all other edges in Q„ are not colored. We want an edge traversing algorithm  A  that 
system atically traverses all edges in Qn in n  steps. In the 6-th step, algorithm  A selects 
a  subset Ek of edges in Qn that satisfies the following conditions: (1) all edges in Ek are 
not previously traversed, (2) each edge in Ek has at least one end vertex th a t is an end 
vertex of a previously colored edge, and (3) for each edge e in in Ek assign a  direction it 
is traversed: let u and v be the two end vertices of e, and suppose that u is an  end vertex 
of a  previously traversed edge, then traverse e from u  to v. In the following, we provide 
a selection procedure for Ek so tha t all edges of Qn are guaranteed to be traversed in n  
parallel steps. Obviously, such an algorithm A corresponds to a broadcasting algorithm 
A* for Q ;.
Now, let us describe our algorithm A. S tarting from Qi (which corresponds to the 
edge connecting vertices 0"~^0 and in Qn- and the source vertex (0” ~^0,0”~ 4 }  in
Qn)i increase the dimension of the cube by one in each step. In the first step, we consider 
Q-2 = Q i+  q ’i, the two edges connecting vertices in Qi and Q\ are colored. In the second 
step, the edge connecting 0"~‘^ 10 and is traversed in the direction from
and 0"“ - l l  and the edges connecting Q2 and Qz 3Je traversed in the direction from Qo 
and Qg. Assume tha t after 6, 1 <  6 <  n  — 2, steps, all the edges in Qk =  
and the ones connecting Qk and Q .^ = have been traversed, but all the edges
in Qj  ^ have not been traversed. In step 6 4-1, we traverse all the edges in Q]^  and the 
edges connecting Q t+ i and where Qk-ti =  and Q'k+i =
in the direction from Q t+i to To traverse all the edges in Q]^ , we random ly pick two
connected vertices v  and u, if u <  u then the link (u, u) is traversed from v  to  u. In step 
n, we only need to traverse all the edges in Q n -i in the direction from QÔ-1 ,0  to <3â-i,i-
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
94
Let bn-ibn - 2  • • • 6 0  be the binary representation of b. We use to represent the binary 
number (and its corresponding decimal value) obtained by complementing the zth bit, b{, 
of the binary representation of b. Translating the above hypercube edge traverse algorithm 
into an algorithm for traversing vertices in Q*, we obtain the following algorithm.
p ro c e d u re  BROAD CAST{{0,1)) 
b eg in
for k  = L to  n — 1  do
fo r a ll (a, 6 ) where a 6  and b 6  do in  p a ra lle l
Processor (a,b) sends the message to using e^;
Processor (a, 6 ) sends the message to (6 , usi ng e&;
Processor (a, 6 ) sends the message to (6 , using e&, 
if 6  <  for i 6  { 0 , 1 , A: -  1} 
en d fo r 
en d fo r
fo r a ll {a,b) do  in  p a ra lle l
if  6  <  for i g ( 0 , 1 , ...,n  -  1 }
th e n  Processor (a, 6 ) sends the message to (6 , &(')) using Cf, 
en d fo r 
e n d
In Figure 6.5, we show the broadcasting tree for Q \, whose bus implementation is 
shown in Figure 6.4. In  Figure 6.5, a circle labeled by a pair of integers a and b represents 
a  processor (a, 6 ). A directed edge labeled by an integer c from (a, 6 ) to (a',b') indicates 
that the message is transm itted  from (a, 6 ) to {a',b') using hyperlink eg.
T h e o re m  10 Assuming bus hyperlinks ofQ n, algorithm BRO AD C AST broadcasts a mes­
sage from any processor to all other processors in  n  parallel message steps.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
95
(^1^ (J^  C^  C^
Figure 6.5: D ata communication pattern  for broadcasting from (0,1) in Q^.
Proof. Tlie theorem follows directly from a  simple induction based on the discussion 
preceding BRO AD CAST. □
A genral case broadcast can also be done in the same complexity. This algorithm, 
called GBROADCAST, is shown as follows.
procedure OBROADCAST{{l.,u)) 
begin
{/, u) broadcasts on I and u
for k =  ^  to  fc > 1 step k =  |d o
for all (a, b) such as b — a = k  and 1 <  a, 6 <  u do in parallel 
Processor (a, b) broadcasts on a and 6 
endfor 
endfor
for k =  2  *  {u — I) to  k <  2"“  ^ step 6 =  A: * 2 do
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
96
fo r a ll (a, b) such as 6 — a =  d o  in  p a ra lle l 
Processor (a, b) broadcasts on a and b 
e n d fo r  
e n d fo r 
en d
6.4 .3  M a n y -to -O n e  C om m u n ica tion
A reduction (or census, or fan-in) function is defined as a commutative and associative 
operation on a set of values, such as finding maximum, addition, logic or, etc. It can be 
carried out using a many-to-one communication operation. We only consider the case that 
the specified reduction operation is addition. The same algorithm can be slightly modified 
to perform other reduction operations.
We present an algorithm that can be used to perform a sum m ation on a set of 
values stores in the A  registers of processors, one per processor, and putting  the final 
result in processor {l, u). T hat is, the algorithm computes I3(a,6)6Q; and putting
the filial result in of processor {l,u). We assume that each processor (a, 6) has a
working register (,) - Given any processor {l,u), procedure TR A N SF O R M  discussed in
the previous section can be used to transform {l,u) to (0"“ ^0,0"“ ^1) (or (0,1)) and all 
other (a, 6) in Q’ to {a',b'). Then, we only need to consider the sum m ation algorithm 
that stores the final result in processor (0,1).
Summation is done in two phases. In the first phase, a set of 2"“  ^ partial sums are 
obtained and stores in processors (z, 2""^ +  z), 0 < z < 2""^ -  1. The second phase 
computes the sum of these partial sums and stored the final result in (0,1).
p ro c e d u re  R ED U C TIO N {{0,1)) 
beg in
/ *  phase 1 * /
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
9 7
for z =  1 to n — 1 do
for all {a.b) do in parallel
if a i U i - i  =  00 and =  01 then
Processor (a, 6) sends from (a, 6) to (a, using 6^;
if ajttt-i =  10 and 6i6j_i =  11 then
Processor (a ,6) sends A^a,b) from (a ,6) to 6) using e^ ,;
if (a, 6) received a  value then
store this value in and perform A(a,6) ~  A(^ a,b) +  ^{a,b)
endfor 
endfor 
/* phase 2 * /
for i =  n — 1 down to 1 do 
for all (a, b) do in parallel
if  a = 0"-'-^00*'-^ and b = 0"-'-^10*'-^ then
Processor (a ,6) sends from (a, 6) to (n, using e^;
if a =  and b = 0"“'~^11*‘“  ^ then
Processor (a.b) sends A(^ a,b) from (a ,6) to using 6^;
if (a, b) received two values then
store one value in A^g (,) and the other in B^a,b)i and perform 
A(a,6) • ^ { a ,b )  "h ^ { a ,b )
endfor
endfor
end
Theorem  11 Assuming bus hyperlinks of Q^, algorithm RED U CTIO N carries out a re­
duction operation in 2 (n — 1) parallel message transmission steps.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
9 8
U.l
2.6
2 .JÜ
10.1
I2 .I-IJ X.IÜ,
lO.I I2 .I .) I4.I.S0.1
Figure 6.6: D ata communication pattern  for reduction in
P ro o f. First, we claim tha t at the end of the first phase the sum of the partial sums 
stored in processors (z, 2""^ +  z), 0 < z <  2"“  ^ — 1, is the final sum. It is easy to 
verify that the claim is true for n =  2 and n =  3. Suppose tha t the claim is true for 
n = k, and consider the case n  = k + I. By the algorithm, any processor (a, b) such 
tha t afcQfc_i =  00 and  6fc6*.—i =  01 has not sent and receive any value before the 6-th 
step (iteration). Furthermore, by the hypothesis, the sum of the partial sums stored in 
processors (z, 2*~^ 4- z), 0 <  z <  2^"^ — 1, is the sum of the values originally stored in 
sub-hypernetwork induced by all vertices (a, 6) in Q^+i such that a  <  2* and 6 <  2*. 
Consider one more step (i.e. the 6-th iteration) of the first phase. In this step the partial 
sum  stored in (z, 2^"^-l-z), 0 <  z <  2*“ ^—1, is sent to processor (z, 2*-t-z), 0 < z <  2^“  ^—I, 
using hyperlink e.. By the symmetry, the sum of the partial sums stored in processors 
(2* -t- z ,2^ 4- 2*“  ^ +  z), 0 < z <  2*“  ^ — 1, is the sum of the values originally stored in 
sub-hypernetwork induced by all vertices (a, 6) in Q^+i such that n >  2^ and 6 >  2* 
and after the 6-th step, the partial sum stored in (2* 4- z, 2* -I- 2*"^ +  z), 0 <  z < 2^“  ^— 1, 
is sent to processor (2*“  ^4-z, 2* 4-2*“  ^4-z), 0 <  z <  2*~^ -  1, using hyperlink e2 *:2 *-i+ 2 -
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
9 9
Therefore, after performing parallel additions in processors {2 ,2* +  z), 0 <  z < 2 ^  — 1, we 
obtain the claimed 2^ partial sums for Q%+i- This completes the induction.
Now, we claim that the second phase computes the sum of the partial sums stored 
in processors {z, 2"~^ +  z), 0 <  z <  2""^ — 1, and store the final result in (0,1) of 
This claim is true for n =  2 and n  =  3. Suppose that the claim is true for n = k, and 
consider the case n  = k  + I. In the first iteration [i = 6), each processor ( z ,2*^ “  ^ +  z), 
0 <  z < 2*“  ^— 1, receives two partial sums, one from (z, 2*^  +  z) via e^, and the other from 
(2^-^ +  z, 2* -f 2*“  ^ +  z) via After additions, 2*“  ^ partial sums are obtained and
stored in processors (z, 2^"^ 4-z), where 0 <  z < 2*“  ^— I. Then, the induction hypothesis 
guarantees that after A: — 1 more steps the final result will be stored in processor (0 ,1). 
This completes the proof of the claim for phase 2, and the proof of the theorem. □
In Figure 6.6, we show the communication pattern used by RE D U C TIO N  on Q \. 
whose bus implementation is shown in Figure 6.4. As Figure 6.5, in this figure, a  circle 
labeled by a pair of integers a and b represents a processor (a, b). A directed edge labeled 
by an integer c from (a, 6) to {a',b') indicates that the message is transm itted  from (a, 6) 
to {a', b') using hyperlink Cc-
6 .4 .4  M a n y-to -M an y  C om m u n ication
We consider a general case, the all-to-all communication. In an all-to-all communication, 
each processor sends a message to all the other processors. It is also called the total 
exchange operation.
We can obtain a total exchange communication algorithm by modifying algorithm 
REDUCTION. The operator used is set union instead of addition. After applying R E ­
DUCTION, all messages are collected a t processor (0,1). Then, by applying BRO AD ­
CAST, processor (0,1) broadcasts the messages to all remaining processors of Q*. 
By Theorem 5, the second phase alone takes nW„ parallel message steps. Note that the 
lower bound for the time of a total exchange operation on Q* is and clearly, this
algorithm is not efficient.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
100
In what follows, we present an algorithm TO TA L-E XC H AN G E  which takes 0(iV„) 
message steps to perform the to tal exchange operation on Q*. Algorithm TOTAL- 
EXCHANGE  is an all-port algorithm, i.e., the two 1 /0  ports of each processor may partic­
ipate in a  message transmission step. However, each port performs either a  send operation 
or receive operation, but not both. This algorithm can be easily converted to a single-port 
algorithm with the same communication complexity.
For convenience, we define that a  processor ( i , j)  is of dimension A:, 0 <  A: <  n  — 1, if 
j  —i = 2*. It is easy to verify the following facts: (i) There are exactly processors 
of dimension A:. 0 <  A: <  n — 1, in Q*; (ii) There is exactly one processor of dimension k, 
0 < k < n — I, attached to each hyperlink in Q* ; and (iii) Any two processors of the same 
dimension are not attached to the same hyperlink in Q*.
p roced u re  TO TAL-EXC H AN G E  
b eg in
for a ll hyperlinks d o in  p aralle l
All processors attached to e/, sends its message to the processor of dimension 0 that 
is attach to using eu',
en d fo r
Let the set of messages received by each processor { i,j)  be denoted by 
for A: =  0 to  n  — 1 do
for a ll processors {i, j )  of dimension k  do in  p a ra lle l
Broadcast to all processors attached to hyperlink and ej
e n d fo r
Let the set of messages received by each processor ( i , j )  be denoted by 
endfor  
end
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
101
The correctness of this algorithm can be verified by the induction. It is easy to see 
the correctness for Q \. Suppose that the algorithm is correct for n  = m , and consider the 
case of n  =  m +  1. After m iterations of the fo r loop, the total exchange operations are 
performed with respect to the subhypergraph of Qm+i induced by vertices { i,j)  such tha t 
i < 2 ^  and j  < 2 "*, and the subhypergraph of Q ^+ i induced by vertices { i'd ')  such tha t 
i' > 2”* and j '  > 2 ”*. In addition, dimension m processors have received ail messages in 
In one additional iteration, each processor (a, 6 ) of dimension m broadcasts all its 
received messages to processors attached to hyperlinks and ej,. Then, by (i), (ii) and 
(iii), the total exchange operation is performed with respect to
Now, let use analyze the performance of TOTAL-EXCH ANG E. In our algorithm, we 
assume that when a processor broadcasts a set of messages, it broadcasts all messages it 
received in the previous step. As a consequence, duplicated messages are broadcast. VVe 
show that even with duplicated messages, the performance of TO TAL-EXC H AN G E  is 
within a  constant factor of the optimal. In the first for statem ent, 2(n — 1) messages are 
collected by each dimension 0  processor {a, b) using two hyperlinks and e&, and this takes 
[n — 1) message steps. Consider the for loop. It has n  iterations. In the first iteration, 2n 
messages are broadcast from each dimension 0  processor (a, 6 ) to all processors attached 
to fiu and 6 4 . In the second iteration, 4n messages are broadcast from each dimension 1 
processor (a, 6 ) to all processors attached to Cq and 6 4 . In general, in the iteration with 
fc =  m, 2 "*‘*'^n messages need to be broadcast by each dimension m  processor (a, 6 ) to all 
processors attached to Ca and 6 4 . By (ii) and (iii) above, the iteration with k = m  takes no 
more than message steps. Therefore, TO TAL-EXC H AN G E requites no more than
(n — 1 ) 4 -(2  4-4 +  84 h 2")n =  (2"‘*'^  — l)n  — 1 parallel message steps. We summarize
this analysis by the following claim.
T h e o re m  12 Assuming bus hyperlinks of Q*, algorithm TO TAL-EXCH ANG E carries 
out a total exchange operation in AN^ — n — 1 parallel message steps.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
102
6.5 Sum m ary and D iscussions
We proposed a new class of hypernetworks based on the duals of hypercubes. The struc­
tures of Qn 3.nd Qn are quite different, bu t as we showed, many properties of Q* can be 
directly derived from the properties of Qn- The Q* hypernetwork is suitable for exploiting 
the high bandwidths provided by new interconnection technologies such as optical fiber or 
devices. We presented a  set of basic data communication algorithms for Q* based on bus 
implementation of hyperlinks. Algorithms RO U TE  are BR O AD C AST  are optimal. The 
algorithms RED U C TIO N  sxiA TO TAL-EXC H AN G E axe optimal within a constant fac­
tor. A trivial lower bound for the communication complexity of the perm utation operation 
in n. We believe that the performance of algorithm PERM U TATIO N caji be improved. 
However, an  improved algorithm can be much more complicated.
Except PERM UTATION, our algorithms are closely related to the ideas behind their 
corresponding algorithms on the hypercube network. This leads us to a pose an open 
problem: is there a  simulation scheme tha t can be used to simulate Q„ by Q* efficiently? 
If such a  scheme can be found, then all previously known hypercube algorithms can be 
automatically translated to algorithms for a machine using Q* as the interconnection 
network.
Using the hypergraph dual concept, one can obtain another class of hypernetworks that 
contains the duals of the star graphs. The n-star graph Sn (refer to [I] for its definition) 
is a point-to-point network that has n! vertices, n (n  — 1)1/2 edges, and its degree and 
the diam eter are n — 1 and [3(n — I)/2J, respectively. Sn is vertex and edge symmetric. 
Therefore, has n!(n — l)/2  vertices and n! hyperedges; S'* is 2-regular, (n — l)-uniform , 
linear, and  vertex and hyperedge symmetric; and the diameter of S* is no greater than 
[(3n — 1 )/2 J , respectively. Both of diam eter and  degree of S^  are sub-logarithmic functions 
of the num ber of processors and the number of hyperlinks in S^- Compared with Q *, J* 
has some advantages. The topological and communication aspects of the S* hypernetworks 
deserve further investigations.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
Chapter 7
Conclusions
In this doctoral research, we first proposed a  linear array architecture based on a  pipelined 
TDM optical bus. We showed tha t using the conditional-delay and coincidence pulse 
techniques, several fundamental operations can be carried out very efficiently on our bus 
system and we demonstrated how to design parallel algorithms on our linear architecture. 
For example, parallel selection can be done in optimal 0(log(n)) time and parallel sorting 
can be done in 0(log(n)) time where n  is the number of data  elements. Our bus structure 
is simpler and more powerful than  the one proposed in [26], and much simpler than  and as 
powerful as the bus of [57]. O ur linear array can be used as building blocks to construct 
processor arrays of multiple dimensions to achieve better scalability and performance.
Second, we introduced a  pipelined asynchronous TDM optical bus based on the coin­
cident pulse technique. We also proposed a reconfigurable version of this bus to solve the 
unfairness problem. Our simulation results indicate that the performance of our TDM bus 
is much better than the performance of its FA-TDM counterpart. It is interesting that 
our bus architecture can be used to implement an n  x n  switch. Again our bus structure 
can be used as building blocks to construct processor arrays of multiple dimensions.
We then proposed a class of reconfigurable buses, the segmented buses, and constructed 
multidimensional interconnection structures using such buses. We showed th a t these in­
terconnection structures can be used to build versatile general-purpose parallel machines. 
To improve bandwidth, reliability, versatility, one may connect linearly arranged proces-
103
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
1 0 4
sors by several segmented buses instead of one. Interconnection patterns th a t are more 
complex than  multidim ensional meshes are possible. Segmented buses may appear in a 
hierarchical system  design, providing chip-to-chip, module-to-module, board-to-board or 
node-to-node communication. We believe tha t segmented buses are more promising in 
hardware im plem entation than  most other reconfigurable buses, such as the ones used in 
reconfigurable meshes, because of their relatively simpler control schemes. O ur segmented 
bus it can be im plem ented in either electronic or optical domain, and the implementation 
is simple.
At last, we proposed a new class of hypernetworks base on the duals of hypercubes. 
Although the structu res of Q* and Qn are quite difierent, as we showed, many properties 
of Qn can be directly derived from the properties of Qn- The Qn hypernetwork is suitable 
for exploiting the high bandwidths provided by new interconnection technologies such as 
optical fiber or devices. We presented a  set of basic data  communication algorithms for 
<3* based on bus im plem entation of hyperlinks. Algorithms R O U TE  and B R O AD C AST  
are optimal. T he algorithms RED U CTIO N and TO TA L-E XC H AN G E are optimal within 
a constant factor. A trivial lower bound for the communication complexity of the permu­
tation operation is n. We think tha t the performance of algorithm P E R M U TA TIO N  can 
be improved. However, an  improved algorithm can be much more complex.
Except PE R M U TA TIO N , our algorithms are closely related to the ideas behind their 
corresponding algorithm s on the hypercube network. This leads us to a pose an open 
problem: is there a  sim ulation scheme that can be used to simulate by Qn efficiently? 
If such a scheme can be found, then all previously known hypercube algorithms can be 
automatically transla ted  to algorithms for a machine using Q* as the interconnection 
network.
Using the hypergraph dual concept, one can obtain another class of hypernetworks 
that contains the duals of the star graphs, n-star graph Sn- The n -star graph Sn (refer 
to [1] for its definition) is a point-to-point network tha t has n! vertices, n (n  — 1)1/2 edges,
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
1 0 5
and its degree and the diam eter are n  — 1 and [3(n — 1)/2J, respectively. Sn is vertex 
and edge symmetric. Therefore, 5* has n!(n — I) /2  vertices and n\ hyperedges; is 
2-regular, (n — I)-uniform , hnear, and vertex and hyperedge symmetric; and the diam eter 
of 5* is no greater than  [(3n — 1)/2J, respectively. Both of diameter and degree of 5* 
are sub-logarithmic functions of the number of processors and the number of hyperlinks 
in 5*. Compared w ith (5„, 5* has some advantages. The topological and communication 
aspects of the 5* hypernetworks deserve further investigations.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
Bibliography
[1] S. Akers, D. Harel, and B. Krishnamurthy. The star graph: an attractive alternative 
to the n-cube. In Proceedings o f 1987 International Conference on Parallel Processing, 
pages 393-400, 1987.
[2] Seiim G. AKL. The Design and Analysis of Parallel Algorithms, pages 39-58. 
Prentice-Hall, 1989.
[3] R. Alferness, L. Buhl, S. Korotky, and R. Tucker. High-speed 5/?-reversal directional 
coupler switch. Top. Meeting Photon. Switching, Tech. Dig. Series, 13:77-78, 1987.
[4] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. 
The tera computer system. In Proc. ACM  Int. Conf. Supercomputing, 1990.
[5] S. Araki, M. Kajita, K. K asahara, K. Kubota, K. K urihara, I. Redmond, E. Schen- 
feld, and T. Suzaki. Experim ental free-space optical network for massively parallel 
computers. Applied Optics, 35(8):1269-1281, 1996.
[6] A. Benner, H. Jordan, and V. Heuring. Optically switched lithium ni obate directional 
couplers for digital optical computing. SPIE Proc. Digital Optical Computing II, 
1215:343-352, 1990.
[7] A. Benner, H. Jordan, and V. Heuring. Digital optical com puting with optically 
switched directional couplers. Optical Engineering, 30(12).T936-1941, 1991.
[8] C. Berge. Hypergraphs. North-Holland, 1989.
[9] M. Blum, R. W. Floyd, V. R. P ra tt, R. L. Rivest, and R. E. Tarjan. Time bounds 
for selection. Journal o f Computer and System Sciences, 7:448-461, 1973.
[10] D.A. Carlson. Performing tree and prefix computations on modified mexh-connected 
parallel computers. In Proceedings of the 1985 IEEE International Conference on 
Parallel Processing, 1985.
[11] S. Chandran and A. Rosenfeld. Order statistics on a hypercube. Center for Automa­
tion Research, University of Maryland, College Park, Md., 1986.
[12] D.M. Chiarulli, R.G. Melhem, and S.P. Levitan. Using coincident optical pulses for 
parallel memory addressing. IE E E  Computer, 20(12):48-58, 1987.
[13] K.L. Chung. Generalized mesh-connected computers with multiple buses. Proc. Int. 
Conf. on Parallel and Distributed System, 1993.
106
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
107
[14] W .J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE  
Trans, on Computers, 39:775-785, 1990.
[15] D.Bhagavathi, P.H.Looges, S. Olariu, J.L. Schwing, and J. Zhang. A fast selection 
algorithm for meshes with multiple broadcasting. In Proceedings o f the International 
Conference on Parallel Processing, III-10, 1992.
[16] O.M. Dighe, R. Vaidyanathan, and S.Q. Zheng. The bus-connected ringed tree: a 
versatile interconnection network. Journal o f Parallel and Distributed Computing, to 
appear, 1997.
[17] Patrick W. Dowd. Wavelength division multiple access channel hypercube processor 
interconnection. IEEE Trans, on Computers, 41 (10): 1223-1241, 1992.
[18] R. Duncan. A survey of parallel computers architectures. IE E E  Trans, on Computers, 
pages 5-16, 1990.
[19] Hossam ElGindy and Paulina Wegrowicz. Selection on the reconhgurable mesh. In 
Proceedings of the International Conference on Parallel Processing, III-26, 1991.
[20] R. Floren and et all. Optical interconnects in the touchstone supercom puter program. 
SPIE Proc. Intergrated Optoelectronics for Communication and Processing, 1582:46- 
54, 1989.
[21] Kanad Ghose, R. Kym Horsell, and Nitin K. Singhvi. Hybrid multiprocessing using 
wdm optical fiber interconnections. In mppoi, 1994.
[22] J. W. Goodman et al. Optical interconnections for vlsi systems. Proc. IEEE, 
72(7):850-866, 1984.
[23] J.R . Goodman and P. Woest. The Wisconsin multicube: A new large-scale cache- 
coherent muptiprocessor. In Proc. 15th Int. Symp. Computer Arch., 1988.
[24] Mathew S. Goodman et al. The lambdanet multiwavelength network: Architecture, 
applications, and dem onstrations. IEEE J. on Selected Areas in Communications, 
8(6), 1990.
[25] A. Guha, J. Bristow, C. Sullivan, and A. Husain. Optical interconnections for mas­
sively parallel architectures. Applied Optics, 29(8), 1980.
[26] Z. Guo. Sorting on array processors with pipelined buses. In  Proceedings o f 1992 
International Conference on Parallel Processing, pages 289-292, 1992.
[27] Z. Guo et al. Array processors w ith pipelined optical buses. J. Parallel Distributed 
Comput., 12(3):269-282, 1991.
[28] S.H. Horng. Performing prefix computation and its applications on modified mesh- 
connected computers w ith hyperbus broadcasting. ICCI, 1994.
[29] E. Horowitz and S. Sahni. Fundamentals o f Computer Algorithms. Com puter Science 
Press, 1978.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
1 0 8
[30] D. K. H unter and D. G. Smith. New architectures for optical tdm  switching. Journal 
Lightwave Technology, 11:495-511, 1993.
[31] K. Hwang. Omp: A rise-based multiprocessor using orthogonal access memories and 
multiple spanning buses. In Proc. A C M  Int. Conf. Supercomputing, 1990.
[32] K. Hwang and F.A. Briggs. Computer Architecture and Parallel Processing, pages 
87-88. McGraw-Hill, 1984.
[33] K. Hwang, P.S. Tseng, and D. Kim. A n orthogonal multiprocessor for parallel scien­
tific computations. IEEE Trans. Computers, 38, 1989.
[34] Kai Hwang- Advanced Computer Architecture: Parallelism, Scalability, Programma­
bility, pages 87-88. McGraw-Hill, 1993.
[35] J. Jahns and S. H. Lee. Optical Computing Hardware. Academic Press, 1994.
[36] J. Jang  and V.K. Prasanna. An optim al sorting algorithm on reconfigurable mesh. 
In Proc. 6th Int. Parallel Processing Symposium, pages 130-137, 1992.
[37] H arry F. Jordan, Daeshik Lee, Kyungsook Y. Lee, and Srinivasan V. Ram anan. 
Serial array time slot interchangers and  optical implementations. IE E E  Trans, on 
Computers, 43(11), 1994.
[38] B. Kahle, E. Parish, T. Lane, and J. Q uam . Optical interconnects for interproces­
sor communications in the connection machine. In IEEE Conference on Computer 
Design, 1989.
[39] P. Lalwaney, L. Zenou, A. Ganz, and I. Koren. Optical interconnects for multiproces­
sors: cost performance analysis. In Proc. o f the Smposium on Frontiers o f Massively 
Parallel Computation, 1992.
[40] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays ■ Trees 
■ Hypercube, pages 78-82, 239-244. M organ Kaufmann Publishers, Inc., 1992.
[41] S.P. Levitan, D.M. Chiarulli, and R.G. Melhem. Coincident pulse techniques for 
multiprocessor interconnection sturctures. Applied Optics, 29(14):2024-2039, 1990.
[42] H. Li and M. Maresca. Polymorphic-torus network. IEEE Trans, on Pattern Analysis 
and Machine Intelligence, ll(3):233-243, 1989.
[43] Y. Li, A.W. Lohmann, Z.G. Pan, S B. Rao, I. Redmond, and T. Wang. O ptical 
multiple-access mesh-connected bus interconnects. Proc. of IEEE, 82(11):1690-1700, 
1994.
[44] Y. Li and S.Q. Zheng. Prefix com putation using a segmented bus. In Proc. o f the 
28th IE E E  Southeastern Symposium on System  Theory, 1996.
[45] Y. Li and S.Q. Zheng. Asynchronous optical tdm  communication for parallel com­
puters. Proceedings o f S P IE ’s Photonics W est’SI, 1997.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
109
[46] R. Lin, S. Olariu, J. Schwing, and J . Zhang. Sorting in o(I) time on a reconfîgurable 
mesh of size n x n. In Proceedings of E W P C ’92, Plenary Address, lO S  Press, pages 
16-27, 1992.
[47] A. Louri and L. Sung. An optical multiple-mesh hypercube - a scalable optical inter­
connection network for massively-parallel computing. Journal of Lightwave Technol­
ogy, 12(4):704-716, 1994.
[48] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley, 1980.
[49] R. Melhem, D. Chiarulli, and S. Levitan. Space multiplexing of waveguides in opti­
cally interconnected multiprocessor systems. Computer Journal, 32(4):362-369, 1989.
[50] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis, and Q.F. Stout. D ata movement opera­
tions and applications on reconfigurable vlsi arrays. In Proceedings o f the International 
Conference on Parallel Processing, 1 , 205-208, 1988.
[51] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis, and Q.F. Stout. Meshes with reconfig­
urable buses. In Proc. of the International Conf. on Parallel Processing, volume 1, 
pages 205-208, 1988.
[52] John A. Nefi’. Optical interconnects based on two-dimensional vcsel arrays. In mppoi, 
1994.
[53] M. Nigam and Sahni. Sorting n  numbers on n x n  mesh with buses. In Technical 
Report #92-5, University of Florida, Gainsville, 1992.
[54] S. Olariu and J. Schwing. A new deterministic sampling scheme with applications 
to broadcast eCBcient sorting on the reconfigurable mesh. Journal o f Parallel and 
Distributed Computing, 1996.
[55] Y. Pan. Hough transform on arrays with an optical bus. In Proceedings of the 5th 
International Conference on Parallel and Distributed Computing and Systems, pages 
161-166, 1992.
[56] Y. Pan. Order statistics on optically interconnected multiprocessor systems. In mppoi, 
1994.
[57] Y. Pan and K. Li. Linear array w ith a  reconfigurable pipelined bus system -  concepts 
and applications. In Proceedings o f 1996 International Conference on Parallel and 
Distributed Processing Techniques and Applications, pages 1431-1442, 1996.
[58] Craig Partridge. Gigabit Networking, pages 20-21. Addison-Wesley Publishing Com­
pany, 1993.
[59] S. Pavel and S.G. Akl. On the power of arrays with reconfigurable optical buses. 
In Proceedings of International Conference on Parallel and Distributed Processing 
Techniques and Applications, pages 1443-1454, 1996.
[60] V.K. Prasanna-Kumar and C.S. Raghavendra. Array processor with multiple broad­
casting. Journal of Parallel and Distributed Computing, 2(4):173-190, 1987.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
110
[61] P.R.Prucnal, I. Glesk, and J.P. Sokoloff. Demonstration of all-optical self-clocked 
demultiplexing of tdm  data  at 250 gb/s. In mppoi, 1995.
[62] C. Qiao, R. Melhem, D. Chiarulli, and S. Levitan. Optical multicasting in linear 
arrays. International Journal o f Optical Computing, 2(1), 1991.
[63] Chunming Qiao. On designing communication-intensive algorithms for a spzmning 
optical bus based array. Parallel Processing Letters(to appear), 1995.
[64] Chumning Qiao and Rami Melhem. Reconfigureation with time division multiplexed 
min’s for multiprocessor communications. IEEE Trans, on Parallel and Distributed 
Systems, 5(4), 1995.
[65] Chunming Qiao and Rami C. Melhem. Time-division optical communications in 
multiprocessor arrays. IEEE Trans, on Computers, 42(5):557-590, 1995.
[66] S. V. Ramanan, H. F. Jordan, and J. R. Sauer. A new time domain, multi-stage 
perm utation algorithm. IEEE Trans. Inform. Theory, 36:171-173, 1990.
[67] I. Redmond and E. Schenfeld. Experimental results of a  64-channel, free-space op­
tical interconnection network for massively parallel processing. Institute o f Physics 
Conference Series, 1995.
[68] John H. Reif and Akitoshi Yoshida. Free space optical message routing for high 
performance parallel computers. In mppoi, 1994.
[69] J. Rothstein. Bus autom ation, brains, and mental models. IEEE Trans, on Systems 
Man Cybernetics, 18, 1988.
[70] C.L. Seitz. Concurrent vlsi architectures. IEEE Trans, on Computers, 33:1247-1265, 
1984.
[71] H.J. Siegel. Interconneciton Networks for Large Scale Parallel Processing. McCraw 
Hill Publishing Co., 1990.
[72] D.M. Spirit, A.D. Ellis, and P.E. Barnsley. Optical-time division multiplexing - sys­
tems and networks. IEEE Commnunications Magazine, 32(12):56-62, 1995.
[73] F. Stone, J. Watson, D. Moser, and W. Minford. Performance and yield of pilot-line 
quantities of lithium  niobate switches. In SPIE  Conf. Proc. OE/Fibers, 1989.
[74] Q.F. Stout. Meshes with multiple buses. In 27th IEEE Symp. Found. Comput. Sci., 
1986.
[75] S. Suzuki and K. Nagashima. Optical broadband communications network architec­
ture utilizing wavelength-division switching technologies. In Technical Digest, Topical 
Meeting on Photonic Switching (Optical Society o f America, Washington, DC, 1987.
[76] T. Szymanski. Hypermeshes - optical interconnection networks for parallel computing. 
Journal o f Parallel and Distributed Computing, 26(l):l-23, 1995.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
I l l
[77] Andrew S. Tanenbaum. Computer Networks, pages 57-59. Prentice-Hall, 1988.
[78] R. A. Thom pson and P. P. Giordano. An experimental photonic time slot interchanger 
using optical fibers as reentrant delay-line memories. Journal Lightwave Technology. 
5:154-162, 1987.
[79] J.L. Trahan, R. Vaidyanathan, and C.P. Subbaraman. Constant time graph algo­
rithm s on the reconfigurable multiple bus machine. Parallel and Distributed Comput­
ing, 1996.
[80] Y. Pan Y. Li and S.Q. Zheng. A pipelined tdm  optical bus w ith conditional delays. 
In Optical Engineering, to appear, 1997.
[81] G.H.Chen Y.C.Chen, W .T.Chen and J.P.Sheu. Designing efficient parallel algorithms 
on mesh-connected computers with multiple broadcasting. IE E E  Trans, on Parallel 
and Distributed Systems, l(2):241-246, 1990.
[82] J.G. Zhang and G. Picchi. Self-synchronized all-optical time-division multiple-access 
broadcast network. Electronics Letter, 29(21):1871-1873, 1993.
[83] Yi-Mo Zhang, Xiao-Qing He, Ge Zhou, Wen-Yao Liu, Yong Wang, and Zhan-Ping 
Yin. Optical fiber interconnection system for massively parallel processor arrays. In 
mppoi, 1995.
[84] S.Q. Zheng. Hypernetworks - a  class of interconnection networks with increased wire 
sharing. In Part I  - Part IV , Technical Reports, Department o f Compute Science, 
Louisiana State University, 1994.
[85] S.Q. Zheng. Hypercube hypernetworks : Implementations of hypercube with in­
creased wire sharing. In Proc. of the 8th International Conf. on Parallel Processing, 
pages 452-457, 1995.
[86] S.Q. Zheng. Hypernetworks: A class of interconnection networks for new generation 
parallel computers. In Proc. o f International Conf. on Parallel Processing, 1995.
[87] S.Q. Zheng. Sparse hypernetworks based on steiner triple systems. In Proc. o f Inter­
national Conf. on Parallel Processing, 1995.
[88] S.Q. Zheng and J. Wu. Dual of a complete graph as an interconnection network. In 
Proc. o f IE E E  SPDP, 1996.
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm issio n .
V ita
Yueming Li was born at Shi Village, Xia County of Shanxi Province, People’s Republic 
of China in 1961. He went through the village’s elementary school and then  Miaoqian High 
School in a neighboring town. After graduating from high school, he worked as a farmer.
In the spring of 1978, he entered the Beijing Institute of Iron and Steel, now named as 
University of Science and Technology Beijing. He studied Mining M achinery in Depart­
ment of the Mining and Mineral Engineering. He obtained his Bachelor of Engineering 
and M aster of Engineering in 1982 and 1984 respectively. After graduating from college, 
he worked as a mechanical engineer a t Beijing Metallurgical Research Institu te.
In the fall of 1991, he came to the United States to study western technology. He first 
enrolled at South Dakota State University to study in the Departm ent of Agricultural 
Engineering and the Department of Com puter Science. In the fall of 1993, after graduating 
with dual m aster degrees, he transfered to Louisiana State University to study for a 
doctoral degree in the Department of Computer Science. He is expecting the degree in 
the December of 1997.
During the studies of his doctoral degree, he has published numerous journal and con­
ference papers. His major research interests include Parallel and D istributed Computing, 
Image Processing, Neural Networks and High Performance Software.
112
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate: Yueming Li
Major Field: Computer Sc ie nc e
Title of Dissertation: Design and A nalys is  o f  Optical  Interconnect ion
Networks for  P ara l le l  Computation
Approved:
aduate School
EXAMINING COMMITTEE:
Date of Examination:
August 27.  1997
R ep ro d u ced  with p erm issio n  o f  th e  copyrigh t ow n er. Further reproduction  prohibited w ithout p erm ission .
IMAGE EVALUATION 
TEST TARGET (Q A -3 )
1.0
l.l
1.25
| r  |i2
1.4
s
2.2
2 £
1.8
1.6
150mm
/A P P L IE D  ^  IIVMGE . Inc
1653 East Mcün Street 
• Rochester, NY 14609 USA 
Phone: 716/482-0300 
------------------ Fax: 716/288-5989
0 1993. Applied Im age. Inc.. Ail R ights R eserved
R ep ro d u ced  w ith p erm iss io n  o f th e  copyrigh t ow ner. Further reproduction  prohibited w ithou t p erm issio n .
