Multiple Bus Networks for Binary -Tree Algorithms. by Dharmasena, Hettihewage Prasanna
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
2000
Multiple Bus Networks for Binary -Tree
Algorithms.
Hettihewage Prasanna Dharmasena
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation




This manuscript has been reproduced from the microfilm master. UMI films the 
text directly from the original or copy submitted. Thus, some thesis and 
dissertation copies are in typewriter face, while others may be from any type of 
computer printer.
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print colored or poor quality illustrations and 
photographs, print bieedthrough, substandard margins, and improper alignment 
can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete manuscript and 
there are missing pages, these will be noted. Also, if unauthorized copyright 
material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning 
the original, beginning at the upper left-hand comer and continuing from left to 
right in equal sections with small overlaps.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white photographic 
prints are available for any photographs or illustrations appearing in this copy for 
an additional charge. Contact UMI directly to order.
Bell & Howell Information and Learning 
300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA
LUVd
800-521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MULTIPLE BUS NETWORKS FOR 
BINARY-TREE ALGORITHMS
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agricultural and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
m
The Department of Electrical and Computer Engineering
by
H. P. Dharmasena 
B.S., University of Moratuwa, 1983 
M.S., Louisiana State University, 1987 
May 2000
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number 9979254
__ ___  _ _ < ®
UMI
UMI Microform9979254 
Copyright 2000 by Bell & Howell Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
Bell & Howell Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgments
I would like to express my sincere gratitude to Dr. R. Vaidyanathan for his guidance, 
wisdom and especially his patience during the course of this research. I would also 
like to thank members of my committee Dr. S. Kundu, Dr. A. El-Amawy, Dr. J. L. 
Trahan, Dr. K. Zhou and Dr. L. J. Smolinsky.
This work would have been impossible without the support of various individu­
als at my work. I wish to express my sincere appreciation to the members of the 
instrument development group for their understanding and support.
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
A c k n o w l e d g m e n t s .......................................................................................................................  ii
L i s t  o f  T a b l e s ...................................................................................................................................  v i
L is t  o f  F ig u r e s  ...............................................................................................................................  v ii
A b s t r a c t ...............................................................................................................................................  x
C h a p t e r
1 In t r o d u c t i o n ..............................................................................................................................  1
1.1 MBNs and Binary-Tree A lg o rith m s............................................................ 4
1.2 Scope of the D isserta tion ...............................................................................  6
1.3 Contribution of this W o r k ............................................................................ 9
1.4 Organization of the D isserta tion ..................................................................  10
2 P r e l im in a r ie s .........................................................................................................  11
2.1 Binary Tree Algorithms ...............................................................................  11
2.2 Multiple Bus N etw orks..................................................................................  13
2.3 Running Binary Tree Algorithms on M B N s..............................................  14
2.3.1 Direct and Indirect M apping.............................................................  16
2.4 Prefix Computations on Binary-Tree MBNs ........................................... 17
3  D e g r e e , L o a d i n g , T im e  T r a d e - O f f s  ..................................................................... 20
3.1 P re lim in a r ie s ................................................................................................... 21
3.2 Lower Bound for Direct M a p p in g ............................................................... 22
3.3 An f i( \/n )  Lower B o u n d ............................................................................... 24
3.3.1 Strategy and D efin itio n s ...................................................................  24
3.3.2 Basic R e s u l t s .......................................................................................  26
3.3.3 The Accounting S c h e m e ...................................................................  27
3.3.4 Non-Uniform Bus U s a g e .................................................................... 30
3.3.5 The Lower B o u n d ................................................................................  31
3.4 An Q ( n ^  Lower B o u n d ............................................................................... 32
3.4.1 Additional R e s u l t s .............................................................................  32
3.4.2 Tighter Lower B o u n d .......................................................................... 35
3.5 An Q ( i j ^ )  Lower B o u n d ...........................................................................  37
3.5.1 Initial C o n d i t io n ................................................................................  38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.5.2 The New Accounting Scheme .......................................................  39
3.5.3 Tighter Lower B o u n d ........................................................................  43
3.6 The TYee M B N ............................................................................................... 44
3.7 Loading-Speed Tradeoff.................................................................................. 47
3.7.1 Lower B o u n d .....................................................................................  47
3.7.2 Upper B o u n d .....................................................................................  49
3.8 Extension to k-ary Tree A lgo rithm s........................................................... 52
3.9 Concluding R e m a rk s .....................................................................................  58
4 M u ltiple-B us E nhanced  M eshes .............................................................. 59
4.1 P re lim in a r ie s ..................................................................................................  63
4.1.1 MBN M e a su re s ..................................................................................  64
4.1.2 Multiple-Bus Enhanced M e sh e s ....................................................  65
4.2 Binary-Tree MBN E x te n s io n s ....................................................................  66
4.3 Meshes with Tree M B N s ..............................................................................  69
4.4 MBEMs with Segment S w itch es .................................................................  74
4.4.1 Binary-Tree MBNs with Segment S w itch es ................................. 74
4.4.2 Meshes with Tree MBNs and Segment S w itc h e s ....................... 80
4.5 Results and D iscussion .................................................................................. 83
5 Fault T o l e r a n c e ..............................................................................................  84
5.1 Fault M o d e l .....................................................................................................  85
5.2 R e p lic a t io n .....................................................................................................  86
5.2.1 Adding Redundant C onnections....................................................  87
5.2.2 Definition of 72*.................................................................................. 87
5.2.3 The Designated S e t ...........................................................................  89
5.2.4 Fault Tolerance Properties of 72*   95
5.2.5 Processor F au lts .................................................................................. 96
5.2.6 Fault Tolerant Binary-Tree M B N s.................................................  97
5.3 Recursive S chedu ling ..................................................................................... 101
5.3.1 An MBN, with 2m — 1 B u se s .................................................. 102
5.3.2 Recursive Scheduling with 23 Bus F a u l t s .................................... 110
5.3.3 Putting it All T o gether....................................................................  I l l
5.4 Comparison of R e s u l t s .................................................................................  112
5.5 Concluding R e m a rk s ..................................................................................... 113
6 V LSI L ayout Lo w er  B o u n d ..........................................................................  115
6.1 P re lim in a r ie s ..................................................................................................  118
6.1.1 VLSI M o d e l........................................................................................  118
6.1.2 Definitions and Figure C onventions..............................................  119
6.2 Towards the Lower B o u n d ...........................................................................  124
6.2.1 Minimum Communication Structure ........................................... 124
6.2.2 Labeling Links .................................................................................. 129
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6 .2 .3  Collapsing L i n k s .............................................................................................. 130
7  S u m m a r y  a n d  F u t u r e  W o r k ........................................................................................  136
B i b l i o g r a p h y ..................................................................................................................................... 140
V i t a ............................................................................................................................................................  147
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
4.1 Some results for meshes with Tree M B N ....................................................... 73
4.2 Some results for meshes with Tree MBN and Segment S w itc h e s ............ 82
5.1 Summary of r e s u l ts ..............................................................................................  114
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
2.1 Running B in (3) on an 8 x 4 MBN ...................................................................  12
2.2 A 16 x 8 MBN and its m a t r ix .............................................................................  14
2.3 Steps of running prefix computation on binary-tree M B N s ........................  18
3.1 T {n)  with a direct m a p p in g ................................................................................. 23
3.2 Step 1 of X i(n ) ........................................................................................................ 25
3.3 Processors and buses in the neighborhood of bus b o ...................................... 36
3.4 Subintervals of [L, n ] ..............................................................................................  37
3.5 The connections on bus b o ....................................................................................  44
3.6 Running B in (4) on T ( 4 ) .......................................................................................  46
3.7 Path with t d e l a y s .................................................................................................  48
3.8 Running Bin(3) on 8 x 4 MBN, 2 ? (3 ) ................................................................  50
3.9 MBN for ternary tree a lg o r i th m s ....................................................................... 58
4.1 A 32 x 8 Tree MBN .............................................................................................. 68
4.2 Structure of a mesh with binary-tree MBN ................................................... 71
4.3 Steps of method 1   76
4.4 Steps of the method 2 ..........................................................................................  77
4.5 Construction 1 ........................................................................................................ 78
4.6 Construction 2 ........................................................................................................ 79
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.1 The MBN of Figure 2.2 augmented to handle 3 bus fau lts .............................  88
5.2 Graph £73jg..................................................................................................................  90
5.3 An illustration of the proof of Lemma..... 5.1...................................................... 92
5.4 Node disjoint correspondences for an e x a m p le ...............................................  95
5.5 Regions of F {n) ...................................................................................................  103
5.6 Recursive decomposition of n ) ......................................................................  104
5.7 Running the first step of Region 3   105
5.8 An example of a 8 x 4 MBN ............................................................................. 105
5.9 Connections of processors and buses with one bus f a u l t ............................... 106
5.10 Connection of processors and buses in Region 3 .............................................  108
5.11 Regions of .F(n) for k  bus faults ......................................................................  110
6.1 H-Tree layout of a 31-processor binary t r e e ....................................................  116
6.2 7-node binary tree layou t......................................................................................  116
6.3 8-processor MBN l a y o u t ......................................................................................  117
6.4 8 processor MBN running B i n ( 3 ) ......................................................................  117
6.5 A perim eter l a y o u t ................................................................................................  119
6.6 Links between processors......................................................................................  120
6.7 View from processor 1 .......................................................................................... 120
6.8 Subset view I ..........................................................................................................  121
6.9 Subset view I I ..........................................................................................................  122
6.10 Subset view I I I ....................................................................................................... 122
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.11 Subset view from processor 1   122
6.12 Subset view from processor 4   123
6.13 Equivalent v i e w s ..................................................................................................  123
6.14 Symbolic representation of a subset v i e w .......................................................  123
6.15 Communication Structure for 3 ) ....................................................................  124
6.16 Subcase 1(a)   125
6 .1 7  Subcase 1(b): P2 -P 1-P0 J in k ...................................................................................................  126
6.18 Subcase 1(b): P2 -P3-P0 l in k ................................................................................... 126
6.19 Case 2 ...................................................................................................................... 127
6.20 Case 2: P2 -P 1-P0 l i n k ...........................................................................................  127
6 .21  Case 2: P2 -P3-P0 l i n k ...........................................................................................  127
6 .2 2  Subcase 3(a)   128
6 .2 3  Subcase 3(b): P2 -P 1-P0 l i n k ................................................................................... 128
6.24 Subcase 3(b): P2-P3-P0 l in k ................................................................................... 129
6.25 T { n ) .........................................................................................................................  130
6.26 View from final result p ro cesso r........................................................................ 130
6.27 A maximal collapse of a structure with 4 levels ..........................................  132
6.28 A different c o llap se ............................................................................................... 134
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
Multiple bus networks (MBN) connect processors via buses. This dissertation ad­
dresses issues related to running binary-tree algorithms on MBNs. These algorithms 
are of a fundamental nature, and reduce inputs at leaves of a  binary tree to a result 
a t the root. We study the relationships between running time, degree (maximum 
number of connections per processor) and loading (maximum number of connections 
per bus). We also investigate fault-tolerance, meshes enhanced with MBNs, and VLSI 
layouts for binary-tree MBNs.
We prove th a t the loading of optimal-time, degree-2, binary-tree MBNs is non­
constant. In establishing this result, we derive three loading lower bounds f2(Vn), 
fi(n^) and ^ (n ^ )>  each tighter than  the previous one. We also show tha t if the 
degree is increased to 3, then the loading can be a constant. A constant loading 
degree-2 MBN exists, if the algorithm is allowed to run slower than the optimal.
We introduce a new enhanced mesh architecture (employing binary-tree MBNs) 
tha t captures features of all existing enhanced meshes. This architecture is more flexi­
ble, allowing all existing enhanced mesh results to be ported to a more implementable 
platform.
We present two methods for im parting tolerance to  bus and processor faults in 
binary-tree MBNs. One of the methods is general, and can be used with any MBN and 
for both processor and bus faults. A key feature of this method is tha t it perm its the 
network designer to designate a set of buses as “unim portant” and consider all faulty
x
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
buses as unimportant. This minimizes the impact of faulty elements on the MBN. 
The second method is specific to bus faults in binary-tree MBNs, whose features it 
exploits to produce faster solutions.
We also derive a series of results tha t distill the lower bound on the perimeter 
layout area of optimal-time, binary-tree MBNs to a single conjecture. Based on this 
we believe that optimal-time, binary-tree MBNs require no less area than a balanced 
tree topology even though such MBNs can reuse buses over various steps of the 
algorithm.
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
In a parallel processing system, the interprocessor communication network plays a 
very important role. In this dissertation, we deal with one class of such networks 
called multiple bus networks (MBNs). An MBN consists of a set of processors and 
a set of buses, with each processor connected to a t least one bus. Any processor 
connected to a bus can access the bus. However, the bus can convey only one piece 
of information at a time.
MBNs have several advantages over traditional point-to-point networks (such as 
the ring, mesh, torus and hypercube). In a point-to-point network, each communi­
cation link is dedicated to a pair of processors. In an MBN, on the other hand, the 
communication medium (bus) is shared among several processors and could, there­
fore, be used more efficiently. This sharing of the communication medium also allows 
for a graceful degradation of performance in the presence of faults. Because the 
communication medium is shared, MBNs lend themselves to easy broadcasting. An 
MBN can be used to emulate several point-to-point topologies or set of interconnec­
tion functions [26, 27, 33, 47, 48] as each bus could serve as a communication link 
between different processor pairs at different times. An MBN is representative of 
any hypergraph based system [6], and a bus can be viewed as an abstraction of any
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
shared resource, for example a  memory module in shared memory systems, or a  trans­
mission frequency in systems with frequency division multiplexing (such as wireless 
[18, 22, 35, 52, 61] and optical [5, 20, 65]). Therefore, this work may find applicability 
in other settings as well.
Traditionally, MBNs have been used in an asynchronous environment with rela­
tively few processors. Most of this work has been on analyzing data  throughput of 
multiprocessor systems under various traffic models, arbitration schemes, and rela­
tionship between numbers of processors/buses [12, 14, 21, 28, 32, 36, 55, 56, 66, 94]. 
Work also exists on variations on the basic MBN model [9, 16, 37, 39, 43, 53, 90] and 
on the pattern  of connections between processors and buses [12, 31, 39, 42, 54, 81]. 
Traditionally, MBNs have been used in asynchronous systems with a  small number of 
processors partly  because of the fact that physical loading due to capacitive coupling 
limits the number of connections to a bus. In an optical bus, loading is caused by a 
receiver on the bus drawing some of the available power, thus limiting the number of 
receivers that can be connected to the bus. The asynchronous bus model also requires 
a complex arbitration scheme to resolve bus contention.
In this dissertation we primarily consider a synchronous bus model, though most 
results apply to asynchronous MBNs as well. Technological advances have made it 
feasible to connect more loads on a bus. This, in turn, makes fine-grained synchronous 
MBNs (with a large number of processors) possible. The synchronous environment 
also removes the need for arbitration. Feldman et al. [29] recently proposed an optical 
slab waveguide bus capable of connecting a large number of processors a t very high 
da ta  rates. Qiao and Melhem [70] proposed a communication scheme called time- 
division source-oriented multiplexing (TDSM) for synchronous optical buses th a t can 
be used for large systems. Their method takes advantage of unidirectional propaga­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3
tion and predictable delay of optical fibers to achieve reliable communication among 
a large number of processors. Lin et al. [51] have proposed “precharged” buses to 
facilitate concurrent broadcasts.
Much work on synchronous MBNs has centered around topologies (primarily the 
two-dimensional mesh) enhanced with buses (for example [1, 4, 10, 17, 19, 64, 69, 71, 
75]). There has also been some work on running algorithm classes and implementing 
interconnection functions [2, 23, 24, 25, 26, 40, 46, 47, 50, 63, 83, 85]. Another 
class of MBNs th a t uses synchronous buses is reconfigurable models, that allow the 
connection pattern  between processors to change dynamically (Nakano [59] provides 
an extensive bibliography). Under current technological constraints, however, fixed 
connection MBNs, such as those considered in this work, are easier to implement 
than reconfigurable networks. Commercially available field programmable gate arrays 
(FPGAs) have also been proposed as reconfigurable com putational platforms [34, 41, 
58, 74, 88, 91, 92]. The programmable interconnections between “configurable logic 
blocks” in FPGAs show some features of MBNs in th a t they are often implemented 
as wires with taps (buses) [89, 93].
In this dissertation we address various issues related to running a well-known 
class of algorithms called binary-tree algorithms on MBNs. (Other researchers have 
also studied algorithm classes on MBNs and other networks [2, 40, 47, 62, 63, 73, 
82].) A binary-tree algorithm reduces N  inputs to a single result. The computation 
performed by such an algorithm can be represented as a balanced binary tree with 
the inputs at the leaves and the result a t the root. Several fundamental algorithms 
involving semigroup operations and prefix computations such as maximum/minimum, 
parity, polynomial evaluation and barrier synchronization can be implemented as 
binary-tree algorithms. Binary-tree algorithms require a  rich communication pattern,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4
so a network suitable for running binary-tree algorithm is likely to be suitable for 
many other applications as well. Because of the fundamental nature of binary-tree 
algorithms, a dedicated hardware module to  run these algorithms could aid solution to 
a large number of problems. Any insights gained by studying binary-tree algorithms 
will be useful in designing such modules. In the past, binary-tree algorithms and 
MBNs for them have been in setting of enhanced meshes [1, 4, 10, 19, 64, 69, 75, 76]. 
Other work on binary-tree MBNs addresses issues such as design, fault-tolerance and 
VLSI layouts [2, 25, 26, 27, 57, 63, 83, 84, 85].
1.1 MBNs and Binary-Tree Algorithms
In this section we present a broad picture of the issues related to MBNs running 
binary-tree algorithms (or “binary tree MBNs”). An N  x  M  Multiple Bus Network 
MBN has N  processors and M  buses. Each processor is connected to a subset of 
the set of buses. Two processors may communicate in one unit of time, provided 
they are connected to a common bus. However, the bus may carry only one piece of 
information on it a t any given point in time. Two important parameters of an MBN 
are its degree (maximum number of buses connected to a processor) and loading 
(maximum number of processors connected to a bus). These parameters determine 
the cost and implementability of the MBN. A large degree requires a processor to 
have a large number of input/output ports, while a large loading can reduce the data  
rate of the system.
One direction of research on MBNs considers a given pattern of interconnections 
between processors and buses and investigates the capabilities of the resulting MBN 
architecture. Often this takes the form of emulating other architectures (for example 
[26, 27, 40]) or developing algorithms on models ([52, 60, 61]); the enhanced mesh
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
results cited earlier also fall in this c a te g o ry  The second direction considers the 
problem of designing an MBN suited to a particular interconnection requirement. 
This dissertation and others [48, 57, 63] represent work in this direction.
As mentioned earlier, degree and loading are im portant considerations for MBNs. 
Clearly, the MBN should also be evaluated on how well it provides the interconnec­
tion requirements in question; this would consider issues such as number of hops and 
congestion on buses. Constructing an “optimal” MBN to run a given set of intercon­
nection functions is a non-trivial task. Kulasinghe and El-Amawy [46] showed that 
the general problem of designing an optimal interconnecting network for a given set 
of interconnection functions is NP-Hard. The criteria they used for measuring the 
cost is the of number of buses and interfaces (connections between processors and 
buses). They showed [47] tha t this problem can be solved in polynomial time for 
certain “symmetric” interconnections, and presented a methodology for such imple­
mentations. Though such symmetries exist in interconnection topologies, it is not 
the case for many algorithm classes. Moreover, their analysis does not address the 
interplay between speed, degree and loading of the MBN.
W ith a single bus, the solution is simple as the only possibility is to connect all 
processors to the bus; this approach is used in most enhanced meshes and traditional 
multiprocessor systems. This method has the disadvantage of high loading and bus 
contention, limiting the size of the network. At the other extreme, all the processors 
could be connected to all buses. The Broadcast Communication Model (BCM) [60, 61] 
adopts this approach. This increases the loading and degree, and consequently, the 
cost of the MBN.
Thus an intermediate solution (that connects each processor to a subset of buses) is 
im portant. Optimal MBNs for binary-tree algorithms are particularly challenging to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6
design. On one hand, for an N  input algorithm the MBN needs sufficient bandwidth 
to sustain 0(iV) simultaneous communications; the lower (near leaf) levels of the 
tree involve a large number of simultaneous communications. On the other hand, 
because of its similarity to a binary tree the  MBN should be fairly sparse. Thus, a 
small num ber of connections needs to be distributed over a large number of buses, 
lowering the acceptable values for both degree and loading. Most previous results 
have completely ignored degree and loading, or have reduced one at the expense of the 
other. For example Vaidyanathan and Padm anabhan [85] have proposed an iV-input 
binary-tree algorithm that runs optimally in log N  steps. Though the degree of this 
MBN is 2, its loading is 0 (log  N ). On the other hand, Ragavendra [71] proposed 
a mesh with a hierarchy of broadcast buses in each row and column. For a given 
param eter k, this MBN has a  loading k, but the degree of O ( ^ y ) -  Thus if the 
degree is small, the loading is large and vice versa. In this dissertation we construct 
an MBN tha t runs binary-tree algorithms optimally and which has both constant 
degree and loading. We now describe results of this dissertation in more detail.
1.2 Scope of the Dissertation
Because of its fundamental nature, binary-tree algorithms have been studied in al­
most all facets of computing. As mentioned earlier, most previous work on binary-tree 
MBNs has focused on enhanced mesh architectures. Very little work has been done 
on identifying fundamental properties of binary-tree MBNs and to establish relation­
ships between running time, loading and the degree. In Chapter 3, we study these 
relationships, and establish lower-bounds on degree-2, binary-tree MBNs. We iden­
tify two im portant mappings and establish th a t it is essential to have a mapping 
called indirect mapping to achieve low loading. We do this by establishing a series of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7
lower bounds on loading, each one tighter than the previous bound. Specifically, for 
a 2n-input, optimal-time, binary-tree MBN we first prove the loading to be fi(x/w)- 
We then improve this bound to f2(na) by deriving some additional results. Finally 
the lower bound is further tightened to by refining the method used to count
connections on buses. (The lower bound restriction requires the MBN to have at 
least 2"-1 buses.) These lower bound results (and indeed most other results of this 
work) are general and apply to any binary-tree MBN satisfying the conditions of the 
problem, rather than  a given MBN instance. Although a degree of 2 necessitates non­
constant loading, this is not the case for degree 3. We construct a binary-tree MBN 
called the Tree MBN that has degree 3 and loading 3, which is the best possible.
Also in Chapter 3 we investigate trade-offs between the loading and running time. 
We show that if the running time is allowed to increase by a factor of 2, then a 
degree-2, binary-tree MBN with constant loading exists. We establish tha t if the 
additional time (beyond the optimal) used by the MBN is t, and if the largest problem 
that can be solved on a degree-2, loading- L, optimal-time MBN has size 2t(-L), then 
We present an example of a degree-2, loading-4, (2n—3)-step binary-tree 
MBN that matches this bound for constant L.
Chapter 4 explores the idea of using binary-tree MBNs to enhance meshes. Here 
we show that an architecture using multiple buses has significant advantages over 
traditional enhanced meshes tha t employ single-bus networks to connect processor 
sets. We also study buses with segment switches (each of which can break a bus into 
two) and use it to reduce the loading. Other parameters of the proposed architecture 
can be selected for various trade-offs between the cost and performance. (Performance 
measures include running time, degree, loading, VLSI area and the aspect ratio; many 
exiting architectures require highly elongated rectangular layouts that are difficult to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8
implement on a chip.) The architecture we propose improves on all previous results 
in at least one of the measures. It provides more choices to the network designer 
than any other architecture in the literature. Tables 4.1 and 4.2 (pages 73 and 82) 
summarize the results of this chapter.
In Chapter 5 we study methods for imparting fault tolerance to binary-tree MBNs. 
This complements the use of binary-tree MBNs as building blocks for general-purpose 
computing platforms (described in Chapter 4). Redundant connections can also be 
used to increase the yield for chips with binary-tree MBNs. In Chapter 5, we present 
two methods for constructing fault tolerant MBNs from any given binary-tree MBN. 
One of these methods (replication) is a general method th a t can be applied to pro­
cessor and bus faults on any type of MBN. The other method (recursive scheduling) 
exploits features particular to binary-tree MBNs to produce better results, but han­
dles only bus-faults. The general results of this chapter are too involved to state here; 
we state results for some particular cases instead. Replication constructs a binary- 
tree MBN that requires at most 5 extra steps, even if half the buses fail. Also, even 
if half the processors fail, the number of the extra steps required is at most 2.
In Chapter 6 we investigate the VLSI area required for optimal-time, binary-tree 
MBNs. The corresponding problem for the balanced tree topology is well studied [80]. 
The binary-tree algorithm is different from a balanced tree topology in that only one 
level of the tree is active (or used) in any step of a binary-tree algorithm. Therefore, 
binary-tree MBNs can reuse the same buses or wires at different levels. This is not 
possible in a balanced tree topology, where all edges could be active simultaneously. 
This raises the possibility that the VLSI area for a binary-tree MBN is less than that 
required for a balanced tree topology. We specifically consider the “perimeter layout” 
case where all the processors of the MBN are laid out on the periphery of the layout;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
9
allowing the processors to be placed in the interior can trivially use a solution for 
the balanced tree topology. Our work on this topic leads us to conjecture that the 
perimeter area required for optimal-time iV-input binary-tree MBNs is Q (N lo g N ). 
Simulations seem to indicate that this conjecture is true.
1.3 Contribution of this Work
This dissertation studies various facets of running binary-tree algorithms on MBNs, 
providing a be tter understanding of the abilities and lim itations of binary-tree MBNs. 
Most of our results are general in nature, applicable to any binary-tree MBN rather 
than particular cases. Many of these results extend to A:-ary tree algorithms (for 
k > 2) as well.
Chapter 3 establishes im portant relationships between key parameters, namely 
running time, loading and degree. We develop a novel accounting scheme to keep track 
of the connections on a bus. It is possible that this method of counting connections 
may be useful in other algorithms as well. We also identify two mappings (called direct 
and indirect) of binary-tree algorithms on MBNs that impact the loading and degree of 
binary-tree MBNs. We show tha t indirect mapping is essential to achieving constant 
loading. Considering tha t indirect mapping increases the amount of communication, 
this result is rather counter-intuitive. Equally surprising is the result of Section 3.7 
that shows th a t by increasing the running time by a constant factor, loading can be 
reduced by a non-constant factor.
In Chapter 4 we provide a general framework for connecting processors in a 2- 
dimensional mesh that, among other things, captures all the features of previous 
enhanced-mesh architectures but with a more realistic loading. Thus, our work pro­
vides the means to automatically translate all existing algorithms on enhanced meshes
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
to  a more implementable platform. In addition, our approach affords much more flex­
ibility to the network designer than traditional methods.
The contribution of Chapter 5 is in providing a framework that adds redundancy 
in a controlled manner to convert any binary-tree MBN to one that is resilient to 
processor and bus faults. In particular, one of the methods, replication, works for 
any MBN (not just binary-tree MBNs) and uses an approach to rename elements and 
convert faulty components into ones th a t have the least impact on performance.
Although Chapter 6 does not derive a lower bound on the area, it distills the 
argument to a  single conjecture. It also develops some satellite results (such as an 
8-processor, optimal-time MBN with “one layer” of buses) th a t may have independent 
significance.
1.4 Organization of the Dissertation
In the next chapter we discuss some preliminary ideas and introduce some definitions. 
Chapter 3 deals with loading, running time and degree trade-offs. In Chapter 4 we 
describe meshes enhanced with binary-tree MBNs. C hapter 5 deals with fault-tolerant 
MBNs. In Chapter 6 we describe the basis of our conjecture on the lower bound of 
the area required for a “perimeter layout” of optim al-time binary-tree MBNs. Finally 
in Chapter 7 we summarize this work, and identify areas for future research.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Preliminaries
In this chapter we discuss some basic ideas used in the rest of the dissertation. We 
define binary-tree algorithms in Section 2.1 and multiple bus networks (MBNs) in 
Section 2.2. In Section 2.3 we discuss issues related to  running binary-tree algorithms 
on MBNs. In particular, Section 2.3.1 identifies two types of mappings of MBN pro­
cessors to “nodes” of binary-tree algorithms. These mappings are im portant factors, 
determining the loading of MBNs that run binary-tree algorithms. Finally, in Sec­
tion 2.4, we prove that an MBN running a binary-tree algorithm can also perform 
prefix computations in the same order of time.
2.1 Binary Tree Algorithms
A binary-tree algorithm, Bin{n), reduces 2n inputs to a single result. The computation 
performed by a binary-tree algorithm can be represented as a complete binary tree. 
For integer n  >  1, and any associative binary operation o, a binary-tree algorithm, 
B in(n ) accepts N  = 2n inputs ao, oi, • • •, a ^ - i  at the leaves of a complete binary 
tree (denoted by F (n))  and produces one output, ao o a i  o • • • o a /y _ i, at the root of 
T in ) .  The algorithm proceeds level by level from the leaves to the root, applying the
11













Figure 2.1: Running Bin (3) on an 8 x 4 MBN
operation o at each internal node to the partial results a t its children. Figure 2.1(b) 
shows ^ (3 ); the numbers associated with nodes and edges are explained later.
The tree T{n) has n  levels, and at level i  (where 0 <  i  < n), there are 2n -/ nodes. 
Clearly, Bin(n) can be used to apply a semigroup operation on a set of 2" inputs. 
Any network tha t runs Bin(n) in T(n) steps can also be used to perform a prefix 
computation on 2n inputs in 0(T(ri)) steps (see Section 2.4). It must be noted that 
Bin{n) is a  description of a class of algorithms, rather than the solution to a particular 
problem (such as a reduction operation) th a t can be implemented as a binary-tree 
algorithm. Thus Bin(n) requires at least n  steps as the height of .F(n) is n; on the 
other hand, particular reduction problems such as finding the OR of N  bits can be 
solved on some models in 0 (1 ) time [51, 77]. To run Bin(n) on a network with 2n 
processors, each of the 2n+l — 1 nodes of T {n )  is mapped to one of the 2” processors of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
the network (Figure 2.1(b)). We elaborate further on running binary-tree algorithms 
on MBNs in Section 2.3.
2.2 Multiple Bus Networks
An N  x M  Multiple Bus Network (M BN ) has N  processors and M  buses. Each 
processor is connected to a subset of the set of buses. Figure 2.2(a) shows a 16 x 8 
MBN. Two processors connected to  the same bus can communicate with each other 
in one unit of time. A bus can carry only one piece of information a t any given point 
in time.
The number of buses to which a processor is connected is called the degree o f the 
processor. The largest of the degrees of all processors is called the degree o f the MBN. 
The number of processors connected to a bus is called the loading o f the bus. The 
largest of the loadings of all the buses is called the loading o f the MBN. The MBN 
of Figure 2.1(a) has a degree of 2 and a loading of 4, while tha t of Figure 2.2(a) has 
a degree of 2 and a loading of 5. The degree and loading are im portant parameters 
tha t determine the cost, speed of operation and implementability of an MBN. The 
degree of an MBN is analogous to the degree of a graph representing a  point-to-point 
network and is indicative of the number of input/ou tpu t ports needed per processor. 
A large loading can introduce a  significant delay or attenuation of the signal. High 
loading in electrical buses introduces capacitive coupling th a t limits the rate at which 
data  can be transm itted. In an optical bus with high loading, the signal is excessively 
attenuated by power drawn by photodetectors connected to the bus [29]. Therefore, 
an MBN implementation should a ttem pt to minimize both  degree and loading.
An N  x M  MBN can be represented as an N  x M  Boolean m atrix  that has a 1 
in row p and column b iff processor p  is connected to bus b. Figure 2.2(b) shows the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
m atrix representation of the 16 x 8 MBN of Figure 2.2(a). Observe that the rows and 
columns of the m atrix can be permuted without affecting the connectivity properties 
0 1 2 3 4 5 6 7
0 —


































Figure 2.2: A 16 x 8 MBN and its matrix; blank entries in the m atrix represent 0 ’s.
of the MBN. This is because perm uting amounts to just renumbering processors and 
buses. We use this fact later, when doing so is advantageous.
2.3 Running Binary Tree Algorithms on MBNs
We assume that Bin{n) is run on a  2n x  M  MBN. Using more than 2n processors 
has no advantage. If the number of processors is 2n# <  2n, then the 2" inputs can 
be divided among the available 2n' processors, so that there are 2n~n> inputs per
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
processor. Each processor then sequentially reduces the 2n~n' inputs to one result in 
2n~n> — 1 steps. The reminder of the algorithm is run as a Bin(n') on a 2n' x M  MBN.
To run a binary-tree algorithm, Bin(n), on a 2n x M  MBN, each node of ̂ F(n) is 
m apped to a processor. Each edge of T (n )  that connects nodes mapped to distinct 
processors represents a communication; such edges are called non-trivial edges [27]. 
Consider the example in Figure 2.1(b). Here the nodes of .F(3) are labeled with 
(mapped to) processor indices 0,1, • • •, 7. This indicates the processor responsible for 
the action (if any) at a  node. Consider the node labeled 0 at level 1 (call it node v 
for this discussion). Its two children are labeled 0 and 1. The edge from node v to its 
left child has end vertices, both of which are labeled by the same processor index (0 
in this case). Therefore, this edge does not represent a communication and is called 
a  trivial edge (shown dotted in the figure). On the other hand, the edge from node 
v to its right child is non-trivial as its end points have diiferent labels (0 and 1 in 
this case); hence, the edge represents a communication between processors 0 and 1. 
Figure 2.1(b) shows non-trivial edges as solid lines; the remaining trivial edges are 
shown dotted. Each non-trivial edge of T {n)  is mapped to a bus of the MBN.
Conversely, an MBN to run Bin(n) can be specified by mapping nodes and non­
trivial edges of T {n) to processors and buses, respectively, of the MBN. Thus T{n) 
(with nodes and non-trivial edges appropriately labeled) completely specifies a 2n x 
M  MBN and the method used to run Bin(n) on it. Figure 2.1(b) shows T{Z) 
corresponding to the MBN in Figure 2.1(a). We will loosely use the term “binary- 
tree MBNs” to refer to MBNs suitable for running binary-tree algorithms.
In running a binary-tree algorithm on an MBN, we assume that in one “step” a 
processor can read from or write on each bus it is connected to and perform an internal 
operation using operands from its local memory or input ports. This assumption
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
is reasonable when the number of ports in a processor is small—all of the MBNs 
considered in this work have a  (small) constant degree. Since the focus of this work is 
on the network connecting processors, there is no advantage in separately considering 
the time required for internal operations. The following restrictions apply, however: 
(z) Each value sent or received by a  processor during a  step uses a different bus, 
and (i i) the pair of processors sending and receiving a value must be connected to a 
common bus. Under these assumptions, a processor is perm itted to (a) send a partial 
result of the binary-tree algorithm, (b) receive two partial results, and (c) perform 
the operation o (associated with the binary-tree algorithm) on the partial results 
received, all in one step. This is not very different from the usual assumption that 
a processor can access operands from its local memory and perform an operation on 
them, all in one step.
2 .3 .1  D irect and In d irect M apping
As noted earlier, running Bin(n) on an MBN requires mapping nodes to processors. 
In this section we identify two types of mappings, direct and indirect, th a t greatly 
impact the degree and loading of binary-tree MBNs.
For any node u of F (n), let /z(u) denotes the processor to which u is mapped. Let 
u be an internal node of T {n)  with children v and w. Node u  is said to be a  direct 
node iff fi(u) =  p(y) or p{u) =  n(w); otherwise, node u is said to be indirect. A direct 
node is mapped to the same processor as one of its children while an indirect node is 
mapped to a processor which is different from both of its children. This implies that 
an indirect node is connected to each of its children by non-trivial edges, whereas a 
direct node is connected to one of its children by a trivial edge. A processor mapped 
to a direct node is called a direct processor of the step in question; otherwise, it is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
17
called an indirect processor of the step. Since a processor may be m apped to more 
than one node of ^F(n), it is possible for the same processor to be a direct processor 
at one step and an indirect processor a t another. Any mapping tha t has an indirect 
node is called an indirect mapping; otherwise it is called a direct mapping.
As an example, in Figure 2.1(b) all nodes except the root and its right child are 
direct. Therefore the entire Bin{2) or !F{$) uses an indirect mapping (as there is an 
indirect node). On the other hand, the T{f2) consisting of the left subtree of the root 
represents a direct mapping.
Observe tha t an indirect node involves two communications (one from each child), 
whereas a direct node requires only one. Thus a direct mapping minimizes the number 
of communications. Notwithstanding the fact that an indirect mapping entails more 
communications, we show in Section 3.2 th a t this mapping is necessary for constant 
loading.
2.4 Prefix Computations on Binary-Tree MBNs
Given N  inputs, a i, a-i, • • •, a^ , and an associative operation o, the Ith prefix (where 
1 <  i < N ) is the quantity a io a 2o- - • oa,-. A prefix computation for the above inputs 
and operation computes the prefix for each 1 <  i <  N . The relationship between 
reduction algorithms and prefix computations is well known in the context of a PRAM 
[38, pp. 44-49] and a fixed-degree topology [49, pp. 37-43]. This relationship has not 
been studied for binary-tree MBNs, however. We prove here th a t a binary-tree MBN 
is suitable for prefix computations as well.
T h e o re m  2.1 I f  X (n )  is an M BN that runs B in(n) in T (n ) steps, then X (n) can 
run a prefix algorithm in 2T (n) steps.








Figure 2.3: Steps of running prefix computation on binary-tree MBNs
Proof: When an MBN runs a binary-tree algorithm, the nodes of Bin(n) are ex­
ecuted in a manner that respects the precedence relationship described by the tree 
JT{n). Figure 2.3(a) shows three nodes of T{n) (corresponding to two levels) where 
nodes u and v are children of node w. Let the two partial results (or inputs) held by 
nodes u and v be a and b respectively, and let the associative operation performed by 
the binary-tree algorithm be o.
The prefix computation runs on the binary-tree MBN in two phases. The first 
phase runs the binary-tree algorithm from the leaves to the root. The only difference 
here (from running a regular binary-tree algorithm) is tha t a node saves the value 
it receives from the left child, unlike the usual form of the algorithm that simply 
computes the partial result. For example, node w receives a and b from nodes u and 
v and computes a o 6. In addition to computing this quantity, node w also saves the 
value “a” (shown in a box in Figure 2.3). Node w sends partial result aob to the next 
higher level in the next step (Figure 2.3(a)). The time to run this phase is clearly the 
same as that of the binary-tree algorithm, namely T(n).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
19
The second phase of the prefix computation also proceeds level by level, starting 
from the top most level (level n) down to the leaves. This can be viewed as reversing 
the binary-tree algorithm where value(s) from top of the tree propagates to the leaves. 
Figure 2.3(b) describes the action at each node during this phase. The root of J-(n) 
sends the value it stored (one received from its left child in the first phase) to the right 
child. It sends the identity1 of operation “o” to the left child. A processor with stored 
value a and that receive value c from its parent (i) sends c unaltered to its left child 
and (ii) sends a o c to the right child. This phase mimics the binary-tree algorithm 
(phase 1) in reverse, so its time is T [n ) as well. Therefore, the time required to run 
a prefix computation on a binary-tree MBN is twice as much as the time required for 
running a binary-tree algorithm. (The correctness of this method follows from the 
results in [38, 49].) ■
JThe identity i has the property th a t for any value x from the domain o f  o ,  i  o  i  = i  o  i = i .  If 
o does not have an identity, then the root could simply send a  special signal indicating to  its left 
child th a t it need not apply o to  the value received.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3 
Degree, Loading, Time Trade-Offs
This chapter establishes non-trivial relationships between the degree, loading and 
running time of binary-tree MBNs. We first show by a trivial connectivity argument 
that any binary-tree MBN has a  degree of at least 2 and a loading of at least 2; the 
loading is at least 3 if no more than 2n_l buses are used for 2n processors. Next we 
show th a t for a direct mapping, constant degree can never yield constant loading, and 
vice-versa. We then establish a series of results that successively bound the loading 
of degree-2, optimal-time binary-tree MBNs for B in(n ) to first Q(v/n ), then to Q (n?) 
and finally to f t ( j^ ; ) ,  where 2n is the size of the problem. These results make no 
assumptions about the type of mapping (direct or indirect) and the number of buses, 
although the optimal-time restriction indirectly requires the MBN to have a t least 
2n_l buses. Considering that increasing the degree by ju st 1 can yield a constant 
loading MBN (see Section 3.6), these lower bound results are quite surprising.
Further, if we relax the optimal-time requirement, then we show the existence of 
a degree-2 , loading-4 MBN tha t runs Bin(n) in 2n —3 steps. However, the extra tim e 
needed (beyond n ) still bounds the loading. We show th a t if a  degree-2 MBN runs 
B in(n ) in n  + t steps, for some 0 <  t  < n, then the loading is Q ( tiog(^)) '
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
In Section 3.1 we introduce some preliminary ideas used in this chapter and Sec­
tion 3.2 bounds the loading for binary-tree MBNs with a direct mapping. In Sec­
tion 3.3 we derive the first of the general lower bounds and lay most of the ground 
work necessary for the tighter lower bounds of Sections 3.4 and 3.5. We explore 
loading-time tradeoffs in Section 3.7. We extend the lower bound results of Sec­
tions 3.3, 3.4 and 3.5 to  k-ary tree algorithms in Section 3.8.
3.1 Preliminaries
As mentioned in Section 2.3, we will consider a 2'l-processor MBN to run Bin(n). 
An optimal-time MBN requires at least 2n_l buses. If the number of buses is less 
than 2n_l, a t least the first level of T {n)  requires more than one step to schedule, so 
optim al time is not possible. Therefore, we consider optimal-time 2" x M  binary-tree 
MBNs for B in(n ) with M  > 2n_1 buses. If such an MBN has degree 2, then it has 
a t most 2n+l connections (at most 2 per processor). If these connections are evenly 
distributed among the buses, then the loading would be ["̂ 77-] <  4. In this chapter 
we show that such a uniform distribution of connections is not possible and tha t a 
large number of connections is concentrated on a small number of buses resulting in 
a large non-constant loading.
An MBN is said to be connected iff there is a path (possibly via several buses and 
processors) between any pair of processors. We now derive trivial lower bounds on 
the degree and loading of a connected binary-tree MBN.
L e m m a  3.1 For n  >  1 and M  > 2 , any connected 2 " x M  binary-tree M BN  has a 
degree o f at least 2 and a loading o f at least max (2 , ["̂ 7^ ] )  •
Proof: If the degree is 1 and if the MBN has connections to  each of its M  > 2
buses, then the MBN cannot be connected; each processor connected to a bus b
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
22
can only communicate with other processors connected to bus b. Thus at least one 
processor must be connected to 2 or more buses. This implies that the total number of 
connections in the MBN is at least 2n +  1. These connections are distributed over M  
buses, so the loading is at least • Since each bus must have at least 2 connections
(otherwise it cannot be used for a communication), the loading is max (2 , ■
■
Remark: If M  =  2m for some 1 < m  < n, then the minimum loading is 2n~m -f- 1.
The 2n x 2m Tree MBN of Section 4.2 has an optimal loading of 3.
3.2 Lower Bound for Direct Mapping
To run a binary-tree algorithm on an MBN, the nodes of the tree F{n) are mapped 
to processors, and non-trivial edges are m apped to buses. Recall the definitions of 
direct and indirect mapping (see Section 2.3.1, page 16). In a direct mapping, each 
internal node of T{n)  is mapped to the same processor as one of its children; that is, 
a processor applying the operation o (associated with a binary-tree algorithm) holds 
one of the operands as a partial result from the previous step. On the other hand, 
in an indirect mapping, two processors with partial results may send them to a third 
processor that applies o on these. The direct mapping may appear to be a better 
choice as it reduces communication requirements by maximizing the number of trivial 
edges. This is not true for the loading of the MBN, as we show below. Indeed, the 
MBN proposed by Vaidyanathan and Padmanabhan [85] uses a direct mapping and 
has a non-constant loading.
L em m a 3.2  For any n > 1, an M BN with degree d that runs Bin(n) optimally in n  
steps using a direct mapping has a loading o f at least I ^ I -I- 1 .




n — 1 
n — 2
Figure 3.1: .F(n) with a direct mapping
Proof: Consider MBN A t  with nodes labeled by a direct mapping. Observe first
tha t for any node u  that is mapped to some processor /i(u), there exists a path from 
u  to a leaf, such that all nodes on the path  are mapped to n(u). (This follows from 
the definition of direct mapping.) Let the root of A t  be mapped to processor 7r. From 
the above observation, there is a path from the root to a leaf such that all nodes on 
this path  are mapped to 7r (see Figure 3.1). Clearly, there are n  internal nodes on 
this path, each of which has one of its two children also on the path. Let the children 
not included in the above path  be mapped to processors 7T/ (where 0 <  £ < n) as 
shown in Figure 3.1. Each leaf of A t  is mapped to a different processor (as each 
input of Bin(n) is in a different processor). This coupled with the observation at 
the beginning of this proof, establishes that all of 7To, tti, • • •, 7rn_i are distinct. Thus 
processor n  is required to communicate with n different processors.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
Let processor 7r be connected to buses 6, (where 0 < i < t f  < d). Since the MBN 
runs B in(n) in n  steps, each of the processors 7T0, 7Ti, • • •, 7r„_i must also be connected 
to at least one bus 6, (where 0 <  i < df). Thus the total number of connections to 
all buses bi is a t least n  +  d!. This implies th a t the loading of the MBN is a t least
[ = 5 * 1  =  [ * !  +  !  2  f s l  +  1 .  ■
Remark: Lemma 3.2 implies tha t an indirect mapping is essential for any optim al­
time binary-tree MBN with constant degree and loading.
3.3 An Q ( y / n )  Lower Bound
In this section, we develop the first of a series of non-trivial lower bounds on the 
loading of degree-2, optimal-time, binary-tree MBNs. Here we will prove th a t if an 
MBN runs B in{n ) in n  steps and if its degree is 2, then its loading is fi(\/n ).
3 .3 .1  S tra teg y  and D efin ition s
We prove this lower bound result by showing th a t the connections in any degree-2, 
optimal-time binary-tree MBN are distributed unevenly over the buses. Our strategy 
here (and to a  large extent in Sections 3.4 and 3.5 as well) is to identify (or prove 
the existence of) a small number, /?, of buses tha t collectively have a large number, 
7 , of connections. This will establish that the loading is a t least
For M  >  2n _ l , consider a 2" x M  MBN, X (n ), that runs B in (n ) in n  steps 
(numbered 1 ,2, • • •, n) and whose degree and loading are 2 and L, respectively. We 
will use the terms “end of step s” and “beginning of step s +  1” synonymously. For 
any 1 <  s <  n, let X s{n) denote a 2n x M  MBN that includes only those connections 
of X (n )  th a t are used in a t least one of steps 1,2, •••,« . Then X 0{n) is a  2” x M  
MBN with no connections and X n(n) =  X (n).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
0 1 2 3 2n — 2 2” — 1
0 1 2n~' 
Figure 3.2: Step 1 of X \{n)
At each step s we will consider an 2n x 2n_l “sub-MBN”, Ya(n), of X ,(n); i.e., 
connections of Ya(n) are also connections of X 3(n). Sub-MBN Ya(n) consists of those 
connections of X a(n) whose existence has been established. We say th a t a connection 
is added to mean that a  previously unaccounted for connection has been detected. 
Therefore, the degree and the loading of the MBN changes from step to step. Running 
B in(n) on an MBN can be viewed as a step by step construction of the MBN with 
the counted connections added at each step.
An intermediate result of Bin(n) (value at any non-root or non-leaf node of tree 
JF(n)) is called a partial result. A processor p holding a partial result or an input at 
the end of step s (where 0 <  s <  n) is called a result processor of step s. Otherwise p 
is a non-result processor of step s. If the degree of the processor p  at the end of step 
is 2 , then it is called a fu ll processor of step s; otherwise, p  is a non-full processor of 
step s.
Clearly, all 2n processors are non-full, result processors of step 0; i.e., at the start 
of the algorithm. Step 1 (the first step) requires a t least 2n_l communications (exactly 
2n_ l, if all 2n_l partial results generated at the end of step 1 are obtained by a direct 
mapping). Therefore X i(n) is isomorphic to the MBN shown in Figure 3.2. Thus 
for 1 <  s <  n, the terms “full” and “non-full” are synonymous with “degree-2” and 
“degree-1,” respectively.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
3.3.2 B asic  R esu lts
We now list four simple consequences of X {n)  being a  degree-2, optimal-time, binary- 
tree MBN; these facts are used, often without explicit mention, in subsequent discus­
sion.
1. All partial results received in a step are used in the same step, and a partial 
result generated in a step is used up in the next step. This is because the 
algorithm runs optimally, so partial results cannot idle.
2 . A direct (resp., indirect) processor of a step receives one (resp., two) partial 
results in tha t step; this follows from 1 above.
3. A processor receiving two partial results a t a step must do so from different 
processors; otherwise the step will not be executed in unit time.
4. A processor sending a  partial result cannot receive one a t the same step. This is 
because it will have to receive two partial results and send one partial result as 
a processor can hold only one partial result. This is not possible on a degree-2, 
optimal-time MBN.
L em m a 3.3 For any 1 < s < n, i f  p is a non-full, result processor of step s, then p  
is a non-full, result processor of steps 1, 2 , • • •, s.
Proof: It suffices to prove that if p is a non-full, result processor of step s, then it is
a result processor of step s — 1. If p holds a result at the end of step s, but not a t the 
end of step s — 1, then it must have obtained two partial results during step s. This 
requires p to have two connections and be a full processor of step s. ■
C o ro lla ry  3.4 For any L < s < n, each result processor of step s is a full processor 
of step s.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
Proof: Let p  be a non-full, result processor of step s. Then by Lemma 3.3, it is also
a result processor of steps 1,2, • • •, s. Therefore to  prove the lemma, it is sufficient 
to prove that s < L. Let p  receive a partial result for step t  (where 1 <  t < s) from 
processor pe, via the only bus b (say) to which p  is connected. Therefore, bus b is 
connected to processors in the set {p} U {pt : 1 <  t  < s}.
Consider processor pt th a t sends a partial result to p  during step t. If pt is a result 
processor of step t, then it must receive two partial results from processors different 
from p (in addition to sending a partial result to p). This is not possible as one of 
the (at most 2) buses to which pt is connected is used by p. Since this bus (bus b) is 
used by p during steps 1, 2 , • • •, s, processor pt cannot be a  result processor of steps 
t , t  + 1, ■ ■ •, s. Therefore, pt & {px : t  < x  < s}  and so {p} U {pt : 1 <  t  < s} has 
s + 1 processors, all of which are connected to bus b. Since the loading of X  (n ) is L, 
we have s +  1 <  L  (or s < L ). ■
From this point on, we will only consider step s > L. Since our aim is to prove that 
L  =  Vt{y/n) (f2 ({^0  in Section 3.5), we may assume tha t L < n. By Corollary 3.4, 
all result processors (of any step) can be assumed to be connected to 2 buses.
3 .3 .3  T h e  A cco u n tin g  Schem e
In determining a lower bound on the loading L  of X (n ), we will count connections 
between processors and buses of X (n ). Let (p, b) denote a connection between pro­
cessor p  and bus b. In our analysis, we will consider only those connections (p, b) for 
which p  is a full processor, and which participates in some step s > L. Since a lower 
bound on the loading is sought, some connections can be ignored.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
28
To account for the connections considered, we now associate each such connection 
with a  processor. For each processor p and step s > L, define a set, r a(p), of 
connections owned by processor p in step s. (We will show later that if pi ^  p?, then 
r 5(pi) and r a(pz) are disjoint.) We now define r, (p).
1 . If p  is a  result processor of step L, then it is also a full processor of step L  (by 
Corollary 3.4). Let p  be connected to buses bx and For each such p, define 
T l(p) =  {<P,&i>,(p,&2>}. I f p  is not a  result processor of step L, then define 
T l (p) to be empty.
2. For s > L, let p be a result processor of step s that receives partial result(s) 
from (not necessarily distinct) processor(s) p ' and p" via bus(es) 6' and b", 
respectively. Define Ta(p), Fs(p') and Ts(p") as follows.
T s(p ) =  ^ ( p )  U  {(pi,b'),(p2,b")}
r 3(j/) =  r ^ o / ) - ^ ! ^ ) }  
r a(p") = r  a_i(p")-{(p25&">}
where (p i,b') €  r ,^ i( j / )  and (p2, 6") €  ra_i(p"); since we are interested pri­
marily in the cardinality, |ra(p)|, of Ta(p), (jpi,br) (resp., (p2, 6")) can be any 
element of r a_i(p') (resp., r a_i(p")). Note that i fp  receives only one value in 
step s, then j f  =  p", bf = b” and pi =  p-j,.
In summary, for each partial result received by processor p from processor p' 
via bus 6, processor p' transfers ownership of a connection on bus b to processor 
p. If processor p does not send or receive any partial result in step s, then
r a(p) = r a_t(p).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
L em m a 3.5 For any s > L,
(i) For distinct processors pi,pz, r s(pt) and r a(p2) are disjoint.
(ii) For any processor p, i f  (j/, b) 6  r a(p), then processor p is connected to bus b.
(iii) I f  p  is a result processor of step s, then r a(p) has a connection o f the form  
(p ',b ), fo r each bus b to which p is connected.
Proof: At step L, by definition of Tt, all result processors own their connections to
the two buses to which they are connected. Therefore, Lemma 3.5 holds a t step L. 
Observe that in part 2 of the definition of r a(p), the sets Ta(p) and ( r a(p'), r a(p")) 
are disjoint, and the connections added to r a(p) are (pi,bf) and (p2, V ), where V and 
V  are buses to which p  is connected. These observations, coupled with the fact that 
Lemma 3.5 holds for step L, completes the proof. ■
Remarks: If the sets Ta(p) are used to count the number of connections in X s(n),
then part (i) of Lemma 3.5 ensures that no connection is counted more than once. 
However, some connections may not be counted at all. Part (ii) is used later in 
Theorem 3.9. Part (iii) ensures that the transfer of ownership in part 2 of the 
definition of T3(p) is always possible.
L em m a 3.6 For any step s > L and any processor p, |Ta(p)| =  |ra_[(p)| + £ , where
1, i f  p is a result processor of steps s and s — 1 .
2 , if  p is a result processor of step s and a non-result processor 
of step s — 1.
-1, i fp  is a non-result processor of step s and a result processor 
of step s — 1.
0 , i f  p is a non-result processor o f steps s and s — 1 .
6 =
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
Proof: Observe th a t S is the number of partial results received by processor p in
step s; S =  — 1 indicates that p  sends a partial result. The lemma now follows from 
this observation and part 2 of the definition of r ,(p ) . ■
C o ro lla ry  3 .7  For any step s > L, i fp  is a result processor o f a  o f the steps L ,L  + 
1, • • •, s, then |F ,(p) | >  a.
Proof: The first time p becomes a result processor at step s0, say, (even if sQ = L), 
| r ao (p) | =  2. For each of the remaining a  — 1 times it is a result processor in some 
step s' > so, we consider two cases:
Case 1: Suppose p is a  result processor of steps s' — 1 and s'. Here |IV(p)| =
|IV_i(p)| -+- 1 (Lemma 3.6).
Case 2: Suppose p is a result processor of step s ' and a non-result processor of
step s' — 1. Since s' >  So, there is a  step s" (so <  s" <  s') such th a t p is a result 
processor of step s" — 1 and a non-result processor of steps s", s" + 1, • - •, s ' — 1. 
Here | r s/(p)| =  |r,/_ i(p )| +  2, |r , /_ L(p)[ =  |I> (p ) |, and |I> (p ) | =
| r a«_L(p)[ — 1 (again by Lemma 3.6). Therefore, |IV(p)| =  |IV '_i(p)| +  1.
In any case, for each step s' >  so of which p is a  result processor, IV (p) increases by 
one. Thus at the last step t  < s of which p is a result processor, | Tt (p) | =  2 +  (a  — 1) =  
a  +  1. If t < s, then |rt+i(p)| =  [rs(p)| =  a . ■
3 .3 .4  N on -U n iform  B u s U sage
In this section we show that as the binary-tree algorithm proceeds towards the root 
of JF(n), most of the activity in the MBN centers around few buses th a t ultimately 
incur a high loading.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
For any step s > L, a  bus b of X (n )  is said to be active in step s  iff it is connected 
to at least one result processor of step s. If a bus is used to carry a  partial result in 
step s, then it must be active in step s. However, a  bus tha t is active in step s  need 
not be used in step s. In the following lemma, we prove that the pool of buses tha t 
could be active a t a step shrinks with each step, thereby forcing a few buses to have 
a large number of connections.
L e m m a  3.8 For any step s > L, bus b is active in step s, then it is also active in 
step s — 1 .
Proof: Let bus b not be active in step s — 1. Then by definition of a non-active bus,
all the processors connected to b are non-result processors of step s — 1. Suppose 
at step s, processor p connected to b becomes a result processor of step s, thereby 
making b active in step s. Since p cannot be a result processor of step s — 1 (otherwise 
b would be an active bus in step s — 1), p must receives two results in step s from 
distinct processors pf and p". One of these partial results must be via bus b. Clearly 
the two sending processors p' and p" are full (Corollary 3.4), result processors of step 
s — 1. Thus, one of them must be connected to b in step s — 1, which contradicts the 
assumption tha t b is not active in step s  — 1. ■
3 .3 .5  T h e  Lower B ou n d
We are now in a  position to prove the main result of this section.
T h e o re m  3.9 For any n > 2, i f  a 2n-processor M B N  with degree 2 and loading L 
runs Bin{n) optimally in n  steps, then L  =  Q(y/n)
Proof: From Lemma 3.8, there exists a bus, bo, th a t is active in steps L, L + 1, • • •, n.
Let b0 be connected to I  <  L  full processors, p i,P 2> ■ ■ • ,Pi- For 1 <  i < £, let the two
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
buses to which processor p* is connected be bo and Also let processor p,- be a result
processor an times from step L  to step n.
From Lemma 3.5(«), each element of r n(pj) is a  connection to either bQ or Since
t
the loading of the MBN is L, ^2  |rn(p,-)| < t  +  IL  < I?  +  L. From Corollary 3.7 we
i = l
I I
also have JZ  l^"(Pi)l ^  5Z a «- Since bo is an active bus of steps L, L  + 1, • • •, n,
i —  1 i = l
I 1 1
5Z a,- >  n  — L +  1. Thus, n — L +  1 <  a * — 5Z l^n(P»)l ^
x— 1 i = l  i = l
which implies that n  <  L 2 + 2L — 1 or L  =  Q(y/n). ■
Remark: Theorem 3.9 proves that for large problem sizes, the product of the degree
and loading of any MBN that runs a  binary-tree algorithm in optimal time is at least 
9, thereby establishing that the MBN, 7~(n), proposed in Section 3.6 has the best 
possible “degree-loading” product.
3.4 An Q Lower Bound
In the lower bound of Section 3.3, we selected a bus bo and proved tha t its neigh­
borhood (consisting of processors on bo and buses connected to these processors) had 
a large number of connections. In restricting our consideration to the neighborhood 
of bus bo, the technique used undercounted the number of connections in the neigh­
borhood. Here we develop additional results that provide a more accurate count of 
connections, even though the consideration is expanded to a larger neighborhood.
3.4.1 A d d ition a l R esu lts
Recall the definitions of direct and indirect nodes and processors (Section 2.3.1, 
page 16).
L em m a 3.10 For any s > L, let p be a result processor of step s 4 -1.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
(z) I f  p is a result processor o f step s, then it is a direct processor o f step s +  1. 
(ii) I fp  is a non-result processor o f step s, then the following assertions hold:
(a) Processor p is an indirect processor of step s -+- 1.
(b) The two buses to which processor p is connected are active in step s.
(c) For each bus b to which processor p is connected, a result processor o f step s 
(that is also connected to b) becomes a non-result processor o f step s -F 1.
Proof: If p is a result processor of both steps s and s +  1, then the result it holds
from step s must be used to  obtain the result of step s + 1. (Otherwise, the processor 
will have to receive two new values, while sending the result of step s to another 
processor; this is not possible on a degree-2 MBN.) Also, since X (n )  runs B in (n ) in
n  steps, partial results cannot be saved to be used at a  later step.
If, on the other hand, p  does not hold a result at step s, it must receive two partial 
results during step s + 1, and is, therefore, an indirect processor of step s +  1. Since 
these results (of step s) arrive through the two buses by and &2 (say) to which p is 
connected, there must be result processors pi and p2 of step s  that are connected to 
buses &i and 62, respectively; that is, the buses 61 and 62 are active in step s. The 
result processors pi and P2 of step s cannot be result processors of step s +  1 as they 
send their results to processor p. (A result processor sending its value to another 
processor must receive two values to remain a result processor of the next step; this 
is not possible on a degree-2 MBN.) ■
The following corollary is a generalization of Lemma 3.8.
C o ro lla ry  3.11 For any s > L, i f  p result processors of step s are connected to bus 
b, then fo r any s > s' > L, at least p result processors o f step s' must be connected to 
bus b.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
Proof: It is sufficient to  prove that the number of result processors connected to
bus b cannot increase after step L. Let processors pi,p?, • •• ,p z be result processors 
of step s' that are connected to bus b. Let processor q connected to bus b not be a 
result processor of step s', and let it become a result processor of step s ' +  1. Since 
processor q is connected to  bus b, one of the partial results m ust come from one of 
the processors p i,p 2 , ■ "  ,px via bus b. The processor that is sending the partial result 
becomes a non-result processor of step s '+ l ,  and the total num ber of result processors 
connected to bus b does not increase. Therefore, a t least a  to ta l of p result processors 
must be connected to bus b in all the steps s > s' > L. ■
Remark: It is im portant to note that the processors holding the results may change
from one step to another, while the number of result processors is non-increasing.
Two processors are said to be neighbors iff they are connected to  a common bus. 
For integers a, b with a < b , let interval [a ,6] denote the set { a ,a  +  1, - - •, 6 — 1,6}.
L em m a 3.12 For s > L , i f  p is not a result processor of step s — 1, and is a result 
processor o f steps s, s +  l , - * - , s  +  x — 1 (for some x  > 0), then the following assertions 
hold:
(z) For each step s' in the interval [L, s — 1], at least x  +  1 neighbors o f p are 
result processors o f step s '.
(ii) A t the end o f step s + x  — 1, the neighbors of p  collectively own at least 
((x -(- l) (s  — L) -I- ) connections.
Proof: When processor p  becomes a result processor of step s, it must be an indirect
processor of step s (Lemma 3.10), and it consumes two results from its neighbors. 
Processor p also consumes a  result from one of its neighbors in each of the steps s +  1, 
s +  2 , • • •, s +  x — 1, (during which it is a direct, result processor). By Corollary 3.11,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
for the 2 +  (x — 1) =  x + 1 results consumed by p in steps s, s + 1 , - • •, s +  x — 1, there 
must be x -I- 1 results among the neighbors of p in each of steps L, L +  1, • • •, s — 1.
Of the x + 1  results consumed by processor p, two are consumed in step s. There­
fore for the remaining x — 1, there are x — 1 result processors among the neigh­
bors of p  in each of steps L, L  +  1, • • •, s — 1, s. In general, for any 1 <  i <  x, 
there are x — i result processors among the neighbors of p in each of the steps 
L, L  +  1, • • •, s +  i — 1. If N  is the set of neighbors of p, then by Corollary 3.7
we have [IY ^ ^ tp ')!  =  (s ~  L )(x  +  *) +  H  (x  ~  0  =  (s ~  L )(x  +  !) +  ■
p'e/v «=i
3.4 .2  T igh ter  Lower B ound
Now we are in a position to prove the main result of this section.
T h e o re m  3.13 For any n  > 2, i f  a 2n-processor M B N  with degree o f 2 and loading
2
o f L runs a B in(n) optimally in n  steps, then L  =  f2(ns).
Proof: Let the loading of the MBN be L < n. By Lemma 3.8, there is a bus that
is active during each step in the interval [L,n]\ let this bus be b0. Let full processors 
Pi ,P2, * • • ,Pi (for some £ < L) be connected to bus b0. For each i (where 1 <  i < £), 
let processor p, be connected to bus 6, (in addition to bus bo). Besides processor p,, 
let bus bi be connected to m, <  L  processors (see Figure 3.3).
For any given step s €  [L, n ], at least one of p l ,p 2, • • • ,p / is a result processor (as 
b0 is active during each step of [L , n]). Therefore, the interval [L, n] can be partitioned 
into k subintervals, / l5 / 2, • • •, /*, as follows (see Figure 3.4):
(i) In each step of subinterval Ij, processor {Pi : 1 <  i <  ^} is a result
processor.
(ii) For j  > 1, 7Tj ^  TTj-i.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
36
bo
PtjP 1,1 Pi,mi Pi,l Pij
Figure 3.3: Processors and buses in the neighborhood of bus b0
Let interval Ij be of length i j  (where 1 < Xj < n — L + l) .  Clearly, ^  xj  = n — L + 1.
i =i
i - i  j
Also Ii =  [L, L + x i — 1] and for j  > 1, Ij = L +  53  xr , L -  1 +  5Z xr
r—l r=l
— [Sj  > sj
x j -  1] (say).
Applying Lemma 3.12 to Ij, the number of connections owned by neighbors of 
7Tj (at the end of step Sj +  Xj — 1) is (sj — L){xj +  1) +  Summing this for
all k intervals, we can assert that the number of connections collectively owned by 
processors p\,p-i, • • • ,pi and their neighbors is at least
t ,  ( f e -  -  L ) ( x ,  +  1) +  X i(x ’2  I } )  =  I , ( i 2 +  1) +
. .  __1\  l _2
X2(X2 +  1) +  ( * 1  +
H bx k- i ) (x k + 1)+ lj > (*1+X3+-+X*) +(XlX2+XlX3̂ -------hXiX/t+ X2X3 +
X2x4 H—  •-rx2xfc+  hxfc-iXfc) =  =  0 (n 2). These connections are distributed
t
among the 1 + i  + ^ r r i i  < 1 +  L +  L(L  — 1) =  I?  +  1 buses connected to the above
t=i
processors (see Figure 3.3). Since each bus can have at most L  connections we have 
L{ 1 +  L2) = L3 + L = fi(n2), which implies that L = Q (n£). ■







result processor 7rt 
step L + x\ — l = s-2 — 1
step «2
Interval / 2 x 2 result processor 7r2 ^  7Ti
’ step «2 + *2 — 1 = *3 — 1
step S3
result processor nj ^  itj-i 
= sj+1 1
result processor 7r* ^  7r*_i
Figure 3.4: Subintervals of [.L , n]
3.5 An Q ( i^ ) Lower Bound
In the last two lower bound derivations (Sections 3.3 and 3.4), we used an accounting 
scheme to count the number of connections in the neighborhood of a bus bo that was 
active at steps L, L  +  1, • • •, n. This accounting scheme transferred ownership of one 
connection for each partial result sent/received by a  processor. This scheme assumes 
the existence of only one transferable connection with each result processor, even
Interval Ij I Xj
Interval I* Xk
step Sj — 1
step Sj
step Sj + Xj — 1
step
step Sk — 1 
step Sk
step n
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
38
though the result processors could possibly own many more connections. In this 
section we develop a  modification to this accounting scheme th a t allows many more 
connections to be counted. The new accounting scheme perm its the ownership of 
more than one connection to be transferred between processors, whenever possible. 
We do this by breaking the interval [L, n] into smaller segments and establishing the 
existence of more than one transferable connection with result processors in each 
segment. This is the key to tightening the lower bound on the loading to fi (u^Tr).
In Section 3.5.1 we derive the basic results needed for the new accounting scheme. 
In Section 3.5.2 we describe the new accounting scheme. Finally in Section 3.5.3 we 
derive the new lower bound.
3 .5 .1  In itia l C on d ition
For some integer d > 1, partition the interval [L,n] into y  segments [L, 3L -t- l j ,
To
(3Z> -F 2, 3L  ~f~ 1 ~t~ d], [3L -f 2 -f  d, 3L  -f 1 -f 2d], • • •, [3L  +  2 +  (y  — 1 )d, n]. Denote
Tl h Ty
these segments by Iq, I\, • • ■, Iy. Segment I0 contains 2L  4- 2 steps, and segments 
I\, I 2 , • • •, Iy- 1 each contains d steps. The last segment, Iy, contains (n — 3L — 2) — 
d(y — 1) <  d steps. In this section we develop a relationship between the number of 
result processors and the total number of connections on an active bus at the end of 
step 3L +  1 (end of interval / 0). W ithout loss of generality, assume L <
L em m a  3.14 Let b be any active bus of steps L, L +  1, • • •, 3L  +  1. Let there be p 
result processors connected to bus b at the end of step 3L  +  1. Let there be £ < L fu ll 
processors (including the p result processors) connected to bus b at step 3L +  1. Then 
^ > 2  p.
Proof: At step 3L +  1 there are p result processors connected to bus b. Therefore,
there must be at least p (not necessarily the same) result processors connected to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
39
bus b all the way from step L  (Corollary 3.11). In this interval, ownership of only 
one connection is transferred with each partial result. Therefore, during this interval 
of 2L  +  2 steps, the total number of connections owned by processors connected to 
bus b is increased by a t least (2L  +  2)p (Corollary 3.7, page 30). There is a total 
of £ full processors connected to bus b at step 3L + 1. Each of these f  processors is 
connected to a bus different from 6. Collectively, the number of connections owned 
by all f  processors on bus b is £L  +  f , where £L is the connection to the buses in 
the neighborhood of b, and £ is the number of connections to b itself. (Recall th a t a 
processor can own connections only to the buses it is connected to and all detected 
connections are owned.) Since the loading of any of these £ buses in the “neighborhood 
of bus U' cannot exceed L, we have £(1 +  L) >  2p(l +  L), which implies tha t £ >  2 p.
■
3 .5 .2  T h e  N ew  A ccou n tin g  Schem e
Notice tha t the result of Lemma 3.14 shows tha t at the end of step 3L  +  1 there 
are twice as many connections as result processors to any active bus. However, some 
result processors may still own only one connection on each of its buses, while other 
processors may own many more connections. At the end of step 3L+1, if all the known 
connections are redistributed among the a  result processors, then each processor will 
own at least 2 connections to  each (active) bus b to which it is connected. Therefore, 
beyond this point ownership of 2 connections could be transacted for each partial 
result sent/received. This change alone with the previous m ethod of counting the 
connections will raise the lower bound on L  by a  factor of about 2. However, if 
we proceed for another d steps, we can show that the number of connections owned 
by each result processor is greater that 2. This can be used to evenly redistribute
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
connections so that each result processor now owns more than 2 connections. This 
in turn allows more connections to be transacted with each partial result. In fact, 
such a redistribution of the known connections can be carried out a t the end of each 
of the intervals Iq, 7i, / 2, • • • Iy~i. The last steps of these intervals (at which known 
connections are redistributed) are called transition points. In general, if there are a  
result processors connected to bus b, and a total of 7  connections to an active bus 
b at a transition point, then after redistributing connections, each result processor 
is guaranteed to own [qJ connections to bus b. The accounting scheme used in 
Sections 3.3 and 3.4 can still be used with suitable modifications. We now outline 
these modifications.
1. For 1 <  i <  y, consider interval =  [s,-,st+i — 1], where S{ =  3L + 2 + 
{i — 1 )d is the first step of /<. Clearly s, — 1 is a transition point, so at 
the end of step S’,- the accounting scheme redistributes known connections of 
each active bus evenly among its result processors. Let each result proces­
sor own w connections at step s,. More precisely, for result processor p  with 
connections to buses b' and b", let its set of owned connections be r jf(p) =  
{(Pi, V), (p'2, b'), ■ • •, (j/w, I/), (p'l, V'), (pg, (p", 6")} where pfj  and p\'j (1 <  
j  < w) are some processors connected to bus b' and 6", respectively. If  p  is 
not a result processor of step S{, then r a<(p) is empty. Since the set of known 
connections is partitioned among the result processors, no connection is owned 
by more than one processor.
2. For a step s (where s* <  s < si+I — 1), let p  be a result processor of step s  that 
receives partial result (s) from (not necessarily distinct) processor(s) pf and p" 
via bus(es) b' and b", respectively. The sets r a(p), and r s(p") change as
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
follows.
r,(p) = r a_x(p) u
r,^) = ra_1(p')-{(P',6'),(P,2 04^)}
r a(p") = r a_!(j/') -  b"), (j/±, (p i, &")}
For 1 <  j  < w, {p'j, b) €  and (p", 6") e  r a_!(p"); since we are interested
primarily in the cardinality, |ra(p)|, of Ta(p), the w connections transferred ((py, b') 
and (p", 6"), for 1 < j  < w) can be any w element of ̂ -[(p') and r a_i(p"). Note th a t 
if p is a direct processor that receives only one partial result in step s, then p' =  p", 
b' = V  and j/j =  p".
In summary, for each partial result received by processor p  from processor p' via 
bus b, processor p' transfers ownership of w connections on bus b to processor p. We 
call w the transaction weight of interval If processor p  does not send or receive 
any partial result in step s, then Ta(p) =  r a_i(p). The facts stated in Lemma 3.5 
also hold when the transaction weight is more than one. We restate Lemma 3.5 
modified to accommodate the idea of transaction weight; its proof is the same as th a t 
of Lemma 3.5.
L em m a 3.15 Let I  be an interval with transaction weight w. For any step s €  / ,  
the following statements hold.
(£) For distinct processors p \ ,p i,  Ta(pi) and r s(p2) are disjoint.
(u) For any processor p, i f  (jf, b) 6  r a(p), then processor p is connected to bus b. 
(in) I f p  is a result processor o f step s, then Ta(p) has w connections o f the form  
(p'i, b), (jp/2, b), • ■ •, (p'w, b) fo r  each bus b to which p is connected.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
42
Lemma 3.6 tracks the size of r a(p). We now restate this lemma with appropri­
ate modifications for the new accounting scheme. Again its proof parallels that of 
Lemma 3.6.
L em m a  3 .16  Let I  be an interval with transaction weight w. For any step s € I, 
|r,(p)| = |r,_i(p)| +  S, where
w, i f  p is a result processor of steps s and 5  — 1,
2w, i f  p is a result processor o f step s and a non-result processor 
o f step s — 1,
—w, i f  p is a non-result processor of step s and a result processor 
o f step s — 1,
0 , i f  p is a non-result processor of steps s and s — 1 .
Corollary 3.7 (page 30) relates the number of connections owned by a processor and 
the number of steps for which it holds a partial result. We now restate Corollary 3.7 
with suitable modifications to accommodate a transaction weight w > 1.
C o ro lla ry  3 .17 Let I  be an interval with transaction weight w. For any step s € 
I  =  [si, s2], i f  p  is a result processor o f a  of the steps of subinterval [si, s] o f I , then
|r3(p)| > w a .
Proof: The proof follows along the same lines as the proof of Corollary 3.7 (page 30)
with each connection transferred replaced by a group of w connections. W ith each 
partial result, w connections are transferred. The remaining steps of the proof are 
the same as in Corollary 3.7. ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
3 .5 .3  T igh ter Lower B ound
We now use the results developed so far to derive a tighter lower bound on the 
loading of degree-2, optimal-time, binary-tree MBNs. Recall that the initial segment
10 =  [L , 3L + 1], the last segment is Iy =  [3L + 2 + (y — 1 )d, n] and for 1 <  i < y,
11 = [3L + 2 + (i -  1 )d, 3L + 1 + id\.
L em m a 3.18 Let 6q be an active bus of steps L, L  -I- 1, • • •, n. For 0 < i < y, let Wi 
be the transaction weight o f segment Ii. Then, w0 =  2 and u;i+l =  -
Proof: Lemma 3.14 proves that w0 =  2. Let there be p,+i result processors con­
nected to bus b0 at the beginning of interval / l+1. Let 6 +i be the total number of 
processors (including the /?,+i result processors) connected to bus 60 a t the beginning 
of the interval I i+1- Since there are p,-+i result processors connected to bus b0 at the 
beginning of the interval / i+i, there must be at least pl+i (possibly different) result 
processors connected to bus 60 a t each of the steps of interval /,- (Lemma 3.11). The 
number of connections collectively owned by these processors at the beginning of the 
interval /,-+1 is at least dw,pi+l (Corollary 3.17). These connections must be on bus b0 
or buses in its neighborhood (that are connected to a processor with a connection to 
bus b0) as shown in Figure 3.5. The number of connections on bus b0 is f1+i- Each of 
the £,-+i has a t most L  connections. Therefore, the number of connections on bo and 
buses in its neighborhood is at most +  Lpi+l =  (1 +  L)pi+\. Since the number of 
connections owned by the p,+i processors on b0 cannot exceed this quantity, we have 
WiPi+id < £i+iL + f i+i =  (1 +  L)6 +1. By definition, wi+l =  [ | ^ J  =  [ ^ J .  ■
L em m a 3.19 For 0 <  i < y, let Wi be the transaction weight for segment I f  
is an integer, then u;,+i =  2 ( 2̂ 7) •
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
44
€ f u l l  p ro c e sso rs
p  r e su lt  p rocessors
< * > 6q
/ K \  / H \  / H \
Figure 3.5: The connections on bus &o
Proof: If is an integer, then [ f ^ J  =  Then from Lemma 3.18, wi+l =
and w0 = 2. Solving this recurrence will yield wi+i = 2 • ■
Now we are in a position to prove the main theorem of this section.
T h e o re m  3 .20 F orn  > 2 , i f  a degree-2, loading-L, 2n-processor M BN runs a B in(n) 
in n  steps, then L  =
Proof: Let d =  2(1 +  L ). Using the same notation as in the proof of Lemma 3.18,
this gives wy =  2y+1 =  L^J- Since py >  1, we have £y > 2y+l. That is, there are 
at least 2y+l connection to bus 60 that is active in each of the steps of I y. Therefore 
2y+l <  L, hence, y +  1 =  j J  Thus L logL  =  f l(n) which implies
that L =  ■
3.6 The Tree MBN
We showed in the previous sections that for n  >  1, any MBN with at least 2n~l 
buses running Bin(n) has a degree of at least 2 and a loading of at least 3. Next we 
proved that if an MBN runs Bin(n) in n steps and if its degree is 2, then its loading is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
f2 (i5̂ ) -  This raises the question, what if the degree is 3? In this section we answer 
this question positively by constructing a binary-tree MBN with the least possible 
degree-loading product. For n  >  1, we present a 2" x 2n“ l MBN 7*(n), called the 
Tree MBN, th a t runs Bin(n) in n  steps. The degree and loading of T (n )  are each 3. 
Lemma 3.1 and Theorem 3.20 prove tha t 7~(n) has the best possible degree-loading 
product of 9. Figure 3.6(a) shows a T(4) and Figure 3.6(b) shows how it runs B in (4). 
We now formally describe T (n). For n  =  1, each of the two processors of T ( l )  is 
connected to the only bus. For the remaining description, we assume tha t n  > 2.
Let the processors and buses of T {n )  be indexed 0,1, • • •, 2n—1 and 0,1, • • •, 2n_l — 
1, respectively. Group processors and buses into 2n“ l clusters, C,, where 0 <  i < 2n~l . 
Cluster Ci consists of bus i and processors 2i and 2i + 1, both of which are connected 
to bus i. Arrange the 2n_l clusters into n  levels (see Figure 3.6(a)). Level 0 contains 
only cluster Co. For 0 <  £ < n, level £ contains clusters Cx, for 2/_1 <  x  < 2/ . In 
addition to connections from processors 2i and 2i -I- 1 to bus i (where 0 <  i < 2n_l), 
T (n ) has the following connections between clusters. Processor 1 is connected to 
bus 1; for 1 <  i < 2n~2, processor 2i +  1 of cluster C, is connected to buses 2i and 
2i +  1. It is easy to see that for 0 <  i < 2n_l, processor 2i is connected only to bus i, 
and processor 2 i+ l  is connected only to buses i, 2i and 2 i+ l  (if they exist). Similarly 
for 0 <  i < 2n_l, bus i is connected only to processors 2i, 2i + 1 and (2  JjJ +  l ) .  
Thus the degree and loading of T{n) are each 3.
Figure 3.6(b), with nodes and non-trivial edges labeled with processor and bus 
indices, respectively, shows how T(4) runs B in(4). The general case is a straightfor­
ward extension of this, so we will keep this description brief. In running B in(n)  on 
T (n ), each processor initially holds an input. The first step consists of a communi­
cation within each cluster, with processor 2i +  1 receiving an input from processor 2i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
Figure 3.6: Running B in(4) on T (4)
(via bus i) and performing the operation o. Subsequent steps involve communication 
between clusters. In step s (where 2 <  s < n), processor 2i +  1 (where 0 <  i < 2n~s) 
sends a partial result to processor (2 |^J +  l ) ,  receives partial results from processors 
4i +  1 and 4i +  3, and applies the operation o on the partial results received. At the 
end of step n, processor 1 holds the result of Bin(n).
T h eo re m  3.21 For any n > 1, the 2n x 2n_l MBN, T (n ), runs B in(n) optimally in 
n  steps. The degree and loading o fT (n )  are each 3. ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
3.7 Loading-Speed Tradeoff
In the preceding sections we proved that an optimal-time, degree-2 binary-tree MBN 
cannot have constant loading. In Section 3.6 we showed that this lower bound on 
loading does not hold if the degree is perm itted to increase to 3. This section deals 
with the “optimal-time” restriction on these results. We prove that if the algorithm 
is perm itted to run a little more slowly, then a constant loading is possible, even 
with a degree of 2. Specifically we show the existence of a  degree-2, loading-4 MBN 
that runs Bin{n) in n +  t time when t is 0 (n ) . When t  is constant, however, the 
loading cannot be constant. We derive a  lower bound that relates the loading with 
t, the amount of time beyond the optimal that the MBN is perm itted to execute the 
binary-tree algorithm.
3.7.1  Lower B ound
Let 2T̂  be the size of the largest instance of a binary-tree algorithm that can be run 
optimally in t (L) steps on a degree-2, loading-L MBN. From Theorem 3.20 we know 
that t (L) = 0 (L  logL). The existence of a degree-2, 0(n)-loading, optimal-time 
MBN for B in(n ) [85] gives the bound t (L) =  f2(L).
T h e o re m  3.22 For any degree-2, loading-L, 2"-processor M BN that runs B in(n) in
n +  t steps, t  > •
Proof: Since the MBN runs Bin(n) suboptimally, there are some nodes of -F(n) with
“delays” in them. A node with delay <5 passes its value (input/partial result) to its 
parent 6 steps after this value is available to it. This delay may be used to transfer 
the value to a different processor with less demands in its buses.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
< t { L )  +  1
r { L )  +  1
r(L) + I
r ( L )  +  1
Figure 3.7: Path  with t  delays
The definition of r(L ) ensures tha t for x  > r(L ), the tree P {x) has at least one 
delayed node in it; therefore Bin(x) requires at least x  + 1 steps on such an MBN. 
Divide T {n )  into [T̂ + l] parts (see Figure 3.7), P i5 where 1 <  i  < • Each
part, Pi, (except perhaps the last one) consists of r(L ) + 1  contiguous levels of P (n ). 
T hat is, these parts contain trees isomorphic to !F{t {L) +  1) that must contain at 
least one delayed node (with delay S > 1). For part Pi that starts from the leaves, 
the roots of the F ( t (L) +  l)s in this part obtain the values no earlier than  at step 
r(L ) + 2 .  As a result, the roots of the T { t (L) +  l)s in the next part P2 obtain 
their values no earlier than at step 2(r(L ) +  2). In general for 1 < i < [ r (E)+ lJ, 
the roots of the F ( t (L) +  l)s of part Pi obtain their values no earlier than at step 
z(r(L) +  2). W ithout the delayed nodes, these roots would have obtained their values 
at step i(r(L ) + 1), so the additional time taken is a t least i. Therefore, the additional
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
49
time (due to delayed nodes) taken before the root of ̂ F(n) obtains its value is a t least
[frfy+rJ- ■
3 .7 .2  U p p er B ou n d
When the loading L  is a constant, Theorem 3.22 requires the additional tim e (beyond 
the minimum required n  steps) to run Bin(n) to be fi(n). This is because r(L ) = 
O (L lo g L )  =  0 (1 ), for constant L. We now show that this lower bound on the 
additional time is tight by presenting a degree 2, loading-4 MBN tha t runs B in(n ) in 
2n — 3 steps (that is, with n  — 3 additional steps).
Consider a degree-2, constant loading-^, 2n x 2n_1 delayed root MBN, 'D{n), that 
has the following properties:
•  The processor, f (n ) ,  that holds the final result of Bin(n) has a degree of 1; that 
is, f { n ) is connected to only one bus.
•  There is a bus 6(n) with loading t  — 2; that is, two more processors could be 
connected to bus 6(n) without increasing the loading of X>(n).
•  One of the processors, p(n), connected to bus 6(n) has degree 1; that is, pro­
cessor p(n) is not connected to any bus other than b(n).
Processors f ( n ) and p(n) and bus b(n) will be called the special elements of 'D(n). 
An example of such an MBN is the 8 x 4  MBN shown in Figure 3.8(a). For this MBN 
the loading i  is 4, and /(3 ) , p(3) are processors 0 and 6 , respectively, while bus 3 is 
6(3). It is easy to verify, that T>{3) possesses the above properties. Also note that 
X>(3) runs Bin(3) in 3 steps.
We now show how two copies of £>(n) can be used to construct the 2n+l x 2n MBN, 
V (n  +  1). To distinguish these copies, we name them V {n )  and P "(n ) and refer to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
m  p( 3)
bus 3 =  6(3)
(a)
(b)
Figure 3.8: Running Bin{3) on 8 x 4 MBN, P(3)
their special elements as / '(n ) , p '(n) 6/(n), /" (n ),p "(n ) and 6"(n). To construct 
Z?(n +  1) from D '(n) and T>"(n), all that needs to be done is to connect processors 
f '{ n ) and / '( n )  to bus 6'(n). Designate p'(n) to be / ( n +  1), p"(n) to be p(n + 1) and 
6"(n) to be b(n +  1).
MBN Z>(n-t-l) has a loading of £ as the only two added connections are to bus b'(n) 
that has only I  — 2 connections to start with. Its degree is 2 as the added connections 
are one each from processors / ' ( n ) and /" (n ), that had only one connection each to 
begin with. It is easy to verify th a t each of the processors f ( n  + 1) =  Pi(n) and 
p(n  +  1) =  p"(n) has degree 1 and tha t bus 6(n +  1) =  611 (n) has loading £ — 2. If we 
can now establish that f ( n  +  1) holds the final result, then P (n  +  1) will satisfy the 
three properties stated above for Af (n).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
When Bin(n  4- 1) is run on P (n + 1 ), the first n  levels (starting from the leaves) are 
run simultaneously on I y (n )  and 2>"(n); the results of these steps are in processors 
f ' ( n ) and /" (n ) . These processors use bus V(n) to send their partial results to 
f ( n  + 1) =  p '(n). (Recall th a t processors f { n )  and f" (n )  have been connected to bus 
6'(n ), while f ( n  -(- 1) =  p '(n) is already connected to this bus.) Processor / ( n  4-1) 
now computes the final value of Bin{n  + 1). From this discussion it should also be 
evident that if 2>(n) runs B in{n ) in T (n ) steps, then T>{n 4- 1) runs Bin(n  4 - 1) in 
T (n  + 1) =  T (n)  4 - 2 steps. This is because both processors / '( n )  and f"{n )  use the 
same bus, b'(n), to  send their partial results to processor / ( n  4- 1); this introduces a 
delay in computing the root. Coupled with the fact tha t T (3) =  3 (see Figure 3.8), 
this gives T(n) = 2n  — 3, for n  >  3.
Thus we have the following result.
L em m a 3.23 For any n  > 3, the degree-2, loading-4, 2n x  2n_1 delayed root MBN, 
T>{n), runs B in{n) in 2n — 3 steps. ■
The generalization of the above result to non-constant loading L  > 4 is straightfor­
ward. First construct a 2 i x 2L~l degree-2, loading-L MBN, by the m ethod proposed 
by Vaidyanathan and Padm anabhan [85]). Denote this MBN by T>l - An important 
feature of the MBN is tha t the result processor has only one connection, and the 
bus that is connected to the final result processor has less than  L connections. We 
can now use two copies of P l ( t i  — 1) and follow the construction described in the 
upper bound section to  obtain T>i,(n). It is easy to see th a t the tim e it takes to run 
B in(n ) on V L{n) is L — 1 4 - 2(n — L +  l)  =  2n — L +  l  steps.
T h e o re m  3.24 For any n  > 3, the degree-2, loading-L, 2n x 2 n_l MBN, T>i{n), runs 
Bin{n) in 2n — L  4- 1 steps. ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
The results of this section show that a degree-2, constant loading MBN exists for 
Bin{n) if and only if it is perm itted to take @(n) more steps than the optimal.
3.8 Extension to k-ary Tree Algorithms
In this section, we extend the lower bound results of this chapter to k-ary tree algo­
rithms. In a  A;-ary algorithm, a node (or a processor) receives A;-inputs/ partial results 
and reduces them to one result by performing an associative k-ary operation on them. 
Binary-tree algorithms form a special case of k-ary tree algorithms, with k  =  2. All 
the results so far have been established for binary-tree algorithms.
In general, for k > 2, a k-ary tree algorithm reduces kn inputs to one result, 
and can be represented as a balanced (n-level) k-ary tree. Clearly, the optim al time 
for this algorithm  is n  steps. In this section we extend the lower bound results of 
Sections 3.3, 3.4 and 3.5 to k-ary tree algorithms. Most of the basic results needed 
to establish the lower bounds for k-ary tree algorithms are very similar to the binary 
tree case. We will therefore keep our discussion brief and only point out places where 
the k-ary case differs from the binary case.
Let X (n )  be an optimal-time, loading-L, k-ary tree MBN with kn processors 
and M  > (k  — l)A;n-1 buses. We start by stating the basic results of Section 3.3.2 
(page 26) extended to k-ary tree MBNs. We list four simple consequences of X  (n ) 
being a degree-A:, optimal-time, A;-ary tree MBN that are used often without explicit 
mention, in subsequent discussion.
1. A full processor has k  connections, while a non-full processor has less than  k  
connections.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
2. All partial results received in a step are used in the same step, and a partial 
result generated in a step is used up in the next step. This is because the 
algorithm runs optimally, so partial results cannot idle.
3. A processor using a  direct (resp., indirect) mapping receives k  — 1 (resp., k) 
partial results; this follows from 2 above.
4. A processor receiving k partial results at a step must do so from different pro­
cessors. Otherwise the step will not be executed in unit time.
5. A processor sending a partial result cannot receive one at the same step. This is 
because it will have to receive k  and send one partial result. This is not possible 
on a degree-A;, optimal-time MBN.
It is easy to show th a t Lemma 3.3 and Corollary 3.4 hold for the fc-ary case. Define 
ownership as in Section 3.3.3 with one connection being transacted for each partial 
result sent/received. Clearly Lemma 3.5 still holds. Lemma 3.6 clearly extends to 
the following:
L em m a 3.25 For any step s > L and any processor p, Ta(p) =  T ,-! (p) +  5, where
k — 1, i f  p is a result processor of steps s and s — 1
k, i f  p is a result processor o f step s and a non-result processor 
of step s — 1
—1, i f  p  is a non-result processor of step s and a result processor 
o f step s — 1
0, i f  p  is a non-result processor of steps s and s — 1
6 =
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
As a result, we have the following corollary using the reasoning of Corollary 3.7.
C o ro lla ry  3.26 For any step s > L, i f  p is a result processor of a  of the steps 
L, L  +  1, •• - ,s , then |rs(p)| > a{k — 1). ■
Observe tha t Lemma 3.8 is independent of the k-ary case.
T h e o re m  3.27 For any n  >  2 , i f  a kn-processor M BN with o f degree-k and loading 
of L, runs kn-input, k-ary tree algorithm optimally in n  steps, then L  =  f2(v/n).
Proof: Prom Lemma 3.8, there exists bus, bo, that is active in steps L, L  +  1, — , n.
Let bo be connected to £ < L  processors, P i,P2, ••*,?/, all of which are full processors
of step n. For 1 <  i < £, let the k buses to which processor p,- is connected be 6o and 
bi,i,bit2 , • ■ •, &»,*_!• Also let processor p, be a result processor times from step L  to
step n.
Prom Lemma 3.5(u), each element of r„(pt) is a connection to either bQ or Since
i
the loading of the MBN is L, 53 lr n (p.) I < £ + £{k — 1 )L  < L2(k — 1) +  L. Prom
i = l
i i
Corollary 3.26 we also have ^  [rn(p,)| >  [k — 1) a ,. Since b0 is an active bus of
»= I i = l
I
steps L, L + l, • • •, n, 53  a , >  n — L +  1. Thus, (k — l ) (n  — L  +  1) <  L2(k — I) + L,
»=i
which implies that n < L2 +  — 1 or L  =  Vt{y/n). ■
To obtain the second lower bound from the first, observe that Lemma 3.10 and 
Corollary 3.11 hold for the k-ary case as stated. Lemma 3.12 changes slightly as 
shown below.
L em m a 3.28 For s > L, i f  p  is not a result processor o f step s — 1, and is a result 
processor o f steps s, s + 1, • • •, s + x — 1 {for some x  > 0), then the following assertions 
held:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(i) For each step s' in the interval [L ,s  — 1], at least x  + 1 neighbors o f p  are 
result processors of step s'.
(ii) A t the end o f step s + x  — 1, the neighbors of p  collectively own at least 
(k — 1) ((x  +  l)(s  — L) +  z(z~1)) connections.
Proof: Is the same as in Lemma 3.12 except that the number of connection owned by
a processor increases by a factor of A: — 1 for every result holding step (Corollary 3.26).
■
T h e o re m  3.29 For any n > 2, i f  a k n-processor M BN  with degree of k  and loading
/  2 \ Lo f L  runs a k-ary tree algorithm optimally in n  step, then L  =  fi 3 .
Proof: This proof is the same as in Lemma 3.13 with some modifications as indicated
below. The number of connections collectively owned by processors p i j ,  p 1,2, • • •, 
P\,k-i • • P/,1, P/,2 , • • -, Pz,fc-i and their neighbors is at least
(* -  1) E  ((*J -  £)(*J +  !) +  1})  =  e«k - ! ) ( " -  =  9 ((*  -  i)" 2)-
These connections are distributed among l  +  (fc —1)^+ (k — 1)2(L—1)£ < 1+ L (k  — 1)+ 
L (L  — 1 )(k — I )2 buses connected to the above processors. Since each bus can have 
at most L  connections we have
L [ l  + L (k  -  1) -I- L(L  -  l) (k  -  l ) 2] =  S ((k  -  1 )2L3) =  Q((k -  1 )n2)
, which implies that L = Q (**■)*• ®
Remark: For constant k, L is still f2(n 3).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
The new accounting scheme developed for proving the lower bound of
Section 3.5 is also valid for the A;-ary tree algorithms. We state results from Section 3.5 
th a t require some changes.
L e m m a  3.30 For k > 2, let b be any active bus o f steps L, L  4- 1, • • •, 3L  4- 2. Let 
there be p result processors connected to bus b at the end of step 3L 4- 2. Let there 
be £ < L fu ll processors ( including the p result processors) connected to bus b at step 
3L  4 - 2. Then £ >  2p.
Proof: At step 3L  +  2 there are p result processors connected to bus bo- Therefore,
there must be at least p (not necessarily the same) result processors connected to bus 
b0 all the way from step L  (Corollary 3.11). In this interval, ownership of only one 
connection is transferred with each partial result. Therefore, during this interval of 
2L 4- 2 steps, the total number of connections owned by processors connected to bus 
b0 is increased by at least (2L  4- 2)(k  — 1 )p (Corollary 3.25). There is a to tal of f  full 
processors connected to bus bo a t step 3L  4- 2. Each of these £ processors is connected 
to (k — 1) buses different from 6o- Collectively, the number of connections owned by 
all £ processors on bus 60 is £ -f f  (fc — 1 )L, where £(fc — 1)L is the connection on the 
buses in the neighborhood of b0 and £ is the number of connections on b0 itself. Since 
the loading of any of these £ buses in the “neighborhood of bus bo" cannot exceed L, 
£(1  + (k — 1 )L) > p(k — 1)(2L +  2), which implies that f  >  2p, or wQ > 2. ■
For the same partitioning of the interval [L,n] in Section 3.5.1, we have the fol­
lowing.
L e m m a  3.31 Let bo be an active bus o f steps L, L  4-1, • • •, n. For 0 < i < w, let Wi 
be the transaction weight for segment /,-. Then, wo =  2 and u;,+1 =  Wi'
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
Proof: Lemma 3.30 proves that w0 = 2. Rest of the proof is the same as in
Lemma 3.18. However, notice that each processor in the neighborhood of the bus 
is now connected to k — 1 other buses. Therefore,
WiPi+id < &+i(fc -  1 )L  +  f i+l =  (1 +  L{k -  l))&+i, so ^  > TTLtk^vj 
By definition, u,i+l =  [J* lJ  =  [ l+L(dfc_i)w»j- ■
L em m a  3.32 Let bo be an active bus of steps L, L + 1, • • •, n. For 0 <  i < w, let 
Wi be the transaction weight for segment Ii. I f  l+Lfk_ ^  is an integer, then W{+1 =
2 ( ( k - l ) L + l )  ■
Proof: If is an integer, then j =  l+L*k_iy  Then from Lemma 3.31,
wi+1 =  and w0 =  2 . Solving this recurrence will yield u;l+i =  2 (i^z^rTy) •
■
T h e o re m  3.33 Forn > 2 , i f  a degree-k, loading-L, kn-processor M BN  runs kn-input
n
k-ary tree algorithm in n  steps, then L  =  0 ( logn^ -g^).
Proof: Let d — 2(1+ L(k—1)). This gives wy =  2y+l =  LfjJ- Since py >  1, >  2y+l.
T hat is 2y+l connection to a bus. Therefore, 2y+l <  L, and j  <  l°g !>■
n
Thus kL  log L  =  Q(n) which implies that L =  iogrt—iog'fc)~ ®
Remark: For constant k, L  is still fl(
We now outline the construction of a Tree MBN for k-ary tree algorithms. Fig­
ure 3.9 shows an example when k =  3. The degree of this MBN is 4 and its loading is
3. In general, the k-ary tree MBN has kn processors, (k  — I )/:"-1 buses, and a degree 
of A; +  1. The loading is always 3.




PSP3 PO P8P« Pi
bis
pis P20 P31P» PlO Pi i Pia P13 PIC P22 P2S
Figure 3.9: MBN for ternary tree algorithms
3.9 Concluding Remarks
We have proved tha t for any degree-2 MBN tha t runs B in(n ) optimally, its loading is 
Q (i5̂ r). If the MBN uses a direct mapping, then we have proved that the bound on 
the loading is f2(n). This is a tight bound as there exists an MBN with such a loading 
[85]. We have also shown a tradeoff between the speed and loading of degree-2 MBNs. 
In particular, we have proved tha t a degree-2, constant loading MBN can run B in (n ) 
if and only if it is perm itted to take 0 (n ) steps more than the optimal. We conjecture 
that an optimal-time, degree-2, binary-tree MBN has an fi(n) loading lower bound. 
If this is the case, then the f2(n) MBN of Vaidyanathan and Padm anabhan [85] is 
optim al for degree-2 MBNs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Multiple-Bus Enhanced Meshes
Over the last decade several topologies have been proposed for connecting proces­
sors in parallel systems. Of these, the two dimensional mesh has emerged as one 
of the most widely studied, due, in part, to its regular structure and simple layout 
in two dimensions. A disadvantage of the mesh is its large diameter, which is often 
the bottleneck of many fundamental algorithms. To circumvent this problem, while 
building on the advantages of the mesh, researchers have proposed meshes enhanced 
with buses (for example [1, 7, 8 , 11, 13, 19, 27, 30, 33, 51, 71, 72, 75]). Most such 
enhancements carefully select the set of processors to connect, but employ an overly 
simple m ethod to connect elements of that set. In this chapter we demonstrate the 
advantage of connecting these sets using MBNs in general, and binary-tree MBNs in 
particular.
A general idea in enhanced meshes is to identify (not necessarily disjoint) sets 
of processors of the mesh, and then connect processors in each set by a single bus. 
We will refer to these sets as connect-sets. Typically, a connect-set consists of pro­
cessors in a  row (or column) of the mesh [1, 7, 10, 30, 51], or variants of this idea 
such as every x th processor of a  row (or column), for some integer x  [4, 17, 19, 75]. 
Hierarchical approaches [71, 64] have also been suggested for selecting connect-sets.
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
60
Regardless of the method used, each connect-set is connected by a single bus. (A 
single “segmentable bus” [72, 75] can also used in this context.) We will refer to 
such architectures as Single-Bus Enhanced Meshes (SBEM s). W ith increase in the 
network size, and consequently the connect-set sizes, SBEMs require buses with high 
loading. Another problem of connecting a  large number of processors to a single bus 
is an increase in the demand for the bus, resulting in communication bottlenecks. 
One advantage, however, is the ease of broadcasting (provided the loading is not 
excessive).
In this chapter, we propose Multiple-Bus Enhanced Meshes (MBEMs) that allow 
the use of multiple buses to connect processors in connect-sets. In particular, binary- 
tree MBNs are very well suited for this purpose as they are designed to facilitate 
the two most widely studied applications of enhanced meshes, namely, semigroup 
operations and broadcasting. An MBEM with binary-tree MBNs along rows/columns 
(or their subsets) can be viewed as being similar to the mesh of trees, a very versatile 
topology [49]. A mesh of trees has mesh processors arranged in a  grid, and additional 
processors and links that connect each row and column as a complete binary tree. 
This network has the desirable features of both the tree and the mesh architecture 
such as a small diameter and a large bisection width. Variations of this idea, such as 
mesh with trees along diagonals, have also been proposed [49, page 295].
MBEMs have three important advantages over SBEMs. First, the loading of 
MBEMs can be limited (often to a constant), regardless of the network size. Second, 
the network for connecting elements of a  connect-set can be tailored by the network 
designer to obtain various trade-offs between network cost and performance. Third, 
an MBEM is a generalization of the SBEM; therefore MBEMs can capture the features 
of SBEMs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
Almost all previous enhanced mesh architectures are SBEMs, focusing on identify­
ing connect-sets rather than the method for connecting elements within them. Conse­
quently, they have the disadvantage of high loading associated with SBEMs. For ex­
ample Stout [76] and Bokhari [10] added a single global bus to the basic 2-dimensional 
mesh to facilitate broadcasting. Since all the processors are connected to this single 
bus, the loading is very high (G(N)). However, Aggarwal [1] added k  global buses to 
a d-dimensional array. This architecture has a  degree k  and also a  high G (N) loading, 
as each processor is connected to all A; global buses. Another model that has drawn 
considerable interest is the mesh with multiple broadcasting buses [7, 8 , 30, 51, 76]. 
In this model, each row and the column of the mesh is connected to a single bus. 
Consequently, the loading is Q (y/N). A partial solution to the high loading problem 
has been to connect only a selected subset of row/column processors to a bus. For ex­
ample Chen et al. [17], Bar-Noy and Peleg [4], and Serrano and Parham i [75] connect 
every N s  processor on each row and column to a bus. These methods still have a high 
loading (Q (Na), for a  > ! ) •  Raghavendra [71] has proposed the HMESH, a hierar­
chical architecture that reduces loading, but only at the cost of a large non-constant 
degree. Pan et al. [64] proposed the IMMB architecture, which uses a multi-level 
mesh hierarchy. Even though its degree is small, it has a high 0 (y /N )  loading. Ser­
rano and Parhami [75] used buses with segment switches in each row/column of the 
mesh. Although this reduces the loading somewhat, the model still has a Q (y/N ) 
loading.
A large portion of results on enhanced meshes has been performing semigroup 
operation (reductions and semigroup operations). Stout [76] showed that finding 
maximum/minimum, median, and sorting can be done in 0 (y /N ) , 0 (y /N  log N ), 
and 0 { N ) time with a single global bus. Bokhari [10] improved the time required
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
for finding the maximum/minimum to 0 ( N 3) steps. Aggarwal [1] added k  global 
buses to a d-dimensional array, and showed th a t finding maximum/ minimum re- 
^ - 1  time. Prasanna-Kumar and Raghavendra [69] derived an 0 (N « )  
tim e algorithm for running semigroup operations on a  y /N  x y /N  square mesh with 
broadcast buses. Chen et al. [17] and Bar-Noy and Peleg [4] have shown that the 
running tim e can be reduced to 0 (N » ) ,  if a rectangular N»  x N # mesh augmented 
with broadcast buses is used. Serrano and Parhami [75] added segment switches to 
the buses and achieved the same running time, while reducing the loading. Chung 
[19] reduced the time to O(N^o) while Pan et al. [64] achieved better time on an 
enhanced mesh of low aspect ratio.
In addition to semigroup operations, enhanced meshes have also been used to solve 
other classes of problems as well. For example, Bhagavathi et al. [7] have shown that 
many visibility problems, such as convex hull, can be computed in O(logiV) time on 
a  y /N  x  y /N  mesh with multiple broadcast buses. In a  different paper, Bhagavathi 
et al. [8] established that selecting the k th smallest element in a  rectangular N» x
5 1 3N  a enhanced mesh can be done in 0(N™  (log N)<) time. The batched searching 
and ranking problem (which is fundamental to many algorithms including database 
querying, pattern  recognition, robotics and VLSI), where m  values stored in a y /N  x 
y /N  mesh with multiple broadcast buses, has been be solved in 0 ( \o g N  + y / in )  time
[ H i -
Some architectures using multiple buses for connect-set have also been proposed. 
The GMCCMB [19] allows several buses to connect elements of a connect-set, it 
represents a particular situation tha t can be viewed as a farther refinement of the 
connect-set itself, rather than employing a specialized MBN. Indeed, our results im­
prove on those of the GMCCMB. The TBN [25], BBT [26] and BRT [27] are MBEMs,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
that also use binary-tree MBNs to enhance the mesh. However, these results do not 
provide a  general treatm ent of the topic or relate cost and performance issues as is 
done in th is work.
Given any iV-processor, — bus, binary-tree MBN, we first develop a framework for 
deriving other binary-tree MBNs that provide different cost/performance trade-offs. 
The prim ary parameters considered are time for reduction and broadcasting, loading, 
degree, and layout area. We also study binary-tree MBNs with “segment switches” 
on buses. (A segment switch allows a bus to be cut into several parts that can be 
used simultaneously as independent bus segments.)
We use the binary-tree MBN derivatives mentioned above to construct MBEMs 
and show how the network parameters can be adjusted to obtain various trade-offs. 
Tables 4.1 and 4.2 (pages 73 and 82) show some param eter choices, with interesting 
possibilities. All MBEMs in the table have optimal area and constant degree. Al­
though our discussion on semigroup operations focuses on reduction, the results can 
also be extend to prefix computations (see Section 2.4, page 17).
In the next section we briefly discuss the parameters used to evaluate MBEMs. In 
Section 4.2 we use a given binary-tree MBN to derive other binary-tree MBNs. We 
put these results together in Section 4.3 to construct enhanced meshes with a wide 
range of cost/performance trade-offs. Section 4.4 deals with similar ideas for MBEMs 
with segment switches. Finally in Section 4.5 we summarize our results and make 
some concluding remarks.
4.1 Preliminaries
In this section we discuss some preliminary ideas and define some terms used in rest 
of the chapter.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
4.1 .1  M B N  M easures
We will quantify the notion of cost and performance of an MBN (and an MBEM 
based on it) using five measures: reduction time, broadcast time, degree, loading, 
and layout area. These quantities are not entirely independent of each other, but 
collectively serve to measure the network’s cost and performance. The number of 
buses has also been used in previous results to reflect the sparseness of the network. 
We ascribe less importance to this quantity in this work, as the layout area captures 
the notion of network sparseness. The degree and loading have been discussed in 
Section 2.2 (page 13). We now describe the remaining parameters.
R e d u c tio n  tim e : This is the time required for the MBN to reduce inputs at its
processors to a single result, using a binary-tree algorithm.
B ro a d c a s t tim e : This is the time required for the MBN to broadcast a piece of 
information from one processor to all other processors. A broadcast (from a fixed 
source) can be viewed as a traversal of a binary tree from root to leaves, so the broad­
cast time is upper bounded by the reduction time for the MBN. If the broadcast can 
originate from any processor, then the broadcast time is at most twice the reduction 
time (corresponding to a traversal to the root of a tree, and a broadcast down to the 
leaves).
L ayou t: An X  x Y  layout of an MBEM or MBN is a placement of its processors
and buses in two layers within an X  x Y  rectangle. A “word-model” is assumed in 
which buses and connections between processors and buses are of unit width. Pro­
cessors themselves are assumed to be of constant area; this is reasonable for constant 
degree MBNs, such as those considered in this chapter. The layout is assumed to be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
rectilinear; that is, all buses and connections consist of horizontal and /o r vertical line 
segments (wires). The two-layer layout places horizontal and vertical wires on two 
different layers with “vias” connecting layers when needed. The area of an X  x Y  
layout is X Y  and its aspect ratio is • Clearly, a low area is desirable, while a
low aspect ratio facilitates easier implementation.
As mentioned earlier, we will use MBNs to construct MBEMs. Since these MBNs 
will be used with connect-sets, each of which spans a single row or column of the mesh, 
we will only consider MBN layouts in which all processors are placed in a  line along 
one side of the rectangle enclosing the layout. Such a layout is called a “perim eter” 
layout. Since a  perimeter layout typically has one side of size 0 (P ) ,  where P  is 
the number of processors, we will specify a high aspect ratio, perim eter layout by its 
layout height H\ this represents a P x H  layout (see Figure 6.5, page 119). The layout 
of the MBEMs, however, will be dense and place processors throughout the enclosing 
rectangle to obtain a constant aspect ratio. Here the layout must be specified by both 
dimensions of the enclosing rectangle; i.e., as an X  x Y  layout.
4 .1 .2  M u ltip le -B u s E n h an ced  M eshes
A Multiple-Bus Enhanced Mesh {M BEM) has a  set of processors connected by an 
underlying mesh topology. The processors are grouped into (not necessarily disjoint) 
sets Co, Ci, • • •, Cz, called connect-sets. The selection of these connect-sets is an 
architectural choice, made with an eye on the application domain, cost and desired 
performance. A typical connect-set includes an entire row/column [1, 10, 51] or 
subsets of a row or column [4, 17, 19, 75]. O ther more complex methods have also 
been proposed [33, 64, 71].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
The processors of each connect-set are connected via one or more buses. Up 
to this point, there is no difference between MBEMs, and traditional Single-Bus 
Enhanced Meshes (SBEMs). The difference between them is in the m anner in which 
processors of a connect-set are connected. In an SBEM all elements of a  connect-set 
are connected to a single bus, dedicated to that connect-set. In an MBEM, however, 
any MBN could be used to  connect elements of a connect-set. As a particular case, if 
MBNs, each with one bus are used, then the MBEM becomes an SBEM. Therefore, 
MBEMs can be viewed as generalizations of SBEMs. In general, a  different type of 
MBN may be used for different connect-sets. They could even be different for the 
same connect-set in two different MBEMs. Thus the idea of decomposing the mesh 
into connect-sets is independent of the method used to connect each connect-set. It is 
the method of connecting them that distinguishes SBEMs (that use single buses) from 
MBEMs (that use multiple buses). We show in this chapter that there are significant 
advantages to connecting elements of connect-sets by multiple buses.
4.2 Binary-Tree MBN Extensions
In this section, we present a method to construct a  2n x 2m MBN from any given 
2" x 2n_1 binary-tree MBN (where 0 <  m  <  n). We apply this general result to the 
Tree MBN (Section 3.6, page 44) and then use the resulting 2" x 2TO binary-tree MBN 
to enhance 2-dimensional meshes (Sections 4.3 and 4.4).
L em m a  4.1 Let X (n )  be a 2” x 2n_l M BN with degree and loading o f dn and in, 
that runs B in(n) in tn steps and that performs broadcast in qn steps. Let X (n )  have a 
layout height of hn. Then fo r  any 0 <  m  <  n, there exists a 2n x 2 m MBN, X '{ n , m), 
with degree and loading dm+i and i m+\ + 2 n-m — 2, that runs Bin(n) in  2n-m -l-£m+1 —2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
67
steps. The M B N X '(n , m) has an hm+i layout height and a broadcast time o fqm+1 + 2
steps.
Proof: Consider running B in(m  +  1) on X (m  + 1). Clearly the first step divides the
2m+1 processors into 2m pairs, each of which uses a distinct bus. For any 0 <  i <  2m+l, 
let processor pair (i , 7r(z)) use bus b(i) in the above step. Construct the 2n x 2m MBN, 
X '(n , m), as follows. Divide its 2n processors into 2m+l sets, 5, (where 0 <  i < 2m+1); 
each set consists of 2n_m_1 contiguous processors of X '(n ,m ) .  Designate one of the 
processors, of each set as its “leader.” Connect each non-leader processor of 5, U 
to bus b{i) and connect the 2m+1 leaders to the 2m buses as in X ( m  +  1). Note that 
the leader of Si is also connected to bus b(i).
The MBN X '{ n , m) runs Bin{n) by first combining the inputs in Si and (via 
bus b(i)) into the leaders of Si and £*(,) in 2(2"_m_l — 1) =  2n~m — 2 steps. This 
effectively reduces B in(n ) into B in(m  + 1), which is next run as on X (m  + 1) in <m+1 
steps. The processors of X (m  +  1) have at most dm+1 connections. All the other 
processors have only one connection each. Therefore, the degree of X '(n , m )  is dm+i- 
Each bus of X ’(n ,m )  is connected to 2n-m processors. Two of these 2n-m processors 
are part of X (m  +  1). Therefore, the loading of X '(n ,m )  is 2n-m — 2 4- £m+1. Since 
the layout height of X (n )  is hn, the layout height of X (m  + 1) is hm+i■ Any processor 
in X '{n , m ) can be reached from another processor by traversing in MBN X (m  -I- 1) 
and along two additional bus (within a group). Since the broadcast time of X (n )  is 
qn, the broadcast time of X '{n ,m )  is qm+i +  2 steps. ■
We now discuss the implication of Lemma 4.1. Observe that if tn =  n  (which is 
optimal), then the time for X '(n , m )  to run B in(n ) is 2n-m +  m  — 1, which has been 
shown to be optimal for any 2n x 2m MBN [2]. If dn is a constant, then so is dm+1, 
the degree of X '{n , m). If £n =  3, then the loading of X '{p , m) is 2n-m -I- 1, the best




Figure 4.1: A 32 x 8 Tree MBN
possible for any connected 2n x 2m MBN (see Lemma 3.1, page 21). Thus, X '{p ,m ) 
inherits optimal features of X (p). In particular, the 2n x 2"-1 Tree MBN, T {n)  (see 
Section 3.6, page 44), runs Bin(n) optimally and has degree and loading of 3 each. 
Lemma 4.1 allows 7”(n) to be extended to a  2" x 2m MBN (for any 0 <  m < n) that 
runs Bin(n) in optimal time, whose degree is 3 (same as that of T (n)), and whose 
loading is 2n-m +  l  (which is optimal). Figure 4.1 shows the structure of a 32 x 8 Tree 
MBN. This MBN has a loading of 5, and a degree of 3. Its time for running B in(5) 
is 6 steps.
The selection of an appropriate MBN to connect the processors in the connect- 
set is crucial as it determines all the im portant network parameters (running time, 
degree, loading, etc.). For this work we will use the Tree MBN. (Other MBNs such 
as the delayed root MBN (Section 3.7.2, page 49) or those of [27, 24] could also be 
used.) Though the Tree MBN is defined for 2" inputs, it is easily modified to handle
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
N  (not necessarily a power of 2) inputs. It can also be verified (see Figure 3.6(a)) 
that a 2" x 2n_1 Tree MBN has 0 (n ) layout height. For brevity, we will express 
the performance measures (time, loading, degree etc.) of the Tree MBN in terms of 
orders, rather than exact values. We now summarize the properties of the above Tree 
MBN in the following theorem using the order notation.
T h e o re m  4 .2  For 1 <  M  <  y ,  there exists an N  x M  Tree M BN  with constant de­
gree and 0 ( |£ 0 loading. It i~uns an N-input, binary-tree algorithm in O +  log A/) 
steps. This MBNs has a 0 (log  M ) layout height and O (log AT) broadcast time.
4.3 Meshes with Tree MBNs
In this section we construct an MBEM called the Mesh with Tree MBNs th a t uses 
the Tree MBN to connect processors of connect-sets. This structure has several 
advantages over other enhanced meshes proposed in the literature. We derive some 
results to highlight these advantages. Our description here uses the least number of 
param eters to describe the idea. This idea can be generalized to include different sized 
and dimensioned meshes, MBNs other than the Tree MBN, and different sizes/types 
of MBNs for different connect-sets.
For integers N ,A ,B  > 1 define the mesh with Tree MBNs, A/iT{N , A , B ), as 
follows. Arrange N  processors as a y /N  x y /N  mesh. Divide this mesh into A  x A  
submeshes (Figure 4.2(a)), and designate one processor (the top, left processor 
say) of each submesh as the “leader.” The leaders form a  x array. For 
0 <  i < processors {ptJ : 0 <  j  < ^ p } , form the horizontal connect-sets and 
processors {pjyi : 0 <  j  < form the vertical connect-sets. In other words, rows 
and columns of leaders form connect-sets. Each connect-set is connected via B  buses 
to form a *^p x B  instance of the Tree MBN (Figure 4.2(b)). If B  =  1, then there will
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
be a single bus in each Tree MBN connecting the horizontal and vertical connect-sets 
and the resulting SBEM structure will be the model of Chen et. al [17], if A =  N*  
and tha t of Bhagavathi et. al [10], if A =  1. We now outline the three m ajor phases 
involved in running an AT-input binary-tree algorithm on this structure.
1. Local reduction: Reduce the A x  A  inputs in each submesh to a single partial 
result in its leader using the local mesh links. This requires 2(A  — 1) steps and 
the problem size reduces to jp .
2. Horizontal reduction: Use all *^p x B  horizontal Tree MBNs (shown as H  
in Figure 4.2(b)), in parallel to reduce ^ p  partial results in leaders of each 
horizontal connect-set to a single partial result. The time and loading for this 
phase are 0 ( ^  +  logB) and O ( ^ ) ,  respectively (Theorem 4.2). The problem 
size is now reduced to ^ p .
3. Vertical reduction: All partial results of the horizontal reduction step are in one 
connect-set. This phase is performed on the vertical ^ p  x B  Tree MBN of this 
connect-set. The time and loading for this phase are again 0 ( ^ |  +  logB ) and 
O (^ J ) .  The inputs are now reduced to one result.
The overall time for running an iV-input binary-tree algorithm on A4T (N , A, B ) 
is the sum of time required for each of the three phases; that is, the reduction time is 
0 (A  +  +  logB ). The broadcast time is the time it takes for a piece of information
to travel from one processor to any other processor. In this MBEM, it is equal to the 
broadcast time of two ^ p  x B  Tree MBNs and time required to reach any processor 
in an A x A submesh. Therefore, the broadcast time is 0 (A  +  logB). The loading is 
O ( ^ )  and the degree (including the local mesh connections) is 4 + 2  x 3 =  0 (1 ). The 
total number of buses (not including local mesh connections) MBEM is 0 ( ~ * ^ ). The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
• —»— • — j—• —• — < i— ---------- • —•
fi£iH====ti
y/N
• ------ #—• — •- • — •
y /N
(a): Leaders shown as large circles
A
H






H  = V  =




4 4  ......  . 4
(b): Binary tree MBNs connecting the leaders
Figure 4.2: Structure of a mesh with binary-tree MBN
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
72
VLSI area required for the y /N  x y /N  processor array alone is N . Therefore, the VLSI 
area of N iT (N ,  A, B ) is f2(iV). Each of the horizontal and vertical MBNs has pro­
cessors and, as shown in Figure 4.2(a), these processors are separated by a distance of 
A  4- log B  units. Therefore, each of these MBNs has an O  ( (^ p (A  +  log B ))  x logB ) 
VLSI layout. Since the ^ p  vertical and horizontal MBNs have symmetric layouts, 
N IT (N , A, B ) has a constant aspect ratio O ((y /N (  1 +  x ( / N ( l  +  ^ n ) )  lay­
out. Therefore, its layout area is 0 ( N  + Afl̂ gg ). In summary, M T ( N ,  A, B)  has the 
following:
Degree =  10
Loading =  0 ( ^ J )
Broadcast time =  ©(A-I-logB)
Reduction time =  ©(A 4- ^  4- logB)
Number of buses =  ©(
Area =  ©(iV +  ^ f £ )
Aspect ratio =  0(1)
Since the reduction time is fi(A) and Q(logiV) (N  inputs cannot be reduced in 
less than logiV steps), we choose A =  fi(logN).  Since B  =  O(N),  logB  =  O(logiV). 
Therefore, A =  D(logB). Then Ai T ( N , A , B )  has 0(A r) area, which is optimal. 





=  10 
=  L 
=  ©(A)
=  ©(A +  L)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
Number of buses =  0 (B 2L)
Area =  O(iV)
Aspect Ratio =  0(1)
Notice that it is possible to make trade-offs between loading and the num ber of 
buses, while keeping the time and VLSI area unchanged. Table 4.1 shows results for 
various choices of A  and L. The last two entries of the table also show reduction 
results on other (SBEM) architectures in the literature. Notice tha t our m ethod 
matches that of the GMCCB [19], while having a better aspect ratio and providing 
additional possibilities for the same time, or for the same loading. Compared to
IMMB [64] our method has a better loading for z < \  and provides more possibilities
for the loading (given a fixed time).
Table 4.1: Some results for meshes with Tree MBN
Architectures Time Loading No of Buses Aspect Ratio
A  = N i N s constant N s constant
A  = Ns N s Ns Ns constant•12II N to constant NTS constant
A  =  Nio N ts NTo NTS constant
A  =  N t* N ts constant N  18 constant
A  =  N ts NTS N T S • rN  18 constant
for z > 0, A  =  log* N log2 AT L JV
L  \ o f \ N
constant
for z > 0, A = N z N z 3 constant
for z > 0, A  =  N z N z N z N l~3z constant
GMCCMB [19] NTS NTS NTS N s
IMMB [64], for z >  0 N z N* N l~3z constant
In all of the cases VLSI area is 0  (N)  and the degree is constant. 
All the  entries show orders rather th an  absolute values.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
74
4.4 MBEMs with Segment Switches
A segment switch placed on a bus can break (when opened) the bus into two inde­
pendent buses. When the switch is closed, the two segments are fused together and 
they function as a single bus. It has been well studied in the context of dynamically 
reconfigurable architectures [59, 77] and other bus-based networks [72, 75]. Since seg­
ment switches allow a bus to be configured to suit a particular step of an algorithm, 
they could be used to reduce the loading as shown in this section.
4 .4 .1  B inary-T ree M B N s w ith  Segm ent Sw itch es
In this section we derive results similar to Theorem 4.2 for binary-tree MBNs with 
segment switches. We then use these derivative binary-tree MBNs to enhance the 
mesh.
Addition of segment switches requires further development of some of the ideas 
used. These ideas are in the setting of a given computation (such as reduction or 
broadcasting). We assume that a segment switch changes state at least once during 
the course of the computation. A bus may be defined as a maximally connected 
segment resulting from all segment switches being in the closed state.
The state of segment switches in an MBN may change during the execution of 
an algorithm. It is therefore possible to define two types of loadings for MBNs with 
segment switches, absolute and relative. The absolute loading is the largest possible 
loading of a bus (when all segment switches on it are closed); this matches the con­
ventional idea of loading. The relative loading of a bus segment a t some step s of 
a computation is the number of connections and (closed) segment switches on that 
segment during step s. The relative loading of a step s is the largest of the relative 
loadings of all the bus segments of step s. The relative loading o f a computation is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
75
the largest of the relative loadings of all steps of the computation. Notice that the 
effect of bus loading (whether optical or electrical) is restricted to  within a segment. 
Therefore, relative loading is indicative of the demands on the system for that com­
putation. On the other hand, absolute loading is useful for the worst case scenario 
in a general purpose environment. Naturally, the absolute loading is no smaller than 
the relative loading of any computation.
We now describe two methods for running binary-tree algorithms on MBNs with 
segment switches.
Consider a bus with K  segment switches numbered 0,1, • • •, K  — 1. When all the 
switches are open, the bus is broken into K  independent segments1. Let these atomic 
segments be So, Si, • • • S k - i- In the context of a binary-tree algorithm, let there be 
one result processor holding a partial result per segment. (There may be many non­
result processors connected to each bus segment.) The aim is to reduce the K  partial 
results to one final result.
M e th o d  1: W ithout loss of generality, let K  =  2* for integer k. This method
performs the reduction in log K  steps (optimal time) as follows (see also Figure 4.3). 
First, close the segment switches 0,2,4, • ■ •, K  — 2. This will fuse segments (So, Si}, 
(S2, S3}, • • •, S k - i}- Reduce the partial results of the two fused segments
to one result. Next, close the segment switches 1,5,9, • • •, K  — 3 and reduce the 
two partial results of the fused segments to one result. At this step, four original 
bus segments {So, Si, S2, S3}, (S 4, S5, S6, S7}, • • • ,{S*-_4, S K - 3 , S K- 2, S*-_i} are fused 
together. This process is carried out doubling the number of segments fused, until all 
the segments are fused together and the last two partial results are reduced to one
1 Actually ther are i f+ 1  segments, but we will keep the discussion simple by leaving one unutilized.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
76





Figure 4.3: Steps of method 1
final result. The time for this reduction is clearly log AT steps, and loading is equal to 
the number of processors connected to the entire bus (or at least half this number). 
Since all the processors are connected via the bus when all the segment switches are 
closed, the broadcast tim e of this method is only one step.
M e th o d  2: This m ethod performs the reduction in K  steps as illustrated in Fig­
ure 4.4. First close segment switch K  — 2, and reduce the inputs in segments S k - i  
and S k - 3 to a result in segment S k - 3. Next open segment switch K  — 2 and close the 
segment switch K  — 3. This will fuse segments S k - 3 and S k - a- Reduce the partial 
results in segments S k - z and S k - a into a result in segment S k - z • This reduction 
process is carried out, until the last two segments S\  and So are fused together and 
their results/inputs are reduced to one result in segment So. The time for this re­
duction is clearly K  — 1 steps. Since at most two segments are fused a t any given 
time, the loading of this method is equal to twice the number of processors in each 
segment. For this loading, the broadcast time is the same as the reduction time.
We now present two methods to construct an S-segment switch, N  x  M  binary- 
tree MBN from any given N  x y  binary-tree MBN. We propose two constructions, 
each of which uses the two methods discussed above.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
Q O Q  Q
HS- HSI- -H- - a - ■a- - a - - a - a
  --------- --------- --------  --------  --------  --------------------  step 1
V....-'  --------- --------- --------  --------- --------------------  --------- step 2
V .
 ------- -------- — ;—  — ■—  —  -----------  —  — -—  — ■—  step 3
: ' V.  : : : : : : :
-------------------- --- -------- --------  --------  --------  --------- --------- step 7
Figure 4.4: Steps of the m ethod 2
C o n s tru c tio n  1: Let X ( N )  be an iV x |  MBN. The aim is to use X ( N )  to
construct an N  x M  MBN X i (N ,  M,  S, K )  with S  segment switches and a parameter
K  th a t controls its running time and relative loading. Let these parameters satisfy
1 < M K <  S  < N .  For brevity and without loss of generality, assume tha t quantities
such as AT S
M ’ M K
and -rr are integers.
The idea is to first use 4f buses with segment switches to reduce the N  inputs to
M  partial results. Next use the remaining buses (without segment switches) as an 
M  x y  instance X (M ) the given MBN. We now describe the reduction of N  inputs 
to M  partial results. Each of buses used for this part is used to obtain 2 partial 
results. Since all buses proceed identically, we describe the activity of only one.
Of the N  processors and S  segment switches, ^  processors and switches are 
assigned to each bus. These processors and switches are arranged on the bus as shown 
in Figure 4.5. The processors are divided into ^  segments, with a segment switch 
between adjacent segments. There are segments, one per switch. Cluster
K  contiguous segments into a group. Each group spans K  segments and there are 
groups. The reduction proceeds as follows.




group o f K segments group of K segments
±L 
^  §. • ►
_N_







Figure 4.5: Construction 1
1. Open all the segment switches, and sequentially reduce the j f  inputs in each 
segment. This takes y  — 1 steps, and the relative loading of each bus segment 
is f .
2. Use Method 1 to reduce each group of K  segments in O (log ft') time with a 
relative loading of
3. Use Method 2 to reduce group partial results to the two partial results for the 
bus. This takes steps with relative loading
4. Reduce the M  results on an Af x y  instance X  (Af) of the given MBN.
If X ( N )  runs in t n  steps with relative loading £n , this step runs in t \ f  steps with 
£m relative loading. If the given MBN X  (N ) has degree cfor, broadcast time qs  and 
layout height hff, then the total time required to reduce N  inputs on X \ (N,  M, S, K)  
is 0 ( | r  +  log ft" +  -fjK +  t M). The relative loading and degree are 0(^K - + £m ) and 
dM respectively. The layout height and the absolute loading of Xi (N ,  M ,S ,  K)  are 
O(Hm ) and 0 ( ^  4- £\f). It is easy to see that the broadcast time of X \ (N ,  M,  S, K)  
is determined by step 3 and the broadcast time on X ( M ) .  Therefore, the broadcast 
time of X i  (N, M,  S, K )  is 0(gjvf +  j^ic) steps.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
79
y t  (  S  M \  Y >  f N  M  \  Y> (  AT M  \
A  v S  ’ S  J V S 5 S  '  v s  » S  )
Figure 4.6: Construction 2
L em m a  4.3 Let X ( N )  be an N  x y  M BN with degree and loading d s  and £ \ .  Let 
X ( N )  run an N -input binary-tree algorithm in t s  steps and have a layout height of 
h s - Let the broadcast time of this M BN be q \ .  For 1 <  NIL < S  < N , there exists an 
S-segment switch, N  x  M  M BN Xy( N ,  M,  S, K )  with degree dM, relative
loading, and 0 ( j L  +  £m ) absolute loading that runs an N -inpu t binary-tree algorithm 
in 0 ( ^  + j ^  +  lo g /f  +  tjvf) steps. The VLSI height o f  X i(N , M , S, K ) is 0 {h \ f )  and 
the broadcast time is 0 ( q ^  +  steps. ■
If the given MBN X  (N ) is the Tree MBN, then we have the following result.
T h e o re m  4.4  For 1 <  < S  < N , there exists an M B N  with constant degree,
@ (^ r)  relative loading and 0 ( ^ )  absolute loading that runs an N -input binary-tree 
algorithm in @(^r +  +  log K  +  log M ) steps. This M B N  has a layout height of
0(log  M ) and a broadcast time o /0 (logA f +  steps.
C o n s tru c tio n  2: Again let X ( N )  be the given MBN. This construction uses a
small number of segment switches S  < M.  Use Lemma 4.1 to construct A '(^r, 
tin |  x |  derivative of X ( N ) .  Use S  such X ' ( ^ - , ^ ) s  to reduce the N  inputs to 
S  partial results. Now the problem is that of reducing S  partial results on a bus 
(see Figure 4.4). W ith a parameter K  (where 1 <  K  < S ), this can be done with 
Method 1 first and then Method 2 in ©(log K  +  j^) steps with a relative loading of 
Q(K) .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
80
Construct X 2(N, M,  S, K ) as follows. Distribute all S  segment switches along one 
bus. Divide the N  processors and M  buses into S  equal parts. Construct S  copies of 
MBN X '(y ,  and connect one MBN each to the S  bus segments.
W ith the same notation as in Construction 1, we have the following result.
L em m a  4.5  Let X  (N) be an N  x y  M BN with degree and loading d s  and £ \ .  Let 
X ( N )  run an N -input binary-tree algorithm in t s  steps and have a layout height of 
htf. Let the broadcast time of this M BN be q \ .  For 1 < K < S < M < y > there 
exists an S-segment switch, N  x (Af +  1) M B N  X 2(N, M,  S, K )  with d*M degree, 
©(2K  +  la**) relative loading and 0 ( 5  +  ^  absolute loading that runs an
N -input binary-tree algorithm in ©(-^ +  +  log/if 4- taw.) steps. The layout height
o f X 2(N, M,  S, K )  is Oh(™ )  and the broadcast tim e is Q(q^M +  £)■ ■
Remark: Note th a t X"(N ,  M , S, K )  is an M  x (M  +  1) MBN.
If the given MBN X (n) is the Tree MBN, then we have the following result.
L em m a  4.6  For 1 < K < S < M < y > there exists an M BN  with constant degree, 
Q (K  +  Yi) relative loading and ©(S +  absolute loading that runs an N -input 
binary-tree algorithm in 0 ( j j  + £  + lo g  K  + lo g (^ ))  steps. The layout height o f the 
M B N  is 0 ( log(4f)) and the broadcast time is O (log(4f) 4- steps. ■
4 .4 .2  M esh es w ith  Tree M B N s and  S egm en t S w itch es
The idea of Section 4.3 readily extends to binary-tree MBNs with segment switches. 
Here the MBN connecting row and column connect-sets is a x B , S  segment-switch 
extension of the Tree MBN, obtained from Construction 1 (Theorem 4.4). The steps 
involved in running a binary-tree algorithm with segment switches are the same as in 
Section 4.3. Therefore, we simply sta te  the results.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
81
Time =  e ( A  + ^  + - ^  + l o g K +  loS B)
Relative loading =
Absolute loading =  © (^ ^ )
Number of buses =  0 ( ^ ^ )
Number of segment switches =
Broadcast time =  0 ( ^  +  logB)
VLSI area =  G ( N  +
Let A'  =  ^  and A"  =  . If we let 1 <  ^  <  S  <  ^  <  AS,  then the
conditions for Lemma 4.5 are satisfied by row and column MBNs. Also 1 <  777 <  
S  < A ' S  < A S  which implies that A' < A.  If we also set S  < A B K  then A" < A,  so 
we have the following:
Reduction time =  (A)
Relative loading =  Q(A'K)
Absolute loading =  Q(A'A"K)
Number of segment switches =  0 (34777)
Number of buses =  a^a a "
Broadcast time =  0 (A " +  logB)
VLSI area =  0(AT)
Using the above set of equations, we can compute various network parameters for 
different values of running time, segment switches and K . We show some results 
in Table 4.2. This table clearly shows the effect of segment switches on loading. 
Compared to Table 4.1, for the same running time, a lower relative loading can be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
82











A  = A ' = N s  
A" = K  — N s
JVi N s N s 0 ( 1 ) N s N s
A  = A ' — N s  
A" =  AT?, K  =  N s
N» N s N s 0 ( 1 ) N s N s
A  = N t s ,  A! =  1 
A" =  K  =  1
NTS 0(1) 0 ( 1 ) 0 ( 1 ) NTS NiS
A  = A'  =  N t s  
A" = K  = N t s
N ts N ts NTS 0 ( 1 ) N ts jV&
A = N z, A! =  1 
A" = K  =  1
N z 0 ( 1 ) 0 ( 1 ) 0 ( 1 ) iVl-2z N i-*2
A  =  log* N , A ' = L  
A" =  K  =  1
log* N L L l o g ' N 0 ( 1 ) /VL  log2* N
N ' 
tlog2* N
A = A , = N Z 
A" = N Z, K  =  1
N z N z N*z 0 ( 1 ) N 1-*2 N l-3z
A — A" = N z 
A! =  K  =  1
N z 0 ( 1 ) N z 0 ( 1 ) N l~3z
Segmented Bus 
Enhanced Mesh [75]
N* N s N s N s N s Ns
achieved. For example with time 0 ( n z) and N l~3z buses, the loading is constant. 
For the same case in Table 4.1 this loading is N z. Moreover, the results here are 
better then that has been achieved by Serrano and Parhami [75]. For a running time 
of 0 ( N s ) ,  we have a much wider choice of parameters, and the resulting MBN is 
superior in all respects to that in [75]. Specifically, we can achieve a much lower 
loading, while improving on aspect ratio, absolute loading and number of buses. Also 
compared to the IMMB [64] we now have for 0 ( N Z) time a constant relative loading 
with the same number of buses.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.5 Results and Discussion
83
Traditionally, buses have been used in a variety of ways to enhance the basic mesh. 
In all these schemes, a single bus is used to connect processors. In this chapter we 
have explored the use of multiple buses to connect processors together to form meshes 
enhanced with MBNs. The traditional single bus approach is a special case of our 
framework.
The IMMB can also achieve O(logiV) time, but to  reduce its loading to  constant, 
a  high-dimensional, sub-optimal area structure [86] is needed. The methods pre­
sented in this chapter can be applied with any MBN. Therefore, these methods can 
be adopted for use with other algorithms by suitably selecting MBNs in horizontal 
and vertical dimensions. If each processor (or some of the processors) is connected 
to several buses through segment switches, then it is possible to “dynamically re­
configure” the MBNs. In th a t case, the same MBEM can be used to run different 
algorithms or different steps of the same algorithm optimally by reconfiguring the 
processor bus interconnection pattern.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Fault Tolerance
In Chapter 4, we showed th a t binary-tree MBNs can be used for general purpose 
enhanced meshes with properties similar to the mesh of trees [49]. This chapter deals 
with the construction of fault-tolerant binary-tree MBNs. Such MBNs can be used 
as fault-tolerant building blocks for enhanced meshes. We present two methods for 
constructing such MBNs. One of the methods is more general in that it can also be 
used for any MBN (not just binary-tree MBNs). It is particularly useful in situations 
where the MBN uses resources (buses and processors) non-uniformly. In other words, 
if a given algorithm uses some of the resources most of the time, and the rest not 
tha t often, then this method can exploit this situation to produce better results; 
binary-tree algorithms represent one such situation. The second method applies only 
to binary-tree algorithms.
Specifically, we present two methods called replication and recursive scheduling 
th a t add connections in a systematic and controlled manner to transform any given 
binary-tree MBN into a fault-tolerant one. Given any N  x  M  binary-tree MBN, 
A i,  and an integer 1 <  k < 4r, we derive a N  x M  MBN A i' that can tolerate 
the failure of any set of k buses. The performance of the fault-tolerant MBN, A4#, is 
measured in terms of (i) the tim e to run a set of computations designed for A i,  (i i ) its
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
85
degree, and (Hi) its loading. These attributes of Ai '  depend on the corresponding 
attributes of A i .  Replication can also be used for handling processor faults. The 
methods we propose in this chapter accept any MBN as input, so the approaches for 
bus and processor faults are independent; that is, tolerance to processor faults can 
be imparted to  an MBN that is already tolerant to bus faults and vice versa.
Most previous work [3, 12, 15, 31, 45, 67, 68] on fault-tolerance in MBNs has 
focused primarily on issues related to connectivity/topology (number of failures to 
disconnect network, average distance between processors, etc.) and performance in 
a general purpose setting (such as throughput under various traffic models). Nadella 
and Vaidyanathan [57, 84] have considered the design of a specific fault-tolerant 
binary-tree MBN. The methods we present here are a generalization of that work 
in that it can be applied to any binary-tree MBN.
In Section 5.1 we state the assumptions used in the chapter. In Sections 5.2 and 
5.3, we detail replication and recursive scheduling. The extension of replication to 
processor faults is discussed in Section 5.2.5. In Section 5.2.6 we tailor the replication 
results specifically to binary-tree MBNs. We compare the two methods in Section 5.4 
and make some concluding remarks in Section 5.5.
5.1 Fault Model
We assume here that a faulty bus or processor is entirely faulty and completely 
unusable. We also assume that the faulty or fault-free status of each processor and 
bus is known before the MBN begins its computation and does not change during 
the computation. If a bus b (or processor p) is faulty, then a fault-free bus b' (or 
processor p') will be assigned to perform the functions of bus b (or processor p). We 
assume that bus bf (or processor p') has all the information necessary to perform
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
86
these functions. The above assignment of responsibilities to  fault-free elements is 
done before the computation commences.
The focus of this work is on designing an MBN that has the required redundant 
connections available, while the actual fault handling procedure (including fault iden­
tification and processing) is not considered. This may also be useful in improving the 
yield of a chip with binary-tree MBNs in which faults during manufacture can be 
bypassed by a one-time reconfiguration [44, 87].
5.2 Replication
In this section we develop the results for bus faults first, then extend them to include 
processor faults in Section 5.2.5. Replication is a general method that can be applied 
to any MBN. A key feature of replication is that it perm its a set of k  buses to be 
designated as ‘less im portant,” and the failure of an arbitrary set of k  buses can be 
treated as the failure of these less-important buses. In cases where not all resources 
are used equally, replication constructs a  fault-tolerant MBN tha t is better tuned to 
the given computational setting. Binary-tree algorithms is a  good example of uneven 
resource use, where the number of processors and buses required decreases by a factor 
of 2 with each level. We establish that for any 2" x 2n_l binary-tree MBN, replication 
gives a fault-tolerant MBN that requires at most 5 (resp., 2) extra  steps if as many as 
2n~2 buses (resp., 2n_l processors) fail. Such a  result would not be possible without 
considering the fact that some buses/processors are used for only a few steps. Even 
with this consideration, one cannot guarantee that the faulty elements would be the 
ones used lightly. Replication provides the effect of this guarantee by allowing the 
failure of any set of buses/processors to be treated as the failure of a fixed set of less 
im portant elements. This flexibility lends itself to designing a fault-tolerant MBN
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
87
that is better tuned to perform a given set of computations. Since we assume that 
faults occur before the start of the computation, resilience to processor faults can 
be obtained as a  dual of the bus faults case. Therefore, replication can be used for 
processor faults as well.
5 .2 .1  A d d in g  R ed u n dan t C on n ection s
Given any N  x  M  MBN, A i,  and an integer 1 <  k < , replication constructs a
N  x M  MBN, 72*, th a t can emulate A4, even if any set of (at most) k  buses fail. 
The idea is to select k  replacements for each bus b and copy the connections of b 
to each of the k replacement buses. We first define 72* and then establish that 72* 
can treat any set of faulty buses as a designated set of less im portant buses. This is 
followed by the derivation of the fault-tolerance properties of 72*. Finally, we discuss 
tolerance to processor faults, as the dual of the bus fault case.
5 .2 .2  D efin itio n  o f  7?*
Let the buses of the given N  x M  MBN, A4, (and the generated N  x AT fault-
tolerant MBN, 72*) be 0,1, ■ • •, Af — 1. For any bus b of AA, let Proc[b : AA) denote
the set of processors of AA that are connected to bus b. For any 0 <  6 <  Af, let
R{b) =  {(6 — i) mod Af : 0 <  i <  k}  be the replacement set for bus b. Now
define the fault-tolerant MBN, 72*, as follows: For any 0 <  b < Af, Proc[b : 72*] =
(J  Proc[b' : A4]. 
feRW
Each bus b of 72* has all the connections of bus b of A4,  and the additional 
connections needed to replace buses (b — i)(m od Af) of A4, where 1 <  i  < k.
In the following example, we have arranged (permuted) the buses so that the 
connections of a bus replicated on the k  other buses overlap with existing connections. 
This can reduce the degree from (A;+l)d to kd. Permuting the buses in this m anner to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
88
0 1 2 3 4 5 6 7
0 1 0 0 0
1 1 • 0 0 0
2 1 0 0 0
3 1 • 0 0 0
4 1 0
5 1 • 0 0 0
6 1 0 0 0
7 1 0 • 0 0 0
8 0 0 0 1
9 0 0 0 1 0 •
10 0 1 0 0
11 0 1 • 0 0
12 0 0 1 0
13 0 0 1 0 • 0
14 1 0 0 0
15 0 0 0 1 0 0 •
Figure 5.1: The MBN of Figure 2.2 augmented to handle 3 bus faults
reduce degree may not be possible in all situations. W ith k  =  3, Figure 5.1 shows the 
MBN 72-3 corresponding to the 16 x 8 MBN, .M, of Figure 2.2 (page 14). A connection 
between a processor and a bus of A4 is indicated by a “1,” and a connection added 
for fault-tolerance is indicated by a “o.” Entries where an existing connection ( “1”) 
and an added connection (“o”) overlap are indicated by In this example, nearly 
half the buses are permitted to be faulty. Therefore, a dense MBN, .M3, is to be 
expected. The observations below show that this is not the case in general.
The following observations about 72* are straightforward.
1. If none of the buses are faulty, then 72* can emulate M. without any overhead. 
This is because for each bus b, Proc[b : M ] C  Proc[b : 72fc]. In other words, 
the set of connections of M  is a subset of those of 72*.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
89
2. If the degree of M. is d, then the degree of 72* is at most (A: +  1 )d. This is
because each connection to a bus b of M  is copied over to k other buses of
72*. Thus a processor connected to <¥ < d  buses of M. is connected to at most 
(k +  l)d ' < (k + 1 )d buses of 72*.
By arranging buses to maximize connection overlaps (shown as in Fig­
ure 5.1), it may be possible to reduce the degree to kd. Notice tha t the degree 
of the MBN of Figure 5.1 is 7, rather than (3 +  1)2 =  8.
3. If the loading of M  is I, then the loading of 72* is a t most (k  +  l ) i .  This is
because each bus b of 72* is connected1 to
| Proc[6 : 72*] j <  JZ  |.Proc[&': Af]| <  (k + 1)£ processors.
6, €R (6)
Since not all buses have £ connections to sta rt with, the loading of 72* is usually 
much smaller than (k +  1)£.
5.2.3 T h e  D esign ated  Set
Often some buses of an MBN, A t, are more critical than others. This may be due to 
connectivity and /or usage in a particular set of computations. Failure of these “crit­
ical buses” impacts the network performance more severely. By the same measure, 
failure of “non-critical buses” does not degrade the performance to the same extent. 
In this section we first prove that there is no loss of generality in assuming that a 
fixed set of k  buses is faulty (regardless of which k buses of 72* are actually faulty). 
That is, 72* can treat the failure of an arbitrary set of k  buses as the failure of a fixed 
set of k  designated non-critical buses. Next we show how this fixed faulty set can be 
emulated by the fault-free buses of 72*. This has the benefit of allowing the network
1 We denote the cardinality of a  set 5  by |S |.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.2: Graph Gz,a-
designer to designate a  suitable set of less im portant buses tha t can be treated as 
faulty.
For any 1 <  k < M , define a directed graph G k ,M  with nodes {0,1,2, • • •, M  — 1} 
as follows. There is a  directed edge (b, b') from node b to node b' iff b' =  (b+i) mod M,  
for some 1 < i < k.  Figure 5.2 shows Qz$. In the context of the N  x M  MBN, 72*, 
each node of G k ,M  represents a bus of 72.*,. Node 6 has a directed edge to each node 
that can replace it; th a t is, (6, U) is an edge iff b 6  R{V).
Let Q — {V,E)  be any directed graph and let U, W  C V  with \U\ C \W\.  An
injective2 function p : U — > W  is called a node disjoint correspondence iff
1. For each u €  U, there is a directed path in G from u  to p(u), and
2. For distinct i t i ,u 2 € U, the paths from ui to p(ttt ) and u2 to p(u2) are node
disjoint (that is, the paths have no nodes is common).
By establishing that Gk,M has a node disjoint correspondence from the set of faulty 
buses to the designated set of less im portant buses, we will show that the faulty buses 
can be treated as less important.
2A function p : U  — ► W  is injective iff ui ^  u2 implies th a t p(tii) ^  p{u2 ); th a t  is, d istinct 
elements of U  are m apped to  distinct elements of W .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
91
In the following we assume that ^  is an integer. This assumption greatly simplifies 
the discussion while the extension to arbitrary values of B  (relative to k) is quite 
straightforward.
Divide the vertex set, {0,1, • • ■, M  — 1}, of Gk,M into ^  segments So, S i, • • •, S m_ 1, 
each consisting of k  contiguous buses. For 0 <  i < let S, =  {ik + j  : 0 <  j  < k}; 
th a t is So =  {0,1,2, • • •, k  — 1}, Si =  {k, k  +  1, • • •, 2k — 1}, and so on.
L e m m a  5.1 For any 0 <  i < ^  — 1, let U  C S,- and W  C S,-+i so that \U\ < 
\Si+i — W  |. Graph Gk,M admits a node-disjoint correspondence p  : U — ► (St+i — W ).
Proof: For each u £ U  we will construct a  node disjoint path  in reverse. That is,
starting  from St+i — W , we will trace the path back to U. Let aj =  ik  +  j  and 
bj =  (i + l ) k + j  (where 0 <  j  < k) so th a t S,- =  {a7- : 0 < j  < k}  and SI+i =  {bj : 0 <  
j  < k}.  (Observe tha t aj has edges to elements aJ+i, aJ+2, • • •, a*, b\ , b?, ■ • •, bj.) From 
each bj 6 Si+1 — W,  trace edge (a,, bj) back to aj. If aj € U, then let p{a.j) =  bj, and 
edge (aj, bj) is the required node disjoint path  from a7- to S,+i — W .  We now consider 
the remaining elements of U and S,-+1 — W  (that is, those not mapped as discussed 
above). Let sets U' and U" be as follows:
U' =  { a j £ U : b j t S i+l - W }
= elements of U with no path established to S,+i — W  as yet;
U" =  {aj £ U  : bj €  Si+l -  W }
=  elements of Sj  that are not in U, but have been arrived at
from Sj+i — W .
Since \U\ < |5.-+i — W \, it is easy to see that |C/'| <  \U"\. Also note that U' and U" 
are disjoint. Let U' =  {a'Q, a\, • • •, a^} and U" =  {oq, a", • • •, a"}, where y >  x  > 1. 
Assume that element a" £  U" was arrived a t from element 6" £  5 t+I — W . We will
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
92
i + 1
(a) Case a'- < a"
i + 1
(b) Case a' >  a"
Figure 5.3: An illustration of the proof of Lemma 5.1.
now attem pt to establish a node disjoint path from a' to a". If this can be done, 
then we have established a node disjoint path from a'- to b" (via a"). We consider two 
cases.
Case 1: (a' <  a") Since a'-, a" €  Si, there is an edge (a ', a") in Gk,M- (This is 
because Gk,M has an edge from an element of S',- to all larger elements of 5,). 
Let p(a ') =  bfj with the path being (a'-, a", 6") (see Figure 5.3(a)).
Case 2: (a' >  a") Since a" <  a' <  6J and Gk,M has edge (a", V-), the edge (a'-, h/-) 
also exists. Therefore let p(a ') =  b" with the path being edge (a!j,b") (see 
Figure 5.3(b)).
Since U' and U" are disjoint, the case a ' =  a'j is not possible. It is clear that 
p : U — y (Sj+i — W )  is an injection, and that for each u  € U, there is a pa th  from u
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
93
to p(u).  To see tha t these paths are node disjoint observe that paths consisting of a 
single edge pose no problem. The remaining 2-edge paths (due to Case 1) have the 
form (a'j, a", b"). The only danger is of a" being in the path of an element of U that 
is different from a ' . Since a" ^  U  and a" is unique for a given a ' , this is not possible 
and hence, p is a node-disjoint correspondence. ■
L em m a 5.2 Let F  C {b : 0 <  b < M } be the set o f (at most k ) faulty buses of 72*.■ 
Graph &k,M admits a node-disjoint correspondence p : F  — ► 5m _,.
Proof: For 0 <  i < let F* be the set of faulty buses in 5, and let X , =  F0 U  Fi U
• • • U Fi. We will now prove by induction on i (where 0 <  i < ™) that Gk,M admits 
a node-disjoint correspondence pi : X,-_i — > (5,- — Fi). Clearly this will prove the 
lemma.
For i =  1, let U = F0 — X 0 and W  =  F t . Since |Fo| +  |F i| <  |F | < k = |5 i|, we 
have \U\ =  |F0| < \Si — F t [ =  \S\ — W |, so Lemma 5.1 guarantees a node-disjoint 
correspondence pi : X q — > (Si — F\).
Let pi : X i_t — ► (Si — Fi) be a node-disjoint correspondence. Let F _ i  C S, — F  
be the set to which elements of X,-_i have been mapped by p,-. Since p,- is an injection, 
l*i-il =  Notice tha t |F - i | + |F |+  |F + l| =  |X t-_t |+  |F |+  |F+ i| =  |F0|+  |F |
+ • ■ • +|Fj_i| +  |Fj|-(- |F+i| < |F | < k =  |5,-+i|.
Therefore IF  U F _ i | 5~ |F + i — Fi+1|. W ith U =  F U  F - i  and W  =  F+i> Lemma 5.1 
gives us a node-disjoint correspondence p : ( F  U F - i )  — ► (F+i _  Fi+1)- Now define 
pi+i : X i — > (S.+i -  F + i)  as follows.
if 6 e X ,_ !
Pi+i(b) =  <
p(b), if b e  Fi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
94
Since /?, and p are-node disjoint correspondences and since A,-_ i and Y{-1 are 
disjoint, /?,+1 is a node-disjoint correspondence. ■
We say that bus b replaces bus V to mean that bus b assumes the identity of bus 
U, in the process losing its own.
T h e o re m  5.3  M BN  72* can treat the failure of any set, F , o f k  buses as the failure 
of a designated set, D , o f k  buses of N i.
Proof: W ithout loss of generality, let D  =  5 w _ t =  {M  — k, M  — k  + 1, • • •, M  — 1}.
By Lemma 5.2, there is a  node-disjoint correspondence p : F  — > D. For any faulty 
bus b €  F , let the path  from b to p{b) be (6 =  bo, bi, - • •, bx =  p{b)). Since 6j+i) 
(where 0 <  i < x)  is an edge of Gk,M, bus 61+i of 72* can replace bus 6, of A4. In 
other words, the faulty bus b =  b0 can indirectly be replaced by bus bx 6  D, via buses 
bx- i , b x- 2 , * * • j bi. The node disjoint correspondence guarantees tha t no bus is called 
upon to replace more than  one other bus. The only buses that replace other buses, 
but are themselves not replaced, are those of D.
Thus the buses can assume new identities so th a t, regardless of the set, F,  of 
faulty buses, the MBN can treat the designated set, D,  as faulty. We now show an 
example.
Let the set of buses be {0,1, •••,7} and let k =  3 with D  =  {5,6,7} and 
F  =  {1,2,4}. Then the node disjoint correspondence is given by the directed paths 
(1 ,3 ,6 ),(2 ,5), and (4,7). These paths are shown in bold in Figure 5.4. The new 
identities assumed by buses are as follows:
Original bus 0 1 2 3 4 5 6 7
Replaced by 0 3 5 6 7 - - -
Notice that buses of the designated set D  =  {5,6,7} are  not replaced, and are there­
fore considered faulty.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.4: Node disjoint correspondences for an example
We now describe how the MBN copes with the loss of buses from the designated 
set. We say that fault free bus b e m u la te s  faulty bus V to mean tha t b assumes 
the work of in addition to its own. (This is different from the notion of bus b 
“replacing” bus V, where b loses its identity to V.) Each faulty bus b € D  (the 
designated set) is emulated by a fault-free bus 6' €  SQ. Since there is an edge in Gk,M 
from bus M  — k  +  6 G 5w _, to bus b e  So, for 0 <  b < k, such an emulation is always 
possible. We will refer to the set S q as the e m u la t in g  s e t .
5 .2 .4  Fault T olerance P rop erties  o f IZk
Earlier in Section 5.2.2 we established that if the degree and loading of A i  are d and £, 
respectively, then the degree and loading of 72* are (k + l)d  and (k + 1)£, respectively. 
We now derive the time needed for 72* to emulate a  computation on A i.
The time required for 72* to run a computation of Ad depends on the choice of 
the designated set D  and the emulating set S0 (in addition to the computation to 
be performed). For example, if the buses of S q are never used concurrently with 
those of D, then 72* emulates A i  without any loss of speed, even when k  buses fail. 
At the other extreme, if buses of SQ and D  are used simultaneously for t  steps in a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
96
computation, then these t steps now run in 21 steps (each bus of 5o does the work 
of two buses). In general, if a computation on A i  has t  steps in which a bus of D  
and its replacement are both used, then 72* requires t  extra steps to emulate A i.  In 
particular, if a T-step computation on A i  uses the buses of D  for at most t  steps, 
then 72* performs this computation in at most T  + t  steps. This view permits the 
performance to be bounded by the set D  alone. If D  is the set of least used buses, 
then we have the following result.
T h e o re m  5.4 For any N  x  M  MBN, A i ,  and an integer 0 <  k < M  — 1, the N  x M  
MBN, Ttk, has the following properties:
(z) I f  no bus is faulty, then 72* can emulate A i  without overhead.
(zz) A T-step computation on A i  that uses a set o f k  buses for at most t < T  
steps can be run on 72* in T  + 1 steps, even i f  any set of {at most) k buses of 
72* fail.
{Hi) I f  the degree of A i  is d, then the degree o/72* at most {k 4- 1 )d.
(iv ) I f  the loading o f A i  is I, then the loading o f Ilk  at most {k +  l)£.
■
5.2 .5  P rocessor Faults
Since the fault model assumes an off-line fault processing scheme, the ideas developed 
so far for bus faults apply to processor faults as well. All that this requires is trans­
posing the N  x M  MBN m atrix into a M  x  N  matrix; this interchanges the roles of 
processors and buses. Therefore, Theorem 5.4 can be restated as follows.
T h e o re m  5.5 For any N  x M  MBN, A i ,  and an integer 0 <  k < the fault- 
tolerant N  x M  MBN, 7Zk, has the following properties:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
97
(i) I f  no processor is faulty, then 72.* can emulate A i  without overhead.
(ii) A T-step computation on A i  that uses a set o f k processors for at m ost t < T  
steps can be run on 72* i n T  + t steps, even i f  any set o f (at most) k  processors 
o f n k fail.
(in) I f  the degree o f A i  is d, then the degree o f 72* is at most (k  4- l)d .
(iv) I f  the loading o f A i  is I, then the loading o f 72* is at most (k 4- 1)1.
■
The MBN, A i,  could already be one th a t is resilient to bus faults. In th a t case, 
72* is resilient to both processor and bus faults. We combine the Theorems 5.4 and 
5.5 and the fact that loading and degree cannot exceed N  and M  respectively, to 
obtain the main result of this section.
T h e o re m  5.6 Given any N  x M  M B N  A i ,  and an integers 1 <  q < y  and 1 <  k < 
y , the N  x M  M BN  72* has the following properties.
(i) I f  no processor or bus is faulty, then 72* can emulate A i  without overhead.
(a) A T-step computation on A i  that uses a set of k buses for tf, < T  steps and 
q processors fo r tp < T  steps can be run on 72* i n T  4- h  + tp steps even i f  any 
set o f (at most) k buses and (at most) q processors 0/72* fail.
(b) I f  the degree o f A i  is d, then the degree of Ttf is max(Af, (k 4- l)(g  4- l)d)-
(c) I f  the loading o f A i  is i ,  then the loading of TZqk is max(Ar, (q 4- 1)(A: 4- 1)^)-
■
5 .2 .6  Fault T olerant B inary-T ree M B N s
In this section, we use replication to derive results specific to binary-tree MBNs. We 
first derive bounds on processor and bus usages in binary-tree MBNs and then use
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
98
these bounds with Theorems 5.4 and 5.5 to derive results specific to fault-tolerant 
binary-tree MBNs. Recall the following facts (Chapter 2, page 11) regarding binary- 
tree algorithms.
1. W ith the root of T { n )  at level n  and the leaves a t level 0, for any 0 <  i < n, 
there are 2n~l  nodes at level i.
2 . Call a communication (non-trivial edge of T{tl)) that brings partial results to a 
level-£ node as a level-l c o m m u n ic a t io n .  For 0 <  £ <  n, there are a t least 2n~l , 
and at most 2n_<+l, level-£ communications; this is because each internal node 
has at least 1 and at most 2 non-trivial edges from its children.
Consider the problem of running B i n ( n ) on a 2n x 2m MBN, where 0 <  m <  n. 
For 1 <  I  < n  — m, there are at least 2n~l  >  2m level-£ communications. For these 
levels, the number of communications exceeds the number of available buses, so it is 
reasonable to assume that the MBN minimizes the number of communications (and 
hence the running time). Therefore for 1 <  £ < n  — m ,  there are exactly 2n~l level-£ 
communications, that are performed on the 2m buses in 2n~e~m steps. The total
n —m
num ber of steps for levels 1, 2 , • • •, n — m is 2n~l~m =  2n-m — 1.
i = i
Consider the next step, that executes nodes at level n  — m  +  1. This level has a t 
most 2m communications and potentially uses all the buses. Level n  — m  +  2 has a t 
most 2m~l communications and so uses a t most 2m_1 buses. Similarly, a t most 2m~2 
buses are used at level n  — m +  3. As a result, at least 2m — 2m-2 buses are not used 
at level n  — m +  3. In the same way, at most 2m~3 buses are used at level n  — m +  4 
and at least 2m — (2m-2 + 2m-3) =  2m_l +  2m-3 buses are not used at levels n  — m  + 3 
and n  — m  + 4. Similarly, a t most 2m-4 buses are used at level n  — m  + 5 and at least 
2m — (2m-2 +  2m-3 -I- 2m-4) =  2m_1 +  2m_4 buses are not used at levels n — m +  3,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
99
n — m  +  4 and n — m  +  5. In general, we have the following lemma, whose proof is 
straightforward by induction on £ > n  — m  +  3.
L em m a  5.7 For n  — m  + 3 < £ < n, any 2n x 2m M BN running Bin{n) at least 
2"»-i + 2n-/+l buses are not used in communications at levels n —m + 3 ,n —m +2, ■■■,£.
■
A direct consequence of this result is the following Theorem.
T h e o re m  5.8 For 0 <  m  < n, any 2n x 2m M BN running B in(n) has at least 2m_1 
buses, each o f which is used for at most 2n-m +  3 steps.
Proof: Communications a t levels 1, 2 , • • •, n  — m require 2n-m — 1 steps, and use
all the buses. Since there are at most 2m (resp., 2m_l) communications a t levels 
n — m  + 1 (resp., n  — m +  2), these levels can each take a t most 2 steps (even if 
both communications to a node use the same bus). Lemma 5.7 implies that at least 
2m_1 + 2 >  2m_l buses are not used at levels n — m  +  3, n  — m  4- 2, • • •, n. These 2m_l 
buses are used for at most 2n-m — 1 +  2 +  2 =  2n-m +  3 steps. ■
We now outline the derivation of similar results for processor usage. We assume 
that a step is required for a processor to send/receive partial results and perform an 
internal computation. Clearly there are 2n~l active processors (nodes) at level I  of 
H n ) .
As explained earlier, assume that Bin(n) is run on a 2n-processor MBN. Suppose, 
we use a 2n x 2m MBN where m <  n —2. Divide the input into 2m+l groups, each with 
2n~m_i inputs. Here it is reasonable to (sequentially and optimally) reduce a group 
of 2n_m_l inputs to one partial result; 2n-m_1 processors of a group are connected to 
a bus and take turns to send their input to a fixed processor (leader) of the group. 
All 2n~m_l processors of the group, except the leader, work for only one of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
100
2Ti-m-i _  j  steps needed to reduce the group. Since there are 2m+l groups, there are 
at least (2n_m_1 — l ) 2m+l =  2n — 2m+1 >  2n — 2n_l =  2n_l processors that are used 
for only one step.
When m  > n — 1, there are enough buses to accommodate all the communi­
cations at each level. At level n  < I  < 1, there are 2n~e “active” processors. 
Therefore the number of processors not used in any of the levels 2 ,3 ,• ••,£  is
t
2n — 53  2n-‘ =  2n_l +  2n~l unused processors. Each of levels 0 and 1 can use a pro-
i= 2
cessor only once. This is because at most half the processors are used in level 1. 
Thus, we have the following result.
L em m a 5.9 For 0 <  m  <  n, any 2n x 2m M BN running B in (n ) has at least 2n_l 
processors, each of which is used fo r  at most 2 steps. ■
Then, by Theorem 5.6 we have the following result.
T h e o re m  5 .10 For 0 <  m  < n, and any given 2" x 2m MBN, N i, there is a fault- 
tolerant 2" x 2m MBN, A i',  that runs B in(n) in at most 2n~m 4- 5 additional steps, 
with 2n~l faulty processors and 2m_1 faulty buses. ■
Remarks: There are 2n~l communications at level t  of J~{n). Therefore, until level
n — m, the number of communications per level exceeds the available buses. Thus, 
the 2n-m additional steps cannot be avoided. The remaining 5 additional steps are 
only upper bounds. For existing networks [23, 24], the corresponding number is only 
2. In particular, when n  =  m — 1, these MBNs require only 1 extra step to tolerate 
the failure of half the buses (or processors).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
101
5.3 Recursive Scheduling
In this section we present a  second method for converting any given binary-tree MBN 
into one th a t is resilient to bus faults. Replication (Section 5.2) works for any MBN, 
not ju st binary-tree MBNs. Consequently, it does not exploit features particular to 
binary-tree algorithms. For example, consider an MBN th a t, with no faulty buses, 
executes each level of T (n )  in one step. If this MBN now has one or few faulty buses, 
then each level with even one faulty bus now requires two steps under replication. In 
other words, replication fails to exploit the possibility of executing nodes at higher 
levels (closer to the root) before all lower level nodes have been executed. (Notice 
that the only requirement in !F{n) is for a node to be executed after all its descen­
dants. It is not required to wait for lower-level non-descendent nodes.) Recursive 
scheduling exploits the features of binary-tree algorithms to  construct fault-tolerant 
MBNs th a t run faster than their replication counterparts. However, the loading of 
the fault-tolerant MBN is somewhat higher, and the m ethod itself is less general, 
being applicable only to bus faults in binary-tree MBNs.
For 1 <  m  < n, given any 2" x 2m binary-tree MBN A f and integer k  (where 
1 < k =  2s <  2m), recursive scheduling produces a 2" x 2m MBN <S*that is resilient to 
the failure of an arbitrary set of at most k  buses. The restriction that k =  2a admits 
k =  1 and 2, the most probable fault situations. We now outline the m ajor steps in 
the construction of MBN «S*.
1. Use the given binary-tree MBN Af to construct a  2n x (2m — k) MBN Af*. 
MBN Af* has k fewer buses than  Af and is not tolerant to bus failures.
2. Add k buses to Ad* to convert it into a 2" x  2m binary-tree MBN Af*; the k 
added buses have no connections at this point.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
102
3. Use replication (Section 5.2) to transform A ik into a 2" x 2m binary-tree MBN 
A ik , that is resilient to k  arbitrary bus faults.
4. Finally, superimpose A i  on A i'k to obtain «S*. This last step ensures tha t S*
behaves exactly as A i in the absence of bus faults.
All of the above steps, except the construction of A i'k (Step 1), are straightforward; 
most of the remainder of this section is devoted to the construction of A i k. We first 
consider the case where k  =  1, as the construction of A i'k can be expressed in terms 
of M \.
5.3 .1  A n  M B N , w ith  2m — 1 B u ses
In this section we consider the case where k = 1 and construct a 2n x (2m — 1)
binary-tree MBN. This MBN is used to define A ik for k = 23 > 1 (Section 5.3.2).
For x  > 0, let A ix be the 2X x 21-1 instance of the given MBN A i.  We will use 
instances A im- i  and A im (among others) in the construction of A i[ .
Recall that a binary-tree MBN can be defined by the manner in which it “sched­
ules” the tree T (n )\ i.e., by the labeling of the nodes and non-trivial edges by buses 
(see Section 2.3, page 14). Here we will define how J-(n) is scheduled by A i\ .  De­
compose JF(n) into three regions as shown in Figure 5.5. The 2n x (2m — 1) MBN A i\  
schedules Regions 1, 2 and 3 in succession (in that order). For each region it uses all 
2m — 1 buses available to it. This approach is different from that of replication which 
would have caused each level of the entire tree T (n )  to be executed in succession 
using at most 2m — 1 buses. The 2” leaves of -F(n) are labeled with the 2n processors 
of A i[ .  In executing T (n ), an internal node u at level t  is labeled only with one of 
the 2n-/ levels a t the subtree rooted a t it. Thus, Regions 1 and 2 use disjoint sets of 
processors of A i \ . The roots of the trees at Regions 1 and 2 are leaves of the tree at





one T {m )
Region 3
2 "  -  1 
T {n  — m)n — m
Region 2 \  Qne r ( n  _  m)Region 1
0
Figure 5.5: Regions of F (n)
Region 3. Region 3 uses only processors at these leaves (level n — m  nodes of !F{n)). 
We now describe the three regions in detail and the method used to schedule them 
on A l p
R eg io n  1: This region lies between levels n — m  and 0 of the tree F{n). It consists
of 2m — 1 trees, each an n — m ) rooted at a level n — m  node of T {n). (Of the 2m 
such subtrees of T{n), any 2m — 1 may be selected for Region 1.) Each T { n  — m) 
is scheduled with a single bus. T hat is, all 2n-m processors at the leaves of the 
T {ji — m) are connected to a single bus, and their values sequentially reduced to one 
leader processor assigned to the root of the T [n  — m ). Clearly this requires 2n-m — 1 
steps. Since there are 2TO — 1 buses available, all the T [ n  — m )s of Region 1 can be 
scheduled as discussed simultaneously. (Notice that this is a very efficient use of the 
buses as each of the 2m — 1 buses is utilized in all 2n-m — 1 steps.) Also observe tha t






Figure 5.6: An example showing 4 levels of recursive decomposition of J-(n) with 
m =  2 and k =  1
in running Region 1, each processor is connected to only one bus, and each bus is 
connected to 2n-m processors.
R e g io n  2: This region consists of a  single T { n  — m ) rooted at a level n  — m  node
of tree T {n). (Of the 2m such subtrees of F {n), Region 2 has one, while Region 1 
has 2m — 1 such subtrees.) If n — m  < m, then Region 2 is scheduled on an instance 
A in-m  of the given MBN A i.  Notice that A in -m  uses 2n_m_1 <  2m_l <  2m — 1 buses 
as m >  0 , so sufficient buses are available.
On the other hand, if n — m  > m, then the tree F (n  — m) of Region 2 is scheduled 
recursively on a 2n-m x (2m — 1) MBN. T hat is, this T {n  — m) is divided into three 
regions, each scheduled in sequence (see Figure 5.6).
R e g io n  3: This region consists of levels n to n  — m  of the tree F (n ). Consequently,
it comprises of a single F (m ). Notice that level n - m o f  JF(n) is shared between 
Regions 1, 2, and Region 3. The leaves of the T { m ) of Region 3 are the roots of the 
T {n  — m )s of Regions 1 and 2. By virtue of the fact that each tree of Regions 1 and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
105
bus 1 bus 2 bus 2m —I
O O
Figure 5.7: Scheduling the lowest level of communications of Region 3. 
shown in dark hold partial results and move to higher levels of Region 3.
Processors
Figure 5.8: An example of a 8 x 4 MBN
2 use disjoint sets of processors, the leaves of the .F(m) of Region 3 are labeled with 
distinct processors.
The first step of Region 3 schedules the lowest level of communications of the 
T {m ). This involves reducing 2m inputs to 2m_l partial results and can be done with 
2m_l <  2m — 1 buses as shown in Figure 5.7.
If m  =  1, then this completes the execution of Region 3. Otherwise the 2m_l 
processors holding partial results, along with 2m-2 <  2m — 1 buses of , are used to 
schedule the remainder of Region 3, as a 2m_l x 2m-2 instance, A4m_i, of the given 
MBN M .
We now illustrate these ideas using an example, where n =  4, m  = 2 and k  =  1. 
Number the 24 =  16 processors 0,1 , • • •, 15 and call the 2m — k  =  3 available buses 
a, P , j .  Let the given MBN use a direct mapping (see Figure 5.8). For replication,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Region 1 Region 2
Figure 5.9: Connections of processors and buses with one bus fault
each of 22 =  4 buses are used for first 3 levels. T hat is, replication uses at least 3 extra 
steps, for a total running tim e of 8 steps. The optimal running time for this MBN is 
only 5 steps. Figure 5.9 illustrates recursive scheduling. In this example, Regions 1, 2 
and 3 run in 3, 2 and 2 steps respectively, for a total of 7 steps. In contrast, replication 
requires 8 steps. This difference will be magnified for large problems.
R u n n in g  T im e: Let 7 \(n , m) denote the time to run Bin(n) on the 2" x (2m — 1)
MBN M ,.  For x  > 0, let tx denote the time to run B in(x ) on A fz, a 2X x 2I_l 
instance of the given MBN, A4; let t0 =  0.
Clearly, T\ (n, m ) is the sum of the times needed to run all three regions. Region 1 
runs in 2m-m — 1 steps, and Region 3 in £n_m +  1 steps. U n — m  < m , then Region 2 
also runs in £n_m steps. Otherwise Region 2 runs (recursively) in T i(n —m, m) steps. 
Thus we have the following recurrence.
Let i\ =  [^"1 — 1. It can be verified tha t n  — i\m  < m  < n  — (z'i — l)m . Therefore,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
107
T \(n ,m ) =  2n m + tm- i+ T i ( n  -  m ,m )
— 2n m +  tm_i +  2n-2m 4- tm- i  +  Ti(n — 2m, m ).
_  (2"~m -+- 2n-2m H h 2n-,,m) +  -f- Ti (n — i^m, m)
=  2—  ( i ^ p ^ - ) + i ltm_ 1 + tn . iim
D egree : The degree, D i(n , m), of Af^ depends on the manner in which processors
are brought together on Region 3. For our initial discussion, if dx is the degree of
a 2X x 2x~l instance A fx of the given MBN Af, then assume that the degree of
the root processor is at most dx — 1; this is indeed the case for the Tree MBN of 
Section 3.6 (page 44). Under these assumptions, we will show that D i(n ,m ) <  dm, 
the degree of a 2m x 2m_l instance of Af. In other words, D\ (n, m ) is independent of 
x. Subsequently, we will eliminate the assumption on the degree of the root processor 
of A fm.
In Region 1, each processor is connected to one bus, and therefore has a degree 
1. For Region 2, if n  — m <  m, then the region is run on an Af n-m that has degree 
dn-m <  dm- If n — m  > m, then the degree due to Region 2 is D \(n  — m, n) which by 
the induction hypothesis is at most dm- In particular, the degree of the root processor 
r  (say) of the tree in Region 2 is at most dm — 1.
In Region 3 we have 2m — 1 processors p, (for 1 <  i < 2m) from the root of the trees 
in Region 2. Each processor p,- (1 <  i < 2m) has degree 1 and processor r  has degree at 
most dm — I- In the first step of Region 3, processors are paired and each pair reduced 
to one partial result holding processor. These processors proceed further in Region 3, 
while the remaining processors are not used any further. Let the processor pairs for






t t j  0 1 OC2 02 OC2m~ • 02m~1
Figure 5.10: Connection of processors and buses in Region 3
the first step be (a*,/?,-), where 1 <  i  < 2m_l and a,-,/?,- 6  { p i , P 2 , • • * r }- Of
these, let a* hold the partial results and proceed further and let all a * have degree 
1; this implies the processor r  =  /?,- for some i  and it does not proceed beyond the 
first step of Region 3. Let processor a,- (whose degree is 1) be connected to some bus
Since each subtree of Region 2 uses a different bus, we can ensure tha t each a is 
connected to a distinct bus b,. The first step of Region 3 connects processors or, and 0i 
to bus bi, allowing processor /?, to send its value to a , (Figure 5.10). This increases the 
degree of /?,- by 1 and the loading of b, by 1, as a,- is already connected to b,-. Thus, at 
the end of this first step each processor /?,- has degree of 2 or dm > 2 (from Lemma 4.1, 
page 66). These processors proceed no further in Region 3. The processors art- that 
proceed in Region 3 each have a degree of 1, and each is connected to a different 
bus. Observe every bus of a 2n x 2n~l binary-tree MBN can be assigned to a distinct 
processor pair that is connected to it (namely those of the first step). Therefore, by 
perm uting processors and buses of A4m_i appropriately (see Section 2.2, page 13) the 
remainder of Region 3 can proceed with processors a,- and MBN A4m_ i as if these 
processors had no connections to begin with. After scheduling Region 3, processors 
or, now have a degree of a t most dm-i <  dm- Thus the degree of is (at m ost) d^.
The above derivation assumed th a t the degree of the root processor of A4X had 
a degree of at most dx — 1. As observed earlier, this is indeed the case for many 
binary-tree MBNs. If this is not the case then assume the degree of A4X to be dx -I-1.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
109
This would allow the degree of the root processor to be incremented in the first step 
of Region 3 without increasing the MBN degree. Notice th a t this is possible as the 
root of the Region 2 subtree does not proceed beyond the first step of Region 3. In 
summary, the degree of A i\  is a t most dm +  1.
L oading: The loading L i(n, m) of is upper bounded by the sum of the loading
due to the three regions. (Unlike processors, that are different for different subtrees 
of Regions 1 and 2, the same set of buses are used for all regions). Let I* denote the 
loading of A ix-
From the discussion of the degree, the loadings due to  Regions 1 and 3 are 2n-m 
and 1 +  £m- i ,  respectively. If n  — m  <  m, the loading of Region 2 is £n-m — im\ 
otherwise Region 2 has loading L i(n  — m, m). Thus we have the following recurrence.
L\ (n , m) =  <
2n m +  1 +  lm -\ + £n-m, if 71 — 771 <  771
2n-m   ̂ Ijn-l + L \(n  — 771, 7 7 l) , if T l  — 771 >  777
which has the solution L i(n ,m ) = 2n m +  1) +  £n — iim , where
*•» =  r s i  - 1 -
T h e o re m  5.11 For any 0 < m  < n, given o 2m x 2m_l binary-tree M B N  M .m of 
degree dm, loading £m, and running time tm to run Bin{m ), there exists a 2n x  (2m — l) 
M BN A4[, with degree at most dm+ l, loading o f 2n-m ( lj~̂ 2- i r )-N i (lm_ i-H )+  in-Um 
and running time o f 2n_m where — 1 . ■
Remark: If the MBN A i  is the Tree MBN of Section 3.6, then the degree of A i[  is 
only 3.






Region 1 Region 2 \  one T {n  — m )
Figure 5.11: Regions of F{n) for k bus faults
5.3 .2  R ecursive Scheduling w ith  2s B u s F aults
Here we describe the construction of Ai'k, a 2n x (2m — k) MBN for running F{n) 
where 2n~x > k  =  25 >  1. The approach is the same as that for (Section 5.3.1). 
Indeed we use A i\  to construct M k. Divide T {n)  into three regions as shown in the 
Figure 5.11.
Region 1 now consists of 2m — 2s subtrees, while Region 2 has the remaining 2s 
subtrees. Region 3 is the same as in the k =  1 case. Schedule Region 1 as before, 
with one bus for each subtree. Schedule Region 3 as before with 2m_l < 2m — k buses 
(as k <  2m_l). The difference here is in the way Region 2 is scheduled. Region 2 
consists of 2a subtrees, each an T (n  — m); also s  < m. Divide the available 2m — 2* 
buses equally among the subtrees so that each subtree uses 2m-a — 1 buses. Thus 
each subtree runs on a 2n-m x (2m_a — 1) MBN, which is an instance of k  =  1 with 
n  replaced by n  — m  and m  replaced by m  — s. The running time for each of these
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I l l
subtrees of Region 2 is
T x(n -  m , T T l  -  s) = 2n 2m+a ( 1 _  2 ~(m-») J +  ***"»-.-1 +
where i* =  f ™ ]  -  1.
The overall nm ning time Tk(n ,m )  for Ad* is (2n_m — 1) +  (fm-i +  1) +  X |(n — 
m ,m  — s). By a similar argument the degree and loading of A i k are at most dm +  1 
and (2n_m) +  +  L \(n  — m ,m  — s).
T h e o re m  5.12 For any 0 < k  = 2s < m < n ,  given a 2m x 2m_1 binary-tree 
M B N  A i m of degree dm, loading lm, and running time tm to run B in(m ), there 
exists a 2n x  (2m — k) M BN  Ad*, with degree at most dm +  1, loading o f 2n-m +  
im -i + 2n~2m+s +  1) +  and running time o f
2n-m  _  l  +  tm - l +  1 +  2n- 2m+a ( l ^ p ( £ . ^ r )  +  w h ere  i k =
r = ^ i  - 1.1 m —3 1
■
5 .3 .3  P u tt in g  it A ll T ogether
Given a 2n x  (2m — k ) MBN Ad* we construct the fault-tolerant 2n x 2m MBN «S* by 
first adding k dummy buses, then applying replication, and finally superimposing the 
given 2" x 2m MBN A i  on it. Clearly, the k  designated buses for replication are the 
added dummy buses they are not used in Ad*. Therefore, S k runs in the same time as 
A i k when a t most k  bus-faults are present. The degree of S k is at most dx + (k + l)d 2, 
where dx and d2 are the degrees of A i  and Ad*. Its loading is a t most £x + (k-h  2)£2, 
where £x and £2 are the loadings of A i  and Ad*. Thus we have the following result.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
112
T h e o re m  5.13 Let N i be a 2n x 2n_1 binary-tree M B N  and let 0 <  k = 2s < m  < n  
and ifc =  — 1. Then recursive scheduling constructs a 2n x 2m binary-tree M BN
Sk with the following properties.
(z) I f  no bus is faulty, then Sk can emulate N4 with no overhead.
(ii) I f  M  takes tn steps to run B in(n) then, with at most there are k faulty  
bxtses, S k runs B in (n) in 2n-m -  1 +  tm_i +  1 + 2n-2m+a + iktm- s - 1
~Jt~ tn —m —iic( m —3) steps.
(Hi) I f  the degree o f M. is dn, then the degree o f Sk is at most min(2m, (k Jr2)dn). 
(iv) I f  the loading o f N i is £n,m, then the loading o f Sk is at most [2n_m +  £m_i +
2 " ~ 2m+a +  h ( £ m - 3 - l  +  1 ) +  £ n - m - i k { m - s ) ] ( h  +  1) +  2 n ~ m  +  £m + l -  2 .
■
Since bus faults and processor faults are treated independently of each other, we 
can use the results derived in Section 5.2.5 to  augment the MBNs that are tolerant 
to bus faults. Therefore, MBN Sk can be made tolerant to processor faults as well.
5.4 Comparison of Results
In this section, we compare the two methods. As explained earlier, expect the running 
time of recursive scheduling to be no more than  tha t of replication in all cases. In 
addition, we expect the loading of replication to be lower than recursive scheduling 
in all cases. This is because recursive scheduling uses buses more efficiently (and 
often), incurring more connections in the process. Table 5.1 shows the running time, 
loading and the degree of the two methods when the Tree MBN of Section 3.6 is 
used as the input MBN for the two methods. The running times of both recursive 
scheduling and replication are the same for cases m  = n  — 1 (regardless of the value 
of £). This is because here the failure of one bus has the same impact as the failure of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
113
half the buses [2]. Therefore, replication is optim al and recursive scheduling cannot 
improve on it. When the number of faults is large, (for example k =  2m_1 in the table, 
then), both methods again have the same running time. This is because replication 
assumes th a t half the buses are faulty, regardless of the actual number of faults. The 
case where m  =  k =  1 shows the advantage of recursive scheduling; the running 
time is about half that of replication. In this case, recursive scheduling makes the 
maximum use of all the available (2 ^ — 1) buses, while replication only uses 2 ^ -1 of 
the available buses. When the number of faulty buses approaches the total number 
of buses, then both the methods give the same running time. This situation is not 
unusual because both the methods have very few buses available and the inefficiency 
of replication becomes insignificant. As expected, loading of replication is superior 
to that of recursive scheduling in all cases. The degree of recursive scheduling is 
marginally larger in all cases due to the fact th a t we superimpose the original MBN 
on the fault tolerant MBN to obtain Sk-
5.5 Concluding Remarks
We have proposed two methods for converting any binary-tree MBN to one that is 
resilient to arbitrary bus faults. One of the methods presented can be used with both 
processor and bus faults. It also works with any type of MBN while the other method 
works only with binary-tree MBNs. The fault tolerant MBNs we have designed do 
not run optimally. However they have much better degree and loading than that 
proposed by Ali and Vaidyanathan [2]. The problem of designing low-degree, fault- 
tolerant MBNs that run binary-tree algorithms in optimal number of steps is open.
Replication specifies the additional connections needed in an MBN to map faulty 
elements to less important elements. An algorithm to perform the required reallo-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
114
Table 5.1: Summary of results
Case Recursive scheduling Replication
Time m  =  n — 1, fc =  l n  +  1 = n  +  1
m  =  n  — 1, k =  2n_2 71 +  1 = 71+1
m  =  f , k  =  1 2 f  +  n  — 1 < % +  2*+I +  1
m  =  j ,  k  =  2a“ l a +  27+l - 1 \  +  2*+I +  1
Loading m  = n — 1, k  =  I 17 > 6
m  =  n  — 1, k  =  2n-2 7.2"“2 +  10 > 3.2n-2 +  3
m  =  %, k =  I 2.2* + 1 5 > 2 .2 * + 6
m  =  5 , k  =  2 ir-1 (2*"l + l ) ( 2 * +l +  f + 3 ) > (2*“ l + 1 ) (2 *  + 3 )
Degree m  =  n — 1, k =  I 9 > 6
m  =  n  — 1, k  =  2n-2 (2"“2 +  2)3 > (2*-2 + 1)3
m  =  k =  1 9 > 6
m =  f , Jk =  2 i - 1 (2* +  2)3 > (2* +  1)3
num ber of processors =  2", number of buses =  2m, number of faulty buses, fc =  2*. 
We denote T ( n , n  — 1), <„,r»-i and dr»,n-i by T(n) ,  l n and dn respectively.
cation of identities is im portant as well. Though our approach could accommodate 
handling of bus-faults on the fly, it would incur larger overheads for processor faults, 
where entire contexts will have to be relocated. Another possible drawback of this 
work is tha t it does not address link faults that render the connection from a processor 
to a bus (rather than an entire bus or the processor) unusable.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
VLSI Layout Lower Bound
This chapter deals with VLSI layouts for optimal-time MBNs. In a related topic, VLSI 
layouts for the balanced tree point-to-point topology have been thoroughly studied 
[80]. The balanced tree represents a structure where all edges of a  balanced binary 
tree could be used simultaneously. In contrast, an optimal-time binary-tree algorithm 
represents a situation in which one level of edges is used at a  time. This implies that 
any layout for a balanced tree would also suffice for a binary-tree MBN. The converse is 
not true, however. This is because the MBN could reuse the communication resources 
(and VLSI real state) over different steps, in a manner not possible on a balanced 
tree. The question we ask here is “is it possible for a binary-tree MBN to be laid out 
in a smaller area than a balanced tree?” For two of the three cases tha t we consider, 
the answer is easily provable to be “no.” For the third case, we conjecture that the 
answer is again “no” and we outline the basis of this conjecture in this chapter.
An X  x  Y  layout of a structure accommodates the structure in two layers within 
an X  x Y  rectangle. Clearly, the area of an X  x Y  layout is X Y .  In a perimeter 
layout, all processors are placed on the perimeter of the enclosing rectangle. On the 
other hand, a dense layout has no restriction on where processors may be placed. As 
the name indicates, a dense layout is usually more compact than a perim eter layout.
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
116
Figure 6.1: H-Tree layout of a 31-processor binary tree
level 2 
level 1
Figure 6.2: 7-node binary tree layout
A perimeter layout, on the other hand, places processors more conveniently for use 
within a larger context such as meshes enhanced with MBNs (see Chapter 4), or for 
connecting to pins of a chip. The aspect ratio of an X  x Y  layout is
An iV-leaf (0(iV)-node) balanced tree has an optimal 0(A r) area, constant aspect 
ratio layout [80] (see Figure 6.1). Therefore an iV-processor binary-tree MBN also 
has such an optim al layout. On the other hand, if a constant aspect ratio, perimeter 
layout is required, then the perimeter must have f l (N)  length, as a result of which 
the area is Q(N2). The well known perimeter layout of a tree [80] can easily be bent 
around the perimeter of a G ( N  x N ) square to construct such a layout. Again, this 
is optimal for a binary-tree MBN.
It can be shown [80] that a high-aspect ratio layout for a balanced tree (with all the 
processors on one side of the layout) requires Q(N log N)  area. Does the same bound
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
117
bus a bus b bus c bus d
Figure 6.3: 8-processor MBN layout
Figure 6.4: 8 processor MBN running Bin(3)
apply for binary-tree MBNs as well? The answer is not simple, as a binary-tree MBN 
uses only one level of edges at a time, and therefore could reuse buses over several 
steps. For example, a 7-node binary tree (that has two degree-3 nodes) requires at 
least two levels of wires, as shown in Figure 6 .2 . On the other hand an optimal-time 
8-processor binary-tree MBN can accommodate its buses in a single level (Figure 6.3). 
Figure 6.4 shows how this MBN runs the 8-input binary-tree algorithm, B in(3). In 
the remainder of this chapter we describe several steps towards developing a lower 
bound on the perimeter layout area for optimal-time binary-tree MBNs. It forms 
the basis of our conjecture that binary-tree MBNs do not have a lower layout area 
than balanced binary trees. Our argument is arranged as a series of lemmas and one 
conjecture. If this conjecture can be proved to be true, then this work will establish
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
118
that any perim eter layout of an optimal-time, binary-tree MBN with N  processors 
requires Q (N l o g N )  area.
In the next section we discuss some preliminary ideas. In Section 6.2, we describe 
our steps towards the lower bound derivation.
6.1 Preliminaries
In this section we state  some assumptions and establish conventions used in subse­
quent discussion.
6.1 .1  V L S I M od el
We adopt the most widely used mathematical model for VLSI algorithms [78, 79]. 
In this model, a  VLSI layout consists of horizontal and vertical wires of unit width. 
Horizontal and vertical wires are laid out on separate layers, and wires on the same 
layer are separated by unit distance. Whenever a horizontal wire is to be connected 
to a vertical wire, a  contact hole or via is cut at the intersection of the two wires and 
a contact made through this hole. Processors are assumed to occupy unit area. The 
assumption usually requires a processor to be of constant degree: our lower bound 
argument does not rely on this assumption, however. Note tha t this is a “word 
model” that assumes unit area for processors and width of wires, regardless of the 
word size used. Since the number of layers in actual fabrications is limited to a 
few, the size of a VLSI layout is primarily measured by the area of the largest layer 
(enclosing rectangle). Practical considerations of VLSI fabrication, such as cost and 
yield dictate tha t the area be kept as small as possible.
As explained earlier, we will only consider a high aspect ratio, perimeter layout 
for an optimal-time, binary-tree MBN. Such an MBN for Bin(n), has 2n processors,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
119
Figure 6.5: Processors are shown as circles
each of which occupies unit area. Therefore, one of the two dimensions of the layout 
is Q(2n) units long (see Figure 6.5). W ithout loss of generality, we assume tha t all 
2n processors are placed on one side of the layout. We will focus on finding a lower 
bound on the other dimension h (height) of the layout. In deriving this lower bound we 
assume tha t vertical wires have no width and, concentrate entirely on the horizontal 
wire segments. Initially, each processor holds an input. However, no assumption is 
made about which processor holds the final result of B in(n).
6.1 .2  D efin ition s and F igure C on ven tion s
Let the processor axis of a perimeter layout be the line (edge) of the layout on which 
processors are placed (in Figure 6.5, the bottom  horizontal side of the layout is the 
processor axis). Assume that, in general the layout orients the processor axis as 
the lower horizontal lines of the layout. Our approach to the area lower bound first 
identifies the minimum communication requirements for an optimal-time binary-tree 
MBN. This communication requirement is represented as horizontal links. A link 
between processors pi and p2 denotes a communication between these processors.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
120
The link is represented as a  horizontal line, whose projection on to the processor 
axis is a line connecting processors p\ and P2 - The link is not to be confused with a 
wire or a bus. It is simply a  channel (not necessarily placed in a layout) dedicated 
for communications between processors p\ and pi. Our goal a t this point is only to 
identify the existence of such links.
In general, we will view the links from a processor’s perspective, and our interest 
will be restricted to questions such as “does the link cover other links?” (A link Ai 
is said to cover a link A2 if an infinite vertical line through any point in A2 intersects 
Aj. A link is said to cover a processor iff vertical lines drawn immediately to the left 
and right of the processor intersect the link.) We now introduce some notation tha t 
will help in explaining ideas about the MBN’s communication requirements.
1
®  ® ® ®  ®  ©  CD
Figure 6 .6 : Links between processors
(D
Figure 6.7: View from processor 1
Consider the links shown in Figure 6 .6 . These links represent the communication 
requirements shown in figure Figure 6.4. Links labeled 1 are at the lowest level of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
121
tree. Notice tha t these links are between processor pairs (1, 2), (3 ,4), (5 ,6) and (7 ,8), 
that are involved in a  communication in the first step. In step 2, processor pairs (1,3) 
and (6 ,7) communicate; these communications correspond to the links labeled 2 in 
Figure 6 .6 . The two links labeled 3 between processor pairs (3,4) and (4 ,6) represent 
the corresponding non-trivial edges to the roots of Figure 6.4.
In general, each link is labeled with the step a t which it is used. For now we will use 
these link labels only to  show the correspondence with Figure 6 .6 . Figure 6.7 shows 
the view of these links from processor 1. This view only captures the existence of links 
and the fact that some links cover others. The length of a link is not an im portant 
consideration, except th a t each link is at least one unit long and a covering link is at 
least as long as the covered link. In most cases we will only be interested in portions 
of a subset of the links (as viewed from a processor). For example we may choose 
to consider the three subsets shown in Figures 6 .8 , 6.9 and 6.10 as the view from 
processor 1. Additionally, we may restrict the view to only portions of some links.
! 3:  2  ____
I 1 1
d)
Figure 6 .8 : Subset view I
Since the links with labels 2 and 3 in Figure 6.8 do not cover any link other than 
the link labeled 1 directly below it, we can shorten these two links shown (without 
the labels) in Figure 6.11. Since only the relative position of links is im portant for 
our consideration, Figure 6.11 also represents the views in Figure 6.9 and 6.10.
Figure 6.11 is also representative of the view from processors 2 and 3 of Figure 6.6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
122
3
; 2  ___________
j 1 1
( t




Figure 6.10: Subset Anew III
but not processors 4, 5, 6 , 7 and 8 . The view from these processors contain the links 
shown in Figure 6.12.
Since we will not make a distinction based on the side of the processor th a t contains 
links, the views in Figures 6.13 are considered identical. We use a two sided arrow to 
indicate this as shown in Figure 6.13. Indeed, this represents the view from any of 
the processors of Figure 6 .6 .
CD
Figure 6 .11: Subset view from processor 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
123
Figure 6 .12: Subset view from processor 4
or
o o
Figure 6.13: Equivalent views
A set of links in the view of a processor will be symbolically represented by a letter
enclosed in a box (for instance [X [ or | Y[). For example, if | X | denotes a single link, 




Figure 6.14: Symbolic representation of a subset view
be any set of links. The notation IXJ denotes the links of X along with another link
that covers all links of X. If [y ] denotes X , then Figure 6.14(a) can be redrawn as 
Figure 6.14(b). Note that a subset view or the view from a processor containing a  set






Figure 6.15: Communication Structure for F(3)
of links denotes a subset of the true view from the processor. Since our lower bound 
argument counts the length of links, such a conservative “subset view3’ is acceptable.
In this section we derive the results necessary for establishing the lower bound on the
determines a set of links that the view from the final result processor must contain. 
This “minimum” communication pattern is used with the concept of “collapsing” 
(that captures the notion of bus reuse) to derive a lower bound on the wire length 
represented by links. This finally bounds the height of the layout.
6 .2 .1  M inim um  C om m unication  Structure
We sta rt by establishing the minimum communication requirement for any MBN run­
ning 5tn(3) optimally in 3 steps. We then use this result to derive the communication 
requirement for Bin(n).
L em m a  6.1 Let N i be an 8 -processor M BN in which each processor contains the 
subset view shown in Figure 6.15(a) fo r some set [jt] of links. (Assume this view to
6.2 Towards the Lower Bound
height of a perimeter layout of an optimal-time, binary-tree MBN. Our approach first
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
125
be unrelated to running B in(3) on jA/f.) I f  A i  can run B in (3) optimally, then the final 
result processor contains the subset view shown in Figure 6.15(b).
Proof: Since A4 executes Bin (3) optimally in 3 steps, it must provide a path of at
most 3 hops1 from each processor to the final result processor po (say). Also observe 
tha t regardless of where processor po is placed on the processor axis, there must be at 
least four other processors on one of its sides. In summary, the final result processor 
po has a t least four processors on one side of it, with each of these processor connected 
to po by a  path  of a t most 3 hops. W ithout loss of generality, let P i,P 2,P3 and p4 
be these four processors to the right of processor po, with pi nearest to po and P3 
furthest.
Since A4 executes B in(3) optimally, each communication in this execution must 
be a 1-hop path. Therefore, the subset view from p 0  m ust contain links to processors 
P i ,  p i t  P3, P a  such that each of the processors can be reached from p0 by traversing 
at most 3 links. We now consider some cases.
•  •  •  •  •
pO p1 p2 p3 p4
Figure 6.16: Subcase 1 (a)
Casel: Suppose there is a link A (of length 4) between p0 and p4; this incudes the 
case where A covers p0 and /or p4. We now consider some subcases.
1A k-hop path between processors p  and p ' is a  sequence (p =  Po, fti , P i , &2 ,P2 , • • • >Pfc-i i &fc,Pfc), 
where for 1 <  t <  k  processors p,- and p,-_i are connected to  bus ft,.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
126
Subcase 1(a): Suppose there is a link A' ^  A that covers any of processors pi, 
P2  or P3  (as shown for pi in Figure 6.16). Then the 1X | of the processor 
(Pn P2 or pz) is covered by A' while the [x ] of a  different one of p t , p2 or 
Pz is covered by A. A subset of this situation is the view of Figure 6.15(b) 
(indicated by the dashed boxes of Figure 6.16).
Subcase 1(b): Suppose A is the only link covering processors p x, p2 and p3. 
Then for p 2 to have a path to p0, there must be links A', A" on both sides 
of either p x (Figure 6.17 ) or P3 (Figure 6.18). As shown in these figures,
X r — - A,”
\ m
•  •  •  •  •
pO p1 p2 p3 p4
Figure 6.17: Subcase 1(b): p?-p\-po link
X r— - X'
jsi
•  •  •  •  •
pO p1 p2 p3 p4
Figure 6.18: Subcase 1(b): P2-P3-P0 bnk
the view from po contains the subset view of Figure 6.15(b).
Case 2: Suppose there is a  link A of length 3. If there is a  link A' ^  A tha t covers any 
of the processors, then the proof follows as in Figure 6.16. Assume therefore,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
127
that there is no link other than A that covers any of the processors. W ithout
loss of generality, let A be between p0 and p3 (the case where A is between px
and P4 is analogous). Clearly there must be a  link A' from pz to p4 (Figure 6.19).
Processor p i  connects to po using a link A", A"' via processor pi (Figure 6.20)
or link A" via processor pz (Figure 6.21). These figures explain why the lemma
hold for these cases.
_______________________________________ X_
X '■
•  •  •  •  •
pO p1 p2 p3 p4





•  •  •  •  •
pO p1 p2 p3 p4
Figure 6.20: Case 2: P2-P 1-P 0 link
r— ■ X
\ M X' r - - X
I ®  J; l_s i
» ; .
•  •  * •  •
pO p1 p2 p3 p4
Figure 6.21: Case 2: p2-p3-po link
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
128
Case 3: Suppose there is a link A of length 2 . Once again assume that there is no 
link other than A that covers any of the processors; otherwise it represents the 
situation in Figure 6.16. We now consider some subcases.
Subcase 3(a): Suppose A is from po to p2 (or analogously from p? to p4). For 
p4 to get to po there must be links A', A" from p4 to P3 and p$ to pi. This 
situation is handled as shown in Figure 6.22.
K r '  ;
r  r-~. X
N® | i s !
—i—
•  * •  •  •
pO p1 p2 p3 p4
Figure 6.22: Subcase 3(a)
Subcase 3(b): Suppose A is in the middle between pi and p$. Then the path 
from P2 to po must include edge A' from pi to p0 and A" from p2 to either 
Pi (Figure 6.23) or p3 (Figure 6.24). In addition, there is a  link A'" from 
p4 to p3. These figures show how these cases are handled.
k  r— -
X r-~ , X.” i
j®s j
•  •  •  •  •
pO p1 p2 p3 p4
Figure 6.23: Subcase 3(b): P2-P1-P0 link
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
129
X
: ; : X’ r— -  X"
I ;®j L®J
•  4► 4► 4
pO p i p2  p3 p4
Figure 6.24: Subcase 3(b): P2-P3-P0 link
Since the paths from pi, P2, P3, p^ to p0 can have a t most 3 hops, there must be 
at least one link of length 2 or more, so all cases are covered. ■
We now use Lemma 6.1 to identify a minimum set of communication links for an 
MBN running B in[n ) optimally.
L em m a  6.2 I f  an M B N  runs B in(n ) optimally, then the final result processor 
contains the subset view of Figure 6.26.
Proof: W ithout loss of generality, let j  be an integer. We proceed by induction on
h = * >  1.
If h =  1, then we have n  =  3. From Lemma 6.1 with [x ] being empty, we have the 
desired result. Assume the assertion of the lemma to hold for h > 1 and consider an 
MBN tha t runs Bin(3(h+ 1)) optimally. The tree ,F (3 (/i+ l)) can be decomposed into 
8 F f i t y s  as shown in Figure 6.25. Let the processors at level 3h (roots of the ^ ( 3 h) 
each contain subset view X . By induction hypothesis [x] is as shown in Figure 6.26.
Then by Lemma 6.1 the roots of F (3(h + 1)) contains the subset view of Figure 6.26. 
Expanding each [x] as in Figure 6.26 completes the proof. ■
6 .2 .2  L abelin g  Links
Subset Figure 6.26 shows the links that the view from the result processor must 
contain for any MBN running Bin(n) optimally. Clearly, the links are drawn in levels




3 h +  1
 7(3h)
level 0
Figure 6.25: T {n)
:     0
:         i
\       2
I J-~:  — —   rn - 1
O
Figure 6.26: View from final result processor
0,1, • • •, — 1 corresponding to 3-level chunks of JF(n). If we label each link by
the (unique) step at which it is used, then no two levels of links have common labels, 
and within a  level, there are at most 3 distinct labels (as each level of Figure 6.26 
represents a set of B in (3)s that run in 3 steps).
6 .2 .3  C ollap sin g  Links
To translate the minimum communication requirement of Figure 6.26 into the mini­
mum requirement of perimeter layout, the possibility of links labeled differently using 
the same physical wire (bus) must be accommodated, as this has the potential to 
reduce the area. To capture this idea of bus reuse, we introduce the concept of col­
lapsing th a t allow links at different levels (with different labels) to merge. As observed
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
131
earlier, each level of links represents a set of sub-problem Bin (3)s and has a t most 3 
different time labels. For 0 <  £ < \ j ]  the length of a link at level £ is at least 2n-<-1.
Collapsing causes a link to have multiple labels, that indicates the times a t which 
it is (re)used. That is, a link now has a set of labels (rather than a single label). 
For i = 1,2 let A, be a levels, link with label set L,. If £x > l 2, then link Xi can 
be collapsed into link A2 iff A2 covers A! and L\ fl L 2  is empty. After the collapse, 
the link A2 is removed from the communication requirement structure and the label 
set of A2 is changed to L\ U  L2. This collapsing captures the idea that link (bus) A2 
can be used for all its original communications as well as those represented by link 
(bus) At. Since their labels are disjoint, the link will not be used simultaneously for 
two communications. As A2 covers At, link A2 also reaches all processors reached by 
At. Since the aim is to derive a lower bound on the area using the total length of 
collapsed links, we will attem pt a set of collapses that minimizes this total link length 
in the communication structure. Indeed, because of the lower bound setting, we will 
assume th a t three links from each level >  £ 2  can be collapsed into each level-^2 
link, regardless of whether or not the level-£2 link covers the level-^i links.
Define a maximal collapse of the communication requirement of Figure 6.26 (or a 
substructure of this structure) as the result of the following procedure, 
for level £ <—  0 to [|J — 1 do 
for each remaining level-£ link A
(i) collapse two of the remaining level-£ -I- 1 links into A 
(«) from each of levels £ +  2, £ 4- 3, • • •, [ f  J — 1 collapse three of 
the remaining links from that level into A 
end 
end
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
initial collapsed to level 0
Q --------------------------------------- ---------------------------------------
1
2   ------
collapsed to level 2 final result
Figure 6.27: At each level, collapsed links are shown dotted
Figure 6.27 shows an example of a maximal collapse for a 4 level structure. Note the 
above procedure allows a link At to be collapsed into another link A2 even if A2 does 
not cover Ai; however, each link Ai is collapsed into at most one other link A2.
We now derive a formula for the number of links left a t each level after following 
the above maximal collapse procedure. W ithout loss of generality assume j  to be 
an integer. Before any collapse, level-£ (where 0 <  £ <  | )  of the communication 
structure has 2l+l links. Let t) ( £ )  denote the number of links left at level-£ after a 
maximal collapse.
Clearly, level-0 links cannot be collapsed, so 77 (0) =  2. The 4 level-1 links are all 
collapsed into the level-0 links (two in each), so 77(1) =  0. For the remaining levels
i  > 1, ther are 77(£ — 1) collapses into level-(£ — 1) links and 3 in each of the remaining
level-{I — 2), level-(£ — 3), • • ■, level — 0 links (assuming level-£ has sufficient links for 
the collapse).
Thus, we have the following relationships. For £ > 2
1,(*) =  2 '+‘ -  2r,(£ -  1) -  3 £  i,U)
3 = 0
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
133
Therefore,
r,(£ — 1) =  2* — 2rj(£ -  2) -  3 77O).
3 = 0
Substituting the second equation from the first we have, ti(£)+t]{£—1)+t}(£ — 2) = 2l . 
That is, the total number of links in three consecutive levels £,£ — 1  and £ — 2 is 2e.
L em m a  6.3 Assuming sufficient higher level links remain fo r a collapse, the total 
length o f wires after a maximal collapse o f the communication structure o f Figure 6.26 
is fl(n2n).
Proof: W ithout loss of generality, let j  =  h  be an integer. For 0 <  k < h, our
earlier observations give rj{3k) +  rj(3k +  1) +  77(3/1 + 2) =  23*+2. Since the shortest 
wire of level 3k, 3k +  1 and 3k  +  2 has length f2(2n-3fc) the total length of wires in 
levels 3k, 3k +  1 and 3k +  2 is L{k ) =  Q(2n~3 k23k+2) = f2(2n). Thus the to ta l wire
h
length is 2n =  Q(h2 n) =  Q(n 2 n). ■
k = 0
The maximal collapse procedure collapses into lower level (longer) wire before it 
gets to shorter wires. This is not the only method possible. For example, if shorter 
wires from some level £ > 1 were collapsed into both level-0 and level-1 wires, then 
some level-1 wires can no longer be collapsed into level-0 wires.
Figure 6.28 shows another collapsing method. Notice that only one level-2 link 
can be collapsed to each level-1 link. This is because each level-2 link has two level-3 
links collapsed into it. As a result, any set of two level-2 links must have 4 level-3 
links collapsed into them, guaranteeing at least one duplicate label (as each level has 
3 labels). Thus collapsing two level-2 links into a level-1 link would be equivalent to 
collapsing 4 level-3 links into level-1 link; this is not perm itted. Assuming unit length 
for the level-3 links, the collapse of Figure 6.28 leaves links whose total length is at











T T  T T  T T  T T
collapsed to level 2
collapsed to level 1 collapsed to level 0
Figure 6.28: A different collapse
least 16. By the same token, the maximal collapse of Figure 6.27 has a length of at 
least 13.
Clearly many other approaches are possible. The question is “which one leads to 
the best possible collapse with the shortest total wire length?” Computer simulations 
seem to indicate tha t the maximal collapse produces the smallest length of links. 
Therefore, we have the following conjecture.
C o n je c tu re  1 No collapsing procedure reduces the total wire length more than the 
maximal collapse.
We now state the main result of this chapter.
T h e o re m  6.4  I f  Conjecture 1  is true, then the height o f a perimeter layout of any 
2n -processor optimal-time binary-tree M BN is fi(n).
Proof: W ithout loss of generality, assume that the processors are placed a unit
distance apart. (They certainly cannot be placed closer, and if they are spread further
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
135
apart, then their wire length is proportionately larger.) Thus the “width” of the layout 
can be assumed to be 0 (2"). From Lemma 6.3 and Conjecture 1 the layout height is 
n  (?£-) =  fi(n). ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7 
Summary and Future Work
In this research we have investigated various issues on running binary-tree algorithms 
on MBNs. We have identified relationships among im portant MBN parameters and 
established some non-trivial lower bounds. Most of the results are general and apply 
to all (or a  very large class of) binary-tree MBNs. We have developed some novel 
techniques th a t may find use in solving problems in other related areas. Most o f  the 
results also extend to  A;-ary trees for k > 2.
In Chapter 3 we investigated the relationships among loading, degree and running 
time of binary-tree MBNs. We developed an accounting scheme to count the number 
of connections on a bus. We established a series of lower bounds on the loading 
of optimal-time, degree-2 MBNs for running 2n-input binary-tree algorithms. The 
tightest of these bounds established the loading to  be We also identified two
im portant mappings called direct and indirect and established that indirect mapping 
is essential to achieving constant loading. This result is somewhat surprising, because 
indirect mapping increases the number of communications. We also showed th a t if 
the degree is increased to 3, then optimal-time, constant loading binary-tree MBN 
exists. We constructed the degree-3, loading-3, Tree MBN with the best possible 
degree-loading product.
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
137
In Chapter 3 we also investigated the possibility of making trade-offs between 
the running tim e and loading. We showed tha t by increasing the running tim e by 
a constant factor, loading can be reduced by a non-constant factor. Specifically, we 
established th a t if the additional time (beyond the optimal) used by the MBN is t, 
and if the largest problem size that can be solved in optimal time on a loading-L, 
degree-2, binary-tree MBN is 2T̂ ,  then t  > [T̂ +l ]• We presented an example of 
a degree-2, loading-4, (2n — 3)-step binary-tree MBN that matches this bound (to 
within a constant factor) when L  is constant.
In Chapter 4 we used MBNs to enhance 2-dimensional meshes. We showed that 
this method of connecting processors together by multiple buses has significant advan­
tages over the conventional single bus approach for connecting processors together. It 
allows all existing algorithms on enhanced meshes to be automatically translated into 
a more implementable platform with a realistic loading. As an MBNs can employ 
a single bus, our architecture captures all features of most existing enhanced mesh 
architectures. We derived the running time, loading, degree, number of buses, VLSI 
area and the aspect ratio of meshes enhanced with the Tree MBN, and showed that 
our results are be tter than the best previous results. We also studied buses with 
segment switches, and showed that segment switches help to reduce loading.
In Chapter 5 we introduced two methods of imparting fault tolerance to MBNs. 
We accomplished this by adding connections in a controlled manner to MBNs that 
are not tolerant to faults. The first method, called replication, is a  general m ethod 
th a t can be used with any MBN (not only binary-tree MBNs) and for both processor 
and bus faults. An im portant feature of replication is tha t it allows a designated set 
of buses/processors to be treated as faulty, regardless of which buses/processors are 
actually faulty. This allows the network designer to designate a set of less im portant
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
138
buses/processors to be faulty. The second method, called recursive scheduling, is 
specific to bus faults in binary-tree MBNs. It uses the features of binary-tree MBNs 
to achieve better speeds compared to replication. The methods for bus faults are 
independent of tha t for processor faults. Therefore, tolerance to processor faults can 
be imparted to an MBN that is already tolerant to bus faults and vice versa.
In Chapter 6 we investigated the VLSI area requirement for a perimeter layout 
of optimal-time, binary-tree MBNs. The corresponding problem for balanced binary 
tree topology is well studied. Unlike in a complete binary tree, however, a binary-tree 
algorithm uses only one level of the tree at a step. Therefore, binary-tree MBNs 
could reuse the same buses at different steps of the algorithm. We developed a tech­
nique to identify the minimum communication requirements for perimeter layouts of 
optimal-time, binary-tree MBNs and then to “collapse” links to mimic bus reuse. We 
conjectured that a particular collapsing scheme minimizes the total wire length. (Sev­
eral computer simulations seemed to indicate that this conjecture is true.) Assuming 
this conjecture to be true, we established an f2(iVlog N )  lower bound on the VLSI 
area of a perimeter layout for optimal-time MBNs for N -input binary-tree algorithms.
F u tu re  W ork: We believe that the lower bound on the loading established
in Chapter 3 is not tight. This is based on the existence of an optimal-time, degree-2, 
loading-0(n) binary-tree MBN [85] and the fact that degree-2 MBNs tend to introduce 
a large number of direct nodes for binary-tree algorithms. This is because the ability 
of a processor to get rid of a partial result while receiving two new partial results is 
crucial for small loading, and this is not possible on a degree-2 MBN. Future work 
in this area can focus on bridging the gap between the lower bound and the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
139
0 ( n )  upper bound. A possible approach for this could be to combine the methods 
used for establishing the Q(rir) and Q ( i ^ )  lower bounds.
The fault tolerance results of Chapter 5 can handle k  faults within a given binary- 
tree MBN. Extension of these methods to the enhanced mesh architecture of Chapter 4 
is sensitive to the number of faults in an MBN building block, rather than the entire 
network. T hat is, if there are k  faults distributed in the entire enhanced mesh, then 
the best way to  address this problem is not known. Currently, the only fail-safe way 
to handle k  faults in the entire enhance mesh is to assume that each of the MBNs 
can tolerate k  faults. This approach could be wasteful for large k.
In Chapter 6, we conjectured that the method used for collapsing the links in 
the communication structure is optimal. Establishing that this indeed the best is still 
open. Also we only investigated the area requirements of optimal-time, N  x  y  MBNs. 
The area requirements for N  x M  (for M  < y )  binary-tree MBNs and sub-optimal 
time MBNs are still open problems.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[1] A. Aggarwal, “Mesh Connected Computers with Fixed and Reconfigurable Buses: 
Packet Routing and Sorting,” IEEE Trans. Computers, 45, 1996, pp. 529-539.
[2] A. Ali and R. Vaidyanathan, “Exact Bounds on Running ASCEND/DESCEND and 
FAN-IN Algorithms on Synchronous Multiple Bus Networks,” IEEE Trans. Parallel & 
Distributed Systems, 7, 1996, pp. 783-790.
[3] B. E. Aupperle and J. F. Meyer, “Fault-Tolerant BIBD Networks,” Proc. International 
Symposium on Fault Tolerant Computing, 1988, pp. 306-311.
[4] A. Bar-Noy and D. Peleg, “Square Meshes are not always Optimal,” IEEE Trans. 
Computers, 40, 1991, pp. 196-203.
[5] P. Berthome, Th. Duboux, T. Hagerup, I. Newman, and A. Schuster, “Self-Simulation 
for the Passive Optical Star Model,” Proc. 3rd European Symposium on Algorithms, 
vol. 979 1995, pp. 369-380.
[6] C. Berge, Hypergraphs, North Holland Mathematical Library, vol. 45, 1989.
[7] D. Bhagavathi, V. Bokka, H. Gurla, S. Olariu, and J. L. Schwing, “Time-Optimal 
Visibility-related Algorithms on Meshes with Multiple Broadcasting,” IEEE Trans. 
Parallel & Distributed Systems, 6, 1995, pp. 687-702.
[8] D. Bhagavathi, P. J. Looges, S. Olariu, J. L. Schwing, and J. Zhang, “A Fast Selection 
Algorithm for Meshes with Multiple Broadcasting,” IEEE Trans. Parallel & Distributed 
Systems, 5, 1994, pp. 772-777.
[9] L. N. Bhuyan and A. K. Nanda, “Multistage Bus Networks (MBN): An Interconnection 
Network for Cache Coherent Multiprocessors,” Proc. 3rd IEEE Symposium on Parallel 
and Distributed Processing, 1991, pp. 780-787.
[10] S. H. Bokhari, “Finding Maximum on an Array Processor with a Global Bus,” IEEE 
Trans. Computers, 33, 1984, pp. 133-139.
[11] V. Bokka, H. Gurla, S. Olariu, J. L. Schwing, and L. Wilson, “Time-Optimal 
Domain-Specific Querying on Enhanced Meshes,” IEEE Trans. Parallel & Distributed 
Systems, 8, 1997, pp. 13-23.
[12] D. Bulka and J. B. Dugan, “Design and Analysis of Multibus Systems Using Projective 
Geometry,” Proc. International Symposium on Fault Tolerant Computing, 1992, pp. 
122-129.
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
141
[13] D. A. Carlson, “Solving Linear Recurrence Systems on Mesh Connected Computers 
with Multiple Global Buses,” J. Parallel & Distributed Computing, 8, 1990, pp. 89-95.
[14] C.-H. Chen and F.-F. Lin, “An Easy to Use Approach for Practical Bus-Based System 
Design,” IEEE Trans. Computers, 48, 1999, pp. 780-793.
[15] T. Chen, T. Kang and R. Yao, “The Connectivity in Hypergraphs and the Design 
of Fault-Tolerant Multiple Bus Systems,” Proc. International Symposium on Fault 
Tolerant Computing, 1988, pp. 374-379.
[16] W. T. Chen and J. P. Sheu, “Performance Analysis of Multiple Bus Interconnection 
Networks with Hierarchical Requesting Models,” IEEE Trans. Computers, 40, 1991, 
p p . 8 3 4 -8 4 2 .
[17] Y. Chen, W. Chen, G. Chen, and J. Sheu, “Designing Efficient Parallel Algorithms 
on Mesh Connected Computers with Multiple Broadcasting,” IEEE Trans. Parallel & 
Distributed Systems, 1, 1990, pp. 241-246.
[18] I. Chlamtac and S. Kutten, “Tree-based Broadcasting in Multihop Radio Networks,” 
IEEE Trans. Computers, 36, 1987, pp. 1209-1223.
[19] K.-L. Chung, “Prefix Computations on a Generalized Mesh-Connected Computer with 
Multiple Buses,” IEEE Trans. Parallel & Distributed Systems, 6, 1995, pp. 196-199.
[20] D. Coudert, A. Ferreira and X. Munoz, “Multiprocessor Architectures Using Multi- 
OPS Lightwave Networks and Distributed Control,” International Parallel Processing 
Symposium, 12, 1998, pp. 151-155.
[21] C. Das and L. Bhuyan, “Bandwidth Availability of Multiple-Bus Multiprocessors,” 
IEEE Trans. Computers, 34, 1985, pp. 918-926.
[22] R. Decher and L. Kleinrock, “Broadcast Communication and Distributed Algorithms,” 
IEEE Trans. Computers, 35, 1986, pp. 210-219.
[23] H. P. Dharmasena and R. Vaidyanathan, “An-Optimal Multiple Bus Networks for 
Fan-in Algorithms,” Proc. International Conference on Parallel Processing, 1997, 
pp. 100-103.
[24] H. P. Dharmasena and R. Vaidyanathan, “Lower Bound on the Loading of Degree-2 
Multiple Bus Networks for Binary-Tree Algorithms,” Proc. International Parallel 
Processing Symposium, 1999, pp. 21-25.
[25] O. M. Dighe, R. Vaidyanathan and S. Q. Zheng, ‘The TBN: A Versatile Building 
Block for VLSI Parallel Architectures,” Proc. 6 th ISCA International Conference on 
Computer Applications in Design, Simulation and Analysis, 1993, pp. 52-55.
[26] O. M. Dighe, R. Vaidyanathan and S. Q. Zheng, “Bus-Based Tree Structure for 
efficient Parallel Computation,” Proc. International Conference on Parallel Processing, 
1993, pp. 158-161.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
142
[27] O. M. Dighe, R_ Vaidyanathan and S. Q. Zheng, “The Bus-Connected Ringed Tree: 
A Versatile Interconnection Network,” J. Parallel & Distributed Computing, 33, 1996, 
pp. 189-196.
[28] M. Dubois, “Throughput Analysis of Cahche-Based Multiprocessors with Multiple 
Buses,” IEEE Trans. Computers, 37, 1988, pp. 58-70.
[29] M. Feldman, R. Vaidyanathan and A. El-Amawy, “High Speed, High Capacity Bused 
Interconnects Using Optical Slab Waveguides,” Proc. 1999 Workshop on Optics in 
Computer Science, Springer Verlag Lecture Notes in Computer Science, vol. 1586, pp. 
924-937.
[30] S. Fujita and M. Yamashitar, “Fast Gossiping on Mesh-Bus Computers,” IEEE Trans. 
Computers, 45, 1996, pp. 1326-1330.
[31] A. Ghafoor, A. L. Goel, J. K. Chan and S. Sheikh, “Reliability Analysis of a 
Fault-Tolerant Multi-Bus Multiprocessor Systems,” Proc. 3rd IEEE Symposium on 
Parallel and Disributed Processing, 1991, pp. 436-443.
[32] R. Giorgi and C. Antonio, “PSCR: A Coherence Protocol for Eliminating Passive 
Sharing in Shared-Bus Shared-Memory Multiprocessors,” IEEE Trans. Parallel & 
Distributed Systems, 7, 1999, pp. 742-761.
[33] Z. Guo and R. G. Melhem, “Embeddding Binary X-Trees and Pyramids in Processor 
Arrays with Spanning Buses,” IEEE Trans. Parallel & Distributed Systems, 5, 1994, 
pp. 664-672.
[34] J. D. Hadley and B. L. Hutchings, “Design Methodologies for partially Reconfigured 
Systems,” Proc. Workshop on FPGAs for Custom Computing Machines, 1995, pp. 
78-84.
[35] T. Hayashi, K. Nakano and S. Olariu, “Randomized Initialization Protocols for 
Packet Radio Networks,” Proc. International Parallel Processing Symposium, 1999, pp. 
544-548.
[36] M. A. Holliday and M. K. Vernon, “Exact Performance Estimates for Multiprocessor 
Memory and Bus Interferences,” IEEE Trans. Computers, 36, 1987, pp. 76-85.
[37] K. Hwang, P. S. Tseng and D. Kim, “An Orthogonal Multiprocessor for Parallel 
Scientific Computations,” IEEE Trans. Computers, 36, 1989, pp. 47-60.
[38] J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Co., 1992.
[39] H. Jiang and K. C. Smith, “PPMB: A Partial-Multiple-Bus Multiprocessor Ar­
chitecture with Improved Cost Effectiveness,” IEEE Trans. Computers, 41, 1992, 
pp. 361-366.
[40] S. T. Kamath and R. Vaidyanathan, “Running Weak Hypercube Algorithms on Mul­
tiple Bus Networks,” Proc. ISCA International Conference on Parallel and Distributed 
Systems, 1997, pp. 217-222.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
143
[41] M. A. S. Khalid and J. Rose, “Hardwired-Clusters Partial-Cross bar: A Hierarchical 
Routing Architecture for Multi-FPGA Syatems,” Proc. 6 th Reconfigurable Architecture 
Workshop, 1999, pp. 597-605.
[42] J. Kilian, S. Kipnis and C. E. Leiserson, “The Organization of Permutation Architec­
tures with Bussed Interconnections,” IEEE Trans. Computers, 39, 1990, pp. 1346-1358.
[43] J. Kim, “Segmented Multiple Bus Systems,” Ph.D Thesis, Dept, of Electrical & 
Computer Eng., Louisiana State University, 1997.
[44] J. H. Kim and P. K. Rhee, “The Rule-Based Approach to Reconfiguration of 2-D 
Processor Arrays,” IEEE Trans. Computers, 42, 1993, pp. 1403-1408.
[45] H.-K. Ku and J. P. Hayes, “Connective Fault tolerance in Multiple-Bus Systems,” 
IEEE Trans. Parallel & Distributed Systems, 8 , 1997, pp. 574-586.
[46] P. Kulasinghe and A. El-Amawy, “On the Complexity of Bussed Interconnections,” 
IEEE Trans. Computers, 44, 1995, pp. 1248-1251.
[47] P. Kulasinghe and A. El-Amawy, “Optimal Realizations of Sets of Interconnection 
Functions on Synchronous Multiple Bus Systems,” IEEE Trans. Computers, 45, 1996, 
pp. 964-969.
[48] P. Kulasinghe, “Combinatorial Analysis and Design of Optimal Multiple Bus Systems 
for Parallel Algorithms,” Ph.D Thesis, Dept, of Electrical & Computer Eng., Louisiana 
State University, 1995.
[49] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays - Trees 
■ Hypercubes, Morgan Kaufmann Publishers, San Mateo, CA, 1992.
[50] Y. Li, J. Wu and S. Q. Zheng, “An Interconnection Network Based on the Dual 
of a Hypercube,” Proc. ISC A International Conference on Parallel and Distributed 
Systems, 1997, pp. 263-268.
[51] R. Lin, S. Olariu, J. L. Schwing and B. F. Wang, “The Mesh with Hybrid Buses: 
An Efficient Parallel Architecture for Digital Geometry,” IEEE Trans. Parallel & 
Distributed Systems, 10, 1999, pp. 266-280.
[52] J. M. Marberg and E. Gafhi, “Sorting and Selection in Multi-Channel Broadcast 
Networks,” Proc. International Conference on Parallel Processing, 1985, pp. 846-850.
[53] S. M. Mahmud, “Performance Analysis of Multilevel Bus Networks for Hierarchical 
Multiprocessors,” IEEE Trans. Computers, 7, 1994, pp. 789-799.
[54] M. D. Mickunas, “Using Projective Geometry to Design Bus-Connection Networks,” 
Proc. ACM/IEEE Workshop on Interconnection Networks for Parallel and Distributed 
Processing, 1980, pp. 47-55.
[55] T. N. Mudge and A. B. Al-Sadoun, “A Semi-Markov Model for the Performance of 
Multiple-Bus Systems ,” IEEE Trans. Computers, 34, 1985, pp. 934-942.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
144
[56] T. N. Mudge, J. P. Hayes and D. C. Winsor, “Multiple Bus Architectures,” IEEE 
Computer, 1987, pp. 42-48.
[57] S. Nadella, “Fault Tolerant Multiple Bus Networks for Fan-in Algorithms,” Mas­
ters Thesis, Dept, of Electrical & Computer Eng., Louisiana State University, 1993.
[58] H. Nagano, A. Matsura and A. Nagoya, “An Efficient Implementation Method of 
Fractal Image Compression on Dynamically Reconfigurable Architecture,” Proc. 6 th 
Reconfiguruble Architecture Workshop, 1999, pp. 670-678.
[59] K. Nakano, “A Bibliography of Published Papers on Dynamically Reconfigurable 
Architectures,” Parallel Processing Letters, 5, 1995, pp. 111-124.
[60] K. Nakano, S. Olariu, and J. L. Schwing, “Broadcast-Efficient Sorting in the Presence of 
Few Channels,” Proc. International Conference on Parallel Processing, 1997, pp. 12-15.
[61] K. Nakano, S. Olariu and J. L. Schwing, “Broadcast-Efficient Algorithms on the 
Coarse-Grain Broadcast Communication Model with Few Channels,” Proc. Interna­
tional Parallel Processing Symposium, 1998, pp. 1-6.
[62] D. Nassimi, “Parallel Algorithms for Classes (±26) DESCEND and ASCEND Compu­
tations on a SIMD Hypercube,” IEEE Trans. Parallel & Distributed Systems, 4, 1993, 
pp. 1372-1381.
[63] A. Padmanabhan, “Design of Multibus Networks for ASCEND/DESCEND and 
FAN-IN Algorithms,” Masters Thesis, Dept, of Electrical & Computer Eng., Louisiana 
State University, 1992.
[64] Y. Pan, S. Q. Zheng, K.- L., and Hong Shen, “Semigroup and Prefix Computa­
tions on Improved Generalized Mesh-Connected Computers with Multiple Buses,” 
To appear in Proc. International Symposium on Parallel & distributed Processing, 2000.
[65] G. Panchapakesan and A. Sengupta, “On a Light Wave Network Topology Using 
Kautz Diagraphs,” IEEE Trans. Computers, 48, 1999, pp. 1131-1137.
[66] R. C. Pearce, J. A. Field and W. D. Little, “Asynchronous Arbiter Module,” IEEE 
Trans. Computers, 24, 1975, pp. 931-932.
[67] D. K. Pradhan, “Fault-Tolerant Multiprocessor Link and Bus Network Architectures,” 
IEEE Trans. Computers, 34, 1985, pp. 33-45.
[68] D. K. Pradhan, Z. Hanquan and M. L. Schlumberger, “Fault-Tolerant Multibus 
Architectures for Multiprocessors,” Proc. Symp. on Fault-Tolerant Computing, 1984, 
pp. 400-408.
[69] V. K. Prasanna Kumar and C. S. Raghavendra, “Array Processor with Multiple 
Broadcasting,” J. Parallel & Distributed Computing, 4, 1987, pp. 173-190.
[70] C. Qiao and R. G. Melhem, “Time-Division Optical Communications in Multiproces­
sor Arrays,” IEEE Trans. Computers, 42, 1993, pp. 577-590.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
145
[71] C. S. Raghavendra, “HMESH: A VLSI Architecture for Parallel Processing,” Proc. 
Conference on Algorithms and Hardware for Parallel Processing, Springer Verlag 
Lecture Notes in Computer Science, vol. 237, 1986, pp. 76-83.
[72] S. Rajasekaran, “Mesh Connected Computers with Fixed and Reconfigurable Buses: 
Packet Routing and Sorting,” IEEE Trans. Computers, 45, 1996, pp. 529-539.
[73] M. R. Samantham and D. K. Pradhan, “The De Bruijn Multiprocessor Network: A 
Versatile Parallel Processing and Sorting Network for VLSI,” IEEE Trans. Computers, 
38, 1989, pp. 567-581; Corrections in IEEE Trans. Computers, 40, 1991, pp. 122.
[74] S. M. Scalera, J. J. Murray and S. Lease, “A Mathematical Benefit Analysis of Context 
Switching Reconfigurable Computing,” Proc. Reconfigurable Architecture Workshop, 
1998, pp. 73-78.
[75] M. J. Serrano and B. Parhami, “Optimal Architectures and Algorithms for Mesh- 
Connected Parallel Computers with Separable Row/Column Buses,” IEEE Trans. 
Parallel & Distributed Systems, 4, 1993, pp. 1073-1080.
[76] Q. F. Stout, “Mesh-Connected Computer with Broadcasting,” IEEE Trans. Comput­
ers, 32, 1983, pp. 826-829.
[77] R. K. Thiruchelvan, J. L. Ttahan and R. Vaidyanathan, “On the Power of Segmenting 
and Fusing Buses,” J. Parallel & Distributed Computing, 34, 1996, pp. 82-94.
[78] C. D. Thompson, “Area-time Complexity for VLSI,” Proceedings of the 1 1 th Annual 
ACM Symposium on Theory of Computing, 5, 1979, pp. 81-88.
[79] C. D. Thompson, “A Complexity Theory for VLSI,” Ph.D. Thesis, Dept, of Computer 
Science, Caraegie-Mellon University, 1980.
[80] J. Ullrnan, Computational Aspects of VLSI, Computer Science Press, Potomac, MD, 
1984.
[81] A. Varma, “Combinatorial Design of Bus-based Interconnection Structures,” Research 
Report RC 12550, IBM Research Division, September, 1986.
[82] R. Vaidyanathan, C. R. P. Hartmann and P. K. Varshney, “Running ASCEND, DE­
SCEND and PIPELINE Algorithms in Parallel Using Small Processors ,” Information 
Processing Letters, 46, 1993, pp. 31-36.
[83] R. Vaidyanathan, “Design of Multiple Bus Interconnection Networks for Fan-in 
Computations,” Proc. 29th Annual Allerton Conf. on Communication, Control & 
Computing, 1991, pp. 1093-1102.
[84] R. Vaidyanathan and S. Nadella, “Fault-Tolerant Multiple Bus Networks for Fan-In 
Algorithms,” Proc. International Parallel Processing Symposium, 1996, pp. 674-681.
[85] R. Vaidyanathan and A. Padmanabhan, “Bus-Based Networks for Fan-in and Uniform 
Hypercube Algorithms,” Parallel Computing, 21, 1995, pp. 1807-1821.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
146
[86] R. Vaidyanathan and J. L. Trahan, “Optimal Simulation of Multidimensional Recon­
figurable Meshes by Two Dimensional Reconfigurable Meshes,” Information Processing 
Letters, 47, 1993, pp. 267-273.
[87] T. A. Varvarigou, V. P. Roychowdhury and T. Kailath, “Reconfiguring Processor 
Array Using Multiple Track Models: The 3-Track-1-Spare-Approach,” IEEE Trans. 
Computers, 42, 1993, pp. 1281-1293.
[88] J. Villasenor and W. H. Mangione-Smith, “Configurable Computing,” Scientific 
American, vol. 276 no. 6 1997, pp. 66-71.
[89] J. F. Wakerly, Digital System Design, Principles & Practice, Prentice Hall Inc., Upper 
Saddle River, NJ, 1994.
[90] D. C. Winsor and T. N. Mudge, “Analysis of Bus Hierarchies for Multiprocessors,” 
Proc. International Symposium on Computer Architecture, 1988, pp. 100-107.
[91] M. J. Wirthlin and B. L. Hutchings, “DISC: The Dynamic Instruction Set Computer,” 
Proc. of the SPIE  - Field Programmable Gate Arrays (FPGAs) for Fast Board and 
Reconfigurable Computing, vol. 2607 1995, pp. 92-103.
[92] M. Wojko and H. ElGindy, “Configuration Sequencing with Self-Configurable Binary 
Multipliers,” Proc. 6th Reconfigurable Architecture Worrkshop, 1999, pp. 643-651.
[93] “Xilink Inc. XC400E and XC4000X Series Field Programmable Gate Arrays,” product 
specification, 1997.
[94] Q. Yang and L. N. Bhuyan, “Analysis of Packet-Switched Multiple-Bus Multiprocessor 
Systems,” IEEE Trans. Computers, 40, 1991, pp. 352-357.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Vita
Hettihewage Prasanna Dharmasena was born in Kurunegala, Sri Lanka. He received 
the bachelor of science degree in electronics and telecommunication engineering from 
the University of Moratuwa, in 1983. From 1983 to 1985 he worked as an assistant 
lecturer a t the same University. He received the degree of master of science in electrical 
and computer engineering from Louisiana S tate University in 1987. He is currently 
employed as a computer analyst at Louisiana State University. He will receive the 
degree of Doctor of Philosophy in May, 2000.
147
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DOCTORAL EXAMINATION AND DISSERTATION REPORT
candidate: H. P. Dharmasena
Major Field: E le c t r ic a l  Engineering




t  V -------------
O n  «•■
Date of Kxaeination:
March 23, 2000
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
