Combinatorial Design and Analysis of Optimal Multiple Bus Systems for Parallel Algorithms. by Kulasinghe, Priyalal D
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1995
Combinatorial Design and Analysis of Optimal
Multiple Bus Systems for Parallel Algorithms.
Priyalal D. Kulasinghe
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Kulasinghe, Priyalal D., "Combinatorial Design and Analysis of Optimal Multiple Bus Systems for Parallel Algorithms." (1995). LSU
Historical Dissertations and Theses. 6026.
https://digitalcommons.lsu.edu/gradschool_disstheses/6026
INFORM ATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality o f the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
A Bell & Howell Information Company 
300 North Z eeb Road. Ann Arbor. Ml 48106-1346 USA 
313/761-4700 800/521-0600

COMBINATORIAL DESIGN AND ANALYSIS OF 
OPTIMAL MULTIPLE BUS SYSTEMS 
FOR PARALLEL ALGORITHMS
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agricultural and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
in
The Department of Electrical and Computer Engineering
by
Priyalal D. Kulasinghe 
B.Sc., University of Peradeniya, Sri Lanka 1981 
M.S., Louisiana State University, 1990 
August 1995
UMI Number: 9609100
UMI Microform 9609100 
Copyright 1996, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, United States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
ACKNOW LEDGEM ENTS
I would like to express my thanks and gratitude to my advisor Dr. Ahmed El- 
Amawy for his constant support, guidance, and encouragement during this research work.
I also want to extend my thanks to Dr. Said Bettayeb who has helped me in many 
ways in developing my research activities. Helpful suggestions and many other supports 
given by the other committee members Dr. Subhash Kak, Dr. R. Vaidyanathan, Dr. J. 
Ramanujam, Dr. P. Bhattacharya, Dr. Gil S. Lee, and Dr. O. Hidalgo-Salvatierra are also 
appreciated.
Special thanks to my wife Eshani who was always there whenever I needed help 
throughout the course of my studies at LSU. Without her encouragement and sacrifice, 
this work would not have been completed.
TABLE OF CONTENTS
ACKNOW LEDGEMENTS ................................................................................................. ii
ABSTRACT .........................................................................................................................  v
CHAPTER 1 INTRODUCTION .......................................................................  1
1.1 Classification of Interconnection Networks ...................................................  2
1.1.1 Topology ...........................................................................................  3
1.1.2 Operation Mode .................................................................................  4
1.1.3 Interprocessor Communication Models .........................................  5
1.2 Classification of Parallel Algorithms .............................................................  7
1.3 Bused Interconnection Networks ..................................................................... 7
1.3.1 Single Bus Systems .......................................................................... 8
1.3.2 Multiple Bus Systems .......................................................................  8
1.4 Design Issues of Parallel Architectures .....................................................  11
1.5 Research Objectives ......................................................................................  13
1.6 Model and Design Criteria ............................................................................  15
1.6.1 Computational Model ....................................................................  15
1.6.2 Architectural Model .......................................................................  19
1.6.3 Design Criteria ..............................................................................  21
1.7 Outline of the Dissertation ............................................................................  23
CHAPTER 2 PROCESSOR ASSIGNMENT .............................................. 27
2.1 Preliminaries ..................................................................................................  27
2.2 Optimal Processor Assignment ....................................................................  31
2.3 Computational Complexity of the Optimal Partition Problem ...............  36
2.4 Exploiting Regularities in the Source Algorithm ......................................  41
2.5 Optimal Vertex Partitioning of a Locally Regular CFG .........................  43
2.6 Processor Scheduling ...................................................................................... 51
CHAPTER 3 BUS A S S IG N M E N T .................................................................. 54
3.1 Preliminaries ..................................................................................................  55
3.2 Construction of an MBG from a given IFG  .............................................  58
3.3 Incorporating Broadcasting O perations........................................................ 61
3.4 Optimal Color Partition of an IFG ............................................................. 64
3.5 Computational Complexity of the Optimal Color Partitioning problem 67
3.6 Optimal Color Partition on an IFG with only Two Colors ...................... 73
3.7 Heuristic Algorithm for the General Case ................................................  79
3.8 Performance of the Algorithm ....................................................................  86
CHAPTER 4 OPTIMAL PROCESSOR ASSIGNMENT
FOR VERTEX SYMMETRIC IFGs ...................................  89
4.1 Preliminaries ................................................................................................  90
4.2 Optimal Color Partitioning of a Cayley Color Graph ......................   98
4.3 Properties of MBSs Realizing Symmetric IFGs ......................................  102
4.3.1 Symmetry ...................................................................................... 104
4.3.2 Number of Ports per Processor ................................................  105
4.3.3 Number of Neighbors per Processor .......................................... 106
4.3.4 Diameter .  ...................................................................................  108
4.4 Optimal Color Partition of a Regular IFG .............................................  117
4.5 Optimal Color Partition of CA(U) when A is Redundant ....................... 120
CHAPTER 5 FAULT TOLERANCE ..........................................................  127
5.1 Preliminaries ................................................................................................ 129
5.2 Failure of a Bus ........................................................................................... 133
5.3 Performance Degradation of M(Qn) due to a Bus Failure .................... 137
5.4 Failure of an Interface ................................................................................. 140
5.5 Performance Degradation of M(Qn) due to an Interface F a i lu r e   141
5.6 Failure of a Processor ................................................................................. 142
5.7 Performance Degradation of M(Qn) due to a Processor Failure ...........  150
5.8 Inclusion of Redundancy ............................................................................ 152
CHAPTER 6 CONSTRUCTION W ITH GIVEN NUMBER OF
PROCESSORS AND BUSES .............................................  154
6.1 MBS with Given Number of Processors ..................................................  155
6.1.1 Isomorphism ................................................................................. 162
6.1.2 Uniform Merging .........................................................................  162
6.1.3 Regular Secondary Coloring .....................................................  169
6.1.3.1 A is a Normal Subgroup of F ...............................  169
6.1.3.2 A is not a Normal Subgroup of T  ............................ 172
6.1.4 Scheduling ...................................................................................  177
6.2 MBS with Given Number of Buses ..........................................................  180
CHAPTER 7 SUMMARY AND CONCLUSIONS....... ...............................  189
REFERENCES ............................................................................................................... 194
VITA ................................................................................................................................  207
iv
ABSTRACT
This dissertation develops a formal and systematic methodology for designing 
optimal, synchronous multiple bus systems (MBSs) realizing given (classes of) parallel 
algorithms. Our approach utilizes graph and group theoretic concepts to develop the 
necessary model and procedural tools. By partitioning the vertex set of the graphical 
representation CFG of the algorithm, we extract a set of interconnection functions that 
represents the interprocessor communication requirement of the algorithm. We prove that 
the optimal partitioning problem is /VP-Hard. However, we show how to obtain 
polynomial time solutions by exploiting certain regularities present in many well-behaved 
parallel algorithms.
The extracted set of interconnection functions is represented by an edge colored, 
directed graph called interconnection function graph (IFG). We show that the problem of 
constructing an optimal MBS to realize an IFG is NP-Hard. We show important special 
cases where polynomial time solutions exist. In particular, we prove that polynomial time 
solutions exist when the IFG is vertex symmetric. This is the case of interest for the vast 
majority of important interconnection function sets, whether extracted from algorithms or 
correspond to existing interconnection networks. We show that an IFG is vertex 
symmetric if and only if it is the Cayley color graph of a finite group T  and its generating 
set A. Using this property, we present a particular scheme to construct a symmetric MBS 
M(T, A) with minimum number of buses as well as minimum number of interfaces 
realizing a vertex symmetric IFG.
We demonstrate several advantages of the optimal MBS M(T, A) in terms of its 
symmetry, number of ports per processor, number of neighbors per processor, and the
diameter. We also investigate the fault tolerant capabilities and performance degradation 
of M(T, A) in the case of a single bus failure, single driver failure, single receiver failure, 
and single processor failure. Further, we address the problem of designing an optimal 
MBS realizing a class of algorithms when the number of buses and/or processors in the 
target MBS are specified. The optimality criteria are maximizing the speed and 
minimizing the number of interfaces.
C H A P T E R  1
INTRODUCTION
Many of today's scientific and industrial applications such as image processing, 
weather forecasting, fluid dynamics, plasma physics simulations, robot vision, molecular 
biology, neural computing, and seismology require enormous amounts of computing 
power. For example, in molecular biology, simulation of a protein molecule with only a 
few thousand atoms for 1 ps time span would require 100 years on Cray 2. Due to 
technological limitations, these higher computational demands cannot be effectively met 
by a single processor system. A logical solution would be to use parallel processing. In 
a parallel processing system, several processors collectively solve a given problem by 
having each processor work on a different part of the problem simultaneously and 
exchanging messages over a network of communication links. Today, parallel processors 
are commercially available and they range from systems with few processors to systems 
with thousands of processors. An example from the low end is the Encore Multimax 
[118], in which up to 20 processors are connected by a single bus. An example from the 
high end is the Connection Machine [142], in which a maximum of 64 thousand 
processors are connected by a boolean hypercube.
A parallel processing system, or a parallel architecture, mainly consists of a set 
of processors, a set of memory modules, and an interconnection network that provides 
communication paths among processors and memory modules. Each processor 
simultaneously executes a subtask of the problem at hand. Whenever one processor needs 
information produced by another processor, they communicate via the interconnection 
network. Therefore, a parallel processing system executing a given problem spends time
1
not only on computation but also on communication. The major concern in a parallel 
processing system, which is not present in a single processor system, is the cost of (time 
spent on) the communication among processors. In many systems, the larger the number 
of processors in the system, the higher the communication cost would be. Running an 
algorithm on a massively parallel computer, more time would be spent on communication 
than on computation [37], The communication among processors and memory modules 
is governed by the interconnection network. In fact, a major distinguishing feature of one 
multiprocessor system from another is its interconnection network. The focus of this 
dissertation is on the interconnection network of parallel processing systems. In the 
literature, a large number of interconnection networks have been proposed for interproces­
sor communication. They include hypercubes [119], ring [103], tree [40], [69], cube 
connected cycles [115], omega network [92], data manipulator network [47], generalized 
cube network [125], [128], De Bruijn network [121], single bus multiprocessor [25], and 
the multiple bus multiprocessor [42], [66], [109]. Each topology has its own merits, and 
the selection of a topology is rather application dependent. For example, mesh connected 
systems are better suited for applications like image processing and weather simulation, 
while hypercube systems are better suited for applications like FFT computation and 
bitonic sorting. For survey and comparison of the relative merits of these topologies, the 
reader is referred to [20], [43], [46], [127].
1.1 Classification of Interconnection Networks
Since it is a very difficult task to perform a comprehensive study on every 
available parallel architecture, it is necessary to divide them into several general classes. 
Different classifications of interconnection networks can be made by viewing them from
different perspectives. The works reported in [20], [48], [50], [59], and [72] present 
different kinds of taxonomy for interconnection networks. Here, we do not attempt to 
exhaust all possible classifications of interconnection networks. For the purpose of this 
dissertation, parallel architectures will be classified depending on their topology, operation 
mode, and communication strategy. These classifications are not very strict and certain 
overlappings may exist among different classes.
1.1.1 Topology
The topology of an interconnection network tells us how processors are connected 
to each other. The topology also determines the diameter, the node degree, the bisection 
width, and the connectivity of the interconnection network. In this dissertation we divide 
interconnection networks into three broad topological classes, called direct link, switched, 
and bused.
An interconnection network can be built by establishing a direct link (either 
unidirectional or bidirectional) between certain pairs of processors. This type of 
interconnection network is called a direct link (or a dedicated path) interconnection 
network. Link connections do not change over time. Therefore, they are also called static 
interconnection networks. If the source processor and the target processor are connected 
by a common link, data will be sent along that link. If they do not have a common link, 
they will communicate using some intermediate processors. Some examples of such direct 
link interconnection networks are the hypercube [61], cube connected cycles [115], tree 
[69], and the ring [103].
In the second topological class, switches are used to establish connections among 
processors. No two processors are connected by a direct link. Switch settings can be
4
dynamically changed to implement various interconnection patterns. Usually, switches are 
arranged in stages, and the networks are referred to as single or multistage interconnection 
networks. Some examples of such networks are generalized cube network [128], banyan 
network [55], omega network [92], benes network [15], data manipulator network [47], 
and STARAN flip network [12].
In the third topological class, processor interconnections are achieved by one or 
more buses and a set of interfaces. Depending on whether one bus or more than one bus 
are used, the multiprocessor system will be called a single bus system or a multiple bus 
system, respectively. In a single bus system, each processor, as well as each memory 
module, is connected to a common bus via interfaces. Some commercially available single 
bus systems are Encore Multimax [118], Sequent [111], SPUR [63], and Firefly [139]. In 
a multiple bus system, each processor, as well as, each memory module, is connected to 
a (not necessarily proper) subset of buses via interfaces. By connecting different 
processors to different buses, a rich set of interconnection networks can be formed. Some 
examples are the conventional multiple bus system [109], partial multiple bus system [33], 
hierarchical multiple bus system [147], and orthogonal multiple bus system [71].
1.1.2 Operation Mode
Most parallel processing systems reported operate in one of the two modes of 
parallelism called SIMD and MIMD [50]. In the SIMD (single instruction stream - 
multiple data stream) mode, all processors execute the same instruction stream issued by 
a central control unit. However, each processor operates on a different data set. 
Communication capabilities of the interconnection network of an SIMD architecture is 
determined by the set of interconnection functions associated with it. Each interconnection
5
function is considered as a set of source-destination processor pairs which can perform 
direct communications concurrently. Some examples of SIMD parallel architectures are 
ILUAC IV  [11], GF11 [14], and Connection Machine [62],
In the MIMD (multiple instruction stream - multiple data stream) mode of 
operation, each processor executes its own set of instructions and uses its own data. There 
is no control unit to effect global synchronization among individual processors. 
Communication among processors is done by handshaking. Therefore, MIMD mode of 
operation is asynchronous. Some examples of MIMD parallel architectures are EMPRESS 
[26], Cm* [137], Cosmic Cube [122], and Cedar [82],
There are certain other parallel systems that can operate both in the SIMD mode 
and/or in the MIMD mode. These are called SIMD/MIMD architectures. Usually these 
architectures comprise of processors connected in a hierarchical fashion. Processors 
functioning within one hierarchical level are in the SIMD mode while those in another 
hierarchical level are in the MIMD mode. Some other architectures, while executing a 
parallel algorithm, may operate in the SIMD mode for a certain amount of time and then 
switch to the MIMD mode. Some examples of SIMD/MIMD architectures are SNAP-1 
[39], PASM  [126], and DCA [75].
1.1.3 Interprocessor Communication Models
There are two basic communication strategies in multiprocessor systems. In the 
shared memory model, as the name implies, processors communicate via a shared global 
memory. A source processor needing to communicate places its output data in an area in 
the shared memory and the destination processor(s) read the data from that memory area. 
Usually, shared memory multiprocessors are tightly coupled and use a single address
6
space. Some examples of shared memory multiprocessors are NYU  Ultra Computer [57], 
Alliant FX/80 [152], HEP [80], KSR-1 [116], BBN  Butterfly [36], and DASH  [96]. The 
advantages of shared memory multiprocessors are flexibility and the ease of programming. 
A major disadvantage of a shared memory multiprocessor is the cost. In the context of 
special purpose architectures, shared memory models are not cost effective. They cannot 
efficiently exploit the communication structures of the applications under consideration. 
Thus, shared memory multiprocessors are more suitable for general purpose applications.
In the message passing model, each processor is provided with a local memory. 
The source processor sends data to the destination processor via the interconnection 
network. Usually, message passing multiprocessors are loosely coupled. Some examples 
of such machines are the Cosmic Cube [122], CHiP [130], /-Machine [38], and iPSC/2 
[9]. Since there is no requirement of concurrent access of data, implementation of 
memory is less expensive. The message passing model is more suitable for distributed 
environments. Also, the designer of a message passing (special purpose) multiprocessor 
can make use of the inherent communication structure of the application domain. The 
message passing model very well exploits the locality of reference present in many 
algorithms. This led many researchers to favor the message passing model over the shared 
memory model for interprocessor communication [93], [94], [99], A disadvantage of the 
message passing paradigm is that an algorithm with a communication structure that does 
not match the topology of the machine will poorly run on the machine.
There are certain parallel architectures which use both the shared memory model 
and the message passing model for interprocessor communication [39], [51], [87]. It 
should be pointed out that, irrespective of the hardware model, any machine can simulate
either a message passing model or a shared memory model using software [16], [97], 
[98]. Good surveys on comparison of the two communication strategies can be found in 
[78], [93], [98], [105].
1.2 Classification of Parallel Algorithms
Parallel algorithms can be classified by the properties of algorithms such as 
communication pattern, data formats and structures, data size, types of operations, and 
control flow strategy [45], [110], [115]. For the purpose of designing an interconnection 
network, only the communication pattern of an algorithm faithfully reflects the 
fundamental characteristic of the problem solving method. Classification of parallel 
algorithms according to their communication structure was reported in [115]. Algorithms 
such as FFT computations, bitonic sort, and merge sort possess similar communication 
patterns and therefore belong to the Ascend/Descend, class of parallel algorithms. 
Algorithms such as searching, summation, and min/max computation belong to the Fan- 
in! Fan-out class of parallel algorithms [144],
1.3 Bused Interconnection Networks
From the three topological classes of interconnection networks, the focus of this 
dissertation is on bus based interconnection networks. Even though they have not received 
as much attention as the other two topological classes, bus based architectures have been 
addressed in several occasions in the literature [8], [33], [68], [71], [109], [148]. Those 
systems can potentially have many obvious advantages, among which are ease of 
broadcasting, ease of incremental expansion, feasibility for efficient VLSI layout, fault 
tolerance, high memory bandwidth, and flexibility [33], [74], [90], [151]. Some of the 
limitations of bused interconnections, such as, bandwidth, stray capacitance, and reflection
waves produced at interfaces can be overcome by using the emerging optical technology 
[34], [58], [135]. Depending on the number of buses used for interprocessor communica­
tion, bused interconnection networks are divided into two categories.
1.3.1 Single Bus Systems
In a single bus system (also called shared bus system), communication among 
processors and communication between processors and memory modules is done over the 
common shared bus. More than one processor attempting to use the bus at the same time 
will result in a bus conflict. This will be resolved by an arbiter. There are several shared 
bus systems commercially available today. Encore Multimax [118], Sequent Symmetry 
[101], and SPUR [63] are some examples of such systems. A major reason for their 
popularity is the ease of implementation compared with multiprocessor systems with other 
types of interconnection networks. Their major disadvantage is that the number of 
processors is limited due to the limited bandwidth of the bus. Increasing the bandwidth 
by using a wider bus is not cost effective [68]. A cost effective solution is to use several 
buses instead of a single wide bus.
1.3.2 Multiple Bus Systems
Multiple bus systems have largely been left unexplored in the literature despite the 
fact that the single bus system has been the most prevalent interconnection method used 
for commercial releases of multiprocessor systems. With multiple buses, we can increase 
the number of processors in the system beyond what is possible in a single bus system, 
thereby potentially increasing the performance. To allow for multiple accesses to the 
memory simultaneously, the standard procedure is to divide the memory into several 
modules. In the conventional multiple bus system, each processor and each memory
module is connected to every bus [18], [42], [66], [109], [150], This method becomes 
prohibitively costly when the number of processors and that of buses increase. When the 
number of buses is large, the number of ports per processor in a conventional multiple 
bus system may exceed physical limits. When the number of processors is large, the 
number of tappings in a bus becomes large. This creates problems such as bus loading, 
large stray capacitances, and wave reflections. Furthermore, with many bus connections, 
complex arbitration is needed.
Because of the above mentioned drawbacks of the conventional multiple bus 
system, several other different configurations have been suggested in the literature. These 
configurations reduce the number of bus connections without degrading the performance 
considerably. In [90], a method is introduced to reduce a substantial number of bus 
connections (compared to the conventional multiple bus system) while keeping the same 
memory bandwidth (provided that proper arbitration is present). In the partial multiple bus 
system, the set of buses and the set of memory modules are partitioned into several 
groups such that a memory module is only connected to the buses within the group, while 
each processor is connected to all buses [33], [89]. This is called memory oriented partial 
multiple bus system (MPMB). In another variation, called processor oriented partial 
multiple bus system (PPMB), the set of buses and the set of processors are divided into 
several groups such that each processor is connected to the buses within its own group 
only, while each memory module is connected to all the buses [74], In the hierarchical 
multiple bus system, several levels of buses are used, where lowest level buses are 
connected to processors and highest level buses are connected to memory modules [147], 
[148]. In the orthogonal multiple bus system, one bus is dedicated to each processor and
10
buses are arranged in rows and columns [26], [71], [143]. All these multiple bus systems 
are aimed at general purpose applications. For special purpose multiple bus systems, 
communication patterns of the algorithms running on the system are usually known a 
priori. Therefore, the number of bus connections can be further reduced.
In the conventional multiple bus system, [18], [109], and in many of the newer 
versions (such as MPMB [33], PPMB [74], and hierarchical multiple bus system [148]), 
the mode of operation is MIMD and the communication is via shared memory. Since in 
the MIMD mode of operation processors access memory modules asynchronously, bus 
conflicts and memory conflicts occur. These conflicts are usually resolved by a two-stage 
arbitration scheme [88], [104], [109], [112]. If a certain source processor wants to send 
data to a target processor, the source processor will get the privilege to access a certain 
memory location with the help of the first arbitration stage. Then it will obtain a bus by 
the second stage. Then the source processor will write the message into the memory 
location. Next, the target processor also will get access to the same memory location via 
a bus through the two-stage arbitration scheme. Finally, the target processor will read the 
message.
There is no reason why a multiple bus system cannot be used in the SIMD mode 
of operation with message passing for interprocessor communication. In die SIMD mode 
of operation, the control unit will select a bus for both the source processor and the target 
processor. Since bus conflicts do not occur in this mode of operation, no arbitration is 
required. If the (SIMD) algorithm is mapped optimally, buses will not be wasted and 
resources will be utilized efficiently. In the message passing model, each processor has 
its own local memory. The processors can be allowed to communicate directly without
11
the need for using a common shared memory. When two processors need to communicate, 
the control unit will select a free bus for that purpose, and the two processors will 
communicate through the selected bus. Usage of multiple bus systems in this context was 
not given much attention in the past. Recently, however, some researchers have explored 
the possibility of using multiple bus systems in SIMD and message passing environments. 
Works reported in [41], [44], [49], [77], [83], [84], [108], and [144] address multiple bus 
systems, directly or indirectly, in the above stated context.
1.4 Design Issues of Parallel Architectures
In designing a parallel architecture, the architect is faced with many design 
parameters such as physical size, power consumption, processor characteristics, instruction 
sets, memory design, interconnection strategy, the technology to be used, and cost [2], 
[65], [124]. Taking all these parameters into consideration, it is extremely difficult (if not 
impossible) to do a theoretical treatise on designing an optimal parallel architecture. At 
the functional level, the interconnection network truly represents the parallel architecture. 
In fact, often the distinguishing feature of a parallel processing system from a single 
processing system is the interconnection network. Therefore, in this dissertation, we only 
concentrate on the design of the interconnection network.
Design issues of an interconnection network vary depending on whether the target 
is a general purpose machine or a special purpose machine. A general purpose machine 
is one which can solve almost any problem with acceptable speedup. Due to their wide 
applicability, general purpose machines generally have a better market potential. A special 
purpose machine is one which can solve problems in a given problem domain with 
greater speed. Due to the smaller market potential and the special purpose hardware
12
needed, those machines are often more expensive. However, due to the speed demands 
of many applications (such as weather forecasting, image analysis, and neural computing) 
and the advancement of technology, the trend is to build more and more special purpose 
machines [52], [133]. The focus of this dissertation is on designing special purpose 
parallel machines.
In designing a general purpose architecture, some of the desirable properties of an 
interconnection network are small diameter, small degree, larger bisection width, fault 
tolerance, regularity, and expandability. Some of these requirements are conflicting with 
each other. For example, decreasing node degree tends to decrease the bisection width and 
to increase the diameter. For a special purpose architecture, the architect should consider 
some other requirements in addition to those required by a general purpose architecture. 
A special purpose architecture should solve problems belonging to a particular application 
domain much faster than a general purpose architecture does. A particular problem 
domain is characterized by a set of algorithms, each one having the same pattern of 
communication. Therefore, a special purpose architecture is also an algorithmically 
specialized architecture.
For the computer architect, to design a special purpose architecture, only the 
knowledge of only the problem domain at hand will not be adequate. The architect should 
also have some knowledge of the target architecture as to its topological class, mode of 
operation, and communication strategy. Depending on the choice of the target architecture, 
the architect usually comes up with a different interconnection network for the same 
problem domain. In this dissertation, we narrow down the target architecture's topology 
to multiple bus systems.
13
1.5 Research Objectives
In a real computing environment, parallel algorithms and parallel architectures are 
inseparable from one another. On the one hand, a good parallel algorithm may not be 
effective in solving a problem if the selected architecture (interconnection topology) does 
not efficiently support the communications required by the algorithm. On the other hand, 
a good architecture may not be effective if the algorithm does not efficiently utilize the 
facilities of the architecture. Therefore, designing a parallel architecture which can 
efficiently solve problems belonging to a particular problem domain is of paramount 
importance. The literature on the design of algorithmically specialized parallel 
architectures is mainly aimed at array processors [79], [100], [124], The research 
community has not paid much attention to the possibility of using other topologies as 
algorithmically specialized architectures.
In the most general sense, the objective of this dissertation is to develop a formal 
methodology to design optimal multiple bus architectures favoring a given class of 
parallel algorithms. Our treatise is of combinatorial nature. We do not specifically address 
hardware issues. A closely related subject is the optimal mapping of parallel algorithms 
onto existing multiple bus systems. The problem of optimally mapping a given algorithm 
into a given architecture (direct link) has been well studied in the literature [23], [31], 
[120]. For the mapping problem, the target architecture is known a priori. But for 
designing an architecture which can run the source algorithm(s) with maximum speed, 
only knowledge of the type of the architecture is known.
In [95], a method was discussed to design reconfigurable interconnection networks 
to realize given algorithms. In [144], the problem of designing an optimal multiple bus
14
architecture to realize a class of algorithms, called Fan-in computations, is addressed. 
Also in [145], a similar method was developed for the Ascend/Descend class of parallel 
algorithms. In [49], a multiple bus system emulating the SIMD hypercube was proposed. 
The work reported in [77] shows how to reduce the number of pins per chip for realizing 
n cyclic shifts on a multiple bus system.
Our approach is rather general; we do not restrict our attention only to a particular 
algorithmic class. The source class can be arbitrary. From a given algorithmic class, 
subject to some optimality constraints, we can extract a set of interconnection functions. 
This is the first stage in the design process. The set of interconnection functions so 
extracted dictates the potential communication capability of the target architecture which 
can run the source algorithm(s) with maximum speed. Once the set of interconnection 
functions has been determined, we will aim to construct an optimal multiple bus system 
which realizes the interconnection function set. This is the second stage. We will analyze 
the computational difficulties of both stages. We will also investigate how the design 
process can be methodically performed by exploiting certain regularities present in many 
parallel algorithms. We will also report on the merits of the multiple bus system over a 
direct link interconnection network realizing the same algorithmic class. We will further 
analyze the fault tolerance of the constructed multiple bus system.
Another objective of the dissertation is to study the trade off between cost and 
speed. We will analyze how an optimal multiple bus system favoring a given algorithmic 
class can be constructed when certain components in the target multiple bus system are 
specified.
15
1.6 Model and Design Criteria
In this section, we first present the model to be used throughout the dissertation. 
The model serves several purposes. First, it states all the assumptions made in the 
dissertation. Second, it allows us to make research objectives more specific and more 
detailed. Third, it gives the foundation for the mathematical approach used to achieve the 
research objectives. The model presented consists of two parts. In the computational 
model, we regard a parallel algorithm as a collection of computational nodes. We 
represent a parallel algorithm by a graph whose vertices represent the computational 
nodes of the algorithm and whose edges represent the flow dependencies of the algorithm. 
In the architectural model, we specify all of the relevant features of the target multiple 
bus system. Later in this section, we present our design criteria, in which we specify our 
objective functions conforming to the presented model.
1.6.1 Computational Model
Parallel algorithms can be represented at different levels of abstraction. In the 
lowest level of abstraction, a parallel algorithm consists of a set of computations (also 
called computational nodes) and a set of data transfers among those computational nodes. 
In the highest level of abstraction, a parallel algorithm consists of a set processes and a 
set of edges among processes, where each process is a coherent set of computations. 
Several methods have been used in the literature for representing parallel algorithms, such 
as, the Problem Graph [23], Task Precedence Graph [76], Task Interaction Graph [120], 
and the Data Dependence Graph [149], These graphs represent parallel algorithms at 
different abstract levels. For example, the Task Precedence Graph is a higher abstraction 
of the algorithm than the Data Dependency Graph. Vertices of the Task Precedence Graph
16
are processes while those of the Data Dependency Graph are computational nodes. For 
our purpose, we will represent the source algorithm in the lowest level of abstraction. 
Therefore, our representation will consist of a set of computational nodes and a set of 
edges among those nodes.
Depending on the source algorithm, a single computation could be a simple 
operation such as comparison of two numbers or a compound operation such as multipli­
cation of two matrices. Irrespective of the actual operation of the computational node, 
however, we assume that computational nodes are indivisible. That is, a computational 
node must be executed by a single processor and that processor will do so without any 
interruption. We also assume that the source algorithm is homogeneous from the 
computational point of view; that is, each computational node takes the same amount of 
time on a single processor. This is a reasonable assumption since we can always break 
heterogeneous computational nodes into smaller homogeneous nodes.
In the Data Dependency Graph, two vertices u and v have a directed edge from 
u to v if and only if the computation at v depends on the computation at u. In a such a 
graph, four kinds of dependencies can exist between two vertices u and v [149].
(a) Flow dependence : The value of a variable computed by u is used by v.
(b) Antidependence : The value of a variable used by u is later changed by v.
(c) Output dependence : The value of a variable computed by u is later changed by v. 
{d) Control dependence: Computation u is a conditional statement whose output
decides the computation v.
Only flow dependence requires that data be transferred from u to v. Our target 
multiple bus system would have a direct communication from one processor to another
if and only if the source algorithm requires a computation assigned to the first processor 
to transfer data to a computation assigned to the second processor. Therefore, in the 
graphical representation of the source algorithm, only flow dependencies must correspond 
to edges. We assume that the order of execution of the computational nodes in the given 
parallel algorithm is known a priori. In this dissertation, the graphical representation of 
a parallel algorithm will be called Computation Flow Graph (CFG for short).
If A is a parallel algorithm, then the corresponding CFG CA is constructed as 
follows. Each vertex of CA is a computation in algorithm A. Each vertex of CA will be 
assigned an integer label such that two vertices have the same label if they are to be 
executed concurrently to achieve maximum speed. There will be a directed edge from 
vertex u to vertex v if and only if computation v is flow dependent on computation u. 
Other dependencies (anti, output, and control dependencies) in the algorithm will not 
correspond to edges in CA. However, those dependencies will be reflected in the labeling 
of the vertices in CA. Labeling of the vertices should satisfy the requirement that vertex 
v has a higher label than vertex u if there is a dependency from u to v in the source 
algorithm. We make the assumption that each edge in CA corresponds to the same amount 
of data transfer.
From our algorithm representation, it is implied that concurrent computations of 
the algorithm are globally known. For example, computational nodes with the same label 
must be executed concurrently in order to achieve maximum speed. Thus our model 
implies that the source algorithm operates in the S1MD environment.
Let there be n labels in the CFG CA. There are no dependencies in algorithm A 
among computations with the same label. That is, there are no edges among vertices with
the same label. Thus CA is a directed n-partite graph. Vertices in the ith partite set, that 
is, vertices with label /, belong to concurrency level i. By the definition, edges will be 
directed from a vertex at a lower level to a one at a higher level.
Figure 1.1 shows a CFG (denoted by C,) with 7 vertices and three levels. It can 
be considered as the CFG corresponding to an algorithm which solves the following 
simple problem: given four numbers av  a2, a3, and a4, find the maximum which is less 
than a4. Vertex labels are written inside the circles representing the vertices of the CFG.
It should be emphasized that our graphical representation of parallel algorithms, 
with respect to the presented computation and communication models, agrees with the 
classification of algorithms given in [115]. Two algorithms represented by the same CFG 
belong to the same algorithmic class. For example, a single CFG represents all the 
algorithms in the Ascend/Descend class. In the rest of the dissertation, however, we use 
the word "algorithm" to mean either a single algorithm or a class of parallel algorithms. 
Since the CFG contains all the information of the algorithm needed for the construction
Figure 1.1: A CFG C,
19
of an optimal multiple bus system, we can use the CFG without referring to the algorithm 
from which it was constructed.
1.6.2 Architectural Model
In this dissertation, our attention is on SIMD multiple bus systems which use 
message passing model for interprocessor communication. We use the abbreviation MBS 
to represent such a multiple bus system. An MBS consists of a set of processors, a set of 
buses and a set of interfaces. An interface is a device which connects a processor to a 
bus. In combinatorial analysis of MBSs in the literature, interfaces are frequently referred 
as "pins" or "ports" [49], [77]. In this dissertation, however, we use the term "interface" 
consistently. An interface connecting a bus to a processor will be called a driver or a 
receiver depending on whether the data transfer is from the processor to the bus or from 
the bus to the processor. Each processor is identical and assumed to have its own local 
memory. Figure 1.2 shows an MBS with eight processors and four buses.
Figure 1.2: An MBS with eight processors and four buses.
20
In our model, buses are assumed to be half duplex in the sense that they can carry 
data only in one direction at a time; however, data can be transferred in opposite 
directions at different times. If two processors are involved in bidirectional communica­
tion, they should use two buses for data transfers.
Since our target MBS uses message passing model for communication, it does not 
contain global memory. All communications are done between processors using buses. 
However, if one wishes to include global memory to an MBS, it can be easily done. For 
example, we can include global memory modules such that each memory module is 
connected to every bus. Obviously, there are other possibilities. Construction of an MBS 
which can use a message passing model as well as a shared memory model is a possible 
extension to this work.
Some researchers have investigated the potential advantages of an MBS as an 
interconnection network. In [19], [49], [138], an MBS has been conveniently represented 
by a hypergraph. The vertices of the hypergraph corresponds to processors and 
hyperedges correspond to buses. But for the optimal design of MBSs, the hypergraph 
representation has a major drawback. When a bus is represented by a hyperedge, the only 
information it represents is whether a certain processor is connected to a certain bus. The 
direction of the connecting port (interface) is not shown. Therefore, unless all interfaces 
are assumed to be bidirectional, the hypergraph does not convey all the information of the 
MBS. In this dissertation, for optimality considerations, we need to distinguish an input 
port (receiver) from an output port (driver). Therefore, hypergraph representation cannot 
be used. We represent an MBS graphically by a directed bipartite graph called Multiple 
Bus Graph (MBG for short). If Gj = (X , Y, E) is an MBG, then X, Y, and E  correspond
to the set of processors, set of buses, and the set of interfaces, respectively, of the 
represented MBS. An edge in E  from a vertex in X  (Y) to a vertex in Y  (X) represents a 
driver (receiver) in the MBS. Figure 1.3 represents the MBG corresponding to the MBS 
shown in Figure 1.2. Notice that vertex x, represents processor Ph for 1 < i < 8, and 
vertex y7 represents bus Bj, for 1 < j  < 4. For our analysis, the MBG is only a convenient 
graphical representation of the MBS. On many occasions, we will use MBS and MBG 
synonymously.
X, X 2 X 3 X t  X s X 6 X 7 Xg
Figure 1.3: MBG corresponding to the MBS shown in Figure 1.2.
1.6.3 Design Criteria
As stated in Section 1.5, given a source algorithm, our objective is to design an 
optimal MBS which will run the source algorithm with maximum possible speed. One of 
the basic assumptions we make in this dissertation is that the given parallel algorithm A 
is not designed for any particular (underlying) architecture. Suppose, on the contrary, that 
algorithm A  which solves problem 0C\s written for a particular architecture ?y. Let M  be 
the optimal MBS designed to realize A. Then, other than solving problem SEefficiently,
22
M  will try to emulate architecture “M, which is an unnecessary side effect. If the design 
of M  was not restricted by architecture °y, we would possibly have obtained a better MBS 
to solve problem SC Thus, in this research we assume that the source algorithm A is not 
written for any particular architecture and is the best algorithm to solve the problem at 
hand.
Let M  be the target MBS. Then, computations in algorithm A  must be carried out 
by the processors in M. Similarly, data transfers (communications) in A must be carried 




Unfortunately, the above two represent conflicting requirements. Since we do not 
modify the given source algorithm for designing an optimal MBS, there is an upper bound 
on the speed (and a corresponding lower bound on the execution time) with which one 
can run the source algorithm on any MBS, and that bound is determined by algorithm A 
itself. This is the ideal speed of algorithm A. Similarly, there is a lower bound on the cost 
of an MBS, which corresponds to the cost of a single processor without buses and 
interfaces (excluding the buses and interfaces necessary for devices other than the 
processor). These are the two extremes in designing an MBS to realize a given algorithm. 
Our main reason for having a parallel processing system is to increase the speed of 
applications beyond what could be achieved using a single processor system. So, one 
primary objective of the dissertation is to construct a minimal cost MBS which can run 
the given source algorithm A at its ideal speed. One secondary objective is the design of 
an MBS with specified number of components, rather than with the number of
23
components dictated by the ideal speed requirement. When practical issues are taken into 
consideration, the latter would be the best approach. An MBS which can run source 
algorithm A at its ideal speed may not be cost optimal.
1.7 Outline of the Dissertation
In this section, we briefly outline how the intended research objectives will be 
achieved subject to the presented model. As mentioned before, one primary objective is 
to construct an optimal MBS which can run a given algorithm (or a class of algorithms) 
at the ideal speed. An MBS hosting a given parallel algorithm will perform computations 
using processors and communications using buses and interfaces. Given an arbitrary CFG 
CA, construction of the minimal cost MBS, which can run algorithm A at its ideal speed, 
is an extremely difficult computational problem. To somewhat alleviate the difficulty, we 
break the problem into two stages.
In the first stage, we assign a set of processors to the vertices of the given CFG. 
This is called processor assignment and is addressed in Chapter 2. The first criterion for 
processor assignment is that the total number of processors required to perform the 
computations of the CFG is minimum. From the CFG, we construct another graph called 
interconnection function graph (IFG for short). Vertices of the IFG are the processors 
allocated to computations of the CFG. Edges of the IFG correspond to data transfers 
among processors necessitated by the source algorithm. In the processor assignment stage, 
we do not completely disregard the number of buses and interfaces required for the target 
MBS. We use the number of edges in the IFG as an approximate measure of the number 
of buses and interfaces in the target MBS. As a consequence, the second criterion for the 
processor assignment is to minimize the number of edges in the IFG. We prove that
24
optimal processor assignment is an /VP-Hard problem. However, we show that certain 
regularities inherent with every, well behaved, parallel algorithm lend themselves to 
polynomial time solutions. For such algorithms, we provide an efficient algorithm to 
perform the processor assignment.
In the second stage, we pay attention to minimizing the number of buses and 
interfaces. We start the second stage with an IFG, which is obtained from a CFG by an 
optimal processor assignment. Our objective in this stage is to realize the communications 
dictated by the edges of the IFG using buses and interfaces. This is called bus assignment 
and is addressed in Chapter 3. The first criterion for the bus assignment is to minimize 
the number of buses and interfaces in the target MBS. The second criterion is to minimize 
the time the target MBS takes to perform the communication primitives dictated by the 
given CFG. We prove that the optimal bus assignment is an ATP-Hard problem.
The construction of an optimal MBS realizing a given IFG automatically addresses 
two other important issues. First, it addresses the problem of designing an optimal MBS 
which emulates an existing SIMD architecture. Second, it addresses the problem of 
designing an optimal MBS which can perform a specified set of interconnection functions. 
Due to these reasons, we will address the bus assignment problem in more detail. We will 
analyze important cases where a polynomial time solution can be obtained. We show that 
the problem is solvable in polynomial time when the number of interconnection functions 
associated with the IFG is two. Based on that algorithm we provide an efficient heuristic 
algorithm to solve the general bus assignment problem.
In Chapter 4, we show how to perform bus assignment in polynomial time when 
the IFG is vertex symmetric. We show that an IFG is vertex symmetric if and only if it
25
is the Cayley color graph of a finite group and its generating set. We utilize this property 
to analyze vertex symmetric IFGs. Due to the importance of vertex symmetric IFGs, we 
will perform a detailed analysis of the bus assignment problem for this case. We will give 
a polynomial time algorithm which performs the bus assignment of a vertex symmetric 
or regular IFG. Furthermore, we will show the superiority of an MBS over a static 
interconnection network realizing vertex symmetric IFGs, in terms of the number of ports 
per processor, the number of neighbors per processor, and the diameter.
In Chapter 5, we address the fault tolerance capabilities of an MBS constructed 
from a vertex symmetric IFG. We study the behavior of the MBS in the case of a single 
bus failure, single interface failure, and a single processor failure. We obtain necessary 
and sufficient conditions for the MBS to be fault-tolerant in each of the above cases. We 
will also obtain a measure of performance degradation due to such component failures. 
Furthermore, we will show how to add redundancy to the MBS in order to improve its 
fault tolerance.
In Chapter 6 , we address a secondary objective of the dissertation, namely the 
problem of constructing an MBS with a given number of processors and a given number 
of buses which can run a given source algorithm at maximum possible speed given the 
restrictions. In this case, the cost function for optimality includes only the number of 
interfaces. We first consider the case, where only the number of processors is specified. 
In that case, we show that the number of buses can also be proportionally decreased 
without any performance penalty. We show how to construct a regular IFG with the 
specified number of processors. We then consider the case, where the number of buses 
is specified. In that case, the number of processors needed for maximum speed is equal
to the optimal number of processors. Here we show how to combine buses of the optimal 
MBS to form a new MBS which has the specified number of buses.
In Chapter 7 we provide the summary of the dissertation along with concluding 
remarks. We point out how several concepts addressed in the dissertation can be extended 
for future research work.
CHAPTER 2 
PROCESSOR ASSIGNM ENT
In this chapter, we consider the first stage of the primary objective of the 
dissertation. Given the CFG CA of a parallel algorithm A, we will assign the set of 
computational nodes in CA to a set of processors such that the target architecture has 
minimum cost and can execute algorithm A in its ideal speed. We will show that the 
processor assignment problem is equivalent to that of partitioning the vertex set of the 
CFG with certain constraints. By partitioning the vertex set of the CFG, we will obtain 
an IFG. After stating the optimization criteria, we will prove that the optimal processor 
assignment is an NP-Hard problem. We will then show how to exploit certain regularities 
present in every well-behaved parallel algorithm in order to perform the processor 
assignment in polynomial time. For the case when the CFG possesses such regularities, 
we will present an efficient algorithm which solves the processor assignment problem in 
polynomial time. Furthermore, we will show that, when the CFG is regular the resulting 
IFG is also regular. Finally, in this chapter, we will present a proper scheduling for the 
assigned processors.
2.1 Preliminaries
The underlying undirected graph of a CFG is connected. This is because the 
algorithm which produced the CFG solves a single problem and therefore its computa­
tional nodes are interrelated. Given a CFG CA, we need to find a set of processors such 
that each vertex in CA is assigned to exactly one processor in the set. One naive approach 
would be to assign one processor for each vertex in CA. Another naive approach would 
be to assign a single processor to all the vertices of CA. The former approach would have
27
28
the fastest computation time but would be the most expensive. On the other hand, the 
latter approach would be the least expensive but also the slowest. Our intention is to 
construct a least expensive MBS which runs the source algorithm at its ideal speed. Since 
each processor must be allocated to a disjoint subset of vertices of the CFG, the processor 
assignment is equivalent to the partition of the vertex set of CA. Therefore, we will use 
the terms "vertex partition" and "processor assignment" interchangeably.
Computations of the algorithm are performed by processors. Since we do not alter 
the source algorithm, the amount of computation required by the algorithm, whether 
executed by one processor or more, is fixed. We are only allowed to allocate computa­
tions to different processors. Therefore, the computation time is only affected by the 
number of processors and the computation schedule. The source algorithm allows certain 
computations to be executed in parallel. For example, all the computations of the first 
concurrency level of the CFG can be executed in parallel. Similarly, all the computations 
in the second level can be executed in parallel. However, no computation in the second 
level can be executed before all the computations in the first level are done. This is due 
to possible dependencies (not necessarily flow dependencies) among computational nodes 
in different concurrency levels. The following lemma gives a straightforward result. 
Lemma 2.1: Let nA be the number of concurrency levels in the CFG CA. Then the 
minimum computation time is nAt c£ , where t cJ  is the computation time of a single node 
of the CFG on a single processor.
Therefore, the algorithm stipulates a lower bound on the computation time. The 
minimum computation time nAt cJ  required by the algorithm will be called its ideal 
computation time. To run algorithm A in its ideal speed, computations must be performed
29
in time nAt% . Therefore, the assignment of processors to CFG CA should be such that the 
processors can perform the computations of the algorithm A with computation time nAt cJ . 
Note that there is very little or no significance of the term ideal communication time, 
because, the minimum communication time is zero and it corresponds to the uniprocessor 
case.
If two vertices in the same level of CA were assigned to the same processor, then 
those two computations cannot be performed at the same time. Therefore, unless each 
processor is assigned vertices belonging to distinct levels of CA, we cannot attain the ideal 
computation time of the source algorithm. We provide the following definition in order 
to characterize a processor assignment which will attain ideal computation time. 
Definition 2.1: A partition £ on the vertex set of a CFG CA will be called a level-disjoint 
vertex partition if each subset has vertices from distinct partite sets.
In this chapter, unless otherwise stated, a partition of the vertex set of a CFG is 
always assumed to be a level-disjoint partition. Let CA = (V, E) be a CFG. Let £ be a 
(level-disjoint) vertex partition of V. Associated with £, we will define an edge colored 
directed graph G j’ called Interconnection Function Graph (IFG for short). If V̂ , V2, ..., 
and Vx are the subsets of partition £, then for each subset V,, there exists a unique vertex 
v, in G j , and vice versa. There is an edge of color r  from v, to v; in G j  if and only if 
there is an edge from «, to Uj in where «, e  V,, Uj e  Vj, and u, is in concurrency level 
r. An IFG may have multiple edges from one vertex to another. Since the vertex
r r
partitioning of CA uniquely determines G j , with slight abuse of notation, we may write G^
= C,(CA). Since the CFG CA is connected, so is the IFG £,(CA). Every vertex of !'S(CA)
30
corresponds to a set of computations, at most one from each concurrency level, of 
algorithm A assigned to a single processor.
Definition 2.2: For a given parallel algorithm A and a partition £ of CA, the collection of 
computational nodes corresponding to each subset of the partition is called a subtask of 
algorithm A.
Example 2.1: Figure 2.1 shows a possible vertex partition of the CFG Q  shown in Figure 
1.1. Dotted lines enclose the vertices belonging to a subset. Figure 2.2 represents the 
corresponding IFG, which we denote by G,. The three computational nodes in the subset 
V4 form the substask associated with vertex v4.
Figure 2.1: A vertex partition of the CFG Cv
Even though the IFG is an intermediate step of our design process, it is an 
important entity in its own right. Vertices of the IFG correspond to processors, and edges 
correspond to data transfers among processors. Therefore, an IFG can be considered as 
a direct link interconnection network with each edge corresponding to a unidirectional
31
link. Colors of the edges represent concurrency of data transfers, that is, in the IFG , the 
set of edges belonging to a certain color corresponds to an interconnection function (in 
the most general sense) in the represented SIMD machine. Thus the IFG t,(CA) represents 
a direct link, SIMD interconnection network which can run algorithm A  in its ideal speed.
Figure 2.2: The IFG G, corresponding to the vertex partition of Figure 2.1.
By definition, C,(CA) has enough processors to execute algorithm A  in its ideal 
computation time. Addition of more processors will not reduce the computation time. This 
is true because, if two computations are to be executed sequentially, whether they were 
assigned to the same processor or to different processors, the computation time will not 
be affected. Recall that we do not modify the source algorithm; specifically, we do not 
distribute a single computation among several processors. Therefore, any interconnection 
network with the number of processors equal to the number of vertices in C,(CA) can attain 
the ideal computation time of algorithm A. Next we consider the criteria for partition C, 
to be optimal.
2.2 Optimal Processor Assignment
Any partition £ of the vertex set of a CFG will produce an IFG which corresponds 
to a system with enough processors to reach the ideal computation time. We are interested
32
in constructing a minimal cost MBS which has ideal computational time and minimum 
possible communication time. Let A/J’ be an MBS which can perform each of the 
interconnection functions associated with £(CA) in one step. To evaluate the merits of the 
partition, we must obtain the cost of the M \  as well as the computation and communica­
tion times of algorithm A running on A/J’. Since we haven't constructed the MBS m \  yet, 
some of the information regarding M \  are unavailable at this stage.
The computation time and the number of processors in M \  are directly available 
from partition £. Only an approximate measure for the communication time, the number 
of buses, and the number of interfaces in M  j  are available from partition £. Edges of the 
IFG C(CA) correspond to data transfers among processors of M j  dictated by algorithm 
A. The larger the number of edges in C,(CA), the larger is the amount of communication 
required. Therefore, as an approximation, we consider the number of edges in £(CA) as 
a measure of the communication time of algorithm A running on AfJ’. Since the 
communication in M \  is effected by buses and interfaces, for the present stage, we also 
approximate the number of edges in £(C„) as a measure of the number of buses and 
interfaces in M \.  Therefore, our goal of the processor assignment is to find a vertex 
partition £ such that IV(£(CA))I and l£(£(CA))l are minimum. As the following lemma 
shows, obtaining the minimum value of IV(£(CA))I is straightforward.
Definition 2.3: Let CA be a CFG with n concurrency levels. Then, the number of vertices 
in level r  is denoted by a r(CA), 1 < r < n. Also max{ar(CA) : 1 < r < n) is denoted by
a  (CA).
Lemma 2.2: The minimum number of processors needed to run algorithm A in its ideal 
speed is a(C A).
33
Proof: To obtain minimum computation time, concurrent computations needed by the 
source algorithm must be executed by distinct processors. □
For example, in Figure 1.1, a,(C ,) = 4, o^C ,) = 2, and Q^C,) = 1. Therefore, 
a(CA) = 4 and an optimal MBS needs 4 processors. Any machine with fewer processors 
than a (CA) cannot execute the source algorithm A in ideal computation time. Since one 
of our design criteria is that the target machine runs A  in the ideal computational time, 
the number of processors must be at least ct(CA).
Unlike the case for IV’CCCQ))!, there is no straightforward method to find the 
minimum value of lE ^ Q ))! . Furthermore, in general, both IV(C(Q))I and \E{L,{CA))\ 
cannot be minimized at the same time. This fact will be clarified in the following 
example.
W 4




Example 2.2: Figure 2.3 shows a four level CFG, denoted by C2, with twelve vertices 
(directions of the edges are implied). Clearly, a ,(C 2) = = cc3(C2) = a 4(C2) = a(C 2)
34
i
Figure 2.4: Induced subgraphs of partition on V(C2).
= 3. Therefore, any level-disjoint partition would have at least three subsets. There are 
16 edges in C2. A partition of C2 with cardinality threet would have four verdces in each 
subset, one vertex from each level. Since the total number of edges in C2 is fixed, the 
number of edges in £(C2) would be minimum if the number of edges induced by the 
subsets of partition £ is maximum. It is clear from the figure that the maximum number 
of edges which can be induced by a 4-element subset of vertices is four. An example is 
the subset {u2 , , Wj4}. One can be easily convinced that there does not exist a
vertex partition with cardinality three such that each subset induces four edges. The best 
possible, cardinality-three partition would induce four edges on two of the subsets and 
three edges on the other subset. Figure 2.4 shows the induced subgraphs of such a 
partition, denoted by The three subsets are {u 2 , u 2, u * }, { u / ,  w22, u2 , u34 }, 
{« 3 , u2 , u3 , u2 }. Since partition induces 11 edges, the IFG C,(C2) resulting from
trThe cardinality o f  a partition is the number o f  subsets it generates.
35
partition would have 5 edges. Now consider the induced subgraphs of the cardinality 
four partition £2 = {{U3 > u f , u l }, { « /, u%, u * }, {u2* , u 3, w *}, {w32, w33, w24}} shown 
in Figure 2.5. That partition induces 12 edges, three on each subset. Therefore, the IFG 
i;2(C2) corresponding to partition £2 shown would have only four edges.
Figure 2.5: Induced subgraphs of partition £2 of V(C2).
Thus, in general, one cannot find a single partition £ which minimizes both 
IV(C(CA))I and \E(C,(CA))\ simultaneously. Another conclusion that can be drawn from the 
above example is that by increasing the number of processors above a (CA), we may be 
able to decrease the communication time.
Since both IV^C*))! and IE(£(Q))I cannot be minimized simultaneously (in 
general), the cost function for the processor assignment must include terms to reflect the 
relative costs of vertices and edges of the target IFG. For each vertex of the IFG, we will 
associate a cost Ki. Also, for each edge we will associate a cost k2. Therefore, the cost
36
of an IFG CA is k,I VCĈ )! +  k 2I£(Q )I. Using this cost measure, we can define the optimal 
processor assignment as follows.
Definition 2.4: A vertex partition £ of a given CFG CA is called an optimal vertex 
partition (or simply an optimal partition) if Kxl V'( (̂C'/t))l + k2I£(£(Ca))I is minimum, taken 
over all partitions £. The processor assignment corresponding to an optimal partition is 
called an optimal processor assignment.
Optimal processor assignment is very similar to the problem of program partition­
ing covered in the literature [10], [17], [24], [60], [81], [123], [132], Solving the 
partitioning problem for only two processors with the aid of network flow algorithms is 
addressed in [132]. In [17], a partition strategy was proposed for a rectangular grid with 
unequal weights. The general partition problem has been shown to be AT3-Hard in [53], 
[81]. The processor assignment problem we consider here is a restricted version of the 
general partition problem. Our model for the processor assignment requires that no two 
vertices from the same level belong to the same subset. We next prove that the optimal 
processor assignment problem, although a restricted version of the general partition 
problem, is also NP-Hard.
2.3 Com putational Complexity of the Optimal Partition Problem
According to the definition of the optimal partition, directions of the edges in a 
CFG do not play any role. Therefore, in obtaining the computational complexity of the 
problem, we will ignore the directions of the edges of the CFG. To use known results 
from computational complexity theory, we convert the above optimization problem into 
a decision problem [53]. The decision problem, which we call the Level Partitioning (LP) 
problem, is defined next.
37
Level Partitioning (LP) Problem:
INSTANCE: An undirected n-partite graph G and non negative integers k „ k 2, and K. 
QUESTION: Can we find a partition £ on the vertex set of G such that each subset 
contains vertices from distinct partite sets and k,*IV(£(G))I + k2*I£(£(G))I 
< K 1
Theorem 2.1: LP is NP-Complete.
Proof: It is straightforward to show that LP belongs to class NP. To show that it is NP- 
Hard, we will transform an instance of the Three Dimensional Matching (3DM) problem 
[53] into an instance of the LP problem and show that the 3DM  instance has a matching 
if and only if the LP instance has a solution. For completeness we will state the 3DM  
problem [53].
INSTANCE: Set M  c  XxYxZ, where, X, Y and Z are disjoint sets each having p  
elements.
QUESTION: Does M  contain a matching, i. e., a subset M' c: M  of cardinality p  such 
that no two elements in M' agree in any coordinate?
For the 3DM  instance, let LX1 = IU = IZ1 = p  and \M\ = q. From this, we will 
construct an instance C of LP, that is, a CFG C and constant K. For each element x, e 
X , there is a vertex xt in C with label 1. Similarly, for each element y, e  Y and zt e  Z 
there exist vertices y, and z, with labels 2 and 3, respectively. For each element m; = (y >, 
x j , z i )  in M  there are 9 vertices ^[1] through a}{9] in C. Figure 2.6 shows how those 
vertices are connected and labeled. We define C = (V, E) by 
V=  {*,} u  {y,} u  {*,.} u  1J [aj[k] : l < k < 9 } ,  and
p  >=i
E  = |J  Ej, where,
7=1
38
Ej = {(*>, a /1 ]), (a/1], a /2 ]), (a/2], a/3])} u  {(y>, a /4 ]), (a/4], a /5 ]), (a/5], 
a / 6])} u  {{zK a /7 ]), (a/7], a / 8]), ( a /8], a/9])} u  {(a/3], a / 6]), ( a /6], a/9])}.
43]
4 1 ]
1 4  6] 2 49]
u 3 «/5] 1 48] <
H 2 44] ■. 3 4 7 ] ..
11 y J 1 2
Figure 2.6: Component of C corresponding to e  M.
According to the way C was constructed, \V(C)\ = 3p + 9q, where each level contains p  
+ 3q vertices. Therefore, a /C )  = a 2(C) = 03(C) = a(C) = p  + 3q. The total number of 
edges in C is 1 \q. Since there are no triangles in C, any 3-element subset of vertices in 
C can induce at most two edges. We will attempt to form a vertex partition of C such that 
each subset has vertices with distinct labels and induce two edges. Therefore, we set the 
value of K  to be equal to Kx(p + 3q) + k 2(1  1 q -  2(p + 3q)) = K,(p +  3q) +  k 2(5 q -  2p).
Suppose that the 3DM  instance has a matching M' c  M. For each element 
rrij e  M \  form the 3-element subsets of the vertices { x 1, a /1 ], a/2]}, { y j , a /4 ], a/5]}, 
{z; , a /7 ], a / 8]}, and {a/3], a / 6], a/9]} in C. These subsets are shown in Figure 2.7 by 
enclosing the elements of each subset in a dotted curve. For each element g M'  form
the 3-element subsets of the vertices {^[1], a p .] , ^[3]}, {^[4], ^[5], a; [ 6]}, and {a}[l], 
a;[8], aj{_9]} in C (see Figure 2.8). There are p  elements in M' and (q -  p ) elements which 
are in M  but not in AT. Hence the above partition has 4p  + 3(q -  p) = p  + 3q subsets. 
Also, each subset induces two edges. Therefore, the total number of edges induced by the 
partition is 2(p + 3q). It can be easily seen that the smallest cycle in C has 14 vertices. 
Therefore, there can be at most one edge from one subset to another. Hence li?(£(C))l = 
11 q -  2(p + 3q) = 5q -  2p. Also, IV(£(C))I = p  + 3q. Therefore, £ is a solution to the LP 
problem.
A[3] 1 46] 2 49]
[2]' 1A firs] •
i i
[1]. 2 j Uf4].
S
X1 < 2/ \ z-'o
Figure 2.7: Partition of a component of C corresponding to rtij e  Af.
Now suppose that the LP instance C has a solution, that is, there exists a partition 
£ such that KjIV(£(C))I + k2I£(£(C))I < K = K ,(p  + 3q) + K2( 5 g  -  2p). We first claim that 
£ consists of p  + 3q 3-element subsets and each subset induces two edges. Since there are 
only three labels in C, one subset can have at most three elements. Let Sj, s2, and s3 be 
the number of 1-element, 2-element, and 3-element subsets in £, respectively. Then,
40
5 , + 2s 2 + 3s3 = IV(C)I = 3(p + 3q). Therefore, 2* 2 + 3* 3 < 3(p + 3q)\ equality holds only
4
when = 0. The inequality can also be written as —s2 + 2s3 < 2(p + 3q). Therefore,
*2 + 2 * 3  < 2(p + 3q); equality holds only if *t = *2 = 0. (1)
/4 3 ] 1 4  6 ] 2 \ / 4  9] 3\
1 4 2 ]  -, 3
*
00 L>̂ 1 2 1
\ 4 i ] (. 2 4 4 ] , , 3 /  W m , ’ 1/
x 1 , . 1 y ,I 2 z ’ < 3
Figure 2.8: Partition of a component of C corresponding to rrij £ M .
A 1-element subset does not induce any edges. A 2-element subset can induce at most 1 
edges. Also, a 3-element subset can induce at most two edges. Therefore,
l£(C(Q)l> 11* - ( 2j 3 -t-jj) (2)
Combining the two inequalities (1) and (2), we get l£(£(C))l > 11  q -  2ip + 3q), that is, 
l£(£(C))l > 5 q -  2p; equality holds only if *j = *2 = 0. From Lemma 2.2, IV'ttXO)! > 
p  + 3q. Therefore, KjIV(£(C))I + k2I£(£(C))I < K -  Kj(p + 3q) + k2(5q -  2p) can be true 
only if *] = *2 = 0, IV(£(C))I = p  + 3q, and l£(C(C))l = (5q -  2p). This proves our claim 
that £ contains p + 3q subsets and each subset induces two edges. Consider element jc, 
in X. Vertex x, must be included with vertices a-[1] and ay[2] for some value of j  to form 
a 3-element subset. This is possible when xi is in m }, that is, xt = x > (see Figure 2.6).
41
Once this is done, «,[3] must be included with aj[6] and ay[9] to form a 3-element subset. 
Now ay[5] and oy[4] must be combined with some yk such that yk = y> to form a 3- 
element subset. Similarly, ay[8] and oy[7] must be combined with some zt such that z, = 
z j . From the construction of C, (x„ yk, z,) must be an element of M. Taking all the 
elements in X, we can thus obtain p  (jc7, y J', z ; ) tuples which will exhaust all the 
elements in X, Y  and Z. Thus we have a matching for the 3DM  instance. □
Since the general problem is A/P-Hard, we can utilize two approaches in finding 
the solution to the processor assignment problem. One approach is to use some features 
which are inherently present in many of the well known parallel algorithms, so that the 
processor assignment problem can be solved in polynomial time. Another approach is to 
use a polynomial time algorithm (probably heuristic) such that the solution is not 
necessarily optimal but does not deviate from the optimal solution by more than a fixed 
percentage.
2.4 Exploiting Regularities in the Source Algorithm
For the optimal processor assignment problem, we use the first of the above 
approaches. The choice of the first approach can be justified as follows. Even though 
obtaining a solution to the general problem has some theoretical interest, it is of very little 
practical interest. The vast majority of practical algorithms show some degree of 
regularity in their structures. In fact, certain regularities are present in every well behaved 
algorithm since spaghetti code writing is completely obsolete. We also favor having 
regular features in the target MBS. An irregular MBS would be unattractive in many ways. 
Thus, by paying attention only to CFGs with some regular features (to be defined 
precisely next), we do not reduce the generality of the design procedure insofar as actual
42
algorithms and actual architectures are concerned. However, if one is interested in finding 
a partition of a general CFG, Algorithm 2.1 given in Section 2.5 will serve the purpose. 
Definition 2.5: A CFG CA is said to be locally regular if the following three conditions 
are satisfied.
(a) Every vertex in the same level has the same indegree (outdegree).
(b) If there is an edge from a vertex in partite set W J to a vertex in partite set W k, 
then k = j  + 1.
(c) If the relation I W J \ < IW J+11 (I JVJl > IW'*1 1) holds for some j ,  then the same holds 
for every j.
The above definition characterizes almost every well-behaved parallel algorithm. 
Statement (a) stipulates that all concurrent computations in the algorithm behave 
similarly. This is true for many algorithms such as binary search, bitonic sort, matrix 
multiplication, and prefix sum computations. Statement (b) stipulates that a certain 
computation can receive data only from computation(s) in the immediately preceding 
level. This is also generally true for most practical S1MD algorithms. Statement (c) 
stipulates that the number of concurrent computations does not vary irregularly as we 
move along the algorithm from the beginning to end. This is also true for many parallel 
algorithms. For example, in Ascend/Descend class of algorithms [115], the width of the 
computation is constant, that is \ W j \ = \W j*11 for all j. Also, CFGs belonging to many 
algorithms are (binary) fan in trees [3], [22], [29], [136], [144]. For such CFGs, the width 
of the algorithm gradually decreases, that is, \ W>\ > \W J*11 for all j. These algorithms 
belong to the Fan-in/Fan-out algorithmic class [144],
43
2.5 Optimal Vertex Partitioning of a Locally Regular CFG
We will show that if a CFG is locally regular, its optimal partition can be found 
in polynomial time. In order to do that, we will use some graph matching techniques. For 
completeness we will state some fundamental definitions and results related to graph 
matching.
Definition 2.6: Two distinct edges in a graph G are independent if they are not adjacent 
in G. A set of pairwise independent edges of G is called a matching in G. A matching 
of maximum cardinality is called a maximum matching [30],
For a given graph G, a maximum matching can be found in polynomial time [30], 
[56]. For the present purpose, we are only interested in matchings in bipartite graphs. A 
maximum matching of a bipartite graph can be found in 0 (n512) time, where n is the 
number of vertices in the graph [67].
Definition 2.7: The set of all vertices adjacent to a vertex v in a graph is called 
neighborhood o f  v and is denoted by N(v). The set of all vertices adjacent to a set 5 of 
vertices is similarly called the neighborhood o f S and is denoted by N(S).
We state the following theorem due to Hall without proof [30],
Theorem  2.2: Let G = (X, Y, E) be a bipartite graph. Then X can be matched to a subset 
of Y, if and only if, for each subset S of X, \N(S)I > 151. □
Theorem  2.3: Let G = (X, F, E) be a bipartite graph such that each vertex in X has 
degree p  and each vertex in Y has degree q. Then, either X can be matched to a subset 
of Y, or Y can be matched to a subset of X depending on whether 1X1 < IF! or IF1 < 1X1. 
Proof: For any subset B of vertices of G, let EB be the set of edges adjacent with the 
vertices of B. Suppose that p > q. Then, since p\X\ = q\Y\, it follows that 1X1 < I FI. Let 5
44
be a subset of X. Then \Esl = pl5l, and IFW(S)I = q\N(S)\. Since N(S)  represents all the 
neighbors of 5, it follows that Es £  EN{S). Therefore IFSI < \EN(S), and hence /?I5I < q\N(S)\. 
Therefore, 151 < \N(S)\, and from Theorem 2.2, X  can be matched to a subset of Y. 
Similarly, if p  < q, we can show that Y  can be matched to a subset of X. □
Recall that a CFG is a directed n-partite graph with the property that vertices in 
partite set W J has incoming edges originating from partite set W k, k  < j. A  CFG CA is 
locally regular if each vertex in partite set W J has the same indegree/outdegree and all 
the edges incident on vertices in W J are originating from partite set W j A .
Theorem 2.4: Let CA be a locally regular CFG. Then, for any level-disjoint partition £, 
IE(C(Q))I > IE(Ca)I -  \V(CA)\ + a ( Q .
Proof: Let a be the cardinality of partition £. Let Vi be any subset of partition C,. Then 
V, contains at most one vertex from a given partite set, say W J. Furthermore, a vertex 
in Vi belonging to W J can be adjacent with a vertex in either W>A or W j A . Therefore, 
Vj cannot contain any undirected cycles. Therefore, if £, is the set of edges induced by
a a a
V„ then l£,l < IF,I -  1. Therefore, £  l£,l < ^  (IV,I -  1) = £  IF,I -  a = IV{CA)\ -  a.
i= l  (=1 a i= l
According to Lemma 2.2, a > ol(Ca). Therefore, J]) IF,I < \V(CA)\ -  a(CA). There can be
i= l
at most one edge of a given color from a vertex in one subset Vl to a vertex in another
a
subset V2. Therefore, \E^(CA))\ = \E(CA)\ -  £  IF,I. Hence, IF(C(Q))I > IF(Q)I -  \V(CA)\
i= i
+ « (CA). □
It should be noted that the above theorem is true not only for locally regular 
CFGs. Any CFG satisfying Condition (b) of Definition 2.5 will have the lower bound 
stated in the above theorem. In the next theorem we show that the lower bound can 
always be reached for a locally regular CFG.
45
Theorem  2.5: Let CA be a locally regular CFG. Then there exists a vertex partition £ 
such that !E(C(Q))I = \E(CA)\ -  \V(CA)\ + a(CA).
Proof: We will provide a constructive proof. Suppose that I W **11 < \ W j \, for 1 < j  < 
n -  1. Then, a ( CA) -  a .-  I W 1 1. Let L ; be the subgraph of CA induced by the vertex set W jA 
u  W J, for 1 < j  < n -  1. Clearly, L 7 is a bipartite graph. We construct the subset of 
vertices V, through Va as follows. Initially, V, contains exactly one vertex from W 1, for 
1 < / < a . According to Theorem 2.3, since \W 2\ < I IF 11 by the hypothesis, vertex set 
W 2 can be matched to a subset of W 2 in L 1. Let M 1 be such a matching. Each vertex 
in W 1 is contained in exactly one subset V,, 1 < / < a . If ux e  Vjy and (uv u2) e  A /1, 
then let V, = V, u  {u2}. In other words, if a vertex in the subset V, is included in the 
matching M 1, insert the complement vertex also in V,.
Since \W 3\ < \W 2\, W 3 can be matched to a subset of W 2 in L 2. Let M 2 be 
such a matching. If u2 e  Vi and (u2, u3) e  M 2, then let Vi = V, u  {n3}. Repeat this 
procedure until all the vertices in CA are inserted into the subsets Vj tlirough Va. Every 
time we update V(, we insert a new vertex which is adjacent with exactly one vertex in 
the existing V,. Therefore, the number of edges induced by V, is IV,I -  1. Hence the total
a
number of edges induced by the partition is (IV(-I -  1) = IV(Q)I -  a(C^). Thus,
i = i
l£(£(Q))l = \E(Ca)\ -  IV(CA)I + a(CA). The proof is similar if \W>+i\ >\W>\,  1 < j <  
n -  1. The only difference is that we start with each V,, 1 < / < a , containing exactly one 
vertex from W" .  □
Thus, when CA is a locally regular CFG, there exists a partition £ which minimizes 
both IV(£(Ca))I and l£(£(CA))l simultaneously. The following algorithm provides an 
optimal partition of a locally regular CFG.
46
Algorithm 2.1
INPUT: Locally regular CFG CA with n levels.
OUTPUT: Optimal partition {Vv  V2, Va } of V{CA). 
begin
I f \W l \ < \W nI then
begin a  = I W x\\ j=  1; o = end;
else
begin a  = I W n\; j  = n; o = end;
for i = 1 to a  do
V, = {v/}, where W j = {v /, v2 , v j };
L ; := subgraph induced by W ;o1 u  ; 
while (o = '+' and j  * n )  or (o  = and j  * 1) do 
begin
M } = {(v/*1, v /) , (v2r l , v2J),
matches to a subset of W J ; 
for / = 1 to a  do
if ((x, y ) e  M j ) and (x e  V,) then V, = V, u  {y};
j  =J + 1;
end;
end.
We can analyze the computational complexity of Algorithm 2.1 as follows. The 
first two compound statements take a constant amount of time. By using the algorithm 
for finding a maximum matching in a bipartite graph given in [67], the while loop takes 
0 (a sn) time. Furthermore, the while loop will be executed n times. Therefore, the total 
time it takes is 0 (n a  ). Since \V(CA)\ is 0(na), the time complexity can be expressed 
as 0(\V(CA)\.am ).
Optimality of the output of Algorithm 2.1 is guaranteed only if the CFG is locally 
regular. When the CFG is not locally regular, the algorithm produces a nearly optimal 
partition. Therefore, Algorithm 2.1 also provides a heuristic method for solving the 
general level-disjoint vertex partition problem.
Example 2.3: Consider the CFG, denoted by C3, shown in Figure 2.9. Clearly, it is a 
locally regular CFG. It has four levels, i.e., four partite sets. Figure 2.10 shows an optimal
47
partitioning of its vertex set using our algorithm. The subsets are named 1, 2, 3, 4, 5, and 
6 . Figure 2.11 shows the corresponding IFG, which we denote by G2. In G2, edges of 
colors 1, 2, and 3 are represented by solid, broken, and dotted lines, respectively.
Figure 2.9: A locally regular CFG, C3.
It is interesting to notice that the CFGs of many parallel algorithms such as bitonic 
sorting, FFT, and convolution, have stronger regularities. Specifically, each computational 
node (except in the first and the last concurrency levels) has two incoming edges and two 
outgoing edges. To emphasize the properties of such parallel algorithms, we provide the 
following definition.
Definition 2.8: A CFG is said to be regular if it is locally regular and every level has the 
same indegree (outdegree) 2, except that the indegree of vertices in the first level and the 
outdegree of the vertices in the last level are equal to zero.
Lem m a 2.3: If a CFG is regular, then each partite set has the same number of vertices.
48
Proof: Let W 1, W 2, ..., W n be the partite sets of the CFG. Then, there are 21W 1 1 edges 
incident with the vertices in W 1 . This is equal to the number of incoming edges to W 2. 
Since each vertex of W 2 has indegree 2, the number of vertices in W 2 is equal to \ W l \. 
Similarly, we can show that \ W 2\ = I IF31 = ... = I W"~1\ = I IF" I. □
1 2 3 4 5 6
Figure 2.10: Optimal partition of VfC3) using Algorithm 2.1.
Therefore, an algorithm corresponding to a regular CFG has the same number of 
computations at each concurrency level. In other words, the algorithm has the same 
"width" throughout the computation. If we assign one processor per computational node 
at each level, then the same number of processors is needed at each level. This kind of 
algorithm uses processors very efficiently; no processor stays idle throughout the 
execution of the algorithm. When the CFG is regular, subtasks (see Definition 2.2) 
corresponding to any optimal partition consists of one node from each concurrency level. 
Therefore, every processor performs the same amount of computation. Next, we present 
two important properties of an IFG obtained by optimal partitioning of a regular CFG.
49
Figure 2.11: The IFG G2 corresponding to the optimal partition shown in Figure 2.10.
Definition 2.9: An IFG is said to be regular if every vertex has exactly one outgoing 
edge from each color.
Theorem 2.6: Let CA be a regular CFG. Then the IFG  obtained by optimal partitioning 
of CA is regular.
Proof: For any optimal partitioning of CA, each subset Vt will have n vertices, one from 
each partite set, where n is the number of partite sets. In the subgraph induced by V,, 
there is exactly one edge from a vertex belonging to W J to a vertex belonging to W J'+1, 
1 < j < n -  1. Therefore, has one incoming edge of color 1, one outgoing edge of color 
1, one incoming edge of color 2, one outgoing edge of color 2, and so on. Thus each 
vertex of the resulting IFG has one incoming edge and one outgoing edge from each 
color. Therefore, the resulting IFG is regular. □
Theorem 2.7: Let CA be a regular CFG with n concurrency levels. Then the IFG obtained 
by optimally partitioning CA has n -  1 colors.
50
Proof: According to the model presented in Section 1.6, edges of CA originating at 
concurrency level r  are of color r, 1 < r  < n -  1. Therefore, CA has at most n -  1 colors. 
By Theorem 2.6, every vertex of the resulting IFG  has one outgoing edge and one 
incoming edge from each color r, 1 < r  < n -  1. □
Figure 2.12: A CFG, C4.
Example 2.4: Figure 2.12 shows the CFG, denoted by C4, corresponding to As­
cend/Descend class of parallel algorithms [115]. It is a 2-regular graph with 4 concurrency 
levels and 8 computational nodes in each level. Figure 2.13 shows an optimal partition 
of the vertex set of the CFG of Figure 2.12. Subsets are labeled with binary numbers 
from 000 through 111. Figure 2.14 shows the corresponding IFG, which we denote by 
G3. It has three colors represented by solid, broken, and dotted lines. We will later see 
that the IFG  G3 shown in Figure 2.14 is not only regular but also vertex symmetric. We
will pay special attention to vertex symmetric IFGs in Chapter 4. If we replace each pair 
of unidirectional edges between adjacent vertices of G3 by an undirected edge and ignore 
colors, we would obtain the 3-dimensional boolean hypercube.
Figure 2.13: An optimal vertex partition of C4
2.6 Processor Scheduling
An important issue closely related to processor assignment is that of processor 
scheduling. Once an IFG has been constructed from a CFG, each processor is assigned 
a single subtask. Therefore, the assignment of processors to individual computational 
nodes is known. Let computational node w, be assigned to processor Pj. Let r, be the time 
at which processor Pj starts to compute For each node ut, the knowledge of P- and
52
completely specifies the execution of the computation The set {(«■, P-, r() : «, e VCQ)} 
is called a processor schedule for algorithm A. When we have a vertex partition £ of 
VCQ), all the pairs (m,, P-) are fixed. The execution time of the algorithm is decided by 
the third parameter r,. Therefore, for the processor schedule to be complete, a time t, must 
be assigned to every computational node
Figure 2.14: The IFG G3 corresponding to the optimal partition shown in Figure 2.13.
As one must have expected, allocation of r, to u, is quite straightforward. Merely 
set tj = j t%, where j  is the concurrency level to which w, belongs and t% is the time a 
computational node takes on a single processor. According to this scheduling, each 
processor will first execute its computational node for level 1. Then each processor 
executes its computational node for level 2, and so on. When a processor is executing its 
r*h computational node, we say that the processor is executing its subtask at level r. When
53
one processor is outputing data at level r, every other processor should be outputing data 
at level r. These transfers correspond to edges of color r in the IFG.
By the optimal processor assignment, each processor is allocated a single subtask. 
If the number of processors in the target MBS is limited, we may have to assign more 
than one subtask to a processor. In that case, in order to achieve the best speed, the order 
of execution of the subtasks by each processor must be correctly determined. This issue 
will be addressed in Chapter 6.
CHAPTER 3 
BUS ASSIGNM ENT
In this chapter, we address the second stage of the optimal MBS construction 
process. From a given IFG  we construct an optimal MBS such that the MBS can perform 
each of the concurrent data transfers associated with the IFG in one time step. This will 
correspond to an optimal MBS which can run the source algorithm (from which the IFG 
was derived) at its ideal sped. We assume that the IFG has been constructed from a CFG 
by finding an optimal (or nearly optimal) level-disjoint vertex partition.
Bus assignment has other related applications also. An interconnection network for 
an SIMD machine is associated with a set of interconnection functions, where an 
interconnection function corresponds to concurrent data transfers from a set of source 
processors to a set of target processors [125]. Clearly, an IFG contains all the information 
about an interconnection network for an SIMD machine. Vertices represent processors and 
edges of the same color represent an interconnection function. This is the reason why the 
IFG  (Interconnection Function Graph) has been given that name in this dissertation. The 
construction of an MBS from a given IFG addresses three different MBS design related 
issues.
(a) It is the second stage of the process of designing an optimal MBS realizing a 
given parallel algorithm.
(b) It provides a method to construct an optimal MBS emulating a given static 
interconnection network. ,




Therefore, in the remainder of this dissertation, we will regard an IFG as representing 
either a set of interconnection functions or the communication pattern of an algorithmic 
class. Due to the wide applicability of the IFG, the tools developed in this chapter will 
be useful for the designers of MBSs in many respects. In [49], an MBS was proposed 
which emulates the SIMD hypercube. Using the tools developed here, we can design 
many different configurations of MBSs that emulate hypercubes with superior properties.
In this chapter, we show that the bus assignment problem is equivalent to the 
problem of partitioning the edge set of the IFG with specific constraints. This will be 
called color partition. We will show how broadcasting information can be incorporated 
into the IFG. We prove that the optimal color partition problem is AT’-Hard. We also 
prove that when the IFG has only two colors, an optimal color partition can be found in 
polynomial time. This corresponds to the construction of an optimal MBS realizing two 
interconnection functions such as shuffle and exchange. Furthermore, based on the two 
color case, we develop an efficient heuristic algorithm to solve the general color partition 
problem. In Chapter 4, we show that the optimal color partition problem can be solved 
in polynomial time when the IFG  is vertex symmetric.
3.1 Preliminaries
The vertices of the IFG directly represent to the processors in the target MBS. The 
edges of the IFG represent the communication requirement of the source algorithm. From 
the edge set of the IFG, we need to determine the set of buses and the set of interfaces 
of the target MBS. We need to address the following question: How can we determine 
buses, drivers and receivers to realize the communication pattern dictated by the 7FG? In 
other words, how can we assign buses to the IFG C,(CA) such that the resulting MBS will
56
run algorithm A at its ideal speed? In order to answer this question, we next define some 
terminology used throughout the remainder of the dissertation.
Definition 3.1: In an MBS, we say that processor P 1 can transfer data to processor P2 in 
one step if and only if there exists a bus B which is connected to processors P, and P2 
via a driver and a receiver, respectively.
Definition 3.2: Let A be a parallel algorithm and M  be an MBS. Then M  is said to realize 
A if the computational nodes of A can be assigned to processors in M  such that concurrent 
computational nodes of A are assigned to distinct processors and concurrent data transfers 
of A can be carried out in a single step on M.
Therefore, for an MBS M  to realize algorithm A, a necessary condition is: for 
every edge (v„ vj) in the IFG £(Q ), the MBS must be capable of transferring data from 
processor v, to processor v; in a single step. Otherwise, one communication edge in the 
source algorithm may correspond to more than one communication step in the MBS. 
Therefore, associated with every edge of the IFG, there must be a bus in the target MBS. 
However, this condition is not sufficient to guarantee that M  realizes algorithm A. If two 
edges in the IFG correspond to concurrent data transfers (non broadcasting), there must 
be two distinct buses in the MBS to carry out those two data transfers.
One naive approach would be to assign a distinct bus for each edge in the IFG. 
But this may result in redundant buses and therefore may not be optimal. If two edges in 
the IFG have different colors, the source algorithm does not require the corresponding 
data transfers to be carried out concurrently. Hence, a single bus may be used for data 
transfers corresponding to both edges. If two edges in the IFG have the same color with 
different originating vertices, then we should assign a distinct bus for each edge for the
57
data transfers to proceed in parallel. If two edges in the IFG have die same color and 
originate from the same vertex, then whether we need only one bus or two buses for the 
data transfer is ambiguous. If the data (information) corresponding to the two edges are 
different, we still need two buses; because, otherwise, data collision will occur. On the 
other hand, if the two data items corresponding to the two edges are the same, then the 
situation corresponds to a broadcasting operation and only one bus is required. From the 
way an IFG is constructed, broadcasting information is not available in the IFG. In 
Section 3.3, we show how edge coloring rules of the IFG  can be modified in order to 
convey broadcast information. Until then we will assume that the given IFG  does not 
model broadcasting operations. In the following, we define the concept of an MBG 
realizing an IFG. It should be noted that MBG is merely a graphical representation of an 
MBS.
Definition 3.3: Let G be an IFG  and G* be an MBG. Then G* is said to realize G if there 
exists a bijective mapping \\f: V(G) —» X(G’) satisfying the following conditions.
(1) For every directed edge (v,, v;) e E(G), there exists a vertex yi} e  F(G‘) such that 
the vertices \j/(v;), yi}, \j/(vy) represent a directed path in G’.
(2) If (v. , v. ) and (v. , v . ) are edges in E{G) of the same color, then y . . and y. .
'i h  ‘ 2 h  y ‘ih y , ih
(as defined in (1)) are distinct vertices in T(G*).
It is clear from Definitions 3.2 and 3.3 that the MBS M  realizes algorithm A if and 
only if M  realizes the IFG C,(CA). Therefore, we can refer to an IFG without referring to 
its origin. Next we show how graph theory can be utilized to construct an MBG  realizing 
a given IFG.
58
3.2 Construction of an M BG  from  a given IFG
Definition 3.4: A partition n  of the edge set of an IFG G will be called a color partition 
of G if no subset of n  has more than one edge of the same color.
Let Ej<K>, 1 < j  < b, be the subsets of edges corresponding to the color partition 
71 of G. Denote by H}<jt> the subgraph of G induced by Ej<n>, 1 < j  < b. (Subgraph 
Hj<7t> itself can be considered as an IFG having all distinct color edges). Construct an 
MBG G’ as follows. For each vertex v, e  V(G), associate a unique vertex jt, e  X(G'). 
Also, for each subgraph Hj<k> associate a unique vertex y,<7i> e  Y(G‘). Establish an edge 
from Xj to y,<7t> (from y-<k> to jc() if vertex v, is in Hj<7C> and there is an edge in Hj<iz> 
which is directed away from (towards) It is easy to verify that MBG G’ realizes IFG 
G since they satisfy Conditions (1) and (2) of Definition 3.3. Thus, constructing an MBS 
realizing a given IFG  is equivalent to finding a color partition 7t of the given IFG.
Since % uniquely determines the MBG, we may write (with slight abuse of 
notation) G* = n(G). We will often not distinguish between the MBG  and the correspond­
ing MBS. We may use n(G) to represent either. In Section 2.1 we stated that an IFG  is 
connected. Therefore, any graph can be considered as an IFG if it is connected, directed 
and edge colored. Notice that any subset Ex £  E(G) of edges can be assigned to a single 
bus. The color partition merely imposes the condition that only non concurrent data 
transfers are assigned to the same bus.
Definition 3.5: Let Ex be a subset of edges in the IFG G and Ht be the subgraph induced 
by Ex. If £] is assigned to a single bus, then the set of interfaces attached to that bus will 
be denoted by either J(EX) or J(HX).
59
Therefore, for color partition 7t, J(Ej<n>) (or J(Hj<k>) represents the set of 
interfaces connected to bus Bj<k>, 1 < j  < b. We can also consider J(Ej<n>) as the set 
of edges incident with y;<7t> in the MBG n(G). The set of processors connected to bus 
Bj<n> is the set of all vertices in Hj<n>. The following lemma gives a straightforward 
result.
b b
Lemma 3.1: E(n(G)) = (J J(Ej<n>), and \E(ji(G))\ = £  \J(Ej<n>)\.
>=1 j=i
Definition 3.6: Let G be an IFG with c colors. Then the number of edges of G belonging 
to color r will be denoted by Pr(G), 1 < r < c. Also, max{(3r(G) : 1 < r < c) will be 
denoted by P(G).
Throughout the remainder of the dissertation, unless otherwise stated, we will 
adhere to the following notation. G represents an IFG. The number of edges of color r 
is denoted by Pr(G) and P(G) = m ax{$r(G)}. The cardinality of color partition n  is b. 
Subsets of edges of E{G) corresponding to color partition k are represented by E]<%>, 
1 < j  < b. The subgraph of G induced by Ej<n> is H<n>. To show a certain color 
partition graphically, we will merely draw the induced subgraphs Hj, 1 < j < b. The vertex 
in Y(n(G)) corresponding to Hj<k> is yj<n>. The bus corresponding to Hj<n> (or Ej<n>, 
or yj<K>) is denoted by Bj<n>. The set of edges incident with vertex yj<n> of n(G) is 
denoted by J(Ej<n>) or J(Hj<n>), 1 </<£>.  When there is no ambiguity as to the color 
partition n, we will always omit <7t> from the respective names.
Example 3.1: Consider the IFG (which we denote by G4) shown in Figure 3.1. It contains 
6 vertices and 8 edges of four different colors. Colors of the edges are represented by 
integers 1, 2, 3, and 4. It is clear that P^G,,) = p2(G4) = p3(G4) = P4(G4) = P(G4) = 2. 
Figure 3.2 shows the subgraphs H l and H2 induced by the subsets of a certain color
60
vi 1 v2
Figure 3.1: IFG G4 with 4 colors 1, 2, 3, and 4.
Figure 3.2: Cardinality two color partition 7Cj of G4.
partition, denoted by n u of G4. In other words, Figure 3.2 shows the color partition 7tj. 
The MBS corresponding to 7tj has two buses: one associated with H x and the other 
associated with H2. Figure 3.3 shows the corresponding MBG, n t(G4). Notice that y, is 
incident with 6 edges, that is, L7(//j)l = 6. Therefore, bus B x is connected to 6 interfaces,
61
three of them being drivers and the other three being receivers. Similarly, bus B2 is also
connected to three receivers and three drivers.
x { x 2 x 3 XA x 5 x 6
Figure 3.3: MBG (G4)
Next we show how broadcasting information can be incorporated in the IFG  for 
the MBG  construction.
3.3 Incorporating Broadcasting Operations
So far we did not make any assumptions on the identity of the data transfers 
denoted by the edges in the IFG. If two edges are of the same color, we allocate two 
buses to them in order to carry out both data transfers concurrently. Suppose that the 
source algorithm requires a certain computational node to send the same datum 
concurrently to different computational nodes. Also suppose that the processor assignment 
is such that the source node and destination nodes are assigned to distinct processors. This 
corresponds to edges of the same color originating from the same vertex of the IFG. 
Since the same datum is to be transferred, distinct buses are not required. If the 
connectivity allows, a single bus can broadcast the datum to all destination processors. 
A single source processor sending one data item to a set of target processors concurrently
62
will be called an instance of broadcasting. There may be more than one instance of 
broadcasting occurring concurrently. In such a case, more than one source processor is 
involved.
To include broadcasting information in the IFG, edge coloring should be modified 
as follows. Instead of a single color r, a 3-tuple of colors (r, x, o) will be assigned to 
each edge of the IFG. Let v, be the source and v. , 1 < o  < h, be the destinations of a
Jo
certain broadcasting operation. Then edge (v„ v. ) will be assigned the color tuple (r, 1,
Jo
a). Here, the first coordinate r represents the primary color that distinguishes the 
concurrent data transfer1. The second coordinate 1 represents the broadcasting instance 
number. If there is another broadcasting instance to be carried out concurrently, the edges 
belonging to that will have a first coordinate r  and a second coordinate 2. The third 
coordinate o  represents the index of the data transfer involved with that particular 
instance of broadcasting. For generality in this section, we assume single target data 
transfers also as broadcasting operations, where the number of target processors is one. 
With slight abuse of notation, we represent an edge e with color tuple (r, x, a ) by 
e{r, X, a).
The notion of color partition was introduced to capture the idea that, to achieve 
minimum communication time, concurrent data transfers are to be assigned to different 
buses. When broadcasting operations are involved, this is not necessarily true. Therefore, 
the definition of the color partition must also be modified.
tIn Chapter 6, we will use the notation o f primary color in a different context. Since the two issues are 
treated separately, no ambiguity will occur.
63
Definition 3.4': A partition of the edge set of the IFG will be a color partition if and 
only if the two edges e(ru T,, a ,)  and e(r2, x2, o2) belonging to the same subset of the 
partition satisfy one of the following conditions.
(1) r, * r2 (data transfers are not concurrent); or
(2) r, = r2 and x, = x2. (they belong to the same instance of broadcasting)
When there are no broadcasting operations, pr(G) represent the number of edges 
with color r. This notation was introduced to count the number of buses needed to realize 
edges of color r. When broadcasting operations are involved, the number of edges with 
primary color r  does not necessarily correspond to the number of buses needed for those 
data transfers. Therefore, we modify the definition of (3r(G) as follows.
Definition 3.6': (3r(G) = I{e(r„ x„ a,) : r, = r  Vi; i * j  => t, *  Xj V ijJI
Therefore, by relabeling the colors of the edges and modifying the notion of the 
color partition and pr(G), we can incorporate broadcasting operations into the design 
process. It is to be understood that a color partition can always be found in a straightfor­
ward manner (whether broadcasting operations are involved or not) if we do not restrict 
the number of interfaces and buses involved. The difficult problem is to find an optimal 
color partition. As we show later, the optimal color partition problem with no broadcast­
ings involved is a computationally difficult problem. Therefore, when we address the 
optimal color partition problem in the next section, we will not consider broadcasting 
operations. However, in Section 3.7, when we implement a heuristic algorithm for the 
general case, we will incorporate broadcasting operations.
64
3.4 Optim al Color Partition  of an IFG
Not every color partition n  results in an optimal MBS. This section addresses the 
problem of finding an optimal color partition, that is, a color partition which results in a 
minimal cost MBS which can run the algorithm in minimum time. Our objective is to find 
a minimum cost MBS which realizes a given source algorithm in minimum time. The cost 
of an MBS is the sum of the costs of its individual components; namely, the set of 
processors, the set of buses, and the set of interfaces. Since processor assignment is 
already done, the set of processors is specified. Therefore, we will exclude the number 
of processors from the cost function. Let G be an IFG. Then for any color partition n, 
y(ft(G)) and E{n(G)) represent the set of buses and the set of interfaces, respectively of 
an MBS realizing G. Therefore, our objective is to find a partition n  such that \Y(n(G))\ 
and \E(n(G))\ are minimum. The following lemma gives a lower bound on iy(7t(G))l. 
Lem m a 3.2: Let n  be a color partition of IFG G. Then IT(7t(G))l > P(G).
Proof: Since iz is a color partition, /3(G) edges of the same color must be spread in at 
least /3(G) subsets. Also there is a distinct vertex in F(7t(G)) for every subset of the 
partition. □
Unfortunately, we do not have such a straightforward method to find a lower 
bound on I£(tc(G))I for an arbitrary IFG G. Minimization of I£(tt(G))I is also called "pin 
minimization of buses" and was addressed on several occasions in the literature. In [141], 
a dynamic programming method was proposed for the pin minimization problem. In 
[106], the same problem was approached using switching theory. In [77], pin 
minimization was performed when the IFG (posed in a different form) has some specific 
features.
65
In this dissertation, we use a graph theoretic approach. Let 7t be a color partition 
of G. If two edges in a subset of n  are directed towards (away from) the same vertex in 
G, data transfers corresponding to those two edges can share the same driver (receiver). 
In order to use this strategy in constructing a minimal cost MBS, we introduce the 
following definition.
Definition 3.7: Two edges in an IFG are said to be head (tail) compatible if they are 
directed towards (away from) the same vertex. Two edges are said to be 0-compatible, 
l-compatible, or 2 -compatible depending on whether they are neither head nor tail 
compatible, head or tail compatible, or both head and tail compatible1, respectively.
In order to minimize the number of interfaces, our criterion is to include edges in 
subsets such that the number of edges in each subset which are head and/or tail 
compatible are maximized. In general, given an arbitrary IFG G, one cannot find a single 
color partition n  which minimizes both IT(jt(G))l and l£(7c(G))l at the same time. This fact 
is illustrated in the following example.
Example 3.2: Consider the IFG G4 shown in Figure 3.1. Clearly, (3(G4) = 2. The number 
of edges in the MBG izj(G4) shown in Figure 3.3 is 12. One can easily be convinced that 
there does not exist any color partition of G4 of cardinality 2 which results in an MBG 
with less than 12 edges. Figure 3.4 shows a color partition k 2 of G4 whose cardinality is 
3. The corresponding MBG tc2(G4) has only 11 edges as shown in Figure 3.5. Therefore, 
there is no single color partition rc which minimizes both U5’(7i:(G4))l and \Y(n(G4))\ 
simultaneously.
^ w o  edges can be 2-compatible iff they are parallel edges in the IFG.
66
v,
o * H,< 7C2>
H2< k^>
Figure 3.4: Cardinality three color partition n2 of G4.
y  i
Figure 3.5: MBG tc2(G4)
For the special case, when the IFG has only two colors, there always exists a 
multiple bus system with minimum number of buses and minimum number of interfaces. 
We will address this issue in Section 3.6. Since, in general, one cannot find an MBS such
67
that both the number of buses and the number of interfaces are minimum (for an IFG 
with arbitrary number of colors), the optimization function to be used for the 
minimization should contain terms to represent relative costs of buses and interfaces. 
Therefore, we define K3*|y(7t(G))l + K4*IZs(tc(G))I as our optimization function, where 
constants k3 and k4 are chosen to reflect the relative costs of buses and interfaces, 
respectively.
Definition 3.8: For the IFG G, a color partition n  with a minimum value of k3*IF(7i(G))I 
+ K4*l£(7t(G))l will be called an optimal color partition. The corresponding bus 
assignment will be called an optimal bus assignment.
Now our design problem is to find a color partition n  of IFG G such that 
K3*ir(7t(G))l + k4*I£(tc(G))I is minimum. This problem, although posed in a different form, 
has been addressed in the literature [106], [141], But none of these works have addressed 
the computational complexity of the problem. In the next section, we prove that the 
optimal color partitioning problem is AT5-Hard.
3.5 Com putational Complexity of the Optimal Color Partitioning problem
We show that the problem of finding an optimal color partition k of a given IFG 
is a computationally difficult problem. To show this, we convert the optimal partitioning 
problem into a decision problem which is called Color Edge Partitioning (CEP) problem. 
We formally define the CEP problem as follows.
Color Edge Partitioning problem:
INSTANCE: An IFG G and three constants k 3, k4 and K.
QUESTION: Can we find a color partition n of G such that 
K3*IK(7t(G))l + K*\E(ii(G))\ < Kl.
68
Theorem  3.1: The CEP problem is NP-Complete.
Proof: It is straightforward to construct a non deterministic algorithm to solve CEP in 
polynomial time. Thus CEP belongs to class NP. To show that the CEP problem is NP- 
Hard, we transform an instance of the Three Dimensional Matching problem (3DM) into 
an instance of CEP and show that the 3DM  instance has a matching if and only if the 
corresponding CEP instance has a solution. The 3DM  problem [53] was already stated in 
Section 2.3. Let X, Y, Z  be the three sets and M  c  X  x Y  x  Z be the collection of three 
element subsets. Let the cardinality of each of X, Y, Z  be p. Also let the cardinality of M  
be q.
v xm  U M  v j p i











uz[ 2] vz[2] uz[p] v[p]
•  . . .  • ------------------
Y Y
Figure 3.6: Type 1 edges of G
From the 3DM  instance we will construct an instance G of CEP. Let x, y, and z 
be the representative elements of the sets X, Y, and Z, respectively. G will have four types 
of edges. For each element w,, in X  u  Y u  Z, form a directed edge («„[/], vj*']) of type 
1 (along with its end vertices) as shown in Figure 3.6. Here, w can be x, y or z, and i can 
range from 1 through p. Type 1 edges corresponding to the elements in set X, set Y and, 
set Z will be of colors Red, Blue and Yellow, respectively. Letters /?, B, and Y in the 
figure stand for the colors Red, Blue, and Yellow, respectively. For each element = 
>)•(2). W  in M, form seven vertices a[/], b\j], c\j], d[j], e\j], f j ]  and g[f] and nine
69
edges as shown in Figure 3.7. The types and colors of the respective edges are also shown 
in Figure 3.7. Since the resulting graph G is directed, edge colored, and connected, it 
represents an IFG. After the construction, G will have, from each color, p  edges of type 
1, q edges of type 2, q edges of type 3, and q edges of type 4. Therefore, the constructed 
CEP instance G has 6p  + Iq  vertices and 3(p + 3q) edges, p  + 3q edges from each color. 
Also, Pfl(G) = Pfl(G) = Py(G) = p(G) = p  + 3q. We set the value of K  equal to 
(k3 + 4k4 )0  + 3 q).
More notation is needed here. A set of three edges is said to form an edge triplet 
(or simply a triplet) if one edge in the set is head compatible with another edge and tail 








uj[i( 3)1 Y vt[/(3)]
Figure 3.7: Types 2, 3, and 4 edges of G corresponding to an element rtij e  M.
70
First we show that G has the required partition if 3DM  has a matching. Let M' c  
M  be a matching for the 3DM  instance. Consider the following color partition. For each 
element m e  M \ construct four edge triplets as shown in Figure 3.8. This will cover all 
the edges in G of type 1 (that is, p  edges from each color). This will also cover p  edges 
from each color belonging to each of the types 2, 3 and 4. For each element m g M', 
construct three triplets as shown in Figure 3.9. This will cover all the remaining edges in 
G of types 2, 3 and 4. The partition has 4p  + 3(q -  p) subsets (triplets), that is, l7(Tt(G))l 
= 4p  + 3(q -  p) = p  + 3q. Each subset contributes 4 edges to E{tz(G)). Therefore, 






Figure 3.8: Triplets corresponding to an element m s  M'.
71
Next we show that 3DM  instance has a matching if G has a solution. Let it be a 
color partition of G such that K3*iy(7t(G))l + k4*I£(7t(G))I = K  = (k3 + 4t^)(p + 3q). Let 
the cardinality of n  be b. Each subset of the color partition it can have at most 3 edges. 
Since there are no triangles in the underlying undirected graph of G, the induced subgraph 
of each subset underlies a tree. Therefore, the number of vertices of the subgraph induced 
by the subset E, of vertices is at least IE,I + 1,1 < i< b .  Thus I7(E,-)I ^  I7(£',)l + 1. Hence,
b b
by Lemma 3.1, IE(jc(G))I > £  (l£,l + 1) = £  l£,l + b = \E(G)\ + b = 3(p + 3q) + b. From
i = i  i = i
Lemma 3.2, IF(rc(G))l = b > P(G) = p  + 3q. Therefore, l£(tc(G))l > 4(p + 3q). Since 
IT(7t(G))l > p + 3q, we have, K3*IF(7t(G))l + K4*l£(7t(G))l > K3(p + 3q) + 4ic,(p + 3q) = ( k 3 
+ 4 k »)(p  + 3q). Since K3*IT(7t(G))l + K4*l£(7i(G))l = ( k 3 + 4 k4) ( p  + 3q) by hypothesis, it 
follows that \E(tz{G))\ > 4(p + 3q) and that IK(tc(G))I = p  + 3q. Hence each subset is a 
triplet.
From the construction of G it is clear that the only way to include a type 1 edge 
in a triplet is to use it with the type 2 edge adjacent with it and the type 3 edge adjacent 
with the type 2 edge. Thus p  pairs of adjacent type 2 and type 3 edges from each color 
will be consumed to cover all the edges of type 1. The remaining (q -  p) pairs of 
adjacent type 2 and type 3 edges from each color must be combined with (q -  p) edges 
of type 4 from each color. After using all the edges of type 2 and type 3, p  edges of type 
4 from each color will remain. If the edge set E(G) has the required partition, we must 
be able to form p  triplets from those remaining type 4 edges. Each such triplet 





Figure 3.9: Triplets corresponding to an element m <£ M'
If we can solve the optimization problem in polynomial time, then we can also 
solve the decision problem in polynomial time. But, since the decision problem is NP 
Complete, it is very unlikely that we find a polynomial algorithm to solve the decision 
problem and hence the optimization problem. For the above proof, we constructed an IFG 
with three colors. Therefore, the optimal bus assignment for an IFG with greater than or 
equal to three colors is an /VP-Hard problem. We later (Section 3.6) show that the optimal 
bus assignment problem for a two color IFG can be solved in polynomial time.
Computational complexity of some related problems have been established in the 
literature. For example, the problem of designing a bounded degree graph to map 
maximum number of edges from a 'problem graph' is shown to be computationally
73
difficult. But the computational complexity of the problem of mapping a 'problem graph' 
into a target graph (which is equivalent to the graph isomorphism problem) has not been 
established yet [23], [53].
As we have mentioned in Section 2.3, once a problem is proven to be NP-Hard, 
we usually have two different approaches to solving the problem. In the first approach we 
study how the problem can be polynomially solved in special cases. In the second 
approach, we use a polynomial time algorithm (probably heuristic) such that the solution 
is not necessarily optimal but does not deviate from the optimal solution by more than 
a certain amount. As for the first approach, it is practically impossible to generalize the 
properties an IFG  should possess in order to be partitionable in polynomial time. In this 
dissertation, we will consider two important properties of an IFG each of which 
guarantees polynomial time solution to the optimal color partition problem. The first 
property, which is addressed in the next section, is that the IFG  has only two colors. The 
second property, which will be addressed in Chapter 4, is that the IFG is vertex 
symmetric. As for the second approach, in this chapter, we devise a heuristic algorithm 
to solve the general problem. The heuristic algorithm is based on the solution to the two 
color case.
3.6 Optimal Color Partition on an IFG with only Two Colors
If the IFG  has only two colors, an optimal color partition can be found in 
polynomial time. We analyze the two color case for three major reasons. First, it answers 
the combinatorial question: what is the maximum number of colors an IFG can have so 
that the optimal color partition problem is solvable in polynomial time? Second, it 
provides a polynomial time algorithm to construct an optimal MBS realizing any two
74
interconnection functions such as the shuffle and exchange functions [134], Third, it 
provides a guideline for a heuristic algorithm for solving the general color partition 
problem. We assume that the two colors of the IFG are denoted by 1 and 2. When the 
IFG  has only two colors, as the following theorem shows, both IF(7t(G))l and \E(n(G))\ 
can be minimized at the same time.
Theorem  3.2: Let G be an IFG  with two colors. Then there always exists a color partition 
% of G which minimizes both IF(tc(G))I and l£(7t(G))l simultaneously.
Proof: Let Ttj be a color partition of G such that I£(7Cj(G))I < \E(n(G))\, for all color 
partitions 7t. If IF(7tj(G))l = P(G), then tc, is the required partition. Otherwise (that is, if 
IF(7tj(G))l > p(G)), there should exist a subset Ex<rx> of partition Ttj which does not 
contain a color 1 edge. This is because, if every subset of n l contains a color 1 edge, then 
the cardinality of partition n x, IF(tc,(G))I = P,(G) < P(G). Similarly, there should exist a 
subset E2<k1> of the partition 7t, which does not contain a color 2 edge. Let n2 be the 
color partition of G obtained from 7t, by replacing the two subsets Ex<rc1> and E2<nx> 
by the single subset Ex<n2> = E x<icx> u  E2<kx>. This is possible because Ex<nx> u  
E2<kx> contains edges of distinct colors. There are two edges in E(tzx(G)) associated with 
each of the subsets Ex<nx> and E2<kx> (that is, iy(^1<7C1 >)l = l/(£’2<^i>)l = 2). There are 
at most four edges in J{Ex<k2>). Hence, from Lemma 3.1, l£(7t2(G))l < l£'(7t1(G))l. Since 
l£(jtj(G))l is minimum by hypothesis, we have IE(tc2(G))I = \E(itx{G))\. Furthermore, 
IF(7C2(G))I = IF(7t](G))l -  1. If IF(tc2(G))I = P(G), then the required partition is n2. 
Otherwise, repeat the above procedure to construct another color partition n3 of G such 
that I£(tc3(G))I = IE(7t2(G))l, and IF(7t3(G))l = IF(7t2(G))l -  1. Therefore, by repeatedly
75
applying the above procedure, we can find a color partition n  such that IF(7t(G))l = P(G) 
and l£(7t(G))l = IE(7C,(G))I. □
In order to perform an optimal color partition of a two color IFG, we provide the 
following notation.
Definition 3.9: Let G be an IFG with two colors. Then L(G) is a weighted bipartite graph
defined as follows. For each edge e]  of color 1 in G, there exists a unique vertex v,1 in
the first partite set X(L(G)), and vice versa. Similarly, for each edge e f  of color 2 in G,
there exists a unique vertex Vj m the second partite set Y(L(G)), and vice versa. There
1 2is an (undirected) edge between vt and Vj in L(G) if and only if their corresponding 
1 2edges ei and are either head or tail compatible (see Definition 3.7) in G. An edge 
(v / ,  v f )  of L(G) will be assigned a weight w (v /, v f )  = z if and only if e f  and e f  are 
z-compatible. If v,1 and v f  are not adjacent in L(G), then w (v /, v f )  -  0.
Clearly, a color partition of G is equivalent to a vertex partition of L(G) such that 
each partition contains at most one vertex from any partite set. In this dissertation, unless 
otherwise stated, a vertex partition of a bipartite graph is always meant to be done such 
that no subset contains more than one vertex from the same partite set.
Theorem  3.3: Let G be a two color IFG and n be a color partition of G. Let W/7r be the 
sum of the weights of the edges induced by the vertex partition of L(G) corresponding 
to color partition K. Then l£(7t(G))l = 2I£(G)I -  W Jr.
Proof: Let Ej, 1 < j  < b, be a subset of edges of G due to color partition K, where b is 
the cardinality of color partition n. Define vv;- as follows. If Ej contains two edges el and 
e2, then Wj = w(v1( v2), where v, and v2 are the vertices of L(G) corresponding to edges ex 
and e2, respectively. If Ej contains only one edge, then Wj -  0. It is clear that W r =
76
b
Wj. We claim that \J(Ej)\ = 2\Ej\ -  Wj, for 1 < j  < b. We prove the claim by
H
considering all possible cases.
Case 1: Ej has only one edge:
A driver and a receiver must be connected to bus B} for the corresponding data 
transfer, that is, \J{Ej)\ = 2. Since Wj = 0, it follows that, 2I£; I -  vv; = 2 -  0 = 2.
Case 2: Ej has two edges el and e2 which are not compatible:
Two drivers and two receivers are needed for corresponding data transfers, that 
is, \J(Ej)\ = 4. Since Wj = 0, it follows that, 2\Ej\ -  Wy = 4 - 0  = 4.
Case 3: Two edges Cj and e2 are head compatible:
Only one receiver is needed for both data transfers, that is, \J(Ej)\ = 3. Since
Wj = 1, it follows that, 2LE; I — vî  = 4 —1 = 3 .
Case 4: Two edges ex and e2 are tail compatible:
Only one driver is needed for corresponding data transfers, that is \J(Ej)\ = 3. 
Similar to Case 3, 2l£yl -  Wy = 3.
Case 5: Two edges ex and e2 are both head and tail compatible:
Only one receiver and one driver are needed for corresponding data transfers, that
is l/(£7)l = 2. Since vv7 = 2, it follows that, 2I£/I ~ wy = 4 -  2 = 2.
b b b
Hence, the claim is true and therefore, \J(E})\ = 2 ^  \Ej\ -  Wj = 2I£(G)I -  W n . 
b j~i 1 j -1
Since ^  I7(£y)l = l£(7t(G))l (by Lemma 3.1), the theorem follows. □
j<
Definition 3.10: A vertex partition of L(G) is said to be an optimal vertex partition, if the
sum of the weights of the edges induced by the subsets of the partition is maximum.
According to Theorem 3.3, the problem of finding an optimal color partition of
a two color IFG G is equivalent to that of finding an optimal vertex partition of the
77
bipartite graph L(G). To find such a vertex partition, some terminology from graph theory 
is necessary.
For a given graph G, a maximum matching can be found in polynomial time [30], 
[56], In this paper we consider matchings in bipartite graphs only. For our purpose, what 
we need to maximize is not the number of edges but the sum of the weights of the edges 
in a matching. To distinguish between these two concepts, we introduce the following 
definition.
Definition 3.11: If the sum of the weights of a matching in L(G) is maximum, that 
matching will be called a weighted maximum matching.
If a weighted maximum matching M  of L(G) is known, it is straightforward to 
construct an optimal vertex partition of L(G). Simply construct a vertex partition which 
contains the matching M. Thus the problem of finding an optimal color partition of a two 
color IFG G is reduced to that of finding a weighted maximum matching in L(G). We 
will next show how to find a weighted maximum matching in L(G). In the following, 
W(M) represents the sum of the weights of the edges in matching M.
Lem m a 3.3: Let G be a two color IFG. Let M2 be the set of edges in L(G) whose weights 
are 2. Then M2 is a matching in L{G).
Proof: We need to prove that the edges in M2 are pairwise independent. Suppose there 
are two edges in M 2 with a common end vertex v,. Assume that v, is in the first partite 
set. Therefore edge e, (which is of color 1) of G must be parallel with two distinct edges 
of color 2. Then those two edges of color 2 must be parallel to one another. Since this 
is not possible, no two edges in M2 have a common vertex. □
78
Theorem  3.4: Let G be an IFG  with two colors. Then there is a weighted maximum 
matching in L(G) which contains all edges of L{G) with weight 2.
Proof: Let M  be a weighted maximum matching in L(G). Let M 2 be the set of edges of 
L(G) of weight 2. If M2 c  M, then we have nothing to prove. Otherwise, let e e  M 2 be 
such that e £ M. Construct a new matching M' from M  by removing the edges adjacent 
with e and inserting e. From Lemma 3.3, no edge adjacent with e can have weight 2. 
Also, there can be at most two edges adjacent with e which are in M. Therefore, W(M') 
> W(M). Since M  is assumed to be a weighted maximum matching, it follows that W(M') 
= W(M). If M 2 <z M', we have the required matching. Otherwise, repeat the above 
procedure until we get a weighted maximum matching containing M2. □
Theorem 3.5: Let M 2 be the set of edges of L(G) of weight 2. Let LX(G) be the graph 
obtained from L(G) by deleting the end vertices of all the edges in M 2. (Deleting a vertex 
will automatically delete all the edges adjacent with it). Let M, be a maximum matching 
in L,(G). Then M x u  M 2 is a weighted maximum matching in L(G).
Proof: Suppose there exists another matching M  in L(G) such that W(M) > W(MX u  M2). 
According to Theorem 3.4, we can assume that M  contains M 2. Let M  = M (  u  M2. 
Therefore, W(M) > W(MX u  M2) implies M x > M v None of the edges in M / is adjacent 
with end vertices of any of the edges in M2. Because, otherwise, M {  u  M 2 will not be a 
matching in L(G). Therefore, M x > M, implies the existence of a matching in LX(G) 
whose cardinality is greater than M x. This is not possible since M x is a maximum 
matching in L,(G). Thus, M x u  M2 is a weighted maximum matching in L(G). □
The above theorem provides us with a method to find an optimal color partition 
of a two color IFG. First, construct L(G) from G. Then find a maximum matching M x in
79
L,(G), where L,(G) is obtained from L(G) by deleting all edges of weight 2. If M2 is the 
set of deleted edges, then M  = M, u  M2 is a weighted maximum matching in L(G). The 
vertex partition of L(G) associated with matching M  corresponds to an optimal color 
partition of G. In [67], an algorithm is presented to find a maximum matching in a 
bipartite graph with n vertices in 0(n 2 5) time. Utilizing that algorithm, we can find an 
optimal color partition of a two color IFG G in 0(\E(G)\25) time. Next we will develop 
a heuristic algorithm for the general case based on the two color case. The algorithm will 
give us a nearly optimal color partition of a general IFG.
3.7 Heuristic Algorithm for the General Case
In developing a heuristic algorithm for the general case, it is assumed that the IFG 
under consideration contains broadcasting information. We base the algorithm on the 
following two assumptions.
(1) For optimality, data transfers associated with the same broadcasting instance must 
be assigned to the same bus.
(2) The local optimization is not very far away from the global optimization.
In Section 3.3, we showed how broadcasting information can be included in an 
IFG. But we did not illustrate how to make use of these for finding a color partition. 
Therefore, while developing the heuristic algorithm, we will clearly show how 
broadcasting information is incorporated into the color partition.
Since broadcasting operations are considered, each edge is assigned a 3-tuple of 
colors instead of a single color. Also, the meaning of the color partition is accordingly 
modified (see Definition 3.4'). To impose the first assumption on which our heuristic 
algorithm is based, if e(r„ x,, a,), V/, are the edges of the given IFG, then the data
80
transfers corresponding to the edges in set {e(r„ x„ a,) : V iJ  r, = r; and I, = Ty } must be 
assigned to the same bus. In other words, the color partition must be performed such that 
the edges in set {e(rt, x(, a,) : V iJ  r, = r, and x, = x,} belong to the same subset of the 
partition.
For simplicity, we will denote the set of edges in G with the first coordinate r 
(primary color) and the second coordinate x by q(G, r, x). That is, q(G, r, x) = {e(r„ x„ 
a,) : r = r,, x = x,}. In the last section, when there were no broadcasting operations 
involved, we constructed a bipartite graph L(G) from the IFG G such that each vertex of 
L(G) represents a unique edge of G. Here we will construct a bipartite graph such that 
each of its vertices represents a set of edges q(G, r, x) for some r and x. The vertex 
corresponding to q(G, r, x) will be denoted by v(q(G, r, x)). Let {q{G, r, x)}^ be denoted 
by Q(G, r). Similarly, let {v(q(G, r, x))}T be denoted by QV(G, r). Clearly, the cardinality 
of Q(G, r), as well as, that of QV(G, r), is Pr(G) (see Definition 3.6').
For the two color case, we constructed only one bipartite graph L(G). For the 
general case, we will construct a set of bipartite graphs L{G)r, 1 < r < c -  1, where c is 
the number of primary colors in G. As will be shown in the algorithm, the partite sets of 
L(G)r, 1 < r  < c -  1, will be constructed recursively. Determination of the edges and their 
weights of L(G)r, 1 < r < c -  1, are based on the following discussion.
When there is no broadcasting involved, we established an edge between v / e
2 1 2  1 2  X(L(G)) and Vj e  Y(L{G)) iff the two edges c, and €j are compatible, where v,- and Vy
are the representative vertices of the edges e ]  and e j , respectively. With broadcasting
operations involved, to establish the adjacency between vertices of X{L(GY) and Y(L(G)r),
t l /W }x represents the set o f elements J[x) by taking all distinct values o f  t.
81
we need to extend the definition of (head or tail) compatibility between the edges of an 
IFG.
Definition 3.12: Let qx and q2 be two non-empty disjoint subsets of edges of an IFG. 
Then qx and q2 are said to be z-compatible if \J{qx)\ + \J{q^)\ -  \J(qx u  q2) 1 = z. Also, qx 
and q2 are said to be compatible if they are z-compatible and z > 0 .
Notice that the above definition explicitly states: if the number of interfaces saved 
by assigning the edges in qx and q2 to the same bus is z, then the subsets qx and q2 are 
z-compatible. With the extended definition of the term "compatible", we can determine 
the edges and their weights in L(G f. Note that each vertex of L(G)r corresponds to a set 
of edges from G. Let v(qx) and v(q2) be two vertices of L(G)r. Then there exists an edge 
of weight z between v(qx) and v(q2) of L(G)r if qx and q2 are z-compatible in G. The 
weights of the edges in L(G)r are not bounded above by 2 in contrast with the case when 
broadcasting operations were not considered. Therefore, we need to find a general 
weighted maximum matching for L(G)r. For a bipartite graph, such a matching can be 
found in 0 (n 3) time [56].
Now, we can present the heuristic algorithm. For notational convenience, we will 
use Pr instead of Pr(G), 1 < r < c. Also, without loss of generality, we assume that P, > 
P2 > ... > pe. The outline of the algorithm is as follows. Initially we form the subsets Ex, 
E2, ..., Ep such that Ej contains edges of primary color 1 belonging to the f '  broadcasting 
instance. Then we construct L(G)1, where X(L(G)‘) = {v(Ej) : 1 < P,} and Y(L(G)X) =
QV(G, 2). The edge set of L(G)1 is determined by the compatibility among the elements 
in {Ej : 1 < j < P,} and those in Q(G, 2). Based on the maximum weighted matching of 
L(G)1, we choose the edge set belonging to the best instance of broadcasting with primary
82
color 2 which can be combined with each E}. This operation will update each Ey Next, 
we construct L(G)2, where X(L(G)2) = {v(E})  : 1 < j  < p,} and Y(L(G)2) = QV(G, 3) and 
edges are determined by the compatibility among the elements in {Ej : 1 < j  < Pj} and 
those in Q(G, 3). Then we find the maximum weighted matching in L(G ) 2 and update the 
values of Ej, 1 < j  < P^ accordingly. We repeat this procedure until all edges of G are 
exhausted. The final list EJt 1 < 7  < P„ is the resulting color partition of G.
Algorithm 3.1
INPUT: An IFG G with c colors including broadcasting information.
OUTPUT: A color partition {£,, E2, ..., Ep } of G. 
begin
for j  = 1 to do
Ej = q(G, 1, x);
for r = 1 to (c -  1) do
begin
begin_construct L(G)r
X(L(GY) = M E J , v ( E 2),  ..., v(E )}
y(L(G)0 = Q \G , r + 1)
E(L(GY) = {{x, y) : x  e  X(L(G)0, y  e  Y(UG)r), 
x  and y  are compatible} 
for all (x, y) e £(L(G)0 
if x  and y are z-compatible then w(x, y) = z; 
end_construct;
find a maximum weighted matching Mr in L(G)r; 
for j  = 1 to P, do 
if (v(Ej), v(q(G, r + 1, x))) e  Mr 
then Ej = Ej u  {q{G, r + 1, x)}; 
end; 
end.
Example 3.3: Suppose that we want to find an optimal color partition of the four color 
IFG, denoted by Gs, shown in Figure 3.10. The same line style is used to identify edges 
with the same primary color. For example, edges e(\, 1, 1), e(l, 1, 2), e(l, 2, 1), e(l, 2,





Figure 3.10: IFG Gs
e(l, 1, 1) and e(l, 1, 2) of Gs belong to a single instance of broadcasting. Therefore, we 
set Ex = {e(l, 1, 1), e(l, 1, 2)}. Also, edges e(l, 2, 1) and e(l, 2, 2) correspond to another 
concurrent broadcasting instance. Thus, we set E2 = {(e(l, 2, 1), e(l, 2, 2)}. Edge e{\, 3, 
1) does not belong to a broadcasting operation. However, as we have stated earlier, we 
consider it as a trivial broadcasting operation of its own and set E3 = {(1, 3, 1)}. The first 
partite set of L(Gsy  is {v(Ex), v(E2), v(E3)}. The second partite set is QV(G5, 2) = {v(q(2, 
1))\ v(q(2, 2)), v(q(2, 3))}, where q(2, 1) = {e(2, 1, 1)}, q(2, 2) = {e(2, 2, 1)}, and q(2, 
3) = {(e(2, 3, 1), e(2, 3, 2)}. Bipartite graph L(G5)‘ is shown in Figure 3.11. Since £ , = 
{e(l, 1, 1), e(l, 1, 2)} and q{2 , 1) = {e(2 , 1, 1)} are 2-compatible, edge (v(£j), v(q(2, 1)))
tHere we omit Gs from the notation q(G, r, t )  for simplicity.
84
of L(GS) 1 is assigned weight 2. Similarly, other edges of L(G5) 1 are assigned then- 
respective weights (see Figure 3.11).
q(2, 1) = {e(2 , 1, 1)} £, = {e(l, 1, 1), e(l, 1, 2)}
q(2, 2) = {e(2, 2, 1)} E  = {e(l, 2, 1), <*1, 2, 2)}
<7(2, 3) = {e(2,3,1), e(2, 3,2)} E, = {e(l, 3,1)}
v(tf(2, 1)) v(q(2, 2)) v(#(2, 3 »
Figure 3.11: L(G5)‘.
Clearly, the maximum weighted matching in L(G5)‘ is {(v(£j), v(#(2, 1))), (v(£2), 
v(q(2, 2))), (v(fs3), v(<j(2 , 3))). Therefore, the updated values of E} are:
Ex = {e(l, 1, 1), e(l, 1, 2), <>(2, 1, 1)},
£ 2 = {e(l, 2, 1), e(l, 2 , 2), e(2, 2 , 1)},
E3 = {g(l, 3, 1), e(2, 3, 1), e(2, 3, 2)}.
Bipartite graph L(G5)2 is shown in Figure 3.12. Again, the first partite set is {v(£’1), v(E2),
v (E 3) }  and the second partite set is QV(G5, 3 ) .  The corresponding weighted maximum 
matching is { (v ^ ) ,  v(q(3, 1))), v(E2), v(q(3, 2))), v(£3), v(<?(3, 3)))}. Now, the updated 
values for E} are:
85
£ , = {e(l, 1, 1), e (l, 1, 2), *(2, 1, 1), *(3, 1, 1)},
E2 = {*(1, 2, 1), c (l, 2, 2), *(2, 2, 1), *(3, 2, 1)},
£ 3 = M l ,  3, 1), e(2, 3, 1), *(2, 3, 2), *(3, 3, 1)}.
9 ( 3 , 1 )  =  { * ( 3 ,1 ,1 ) }  
9(3 , 2) =  {e(3, 2 ,1 ) }  
9(3, 3 ) = { e ( 3 ,  3 ,1 ) }
v(<7(3, D)
£ , =  {«( 1 , 1 ,1 ) ,  * ( 1 ,1 ,2 ) ,  * ( 2 , 1 ,1 ) )  
£ , =  {*(1 ,2 ,1) ,  *(1,2,  2), c(2, 2, 1)} 
£ 3 = { * ( 1 , 3 ,  1), e(2, 3 ,1 ) ,  *(2, 3 ,2 ) }
v(q(3, 2)) v(<?(3, 3))
v(£3)
Figure 3.12: L(Gsf
Bipartite graph L(G5)3 is shown in Figure 3.13. The weighted maximum matching is 
{(v(£,), v(<?(4, 1 ) ) ) ,  (v(£2), v ( ^ ( 4 ,  2 ) ) ) ,  (v(£3), v(q(4, 3)))}. The updated final values of £, 
are:
Et = {*(1, 1, 1), e (l, 1, 2), e(2, 1, 1), e(3, 1, 1), *(4, 1, 1)},
E2 = {e ( l , 2, 1), *(1, 2, 2), *(2, 2, 1), *(3, 2, 1), e(4, 2, 1)}, and
£3 = {*(1, 3, 1), *(2, 3, 1), e(2, 3, 2), e(3, 3, 1), *(4, 3, 1)}.
Figure 3.14 shows the induced subgraphs corresponding to the above partition. By
exhaustive search, we found that the color partition shown in Figure 3.14 is optimal. 
Therefore, in our particular example, the heuristic algorithm produced an optimal color
86
partition. In Figure 3.14, induced subgraph //, corresponds to 2 drivers and 3 receivers. 
Induced subgraph H2 corresponds to 3 drivers and 2 receivers. Induced subgraph H3 
corresponds to 3 drivers and 3 receivers. Therefore an optimal multiple bus system 
realizing IFG Gs of Figure 3.10 consists of 3 buses, 8 drivers, and 8 receivers.
9( 4 ,1) ={*(4, 1,1)} 
<7(4,2) = [e(4, 2, 1)} 
q(4,3) = {e(4, 3,1)}
E, = {e(l, 1,1), e(l,  1, 2), e{2,1, 1), e (3 ,1, 1)}
£ 2 = {e(l, 2,1), e(l, 2,2), e(2, 2, 1), e(3 ,2, 1)}
£3 = {e(l, 3,1), e(2 ,3,1), e(2, 3,2), e(3 ,3, 1)}
v(q( 4, 1)) v(<?(4, 2)) v(?(4, 3))
1
Figure 3.13: L(G5) \
3.8 Performance of the Algorithm
In the above algorithm, we repeatedly find a maximum weighted matching of a 
bipartite graph. Each partite set has at most p(G) vertices. Therefore, one step takes 
0((P(G))3) time using the algorithm given in [56]. Hence the time complexity of the 
heuristic algorithm is G(c(p(G))3). Our motive for presenting the above algorithm is not 
to provide the best possible algorithm to find a color partition of a general IFG. Instead, 
we meant to present an approach or methodology which can be used to find efficient
87
heuristic algorithms. We will discuss how close is our solution to the optimal solution and 
provide some ideas to improve on it.
* (4 ,3 ,1 )  v.
H
e(2, 3, 1)
Figure 3.14: Induced subgraphs of G5 corresponding to 
the color partition output by Algorithm 3.1.
For r  = 1 through c -  1, we have constructed bipartite graph L(G)r, where 
X(L(G)r) = {v(Ej): 1 < / < Pj } and F(L(G)r) = QV(G, r + 1). The outcome of the algorithm 
depends on the matching of vertices of {v{E})  : 1 < j  < P,} to those of QV(G, r + 1). The 
criteria to match vertex v(Ej) e  {v{E}) : 1 < j < p ,} to vertex v(q(G, r + 1, x)) e  QV(G, 
r + 1) is the compatibility of q(G, r + 1, x) and E}. The main reason for the outcome to 
possibly deviate from the optimal solution is the following. Set Ej contains edges of 
primary colors 1 through r and set q(G, r + 1, x) contains edges of primary color r + 1
only. We did not consider the compatibility of q(G, r + 1, x) with the edges of primary 
colors r + 2 through c. We can improve the outcome of the algorithm by introducing 
some kind of lookahead at the edges of primary colors r  + 2 through c. For example, 
suppose that there exists an edge set, say q'(G, r + 2, x'), which is zr compatible with 
q(G, r + 1, x) and z2' co m Patib le  with Ej. Unless q(G, r + 1, x) is matched to E} at 
iteration r, we cannot include both subsets q'(G, r + 2, x') and q(G, r + 1, x) in Ej at 
iteration r + 1, thereby saving zx + z2 interfaces. In order to increase the chance of 
matching q(G, r + 1, x) with E} at iteration r, we could add an extra weight 8 to edge 
(v(Ej), v{q(G, r + 1, x))) of L{G)r. If the edge does not exist in L(G)r, we can insert an 
edge with weight 8. Therefore, by suitably assigning weights to edges in the bipartite 
graphs in each iteration, one can design a color partition algorithm whose outcome is very 
close to the optimal solution. Given a constant e , it seems possible that, one can choose 
suitable weights for edges in bipartite graphs in each iteration such that the cost of the 
resulting multiple bus system is guaranteed to be no more than (1 + e )opt, where opt is 
the cost of an optimal bus system. This is an interesting and challenging problem in its 
own right. However, we do not pursue this problem further in this dissertation.
CHAPTER 4
OPTIM AL BUS ASSIGNM ENT  
FOR VERTEX SYM M ETRIC IFGs
Spaghetti code writing is generally considered as a bad practice. Structured, 
regular algorithms are considered most popular in today's standards. Furthermore, most 
efficient algorithms for many important problems exhibit inherent regularity. For example, 
Preparata and Vuillemin defined an algorithmic class called Ascend/Descend which is 
based on iterative rendition of divide and conquer [115]. They have shown that 
fundamental parallel algorithms such as merging, Fast Fourier Transform, sorting, 
convolution and matrix operations are either instances of the scheme (covered by the 
algorithmic class) or simple combinations of such instances. The Ascend/Descend class 
exhibits data exchange pattern which corresponds directly to cube permutations which 
have very regular IFG representation. Therefore, IFGs corresponding to most real 
algorithms of importance would have certain regularities. Besides, an optimal MBS 
realizing a heterogeneous IFG may also be heterogeneous, which is very undesirable from 
an implementation point of view. Modularity and regularity are very attractive features 
for VLSI implementations. Almost every interconnection network found in multiprocessor 
systems have certain amount of regularity and modularity. Sets of interconnection 
functions used in most existing SIMD machines (such as hypercube, torus, star network 
etc.) correspond to symmetric IFGs. Multiple bus systems which are regular in nature are 
also attractive from an implementation point of view. On the other hand, an irregular MBS 
is very unattractive as a multiprocessor system. An optimal MBS realizing a given 
irregular algorithm will also be irregular.
89
90
Thus, even though finding an optimal color partitioning for a general IFG  has 
some theoretical interest, it has very little practical implication in regard to the design of 
an optimal MBS. In this chapter, we address the problem of finding an optimal color 
partition of a vertex symmetric or regular IFG. We show that an IFG  is vertex symmetric 
if and only if it is a Cayley color graph associated with a finite group and its generating 
set. This allow help us to use some concepts from group theory to perform an optimal 
color partition of a vertex symmetric IFG  and to analyze the properties of the resulting 
MBS. We show that there exist many optimal color partitions of a vertex symmetric IFG. 
We choose a particular partition which has many desirable properties other than being 
optimal. We show that the resulting MBS is symmetric. We also show the superiority of 
an MBS over a static direct link interconnection network realizing the same symmetric 
IFG, in terms of the number of ports per processor, the number of neighbors per 
processor, and the diameter. Furthermore, we present a polynomial time algorithm to 
perform an optimal color partition of a vertex symmetric or regular IFG.
4.1 Prelim inaries
In this chapter we assume that the interconnection functions associated with an 
IFG  does not contain broadcasting operations. When broadcasting operations are involved, 
Algorithm 3.1 can be used to find a near optimal color partition. Therefore, each vertex 
of an IFG  is assumed to have distinct color outgoing edges.
Definition 4.1: A color preserving automorphism [146] of an edge colored digraph G is 
a permutation xjr on V(G) such that («, v) is an edge in E(G) if and only if (tj/(«), \|t(v)) 
is an edge in E(G) of the same color, for all u, v e  V(G).
91
Definition 4.2: An edge colored digraph is said to be vertex symmetric (or vertex 
transitive) if for every pair of vertices u and v, there exists a color preserving 
automorphism of the graph which maps u to v.
Intuitively speaking, an edge colored digraph is vertex symmetric if it looks the 
same when viewed from any vertex. We will show how to find an optimal edge partition 
of a vertex symmetric IFG.
Lemma 4.1: If an IFG G is vertex symmetric, each vertex v e V(G) has incoming edges 
of distinct colors.
Proof: Let IV(G)I = n. Suppose a certain vertex in V(G) has x > \  incoming edges of color 
r. Since G is vertex symmetric, each vertex must have x  incoming edges of color r. 
Therefore, the total number of edges of color r is rx. Since G is vertex symmetric, each 
vertex must have x  outgoing edges of color r. Since in an IFG each vertex has outgoing 
edges of distinct colors, it is not possible for a vertex to have x  outgoing edges of color 
r  for x  > 1. Thus each vertex in V(G) has incoming edges of distinct colors. □
Recently, much attention has been focused on analyzing existing interconnection 
networks and designing new ones using group theoretic concepts [5], [6], [27]. Here we 
will utilize group theoretic concepts in analyzing symmetric IFGs for the purpose of 
constructing an optimal MBS. For completeness, we briefly introduce the very basics of 
group theory, rather informally.
Let T = (Yi, Y2’ —) be a set and ° be a binary operator on the elements of the set. 
Then the algebraic structure <T, °> is called a group if it satisfies the four axioms: 1) 
closure: Y,°Y; e  T, V Y„Y; e  T; 2) associativity: Y;°(Y,°Y*) = (Y°Y,)°Y* e  F, V y,->Y;>Y* e  F;
3) identity: there exists an element e e  T  such that c°Y = J°e =  y, V Y € F; and 4)
92
inverse: for every y e  T, there exists y 1 e  T  such that yoy1 = y 1 °y = e. With minor abuse 
of notation (which usually is the case), we will use the same symbol T to represent both 
the set and the group; the binary operation is usually implied. Whether we are referring 
to the group or the set would be clear from the context. For simplicity, binary operator 
°, called group composition, is usually omitted from expressions. We write y y  instead of 
y°y. Also, yy is called the product of y and y. We abbreviate the product yy...y by y ', 
where i is the number of terms in the product. The order, or the cardinality of a group 
r, denoted by ITI, is the number of elements in the set F. The order of an element y e  
F  is the smallest integer q such that y9 = e. If the binary operation is commutative, that 
is, if y y  = yy  for all y,y e T, then F  is called an abelian group. The group containing 
only the identity element is called the trivial group. A group of finite order is called a 
finite  group. If a group is finite, all its elements are of finite order. In this paper, we only 
consider finite groups. A subset A c  T is  said to be a set of generators of T, if every 
element of T can be expressed as a product of the elements in A and their inverses.
Let r  be a non trivial finite group with a generating set A. We associate with F 
and A an edge colored digraph CA(T) called Cayley color graph [102], [146]. The vertices 
of Ca(T) are the group elements. Each of the generators in A is assigned a distinct color. 
There exists a directed edge (y , y2) of color r in CA(F) if and only if 5r e  A and y2 = y,5r. 
Clearly, the number of vertices in CA(T) is equal to the cardinality of T. Without loss of 
generality, we assume that A does not contain the identity element e of the group T  (this 
disallows self-loops in the graph). A generator in A is said to be redundant [146] if it can 
be obtained from the remaining elements in A.
93
Not only edge colored directed graphs, but also undirected graphs can be 
represented using groups and generating sets. If 8 e  A implies 8'1 e  A, then the 
underlying undirected graph of CA(F) (by possibly coalescing multiple edges) will be 
called Cayley graph [146] and is usually denoted by GA(T). Many static interconnection 
networks encountered in the literature can be represented as Cayley graphs. Some 
examples of such networks are hypercube [119], cube connected cycles [115], star [5], 
pancake [54], and bubble sort [5].
Example 4.1: Consider the group, denoted by T ^ ,  whose elements are the binary vectors 
of length n and the group composition is the bitwise exclusive-or operation. Clearly, TQ(n) 
is an abelian group and contains 2" elements. The set of unit vectors in forms a non 
redundant generating set, denoted by AG(n). We will use notation GQ(n) to represent the 
Cayley color graph associated with group and generating set AC(n). In example 2.4, 
we constructed an optimal IFG from the CFG C4 (see Figure 2.12) representing the 
Ascend/Descend algorithmic class. The constructed IFG (shown in Figure 2.14) is in fact 
Ge(3). If we regard IFG as the representation of interconnection functions of an SIMD 
machine, then GC(n) is the IFG corresponding to the cube interconnection functions of an 
n-dimensional hypercube. Due to the importance of cube interconnection functions, we 
will refer to Ge(n) in several occasions in this dissertation.
Theorem  4.1: Cayley color graph CA(T) is strongly connected.
Proof: Let A = [8 ,, 82, ..., 8IAI}. Let ya and yb be two arbitrary elements of F. Then yb = 
yjiih 2-.hr, where h{ through hr are elements (not necessarily distinct) taken from the set 
[Sp 82, ..., 8^1} [S j , S 2 , S^i) .  Since we assume that T  is a finite group, each of
its elements is of finite order. If qt is the order of 8„ then 8 , 1 = Sf' 1, for 1 < i < IAI.
94
Thus yb = 7flS*‘ S^...8*r , where s x through sr are some finite positive integers and 
xt e  {1, 2, IAI}, for 1 < i < r. Hence there is a directed path from ya to yb, and 
therefore, CA(F) is strongly connected. □
Lemma 4.2: Cayley color graph CA( 0  is an IFG.
Proof: Each vertex of CA(T) has outgoing edges of distinct colors. In addition, Theorem
4.1 indicates that the underlying undirected graph of CA(T) is connected. Thus the proof 
is reached. □
The following theorem will provide us the main tool in constructing an optimal 
MBS realizing a symmetric IFG.
Theorem 4.2: An IFG  is vertex symmetric if and only if it can be represented as a 
Cayley color graph CA(T) associated with a finite group F and its generating set A. 
Proof: For sufficiency, let G = CA(r) be the Cayley color graph associated with group 
r  and its generating set A. We prove that, for any two vertices yx and y2 in CA(r), there 
exists a color preserving automorphism which maps y, to y2. Consider the permutation \(/ 
defined by \j/(y) = Y2Yi1Y> where y,yx,y2 e  T. Let (x, y) be a color r edge in G. Then 
y = xbr for some 8r € A. From the definition, \jf(y) = Y2Y1V  = Y2 Yi**^ = \|/(x)8r  
Therefore, (\j/(x), \jr(y)) is a color r  edge in G. Therefore, \|/ is a color preserving 
automorphism for G. Furthermore, \|/(Yj) = Y2Yi1Yi = Y2- Hence Yi maps to y2, and 
therefore, G is vertex symmetric.
For necessity, let G = (V, E) be a vertex symmetric IFG  with c colors. We prove 
that G is a Cayley color graph. Let V = {vj, v2, ..., v„}. Let Y; be a color preserving 
automorphism which maps to v, for 1 < / ' < « .  We will show that the elements in the 
set {Yi» Y2> —> In) are unique, that is, there is only one color preserving automorphism
95
which maps vx to v„ for 1 < i < n. Suppose there exists another color preserving 
automorphism y{ which maps v, to v(. Let v, be any vertex in V. Since the underlying 
undirected graph of G is connected, there exists an undirected path vx=ux, u2, ..., um=vx, 
where, either (m„ «i+1) or (m,+1, «,) is an edge in E  for 1 < i < m - 1. First suppose that 
edge (t/j, u2) is in E  and that it is of color r. From the hypothesis, y(«i) = Y,'(Mi) = v,. 
Since there is only one edge of color r  directed away from v{, the two edges (y,(«i), y(«2)) 
and {y'(ux), y '(u2)) must be the same, and therefore, y,(«2) = y '(u2). Next suppose (w2, w,) 
is an edge in E  of color r. Since there is only one edge of color r directed towards v, 
(from Lemma 4.1), we get y,(«2) = y '(u2). Similarly, we can show that y,(«3) = y;'(m3). By 
repeated application of the same argument to the remaining vertices of the path, v,=«„ 
w2, ..., um=vx, we get y(vx) = Y/(vJ. This is true for every vertex yr € E. Therefore, y  and 
y' represent the same automorphism. Therefore, the elements in the set {y„ y2, ..., yn} are 
distinct and all the color preserving automorphisms of G are included in the set.
It is straightforward to verify that the elements in the set {y, y2, ..., y„} form a 
group with Yj as the identity element. Let that group be T. Let wr be the vertex in V such 
that (vls wr) is an edge of color r. Also, let the color preserving automorphism 
corresponding to vertex wr be 5r, for 1 < r  < c. Then A = {8 ,, S2, ..., 8C} is a generating 
set for F  and G is the Cayley color graph CA(T). □
The above theorem allows us to use group theoretic concepts to find an optimal 
color partition of a vertex symmetric IFG. If an IFG G is a Cayley color graph CA(T), 
then the generating set A is isomorphic with the set of interconnection functions 
associated with G. This explains why we shouldn't include the identity element e in the 
generating set A. An identity interconnection function does not serve any purpose in
96
interprocessor communication. Usually, interconnection functions in a set are independent, 
that is, one function cannot be constructed from the remaining functions in the set. 
Therefore, we emphasize on generating sets which are non redundant. In the first part of 
this chapter, unless otherwise stated, Cayley color graphs are assumed to have non 
redundant generating sets. In Section 4.5, we will specifically address Cayley color graphs 
with redundant generating sets.
For an arbitrary IFG G, a lower bound on IFrc(G))l is |3(G) (see Lemma 3.2). As 
we have already stated in Section 3.4, the existence of such a lower bound on I£’(te(G)I 
is not known. However, we can establish a lower bound on I£(ti(G))I if G belongs to a 
certain class of graphs.
Definition 4.3: A sequence of edges in an IFG is called an alternately oriented path if 
every pair of consecutive edges in the sequence are directed towards or directed away 
from the same vertex. If the first edge and the last edge of the sequence are also directed 
towards or directed away from the same vertex, the sequence is called an alternately 
oriented cycle. An alternately oriented path or cycle will also be denoted by a sequence 
of vertices, where vertices are selected from the sequence of edges in the obvious manner. 
Definition 4.4: An IFG is said to belong to class % if it contains no alternately oriented 
cycles with distinct color edges.
Example 4.2: Figure 4.1(a) shows an IFG belonging to class Notice that although 
cycle (v,, v2, v3, v4) of Figure 4.1(a) is an alternately oriented cycle, it contains two edges 




Figure 4.1: (a) An IFG in class ^  (b) An IFG  not in class W.
It is obvious that the cardinality of an alternately oriented cycle is even. It is 
interesting to notice that an alternately oriented path or cycle has the property that 
adjacent edges are compatible (see Definition 3.7). The following theorem establishes a 
lower bound on the number of interfaces of an MBS realizing an IFG belonging to 
class
Theorem 4.3: Let G be an IFG  belonging to class Then for any color partition 7t, 
IE(n(G))l > \E{G)\ + {3(G).
Proof: Let Eu E2, ..., Eb be the subsets of E(G) corresponding to color partition n. Let 
be the subgraph of G induced by the subset £), 1 < j < b .  (Note that IT(7t(G))l = b.) From 
Hj, construct H* by splitting every vertex v incident with both incoming and outgoing 
edges into two vertices v'" and vou! such that vm is incident with only incoming edges and 
v°“' is incident with only outgoing edges, for 1 < j  < b. If the superscripts in and out of 
the labels of the vertices are disregarded, the edge set of H* is still for 1 < j  < b, and
98
therefore, the final MBS is not affected. Any cycle present in H “ must be an alternately 
oriented cycle. Furthermore, any cycle in H ' is also a cycle in G (again, if we disregard 
the superscripts of the labels). Since G is assumed to be in «?, H ’ does not contain a 
cycle, that is, H f  underlies a tree. Therefore, the number of vertices in H ‘ is at least \E}\ 
+ 1. According to the construction, each vertex in Hy* is incident with either incoming or 
outgoing edges (not both). Therefore, \J{Ej)\ > l£)l + 1,1 < j < b .  Hence, by Lemma 3 .1 ,
b b
I£(tc(G))I > J2  (IS’jJ + 1) = E  + b = + b- Also’ by Lemma 3 .2 ,  b > P(G).
j =l j =l
Hence the theorem follows. □
We must emphasize the fact that if an IFG G belongs to class it does not 
necessarily mean that a partition 7t exists such that l£(7c(G))l is equal to its lower bound 
IE(G)I + P(G). In other words, the lower bound for l£(7c(G))l stated in Theorem 4 .3  is not 
always reachable. However, if the lower bound can be reached by color partition then 
the proof of Theorem 4 .3  shows that iy(jt'(G))l = P(G). Therefore, such a color partition 
minimizes both IF(7t'(G))l and \E(k'(G))\. Therefore, irrespective of the values k 3 and k 3 
used in the optimization function, k ' is an optimal color partition. We state this in the 
following theorem.
Theorem  4.4: Let G be an IFG belonging to class If color partition 7t' satisfies the 
condition IE(7t'(G))l = lis(G)l + P(G), then tt' is an optimal color partition. □
4.2 Optim al Color Partitioning of a Cayley Color G raph
When the IFG  is a Cayley color graph CA( 0 ,  we can analytically find an optimal 
color partition of CA(F) using group F and the generating set A. We first present some 
important characteristics of Cayley color graphs.
Lem m a 4.3: P(CA( 0 )  = IH.
99
Proof: Each vertex of CA(T) has one edge from each color. Since there are ID vertices 
in the graph, there are IFI edges from each color. Therefore, p(CA( 0 )  = ID. □
Theorem  4.5: If A is a (non redundant) generating set of the group F, then the Cayley 
color graph CA(F) belongs to class
Proof: If Ca(T) contains an alternately oriented cycle with distinct color edges, then there 
must exist distinct generators 8„ 82, ..., 52i such that h lh2...h2k = e, where ht = for 
1 < i < 2k < IAI. Since the elements in A are non redundant, this is not possible. Hence 
Ca(T) belongs to class K  □
Theorem  4.6: Let El c  £(CA(r)) be a subset containing exactly one edge from each 
color. Also, let H l be the subgraph induced by Ex. Then £'(CA(r)) can be partitioned into 
subsets Zij, E2, ..., £ in such that subgraph Hj induced by Ep 2 < j<  ID, is isomorphic with 
H t.
Proof: Define YyCEj) = {{ypc, yy) : (x, y) e Ex}, I < j  < IFI. By the definition, if  (x, y) e 
Ey, then y = x8 for some generator 8 e  A. Therefore, (ypc, yy) = {ypc, ypch). Hence, edge 
(x, y) in E x and its image {ypc, yy) in y{Ex) are of the same color. Thus, y f Ex), 1 < 7 < IFI, 
contains exactly one edge from each color. Now, we make the claim that, for j  ¥= j ', yfE^) 
n  Yy'(^i) = {)• Suppose on the contrary that edge (x \ y )  is common to both yfJE^ and 
y.,{E^). For mapping y , edge (x°, y*) in y{Ex) must be the image of a certain edge, say, 
(x, y), in Ev Similarly, for mapping y edge (x*, y*) in yj / (Ex) must be the image of a 
certain edge (x ', y ') € Ex. We have already shown that mappings y} and y . t preserve the 
colors of the edges. Therefore, (x, y) and (x ', y ') must represent the same edge in Ex. 
Thus, x" -  ypc -  y.,x. This is not possible since j  *  j 1. Therefore, the claim is true, that 
is, y{Ex) and yr{Ex) are disjoint subsets. Now, without loss of generality, assume y, = e.
100
Then we have disjoint subsets E x, E2, ..., E{U, where Ei = Y;(Zsj), 1 < j  < in .  Since the 
number of edges in the subsets are iri.lAI, they form a partition on £ (C A( 0 ) .  It is clear 
from mapping y} that H} is isomorphic with H x. □
Therefore, for any subset Ej cz E(CJT)) containing one edge from each color, 
there exists a color partition such that the induced subgraphs are isomorphic to one 
another. According to Theorems 4.4 and 4.5, if we can find a color partition n  such that 
I£(tc(Ca(D ))I = iri.lAI + in ,  then n must be optimal. Therefore, according to Lemma 3.1 
and Theorem 4.6, what we need to find is only one subset Ej cz E(CA(T)), containing 
exactly one edge from each color, such that \J{Ej)\ = IAI + 1. There are many different 
ways subset E} can be formed such that \J(Ej)\ = IAI + 1. In this dissertation, we do not 
attempt to exhaust all possible partitioning methods nor do we try to compare their 
relative merits. We choose Ej as an alternately oriented path with one edge from each 
color. As we see later, the corresponding optimal color partition will produce an MBS 
with many attractive features. Next we formally define such a color partition.
Definition 4.5: Define <J)r^ = {E}; : 1 < j  < ITI}, where
Ej = {(Yy, Ijhi), (Yy/ij, yjhA ), ..., (yJh1h2...hc_1, y fah^ .h ,)} , A, = (Sf)(-iy ',  for 1 < i < IAI.
From the above discussion, the following theorem is straightforward.
Theorem  4.7: Let CA(T) be a Cayley color graph. Then <|)r A is an optimal color partition. 
Furthermore, l£’((J)Ar(CA(r)))l = IFI(IAI + 1 ). □
Let Hj be the subgraph induced by subset Ej of edges due to the optimal color 
partition <J)r>A. Then the vertices y}, ^ 8 ,, Y;8 iS 21, ..., YySiS^1 (x =  -1 when IAI is even, 
otherwise x  = 1) of Hj will be called the 0th, the first, ... the lAlth vertex of Hjt 
respectively. Also, the /th vertex of Hj will be denoted by v,(//y) for 0 < / < IAI. Figure 4.2
101
v0(Hj) V\(Hj) v2(H j) W  H ) vj h )
Figure 4.2: Induced subgraph H} of partition <|)r^  (IAI is even).
shows the induced subgraph corresponding to color partition <|)r>A. The 0th vertex of H] 
(that is, v0(Hj)) is yjw For simplicity, the bus corresponding to subgraph Hj will also be 
denoted by y . From the definition of color partition <J)r A of CA(F) and from the notation 
used, the following observations can be made.
(1) There is a one to one correspondence between the elements in F and the elements 
in each of the sets {v,(//y) : 1 < j < ID}, 0 < / < IAI.
(2) There is a one to one correspondence between the elements of F and the buses in 
M(T, A).
(3) If i is even (odd), then vertex v,(//;) is incident with only outgoing (incoming) 
edges in subgraph Hp 1 < j  < in.
(4) If i is even (odd), then the processor corresponding to vertex is connected 
to bus y  via a driver (receiver), 1 < j < IFI.
Example 4.3: Consider the IFG G3 shown in Figure 2.14. Notice that G3 is obtained by 
optimal processor assignment to the CFG C4, which corresponds to the Ascend/Descend 
class of parallel algorithms. As we have already stated, that IFG  is the Cayley color graph 
GG(3). The group associated with Ge(3) is r e(3) = {000, 001, 010, 011, 100, 101, 110, 111} 
and its generating set is AG(3) = {001, 010, 100}. Edges corresponding to generators 001, 
010, and 100 are represented as solid, broken, and dotted lines, respectively. Group
102
composition is the exclusive-or operation of binary strings. Figure 4.3 shows the 
subgraphs of Gom induced by the subsets of partition <br . . Figure 4.4 shows the
'  C(3)’A(2(3)
corresponding optimal MBS.














0 1 0  A Oil
0 0 0 0 0 1
1.O





I .7 ' 
0 0 1—<1
#000 Hm #0,0
1 0 0 1 1 1 1 1 0
9


















1O 0 1 0
Hl0l
1 1 0 i n
o—
. .  -T' ’
-p
1 1 0
. o — —
1 1 1  
____









Figure 4.3: Induced subgraphs of G ^  by partition <f>r .
*<?(3)'a p(3)
4.3 Properties of M BSs Realizing Symmetric IFGs
In this section we will highlight some attractive features of an MBS which 
optimally realizes a symmetric IFG. From Theorem 4.2, a symmetric IFG can be
103
expressed as a Cayley color graph CA(r). Therefore we will use the group T and its 
generating set A in obtaining the properties of an MBS realizing a symmetric IFG.









Figure 4.4: MBS corresponding to partition in Figure 4.3.
Definition 4.6: The MBS corresponding to the MBG <|>rA(CA(r)) will be denoted by M(T, 
A), where (J)rA is the optimal color partition as given in Definition 4.5. The direct link 
interconnection network corresponding to the Cayley color graph CA(P) will be denoted 
by N(T, A). Also, we use NQ(n) and Mm  to denote N(TQ(n), Aai)) and M(Tm , AQ(n)), 
respectively.
Other than the quite interesting properties inherent with any multiple bus system, 
properties of the MBS M(T, A) (to follow) will show how well an MBS can be used as 
an algorithm specific architecture. We will first show that the MBS M{T, A) is symmetric. 
Then we will show the superiority of the MBS M(T, A) over the counterpart direct link
104
interconnection network N(T,A), in terms of the number of ports per processor, the 
number of neighbors, and the diameter (to be defined).
4.3.1 Symmetry
To show that the MBS M(T, A) is symmetric, we introduce the following 
definition.
Definition 4.7: A directed bipartite graph (X , Y, E) is said to be vertex symmetric if for 
every pair of vertices u and v, both of them in X  or both of them in Y, there exists an 
automorphism of the graph that maps u to v. An MBS is said to be symmetric if its 
corresponding bipartite graph is vertex symmetric.
Intuitively speaking, if we view a symmetric MBS from a processor, it appears the 
same irrespective of the processor used. Similarly, if we view the system from a bus, it 
appears the same irrespective of the bus used.
Theorem 4.8: The bipartite graph <t>r^(CA(r)) is vertex symmetric.
Proof: Let (fr^CQCO) = (X, Y, E). Consider a pair of vertices y, and y2 in X. Let mapping 
\|r defined by \|/(y) = y^y^y, where y is an elements of T. Let (yr  yx) be an edge in E  such 
that yx e  X  and yy e Y. Then, according to our usual notation, yx is the /•“' vertex of Hy for 
some odd integer r. Thus, yx = yy51521...8r, and therefore, y^y, = Sjfv*...8r  But
(V W 'O K Y .)) = Yy'1 (YiY2'1)(Y2Y1'1 )lx = Yy‘Yx- Thus \|/(y,) = \t/(y>)51521...5r. Therefore, y(Yx) 
is the vertex of Hv where ^(y^) = yv Hence, there is an edge from yz to t|/(Yx), that is, 
0|/(Yy)> V(y*)) e  E. Analogously, we can show that (^(y,), xj/Cŷ )) e  E  implies {yy, yx) e  
E. Moreover, we can show that {yx, yy) e  E if and only if (y(Yx)- Y(Yy)) e  E. Furthermore, 
¥(Yi) = Y2Y^Yi = Y2- Thus, for every pair of vertices y, and y2 in X, there exists an 
automorphism of the bipartite graph which maps y, to y2. By similar reasoning,
105
we can show that for every pair of vertices y3 and y4 in Y, there exists an automorphism 
of the graph which maps y3 to y4. Hence the bipartite graph (J)r^(CA(r)) is vertex 
symmetric. □
The above result implies that the MBS M(T, A) is symmetric. This can be 
observed, for example, in Figure 4.4. Each processor (bus) is identical to every other 
processor (bus). This property suggests a simple and efficient VLSI implementation, due 
to the fact that symmetry leads to easy routing methods, convenient replacement of faulty 
components, and area efficient layout.
4.3.2 Num ber of Ports per Processor
The following result provides one of the favorable properties of an optimal MBS 
realizing a vertex symmetric IFG.
Theorem  4.9: The number of output and input communicating ports per processor in 
M(T, A) are [IAI/2J + 1 and [~lAI/2~], respectively.
Proof: Subgraph H} induced by the subset of edges Ejt 1 < j  < IFI, is an alternately 
oriented path. With our usual notation, only the vertices v2(Hj), v4(Hj) , ..., v2^Mr2i(Hj) 
of Hj are incident with outgoing edges (see Figure 4.2). Let y, be the vertex in 
T(<t)r_A(CA(r))) corresponding to subgraph H}. Then there is an edge from each of the 
vertices x0, x2, x4, ..., x,LIAI/2J to vertex y} in <j)rA(CA(r)), where x, e  X(<jir (̂CA(r))) is the 
vertex corresponding to v^Hj) in Hj. Thus the number of incoming edges incident with 
is [_IA1/2J + 1. That is, each bus in M(T, A) is connected to |_IAI/2J + 1 drivers. Hence, 
there are 1F1(1_IAI/2J + 1) drivers. Therefore, by symmetry, each processor is connected 
to [IAI/2J + 1 drivers. By similar reasoning we can show that each processor is connected 
to [lAI/2] receivers. □
106
The above theorem establishes an important attraction for MBSs. It is obvious that 
in the direct link network N(V, A), the number of input and output ports is IAI each. Hence 
the optimal MBS realization of a set of interconnection functions requires approximately 
one half the number of ports of that used in a direct link network realizing the same set 
of interconnection functions. In fact, it is well known that the number of ports per 
processor is a very limiting factor in constructing large size multiprocessor systems, 
particularly in a VLSI context [107]. To demonstrate the advantage of MBS in this regard, 
we simply note that in implementing cube functions, an MBS with N2 processors would 
have equal number of ports per processor as a conventional hypercube with only N  
processors. The following corollary is a direct consequence the Theorems 4.8 and 4.9. 
Corollary: The number of drivers and receivers per bus in the MBS M(T, A) are 
[IAI/2J + 1 and PAI/2"), respectively. □
4.3.3 Num ber of Neighbors per Processor
In this section we derive an expression for the number of neighbors per processor 
in M(T, A) and show that it is much larger than that for the direct link network N(T, A). 
Definition 4.8: Let P l and P2 be two processors in an MBS such that they are connected 
to bus B via a driver and a receiver, respectively. Then P2 is said to be a neighbor of Pl 
via bus B. Also, we say that there is a direct path from P, to P2 in the MBS. If the 
identity of bus B is irrelevant, we simply say that P2 is a neighbor of P v 
Lem m a 4.4: For two processors yx and yy in M(T, A), there exists at most one bus yz, such 
that, yy is a neighbor of yx via bus yz.
Proof: Suppose that y  is a neighbor of yx with respect to two distinct buses yz and yw in 
M(T, A). Then, with our usual notation, vertices yx and yy must be in both of the
107
subgraphs Hz and Hw. Furthermore, yx must be the aih and bth vertices of Hz and Hw, 
respectively, for some even integers a, b < IAI. Therefore,
" i x ~  Y z ^ i ^2 8 3 . . . 8 0 . 1 S fl =  Yw 8 j S 2  S 3 . . . 8 ^ . ]  S b  .
Similarly, yy must be the 0th and d h vertices of Hz and Hw respectively, for some odd 
integers c, d  < IAI. Therefore,
ly = yẑ i ^ 2  83 --8c_i8c = y^SjS2 83...S</_18rf.
Thus, we can write
y ? ly  = (Yz81s 2183...8fl.1s ;1)-1(YZ8152-I 83...s ;l18c) = Z, and 
Y/Y, = (YM,81S2183...8i, 1St 1)-1(YM,51S2153...5 ,1.15,) = W.
Suppose a > c. Then Z = (S ^ S ^ ... S^ " 1 y*1 )(Yz$i S2l S3... 8^  8C) =
8 a 5 5 a . 2 . • • 8 C\  8 C+J . N o w  a s s u m e  b < d  . T h e n  W  =
( S A V . S ^ A 1 Yw1)(Yw8 i82183...S;1_1S</) = 8 M Sb\2 ...8 d.2 Sd[l 8 d. The generators 8a, 8 aA, 
8fl_2, 8c+2, and 8C+1 used in word Z are all distinct and so are the generators db+1, 8iH.2, 
..., 8j.2, 8* ,, and 8 d used in word W. Let Az = {8a, 8fl.,, Sfl_2, ..., 8c+2, 8C+1}, and 
Aw = {8i+], 8, +2, ..., 8rf_2, Sj.j, 8^}. Since the generators in A are non redundant, Z - W  
implies that Az = Aw. Thus, 8 M  = 8C+1 (8^ ,  and Sc+1 are the two generators with the least 
indices in sets Az  and Aw, respectively). Hence, b = c, which is impossible since b is even 
and c is odd. Therefore, b must be > d, when a > c. In this case W = 8 b8 b\  8 b_2...Sd \ 2 8 d+ j. 
Now Z - W  implies a -  b. But this is true only when yz = yw. The same is true when 
a < c. Therefore, yy can be a neighbor of yx via only one bus. □
Theorem 4.10: The number of neighbors per processor in M(T, A) is 
(flAI/2"))(L_IAI/2J + 1).
108
Proof: From Theorem 4.9, each processor is connected to [IAI/2J + 1 distinct buses via 
drivers. Also from the corollary to Theorem 4.9, each bus is connected to flAI/2-] distinct 
processors via receivers. Thus the total number of neighbors is (flAl/2~])(|_lAI/2J + 1). 
According to Lemma 4.4, all of these neighbors are distinct. □
The number of neighbors per processor in the direct link network is clearly IAI. 
The above theorem highlights another big advantage of an MBS over its direct link 
counterpart. An MBS has a larger number of neighbors (very useful for efficient 
communication and broadcasting) than a direct link interconnection network realizing the 
same set of interconnection functions, yet it uses roughly half the ports per processor. For 
IAI = 6, M(T, A) would have 12 neighbors per processor which is twice as many as that 
in the direct link counterpart N(T, A). The advantage becomes even larger as the number 
of interconnection functions IAI increases.
4.3.4 Diameter
Another very useful parameter associated with any interconnection network is the 
diameter. For an MBS, we give the following definition.
Definition 4.9: The distance from processor Pl to processors P2 in an MBS is the 
minimum number of buses to be used to transfer data from P, to P2. The diameter of an 
MBS is the maximum distance from one processor to another, taken over all pairs of 
processors.
For a direct link interconnection network, the diameter corresponds to the usual 
graph theoretic definition [30]. If there is an edge from vertex x  to vertex y in an IFG G, 
then processor y  is a neighbor of processors x  in the MBS which realizes G. Therefore, 
the diameter of an MBS is no more than that of the direct link interconnection network
109
realizing the same IFG. This is true irrespective of the IFG used. Unfortunately, nothing 
more can be said about the relative magnitudes of the diameters even if the IFG is 
symmetric. There is no general formula for the diameter of an arbitrary Cayley color 
graph. However, for the important IFG GQ(n), we shall obtain a very attractive result, 
namely, the diameter of M Q(n) is \n/2J + 1.
It is well known that the diameter of Ne(n) is n. Elements of group Tew are the 2" 
binary vectors of length n. Elements of the generating set Ag(n) are the n unit vectors of 
length n. In order to find the diameter of we will adhere to the following notation. 
The exclusive-or operation is represented by ®. Notation O' ( l1) is used to represent the 
string of zeros (ones) of length i. Binary string w will be written as wnwn.1...w2w1, where 
Wj is the iUl bit of w, 1 < i < n. The number of l ’s in string w is denoted by 3dyv). We will 
denote the diameter of M&n) by diam(MQ(n)). First we will obtain an upper bound on 
diam(MQin)).
Lem m a 4.5: diam{MQ(n)) < [n/2j + 1.
Proof: Let ya and yb be two arbitrary processors in M an). We can express yb = ya ® x, for 
some binary string x  of length n. Then, in MQ(n), there is a path from ya to yb of length 
3dx). If 3(kx) ^  [h/2J + 1, then we have a path from ya to yb of length < |_«/2J + 1. Now 
suppose that 3dx) > |_n/2j + 1. Then, we will consider two cases. Let yc = ya © 1".
Case 1: n is odd.
In this case, processor yc is connected to bus ya via a receiver. Therefore, there is a direct 
path from ya to yc in MQ(n). (Note that processor ya is connected to bus ya via a driver.) 
Also there is a path from yc to yb of length n -  M{x). Therefore, there is a path from ya
110
to yb of length n -  3&x) + 1. Since 3dx) > \nl2J + 1 and n is odd by the hypothesis, we 
have,
n -  3 (Xx) + 1 < n -  ([_«/2J + 1) + 1, that is,
n -  30x) + 1 < n -  {\nt2J + 1)
= n -  ((n -  l)/2 + 1)
= ( n -  l)/2
=  L«/2J-
Case 2: n is even.
In this case processor yc is not connected to bus ya with a receiver. Let yd = ya ® 01” 1. 
Then, since n -  1 is odd, there is a direct path from ya to yd in MQ(n). Since yc and yd are 
adjacent in n), there is a direct path from yd to yc. Therefore, there is a path of length 
n -  3&x) + 2 from ya to yb. Since n is even and d i f l jx )  > |_n/2j + 1 by the hypothesis, we 
have,
n — 3({x) + 2  < n -  (\_n/2 J + 1) + 2, that is,
n -  3&x) + 2 < n -  (j_n/2J + 1) + 1
= n - i n l 2 + 1) + 1 
= n! 2  
= \n!2 \.
Therefore, in M Q(n), there always exists a path of length less than or equal to [n/2J + 1 
from processor ya to processor yb. Since ya and yb are arbitrary processors of M Q(n), the 
lemma follows. □
Next we will obtain a lower bound on the diameter of Me(n). For that purpose, 
some new notation will be introduced.
I l l
Definition 4.10: A binary string of the form s = O 'l^^O ”-''2*'1 will be called an SCOL
(String with Consecutive ones of Odd Length). Also define, t|(a') = m gx{j: sj = 1},
ja(5-> = min {j: s. = 1}, and t(s) = ti(a’) -  |i(s) + 1. 
j 1
Example 4.4: The strings 001000, 011100, 111000, and 000001 are SCOLs because all 
l's  in each string are consecutive and they form substrings of odd length. However, the 
strings 000011, 000000, and 100011 are not SCOLs. For the SCOL 011100, T|(011100) 
= 5, p(011100) = 3, and i(011100) = 3. Notice that, t(s) represents the length of the 
substring of j  with consecutive l's.
Lem m a 4.6: Processor yb is a neighbor of processor ya in Mg(n) if and only if yb = ya ® 
s, for some SCOL s.
Proof: First suppose that processors ya and yb are connected to bus yc via a driver and a 
receiver, respectively. Then, vertices ya and yb must both belong to the same subgraph Hc. 
Furthermore, ya must be the gth vertex of Hc for an even integer g, and yb must be the /zth 
vertex of Hc for an odd integer h. Therefore, ya = yc © 0"‘gl g and yb = yc © 0n'hl h. If 
h > g, then yb = ya © O ^ l^ Q 8. Since h -  g is odd, 0"'/‘l /'"4'O? is an SCOL. The same 
result holds when h < g.
Now suppose that yb = ya © s for some SCOL s. Let s = 0nm'll l( f t, where I is an 
odd integer. We will consider two cases.
Case 1: m is even.
Let yc = ya © 0n ml m. Then, ya = yc ® 0"'ml m. Therefore, ya is the mth vertex of the 
subgraph Hc. Since m is even by the hypothesis, processor ya is connected to bus 
Yc via a driver in Now, yb = ya 0  s = ya © 0"'m'/l /0m = yc © 0"'mlm ©
Qti-m-i i iQtn = Therefore yb is the (m + /),h vertex of subgraph Hc.
112
Since m + I is odd (m is even and I is odd), processor yb is connected to bus yc via 
a receiver. Hence yb is a neighbor of ya via bus yc.
Case 2: m is odd.
Let yc = ya © o"■m'/l m+,. Since m + I is even, processor ya is connected to bus yc 
via a driver. Now, yb = ya ® s = yc © 0"'ml m, and therefore, processor yb is 
connected to bus yc via a receiver. Hence, yb is a neighbor of yb. □
Lemma 4.7: Let w = ^ 1 © s 2 © ... © s “ where each s ‘, 1 < i < a, is an SCOL. Let m 
= max{ j  : Wj= 1}. Then w = t 1 @ t 2 © ... © t 13, where t ' ,  1 < i < p < oc, is an SCOL 
with the property r \ ( t l) < m.
Proof: Let S = 1, s 2,..., s “ }. In the proof we will construct T  = {t 1, t 2, ..., t p } from
S. Divide S into three disjoint subsets Sv  S2, and S3 as follows.
5, = {^ ' : r |( j  *') < m)
52 = { s '  : v\(s ')  > m > p(s ')}
53 = ' : p(j ')  > m}
In constructing T, we will retain all the elements in Sv  modify all the elements in 
S2, and discard all the elements in S3. Let Tx = Sj. To modify the elements in S2 and form 
T2, we make the following claim.
Claim: If S3 is empty, then the number of elements s '  in S2 with the same value 
of T|(s ')  is even.
Proof: Let s x be an element in S2 such that ri(^A:) is the largest member in 
{ti(j ')  : s ‘ € S2). Since Wj = 0 for m < j  < n, there should be an even number
of elements in {j ' : s '  e  S, ri(s ')  = T|(j *)}. By the hypothesis of the claim, S3
is empty. Therefore, there should be an even number of elements in {s ‘: s ' e S2,
113
t](s ' )  =  tj( .s* )}•  The exclusive-or summation of all these SCOLs will yield zeros 
in bit positions m + 1 through n. Therefore, we can disregard all those SCOLs and 
similarly prove that there is an even number of elements in {s ' : s '  e  S, T|(.y')  
= where 11(5 y) is the second largest member in {rj(^ : s '  e 52}. By
repeated application of this argument, the claim follows.
For each s  ' e  S2 such that (r|(jr')  -  m) is even, let / '  = s  1 ® 0 O' ". 
Since (r\ ( s  -  m) is even, t ‘ is also an SCOL. Furthermore, rj(7 ')  = m  and the mth bits
of s  ' and t 1 are the same. Similarly, for each s  ' e  S2 such that (r|($ ')  -  m) is odd, let 
f * = j  •' © 0”^ '>  l r'(xVm*1 O'"'1. Since (ri(s') -  m + 1) is even, t *' is also an SCOL. 
Furthermore, r \ ( t ‘) = m ~  1 and the bits of s  ‘ and t ‘ are different. Now we construct 
T2 by letting T2 = { t 1 : s '  e S2).
Let v be the string obtained by exclusive-or summation of the elements in 7\ u  
T2. If v = w, then we have the required set of SCOLs T = T x kj T2. Clearly, P = 171 = ITjl 
+ ir2l = I5jl + I52l < LSI = a . If v * w, then they must differ only in bit position m. 
Therefore, we can make v = w by letting T  = T, u  T2 u  {Q '^IO ^1}. If that is the case, 
we need only to prove that P = 171 < LSI = a . By constructing t ’ from s ' ,  we altered the 
mlh bit of the SCOL s ‘ only when (q(5 ')  -  m) is odd. Therefore, for the m,h bits of v and 
w to be different, the number of elements s ‘ e  S2 with an odd value of (r[(5 ') - m )  must 
be odd. But according to the previous claim, unless S3 is non empty, the number of 
elements s '  e  S2 with an odd value of (n(s') -  m) is even. Therefore, v ^  w» implies that 
S3 is non empty. Therefore, 151 > 15,1 + LS2I + 1, implying 171 < 151. Hence, the lemma 
follows. □
114
Lem m a 4.8: Let s  be an SCOL of length n. Then s  © 0"'T1('r)110n('l)'2 is also an SCOL 
whose Cn(.y))th bit is equal to zero.
Proof: For clarity denote 11(5) by rj. First suppose that i(s) = 1. Then, s  = O ^IO 11"1. 
Therefore,
5 © O'"111 IQ11'2 = O""’110n_1 © O"'11110n'2
_  Qn-Tl+ljQtl-2
Clearly, 0"'T|+110T1'2 is an SCOL and its rj,h bit is equal to zero.
Now suppose that i(j) =  I >  1. Then, s  =  0"'nl /0n‘/. Therefore, 
s © (T nl 10n'2 = 0”'T|1/0T1'/ © 0” nl 1011'2
_ Qn-Tl+2j/-20Tl-/
Clearly, 0"'11+21,'2011'/ is an SCOL and its r fh bit is equal to zero. □
Definition 4.11: Let w be a binary string of length n. Then L(w) is the minimum number 
of SCOLs whose exclusive-or summation produces w.
Lem m a 4.9: Let w be an arbitrary string of length n such that m = max{j : w; = 1} < 
n -  2. Let w' = w © 0n m-210m+1. Then L(w') > L(w) + 1.
Proof: Let S be a set of SCOLs whose exclusive-or summation yields w'. According to 
Lemma 4.7, without loss of generality, we can assume that r|(.s ')  < m + 2 for every 
element s ' e  S. Since w ^ +2 = 1, at least one SCOL s  ‘ in 5 must be such that T|(s ')  = 
m  + 2. Suppose that o"‘m'2l;tOm'*+2 and 0"'m'2l >0",~:y+2 are two such SCOLs. Now,
Qn-m-2|JQm-x+2 0  Qn-m-2 j>Qm-y+2
_  0«-m-2jx(yn-A:+2 0  0  (0«-m-2n ()m 0  Q n-m -2n o m)
=  ( o " ' m' 2 1 -r0 m jc+ 2  ©  o " m ' 2 i i ( y ” )  ©  ( o  n-m-2i ycr-y+2 ©  o " ' m -2 n o m )
115
Therefore, according to Lemma 4.8, onm‘2rO '’"JC+2 and on m‘2PO""y+2 can be replaced by two 
SCOLs whose (m + 2)nd bits are equal to 0. Every pair of SCOLs in S with (m + 2)nd bits 
equal to 1 can be similarly replaced. Therefore, we can assume that 5 contains only one 
SCOL whose (m + 2)nd bit is equal to 1. Let that SCOL be
,̂1 _  Qn-m-2 j XQm-x+2
Suppose that x > 1. Then the (m + l)st bit of s1 is also equal to 1. Since w ^+1 = 0, there 
should be an element of S whose (m + I)* bit is equal to 1. Let that element be
s 2 =  Qn'm'12zo m‘z+1
Suppose that z > x  -  1. Then,
s' ©  s 2 =  o '" m'2 r o n,-"+2 ©  o n-m-1 i zo mz+1
_  Qn-m-2jQT|z-x+lQm-z+l 
  Qn-m-2|QW+l ^  Qn-m+x̂ jZ-x+lQm-z+l
Since both x  and z are odd, z -  x +  1 is also odd. Therefore, on m+J>2l z-'+1Om'z+1 is an SCOL. 
Thus, the two SCOLs sl and s2 can be replaced by on'm'210mH and another SCOL with the 
(m + 2)nd bit equal to 0. The same result holds when z <  x -  1. Therefore, we can assume 
that the only SCOL in S with bit (m + 2) equal to 1 is the string o'w"'210mfl. Now, since w ^ +1 
= 0, there should be an even number of elements of S whose (in + l)st bits are equal to 
1. Let s3 and s4 be such a pair o f SCOLs. W e have, s3 © s4 = (.v3 © o/"m l llO m‘1) © (s4 © 
0n-m-i110m-i) According t0 Lemma 4.8, (^3 © on '" , l 10m l) and (s4 © 0" m l l 10m l) are SCOLs 
with their (m + l)sl bits equal to 0. Therefore, every pair of SCOLs in S with (m + l ) st bits 
equal to 1 can be replaced by another pair of SCOLs whose (m + l) st bits are equal to 0. 
Therefore, we can safely assume that S consists of two disjoint subsets S, and S2, where
116
every element of 5, has its (m + 2)nd and (m + I)51 bits equal to zero and S2 contains the 
single SCOL Q T^IO T1. Furthermore, 151 = 15,1 + I52l = 15,1 + 1.
Suppose that 151 < L(w) + 1. Then, 15,1 < L(w). Since w' -  w © onn,'210m+1, the 
exclusive-or summation of the elements of 5, yields w. Therefore, we could construct a 
smaller set of SCOLs whose exclusive-or summation produces w. This violates the 
definition of L(w). Therefore, 151 > L(w) + 1. That is, L(w') > L(w) + 1 . □
Lem m a 4.10: diam(MQ{n)) > [n/2j + 1.
Proof: When n = 1 or 2, the truth of the lemma can be easily verified. Let n > 2. Let ya 
be an arbitrary processor in the MBS MQ(n). Our proof consists of finding another 
processor yb which is at a distance of at least \n!2] + 1 from ya. For that purpose, we 
consider two cases.
Case 1: n is even.
Let yb be the processor in MQ(n) such that yb = ya ® (10)n/2111. Since 0n'2l l  is not 
an SCOL, L(0"'2l 1) > 2. Therefore, according to Lemma 4.9, L(0"'41011) > 2 + 1 .  
By repeatedly applying Lemma 4.9, we get 
L((10)n/2111) > 2 + (n/2 -  1)
= n/ 2  + 1 
= L«/2J + 1.
Case 2: n is odd.
Let yb be the processor in MG(n) such that yb = ya ® OCloy"'1̂ '1!! . Again, by 
Lemma 4.9,
0(10)("'1)/2‘111 > 2 + ((n -  l)/2 -  1)
= ( n -  l)/2 + 1
= L«/2J + 1.
117
Therefore, in both cases, to construct yb from ya at least |n/2J + 1 SCOLs are necessary. 
Hence, according to Lemma 4.6, the distance from ya to yb is at least |_«/2J + 1. 
Therefore, diam{M&n̂  > \nl2\ + 1 . □
Theorem  4.11: diam(MQ(n)) = [n/2j + 1.
Proof: Directly follows from Lemma 4.5 and Lemma 4.10. □
This shows the existence of an MBS which emulates the S1MD hypercube with far 
superior features. The diameter of the MBS emulating the hypercube is half that of the 
hypercube while the number of ports per processor in the MBS is also half that of the 
hypercube.
So far we have shown how to perform an optimal color partition of a vertex 
symmetric IFG  with non redundant generating sets. We have also shown some attractive 
features of the resulting MBS. Next we consider regular IFGs. As we show, not only 
vertex symmetric IFGs, but also regular IFGs can be optimally color partitioned in 
polynomial time if certain conditions are satisfied.
4.4 Optim al Color Partition of a Regular IFG
In Section 2.5, we introduced the notion of regular IFGs (see Definition 2.9). It 
is straightforward to see that, in a regular IFG, every vertex has exactly one incoming 
edge from each color (see the proof of Lemma 4.1). Obviously, every vertex symmetric 
IFG  is regular. But the opposite is not true. If an IFG is regular and belongs to class ^  
then we can find an optimal color partition in polynomial time. The following algorithm, 




INPUT : Regular IFG G, belonging to class «?, with c colors,
where V(G) = {v,, v2, v„}.
OUTPUT: An optimal color partition {£,, E2, ..., En}
begin
for j  = 1 to n do 
begin
Ej = { };
W = Vf,
for r  = 1 to c do 
begin
if  r  is odd then
begin Ej = E} u  {(w, w ‘n)}; w = wrm; end;
/* w ‘n is the vertex which is incident with the 
edge of color r  directed away from w. */ 
if r is even then
begin Ej -  Ej vj {{w°ut, w)}; w -  w °ut; end;
/* w °u> is the vertex which is incident with the
edge of color r  directed towards w. */ 
end (r loop); 
end (j loop); 
end.
Theorem 4.12: Algorithm 4.1 outputs an optimal color partition of G.
Proof: Since each vertex of G has exactly one outgoing (incoming) edge belonging to
each color, at the end of the execution of the algorithm, each subset Ej would have c
edges, one from each color. The rth edge in Ej is of color r, for 1 < r < c. Furthermore, 
according to the algorithm, edges in the set E- form an alternately oriented path (since G 
is assumed to belong to class ^  no alternately oriented cycles exist). Thus, Ej induces 
c + 1 vertices. Let those vertices be u { , u { , •••> u j , and uJA . The edge with end vertices urJ 
and would be of color r, for ‘1 < r < c. The orientation of that edge will depend on 
whether r  is even or odd. According to the algorithm, u (  = Vj, for 1 < j  < n. In other 
words, vertices u{,  ,..., u "  are distinct. Each vertex has exactly one incoming edge
and one outgoing edge of color 1. Therefore, vertices u2 , u 2 , ..., u 2 are all distinct.
119
Similarly, we can show that, for every r (1 < r  < c), vertices ux , « r2, ..., u "  are all 
distinct.
We need to prove that the subsets £ , through En as outputted by the algorithm are 
edge disjoint. Suppose on the contrary that two (distinct) subsets Ex and Ey contain a 
common edge. Let ( u x , u xH ) e  Ex and ( « / ,  u^+1) e  Ey represent the same edge in G. 
Then they must be of the same color, and therefore, k  = r. But as we have just shown, u x 
and ury are distinct vertices. Hence, ( u x , u xx) and (w /, m/+1 ) cannot be the same edge 
of G. Thus there are no common edges in Ex and Ey. Therefore, subsets E2, ..., En are 
disjoint. Clearly, edges in £j, E2, ..., En cover all edges in G. Thus [Ex, E2, ..., En} is a 
color partition of G. Each subset E} induces an alternately oriented path in G. Therefore,
n
U{E})\ = c + 1, 1 < j  < n. Thus £  \J{Ej)\ = n(c + 1) = nc + c = \E{G)\ + p(G).
i =i
Furthermore, G is in class ^  Therefore, by Theorem 4.4, {E{, E2, ..., En} is an optimal 
color partition. □
Each subset £, of the output of the algorithm induces an alternately oriented path 
of length c such that the first vertex is incident with an outgoing edge. Therefore, in the 
corresponding MBS, every bus is connected to the same number of drivers and the same 
number of buses. It is also easy to verify that every processor is connected to the same 
number of drivers and receivers. Thus the resulting MBS is regular. But other features of 
the MBS (such as symmetry and diameter) may not be as attractive as those obtained for 
vertex symmetric IFGs.
The time complexity of Algorithm 4.1 can be determined as follows. The outer 
loop of Algorithm 4.1 is executed n times, while the inner loop is executed c times.
120
Therefore, the time complexity of the above algorithm is Q(n.c) = 0(1 E(G)\). To 
demonstrate the operation of the algorithm, we provide the following example.
v i V„






\  /% »
\ t 
\ i I 
* \ 
t \
/  \/ \/ %
V3
V7̂ .................... / - .— V ..................^  V4
/  \  t \
t \I \Z \
v«''t%--------------------v5
Figure 4.5: A regular IFG G6.
Example 4.5: Consider the regular IFG, denoted G6, shown in Figure 4.5 with eight 
vertices and three colors. Solid, dotted, and broken lines represent edges of colors 1, 2, 
and 3, respectively. It can be easily verified that G6 is not vertex symmetric; therefore, 
it is not a Cayley color graph. Figure 4.6 shows the eight induced subgraphs 
corresponding to the optimal color partition obtained by using Algorithm 4.1. Figure 4.7 
shows the corresponding MBS. Notice that it is regular.
4.5 Optim al Color Partition of CA(T) when A is Redundant
So far we have studied Cayley color graphs with non redundant generating sets. 
Even though this is the case for almost all the situations of interest, we are obliged to 
consider how to perform an optimal color partition of CA(T) when A is redundant for 
completeness. The purpose of this section is to briefly outline the difficulties faced when
121
we try to find an optimal color partition of a Cayley color graph associated with a group 









^  Vfi Ha
S v ,





V 54 ” — °
\ V 3 vg ..*
-  O o- - V3->o
4
-K>

















Figure 4.6: Color partition of G6 by Algorithm 4.1.
In finding an optimal color partition for CA( 0 ,  we used the non redundant nature 
of A to show that CA( 0  belongs to class W. In fact, if CA(T) belongs to class then, even 
if A is redundant, obtaining an optimal color partition can be done exactly the same way 
as the non redundant case. All the results obtained in this chapter will still be valid. 
However, when A is redundant, CA(F) may not belong to the class In such a case, the
color partition <J)rj4 given in Definition 4.5 may not be optimal.
122
©  CD CD ©
8
Figure 4.7: MBS corresponding to partition shown in Figure 4.6.
Theorem 4.6 is valid whether the generating set is redundant or not. Thus, what 
we need to do is to find only one subset £j of edges with distinct colors such that l/(£j)l 
is minimum. The remaining subsets can be easily determined by using the mapping given 
in the proof of Theorem 4.6. When CA(T) belongs to class the minimum number of 
vertices induced by any lAI-element subset of distinct color edges is IAI + 1. The color 
partition <|>rA divides the edge set of CA(P) into such subsets Ej with the additional 
property that l/(£))l = IAI + 1, 1 < j < ID. Recall that, for color partition <()r A, edges in each 
subset are selected such that the edge of color r  and the edge of color r + 1 are adjacent 
(and compatible). This ordering of edges is not mandatory. Even if we change the 
ordering of the colors, the corresponding color partition would be still optimal. But these 
facts may not be true when CA(T) does not belong to class In the following two
123
examples, we show how the optimality is affected when the Cayley color graph does not 
belong to class
Figure 4.8: Cayley color graph G7 with redundant A.
Example 4.6: Consider the IFG , denoted G7, shown in Figure 4.8. It is the Cayley color 
graph associated with the group F (G(3) = {000, 001, 010, 011, 100, 101, 110, 111} and its 
generating set A, = {001, 010, 100, 111}. We will denote the generators 001, 010, 100, 
and 111 by 8„ S2, 83, and 54, respectively. Not all edges associated with generator 84 are 
shown in Figure 4.8 for clarity. Since every element in A is its self inverse, we use 
undirected edges to represent bidirectional edges. Note that the IFG G7 shown in Figure 
4.8 corresponds to the folded hypercube of dimension three [91], Figure 4.9 shows 
subgraph of G7 induced by subset of color partition <£r . . Clearly, Hm
1 C(3) t
contains 4 = IAI vertices instead of IAI + 1 when A is non redundant. Furthermore, I/CEqo,,)! 
= 4. One can be easily convinced that the minimum value of \J(F?)\ for any 4-element 
subset E? of distinct color edges of G7 is 4. Therefore, despite the fact that Cayley color 




Figure 4.9: Induced subgraph H000 of G7.









8 , 0Cli *
I l l
Oil
Figure 4.10: Cayley color graph G8 with redundant A.
The above example shows us that even if CA(T) does not belongs to class the 
color partition <|>r4 may be optimal. In the next example, we show a situation, where the 
optimality of color partition <j>r>A depends on the order of the generators.
Example 4.7: Consider the generating set A2 = {5„ 82, 83, 84, 85} for group r G(3), where 
8, = 001, S2 = 010, 83 = 011, 84 = 100, and 85 = 111. Figure 4.10 shows the 
corresponding Cayley color graph, denoted by G8. Not all edges associated with
125
generators 83 and 85 are shown for clarity. Figure 4.11 shows the subgraph H0 induced 
by subset of color partition 0 r  . . Clearly, (•/(fi'ooo)! = 6. Now, consider the inducedC?(3)’ *
subgraph H ' ^  shown in Figure 4.12, which was obtained by reordering the generators 
as 8,. S2, 84, 85, and 83. From Figure 4.12 it is clear that !^(// /000 )l = 5. Thus, the value 
of \J(H,) I depends on the order of the generators used in the partition <f)r . . It is easy 
to verify that, for any 5-element subset E! of distinct color edges of an IFG, the minimum 
value of lifiiOl is 5. Therefore, color partition <br . with the generators taken in the
1 G(3)’^ 2
order 8,, 82, 84, S5, S3 is optimal, whereas that with the generators taken in the order 8,, 





Figure 4.11: Induced subgraph of G8.
When Ca(T) does not belong to class the order of the generators resulting in an 
optimal color partition <{)nA would not be obvious. In fact, a lAI-element subset Ex of 
distinct color edges with minimum !./(£, )l may not correspond to color partition <|)r^ 
irrespective of the ordering of the generators used. When the size of CA(T) is not very
126
large, the best £, can always be found by trial and error. Usually, IAI is much smaller than 
in . Therefore, utilizing an exhaustive search method to find Ex with minimum l/CE,)! 
would not be time consuming. Finding of a subset £ , with minimum l/C#,)! is beyond the 




Figure 4.12: Induced subgraph H ' ^  of 6 8
CHAPTER 5 
FAULT TOLERANCE
One of the key issues of a parallel processing system is its fault tolerance. Fault 
tolerance of a parallel processing system reflects its ability to function, possibly with 
degraded performance, under the failure of certain components. The likelihood of one or 
more components failing in a parallel processor system increases as the number of 
components increases. Therefore, fault tolerance is one of the richly addressed subjects 
in the area of parallel processing [4], [6], [7], [114], [117], [129], Some systems such as 
hypercube [119], star [5], mesh [13], and conventional multiple bus systems [109] are 
inherently fault tolerant. They can continue functioning as smaller networks even when 
certain links and nodes are failed, provided that the network can detect and isolate the 
fault. Some other systems such as the linear array [140] and the generalized cube network 
[128] are not fault tolerant. However, they can be made fault tolerant by adding some 
redundant components. For example, the generalized cube network can be made fault 
tolerant by adding an extra stage [1], [4],
In this chapter, we study the fault tolerance of the MBSs corresponding to vertex 
symmetric IFGs. That is, we study the fault tolerance of M(T, A) for a group T and its 
generating set A, where A is assumed to be non redundant. We will analyze the behavior 
of M (I\ A) in the case of single bus failure, single interface failure, and single processor 
failure. Even though the analysis carried out in this chapter can be extended to multiple 
component failures, we do not attempt to do so in this dissertation. If an MBS can execute 
its source algorithm (with performance degradation) although a fault exists, we say that 
the MBS sustains that fault. On the other hand, if the MBS cannot execute its source
127
128
algorithm in the case of a fault, then we say that the MBS does not sustain the fault. Our 
design criteria were directed towards constructing a minimal cost MBS realizing a given 
source algorithm. An MBS designed under such criteria will always suffer some 
performance degradation when a component fails, if it sustains the failure. Otherwise, one 
could have constructed the MBS without using the failed component.
We have already seen many attractive features of the MBS M(T, A). The fault 
tolerance capabilities of M(T, A) will further enhance the applicapability of the MBS as 
an algorithmically specialized parallel architecture. It is quite interesting to notice that 
although M{T, A) is the minimal component MBS which can realize the given IFG, it still 
has certain fault tolerant capabilities. In this chapter we analyze under what conditions 
the MBS M(T, A) sustains a single bus failure, single interface failure, or a single 
processor failure. We specifically show the following in this chapter.
(1) M(T, A) can sustain a single bus failure if and only if IAI > 2.
(2) M(T, A) can sustain a single driver failure if and only if IAI > 1.
(3) M(T, A) can sustain a single receiver failure if and only if IAI > 2.
(4) M(T, A) can sustain a single processor failure if and only if IAI > 1.
Clearly, for most practical cases, IAI > 1 and thus the MBS will normally sustain 
any of the above failures. Also, when the MBS sustains the above faults, we address the 
issue of performance degradation under each fault condition. Furthermore, when M(T, A) 




We assume that a faulty component can be isolated, located, and disconnected. 
This implies that the effect of the fault can be isolated. Fault tolerance of a multiple bus 
system depends on the connectivity of the faulty MBS with respect to the non faulty MBS. 
Suppose that the original source algorithm requires direct data transfer from processor P, 
to processor P2. Then the MBS realizing the algorithm has a direct path from P, to P2. 
Suppose after a component failure, the MBS does not contain a direct path from P{ to P2 
(assuming that Pl and P2 are not faulty) but there is an indirect path from P, to P2. In that 
case data transfer from P, to P2 may take several steps as opposed to the single step taken 
by the healthy MBS. This is an instance of sustaining the fault with performance 
degradation. But if the faulty MBS does not contain a path from P, to P2, the communica­
tions required by the source algorithm cannot be performed on the faulty machine, even 
if performance degradation is acceptable. This is an instance where the MBS does not 
sustain a fault. As we see throughout the chapter, fault tolerance of Af(T, A) depends on 
the connectivity property of the Cayley color graph CA(F).
Definition 5.1: An MBS is strongly connected if there is a path from each processor to 
every other processor.
Lem m a 5.1: Multiple bus system M(F, A) is strongly connected.
Proof: According to the construction of M(T, A), for every edge (w, v) in CA(r) there is 
a direct path from processor u to processor v in M(T, A). According to Theorem 4.1, 
Ca(T) is strongly connected. Therefore, there is a path from every processor to every other 
processor in the MBS M(T, A). □
130
The fault tolerance of M(T, A) in front of an interface failure or a bus failure is 
based on the above lemma. We check the strong connectivity of the MBS after a 
component failure. If the MBS with a failed component is also strongly connected, then 
communications among processors are still possible but with some performance 
degradation. As we see later, fault tolerance of M(r, A) in case of a processor failure also 
depends on the strong connectivity of the faulty MBS. In the following sections we will 
discuss each component failure separately. Before studying fault tolerance of an M(r, A) 
we will present an important result concerning the connectivity of a Cayley color graph. 
To that end, a few concepts from group theory are necessary [131], [113]. For 
completeness, we state them next.
Definition 5.2: Let A be a subset of the elements of group F. Let y be any element in T. 
Then yA is defined as the set { ya : a e  A] [113].
Definition 5.3: Let A be a subgroup of T. Then yA is said to be the left coset of F 
generated by y (or, containing y) relative to A [113].
Similarly, we can define right coset Ay. However, in this dissertation, we only use 
left cosets. Therefore, unless otherwise stated, the word "coset" must be interpreted as a 
left coset.
Lem m a 5.2: Any two cosets of a group are either identical or else have no element in 
common.
Proof: Let A be a subgroup of group F. Let y,A and y2A be two cosets of T. Suppose, 
there exists an element, say y, common to both y,A and y2A. Then there is an element X, 
e  A such that y = y,X,. Also, there is an element X, e  A such that y = y2X2. Hence, y,Xj 
= y2X2, that is, yt = y2X2Xf‘. Therefore, y,A = y2X2'kl-lA  = y2A. This proves the lemma. □
131
Corollary: A family of cosets of a group is a partition on the elements of the group.
Consequently, a family of cosets of a group T  induce a partition on the vertex set 
of the Cayley color graph associated with T  and its generating set. In general, partitions 
corresponding to right and left cosets are not the same.
Definition 5.4: For a subgroup A of F, if yA = yA for every y e  T, then A is a normal 
subgroup of r.
Definition 5.5: Let T be a finite group and A be a (non redundant) generating set for T.
Let A be a subgroup of T. Then the subgraph of CA(F) induced by a left coset of F
relative to A will be called a coset subgraph of CA(r) relative to A.
When there is no ambiguity as to the subgroup A under consideration, we will 
refer to a coset subgraph without referring to A. In such a case, the family of coset
subgraphs of a Cayley color graph will be denoted by CSX, CS2, ..., etc. It is clear that
every coset subgraph CS, is a Cayley color graph by itself. Theorem 5.1 (to be introduced) 
provides an important result regarding the connectivity of CA(T). To simplify its proof, 
we give the following algebraic result.
Lem m a 5.3: Let x and k be integers such that k > x  > 0. Then there exists an integer m 
such that mx mod k<  k -  x  + 1.
Proof: Let m -  f(/fc -  1 )/k -  x)] -  1. Then, we have the following two inequalities.
(m + 1)(& -  x) > k -  1 (1)
m(k -  x) < k -  1 (2)
From (2), we have, mk -  mx < k -  1. Therefore,
mx -  (m -  1)/: > 0 (3)
Also, from (1), we have mk -  mx -  x  > -  1. Therefore, by combining (3), we have,
132
0 < mx -  (m -  1)£ < k -  x  + 1 (4)
Inequality (3) implies that mx mod k  = (mx — (m -  1 )k) mod k. Inequality (4) implies 
that (mx -  (m -  l)/c) mod k < k -  x  + 1. Therefore the required result follows. □  
Theorem 5.1: The graph obtained from CA(T) by deleting a single edge is strongly 
connected unless IAI = 1.
Proof: If IAI = 1, then CA(T) is a (directed) cycle with IFI vertices. Deletion of any edge 
breaks the cycle and the remaining graph is not therefore strongly connected. Now 
suppose IAI > 2. Assume the deletion of edge (e, 8,), where 8X e  A. We need only to 
prove that there exists a directed path from e to 8 , in CA(T) which does not use edge (e, 
8 ,). Let A be the subgroup of F  generated by {8j}. Let CS1 be the coset subgraph of 
Ca(F) (relative to A and {8!}) which contains vertex e. Since T is a finite group, 8 , is of 
finite order, say, k. Therefore, CS{ is a cycle of size k consisting of only color 1 edges. 
It is also clear that CS, contains edge (e, 8 ,). First we prove the following claim.
Claim: There exists a vertex y in CSl which can be reached from e without using 
any edge from CS,.
Proof. Since IAI > 2, there exists another generator, say 82 e  A. Let a = 8 j1 and 
b = 82. Then there is an edge from a to e of color 1 and an edge from e to b of 
color 2. See Figure 5.1. If we start from e and traverse along edges of colors, 2, 
1 ,2 , 1, ..., in that order, we would come back to e. This is true because there is 
a finite integer g such that (828 ,)g = e. Let y be the first vertex in CS{ encountered 
by the above traversal of edges. Then y *= e since edge (a, e) must be included in 




Figure 5.1: A path from e to y which does not use any edge from GS1,.
Since y is in CSlt we have y = 5 \  for some integer x  < k. Therefore, there is a 
jum p  of length x  along CS1 from e to y. First suppose that there exists in integer I such 
that Ix — k + 1. Therefore, we can move from e to y by using I jumps. Thus, in that case, 
there exists a path from e to 5, that does not include edge (e, 8 ,). Now suppose that such 
/ does not exist. Then, let m be the least integer such that mx mod & <& — x + 1 .  By 
lemma 5.3, such an integer exists. Let v = 5™ and w = 8 . Then v can be reached 
from e by m jumps. See Figure 5.2. Since our choice of m guarantees that mx mod k < 
k -  x  + 1, w can be reached from v along CSj without using edge (e, 5,). Furthermore, 
8 j can be reached from w by a single jump. Therefore, there is a directed path from e to 
8 , which does not use edge (e, 8 ,). □
5.2 Failure of a Bus
In this section we analyze the behavior of M(T, A) when a single bus fails. Failure 
of a bus will remove direct paths for some communicating pairs of processors. According
134
Figure 5.2: A path from e to 8, that does not include edge (e, 8 j).
to the way M{T, A) was constructed (see Section 4.2), bus Bj is assigned the data transfers 
corresponding to the set of edges EJy 1 < j<  IFI, where
E j  =  {(y, Jjhi), (y h 1, y h ^ ) ,  ..., (yJh1h2...hc_l, ; h, = ' , 1 < / <  IAI.
Figure 5.3 shows the set of processors connected to bus B}, where IAI is assumed 5. 
Failure of bus Bj affects the direct communication among processors y ,  y78 ,, ,...,
8! 8 2183 5 4185. As for the Cayley color graph, failure of Bj corresponds to the removal of 
edges belonging to E j from CA(T). The following theorem provides yet another attractive 
feature of the MBS M(V, A).
Theorem  5.2: The MBS M(T, A) can sustain a single bus failure if and only if IAI > 2. 
Proof: First suppose IAI < 2. In this situation each processor P  is connected with only one 
receiver (see Theorem 4.9). Failure of the bus connected to that receiver will destroy all 
input paths to processor P.
7,5,8,'' 8354' 8 5  yA52-, 8354-1
o
y8,5-' 8 3 yA
135
Y
0 0 0 0 0
Bus B,
Figure 5.3: Processors connected to bus Hj in M(T, A).
Now suppose that IAI > 2. Assume that bus B} in M(F, A) fails. Figure 5.4 shows 
the induced subgraph H}, where v, denotes the ith vertex of Hj. The set of processors 
among which communication may be in jeopardy due to the failure of bus Bj is 
represented by V(Hj). To prove the theorem, it is sufficient to show the existence of a 
path in CA(T) from v, to uj , which does not use any of the edges in Hj. Here i is an even 
integer in the range 0 < / < IAI and i 1 is an integer one less than or one more than i in 
the range 0 < i 1 < IAI.
8a 8,
v4  v3  v2  yi
Figure 5.4: Induced subgraph H}
We will use A„ to denote the subgroup of F generated by {8f, 8J  for 8S e  A. 
Denote by CSt3 the coset subgraph of CA(F) corresponding to the left coset of F
v, vo =  Yi
136
associated with subgroup T13 containing vertex v0 = Jj. It is clear that CSn is a Cayley 
color graph with two colors. We claim that there is a path from v0 to in CA(F) which 
does not use any of the edges in Hj. It is clear that edge (v0, v,) of color 1 is in the 
subgraph CSn (see Figure 5.4). Therefore, according to Theorem 5.1, there exists a path 
in CSn  which does not use edge (v0, v^. Suppose that the color 3 edge (v2, v3) is in that 
path. Then both vertices vx and v2 are in CSl3. Therefore, since Vj = v282, generator 82 
could be expressed in terms of generators 8 , and S3. This is not possible since we assume 
that A is a non redundant generating set. Therefore, there is a path from v0 to v3 in CSl3 
which does not use (v0, v,) or (v2, v3). Thus the claim is true and therefore, there exists 
a path from v0 to Vi which does not use any of the edges in Hj.
Now, let CSn  be the subgraph of CA( 0  corresponding to the left coset of T 
associated with subgroup A12 containing v2. Then, according to Theorem 5.1, there is a 
path, say 3P, from v2 to v, in CSn which does not use the color 2 edge (v2, v,). First 
suppose that 8P does not contain edge (v0, v^. Then is a path from v2 to v, which does 
not contain any edge from Hj. Next suppose contains edge (v0, v^. We have already 
shown that there exists a path from v0 to v, that does not use any edge from Hj. Hence, 
there exists a path from v2 to v, that does not use any edge from Hj. Similar to the case 
for the path from v0 to v,, we can show that there exists a path from v2 to v3 that does not 
use any edge from Hj. We can continue this argument to prove that for every even i, 0 
< i < IAI, there exist a path from v, to vi+1 and a path from v, to v,-., that do not use any 
edge from H j.  Therefore, M ( T ,  A) remains strongly connected even after the failure of 
bus Bj. □
137
Therefore, A/(T, A) can function with some performance degradation in the case 
of a single bus failure, provided that IAI > 2. Performance degradation of a faulty MBS 
depends on how many buses are needed to send data from processor P Y to processor P2, 
when the direct path from P, to P2 is destroyed. Since M(T, A) is symmetric, performance 
degradation is the same irrespective of the identity of the failed bus. Unfortunately, there 
is no general formula to find the distance from Px to P2 when the bus associated with the 
direct path from P, to P2 is destroyed. However, we will obtain a measure of performance 
degradation when M(T, A) is the optimal MBS realizing cube interconnection functions.
5.3 Perform ance Degradation of M(Qn) due to a Bus Failure 
Theorem  5.3: When a single bus fails in M(Q„) (n > 2), a processor connected to the 
faulty bus via a driver can send data to a processor connected to the same bus via a
receiver, using three non faulty buses (i.e., via a path of length 3).
Proof: As we have already shown, vertices of the IFG GQ(n) can be labeled by binary 
strings of length n. Furthermore, buses can also be assigned the same labels. Suppose that 
bus yc fails. Consider two processors ya and yb connected to bus yc via a driver and a 
receiver, respectively.
There is a direct path from processor ya to processor yb if and only if yb = ya ® s, 
for some SCOL s (see Lemma 4.6). By the definition, exclusive-or summation of two 
SCOLs cannot be an SCOL. Therefore, there cannot exist a path of length two from ya to 
yb. Also, according to Lemma 4.4, there can be at most one direct path in M (l\ A) from 
one processor to another. Therefore, if bus yc fails, at least three buses are required to
send data from ya to yb. We will next show that this is actually possible.
138
Since processor ya is connected to bus yc via a driver, ya = yc © 0"'al°, for an even 
integer a . Also, since processor yb is connected to bus yc via a receiver, yb = yc © 0"'pl p, 
for an odd integer p. Therefore, s -  0"°T“ © 0"'pl p. Without loss of generality, assume 
that a  > p. Then a  > 2. Consider bus y* given by
YI  =  Y. (1 )
It is clear that processor ya is connected to bus y \ via a driver. Let
Y; = Y; e  o - i .
Now, processor yI  is connected to bus y* via a receiver. Thus, there is a direct path
1 1  2 from processor ya to processor y b via bus yc . Next consider bus y c given by
Y c  =  Y b  ( 2 )
Thus, processor yI is connected to bus y 2c via a driver. Let 
yI = y]  © 0"'pl p.
Since P is odd, processor y \  is connected to bus y 2 via a receiver. Thus, there is a direct
path from processor y l  to processor yI via bus y 2. Finally, consider bus y l  given by,
Yc =  YI  ©  0" “ 1“  (3 )
2 3Since a  is even, processor y b is connected to bus y c via a driver. Let
yl = Yc ® o*'i.
3 3Then, processor yb is connected to bus yc via a receiver. Thus, there is a direct path
2 3 3from processor yb to processor y b via bus y c . Hence, there is a path of length 3 from
3 1 2  3processor ya to processor y b via the three buses yc , y c , and yc . Now,
y l  = y l  © 0"1!
= y 2b © 0n al a © (y ' l  
=  yl ©  0 ” pl p ©  0" a l “ ©  0"’1!
139
= yl © 0"'pl p 0  0"‘“ l a 0  O"'1!
= Yc ® O”’1! ® °"'PlP ® 0”'a l a 0  O"'1!
= Yc ® 0"'P1P ® 0"'“ !“
= ya ® 0”-pl p 0  0"-al a
= yb-
Therefore, there is a path of length 3 from ya to yb which uses the three buses Yc> Yc> 
and Yc • ^  remains to show that none of those buses is yc (the faulty one). Since yc = ya 
0  0”‘“ 1“ (we assume a  > 0), according to (1) Yc *  %■ Also, from (2), y l  = y \  = y lc @ 
0" '1! = ya ® O'1'11. Clearly Yc *  Yc since a  *  1. Furthermore, from (3),
Yc = yI ® o"-ar
=  Yc ®  °" 'PlP ®
= ya 0  0n l l 0  0" 'pl p 0  0” a l “
Now, 0” 1! 0  0"‘pl p cannot be 0" since (3 *  1. Therefore, y l * yc. So, there is a path of 
length 3 from ya to yb which uses three non faulty buses. The same result can be reached 
if we assume a  < p. □
To preserve possible precedence relations in the source algorithm, we should finish 
data transfers corresponding to a particular interconnection function before starting data 
transfers corresponding to another interconnection function. To perform the communica­
tion corresponding to an interconnection function, four steps are necessary for the faulty 
MBS as opposed to the single step necessary for the non faulty MBS. In the case of a bus 
failure, one step is required for source processors not using the faulty bus to communi­
cate. Three more steps are necessary for the processor which could not communicate
140
because of the faulty bus to reroute the data to its target processor. Therefore, the 
communication time will increase by a factor of at most four due to a single bus failure.
5.4 Failure of an Interface
In this section we will analyze the fault tolerance of M(T, A) in the case of a 
single interface failure. Clearly, bus failure is more serious than an interface failure, in 
general. Failure of a bus is equivalent to the failure of all the interfaces connected to it, 
according to our fault model. Therefore, we would expect that M(T, A) is more tolerant 
to an interface failure than to a bus failure. But, as we shall see shortly, the above 
statement is only partially true. Due to the specific way Af(T, A) was constructed, fault 
tolerance of M(F, A) with respect to a driver failure and a receiver failure are not 
identical, in general.
Theorem  5.4: The MBS M{T, A) sustains any single driver failure if and only if IAI > 1. 
Furthermore, it sustains any single receiver failure if and only if IAI > 2.
Proof: According to Theorem 4.9, the number of drivers and receivers connected to a 
processor in M(T, A) are [IAI/2J + 1 and flAI/2], respectively. When IAI = 1, that is, 
when there is only one generator in the Cayley color graph, each processor has only one 
driver and one receiver. Thus M(T, A) does not sustain a driver failure or a receiver 
failure when IAI = 1. When IAI = 2, that is, when there are two generators, each processor 
is connected with 2 drivers and 1 receiver. Therefore, M(T, A) does not sustain a receiver 
failure when IAI = 2. We will next show that M(T, A) sustains a driver failure when 
IAI = 2.
When IAI = 2, let 8 , and 82 be the two generators in A. Then, bus y  is connected 
to processors y  and y 81821 through drivers and to processor y 8 , through a receiver (see
141
Figure 5.3). Suppose that the driver connecting processor jj to bus Jj fails. The only direct 
path which uses the faulty driver is the one associated with edge (y;, yfii). According to 
Theorem 5.1, there is a path from y;- to yfa  in CA(T) which does not use edge (ty y-8 j). 
Therefore, there is a path from processor jj to processor ŷ Sj which does not require the 
faulty driver. Similar reasoning applies if the driver connecting bus y; to processor ySjSi'1 
fails. It remains to show that, when IAI > 2, M(T, A) sustains a driver or a receiver fault. 
This result directly follows from Theorem 5.2. □
Therefore, even though a bus failure appears to be more serious than an interface 
failure, the degree of fault tolerance of M(T, A) with respect to both is the same with only 
one minor exception. Even though M(T, A) cannot sustain a bus failure when IAI = 2, it 
can sustain a driver failure. Similar to the faulty bus case, there is no general formula to 
find the distance from P1 to P2 when the driver connected to Pj or the receiver connected 
to P2 associated with the direct path from Px to P2 is faulty. Therefore, a measure of 
performance degradation under an interface failure cannot be obtained for the general 
case. However, we can obtain a measure of performance degradation when the MBS 
is M g^y
5.5 Perform ance Degradation of M(Qn) due to an Interface Failure
Theorem  5.5: If M(Qn) sustains the failure of a driver or receiver, then any pair of
processors in the faulty MBS can communicate using at most three buses.
Proof: First suppose that the MBS M Q{jn) be such that n > 2. Let ya and yb be two 
processors such that there is a direct path from ya to yb via bus yc. In Theorem 5.3, we 
have already shown the existence of a path of length 3 from processor ya to processor yb
142
which does not use bus yc. The same three buses can be used if the driver connecting ya 
to yc or the receiver connecting yb to yc fails.
Now suppose that n = 2. In this case, MQ(n) does not tolerate a receiver failure (see 
Theorem 5.4). Suppose that the driver connecting ya to yc fails. As in the proof of 
Theorem 5.3, let ya = yc © 0"'“1“, for an even integer a, and yb = yc © 0"'pl p, for an odd 
integer (3. Since n = 2, P must be equal to 1. Also, a  must be either 0 or 2. We will 
assume a  = 2. The same result can be reached when a  = 0.
Since a  = 2, we have yc = ya ® 0"'2l l .  Let the two buses y* and y 2 be defined 
by y* = ya and y 2 = ya © 0”'210. Then processor ya and processor ya © 0n ll are 
connected to bus y* via a driver and a receiver, respectively. Therefore, there is a direct 
path from processor ya to processor ya © 0""11 via bus y*. Also, there is a direct path 
from processor ya © 0n ll to processor ya © 0"'2l l  via bus y 2. Furthermore, there is a 
direct path from processor ya © 0"'2l 1 to processor ya © 0n ll = yb via bus yc. Hence there 
is a path of length 3 from ya to yb which does not use the driver connecting ya to yc. □  
Notice that when n = 2, the indirect path from processor ya to processor yb uses 
bus yc. But it doesn't use the faulty receiver. While executing any interconnection 
function, the faulty MQ(n) requires three extra communication steps. Therefore, similar to 
the bus failure case, whenever MQ(n) sustains an interface failure, the communication time 
will be increased by a factor of at most four.
5.6 Failure of a Processor
In this section we will analyze the behavior of M{T, A) when a processor fails. 
Failure of a processor does not affect the direct paths among other processors. But that 
may affect indirect paths. Also, when a processor fails, the subtask assigned to the faulty
143
processor must be reassigned to one or more non faulty processors. We assume that the 
multiple bus system detects the faulty processor and assigns its subtask to one or more 
of its neighboring processors. We will show that unless IAI = 1, M(T, A) is strongly 
connected after a processor failure. Unlike the previous two cases (interface failure and 
bus failure), more involved proofs are required to prove that M(T, A) remains strongly 
connected after the failure of a single bus.
Lem m a 5.4: Let G be a Cayley color graph associated with group T  and its generating 
set A. Let A be the subgroup of T  generated by Aj c  A. All edges of color r, 8r e 
A -  Aj, originating from one coset subgraph relative to A will terminate at another coset 
subgraph (relative to A) if and only if A is a normal subgroup of F.
Proof: First suppose that A is a normal subgroup of F. Then yASr = y8,A.. Thus all edges 
of color r originating from coset subgraph yA terminate at coset subgraph y8,A.
Now suppose that, for every coset yjA, there exists another coset y2A such that y2A 
= yiA&r  Clearly, yj8r e  yiA8r  So, yj8r belongs to the left coset y2A. Furthermore, yt8r 
belongs to the left coset yj8,A. Therefore, by Lemma 5.2, yjA8r = yj8rA. So, A8r = SrA. 
This is true for every 8r e  A -  Furthermore, for every 8S e  A u we have A8  ̂= 8jA. 
Let y be an arbitrary element in F. Since A is a generating set for T, y is a word in A. 
Hence, yA = Ay, and therefore, A is a normal subgroup of T. □
Lem m a 5.5: Let G = (V ,E) be the Cayley color graph associated with finite group T  and 
its generating set A. Let A £  V be a subset of vertices. Then the number of edges from 
A  to V -  A  of color r is equal to the number of edges from V -  A to A of color r, for 
every color r in G.
144
Proof: Each vertex in A has one outgoing edge of color r. Therefore, there are L4I edges 
of color r  whose tails are in A. From the heads of those edges, let q be in V -  A. The 
remaining LAI — ^ heads are in A itself. Furthermore, there are 1AI edges of color r whose 
heads are in A. From those heads, LAI -  q are heads which originate in A itself. Therefore, 
there are q edges from V -  A to A. □
The next lemma, which requires a lengthy proof, provides an important 
characterization of the connectivity among coset subgraphs of a Cayley color graph. 
Lemma 5.6: Let G -{ V ,E )  be the Cayley color graph associated with finite group T  and 
its generating set A = {8„ 82, ..., 8*}. Let A be the subgroup of F  generated by Aj c  A 
and S  be the family of coset subgraphs relative to A and A,. Then S can be divided into 
two disjoint subsets A and B satisfying
(a) IAI > 1,
(ib) \B\ > 2, and
(c) All edges in CA(r) from A  to B terminate at a single coset subgraph in B. 
only if A is a normal subgroup of T.
Proof: Suppose that the statements (a), (b), and (c) of the lemma are true. Also suppose 
that all edges from A to B terminate at coset subgraph CS{. First, assume that A contains 
only one coset subgraph, say CSa. Then, all edges originating from CSa terminate in CSt. 
So, by Lemma 5.4, A is a normal subgroup of F.
Next, assume that A contains more than one coset subgraph. In that case, we claim 
that there exists a subset A  ' c  S, L4 '\ < IAI, such that all edges from A  ' to S -  A  ' = B ' 
terminate at a single coset subgraph in B '. If the claim is true, we can ultimately find a 
coset subgraph CSa such that all edges originating from CSa terminate at a single coset
145
subgraph. This will then imply that A is a normal subgroup of V. We now prove the 
correctness of the above claim.
We first observe that, by symmetry of CA(F) and by Statement (c) of the lemma, 
for every coset subgraph CS„ there exists a subset £(CS,) c  S such that all edges from 
£,(CSj) to S -  |(CS,) terminate at CS,. Clearly, £(CS,) -  A. It is also clear that CS, is not 
contained in £(CS,), for CS, e  S. If CS, and CSy are distinct coset subgraphs, then £(CS,) 
*  £,(CSj), for otherwise, edges originating from £(CS,) should terminate at CS, as well as 
at CSj.
To obtain the subset A '  c  S stated in the claim, we first choose a coset subgraph 
CSX in A  such that £(CSJ n  A  is not empty. Suppose that %{CSX) n  A is empty for 
selected CSX. Then, ^(CJ c  B. Let there be q edges from A  to B. Then, there are q edges 
from A to CSV Therefore, by symmetry, there are q edges from ^(CSJ to CSX. From 
Lemma 5.5, the number of edges from B to A is q. Since £(CSJ c  B, all edges from B 
to A terminate at CSX. Let CSX. be another coset subgraph other than CSX in A (since A is 
assumed to contain more than one coset subgraph, such a coset subgraph exists). Then 
there are no edges from B to CSy. Therefore, ^{CSX.) cannot be completely contained in 
B. Hence ^(CSy) n  A is not empty. Thus, without loss of generality, we will assume that 
%(CSX) n  A is not empty. It should be noticed that £(CSX) n  B cannot be empty since 
Z,(CSX) #  A. Denote £(CSJ n  A and Z,(CSX) n  B by Ax and Bx, respectively (see Figure 
5.5(a)). The construction of A  1 will be different depending on whether C5t is contained 
in Bx or not.
Case 1: CS, is not contained in Bx.
146
Since all edges from A to £  terminate at C5,, there are no edges from Ax to Bx. 
Also, all edges originating from Ax u  Bx (= £(CSX)) terminate at CSX. Hence all edges 
originating from Ax should terminate at CSX. Clearly, IAJ < IAI. By letting A  ' = Ax, we 
have the proof for claim 1.
Case 2: CS, is contained in Bx.
In this case, there are no edges from A to B -  Bx, because, by Statement (c) of the 
lemma, all edges from A to B terminate at CS,. Also, there are no edges from Bx to 
B -  Bx because, edges originating from Bx terminate either in CSX or in Ax. Therefore, 
there are no edges terminating at £  -  Bx. Since this violates the strong connectivity of 
CA(T), B -  Bx must be empty. Therefore, Bx -  B. Let CSy be any coset subgraph in 
A -  Ax -  CSX. By the hypothesis of the lemma (Statement (b) stipulates that S -  %(CS,) 
contains at least two coset subgraphs, for CS, e S), such a coset subgraph exists. Since 
l£(CSpi = l£(CSJI = IAJ + 151 > 151, £,(CSy) n  A = Ay cannot be empty. Let By = £(CSy) 
n  5. If CS, is not contained in By, then the claim holds by Case 1. If CS, is contained 
in 5V, then 5  -  5V must be empty. Therefore, Bv = B. Let Arv = A r n  A v and R = A ^  uy  y y  *y y  *-y
B. We will now prove that CSX is contained in Ay. Suppose on the contrary that CSX is not 
contained in Ay (see Figure 5.5(b)). Since R c  £,(CSX), the edges originating from R would 
not terminate in (A -  CSX -  A x). Similarly, since R c  £(CSy), the edges originating from
R  would not terminate in (A -  CSV -  A J. But (A -  CSV -  A J u  (A -  CSV -  A J =y y  * ■* y  y
A -  Axy = S -  R. So, there are no edges from R  to S -  R. But this violates the strong 
connectivity of CA(P), and so CSX must be contained in Ay. In that case (see Figure 
5.5(c)), all edges originating from R will terminate at CSX. Since Ax *  Ay, we have
i / g  < ia j.
147
Thus 151 = \Axy u  51 = L4X>I + 151 < L4J + 151 = l^(C5x)l = IAI. We obtain A ' by letting 





A  = U C S X) B
(a)
, - ‘7 \
: -a \ .................X , . . .
I A* \ 
>--Vi
\  A  . .
/  B
(b)
A ' \  % CS, K
X .........
X .......
rcs.) [(Si - .. :
5
>;




Figure 5.5: Illustration for Lemma 5.6.
148
Lem m a 5.7: Let CA(T) be such that IAI > 1. Then V(CA( 0 )  cannot be divided into two 
disjoint subsets R and S, l/?l > 1, 151 > 2, such that all edges from R  to 5 terminate at a 
single vertex in 5.
Proof: Suppose that there exist two subsets of vertices R  and 5 such that all edges from 
R  to 5 terminate at v e  5. Due to the strong connectivity of CA(T), there should be at 
least one edge from R to 5. Let (u, v) be such an edge whose color is, say, 1. Let A, be 
the subgroup of F  generated by { } .  Denote by CS, the family of coset subgraphs 
relative to Aj and {8,}. It is clear that CS, is a directed cycle consisting of color 1 edges. 
Let the coset subgraph containing edge («, v) be CS,. Clearly, CS, is common to subsets 
R  and 5 (see Figure 5.6(a)). Suppose that there is a coset subgraph CS', other than CS,, 
which is also common to both R and 5. Then, there must be another edge of color 1 from 
R  to 5. Since this is not possible by the hypothesis of the lemma, such a coset subgraph 
CS' does not exist. Therefore, CS, is the only coset subgraph common to A  and B. Let 
CS, n  R  and CS, n  5 be denoted by CSR and CSs, respectively (see Figure 5.6(a)).
Now, we claim that S -  CSs is empty. Suppose that it is not empty. Since all 
edges from R to 5 terminate at v, it is true that all edges from R -  CS, to S u  CS, 
terminate at the single coset subgraph CSX. Therefore, according to Lemma 5.6, A must 
be a normal subgroup of T. There are no edges from CSR to 5 -  CSs by the hypothesis 
of the lemma. Therefore, since A is a normal subgroup, there are no edges from CSR to 
5 -  CSR. Also, by the hypothesis of the lemma, there are no edges from R -  CSR to 
5 -  SCs. This implies that CSs has no incoming edges. Therefore, due to the strong 
connectivity of CA(P), 5 -  CSs must be empty. Therefore, 5 = CSs. Figure 5.6(b) shows 
the new configuration. Let v, be a vertex in 5 other than v. Then incoming edges to v.
149
of colors other than 1 should originate from R. Since this not possible by the hypothesis 
of the lemma, the only vertex in 5 must be v. This also violates the condition given in the 
lemma (ISI > 2). This completes the proof of the lemma. □
(a)
R
 CSR S = CSs
(b)
Figure 5.6: Illustration for Lemma 5.7.
150
Lem m a 5.8: Let CA(T) be such that IAI > 1. Then CA(T) remains strongly connected after 
the removal of a single vertex.
Proof: Let v be an arbitrary vertex of CA(0 . If CA( 0  is not strongly connected, then 
there exists two vertices v, and v2 such that all paths from Vj to v2 pass through v. Let R 
be the set of vertices of CA(P) (including Vj but excluding v) which can be reached from 
Vj without passing through v. Let S = V(CA(0 )  -  R- Since v and v2 are in 5, we have 151 
> 2. Furthermore, every edge from R to 5 terminates at v. This is not possible according 
to Lemma 5.7. Hence, there is a path from Vj to v2 which does not pass through v. □  
As a consequence of Lemma 5.8, we have the following theorem.
Theorem  5.6: If IAI > 1, then the MBS M(T, A) remains strongly connected after the 
failure of a single processor. □
Thus M(T, A) sustains a single processor failure provided that IAI > 1. Similar to 
the previous two cases, there is no general formulation for the performance degradation 
of M(T, A) with respect to a processor fault. As before, we will obtain a measure of 
performance degradation in the special case when MBS is M(Qn).
5.7 Perform ance Degradation o f M(Qn) due to a Processor Failure
As mentioned before, failure of a processor does not affect the communication 
among other neighboring processors. But extra data transfers are required when the 
subtask originally assigned to the faulty processor is reassigned to one or more non-faulty 
processors. Let Pj be the faulty processor. Let Pt be the processor whose corresponding 
vertex in the 1FG Gg(n) is adjacent with the vertex corresponding to processor Pj along 
dimension i, 1 < i < n .  For the analysis in this section, we assume that the subtask of the 
faulty processor is assigned to a single neighboring processor, say P v See Figure 5.7.
151
Consider the situation when M(Qn) is performing the interconnection function 
associated with generator 82, that is, when the MBS is performing the cube2 
interconnection function [125]. The subtasks assigned to processors Pf  and P2 must 
communicate with each other. In the non-faulty M(Qn), processor Pf  can communicate 
with processor P2 in one step. To accomplish this in the single processor fault situation, 
processor Px should communicate (simultaneously, if possible) with processor P2.
Let yf  be the vertex of the IFG GQ(n) corresponding to processor Pf  in the MBS 
M q̂ .  Then, YyS, is the vertex corresponding to processor P„ 1 < i < n. Now, there is a 
direct path from to Yy8 ,82 via bus y 82. Also, there is a direct path from y5162 to y/>2 
via bus y5182. Thus, there is a path from ŷ S, to y 82 via the two buses y 82 and yfii$2. 
Similarly, there is a path from yfi2 to y 8 , via the two buses y f i ^  and yp in that order. 
Notice that both paths contain the bus y 8 ,82. When the data transfer from P, to P2 is
Figure 5.7: Communication among neighboring processors 
of a faulty processor in M(Qn).
152
using bus Y;8 ,S2, the data transfer from P2 to Pl will be using bus yf. Also, when the data 
transfer from P2 to P, is using bus YyS,S2, the data transfer from P, to P2 will be using bus 
yfi2. Therefore, no bus conflict will occur. Hence, processors yJ8l (= P,) and yp2 (= P2) 
can communicate with each other using only two steps. Thus, two extra communication 
steps are required in order to perform the communications required by the faulty 
processor corresponding to the interconnection function 82. This is true for every other 
interconnection function except 8 ,. Therefore, communication time will be increased by 
at most three times because of a processor failure.
5.8 Inclusion of Redundancy
When the MBS does not sustain the failure of a certain component, we can make 
it fault tolerant by adding some redundant components. In all three cases of component 
failures we have addressed (bus, interface, and processor), whether M(T, A) is fault 
tolerant or not was decided by the size of IAI. In M(F, A), only the number of interfaces 
is dependent on IAI. Therefore, to make M(T, A) fault tolerant, whenever necessary, only 
interfaces need to be added.
If A = {Sj}, that is, if IAI = 1, then M(T, A) is not fault tolerant in the face of any 
component failure. Note that A = {8 ,} corresponds to an IFG with unidirectional ring 
topology. In such a case, we can use generating set A' -  {8„ S j1} of group T  instead of 
A. Even though A' is a redundant generating set, color partition <|)r is optimal (see 
Section 4.2). From Theorems 5.4 and 5.6, M(F, A') is fault tolerant against a single 
processor failure or a single driver failure. According to Theorem 4.9, the number of 
interfaces in M(T, A) and M(F, A') are 2IFI and 3ITI, respectively. Therefore, an MBS 
M{r ,  A) which is not fault tolerant against a single processor or a single driver failure can
153
be converted to a fault-tolerant one by adding 50% more interfaces. Note that, the 
insertion of generator 82 will fail when 82 = e. But this corresponds to the two processor 
case which can easily be handled. For M(T, A) to tolerate a bus failure or a receiver 
failure, A must have at least three generators. We may have to add one or two extra 
interfaces per processor depending on whether IAI = 2 or 1.
CHAPTER 6
CONSTRUCTION W ITH GIVEN NUM BER OF  
PROCESSORS AND BUSES
An optimal MBS realizing a given IFG may require a large number of processors 
and buses. When the IFG is vertex symmetric or regular, the number of processors, as 
well as the number of buses, of the optimal MBS is equal to the number of vertices of the 
IFG. Implementing an MBS with as many processors and buses as what is needed for 
optimality may be infeasible due to practical considerations. Therefore, it is imperative 
to study how well an MBS with a given number of buses and/or processors can run the 
source algorithm(s). Such a system will inevitably suffer some performance penalties.
In this chapter we study how to design an optimal multiple bus system realizing 
a given IFG, when the number of processors and buses in the target MBS are specified. 
In this case, the optimality of the target MBS depends on the number of interfaces used. 
Therefore, in this chapter an optimal MBS means an MBS with minimum number of 
interfaces. This is our secondary objective of the dissertation and it is an extension to the 
work presented in Chapters 2, 3, and 4. Note that the problem at hand differs from 
mapping a given IFG  onto an existing MBS. In the mapping problem, how processors and 
buses are interconnected together is specified while in our design problem it is not 
specified. We set up the interconnection between the given set of processors and/or the 
buses such that the target MBS can realize the original source algorithm at maximum 
possible speed and the number of interfaces used is minimum.
First we consider the case with given number of processors. We show that the 
number of processors p  must be a factor of the optimal number of processors in order to
154
155
utilize them efficiently. We will partition the vertex set of the given IFG  into p  subsets. 
The optimality criterion is to minimize the number of edges induced by the partition. 
Since the general partition problem is NP-Hard, we consider the case when the IFG is 
vertex symmetric. By partitioning the vertex set of the given IFG C JT), we will construct 
a new IFG  which has p  vertices. We state some specific problems associated with the 
constmction of the new IFG. We show that unless the vertex partition satisfies certain 
conditions, those questions cannot be answered properly. Using cosets of group T, we will
r
show how such partition can be made. We also show that the resulting MBS is regular. 
Furthermore, we will provide a proper scheduling of the processors which guarantees 
maximum speed.
Then, we will consider the case with a given number of buses. Here also, for 
efficient utilization of buses, we show that the specified number of buses b must be a 
factor of the optimal number of buses. We will show that even when the given IFG is 
vertex symmetric, the optimal MBS with a given number of buses may be heterogeneous. 
Therefore, we may sacrifice certain amount of optimality in order to make the target MBS 
homogeneous. Using cosets of T, we will show how to combine buses of the optimal 
MBS M(T, A) to obtain a new MBS which is symmetric and contains b buses.
6.1 MBS with Given Number of Processors
In order to construct an MBS with p  processors realizing a given IFG G, we 
partition the vertex set of G into p  subsets such that the subtasks associated with each 
subset are to be performed by a single processor in the target MBS. (Note that each vertex 
of the IFG  corresponds to a single subtask.) Consider subset V, of vertices of G assigned 
to processor P,. The edges induced by subset Vx will not impose any communications
156
since both the source and destination subtasks involved in the communication are assigned 
to the same processor. These edges represent the communication hidden by processor Pv 
Therefore, the number of edges induced by Vt is a measure of the amount of communica­
tion hidden by processor Px. Thus, we define an optimal vertex partition as follows. 
Definition 6.1: A partition of the vertex set of a given IFG  is optimal if the total number 
of edges induced by the partition is maximum.
The optimal partition of a general IFG into a given number of subsets is 
equivalent to the general graph partition problem, and is, therefore, NP-Hard [24], [53], 
[60], [81]. Besides, we have already demonstrated the importance of symmetric IFGs. 
Therefore, we will only consider the partition of vertex symmetric IFGs. Unfortunately, 
as we demonstrate later, even when the IFG  is vertex symmetric, finding an optimal 
vertex partition is not a trivial problem for a given value of p.
We will now show an interesting relation to be held between the given number 
of processors p  and the optimal number of processors IV(G)I. Let CA be the (regular) CFG 
from which the IFG G was constructed by an optimal level-disjoint partition. Then IV(G)I 
= a(CA). Suppose that CA consists of n levels. Momentarily assume that p  = IV(G)I -  1. 
The source algorithm requires n parallel steps to be executed, where each parallel step 
consists of a(CA) = 1 V(G)I single steps. Since there are only p  (= IV(G)I -  1) processors 
available, it takes two steps for the processors to execute one parallel computation step 
of the source algorithm. This is true because we assume that all the computational steps 
take the same amount of time. Also we assume that computations from different levels 
cannot be interleaved together because of possible data dependencies. Therefore, it takes 
2n time steps to complete all the computations in the algorithm. Even if we use
157
PV(G)I/2~| processors, then the computation time needed is 2n. Thus, if the specified 
number of processors is only IV(G)I -  1, then the actual number of processors needed for 
optimal execution is pV(G)l/2"]. This idea is generalized in the following theorem. 
Theorem  6.1: If the specified number of processors is p, then only |"lV(G)l//fj processors 
are needed without affecting the performance, where k = pi/(G)!///|. □
Therefore, without loss of generality, we assume that p  is a factor of IV(G)I. Each 
subset of the partition is assigned to a unique processor. The size of each subset will 
determine the number of subtasks allocated to the corresponding processor.
Definition 6.2: A partition of the vertex set of an IFG  is said to be an isomorphic vertex 
partition if the induced subgraphs of the subsets are isomorphic to one another.
If a vertex partition is not isomorphic, different processors may be assigned 
different number of subtasks and thus processors may behave differently. Also, the 
number of outgoing edges of color r  from a subset of vertices in G may vary from subset 
to subset. Suppose all the processors in the new MBS are performing computation r. Then 
the number of times a processor will perform an external communication may be different 
from processor to processor. Then not only processor execution times are different but 
buses will also be used inefficiently. Therefore, we are interested in finding an isomorphic 
vertex partition even at the expense of optimality.
By partitioning the vertex set of the IFG G into p  subsets of size k, we construct 
another IFG Gk. As one would have expected, the vertices of Gk are the subsets of the 
partition. We use notation v, to represent the vertex in Gk corresponding the subset V, of 
G, according to a given vertex partition. When we want to make a distinction between 
the two IFGs G and G \ we will refer to the former as the principal IFG and to the latter
158
as the derived IFG. To determine the edge set of Gk, closer attention must be paid to the 
concurrent data transfers dictated by the principal IFG G.
If there exists an edge in E(G) such that its initial vertex is in V, and its terminal 
vertex is in Vj, we say that an edge exists from Vt to V} in G. Suppose that there exists 
an edge from V) to Vj in G. Then the processor corresponding to v, (or, processor v,), at 
a certain time during its operation, should send data to the processor corresponding to y  
(or, processor y). Conversely, if there are no edges in G from Vt to Vj, then processor v, 
does not send data to processor y  at all. Therefore, we stipulate that (v,, y) e  Zs(G*) if and 
only if there is an edge from V, to Vj in G. Obviously, there could be more than one edge 
from Vj to Vj. Those edges are called parallel edges. Interpretation of those parallel edges 
in G* involves two important issues, merging and secondary coloring.
First we address the issue of merging. Suppose there are more than one edge in 
G from Vj to Vj of the same color, say r. Since these data transfers must be performed 
sequentially by processors v, and y, they can be represented by a single edge of color r 
from v, to y  in Gk. This is called merging of parallel edges.
Definition 6.3: Let there be w edges of color r  from Vj to V} in G. Then the representative 
edge of color r  from v, to y- in G* has weight w.
Example 6.1: Suppose there are 2 edges from subset Vj to subset V3 and 3 edges from 
subset V2 to subset V4 of color r in G. Assume that the subsets V,, V2, V3, and V4 in G are 
all distinct. Representative edges (v,, v3) and (v2, v4) in G* are of weights 2 and 3, 
respectively. See Figure 6.1.
Figure 6.1: Two edges of Gk with different weights.
If the weights of the edges of Gk are different, then bus utilization will be poor. 
Some buses will stay idle while some others are in operation. This is undesirable since 
we are interested in an MBS with regular features.
Definition 6.4: A uniform merging allocates the same weight to every edge in Gk.
With uniform merging, every bus carries the same amount of data concurrently. 
Therefore, bus utilization is uniform. We are interested in finding a partition on V(G) 
which allows for uniform merging of parallel edges while constructing the edge set of Gk.
Now we consider the issue of secondary coloring. Let uv u2, u3, and u4 be four 
vertices in G such that («,, u3) and («2, u4) are edges of the same color r. Assume that V,, 
V* and V3 are distinct subsets of the partition such that u2 e  V,; u3 e  V2; and u4 e  V3. 
Then we have color r  edges (v,, v2) and (v,, v3) in Gk. See Figure 6.2. Should we retain 
the colors of those two edges? If so, two edges (v„ v2) and (v,, v3) in Gk demand two 
distinct buses for the corresponding data transfers. Processor v, will perform the subtasks
160
corresponding to u, and u2 sequentially. Therefore, the two data transfers corresponding 
to the edges («„ «3) and (u2, w4) of G will not occur concurrently and can be carried out 
via the same bus. Hence, for the optimality of the target MBS, different colors must be 




Figure 6.2: Two edges of color r in G* with the same initial vertex.
We can assign two new colors to edges (vl5 v2) and (v,, v3). Suppose we do this 
for all edges originating from a vertex for the same color, for every color and every 
vertex. Then there is no way of finding edges corresponding to potential concurrent data 
transfers. Therefore, instead of assigning totally new colors to edges (v3, v2) and (vx, v3) 
we will assign two distinct secondary colors to those edges and retain the primary color 
r \  Two edges having the same primary color and the same secondary color correspond 
to concurrent data transfers.
^Note that this coloring scheme is different from that used for broadcasting in Section 3.3.
161
Now the question is: what criteria should be used to assign secondary colors to 
the edges with the same primary color? Different secondary colorings will result in 
different target MBSs. An arbitrary secondary coloring may result in an MBS which is 
heterogeneous.
Definition 6.5: A (secondary) coloring of the edges of Gk is said to be a regular coloring 
if each vertex has distinct colored incoming edges as well as distinct colored outgoing 
edges according to that coloring.
If possible, we need to find a regular secondary coloring. If the secondary coloring 
is regular, then we would have made the derived IFG regular. In that case, Algorithm 4.1 
can be used to find an optimal color partition of the IFG Gk. Furthermore, the resulting 
MBS will also be regular. Now, our aim is to partition the vertex set of the principal IFG 
G into k subsets satisfying the following three conditions.
(a) Subsets are isomorphic
(b) Parallel edge merging is uniform
(c) There exists a regular secondary coloring for edges in Gk.
We will now show how to obtain a vertex partitioning of the symmetric IFG  CA( 0  
satisfying the above three requirements. The partition may not be optimal in general. But 
it will be optimal in many cases of interest. We assume that there exists a subgroup of 
r  of cardinality k  = \T\lp. If there are more than one such subgroup, we will select the 
one containing the largest number of generators from A. By selecting the subgroup 
containing the largest number of generators from A, we actually maximize the 
communication hidden by each processor. We will denote the selected subgroup of T  
by A.
162
Remark: In general, there may not exist any subgroup of T whose cardinality is 
k. If such a subgroup does not exist, we will use the largest subgroup whose size 
is less than ITI/p. But many of the Cayley color graphs, which are of interest for 
our purpose, have subgroups of orders equal to almost every factor of III.
We partition the vertex set of CA(T) into coset subgraphs relative to A. This is 
called coset partitioning. First we will show that the coset partition is an isomorphic 
vertex partition. Second, we will show that coset partitioning yields uniform merging of 
parallel edges. Third, we will show that there exists a regular secondary coloring 
corresponding to the coset partitioning.
6.1.1 Isomorphism
Theorem 6.2: Coset partition of CA( 0  is an isomorphic vertex partition.
Proof: Let A be a subgroup of T. Consider the coset subgraphs CSt and CS} corresponding 
to the two distinct cosets yA and yA, respectively. Let the mapping \|/: V(CSj) —> V(CSj) 
be defined by \|/(y) = y y 'y  Let (y,, y2) be an edge of CS, whose color is, say, r. Then y2 
= y,S,. So, \|/(y2) = y y 1 y2 = yy* y,5r = y(Y,)8r. Since y  e y,A, it follows that \|/(y) = y y 'y  
6 yy'yA  = yA- Similarly, \|/(y2) € yA. Therefore, (\|/(y), \|/(y2)) is a color r  edge of CSj. 
Thus mapping \\i preserves adjacency and colors. Hence, CS, and CS, are isomorphic. □
6.1.2 Uniform Merging
Here we show that merging of parallel edges corresponding to coset partitioning 
is uniform. For that purpose, some properties of cosets and their subgraphs are presented 
next.
Lemma 6.1: For every pair of coset subgraphs in CA(F), there exists a color preserving 
automorphism which maps the first coset subgraph to the second.
163
Proof: Let ytA and y2A be two cosets of T. Consider the mapping \jr defined by \|/(y) = 
Yi-ly2y, where y is any element in T. Clearly, \|/ corresponds to a color preserving 
automorphism which maps left coset y,A to left coset y2A. □
Lemma 6.2: Let T be a group and let A be a subgroup of T. Then two elements yt and 
y2 of r  belong to the same left coset relative to A if and only if y,-’y2 e  A.
Proof: First suppose that y {xy2 e  A. Then yyxy2 = X for some element X e  A. It is clear 
that y, and y2 are in the left cosets y,A and y2A, respectively. But y2A = y,A,A = y,(XA) 
= y,A. Thus y x and y2 belong to the same left coset. Next suppose that y2 e  y,A. Then, 
there exists an element X e  A such that y2 = yxX. That is, yyxy2 = X e  A. □
Next we present a very interesting relationship among the vertices of CA(T) which 
correspond to parallel, similar color edges between coset subgraphs. Once that relationship 
is obtained, merging of similar color edges can be done in a straightforward manner. 
Theorem 6.3: Let A be a generating set for group T and A be a subgroup of T  generated 
by A, c  A. Let 8r be a generator in A but not in A,. For an arbitrary element y, of T, let 
A r c  y,A be the largest cardinality subset of the coset y,A, which satisfies the condition 
that all the elements in the set belong to the same left coset of T  relative to A. Then 
<dr is a left coset of F relative to a subgroup of A. Furthermore that subgroup is unique. 
Proof: Since <dT is a subset of y(A, we can write A r = y ,^ r, where 2Br is a subset of A. 
Let 0 be an arbitrary element of f£r Let % = Q'lSSn that is, ,dr = yx̂ r According to the 
hypothesis of the theorem, all elements in set <dr8r belong to the same left coset of T 
relative to A. Let that left coset be y2A. Therefore, yxQ%br c  y2A. Now we make the 
following claim.
Claim: % is a group.
164
Proof: Since r€r = B'l38r and 0 6  f r contains the identity element e. Also, as 
a result, y,08r e  y2A. Let <;, and q2 be two arbitrary elements in %. Since 
£  y2A, we have y10ĉ 15r e  y2A. Hence, the elements y,08r and y10Q18r belong to the 
same left coset and therefore (from Lemma 6.2), e  A.
Similarly, 8r 1q2Sr e  A. Therefore,
(8^ , 8r)(8r *q28r) = 8f-IQ(;28r e  A.
That is,
(Y W C y O q ^ A ) e  A.
Therefore, from Lemma 6.2, both y,08r and belong to the same left coset
of r .  But y0 8 r belongs to left coset y2A. Therefore, y 0?i528r also belongs to left 
coset y2A. From the definition, we have %  = B'xSBr and 0 e  3tr. Therefore, since 
SSr is a subset of A, % is also a subset of A. Now, 0 e  J ,  and q,, q2 e  c€r. Hence, 
0q,q2 e  A, and therefore, y,©^^, e  y,A. We have already shown that yOc^cA e 
y2A. Since <dr is assumed to be the largest subset of y,A with the required 
property, y O q ^  must belong to d4r That is, y O q ^  e  y,©1̂  Therefore, qxq2 e 
Let q be any element of %. Similar to the above reasoning, we can show q 1 e 
Hence A  is a group.
Now, <dr = yxQ%. Therefore, ,dT is a left coset of F  relative to r€r Since cdr is the largest 
subset of y,A with the required property, group % is unique. □
Definition 6 .6 : The unique subgroup stated in Theorem 6.3 is denoted by Ar
The above theorem states the existence of a unique subgroup Ar of A for every 
color r. If for a certain color r there are no parallel edges, that is, if there exists at most
165
one edge of color r from one coset subgraph (relative to A) to another in CA(F), then Ar 
is the trivial group e.
Now we will show that merging of parallel edges is uniform for the coset 
partitioning. Suppose there are mr parallel edges of color r  from one coset subgraph to 
another in CA(r). Then, from Theorem 6.3, there is a subgroup Ar of A of order mr. Since 
Ar is unique, by symmetry (Lemma 6.1), for every edge of color r  in CA(T), there are mr 
parallel edges. We can merge parallel edges of color mr to produce a single edge of color 
r  and weight mr in the derived IFG. Thus the merging is uniform. Note that edges 
belonging to different colors may have different number of occurrences of parallel edges. 
This doesn't create problems since data transfers corresponding different color edges are 
carried out sequentially.
Definition 6.7: The derived IFG  obtained by coset partitioning CA(F) relative subgroup 
A is denoted by CA(T, A).
An edge of color r in CA(I \  A) corresponds to IArl color r edges of CA(T), where 
Ar is as given in Definition 6.6 . From Lagrange's Theorem [113], IArl evenly divides IAI 
(recall that IAI = k). Thus, there are IAI/Arl r-neighbors for every vertex of CA( 0 .  When 
we need to emphasize the correspondence between CA(T) and CA(T, A), we will 
sometimes denote the vertices of CA(T, A) by cosets yA.
Definition 6 .8 : Let v, and v2 be two vertices of CA(T, A) such that there is an edge of 
color r  from vl to v2. Then v2 is said to be an r-neighbor of v,.
In CA(P, A), there may still exist parallel edges of different colors. This occurs 
when there are edges from one coset subgraph to another (relative to a subgroup A) in
166
Ca(T) belonging to two different colors. We will show how these situation can be easily 
tackled. The following lemma from basic group theory is now in order [113].
Lem m a 6.3: Let A be a subgroup of the group T. Let y be an arbitrary element of T. 
Then y '‘Ay is also a subgroup of T.
Theorem  6.4: Suppose that there is an edge of color r in parallel with an edge of color 
s in CA(P) from coset subgraph yA to coset subgraph y;A whose initial vertices are yAr 
and ytXs (not necessarily distinct), respectively. Then As = \ ; 1’k rArXr-iks, where Ar and As 
are as defined in Definition 6.6 .
Proof: Since y,Xn yAs e  yA, it is clear that Xr, Xs e  A. Furthermore, since both edges 
terminate in yA, we get y,Xrdr, y \ sb s g  yA. In fact, there are IArl parallel edges of color 
r from coset subgraph y,A to coset subgraph yA of CA(T). Furthermore, subgroup Ar of 
A satisfies the condition that, all the elements in Ar8r belong to the same coset of F 
relative to A. Since y,ArSr e  yA, it follows that 8r e  V 'Y rlyA . But 8r e  Ar8r (because 
e e  A r). Therefore, one element of Ar8r is in the coset Xr-lyr lyjA. Since all elements of 
Ar8r should belong to the same coset, we get Ar8r c  A/'y-ly'A. Hence, y^rA.b, c  yA. 
Therefore, yA,r/Sr g yA, for any arbitrary element t e Ar. Hence, both yAr8r and yXrtbr 
belong to the same coset yA, and therefore, from Lemma 6.2,
(y W 'fY V S r)  = 5r lt8r e  A.
Furthermore, both y \ 8r and yA.S, also belong to the same coset yA. Hence,
( y W ‘(T M ) = e  A.
We have just shown that both 8r-,r8r and 8 /1 are elements of group A. Therefore,
their product (S /^ X S ^ V ^ -A )  = i-s aL° an element of A. But
= (S/'V 'Y rD fyW A -A ) =
167
Therefore, from Lemma 6.2, both yXr8r and yiXrt'kr iXs8s should belong to the same coset. 
Therefore, since yA A  belongs to coset yA, yiXrtX;lXs8s should also belong to coset yA. 
Since t is an arbitrary element of Ar, we can deduce that y A A A '^ A  £  y;A. We have 
already seen that Xn Xs e A. Since Ar is a subgroup of A, every element of A r must also 
be an element of A. Therefore, A,rA A '1̂  £  A. In other words, y A A A '1̂  £  yA. 
Therefore edges of color  ̂ originating from vertices in the set y A A A 'A  will all 
terminate at vertices in the coset yA. Clearly, yA A A '1̂  is the largest subset of yA which 
satisfies the relation y A A A '^ A  £  Y;A- According to Theorem 6.3, y A A A '1̂  must be 
a (left) coset of the subgroup As of A. We can write y A A A 'A  = Y,A( V lX ,A \ - lXs). From 
Lemma 6.3, X ^ X j i ^ X ,  = (A A ) ’!A /V A )  is a subgroup of A. Obviously,
A, = Xy'XAX-'K □
Corollary: Suppose that there is an edge of color r in parallel with an edge of color s in 
Ca(T) from coset subgraph yA to coset subgraph yA. Then Ar = As.
Therefore, if there are two parallel edges of colors r and s from vertex v, to vertex 
v2 in CA(P, A), then all edges of colors r  and s of CA(T, A) will be of the same weight. 
Furthermore, it is clear that for every vertex in CA(T, A), the number of r-neighbors is 
equal to the number of ^-neighbors. The following theorem states a much stronger result. 
Theorem 6.5: Let a certain edge of color r be parallel with another edge of color s in 
Ca(T, A). Then for every vertex in CA(P, A), an r-neighbor is also an s-neighbor, and vice 
versa.
Proof: Suppose that there is an edge of color r  in parallel with an edge of color s from 
vertex y,A to vertex y2A in CA(T, A). Let the initial vertices of those edges of color r  and 
s in CA( 0  be yxXr and yxXs, respectively. Let y,A and yA be two arbitrary vertices in
168
CA(T, A). We need to prove that y A  is an r-neighbor of j j A  if and only if ŷ A is an s -  
neighbor of y;A. Suppose that y;A is an r-neighbor of y,A. Then there exists an element 
X e  A such that Y,X8r e  y;A. From the hypothesis, both y A A  and y A A  belong to the 
same left coset y2A. Therefore, from Lemma 6.2, (YiXA^CyAA) = 8r' 1X /1XA.s 6 A. But,
(YW^YA-'1 W  = W ^ A -
Therefore (from Lemma 6.2), y,?A and yXX/^XA belong to the same left coset. Since 
Y,X8r e  JjA, we have, Yj-XX^XA e Y;^- The three elements X, Xr, and Xs all belong to 
subgroup A. Therefore XX^X, also belongs to A. Hence, y,XV A  e  YA- So, there exists 
an edge from y,A to JjA  of color s. Hence JjA  is an ^-neighbor of y,A. Similarly, we can 
show that JjA  is an r-neighbor of y,A if JjA is an s-neighbor of ytA. □
The following corollary is a direct consequence of the above two theorems. 
Corollary: Let a certain edge of color r be parallel with an edge of color s in CA(r, A). 
Then, for every edge of color r in CA(T, A), there exists a distinct parallel edge of color s.
Suppose an edge of color r  is parallel with an edge of color s in CA(T, A). Data 
transfers corresponding to edges of color s can be carried out using paths allocated for 
data transfers corresponding to edges of color r. We will say that edges of color s can be 
masked by edges of color r. Therefore, all edges of color s can be removed from CA(T, A) 
without affecting the target MBS. However, the data transfers corresponding to edges of 
color s are still required by the original algorithm. The removal of color s edges does not 
mean that those data transfers are non existent. What we actually do is to let the data 
transfers corresponding to color s edges use the connections established for the edges of 
color r.
169
6.1.3 Regular Secondary Coloring
Now we show that a regular secondary coloring of the edges of CA(F, A) can be 
performed. We will consider two cases. The first case is when A is a normal subgroup 
of T (see Definition 5.4). In that case no secondary coloring is necessary. The second case 
is when A is not a normal subgroup of F. In that case, secondary coloring may be 
necessary.
6.1.3.1 A is a Normal Subgroup of T
If A is a normal subgroup of T, then ASr = S,A for any generator 8r  Therefore, 
yA8r = 8ryA, y e  T. Therefore, in CA(T), all edges of color r originating from one coset 
subgraph terminate at the same coset subgraph, for 8r e A. In other words, Ar = A. 
Therefore, every vertex of CA(F, A) has only one outgoing edge of color r. Thus, a 
secondary coloring is not necessary. The following is a well known theorem in group 
theory [113].
Theorem  6 .6 : Let F be group and A be a normal subgroup of it. Then left (right) cosets 
of F relative to A forms a group denoted by T/A. □
The group T/A  is called the quotient group. Let A, c  A be the subset of generators 
for A. Then, the derived IFG CA(T, A) is the Cayley color graph associated with group 
T/A and the generating set A -  A,.
Lemma 6.4: The elements in the generating set A -  Ax corresponding to the quotient 
group F/A are non-redundant.
Proof: Suppose that there is a redundant generator 8 e  A -  A, associated with group T/A. 
Also suppose that generator 8 transforms an element in coset yA to an element in coset 
yA. Then, y,A8 = yA. Therefore, element yA <= T/A can be obtained from element yA
170
g r  by right multiplication by 8. Since 8 is assumed to be a redundant generator for 17A, 
yA g T/A can be obtained from yA g T/A by right multiplying it by hxh2...hx, where ht 
(1 < i < x) is an element selected from set A -  A, and their inverses; that is, yA = 
(yiA)hxh2...hx. Let y, be an element in coset yA. Then y,8 = y2 is an element in coset yA. 
Furthermore, yxhxh2...hx is in coset yAhxh2...hx = yA. Since both y2 and yxhxh2...hx belong 
to the same coset yA and subgraphs corresponding to cosets are assumed to be connected, 
it follows that y2 = y ih lh2...hxg lg2...gy, where g, (for 1 < i < y) is an element selected from 
Aj and their inverses. Therefore, 8 = hxh2...hxgxg2...gr  that is, 8 can be expressed as the 
product of the remaining generators in A. This is not possible since 8 is not redundant in 
A with respect to group T. Hence elements in A -  Ax are non redundant with respect to 
the quotient group T/A. □
Therefore, if A is a normal subgroup of F, we can find an optimal color 
partitioning of the IFG CA(T, A) using the method described in Section 4.2.
Example 6.2: Consider the 16-node Cayley color graph GQW shown in Figure 6.3. The 
generators are 8 , = 0001 , S2 = 0010 , 83 = 0100, and S4 = 1000. Different line styles are 
used for different generators. Also since 8, = S ,1, 1 < / < 4, undirected edges are used. 
The number of processors needed for the optimal MBS MQ(4) is 16. Let the specified 
number of processors be p = 4. Then k = 16/4 = 4. Let A' be the 4-element subgroup of 
rc(4) generated by {Sl5 S2}. It is clear that A' is a normal subgroup of Fe(4). Figure 6.4 
shows the corresponding coset subgraphs (enclosed in dotted curves). Figure 6.5 shows 
the derived IFG and Figure 6.6 shows the corresponding MBS containing four processors 
and four buses.
Figure 6.3: Principal IFG GQ(4).
  ,
-  vf *■ 1111











T  h o i
0011
1100 ..
■ i o n
• \ i o o i  
-  —*•* :
1000
-.1110










Figure 6 .6 : The optimal MBS corresponding to the derived IFG in Figure 6.5.
6.1.3.2 A is not a Normal Subgroup of T
When A is not a normal subgroup of F, Ar can be a proper subgroup of A. In that 
case, Ca(F, A) has more than one edge of the same color originating from the same 
vertex. Thus secondary coloring is necessary. Can we find a regular secondary coloring 
for Ca(T, A)?
Each vertex has k l\\\  outgoing edges of (primary) color r. We can arbitrarily 
assign k l\ \ \  secondary colors to those edges. We can repeat this for every vertex. This
173
will guarantee that no vertex has more than one outgoing edge of a given primary and a 
given secondary color. So, this method guarantees that every vertex has outgoing edges 
of distinct colors. But this method does not, in general, guarantee that each vertex has 
incoming edges of distinct colors. In the following, we prove that a regular secondary 
coloring always exists for CA(T, A).
Theorem 6.7: There exists a regular coloring for GA(T, A).
Proof: We will provide a constructive proof. We will only show how to insert secondary 
colors to edges with primary color r. The same method can be used with every primary 
color. Construct the bipartite graph G y  as follows. For every vertex v, of CA(T, A), there 
exist two vertices v/" and v°ul in G y, and vice versa. For every edge (v„ v;) of color r 
in Ca(T, A), there exists an undirected edge (v°u>, v/") in G y. It is clear that G y  is a 
bipartite graph whose first partite set is {v/" : 1 < / < IFI} and the second partite set is 
{v°ul : 1 < i < iTI}. Furthermore, since each vertex of CA(T, A) has IArl incoming edges 
and IArl outgoing edges of color r, Gy  is a regular bipartite graph. Therefore, G y  has a 
perfect matching [30],
Let M x be a perfect matching in G y. For every element (v°ut, v/") e  Af„ assign 
secondary color 1 to edge (v,, vy) of CA(r, A). Now remove all edges of G y  belonging 
to Mj. Bipartite graph G y  remains regular since the degree of each vertex is one less than 
before. Let M2 be a perfect matching of G y  after the removal of the edges. Assign 
secondary color 2 to every edge of CA(r, A) corresponding to M2. Repeat this procedure 
until all edges of G y  are exhausted.
We will now show that the assigned secondary coloring is regular. There is exactly 
one edge in matching Mh I = 1, 2, ..., which is incident with vertex v°l“ of G y.
174
Therefore, there is exactly one edge of primary color r and secondary color I directed 
away from every vertex of CA(F, A). Similarly, there is exactly one edge of primary color 
r and secondary color / directed towards every vertex of CA(r, A). Thus the secondary 
coloring is regular. □
The proof of the above theorem also describes a method to find a regular coloring 
for CA(T, A). Next we provide an example for the case when A is not a normal subgroup 
of r.
Example 6.3: Figure 6.7 shows the Cayley color graph associated with the symmetric 
group with four symbols, denoted by rP(4), and three generators. The four symbols are a, 
b, c, and d. The three generators are 8 , = bcicd, 82 = cbad, and 83 = dcba. Let Aw  = {Sl5 
82, 83}. Since each generator is its self inverse, we represent bidirectional edges by 
undirected edges for convenience. Solid, broken, and dotted lines correspond to the 
generators 8l5 82, and 83 respectively. In fact, the graph in Figure 6.7 is the pancake graph 
associated with four symbols (if we ignore the edge colors). Since (Fp(4)) has 24 
vertices, the optimal multiple bus system M(TPW, AP(4)) requires 24 processors and 24 
buses. Let us construct an MBS with only four processors, that is, p  = 4. Therefore, we 
need to partition the vertex set of CÂ  (FP(4)) into four subsets such that each subset has 
6 vertices. Clearly, {abed, bacd, cabd, acbd, bead, cbad} is a six element subgroup of F. 
Let that subgroup be A. The subgraph corresponding to A has 6 vertices and six edges. 
Coset subgraphs of C& (F/>(4)) relative to A are the hexagons in Figure 6.7 connected 
by solid and broken lines. There are two dotted line edges (edges corresponding to 
generator 83) from each coset subgraph to every other coset subgraph. Figure 6.8 shows 
C (TP{4), A) after merging the parallel edges. Even though we could have used
175
undirected edges to represent bidirectional edges, we need to explicitly represent directed 
edges in order to assign secondary colors.
dcb a, S  a b e d  N . 
■f'cbad b a c d v - . b e d dcdba
cb d ac a b d bdeab ea d
db caa c b d
ca d b b d a c
f  da cb a cd b a d b cd b a c
da b cab d cd c a b ,ad cb
bade.cd a b
Figure 6.7: The principal IFG CA (rp(4)).
Each directed edge in Figure 6.8 has a weight of 2. This is because one edge of 
CApw (Fp(4), A) is obtained by merging two edges of CÂ (rf(4)). In Figure 6 .8, each 
vertex has 3 outgoing edges of the same (primary) color. Figure 6.9 shows the 
corresponding ( F w  A) after assigning a regular secondary coloring. Solid, dotted,
and broken lines correspond to the three secondary colors. It can be noticed that the 
derived IFG  in Figure 6.9 is vertex symmetric. Therefore, according to Theorem 4.2, the
Figure 6 .8 : Derived IFG  CA ̂  ( r p(4), A) without secondary colors.
graph in Figure 6.9 is the Cayley color graph associated with a group and its generating 
set. Since in general CA(r, A) may not be vertex symmetric, we should not pay attention 
to the group and its generating set for the IFG in Figure 6.9. What we are concerned with 
is its regularity. Clearly, the derived IFG in Figure 6.9 belongs to class £? (see Definition 
4.4). Therefore, an optimal color partition can be found using Algorithm 4.1. The 
subgraphs induced by the subsets of an optimal color partition of CÂ  (Tw , A) are 
shown in Figure 6.10. Figure 6.11 shows the corresponding multiple bus system. It has 
4 processors and 4 buses.
s \'  ' V:
a o
Figure 6.9: The IFG  CA (Fp(4), A) with secondary colors.
P ( 4)
177
It is to be noted that, irrespective of whether or not A is a normal subgroup of T, 
the optimal MBS realizing CA(F, A) has the same number of processors and buses. 
Therefore, when we decrease the number processors, the number of buses will 




c -    oVs
f t










Figure 6.10: Optimal color partition of the derived IFG  in Figure 6.9.
6.1.4 Scheduling
Once the MBS has been constructed, processors must be properly scheduled in 
order to run thfe source algorithm(s) at the maximum speed. Computations executed by 
a processor correspond to vertices in the original CFG. The set of computations (one
178
computation from each level) represented by a single vertex in the principal IFG CA( 0  
is a subtask (see Definition 2.2). Since one vertex of CA(F, A) corresponds to a set of 
vertices of CA( 0 ,  each processor in the target MBS is assigned a set of subtasks. A certain 
processor P will first perform computation 1 (computation at level 1) of each of its 
subtasks. We will say that P  is executing its subtasks at level 1. Concurrently, every other 
processor must be executing computations at level 1 of all of its subtasks. After every 
processor has finished executing its subtasks at level 1, they must execute all their 
subtasks at level 2 (computation 2 of each of its subtasks), and so on. At a certain level, 
what is the order of execution of the subtasks by a processor? In the optimal MBS M(F, 
A), each processor was assigned a single subtask. Therefore, the problem of the order of 
execution of the subtasks did not arise there.
Figure 6.11: The MBS corresponding to the color partition in Figure 6.10.
If the communications are ignored, the order of execution of the subtasks is 
immaterial provided that every processor is executing at the same level concurrently. The 
governing factor for proper ordering of the subtasks is the communication. Therefore, we 
need to order the subtasks according to the way communications are performed. We need
179
only to consider processors executing at a particular level r. The same scheduling policy 
can be applied to every level. Let P, be the processor assigned to vertex v, of CA(T, A), 
1 < i < p. Each vertex v, of CA(r, A) corresponds to a coset yA of group T. There are IArl 
parallel edges originating from coset subgraph yA of CA(F). All these edges are 
represented by a single edge of CA(T, A) with primary color r and a certain secondary 
color. Therefore, a proper scheduling should be such that the tails of the aforementioned 
IArl parallel edges correspond to consecutive subtasks executed by processor P,. Towards 
that end, we give the following definition.
Definition 6.9: Let (\)/*(x), \j/*(y)) be the edge in the derived IFG  CA(T, A) corresponding
to edge (x, y) in the principal IFG CJT). Then, we define
TS(i, I, r) = {y : y e  yA, edge (\|f’(y), \|r*(y5r)) is of secondary color /}
With the above definition, we can describe processor scheduling as follows.
Consider the processors executing at level r. Each processor P, will execute the subtasks
belonging to set TS(i, 1, r) sequentially. Suppose that the two processors P. and P. are
*1 *2
executing the subtasks belonging to TS(it, 1, r) and TS(i2, 1, r), respectively, at level r. 
When communicating at level r, these two processors should not be allowed to use the 
same bus simultaneously. The bus processor P. is using is the one assigned to the edge 
of primary color r  and secondary color 1 originating from vertex v. . Also, the bus'i
processor P. is using is the one assigned to the edge of primary color r and secondary 
color 1 originating from vertex vf . Color partitioning guarantees that those two buses are 
distinct. Therefore, no bus conflicts will occur. This scheduling can be used for every 
level r for which there is a primary color. For levels, where there is no primary color (this 
happens when that color was masked by another primary color), follow the same
180
scheduling, all processors will perform computations concurrently and communications 
concurrently. There will be no bus conflicts. Furthermore, bus utilization will be uniform.
6.2 M BS  with Given Num ber of Bases
If the optimal number of buses is large, it may be very costly and impractical. If 
the source algorithm requires more computation time and less communication time, then 
reduction of communication hardware at the cost of some speed will be well justified. For 
example, suppose that halving the number of buses results in 5 percent decrease in speed. 
Then it may be quite acceptable to reduce the number of buses by half at the cost of little 
extra time. Notice that, as shown in Section 6.1, reducing the number of processors 
automatically reduces the number of buses by the same factor. The purpose of this section 
is to analyze how to reduce the number of buses without altering the number of 
processors.
We will denote the specified number of buses by b. Also, we assume that 
b < P(G). The construction of an MBS with b buses realizing a given IFG G is equivalent 
to the partition of the edge set of G into b subsets such that data transfers corresponding 
to the edges in each subset is carried out by a single bus. Since we assume that b < P(G), 
a subset of the partition may contain more than one edge from the same color. Thus the 
partition does not correspond to a color partition (see Definition 3.4). According to our 
model, all data transfers associated with similar color edges of the IFG can be performed 
concurrently without affecting the integrity of the source algorithm. Therefore, when 
b < P(G), some of the concurrent data transfers implied by the source algorithm must be 
carried out sequentially by the target MBS.
181
We have already proved that the problem of designing an optimal MBS with f3(G) 
buses to realize an arbitrary IFG G is NP-Hard. Therefore, the problem of designing an 
MBS with b number of buses is also NP-Hard. If one needs to solve the general problem, 
the heuristic algorithm given in Section 3.7 can be easily modified. As we have already 
mentioned, the 'general problem has no practical interest. Therefore, we do not attempt to 
solve the general color partition problem here. We assume that the IFG is vertex 
symmetric.
In Section 6.1, we showed that, in order to use processors efficiently, the number 
of processors used must be a factor of the optimal number of processors. Here, we will 
establish a similar result for the number of buses. When one processor is transferring data 
corresponding to color r, every processor is transferring data corresponding to color r. 
Suppose that the number of buses b in the target MBS is one less than the optimal number 
of buses P(G). Then only b processors can simultaneously send data to their r-neighbors. 
The remaining processor, say P,, must send its data to its r-neighbor after all other 
processors have finished their data transfers corresponding to the rth interconnection 
function. In order to preserve the integrity of the algorithm, while P, is transferring data 
to its r-neighbor, other processors cannot involve in communication transactions. Thus, 
during that period, b -  1 buses must stay idle. We could have used [^(G)/?] buses and 
obtained the same communication overhead. The following theorem generalizes this idea. 
Theorem 6 .8 : If the specified number of buses is b, then only []3(G)//f| buses can be 
used without speed penalty, where k = [P(G)/£f|.
Proof: If we partition the edge set of CA(T) into b subsets, at least one subset would 
contain at least |"P(G)/7f] = k similar color edges. Therefore, to perform data transfers
182
corresponding to a single interconnection function, at least k communication steps are 
necessary. Therefore, the actual number of buses needed is the minimum number of buses 
required to perform a single interconnection function in k steps. The minimum number 
of disjoint subsets which can be formed without any subset exceeding k  edges from a 
single color is PP(G)/Af|. □
For example, suppose that P(G) = 9 and b = 4. Then k = f9/4] = 3. Therefore, 
f9/3~| = 3 buses are actually needed. Communications which can be performed by 4 buses 
can also be performed by 3 buses without loss of speed; and fewer than 3 buses cannot 
perform the communication without loss of speed. If P(G)/k is not an integer, all subsets 
in the partition will not have the same number of edges and the target MBS will be 
heterogeneous. We have already mentioned many advantages of regularity and symmetry 
of an MBS. Therefore, in order to maintain those regular properties of the target MBS, we 
always assume that b is a factor of P(G).
By decreasing the number of buses by a factor of k, we reduce the cost of the 
buses by the same factor. Decreasing of the number of buses would have no effect on the 
cost of the processors. Note that, by decreasing the number of interfaces, the cost of a 
single processor would decrease due to the reduced number of ports. But we do not 
consider this effect in this research. Reduction of buses may automatically make some of 
the interfaces redundant. However, decreasing the number of buses by a certain factor will 
not necessarily decrease the number of interfaces also by the same factor. Our optimality 
criterion for partitioning the edge set of the IFG into b subsets would be to minimize the 
number of interfaces. In other words, we need to find a partition k of the edge set of
183
Ca(T) of cardinality b such that each subset contains k (= iri/b) edges from each color and 
IE(n(CA(D))l is minimum.
Unlike the case for b = (J(CA(F)) = IFI, there is no straightforward method to 
perform an optimal partition of CA(F) for an arbitrary b. This rather unexpected nature of 
Cayley color graphs will be clarified using a very simple example.
123
321 213






Figure 6.12: Principal IFG C. (rp(3)A).
A P{3)
Example 6.4: Figure 6.12 shows the IFG CAf (Ff(3)) which is the Cayley color graph 
associated with the symmetric group r P(3) = {123, 213, 132, 312, 231, 321} and its 
generating set AP(3) = {213, 321}. We will denote 213 and 321 by 8 ! and 82, respectively. 
Solid lines are used for generator 8j (color 1) and broken lines are used for generator 82 
(color 2). If we replace bidirectional edges in Figure 6.12 with undirected edges and 
ignore colors, what we get is the pancake (or star) graph of three symbols. The optimal 
color partition of C4 (FP(3)) contains six 2-element subsets such that for each subset Et, 
!/(£,)! = 3. Now consider the problem of partitioning the edge set of CA (rP(3)) into
P(3)
three subsets instead of six. We need to find three edge disjoint subsets £j, E2, and E3 























Figure 6.13: Optimal partition of C. (TP(3)A) into three subsets.
P(3)
subset Ef of E(C. (r„ni)) containing 2 edges from each color, the minimum value of
/»(3) V ’
\J{E)\ is 5. One can also be convinced that (possibly by exhaustive search) E(C^ (rP(3)))
P( 3)
cannot be partitioned into three subsets Ex, E2, and E3 such that iy(JE’1)l = IJ(E2)\ = \J(E3)I 
= 5. Figure 6.13 shows an optimal partition, where l / ^ ) !  = IJ(E2)\ = 5, and L/(Zs3)l = 6 . 
Therefore, the optimal partition of CA (TP(3)) results in three induced subgraphs which
P(3)
are not isomorphic to one another (recall that, for the optimal color partition <j)r>A, induced 
subgraphs are isomorphic to one another). Figure 6.14 shows the MBS corresponding to 
the optimal partition shown in Figure 6.13. In the MBS shown in Figure 6.14, bus 1 is 
connected to 5 processors, whereas bus 3 is connected to only 3 processors. Furthermore, 
processor 123 has two drivers and two receivers, whereas, processor 321 has one driver 
and one receiver. Therefore, although the IFG is vertex symmetric, the optimal MBS is 
not even regular. This outcome is rather unexpected.
According to the above example, the optimal MBS with b buses corresponding to 
a vertex symmetric IFG can be heterogeneous. This undermines the whole purpose of
185
analyzing vertex symmetric IFGs. Besides, when the Cayley color graph is complex, it 
is unlikely to find an optimal partition using group properties. Therefore, we would seek 
to find the best partition under the requirement of symmetry (even at the expense of 
optimality). We will utilize the symmetric properties of Af(T, A) to construct an MBS with 
b buses.
0 3 )  (u p 312
Figure 6.14: The MBS corresponding to the optimal partition in Figure 6.13.
Our approach is to combine the buses in M(T, A) in order to convert it into an
MBS with b buses. This is equivalent to the partition of the buses in M(T, A) into k- 
element subsets. In doing so, we will reduce the number of buses from IFI to b. Let 
processor P be connected to buses B{ and B2 via two drivers. If we combine the two 
buses together, only one driver is necessary. In other words, by combining the two buses 
Bt and B2, we save a driver.
As was done in Section 6.1, we will use subgroups of T to combine buses. Let A 
be a subgroup of F whose size is \T\lb = k. We will combine the buses of M(T, A) to 
form the MBS M(T, A, A) such that all buses of M(T, A) belonging to set {Z?,-,: 7 e  7 A} 
are replaced by the single bus £, in M(F, A, A). If there are more than one subgroup of 
size \T\Ib, then the one which saves the maximum number of interfaces must be chosen.
186
In the following, we show how to determine the number of interfaces saved by the MBS 
M(T, A, A).
Let £ ,, E2, ..., £|n be the subsets of edges produced by the optimal color partition 
<t»r>A (see Definition 4.5). Also, let Hj be the subgraph induced by Ej, 1 < j  < IFI. 
Furthermore, let y,A, y2A, ..., ybA  be the distinct left cosets of T  relative to A. Consider 
the set of edges E i defined by, E { = U E i . Clearly, E i contains the edges of CA(T) 
assigned to bus 5,. As we have observed in Section 4.2, the f 1 vertex of Hj corresponds 
to a driver (receiver) if t is even (odd). Now, y,A represents the set of 0th vertices of the 
induced subgraphs H}, 1 < j  < in . Also, y A S ^ j1 represents the set of 2nd vertices of the 
induced subgraphs Hj, 1 < j  < in , and so on. Therefore, the number of drivers connected 
to bus Bi of M(T, A, A) is equal to the cardinality of set D„ where 
= (YA) ^  (Y.A5,62 ) ^  (Y/A6 j 5 2 83 5 4 ) u  ... l-* (yA81S 2 
Similarly, the number of receivers connected to bus 5, is equal to the cardinality of set 
/?„ where
R i — (YASi) ^  (y.ASj 82 83) ... (yA8i5 2 83...S 2̂ i-i)/2j^2L(iAi-i)/2j+A
Thus, the total number of interfaces in the MBS M(T, A, A) is
b
£  (ID,-I + \R,\) = b(\Dt\ + \Rf).
1=1
With the help of Lemma 6.1, it is easy to verify that the MBS M(T, A, A) is 
symmetric. Thus, we have constructed a symmetric MBS with b -  IFI/IAI buses and IFI 
processors. That MBS has b(ID,l + LR.-I) -  1FI(1AI + 1) fewer interfaces compared with the 
optimal MBS M(T, A). The following example illustrates the analysis done in this section. 
Example 6.5: We will revisit the IFG CA ( r p(3)) shown in Figure 6.12. Figure 6.15 
shows the six induced subgraphs of the optimal color partition <br . . Let A =
1 P(3)’a P(3)
{123, 213}, that is, A = {e, S,}. Induced subgraphs of the three subsets E j , E 2, and E 3, 
denoted by H x, H 2, and H 3, respectively, are shown in Figure 6.16. Notice that, E x = 
E x u  E2, E 2 = E3 u  and E 3 = Es u  E6. Isomorphism of the induced subgraphs can 
be easily observed. Figure 6.17 shows the corresponding MBS M(T, A, A). Clearly, it has 
18 interfaces (12 drivers and 6 receivers), in contrast to the 16 interfaces required by the 
optimal MBS shown in Figure 6.14. But as a compensation, we have symmetry. Notice 
that bus loading and the maximum number of ports per processor are also better in the 
system of Figure 6.17.
17,
H, •  312
,  213
0 123 •  321
/

























' •  132 
H,
Figure 6.16: Induced subgraphs obtained by combining those in Figure 6.15.
i
1 f
5 )  < 2
1
V
13) ( 7 .
L
5 )  ( J
i
12)  ( J
i i 
11





i t V v i r
V u u w
Figure 6.17: The MBS corresponding to the induced subgraphs in Figure 6.16.
CHAPTER 7 
SUM M ARY AND CONCLUSIONS
Applicability of multiple bus systems as special purpose architectures is relatively 
unexplored in the literature. In this dissertation, we developed a methodology to design 
multiple bus architecture which favors a given class of parallel algorithms. The 
algorithmic class we considered here reflects the communication pattern of the member 
algorithms. We represented the source algorithm (or the algorithmic class) as a labeled, 
edge colored, directed graph called CFG. The target MBS was represented as a directed 
bipartite graph. According to the model we assumed, the target MBS operated in the 
SIMD mode and used message passing model for interprocessor communication.
We performed the construction of an optimal MBS from the given CFG in two 
stages. In the first stage, which was addressed in Chapter 2, from the given CFG, we 
constructed another graph IFG. The IFG reflects the communication pattern among 
processors in the target architecture that can efficiently execute any algorithm belonging 
to the given class. We proved that the construction of an optimal IFG  from the given 
CFG is an A^P-Hard problem. Nevertheless, we showed that certain regularities present 
in almost every, well-behaved, parallel algorithm can be exploited to obtain a polynomial 
time solution. We also showed that when the CFG is regular, the resulting IFG is also 
regular.
In the second stage, starting from the IFG constructed in the first stage (or, from 
any other arbitrary IFG), we constructed an optimal MBS. The construction of an MBS 
optimally realizing a given IFG naturally answers two other important issues relating to 
MBSs. The first is the design of an optimal MBS emulating an existing, SIMD, static
189
190
interconnection network. The second is the design of an optimal MBS realizing a given 
set of interconnection functions. Due to this wide applicability, we devoted a major part 
of the dissertation to the construction of an optimal MBS realizing a given IFG. We 
showed that the design of an MBS from a given IFG is equivalent to the partition of the 
edge set of the IFG with certain constraints. This is called color partition.
In Chapter 3, we proved that the optimal color partition of an arbitrary IFG  is an 
NP-Hard problem. We also proved that the optimal color partition problem is solvable in 
polynomial time when the IFG has only two colors. Therefore, an optimal MBS realizing 
two interconnection functions (such as, shuffle and exchange) can be constructed in 
polynomial time. Taking the solution method to the two color case as a guideline, we 
developed a heuristic algorithm for solving the general color partition problem in 
polynomial time.
The analysis in Chapter 3 naturally opens some avenues for future research work. 
We did not consider other features (except when the IFG is vertex symmetric or regular) 
of an IFG  which will guarantee a polynomial time solution. For example, when the IFG 
is a tree, the problem may be solvable in polynomial time. Thus, the investigation of 
attractive properties of algorithms which can be used to solve the color partition problem 
in polynomial time is a possible extension to this work. The heuristic algorithm we 
presented was obviously not the best to solve the general color partition problem. The 
development of a better algorithm would be another extension to this work. It is an 
interesting and challenging problem to develop an algorithm which, given an arbitrary 
constant e , solves the general color partition problem whose output is at most (1 + e  )opt, 
where opt is the optimal output.
191
In Chapter 4, we showed how to obtain an optimal color partition of a vertex 
symmetric IFG. This was done by using the analogy between such an IFG  and a Cayley 
color graph associated with a finite group F  and its generating set A. We showed that 
there can be many optimal color partitions; however, we have chosen a particular color 
partition which results in an MBS, denoted by M(T, A), having many attractive properties. 
We proved that M(T, A) is symmetric. We also showed the superiority of M(T, A) over 
its static interconnection network counterpart in terms of the number of ports per 
processor, the number of neighbors per processor, and the diameter. As an automatic 
outcome of the optimal color partition of a vertex symmetric IFG, we showed how to find 
an optimal color partition for a regular IFG. We also presented an algorithm to perform 
such a partition.
Again, there are several possible avenues for future research work related to the 
treatise on Chapter 4. The optimality of the color partition obtained for a vertex 
symmetric or regular IFG was guaranteed only if the IFG belongs to a certain class; 
otherwise, the optimality was not guaranteed. Fortunately, IFGs belonging to that 
particular class encompass interconnection patterns of many well known algorithms as 
well as interconnection functions of many known SIMD machines. However, it may be 
worthwhile to explore the possibility of finding an optimal color partition for every vertex 
symmetric or regular IFG. By considering some of the unsolvable problems in group 
theory, it is unlikely that one would come up with a polynomial time solution for finding 
an optimal color partition for an arbitrary Cayley color graph. Therefore, an appropriate 
attempt would be for a heuristic algorithm.
192
In Chapter 5, we addressed the fault tolerance capabilities of the MBS M(T, A). 
We proved that M(T, A) can sustain a single driver failure or a single processor failure 
if and only if IAI > 1. We also proved that M(T, A) can sustain a single receiver failure 
or single bus failure if and only if IAI > 2. We also obtained the performance degradation 
due to each component failure when the IFG represents cube functions. Furthermore, we 
showed how to add redundancy to M(T, A) in order to increase its fault tolerance. Even 
though we only analyzed single component failures in Chapter 5, it can be easily extended 
to multiple component failures. The number of edge disjoint paths and vertex disjoint 
paths in Cayley color graphs can be utilized to analyze the fault tolerance of M(T, A) 
under multiple component failures. This will be a useful extension to the work done in 
Chapter 5.
In Chapter 6 , we addressed the problem of constructing an MBS with a given 
number of processors and/or a given number of buses which can realize a given IFG 
optimally. We analyzed the cases with fixed number of processors and fixed number of 
buses separately. We showed that the number of processors (buses) must be a factor of 
the optimal number of processors (buses) in order to utilize processors (buses) efficiently. 
We used the properties of cosets of T  in order to construct an MBS with given number 
of buses and/or processors. When the number of processors was specified, by partitioning 
the vertex set of the original IFG, we obtained a new IFG  that has as many vertices as 
the number of processors in the target MBS. We showed how the partition can be done 
so that the derived IFG is regular. We also provided processor scheduling which 
guarantees that the source algorithm would run on the target MBS at maximum possible
193
speed. When the number of buses is specified, we showed how to combine buses in order 
to obtain the target MBS. Again, we use cosets of F  to combine the buses in M(T, A).
In constructing an optimal MBS with given number of processors and/or buses, 
there are certain issues we did not address in Chapter 6 . Even though the given IFG is 
vertex symmetric, the optimal MBS with given number of processors and/or buses may 
not be even regular. Our focus in the chapter was on constructing an MBS which is at 
least regular. We did not address the problem of finding an optimal MBS without paying 
attention to its structure. This could be an extension to the work reported in Chapter 6 . 
Again, due to the computational difficulty of the general problem and due to the fact that 
there are certain unsolvable problems regarding groups and their generating sets, it is 
unlikely that we can find a polynomial time algorithm to construct an optimal MBS with 
a specified number of processors and/or buses. Therefore, one may have to be content 
with a heuristic algorithm.
REFERENCES
[1] G. B. Adams and H. J. Siegel, "The Extra-Stage Cube: A Fault-Tolerant
Interconnection Network for Supersystems", IEEE Trans, on Computers, vol. c-31, 
May 1982, pp. 443-454.
[2] A. Agarwal, "Limits on Interconnection Network Performance", IEEE Trans, on
Parallel & Distributed Systems, vol. 2, no. 4, October 1991, pp. 398-412.
[3] A. Aggarwal, "Optimal Bounds for Finding Maximum on Array of Processors
with k  Global Buses", IEEE Trans, on Computers, vol. c-35, no. 1, January 1986, 
pp. 62-64.
[4] D. P. Agrawal, "Testing and Fault Tolerance of Multistage Interconnection
Networks", Computer, April 1982, pp. 41-53.
[5] S. B. Akers and B. Krishnamurthy, "A Group-Theoretic Model for Symmetric
Interconnection Networks", IEEE Trans, on Computers, vol. 38, no. 4, April 1989, 
pp. 555-566.
[6] B. Alspach, "Cayley Graphs with Optimal Fault Tolerance", IEEE Trans, on 
Computers, vol. 41, no. 10, October 1992, pp. 1337-1340.
[7] T. Anderson and P. A. Lee, Fault Tolerance: Principles and Practice, Prentice 
Hall, Englewood Cliffs, NJ, 1981.
[8] J. Archibald and J. Baer, "An Evaluation of Cache Coherence Solutions in Shared- 
Bus Multiprocessors", ACM  Transactions on Computer Systems, 4, 4 (November 
1986), pp. 273-298.
[9] R. Arlauskas, "iPSCI2 System: A Second Generation Hypercube", Third 
Conference on Hypercube Concurrent Computers and Applications, ACM  1988, 
pp. 33-36.
[10] E. R. Barnes, "Partitioning the Nodes of a Graph", Graph Theory with 
Applications to Algorithms and Computer Science, Editors: Y. Alavi et al., John 
Wiley & Sons, pp. 57-72.
[11] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. A. 
Stokes, "The ILLIAC IV  Computer", IEEE Trans, on Computers, vol. 17, no. 8, 
August 1968, pp. 746-757.
[12] K. E. Batcher, "The Flip Network in STARAN", Proc. o f the International 
Conference on Parallel Processing, August 1976, pp. 65-71.
194
195
[13] K. Batcher, "Design of a Massively Parallel Processor", IEEE Trans, on 
Computers, vol. 29, no. 9, September 1980, pp. 836-840.
[14] J. Beetem, M. Denneau, and D. Weingarten, "The GF11 Supercomputer", Proc. 
o f the 12th International Symposium on Computer Architecture, Boston, 1985, pp. 
108-115.
[15] V. E. Benes, Mathematical Theory o f Communication Networks and Telephone 
Traffic, Academic Press, New York, 1965.
[16] B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon, "The Midway Distributed 
Shared Memory Systems", Proceedings o f the 1993 CompCon Conference, 
February 1993, pp. 528-537.
[17] M. J. Berger and S. H. Bokhari, "A Partitioning Strategy for Nonuniform 
Problems on Multiprocessors", IEEE Trans, on Computers, vol. c-36, no. 5, May 
1987, pp. 570-580.
[18] L. N. Bhuyan, "A Combinatorial Analysis of Multibus Multiprocessor Systems", 
Proceedings o f the 1984 International Conference in Parallel Processing, 1984, 
pp. 225-227
[19] L. N. Bhuyan and D. P. Agrawal, "Generalized Hypercube and Hyperbus 
Structures for a Computer Network", IEEE Trans, on Computers vol. c-33, no. 4, 
April 1984, pp. 323-333.
[20] L. N. Bhuyan, Q. Yang, and D. P. Agrawal, "Performance of Multiprocessor 
Interconnection Networks", IEEE Computer, February 1989, pp. 25-37.
[21] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, and E. W. Felten, "Virtual 
Memory Mapped Network Interface for the SHRIMP Multicomputer", Proceedings 
o f the 21st Annual International Symposium on Computer Architecture, April 
1994, pp. 142-153
[22] S. H. Bokhari, "Finding Maximum on an Array Processor with a Global Bus", 
IEEE Trans, on Computers, vol. c-33, no. 2, February 1984, pp. 133-139.
[23] S. H. Bokhari, "On the Mapping Problem", IEEE Trans, on Computers, vol. c-30, 
no. 3, March 1981, pp. 207-344.
[24] S. H. Bokhari, "Partitioning Problems in Parallel, Pipelined, and Distributed 
Computing", IEEE Trans, on Computers, vol. 37, no. 1, January 1988, pp. 48-57.
196
[25] B. L. Bondar and A. C. Liu, "Modelling and Performance Analysis of Single Bus 
Tightly Coupled Multiprocessors", IEEE Trans, on Computers, vol. 38, no. 3, 
March 1989, pp. 464-470.
[26] R. E. Buehrer et al. "The 2s77/-Multiprocessor EMPRESS: A Dynamically 
Configurable MIMD System", IEEE Trans, on Computers, vol. c-31, no. 11, 
November 1982, pp. 1035-1044.
[27] L. Campbell, G. E. Carlson, M. J. Dinneen, V. Faber, M. R. Fellows, M. A. 
Langston, J. W. Moore, A. P. Mullhaupt and H. B. Sexton, "Small Diameter 
Symmetric Networks from Linear Groups", IEEE Trans, on Computers, vol. 41, 
No. 2, February 1992, pp. 218-220.
[28] M. J. Carey and C. D. Thompson, "A Pipelined Architecture for Search Tree 
Maintenance", Algorithmically Specialized Parallel Computers, Academic Press, 
Inc., 1985, pp. 37-46.
[29] D. A. Carlson, "Performing Tree and Prefix Computations on Modified Mesh- 
Connected Parallel Computers", Proceedings o f the International Conference on 
Parallel Processing, 1985, pp. 715-718.
[30] G. Chartrand and L. Lesniak, Graphs and Digraphs, Wadsworth & Brooks/Cole
Advanced Books & Software, Pacific Grove, California, 1986.
[31] V. Chaudhary and J. K. Aggarwal, "A Generalized Scheme for Mapping Parallel 
Algorithms", IEEE Trans, on Parallel and Distributed Systems, vol. 4, no. 3, 
March 1993, pp. 328-346.
[32] M. S. Chen, K. G. Shin, and D. D. Kandlur, "Addressing, Routing, and
Broadcasting in Hexagonal Mesh Multiprocessors", IEEE Trans, on Computers,
vol. 39, no. 1, January 1990, pp. 10-18.
[33] W. T. Chen and J. P. Sheu, "Performance Analysis of Multiple Bus
Interconnection Networks with Hierarchical Requesting Model", IEEE Trans, on 
Computers, vol. 40, no. 7, July 1991, pp. 834-842.
[34] D. M. Chiarulli, S. P. Levitan, and R. Melhem, "Optical Bus Control for 
Distributed Multiprocessors", Journal o f Parallel and Distributed Computing 10, 
1990, pp. 45-54.
[35] A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel,
"Software versus Hardware Shared-Memory Implementations: A Case Study",
Proceedings o f the 21st Annual International Symposium on Computer
Architecture, April 1994, pp. 106-117
197
[36] W. Crowther, J. Goodhue, E. Starr, R. Thomas, W. Milliken, and T. Blackadar, 
"Performance Measurements on a 128-node Butterfly Parallel Processor", 
Proceedings o f  the International Conference on Parallel Processing, August 1985, 
pp. 531-540.
[37] W. J. Dally, "Performance Analysis of k-ary n-cube Interconnection Networks", 
IEEE Trans, on Computers, vol. 39. no. 6 , June 1990, pp. 775-785.
[38] W. J. Dally et al., "The 7-Machine: A Fine-Grain Concurrent Computer", 
Information Processing 89, Elsevier North Holland, 1989.
[39] R. F. Demara and D. I. Moldovan, "The SNAP-1 Parallel A I  Prototype", 
Proceedings o f the 18th Annual International Symposium on Computer 
Architecture", May 1991, pp. 2-11.
[40] A. M. Despain and D. A. Patterson, "X-Tree: A Tree Structured Multiprocessor 
Computer Architecture", Proc. 5th Annual Symp. on Computer Architecture, 1978, 
pp. 144-151.
[41] O. M. Dighe, R. Vaidyanathan, and S. Q. Zheng, "Bus-Based Tree Structures for 
Efficient Parallel Computation" Proceedings o f the International Conf. on Parallel 
Processing, August 1993, pp. 7-158 - 7-161.
[42] M. Dubois, "Throughput Analysis of Cache-Based Multiprocessors with Multiple 
Buses", IEEE Trans, on Computers, vol. 37, no. 1, January 1988, pp. 58-70.
[43] R. Duncan, "A survey of Parallel Computer Architectures", IEEE Computer, 
February 1990, pp. 5-16.
[44] A. El-Amawy and Priyalal Kulasinghe, "Optimal Mapping of Feedforward Neural 
Networks onto Multiple Bus Architectures", IEEE 37th Midwest Symposium on 
Circuits and Systems 1994, pp 477-483.
[45] R. D. Etchells and G. R. Nudd, "Software Metrics for Performance Analysis of 
Parallel Hardware", DARPA Image Processing Workshop, June 1983, pp. 137-147.
[46] T. Y. Feng, "A Survey of Interconnection Networks", IEEE Computer 14, 
December 1981, pp. 12-27.
[47] T. Feng, "Data Manipulating Functions in Parallel Processors and Their 
Implementations", IEEE Trans, on Computers, vol. c-23, no. 3, March 1974, pp. 
309-318.
198
[48] T. Y. Feng, "Some Characteristics of Associative/Parallel Processing", Proc. 1972 
Sagamore Computer Conference, Syracuse University, 1972, pp. 5-16.
[49] C. M. Fiduccia, "Bused Hypercubes and Other Pin-Optimal Networks", IEEE
Trans, on Parallel and Distributed Systems, vol. 3, no. 1, January 1992, pp. 14-24.
[50] M. J. Flynn, "Some Computer Organizations and Their Effectiveness", IEEE
Trans, on Computers, vol c-21, 1972, pp. 948-960.
[51] M. I. Frank, "A Hybrid Shared Memory /  Message Passing Parallel Machine", 
Proceedings o f  the International Conference on Parallel Processing, August 1993, 
pp. 7-232 - 7-236.
[52] D. D. Gajski, "Does General Purpose Mean Good for Nothing", Algorithmically 
Specialized Parallel Computers, Academic Press, Inc., 1985, pp. 249-250.
[53] M. Garey and D. S. Johnson, Computers and Intractability: A guide to the theory
o f NP-Completeness, W. H. Freeman and Company, New York, 1979.
[54] W. H. Gates and C. H. Papadimitriou, "Bounds for Sorting by Prefix Reversal", 
Discrete Mathematics 27, 1979, pp. 47-57.
[55] G. R. Goke and G. J. Lipovski, "Banyan Networks for Partitioning Multiprocessor 
Systems", Proc. First Annual Symp. on Computer Architecture, December 1973,
pp. 21-28.
[56] M. G. and M. Minoux (translated by S. Vaida), Graphs and Algorithms, John 
Wiley and Sons, 1984.
[57] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. 
Snir, "The NYU  Ultracomputer - Designing an MIMD Shared Memory Parallel 
Computer", IEEE Trans, on Computers, vol. c-32, no. 2, February 1983, pp. 175- 
189.
[58] Paul E. Green, "The Future of Fibre-Optic Computer Networks", IEEE Computer, 
September 1991, pp. 78-87.
[59] W. Handler, "The Impact of Classification Schemes on Computer Architecture" 
Proc. International Conference on Parallel Processing, IEEE, 1977, pp. 7-15.
[60] P. Hansen and IC. W. Lih, "Improved Algorithms for Partitioning Problems in 
Parallel, Pipelined, and Distributed Computing", IEEE Trans, on Computers, vol. 
41, no. 6 , June 1992, pp. 769-771.
199
[61] J. P. Hayes, T. N. Mudge, Q. F. Stout, S. Colley, and J. Palmer, "Architecture of 
a Hypercube Supercomputer", Proceedings o f  the 1986 International Conference 
on Parallel Processing, 1986, pp. 653-660.
[62] W. D. Hillis, "The Connection Machine, M IT  Press, 1985.
[63] M. D. Hill et al., "SPUR: A VLSI Multiprocessor Workstation", IEEE Computer, 
vol. 19, November 1986, pp. 8-22.
[64] R. W. Hockney and C. R. Jesshope, Parallel Computers, Adam Hilger Ltd., 1981.
[65] E. Hokens and A. Louri, "Performance Considerations Relating to the Design of
Interconnection Networks for Multiprocessing Systems", Proceedings o f the 1993 
International Conference on Parallel Processing, August 1993, pp. 7-206 - 7-209.
[66] M. A. Holloday and M. K. Vernon, "Exact Performance Estimates of
Multiprocessor Memory and Bus Interfaces", IEEE Trans, on Computers, vol. c- 
36, no. 1, January 1987, pp. 76-85.
[67] J. E. Hopcroft and R. M. Karp, "An n512 algorithm for maximum matchings in 
bipartite graphs", SIAM J. Computing 2, 1973, pp. 225-231.
[68] A. Hopper, A. Jones, and D. Liopis, "Multiple vs. Wide Shared Bus
Multiprocessors", 16th Annual Symposium on Computer Architecture, 1989, pp. 
300-306.
[69] E. Horowitz and A. Zorat, "The Binary Tree as an Interconnection Network: 
Applications to Multiprocessor Systems and VLSI", IEEE Trans, on Computers, 
vol. c-30, no. 4, April 1981, pp. 247-253.
[70] K. Hwang, "Advanced Parallel Processing with Supercomputer Architectures", 
Proceedings o f the IEEE, vol. 75, no. 10, October 1987, pp. 1348-1379.
[71] K. Hwang, P. Sheng, and D. Kim, "An Orthogonal Multiprocessors for Parallel 
Scientific Computations", IEEE Trans, on Computers, vol. 38, no. 1, January 
1989, pp. 47-60.
[72] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, 
McGraw-Hill, 1977, pp. 32-40.
[73] L. H. Jamiesson, "Characterizing Parallel Algorithms", The Characteristics o f  
Parallel Algorithms, The MIT  Press, 1987, pp. 64-100.
200
[74] H. Jiang and K. C. Smith, "A Partial-Multiple-Bus Computer Structure with 
Improved Cost-Effectiveness", Proceedings o f the 15th Annual Int. Symp. on 
Computer Architecture, 1988, pp. 116-122.
[75] S. I. Kartashev and S. P. Kartashev, "A Multicomputer System with Dynamic 
Architecture", IEEE Trans, on Computers, vol. c-28, October 1979, pp. 704-720.
[76] H. Kasahara and S. Narita, "Practical Multiprocessor Scheduling Algorithms for 
Efficient Parallel Processing", IEEE Trans, on Computers, vol c-33, no. 11, 
November 1984, pp. 1023-1029.
[77] J. Killian, S. Kipnis, and C. E. Leiserson, "The Organization of Permutation 
Architectures with Bused Interconnections", IEEE Trans, on Computers, vol. 39, 
no. 11, November 1990, pp. 1346-1357.
[78] A. C. Klaiber and H. M. Levy, "A Comparison of Message Passing and Shared 
Memory Architectures for Data Parallel Programs", Proceedings o f  the 21st 
Annual International Symposium on Computer Architecture, April 1994, pp. 94- 
105.
[79] S. C. Kothari and E. Gannett, "Optimal Design of Linear Flow Systolic 
Architectures", Proceedings o f the International Conference on Parallel 
Processing, August 1989, pp. 7-247 - 7-256.
[80] J. S. Kowalik (ed.), Parallel MIMD Computation: HEP Supercomputer and its 
Applications, MIT Press, Cambridge, MA, 1985.
[81] R. Krishnamurti and E. Ma, "The Processor Partitioning Problem in Special 
Purpose Partitionable Systems", Proceedings o f the Int. Conf. on Parallel 
Processing, 1988, pp. 434-443.
[82] D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh, "Parallel 
Supercomputing Today and the Cedar Approach", Science 231, February 1986, pp. 
967-974.
[83] D. P. Kulasinghe and A. El-Amawy, "On the Complexity of Optimal Bused 
Interconnections", IEEE Trans, on Computers, to appear.
[84] D. P. Kulasinghe and A. El-Amawy, "Optimal Realization of Sets of 
Interconnection Functions on Multiple Bus Systems" Submitted to the IEEE Trans, 
on Computers for second revision.
[85] H. T. Kung, "Why Systolic Architecture", IEEE Computer, January 1982, pp. 37- 
46.
201
[86] S. Y. Kung, " VLSI Array Processors", IEEE ASSP Magazine, vol. 2, no. 3, July 
1985, pp. 4-22.
[87] J. Kuskin et al., "The Stanford FLASH Multiprocessor", Proceedings o f  the 21st 
Annual International Symposium on Computer Architecture, April 1994, pp. 302- 
313.
[88] T. Lang and M. Valero, "M-users B-servers Arbiter for Multiple-Buses Multi­
processors", Microprocessing and Microprogramming, 1982, pp. 11-18.
[89] T. Lang, M. Valero, and I. Alledre, "Bandwidth of Crossbar and Multiple-Bus
Connections for Multiprocessors", IEEE Trans, on Computers, vol. c-31, no. 12,
December 1982, pp. 1227-1234.
[90] T. Lang, M. Valero, and M. A. Fiol, "Reduction of Connections for Multibus
Organization", IEEE Trans, on Computers, vol. c-32, no. 8, August 1983, pp. 707- 
715.
[91] S. Latifi and A. El-Amawy, "On Folded Hypercubes", Proceedings o f  the 1989 
International Conference on Parallel Processing, August 1989, pp. 7-180 - 7-187.
[92] D. H. Lawrie, "Access and Alignment of Data in an Array Processor", IEEE 
Trans, on Computers, vol. c-24, no. 12, December 1975, pp. 1145-1155.
[93] T. J. LeBlanc, "Problem Decomposition and Communication Tradeoffs in a Shared 
Memory Multiprocessor", Numerical Algorithms fo r  Modem Parallel Computer 
Architectures, IMA Volumes in Mathematics and its Applications, vol. 13, 
Springer-Verlag 1988, pp. 145-163
[94] T. J. LeBlanc, "Shared Memory Versus Message-Passing in a Tightly-Coupled 
Multiprocessor: A Case Study", Proceedings o f the 1986 International Conference 
on Parallel Processing, August 1986, pp. 463-466.
[95] I. Lee and D. Smitley, "A Synthesis Algorithm for Reconfigurable Interconnection 
Networks", IEEE Trans, on Computers, vol. 37, no. 6 , June 1988, pp. 691-699.
[96] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Goopta, and J. 
Hennessy, "The DASH Prototype: Implementation and Performance", 19th Annual 
International Symposium on Computer Architecture, May 1992, pp. 92-103.
[97] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems", 
ACM Transactions on Computer Systems, vol. 7, no. 4, November 1989, pp. 321- 
359.
202
[98] K. Li and R. Schaefer, "A Hypercube Shared Virtual Memory System", 
Proceedings o f the International Conference on Parallel Processing, August 1989, 
pp. 7-125 - 7-132.
[99] C. Lin and L. Snyder, "A Comparison of Programming Models for Shared 
Memory Multiprocessors", Proceedings o f the 1990 International Conference on 
Parallel Processing, August 1990, pp. 163-170.
[100] R. Lin and S. Olariu, "Application-Specific Array Processors for Binary Prefix 
Sum Computation", Proceedings o f the 6th IEEE Symposium on Parallel and 
Distributed Processing, October 1994, pp. 118-125.
[101] T. Lovett and S. Thakkar, "The Symmetry Multiprocessor System", Proceedings 
o f the International Conference on Parallel Processing, 1988, pp. 303-311.
[102] W. Magnus, A. Karr ass D. Solitar, Combinatorial Group Theory, Dover 
Publications, Inc, New York, 1976.
[103] Y. Mansour and L. Schulman, "Sorting on a Ring of Processors", Journal o f  
Algorithms, vol. 11, no. 4, December 1990, pp. 622-630.
[104] M. A. Marson et. al, "Modeling Bus Contention and Memory Interference in a 
Multiprocessor System", IEEE Trans, on Computers, January 1983, pp. 60-72.
[105] M. Martonosi and A. Gupta, "Tradeoffs in Message-Passing and Shared-Memory 
Implementations of a Standard Cell Router", Proceedings o f the 1989 International 
Conference on Parallel Processing, August 1989, pp. 777-88 - 777-96.
[106] A. Mathialagan and N. N. Biswas, "Optimal Interconnections in the Design of 
Microprocessor and Digital Systems", IEEE Trans, on Computers, vol. c-29, no. 
2, February 1980, pp. 145-148.
[107] C. A. Mead and L. A. Convey, Introduction to VLSI Systems, Reading MA, 
Addison-Wesley, 1980.
[108] M. D. Mickunas, "Using Projective Geometry to Design Bus Connection 
Networks", Proc. Workshop Interconnection Networks fo r  Parallel and Distributed 
Processing, ACM/IEEE, April 1990, pp. 47-55.
[109] T. N. Mudge, J. P. Hayes, and D. C. Winsor, "Multiple Bus Architectures", IEEE 
Computer, June 1987, pp. 42-48.
[110] P. A. Nelson and L. Snyder, "Programming Paradigms for Nonshared Memory 
Parallel Computers", The Characteristics o f Parallel Algorithms, The M IT  Press, 
1987, pp. 3-20.
203
[111] A. Osterhaug, Guide to Parallel Programming, Sequent Computer Systems Inc.,
1986.
[112] R. C. Pearce, J. A. Field, and W. D. Little, "Asynchronous Arbiter Module", IEEE 
Trans, on Computers, September 1975, pp. 931-932.
[113] C. C. Pinter, A Book o f Abstract Algebra, McGraw-Hill Publishing Company, 
1990.
[114] D. P. Pradhan, Fault-Tolerant Multiprocessor Link and Bus Network 
Architectures", IEEE Trans, on Computers, vol. 34, no. 1, January 1985, pp. 33- 
45.
[115] F. P. Preparata and J. Vuillemin, "The Cube-Connected Cycles: A Versatile 
Network for Parallel Computation", Computer Architecture and Systems, vol. 24, 
no. 5, May 1981, pp. 300-309.
[116] U. Ramachandran, G. Shah, and S. Ravikumar, "Scalability Study of the KSR-l", 
Proceedings o f the 1993 International Conference on Parallel Processing, August 
1993, pp. 7-237 - 7-240.
[117] D. A. Rennels, "Distributed Fault-Tolerant Computer Systems", Computer, vol. 13, 
no. 3, March 1980, pp. 55-65.
[118] C. D. Rose, "Encore Eyes Multiprocessor Market", Electronics, July 8, 1985.
[119] Y. Saad and M. H. Schultz, "Topological Properties of Hypercubes", IEEE Trans, 
on Computers, vol. 37, no. 7, July 1988, pp. 867-872.
[120] P. Sadayappan and F. Ercal, "Nearest-Neighbor Mapping of Finite Element Graphs 
onto Processor Meshes", IEEE Trans, on Computers, vol. c-36, no. 12, December
1987, pp. 1408-1424.
[121] M. R. Samatham, and D. K. Pradhan, "The De Bruijn Multiprocessor Network: 
A Versatile Parallel Processing and Sorting Network for VLSF, IEEE Trans, on 
Computers, vol. 38, no. 4, April 1989, pp. 567-581.
[122] C. L. Seitz, "The Cosmic Cube", Communications o f  the ACM, vol. 28, no. 1, 
January 1985, pp. 22-33.
[123] W. Shang and J. A. B. Fortes, "Independent Partitioning of Algorithms with 
Uniform Dependence", IEEE Trans, on Computers, vol. 41, no. 2, February 1992, 
pp. 190-206.
204
[124] W. Shang and B. W. Wah, "Dependence Analysis and Architecture Design for 
Bit-Level Algorithms", Proceedings o f the 1993 International Conference on 
Parallel Processing, August 1993, pp. 7-30 - 7-38.
[125] H. Siegel, Interconnection Networks fo r  Large-Scale Parallel Processing: Theory 
and Case Studies, Lexington Books, Lexington MA, 1984.
[126] H. J. Siegel and J. T. Kuehn, "PASM: A Partitionable SIMD/MIMD System for 
Parallel Image Processing Research", Algorithmically Specialized Parallel 
Computers, Academic Press, Inc., 1985, pp. 69-78.
[127] H. J. Siegel, R. J. McMillen, and P. T. Mueller, "A Survey of Interconnection 
Methods for Reconfigurable Parallel processing systems", National Computer 
Conference, 1979, pp. 529-541.
[128] H. J. Siegel and S. D. Smith, "Study of Multistage SIMD interconnection 
Networks", Proc. o f  5th Annual Symp. on Computer Architecture, April 1978, pp. 
223-229.
[129] D. B. Skillicorn, "A New Class of Fault-Tolerant Static Interconnection 
Networks", IEEE Trans, on Computers, vol. 37, no. 11, November 1988, pp. 
1468-1470.
[130] L. Snyder, "Introduction to the Configurable, Highly Parallel Computer", IEEE 
Computer 15, January 1982, pp. 47-56.
[131] H. S. Stone, Discrete Mathematical Structures and Their Applications, Science 
Research Associates, Inc., Chicago, Palo Alto, Toronto, Henley-on-Thomas, 
Sydney, 1973.
[132] H. S. Stone, "Multiprocessor Scheduling with the Aid of Network Flow 
Algorithms", IEEE Trans, on Software Engineering, vol. SE-3, no. 1, January 
1977, pp. 85-93.
[133] H. S. Stone, "Special-Purpose vs. General-Purpose Systems: A Position Paper", 
Algorithmically Specialized Parallel Computers, Academic Press, Inc., 1985, pp. 
251-252.
[134] H. S. Stone, "Parallel Processing with Perfect Shuffle", IEEE Trans, on 
Computers, vol. c-20, February 1971, pp. 153-161.
[135] H. S. Stone and J. Cocke, "Computer Architecture in the 1990s", IEEE Computer, 
September 1991, pp. 30-38.
205
[136] Q. Stout, "Mesh-Connected Computers with Broadcasting", IEEE Trans, on 
Computers, vol. c-32, no. 9, September 1983, pp. 826-830.
[137] R. J. Swan, S. H. Fuller, and D. P. Siewiorek, "Cm* - A Modular Multiprocessor", 
Proc. AFIPS 1977 Fall Joint Computer Conference, vol. 46, 1977, pp. 637-644.
[138] T. Szymanski, "Graph Theoretic Models for Photonic Networks", Proceedings o f  
the New Frontiers: A Workshop on Future Directions o f  Massively Parallel 
Processing, McLean, VA, 1992, pp. 85-96.
[139] C. Thacker and L. Stewart, "Firefly: A Multiprocessor Workstation", 2nd inti. 
Conference on Architectural Support fo r  Programming Languages and Operating 
Systems, ACM, October 1987, pp. 164-172.
[140] S. Todd, "Algorithms and Hardware for a merge sort using multiple processors", 
IBM Journal o f Research and Development, vol. 22, no. 5, September 1987, pp. 
509-517.
[141] H. C. Tomg and N. C. Wilhelm, "The Optimal Interconnection of Circuit Modules 
in Microprocessor and Digital System Design", IEEE Trans, on Computers vol. 
c-26, no. 5, May 1977, pp. 450-457.
[142] L. W. Tucker and G. G. Robertson, "Architecture and Applications of the 
Connection Machine", IEEE Computer, August 1988, pp. 26-38.
[143] C. H. Tung and C. W. McCarron, "A High Performance Multiprocessor with 
Partially Orthogonal Multibus and Memory", ISMM International Conference on 
Parallel and Distributed Computing, and Systems, October 1990, pp. 62-66.
[144] R. Vaidyanathan, "Design of Multiple Bus Interconnection Networks for Fan-in 
Computations", Proc. 29th Annual Conf. on Comm., Control & Comp., 1991, pp. 
1093-1102.
[145] R. Vaidyanathan and A. Padmanabhan, "Bus-Based Networks for Fan-in and 
Uniform Hypercube Algorithms", submitted to Parallel Computation.
[146] A. T. White, Graphs, Groups and Surfaces, North-Holland Mathematics Studies, 
1984.
[147] A. W. Wilson, "Hierarchical Cache/Bus Architecture for Shared Memory 
Multiprocessors", Proceedings o f the 14th Annual Symposium on Computer 
Architecture, June 1987, pp. 244-252.
206
[148] D. C. Winsor and Trevor N. Mudge, "Analysis of Bus Hierarchies for 
Multiprocessors", Proc. 15th Annual Int. Symp. on Computer Architecture , June 
1988 pp. 100-107.
[149] M. Wolfe and U. Baneijee, "Data Dependence and Its Application to Parallel 
Processing", International Journal o f Parallel Programming, vol. 16, no. 2, 1987, 
pp. 137-179.
[150] Q. Yang and L. N. Bhuyan, "Analysis of Packet-Switched Multiple-Bus 
Multiprocessor Systems", IEEE Trans, on Computers, vol. 40, no. 3, March 1991, 
pp. 352-357.
[151] Q. Yang and S. G. Zaky, "Communication Performance in Multiple-Bus Systems", 
IEEE Trans, on Computers, vol. 37, no. 7, July 1988, pp. 848-853.
[152] Z. Zlatev, J. Wasniewski, M. Venugopal, and J. Moth, "Optimizing Air Pollution 
Models on Two Alliant Computers", Parallel Computation, Editors: A. E. 
Fincham and B. Ford, Clarendon Press 1993, pp. 115-133.
VITA
Priyalal Kulasinghe received the B.Sc. degree in Electrical and Electronics 
Engineering in 1981 from University of Peradeniya, Sri Lanka. He received the M.S. 
degree in Electrical Engineering in 1990 from Louisiana State University, Baton Rouge. 
His research interests include design and analysis of algorithms, graph theory and graph 
algorithms, high performance computer architecture, interconnection networks, and theory 
of computing.
207





Priyalal D. Kulasinghe 
Electrical Engineering
Combinatorial Design and Analysis of Optimal 
Multiple Bus Systems for Parallel Algorithms
Approved:
Major Professor and Chairpian 
/  /  Dean d£ rile Graduatiuate School




April  4, 1995
