Efficient parallel computation on multiprocessors with optical interconnection networks by He, Min
Louisiana State University
LSU Digital Commons
LSU Doctoral Dissertations Graduate School
2002
Efficient parallel computation on multiprocessors
with optical interconnection networks
Min He
Louisiana State University and Agricultural and Mechanical College, mhe@bit.csc.lsu.edu
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations
Part of the Computer Sciences Commons
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please contactgradetd@lsu.edu.
Recommended Citation
He, Min, "Efficient parallel computation on multiprocessors with optical interconnection networks" (2002). LSU Doctoral
Dissertations. 28.
https://digitalcommons.lsu.edu/gradschool_dissertations/28
EFFICIENT PARALLEL COMPUTATION ON
MULTIPROCESSORS WITH OPTICAL INTERCONNECTION
NETWORKS
A Dissertation
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in
The Department of Computer Science
by
Min He
B.S., Hunan University, P.R.China, 1988
M.E., Hunan University, P.R.China, 1991
M.S., Louisiana State University, 1997
December, 2002
Acknowledgments
I am deeply grateful to my advisor, Dr. Si Qing Zheng. He has been very supportive and
encouraging during my time as a graduate student. His rigorous research attitude, breadth
of knowledge, and innovative ideas have been a great benefit for me to work on my Ph.D
research.
I would like to give my special thanks to Dr. Donald Kraft for his support and all the
helps he provided me during my time as a part-time Ph.D student. Whenever I need help,
he is always there helping me.
I would like to thank Dr. Ramachandran Vaidyanathan. I learned a lot from his courses
on methods and models on parallel computation. The theories and knowledge I learned from
his courses laid a good foundation for my later research in parallel computation area.
I would like to thank Dr. Michael Cherry, Dr. S. Sitharama Iyengar, Dr. Jianhua Chen,
Dr. Donald Kraft, and Dr. Ramachandran Vaidyanathan for serving on my Ph.D committee
and their helpful comments and suggestions on my dissertation.
I would like to thank my parents, Yi Guo and Guirong He, for their endless love and
support, for taking care of my daughter so that I could concentrate on my Ph.D research.
I would like to thank my husband, Qiang Liu, for his love, understanding and support.
I would like to thank my lovely daughter, Yuanyuan. She is an angel in my life and she
motivates me to do my best for this dissertation.
I would like to thank my brother Kai and my sister Pei for their support and love.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Optical Interconnection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Optical Bus Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Comparison of Several Optical Bus Models . . . . . . . . . . . . . . . . . . . 10
2.3 The LARPBS Model and Its Basic Operations . . . . . . . . . . . . . . . . . 13
2.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Power of the LARPBS Model . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Basic Operations on LARPBS . . . . . . . . . . . . . . . . . . . . . . 19
2.4 2D LARPBS and Its Basic Operations . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 2D LARPBS – an extension to LARPBS . . . . . . . . . . . . . . . . 22
2.4.2 Basic Operations on 2D LARPBS . . . . . . . . . . . . . . . . . . . . 24
3 Boolean Matrix Multiplication and Its Applications. . . . . . . . . . . . . . . . . . . . . 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Definition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 A New Constant Time Boolean Matrix Multiplication Algorithm . . . . . . . 32
3.4 Extension to 2D LARPBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Discussion on Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Application to Transitive Closure . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Application to Connected Component Problems . . . . . . . . . . . . . . . . 37
iii
4 Parallel Sorting Algorithms on LARPBS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 A Constant-Time Sorting Algorithm on LARPBS . . . . . . . . . . . . . . . 43
4.3 Cole’s Merge Sort Algorithm for CREW PRAM . . . . . . . . . . . . . . . . 43
4.3.1 Comparison of Cole’s Merge Sort and the AKS Sorting Network . . . 44
4.3.2 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Basic Idea of the CREW PRAM Merge Sort Algorithm . . . . . . . . 46
4.4 Parallel Merge Sort on LARPBS . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 Problems Need to be Solved . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Details of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.3 Processor Assignment and Reuse . . . . . . . . . . . . . . . . . . . . 54
4.4.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.5 Correctness Proof and Complexity Analysis . . . . . . . . . . . . . . 59
5 Parallel Sorting Algorithms on Two Dimensional LARPBS . . . . . . . . . . . . . 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 A Constant-Time Sorting Algorithm on 2D LARPBS . . . . . . . . . . . . . 64
5.3 The Seven-Phase Columnsort Algorithm . . . . . . . . . . . . . . . . . . . . 66
5.4 An O(log r) Column Sort Algorithm on a 2D LARPBS . . . . . . . . . . . . 70
5.5 The Two-Way Merge Sort Algorithm on a 2D LARPBS . . . . . . . . . . . . 73
5.6 The Multi-Way Merge Algorithm on a 2D LARPBS . . . . . . . . . . . . . . 82
5.7 An O(log n)-Time Merge-Based Sorting Algorithm on a 2D LARPBS . . . . 88
5.8 An O(log n)-Time Generalized Columnsort Algorithm on a 2D LARPBS . . 93
6 Optimal Sorting Algorithms on Higher Dimensional LARPBS . . . . . . . . . . 97
6.1 An O(log n)-Time Generalized Columnsort Algorithm on a 3D LARPBS . . 97
6.2 A 5-Phase Sorting Algorithm on a 3D LARPBS . . . . . . . . . . . . . . . . 102
6.3 Comparison of The Two 3D Sorting Algorithms . . . . . . . . . . . . . . . . 105
7 Selection Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Parallel Comparison Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Finding Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.1 A Constant-Time Algorithm . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.2 A Fast Non-Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . 108
7.3.3 An Optimal-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4.1 The Cole-Yap O((log log n)2) Time Parallel Selection Algorithm . . . 111
7.4.2 An Optimal Parallel Selection Algorithm . . . . . . . . . . . . . . . . 112
7.4.3 An Implementation of Cole-Yap’s Algorithm on an LARPBS . . . . . 114
7.5 Application to Parallel Multi-selection . . . . . . . . . . . . . . . . . . . . . 118
iv
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
v
List of Tables
2.1 Step 1: distributing the integers . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Step 2: calculating n
n
2 -ary representation . . . . . . . . . . . . . . . . . . . . 27
2.3 Step 2: saving corresponding n
n
2 -ary representation digit . . . . . . . . . . . 27
2.4 Step 3: multicasting 1s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Step 4: performing row BPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Step 5: collecting and converting results to decimal values on the first row . 27
2.7 Step 6: returning result to item’s original location . . . . . . . . . . . . . . . 27
3.1 Comparison of different parallel Boolean matrix multiplication algorithms. . 31
5.1 Comparison of different parallel sorting algorithms on multi-dimensional arrays. 64
7.1 Comparison of different parallel selection algorithms. . . . . . . . . . . . . . 107
vi
List of Figures
2.1 Structure of a folded optical bus . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Structure of a POB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Address scheme for an LARPBS . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Structures of LARPBS and LPB . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Conditional delay switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 The LARPBS model of size 6 with two sub-buses . . . . . . . . . . . . . . . 17
2.7 Address structure for basic operations . . . . . . . . . . . . . . . . . . . . . . 20
2.8 A 2D LARPBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Structure of a folded optical bus: (a) step 1 (b) step 2 and step 3 (c) address
frames for some of active processors . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Connected components of graph G . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 SUP (v) → UP (u): (a) Pj−2, Pj−1 and Pj all have rank s − 1 in UP (u). (b)
boundary condition: Pj−2, Pj−1 and Pj all have rank 0 in UP (u). . . . . . . 50
4.2 Finding the ranks in SUP (w) for Pi’s two straddle processors Pj and Pj+1 in
UP (u). Solid lines represent data communication and dashed lines represent
ranks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Computing SUP (v) → NEWSUP (v). Solid lines represent data communi-
cation, and dashed lines represent ranking relation. . . . . . . . . . . . . . . 53
4.4 Simulating a binary tree on the LARPBS Model: (a) depth-first search ranking
(b) breath-first search ranking, and (c) simulation of binary tree on LARPBS 55
4.5 Simulating a parallel merge sort tree on the LARPBS Model . . . . . . . . . 56
4.6 Pipelined Merge Sort: life-cycle 1 (a) input, and (b) after stage 3 . . . . . . 58
4.7 Pipelined Merge Sort: life-cycle 2 (a) after stage 2, and (b) after stage 3 . . . 60
4.8 Pipelined Merge Sort: life-cycle 3 (a) after stage 1, (b) after stage 2, and (c)
after stage 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 7-step Columnsort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 (a) Constructing middle matrix D, (b) Circular movements on column buses
in matrix D, and (c) Partial circular movements on row buses in A and B . 76
5.3 Two-way columnsort: reconstructing A and B . . . . . . . . . . . . . . . . . 80
6.1 n-way-column-shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vii
Abstract
This dissertation studies optical interconnection networks, their architecture, address schemes,
and computation and communication capabilities. We focus on a simple but powerful optical
interconnection network model – the Linear Array with Reconfigurable pipelined Bus System
(LARPBS). We extend the LARPBS model to a simplified higher dimensional LAPRBS and
provide a set of basic computation operations.
We then study the following two groups of parallel computation problems on both one
dimensional LARPBS’s as well as multi-dimensional LARPBS’s: parallel comparison prob-
lems, including sorting, merging, and selection; Boolean matrix multiplication, transitive
closure and their applications to connected component problems.
We implement an optimal sorting algorithm on an n-processor LARPBS. With this
optimal sorting algorithm at disposal, we study the sorting problem for higher dimensional
LARPBS’s and obtain the following results:
• An optimal basic Columnsort algorithm on a 2D LARPBS.
• Two optimal two-way merge sort algorithms on a 2D LARPBS.
• An optimal multi-way merge sorting algorithm on a 2D LARPBS.
• An optimal generalized column sort algorithm on a 2D LARPBS.
• An optimal generalized column sort algorithm on a 3D LARPBS.
• An optimal 5-phase sorting algorithm on a 3D LARPBS.
Results for selection problems are as follows:
• A constant time maximum-finding algorithm on an LARPBS.
• An optimal maximum-finding algorithm on an LARPBS.
• An O((log log n)2) time parallel selection algorithm on an LARPBS.
• An O(k(log log n)2) time parallel multi-selection algorithm on an LARPBS.
viii
While studying the computation and communication properties of the LARPBS model,
we find Boolean matrix multiplication and its applications to the graph are another set of
problem that can be solved efficiently on the LARPBS. Following is a list of results we have
obtained in this area.
• A constant time Boolean matrix multiplication algorithm.
• An O(log n)-time transitive closure algorithm.
• An O(log n)-time connected components algorithm.
• An O(log n)-time strongly connected components algorithm.
The results provided in this dissertation show the strong computation and communication





The performance of many existing computational systems based on micro-electronic inte-
grated circuits is not limited by the devices themselves but by the performance of commu-
nication links between devices. Though the interconnect bottleneck occurs at all levels of
electronic packaging, there are three major levels:
• Level one: chip-level(intra-chip) interconnections between devices on a single very-
large-scale-integrated(VLSI) circuit or chip.
• Level two: multi-chip module (MCM) level (intra-MCM or chip-to-chip) interconnec-
tions are connections between VLSI chips within a single MCM.
• Level three: board-level (inter-MCM) interconnections are connections between MCMs
on a single printed circuit board.
The use of optical interconnects with optical signal paths to replace particular electrical
connection links can result in the following advantages: increased communication speed, re-
duced power consumption, reduced cross talk, increased reliability, increased fault tolerance,
reduced cost, enhanced circuit testing capabilities, and provision of a path toward dynamic
interconnects.
Optical waveguides have enhanced bandwidth, loosened loading constraints and large
physical distribution of computing resources. What’s more, the unique property of op-
tics, namely, unidirectional propagation of signals, can be used to further increase the ef-
fective bandwidth of optical buses. Three multi-access methods are used in optical bus
systems, namely, space-multiplexing, time-division multiplexing (TDM), and wavelength-
division multiplexing.
• Space Multiplexing [44] : Space Multiplexing is a technique for pipelining messages
on optical buses. The pipelining of control and data signals represents a significant
1
departure from the conventional exclusive access discipline which characterizes elec-
tronic bus-interconnected multiprocessors. By relaxing the exclusive access require-
ment, space multiplexing can support the design of large-scale, distributed, and tightly
coupled multiprocessor systems.
• Time-Division Multiplexing (TDM) : Time-division multiplexing of messages has the
same effect as message pipelining on optical waveguides. The TDM access protocol
may be used to improve the temporal bandwidth of interconnection network.
• Wavelength Division Multiplexing (WDM) : WDM is a technology used in optical inter-
connection networks, where the optical spectrum is divided into many different logical
channels, and each channel corresponds to a different wavelength. These channels can
be carried simultaneously on a small number of physical channels, e.g., optical fibers.
WDM exploits the terahertz (THz) bandwidth of optics, and enables multiple multi-
access channels to be realized on a single physical channel for large parallel computing
systems.
1.2 Main Results
In this dissertation, we study both optical interconnection models and efficient algorithms
for these models. We focus on a simple but powerful optical interconnection network model –
A Linear Array with Reconfigurable Pipelined Bus System (LARPBS). We extend this one-
dimensional optical bus model to two and three dimension to improve the scalability. We
also provide several basic operations for two-dimensional LARPBS’s. These basic operations
are used as building blocks for designing a few efficient algorithms given in this dissertation.
Then we study parallel comparison problems, including sorting, merging and selection,
the lower bounds for these problems, and the best parallel algorithms for these problems
that were designed under PRAM and parallel comparison models. Sorting is a well-known
problem that has been studied extensively in the literature because it has so many important
applications, and it has a theoretical importance as well. There are two optimal parallel
sorting algorithms: Cole’s pipelined merge sort [11] and the AKS sorting network [1]. Both
of them sort n elements in O(log n) time. Since the lower bound of parallel sorting problem
is Ω(log n) [76], both algorithms are actually Θ(log n) algorithms. We chose to implement
Cole’s pipelined merge sort because it has a much smaller constant in running time and it
does not depend on expander graphs. Li, Pan and Zheng [39] provided some methods for
simulating a CRCW PRAM on an LARPBS. But as their simulation methods simulate a
CRCW PRAM on an LARPBS with an O(log n) slowdown, we cannot obtain an O(log n)
optimal sorting algorithm by simply applying their simulation methods. Cole also gave an
EREW PRAM sorting algorithm with the same time complexity in [11]. Given that an
EREW PRAM can be simulated on an LARPBS in constant time [39], another approach
would be using the simulation methods given in [39] to implement this EREW PRAM sorting
algorithm on an LARPBS. But the EREW PRAM sorting algorithm has a larger constant
2
in running time and it is much more complex. We observe that in this algorithm the owners
of the data always know the address of the processors who need the data for all the required
concurrent read operations. Thus, concurrent read can be done by performing a multicast
operation on an LARPBS. Based on the above analysis and observation, we decide to directly
implement the CREW PRAM merge sort algorithm on an LARPBS.
A lot of parallel algorithms [24, 31, 35, 37, 38, 52, 59, 60, 62, 72, 74] have been designed
and implemented for the LARPBS since it is first proposed by Pan and Li [61]. Currently
the best known sorting algorithms implemented on an n-processor LARPBS sort n elements
in O(log n) expected time or O(log2 n) deterministic time [59], [36]. Thus to the best of our
knowledge, our O(log n)-time sorting algorithm is the first optimal sorting algorithm imple-
mented on an LARPBS. With this optimal sorting algorithm for one-dimensional LARPBS
at disposal, we further extend our research on the sorting problem for higher dimensional
LARPBS’s.
We started our research on two dimensional sorting problems with the well-known
Columnsort algorithms. There are two known versions of Columnsort algorithm: the 7-phase
Columnsort [30] and the 8-phase Columnsort [29]. Since the 8-phase algorithm requires ex-
tra processors for some of its phases and it works under a more restrictive assumption
(r ≥ 2(s− 1)2, where r is the number of rows and s is the number of columns), we chose
to implement the 7-phase Columnsort. We obtained our O(log n)-time 7-phase Columnsort
algorithm by applying our optimal sorting algorithm on one-dimensional LARPBS to sort-
ing each column. But this 7-phase Columnsort only works under the assumption of r ≥ s2,
where r is the number of rows and s is the number of columns. So we ask the following
question: Can we loosen the assumption so that it applies to a larger range of applications?
Or can we obtain an optimal sorting algorithm for an n×n 2D LARPBS? We came up with
the following two solutions to these questions:
1. Generalize the idea of basic Columnsort: we can simulate an n
√
n × √n imaginary
matrix on the n × n 2D LARPBS, and we can also simulate an n2 × n imaginary
matrix on the n× n× n 3D LARPBS. Since the assumption r ≥ s2 is satisfied in both
cases, we can apply the 7-phase Columnsort algorithm on the two imaginary matrices.
We called the sorting algorithms obtained this way generalized Columnsort algorithms.
2. Multi-way merge sorted sub-matrices: apply a multi-way merge method to
√
n n×√n
shape sub-matrices to obtain a sorted n×n shape matrix. We observed that the most
time consuming operations needed in the multi-way merge procedure are sorting an n-
processor one-dimensional LARPBS and sorting an n×2√n 2D LARPBS. We already
have an optimal sorting algorithm for one-dimensional LARPBS. If we can obtain an
optimal two-way merge sort algorithm, we can get an optimal algorithm for n×n shape
matrix.
We designed and implemented following optimal sorting algorithms based on the above
two solutions:
1. An optimal basic Columnsort algorithm on an n×√n two-dimensional LARPBS that
sorts n
√
n elements in O(log n) time.
3
2. An optimal two-way merge sort algorithm on an n × 2√n two-dimensional LARPBS
that sorts 2n
√
n in O(log n) time.
3. An optimal two-way merge sort algorithm on an n × √n two-dimensional LARPBS
that sorts 2n
√
n in O(log n) time.
4. An optimal multi-way merge sorting algorithm on an n×n two-dimensional LARPBS
that sorts n2 elements in O(log n) time.
5. An optimal generalized column sort algorithm on an n× n two-dimensional LARPBS
that sorts n2 elements in O(log n) time.
6. An optimal generalized column sort algorithm on an n × n × n three-dimensional
LARPBS that sorts n3 elements in O(log n) time.
7. An optimal 5-phase sorting algorithm on an n×n×n three-dimensional LARPBS that
sorts n3 elements in O(log n) time.
Selection is another comparison problem. A typical application for selection is multi-
selection – a fundamental algorithmic problem arising frequently in databases. The easiest
cases are extreme selection, which is finding the minimum or maximum. The hardest case
is finding the median. Valiant proved the lower bound of finding the maximum of n ele-
ments to be Ω(log log n) in [76]. This implies that the lower bound for selection problems
is Ω(log log n). Since sorting is a complete selection, the upper bound O(log n) for sorting
provided an upper bound for selection problem. Researches for an upper bound are done on
both PRAM models and Valiant’s parallel comparison model [76].
Cole and Yap [16] presented a selection algorithm on parallel comparison model in
year 1984. This selection algorithm improved the upper bound of selection problem to
O((log log n)2). Two years later, a even faster selection algorithm was presented in [3] that
pushes the upper bound further down to O(log log n). The existence of such an algorithm
implies that selection problem has an upper and lower bound of log log n. The best previ-
ous result for selection on the PRAM is given in [13] that uses n operations and runs in
O(log n log log n) on a CRCW PRAM.
We did research on the two upper bound algorithms [16][3] on parallel comparison model.
And we implemented the Cole-Yap’s O((log log n)2)-time selection algorithm on an LARPBS.
Since the O(log log n) algorithm is designed for parallel comparison model, time spent for
calculating transitive closure is not counted as transitive closure can be obtained free of
comparison cost. Since we have to count the time spent for calculating transitive closure in
the LARPBS model, we cannot implement this upper bound algorithm with the same time
complexity unless we can find a constant time transitive closure algorithm. It has caught our
attention that Han and Pan provided an O((log log n)2/log log log n)-time selection algorithm
in [24]. But their algorithm is based on Cole’s EREW PRAM selection algorithm [12]. By
using a fast sorting algorithm to sort a smaller set of data on the LARPBS model, Han
and Pan were able to speedup Cole’s algorithm and obtained the O((log log n)2/log log log n)
4
time selection algorithm. Our algorithm is different from Han and Pan’s algorithm. Our
algorithm is based on Cole’s selection algorithm designed on parallel comparison model. It
is conceptually simpler than the EREW PRAM selection algorithm [12].
To find an optimal maximum algorithm, we studied the accelerated cascading technol-
ogy. We provided a solution for simulating doubly logarithmic tree on an LARPBS, and
implemented an optimal algorithm for finding the maximum of n elements on an n-processor
LARPBS. Following is a list of algorithms we implemented for selection problems.
1. A constant time algorithm that finds the maximum of n elements on an n2-processor
LARPBS.
2. A non-optimal algorithm that finds the maximum of n elements on an n-processor
LARPBS in O(log log n) time.
3. An optimal algorithm that finds the maximum of n elements on an n-processor LARPBS
in O(log log n) time.
4. An O((log log n)2) time parallel selection algorithm on an n-processor one-dimensional
LARPBS.
5. An O(k(log log n)2) time parallel multi-selection algorithm on an n-processor one-
dimensional LARPBS.
Apart from the comparison problems, we also studied Boolean matrix multiplication
problem and its applications to graph problems. We presented a new constant time Boolean
matrix algorithm on an LARPBS. Our algorithm achieves much better speed-up compared
with the four Russians’ algorithm proposed by Agerwala and Lint in [4] and a hypercube
implementation provided by Plaxton [55]. We noticed that a parallel implementation of the
Four Russians’ algorithm on the LARPBS with constant running time is proposed by Keqin
Li in [31]. Compared with Li’s constant-time algorithm, our algorithm is not based on any
existing sequential or parallel algorithm and it is much simpler and uses fewer processors.
Our algorithm uses only O(n2) processors, while Li’s algorithm uses O(n3/ log n) processors.
We then applied the constant time Boolean matrix multiplication algorithm to two
graph problems, and obtained an efficient transitive closure algorithm and two connected
component algorithms. We noticed that a constant time algorithm for the transitive closure
was proposed in [77]. But their constant time algorithm is obtained at a cost of more
processors. Two constant time algorithms were given in [77]. One is designed on a three-
dimensional n × n × n processor array with a reconfigurable bus system and the other is
designed on a two-dimensional n2 × n2 processor array with a reconfigurable bus system,
where n is the number of vertices in the graph.
Following is a list of results we have obtained in this area.
1. A constant time algorithm on an n2-processor one-dimensional LARPBS that multiplies
two n× n Boolean matrices.
5
2. A constant time algorithm on an n× n two-dimensional LARPBS that multiplies two
n× n Boolean matrices.
3. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the transitive closure for a undirected graph G with n vertices.
4. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the connected components for a undirected graph G with n vertices.
5. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the strongly connected components for a directed graph G with n vertices.
1.3 Organization of the Dissertation
The rest of the dissertation is organized into chapters as follows. In Chapter 2, we first
compare several similar optical bus models with the two models used in this dissertation.
Then we give a detailed description on the two models, i.e. LARPBS and 2D LARPBS,
and basic operations designed for these two models. In Chapter 3, we design a simple
constant-time Boolean matrix multiplication algorithm and apply it to solving the transitive
closure problem. In Chapter 4, we present an optimal sorting algorithm on LARPBS with
a solution on processor addressing and reuse. In Chapter 5, we present two optimal sorting
algorithms on 2D LARPBS as well as Columnsort and two-way Columnsort algorithm on
2D LARPBS. In Chapter 6, we present two optimal sorting algorithms for three-dimensional
LARPBS. In Chapter 7, we presents an implementation of Cole-Yap’[16] O([log log n]2)-time
parallel selection algorithm on LARPBS and apply this algorithm to solving the parallel
multi-selection problem. We conclude in Chapter 8.




2.1 Optical Bus Systems
Bus-based architectures refer to as the class of networks where arrays of processors are en-
hanced by various bus systems. Different from other network topologies, bus-based networks
have the advantage of having direct connection between any two processors in the system.
Two disadvantages of a bus-based network are: first, its large diameter, which causes long
communication delay; second, small bisection width. When there is a large amount of mes-
sage transfer between nodes in a network, the bus becomes a communication bottleneck
because it is shared by all processors in the network.
Due to recent advances in VLSI and fiber optic technologies these two disadvantages
become less significant. The study of computing on bus-based architectures becomes a new
and fast growing field. Various bus-based architectures have been proposed for efficient par-
allel computations, including arrays of processors with global buses[7], multiple buses [10],
hyperbuses [25], multi-dimensional bus architectures [32], and reconfigurable buses [47], etc.
Compared with theoretical parallel computing models, such as PRAM, bus-based architec-
tures are practical and can be implemented with current bus technologies.
In a parallel computing system, communications among processors are always the most
important issue. High communication bandwidth and less communication load is the goal
for designing a high performance parallel system. As the silicon and Ga-As technologies ad-
vances, processor speed has already reached the gigahertz (GHz) range. Compared with such
a high processor speed, traditional metal-based communication technologies used in parallel
computing systems become a bottleneck. There are two ways to solve this problem. That is,
either significantly improve the performance of traditional interconnect technologies, or in-
troduce some new interconnect technologies, such as optics, into parallel computing systems.
From the technological point of view, at least three fundamental constraints bound inter-
connections in electronic (or metal-based) systems: limited bandwidth, capacitive loading,
and cross-talk caused by mutual inductance [54].
Optical communications provide an opportunity to redesign the traditional multiproces-
sor solutions free of the three limitations mentioned above. Optical interconnections offer
7
the following attractive features: high speed, high bandwidth, low error probability, gigabit
transmission capacity, increased fanout, long interconnection lengths, low power requirement,
and free from capacitive bus loading, cross talk, and electro-magnetic interference. Optical
waveguide can be used to implement a bus. An optical bus can connect more processors
than an electrical bus. There are three main areas of application for the optical bus com-
munication technology in parallel distributed computing: optical networks, including both
local area network and wide area network; optical buses used to connect components within
a single system; and inter-chip interconnections. Advances in optical and optoelectronic
technologies indicate that they could also be used as interconnections in parallel computers.
Actually, many commercial massively parallel computers such as the Cray T3D have already
used optical technology in their communication subsystems. Thus, optical interconnections
have great potential for improving massively parallel processing systems, and will improve
system performance and algorithm design.
Although optical technology alleviates some of the communication bottlenecks in parallel
computing systems, there are some limitations to this solution. Two obvious limitations are:
first, the transmission speed on the optical channels is bounded by the speed of electronics
and of the optical/electrical and electrical/optical conversion devices. Any interface between
electronics and optics lowers the speed at that interface to the speed of electronics. In
other words, the speed of electronics puts bounds on the transmission speed of optical buses.
Secondly, the end-to-end transmission time for optical signal is not inherently shorter than
those of electrical signals, especially for short distance.
An optical bus system uses optical waveguides to transfer messages among electronic pro-
cessors. In addition to the high transmission speed, optical signals transmitted on an optical
waveguide have two important properties that are not shared with electrical signals running
on an electrical bus. That is, unidirectional propagation and predictable propagation delays
per unit length. These properties enable synchronized concurrent accesses of an optical bus
in a pipelined fashion [14], [33]. Compared with exclusive access for data transmission on an
electrical interconnection, the ability for an optical signal to concurrently access waveguide
greatly reduce the time complexity of many parallel algorithms. Pipelined optical bus sys-
tems can support a massive volume of communications simultaneously, and are particularly
appropriate for applications that involve intensive communication operations such as broad-
casting, n-way one-to-one communication, multicasting, global aggregation, extraction and
compression, and irregular communication patterns.
The optical bus architecture has been studied by many researchers [40, 64, 21, 63, 10,
22, 65, 61, 34, 78]. Figure 2.1 shows an SIMD linear array in which electronic processors
are connected with a folded optical bus. Each processor is connected to the bus with two
directional couplers, one for transmitting on the upper segment (the transmitting segment)
and the other for receiving from the lower segment (the receiving segment) of the bus.
Optical signals propagate unidirectional from right to left on the transmitting segment and
from left to right on the receiving segment. Data are represented by optical pulses in the
following way: An optical pulse p can represent a binary bit 1, and the absence of a pulse
represents bit 0. Messages are organized as fixed-length frames, with each frame containing
8
Figure 2.1: Structure of a folded optical bus
a maximum of b pulses. To be specific, let b = 8, the integer data 169 can be represented as
a message frame (p,−, p,−, p,−,−, p), where p stands for an optical pulse, and - stands for
the absence of a pulse. In order to describe an optical bus system precisely, we define the
most important hardware parameters of an optical bus system as follows:
• ω: the duration in seconds (temporal length) of a single optical pulse.
• cb: the velocity (speed) of light running in waveguides.
• δ: delay (spatial length) of a single optical pulse, the relation between δ and ω is
δ = ω × cb.
• τ : pipeline cycle (or petit cycle), i.e., the time for an optical pulse to traverse the
distance in the waveguide between two consecutive processors.
• T : bus cycle, i.e., end-to-end propagation delay on the bus (the time for an optical
pulse to propagate the entire bus). In the case of a folded linear optical bus, shown in
Figure 2.1, the entire bus is (P0 → P1 → ... → PN−1 → PN−1 → PN−2 → ... → P0).
T = (2N − 1)τ , when there is no delay introduced for reference and select pulses;
otherwise, (2N − 1)τ ≤ T ≤ (2N − 1)τ + (N − 1)ω.
• b: the length of a message frame. (We assume that a message frame is long enough to
hold one basic data item such as an array element. So if processors want to send their
indices, b should be at least dlog Ne.)
9
Messages sent by different processors may overlap with each other even if they propagate
in one direction on the bus. We call the overlapping of more than one message transmission
collision. To avoid transmission collision, we must make sure that τ > bω, so that each
message frame can fit into a pipeline cycle τ . In one bus cycle T , up to N messages can be
transmitted simultaneously in a pipeline fashion without collisions on the bus. And this is
done by assuming that all processors are synchronized at the beginning of each bus cycle.
It is a strictly non-blocking network in the sense that all processors can simultaneously
send/receive messages in one bus cycle. So routing on a bus network is very simple.
2.2 Comparison of Several Optical Bus Models
In this section, we compare three similar one-dimensional optical bus models and several
two-dimensional optical bus models, as they are related to the two optical models used or
proposed in this dissertation: LARPBS[61] and 2D LARPBS.
As we mention in previous section, the unidirectional propagation and predictable delay
of an optical bus enable synchronized concurrent access to an optical bus in a pipelined
fashion. Researchers proposed several models that allow pipelining of data through an optical
bus [21, 61, 34, 78, 45, 51, 57, 37]. Among them, the following three similar models are proved
to be equivalent in [72]:
• LARPBS[61]–the linear arrays with a reconfigurable pipelined bus system, see Figure
2.4, Figure 2.5, Figure 2.6.
• POB[34]–the pipelined optical bus, see Figure 2.2.
• LPB[51]–the linear pipelined bus, see Figure 2.4.
The differences among the models lie in whether or not there are reconfigurable switches
on the transmitting segment, the location of the conditional delay switches, and whether
or not there are fixed delay loops on the receiving segment: The LPB is identical to the
LARPBS with the exception that it does not have the reconfigurable switches. The POB is
even simpler: it has neither the reconfigurable switches nor the conditional delay switches
on the transmitting segment, but it has conditional delay switches instead of fix delay loop
on the receiving segment.
The equivalence of the above three models is based on the following facts:
1. an address in a segmented bus can always be mapped to an address in the non-segment
version of that bus, thus reconfigurable switches can be eliminated at the cost of extra
computation on the destination address.
2. the fact that a processor always receives the first message and ignores all that follow
plays an important role in simulating an operation for a segmented bus on a non-
segmented bus.
10
Figure 2.2: Structure of a POB
11
3. Basic operation binary-prefix-sum can be done without reconfigurable switches by using
the multicasting ability instead of reconfiguration ability. This basic operation also
plays an important role in the proof for equivalence.
The equivalence is proved by showing that each step in one model can be simulated
by another model in constant time, where each step is defined to consist of following three
phases:
1. determine parameters for the actual destinations of all messages
2. create the select frames
3. send the messages
Detailed proof can be found in [72]. In this dissertation, we are interested in LARPBS since
it is the most convenient model for presenting parallel algorithms.
There are several factors that constrain the size of an optical bus system: the bus fan-
out (split loss); power distribution problem; the length of unary addresses; and the increased
latency when the number of processors is large. To achieve better scalability, researchers
proposed the following two or higher dimensional optical array structures:
• PR-Mesh[73] – k-dimensional mesh of processors in which each processor has 2k ports,
with each linear bus in the system being an LARPBS.
• 2-D ACPOB[63] – array connected by pipelined optical buses with conditional delays
• APPBS[21] – array processor with pipelined buses using switches
• AROB[54, 58] – array with reconfigurable optical buses
• ASOS[65] – array structure with synchronous optical switches
• SASOS[40] – symmetric ASOS
• ARSOB[64] – reconfigurable array with spanning optical buses
The common properties of the above models are:
• connect rows and columns of processors using generic optical buses
• place optical switches at the intersections of the row and column buses
• the switches are controlled in synchronization with the TDM to allow packets to be
transmitted along multiple all-optical data paths without intermediate O/E and E/O
conversions.
12
The structural differences for these models lie in the locations of the switches and the
switching rules employed. Among the above five models, two-dimensional PR-Mesh, and
ACPOB are the two dimensional extension of LARPBS and POB, respectively. The equiva-
lence of PR-Mesh, APPBS, and AROB is established in [9]. The AROB model is actually an
implementation of reconfigurable mesh [46] - a powerful parallel computation model that has
been shown to be able to solve many problems very efficiently. All existing algorithms on a
reconfigurable mesh can be ported to its corresponding AROB. But it has a major drawback:
its structure, switching and operation mode are very complicated. Due to the following two
facts, the ACPOB is the simplest among the above models: 1. ACPOB has only one type
of switches – conditional delay switches, and one set of such switches for each dimension;
2. concurrently performing a row communication step and a column communication step is
not assumed in this model. However, computationally, the ACPOB is very powerful. It is
shown in [63] that ACPOB is more powerful than ASOS, SASOS, RASOB, and APPBS.
We proposed a two dimensional extension of LARPBS: we call it 2D LARPBS, see
Figure 2.8. Each processor has a row reconfigurable switch pairs and a column reconfigurable
switch pairs which can either be set to cross or straight. Like ACPOB, we do not assume
a column communication and a row communication to be performed in the same bus cycle.
The advantage of this simplification is that it simplifies the complexity of the model, and
provides convenience for algorithm design. The details of a 2D LARPBS is given in section
2.4.
2.3 The LARPBS Model and Its Basic Operations
2.3.1 The Model
A Linear Array with Reconfigurable Pipelined Bus System (LARPBS) is a folded optical bus
system with one set of fixed delays and two sets of flexible switches. Figure 2.4, Figure 2.5,
and Figure 2.6 show the LARPBS model. To avoid confusion, we use two figures to represent
the same LARPBS system. We show one set of switches in Figure 2.4 and Figure 2.5, and
another set in Figure 2.6. We call the upper half of the bus transmitting segment, the
lower half receiving segment. The optical bus contains three identical waveguides: message
waveguide used to carry messages, and reference waveguide and select waveguide used together
to carry address information, as shown in Figure 2.4. Messages are organized as fixed-length
message frames. Optical signals propagate unidirectional either from left to right on the
upper segment and from right to left on the lower segment of the bus, or vise versa, depending
on how we fold the bus. We define a unit delay ∆ to be the spatial length of a single optical
pulse, that is ∆ = ω × cb, where ω is the pulse duration in seconds and cb is the velocity of
light in the waveguides. Initially, processors are connected to those three waveguides such
that the same length of fiber is used on all three waveguides between any two consecutive
processors. A bus cycle for an optical bus is defined to be the end-to-end propagation delay
on the bus. If we denote τ as the time taken for an optical signal to propagate between
two consecutive processors, the bus cycle for the bus shown in Figure2.4 is 2Nτ . We add
13
one unit delay ∆ between any two processors on the receiving segments of the reference
waveguides and the message waveguides. The one unit delay can be implemented by adding
an extra segment of fiber and the amount of delay added can be accurately chosen based on
the length of the segment. Then we add a conditional delay ∆ between any two processors i
and i + 1, where 0 ≤ i ≤ N − 2, on the transmitting segments of the select waveguides. The
switch between processor i and i + 1 is local to processor i + 1.
The conditional delays can be implemented by 2 × 2 optical switches such as the
Ti:LiNbO3 switches used in an optical computer [6]. The conditional switch can be set
to two different states: cross and straight. When the conditional switch is set to cross, one
unit delay is added to an optical signal passing by; when it is set to straight, no delay is
added to the optical signal passing by. The conditional delay switches are used to change
the destination address at run time based on program needs. An address technology called
coincident pulse technique [14, 33] is used to make it possible. This technique can be used to
route information on an optical bus system in the case when the source processor knows its
destination address. By making use of the two properties of optical signals, namely unidi-
rectional propagation and predictable propagation delays, we can relate time with space and
positional encode information.
Here is how it works: consider an LARPBS with n processors Pi , i = 0, 1, ...n. Suppose
that processor Pi wants to send a message to processor Pj. Let tref denote the time when the
reference pulse is ejected, and tsel denote the time when a select pulse is ejected. Every time
a reference pulse passes a delay loop, either a fixed delay loop or a delay loop introduced by
a conditional delay switch, the reference pulse is postponed one petit time slot. By properly
selecting the time difference between tref and tsel, the two pulses can be made to meet at
certain processor and be detected by the processor’s detector. Figure 2.3(b) shows the initial
address frame for sending a message to processor j, and Figure 2.3(c) shows the same address
frame for the time when the message reaches its destination.
A set of flexible switches, shown in Figure 2.6, are reconfigurable switch pairs. We call
such switches RST (i) and RSR(i), where 1 ≤ i < N . RST (i) is a 1 × 2 optical switch
on the waveguide section between Pi and Pi+1 on the transmitting segment. RSR(i) is a
2× 1 optical switch on the waveguide section between Pi and Pi+1 on the receiving segment.
Both RST (i) and RSR(i) are controlled by processor Pi. These switch pairs are used to re-
configure the bus, either segment the bus into several independent sub-buses that can operate
in parallel, or fuse two or more buses into one single bus. The reconfiguration is performed in
the following way: When all RST and RSR switch pairs are set to straight, the bus system
operates as a regular pipeline single bus system. When one pair, say RST (i) and RSR(i), is
set to cross, the bus system is split into two independent sub-buses. One sub-bus consists of
processor P0, P1, · · ·, Pi. The other sub-bus consists of processor Pi+1, Pi+2, · · ·, PN−1. When
RST (i) is directly connected to RSR(i), the total delay for a signal to pass the optical
fiber between the two switches is made to be τ . Hence, in the above case, the sub-bus with
processors P0 to Pi can operate as a regular linear array with a pipelined bus system; so
does the sub-bus with processors Pi+1 to PN−1. Figure 2.6 demonstrates an LARPBS model
with N = 6 processors which is split into two segments. The first one has two processors,
14
Figure 2.3: Address scheme for an LARPBS
15
i.e., processors P0 and P1; the second one has four processors, i.e., processors P2 to P5. In
the figure, only one waveguide is shown. Another two sets of switches are omitted to avoid




Figure 2.4: Structures of LARPBS and LPB
Figure 2.5: Conditional delay switches
Computation on an LARPBS usually consists of a sequence of alternate communication
and computation steps. Communication time is measured by bus cycle. Thus, the time
complexity of an algorithm implemented on LARPBS is measured in terms of total number
of bus cycles and the time for parallel local computation.
The scalability of an algorithm can be defined as follows [74]: For a given algorithm
and problem size, let T1(N) be the time to run the algorithm on N processors, where N is
the number of processors required to run the algorithm as fast as possible. We then define
T2(P ) to be the time to run the algorithm adapted for P processors for the same problem




where g(N,P ) is the scalability factor of the algorithm. If g(N, P ) is a constant, then the
algorithm is completely scalable. If g(N, P ) is a function of P and not of N , then the














P P P P PP0 1 2 3 4 5
Processor 1X2 Switch 2X1 Switch
Figure 2.6: The LARPBS model of size 6 with two sub-buses
2.3.2 Power of the LARPBS Model
The LARPBS model mentioned in previous section is a powerful model for the following
facts:
• An LARPBS can simulate a PRAM very efficiently. For deterministic methods, each
step of a PRAM algorithm can be simulated by a p-processor LARPBS in O(log p +
log m) time [39], where m is the number of shared memory cells in the PRAM. For
probabilistic methods, an O(log p) time simulation complexity can be achieved with
high probability.
• If extended to higher dimension, the LARPBS model can be very versatile. Yueming
Li, S. Q. Zheng, and Xiangyang Yang proved in [34] that we can embed parallel com-
munication patterns supported by a wide variety of networks in parallel architectures
based on segmented buses with small slowdown factors. Those networks include linear
array, ring, complete binary tree, X-tree, mesh-of-trees, multi-dimensional mesh, torus,
multi-grid and pyramid.
• For some of typical parallel algorithms, an LARPBS can achieve the same or even
better time complexity than algorithms on the PRAM model. This is due to the fact
that some computer tasks can be accomplished during the process of communication.
The parallel random access machine (PRAM) model is the most convenient, standard
and popular abstract device for developing, and analyzing parallel algorithms. The PRAM
is the synchronous version of the shared-memory model. Its power comes from the following
facts [28]:
• There is a well-developed set of techniques and methods to handle many different
classes of computational problems on the PRAM model.
17
• The PRAM model idealizes parallel computing. The use of shared memory accessible
to all the processors makes interconnection communication relatively simple. Thus,
one does not need to worry about algorithmic details concerning synchronization and
communication, but concentrates on the structure properties of the problem and finds
the best parallelism in a computation.
• The PRAM model captures several important parameters of parallel computations. A
PRAM algorithm includes an explicit understanding of the operations to be performed
at each time unit for traditional electrical systems, and explicit allocation of processors
to jobs at each time unit.
• The PRAM gives robust design paradigms. Many practical parallel algorithms can be
directly derived from PRAM algorithms.
However, PRAM is obviously far from practical. Accessing a shared memory location
in constant time is impossible even in small scale shared memory systems. We cannot
ignore memory access delays through a processor-memory interconnection network and the
effect of memory access conflict/contention in shared memory modules. Nevertheless, many
PRAM algorithms have been designed. It is clear that the time complexities of most of
these algorithms are optimistic. They can be used as a measure for algorithms designed
for various practical interconnection network models. Thus, the problem of implementing
PRAM computations on more realistic parallel computers with minimum overheads has
received considerable attention.
Keqin Li, Yi Pan, and Si-Qing Zheng presented several deterministic and probabilistic
methods for simulating PRAM computation on a LARPBS model in [39]. They established
the following results:
• Each step of a p-processor priority CRCW PRAM computation with m = O(p) shared
memory cells can be simulated by a p-processors LRAPBS in O(log p) time, where the
constant in the big-O notation is small.
• Each step of a p-processor priority CRCW PRAM computation with m = Ω(p) shared
memory cells can be simulated by a p-processors LARPBS in O(log m) time.
• Each step of a p-processor priority CRCW PRAM computation with m shared memory
cells can be simulated on a p-processor LARPBS in O(log p) time with probability at
least 1− 1/pc for all c > 0.
These results indicate that a number of important PRAM algorithms can be ported on
the LARPBS model without loss much of speed. One good example is Cole’s pipelined merge
sort, which we will give a detailed implementation solution with the same time complexity
on LARPBS in chapter 4. However, it is not the only way to design efficient algorithms
on LARPBS by implementing or simulating the best known PRAM algorithm. Another
approach is to design efficient algorithms by directly taking the advantages of the special
18
properties of optical interconnection. This second approach may provide unusual methods to
perform some operations more efficiently than what PRAM can do. Here are two examples
for which LARPBS algorithms achieve better time complexities than corresponding PRAM
algorithms if we assume that a bus cycle takes O(1) time: 1. n elements can be sorted in
O(1) time with n2 processors in LARPBS, whereas Ω(log n) is the lower bound for sorting on
a PRAM. 2. binary-prefix-sums operation takes Ω(log n/ log log n) time on a priority CRCW
PRAM with n processors [5], but it takes O(1) time on an n-processor LARPBS. Based on
the above results and observations, we develop our efficient parallel algorithms on LARPBS
and higher dimensional LARPBS using both approaches in this dissertation.
2.3.3 Basic Operations on LARPBS
Assume there are n processors in the LARPBS: P0, P1, · · ·, Pn−1, and the LARPBS folded at
processor Pn−1. All of the following operations take O(1) bus cycle. No local computation
is involved.
Definition 2.3.1 One-to-one[37]: A group of processors (Pi1 , Pi2 , · · ·, Pim) each sends a
message to one of the processors in (Pj1 , Pj2 , · · ·, Pjm). No two processors in (Pi1 , Pi2 , · · ·, Pim)
send a message to the same processor in (Pj1 , Pj2 , · · ·, Pjm).
for 1 ≤ k ≤ m pardo
R(jk) ← R(ik)
endfor
One-to-one communication operation can be implemented in one bus cycle as follows:
each processor Pik in (Pi1 , Pi2 , ···, Pim) sends a reference pulse and a message frame containing
R(ik) at time tref , and a select pulse at time tsel(jk). See Figure 2.7(a) for the address frame.
Whenever a processor Pjk detects a coincidence of a reference pulse and a select pulse, it
reads in the message.
Note that a special case of the one-to-one communication operation is a permutation
operation. In a permutation operation, each processor sends a message and a reference pulse
at the beginning of a bus cycle. The time for a select pulse to be sent is determined by its
corresponding destination processor’s address. And every processor receives one and only
one message at the receiving segment.
Definition 2.3.2 Broadcast[37]: one processor Pi in an n-processor system sends a mes-
sage R(i) to all n processors (P1, P2, · · ·, Pn) in the system, i.e. R(0), R(1), R(2), ..., R(n −
1) ← R(i).
For this operation, all reconfigurable switch pairs are set to straight, and all conditional
delay switches are set to straight to allow no delay to be introduced on transmitting segment
of the select waveguide. The source processor Pi sends a reference pulse at the beginning of
its address frame, and it sends consecutive select pulses in all the n pulse slots in its address
frame on the select waveguide, as shown in Figure 2.7(c). Each processor in the system
19
Figure 2.7: Address structure for basic operations
20
will detect a double-height pulse on the receiving segment of the bus and read the message.
Clearly, one bus cycle is sufficient for broadcasting a message.
Definition 2.3.3 Multicast[62]: one processor Pi in an n-processor system sends a mes-
sage R(i) to a subset (Pj1 , Pj2 , · · ·, Pjm) of the n processors, i.e. R(j1), R(j2), ..., R(jm) ←
R(i).
Similar to broadcast operation, but processor i sends select pulses only at time
tsel(j1), tsel(j2), ..., tsel(jm), see Figure 2.7(b).
Definition 2.3.4 Binary-Prefix-Sums[37]: Assume each processor Pi in the n-processor
LARPBS holds a binary value R(i). The problem of calculating the binary prefix sums is to
calculate Si = R0 + R1 + · · ·+ Ri for all 0 ≤ i < n.
Step 1: Each processor Pi, where 0 ≤ i < n, sets its conditional delay switch in the
following way: if Ri = 1, set it to cross; otherwise, set it to straight. Then Pi sends
its address i to processor Pn−1. In other words, each processor sends a message frame, a
reference pulse, and a select pulse at the beginning of the bus cycle. A processor in the
system may receive multiple messages in this bus cycle, but each processor only reads the
first message and ignores the rest. After this step, all processors set their conditional delay
switches to straight.
Step 2: In the second bus cycle, each processor Pj, who receives address i in the first bus
cycle, sends the value s = (n− j − 1) to processor Pi. This is a one-to-one communication.
Step 3: Each processor Pi who receives a value s in the second bus cycle, sends s to
processor Pi+1. Processor Pi+1 then sets Si+1 = s, sets its reconfigurable switches to cross to
segment the bus into m+1 (m is the number of 1s in the system) sub-buses, and broadcasts
its Si+1. After receiving the broadcasted value, each processor Pi (1 ≤ i < n) sets its Si to
be the received value. And every processor in the system set its reconfigurable switch back
to straight.
Step 4: processor P0 checks its R0, if R(0) = 1, sets m = S0 +1; otherwise, sets m = S0.
This is because processor P0 does not have a conditional delay switch. Then processor P0
broadcast m to every processor in the system. After receiving m, each processor Pi in the
system modify its Si value: Si = m− Si.
Definition 2.3.5 Ordered-Compression [62]: Assume an array of n elements are clas-
sified into active and inactive. The ordered-compression operation is to compress all active
elements to one end of the LARPBS preserving their original order.
The ordered compression operation can be implemented with n processors in one bus
cycle as follows:
Let’s call each processor who holds an active element an active processor. Each processor
Pi, where 0 ≤ i < n, sets its local conditional delay switch to cross if it is active; otherwise,
set it to straight. Then each active processor Pi injects a reference pulse on the reference
21
waveguide and a select pulse on the select waveguide at the beginning of a bus cycle. In other
words, every active processor tries to address processor n− 1. A message frame containing
the element it holds is also sent simultaneously with the address frames. The select pulse
sent by the active processor whose index is the largest passes no cross-setting delay switch in
the transmitting segment. Thus, select pulse and reference pulse sent by that processor will
coincide at processor Pn−1, and the corresponding message is picked up by processor Pn−1.
Similarly, the select pulse sent by the processor whose index is the second largest passes
one cross-setting delay switch on the transmitting segment of the bus. Thus, the two pulses
will meet at processor Pn−2, the corresponding message is picked up by processor Pn−2. In
general, message sent by the active processor who is the kth largest is received by processor
Pn−k. At the end of this ordered compression, active elements are compressed to the higher
index end of the bus, and the original order is preserved.
Definition 2.3.6 Split[61]: Assume an array of n elements are classified into active and
inactive. The split operation is to separate all active elements from all inactive elements.
The split operation proposed in [61] can be performed in the following three steps:
Step 1: do an ordered-compression operation on active elements to move them to the
higher index end of the bus. Again we call each processor who holds an active element an
active processor. Then each active processor sends a dummy message to the processor on its
right. The active processor Pn−i who does not receive a dummy in the bus cycle broadcast
i to every processor in the system.
Step 2: do an ordered-compression operation on inactive elements to move them to the
higher index end of the bus. But this time each inactive processor addresses its message
to processor Pn−1−i. At the end of this operation, all inactive elements are moved to the
left side of the bus, and all active elements are moved to the right side of the bus. And we
complete the split operation.
2.4 2D LARPBS and Its Basic Operations
2.4.1 2D LARPBS – an extension to LARPBS
We propose a 2-D LARPBS model in this chapter. See Figure 2.8. A 2-D LARPBS has
two sets of switches at each row and column intersection: conditional delay switches and
reconfigurable switches. Figure 2.8 only draws processor connections to avoid confusion.
Compared with the 2-D ACPOB, it has one extra pair of switches for the convenience of
row bus reconfiguration and column bus reconfiguration. Actually, since we do not assume
column communication and row communication be performed concurrently, each processor
can use the same set of switches (including reconfigurable switch pairs and conditional delay
switch) for its row bus and column bus if the switching between column bus and row bus can
be done in a reasonable time. We classify a communication step into two types: a column
communication step and a row communication step. The communication performance of an
22
Figure 2.8: A 2D LARPBS
23
algorithm designed for this model is calculated by the total number of bus cycles for both
row communication steps and column communication steps. In the following section, we
follows the same idea in [63] to give some basic operations for the 2-D LARPBS model.
2.4.2 Basic Operations on 2D LARPBS
2-D LARPBS is more scalable than 1-D LARPBS, but communication in a 2-D LARPBS is
more restrictive. One-to-one communication in LARPBS can be done in constant time, but
not all permutation of the one-to-one communication can be done in 2-D LARPBS in con-
stant time. In this section, we show the following special permutation operations can be done
in O(1) time: Matrix Transpose and Reverse Matrix Transpose, 2D Integer Prefix Sums,
2D Block BPS, 2D BPS, and 2D-compaction operation. These basic permutation opera-
tions will be used as building blocks in the algorithms proposed in later chapters. Now we
present the basic operations one by one.
Definition 2.4.1 Transpose Matrix: Assume matrix A and B are both n × n matrices.
Then matrix B is the transpose of matrix A, if bij = aji for all 1 ≤ i, j ≤ n.
Definition 2.4.2 Matrix Transpose: For any matrix A, pick up the items in column-
major order and set them down in row-major order.
Definition 2.4.3 Reverse Matrix Transpose: For any matrix A, pick up the items in
row-major order and set them down in column-major order.
It is easy to see, if a matrix has equal number of rows and columns, then its Ma-
trix transpose operation and Reverse Matrix Transpose operation are equivalent. Thus we
will only give a description on the Matrix transpose operation. We follow the same idea for
the matrix transpose algorithm on ACPOB given in [63] to obtain the following algorithm
on 2D LARPBS.
Algorithm 2.4.1 Matrix Transpose
Input: a r × r matrix A.
Output: matrix B, the transpose of matrix A
Begin
1. Eliminate collision by row communication: processor Pi,j sends its item xi,j to processor
Pi,((j+i−1) mod r). In other words, we imagine each row in the matrix to be a circular
queue. Processor Pi,j moves its item i− 1 positions to the right, where 1 ≤ i ≤ n. The
purpose of this row shift communication is to send items on the same column in the
matrix to different columns. To be precise, no two items ai,j and ak,j, where i 6= k, are
in the same column after this row communication. Thus, there will be no collision in
the communication performed in the next step. Each processor will receive one and
only one items in the next step.
24
2. Send items to correct rows: a multiple one-to-one column communication operation is
performed so that all items, whose destination processors are in row i, are sent to that
row. To be precise, if processor Pi,j has item ei,k, then Pi,j sends ei,k to processor Pk,j.
3. Send items to the correct columns: a multiple one-to-one row communication operation
is performed so that all items, whose destination processors are in column j, are sent
to that column. To be precise, if processor Pi,j has item eh,i, then Pi,j sends eh,i to
processor Pi,h. Now all items eh,i are obtained by their destination processor Pi,h, where
1 ≤ h, i ≤ n.
End
Example 2.4.1 Matrix Transpose on 2D LARPBS.


1, 1 1, 2 1, 3 1, 4 1, 5
2, 1 2, 2 2, 3 2, 4 2, 5
3, 1 3, 2 3, 3 3, 4 3, 5
4, 1 4, 2 4, 3 4, 4 4, 5







1, 1 1, 2 1, 3 1, 4 1, 5
2, 5 2, 1 2, 2 2, 3 2, 4
3, 4 3, 5 3, 1 3, 2 3, 3
4, 3 4, 4 4, 5 4, 1 4, 2






1, 1 2, 1 3, 1 4, 1 5, 1
5, 2 1, 2 2, 2 3, 2 4, 2
4, 3 5, 3 1, 3 2, 3 3, 3
3, 4 4, 4 5, 4 1, 4 2, 4







1, 1 2, 1 3, 1 4, 1 5, 1
1, 2 2, 2 3, 2 4, 2 5, 2
1, 3 2, 3 3, 3 4, 3 5, 3
1, 4 2, 4 3, 4 4, 4 5, 4
1, 5 2, 5 3, 5 4, 5 5, 5


Step 2 Step 3
The basic idea of the 2D Binary Prefix Sums algorithm proposed in [63] is: divide the
n × n matrix A into √n sub-matrices, each has size √n × n; compute the binary prefix
sums for each sub-matrix using block binary prefix sums algorithm; then apply an integer
binary prefix sums algorithm to get the prefix sums for sub-matrix sums; finally the binary
prefix sums for each element can be obtained by adding its prefix sums within its sub-matrix
with prefix sums for sub-matrix sums of the proceeding sub-matrix. We give the binary
prefix sums algorithm by first describing the two basic algorithms: Integer Prefix Sums and
Block BPS, which will be used in the 2D Binary Prefix Sums for calculating binary prefix
sums of the sub matrices and calculating block prefix sums. Then we give a description for
the 2D Binary Prefix Sums algorithm. The 2D binary prefix sums algorithm given below
uses
√
n× n processors to computer the prefix sums of √n integers. Each integer falls into
the range [0, n2 − 1]. Thus the n 12 -ary representation of each integer requires at most d = 4
digits. Assume, this
√
n are first distributed to the last column of the
√
n×n 2D LARPBS.
Here is the detail of the algorithm.
25





Algorithm 2.4.2 2D Integer Prefix Sums
Input:
√
n integer distributed to the last column of the
√
n × n matrix A, one integer per
processor. And let d = 4.




1. Send integer to diagonal processors : each processor Pi,n, where 1 ≤ i ≤
√
n, sends its




2 -ary representation : each processor Pi,i, where 1 ≤ i ≤
√
n, broadcasts
Ii to every processor in column i. Refer to Table 2.2. Then, processor Pi,j, where
1 ≤ i ≤ d and 1 ≤ j ≤ √n, locally computes the n 12 -ary representation of integer Ii,
and saves the jth digit as its local value bi,j. Refer to Table 2.3.
3. Spread out each n
1
2 -ary representation digit with a sequence of 1s : each processor Pi,j,
where 1 ≤ i ≤ d and 1 ≤ j ≤ √n, use its bi,j as count for number of 1s and spread 1s
to the right in the following way: if bi,j 6= 0, processor Pi,j multicasts 1 to processors
P
i,(j−1)n 12 +1, · · ·, Pi,(j−1)n 12 +bi,j . Refer to Table 2.4.
4. Apply 1D binary prefix sums : Apply 1D binary-prefix-sums operation to the first d
rows to compute row binary prefix sums. Refer to Table 2.5.
5. Collect and convert results to decimal : processor Pi,k√n, where 2 ≤ i ≤ d and 1 ≤
k ≤ √n, sends its binary prefix sums si,k√n to processor P1,k√n one by one. As d is a
constant, these multiple one-to-one column communication operations take constant
time. Each processor P1,k√n then convert the n
1
2 -ary representation to decimal value.
Refer to Table 2.6.
6. Move output to its corresponding input location : processor P1,k√n, where 1 ≤ k ≤
√
n,
sends the decimal prefix sums value to processor Pk,n using a one-to-one column com-
munication operation followed by a one-to-one row communication operation. Now,
the output for the integer prefix sums of an input item can be found at the same
location where it is distributed at the beginning of the algorithm. Refer to Table 2.7.
End
Based on the above special case integer prefix sums algorithm, we can obtain the fol-
lowing block binary prefix sums algorithm.
26
Table 2.2: Step 2: calculating n
n
2 -ary representation
7=013 12 16 9
7 12=030 16 9
7 12 16=100 9
7 12 16 9=021
Table 2.3: Step 2: saving corresponding n
n
2 -ary representation digit
3 0 0 1
1 3 0 2
0 0 1 0
Table 2.4: Step 3: multicasting 1s
1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 1 1 1 0 0 0 0 0 1 2 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Table 2.5: Step 4: performing row BPS
1 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4
1 1 1 1 2 3 4 4 4 4 4 4 5 7 7 7
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Table 2.6: Step 5: collecting and converting results to decimal values on the first row
7=013 19=043 35=143 48=174






Algorithm 2.4.3 2D Block BPS
Input: n × √n Boolean values distributed in an n × √n 2D LARPBS A, one value per
processor.
Output: the binary prefix sums for every item in A.
Begin
1. Compute binary prefix sums on each row : do a binary-prefix-sums operation on each
row using the basic 1D algorithm.
2. Compute integer prefix sums for each row sum : the binary prefix sums si,n, where
1 ≤ i ≤ √n obtained in the previous step for the last processor in each row is in the
range [0, n]. Thus we apply algorithm 2.4.2 2D Integer Prefix Sums to the
√
n si,n to
obtain the row prefix sums Si =
∑i
j=1 si,n.
3. Pass row prefix sums to next row : the last processor in each row Pi,n, where 1 ≤ i < n,
sends its Si to processor P(i+1),n. This is a one-to-one column communication operation.
4. Broadcast row prefix sums within each row : the last processor in each row Pi,n, where
1 < i ≤ n, broadcasts the S(i−1) received in previous step to every processor in row i.
5. Calculate final binary prefix sums : every processor Pi,j, where 1 < i ≤ n and 1 ≤ j ≤ n
calculates its final prefix sums by adding the row prefix sums of the previous row with
the prefix sums calculated in step 1: si = S(i−1) + si.
End
Now we are ready for our constant time binary prefix sums algorithm on 2D LARPBS.
Algorithm 2.4.4 2D BPS
Input: n× n Boolean A, one element per processor.
Output: the binary prefix sums for every item in A.
Begin
1. Partition and compute Block BPS : partition A into
√
n sub-matrix A1, A2, · · ·,A√n.
Each of them is a
√




2. Compress Block BPS’s to the last processor on the first
√
n rows : the last processor
in each sub-matrix, processor Pk√n,n, where 1 ≤ k ≤
√
n, sends its prefix sums sk√n,n
obtained in the previous step to processor Pk,n. This one-to-one column communication
operation serves the purpose of compressing the
√
n sub-matrix prefix sums to the last
processor in the first
√
n rows so that their prefix sums can be calculated using the
first sub-matrix.
3. Compute sub matrix prefix sums : As sk√n,n ≤ n
√
n < n2 − 1, we can apply algorithm
2.4.2 2D Integer Prefix Sums to the first sub-matrix to compute prefix sums for sk√n,n,
where 1 ≤ k ≤ √n. The results are stored in the last processors on the first √n rows.
28
4. Multicast sub matrix prefix sums : processor Pi,n, where 1 ≤ i <
√
n, multicasts the
prefix sums Si obtained in the previous step to processors Pi√n+k,n, where 1 ≤ k ≤
√
n.
This is a column multiple multicast communication operation. After these multicast
operations, each processor Pj,n, where 1 ≤ j ≤ n, broadcasts the received prefix sum
Sbj/√nc to every processor in row j.
5. Calculate final binary prefix sums : the final binary prefix sums for each processor Pi,j,
where 1 ≤ i, j ≤ n, can be obtained by adding its local prefix sums obtained in step 1
and the one obtained in step 4 : si,j = si,j + Sbi/√nc.
End
It is clear that each step in algorithm 2.4.4 takes constant time. Therefore, algorithm
2.4.4 runs in O(1) time.
The following is a description for 2D compaction operation:
Algorithm 2.4.5 2D Compaction
Input: an n×n Boolean matrix A, where elements are distributed to n2 processors such that
each processor Pi,j holds a Boolean value ai,j. Each Pi,j also holds a data item bi,j. There
are a total of x processors with ai,j = 1.
Output: all items bi,j with ai,j = 1 are moved to the first x processors in A in row major
or column major order.
Begin
1. Compute binary prefix sums for each item: do a 2D-Binary-Prefix-Sum operation to
get the binary prefix sums si,j for each item bi,j in matrix A.
2. Compute destination address: each processor Pi,j who has ai,j = 1, compute its des-
tination address (i′, j′). For row major order, then i′ = bsi,j/nc, j′ = si,j mod n; if
output is in column major order, then j′ = bsi,j/nc, i′ = si,j mod n.
3. Send items to correct row or column: each processor Pi,j who has ai,j = 1 does the
following. For column major order, Pi,j sends its item bi,j and new column address j
′
to processor Pi′,j. As two processors on one column cannot be more than n positions
apart for column major compaction, this is a one-to-one row communication operation.
For row major order, Pi,j sends its item bi,j and new row address i
′ to processor Pi,j′ .
As two processors on one row cannot be more than n positions apart for row major
compaction, this is a one-to-one column communication operation.
4. Send items to their final destinations: For column major order, each processor Pi,j′ who
has received item bi,j and row address i
′ in the previous step sends bi,j to processor
Pi′,j′ . This is a one-to-one row communication operation. After this step, all items
whose ai,j = 1 are compressed to the first x processors in column major order. For row
major order, each processor Pi′,j who has received item bi,j and column address j
′ in
previous step sends bi,j to processor Pi′,j′ . This is a one-to-one column communication
29
operation. After this step, all items whose ai,j = 1 are compressed to the first x




Boolean Matrix Multiplication and Its
Applications
3.1 Introduction
General matrix multiplication problem on an LARPBS has been studied extensively by Keqin
Li, Yi Pan, and Si Qing Zheng in [35, 37, 38]. In this chapter, we present a constant-time
Boolean matrix multiplication algorithm on the LARPBS model. Our algorithm achieves
better speed-up compared with the four Russians’ algorithm proposed by Agerwala and
Lint in [4] and a hypercube implementation provided by C. Gregory Plaxton [55]. A parallel
implementation of the Four Russians’ algorithm on the LARPBS with constant running time
is proposed by Keqin Li in [31]. Compared with Li’s constant-time algorithm, our algorithm
is not based on any existing sequential or parallel algorithm and it is much simpler and
uses fewer processors. our algorithm uses only O(n2) processors, while Li’s algorithm uses
O(n3/logn) processors. Refer to Table 3.1 for a comparison on different parallel Boolean
matrix multiplication algorithms.
A closely related problem is the Boolean matrix closure (BMC) problem, i.e., the problem
of calculating transitive closure for a directed graph. By using our constant time BMM
algorithm, we can compute the transitive closure of a directed graph in O(log n) time on
an LARPBS with O(n2) processors. If a polynomial number of processors are employed, it
Table 3.1: Comparison of different parallel Boolean matrix multiplication algorithms.
Algorithm Time Complexity Processor Complexity
Our new LARPBS Algorithm O(1) O(n2)
Li’s Parallel Four Russians’ Algorithm [31] O(1) O(n3/logn)
Plaxton’s Hypercube Algorithm [55] O(log n) O(n3/log n2 log log n)
Agerwala and Lint’s Algorithm [4] O(log n) O(n3/(log n log log n))
31
is shown in [15] that the lower bound for parallel time complexity of transitive closure is
Ω(log n/ log log n) on a CRCW PRAM.
3.2 Definition of the Problem
Given two Boolean matrices A and B. Let C denotes the product of matrices A and B, the




ai,k ∧ bk,j (3.1)
where
∨
represents logical-or, and ∧ represents logical-and.
This equation leads to a simple and well-known O(log n) time matrix multiplication
algorithm running on a hypercube using n3 processors [17].
3.3 A New Constant Time Boolean Matrix Multipli-
cation Algorithm
In this section, we introduce a new constant time Boolean matrix multiplication algorithm
for the LAPBRS model. Compared with Li’s parallel implementation of four Russians’
algorithm on the LARPBS, our algorithm requires much less processors and is much simpler.
Instead of implementing an existing algorithm on a new model, we provide a totally new
idea for solving the problem on LARPBS.
The algorithm we present here uses O(n2) processors on an LARPBS. We use a different
strategy when computing matrix C: Instead of trying to send ci,j all the needed elements in
A and B, we think the other way around: for each element bi,j in B, we try to find out which
processor in the system should know its value in order to get the final results. Throughout
this chapter, we use Pi,j to denote the (in + j)
th processor on a one-dimensional LARPBS.
This helps simplify our algorithm description.
Assume that each processor Pi,j has the following registers:
• Ra: a one bit register used to store matrix element ai,j.
• RB: an n bit register, used to store elements in row j of matrix B, B(j) = (bj,1, bj,2, · · · , bj,n).
• Rc: a one bit register, used to store matrix element ci,j. It is initialized to 0.
At the beginning of the algorithm, elements in matrix A and B are distributed to
processors in LARPBS in the following way: A is distributed to the n2 processors one
element per processor in row major order, i.e., ai,j is stored in processor Pi,j. B is stored
in the first n processors one row per processor, i.e., B(j) = (bj,1, bj,2, · · · · ·, bj,n) is stored in
processor P1,j, where 1 ≤ j ≤ n. Results of the algorithm will be stored in the Rc register
in each processor.
32
Algorithm 3.3.1 Boolean Matrix Multiplication
Input: Two n× n boolean matrices A and B.
Output: boolean matrix C = A×B.
Begin
1. Multicast B(j): each processor P1,j, where 1 ≤ j ≤ n, multicasts its B(j) in RB to
processors Pk,j, where 1 < k ≤ n.
2. Each processor Pi,j checks its Ra, and sets itself to active if Ra = 1.
3. Form sub-buses: processors Pi,n, where 1 ≤ i ≤ n, set their reconfigurable switches to
cross to form n sub-buses. Each sub-bus has n processors.
4. Calculate ci,j: each active processor Pi,j does the following multicast operation: uses
the B(j) in its RB to set its select frame, and then multicasts a dummy message
within its sub-bus. For example, if B(j) = (1, 0, 0, 1, 0, 1), then the select frame is
set to (p,−,−, p,−, p). If processor Pi,j receives a message in this multiple multicast
operation, then Pi,j sets its Rc = 1; otherwise it sets its Rc = 0. After this local
operation, we obtain matrix C: processor Pi,j holds matrix element ci,j in register Rc.
End
After step 1, each processor Pi,j has bit array B(j) = (bj,1, bj,2, · · · , bj,n). So all processors
have their RB ready. This step consists of only one basic operation: multiple multicasts
operation. Step 2 performs the following logical-and operation: ai,j ∧ bj,k for 1 ≤ k ≤ n.
Step 3 involves no communication and computation. This step is used to reconfigure the
system into n sub-buses, with each of them responsible for calculating one row in matrix C.
Step 4 is the most important step that bears the main idea for this algorithm. The beauty
of this step is that communication itself also accomplishes special computations. Here is
how it works: The calculation of C is based on Equation 3.1. Each element ai,j in A will
participate in the following logical-and calculation with all elements on the jth row in matrix
B: ai,j ∧ bj,1, ai,j ∧ bj,2, · · ·, ai,j ∧ bj,n. The results of these calculations need to be sent to
processors Pi,1, Pi,2, · · ·, Pi,n for computing ci,1, ci,2, · · ·, ci,n. Here we compute the logical-or∨
and send the results to their correct processors in one multiple multicast communication
operation. Notice that if ai,j = 0, all of the above logical-or operations will result in 0. As we
have already initialized Rc to be 0, the logical-or operations are done by only allowing active
processors to send dummy messages. Now we show how to send the logical-or operations
results for processors whose ai,j equal to 1. If ai,j = 1, then we need to let processor Pi,k
know that there is a 1 result from ai,j ∧ bj,k only if bj,k is non-zero. Thus, if we use the vector
B(j) stored in RB to set the select frame for the dummy message, then this message will
be addressed to processor Pi,k only if bj,k is non-zero. This is how ai,j ∧ bj,k is done. Each
processor in each sub-bus can receive more than one message in Step 4. For example, assume
A and B are two 4 × 4 matrices. Then processor P2,1 will receive a dummy message from
every processor in the second sub-bus if the following matrix elements have value 1: a2,1,
b1,1, a2,2, b2,1, a2,3, b3,1, a2,4, and b4,1. As each processor only cares about whether a dummy
33
message is sent to it or not, it reads the first message addressed to it and ignores the rest.
This is how the logical-or operations of Equation 3.1 are done. Notice this multiple multicast
operation performs all the logical-or calculation needed for all elements in C. The reason
why we can use the binary vector B(i) to set the select frame is: if the LARPBS system has
n2 processors, then the system should be able to address the n2 processors. According to
the address mechanism of the LARPBS system, a select frame should consists of at least n2
pulses for it to address n2 processors. For the case of n×n 2D LARPBS, the address should
at least be n-pulse long. In that case, it is still long enough to hold vector B(i).
It is clear that this algorithm only consists of two multiple multicast operations and a
couple of local Boolean value comparisons. Thus, the time complexity of this algorithm is
O(1). We obtain the following theorem.
Theorem 1 Given two n × n Boolean matrices A and B. Let C denote the product of
matrices A and B. Matrix C can be calculated in O(1) bus cycles, and in O(1) computation
time, using O(n2) processors on an LARPBS.
Refer to Figure 3.1 for an example.
3.4 Extension to 2D LARPBS
Our algorithm can be easily extended to a 2D LARPBS. Here is the 2D LARPBS algorithm.
Algorithm 3.4.1 2D Boolean Matrix Multiplication
Input: Two n× n boolean matrices A and B.
Output: boolean matrix C = A×B.
Begin
1. Multicast B(j): each processor P1,j, where 1 ≤ j ≤ n, multicasts its B(j) in RB to
processors Pk,j using its column bus, where 1 < k ≤ n.
2. Each processor Pi,j checks its Ra, and sets itself to be active if Ra = 1.
3. Calculate ci,j: active processors Pi,j does the following multicast operation: use the
B(j) to set select frames, multicast a dummy message on their row buses. If processor
Pi,j receives a message in this multiple multicast operation, then it sets its Rc = 1;
otherwise it sets its Rc = 0. After this local operation, we obtain our final result C:
processor Pi,j holds matrix element ci,j in register Rc.
End
34
Figure 3.1: Structure of a folded optical bus: (a) step 1 (b) step 2 and step 3 (c) address
frames for some of active processors
35
3.5 Discussion on Scalability
Suppose the number of available processors is p. If n ≤ p < n2, we can modify algorithm
3.3.1 in the following way to achieve a good scalability:
We still want to split the p processors into n sub-buses in Step 3. Each group is re-
sponsible for one row of matrix A, B, and C. Thus there are m = p/n processors in each
sub-bus. Every processor has three arrays: A, B, and C, each of which has size n2/p. Now,
let’s reconsider the computation and communication time. Each of Step 1 and Step 4 takes
n2
p
bus cycles. There is no local computation in Step 1. Each of Step 4 and Step 2 takes n
2
p




words, we have T1(N) = O(1), T2(P ) = O(
n2
p
), where N = n2 and P = p. The scalability
factor for our algorithm is g(N,P ) = (T2(P )/T1(N)) × PN =1. Thus our Boolean matrix
multiplication algorithm is completely scalable. We obtain the following theorem.
Theorem 2 Given two n × n Boolean matrices A and B. Let C denote the product of
matrices A and B. There is a completely scalable algorithm on the LARPBS model. The
algorithm runs in O(n
2
p
) time, where p is the number of processors and 1 ≤ p ≤ n2.
3.6 Application to Transitive Closure
Li [31] gave an O(log n)-time transitive closure algorithm by repeatedly applying his constant-
time Boolean matrix multiplication (BMM) algorithm. Our BMM algorithm can also be used
directly for calculating the transitive closure of a directed graph.
The following lemma is given in [28] with proof.
Lemma 3.6.1 Let A be an n×n Boolean matrix representing the directed graph G = (V, E).
Then the incidence matrix A∗ of the transitive closure of the graph G is given by A∗ =
(I + A)2
dlog ne
, where I is the n× n identity matrix.
Based on Lemma 3.6.1, the incidence matrix A∗ can be obtained by repeatedly running
Boolean matrix multiplication algorithm for log n times to compute the following matrix:
(A ∨ I)21 , (A ∨ I)22 , · · ·, (A ∨ I)2dlog ne
Each of these matrices is obtained by following Boolean matrix multiplication operation:
(A ∨ I)2k = (A ∨ I)2(k−1) × (A ∨ I)2(k−1)
where 1 ≤ k ≤ log n.
By repeatedly applying our BMM algorithm, we can obtain an O(log n)-time transitive
closure algorithm using only O(n2) processors.
36
Algorithm 3.6.1 Transitive Closure
Input: adjacency matrix A for graph G = (V, E).
Output: transitive closure A′ of graph G.
Begin
1. Compute X = A ∨ In : each processor Pii, where 1 ≤ i ≤ n, set Ra = 1.




Theorem 3 The transitive closure of a directed graph with n vertices can be obtained in
O(log n) time on an LARPBS using O(n2) processors.
3.7 Application to Connected Component Problems
An application of transitive closure algorithm is finding the connected components for undi-
rected graphs and finding strongly connected components for directed graphs. We show
how to solve these problems on an LARPBS. Figure 3.2 shows a undirected graph G with 3
connected components.
Algorithm 3.7.1 Connected Component
Input: adjacency matrix A for undirected graph G = (V, E) distributed to an n2-processor
LARPBS, one element per processor.
Output: connected components of graph G with vertices and its component ids saved in the
first n processors, one vertex per processor.
Begin
1. Apply algorithm 3.6.1 Transitive Closure.
2. Each processor Pi,n, where 1 ≤ i ≤ n, sets its reconfigurable switches to cross to
segment the bus into n sub buses. Then each processor Pi,j whose ai,j = 1 sends index
j to processor Pi,n. Each processor Pi,n accepts the first message addressed to it and
ignores the rest. Each processor Pi,n, where 1 ≤ i ≤ n, checks the received index j: if
j = i, sets flagi,j = 1; otherwise, sets flagi,j = 0. Pi,n then broadcasts flagi,n on its
sub bus. Each processor Pi,j sets itself to active if the received flagi,j = 1 and ai,j = 1.
Then all processors set their reconfigurable switches to straight to form a single bus.
37
Figure 3.2: Connected components of graph G
38
3. Assign index for each connected component : each processor Pi,1, where 1 ≤ i ≤ n, sets
bi,1 = 1 if its flagi,1 = 1; all other processors Pi,j set bi,j = 0. Do a binary-prefix-sum
operation for bi,j. Each processor whose flag = 1 saves its prefix sum as its connected
component id x.
4. Collect connected components : do an ordered-compression operation for all active
processors Pi,j to compress index pairs (x, j) to the first n processors of the bus.
Each of the first n processors saves the received index pair (x, j): x is the connected
component id, j is the vertex index.
End
39
Example 3.7.1 Finding connected components for graph G in Figure 3.2.
j bps index

1 0 0 0 0 1 1
0 1 0 1 0 0 0
0 0 1 0 0 0 1
0 1 0 1 0 0 0
0 0 0 0 1 0 0
1 0 0 0 0 1 1






1 0 1 0 0 1 1
0 1 0 1 0 0 0
1 0 1 0 0 1 1
0 1 0 1 0 0 0
0 0 0 0 1 0 0
1 0 1 0 0 1 1
































Adjacency matrix A after step 2
In the matrix shown above, the elements with boxes are active elements. After step 5,
following data are stored in the first 7 processors:
{(1, 2), (1, 4), (2, 5), (3, 1), (3, 3), (3, 6), (3, 7)}.
These data give us the following connected components in G: (2, 4), (5), and (1, 3, 6, 7).
40
By adding an extra step, i.e. Step 2 in the following algorithm, we can obtain the
following strongly connected component algorithm.
Algorithm 3.7.2 Strongly Connected Component
Input: adjacency matrix A for directed graph G = (V, E) distributed to an n2-processor
LARPBS, one element per processor.
Output: strongly connected components of graph G with vertices and its component ids in
each component saved in the first n processors, one vertex per processor.
Begin
1. Apply algorithm 3.6.1 Transitive Closure.
2. Remove single directional connections : each processor Pi,j whose ai,j = 1 sends a
dummy message to processor Pj,i. Then each processor Pi,j whose ai,j = 1 and who
does not receive a message in this one-to-one communication operation sets its ai,j = 0.
3. Each processors Pi,n, where 1 ≤ i ≤ n, set their reconfigurable switches to cross to
segment the bus into n sub buses. Then each processor Pi,j, whose ai,j = 1, sends its
index j to processor Pi,n. Each processor Pi,n accepts the first message addressed to
it and ignores the rest. Each processor Pi,n checks the received index j: if j = i, sets
flagi,n = 1; otherwise, sets flagi,n = 0. Pi,n then broadcasts flagi,n on its sub bus.
Each processor Pi,j sets itself to be active if the received flagi,n = 1 and ai,j = 1. Then
all processors set their reconfigurable switches to straight to form a single bus.
4. Assign index for each connected component : each processor Pi,1, where 1 ≤ i ≤ n, sets
bi,1 = 1 if its flagi,1 = 1; all other processors Pi,j set bi,j = 0. Do a binary-prefix-sum
operation for bi,j. Each processor Pi,j whose flagi,j = 1 saves its prefix sum as its
connected component id x.
5. Collect connected components : do an ordered-compression operation for all active
processors Pi,j to compress index pairs (x, j) to the first n processors of the bus.
Each of the first n processors saves the received index pair (x, j): x is the connected
component id, j is the vertex index.
End
Theorem 4 The connected components of an undirected graph and the strongly connected
components of a directed graph with n vertices can be obtained in O(log n) time on an
LARPBS using O(n2) processors.
41
Chapter 4
Parallel Sorting Algorithms on
LARPBS
4.1 Introduction
Sorting is a well-known problem that has been studied extensively in the literature because
it has so many important applications and it has a theoretical importance. Researches
on parallel sorting algorithms have been done on different models. There are mainly two
theoretical computational models that have been considered for parallel sorting algorithms
– the circuit model and the PRAM model. A parallel sorting algorithm is often said to
be optimal (with respect to cost) if its cost is O(n log n). The first O(log n) time sorting
algorithm is the AKS sorting network on circuit model, and it is optimal as it uses O(n)
comparators. The second O(log n) time sorting algorithm is Richard Cole’s parallel merge
sort [11], which is an optimal sorting algorithm on CREW PRAM model. Compared with
AKS sorting network, Richard Cole’s merge sort algorithm has a much smaller constant in
O notation for running time. Although the AKS-network can be combined with a variation
of the odd-even network to give an optimal sorting network with O(log n) time and O(n)
comparators, the constant of proportionality of this algorithm is still immense.
Several researches have been done for sorting on the LARPBS model. A constant-
time parallel sorting algorithm is given in [62] which uses O(n2) processors. An efficient
and scalable quicksort algorithm on the LARPBS model proposed in [60] runs at O(log n)
average time on an LARPBS with size n, and O(n) time in the worse case. An O(log2 n)
time deterministic algorithm that sorts n elements on n processors LARPBS is given in
[52]. In this chapter, we present an innovative implementation of Cole’s parallel merge sort
algorithm [11] on LARPBS. To the best of our knowledge, it is the first O(log n) time sorting
algorithm implemented on an LARPBS of n processors. In later chapters, we will apply this
algorithm to higher dimensional LARPBS to obtain optimal sorting algorithms.
This chapter is organized as follows. Section 4.2 gives the existing O(n2) processor
constant time sorting algorithm on LARPBS. This algorithm is used in our selection algo-
rithm presented in a later chapter. Section 4.3 introduces the basic idea of the Cole’s CREW
42
PRAM merge sort algorithm. Section 4.4 describes our sorting algorithm on LARPBS model
in detail. The issues of processor assignment and reuse are also addressed in this section.
4.2 A Constant-Time Sorting Algorithm on LARPBS
In this section, we give a description of the constant-time sorting algorithm, which is slightly
different from the one proposed in [62]. This algorithm uses n2 processors. Without lose of
generality, we assume the n elements are distinct.
Algorithm 4.2.1 Constant Time Sort
Input: a sequence X of n unsorted elements distributed to the first n processors, one per
processor.
Output: the n elements stored in the first n processors in sorted order, one element per
processor.
Begin
1. Distribute the elements on the first n processors in the following way: processor Pi
multicast its element x to processor Pkn+i, where 1 ≤ k < n, and 1 ≤ i ≤ n.
2. Segment the bus into n sub-buses, each having n processors. Processor Pin+i set y = x
and broadcasts y to every other processor in its sub-bus.
3. Each processor compares the two elements x and y: if x ≤ y, sets its result = 1;
otherwise, sets its result = 0. Do a binary-prefix-sums operation on each sub-bus.
Then all processors set their reconfigurable switches to straight to form a single bus.
4. Each processor Pkn, where 1 ≤ k ≤ n, uses its binary prefix sums as the address for
the destination processor, and sends its element y to that processor.
End
Since we use only basic operations in the algorithm, it is clear to see that the total time
required for this algorithm is O(1).
4.3 Cole’s Merge Sort Algorithm for CREW PRAM
In order to obtain an optimal sorting algorithm on an LARPBS, we identify two possible
approaches. The first one is to implement Cole’s algorithm on an LARPBS, and the second
one is to used an LARPBS to simulate an AKS sorting network.
43
4.3.1 Comparison of Cole’s Merge Sort and the AKS Sorting Net-
work
Both sorting algorithms are asymptotically optimal, but they are different in the following
ways:
• AKS sorting is not uniform. Although for a given value of n, we can construct with a
great deal of effort a sorting network with O(n log n) comparators that sorts n numbers
in O(log n) time, the complexity-parameter, n, cannot be regard as one of the inputs
to the sorting algorithm. Different n values require different expander graphs as input.
so in a way, we have a different algorithm for each value of n.
• The constant in the complexity expression for AKS sorting is much larger than that of
Cole’s pipelined merge sort algorithm.
• AKS sorting uses expander graphs for computing the ε − halver of different element
sets. Calculation of those expander graphs is not trivial and it brings in a huge constant
to the running time complexity. These expander graphs are assumed to be known and
treated as part of the input data of the algorithm. Thus, extra work needs to be done
to make expander graphs ready before we run the AKS sorting. Cole’s pipelined sorting
does not make use of expander graphs or any related constructions, thus it avoids the
huge constant in the running time.
• AKS sorting is a sorting network; Cole’s sorting does not provide a sorting network, it
is designed for PRAM.
• Both AKS and Cole’s sorting algorithm use a binary tree model to solve the problem.
But the way they use it is different: Cole’s sorting algorithm starts from the leaves of
the tree with one element assigned to each leaf, and it walks up to the root in O(log n)
stages. A constant merging time is achieved at each stage by merging only samples of
two sorted arrays. As the merged samples from one level of the tree provide fairly good
samples at the next level of the tree, the merges at the different levels of the tree can be
pipelined to achieve an overall time complexity of O(log n). The algorithm stops when
it reaches the root node, with the final sorted array of elements stored in the UP array
of the root node. AKS sorting starts from the root with the unsorted array of elements
assigned to the root node. It tries to split the array of elements into two halves while it
walks down the binary tree. But in each stage, the split is just a good approximation.
The algorithm tries to correct the errors introduced in the approximation: it sends up
elements with possible wrong order for error correction while it walks down the tree.
And it stops when it reaches the leaf level of the tree with the elements stored in the
leaves in sorted order.
44
4.3.2 Definitions and Notations
In this section, we give the definitions and notations that we use for describing the pipelined
merge sort on an LARPBS throughout the rest of this chapter. Without loss of generality,
we assume all input elements are distinct.
• Interval I(e): Let L be a sorted array. Define L∞ = (−∞, L, +∞). If e ∈ L, g is the
next larger element in L∞, then we define [e, g) to be the interval induced by e. We
denote it as I(e).
• c-cover: Let c be a positive integer. Given two sorted arrays L and J . L is a c-cover
of J if each interval induced by an element in L contains at most c elements from J .
• Rank: The rank of an element x in a sorted array X is defined to be the number of
elements in X that are less than or equal to x. A processor’s rank refers to the rank
of the element assigned to it.
• Straddle: Let e, f, g be three elements, with e < g. We say that e and g straddle f if
e ≤ f and f < g. We refer to e and g as the two straddle elements for f . And if e, f, g
are associated to processors Pi, Pj, and Pk respectively, we refer to Pi and Pk as the
two straddle processors for processor Pj.
• L → J : Let L and J be two sorted arrays, L → J denotes that L is ranked in J ; i.e.,
for each element in L, we know its rank in J .
• L ↔ J : Let L and J be two sorted arrays, L ↔ J denotes that L and J are cross-
ranked.
• L(u): a final sorted set of elements rooted at node u.
• UP (u): a sorted subset of the elements in L(u).
• External node/internal node: node u is external if it has |UP (u)| = |L(u)|; otherwise,
it is an internal node.
• Active node: an active node is either an internal node whose UP array is non-empty;
or it is an external node in its first three stages (external nodes do not participate in
the algorithm after three stages).
• NEWUP (u), OLDUP (u): the corresponding UP arrays in the next, and the previous
stage, respectively.
• SUP (u): if u is an internal node, SUP (u) is a sorted array comprising every fourth
element in UP (u), measured from the right end; if u is an external node, SUP (u) is
every fourth, every second and every element in UP (u) for the first, second and third
stages.
45
• NEWSUP (u), OLDSUP (u): the corresponding SUP arrays in the next, and the
previous stage, respectively.
• Calculate NEWUP (u): Let u, v and w be three nodes in the sorting tree, such that
u is the parent of nodes v and w. We define NEWUP (u) = SUP (v)
⋃
SUP (w), i.e.,
array NEWUP (u) is the result of merging arrays SUP (v) and SUP (w).
• find-min-representatives: Assume that an LARPBS bus holds a sorted array of ele-
ments. Each processor in the array also holds the rank of its element. The rank can
be its element’s rank in the array it belongs to or the rank of its element in another
sorted array. The operation is done in the following way: each processor sends its rank
r to the one on its right, then compares the received rank with its own rank. If the two
ranks do not match, then the processor sets itself to be a representative. The last one
in the bus will not receive a rank in this one-to-one communication. It always sets itself
to be a representative. This operation takes one bus cycle and one local comparison.
• find-max-representatives: Same as the find-min-representatives operation except that
each processor sends its rank r to the one on its left, and the first processor always
sets itself to be representative.
4.3.3 Basic Idea of the CREW PRAM Merge Sort Algorithm
In this section, we describe the basic idea in Cole’s parallel merge sort on a CREW PRAM
[11]. This algorithm is based on an O(log n) time binary tree merge procedure, which is
performed in 3 log n stages. Let u be a node in the sorting tree, v and w be the two children
of node u. In each stage, the computation comprises the following two phases:
1. Form the array SUP (u).
2. Compute NEWUP (u) = SUP (v) ∪ SUP (w), where ∪ denotes merging.
If a node becomes external, then it will only be active for another three stages. On each of
these three external stages, phase two is not performed, and phase one is performed with
sampling rate of 4, 2, and 1 respectively for the first, second, and third stage, respectively.
Internal nodes use the following equation to compute SUP (u) arrays:
r = m− 3− 4i (4.1)
where r refers to qualified sampling rank, m is the number of elements in UP (u) array, and
0 ≤ i < bm/4c.
As phase 1 for internal nodes is obvious, we focus on phase 2 for internal nodes. The
following lemmas are provided [11], and we need them for developing our sorting algorithm
on an LARPBS:
Lemma 4.3.1 While 0 < |UP (u)| < |L(u)|, |NEWUP (u)| = 2|UP (u)|.
46
Lemma 4.3.2 The algorithm has 3 log n stages.
Lemma 4.3.3 OLDSUP (v) is a 3-cover for SUP (v) for each internal node.
Lemma 4.3.4 UP (u) is a 3-cover of SUP (v), UP (u) is a 3-cover of SUP (w), if u is the
parent of v and w.
Lemma 4.3.5 3 stages after node v becomes external, its parent, node u, also becomes
external.
Also we assume UP (u) → SUP (v) and UP (u) → SUP (w), as it is calculated in Step 2
of Phase 2 in the previous stage.
Phase 2 for internal node can be done in two steps:
• Step 1: computing NEWUP (u)
• Step 2: maintaining following ranks: NEWUP (u) → NEWSUP (v) and NEWUP (u) →
NEWSUP (w).
Step 1 has following three substeps:
• Substep 1: Compute SUP (v) → UP (u) and SUP (w) → UP (u). Let y be an ele-
ment in UP (u), I(y) be the interval in UP (u) induced by y, and ry be y’s rank in
SUP (v)(obtained from UP (u) → SUP (v)). Since UP (u) is a 3-cover of SUP (v),
there are at most 3 elements after the ry
th element in SUP (v) fall into I(y). To obtain
SUP (v) → UP (u), processor associated with element y can send its rank in UP (u) to
those elements in SUP (v). SUP (w) → UP (u) is calculated in a similar way.
• Substep 2: Cross rank SUP (v) and SUP (w): SUP (v) ↔ SUP (w). Using SUP (v) →
UP (u), for each element e in SUP (v), we can find the two elements d and f in UP (u)
that straddle element e. And using UP (u) → SUP (w), we can get the rank of d
and f in SUP (w), say r and t. So e’s rank in SUP (w) falls into interval (r, t]. As
UP (u) is a 3-cover for SUP (w), by means of at most three comparisons we will be
able to find e’s rank in SUP (w). Thus, we obtain SUP (v) → SUP (w). We can obtain
SUP (w) → SUP (v) in a similar way.
• Substep 3: Compute NEWUP (u): NEWUP (u) = SUP (v) ∪ SUP (w). Based the
result obtained in Substep 2, for any element e in SUP (v) or SUP (w), its rank in
NEWUP (u) can be obtained simply by adding its rank in SUP (v) and its rank in
SUP (w). So computing NEWUP (u) = SUP (v) ∪ SUP (w) can be done in constant
time.
47
Step 2 maintains the following ranks that will be used in Step 1 of the next stage:
NEWUP (u) → NEWSUP (v) and NEWUP (u) → NEWSUP (w). We describe in detail
how to rank NEWUP (u) in NEWSUP (v). Ranking NEWUP (u) in NEWSUP (w) can
be done in a similar way. We classify elements in NEWUP (u) into two types: those came
from SUP (v) and those came from SUP (w). As the ranking process is different for the two
types of elements, we show how to rank each type in NEWSUP (v) separately as follows:
Observation 1: Given the rank of an element from UP (u) in both SUP (v) and SUP (w)
(we obtained it in Step 2 of Phase 2 in the previous stage), and given NEWUP (u) =
SUP (v)∪SUP (w), we can easily rank an element e from UP (u) in NEWUP (u) by adding
e’s rank in SUP (v) and SUP (w). So at the end of Step 1 of Phase 2, we have UP (u) →
NEWUP (u).
• Elements came from SUP (v): Assume that element e in NEWUP (u) came from
SUP (v). As we know e’s rank in SUP (v) and SUP (v) is sampled from UP (v), we
can obtain e’s rank in UP (v). Apply Observation 1 to node v, we have UP (v) →
NEWUP (v). Thus, we can get e’s rank in NEWUP (v). As NEWSUP (v) is sampled
from NEWUP (v), we can get e’s rank in NEWSUP (v). So for every element e in
NEWUP (u) that came from SUP (v), we obtain its rank in NEWSUP (v).
• Elements came from SUP (w): Assume that element e in NEWUP (u) came from
SUP (w). From the results obtained in Step 1, we know e’s two straddle elements d
and f in SUP (v). Based on the results we obtained in the previous paragraph, we
know the ranks r and t in NEWSUP (v) for d and f , respectively. So we know the
interval in NEWSUP (v) that e falls into is (r, t]. Next we need to find out the rank
for e in interval (r, t]. According to Lemma 4.3.3 and the 3-cover property, we can find
out e’s rank in (r, t] by means of at most three comparisons. So for every element e in
NEWUP (u) that came from SUP (w), we obtain its rank in NEWSUP (v).
As both Step 1 and Step 2 requires constant time, and the algorithm has 3 log n stages,
the time complexity of this algorithm is O(log n). Each stage of the algorithm requires O(n)
processors. As processors can be reused from stage to stage, the algorithm needs only O(n)
processors.
4.4 Parallel Merge Sort on LARPBS
4.4.1 Problems Need to be Solved
PRAM is a less restrictive model. It provides an excellent framework for studying paral-
lelism. There are no wires to worry about, and one never needs to worry about getting
the right data to the right place at the right time. So the PRAM model abstractly removes
most of the important details associated with implementing a parallel algorithm on a parallel
machine. But in the LARPBS model, we do need to take care of inter-processor communi-
cation explicitly and efficiently. Thus to implement the pipelined merge sort algorithm on
the LARPBS model, we are facing the following challenges:
48
• Simulating multiple binary comparison trees on an LARPBS model: we need to keep
track of three arrays, UP , SUP , and NEWUP , for each active node in the sorting
tree. How to assign processors for items in these arrays and how to rank them among
each other are important issues need to be addressed.
• Communications: PRAM is a shared memory parallel computing model. So every
processor has access to the shared memory. Communication among processors and
exchanging information in the PRAM model is obvious and simple. But LARPBS is
a distributed memory interconnection model. How to efficiently communicate among
processors and exchange information among them are important issues that we have
to take care of in order to implement the algorithm on LARPBS model.
• Processor assignment and reuse: In order to use only O(n) processors, we need to find
a solution to processor reuse.
4.4.2 Details of the Algorithm
This algorithm is based on the parallel sorting tree model. The algorithm “moves up” from
the leaves to the root, and it runs in 3 log n stages. In Cole’s algorithm, each stage consists
of two phrases. We convert the two phrases into 5 steps in our algorithm: 1 step for Phase
1 and 4 steps for Phase 2. In order to address the fact that external nodes only stay active
for three stages, we call each 3-stage period from the time a set of nodes becomes external,
a life cycle. So the algorithm runs in log(n) life cycles. At the beginning of the algorithm,
we only have UP arrays: each leaf has an UP array which contains one item. The LARPBS
merge sort algorithm proceeds in O(log n) stages as follows:
Algorithm 4.4.1 Pipelined Merge Sort
Input: a sequence of n keys S
Output: a sorted sequence S ′ of the n keys
Begin






for stage = 1 to 3 do
Step 1 – Compute SUP arrays:
For internal node: each processor Pi in UP array checks its rank r in that array, if
r = m − 3 − 4j, then Pi sends its element to the jth processor in SUP array; for external
nodes: sampling rate is 2(3−stage), action for sampled processors is the same as internal nodes.
Step 2 – Compute rank SUP (v) → UP (u) and SUP (w) → UP (u):
49
1. Do a find-min-representatives operation in UP (u) based on each element’s rank in
SUP (v). As we have UP (u) → SUP (v) (obtained from step 5 of the previous stage and
Step 1 of current stage. For stage 1, since only leaf nodes have non-empty SUP arrays,
and all these SUP arrays have only one element, the ranking of UP (u) → SUP (v) is
obvious), each representative processor Pi in UP (u) sends its rank s in UP (u) to the r
th
processor Pj in SUP (v) if Pi’s rank in SUP (v) is r. As different elements in UP (u) can
have the same rank in SUP (v), we need to use the find-min-representatives operation
here to guarantee the communication from UP (u) to SUP (v) performed in this step is
one-to-one instead of multiple-to-one. This step consists of two basic operations: find-
min-representatives and one-to-one communication. Both of them takes O(1) time.
2. Each processor Pj in SUP (v) who received a rank s in the previous bus cycle sets its
switch to cross to form a sub-bus and broadcast rank max(0, s − 1) to every other
processor in the sub-bus. Figure 4.1 shows two examples: example (a) illustrates a
general case, and example (b) gives a boundary situation where Pi is the first processor
in the UP (u) array. This sub-step only consists of a multiple broadcast operation.
Now we have obtained SUP (v) → UP (u). SUP (w) → UP (u) is done in a similar
way. We can further parallelize the two computations in the following way: do Step 1
separately for the two computations, then as elements in SUP (w) array and in SUP (v)
array are associated with different processors, Step 2 can be performed for SUP (w) and
SUP (v) simultaneously. The operations needed for this step is: 2 find-min-representatives
operations, 2 one-to-one communication operations and 1 multiple broadcast. Thus it takes
O(1) time.
Figure 4.1: SUP (v) → UP (u): (a) Pj−2, Pj−1 and Pj all have rank s − 1 in UP (u). (b)
boundary condition: Pj−2, Pj−1 and Pj all have rank 0 in UP (u).
50
Step 3 – Cross-rank SUP (v) and SUP (w):
We give a description on the details of ranking SUP (v) in SUP (w). SUP (w) → SUP (v)
can be done in a similar way.
1. Do a find-max-representatives operation in SUP (v) based on elements’ ranks in UP (u)
using SUP (v) → UP (u). The find-max-representatives here is necessary because two
or more consecutive elements in SUP (v) can have the same rank in UP (u).
Figure 4.2: Finding the ranks in SUP (w) for Pi’s two straddle processors Pj and Pj+1 in
UP (u). Solid lines represent data communication and dashed lines represent ranks.
2. Each representative processor Pi in SUP (v) finds out the ranks in SUP (w) for the two
processors that straddle Pi in UP (u) in the following way: using SUP (v) → UP (u),
each representative Pi sends a request to its two straddle processors in UP (u), Pj and
Pj+1. This is a multiple multicast operation. The request specifies the order in which
they should send data back. Processor Pj and Pj+1 then send their ranks r and t in
SUP (w) back to Pi in the specified order. See Figure 4.2.
3. Representative processors in SUP (v) obtain elements in (r, t] from SUP (w): each
representative processor sends a request to processors whose ranks are in the range
(r, t] in SUP (w). This is another multiple-multicast operation. The request message
indicates the order in which each processor should send its element back. For example,
the (r + 1)th processor is the first one to send its elements back, (r + 2)th is the second
one, etc. Each processor in SUP (w) who received a request then sends its element f
back to its corresponding representative processor in SUP (v) in the specified order.
51
As UP (u) is a 3-cover of SUP (w) (see Lemma 4.3.4), a maximum of 3 one-to-one
communication operations are needed in this SUP (w) to SUP (v) communication.
4. Calculate SUP (v) → SUP (w): After receiving all elements sent by processors in
SUP (w), each representative processor sets its switch to cross to form a sub-bus and
broadcasts these elements and value r in the sub-bus. Each processor in SUP (v) now
compares these elements with its own element and determines its rank in SUP (w).
For example, if a processor owns element x and it received elements q, y, z, and a rank
r. If q < x < y < z, then x’s rank in SUP (w) is r + 1. Again due to the 3-cover
property of UP (u) stated in Lemma 4.3.4, a maximum of 3 comparisons are needed in
the above local computation.
While Computing SUP (w) → SUP (v) in a similar way, the above Step 2 needs to be done
separately for computing the two ranks, but Step 1, 3, and 4 can be done simultaneously for
the two computation: SUP (w) → SUP (v) and SUP (v) → SUP (w).
Step 4: Compute NEWUP (u) = SUP (v) ∪ SUP (w):
Each processor Pi in SUP (v) and SUP (w) computes its rank r in NEWUP (u) by adding
its rank i in SUP (v) and its rank j in SUP (w). Each Pi then sends its element to the r
th
processor in NEWUP (u) to form NEWUP (u). Note that SUP (v) → NEWUP (u) and
SUP (w) → NEWUP (u) are two side products of this step that will be used in Step 5.
Step 5: Maintain ranks: NEWUP (u) → NEWSUP (v), NEWUP (u) → NEWSUP (w).
These ranks will be used in Step 2 of the next stage. And we only need to compute them
for internal nodes. We show how to compute NEWUP (u) → NEWSUP (v) in detail.
Computing NEWUP (u) → NEWSUP (w) is done in a similar way. Here is the ba-
sic idea: We have SUP (v) → NEWUP (u) and SUP (w) → NEWUP (u) obtained from
Step 4. If we can get SUP (v) → NEWSUP (v) and SUP (w) → NEWSUP (v), then use
NEWUP (u) = SUP (v)
⋃
SUP (w), we can obtain NEWUP (u) → NEWSUP (v). Now we
give the detail.
1. Compute UP (u) → NEWUP (u): Given UP (u) → SUP (v) and UP (u) → SUP (w),
each processor in UP (u) computes its rank r in NEWUP (u) by adding its rank i in
SUP (v) and its rank j in SUP (w), i.e. r = i + j.
2. Compute SUP (v) → NEWSUP (v): Using UP (v) → NEWUP (v), each processor Pi
in UP (v) whose element is sampled into SUP (v) as the kth element sends its rank i in
UP (v) to the jth processor in NEWUP (v) if the rank of Pi’s element in NEWUP (v)
is j. According to Lemma 4.4.3, this is a one-to-one communication. Each processor
Pj in NEWUP (v) then computes its element’s rank r in NEWSUP (v) and sends r
back to processor Pi in UP (v). After receiving the rank r, processor Pi forwards r to
the kth processor in SUP (v). Now each processor in SUP (v) has its element’s rank in
NEWSUP (v). See Figure 4.3.
3. Compute NEWUP (u) → NEWSUP (v): We classify elements in NEWUP (u) into
two classes: those came from SUP (v) and those came from SUP (w).We need to
compute their ranks in NEWSUP (v) using different methods:
52
Figure 4.3: Computing SUP (v) → NEWSUP (v). Solid lines represent data communica-
tion, and dashed lines represent ranking relation.
• Elements came from SUP (v): Using SUP (v) → NEWUP (u), each processor
Pi in SUP (v) sends its element’s rank r in NEWSUP (v) to the j
th processor
Pj in NEWUP (u) if this element’s rank in NEWUP (u) is j. Processor Pj in
NEWUP (u) set its element’s rank in NEWSUP(v) to be the received rank r.
After this one-to-one communication, we obtained the ranks in NEWSUP (v) for
all elements in NEWUP (u) who came from SUP (v). The correctness of this
operation is based on the following two facts: (1) Processor Pi in SUP (v) holds
the same element as processor Pj in NEWUP (u) as NEWUP (u) = SUP (v) ∪
SUP (w); (2) According to Lemma 4.4.4, processor Pi in SUP (v) holds the same
element as the rth processor in NEWSUP (v).
• Elements came from SUP (w): We haven’t formed NEWSUP (v) arrays at this
time. So what we will do is: for each element e in NEWUP (u) that came
from SUP (w), find out the ranks r and t in NEWSUP (v) for the two elements
in SUP (v) that straddle e, and keep them in e’s owner Pi’s local memory in
NEWUP (u) array. Then we compare e with the elements whose ranks are in the
range (r, t] in NEWSUP (v) in Step 1 of the next stage. We use a similar method
as what we do in Step 3 to obtain the two ranks r and t.
(a) Do a find-max-representatives in SUP (w) using the ranks SUP (w) → SUP (v).
Then each representative processor Pi sends request to the two processors in
SUP (v) that straddle Pi for their ranks r and t in NEWSUP (v). Proces-
sors in SUP (v) who received the request, find out the requested rank using
53
SUP (v) → NEWSUP (v) and send them back to Pi.
(b) Each representative processor Pi in SUP (w) then sends the two ranks r and t
to the kth processor Pj in NEWUP (u), where k is Pi’s rank in NEWUP (u).
Processor Pj in NEWUP (u) receives and saves the two ranks, and marks
itself to be a representative.
(c) In Step 1 of the next stage, after building NEWSUP (u) arrays, each repre-
sentative processor Pj in NEWUP (u), sends a request to processors in the
range (r, t] in NEWSUP (v). Applying a similar computation as what we do
in Step 3, each processor in NEWUP (u) whose elements came from SUP (w)
will get its rank in NEWSUP (v) .
Now we have NEWUP (u) → NEWSUP (v). NEWUP (u) → NEWSUP (w) can be
obtained in a similar way.
endfor
End
4.4.3 Processor Assignment and Reuse
Consider the following two approaches for simulating a binary tree on an LARPBS:
1. Depth-first search based simulation: Associate one processor to a node in the tree based
on its rank obtained from a depth-first search. See Figure 4.4(a).
2. Breath-first search based simulation: Associate one processor to a node in the tree
based on its rank obtained from a breath-first search. See Figure 4.4(b).
The simulation for the sorting tree in our algorithm is more involved. We use a breath-
first search based simulation, but we view the three groups of arrays as three binary trees –
UP tree, SUP tree, and NEWUP tree. We simulate the three binary trees on LARPBS by
segmenting the bus into three sub-buses in that order (See Figure 4.5). We call the sub-bus
allocated to one of the three trees a tree-bus. Within each tree-bus, we only assign processors
to active nodes with non-zero array elements. With respect to each tree-bus, we further
segment the bus into smaller sub buses, one for each node; we call it a node-bus. On each
node-bus, elements are always aligned in sorted order, one element per processor. The final
result is on the root node-bus of the UP tree-bus.
Now we determine how many processors the algorithm needs.
Lemma 4.4.1 The number of processors needed on any stage of the parallel merge sort
algorithm is upper bounded by 4n.
Proof. According to [11], the size of the UP arrays is upper bounded by n + n/7,
n + 2n/7, and n + 4n/7 on the first, second and third stages respectively, in each life cycle.
The sizes of SUP/NEWUP arrays are upper bounded by 2n/7, 4n/7, and 8n/7 on the first,
54
Figure 4.4: Simulating a binary tree on the LARPBS Model: (a) depth-first search ranking
(b) breath-first search ranking, and (c) simulation of binary tree on LARPBS
55
Figure 4.5: Simulating a parallel merge sort tree on the LARPBS Model
56
second and third stages respectively. Thus the maximum size for all arrays on any stage is
upper bounded by (3 + 6/7)n < 4n. This proves the lemma.
Based on Lemma 4.4.1, we reserve 4n processors for our parallel merge sort algorithm.
Notice that number of elements for each node on the three trees in any stage of a life cycle is
fixed, so the address for each element in an array for any active node can be pre-calculated
and available to all processors as input data. As we need one processor for each element in
the UP, NEWUP, and SUP arrays, and the sizes of these arrays, for each node u, change
from stage to stage, the processors must be reassigned dynamically so that we can reuse
them. Processor reuse is done in the following two ways:
• At the end of each stage, we set all processors assigned to SUP to be available for
reuse, and copy the NEWUP tree to UP tree. Thus processors are reused from stage
to stage.
• As external nodes become inactive at the end of each life-cycle, processors assigned to
them can be reused by new active nodes.
4.4.4 An Example
Now we provide an example for the pipelined merge sort algorithm. Suppose the input is an
array of unsorted elements, S = 7, 3, 19, 1, 23, 74, 32, 17. We put these elements at the leaves
of a complete binary tree, one number per leaf. Let us define the level of a node to be the
length of the path from that node to the root node. Then in our example, the binary sorting
tree has following 4 levels from root to leaves: 0, 1, 2, 3.
Initially, elements in S are distributed to leaf nodes, one per node. And all leaf nodes
are active external nodes. The UP array for each leaf node is the element stored there. All
other nodes in the tree are inactive (See Figure 4.6 (a)).
life-cycle 1:
• Stage 1: According to the sampling rule for external nodes nothing happens in this
stage. See Figure 4.6 (a).
• Stage 2: same as first stage, nothing happens.
• Stage 3: we have SUP = L for all the leaf nodes. So UP arrays for the nodes on level
2 are given non-zero values, and become new active external nodes. The resulting tree
is shown in Figure 4.6 (b).
The leaf nodes, i.e. level 3 nodes in our example, will become inactive in the next
life-cycle.
57
Figure 4.6: Pipelined Merge Sort: life-cycle 1 (a) input, and (b) after stage 3
58
life-cycle 2:
• Stage 1: as the sampling rate for this stage is 4 and there are only 2 elements in each
UP array in all active nodes, nothing happens.
• Stage 2: as the sampling rate for external nodes is reduced to 2, so the first element
in the UP array for level 2 node will be sampled to form its SUP array. That will
provide a non-empty UP array for each node on level 1. See Figure 4.7 (a).
• Stage 3: as the sampling rate for external nodes becomes every node in UP array, we
have SUP = L for all nodes in level 2. But the SUP arrays for the nodes on level
1 will still be empty as number of elements in the UP for level 1 nodes are less than
4. Thus level 1 UP arrays will be updated, and level 2 nodes become external. See
Figure 4.7 (b).
Level 2 nodes will become inactive in the next life-cycle.
life-cycle 3:
• Stage 1: the first element in UP array for each level 1 node will be sampled into its
SUP array. Thus root node will becomes active with two elements in its UP array.
See Figure 4.8 (a).
• Stage 2: the first and third elements in UP array for each level 1 node will be sampled
into its SUP array. And UP array for root node will be updated to have four elements.
See Figure 4.8 (b).
• Stage 3: All elements in UP array for level 1 nodes will be sampled and we will have
the final sorted array of elements in the UP array of root node. See Figure 4.8 (c).
4.4.5 Correctness Proof and Complexity Analysis
The following lemmas are used to show the correctness of the algorithm:
Lemma 4.4.2 SUP (v) is a subset of NEWSUP (v), i.e., SUP (v) ⊂ NEWSUP (v).
Proof. We prove the result by induction on the levels of the nodes in the sorting tree.
The claim is true for the nodes on the lowest active level of the sorting tree in a life cycle,
i.e., the external nodes. If v is an external node, then (a) in the first stage after it becomes
external, SUP (v) consists of every forth elements in UP (v) = L(v); (b) in the second stage,
SUP (v) consists of every second element in L(v); and (c) in the third stage, SUP (v) consists
of every element in L(v). Thus our claim holds for all external nodes in each life cycle.
Inductive step: We seek to prove that for any node u on level k, we have SUP (u) ⊂
NEWSUP (u), where k is a level above the external node level. Suppose our claim for
59
Figure 4.7: Pipelined Merge Sort: life-cycle 2 (a) after stage 2, and (b) after stage 3
60
Figure 4.8: Pipelined Merge Sort: life-cycle 3 (a) after stage 1, (b) after stage 2, and (c)
after stage 3
61
SUP (u) ⊂ NEWSUP (u) on level k is not true. Then there must be an element e ∈ SUP (u),
but e /∈ NEWSUP (u). As e comes from UP (u) which in term comes from the OLDSUP
array of one of its children, we use the same induction to trace the origin of u down to the
lowest active level, which is the level that has the external nodes. Assume that e originally
comes from external node f . So we have e ∈ OLDSUP (f), but e /∈ SUP (f). That is
conflict with what we have already proved for the initial case. Thus our claim is true for
level k.
Summarizing the initial step and inductive step, we conclude that SUP (v) ⊂ NEWSUP (v).
Lemma 4.4.3 UP (u) is a subset of NEWUP (u), i.e., UP (u) ⊆ NEWUP (u).
Proof. Case 1: |UP (u)| < L(u), NEWUP (u) = SUP (v) ∪ SUP (w), and UP (u) =
OLDSUP (v) ∪ OLDSUP (w). Using Lemma 4.4.2, we get UP (u) ⊆ NEWUP (u). Case
2, |UP (u)| = L(u), we have NEWUP (u) = UP (u). Combined the two cases, we obtain
UP (u) ⊆ NEWUP (u).
Lemma 4.4.4 If e in SUP (v) has a rank r in NEWSUP (v), and the rth element in
NEWSUP (v) is f , then we have e = f .
Proof. This lemma directly follows from Lemma 4.4.2 and the assumption that all n
numbers are distinct.
Lemma 4.4.5 If e in UP (u) has a rank r in NEWUP (u), and the rth element in NEWUP (u)
is f , then we have e = f .
Proof. This lemma directly follows from Lemma 4.4.3 and the assumption that all n
numbers are distinct.
Now we calculate the time complexity of the algorithm. Step 1 consists of a simple local
comparison and a one-to-one communication, so it is clear that it takes constant time. Step
2 consists of following three basic operations: find-min-representative, one-to-one communi-
cation, and broadcast. Thus it runs in constant time. Step 3 consists of constant number of
following basic communication operations: find-max-representative, multiple-multicast, one-
to-one communication and broadcast. A maximum of 3 local comparisons is needed for finding
the rank at the end of this step. Thus step 3 also runs in constant time. Step 4 consists of a
simple local addition and a one-to-one communication, which runs in constant time. Step 5
consists of constant number of one-to-one communication and similar communications and
computations as Step 3. Thus Step 5 runs in constant time. Now we conclude that the time
complexity of each stage is O(1). As the algorithm has 3 log n stages, we obtain the following
theorem:
Theorem 5 n elements can be sorted in O(log n) time on an n-processor LARPBS.
62
Chapter 5
Parallel Sorting Algorithms on Two
Dimensional LARPBS
5.1 Introduction
By applying algorithm 4.4.1 Pipelined Merge Sort to higher dimensional LARPBS’s, such
as a 2-dimensional and a 3-dimensional LARPBS, we can obtain work-time optimal sorting
algorithms on 2D and 3D LARPBS’s.
Table 5.1 compares multi-dimensional parallel sorting algorithms implemented in differ-
ent models with the algorithms that will be presented in this chapter on multi-dimensional
LARPBS.
In this and the next chapter, we present the following results on efficient sorting algo-
rithms for two-dimensional and three-dimensional LARPBS’s:
1. A constant-time sorting algorithm that sorts n elements on an n× n two-dimensional
LARPBS.
2. An O(log n) time basic column sort algorithm that sorts n
√
n elements on a n × √n
two dimensional LARPBS.
3. An O(log n) time two-way merge sort algorithm that sorts 2n
√
n elements on a n×2√n
two dimensional LARPBS.
4. An O(log n) time two-way merge sort algorithm that sorts 2n
√
n elements on a n×√n
two dimensional LARPBS.
5. An O(log n) time multi-way merge algorithm that merges two sorted array with n
elements on an n× n two dimensional LARPBS.
6. An optimal multi-way merge sort algorithm that sorts n2 elements in O(log n) time on
an n× n two dimensional LARPBS.
63
Table 5.1: Comparison of different parallel sorting algorithms on multi-dimensional arrays.
Algorithm Time Complexity Processor Array
2D Shearsort[30] O(n log n) O(n× n)
2D 8-phase Block Sort Algorithm[30] O(n) O(n× n)
Sorting on AROB [69] O(1) nε × n, where ε > 0
Algorithm 5.2.1 2D Constant Time Sort [30] O(1) O(n× n)
Algorithm 5.4.3 Basic Columnsort O(log n) O(n×√n)
Algorithm 5.5.1 Two Way Merge Columnsort O(log n) O(n× 2√n)
Algorithm 5.5.2 Two Way Merge Columnsort1 O(log n) O(n×√n)
Algorithm 5.8.1 Generalized Columnsort O(log n) O(n× n)
Algorithm 5.7.1 Multiway Merge Sort O(log n) O(n× n)
Algorithm 6.2.1 5-Phase Sort on 3D Array[30] O(n) O(n× n× n)
Algorithm 6.1.1 3D Generalized Columnsort O(log n) O(n× n× n)
Algorithm 6.2.2 5-Phase Sort on 3D LARPBS O(log n) O(n× n× n)
7. An optimal generalized Columnsort algorithm that sorts n2 elements in O(log n) time
on an n× n two dimensional LARPBS.
8. An optimal generalized Columnsort algorithm that sorts n3 elements in O(log n) time
on an n× n× n three dimensional LARPBS.
9. An optimal 5-phase sorting algorithm that sorts n3 elements in O(log n) time on an
n× n× n three dimensional LARPBS
5.2 A Constant-Time Sorting Algorithm on 2D LARPBS
We first show how to sort n elements on an n2-processor 2D LARPBS in O(1) time.
Algorithm 5.2.1 2D Constant Time Sort
Input: a set S of n elements to be sorted. Assume that the n elements are initially distributed
to the n processors on the first row, one per processor. Without lose of generality, we assume
that the n elements are distinct.
Output: a sorted array of n elements S ′ stored in the first row.
Begin
1. Broadcast elements on column buses : processor P1,j broadcasts its element ej to pro-
cessor Pi,j, where 1 ≤ i ≤ n, so that all processors Pi,j in column j have a copy of the
jth element ej saved as ci,j.
64
2. Diagonal processors broadcast elements on row buses : each diagonal processor Pi,i,
where 1 ≤ i ≤ n, broadcasts the element ci,i to every processor in the ith row using
only row buses. After this multiple row broadcast communication operation, each
processor Pi,j saves the received element as ri,j.
3. Compare and set results : each processor Pi,j compares ci,j with ri,j. If ci,j ≤ ri,j, it
sets result = 1; otherwise, sets result = 0.
4. Sum up 1s : perform a one-dimensional binary-prefix-sums operation on each row
simultaneously for the value of result. Each processor Pi,j, where 1 ≤ i, j ≤ n, stores
the binary prefix sums to bi,j.
5. Route element to correct location : the last processor on each row, processor Pi,n,
where 1 ≤ i ≤ n, sends a dummy message to processor Pi,k, where k = bi,n. This
is a multiple one-to-one row communication operation. After receiving the dummy
message, processor Pi,k sends its ri,k to processor P1,k. After this multiple one-to-one
column communication operation, all elements are sorted and are stored in the first
row in increasing order.
End
Each step in the above algorithm runs in constant time. Thus we have the following
theorem:
Theorem 6 There is a constant time sorting algorithm on a 2D LARPBS that sorts n
elements using n2 processors.
Example 5.2.1 2D Constant Time Sort: sort 5 elements.


7 2 9 16 5
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0






7 2 9 16 5
7 2 9 16 5
7 2 9 16 5
7 2 9 16 5






7, r=7 2, 7 9, 7 16, 7 5, 7
7, 2 2, r=2 9, 2 16, 2 5, 2
7, 9 2, 9 9, r=9 16, 9 5, 9
7, 16 2, 16 9, 16 16, r=16 5, 16






1 1 0 0 1
0 1 0 0 0
1 1 1 0 1
1 1 1 1 1
0 1 0 0 1






0 0 0 0 b = 3, r = 7
0 0 0 0 b = 1, r = 2
0 0 0 0 b = 4, r = 9
0 0 0 0 b = 5, r = 16






0 0 7 0 0
2 0 0 0 0
0 0 0 9 0
0 0 0 0 16






2 5 7 9 16
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0


step 4 step 5 step 5 :
row movement column movement
5.3 The Seven-Phase Columnsort Algorithm
Our O(log n) sorting algorithm on the LARPBS using O(n) processors can be used to obtain
an O(log n) column sort algorithm for a 2D LARPBS. There are two well known versions of
Columnsort [29] [30] for sorting an r× s array in column-major order: one has eight phases
and the other has seven. The 8-phase version [29] has a more restrictive assumption: it
requires r ≥ 2(s− 1)2. The 7-phase version [30] requires r ≥ s2. Thus the 7-phase version has
a larger range of applicability. Olariu, Pinotti, and Zheng proposed an extended Columnsort
algorithm in [49] that enlarges the applicability of the 8-phase Columnsort algorithm from
r ≥ 2(s− 1)2 to r ≥ s(s− 1) by adding an additional sorting step at the end of the algorithm.
In this section, we will develop a 7-phase Columnsort algorithm that runs in O(log n)
time on a 2D LARPBS by applying our O(log n) sorting algorithm presented in previous
sections.
We start by presenting the original idea of the 7-phase Columnsort algorithm.
Algorithm 5.3.1 7-phase Columnsort
Input: rs elements distributed to an r × s processor array, where r ≥ s2, one element per
processor.
Output: a stored r × s array in column major order, one element per processor.
Begin
• Phase 1: sort elements within each column in ascending order.
• Phase 2: transpose matrix by picking up the elements in column-major order and
setting them down in row major order, preserving the r × s shape.
• Phase 3: sort elements within each column again in ascending order.
• Phase 4: reverse transpose the matrix by picking up the elements in row-major order
and setting them down in column major order, again preserving the r × s shape.
• Phase 5: sort elements within each column, but this time adjacent columns are sorted
in reverse order.
66
• Phase 6: apply two steps of odd-even transposition sort to each row – in the first
step, we compare elements 1 and 2, 3 and 4, etc., in each row; in the second step, we
compare elements 2 and 3, 4 and 5, etc., in each row.
• Phase 7: apply the regular sort again in each column, i.e., sort each column in as-
cending order.
End
Theorem 7 The 7-step Columnsort algorithm correctly sorts rs elements of an r× s array
in column-major order.
We need to use following well-known 0-1 sorting lemma given in [30] for the proof of the
above 7-phase column sort algorithm.
Lemma 5.3.1 The 0-1 Sorting Lemma. If an oblivious comparison-exchange algorithm
sorts all input sets consisting solely of 0s and 1s, then it sorts all input sets with arbitrary
values.
Since the Columnsort algorithm is an oblivious comparison-exchange algorithm, we can
use the 0-1 sorting Lemma to prove its correctness. In other words, we only need to check
that it correctly sorts any set of 0s and 1s. We now give the proof by illustrating the effects
of each phase on the 0-1 input in the Columnsort algorithm.
• phase 1: after sorting each column, 0s and 1s in each column are grouped into two
separate sets (See Figure 5.1 (a)).
• phase 2: the operations performed in this phase can be viewed in the following way
– we use a window of length s to extract the elements for each row while moving the
window from left most column to right most column. Each time we move the window
by s elements. Within a window we can see one of the following three possibilities:
clean 0s, clean 1s, and dirty (with 0s and 1s). This is because in each column, 0s and
1s only meet at one point after phase 1. Thus we will see at most two dirty rows when
we move in one column: one is the window that covers the point where 0s and 1s meet;
the other is the window that covers the boundary of two columns. Thus after this
phase, there are at most 2s− 1 dirty rows. See Figure 5.1 (b).
• phase 3: as we sort each column in this phase. At the end of this phase, all dirty rows
are moved together between clean 0 rows and clean 1 rows. For any two dirty rows,
one of the following possibilities applies: more 0s, more 1s, or equal 0s and 1s. Thus
after sorting each column, number of dirty rows will be reduced by one half. In other
words, after this phase, number of dirty rows becomes s− 1/2. See Figure 5.1 (c).
67
Figure 5.1: 7-step Columnsort
68
• phase 4: after the reverse transpose, all dirty rows are distributed into a number of
consecutive columns. Number of dirty columns resulted from this transpose is at most
d(s(s− 1/2))/re ≤ d(s2 − s/2)/s2e = d1− (1/2s)e = 1, where r ≥ s2 and s ≥ 1. Due
to the fact that the first dirty row may not start exactly at the beginning of a column
when permuted to column, there can be at most two dirty columns if the dirty rows
does not starts a column. Therefore, the (s − 1/2) dirty rows gives rise to at most
2 dirty columns, with all clean 0 columns on the left and all clean 1 columns on the
right. See Figure 5.1 (d).
• phase 5: As after phase 4, the two dirty columns are next to each other, the opposite
column sorting for two adjacent columns makes sure that when we apply one step of
odd-even sort on the two dirty columns in the next phase, at least on clean column
will be produced. See Figure 5.1 (e)
• phase 6: the two steps of odd-even transposition sort are performed in this phase
to guarantee the two dirty columns are compared and exchanged so that at most one
dirty column is left. See Figure 5.1 (f).
• phase 7: the column sort invoked in this phase will make sure that the only one dirty
column will be sorted. Thus after this column sort, the whole matrix will be sorted in
column major order.










































































































step 4 step 5 step 6 step 7
69
5.4 An O(log r) Column Sort Algorithm on a 2D LARPBS
Key operations needed for implementing the 7-step Columnsort:
1. Sort r elements with r processors in O(log r) parallel time.
2. r × s shape matrix transpose.
3. r × s shape reverse matrix transpose.
4. Odd-even transposition sort.
Operation 1 can be done by applying our O(log n) time LARPBS sorting algorithm.
And it is obvious that operation 4 can be done in constant time. The question is: can we do
operations 2 and 3 in constant time?
As in the above Columnsort, the matrix we are sorting has r ≥ s2. Matrix transposition
cannot be done by simply applying algorithm 2.4.1 Matrix Transpose for 2D LARPBS. Thus
we propose the following two special matrix transpose algorithms for r × s matrices with
r = ks, where k > 1. The basic idea for this extended transpose algorithm is: we split the
r× s matrix into a group of k sub-matrices each of size s× s, then do the matrix transpose
for each sub-matrix. Although that basic matrix transpose does not give us what we want,
we can get the final result by rearranging the rows. The details of the algorithm are given
below.
Algorithm 5.4.1 Extended Matrix Transpose
Input: an r × s matrix A, where r = ks, k > 1.
Output: the transpose of matrix A, preserving the r × s shape.
Begin
• Step 1. Partition into regular shape sub matrices: processor Pms,j, where 1 ≤ m ≤ k
and 1 ≤ j ≤ s, set its column reconfigurable switches to cross to partition A into k
sub-matrices A1, A2, · · ·, Ak. Each of them is an s× s matrix.
• Step 2. Transpose sub matrices: apply algorithm 2.4.1 Matrix Transpose to each
sub-matrix Ai
• Step 3. Fuse sub matrices and row shuffle: each processor Pms,j, where 1 ≤ m ≤ k
and 1 ≤ j ≤ s, sets its column reconfigurable switches back to straight to fuse the k
sub-matrices into one r× s matrix. Then each processor Pi,j in matrix A calculate its
new row address i′: i′ = k((i − 1) mod s) + di/se. A permutation communication is
performed in each column bus based on the new row addresses. Each processor Pi,j
use the calculated i′ as destination address to send its element to processor Pi′,j. This
is a one-to-one column communication operation and can be done in constant time.
End
70
Example 5.4.1 Extended Matrix Transpose on an 8× 4 matrix.


1, 1 1, 2 1, 3 1, 4
2, 1 2, 2 2, 3 2, 4
3, 1 3, 2 3, 3 3, 4





5, 1 5, 2 5, 3 5, 4
6, 1 6, 2 6, 3 6, 4
7, 1 7, 2 7, 3 7, 4






1, 1 2, 1 3, 1 4, 1
1, 2 2, 2 3, 2 4, 2
1, 3 2, 3 3, 3 4, 3





5, 1 6, 1 6, 1 7, 1
5, 2 6, 2 6, 2 7, 2
5, 3 6, 3 6, 3 7, 3






1, 1 2, 1 3, 1 4, 1
5, 1 6, 1 6, 1 7, 1
1, 2 2, 2 3, 2 4, 2
5, 2 6, 2 6, 2 7, 2
1, 3 2, 3 3, 3 4, 3
5, 3 6, 3 6, 3 7, 3
1, 4 2, 4 3, 4 4, 4
5, 4 6, 4 6, 4 7, 4


Step 1 Step 2 Step 3
Let’s call each group of s elements in a column as a super element. And let A′ denote the
new matrix consists of the super elements. Then it is clear that matrix A′ is an s×s matrix.
After applying algorithm 5.4.1 Extended Matrix Transpose, we should have: the elements in
the first super element on the first column are put on the first row of matrix B, the elements
in the second super element on the first column are put on the second row of B, etc. In
general, elements in the lth super element on the kth column are put on the ((k − 1)s + l)th
row of B. But what we have achieved after step 5.4 is that elements in the lth super element
on the kth column are put on the ((l − 1)s + k)th row of B. In other words, after Step 5.4,
we have the correct column index for elements in each row and a wrong row index. Step 5.4
does the required row permutation to send elements to the correct row.
The idea in Extended Reverse Matrix Transpose is similar to that in
Extended Matrix Transpose. But the order for the basic operations needs to be changed. We
need to do a row permutation first so that all rows are in correct position before we call the
regular transpose algorithm. Here are the details.
Algorithm 5.4.2 Extended Reverse Matrix Transpose
Input: a r × s matrix A, where r = ks, k > 1.
Output: matrix B, which is the reverse transpose of matrix A, preserving the r × s shape.
Begin
• Step 1. Row shuffle: each processor Pi,j in matrix A calculate its new row address i′:
i′ = s((i−1) mod k)+di/ke. Then a permutation communication is performed in each
column bus based on the new row addresses. Each processor Pi,j use the calculated
i′ as destination address to route its element. This is a one-to-one communication
operation and can be done in constant time.
• Step 2. Partition into regular shape sub matrices: processor Pms,j, where 1 ≤ m ≤ k
and 1 ≤ j ≤ s, set its column reconfigurable switches to cross to divide A into k
sub-matrices A1, A2, · · ·, Ak. Each of them is an s× s matrix, where 1 ≤ i ≤ s.
71
• Step 3. Reverse transpose and fuse into one matrix: Since reverse matrix trans-
pose is equivalent to matrix transpose for r × r matrices, apply algorithm 2.4.1 Ma-
trix Transpose to each sub-matrix Ai to reverse transpose Ai. Then processor Pms,j,
where 1 ≤ m ≤ k and 1 ≤ j ≤ s, set its column reconfigurable switches back to
straight to fuse the k sub-matrices into one r × s matrix.
End
Example 5.4.2 Extended Reverse Matrix Transpose on an 8× 4 matrix.


1, 1 1, 2 1, 3 1, 4
2, 1 2, 2 2, 3 2, 4
3, 1 3, 2 3, 3 3, 4
4, 1 4, 2 4, 3 4, 4
5, 1 5, 2 5, 3 5, 4
6, 1 6, 2 6, 3 6, 4
7, 1 7, 2 7, 3 7, 4






1, 1 1, 2 1, 3 1, 4
3, 1 3, 2 3, 3 3, 4
5, 1 5, 2 5, 3 5, 4





2, 1 2, 2 2, 3 2, 4
4, 1 4, 2 4, 3 4, 4
6, 1 6, 2 6, 3 6, 4






1, 1 3, 1 5, 1 7, 1
1, 2 3, 2 5, 2 7, 2
1, 3 3, 3 5, 3 7, 3





2, 1 4, 1 6, 1 8, 1
2, 2 4, 2 6, 2 8, 2
2, 3 4, 3 6, 3 8, 3
2, 4 4, 4 6, 4 8, 4


input step 1 & 2 step 3
odd-even-transposition-sort: This operation is for one-dimensional LARPBS of size
m. In an odd step, processor P2k−1, where 1 ≤ k ≤ m/2, sends its element x to the processor
P2k. Processor P2k then compares its element y with x. If y < x, then processor P2k set its
element to be x and sends y back to processor P2k−1. An even step is similar to an odd step,
but this time even indexed processors send their elements to the odd index processors. In
other words, processor P2k, where 1 ≤ k ≤ m/2− 1, sends its element x to processor P2k+1.
Now that we have all the required basic algorithms, we are ready to give our Columnsort
algorithm on LARPBS.
Algorithm 5.4.3 Basic Columnsort
Input: an r × s matrix A, where r = s2.
Output: matrix A sorted in column major order.
Begin
• Phase 1: sort elements within each column in ascending order by calling algorithm
4.4.1 Pipelined Merge Sort for each column bus. Only column buses are used in this
step.
• Phase 2: apply algorithm 5.4.1 Extended Matrix Transpose on A.
• Phase 3: for each column bus, run algorithm 4.4.1 Pipelined Merge Sort to sort ele-
ments again in ascending order. Only column buses are used in this step.
• Phase 4: apply algorithm 5.4.2 Extended Reverse Matrix Transpose on A.
72
• Phase 5: for each odd column, run algorithm 4.4.1 Pipelined Merge Sort to sort ele-
ments in ascending order; and for each even column, run algorithm 4.4.1 Pipelined Merge Sort
to sort elements in descending order. Only column buses are used in this step.
• Phase 6: use only row buses to perform the two steps of odd-even transposition sort
on each row. Do odd-even-transposition-sort twice: one odd step followed by an even
step.
• Phase 7: apply algorithm 4.4.1 Pipelined Merge Sort again to finally sort A in column
major order.
End
The operations used in the above basic columnsort algorithm are: Pipelined Merge Sort,
Extended Matrix Transpose, Extended Reverse Matrix Transpose, and two steps of odd-even-
transposition-sort. All those operations take constant time except Pipelined Merge Sort
which takes O(log n) time. Thus algorithm 5.4.3 Basic Columnsort runs in O(log n) time.
Theorem 8 There is a Columnsort algorithm that sorts O(n
√
n) elements in O(log n) time
on an n×√n-processor 2D LARPBS.
5.5 The Two-Way Merge Sort Algorithm on a 2D LARPBS
One application of the basic Columnsort algorithm is a two-way merge Columnsort algorithm.
The general two-way merge problem is defined as follows:
Definition 5.5.1 Given two sorted arrays A = (a1, a2, · · ·, an) and B = (b1, b2, · · ·, bn), the
two-way merge problem is to obtain the sorted array C = (c1, c2, · · ·, c2n) by merging A and
B.
Our two-way merge Columnsort algorithm relies on the following lemma given in [49]
with proof.
Lemma 5.5.1 Assume that A = (a1, a2, · · ·, an) and B = (b1, b2, · · ·, bn) are two sorted
sequences, and adn/2e ≤ bdn/2e+1. Let D = (d1, d2, · · ·, dn) be the sorted sequence obtained
by merging b1, b2 · ··, bdn/2e and adn/2e+1, adn/2e+2, · · ·, an. Then, no element in the sequence
E = (a1, a2, · · ·, adn/2e, d1, d2, · · ·, dbn/2c) is strictly larger than any element in the sequence
F = (bdn/2e+1, bdn/2e+2, · · ·, bn, dbn/2c+1, dbn/2c+1, · · ·, dn).
Proof. We prove this lemma by proving the following two facts:
1. no ai, where 1 ≤ i ≤ dn2 e, in E is strictly larger than any element in F .
2. no di, where 1 ≤ i ≤ bn2 c, in E is strictly larger than any element in F .
73
Fact 1: Given the assumption that A and B are both sorted and adn/2e ≤ bdn/2e+1,
it is clear that no ai, where 1 ≤ i ≤ dn2 e, in E is strictly larger than any bj in F , where
bn
2
c + 1 ≤ j ≤ n. Thus we only needs to prove that ai is strictly larger than any dj in F ,
where bn
2
c + 1 ≤ j ≤ n. We claim that there is an (ai in a1, a2, · · ·, adn
2
e) larger than some
element in F . Again the assumption adn/2e ≤ bdn/2e+1 guarantees that this element can only
be some dk, where k ≥ bn/2c+ 1, in sorted array D. Observe that all the bn/2c elements in
D that comes from A are greater than or equal to ai, where 1 ≤ i ≤ dn2 e, therefore strictly
larger than dk, which implies k ≤ bn/2c, a contradiction. Thus Fact 1 is proved.
Fact 2: Suppose that there is a di in E strictly larger than some element in F . Since
D is sorted, this element must be some bk, where k ≥ bn/2c+ 1, i.e., ci > bk. Since all
elements in D that come from B are less than or equal to bk, therefore strictly smaller than
di. It follows that i ≥ bn/2c+ 1. This contradicts with the fact that di is in E. Thus our
claim cannot be true. This proves Fact 2.
To summarize Fact 1 and Fact 2, no elements in E is strictly larger than any element
in F . The lemma is proved.
The following basic operation will be used as a building block in our two-way merge
algorithm:
2D broadcast: processor Pi,j who holds the key to be broadcast performs a basic bus
broadcast operation on the ith row. Then every processor in the ith row perform a column
broadcast operation simultaneously. After this multiple column broadcast operation, every
processor in the 2D LARPBS gets the key.
Now we are ready to give the details of our two-way merge columnsort algorithm on an
r × 2s 2D LARPBS.
Algorithm 5.5.1 Two Way Merge Columnsort
Input: an r × 2s matrix C, where r = s2.
Output: matrix C sorted in column major order.
Begin
1. Segment and sort : Each processor Pi,s, where 1 ≤ i ≤ r, in C set its row reconfigurable
switches to cross to segment C into two r × s matrices: A and B. Apply algorithm
5.4.3 Basic Columnsort on A and B simultaneously.
2. Compare and exchange : Processor P1,1 calculate the address (α, β) for the median
element in matrix A: α = dsr/2e mod r. If α = 0, then set α = r, β = dsr/2e/r, and
flag = 0; otherwise, β = dsr/2e/r + 1, flag = 1. Then each processor Pi,s, where
1 ≤ i ≤ r, sets its row reconfigurable switches back to straight to fuse A and B into
one matrix C. Processor P1,1 do a 2D broadcast communication operation to broadcast
addresses (α, β) and flag to every processor in matrix C. Now processor Pα,β in matrix
A sends its element aα,β = l to the median processor of matrix B, i.e., processor Pα,β+s.
Processor Pα,β+s compares the received element l with its own element bα,β = r. If
l > r, processor Pα,β+s broadcasts a signal message to every processor in C using a
2D broadcast communication operation. Upon receiving the signal message, processors
74
in matrices A and B exchange the corresponding elements using a row one-to-one
communication operation on the r rows simultaneously. To be precise, processor Pi,j
exchange element with processor Pi,j+s, where 1 ≤ i ≤ r.
3. Construct middle matrices D : If flag = 1, we need to move elements around in
matrices A and B to form an r × s shape middle matrix D. Thus processors Pi,β
exchange its element with the elements of Pi,β+s, where α ≤ i ≤ r. This is to move
all elements in [aα+1,β, ar,s] and [b1,1, bα,β] into the middle matrix D. Then processors
Pi,β and Pi,β+s, where 1 ≤ i ≤ r, set their row reconfigurable switches to cross to form
matrix D. See in Figure 5.2(a).
4. Sort matrix D : Apply algorithm 5.4.3 Basic Columnsort to matrix D.
5. Reconstruct A and B : If flag = 1, we need to move the elements around in matrices
A and B to recover the elements that belong to matrix B and were moved to matrix
A for the convenience of sorting matrix D in the previous step. This can be done in
the following two sub-steps:
(a) Circular movement on column buses in D : Processors in matrix D move their
elements up x = r− β positions, i.e. processor Pi,j sends its element to processor
Pk,j, where k = i − x, if i − x > 0; otherwise, k = x + i − 1. This is a multiple
one-to-one column communication operation. See Figure 5.2 (b).
(b) Partial circular movement on row buses in A and B : Processors Pi,β and Pi,β+s,
where 1 ≤ i ≤ r, set their row reconfigurable switches to straight to form a single
matrix. Then processor Pi,j sends its element to processor Pi,k, where α < i ≤ r,
β ≤ j ≤ β + s, k = j − 1 for β + 1 ≤ j ≤ β + s and k = j + s for j = β. See
Figure 5.2 (c).
6. Partition and Sort : Each processor Pi,s, where 1 ≤ i ≤ r, set its row reconfig-
urable switches to cross to segment C again into A and B. Apply algorithm 5.4.3
Basic Columnsort to matrices A and B. Then each processor Pi,s, where 1 ≤ i ≤ r,
sets its row reconfigurable switches back to straight to fuse A and B into one matrix
C. Now matrix C is sorted in column major order.
End
Since the most time consuming operation in the above algorithm is algorithm 5.4.3
Basic Columnsort which runs in O(log n) time, we conclude that algorithm 5.5.1 runs in
O(log n) time.
Theorem 9 The two-way merge sort algorithm on an n × 2√n 2D LARPBS sorts 2n√n
elements in O(log n) time.
75
Figure 5.2: (a) Constructing middle matrix D, (b) Circular movements on column buses in
matrix D, and (c) Partial circular movements on row buses in A and B
76
Example 5.5.1 Two Way Merge Columnsort for 9× 6 matrix.


13 23 42 40 51 11
44 53 14 54 19 7
5 36 6 35 22 9
49 20 30 21 37 28
15 46 52 12 18 1
32 38 43 3 26 41
17 45 34 50 29 16
33 27 47 31 10 8


















































































































































































Step 5 : substep1


1 16 14 28 41 42
2 18 15 29 48 43
3 19 17 30 50 44
7 21 20 31 51 45
8 22 23 32 54 46
9 4 24 35 33 47
10 5 25 37 34 49
11 6 26 39 36 52






























Step 5 : substep2 Step 6
If we take a second thought on the above two-way merge algorithm, we will find that
we don’t need 2n
√
n processors. After some modification to the algorithm, we can use only
n
√
n processors to sort 2n
√
n elements. Here is a modified version of our two-way merge
sort algorithm.
Algorithm 5.5.2 Two Way Merge Columnsort1
Input: two r×s matrices A and B, where r = s2. Elements for both matrices are distributed
to an r × s LARPBS, one element per processor for each matrix, i.e. processor Pi,j holds
element ai,j from A and element bi,j from B.
Output: 2n
√
n sorted elements stored in A and B in following column major order: (a1,1, a2,1, ··
·, an,1, · · ·, a1,√n, a2,√n, · · ·, an,√n, b1,1, b2,1, · · ·, bn,1, · · ·, b1,√n, b2,√n, · · ·, bn,√n).
Begin
1. Double sort : Apply algorithm 5.4.3 Basic Columnsort twice – once for sorting matrix
A and once for sorting matrix B.
2. Compare and set : Let (α, β) be the address for the median processor, where α =
dsr/2e mod r. If α = 0, then set α = r, β = dsr/2e/r; otherwise, β = dsr/2e/r + 1.
Then processor Pα,β compares its aα,β and bα,β. If aα,β > bα,β, each processor Pi,j




3. Construct matrix of active elements : Processors Pi,j in the first half of the matrix,
where 1 ≤ i ≤ α and j = β, or 1 ≤ i ≤ n and j < β, set bi,j as active element.
Processors Pk,l in the second half of the matrix, where α < k ≤ r and l = β, or
1 ≤ k ≤ n and l > β, set ak,l as active element.
4. Sort active elements : Apply algorithm 5.4.3 Basic Columnsort to A sort all active
elements.
5. Reconstruct A and B : Processor Pi,j exchanges its bi,j value with processor Pi,j′ ’s ai,j′
value, where 1 ≤ i ≤ n, 1 ≤ j ≤ β if α = 0, otherwise 1 ≤ j < β, and j′ = s − j + 1.
This can be done with a one-to-one row communication operation. If α 6= 0, do the
following one-to-one column communication operation on column β: processor Pi,β
exchanges its bi,β value with processor Pi′,β’s ai′,β value, where 1 ≤ i ≤ bsr/2c mod r,
i′ = r − i + 1.
6. Sort new A and B : Apply algorithm 5.4.3 Basic Columnsort twice again to sort matrix
A and B. At the end of this step, all A-elements are sorted in column major order, all
B-elements are sorted in column major order. And all B-elements are strictly larger
than all A-elements. Thus the 2n
√
n elements are sorted.
End
Same as algorithm 5.5.1 Two Way Merge Columnsort, the most time consuming oper-
ation in the above algorithm is algorithm 5.4.3 Basic Columnsort which runs in O(log n)
time, thus we have the following theorem.
Theorem 10 There is a two-way merge sort algorithm on an n × √n 2D LARPBS that
sorts 2n
√
n elements in O(log n) time.
Example 5.5.2 Two Way Merge Columnsort1 for a 9×3 matrix. Elements in each matrix
location (i, j) is always shown as (ai,j, bi,j) pair.


13, 40 23, 51 42, 11
44, 54 53, 19 14, 7
5, 35 36, 22 6, 9
49, 21 20, 37 30, 28
15, 12 46, 18 52, 1
32, 3 38, 26 43, 41
17, 50 45, 29 34, 16
33, 31 27, 10 47, 8






4, 1 24, 16 42, 35
5, 2 25, 18 43, 37
6, 3 27, 19 44, 39
13, 7 30, 21 45, 40
14, 8 32 , 22 46, 41
15, 9 33, 26 47, 48
17, 10 34, 28 49, 50
20, 11 36, 29 52, 51









1, 4 16, 24 35, 42
2, 5 18, 25 37, 43
3, 6 19, 27 39, 44
7, 13 21, 30 40, 45
8, 14 22 , 32 41, 46
9, 15 26, 33 48, 47
10, 17 28, 34 50, 49
11, 20 29, 36 51, 52






1, 4 16, 24 35 , 42
2, 5 18, 25 37 , 43
3, 6 19, 27 39 , 44
7, 13 21, 30 40 , 45
8, 14 22, 32 41 , 46
9, 15 26 , 33 48 , 47
10, 17 28 , 34 50 , 49
11, 20 29 , 36 51 , 52
12, 23 31 , 38 54 , 53


Step 2 Step 3


1, 4 16, 24 35 , 42
2, 5 18, 25 37 , 43
3, 6 19, 26 39 , 44
7, 13 21, 27 40 , 45
8, 14 22, 28 41 , 46
9, 15 29 , 33 48 , 47
10, 17 30 , 34 50 , 49
11, 20 31 , 36 51 , 52






1, 35 16, 32 4 , 42
2, 37 18, 31 5 , 43
3, 39 19, 30 6 , 44
7, 40 21, 29 13 , 45
8, 41 22, 28 14 , 46
9, 48 27 , 33 15 , 47
10, 50 26 , 34 17 , 49
11, 51 25 , 36 20 , 52
12, 54 24 , 38 23 , 53


Step 4 Step 5


1, 28 10, 37 19, 46
2, 29 11, 38 20, 47
3, 30 12, 39 21, 48
4, 31 13, 40 22, 49
5, 32 14, 41 23, 50
6, 33 15, 42 24, 51
7, 34 16, 43 25, 52
8, 35 17, 44 26, 53




This version of two-way merge sort algorithm uses less processors and simpler. What
is more important is that it greatly simplifies the algorithm design for the multi-way merge
algorithm to be presented in the next section.
81
5.6 The Multi-Way Merge Algorithm on a 2D LARPBS
The multi-way merge problem we would like to solve in this section is defined as follows:
Definition 5.6.1 Multi-way Merge Problem:
Given a collection of sorted arrays A = {A1, A2, · · ·, Am}, each of them has size n ×
√
n,
the multi-way merge problem is to obtain a sequence of sorted elements that contains all the
elements in A.
In this section, we are specifically interested in the Multi-way Merge Problem where
m =
√
n, since for this case we can use algorithms developed in the previous sections to
obtain an efficient multi-way merge algorithm. We follow the following basic idea proposed
in [49] to develop our algorithm on a 2D LARPBS: evenly take n elements as samples from
sorted arrays A = {A1, A2, · · ·, A√n}, where A is an n×
√
n array; sort those samples; then
partition A into
√
n arrays B1, B2, · · ·, B√n, each of them has size at most 2n ×
√
n. To
be specific, this partition is done by assigning each elements in A to one of the arrays Bi
based on a partition scheme stated below in Definition 5.6.7. This partition scheme does
not guarantee that elements in each Bi are in order, but it guarantees that no element in Bi
is strictly greater than any element in Bi+1. Thus based on this partition, what we need to
do to get the final results is to sort each Bi simultaneously. In summary, the multi-way sort
can be done in following 4 steps:
1. Form sample set S = {s1, s2, · · ·, sn} by selecting every mth element in each sorted
sequence Ai, where 1 ≤ i ≤
√
n.
2. Sort sample set S.
3. Partition A into
√
n arrays: B1, B2, ···, B√n. Each Bi is obtained by calculating regular
matrix Xi and special matrix Yi, and let Bi = Xi
⋃
Yi.
4. Sort each Bi using algorithm 5.5.2 Two Way Merge Columnsort1.
5. Coalesce the sorted Bis into an n× n column major sorted array.
We need the following definitions and lemmas before we can describe our algorithm in
details.
Definition 5.6.2 Regular Column
Let A = {A1, A2, · · ·, A√n} be an n×n matrix, each Ai is an n×
√
n matrix sorted in column
major order. Let B = {B1, B2, · · ·, B√n} be a partition of matrix A. Let element b be the
leader of column k, and rb be the rank of b in sorted sample set S. A leader of a column is
defined to be the last element on that column. Column k is said to be regular with respect to
Bi if (i − 1)
√
n < rk ≤ i
√
n. All regular columns with respect to Bi are assigned to regular
matrix Xi.
82
Definition 5.6.3 Regular Matrix
Regular matrices are defined to be a set of n×√n matrices, X = {X1, X2, · · ·, X√n}. Regular
matrix Xi consists of all regular columns with respect to Bi.
Definition 5.6.4 Special Column
Let A = {A1, A2, · · ·, A√n} be an n×n matrix. Assume B = {B1, B2, · · ·, B√n} is a partition
of A obtained by the partition scheme defined in Definition 5.6.7. Assume each n × √n
matrix Ai, where 1 ≤ i ≤
√
n, is sorted in column major order. Let rj be the rank of the
leader of column j in sorted sample set S. Then column j in matrix A is said to be special
with respect to Bt if r(j−1) ≤ t
√
n ≤ rj. All special columns with respect to Bi are assigned
to special matrix Yi.
Definition 5.6.5 Special Matrix
Special matrices are defined to be a set of n×√n matrices, Y = {Y1, Y2, · · ·, Y√n}. A column
in each special matrix Yi, is either a special column with respect to Bi, or a column with all
elements set to ∞.
Definition 5.6.6 Qualification Rule
Given an element e and the sorted sample S = {s1 < s2 < · · · < sn}. Let m =
√
n. Then e
is said to be qualified with respect to Bi if one of the following conditions is satisfied:
1. s(i−1)m < e ≤ sim whenever s(i−1)m < sim
2. s(i−1)m ≤ e ≤ sim whenever s(i−1)m = sim
Definition 5.6.7 Partition Scheme
Given
√
n matrices Ai, where 1 ≤ i ≤
√
n, the partition of A into
√
n subsets Bi, where
1 ≤ i ≤ √n, is done in three phrases:
1. Associate qualified regular elements on regular columns with their regular matrices: us-
ing Qualification Rule defined in Definition 5.6.6 to filter out qualified regular elements
and associate them with their regular matrix Xi.
2. Associate leftover elements on special columns with their special matrices: for non-
qualified regular elements, find each element e′s special matrices, Yj1 , Yj2 , · · ·, Yjk , where
j1 < j2 < · · · < jk. Then associate e with special matrix Yji, where ji is the smallest
index in (j1 < j2 < · · · < jk) obtained by Qualification Rule defined in Definition 5.6.6.
3. Obtain Bi: let Bi = Xi
⋃
Yi
The following lemmas are implicitly given in [49] without proof. Since they are important
facts that guarantee the correctness of our multi-way merge algorithm, we now give the
description of the lemmas followed by a detailed proof for each of them.
Lemma 5.6.1 Each sorted sequence Ai contains at most one special column with respect to




Proof. Assume that a sorted sequence Ai contains two special columns with respect to Bj,
i.e., the gth column Aig and the k
th column Aik in Ai. We denote the rank for the leader of
those two columns as rg and rk, respectively. Assume rg < rk. Since column Aik is a special
column with respect to Bj, we have r(k−1) ≤ j < rk. As rg < rk, we have rg ≤ r(k−1). Since
Ai is sorted and r(k−1) ≤ j, we must have rg ≤ j. But according to our assumption, column
Aig is also a special column with respect to Bj. Based on the definition of a special column,
we have r(g−1) ≤ j < rg; in other words, rg > j. This contradict with rg ≤ j. The lemma is
proved.
Lemma 5.6.2 The partition scheme defined in Definition 5.6.7 guarantees every element in
A will be associated to exactly one Bi.
Proof. Based on Definition 5.6.7, once an element is assigned to a regular matrix Xi or a
special matrix Yi, it will not be assigned to any other regular or special matrix. Thus any
element will be assigned to at most one Bi.
Assume that an element e is in a regular column with respect to Bi and it does not meet
the qualification rule for Bi, then we must have e ≤ s(i−1)√n. Since each column is sorted,
we always have si√n ≥ e. Assume e ∈ Ak, sc is the leader of the column that e belongs to,
sa is the leader of the proceeding column in Ak. Since Ak is sorted in column major order,
we have e ≥ sa. From s(i−1)√n ≥ e and e ≥ sa, we obtain s(i−1)√n ≥ sa. As e ∈ Bi, we
have s(i−1)√n ≤ sc ≤ si√n. Thus we obtain sa ≤ s(i−1)√n ≤ sc, which means e is in a special
column with respect to at least one special matrix Y(i−1). We apply a similar argument if
e disqualifies for special matrix Y(i−1) until we find a special matrix for which e qualifies.
If e disqualifies for any special matrix until we have e < s√n, then e qualifies for special
matrix Y1 as e is always greater than or equal to the smallest element in A. Thus we proved
there exists at least one special matrix Yi for any disqualified regular element such that this
element qualifies for the special matrix Yi. In other words, any element qualifies for at lease
one Bi, thus it will be assigned to at least one Bi.
Summarizing the above argument, every element in A will be associated to one and only
one Bi. The lemma is proved.





n columns are assigned to each Xi, it is clear that Xi will be assigned
at most n
√
n qualified elements. According to Lemma 5.6.1, there will be at most one special
column from each Aj assigned to a Bi. Since 1 ≤ j ≤
√
n, there will be at most
√
n special
columns assigned to a special matrix Yi. Thus Yi will be assigned at most n
√
n qualified
elements. The lemma is proved.
The following lemma is given in [49] with proof. We reiterate the proof here for the
completeness of this discussion.
Lemma 5.6.4 Assume A is an n × n matrix. Matrix Ai, where 1 ≤ i ≤
√
n, is obtained
by partition A into
√




obtained by collecting qualified regular elements and qualified special elements with respect to
Bi. Then all Bi, where 1 ≤ i ≤
√
n, satisfy following condition: for every i and j, where
1 ≤ i < j ≤ √n, no element in Bi is strictly larger than any element in Bj.
Proof.
Based on the qualification rule defined in Definition 5.6.6, all elements that are assigned
to Bi are in interval [s(i−1)m, sim]; all elements that are assigned to Bi+1 are in interval
[sim, s(i+1)m]; etc. In general, all elements that are assigned to Bk are in interval [s(k−1)m, skm]
for 1 ≤ k ≤ √n. Since we have s1 < s2 < · · · < s√n, it is clear that for every i and j, where
1 ≤ i < j ≤ s, no element in Bi is strictly larger than any element in Bj.
We develop following multi-way merge algorithm for a 2D LARPBS based on the basic
idea of [49], but we use different techniques that apply to the LARPBS model.
Algorithm 5.6.1 Multiway Merge
Input: An n× n matrix A, split into √n sorted sub-matrices A = {A1, A2, · · ·, A√n}, each
of them is an n×√n matrix. Elements in each sub-matrix Ai, where 1 ≤ i ≤
√
n, are sorted
in column major order. But A is not sorted.
Output: matrix A sorted in column major order
Begin
• Step 1: Let the last processor of each column in A be the leader of that column. So
leader processors are Pn,j for 1 ≤ j ≤ n. The nth row bus consists of all the leader
processors. We call this bus the sample bus, and we use S to denote the sample set,
which contains all leaders.
• Step 2: Run algorithm 4.4.1 Pipepined Merge Sort on the sample bus to sort the
elements in the leader processors. When an element is being moved in the sample bus
while being sorted in the sorting operation, the original column index of an element is
also moved with that element. After sorting operation is done, each leader processor
sends its element and its current column index r back to that element’s original location
using the original column index that moves with the element. After receiving its original
element and index r, each leader processor sets its element to be equal to the original
one, set its column rank to be r, and broadcast r to every processor in the same column.
This is a multiple column broadcast communication operation.
• Step 3: Construct regular matrices Xi.
1. Move each column to its regular matrix : each processor Pi,j sends its element ai,j
to processor Pi,r if its column rank in S is r. After this one-to-one row commu-
nication operation, each processor Pi,j sets its xi,j to be the element received.
2. Calculate bounding interval I : each processor Pn,j√n, where 1 ≤ j <
√
n, sends its
element xn,j√n to processor Pn,j√n+1. Then each processor Pi,j√n, where 1 ≤ i ≤ n
and 1 ≤ j < √n, sets its reconfigurable switches to cross to form√n sub matrices.
Within each n×√n sub-matrix, leader of the first column broadcast the element a
85
just received and leader of the last column broadcasts its element b. 2D Broadcast
operation is used in both of those broadcasting. After the two broadcasting
operations, each processor in the system sets its bounding interval I = (a, b],
and all processors set their reconfigurable switches back to straight to form a
single 2D LARPBS.
3. Determine xi,j: each processor Pi,j compares the received element with its bound-
ing interval. If the element falls out of the bounding interval, it sets its xi,j = ∞.
Then each processor Pi,j whose xi,j 6= ∞ sends a signal message to processor Pi,k
if its xi,j comes from Pi,k. Processor Pi,k, who received a signal message in this
one-to-one row bus communication operation, sets its ai,j = ∞.
• Step 4: Construct special matrices Yi.
1. Identify special columns and their special matrix Yi: each leader processor sends
its column rank r to the leader processor on its right. This is a one-to-one row
communication operation. Upon receiving the rank from the processor on its
left, each leader processor find the Bi for its column using Definition 5.6.4, and
broadcasts the indices i of all Bis found in its column bus. This is a multiple
column broadcast communication operation.
2. Send special elements to their special matrices: If processor Pi,j received special
matrix indices (l, l+1, ···, h) in the previous step and ai,j 6= ∞, then Pi,j multicasts
its element ai,j to the k
th column of matrices Al, A(l+1), ···, Ah , where k = dj/
√
ne.
According to Lemma 5.6.1, each processor in the system will receive at most one
element in this multiple row multicast communication operation.
3. Determine qualified special elements: all processors do initialization yi,j = ∞.
Each processor Pi,j who received an element in previous step, compares the re-
ceived element y with its bounding interval. If y falls in the bounding interval,
then processor Pi,j does two things: (1) set yi,j equals to the element received.
(2) multicast a dummy message to processors Pi,j′ , where j
′ = j + m
√
n, j′ < n
and m ≥ 1 (i.e. the processors on the same row on kth column in every matrix
Ar that is on the right of processor Pi,j, where k = dj/
√
ne). Each processor Pi,j′
who received a dummy message in the multicast communication operation sets
yi,j′ = ∞.
• Step 5: Apply alogirthm 5.4.3 Basic Columnsort twice on each n × √n sub-matrix
Ai, once for sorting Xi, and once for sorting Yi.
• Step 6: Apply algorithm 5.5.2 Two Way Merge Columnsort1 on each n × √n sub-
matrix Ai to merge elements in each Xi and Yi and sort them in column-major order.
• Step 7: Calculate final address for each element in Xi and Yi.
1. Calculate number of elements in each Xi and Yi: each processor Pi,j checks its yi,j
value. If yi,j = ∞, it sends a signal message to P((i−1) mod n),j. If processor Pn,j
86
whose yn,j = ∞, received a message in this one-to-one column communication
operation, it sends a signal message again to processor Pn,j−1, where j > 1.
Only one processor whose yi,j 6= ∞ should get a message in the two one-to-one
column and row communication operations. And this processor holds the largest
element in its Yi matrix. This processor then sends its address to the header
processor of the sub-matrix it belongs to. Let Bi = Xi
⋃
Yi, and |Bi|, |Xi|, and
|Yi| represent number of non-∞ elements in Bi, Xi, and Yi, respectively. Since
each Yi are sorted, the header processor who receives an address can calculate
the rank r from the received address and obtain the number of elements in Bi:
|Bi| = |Xi| + |Yi| = r + n
√
n. If a header processor does not receive an address,
then it broadcast a signal to every processor in its sub-matrix to calculate the
number of elements in Xi using the same method as we used for calculating the
number of elements in Yi, and set |Bi| = |Xi|. For both cases, |Bi| is stored in
the header processor.
2. Calculate integer prefix sums for each sub-matrix: each header processor P1,k√n+1,
where 0 ≤ k < n, sends its |Bk+1| to processor P1+k,k√n+1. Processor P1+k,k√n+1
then sends the received |Bk+1| to processor P1+k,n. Now run algorithm 2.4.2
2D Integer Prefix Sums on the first
√
n × n sub-matrix to calculate the integer
prefix sums for |Bi|. Let PSi be the prefix sums of |Bi|. Then at the end of
algorithm 2.4.2 2D Integer Prefix Sums, PSi will be stored in processor Pi,n.
3. Calculate final destination address for each element: processor Pi,n sends PSi
to processor Pi,(i+1)√n, where 1 ≤ i <
√
n. Processor Pi,(i+1)√n then broadcast
PSi in the i
th n×√n matrix using 2D Broadcast operation. Each processor Pi,j
calculates the final address for the element it holds by adding its rank in its sub-
matrix with the prefix sums it received and convert it to a two-dimensional array
address.
• Step 8: do the following operations for matrix Xi and Yi separately: send elements to
their final destinations using a one-to-one column communication operation followed
by a one-to-one row communication operation – use the column communication to
send an element to the correct row, and then use the row communication to send an
element to the correct column.
End
The non-constant time operations used in the above algorithm are: algorithm 5.5.2
Two Way Merge Columnsort1, alogirthm 5.4.3 Basic Columnsort, and algorithm 4.4.1
Pipepined Merge Sort. All of those three algorithms run in O(log n) time, thus we obtain
the following theorem:
Theorem 11 The multi-way merge algorithm on an n× n 2D LARPBS merges √n sorted
list each of size n
√
n into one sorted list in O(log n) time.
87
5.7 An O(log n)-Time Merge-Based Sorting Algorithm
on a 2D LARPBS
With algorithm 5.4.3 Basic Columnsort and algorithm 5.6.1 Multiway Merge at disposal, we
are ready to present our O(log n) sorting algorithm on 2D LARPBS. This algorithm sorts
n2 elements in O(log n) time using an n× n 2D LARPBS.
Algorithm 5.7.1 Multiway Merge Sort
Input: a sequence of n2 unsorted elements distributed to an n×n 2D LARPBS, one element
per processor.
Output: a sequence of n2 sorted elements stored in the n×n 2D LARPBS in column major
order, one element per processor
Begin
1. Partition and sort: each processor Pi,j√n set its row reconfigurable switches to cross to
partition the n× n matrix into √n sub-matrices, where 1 ≤ i ≤ n, 1 ≤ j ≤ √n. Each
sub-matrix runs the algorithm 5.4.3 Basic Columnsort simultaneously.
2. Multi-way merge: Apply Multiway Merge algorithm to the
√
n sub matrices to obtain
the sorted sequence in column major order.
End
Since both algorithm 5.4.3 Basic Columnsort and algorithm 5.6.1 Multiway Merge run
in O(log n) time, it is clear that algorithm 5.7.1 Multiway Merge Sort algorithm runs in
O(log n) time.
Theorem 12 Algorithm 5.7.1 Multiway Merge Sort sorts n2 elements on an n × n 2D
LARPBS in O(log n) time.
Example 5.7.1 Multiway Merge Sort for a 9× 9 matrix.


41 48 23 33 40 57 64 56 54
5 67 44 59 69 74 2 35 58
61 15 25 80 24 31 66 37 52
36 42 62 7 13 12 71 49 16
17 19 65 21 60 27 68 70 45
10 77 30 38 28 51 11 20 81
32 55 14 3 50 76 29 53 47
79 8 63 75 4 34 46 26 39






















































































































































































































































































































































after merging sorted special matrices and regular matrices


0 0 0 0 0 0 0 0 39
0 0 0 0 0 0 0 0 17
0 0 0 0 0 0 0 0 25


number of elements in each Bi


0 0 0 0 0 0 0 0 39
0 0 0 0 0 0 0 0 56
0 0 0 0 0 0 0 0 81


prefix sum on the number of elements in each Bi


1, 28 10, 37 19
2, 29 11, 38 20
3, 30 12, 39 21
4, 31 13 22
5, 32 14 23
6, 33 15 24
7, 34 16 25
8, 35 17 26






































































row communication : send elements to correct columns
5.8 An O(log n)-Time Generalized Columnsort Algorithm
on a 2D LARPBS
In this section, we give a generalized Columnsort algorithm to sort n2 elements on an n× n
2D LARPBS in O(log n) time. The basic idea is: partition matrix A into
√
n sub-matrices:
A1, A2, · · ·, A√n, each being an n ×
√
n matrix. We then construct a virtual matrix B of
size n
√
n×√n by stretching all elements in each sub-matrix to a so called “super-column”
of n
√
n elements, and we call it a super column. We call B a virtual matrix because it is
a logical matrix created for the convenience of explaining the algorithm. Since the virtual
matrix satisfies the restriction for column sort, i.e. number of rows greater than the square
of the number of columns, we can use Algorithm 5.4.3 Basic Columnsort to sort the virtual
matrix B. Thus by introducing the concept of super column and virtual matrix, we convert
the problem of sorting an n×n matrix into sorting an n√n×√n matrix. And we will show
that sorting an n
√
n × √n matrix with n × n processors can be done on 2D LARPBS in
O(log n) time.
Before we present our generalized Columnsort algorithm, we need to provide following
basic operations:
• √n-way-column-shuffle: Each processor Pi,j, where 1 ≤ i, j ≤ n, sends its element
to processor Pi,k, where k = (j mod
√
n) × √n + dj/√ne. This is a one-to-one row
communication.
• form-super-columns: Processors Pi,j√n, where 1 ≤ i ≤
√
n, set their row reconfigurable
switches to cross to partition matrix A into
√
n sub-matrices: A1, A2, · · ·, A√n. Each
of those sub-matrices Ai is a n×
√
n matrix and we call it super-column i.
• fuse-super-columns: Processors Pi,j√n, where 1 ≤ i ≤
√
n, set their row reconfigurable
switches to straight to fuse
√
n super columns into one n× n matrix A.
Algorithm 5.8.1 Generalized Columnsort
Input: a sequence of n2 unsorted elements distributed to an n×n 2D LARPBS, one element
per processor.
93
Output: a sequence of n2 sorted elements stored in the n×n 2D LARPBS in column major
order, one element per processor
Begin
1. Partition and sort each super column: Do a form-super-columns operation to parti-
tion matrix A into
√
n sub-matrices. Each sub-matrix runs the algorithm 5.4.3 Ba-
sic Columnsort simultaneously to sort its elements in column major order. This step
serves the purpose of sorting items in each column of the virtual matrix in column
major order. After sorting is done, do a fuse-super-columns operation to form on 2D
LARPBS.
2. Shuffle and transpose: Each sub-matrix Ai does an Extended Matrix Transpose opera-
tion. (This operation actually equivalent to
√




3. Shuffle and sort: Do a
√
n-way-column-shuffle to form the new super columns. Then
run the algorithm 5.4.3 Basic Columnsort again simultaneously in each super column
to sort its elements in columns major order.
4. Shuffle and reverse transpose: Do a
√
n-way-column-shuffle operation to form the
new super columns. Then do a Extended Reverse Matrix Transpose operation in each
sub-matrix Ai. (Actually we should do a
√
n-way-column-shuffle operation after the
Extended Reverse Matrix Transpose to move the items to their correct locations for
correctly reversely transpose the n
√
n × √n virtual matrix. But as we need another√
n-way-column-shuffle operation after this one to shuffle elements in a super column
into a sub-matrix so that we can alternate sort each super column, two shuffles cancel
each other. So we skip the shuffle here and the one in the next step as well)
5. Alternate sort: Run algorithm 5.4.3 Basic Columnsort again simultaneously on each
sub-matrix(super column). This time, odd indexed sub-matrices sort their elements in
increasing order; even indexed sub-matrices sort their elements in decreasing order.
6. Shuffle and 2 steps of odd-even sort: Do an
√
n-way-column-shuffle operation to put
elements in their correct virtual matrix locations. Then do a form-super-columns oper-
ation and an odd step followed by an even step of odd-even-transposition-sort operation
on row buses to partially sort each super column.
7. Shuffle and final column sort: Do a
√
n-way-column-shuffle operation to form super
columns for the final column sort. Then run algorithm 5.4.3 Basic Columnsort within
each super column to sort elements for each super column in column major order.
After this column sort step, all elements in matrix A are sorted in column major order





The correctness of this algorithm is based on the 7-phase Columnsort algorithm men-
tioned in section 5.3. Since the virtual matrix on which we apply the 7-phase Columnsort
algorithm satisfies r ≥ s2, Algorithm 5.8.1 Generalized Columnsort correctly sorts the n2
elements in column-major order. As the only non-constant operation performed by Algo-
rithm 5.8.1 is Algorithm 5.4.3 Basic Columnsort which runs in O(log n) time, we obtain the
following theorem:
Theorem 13 The Generalized Columnsort algorithm sorts n2 elements on an n × n 2D
LARPBS in O(log n) time.
Example 5.8.1 Generalized Columnsort for 4× 4 matrix.


4 13 9 16
3 14 5 12
1 8 6 15



































































































































































































































1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16


Step 6 Step 7 output
96
Chapter 6
Optimal Sorting Algorithms on
Higher Dimensional LARPBS
6.1 An O(log n)-Time Generalized Columnsort Algorithm
on a 3D LARPBS
The basic idea of our 3D generalized Columnsort algorithm is: partition the n3 elements
into n group, each of which is an n×n matrix. We call each of these n×n matrices a super
column. So the n3 elements can be visualized as distributed into an n2 × n virtual matrix,
where the number of rows r = n2, and the number of columns s = n. As r = s2, we can
apply basic Columnsort algorithm to this virtual matrix. For the convenience of description,
we denote the three different groups of buses as: x buses, y buses, and z buses for dimension
1, 2, and 3, respectively. We denote each xy array an xy-plane, each xz array an xz-plane,
and each yz array a yz-plane. Our sorting algorithm sorts the n3 elements distributed to the
three-dimensional LARPBS into zyx-order. For example, a yx-order for a two dimensional
array refers to column-major order, and an xy-order for a two dimensional array refers to
row-major order.
The following basic operation will be used in our algorithm:
n-way-column-shuffle: each processor Pi,j,k, where 1 ≤ i, j, k ≤ n, sends its element to
processor Pi,j′,k′ , where j
′ = k, k′ = j. This can be done by simultaneously running algorithm
2.4.1 Matrix Transpose on all xz-planes(1 ≤ i ≤ n) (See Figure 6.1 for an example).
Algorithm 6.1.1 3D Generalized Columnsort
Input: a sequence of n3 unsorted elements distributed to an n × n × n 3D LARPBS, one
element per processor.
Output: a sequence of n3 sorted elements stored in the n×n×n 3D LARPBS in zyx-order,
one element per processor
Begin




each xy-plane using only x buses and y buses. This will sort the n xy-planes in yx-
order.
2. Transpose and shuffle: run algorithm 2.4.1 Matrix Transpose on each xy-plane. Then
do an n-way-column-shuffle operation on the 3D LARPBS. (The Matrix Transpose
operation performed here is equivalent to an n-way-column-shuffle operation plus a
transposition operation on an n2 × n matrix.)
3. Sort super column: run algorithm 5.8.1 Generalized Columnsort again on each xy-plane
to sort each of n n× n matrix in yx-order.
4. Shuffle and reverse transpose: do an n-way-column-shuffle operation, then run algo-
rithm 2.4.1 Matrix Transpose on xy-planes (the Matrix Transpose operation performed
in this step is equivalent to a reverse transpose on n2×n matrix followed by an n-way-
column-shuffle operation).
5. Alternate sort: run algorithm 5.8.1 Generalized Columnsort on each xy-plane so that
adjacent xy-planes are sorted in reverse yx-order.
6. shuffle and odd-even transposition sort: do an n-way-column-shuffle operation, followed
by an odd step and an even step of odd-even-transposition-sort on y buses.
7. shuffle and sort: do an n-way-column-shuffle operation, then run algorithm 5.8.1
Generalized Columnsort to finally sort the n3 elements in zyx-order.
End
Since Algorithm 6.1.1 3D Generalized Columnsort applies the 7-phase Columnsort al-
gorithm to an r × s matrix, where r = n2, s = n, and r = s2, it correctly sorts the
n3 elements. This r × s matrix is simulated by an n × n × n 3D LARPBS in Algorithm
6.1.1 3D Generalized Columnsort. As the only non-constant time operation performed in the
above algorithm is algorithm 5.8.1 Generalized Columnsort, we obtain the following theorem:
Theorem 14 Algorithm 6.1.1 3D Generalized Columnsort sorts n3 elements on an n×n×n
3D LARPBS in O(log n) time.





22 7 39 50
47 53 10 27
41 25 15 63





28 54 42 8
5 34 21 57
44 60 3 26





32 46 56 43
11 29 17 9
52 62 20 14





49 16 6 40
2 19 58 18
36 59 35 51






7 22 38 47
10 25 39 50
13 27 41 53





3 12 28 54
4 21 34 57
5 24 42 60





1 17 30 48
9 20 32 52
11 23 43 56





2 19 37 55
6 33 40 58
16 35 49 59






7 10 13 15
22 25 27 31
38 39 41 45





3 4 5 8
12 21 24 26
28 34 42 44





1 9 11 14
17 20 23 29
30 32 43 46





2 6 16 18
19 33 35 36
37 40 49 51
55 58 59 64


Input Step 1 Step 2
yx− ordered xy − transpose


7 3 1 2
22 12 17 19
38 28 30 37





10 4 9 6
25 21 20 33
39 34 32 40





13 5 11 16
27 24 23 25
41 42 43 49





15 8 14 18
31 26 29 36
45 44 46 51






1 12 28 47
2 17 30 48
3 19 37 54





4 20 33 50
6 21 34 52
9 25 39 57





5 23 41 53
11 24 42 56
13 27 43 59





8 26 44 61
14 29 45 62
15 31 46 63






1 4 5 8
2 6 11 14
3 9 13 15





12 20 23 26
17 21 24 29
19 25 27 31





28 33 41 44
30 34 42 45
37 39 43 46





47 50 53 61
48 52 56 62
54 57 59 63
55 58 60 64


Step 2 Step 3 Step 4




1 2 3 7
4 6 9 10
5 11 13 16





12 17 19 22
20 21 25 32
23 24 27 35





28 30 37 38
33 34 39 40
41 42 43 49





47 48 54 55
50 52 57 58
53 56 59 60






1 5 9 14
2 6 10 15
3 7 11 16





36 29 24 20
35 27 23 19
32 26 22 17





28 37 41 45
30 38 42 46
33 39 43 49





64 60 56 52
63 59 55 50
62 58 54 48






1 36 28 64
2 35 30 63
3 32 33 62





5 29 37 60
6 27 38 59
7 26 39 58





9 24 41 56
10 23 42 55
11 22 43 54





14 20 45 52
15 19 46 50
16 17 49 48
18 12 51 47


Step 4 Step 5 Step 6
xy − transpose column− shuffle


1 28 36 64
2 30 35 63
3 32 33 62





5 29 37 60
6 27 38 59
7 26 39 58





9 24 41 56
10 23 42 55
11 22 43 54





14 20 45 52
15 19 46 50
16 17 48 49






1 5 9 14
2 6 10 15
3 7 11 16





28 29 24 20
30 27 23 19
32 26 22 17





36 37 41 45
35 38 42 46
33 39 43 48





64 60 56 52
63 59 55 50
62 58 54 49






1 5 9 13
2 6 10 14
3 7 11 15





17 21 25 29
18 22 26 30
19 23 27 31





33 37 41 45
34 38 42 46
35 39 43 47





49 53 57 61
50 54 58 62
51 55 59 63
52 56 60 64


Step 6 Step 7 Step 7
o/e sorted column− shuffle zyx− sorted
101
6.2 A 5-Phase Sorting Algorithm on a 3D LARPBS
Leighton has described an O(n1/3)-time algorithm that sorts n elements on an n1/3× n1/3×
n1/3 array [30]. In other words, this algorithm sorts n3 elements in O(n) time on a n3 array.
By applying either algorithm 5.8.1 Generalized Columnsort or 5.7.1 Multiway Merge Sort to
an n×n×n 3D LARPBS, the time complexity of this algorithm can be reduced to O(log n).
This O(n) time 3D sorting algorithm proceeds in 5 simple phases. Here are the details
of the algorithm.
Algorithm 6.2.1 5-Phase Sort on 3D Array
Input: a sequence of n3 unsorted elements distributed to an n × n × n array, one element
per processor.
Output: a sequence of n3 elements stored into zyx-order, one element per processor.
Begin
1. Phase 1. sort the elements within each xz-plane into zx-order.
2. Phase 2. sort the elements within each yz-plane into zy-order.
3. Phase 3. sort the elements within each xy-plane into yx-order, but reverse the order
on every other plane.
4. Phase 4. do two steps of odd-even transposition sort to each z-line.
5. Phase 5. sort the elements within each xy-plane into yx-order.
End
We reiterate the correctness proof here for the completeness of this algorithm. Since
the algorithm is an oblivious comparison-exchange algorithm, it is sufficient to prove the
correctness using Lemma 5.3.1 The 0-1 Sorting Lemma. A dirty line is defined to be a line
with both 0 and 1. The argument goes as follows:
At the end of Phase 1, since each xz-plane is sorted in zx-order, there is at most one
dirty x-line in each xz-plane. Hence, the number of 1s for any two z-lines in a xz-plane differs
at most 1. Since each yz-plane contains n z-lines, the number of 1s for any two yz-plane can
differ at most n after Phase 1.
After Phase 2, each yz-plane is sorted into zy-order. Since the number of 1s for any two
yz-plane differs at most n, the number of all-1 and all-0 y lines between any two yz-plane
can differ at most 1. Since there is at most one dirty y-line in each yz-plane and this dirty
line is between all-0 lines and all-1 lines, we can conclude that there are at most two dirty
xy-planes at the end of phase 2, and the two dirty planes are consecutive.
Phase 3 sorts the two consecutive dirty xy-planes in reverse orders which makes it
possible for Phase 4 to merge those two dirty xy-planes using odd-even transposition sort so
that only one dirty xy-plane is left.
Phase 5 is used to sort the one dirty xy-plane into yx-order. Since Phase 1 has already
sort the z-lines, Phase 5 completes the sorting, i.e., the n3 elements are sorted in zyx-order.
102
By using an O(n)-time sorting algorithm for two dimensional arrays, the time complexity
of 6.2.1 5-Phase Sort on 3D Array is O(n).
Observe that only the following two key operations are involved in this algorithm:
1. a two-dimensional sorting operation
2. two steps of one dimensional odd-even transposition operation
The two dimensional sorting operation can be done on a 2D LARPBS in O(log n) time
using either algorithm 5.8.1 Generalized Columnsort or algorithm 5.7.1 Multiway Merge Sort,
and two steps of the one dimensional odd-even transposition sort can be done in constant
time which we have already shown in previous section. Thus we have a 5-phrase O(log n)
sorting algorithm that sorts n3 elements on an n× n× n LARPBS. Here is the algorithm.
Algorithm 6.2.2 5-Phase Sort on 3D LARPBS
Input: a sequence of n3 unsorted elements distributed to an n × n × n 3D LARPBS, one
element per processor.
Output: a sequence of n3 elements stored in zyx-order, one element per processor.
Begin
1. Phase 1. run algorithm 5.8.1 Generalized Columnsort in each xz-plane to sort the
elements into zx-order.
2. Phase 2. run algorithm 5.8.1 Generalized Columnsort in each yz-plane to sort the
elements into zy-order.
3. Phase 3. run algorithm 5.8.1 Generalized Columnsort in each xy-plane to sort the
elements into yx-order, but reverse the order in every other plane.
4. Phase 4. do two steps of odd-even-transposition-sort operation on each z-bus.
5. Phase 5. run algorithm 5.8.1 Generalized Columnsort in each xy-plane to sort the
elements into yx-order.
End
Since the only non-constant time operation performed in the above algorithm is algo-
rithm 5.8.1 Generalized Columnsort, we obtain the following theorem:
Theorem 15 Algorithm 6.2.2 5-Phase Sort on 3D LARPBS sorts n3 elements on an
n× n× n 3D LARPBS in O(log n) time.




22 28 32 49
47 5 11 2
41 44 52 36





7 54 46 16
53 34 29 19
25 60 62 59





39 42 56 6
10 21 17 58
15 3 20 35





50 8 43 40
27 57 9 18
63 26 14 51





1 13 36 49
2 22 41 52
5 28 44 55





7 24 33 54
16 25 34 59
19 29 46 60





3 15 35 48
6 17 38 56
10 20 39 58





4 18 37 50
8 26 40 51
9 27 43 57





1 13 36 49
7 24 33 54
3 15 35 48





2 22 41 52
16 25 34 59
6 17 38 56





5 28 44 55
19 29 46 60
10 20 39 58





11 32 47 61
23 31 53 62
12 21 42 64





1 13 33 48
3 15 35 49
4 18 36 50





2 17 34 51
6 22 38 52
8 25 40 56





5 20 39 55
9 27 43 57
10 28 44 58





11 21 42 61
12 30 45 62
14 31 47 63
23 32 53 64


Input Step 1 Step 2 Step 2
xz − planes zx− ordered yz − planes zy − ordered
xz − planes yz − planes


1 3 4 7
2 6 8 16
3 7 11 19





13 15 18 24
17 22 25 26
20 27 28 29





33 35 36 37
34 38 40 41
39 43 44 46





48 49 50 54
51 52 56 59
55 57 58 60





1 5 9 14
2 6 10 16
3 7 11 19





32 28 24 18
31 27 22 17
30 26 21 15





33 37 41 45
34 38 42 46
35 39 43 47





64 60 56 51
63 59 55 50
62 58 54 49





1 5 9 14
2 6 10 16
3 7 11 15





32 28 24 18
31 27 22 17
30 26 21 19





33 37 41 45
34 38 42 46
35 39 43 47





64 60 56 51
63 59 55 50
62 58 54 49





1 5 9 13
2 6 10 14
3 7 11 15





17 21 25 29
18 22 26 30
19 23 27 31





33 37 41 45
34 38 42 46
35 39 43 47





49 53 57 61
50 54 58 62
51 55 59 63
52 56 60 64


Step 3 Step 3 Step 4 Step 5
xy − planes yx− ordered odd− even− sorted zyx− ordered
xy − planes xy − planes xy − planes
104
6.3 Comparison of The Two 3D Sorting Algorithms
We have presented two three-dimensional sorting which achieve the same time complexity.
Now let’s compare the two algorithms:
In Algorithm 6.1.1 3D Generalized Columnsort, all sorting are done in the xy-plane.
Since only the sorting operations require reconfigurable switches, we can eliminate recon-
figurable switches on z-buses. But this algorithm runs in seven phases. The trade-off for
reducing one set of reconfigurable switches is to introduce several transpose operations which
can be done in constant time.
In Algorithm 6.2.2 5-Phase Sort on 3D LARPBS, we need to sort elements in the xy-
plane, the yz-plane, and the xz-plane. Thus reconfigurable switches are needed for all three
buses. But this algorithm is simpler in terms of the number of phases involved. With only
5 phases, this algorithm completes the sorting.
To Summarize, algorithm 6.1.1 3D Generalized Columnsort can run on a simpler 3D
LARPBS model, but requires several extra constant time operations; algorithm 6.2.2 5-





The selection problem is to determine the kth largest element in a given set of n elements.
Parallel sorting trivially solves the selection problem, but sorting is known to be computa-
tionally harder than selection. The easy cases are extreme selection where k = 1 or k = n,
and the hard case is the median, where k = dn/2e. Valliant proved the lower bound of
finding the maximum to be Ω(log log n) in [76]. This implies that the lower bound of se-
lection problem is Ω(log log n). Cole and Yap [16] present a selection algorithm on parallel
comparison model. This selection algorithm brings the upper bound of selection problem
from O(log n) down to O((log log n)2). Later, a even tighter upper bound selection algorithm
[3] is presented that pushes the upper bound further down to O(log log n). The existence of
such an algorithm implies that selection problem has an upper and lower bound of log log n.
The best previous result for selection on the PRAM is given in [13] that uses n operations
and runs in O(logn log log n) on a CRCW PRAM.
Previous research results for parallel selection problem on LARPBS model are the fol-
lowing: Pan gave a randomized algorithm which runs in Õ(n log n/p) expected time using p
processors [51]. Li et al. presented a deterministic selection algorithm that runs in O(log n)
time on an n-processor LARPBS. Han and Pan described an O((log log n)2/ log log log n)
deterministic selection algorithm on an n-processor LARPBS [24]. Research results for se-
lection problems on other parallel computing models are listed in Table 7.1.
The rest of this chapter is organized as follows. In section 7.2, we give a definition
on parallel comparison model. In section 7.3, we study the extreme selection problem, i.e.
finding the maximum. In section 7.4, we do research on two upper bound algorithms and
implement the Cole-Yap’s algorithm on the LARPBS model. In section 7.5, we provide an
application of our selection algorithm.
106
Table 7.1: Comparison of different parallel selection algorithms.
Algorithm Model Time Complexity Processor
Cole-Yap’s[16] PCT O((log log n)2) n
An optimal algorithm[3] PCT O(log log n) n
Li et al LARPBS O(log n) n
Our LARPBS algorithm LARPBS O((log log n)2) n
Han and Pan [24] LARPBS O((log log n)2/ log log log n) n
Cole [12] EREW PRAM O(log n log∗ n) n
Chaudhuri et al.[13] CRCW PRAM O(log n/log log n) n
Pan [51] LARPBS Õ(n log n/p) p
Rajasekaran & Sahni [68] AROB Õ(1) n2
7.2 Parallel Comparison Model
Valliant first introduced parallel comparison model in [76] for the purpose of studying com-
parison dominated parallel problems. Since then, a lot of researchers used this model as a
tool to study the parallelism in comparison problems. On the parallel comparison model, we
only count the time taken to carry out comparisons, and ignore time spent in communica-
tions, in determining which comparisons to do next, and any other side computations. Thus
any lower bound on the time complexity of solving a problem on this model will necessarily
also be a lower bound for any other model of parallelism that has binary comparisons as
the basic operations. And the best constructive upper bound will correspond to the fastest
algorithm for comparison dominated problems. The common properties of such problems
are: for each problem, the input consists of a set of elements on which there is a linear
ordering. By performing a comparison operation on any pair of elements, we can find the
ordering relationship between these two elements. The following are some problems that can
be studied using the parallel comparison model: finding the maximum, finding the medium,
sorting, and merging two sorted lists. The following definition for parallel comparison model
comes from [76].
Definition 7.2.1 Parallel comparison model
There are k processors available in the model, so k comparisons can be performed simultane-
ously. The processors are synchronized so that each of them completes a comparison within
each time interval. At the end of the interval the algorithm decides which k (not necessarily
disjoint) pairs of elements are to be compared during the next interval and have processors
assigned to them. The decision mentioned above is made by inspecting the ordering relation-
ships that have already been established in previous comparisons. The computation terminates
when the relationships that have been discovered are sufficient to specify the solution to the
given problem. The time complexity of each problem will be expressed as a function of the
number of processors, and of the size of the input set.
107
7.3 Finding Maximum
7.3.1 A Constant-Time Algorithm
This section gives a constant time algorithm for finding the maximum of n elements using
n2 processors on LARPBS. Assume that the n elements are initially distributed to the first
n processors.
Algorithm 7.3.1 Basic Maximum
Input: An unsorted array A of n distinct elements.
Output: the maximum element x stored in the first processor.
Begin
1. Distribute the elements on the first n processors in the following way: processor Pi
multicast its element x to processor Pkn+i, where 1 ≤ k < n, and 1 ≤ i ≤ n.
2. Segment the bus into n sub-buses so that each has n processors. Processor Pin+i, where
1 ≤ i ≤ n, sets y = x and broadcasts y to every other processor in its sub-bus.
3. Each processor in the system compares its x and y. If x > y, then sends a dummy
message to the first processor in its sub-bus. Then all processors set their reconfigurable
switches to straight to fuse sub-buses into one bus.
4. The first processor of a sub-bus who does not receive anything in previous bus cycle
holds the maximum element y. So it sends its y to the first processor in the system.
End
The above algorithm runs in constant time with n2 processors. Although the number of
processors is not appealing to us, the constant time complexity is very appealing. By using
a doubly logarithmic tree structure, the algorithm can be improved so that the number of
processors is reduced to O(n) and running time reaches lower bound Ω(log log n) [28].
7.3.2 A Fast Non-Optimal Algorithm
A fast algorithm [28] that reaches the lower bounds Ω(log log n) of the maximum finding
problem can be obtained on a CRCW PRAM based on the idea of the doubly logarithmic-
depth tree [28].
Definition 7.3.1 A doubly logarithmic tree is a balanced tree with the following property
the number of children of an internal node u is equal to d√nue, where nu is the number of
leaves in the subtree rooted at u.
We implement the same idea on an LARPBS as one of the comparison algorithms. Two
key issues need to be considered for developing this fast maximum algorithm:
108
1. how to simulate the doubly logarithmic tree on an LARPBS model; and
2. how to perform the computation needed and how to move intermediate results around
for the computation performed in the next stage.
Our solutions are:
1. Simulating the doubly logarithmic tree on an LARPBS : At the beginning of the al-
gorithm, elements are distributed to the n processors, one for each processor. As the
algorithm goes from stage to stage, we segment the bus into sub-buses in the following
way: on the first stage, each sub-bus has size 2; on the second stage, each sub-bus has
size 22 = 4; on the third stage, each sub-bus has size 24 = 16. In general, on the ith
stage, where 0 ≤ i ≤ log log n, each sub-bus has size 22i . So when the algorithm reaches
the log log n stage, the system consists only one bus which has a size of 2log n = n.
2. Computation and communication : The algorithm proceeds in O(log log n) stages. On
the first stage, a binary tree comparison is used to reduce number of candidates by a
factor of 2. On each stage that follows, the system is segmented into multiple sub-buses
based on the above segmentation rule. After segmentation is done, do a compression
operation to compress results from the previous stage to the left most processors of
the sub-bus. Then each sub-bus calls Algorithm 7.3.1 Basic Maximum which runs in
constant time. The reason why Algorithm 7.3.1 Basic Maximum can be used starting
from the second stage is as follows: in the second stage, the number of candidates on
each sub-bus is m′ = 2, and the number of processors on each sub-bus is n′ = 4; in
the third stage, the number of candidates on each sub-bus is m′ = 4, and the number
of processors on each sub-bus is n′ = 16; etc. So in general, in the (i + 1)th stage, the
number of candidates on each sub-bus is m′ = 22
i
, and the number of processors on
each sub-bus is n′ = 22
(i+1)








Assume that n elements are initially distributed to n processors, one element per processor.
Here is the detail of the algorithm:
Algorithm 7.3.2 Fast Maximum
Input: An array A of n distinct elements.
Output: the maximum element x stored in the first processor.
Begin
1. Each processor P2i, where 0 < i ≤ n/2, sends its element to processor P2i−1. Processor
P2i−1 sets itself to active, compare the element it received with its own, and keeps the
larger one.
2. for i = 1 to log log n pardo
109
(a) Segment the array into n/22
i
sub-buses: each processor Pj sets its reconfigurable
switches to cross if j = k22
i
, where 1 ≤ k ≤ n/22i . All other processors set their
reconfigurable switches to straight. Each sub-bus does a ordered-compression
operation to compress all elements held by the active processors to the left most
processors of the sub-bus.
(b) Each sub-bus calls Algorithm7.3.1 Basic Maximum and sets the processor who
holds the maximum element for a sub-bus to be active.
End
This algorithm is very fast, it reaches the lower bound of maximum finding problem.
But it is not optimal as the operations required is O(n log log n). By using the accelerated
cascading technique, we can obtain a fast optimal algorithm[28].
7.3.3 An Optimal-Algorithm
Definition 7.3.2 The Accelerated Cascading Strategy is an algorithm design strategy that
starts with the optimal algorithm until the size of the problem is reduced to a certain threshold
value, then shifts to the fast but non-optimal algorithm.
The accelerated cascading strategy is usually used to make a fast non-optimal algorithm
optimal. Here is how it is used to make our fast maximum finding algorithm optimal:
Algorithm 7.3.3 Optimal Maximum
Input: An array A of n distinct elements.
Output: the maximum element x stored in the first processor.
Begin
1. Perform log log log n steps of the binary tree maximum algorithm.
2. Run algorithm 7.3.2 Fast Maximum for the remaining candidates.
End
The algorithm called in the first step is optimal but not fast, the running time is
O(log log log n). The number of operations required in Step 1 is O(n), and the number of
operations required in Step 2 is O((n/ log log n) log log (n/ log log n)) = O(n). Thus the total
number of operations required is reduce to O(n). Obviously Step 1 requires O(log log log n)
computation time. As at the end of Step 1, the size of the array is reduced to n/ log log n,
Step 2 only requires O(log log (n/ log log n)) = O(log log n) time. Thus algorithm 7.3.3 Op-
timal Maximum is both very fast and optimal.
110
7.4 Selection
7.4.1 The Cole-Yap O((log log n)2) Time Parallel Selection Algo-
rithm
Cole and Yap [16] present a well-known O((log log n)2) time parallel selection algorithm for
parallel comparison model. This algorithm uses O(n) processors. The Cole-Yap algorithm
proceeds in O(log log n) stages. In each stage, it finds two approximations to the kth smallest
item e: one is guaranteed to be no larger than e, and the other is guaranteed to be no smaller
than e. Thus, candidate elements can be reduced by eliminating all elements that fall outside
the interval defined by these two approximations. The approximation calculation performed
in each stage of the selection algorithm proceeds by sampling the surviving set of elements
and recursively find the approximation for the kth element in the sample set. By smartly
sampling the remaining elements, the size of the surviving set shrinks in a doubly logarithmic
rate. Thus each call to the approximate calculation takes O(log log n) time, which brings
the total time required for selection algorithm to O((log log n)2).
Now we need to prove why the approximation strategy mentioned above correctly selects
the samples. A brief proof for lemma 7.4.1 is given in [16]. We give a more detailed proof in
the following paragraphs.
Lemma 7.4.1 For j ≤ 2m/r, the (bj/rc − s)th smallest element in Si+1 is no larger than
the jth smallest element in Si, and is at least the (bj/rcr − sr)th smallest element in Si.
Proof. Assume that the number of processors is n, and the number of elements in Si is m,




e. In stage i of the selection algorithm, the Si elements is divided
into s = m/r2 short sets, each has r2 elements. As Si+1 consists of the (rk)
th elements, for
k = 1, 2, · · ·, r, for each of the sorted short sets, the number of elements sampled into Si+1
in each of short set is r.
First we prove that after sorting those short sets, and forming the surviving set Si+1, we
have the following upper and lower approximation for the kth element in Si+1:
• upper approximation: the kth smallest element in Si+1 is at most the (kr + s(r − 1))th
smallest element in Si.
• lower approximation: the kth smallest element in Si+1 is at least the krth smallest
element in Si.
Suppose x and y are the (l− 1)rth and the lrth element in short set V , where 1 < l ≤ r. We
associate with element y those elements in V lying between x and y, plus y, and denote it as
a sorted array L(y). So each element z in Si+1 is associated with r elements L(z) in Si, and
z is the largest of the r elements in L(z). There are rs such sorted arrays. Assume the kth
element is e. If x ≤ e, then all the elements in L(x) are not greater than e. Therefore there
are at least kr elements in Si that are less than or equal to e. Thus the lower approximation
of the kth smallest element in Si+1 is the kr
th element in Si.
111
Now we calculate the upper approximation: since we have already taken care of all
L(x)’s for 0 < x ≤ e, let’s consider all L(y)’s, where e < y ≤ r. Notice that each short set
is sorted. So no matter how many samples in a short set are greater than e, at most r − 1
of all the elements associated with the samples can be smaller than e. To be specific, only
the elements in L(y) (not including y), where y is the smallest sample in a short set that is
greater than e, can be smaller than e. Thus a maximum of r − 1 elements associated with
samples greater than e in one short set can be smaller than e. Since there are s short sets,
the maximum number of elements who are associated with samples greater than e and who
themselves are smaller than e is s(r − 1). So the total maximum number of elements in Si
that is smaller than e is: kr + s(r − 1).
To prove the two claims stated in the lemma, we do the following:
For j ≥ 2m/r, let k = b j
r
c − s, the upper approximation for the rank of kth element e
in Si is
kr + s(r − 1) = (bj
r
c − s)r + s(r − 1) < (bj
r
c − s)r + sr < j
and the lower approximation for the rank of kth element e in Si is
kr = (bj
r
c − s)r = bj
r
cr − sr
Thus the lemma is proved.
The reason why sorting each short set in each stage of the approximate algorithm takes
















So we have np = ni
2, and sorting can be done in constant time using algorithm 4.2.1
Constant Time Sort.
7.4.2 An Optimal Parallel Selection Algorithm
An O(log log n) parallel selection algorithm was presented in [3], which proves the upper
bound of parallel selection problem to be O(log log n), and establishes that all comparison
tasks (selection, merging, sorting) have upper and lower bounds of the same order in both
random and deterministic models. This algorithm is designed on Valian’s PCT model, i.e.,
the parallel comparison tree model.
There are two basic ideas underlying the algorithm:
1. Use the transitivity of the relation R, obtained free of comparison cost, to achieve a
constant time approximation in each stage of the selection algorithm. R refers to the
relation obtained after each comparison.
112
2. Using a weak (A,α)− expander graph to make sure that if m ≤ n/(log n)3, the upper
and lower approximations l and r are quite accurate for most elements of V and l and
r can be determined in constant time.
Similar to Cole-Yap’s selection algorithm, this optimal selection algorithm uses a pro-
cedure called SELECT which, given V = v1, · · ·, vm, will return the kth smallest element
vk. SELECT proceeds in O(log log n) stages. On each stage, an upper approximation and a
lower approximation for the rank of each elements in the surviving set is calculated in the
following way: the n processors will perform n comparisons according to an undirected graph
G in the natural way: G has vertices v1, · · ·, vm and n of the possible (
m
2) edges. Elements
x and y are compared only if there is an edge between them. The n comparisons induce a
relation R on V × V . Let ≤G represent the partial order given by the transitive closure of
the relation R. Then the algorithm calculates the upper and lower approximations ρ+ and
ρ− for the rank of each element based on the transitive closure ≤ G of relation R where ρ+
and ρ− are defined to be:
ρ− = |j : vj ≤G vi|
and
ρ+ = (m + 1)− |j : vi ≤G vj|
Lower approximation l and upper approximation r for k is determined using the following
two inequalities:
j −D ≤ ρl− ≤ ρl+ ≤ j (7.1)
j ≤ ρr− ≤ ρr+ ≤ j + D (7.2)
where D = (20m log m)/A.
Then form a new surviving set V ′ = {u ∈ V : l ≤ u ≤ r}. It can be done in two
parallel steps: the m processors compare its estimated rank simultaneously with l and r to
decide whether the element associated with it is a survival element or not. The graph G
that the SELECT procedure uses is called a weak (A,α)-expander with m vertices and n
edges, where m ≤ n/(log n)3. The property of the weak (A,α)-expander and the transitive
closure of the n comparison results guarantee the existence of such l and r whose upper and
lower approximate ranks given by the transitive closure of R satisfy both inequality 7.2 and
inequality 7.2. By choosing A = n/(m log n), we have
D = (20m log m)/A = (20m log m)/(n/(m log n)) = (20m2 log m log n)/n ≤ (20(m log n)2)/n
Thus
|V ′| ≤ 40|V |(log n)2/(n/|V |)
Therefore, if
|V | ≤ n/(log n)1+δ
then
|V ′| ≤ 40|V |/(log n)2δ
113
At this rate of shrinking, O(log log n) recursive calls to SELECT will force the surviving
set V ′ to reach a size of at most
√
n. Now as m <
√
n, the surviving set V ′ can be sorted in
constant time. But in order to have |V | ≤ n/(log n)1+δ, a preprocessing is needed to shrink
the original set size to the desired one. Now AKS sorting algorithm comes into play.
The following property of the AKS sorting algorithm is implicit in the argument in [61]:
After running O(log log n) stages, 9
10
of the elements being sorted will be within m of k +m.
So the preprocessing consists of the following function call sequence:
1. O(log log n) steps of AKS sorting algorithm
2. O(log log n) recursive calls to SELECT
3. another O(log log n) steps of AKS sorting algorithm
After the above O(log log n) preprocessing, the size of the surviving set can be reduced
to n/(log n)3 or less, without increasing the overall time complexity. After preprocessing,
O(log log n) recursive calls to SELECT will reduce the size for surviving set to
√
n or less,
and then we find out the jth element by sorting the surviving set. Thus the overall time
complexity of this algorithm is O(log log n). Notice that only the time spent for comparison
is counted in parallel comparison model, thus the following key computation is considered
to be available free of time cost: computing the transitive closure of relation R.
Also this algorithm does not count the time for computing the expander graph. It
assumes the expander graphs used by the algorithm are known and just feed the algorithm
with the required expander graphs. This algorithm is not uniform in a sense because different
expander graphs are required for different n.
We did some work implementing this optimal parallel selection algorithm on the LARPBS
model. We found that all the operations in this algorithm can be implemented on the
LARPBS model with the same time complexity except the calculation for transitive closure.
Our best transitive closure algorithm runs in O(log n) time. If we use this transitive closure
algorithm, the overall time complexity of the selection algorithm would be O(log n log log n),
which is slower than the Cole-Yap’s selection algorithm. So we conclude that this optimal
parallel selection algorithm can be implemented on the LARPBS with same time complexity
only if there is a constant time transitive closure algorithm available for this model.
7.4.3 An Implementation of Cole-Yap’s Algorithm on an LARPBS
The basic operations needed for implementing the Cole-Yap’s O((log log n)2) selection algo-
rithm are:
1. Sort multiple sub-sequences of length O(m) in constant time using O(m2) processors.
2. Sample a set to form a subset for further processing.
3. Find out all elements that are greater or smaller than a given element.
114
4. Compress selected elements to one end of the bus.
Operation 1 can be done using algorithm 4.2.1 Constant Time Sort designed for LARPBS.
Operation 2 and 3 can be done in constant time. And operation 4 can be done by the basic op-
eration order-compression. Now we present the implementation details of the O((log log n)2)
selection algorithm. Assume we have n = 4096m processors, and the n elements in A are
initially distributed to the first m processors. Here are some data that will be used in the
algorithm:
• number of processors: n




• number of elements in each short set: r2 = d n
m
e





Algorithm 7.4.1 Find Lower Approximation(n,m,k)
Input: A sequence L of m distinct elements distributed to the first m processors in an n-
processor system, one per processor.
Output: lower approximation l for the kth element. l is stored in the first processor, which
is called the header processor.
Begin
if n ≥ m2, apply algorithm 4.2.1 Constant Time Sort to find the kth element directly,
and processor Pk sends its element to the header processor.
else
1. Calculate s and r2: The header processor performs the following calculation:




• Calculate number of elements in each short set: r2 = d n
m
e





2. Partition List L: each processor Pi who initially holds an element, calculates the
destination processor address j using j = i × r + i mod r. Then Pi sends its element
to processor Pj. After this one-to-one communication operation, processors Plp, where
1 ≤ l ≤ s, set their switches to cross to form s sub-buses.
3. Sort short sets: apply algorithm 4.2.1 Constant Time Sort on each sub-bus to sort each
short set in increasing order.
4. Identify sample elements: In each sub-bus, processor Pi sets itself to active if its rank
obtained in the previous step is jr, where 0 < j ≤ r.
115
5. Form surviving set: All processors set their switches to straight to form a single bus.
Do a order-compression operation to move all surviving elements to the beginning of
the bus.
6. Update k: if k ≤ 8m/r, then set k = 0; otherwise, set k = bk/rc − s.
7. Update m: set m = m/r
8. recursively find l: do Find Lower Approximation(n,m,k)
endif
End






+ i mod r2 =
i
n/m
× (n2/m2) + i mod (n/m) = i× r + i mod r
Finding the upper approximation is a symmetric operation, the only difference is in the
calculation of the new rank k, see details below:
Algorithm 7.4.2 Find Upper Approximation(n,m,k)
Input: An sequence L of m distinct elements. Assume the m elements are distributed to
the first m processors in the n-processor system, one per processor.
Output: upper approximation u for the kth element. u is stored in the first processor, which
is called the header processor.
Begin
if n ≥ m2, apply algorithm 4.2.1 Constant Time Sort to find the kth element directly,
and processor Pk sends its element to the header processor.
else
1. Calculate s and r2: The header processor performs the following calculation:




• Calculate number of elements in each short set: r2 = d n
m
e





2. Partition array A: each processor Pi who initially holds an element, calculates the
destination processor address j using j = i × r + i mod r. Then Pi sends its element
to processor Pj. After this one-to-one communication operation, processors Plp, where
1 ≤ l ≤ s, set their switch to cross to form s sub-buses.
3. Sort short sets: apply algorithm 4.2.1 Constant Time Sort on each sub-bus to sort each
short set in increasing order.
116
4. Identify sample elements: In each sub-bus, processor Pi sets itself to active if its rank
obtained in the previous step is jr, where 0 < j ≤ r.
5. Form sample sets: All processors set their switches to straight to form a single bus.
Do a order-compression operation to move all sampled elements to the beginning of
the bus.
6. Update k: if k ≥ m− 8m/r, then set k = m; otherwise, set k = k/r.
7. Update m: set m = m/r




Input: An array A of m distinct elements. Assume the m elements are distributed to the
first m processors in the system, one per processor.
Output: the kth element x stored in the first processor, which is called the header processor.
Begin
if n ≥ m2, apply algorithm 4.2.1 Constant Time Sort to find the kth element directly,
and processor Pk send its element to the header processor.
else
1. Calculate the lower and upper approximation l and u:
• Do Find Lower Approximate(n, m, k) to find lower approximation l.
• Do Find Upper Approximate(n,m, k) to find upper approximation u.
After the above two operations, header processor P0 has both the upper approximation
u and the lower approximation l. Then P0 broadcasts u and l to the first m processors
in the system.
2. Find the rank rl and ru for the lower and upper approximations: each Pi, where 0 ≤
i < m, compares its own element x with l it received in the previous step, and set
flag = 1 if x < l; otherwise, set flag = 0. Processor Pm−1 set its switch to cross to
form a sub-bus that contains the first m processors. Do a binary-prefix-sums operation
in this sub-bus. Processor Pm−1 then sends its prefix sums to the header processor.
This prefix sums is the rank for l. The header processor sets rl to be equal to the prefix
sums received in previous bus cycle. ru is obtained in the same way.
3. Identify surviving elements: each processor Pi who received the two approximations in
previous step, compares its element x with the two approximations, and sets itself to
active if l ≤ x ≤ u. Otherwise, set it to inactive.
117
4. Form surviving set: Do a compression operation to move all surviving elements to the
beginning of the bus.
5. Update k and m: set k = k − rl, m = ru − rl + 1.
6. Recursively find k: Apply algorithm 7.4.3 Select(n, m, k).
endif
End
Step 1 takes O(log log n) time, all other steps take constant time. The recursive proce-
dure terminates after log log n stages. So the total execution time is O((log log n)2).
Theorem 16 Cole-Yap’s selection algorithm can be implemented on an LARPBS in O((log log n)2)
time using n processors.
7.5 Application to Parallel Multi-selection
An immediate application of algorithm 7.4.3 Select(n,m,k) is the parallel multi-selection prob-
lem. Multi-selection is a fundamental algorithmic problem arising frequently in databases.
It is defined as follows: Given an unordered set S of n records and a sequence of m integers
1 ≤ q1 < q2 < . . . < qm ≤ n, we are interested in answering queries of the type “find the qthi
smallest elements in S”. The problem is to answer all these queries as fast as possible. In
database applications, this problem translates as answering a corresponding set of queries
on a record set S. A sequential solution runs in O(min{mn, n log n}) time. S. Olariu and
Z. Wen proposed an EREW-PRAM solution running in O(n/p log i log 2m) time using p
processors with 1 ≤ p ≤ dn/ log n log∗ ne [50]. By applying algorithm 7.4.3 Select(n,m,k) k




In this chapter, we would like to summarize the contributions of this dissertation and point
out several topics that we would like to work on in the future.
8.1 Contributions
In this dissertation, we studied both optical interconnection models and efficient algorithms
for these models. We focus on a simple but powerful optical interconnection network model
– A Linear Array with Reconfigurable Pipelined Bus System (LARPBS). We extended this
one-dimensional optical bus model to two and three dimension to improve the scalability. We
also provided several basic operations for two-dimensional LARPBS. These basic operations
are used as building blocks for designing a few efficient algorithms given in this dissertation.
Then we studied parallel comparison problems, including sorting, merging and selection,
the lower bounds for these problems, and the best parallel algorithms for these problems
that were designed under PRAM and parallel comparison models. Sorting is a well-known
problem that has been studied extensively in the literature because it has so many important
applications and it has a theoretical importance. We implement an optimal CREW PRAM
merge sort algorithm on an LARPBS. To the best of our knowledge, our O(log n)-time
sorting algorithm is the first optimal sorting algorithm implemented on an LARPBS. With
this optimal sorting algorithm for one-dimensional LARPBS at disposal, we further extended
our research on the sorting problem for higher dimensional LARPBS’s.
We obtained our O(log n)-time 7-phase Columnsort algorithm by applying our optimal
sorting algorithm on one-dimensional LARPBS to sorting each column. We further general-
ized our two-dimensional sorting to a larger range of two-dimensional LARPBS’s, i.e. n× n
shape two-dimensional LARPBS’s. The generalization was done based on the following two
ideas:
1. Generalize the idea of basic Columnsort: we can simulate an n
√
n × √n imaginary
matrix on the n × n 2D LARPBS, and we can also simulate an n2 × n imaginary
matrix on the n× n× n 3D LARPBS. Since the assumption r ≥ s2 is satisfied in both
119
cases, we can apply the 7-phase Columnsort algorithm on the two imaginary matrices.
We called the sorting algorithms obtained this way generalized Columnsort algorithms.
2. Multi-way merge sorted sub-matrices: apply a multi-way merge method to
√
n n×√n
shape sub-matrices to obtain a sorted n×n shape matrix. We observed that the most
time consuming operations needed in the multi-way merge procedure are sorting an n-
processor one-dimensional LARPBS and sorting an n×2√n 2D LARPBS. We already
have an optimal sorting algorithm for one-dimensional LARPBS. If we can obtain an
optimal two-way merge sort algorithm, we can get an optimal algorithm for n×n shape
matrix.
The algorithms we designed and implemented for two-dimensional sorting problem are
listed below:
1. An optimal basic Columnsort algorithm on an n×√n two-dimensional LARPBS that
sorts n
√
n elements in O(log n) time.
2. An optimal two-way merge sort algorithm on an n × 2√n two-dimensional LARPBS
that sorts 2n
√
n in O(log n) time.
3. An optimal two-way merge sort algorithm on an n × √n two-dimensional LARPBS
that sorts 2n
√
n in O(log n) time.
4. An optimal multi-way merge sorting algorithm on an n×n two-dimensional LARPBS
that sorts n2 elements in O(log n) time.
5. An optimal generalized column sort algorithm on an n× n two-dimensional LARPBS
that sorts n2 elements in O(log n) time.
6. An optimal generalized column sort algorithm on an n × n × n three-dimensional
LARPBS that sorts n3 elements in O(log n) time.
7. An optimal 5-phase sorting algorithm on an n×n×n three-dimensional LARPBS that
sorts n3 elements in O(log n) time.
We also did some research on selection problem and its application to multi-selection
– a fundamental algorithmic problem arising frequently in databases. The easiest cases
are extreme selection, which is finding the minimum or maximum. The hardest case is
finding the median. We studied the upper bound and lower bound of selection problem. To
find an optimal maximum algorithm, we studied the accelerated cascading technology. We
provided a solution for simulating doubly logarithmic tree on an LARPBS, and implemented
an optimal algorithm for finding the maximum of n elements on an n-processor LARPBS.
We implemented the Cole-Yap’s O((log log n)2)-time selection algorithm on an LARPBS.
Although it is a little bit slower than Han and Pan’s selection algorithm [24], it is simpler.
Following is a list of algorithms we implemented for selection problems.
120
1. A constant time algorithm that finds the maximum of n elements on an n2-processor
LARPBS.
2. A non-optimal algorithm that finds the maximum of n elements on an n-processor
LARPBS in O(log log n) time.
3. An optimal algorithm that finds the maximum of n elements on an n-processor LARPBS
in O(log log n) time.
4. An O((log log n)2) time parallel selection algorithm on an n-processor one-dimensional
LARPBS.
5. An O(k(log log n)2) time parallel multi-selection algorithm on an n-processor one-
dimensional LARPBS.
Apart from the comparison problems, we also studied Boolean matrix multiplication
problem and its applications to graph problems. We presented a new constant time Boolean
matrix algorithm on an LARPBS. Our algorithm achieves much better speed-up compared
with the four Russians’ algorithm proposed by Agerwala and Lint in [4] and a hypercube
implementation provided by C. Gregory Plaxton [55]. We noticed that a parallel imple-
mentation of the Four Russians’ algorithm on the LARPBS with constant running time is
proposed by Keqin Li in [31]. Compared with Li’s constant-time algorithm, our algorithm
is not based on any existing sequential or parallel algorithm and it is much simpler and
uses fewer processors. Our algorithm uses only O(n2) processors, while Li’s algorithm uses
O(n3/ log n) processors.
We then applied the constant time Boolean matrix multiplication algorithm to two
graph problems, and obtained an efficient transitive closure algorithm and two connected
component algorithms. We noticed that a constant time algorithm for the transitive closure
was proposed in [77]. But their constant time algorithm is obtained at a cost of much more
processors. Two constant time algorithms were given in [77]. One is designed on a three-
dimensional n × n × n processor array with a reconfigurable bus system and the other is
designed on a two-dimensional n2 × n2 processor array with a reconfigurable bus system,
where n is the number of vertices in the graph.
Following is a list of results we have obtained in this area.
1. A constant time algorithm on an n2-processor one-dimensional LARPBS that multiplies
two n× n Boolean matrices.
2. A constant time algorithm on an n× n two-dimensional LARPBS that multiplies two
n× n Boolean matrices.
3. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the transitive closure for a undirected graph G with n vertices.
4. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the connected components for a undirected graph G with n vertices.
121
5. An O(log n)-time algorithm on an n2-processor one-dimensional LARPBS that calcu-
lates the strongly connected components for a directed graph G with n vertices.
8.2 Future Works
We identified that the following topics are interesting and can be further studied in the
future:
1. Extend the generalized Columnsort algorithms to d-dimension: we expect that the
same idea of generalized Columnsort algorithms can be applied to a d-dimensional
LARPBS. We define a d-dimensional LARPBS to be an
d︷ ︸︸ ︷
n× n× · · · × n LARPBS.
Again we assume that communications on different dimension cannot be done in the
same bus cycle. This assumption helps us simplify the architecture of the d-diemnsional
LARPBS. With this assumption, we can use only one set of switches for all d dimen-
sions. Only the buses that are being used in a bus cycle will be connected to the
switches.
2. Extend the multi-way merge algorithm to d-dimension: we expect that the multi-way
merge algorithm can be extended to d-dimension.
3. Extend the 5-phase sorting algorithm to d-dimension: we expect that the 5-phase
merge sort algorithm can be extended to d-dimension with a combination of wither
multi-way merge algorithm or generalized Columnsort algorithm.
4. Improve the selection algorithm: we expect that our O((log log n)2)-time selection
algorithm on an n-processor LARPBS can be further improved. We have two ideas for
improving this algorithm: First, we can use a fast sorting algorithm to help speedup the
selection process. Second, we can use the current method for some iteration to reduce
the size of the surviving set. At some point, we may have enough processors to perform
a constant time transitive closure algorithm provided in [77]. Then we can switch to
the optimal selection algorithm given in [3]. The challenges for this improvement will
be:
(a) This constant time transitive closure algorithm provided in [3] is designed on a
different model. So it needs to be converted to an LARPBS first.
(b) Even if we have a constant time transitive closure algorithm, there are still a lot of
details needs to be worked out for implementing the O(log log n) optimal selection
algorithm on an LARPBS.
The results obtained in this dissertation show the strong communication and computa-
tion power of optical interconnection networks. The special properties of an optical inter-
connection network give us a chance to design more efficient algorithms for a lot of problems
122
in a new and different way, and to implement efficient parallel algorithms that were designed
under theoretical parallel models, such as PRAM, parallel comparison model, etc.
123
Bibliography
[1] M. Ajtai, J. Komlos, and E. Szemeredi, “An O(n log n) Sorting Network,”Proc. 15th
Ann. ACM Symp. On Theory of Computing, pp.1-9, August 1983.
[2] M. Ajtai, J. Komlos, and E. Szemeredi, “Sorting in C log N Parallel
Steps,”Combinatorica, vol 3. pp.1-19, 1983.
[3] Miklos Ajtai, Janos Komlos, W. L. Steiger, and Endre Szemeredi, “Optimal Parallel Se-
lection Has Complexity O(log log n),”Journal of Computer And System Sciences, vol.38,
pp.125-133, 1989.
[4] T. Agerwala and B. Lint, “Communication in Parallel Algorithm for Boolean Matrix
Multiplication,”Proceedings of the 1978 IEEE International Conference on Parallel Pro-
cessing, pp.146-153, 1978.
[5] P. Beame and J. Hastad “Optimal Bounds for Decision Problems on the CRCW
PRAM,”Journal of the ACM, vol.36, pp. 643-670, 1989.
[6] A. F. Benner, H. F. Jordan, and V. P. Heuring, “Digital Optical Computing with Opti-
cally Switched Directional Couplers,”Optical Engineering, 30, 12, pp.1936-1941 (1991).
[7] S. H. Bolhari, “Finding Maximum on an Array Processor with a Global Bus,”IEEE
Transactions on Computers, vol.32, pp.133-139, 1984.
[8] Y. Ben-Asher, D. Peleg, R. Ramaswami, A. Shuster, “The Power of Reconfigura-
tion,”Journal of Parallel and distributed Computing, 13, 1991, pp. 139-153.
[9] A. G. Bourgeois and J. L. Trahan “Relating Two-Dimensional Reconfigurable Meshes
with Optically Pipelined Buses,”International Journal of Foundations of Computer Sci-
ence, vol.11, no.4, pp. 553-571, 2000.
[10] K. L. Chung, “Generalized Mesh-Connected Computers With Multiple
Buses,”Proceedings of International Conference on Parallel and Distributed sys-
tems, pp.622-626, December 1993.
[11] Richard Cole, “Parallel Merge Sort,”SIAM J. Computing, vol.17, No. 4, pp.1431-1442,
August 1988.
124
[12] R.J. Cole, “An Optimally Efficient Selection Algorithm,”Information Processing Letters
vol.26, pp.295-299, 1987/1988.
[13] S. Chaudhuri, T. Hagerup, and R. Raman, Computer Science, Springer-Verlag, pp.352-
361, 1993.
[14] D. Chiarulli, R. Melhem, and S. Levitan, “Using Coincident Optical Pulses for Parallel
Memory Addressing,”IEEE Computer, vol.30, pp.48-57, 1987.
[15] A. Chandra, L. Stockmeyer, and U. Vishkin. “Constant Depth Reducibility,”SIAM
Journal on Computing, 13, pp.423-439, 1984.
[16] Richard Cole and Chee K. Yap, “A Parallel Median Algorithm,”Information Processing
Letters, vol.20, pp.137-139, 1985.
[17] E. Dekel, D. Nassimi, and S. Sahni, “Parallel Matrix and Graph Algorithms,”SIAM
Journal on Computing, 10, pp. 657-673, 1981.
[18] J. A. Fernandez-Zepeda, R. Vaidyanathan, and J. L. Trahan “Using Bus Linearization to
Scale the Reconfigurable Mesh,”Journal on Parallel and Distributed Computing, vol.62,
no.4, pp. 495-516, 2002.
[19] Paul E. Green, IBM T. J. Watson Research Center, “The Future of Fiber Optic Com-
puter Networks,”IEEE Computer, September 1991, pp. 78-87.
[20] Z. Guo, “Sorting on Array Processors with Pipelined Buses,”Proceedings 1992 Interna-
tional Conference on Parallel Processing, 1992, III, pp. 289-292.
[21] Z. Guo, “Optically Interconnected Processor Arrays With Switching Capabil-
ity,”Journal of Parallel and Distributed Computing 23, pp.314-329, 1994.
[22] Z. Guo, R. Melhem, R. Hall, D. Chiarulli, and S. Levitan, “Array Processors with
Pipelined Optical Buses,”Journal of Parallel and Distributed Computing, vol.12, no. 3,
pp. 269-282, 1991.
[23] D, H. Hartman, “Use of Guided Wave Optics for Board Level and Mainframe Level In-
terconnects,”Proceedings of the 41st Electronic Components and Technology Conference,
June 1991, pp. 463-474.
[24] Yijie Han, Yi Pan, “Sublogarithmic Deterministic Selection on Arrays with a Reconfig-
urable Optical Bus,”IEEE Transaction on Computers vol.51, No.6, June 2002.
[25] S. J. Horng, “Prefix Computation and Some Related Applications on Mesh Connected
Computers with Hyperbus Broadcasting,”Proceedings of International Conference on
Computing and Information, pp.366-388, July 1995.
125
[26] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability.
New York, NY: McGraw-Hill, 1993.
[27] Min He, S.Q.Zheng, “An Optimal Sorting Algorithm on a Linear Array with Reconfig-
urable Pipelined Bus System,”Proceedings of the ISCA 15th International Conference
on Parallel and Distributed Computing Systems, 2002.
[28] Joseph JaJa, An Introduction to Parallel Algorithms, Addison-Wesley Publishing com-
pany, 1992.
[29] F.T. Leighton, “Tight Bounds on the Complexity of Parallel Sorting,”IEEE trans. Com-
puters, vol.34, pp.344-354, 1985.
[30] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees,
Hypercubes. San Mateo, Calif.:Morgan Kaufmann, 1992.
[31] KeQin Li, “Constant Time Boolean Matrix Multiplication on a Linear Array with a
Reconfigurable Pipelined Bus System,”The Journal of Supercomputing, vol.11, pp.391-
403, 1997.
[32] A. Louri, “Three-Dimensional Optical Architecture and Data-Parallel Algorithms for
Massively Parallel Computing,”IEEE Micro, pp.24-81,1991.
[33] S. Levitan, D. Chiarulli, and R. Melhem, “Coincident Pulse Techniques For Multipro-
cessor Interconnection Structures,”Applied Optics, 29(14), pp.2024-2039,1990.
[34] Yueming Li, Yi Pan and S.Q.Zheng, “Pipelined Time-Division Multiplexing Optical Bus
with Conditional Delays,”Optical Engineering, vol.36, no. 9, pp.2417-2424, September
1997.
[35] K. Q. Li, Y. Pan, S. Q. Zheng, “Fast and Processor Efficient Parallel Matrix Multiplica-
tion Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,”IEEE
transactions on Parallel and Distributed systems, Vol.9, No.8, pp.705-720, August 1998.
[36] K.Li, Y.Pan, and S.Q. Zheng, eds., Parallel Computing Using Optical Interconnections,
Kluwer Academic Publishers, Boston, Massachusetts, 300 pp., 1998.
[37] K. Li, Y. Pan, S.Q. Zheng, “Fast and Efficient Parallel Matrix Computations on a Lin-
ear Array with a Reconfigurable Pipelined Bus System,”High Performance Computing
Systems and Applications, J. Schaeffer ed., pp. 363-380, Kluwer Academic Publishers,
Boston, Massachusetts, 1998.
[38] K. Li, Y. Pan, and S.Q. Zheng, “Fast Matrix Multiplication and Related Operations Us-
ing Reconfigurable Optical Buses,”Parallel Computing Using Optical Interconnections,
pp. 249-273, Kluwer Academic Publishers, 1998.
126
[39] Keqin Li, Yi Pan, Si-Qing Zheng, “Efficient Deterministic and Probabilistic Simulations
of PRAMs on Linear Arrays with Reconfigurable Pipelined Bus Systems,”Journal of
Supercomputing, pp. 163-181, 2000.
[40] Y. Li, J. Tao and S.Q.Zheng, “A Symmetric Processor Array With Synchronous Optical
Buses And Switches,”Parallel Processing Letters, Vol. 8, No. 3, 1998.
[41] Y. Li, and S. Q. Zheng, “Parallel Selection on a Pipelined TDM Optical
Bus,”Proceedings of the 9th International Conference on Parallel and Distributed Com-
puting Systems, pp.69-73, 1996.
[42] Y. Li, and S. Q. Zheng, “Processor Arrays with Asynchronous TDM Optical Buses,”it
Proceedings of SPIE Photonics West ’97, pp. 291-302, 1997.
[43] Yueming Li, S. Q. Zheng, Xiangyang Yang, “Versatile Processor Arrays Based on Seg-
mented Optical Buses,”Proceedings of SPIE Photonics West ’97, pp. 280-290, 1997.
[44] R. Melhem, D. Chiarulli, and S. Levitan, “Space Multiplexing of Waveguides in Op-
tically Interconnected Multiprocessor Systems,”The Computer Journal, vol. 32, no. 4,
pp. 362-369, 1989.
[45] M. Middendorf and H. ElGindy, “Matrix Multiplication on Processors with Optical
Buses”Informatica, in press.
[46] R. Miller, V. K. Prasanna-Kumar, D. I. Reisis, and Q. F. Stout, “Meshes with Recon-
figurable Buses,”Proc. MIT Conf. on Advanced Research in VLSI, pp.163-178, 1988.
[47] R. Miller, V. K. Prasanna-Kumar, D. I. Reisis, and Q. F. Stout, “Parallel Computations
on Reconfigurable Meshes,”IEEE Transactions on Computers, vol.42, pp.678-692, 1993.
[48] R. A. Nordin, A. F. J. Levi, R. N. Nottenburg, J. O’Gorman, T. Tanbun-Ek, and R.
A. Logan, “A Systems Perspective on Digital Interconnection Technology,”J. Lightwave
Tech., Vol. 10, No.6, pp.811-827, June 1992.
[49] Stephan Olariu, Cristina Pinotti, and Si Qing Zheng, “An Optimal Hardware-Algorithm
for Sorting Using a Fixed-Size Parallel Sorting Device,”IEEE Transaction on Comput-
ers, Vol.49, No.12, pp.1310-1324, December, 2000.
[50] S. Olariu and Z. Wen, “An Efficient Parallel Algorithm for Multiselection,”Parallel
Computing 17, 1991, pp. 689-693.
[51] Y. Pan, “Order Statistics on Optically Interconnected Multiprocessor Systems, Opti.
Laser Tech., vol.26, pp.281-287, 1994.
[52] Y. Pan, “Basic Data Movement Operations on the LARPBS Model, ”in Parallel Com-
puting using Optical Interconnections, ed. By S,Q,Zheng, Kluwer Academic Publishers,
Boston, USA, 1998.
127
[53] M. S. Paterson, “Improved Sorting Networks with O(log N) Depth,”Algorithmica, vol.5,
pp.75-92, 1990.
[54] Sandy D Pavel, Computation and Communication Aspects of Arrays with Optical
Pipelined Buses, A thesis submitted to the Department of Computing and Information
Science in conformity with the requirements for the degree of Doctor of Philosophy,
Queen’s University, 1996
[55] C. Gregory Plaxton, Efficient Computation on Sparse Interconnection Networks, Thesis,
Department of Computer Science, Stanford University, September, 1989.
[56] S. Pavel and S. G. Akl, “Matrix Operations Using Arrays with Reconfigurable Optical
Buses,”Parallel Algorithms and Applications, vol. 11 , pp. 223-242, 1996.
[57] S. Pavel and S. G. Akl, “Integer Sorting and Routing in Arrays with Reconfigurable
Optical Buses,”Proc. Intl. Conf. Par. Processing, pp.III-90-III-94, 1996.
[58] S. Pavel and S. G. Akl, “On The Power of Arrays with Optical Pipelined
Buses,”Proceedings of 1996 International Conference on Parallel and Distributed Pro-
cessing Techniques and Applications, pp.1443-1454, Sunnyvale, California, August 9-11,
1996.
[59] Y. Pan and M. Hamdi, “Quicksort on a Linear Array With a Reconfigurable Pipelined
Bus System,”Proc. of IEEE International Symposium on Parallel Architectures, Algo-
rithms, and Networks, pp. 313-319, June 12-14, 1996.
[60] Yi Pan, Mounir Hamdi, and Keqin Li, “Efficient And Scalable Quicksort on a Linear Ar-
ray with a Reconfigurable Pipelined Bus System”Future Generation Computer Systems,
13 (1997/98) pp.501-513.
[61] Y. Pan and K. Li, “Linear Array with Reconfigurable Pipelined Bus System - Concepts
and Applications,”Proceedings of International Conference on Parallel and Distributed
Processing Techniques and Applications, vol.III, pp.1431-1442, August 1996.
[62] Y. Pan, K. Q. Li, S. Q. Zheng, “Fast Nearest Neighbor Algorithms on a Linear Array
with a Reconfigurable Pipelined Bus System,”Parallel Algorithms and Applications,
Vol.13, pp.1-25, 1998.
[63] M. C. Pinotti and S. Q. Zheng, “Efficient Parallel Computation on a Processor Array
with Pipelined TDM Optical Buses,”Proceedings of 12th ISCA International Conference
on Parallel and Distributed Computing Systems, pp.114-120, 1999.
[64] C. Qiao, “On Designing Communication-Intensive Algorithms For A Spanning Optical
Bus-Based Array,”Parallel Processing Letters, 5(3), pp.499-511, 1995.
[65] C. Qiao, R. Melhem, “Time-Division Optical Communications in Multiprocessor Ar-
rays,”IEEE Trans. On Computers, 42(5), pp.577-590, 1993.
128
[66] C. Qiao, R. Melhem, D. Chiarulli, and S. Levitan, “Optical Multicasting in Linear
Arrays,”International Journal of Optical Computing, vol.2, no. 1, pp. 31-48, 1991.
[67] M. Ryckebusch, “A High Performance Electrical and Optical Interconnection Technol-
ogy,”Proceedings of the 41st Electronic Components and Technology Conference, June
1990, Vol.2, pp.974-979.
[68] S. Rajasekaran and S. Sahni, “Sorting, Selection and Routing on the Arrays with Recon-
figurable Optical Buses,”IEEE Transaction on Parallel and Distributed Systems vol.8,
no.11, pp.1123-1132, November, 1997.
[69] S. Rajasekaran and S. Shni, “Sorting, Selection and Routing on The Arrays with Recon-
figurable Optical Buses,”IEEE transactions on Parallel and Distributed Systems, vol. 8,
no. 11, pp. 1123-1132, Nov. 1997.
[70] S. Sahni, “Models and Algorithms for Optical and Optoelectronic Parallel Computers”,
Proceedings of the Fourth IEEE International Symposium on Parallel Architectures,
Algorithms, and networks, pp. 2-7, Fremantle, Australia, June 23-25, 1999.
[71] H. Shen, “Improved Universal k-Selections in Hypercubes,”Parallel Computing vol.18,
no.2, pp.177-184, 1992.
[72] Jerry L. Trahan, Anu G. Bourgeois, Yi Pan, and Ramachandran Vaidyanathan, “Op-
timally Scaling Permutation Routing on Reconfigurable Linear Arrays with Optical
Buses,”, Journal of Parallel and Distributed Computing, vol.60, pp.1125-1136, 2000.
[73] J. L. Trahan, A. G. Bourgeois and R. Vaidyanathan, “Tighter and Broader Complexity
Results for Reconfigurable Models,”Parallel Processing Letters, special issue on Bus-
based Architectures, vol.8, no.3, pp. 271-282, 1998.
[74] J. L. Trahan, Y. Pan, R. Vaidyanathan and A. Bourgeois, “Scalable Basic Algorithms
on a Linear Array with a Reconfigurable Pipelined Bus System, ”Proc. 10th ISCA
International Conference on Parallel and Distributed Computing Systems, pp. 564-569,
New Orleans, LA, October 1997.
[75] J. L. Trahan, R. Vaidyanathan and C. P. Subbaraman, “Constant Time Graph Algo-
rithms on the Reconfigurable Multiple Bus Machine,”Journal of Parallel and Distributed
Computing, vol.46, pp.1-14, 1997.
[76] L.Valiant, “Parallelism in Comparison Problems,”SIAM J. Comput. vol.4, pp.348-355,
1975.
[77] Biing-Feng Wang, and Gen-Huey Chen, “Constant Time Algorithms for the Transitive
Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable
Bus Systems,”IEEE Transactions on Parallel and Distributed Systems vol.1, No. 4,
pp.500-507, October 1990.
129
[78] S. Q. Zheng and Y. Li, “Pipelined Asynchronous Time-division Multiplexing Optical
Bus,”Optical Engineering vol.36, pp.3392-3400, 1997.
130
Vita
Min He got her Bachelor’s and Master’s degrees at Hunan University, Changsha, Hunan,
P.R.China in year 1988 and 1991, respectively. She worked in Zhuhai Telecommunication
Office for several years before she came to US and got her Master’s degree in Systems
Science at Louisiana State University in year 1997. She started working for Qualcomm,
Inc in year 1998. And she is now a senior engineer in Qualcomm Consumer Product,
Inc.
131
