Algorithms for the parallel alternating direction access machine  by Chlebus, Bogdan S. et al.
Theoretical Computer Science 245 (2000) 151{173
www.elsevier.com/locate/tcs
Algorithms for the parallel alternating direction
access machine(
Bogdan S. Chlebusa, Artur Czumajb ;1, Leszek Gasieniecc,
Miros law Kowaluka, Wojciech Plandowskia
aInstytut Informatyki, Uniwersytet Warszawski, Banacha 2, 02-097 Warszawa, Poland
bDepartment of Computer and Information Science, New Jersey Institute of Technology,
University Heights, Newark, NJ 07102-1982, USA
cDepartment of Computer Science, The University of Liverpool, Chadwick Building,
Peach Street, Liverpool L69 7ZF, UK
Abstract
We describe a number of algorithms for the model for parallel computation called parallel
alternating-direction access machine (PADAM). This model has the memory modules of the global
memory arranged as a two-dimensional array, with each processor assigned to a row and a
column, the processors can switch synchronously between row and column access modes. We
study the issues of inter-processor communication and of ecient use of memory on the PADAM,
and develop: an optimal routing scheme among memory modules, algorithms enhancing random
access of processors to all memory blocks, and general simulations of shared memory machines.
Finally, we present optimal algorithms for the problems of selection, merging, and sorting.
c© 2000 Elsevier Science B.V. All rights reserved.
Keywords: Algorithm; Parallel computation; Routing; Simulation
1. Introduction
One of fundamental goals in parallel processing is to develop eective models
for parallel computation. The purpose they serve is to provide suitable abstraction
which allows us to concentrate on the high-level structure of design while developing
(Work partially supported by EC Cooperative Action IC-1000 (project ALTEC: Algorithms for Future
Technologies) and a research grant from Matsushita Electric Industrial Company Ltd.
1 Work partially supported by DFG-Graduiertenkolleg \Parallele Rechnernetzwerke in der Produktion-
stechnik", ME 872=4-1, by EU ESPRIT Long Term Research Project 20244 (ALCOM-IT), and by DFG
Leibniz Grant Me872=6-1.
E-mail addresses: chlebus@mimuw.edu.pl (B.S. Chlebus), Leszek@csc.liv.ac.uk (L. Gasieniec), kowaluk
@mimuw.edu.pl (M. Kowaluk), wojtekpl@mimuw.edu.pl (W. Plandowski).
0304-3975/00/$ - see front matter c© 2000 Elsevier Science B.V. All rights reserved.
PII: S0304 -3975(99)00280 -7
152 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
parallel algorithms. In this paper, we investigate the model of parallel computation
called parallel alternating-direction access machine (PADAM). PADAM is an abstraction
of the architecture of the machine ADENART of Matsushita (see, e.g., [11, 12, 21] and
[30, Chapter 3.4.11]) and the prototype USC=OMP developed at the University of South-
ern California [8]. This model has been previously considered in [9, 25] (cf. also [6,
Chapter 9.2.3]) under the name of orthogonal multiprocessor (OMP). The key feature of
ADENART, OMP and their abstraction PADAM is the structured organization of the global
memory: the memory modules are arranged as a two-dimensional array and each pro-
cessor has the assigned row and column that it is able to access. The processors switch
simultaneously between the column and row access modes and then access the assigned
groups of memory modules without conicts.
The goal of the designers of ADENART and OMP was to develop a special-purpose paral-
lel computer oriented mainly towards scientic applications, able to solve quickly prob-
lems in matrix algebra, PDEs, image processing, computer vision, etc. For such pur-
poses it is crucial to be able to perform eciently complex operations on large matrices.
The design was also motivated by the assumption that if each processor has an assigned
set of memory modules disjoint from the memory of other processors, then a fast access
to them can be implemented with modest hardware requirements (there are two such as-
signments for the PADAM, and the processors may switch between them). The inventors
of OMP provided in [9] (see also [25]) many arguments why the orthogonal multiproces-
sor is an attractive architecture for parallel scientic computations. They also presented
successful simulation performance results of the behavior of the prototype machine
USC=OMP in [7]. The Matsushita ADENART was tested on numerical and computer-graphics
applications in [11]. The results of further successful tests of the NAS Parallel Bench-
mark were presented in [22] (see also [24] for comparison of these results with other
existing supercomputers). This provides ample evidence that the architecture can be
successfully applied in the area of scientic and engineering applications.
In this research we develop algorithmic techniques that could be used to enhance
PADAM-like architectures to make them viable general-purpose computers. The main
diculty in designing algorithms on the PADAM is caused by the restricted memory-
access pattern. Therefore, our study is mainly concentrated on the issues of ecient
use of the available memory. The rst problem we consider is that of routing among
memory units, which is an important generic problem of communication in a parallel
setting. Then we design schemes to streamline access of processors to the memory units.
Next, we develop general shared memory simulations. Finally, we present algorithms
for selected problems (selection, merging, sorting) as case studies.
1.1. Models of computation
Parallel alternating-direction access machine consists of a set of n processors and
a global memory. The global memory is organized as an n n array of memory units
(MUs). In one step a processor can either execute an instruction of its local data or
access a MU for reading or writing. Let P(i) denote the ith processor and M [i; j] de-
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 153
note the MU in the ith row and the jth column. Processor P(i) can access the MUs in
the ith row and in the ith column. The system is always in one of two states: either in
the row-access mode or in the column-access mode. While in the row-access mode, all
the processors may access their rows of MUs, that is, processor P(i) can read or write
to MU M [i; k], for any k; 16k6n. Similarly, if the computer is in the column-access
mode, processor P(i) can access M [k; i], for any k; 16k6n. Switching modes is pro-
grammable, that is, there is an instruction available, say switch, to change between
the row and column access modes. All the processors execute the same program. Com-
putations are globally synchronized; in particular, if a loop is executed then the next
instruction will be started only after all the processors have already exited the loop.
The same holds for the switch instruction, that is, the processors which are ready to
execute it remain idle waiting until the others, still performing their computations, are
ready to switch the access mode (see Fig. 1).
Notice that with this restricted access pattern, processors can still communicate di-
rectly in constant time. Namely, when processor P(i) needs to send a message to
processor P(k), it rst writes its value into M [i; k], in the row-access mode, then P(k)
waits till the change of mode to the column-access is done, and nally P(k) reads
M [i; k] (this takes time O(1) if there is a constant delay between switching the modes).
We make no general assumptions on the size of a MU. The size of memory required
by the presented algorithms is proportional to the size needed to store input and output.
For instance, the 1{1 routing algorithm can be implemented on a PADAM with each MU
of size O(1), but in the case of h{h routing the MUs need to have the capacity to
accommodate h packets each, and nothing more. Also in the case of the randomized
PRAM simulation, MUs should be large enough to store up to O(t p) words after t steps
of the simulation of a p-processor PRAM.
A distributed memory machine (DMM) consists of a number of synchronized pro-
cessors and memory modules (see [14, 19, 29]). Each module can be accessed by any
processor in one step, with some semantics to resolve conicts for access. The DMM
has been studied under various names in the literature, usually the authors assumed a
specic way to resolve conicts for access to modules; examples are direct connection
machine in [14] and SPRAM in [29].
A parallel random access machine (PRAM) consists of a number of synchronized
processors that share a global memory, and every basic CPU operation, including ac-
cesses to the shared memory, is executed in a single time step (cf. [10, 13]). So the
PRAM is like a DMM in which each module has the capacity of one word. The PRAM
model has proven to be particularly appealing and popular among parallel models (see
[10]). The reason is that its key feature, the random access to memory, facilitates the
design and analysis of algorithms. However, this very property of the PRAM model is
unrealistic, because the cost of memory access can hardly be neglected (cf. [1]).
The DMM model is considered more realistic than the PRAM one, as it captures the
phenomena occurring if the global memory is partitioned into modules. Simulating a
PRAM on a DMM has been a subject of extensive study (see [19, 20, 29]). Notwithstanding
its similarity to the DMM and PRAM, the PADAM architecture is more restricted than that of
154 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
DMM, because each MU can be accessed directly by at most two processors. We show
that the PADAM is strictly weaker than both the DMM and PRAM (see Theorems 5 and 9).
The PADAM architecture is related also to that of a mesh (cf. [15]) and a mesh with
buses (called also mesh with broadcasting) (cf. [23, 27, 28]). All these architectures
have the underlying n  n array. The mesh and the mesh with buses consist of n
times as many processors as the PADAM. Another dierence between the mesh and
the PADAM is that each processor of the mesh can communicate only with its four
neighbors in the grid, whereas any pair of processors of the PADAM can communicate
directly in constant time. The dierence between the mesh with buses and the PADAM
is that the former consists of n2 processors that can use buses for fast communication
in rows and columns, whereas the n-processor PADAM consists of an n  n array of
memory units and n processors that have a random access to their assigned rows and
columns. Clearly, the mesh with buses is a stronger model than PADAM, with respect
to communication capabilities.
1.2. Overview of results
There are four groups of results: (1) routing is studied as an important generic
problem; (2) a direct access to the MUs is implemented; (3) general shared mem-
ory simulations are provided; (4) algorithms for selection, merging and sorting are
developed, as case studies.
1.2.1. Routing
The problem of routing is specied as follows: there are a number of packets residing
in the memory units and the task is to relocate them to other memory units that are
their destinations. In the case of xed-connection networks, where nodes store packets
which are to be redistributed, the problem of routing is fundamental because this is
how the communication between processors can be performed (see [15, 16]). For a
machine with a global memory and no restrictions on the access of processors to
memory modules (this is the case of the PRAM), routing can be performed trivially by
a sequence of read{write operations. The PADAM processors cannot fetch the variable
values or write arbitrarily, and if they need to learn values stored in the regions of
memory which they cannot reach directly, then they resort to a routing procedure.
This makes routing important for PADAM. If O(h) packets are in each MU and O(h)
packets are addressed to any MU then this is an instance of h{h routing. We present
an algorithm solving any h{h routing problem on the PADAM with n processors in time
O(h  n) with queues of size O(h). The algorithm is deterministic and asymptotically
optimal. Recently, other routing algorithms were developed for PADAM, see [4], both
deterministic and randomized, having smaller constants in their performance bounds.
Even though PADAM resembles meshes and meshes with buses, for which routing was
studied extensively (cf. [17, 23, 26{28] and [16, Section 1:7]), there is no apparent and
general way to adapt techniques from these models. In the n  n mesh, there are n2
processors that can move n2 packets in one step, but only to the adjacent nodes. This
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 155
also holds in the n  n mesh with buses, and additionally one item can be broadcast
in unit time to all the nodes in its row or column. It is not clear how these computing
capabilities could be translated to the PADAM, where there are n processors only, each
can handle one packet at a time to transfer it to a possibly distant node (as opposed
to the mesh) but only to one at a time (as opposed to the mesh with buses).
The high-level description of the routing algorithm described in this paper is as
follows. The array of MUs is partitioned into
p
n horizontal and vertical strips, an
intersection of a vertical and horizontal strip is a block. The algorithm proceeds in
three phases: (1) routing to the target vertical strips; (2) routing to target blocks;
(3) routing within blocks.
1.2.2. Enhancing access to memory
Next, we study ways of implementing a programmable access of processors to the
whole memory. We consider two possible approaches. The low-level one is to imple-
ment access of the processors to any memory module, similarly as they access the
modules in their assigned rows and columns, but additionally taking care of possible
conicts for access. This is actually an emulation of the DMM with exactly the same
processors and MUs as in the emulating machine. We call this emulation of ran-
dom access to memory units. The high-level approach is to simulate general shared
memory, that is, to simulate a PRAM which has an arbitrary number of processors and
memory words. In this paper the term simulation usually means a general PRAM sim-
ulation, and emulation denotes the DMM emulation as described above. The presented
DMM emulations are simpler than PRAM simulations, and are expected to be signicantly
faster and easier to implement on real machines. The motivation for developing em-
ulations is not to provide general DMM simulations, because next we develop PRAM
simulations, but rather to enhance the ease of programming on ADENART=OMP without
slowing down the programs signicantly. We develop two DMM emulations, one deter-
ministic and the other randomized, and two PRAM simulations, one deterministic and
the other randomized. The randomized emulation and simulation are signicantly faster
than the corresponding deterministic algorithms.
The deterministic emulation is implemented by a specialized routing algorithm, the
main dierence with the general case is that the number of packets is equal to the
number of processors rather than to the number of MUs. A single access step can be
performed in time O(
p
n), this is shown to be optimal by a matching lower bound.
The randomized emulation operates with the expected slowdown O(log n), this again
is optimal. The algorithm is very simple and could be of practical value. The idea
is to rotate conceptually the rows by random and independent cyclic shifts, creating
new physical addresses of MUs corresponding to the virtual addresses used by the
programmer. This makes the physical column-addresses (of the packets in a batch of
memory requests) distributed evenly among the columns with high probability, thus
avoiding unbalanced load of the processors.
The deterministic PRAM simulation is an extension and modication of the deter-
ministic DMM emulation. We show that a PRAM with p processors, for p6n2, can
156 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
be simulated deterministically step-by-step with the slowdown O(
p
p), which is opti-
mal. Next we present randomized PRAM simulations. It is shown that one step of the
p-processor CRCW PRAM can be simulated on the n-processor PADAM in the optimal time
O(dp=ne), provided p = 
(n1+), for any constant >0. For smaller values of p,
the simulation has delay O(logp  dp=ne). The simulations are based on the universal
hashing of the PRAM memory among the MUs of the PADAM.
1.2.3. Case studies
We also consider three specic problems of selection, merging and sorting. These
problems are investigated as case studies. Our aim is to show that each of these
problems can be solved optimally on the PADAM, by a relatively simple algorithm.
1.3. Miscellany
If some property depending on n holds with the probability 1 − n−a, for any con-
stant a, then the property is said to hold with high probability, abbreviated w.h.p. All
the presented randomized algorithms are Las Vegas, in the sense that they always
output the correct solution.
The n-processor PADAM can simulate any p-processor constant degree network (e.g.
binary trees, butteries and comparator-sorting networks) with delay O(dp=ne). A step
of the network is performed by communicating with each of the neighbors. Since each
processor knows its neighbors, the corresponding processors of the PADAM know where
to look in the memory of messages.
We often refer to certain indexings of the MUs determined by linear orderings.
Dene the row-major order as follows: the MUs in row i precede the MUs in row
j, if i<j, and all the rows are ordered from left to right. The column-major order is
dened similarly.
1.4. Paper organization
In Section 2 we present an optimal deterministic h{h routing algorithm. Section 3 in-
cludes two methods of emulating random access to memory of the PADAM. In
Section 4 we consider general simulations of a PRAM. Section 5 includes the algo-
rithms for selection, merging, and sorting.
A preliminary version of this research appeared as [3].
2. Routing
Suppose that each MU stores at most h packets, and the packets need to be rearranged
among the MUs in such a way that each MU receives at most h new packets. This is
the h{h routing problem.
We rst describe an optimal deterministic 1{1 routing algorithm operating in time
O(n). Let us suppose that initially the packets are stored in the MUs, at most one
packet at each M [i; j]. The array M of MUs is partitioned into
p
n horizontal strips,
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 157
which are segments of
p
n rows, and similarly into
p
n vertical strips comprised ofp
n columns. The strips are numbered from left to right and from the bottom up, re-
spectively. The
p
n MUs in a row that are in the same vertical strip make a horizontal
segment, the vertical segments are dened similarly. An intersection of a horizontal
and vertical strip is alled a block. There are n blocks, each having n elements or-
ganized as an
p
n  pn array. The algorithm can be conceptually divided into three
phases:
PADAM ROUTING
Step 1: Route the packets to their vertical strips.
Step 2: Route the packets to their blocks.
Step 3: Finish routing in the blocks.
These steps are actually subroutines, in the sense that a solution to implementing
Step i can be used in designing solution for Step j, for j<i. The phases are presented
in the reversed order, to facilitate the comprehension.
Step 3: Routing packets inside the blocks: Let us suppose that the packets have
already been moved to their destination blocks. To complete routing, we convert blocks
into rows, then route packets in the rows, and convert rows back to blocks. The details
follow (see Fig. 2).
Enumerate the blocks in the (regular) row-major order, which is analogous to the or-
dering of the array of MUs. We move all the packets from the ith block to the ith row.
To this end, the kth row inside a row of blocks, for 16k6
p
n, is shifted cyclically
(k − 1) pn columns to the right. More precisely, a packet in the row i and column j,
for 16i; j6n, is moved to the column number (j + ((i− 1) modpn)  pn) mod n in
the ith row. Now all the packets from a block are in dierent columns. Permute the
packets in the columns so that all the packets from the ith block are moved to the ith
row.
This determines a translation of addresses in the block to addresses in the row.
Permute the packets in each row, translating block addresses to row addresses, to
achieve the desired routing in the block. Finally, convert back rows to blocks in the
reversed way, the ith row to the ith block.
Step 2: Routing packets inside the vertical strips: Suppose that the packets have
already been moved to their destination vertical strips. We show how to move the
packets to their destination blocks. Consider a vertical strip. The rst stage is to sort
the packets in each column on their block addresses. This can be done in time O(n),
one processor per column, as follows. The processor scans the column and nds the
number ai of packets with destination address in the ith block. The number ai is stored
at the top MU of the ith vertical segment of the column. Next the processor computes
the partial sums of these numbers: f1 = 0, and fk = a1 +   +ak−1, for k>1. Now the
ith packet among those going to the jth block is moved to the location i + fj in the
column. This can be done by one additional scan of the column, and completes sorting.
The sorted packets in each column are partitioned into two categories: A-packets
and B-packets, as follows. For each block, if there are i packets in the column with
158 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
this destination address, then i − (imodpn) are designated as A-packets and the
remaining ones as the B-packets. Next the columns are sorted in a stable way such
that the A-packets precede the B-packets. This can be done in time O(n) similarly
as sorting on the block addresses. The A-packets are dealt with rst. Afterwards, we
add some dummy B-packets to make exactly
p
n packets with each block address, and
the B-packets are sorted on their block addresses. This operation moves them to their
blocks. The dummy packets are then discarded.
It remains to consider the A-packets in detail. At this moment, every vertical seg-
ments contains only A-packets with the same block address. Rearrange the A-packets
in such a way that those in the rows congruent to i modulo
p
n go to the ith block (we
refer to this permutation of rows in the vertical strips as ). Next solve the routing
problem in each block treating the original block addresses as row addresses inside
the block (this is done similarly as in Step 3). Afterwards the packets are rearranged
according to the permutation −1 (see Fig. 3).
Lemma 1. At this stage of the algorithm the packets are in their proper destination
blocks.
Proof. Consider a block B just after permuting columns according to . If there is a
packet in B with some block address k then it was taken from a vertical segment of
packets with the same block address and moved as the only representative of them
to B. Hence there are at most
p
n packets in B with the block address k. All these
packets are to be routed to the kth row in B. Notice that the column addresses do
not matter since the packets are to be routed to their destination blocks to whatever
column, and this is achieved by rst routing the packets to their rows inside blocks.
Step 1: Routing the packets to their vertical strips: First the packets in each row are
sorted on their destination vertical-strip addresses. This is done in time O(n) similarly
as sorting columns on their block addresses. Next, for each vertical-strip address, if
there are some i packets with this destination address, then i−(imodpn) are designated
as A-packets and the remaining ones as B-packets. When the A-packets have been
routed, the B-packets are padded with the dummy packets to make exactly
p
n packets
with each vertical-strip address in each row.
The packets are sorted in each row and this operation places them in the proper
vertical strip. It remains to show how to handle the A-packets. First the A-packets and
B-packets are sorted in a stable way in the rows preserving that the A-packets precede
the B-packets. Now we permute the A-packets so that those in the columns congruent
to i modulo
p
n are moved to the ith vertical strip (we refer to this permutation of
columns as ). When inside the vertical strips, treat the original vertical-strip addresses
as column addresses inside the block, and move the packets to the respective blocks
(this is done similarly as in Step 2).
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 159
Next move the packets from the ith block to the ith column inside vertical strips,
and nish the routing by rearranging the packets according to the permutation −1 (see
Fig. 4).
Lemma 2. Routing to the vertical strips is correct and can be accomplished in time
O(n).
Proof. Consider the A-packets in a vertical strip V just after applying the permuta-
tion . There are at most n packets in V with the same original vertical-strip destination
addresses, which are now treated as the block addresses inside V . This guarantees that
there is enough room in each block to accommodate the packets.
h{h routing: Adaptation to the h{h case is as follows. While routing inside the
blocks, all the packets from a block are reallocated to a row, then the routing in
that row is performed. While routing inside vertical strips, we need to dene indexing
of the locations of packets in a column to determine the sorting order, and then the
A-packets and B-packets. The indexing of packet locations is such that all the packets
from the MU in a row i are to precede the packets in the MU in the row j, if i<j.
For each vertical-strip address, if there are some i packets with this destination address,
then some i− (imod (pn  h)) of them are designated as A-packets and the remaining
ones as B-packets. The phase of routing packets to their destination vertical strips is
modied accordingly.
Traditionally, all the packets stored in one unit of storage are referred to as the
queue.
Theorem 1. The algorithm PADAM ROUTING solves an instance of the h{h routing prob-
lem on the PADAM with n processors in time O(h  n); while the queues are of size
O(h).
3. Emulating DMM
In this section we show how to implement an access of each processor to any
memory unit on the PADAM. This is the same as emulating a DMM with the same
processors and MUs. Two emulations are presented. One is deterministic and uses the
original memory addressing of the PADAM, the other is randomized, the randomization
is used in shifting cyclically the addresses in rows. Each processor creates its memory
request as a packet storing some address X of a MU, and memory word w in X ; the
packet additionally carries the bit dening the request as either reading or writing, and,
in the case of a writing request, it carries the additional value y. If the request is for
reading then the meaning is that the word w in MU X is to be read (and then brought
back to the processor). Otherwise, the meaning is that the value y is to be written
into the word w in MU X . The set of packets created by all the processors that are
160 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
to be realized concurrently is called a batch of memory requests. We do not assume
any specic semantics regarding conicts for access of MUs, but all the popular ones
(like EREW, CREW, arbitrary CRCW, priority CRCW) can be handled with natural
stipulations augmenting the given algorithm.
3.1. Deterministic emulation
The emulation is performed in a step-by-step manner. Each processor creates a packet
and places it in its row in the rst column. The packets are rst organized into groups
determined by the MU addresses. For each group, only one packet is designated to be
routed, its selection depends on the conict-of-access-resolution protocol. In the case
of a writing step, this packet will carry the value which is to be written. In the case
of a reading step, this packet will be also routed back by a reversed route and will
bring the value retrieved from the accessed MU. This value will be distributed among
the packets in the group, if necessary, and nally these packets will be sent back to
their original rows.
EMULATION E1
Step 1: The packets have been placed in the rst column. Sort them on their
destination-column addresses by emulating the Batcher’s sorting network (see [15]).
Now the groups of packets with the same addresses are in contiguous blocks of loca-
tions in the rst column.
Step 2: Select one packet (if any) from each group. The selection is made depending
on the conict-resolution protocol.
Step 3: The selected packets are routed up to ll a contiguous block of MUs in the
rst column, including the top row, one packet per MU. This is accomplished by rst
computing the addresses for the packets by the parallel prex algorithm executed on a
simulated tree, and then by routing on a simulated buttery by applying a monotone-
routing algorithm (see [15]). After this the selected packets remain sorted on their
destination-column addresses. We consider only the selected packets from now on.
Step 4: For each group of the packets with the same destination-column address,
if there are some i packets in the group then designate i − (imodpn) packets as
A-packets, the remainder are the B-packets. This can be done on a simulated tree.
The A-packets are then routed to a contiguous block of locations, preserving their
relative order. This is accomplished similarly as in Step 3.
We rst deal with the A-packets.
Step 5: The ith vertical segment of
p
n A-packets is routed i−1 columns to the right.
Then the processors switch to the column-access mode. The ith processor, for i6
p
n,
looks for
p
n packets in its column in the consecutive rows starting from the row
(i − 1)pn+ 1. If there are any packets there then they are moved to their destination
rows.
Step 6: The processors switch to the row-access mode. Each processor reads the rstp
n MUs in its column looking for packets, and if it nds any then they are moved to
their destination columns.
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 161
Lemma 3. The A-packets are routed in time O(
p
n).
Proof. All the packets that have been in the same vertical segment and were moved
to the same column in Step 5 have the same destination-column address and hence
distinct destination-row addresses. This implies that there is at most one packet in every
MU in the rst
p
n columns after Step 5.
This completes routing of the A-packets. The next steps take care of the B-packets.
They are already in a contiguous block of memory locations, and the packets with the
same destination-column addresses also make contiguous block. These blocks will now
be moved to their destination columns. Before that we need to inform the processors
where they are to look for packets in their columns.
Step 7: For each packet which is the rst in a group of packets with the same
column address j, and which is in the row i, create a special packet to carry the
number i and send it to the jth processor, say to M [j; j]. To this end, the special
packet from the ith row is rst moved to the column b(i − 1)=pnc + 1. Then the
processors switch to the column-access mode and the kth processor, for k6
p
n, scansp
n MUs, starting from the row (k − 1)pn+ 1. If a packet is found then it is routed
to its destination row. Finally, the processors switch to the row-access mode and scan
the rst
p
n columns and move the packets (if any) to their column addresses. The
special packets are handled in time O(
p
n), because there is never more than a single
packet in each MU, this follows from the fact that they have distinct row addresses.
Step 8: Now the B-packets are moved to their destination columns in one step. The
processors switch to the column-access mode and place themselves at the rows whose
numbers have just been distributed by the special packets. Then they move the packets
waiting there to their destination rows.
This completes the emulation and proves:
Theorem 2. The algorithm E1 on a PADAM with n processors performs one batch of
memory request in time O(
p
n).
The following matching lower bound holds.
Theorem 3. Any emulation of random access to memory units on the n-processor
PADAM requires 
(
p
n) operations per batch of memory request in the worst case.
Proof. Consider a batch of requests to n dierent MUs which are located in
p
n rows
and
p
n columns. The PADAM is either in row-access or column-access mode. In any
case, in one step at most
p
n processors can access the required MUs.
3.2. Randomized emulation
The emulation is based on an oset addressing mechanism. It is determined by a
sequence ai of osets, where 16i6n and 16ai6n. Instead of M [i; j], the access to
162 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
the ith row and the jth column is meant to be to M 0[i; j] =M [i; ( j + ai) mod n]. Just
before the computation starts, each P(i) generates a random oset ai and stores it in
M [i; 1]. Then the input is read and stored in the MUs. The following is an emulation
of a single memory reference.
EMULATION E2
Step 1: Each processor P(i) generates a memory access request and places the
respective (ordinary) packet pi in M [i; 1]. Assume that packet address are all distinct,
otherwise proceed as in E1.
Step 2: The packets are sorted on their row addresses. The processor P(i) such that
the packet at M [i; 1] is the rst in a block of packets with the same row address, say ri,
generates a packet to inquire about the oset ari . The inquiring packets are routed to
their respective rows in the rst column on a simulated buttery, and then routed back
(these are instances of monotone routing). The osets they bring are disseminated
among other packets with the same row address on a simulated tree.
Step 3: The packets pi are sorted on their oset column addresses ki and stored
again in the column M [; 1].
Step 4: Each processor P(i) checks to see if the packet residing now in M [i; 1] is
the rst in the block of packets with the same key, say ki. If this is the case then it
generates a special packet storing i and the address M [ki; 1].
Step 5: The special packets are routed to their destination addresses on a simulated
buttery. Then each processor P(i) checks M [i; 1] for such a packet.
Step 6: Each processor P(i) moves the (ordinary) packet from M [i; 1] to M [i; ki],
where ki is the key of the packet.
Step 7: The processors switch to the column-access mode and processor P(j) scans
consecutive locations in the column j, starting from the address delivered by the
special packet (if any) and moves the packets waiting there (if any) to their nal
destinations.
Theorem 4. The algorithm E2 performs one batch of memory requests on a PADAM
with n processors in time O(log n) w.h.p.
Proof. The sortings in Steps 2 and 3 are performed by emulating the randomized
hypercubic sorting algorithm of Leighton and Plaxton [18]. Hence the Steps 1{6 are
performed in time O(log n). The time of Step 7 depends on the maximum number of
packets in a column. Consider a specic column number j. A packet in row i goes
to M [i; j] with the probability 1=n since the oset ai was selected randomly. If all
the packets were in dierent rows then we would have n independent selections of a
column, each selected with the probability 1=n. It is well known that every column
would be selected at most O(log n= log log n) times w.h.p. It may happen that many
packets are in the same row. Then our estimates hold true as well because if one of
these packets is selected for a particular column then the remaining packets must go
to other columns.
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 163
The following matching lower bound holds.
Theorem 5. Any randomized emulation of random memory access on the n-processor
PADAM has the expected number of 
(log n) steps per batch of memory requests.
Proof. Consider the special case when MU stores O(1) words and we emulate the
EREW DMM. Then the lower bound follows from Theorem 9.
4. Shared memory simulations
In this section we consider simulations of the random-access shared-memory archi-
tecture on the PADAM. First, we describe a deterministic simulation of a PRAM with some
restrictions on the number of processors and the size of the shared memory. Next, we
show a general randomized simulation of a p-processor CRCW PRAM with unbounded
shared memory on the n-processor PADAM. The simulation has an optimal delay for
p=O(n) and for p= 
(n1+), for any constant >0. If t is the running time of a
CRCW PRAM algorithm and pt>n2, then it is assumed that each MU of the PADAM can
store 
(pt=n2) words.
4.1. Deterministic simulation
We present a deterministic simulation of a PRAM with memory O(n2) and p pro-
cessors, for p6n2. The algorithm uses O(1) memory of each MU, but, alternatively,
could be easily modied to an emulation of a DMM with p processors and the MUs
of the PADAM. It is possible to use the algorithm E1 directly to obtain a similar sim-
ulation, or the respective DMM emulation, as follows. Divide the PRAM processors into
dp=ne groups of at most n processors. To simulate one step, perform dp=ne simulation
phases, in each one taking care of n processors. Each phase takes O(
p
n) steps, and
the delay is O(p=
p
n). We present an alternative simulation with delay O(
p
p), which
is better if p>n.
The simulation is an extension of emulation E1, so we just sketch it. Initially, the
PRAM processors are divided into groups of size p=n and the groups assigned to the
PADAM processors. Let K be the set of MUs M [i; j] with i; j6dppe.
SIMULATION S1
Step 1: The processors create packets storing the memory requests, and place them
in their rows in the rst p=n columns. Next, the packets are allocated to the square
K, one packet per MU. This can be accomplished in a straightforward way in time
p=n+
p
p62 pp.
Step 2: The packets in K are sorted on their destination-column addresses in column
major order in time O(
p
p).
A packet is dened to be an A-packet if all the packets in its column in K have
the same destination-column address. The remaining packets are B-packets.
164 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
Step 3: The A-packets are moved to their destination rows. Then the processors
switch to the row access mode, scan the rst
p
p MUs in their rows and move the
A-packets residing there to their nal destinations.
Step 4: The B-packets with the same column address are in contiguous blocks in
at most two columns. For each such block(s) of packets going to the column i create
a special packet pi with the row addresses of the blocks. Then the special packets are
routed to the diagonal MUs, pi to M [i; i], and retrieved by the processors.
Step 5: The B-packets are routed along the rows to their destination columns. Then
the processors switch to the column access mode and deliver the B-packets to their
nal destinations.
Theorem 6. A p-processor PRAM with O(n2) memory words can be simulated on the
n-processor PADAM with delay O(
p
p); provided p6n2.
To prove Theorem 6, we need only to show how to sort eciently in Step 2; this
is described in Theorem 13.
4.2. Randomized simulation
In this subsection a randomized shared memory simulation is presented. We assume
the strongest CRCW model of computation on the PRAM. The algorithm hashes the shared
memory of the PRAM into the (usually smaller) address space of the PADAM. First, we
give an optimal simulation of a n-processor CRCW PRAM on the n-processor PADAM.
Next, we extend it to the case of a p-processor CRCW PRAM, for any p. Our algorithm
achieves the optimal delay O(dp=ne) for p>n1+. Since our simulation hashes the
shared memory of the PRAM, we do not assume any bounds on the size of the memory
of the PRAM.
A PRAM consists of p processors P(1); P(2); : : : ; P(p) and a shared memory with cells
U = f0; : : : ; u − 1g, where u is arbitrary. The PADAM has n processors P(1); P(2); : : : ;
P(n). In our simulations the shared memory cells of the PRAM are distributed among
the rows of the PADAM using a hash function h :U!f1; : : : ; ng drawn at random from
the universal class of hash functions R(n; n; d) introduced by Dietzfelbinger and Meyer
auf der Heide [5]. This family of hash functions has the following useful property:
Fact 1 (Dietzfelbinger and Meyer auf der Heide [5]). If h is a random hash
function from class R(n; n; d); then; for a constant d suciently large; and for any
X U; w.h.p. max16i6n fjh−1(i)\X jg=O(jX j=n+ log n). R(n; n; d) consists of func-
tions h= h(f; a0; : : : ; an−1); where f is a polynomial of degree d − 1 in the eld
of residues of some large prime taken modulo n2; a0; : : : ; an−1 2f0; : : : ; n − 1g; and
h(x) = (f(x) mod n+ af(x)divn) mod n for x2U .
Lemma 4. Any n requests to the hash values of h2R(n; n; d) can be evaluated on
the n-processor PADAM in time O(log n).
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 165
Proof. Suppose processor P(i) needs h(xi), for some xi. First, processor P(1) gen-
erates a random f in time O(d). Next, each processor P(i) generates a random
ai−1 2f0; : : : ; n − 1g. Function f is distributed to all the processors on an emulated
binary tree. The notation f1(x) means f(x)div n. Packets with values f1(x1); : : : ; f1(xn)
are sorted. Each ai is routed to the rst x such that f1(x) = i, and next broadcast to
the other locations storing the sorted values of f1 equal to i. Finally, the packets with
af1(xi) are routed back to P(i).
We have actually shown that any p requests of values of h2R(n; n; d) can be
evaluated on the n-processors PADAM in time needed to perform integer sorting of p
numbers.
A dictionary is a data structure supporting two operations:
LOOKUP(x) | returns the information associated with x that is currently stored in the
dictionary; and
INSERT(x; c) | stores x together with the information c associated with x in the
dictionary; if x was already stored then it is rst removed from the dictionary.
Fact 2 (Dietzfelbinger and Meyer auf der Heide [5]). There is a randomized imple-
mentation of a dictionary which uses O(n) space and that performs each of a sequence
of O(n) dictionary operations in constant time w.h.p.
At the beginning of the simulation we choose randomly a hash function h :U!f1;
: : : ; ng from R(n; n; d). Function h denes the row h(x) of the PADAM in which cell
x of the PRAM will be stored. Let M= fm1; : : : ; mpgU . In a given PRAM step, P(i)
wants to access cell mi and either read its content as xi (in the reading phase) or write
yi to mi (in the writing phase). We refer to the respective pair (mi; xi) or (mi; yi) as
the packet i.
SIMULATION S2
Step 1: Each processor P(i) creates a packet i, places it in M [i; 1], and evaluates
h(mi).
Step 2: All the packets are sorted on their PRAM addresses from M. Now the groups
of packets with the same addresses are in contiguous block of locations in the rst
column. Select one packet from each group according to the writing-conict-resolution
protocol.
Step 3: The selected packets are routed up to ll a contiguous block of MUs in
the rst column, including the top row, one packet per MU. We consider only these
packets from now on.
Step 4: These packets are sorted on their destination-row addresses h(mi). Using
the obtained sorted sequence, each packet computes the number of packets with the
same destination row and the number of packets with the same destination row which
precede it in the sorted sequence. This is accomplished by emulating a binary tree.
Step 5: The rst packet destined for each row is moved to that row, as on a binary
tree.
166 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
Step 6: The remaining packets are handled as follows. If the ith packet i (with
respect to the order obtained by sorting in Step 4) was moved in Step 5 to row k and
there are j packets to be delivered to row k, then in the next j − 1 moves: processor
P(i + l) writes i+l to M [i + l; k] in move l and P(k) reads i+l from that cell.
Afterwards P(k) stores all the packets destined for row k.
Step 7: Each processor P(i) maintains a local dictionary in its row i. In the writing
phase, P(i) executes operation INSERT(mj; yj) for every packet j with h(mj) = i. In the
reading phase, P(i) executes operation xj = LOOKUP(mj) for every h(mj) = i.
Step 8: The values read are distributed among the read packets left in Step 2.
Theorem 7. Simulations S2 performs one step of the n-processor CRCW PRAM on the
n-processor PADAM in time O(log n) w.h.p.
Proof. The correctness of the simulation is straightforward and the optimality follows
from Theorem 5. The following time estimates hold w.h.p. Steps 3,5{7 are performed
in time O(log n). Sortings are performed in time O(log n) by a simulation of the sorting
circuit of Leighton and Plaxton [18]. Fact 1 ensures that in each step each processor
performs only O(log n) operations on its dictionary. Step 7 needs O(log n) time, by
Fact 2.
The algorithm above can be extended to simulate the p-processor PRAM on the
n-processor PADAM for any p. We use the following simple class of hash functions. Let q
be a prime and U = f0; : : : ; u−1g for u6q. For s>1 dene the set H1s = fha; b: 06a; b
<qg where ha; b(x) = (a+ bxmod q) mod s, for x2U and 06a; b<q.
Fact 3 (Dietzfelbinger and Meyer auf der Heide [5]). Let h :U!f0; : : : ; s − 1g be a
random function from class H1s ; then for any X U; if 2jX jc+26s then Pr(max16i6s
fjh−1(i)\X jg>1 − jX j−c.
Observe that a function h2H1s can be stored in O(1) cells, and can be generated
and evaluated in constant time by one processor. Hence, it can be generated by P(1)
and broadcast by emulating a binary tree to all the processors of the PADAM in time
O(log n). Afterwards each processor can evaluate h in constant time.
Theorem 8. One step of the p-processor CRCW PRAM can be simulated on the
n-processor PADAM in the optimal time O(dp=ne); provided p = 
(n1+) for any con-
stant >0. For smaller values of p the simulation has delay O(logp  dp=ne).
Proof. The simulation is an extension of S2. The case p<n is straightforward. If p>n,
then we can randomly divide p requests to the memory of the PRAM step into batches
of size at most n. Then we perform the algorithm S2 for each batch independently. For
p= nO(1) and T = nO(1) this yields the simulation of one step of the p-processor CRCW
PRAM on the n-processor PADAM with delay O(log n  dp=ne) w.h.p. Now we sketch the
optimal simulation for p = (n1+), for any constant >0.
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 167
We proceed similarly as in the simulation for p= n. Now however, Steps 2 and 4
use integer sorting instead of general sorting. Theorem 14 shows that p integers can
be sorted in the optimal time O(p=n) on the n-processor PADAM, for p= n1+. In order
to use integer sorting we have to reduce the problem of eliminating all requests to the
same address in the PRAM to the integer case. We can do this by employing a random
hash function h : U!f1; : : : ; 2 pc+2g from class H12pc+2 . Fact 3 implies that the set
fmi 2M: 9mj 6=mi h(mi) = h(mj)g is empty with the probability at least 1−p−c. Hence,
we can apply integer sorting on h(m1); : : : ; h(mp) to store the addresses with the same
value in a contiguous subsequence of the sorted sequence w.h.p.
The algorithm can be extended to simulate a p-processor CRCW PRAM on the
n-processor PADAM also when p is arbitrarily large.
In this proof, we have essentially reduced the problem of simulating the p-processor
CRCW PRAM on an n-processor PADAM to the problem of sorting p integer numbers on
the n-processor PADAM. Thus, any improvement of such an algorithm for p=!(n) and
p= o(n1+), for any constant >0, would also improve our simulation.
4.3. Lower bound for shared memory simulations
In this section we show that emulation E2 and simulation S2 are optimal. This follows
from the lower bound on randomized PRAM simulations given in Theorem 9.
We will use the following fact:
Fact 4 (Yao [32]). Let P1 be the probability that a xed randomized algorithm solves
problem A in T steps; minimized over all the possible inputs. Let P2 be the probability
that a deterministic algorithm solves A in T steps; subject to a xed probability
distribution D on the inputs; maximized over all the deterministic algorithms operating
in T steps. Then P16P2.
Theorem 9. Any randomized simulation of the n-processor EREW PRAM on the PADAM;
that operates correctly with the probability larger than 12 + ; for some constant >0;
requires time 
(log n).
Proof. Consider the following problem: given n bits, x1; : : : ; xn, of which at most
one is a 1 and the others are 0’s, compute OR of x1; : : : ; xn, that is, verify whether
there exists i with xi = 1. Clearly, this problem can be solved in constant time on the
n-processor EREW PRAM. We show that there exists an input distribution D such that any
deterministic PADAM algorithm (with arbitrary number of processors) which solves that
problem with probability ( 12 + ) must run in time 
(log n). Combined with Fact 4,
this will establish the theorem. Our proof is similar to that of the lower bound for the
CROW PRAM for the same problem given by Snir.
Let Ij, for 16j6n, be the input with xj = 1 and 0’s everywhere else, and let I0 be
the input consisting of all 0’s. We say that a processor P or a memory cell C depends
168 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
on the input variable xj at time t, if its state on input I0 is distinct from its state on
input Ij at time t. We will prove by induction on t the following fact: Any processor
P or memory cell C at time t depends on at most 3t variables.
The basis of induction is clear. Consider the inductive step. Observe that at time t
a processor P can depend on:
(1) any variable it depends on at time t − 1, and
(2) any variable that the memory cell CP depended on, if it read CP during the last
step.
By the inductive hypothesis, there are at most 3t−1 variables that P depended on
at time t − 1, and at most 3t−1 variables that CP depended on at time t − 1. Hence
processor P can depend on at most 2  3t−163t variables.
Now let us consider memory cell C at time t. It can depend on:
(1) any variable it depends on at time t − 1, and
(2) any variable that all the processors PC depended on, where PC are exactly the
processors allowed to write into C during the last step.
Observe that at most two PADAM processors can write into a given memory cell. This
implies that memory cell C at time t depends on at most 3  3t−1 = 3t variables.
Let P1 be the processor writing the answer and let D be the following input distribu-
tion: Pr[I0 is the chosen input] = 12 and Pr[Ij is the chosen input] = 1=2n, for 16j6n.
After t steps P1 depends on at most 3t variables xi1 ; : : : ; xi3t . With probability 3
t =2n
one of these variables has value 1. Then P1 gives the correct answer. If none of the
variables is 1, then P1 has to guess the answer to be either 1 or 0, basing this deci-
sion only on the fact that all the variables in fxi1 ; : : : ; xi3t g are zeros. If it decides to
output 0, then it fails for all the inputs Ij with xj =2fxi1 ; : : : ; xi3t g, and gives the wrong
answer with the probability (n− 3t)=2n. Otherwise, it fails on I0, and gives the wrong
answer with the probability 12 . Hence the algorithm has the success probability at most
(n+ 3t)=2n.
Thus in order to make the success probability at least 12 + , we must have 3
t =2n>.
Therefore, if the PADAM algorithm runs in time t, then 3t>2n.
5. Case studies: selection, merging, sorting
In the previous sections we considered algorithmic problems related to the commu-
nication issues on the PADAM. Now we present optimal algorithms for the problems of
selection, merging and sorting. They are examples of solutions on the PADAM of the
classical algorithmic problems, and they are included as case studies. Some algorithms
for specic problems were developed in [9, 25] in particular for matrix multiplication,
LU decomposition, triangular systems, solving linear system, FFT, linear program-
ming, solving partial dierential equation (PDE), and polynomial manipulations. In [9]
a nonoptimal sorting algorithm was given.
For the sake of simplicity of the exposition, we assume that there are exactly n2
keys which are stored in the global memory, one key per MU, unless stated otherwise.
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 169
5.1. Selection
Given a sequence X of n2 keys and an integer k; 16k6n2, the task is to nd the
kth key in X . This key is dened by the property that there are less then k keys in
X smaller than x, and for each y2X; y>x, there are at least k keys in X which
are smaller than y. The notation jY j means the number of elements in the set Y . The
median of Y is the djY j=2eth key in Y . In the following presentation of the algorithm,
it is assumed that all the keys are distinct to simplify the exposition, the general case
can be handled along similar lines.
PADAM SELECTION
Step 1: The processors nd in parallel the median of the keys in each row. They
place the medians in the rst column, next switch to the column-access mode, and the
rst processor nds the median z of the row medians.
Step 2: The value of z is made known to all the processors who count the number
kz of occurrences of keys smaller than z. Let Sz = fx2X j x<zg and kz = jSzj. If kz<k
then let X1 :=X − Sz and k1 := k − kz, otherwise let X1 := Sz and k1 := k.
Step 3: The keys in X1 are ranked in the row-major order, that is, a key x in row i
obtains the rank r if there are r − 1 keys from X1 in the rows with numbers smaller
than i and in the row i to the left of x. This is accomplished by combining prex
computations in the rows and one column.
Step 4: The keys from X1 are moved to the square array A of the MUs M [i; j] such
that i; j6dpX1e. This is accomplished by the algorithm PADAM ROUTING. The key of
rank r in X1 is routed to the rth MU in A with respect to the row-major order.
Step 5: Recursively, the k1th key in X1 is found in A by the processors P(i) such
that i6dpX1e.
Theorem 10. The algorithm PADAM SELECTION nds the kth key in a set of n2 keys in
time O(n) on the PADAM with n processors.
Proof. The rst step is performed by executing the sequential selection algorithm of
Blum et al. [2] in each row in time O(n). The Steps 2{4 are also completed within
this time bound. There are at most d3n2=4e keys in the set X1 because z is the median
of the row medians. The total time bound T (n) of the algorithm satises the equation
T (n) =T (dp3n=2e) + O(n), hence T (n) =O(n).
5.2. Merging
The task is to combine two sorted sequences into one sorted sequence. Assume that
one of two input sequences occupies the rst n=2 rows, and the other the remaining
rows.
PADAM MERGE
Step 1: Rearrange the keys in such a way that the upper sequence is sorted in the
regular row-major left-to-right order, and the lower sequence similarly in the right-to-
170 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
left order.
Step 2: Sort the columns.
Step 3: Sort the rows.
Step 1 can be performed in time O(n), as an instance of o-line routing. Step 2
simply merges two sequences, since each column is a bitonic sequence.
Lemma 5. After Step 2; the sequences stored in the rows are bitonic.
Proof. Take a key x. Let 0s denote the keys y such that y6x, and 1s keys y such
that y>x. After Step 1 there are at most two dirty rows (containing both 0s and 1s),
one having 0s in a block on the left side (if any), the other on the right side (if any).
After Step 2, there is exactly one dirty row, of one of two possible forms: 0s1s0s or
1s0s1s. The lemma follows.
Hence in Step 3 it is sucient to merge the columns. This proves the following:
Theorem 11. The algorithm PADAM MERGE merges two sorted sequences of n2=2 keys
in time O(n) on the n-processor PADAM.
5.3. Sorting
The sorting problem is to rearrange the elements of an n  n array in such a way
that they appear in a nondecreasing order with respect to some indexing of array
entries. Sorting on the PADAM can be accomplished in the optimal time O(n log n) by
a mergesort, we use the left-to-right row-major indexing:
PADAM MERGESORT
Step 1: Sort each row.
Step 2: Merge repeatedly blocks of sorted rows.
Step 1 can be accomplished in time O(n log n), and there are O(log n) iterations of
Step 2, each taking time O(n).
Theorem 12. The algorithm PADAM MERGESORT sorts O(n2) keys in time
O(n log n) on the PADAM with n processors.
If the keys are integers of polynomial magnitude then they can be sorted faster. We
start with a result that is used in the deterministic simulation of PRAM on PADAM.
Theorem 13. Suppose that at most n2 keys are distributed among the n2 MUs of
the PADAM with n processors; each key is a natural number between 1 and na; for a
constant a. Then the keys can be sorted in time O(n).
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 171
Proof. The keys are sorted in the column-major order. It is sucient to show how
to sort in a stable way a set of keys in the range [1 : : : n]. If we refer to an array of
numbers [xk; l] then it is understood that the entry xk; l is stored in M [k; l].
First each column is sorted in a stable way, similarly as in PADAM ROUTING. Then the
location of the beginning of the block of keys equal to i in column j, denoted bi; j, is
found. The numbers ci; j, equal to the number of occurrences of keys i in column j,
are computed, and then the prexes di; j =
Pj
k=1 ci; k . Let us consider the key s which
is now stored in M [i; j]. It is to be moved to the ai; jth MU (we refer to the numbering
according to the column-major order), where ai; j is equal to the number of occurrences
of keys smaller than s or those equal to s in the columns <j or in the column j but
in the rows 6i. The number of keys equal to s above s in column j can be found
from the value of bs; j. This is combined with ds; j to yield ai; j. Finally, the packets are
moved to their positions by invoking PADAM ROUTING.
We can improve the above theorem in the case when the number of keys is strictly
less than n2. This result is also used in simulation S2.
Theorem 14. Suppose that m6n2 keys are distributed evenly among the n processors
of the PADAM; each key is a natural number between 1 and r. Then the keys can be
sorted in time O((m=n+ log n)d(log r)= log(m=n+ log n)e).
The above sorting algorithm is optimal for m>n1+ and r=mO(1).
Proof. The algorithm employs an adaptation of the idea of the bucket sorting algorithm
of Wagner and Han [31]. Assume rst that r6m=n+ log n.
Step 1: Each processor computes sequentially the number of its occurrences of each
key and the rank of each amongst these occurrences. The set of keys of value j stored
by processor P(i) is referred to as the bucket B(j; i). The size of bucket B(j; i) is
denoted as S(j; i). A label C(x) is attached to each key x, which is the rank of its
occurrence in bucket B(j; i), where j is the value of key x.
Step 2: The buckets are divided into n groups of size r. We order the buckets
lexicographically with respect to their identication numbers. The buckets in the jth
group, which are assigned to P(j), precede those in the (j + 1)st group. The sizes
S(k; i) of the buckets B(k; i) in group j are routed to processor P(j).
Step 3: Each P(i) computes sequentially prex sums E(i; j); 06j6r, of the sizes
of the rst j buckets in group i. Let E(i) =E(i; r) be the number of keys in the buckets
in group i.
Step 4: All processors compute in parallel the prex sums F(i) =
Pi−1
k=1 E(k); 16i
6n, which is the number of keys in the buckets in groups 1; 2; : : : ; i − 1.
Step 5: Each P(i) computes sequentially G(i; j) =F(i) + E(i; j − 1) for 16j6r,
which is the number of keys in the buckets smaller than the jth bucket in group i.
Step 6: Observe, that if x was in bucket B(i; j), which is the kth bucket in group l,
then its rank is G(l; k) + C(x). Hence, to nish the algorithm, we only need to route
the values G(i; j) similarly as in Step 2, but in the reverse order.
172 B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173
Next we estimate the performance of the algorithm. Step 1 can be performed in time
O(m=n) and Steps 3 and 5 in time O(r). Step 4 can be performed in time O(log n) on
an emulated binary tree by the standard prex sums algorithm. Finally, Steps 2 and
6 can be done in O(r) time because our routing pattern is xed. The total time is
O(m=n+ log n), because we have assumed that r6m=n+ log n.
The algorithm can be modied to handle the case r>m=n + log n using the idea
of radix sort. Divide log r bits representing the numbers into parts of size log(m=n +
log n), and, starting from the least signicant part, apply the algorithm as described
above to all the parts. In order to sort all the numbers, this procedure is repeated
(log r)= log(m=n+ log n) times.
References
[1] A. Aggarwal, A.K. Chandra, M. Snir, On communication latency in PRAM computations, in: Proc.
ACM Symp. on Parallel Algorithms and Architectures, 1989, pp. 11{21.
[2] M. Blum, R.W. Floyd, V.R. Pratt, R.L. Rivest, R.E. Tarjan, Time bounds for selection, J. Comput.
System Sci. 7 (1972) 448{461.
[3] B.S. Chlebus, A. Czumaj, L. Gasieniec, M. Kowaluk, W. Plandowski, Parallel alternating-direction
access machine, in: Proc. 21st Internat. Symp. on Mathematical Foundations of Computer Science,
1996, Lecture Notes in Computer Science, Vol. 1113, pp. 267{278.
[4] B.S. Chlebus, A. Czumaj, J.F. Sibeyn, Routing on the PADAM: degrees of optimality, in: Proc.
Euro-Par’97 Parallel Processing, 1997, Lecture Notes in Computer Science, Vol. 1300, pp. 272{279.
[5] M. Dietzfelbinger, F. Meyer auf der Heide, Dynamic hashing in real time, in: J. Buchmann, H.
Ganzinger, W.J. Paul (Eds.), Informatik: Festschrift zum 60. Geburtstag von Gunter Hotz, B.G. Teuber
Verlagsgesellschaft, Leipzig, 1992, pp. 95{119. A preliminary version appeared as: M. Dietzfelbinger,
F. Meyer auf der Heide, A new universal class of hash functions and dynamic hashing in real time,
in: Proc. 17th Internat. Colloq. on Automata, Languages and Programming, 1990, pp. 6{19.
[6] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill,
New York, 1993.
[7] K. Hwang, C.-M. Cheng, Simulated performance of a RISC-based multi-processor using
orthogonal-access memory, J. Parallel Distrib. Comput. 13 (1991) 43{57.
[8] K. Hwang, M. Dubois, D.K. Panda, S. Rao, S. Shang, A. Uresin, W. Mao, H. Nair, M. Lytwyn, F.
Hsieh, J. Liu, S. Mehrotra, C.M. Cheng, OMP: a RISC-based multiprocessor using orthogonal-access
memories and multiple spanning buses, in: Proc. 1990 Internat. Conf. on Supercomputing, 1990, pp.
7{22. ACM SIGARCH Computer Architecture News 18 (3) (1990) 7{22.
[9] K. Hwang, P.-S. Tseng, D. Kim, On orthogonal multiprocessor for parallel scientic computations,
IEEE Trans. Comput. 38 (1989) 47{61.
[10] J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA, 1992.
[11] H. Kadota, K. Kaneko, I. Okabayashi, T. Okamoto, T. Mimura, Y. Nakakura, A. Wakatani, M.
Nakajima, J. Nishikawa, K. Zaiki, T. Nogi, Parallel computer ADENART | its architecture and
application, in: Proc. Fifth ACM Internat. Conf. on Supercomputing, 1991, pp. 1{8.
[12] H. Kadota, K. Kaneko, Y. Tanikawa, T. Nogi, VLSI parallel computer with data transfer network:
ADENA, in: Proc. Internat. Conf. on Parallel Processing, Vol. I, 1989, pp. 319{322.
[13] R.M. Karp, V. Ramachandran, Parallel algorithms for shared-memory machines, in: J. van Leeuwen
(Ed.), Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity, Elsevier
Science Publishers, Amsterdam, 1990, pp. 870{941.
[14] C.P. Kruskal, L. Rudolph, M. Snir, A complexity theory of ecient parallel algorithms, Theoret.
Comput. Sci. 71 (1990) 95{132.
[15] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,
Morgan Kaufman Publishers, San Mateo, California, 1991.
[16] T. Leighton, Methods for message routing in parallel machines, in: Proc. 24th ACM Symp. on Theory
of Computing, 1992, pp. 77{96.
B.S. Chlebus et al. / Theoretical Computer Science 245 (2000) 151{173 173
[17] T. Leighton, F. Makedon, Y. Tollis, A 2n−2 step algorithm for routing in an nn array with constant
size queues, in: Proc. ACM Symp. on Parallel Algorithms and Architectures, 1989, pp. 328{335.
[18] T. Leighton, C.G. Plaxton, A (fairly) simple circuit that (usually) sorts, in: Proc. 31st IEEE Symp.
on Foundations of Computer Science, 1990, pp. 264{274.
[19] K. Mehlhorn, U. Vishkin, Randomized and deterministic simulations of PRAMs by parallel machines
with restricted granularity of parallel memories, Acta Inform. 21 (1984) 339{374.
[20] F. Meyer auf der Heide, Hashing strategies for simulating shared memory on distributed memory
machines, in: F. Meyer auf der Heide, B. Monien, A.L. Rosenberg (Eds.), Proc. First Heinz Nixdorf
Symposium Parallel Architectures and their Ecient Use, Lecture Notes on Computer Science, Vol.
678, 1992, pp. 20{29.
[21] T. Nogi, Parallel computation on ADENA, in: D.J. Evans, G.R. Joubert, H. Liddell (Eds.), Parallel
Computing’91, Elsevier Science Publishers, Amsterdam, 1992.
[22] T. Nogi, T. Imamura, K. Sakai, Y. Obayashi, NAS parallel benchmark results on ADENART, Parallel
CFD’94, Kyoto, 1994.
[23] S. Rajasekaran, Mesh connected computers with xed and recongurable buses: packet routing, sorting
and selection, in: Proc. European Symp. on Algorithms, Lecture Notes in Computer Science, Vol. 726,
1993, pp. 309{320.
[24] S. Saini, D.H. Bailey, NAS parallel benchmark results 12-95, Technical Report NAS-95-021, NASA
Ames Research Center, Moett Field, CA 94035-1000, 1995. This report is available electronically at
http://www.nax.nasa.gov/NAS/TechReports/NASreports/NAS-95-021/NAS-95-021.htm.
[25] I.D. Sherson, Y. Ma, Analysis and applications of the orthogonal access multiprocessor, J. Parallel
Distrib. Comput. 7 (1989) 232{255.
[26] J.F. Sibeyn, B.S. Chlebus, M. Kaufmann, Deterministic permutation routing on meshes, J. Algorithms
22 (1997) 111{141.
[27] J.F. Sibeyn, M. Kaufmann, R. Raman, Randomized routing on meshes with buses, in: Proc. European
Symp. on Algorithms, Lecture Notes in Computer Science, Vol. 726, 1993, pp. 333{344.
[28] Q. Stout, Mesh connected computers with broadcasting, IEEE Trans. Comput. 32 (1983) 826{830.
[29] L.G. Valiant, General purpose parallel architectures, in: J. van Leeuwen (Ed.), Handbook of Theoretical
Computer Science, Vol. A: Algorithms and Complexity, Elsevier Science Publishers, Amsterdam, 1990,
pp. 943{972.
[30] A.J. van der Steen, J.J. Dongarra, Overview of Recent Supercomputres, NHSA Review, 1996 Volume,
1st issue, 1996. This report is available electronically at http://www.crpc.rice.edu/NHSEreview/ORS/.
[31] R.A. Wagner, Y. Han, Parallel algorithms for bucket sorting and the data dependent prex problem,
in: Proc. 15th Internat. Conf. on Parallel Processing, 1986, pp. 924{930.
[32] A.C. Yao, Probabilistic computations: towards a unied measure of complexity, in: Proc 18th Annual
Symp. on Foundations of Computer Science, 1977, pp. 222{227.
