Large-Scale Sorting in Uniform Memory Hierarchies by Vitter, Jeffrey Scott & Nodine, Mark H.
Large-Scale Sorting in Uniform Memory
Hierarchies
Jerey Scott Vitter

and Mark H. Nodine
y
Dept. of Computer Science
Brown University
Providence, R. I. 02912{1910
December 19, 1991
Abstract. We present several ecient algorithms for sorting on the uniform
memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its paral-
lelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of
bandwidth degradations, including a parsimonious algorithm for constant bandwidth.
We also develop optimal sorting algorithms for all bandwidths for other versions of
UMH and P-UMH, including natural restrictions we introduce called RUMH and
P-RUMH, which more closely correspond to current programming languages.
1 Introduction
In many large-scale computer systems, memory progresses from very small but very
fast registers to successively larger but slower components, such as several layers of
cache, primary memory, disks, and archival storage. In order to achieve optimal

Support was provided in part by a National Science Foundation Presidential Young Investigator
Award CCR{9047466 with matching funds from IBM Corporation, by National Science Foundation
grant CCR{9007851, by the U.S. Army Research Oce under grant DAAL03{91{G-0035, and by
the Oce of Naval Research and the Defense Advanced Research Projects Agency under contract
N00014{83{K{0146 and ARPA order 6320, amendment 1.
y
Support was provided in part by an IBM Graduate Fellowship, by a National Science Foundation
Presidential Young Investigator Award CCR{9047466 with matching funds from IBM Corporation,
by National Science Foundation grant CCR{9007851, and by the U.S. Army Research Oce under
grant DAAL03{91{G-0035.
1
2 1 INTRODUCTION
performance on such a computer, it is often necessary for the algorithm designer to
take into account the physical characteristics of the memory hierarchy. Unfortunately,
there are too many possible variables to consider (e.g., the block size of each level,
the number of blocks at each level, the bandwidth between one level and the next)
to allow the design of general algorithms; hence some degree of abstraction of the
memory hierarchy is required.
Several interesting and elegant hierarchical memory models have been proposed
recently to model the many levels of memory typically found in large-scale computer
systems. The HMM model of Aggarwal, Alpern, Chandra, and Snir [AAC] allows ac-
cess to individual location x in time f(x). The BT model of Aggarwal, Chandra, and
Snir [ACSa] represents a notion of block transfer applied to HMM; in the BT model,
access to the t+1 records at locations x t, x t+1, . . . , x takes time f(x) + t. Typ-
ical access cost functions are f(x) = log x and f(x) = x

, for some  > 0.
1
A model
similar to the BT model that allows pipelined access to memory in O(log n) time was
developed independently by Luccio and Pagli [LuP]. Optimal sorting algorithms for
each of these models have been developed [AAC, ACSa, LuP].
In this paper we concentrate on a newer hierarchical memory model introduced
by Alpern, Carter, and Feig [ACF, ACSb], called the uniform memory hierarchy
(UMH), which oers an alternative model of blocked multilevel memories. In the
UMH
b(`)
model (for integer constants ;   2), the `th memory level (as illustrated
in Figure 1) consists of 
`
blocks, each of size 
`
; it is connected via buses to levels
`   1 and ` + 1. Each individual block on level ` can be randomly accessed as a
unit and transferred to or from level ` + 1 at a bandwidth of b(`); that is, each block
transfer takes time 
`
=b(`). The CPU resides at level 0.
A model for parallel hierarchies was introduced by Vitter and Shriver, in which P
hierarchies are connected at their base level via an interconnection network as shown
in Figure 2. Communication between the P hierarchies takes place at the base memory
level (call it level 0), which consists of location 1 from each of the P hierarchies. The P
base memory level locations are interconnected via a network such as the hypercube or
cube-connected cycles so that the P records in the base memory level can be sorted
1
We use the notation logx, where x  1, to denote the quantity maxf1; log
2
xg.
3Level l  
Level l +1 
Bus l
Figure 1: The uniform memory hierarchy (UMH), pictured here for the case  = 3,
 = 2. The `th memory level contains 
`
blocks, each of size 
`
. It is connected
via buses to levels `   1 and `+ 1. Each level ` block can be randomly accessed and
transferred to level `+ 1 at a bandwidth of b(`) (that is, in 
`
=b(`) time).
in O(log P ) time (perhaps via a randomized algorithm [ReV]). Vitter and Shriver
introduced optimal randomized sorting algorithms for P-HMM and P-BT [ViSa]. The
algorithms were based on their randomized two-level partitioning technique applied to
the optimal single-hierarchy algorithms for HMM and BT developed in [AAC, ACSa].
We can consider parallel UMH hierarchies (analogous to P-HMM and P-BT), and
we call the resulting model P-UMH. (This is fundamentally dierent from the parallel
type of UMH called UPHM mentioned in [ACF].) The initial input of N elements
resides at level s = d
1
2
log

N
P
e.
A UMH or P-UMH \program" consists of a schedule of choreographed block trans-
fers and computations. If a RAM program that runs in T (N) steps can be scheduled
in UMH in  T (N) time, the program is said to be parsimonious; note that the
4 1 INTRODUCTION
CPU
“The Net”
CPU CPU
Figure 2: A parallel hierarchical memory. The P individual memory hierarchies are
all of the same type, such as HMM, BT, or UMH. The P CPUs can communicate
among one another via the interconnection network (which can be a hypercube or
cube-connected cycles, for example).
constant factor must be 1. If the UMH program runs in time O(T (N)), it is said to
be ecient . A UMH program whose running time is within a constant factor of best
possible for that problem in the UMH model is said to be optimal.
In Section 2 we give optimal and near-optimal sorting algorithms for UMH and
P-UMH for a wide range of bandwidth rates b(`), and we present a parsimonious
schedule for merge sort for the case b(`) = 1. In Section 3 we also introduce a natural
and easy-to-program restriction of UMH, called random-access UMH (or RUMH), for
which we have optimal upper and lower bounds for all bandwidths and amounts of
parallelism, and a sequential model of UMH called SUMH, for which we do the same.
52 Sorting in UMH and its Parallelization
Optimal sorting in O(N logN) time in UMH is possible only when the bandwidth b(`)
at level ` is 
(1=`), or else the time required just to access theN records will be greater
than O(N logN). Many buses may be active simultaneously in the UMH model, so
conceivably it is possible to sort in O(N logN) time even with small bandwidth b(`) =
1=(` + 1).
Recently other authors announced an ecient UMH sorting algorithm for the case
b(`) = 1=(`+1), based on the optimal two-level distribution sort algorithm of [AgV],
but their UMH
1=(`+1)
algorithm turned out to be inecient, with a running time of

(N log
c
N), for c > 3. Whether or not an O(N logN)-time UMH
1=(`+1)
algorithm
exists is still open.
In this section we give a near-optimal sorting algorithm for the small bandwidth
case b(`) = 1=(`+1), and optimal sorting algorithms for several other bandwidths. For
the special case of constant bandwidth, we present a parsimonious algorithm. Since
optimal sorting seems to require nonoblivious UMH programs, the oblivious UMH
model of [ACF] must be modied in a reasonable way. In Theorem 1, we assume that
the `th level of the hierarchy can initiate a transfer from the (` + 1)st level without
involving the CPU when one of its blocks becomes empty. In the remaining theorems,
we assume that the CPU can originate the transfer of a block at level ` given the
address of the block, with suitable delay.
The fastest oblivious algorithm we have found for sorting in UMH
1=(`+1)
is based
on a simple schedule of Batcher's bitonic sort [Akl] where each of the log
2
N par-
allel time steps is implemented in O(N logN) time for an overall running time of
O(N log
3
N). It is also possible to schedule a recursive version of Columnsort [Lei]
on UMH
1=(`+1)
in a manner that is ecient with respective to the RAM algorithm,
but this observation is not very useful since both algorithms have running time that
is O(N log
c
N), where c  3:4.
6 2 SORTING IN UMH AND ITS PARALLELIZATION
2.1 Parsimonious sorting in UMH
1
Theorem 1 A variant of merge sort can be scheduled in UMH
1
parsimoniously, as-
suming   2 and   6.
Proof : The basic idea is to schedule a systolic binary merge sort in such a way that
the CPU is always kept busy (after a small initial delay and with a small nal delay
for propagating the results back). After the initial delay, the CPU (level 0) reads one
element from each of the two lists. At every time step after the initial delay, the CPU
writes the smaller element to the output list and then reads the next element from
the list that had the smaller element at the previous step. We use a double-buering
scheme so that level `, for `  1, contains room for two blocks from each of the
two lists being merged. It also has two blocks for the output list. When level `   1
requests a subblock from one of the lists, and this request causes level `'s buer to
be emptied, then level ` requests the next block from level `+ 1. In this way, level `
always has at least one sub-block for level `   1 available on demand. The output
blocks ll up at a known rate, so they can be scheduled in advance (again using
double-buering to keep an empty subblock available for writing from level `   1).
At the end of each list, we immediately begin to send a new list down for the next
merge. The CPU can keep track of how many elements have been read from each
list, so that when one list is nished it knows to copy the rest of the other list to the
output. The number of wasted CPU cycles is only O(logN) = o(N logN), so the
schedule is parsimonious. 2
2.2 Sorting in P-UMH
Theorem 2 Distribution sort algorithms can be scheduled on P-UMH with the fol-
lowing running times. The algorithms for nonconstant P for the rst two bandwidth
cases are randomized.


N
P
logN

if b(`) = 1;
O
 
N
P
logN log
 
logN
logP
!!
if b(`) =
1
`+ 1
;

 

N
P

1+c=2
+
N
P
logN
!
if b(`) = 
 c`
, c > 0.
2.2 Sorting in P-UMH 7
Proof : The lower bound for the rst case b(`) = 1 follows from the conventional

(N logN) serial bound for sorting on a RAM. With P processors, the P-UMH
sorting time can be at most P times faster, giving a 
((N=P ) logN) lower bound.
The best known lower bound for the case b(`) = 1=(`+1) is the same as when b(`) = 1.
To prove the lower bound when b(`) = 
 c`
, we assume that the N records reside
initially in level s = d
1
2
log

N
P
e. It then follows that it takes 
((N=P )
1+c=2
) time to
get the N records from level s to level s  1, thus completing the lower bound for the
third case.
The algorithm that achieves the upper bound for the rst case b(`) = 1 is based
on a simulation of the P-BT algorithm for access cost function f(x) =
p
x given
in [ViSb]. The time for the simulation is bounded by a constant times the P-BT
running time. Let us consider moving any b+ 1 elements from locations x  b; . . . ; x
to locations y   b; . . . ; y in the BT model with access cost function f(x) =
p
x. The
amount of time taken for this transfer is
p
x +
p
y + b. It can be shown that the
amount of time taken by a UMH to do the same transfer, with multiple levels of the
hierarchy active concurrently, is
p
x+ (  1)
p
x  1 
p
y
p

+ b;
which is bounded by c(
p
x+
p
y + b) for all c  =
p
.
The other simulations presented in this paper are a bit trickier, since they require
that eective use be made of blocking in the UMH simulation, and therefore that the
algorithm meet certain constraints. A sucient constraint is that operations in the
algorithm process all the elements at any level in the hierarchy consecutively. It may
be convenient for the algorithm to describe the elements as comprising groups that
may be unrelated to the block size for that level, but as long as all the elements are
accessed consecutively, the intermediate levels of the hierarchy can act as buers to
allow reblocking to occur as needed without losing eciency, as long as   3. The
algorithms given in [ViSb] all meet this constraint.
For the second case b(`) = 1=(` + 1), the upper bound is related to the P-HMM
approach for f(x) = log x [ViSb]. The P-HMM algorithm needs to be modied to
8 3 SORTING IN SUMH AND RUMH AND THEIR PARALLELIZATIONS
reblock the buckets prior to sorting them recursively by substituting Step 8 of the P-
BT algorithm of [ViSb] into the P-HMM algorithm. The cost of accessing an element
at location x in the HMM model is log x; the amortized cost of accessing the same
element in the UMH model when an entire block is brought to the base memory level
is log(x=)= log , which is within a constant factor of the HMM cost.
The upper bound third case b(`) = 
 c`
makes use of an algorithm based on
deterministic, two-way merge sort. This algorithm gives rise to the recurrence relation
S(N) =
8
>
>
<
>
>
:
2S

N
2

+ k

N
P

c=2+1
if N > P ;
O(logN) if N  P;
which gives the stated bound. 2
The algorithms are optimal, except for the middle b(`) = 1=(` + 1) case, which
is o from the best known lower bound of ((N=P ) logN) by a log((logN)= log P )
factor.
3 Sorting in SUMH and RUMH and their Paral-
lelizations
The UMH model can be dicult to program because many buses can be active simul-
taneously. An earlier version of [ACF] introduced a sequential UMH model, appro-
priately called SUMH, that allowed at most one bus to be active at a time. However,
the SUMH restriction can be regarded as too severe, since it forfeits much power of
the UMH model.
We introduce the following more natural and less severe restriction that ts in
nicely with feasible and easy-to-use programming languages: We require that the
UMH program correspond exactly to a RAM program in which the RAM instruction
set is augmented with a block move command that can move t contiguous memory
elements in time t, for arbitrary t. Each such block transfer can be implemented in
UMH by a coordinated series of transfers in which several buses are simultaneously
active but cooperating on that single transfer. We call this natural variant of UMH the
random-access UMH model, or simply RUMH. For example, a block of
p
N elements
9can be moved from the top memory level all the way down to the CPU (or anywhere
in between) in 
p
N time in UMH
1
and RUMH
1
, but it requires (
p
N logN) time
in SUMH
1
.
The parallel versions of RUMH and SUMH are called P-RUMH and P-SUMH,
respectively. Theorems 3 and 4 give matching upper and lower bounds for sorting
in the RUMH and SUMH models and their parallelizations. The structures of the
formulas in Theorems 3 and 4 suggest several dierent relationships between the
RUMH and SUMH models on the one hand and the HMM, BT, and two-level models
on the other hand (cf. Theorems 5 and 6 in [ViSa]); accordingly the upper and lower
bounds combine in an interesting way several techniques from [AAC, ACSa, AgV,
ViSa].
Theorem 3 The running times mentioned in Theorem 2 are matching upper and
lower bounds for sorting in P-RUMH. The algorithms for nonconstant P for the rst
two bandwidth cases are randomized.
Proof : The upper bounds all follow directly from the proof of Theorem 2, since all
the algorithms given there are P-RUMH algorithms.
The lower bounds for b(`) = 1 and b(`) = 
 c`
are the same as and follow directly
from those for P-UMH. When b(`) = 1=(` + 1), we can prove a tight lower bound by
simulating RUMH
1=(`+1)
by HMM with access cost function f(x) = log x. Specically,
any block transfer of 
` 1
elements from level ` to level `
0
where ` > `
0
will take an
amount of time that is

` 1
((`  2)(   1)   1)  
`
0
((`
0
  1)(  1)  1)
(  1)
2
+ `
` 1
= (`
` 1
):
Doing the same move in the HMM model requires time at most

` 1
(log(
2`
) + log(
2(`
0
+1)
)) = O(`
` 1
);
so the simulation by HMM is bounded by a constant times the RUMH
1=(`+1)
running
time. Hence, the lower bound for P-HMM for f(x) = log x given in [ViSb] also holds
for P-RUMH
1=(`+1)
. 2
10 3 SORTING IN SUMH AND RUMH AND THEIR PARALLELIZATIONS
Theorem 4 The following bounds are matching upper and lower bounds for sorting
in P-SUMH. The algorithms for nonconstant P for the rst two bandwidth cases are
randomized.

 
N
P
logN log
 
logN
logP
!!
if b(`) = 1;


N
P
logN log
N
P

if b(`) =
1
`+ 1
;

 

N
P

1+c=2
+
N
P
logN
!
if b(`) = 
 c`
, c > 0.
Proof : We prove the lower bounds using an approach similar to that of [ViSb]. Let
us dene the \sequential time" of a P-SUMH algorithm to be the sum of its time
costs for each of the P hierarchies. The sequential time can be at most P times the
P-SUMH running time. We superimpose on the P-SUMH model a sequence of one-
disk, two-level memories of the type studied in [AgV, ViSb], in the following way: For
1  ` 
1
2
log

(N=P ), the `th two-level memory has one disk, internal memory size
M
`
= P(
2(`+1)
  1)=(
2
  1), and block size B
`
= 
`
. An I/O in the `th two-level
memory corresponds to a single block transfer between levels ` and ` + 1 in one of
the hierarchies in the P-SUMH model, which requires sequential time
C
`
=
8
>
>
>
>
<
>
>
>
:
B
`
if b(`) = 1;
B
`
logB
`
if b(`) =
1
` + 1
;
B
1+c
`
if b(`) = 
 c`
, c > 0.
The minimum number of I/Os required for sorting in the `th two-level memory is


0
@
N
B
`
log
N
B
`
log
M
`
B
`
 
M
`
B
`
1
A
;
as shown in [AgV]. Each such I/O contributes C
i
to the sequential time in the P-
SUMH model, since in the P-SUMH model only one level can be active at a time in
each hierarchy. This gives a lower bound on the P-SUMH sequential time:
T (N) = 

0
B
@
X
1`
1
2
log

(N=P )
C
`
0
@
N
B
`
log
N
B
`
log
M
`
B
`
 
M
`
B
`
1
A
1
C
A
:
11
We get the desired lower bound on the P-SUMH time by substituting the values of
M
`
, B
`
, and C
`
for the three cases into the above summation, and then dividing by P .
The b(`) = 
 c`
case in addition requires the use of the the conventionalN logN serial
bound for sorting.
The upper bounds for the rst two cases b(`) = 1 and b(`) = 1=(`+1) are achieved
by simulating the optimal P-HMM algorithm of [ViSb], for access cost functions
f(x) = log x and f(x) = log
2
x, respectively. Since a UMH in each case can simulate
an HMM with the appropriate cost function in a running time that is at most a
constant times the HMM time, the P-HMM bound holds for the P-UMH simulation.
The upper bound for the b(`) = 
 c`
case is achieved by the same deterministic merge
sort as the previous theorems. 2
4 Conclusions
We have given optimal or near-optimal sorting algorithms for UMH and its paral-
lelization that we have introduced called P-UMH. We have derived tight matching
upper and lower bounds for sorting in the restricted models RUMH and SUMH and
their parallelizations. Some of the algorithms are randomized. The RUMH model
is particularly useful because it is easy to visualize and it matches well with current
programming languages and compilers.
An interesting open problem is whether it is possible to sort in O(N logN) time
with the UMH
1=(`+1)
model. The related FFT computation can be done in UMH
1=(`+1)
in O(N logN) time. Another open problem is whether a parsimonious oblivious al-
gorithm can be found to replace our non-oblivious one in UMH
1
, or whether deter-
ministic algorithms can be found to replace the randomized ones.
References
[AAC] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir, \A Model for Hierar-
chical Memory," Proceedings of 19th Annual ACM Symposium on Theory of
Computing (May 1987), 305{314.
12 4 CONCLUSIONS
[ACSa] A. Aggarwal, A. Chandra, and M. Snir, \Hierarchical Memory with Block
Transfer," Proceedings of 28th Annual IEEE Symposium on Foundations of
Computer Science (October 1987), 204{216.
[AgV] A. Aggarwal and J. S. Vitter, \The Input/Output Complexity of Sorting and
Related Problems," Communications of the ACM (September 1988), 1116{
1127, also appears in Proceedings of 14th Annual International Colloquium
on Automata, Languages, and Programming (ICALP), LNCS 267, Springer-
Verlag, Berlin, 1987.
[Akl] S. G. Akl, Parallel Sorting Algorithms, Notes and Reports in Computer Sci-
ence and Applied Mathematics#12, Academic Press, Inc., Orlando, 1985.
[ACF] B. Alpern, L. Carter, and E. Feig, \UniformMemory Hierarchies," Proceedings
of the 31st Annual IEEE Symposium on Foundations of Computer Science
(October 1990), 600{608.
[ACSb] B. Alpern, L. Carter, and T. Selker, \Visualizing Computer Memory Architec-
tures," Proceedings of the 1990 IEEE Visualization Conference Foundations
of Computer Science (October 1990).
[Lei] T. Leighton, \Tight Bounds on the Complexity of Parallel Sorting," IEEE
Transactions on Computers C-34 (April 1985), 344{354, also appears in Pro-
ceedings of the 16th Annual ACM Symposium on Theory of Computing,
(April 1983), 71-80.
[LuP] F. Luccio and L. Pagli, \A Model of Sequential Computation Based on a
Pipelined Access to Memory," Proceedings of the 27th Annual Allerton Con-
ference on Communication, Control, and Computing (September 1989).
[ReV] J. H. Reif and L. G. Valiant, \A Logarithmic Time Sort on Linear Size
Networks," Journal of the ACM 34 (January 1987), 60{76, also appears in
Proceedings of the 15th Annual ACM Symposium on Theory of Computing
(April 1983), 10{16.
[ViSa] J. S. Vitter and E. A. M. Shriver, \Optimal Disk I/O with Parallel Block
Transfer," Proceedings of the 22nd Annual ACM Symposium on Theory of
Computing (May 1990), 159{169.
[ViSb] J. S. Vitter and E. A. M. Shriver, \Algorithms for Parallel Memory II: Hier-
archical Multilevel Memories," Brown University, CS-90-22, September 1990.
