In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write
  Memory by Kuszmaul, William & Westover, Alek
ar
X
iv
:2
00
4.
12
53
2v
2 
 [c
s.D
S]
  8
 Ju
l 2
02
0
In-Place Parallel-Partition Algorithms
using Exclusive-Read-and-Write Memory
William Kuszmaul
∗1
and Alek Westover
†1
1MIT
kuszmaul@mit.edu, alek.westover@gmail.com
Abstract
We present an in-place algorithm for the parallel partition
problem that has linear work and polylogarithmic span.
The algorithm uses only exclusive read/write shared vari-
ables, and can be implemented using parallel-for-loops
without any additional concurrency considerations (i.e.,
the algorithm is EREW). A key feature of the algorithm
is that it exhibits provably optimal cache behavior, up to
small-order factors.
We also present a second in-place EREW algorithm
that has linear work and span O(log n · log logn), which
is within an O(log logn) factor of the optimal span. By
using this low-span algorithm as a subroutine within the
cache-friendly algorithm, we are able to obtain a single
EREW algorithm that combines their theoretical guaran-
tees: the algorithm achieves span O(log n · log logn) and
optimal cache behavior. As an immediate consequence,
we also get an in-place EREW quicksort algorithm with
work O(n log n), span O(log2 n · log logn).
Whereas the standard EREW algorithm for parallel
partitioning is memory-bandwidth bound on large num-
bers of cores, our cache-friendly algorithm is able to
achieve near-ideal scaling in practice by avoiding the
memory-bandwidth bottleneck. The algorithm’s perfor-
mance is comparable to that of the Blocked Strided Algo-
rithm of Francis, Pannan, Frias, and Petit, which is the
previous state-of-the art for parallel EREW sorting algo-
rithms, but which lacks theoretical guarantees on its span
and cache behavior.
1 Introduction
A parallel partition operation rearranges the elements
in an array so that the elements satisfying a particular
pivot property appear first. In addition to playing a cen-
tral role in parallel quicksort, the parallel partition opera-
tion is used as a primitive throughout parallel algorithms.1
A parallel algorithm can be measured by its work ,
∗Supported by a Hertz fellowship and a NSF GRFP fellowship
†Supported by MIT
1In several well-known textbooks and surveys on parallel
algorithms [2, 7], for example, parallel partitions are implicitly used
extensively to perform what are referred to as filter operations.
the time needed to execute in serial, and its span , the
time to execute on infinitely many processors. There is
a well-known algorithm for parallel partition on arrays of
size n that we call the Standard Algorithm with work
O(n) and span O(log n) [7, 2]. Moreover, the algorithm
uses only exclusive read/write shared memory variables
(i.e., it is an EREW algorithm). This eliminates the
need for concurrency mechanisms such as locks and
atomic variables, and ensures good behavior even if the
time to access a location is a function of the number of
threads trying to access it (or its cache line) concurrently.
EREW algorithms also have the advantage that their
behavior is internally deterministic, meaning that the
behavior of the algorithm will not differ from run to run,
which makes test coverage, debugging, and reasoning
about performance substantially easier [8].
The Standard Algorithm suffers from using a large
amount of auxiliary memory, however. Whereas the serial
partition algorithm is typically implemented in place, the
Standard Algorithm relies on the use of multiple auxiliary
arrays of size n. To the best of our knowledge, the only
known linear-work and polylog(n)-span algorithms for
parallel partition that are in-place require the use of
atomic operations (e.g, fetch-and-add) [21, 6, 29].
An algorithm’s memory efficiency can be critical on
large inputs. The memory consumption of an algorithm
determines the largest problem size that can be executed
in memory. Many external memory algorithms (i.e.,
algorithms for problems too large to fit in memory)
perform large subproblems in memory; the size of these
subproblems is again bottlenecked by the algorithm’s
memory-overhead [30]. In multi-user systems, processes
with larger memory-footprints can hog the cache and the
memory bandwidth, slowing down other processes.
For sorting algorithms, in particular, special attention
to memory efficiency is often given. This is because (a)
a user calling the sort function may already be using
almost all of the memory in the system; and (b) sorting
algorithms, and especially parallel sorting algorithms,
are often bottlenecked by memory bandwidth. The
latter property, in particular, means that any parallel
sorting algorithm that wishes to achieve state-of-the art
performance on a large multi-processor machine must be
(at least close to) in place.
1
William Kuszmaul and Alek Westover
Currently the only practical in-place parallel sorting
algorithms either rely heavily on concurrencymechanisms
such as atomic operations [21, 6, 29], or abstain from
theoretical guarantees [14]. Parallel merge sort [18] was
made in-place by Katajainen [22], but has proven too
sophisticated for practical applications. Bitonic sort
[9] is naturally in-place, and can be practical in certain
applications on super computers, but suffers in general
from requiring work Θ(n log2 n) rather than O(n logn).
Parallel quicksort, on the other hand, despite the many
efforts to optimize it [21, 6, 29, 14, 15], has eluded any
in-place EREW (or CREW2) algorithms due to its
reliance on parallel partition.
Results. We consider the problem of designing a theoret-
ically efficient parallel-partition algorithm that also per-
forms well in practice. All of the algorithms considered
in this paper use only exclusive read/write shared vari-
ables, and can be implemented using CILK parallel-for-
loops without any additional concurrency considerations.
We give a simple in-place parallel-partition algorithm
called the Smoothed Striding Algorithm , with linear
work and polylogarithmic span. Additionally, the algo-
rithm exhibits provably optimal cache behavior up to
low-order terms. In particular, if the input consists of m
cache lines, then the algorithm incurs at mostm(1+o(1))
cache misses, with high probability in m.
We also develop a suite of techniques for transforming
the standard linear-space parallel partition algorithm into
an in-place algorithm. The new algorithm, which we call
the Blocked-Prefix-Sum Partition Algorithm , has
work O(n) and span O(log n · log logn), which is within a
log logn factor of optimal. As an immediate consequence,
we also get an in-place quicksort algorithm with work
O(n log n) and span O(log2 n log logn). Moreover, we
show that, by using the Blocked-Prefix-Sum Partition
Algorithm as a subroutine within the Smoothed Striding
Algorithm, one can combine the theoretical guarantees of
the two algorithms, achieving a span of O(log n log logn),
while also achieving optimal cache behavior up to
low-order terms.
In addition to analyzing the algorithms, we experimen-
tally evaluate their performances in practice. We find that
an algorithm based on the Blocked-Prefix-Sum Partition
Algorithm is able to achieve speedups over the standard
linear-space algorithm due to increased cache efficiency.
The Blocked-Prefix-Sum Partition Algorithm does not
exhibit optimal cache behavior, however, and as we show
in our experimental evaluation, the algorithm remains
bottlenecked by memory throughput. In contrast, the
cache-optimality of the Smoothed Striding Algorithm
eliminates the memory-throughput bottleneck, allowing
for nearly perfect scaling on many processors.
The memory-bandwidth bottleneck previously led
researchers [14, 15] to introduce the Strided Algorithm ,
2In a CREW algorithm, reads may be concurrent, but writes
may not. CREW stands for concurrent-read exclusive-write.
which has near optimal cache behavior in practice, but
which exhibits theoretical guarantees only on certain
random input arrays. The Smoothed Striding Algorithm
is designed to have similar empirical performance to
the Strided Algorithm, while achieving both theoretical
guarantees on work/span and on cache-optimality. This
is achieved by randomly perturbing the internal structure
of the Strided Algorithm, and adding a recursion step that
was previously not possible. Whereas the Strided Algo-
rithm comes with theoretical guarantees only for certain
inputs, the Smoothed Striding Algorithm has polyloga-
rithmic span, and exhibits provably optimal cache behav-
ior up to small-order factors for all inputs. In practice,
the Smoothed Striding Algorithm performs within 15%
of the Strided Algorithm on a large number of threads.
Outline. We begin in Section 2 by discussing background
on parallel algorithms and the Parallel Partition Prob-
lem. In Section 4, we present and analyze the Smoothed
Striding Algorithm. In Section 3 we present and analyze
the Blocked-Prefix-Sum Partition Algorithm, which
achieves a nearly optimal span of O(log n log logn) but
is not cache-optimal; this algorithm can be used as a
subroutine within the Smoothed Striding Algorithm
to achieve the same span. Section 5 implements the
algorithms from this paper, along with the Strided
Algorithm and the standard linear-space algorithm, in
order to experimentally evaluate their performances.
Finally, we conclude with open questions in Section 6.
2 Preliminaries
We begin by describing the parallelism and memory
model used in the paper, and by presenting background
on the parallel partition problem.
WorkflowModel. We consider a simple language-based
model of parallelism in which algorithms achieve paral-
lelism through the use of parallel-for-loops (see, e.g.,
[7, 2, 13]); function calls within the inner loop then allow
for more complicated parallel structures (e.g., recursion).
Our algorithms can also be implemented in the PRAM
model [7, 2].
Formally, a parallel-for-loop is given a range sizeR ∈ N,
a constant number of arguments arg1, arg2, . . . , argc, and
a body of code. For each i ∈ {1, . . . , R}, the loop launches
a thread that is given loop-counter i and local copies of
the arguments arg1, arg2, . . . , argc. The threads are then
taken up by processors and the iterations of the loop are
performed in parallel. Only after every iteration of the
loop is complete can control flow continue past the loop.
A parallel algorithm may be run on an arbitrary
number p of processors. The algorithm itself is oblivious
to p, however, leaving the assignment of threads to
processors up to a scheduler.
The work T1 of an algorithm is the time that the
algorithm would require to execute on a single processor.
2
William Kuszmaul and Alek Westover
The span T∞ of an algorithm is the time to execute on
infinitely many processors. The scheduler is assumed to
contribute no overhead to the span. In particular, if each
iteration of a parallel-for-loop has span s, then the full
parallel loop has span s+O(1) [7, 2].
The work T1 and span T∞ can be used to quantify
the time Tp that an algorithm requires to execute on
p processors using a greedy online scheduler. If the
scheduler is assumed to contribute no overhead, then
Brent’s Theorem [12] states that for any p,
max(T1/p, T∞) ≤ Tp ≤ T1/p+ T∞.
The work-stealing algorithms used in the Cilk exten-
sion of C/C++ realize the guarantee offered by Brent’s
Theorem within a constant factor [10, 11], with the
added caveat that parallel-for-loops typically induce an
additional additive overhead of O(logR).
Memory Model. Memory is exclusive-read and
exclusive-write. That is, no two threads are ever
permitted to attempt to read or write to the same
variable concurrently. The exclusive-read exclusive-write
memory model is sometime referred to as the EREW
model (see, e.g., [18]).
Note that threads are not in lockstep (i.e., they may
progress at arbitrary different speeds), and thus the
EREW model requires algorithms to be data-race free
in order to avoid the possibility of non-exclusive data
accesses.
In an in-place algorithm, each thread is given
O(polylog n) memory upon creation that is deallocated
when the thread dies. This memory can be shared
with the thread’s children. However, the depth of the
parent-child tree is not permitted to exceedO(polylog n).
Whereas the EREW memory model prohibits concur-
rent accesses to memory, on the other side of the spectrum
are CRCW (concurrent-read-concurrent-write) models,
which allow for both reads and writes to be performed
concurrently (and in some variants even allow for atomic
operations) [7, 2, 25]. One approach to designing efficient
EREW algorithms is to simulate efficient CRCW algo-
rithms in the EREW model [25]. The known simulation
techniques require substantial space overhead, however,
preventing the design of in-place algorithms [25].3
In addition to being the first in-place and
polylogarithmic-span EREW algorithms for the parallel-
partition problem, our algorithms are also the first such
CREW algorithms. In a CREW algorithm, reads may
be concurrent, but writes may not – CREW stands
for concurrent-read exclusive-write. In practice, the
important property of our algorithms is that they avoid
concurrent writes (which can lead to non-determinacy
and cache ping-pong effects).
3The known simulation techniques also increase the total work
in the original algorithm, although this can be acceptable if only a
small number of atomic operations need to be simulated.
The Parallel Partition Problem. The paral-
lel partition problem takes an input array A =
(A[1], A[2], . . . , A[n]) of size n, and a decider function
dec that determines for each element A[i] ∈ A whether
or not A[i] is a predecessor or a successor . That is,
dec(A[i]) = 1 if A[i] is a predecessor, and dec(A[i]) = 0 if
A[i] is a successor. The behavior of the parallel partition
is to reorder the elements in the array A so that the
predecessors appear before the successors. Note that, in
this paper, we will always treat arrays as 1-indexed.
The (Standard) Linear-Space Parallel Partition.
The linear-space implementation of parallel partition
consists of two phases [7, 2]:
The Parallel-Prefix Phase: In this phase, the algo-
rithm first creates an array D whose i-th element
D[i] = dec(A[i]). Then the algorithm constructs an array
S whose i-th element S[i] =
∑i
j=1D[i] is the number
of predecessors in the first i elements of A. The trans-
formation from D to S is called a parallel prefix sum
and can be performed with O(n) work and O(log n) span
using a simple recursive algorithm: (1) First construct an
array D′ of size n/2 with D′[i] = D[2i − 1] + D[2i]; (2)
Recursively construct a parallel prefix sum S′ of D′; (3)
Build S by setting each S[i] = S′[⌊i/2⌋] + A[i] for odd i
and S[i] = S′[i/2] for even i.
The Reordering Phase: In this phase, the algorithm
constructs an output-arrayC by placing each predecessor
A[i] ∈ A in position S[i] of C. If there are t predecessors
inA, then the first t elements ofC will now contain those t
predecessors in the same order that they appear inA. The
algorithm then places each successor A[i] ∈ A in position
t+ i−S[i]. Since i−S[i] is the number of successors in the
first i elements of A, this places the successors in C in the
same order that they appear in A. Finally, the algorithm
copies C into A, completing the parallel partition.
Both phases can be implemented with O(n) work and
O(log n) span. Like its serial out-of-place counterpart,
the algorithm is stable but not in place. The algorithm
uses multiple auxiliary arrays of size n. Kiu, Knowles,
and Davis [24] were able to reduce the extra space con-
sumption to n+ p under the assumption that the number
of processors p is hard-coded; their algorithm breaks the
array A into p parts and assigns one part to each thread.
Reducing the extra space below o(n) has remained open
until now, even when the number of threads is fixed.
3 An In-Place Partition Algo-
rithm with Span O(logn log logn)
In this section, we present an in-place algorithm for par-
allel partition with span O(log n log logn). Each thread
in the algorithm requires memory at most O(log n).
Prior to beginning the algorithm, the first implicit step
of the algorithm is to count the number of predecessors
in the array, in order to determine whether the major-
3
William Kuszmaul and Alek Westover
ity of elements are either predecessors or successors.
Throughout the rest of the section, we assume without
loss of generality that the total number of successors in A
exceeds the number of predecessors, since otherwise their
roles can simply be swapped in the algorithm. Further, we
assume for simplicity that the elements of A are distinct;
this assumption is removed at the end of the section.
Algorithm Outline. We begin by presenting an
overview of the key algorithmic ideas needed to construct
an in-place algorithm.
Consider how to remove the auxiliary array C from
the Reordering Phase. If one attempts to simply swap in
parallel each predecessorA[i] with the element in position
j = S[i] ofA, then the swaps will almost certainly conflict.
Indeed, A[j] may also be a predecessor that needs to be
swapped with A[S[j]]. Continuing like this, there may be
an arbitrarily long list of dependencies on the swaps.
To combat this, we begin the algorithm with a Prepro-
cessing Phase in whichA is rearranged so that every prefix
is successor-heavy , meaning that for all t, the first t
elements contain at least t4 successors. Then we compute
the prefix-sum array S, and begin the Reordering Phase.
Using the fact that the prefixes of A are successor-heavy,
the reordering can now be performed in place as follows:
(1) We begin by recursively reordering the prefix P of A
consisting of the first 4/5 · n elements, so that the prede-
cessors appear before the successors; (2) Then we simply
swap each predecessor A[i] in the final 1/5 · n elements
with the corresponding element S[A[i]]. The fact that the
prefix P is successor-heavy ensures that, after step (1),
the final 15 ·n elements of (the reordered)P are successors.
This implies in step (2) that for each of the swaps between
predecessors A[i] in the final 1/5 · n elements and earlier
positions S[A[i]], the latter element will be in the prefix
P . In other words, the swaps are now conflict free.
Next consider how to remove the array S from the
Parallel-Prefix Phase. At face value, this would seem
quite difficult since the reordering phase relies heavily on
S. Our solution is to implicitly store the value of every
O(log n)-th element of S in the ordering of the elements of
A. That is, we breakA into blocks of sizeO(log n), and use
the order of the elements in each block to encode an entry
of S. (If the elements are not all distinct, then a slightly
more sophisticated encoding is necessary.) Moreover, we
modify the algorithm for building S to only construct
every O(log n)-th element. The new parallel-prefix sum
performsO(n/ log n) arithmetic operations on values that
are implicitly encoded in blocks; since each such operation
requires O(log n) work, the total work remains linear.
In the remainder of the section, we present the algorithm
in detail, and prove the key properties of each phase of the
algorithm. We also provide detailed pseudocode in Figure
1 and Figure 2. The algorithm proceeds in three phases.
A Preprocessing Phase. The goal of the Preprocessing
phase is to make every prefix of A successor-heavy. To
perform the Preprocessing phase on A, we begin with
a parallel-for-loop: For each i = 1, . . . , ⌊n/2⌋, if A[i] is
a predecessor and A[n − i + 1] is a successor, then we
swap their positions in A. To complete the Preprocessing
phase on A, we then recursively perform a Preprocessing
phase on A[1], . . . , A[⌈n/2⌉].
Lemma 3.1. The Preprocessing Phase has workO(n) and
span O(log n). At the end of the Preprocessing Phase,
every prefix of A is successor-heavy.
Proof. Recall that for each t ∈ 1, . . . , n, we call the
t-prefix A[1], . . . , A[t] of A successor-heavy if it contains
at least t4 successors.
The first parallel-for-loop ensures that at least half
the successors in A reside in the first ⌈n/2⌉ positions,
since for i = 1, . . . , ⌊n/2⌋, A[n − i + 1] will only be a
successor if A[i] is also a successor. Because at least half
the elements in A are successors, it follows that the first
⌈n/2⌉ positions contain at least ⌈n/4⌉ successors, making
every t-prefix with t ≥ ⌈n/2⌉ successor-heavy.
After the parallel-for-loop, the first ⌈n/2⌉ positions
of A contain at least as many successors as predecessors
(since ⌈n/4⌉ ≥ ⌈n/2⌉2 ). Thus we can recursively apply the
argument above in order to conclude that the recursion
onA[1], . . . , A[⌈n/2⌉] makes every t-prefix with t ≤ ⌈n/2⌉
successor-heavy. It follows that, after the recursion, every
t-prefix of A is successor-heavy.
Each recursive level has constant span and performs
work proportional to the size of the subarray being
considered. The Preprocessing phase therefore has total
work O(n) and span O(log n).
An Implicit Parallel Prefix Sum. Pick a block-size
b ∈ Θ(logn) satisfying b ≥ 2⌈log(n+1)⌉. ConsiderA as a
series of ⌊n/b⌋ blocks of size b, with the final block of size
between b and 2b−1. Denote the blocks byX1, . . . , X⌊n/b⌋.
Within each blockXi, we can implicitly store a value in
the range 0, . . . , n through the ordering of the elements:
Lemma 3.2. Given an arrayX of 2⌈log(n+1)⌉ distinct el-
ements, and a value v ∈ {0, . . . , n}, one can rearrange the
elements of X to encode the bits of v using work O(log n)
and span O(log logn); and one can then later decode v
from X using work O(log n) and span O(log logn).
Proof. Observe that X can be broken into (at least)
⌈log(n + 1)⌉ disjoint pairs of adjacent elements
(x1, x2), (x3, x4), . . ., and by rearranging the order in
which a given pair (xj , xj+1) occurs, the lexicographic
comparison of whether xj < xj+1 can be used to encode
one bit of information. Values v ∈ [0, n] can therefore be
read andwritten toX withworkO(b) = O(log n) and span
O(log b) = O(log logn) using a simple divide-and-conquer
recursive approach to encode and decode the bits of v.
To perform the Parallel Prefix Sum phase, our algo-
rithm begins by performing a parallel-for loop through
the blocks, and storing in each block Xi a value vi equal
4
William Kuszmaul and Alek Westover
Figure 1: Blocked-Prefix-Sum Partition Algorithm: Helper Functions
procedureWriteToBlock(A, b, i, v) ⊲Write value v to the i-th blockXi ofA, where A = X1 ◦X2 ◦ · · · ◦X⌊n/b⌋
for all j ∈ {1, 2, . . . , ⌊b/2⌋} in parallel do
if 1Xi[2j]<Xi[2j+1] 6= (the j-th binary digit of v) then
Swap Xi[2j] and Xi[2j + 1]
end if
end for
end procedure
procedure ReadFromBlock(A, i, j) ⊲ Reads the value v stored in A[i], A[i+ 1], . . . , A[j]
if j − i = 2 then
return 1A[i]<A[i+1]
else
Parallel-Spawn v0 ← ReadFromBlock(A, i, i+ (j − i)/2)
Parallel-Spawn vf ← ReadFromBlock(A, i+ (j − i)/2 + 1, j)
Parallel-Sync
return vf · 2 j−i4 + v0
end if
end procedure
Require: A has more successors than predecessors
Ensure: Each prefix of A is “successor heavy”
procedureMakeSuccessorHeavy(A, n)
for all i ∈ {1, 2, . . . , ⌊n/2⌋} in parallel do
if A[i] is a predecessor and A[n− i+ 1] is a successor then
Swap A[i] and A[n− i+ 1]
end if
end for
MakeSuccessorHeavy(A, ⌈n/2⌉) ⊲ Recurse on A[1], A[2], . . . , A[⌈n/2⌉]
end procedure
to the number of predecessors in the block. (This can be
done in place with work O(n) and span O(log logn) by
Lemma 3.2.)
The algorithm then performs an in-place parallel-prefix
operation on the values v1, . . . , v⌊n/b⌋ stored in the
blocks. This is done by first resetting each even-indexed
value v2i to v2i + v2i−1; then recursively performing a
parallel-prefix sum on the even-indexed values; and then
replacing each odd-indexed v2i+1 with v2i+1 + v2i, where
v0 is defined to be zero.
Lemma 3.3 analyzes the phase:
Lemma 3.3. The Parallel Prefix Sum phase uses work
O(n) and span O(log n log logn). At the end of the
phase, eachXi encodes a value vi counting the number of
predecessors in the prefixX1◦X2◦· · ·◦Xi; and each prefix
of blocks (i.e., each prefix of the form X1 ◦X2 ◦ · · · ◦Xi)
is successor-heavy.
Proof. If the vi’s could be read and written in constant
time, then the prefix sum would take work O(n/ logn)
and span O(log n), since there are O(n/ logn) vi’s.
Because each vi actually requires work O(log n) and span
O(log log n) to read/write (by Lemma 3.2), the prefix
sum takes work O(n) and span O(log n · log logn).
Once the prefix-sum has been performed, every block
Xi encodes a value vi counting the number of predecessors
in the prefix X1 ◦ X2 ◦ · · · ◦ Xi. Moreover, because the
Parallel Prefix Sum phase only rearranges elements
within each Xi, Lemma 3.1 ensures that each prefix of
the form X1 ◦X2 ◦ · · · ◦Xi remains successor-heavy.
In-Place Reordering. In the final phase of the al-
gorithm, we reorder A so that the predecessors appear
before the successors. Let P = X1 ◦X2 ◦ · · · ◦Xt be the
smallest prefix of blocks that contains at least 4/5 of the
elements in A. We begin by recursively reordering the
elements in P so that the predecessors appear before the
successors; as a base case, when |P | ≤ 5b = O(log n), we
simply perform the reordering in serial.
To complete the reordering of A, we perform a parallel-
for-loop through each of the blocksXt+1, . . . , X⌊n/b⌋. For
each blockXi, we first extract vi (with work O(log n) and
span O(log logn) using Lemma 3.2). We then create an
auxiliary array Yi of size |Xi|, using O(log n) thread-local
memory. Using a parallel-prefix sum (with work O(log n)
and span O(log logn)), we set each Yi[j] equal to vi
plus the number of predecessors in Xi[1], . . . , Xi[j]. In
other words, Yi[j] equals the number of predecessors in A
5
William Kuszmaul and Alek Westover
Figure 2: Blocked-Prefix-Sum Partition Algorithm: Main Functions
Require: Each prefix of A is “successor heavy”
Ensure: Each block Xi stores how many predecessors occur in X1 ◦X2 ◦ · · · ◦Xi
procedure ImplicitParallelPrefixSum(A, n)
Pick b ∈ Θ(logn) to be the ”block size”
Logically Partition A into blocks, with A = X1 ◦X2 ◦ · · · ◦X⌊n/b⌋
for all i ∈ {1, 2, . . . , ⌊n/b⌋} in parallel do
vi ← 0 ⊲ vi will store number of predecessors in Xi
for all a ∈ Xi in serial do
if a is a predecessor then
vi ← vi + 1
end if
end for
WriteToBlock(A, b, i, vi) ⊲ Now we encode the value vi in the block Xi
end for
Perform a parallel prefix sum on the values vi stored in the Xi’s
end procedure
Require: Each block Xi stores how many predecessors occur in X1 ◦X2 ◦ · · · ◦Xi
Ensure: A is partitioned
procedure Reorder(A, n)
t← least integer such that t · b > n · 4/5
Reorder(A, t) ⊲ Recurse on A[1], A[2], . . . , A[t]
for all i ∈ {t+ 1, t+ 2, . . . , ⌊n/b⌋} do
vi ← ReadFromBlock(A, b · i+ 1, b · (i+ 1))
Instantiate an array Yi with |Yi| = |Xi| ∈ Θ(logn),
In parallel, set Yi[j]← 1 if Xi[j] is a predecessor, and Yi[j]← 0 otherwise.
Perform a parallel prefix sum on Yi, and add vi to each Yi[j]
for all j ∈ {1, 2, . . . , b} do
if Xi[j] is a predecessor then
Swap Xi[j] and A[Yi[j]]
end if
end for
end for
end procedure
procedure ParallelPartition(A, n)
k ← count number of successors in A in parallel
if k < n/2 then
Swap the role of successors and predecessors in the algorithm (i.e. change the decider function)
At the end we consider A′[i] = A[n− i+ 1], the logically reversed array, as output
end if
MakeSuccessorHeavy(A, n) ⊲ prepreocessing phase
ImplicitParallelPrefixSum(A, n) ⊲ Implicit Parallel Prefix Sum
Reorder(A, n) ⊲ In-Place Reordering Phase
end procedure
appearing at or before Xi[j].
After creating Yi, we then perform a parallel-for-loop
through the elements Xi[j] of Xi (note we are still within
another parallel loop through the Xi’s), and for each
predecessorXi[j], we swap it with the element in position
Yi[j] of the array A. This completes the algorithm.
Lemma 3.4. The Reordering phase takes work O(n) and
span O(log n log log n). At the end of the phase, the array
A is fully partitioned.
Proof. After P has been recursively partitioned, it will
be of the form P1 ◦ P2 where P1 contains only prede-
cessors and P2 contains only successors. Because P was
successor-heavy before the recursive partitioning (by
Lemma 3.3), we have that |P2| ≥ |P |/4, and thus that
|P2| ≥ |Xt+1 ◦ · · · ◦X⌊n/b⌋|.
6
William Kuszmaul and Alek Westover
After the recursion, the swaps performed by the algo-
rithm will swap the i-th predecessor inXt+1 ◦ · · · ◦X⌊n/b⌋
with the i-th element in P2, for i from 1 to the
number of predecessors in Xt+1 ◦ · · · ◦ X⌊n/b⌋. Because
|P2| ≥ |Xt+1◦· · ·◦X⌊n/b⌋| these swaps are guaranteed not
to conflict with one-another; and since P2 consists of suc-
cessors, the final state of arrayA will be fully partitioned.
The total work in the reordering phase is O(n) since
each Xi appears in a parallel-for-loop at exactly one level
of the recursion, and incurs O(log n) work. The total
span of the reordering phase is O(log n · log logn), since
there are O(log n) levels of recursion, and within each
level of recursion each Xi in the parallel-for-loop incurs
span O(log logn).
Combining the phases, the full algorithm has work
O(n) and span O(log logn). Thus we have:
Theorem 3.1. There exists an in-place algorithm using
exclusive-read-write variables that performs parallel-
partition with work O(n) and span O(log n · log logn).
Allowing for Repeated Elements. In proving The-
orem 3.1 we assumed for simplicity that the elements of
A are distinct. To remove this assumption, we conclude
the section by proving a slightly more complex variant
of Lemma 3.2, eliminating the requirement that the
elements of the arrayX be distinct:
Lemma 3.5. Let X be an array of b = 4⌈log(n + 1)⌉ + 2
elements. The there is an encode function, and a decode
function such that:
• The encode function modifies the array X (possibly
overwriting elements in addition to rearranging
them) to store a value v ∈ {0, . . . , n}. The first
time the encode function is called on X it has work
and span O(log n). Any later times the encode
function is called on X , it has work O(log n) and
span O(log logn). In addition to being given an
argument v, the encode function is given a boolean
argument indicating whether the function has been
invoked on X before.
• The decode function recovers the value of v from
the modified array X , and restores X to again be
an array consisting of the same multiset of elements
that it began with. The decode function has work
O(log n) and span O(log logn).
Proof. Consider the first b letters of X as a sequence of
pairs, given by (x1, x2), . . . , (xb−1, xb). If at least half
of the pairs (xi, xi+1 satisfy xi 6= xi+1, then the encode
function can reorder those pairs to appear at the front of
X , and then use them to encode v as in Lemma 3.2. Note
that the reordering of the pairs will only be performed
the first time that the encode function is invoked on X .
Later calls to the encode function will have workO(log n)
and span O(log logn), as in Lemma 3.2.
If, on the other hand, at least half the pairs consist of
equal-value elements xi = xi+1, then we can reorder the
pairs so that the first ⌈log(n+1)⌉+1 of them satisfy this
property. (This is only done on the first call to encode.)
To encode a value v, we simply explicitly overwrite the
second element in each of the pairs (x3, x4), (x5, x6), . . .
with the bits of v, overwriting each element with one bit.
The reordering performed by the first call to encode has
work and span O(log n); the writing of v’s bits can then
be performed in work O(log n) and span O(log logn)
using a simple divide-and-conquer approach.
To perform a decode and read the value v, we check
whether x1 = x2 in order to determine which type of
encoding is being used, and then we can unencode the
bits of v using work O(log n) and span O(log logn); if
the encoding is the second type (i.e., x1 = x2), then the
decode function also restores the elements x2, x4, x6, . . .
of the array X as it extracts the bits of v. Note that
checking whether x1 = x2 is also used by the encode
function each time after the first time it is called, in order
determine which type of encoding is being used.
The fact that the first call to the encode function on
eachXi has spanO(log n) (rather than O(log logn)) does
not affect the total span of our parallel-partition algo-
rithm, since this simply adds a step with O(log n)-span
to the beginning of the Parallel Prefix phase. Lemma
3.5 can therefore used in place of Lemma 3.2 in order
to complete the proof of Theorem 3.1 for arrays A that
contain duplicate elements.
4 A Cache-Efficient Partition
Algorithm
In this section we present the Smoothed Striding Algo-
rithm , which exhibits provably optimal cache behavior
(up to small-order factors). The Smoothed Striding
Algorithm is fully in-place and has polylogarithmic
span. In particular, this means that the total amount
of auxiliary memory allocated at a given moment in the
execution never exceeds polylogn per active worker.
Modeling Cache Misses. We treat memory as con-
sisting of fixed-size cache lines, each of some size b. Each
processor is assumed to have a small cache of polylogn
cache lines. A cache miss occurs on a processor when the
line being accessed is not currently in cache, in which case
some other line is evicted from cache to make room for
the new entry. Each cache is managed with a LRU (Least
Recently Used) eviction policy.
We assume that threads are scheduled using work
stealing [1], and that the work-stealing itself has no
cache-miss overhead. Note that caches belong to proces-
sors, not threads, meaning that when a processor takes a
new thread (i.e., performs work stealing), the processor’s
cache contents are a function of what the processor was
7
William Kuszmaul and Alek Westover
previously executing. In order to keep our analysis of
cache misses independent of the number of processors,
we will ignore the cost of warming up each processor’s
cache. In particular, if there are polylogn global variables
that each processor must access many times, we do not
consider the initial p · polylogn cost of loading those
global variables into the caches (where p is the number of
processors). In practice, p ≪ n on large inputs, making
the cost of warming up caches negligible.
Although each cache is managed with LRU evic-
tion, we may assume for the sake of analysis that
each cache is managed by the optimal off-line eviction
strategy OPT (i.e. Furthest in the Future). This is
because, up to resource augmentation, LRU eviction is
(1+1/ polylogn)-competitive with OPT. Formally this is
due to the following theorem by Sleator and Tarjan [28]:
Theorem 4.1. LRU operating on a cache of sizeK ·M for
someK > 1 will incur at most 1+ 1K−1 times the number
of times cache misses of OPT operating on a cache of size
M , for the same series of memory accesses.
Recall that each processor has a cache of size logc n for
c a constant of our choice. Up to changes in c LRU incurs
no more than a 1+ 1polylogn factor more cache misses than
OPT incurs. Thus, up to a 1 + 1polylog(n) multiplicative
change in cache misses, and a polylog(n) factor change in
cache size, we may assume without loss of generality that
cache eviction is performed by OPT.
Because each processor’s cache is managed by OPT
(without loss of generality), we can assume that each
processor pins certain small arrays to cache (i.e., the
elements of those arrays are never evicted). In fact, this
is the only property of OPT we will use; that is, our
analyses will treat non-pinned contents of the cache as
being managed via LRU.
The Strided Algorithm [14]. The Smoothed Strid-
ing Algorithm borrows several structural ideas from a
previous algorithm of Francis and Pannan [14], which
we call the Strided Algorithm. The Strided Algorithm is
designed to behave well on random arrays A, achieving
span O˜(n2/3) and exhibiting only n/b + O˜(n2/3/b)
cache misses on such inputs. On worst-case inputs,
however, the Strided Algorithm has span Ω(n) and incurs
n/b+Ω(n/b) cache misses. Our algorithm, the Smoothed
Striding Algorithm, builds on the Strided Algorithm by
randomly perturbing the internal structure of the original
algorithm; in doing so, we are able to provide provable
performance guarantees for arbitrary inputs, and to add
a recursion step that was previously impossible.
The Strided Algorithm consists of two steps:
The Partial Partition Step.
Let g ∈ N be a parameter, and assume for simplicity that
gb | n. Logically partition the array A into n/(gb) chunks
C1, . . . , Cn/(gb), each consisting of g cache lines of size b.
For i ∈ {1, 2, . . . , g}, define Pi to consist of the i-th cache
line from each of the chunks C1, . . . , Cn/(gb). One can
think of the Pi’s as forming a strided partition of arrayA,
since consecutive cache lines in Pi are always separated
by a fixed stride of g − 1 other cache lines.
The first step of the Strided Algorithm is to perform an
in-place serial partition on each of the Pi’s, rearranging
the elements within the Pi so that the predecessors come
first. This step requires work Θ(n) and span Θ(n/g).
The Serial Cleanup Step. For each Pi, define the
splitting position vi to be the position in A of the first
successor in (the already partitioned) Pi. Define vmin =
min{v1, . . . , vg} and define vmax = max{v1, . . . , vg}.
The second step of the Strided Algorithm is to perform a
serial partition on the sub-arrayA[vmin], . . . , A[vmax− 1].
This step has no parallelism, and thus has work and span
of Θ(vmax − vmin).
In general, the lack of parallelism in the Serial Cleanup
step results in an algorithm with linear-span (i.e., no
parallelism guarantee). When the number of predeces-
sors in each of the Pi’s is close to equal, however, the
quantity vmax − vmin can be much smaller than Θ(n).
For example, if b = 1, and if each element of A is selected
independently from some distribution, then one can use
Chernoff bounds to prove that with high probability in
n, vmax − vmin ≤ O(
√
n · g · logn). The full span of the
algorithm is then O˜(n/g +
√
n · g), which optimizes at
g = n1/3 to O˜(n2/3). Since the Partial Partition Step
incurs only n/b cache misses, the full algorithm incurs
n+ O˜(n2/3) cache misses on a random array A.
Using Hoeffding’s Inequality in place of Chernoff
bounds, one can obtain analogous bounds for larger
values of b; in particular for b ∈ polylog(n), the optimal
span remains O˜(n2/3) and the number of cache misses
becomes n/b + O˜(n2/3/b) on an array A consisting of
randomly sampled elements.4
The Smoothed Striding Algorithm. To obtain an
algorithm with provable guarantees for all inputs A, we
randomly perturb the internal structure of each of the
Pi’s. Define U1, . . . , Ug (which play a role analogous to
P1, . . . , Pg in the Strided Algorithm) so that each Ui
contains one randomly selected cache line from each of
C1, . . . , Cn/gb (rather than containing the i-th cache line
of each Cj). This ensures that the number of predecessors
in each Ui is a sum of independent random variables with
values in {0, 1, . . . , n/g}.
By Hoeffding’s Inequality, with high probability in
n, the number of predecessors in each Ui is tightly
concentrated around µng , where µ is the fraction of
elements in A that are predecessors. It follows that, if
we perform in-place partitions of each Ui in parallel, and
then define vi to be the position in A of the first successor
in (the already partitioned) Ui, then the difference
4The original algorithm of Francis and Pannan [14] does not
consider the cache-line size b. Frias and Petit later introduced the
parameter b [15], and showed that by setting b appropriately, one
obtains an algorithm whose empirical performance is close to the
state-of-the-art.
8
William Kuszmaul and Alek Westover
between vmin = mini vi and vmax = maxi vi will be small
(regardless of the input array A!).
Rather than partitioning A[vmin], . . . , A[vmax − 1] in
serial, the Smoothed Striding Algorithm simply recurses
on the subarray. Such a recursion would not have been
productive for the original Strided Algorithm because
the strided partition P ′1, . . . , P
′
g used in the recursive
subproblem would satisfy P ′1 ⊆ P1, . . . , P ′g ⊆ Pg and thus
each P ′i is already partitioned. That is, in the original
Strided Algorithm, the problem that we would recurse on
is a worst-case input for the algorithm in the sense that
the partial partition step makes no progress.
The main challenge in designing the Smoothed Strid-
ing Algorithm becomes the construction of U1, . . . , Ug
without violating the in-place nature of the algorithm.
A natural approach might be to store for each Ui, Cj the
index of the cache line in Cj that Ui contains. This would
require the storage of Θ(n/b) numbers as metadata,
however, preventing the algorithm from being in-place.
To save space, the key insight is to select a random offset
Xj ∈ {1, 2, . . . , g} within each Cj , and then to assign
the (Xj + i (mod g)) + 1-th cache line of Cj to Ui for
i ∈ {1, 2, . . . , g}. This allows for us to construct the
Ui’s using only O(n/(gb)) machine words storing the
metadata X1, . . . , Xn/(gb). By setting g to be relatively
large, so that n/(gb) ≤ polylog(n), we can obtain an
in-place algorithm that incurs (1+ o(1))n/b cache misses.
The recursive structure of the Smoothed Striding Algo-
rithm allows for the algorithm to achieve polylogarithmic
span. As an alternative to recursing, one can also use the
Blocked-Prefix-Sum Partition Algorithm from Section 3
in order to partitionA[vmin], . . . , A[vmax−1]. This results
in an improved span (since the algorithm from Section 3
has span onlyO(log n log log n)), while still incurring only
(1 + o(1))n/b cache misses (since the cache-inefficient
algorithm from Section 3 is only used on a small subarray
of A). We analyze both the recursive version of the
Smoothed Striding Algorithm, and the version which
uses as a final step the Blocked-Prefix-Sum Partition
Algorithm; one significant advantage of the recursive
version is that it is simple to implement in practice.
Formal Algorithm Description. Let b < n be the size
of a cache line, let A be an input array of size n, and let g
be a parameter. (One should think of g as being relatively
large, satisfying n/(gb) ≤ polylog(n).) We assume for
simplicity that that n is divisible by gb, and we define
s = n/(gb). 5
In the Partial Partition Step the algorithm par-
titions the cache lines of A into g sets U1, . . . , Ug where
each Ui contains s cache lines, and then performs a serial
5This assumption can be made without loss of generality by
treating A as an array of size n′ = n+ (gb− n (mod gb)), and then
treating the final gb − n (mod gb) elements of the array as being
successors (which consequently the algorithm needs not explicitly
access). Note that the extra n′ − n elements are completely virtual,
meaning they do not physically exist or reside in memory.
partition on eachUi in parallel over theUi’s. To determine
the sets U1, . . . , Ug, the algorithm uses as metadata an
array X = X [1], . . . , X [s], where each X [i] ∈ {1, . . . , g}.
More specifically the algorithm does the following:
Set each of X [1], . . . , X [s] to be uniformly random and
independently selected elements of {1, 2, . . . , g}. For each
i ∈ {1, 2, . . . , g}, j ∈ {1, 2, . . . , s}, define
Gi(j) = (X [j] + i (mod g)) + (j − 1)g + 1.
Using this terminology, we define eachUi for i ∈ {1, . . . , g}
to contain the Gi(j)-th cache line of A for each
j ∈ {1, 2, . . . , s}. That is, Gi(j) denotes the index of the
j-th cache line from array A contained in Ui.
Note that, to compute the index of the j-th cache
line in Ui, one needs only the value of X [j]. Thus the
only metadata needed by the algorithm to determine
U1, . . . , Ug is the array X . If |X | = s = ngb ≤ polylog(n),
then the algorithm is in place.
The algorithm performs an in-place (serial) partition
on each Ui (and performs these partitions in parallel
with one another). In doing so, the algorithm, also
collects vmin = mini vi, vmax = maxi vi, where each vi
with i ∈ {1, . . . , g} is defined to be the index of the first
successor in A (or n if no such successor exists).6
The array A is now “partially partitioned”, i.e. A[i] is
a predecessor for all i ≤ vmin, and A[i] is a successor for
all i > vmax.
The second step of the Smoothed Striding Algorithm is
to complete the partitioning of A[vmin + 1], . . . , A[vmax].
This can be done in one of two ways: The Recur-
sive Smoothed Striding Algorithm partitions
A[vmin + 1], . . . , A[vmax] recursively using the same
algorithm (and resorts to a serial base case when
the subproblem is small enough that g ≤ O(1)); the
Hybrid Smoothed Striding Algorithm partitions
A[vmin+1], . . . , A[vmax] using the in-place algorithmgiven
in Theorem 3.1 with span O(log n log logn). In general,
the Hybrid algorithm yields better theoretical guarantees
on span than the recursive version; on the other hand, the
recursive version has the advantage that it is simple to
implement as fully in-place, and still achieves polyloga-
rithmic span. We analyze both algorithms in this section.
Detailed pseudocode for the Recursive Smoothed
Striding Algorithm can be found in Figure 3.
Algorithm Analysis. Our first proposition analyzes
the Partial Partition Step.
Proposition 4.1. Let ǫ ∈ (0, 1/2) and δ ∈ (0, 1/2) such
that ǫ ≥ 1/ poly(n) and δ ≥ 1/ polylog(n). Suppose
6One can calculate vmin and vmax without explicitly storing
each of v1, . . . , vg as follows. Rather than using a standard g-way
parallel for-loop to partition each of U1, . . . , Ug, one can manually
implement the parallel for-loop using a recursive divide-and-conquer
approach. Each recursive call in the divide-and-conquer can then
simply collect the maximum and minimum vi for the Ui’s that are
partitioned within that recursive call. This adds O(logn) to the
total span of the Partial Partition Step, which does not affect the
overall span asymptotically.
9
William Kuszmaul and Alek Westover
Figure 3: Smoothed Striding Algorithm
Recall:
A is the array to be partitioned, of length n.
We break A into chunks, each consisting of g cache lines of size b.
We create g groups U1, . . . , Ug that each contain a single cache line from each chunk,
Ui’s j-th cache line is the (X [j] + i mod g + 1)-th cache line in the j-th chunk of A.
procedure Get Block Start Index(X , g, b, i, j) ⊲
This procedure returns the index in A of the start of U ′is j-th block.
return b · ((X [j] + i mod g) + (j − 1) · g) + 1
end procedure
procedure ParallelPartition(A, n, g, b)
if g < 2 then
serial partition A
else
for j ∈ {1, 2, . . . , n/(gb)} do
X [j]← a random integer from [1, g]
end for
for all i ∈ {1, 2, . . . , g} in parallel do ⊲We perform a serial partition on all Ui’s in parallel
low← GetBlockStartIndex(X ,g,b,i,1) ⊲ low← index of the first element in Ui
high← GetBlockStartIndex(X ,g,b,i,n/(gb)) + b− 1 ⊲ high← index of the last element in Ui
while low < high do
while A[low] ≤ pivotValue do
low← low+1
if low modb ≡ 0 then ⊲ Perform a block increment once low reaches the end of a block
k ← number of block increments so far (including this one)
low← GetBlockStartIndex(X ,g,b,i,k) ⊲ Increase low to start of block k of Gi
end if
end while
while A[high] > pivotValue do
high← high−1
if high modb ≡ 1 then ⊲ Perform a block decrement once high reaches the start of a block
k ← number of block decrements so far (including this one)
k′ ← n/(gb)− k
high← GetBlockStartIndex(X , g,b,i,k′) +b− 1 ⊲ Decrease high to end of block k′ of Gi
end if
end while
Swap A[low] and A[high]
end while
end for
Recurse on A[vmin], . . . , A[vmax − 1]
end if
end procedure
s > ln(n/ǫ)δ2 , and that each processor has a cache of size at
least s+ c for a sufficiently large constant c.
Then the Partial-Partition Algorithm achieves work
O(n); achieves span O(b · s); incurs s+nb + O(1) cache
misses; and guarantees that with probability at least 1−ǫ,
vmax − vmin < 4nδ.
Proof. Since
∑
i |Ui| = n, and since the serial partitioning
of each Ui takes time O(|Ui|), the total work performed
by the algorithm is O(n).
To analyze cache misses, we assume without loss of
generality that array X is pinned in each processor’s
cache (note, in particular, that |X | = s ≤ polylog(n),
and so X fits in cache). Thus we can ignore the cost of
accesses toX . Note that each Ui consists of s = polylogn
cache lines, meaning that each Ui fits entirely in cache.
Thus the number of cache misses needed for a thread to
partition a given Ui is just s. Since there are g of the U
′
is,
the total number of cache misses incurred in partitioning
all of the Ui’s is gs = n/b. Besides these, there are s/b
10
William Kuszmaul and Alek Westover
cache misses for instantiating the array X ; and O(1)
cache misses for other instantiating costs. This sums to
n+ s
b
+O(1).
The span of the algorithm isO(n/g+s) = O(b ·s), since
the eachUi is of sizeO(n/g), and because the initialization
of arrayX can be performed in time O(|X |) = O(s).
It remains to show that with probability 1 − ǫ,
vmax − vmin < 4nδ. Let µ denote the fraction of elements
in A that are predecessors. For i ∈ {1, 2, . . . , g}, let µi
denote the fraction of elements in Ui that are predeces-
sors. Note that each µi is the average of s independent
random variables Yi(1), . . . , Yi(s) ∈ [0, 1], where Yi(j)
is the fraction of elements in the Gi(j)-th cache line of
A that are predecessors. By construction, Gi(j) has the
same probability distribution for all i, since (X [j] + i)
(mod g) is uniformly random in Zg for all i. It follows
that Yi(j) has the same distribution for all i, and thus
that E[µi] is independent of i. Since the average of the
µis is µ, it follows that E[µi] = µ for all i ∈ {1, 2, . . . , g}.
Since each µi is the average of s independent [0, 1]-
random variables, we can apply Hoeffding’s inequality
(i.e. a Chernoff Bound for a random variable on [0, 1]
rather than on {0, 1}) to each µi to show that it is tightly
concentrated around its expected value µ, i.e.,
Pr[|µi − µ| ≥ δ] < 2 exp(−2sδ2).
Since s > ln(n/ǫ)δ2 ≥ ln(2n/(bǫ))2δ2 , we find that for all
i ∈ {1, . . . , g},
Pr[|µi − µ| ≥ δ] < 2 exp
(
−2ln(2n/(bǫ))
2δ2
δ2
)
=
ǫ
n/b
<
ǫ
g
.
By the union bound, it follows that with probability at
least 1− ǫ, all of µ1, . . . , µg are within δ of µ.
To complete the proof we will show that the occurrence
of the event that all µy simultaneously satisfy |µ−µy| < δ
implies that vmax − vmin ≤ 4nδ.
Recall that Gi(j) denotes the index within A of the j
th cache-line contained in Ui. By the definition of Gi(j),
(j − 1)g + 1 ≤ Gi(j) ≤ jg.
Note that A[vi] will occur in the ⌈sµi⌉-th cache-line of Ui
because Ui is composed of s cache lines. Hence
(⌈sµi⌉ − 1)gb+ 1 ≤ vi ≤ ⌈sµi⌉gb,
which means that
sµigb− gb+ 1 ≤ vi ≤ sµigb+ gb.
Since sgb = n, it follows that |vi − nµi| ≤ gb. Therefore,
|vi − nµ| < gb+ nδ.
This implies that the maximum of |vi − vj | for any i and
j is at most, 2bg + 2δn. Thus,
vmax − vmin ≤ 2n
(
δ +
bg
n
)
= 2n (δ + 1/s)
≤ 2n
(
δ +
δ2
ln(n/ǫ)
)
< 4n · δ.
We use Proposition 4.1 as a tool to analyze the
Recursive and the Hybrid Smoothed Striding Algorithms.
Rather than parameterizing the Partial Partition step
in each algorithm by s, Proposition 4.1 suggests that it
is more natural to parameterize by ǫ and δ, which then
determine s.
We will assume that both the hybrid and the recursive
algorithms use ǫ = 1/nc for c of our choice (i.e. with
high probability inn). Moreover, the Recursive Smoothed
Striding Algorithm continues to use the same value of ǫ
within recursive subproblems (i.e., the ǫ is chosen based
on the size of the first subproblem in the recursion), so that
the entire algorithm succeeds with high probability in n.
For both algorithms, the choice of δ results in a tradeoff
between cache misses and span. For the Recursive
algorithm, we allow for δ to be chosen arbitrarily at the
top level of recursion, and then fix δ = Θ(1) to be a
sufficiently small constant at all levels of recursion after
the first; this guarantees that we at least halve the size of
the problem between recursive iterations7. Optimizing δ
further (after the first level of recursion) would only affect
the number of undesired cachemisses by a constant factor.
Next we analyze the Hybrid Smoothed Striding
Algorithm.
Theorem 4.2. The Hybrid Smoothed Striding Algorithm
using parameter δ ∈ (0, 1/2) satisfying δ ≥ 1/ polylog(n):
has work O(n); achieves span
O
(
logn log logn+
b logn
δ2
)
,
with high probability in n; and incurs fewer than
(n+O(nδ))/b
cache misses with high probability in n.
An interesting corollary of Theorem 4.2 concerns what
happens when b is small (e.g., constant) and we choose δ
to optimize span.
Corollary 4.2 (Corollary of Theorem 4.2). Suppose
b ≤ o(log logn). Then the Hybrid Smoothed Striding
using δ = Θ
(√
b/ log logn
)
, achieves workO(n), and with
high probability in n, achieves spanO(log n log logn) and
incurs fewer than (n+ o(n))/b cache misses.
7In general, setting δ = 1/8 will result in the problem size
being halved. However, this relies on the assumption that gb | n,
which is only without loss of generality by allowing for the size
of subproblems to be sometimes artificially increased by a small
amount (i.e., a factor of 1 + gb/n = 1 + 1/s). One can handle this
issue by decreasing δ to, say, 1/16.
11
William Kuszmaul and Alek Westover
Proof of Theorem 4.2. We analyze the Partial Partition
Step using Proposition 4.1. Note that by our choice of ǫ,
s = O
(
logn
δ2
)
. The Partial Partition Step therefore has
work O(n), span O
(
b logn
δ2
)
, and incurs fewer than
n
b
+O
(
logn
bδ2
)
+O(1)
cache misses.
By Theorem 3.1, the subproblem of partitioning of
A[vmin + 1], . . . , A[vmax] takes work O(n). With high
probability in n, the subproblem has size less than 4nδ,
which means that the subproblem achieves span
O(log nδ log lognδ) = O(log n log logn),
and incurs at most O(nδ/b) cache misses.
The total number of cache misses is therefore,
n
b
+O
(
logn
bδ2
+
nδ
b
)
+O(1),
which since δ ≥ 1/ polylog(n), is at most
(n+O(nδ))/b +O(1) ≤ (n+O(nδ))/b, as desired.
Proof of Corollary 4.2. We use δ =
√
b/ log logn in the
result proved in Theorem 4.2.
First note that the assumptions of Theorem 4.2 are
satisfied because O(
√
b/ log logn) > 1/ polylog(n). The
algorithm achieves work O(n). With high probability in
n the algorithm achieves span
O
(
logn log logn+
b logn
δ2
)
= O(log n log logn).
With high probability inn the algorithm incurs fewer than
(n+O(nδ))/b = (n+O(n
√
b/ log logn))/b
cache misses. By assumption
√
b/ log logn = o(1), so
this reduces to (n+ o(n))/b cache misses, as desired.
The next theorem analyzes the span of the Recursive
Smoothed Striding Algorithm.
Theorem 4.3. With high probability in n, the Recur-
sive Smoothed Striding algorithm using parameter
δ ∈ (0, 1/2) satisfying δ ≥ 1/ polylog(n): achieves work
O(n), attains span
O
(
b
(
log2 n+
logn
δ2
))
,
and incurs (n+O(nδ))/b cache misses.
A particularly natural parameter setting for the
Recursive algorithm occurs at δ = 1/
√
logn.
Corollary 4.3 (Corollary of Theorem 4.3). With high
probability in n, the Recursive Smoothed Striding
Algorithm using parameter δ = 1/
√
logn: achieves work
O(n), attains span O(b log2 n), and incurs (1 + o(1))n/b
cache misses.
Proof of Theorem 4.3. To avoid confusion, we use δ′,
rather than δ, to denote the constant value of δ used at
levels of recursion after the first.
By Proposition 4.1, the top level of the algorithm has
work O(n), span O
(
b lognδ2
)
, and incurs s+nb +O(1) cache
misses. The recursion reduces the problem size by at least
a factor of 4δ, with high probability in n.
At lower layers of recursion, with high probability in n,
the algorithm reduces the problem size by a factor of at
least 1/2 (since δ is set to be a sufficiently small constant).
For each i > 1, it follows that the size of the problem at
the i-th level of recursion is at most O(nδ/2i).
The sum of the sizes of the problems after the first level
of recursion is therefore bounded above by a geometric se-
ries summing to at mostO(nδ). This means that the total
work of the algorithm is at most O(nδ) +O(n) ≤ O(n).
Recall that each level i > 1 uses s = ln(2
−inδ′/b)
δ′2 , where
δ′ = Θ(1). It follows that level i uses s ≤ O(log n). Thus,
by Proposition 4.1, level i contributesO(b ·s) = O(b logn)
to the span. Since there are at most O(log n) levels of
recursion, the total span in the lower levels of recursion is
at most O(b log2 n), and the total span for the algorithm
is at most,
O
(
b
(
log2 n+
logn
δ2
))
.
To compute the total number of cache misses of the
algorithm, we add together (n + s)/b + O(1) for the top
level, and then, by Proposition 4.1, at most
∑
0≤i<O(log n)
1
b
O
(
22−inδ + logn
) ≤ O
(
1
b
(nδ + log2 n)
)
.
for lower levels. Thus the total number of cache misses
for the algorithm is,
1
b
(
n+
logn
δ2
)
+O(nδ + log2 n)/b = (n+O(nδ))/b.
Proof of Corollary 4.3. By Theorem 4.3, with high prob-
ability in n, the algorithm has work O(n), the algorithm
has span
O
(
b
(
log2 n+
logn
δ2
))
= O(b log2 n),
and the algorithm incurs
(n+O(nδ))/b = (n+O(n/
√
logn))/b = (n+ o(n))/b
cache misses.
5 Performance Comparisons
In this section, we implement the techniques from Section
3 and Section 4 to build space-efficient and in-place
parallel-partition functions.
12
5.1 Comparing Parallel-Prefix-Based Algorithms William Kuszmaul and Alek Westover
Each implementation considers an array of n 64-bit
integers, and partitions them based on a pivot. The
integers in the array are initially generated so that each
is randomly either larger or smaller than the pivot.
In Subsection 5.1, we evaluate the techniques in Section
3 for transforming the standard parallel-prefix-based par-
tition algorithm into an in-place algorithm. We compare
the performance of three parallel-partition implementa-
tions: (1) The high-space implementation which follows
the standard parallel-partition algorithm exactly; (2) a
medium-space implementation which reduces the space
used for the Parallel-Prefix phase; and (3) a low-space
implementation which further eliminates the auxiliary
space used in the Reordering phase of the algorithm. The
low-space implementation still uses a small amount of
auxiliary memory for the parallel-prefix, storing every
O(log n)-th element of the parallel-prefix array explicitly
rather than using the implicit-storage approach in Section
3. Nonetheless the space consumption is several orders of
magnitude smaller than the original algorithm.
In addition to achieving a space-reduction, the better
cache-behavior of the low-space implementation allows
for it to achieve a speed advantage over its peers, in
some cases completing roughly twice as fast as the
medium-space implementation and four times as fast
as the low-space implementation. We show that all
three implementations are bottlenecked by memory
throughput, however, suggesting that the cache-optimal
Smoothed Striding Algorithm can do better.
In Subsection 5.2, we evaluate the performance of
the Recursive Smoothed Striding Algorithm and the
Strided Algorithm. Unlike the algorithms described
above, the implementations of both of these algorithms
are fully in-place, meaning that the total space overhead
is only polylogn. The cache efficiency of these two
algorithms allows for them to achieve substantially better
scaling than their parallel-prefix-based counterparts.
The Strided Algorithm tends to slightly outperform the
Smoothed Striding Algorithm, though on 18 threads their
performance is within 15% of one-another. We conclude
that the Smoothed Striding Algorithm allows for one to
obtain empirical performance comparable to that of the
Strided Algorithm, while simultaneously achieving the
provable guarantees on span and cache-efficiency missing
from the original Strided Algorithm.
Machine Details. Our experiments are performed on a
two-socket machine with eighteen 2.9 GHz Intel Xeon E5-
2666 v3 processors. To maximize the memory bandwidth
of the machine, we use a NUMA memory-placement
policy in which memory allocation is spread out evenly
across the nodes of the machine; this is achieved using
the interleave=all option in the Linux numactl tool [23].
Worker threads in our experiments are each given their
own core, with hyperthreading disabled.
Our algorithms are implemented using the CilkPlus
task parallelism library in C++. The implementations
avoid the use of concurrency mechanisms and atomic
operations, but do allow for concurrent reads to be
performed on shared values such as n and the pointer to
the input array. Our code is compiled using g++ 7.3.0,
with march=native and at optimization level three.
Our implementations are available on GitHub.
5.1 Comparing Parallel-Prefix-Based
Algorithms
In this section, we compare four partition implemen-
tations, incorporating the techniques from Section 3 in
order to achieve space efficiency:
• ASerial Baseline: This uses the serial in-place partition
implementation from GNU Libc quicksort, with minor
adaptations to optimize it for the case of sorting 64-bit
integers (i.e., inlining the comparison function, etc.).
• The High-Space Parallel Implementation: This uses
the standard parallel partition algorithm [7, 2], as
described in Section 2. The space overhead is roughly
2n eight-byte words.
• The Medium-Space Parallel Implementation: Starting
with the high-space implementation, we reduce the
space used by the Parallel-Prefix phase by only con-
structing every O(log n)-th element of the prefix-sum
array B, as in Section 3. (Here O(log n) is hard-coded
as 64.) The arrayB is initialized to be of size n/64, with
each component equal to
∑64
i=1 dec(A[64(i − 1) + 1]),
and then a parallel prefix sum is computed on the array
B. Rather than implicitly encoding the elements of B
in A, we use an auxiliary array of size n/64 to explicitly
store the prefix sums.
The algorithmhas a space overhead of n32+n eight-byte
words.8
• The Low-Space Parallel Implementation: Starting
with the medium-space implementation, we make
the reordering phase completely in-place using the
preprocessing technique in Section 3.9 The only space
overhead in this implementation is the n32 additional
8-byte words used in the prefix sum.
We remark that the ample parallelism of the low-space
algorithm makes it so that for large inputs the value 64
can easily be increased substantially without negatively
effecting algorithm performance. For example, on an
8In addition to the auxiliary array of size n/64, we use a
series of smaller arrays of sizes n/128, n/256, . . . in the recursive
computation of the prefix sum. The alternative of performing
the parallel-prefix sum in place, as in Section 3, tends to be less
cache-friendly in practice.
9Depending on whether themajority of elements are predecessors
or successors, the algorithm goes down separate (but symmetric)
code paths. In our timed experiments we test only with inputs
containing more predecessors than successors, since this the slower
of the two cases (by a very slight amount) for our implementation.
13
5.1 Comparing Parallel-Prefix-Based Algorithms William Kuszmaul and Alek Westover
input of size 228, increasing it to 4096 has essentially
no effect on the empirical runtime while bringing the
auxiliary space-consumption down to a 12048 -fraction of
the input size. (In fact, the increase from 64 to 4096
results in roughly a 5% speedup.)
An Additional Optimization for The High-Space
Implementation. The optimization of reducing the
prefix-sum by a factor of O(log n) at the top level of
recursion, rather than simply by a factor of two, can also
be applied to the standard parallel-prefix algorithm when
constructing a prefix-sum array of size n. Even without
the space reduction, this reduces the (constant) overhead
in the parallel prefix sum, while keeping the overall span
of the parallel-prefix operation at O(log n). We perform
this optimization in the high-space implementation.
Performance Comparison.
Figure 4 graphs the speedupof the each of the parallel al-
gorithms over the serial algorithm, using varying numbers
of worker threads on an 18-coremachine with a fixed input
size of n = 230. Both space optimizations result in perfor-
mance improvements, with the low-space implementation
performing almost twice as well as the medium-space
implementation on eighteen threads, and almost four
times as well as the high-space implementation.
Figure 5 compares the performances of the implemen-
tations in serial. Parallel-for-loops are replaced with
serial for-loops to eliminate scheduler overhead. As the
input-size varies, the ratios of the runtimes vary only
slightly. The low-space implementation performs within
a factor of roughly 1.9 of the serial implementation. As in
Figure 4, both space optimizations result in performance
improvements.
The Source of the Speedup. If we compare the number
of instructions performed by the three parallel implemen-
tations, then the medium-space algorithm would seem
to be the clear winner. Using Cachegrind to profile the
number of instructions performed in a (serial) execution
on an input of size 228,10 the high-space, medium-space,
and low-space implementations perform 4.4 billion, 2.9
billion, and 4.6 billion instructions, respectively.
Cache misses tell a different story, however. Using
Cachegrind to profile the number of top-level cachemisses
in a (serial) execution on an input of size 228, the high-
space, medium-space, and low-space implementations
incur 305 million, 171 million, and 124 million cache
misses, respectively.
To a first approximation, the number of cache misses
by each algorithm is proportional to the number of times
that the algorithm scans through a large array. By elimi-
nating the use of large auxiliary arrays, the low-space im-
plementation has the opportunity to achieve a reduction
in the number of such scans. Additionally, the low-space
algorithmallows for steps fromadjacent phases of the algo-
10This smaller problem size is used to compensate for the fact
that Cachegrind can be somewhat slow.
rithm to sometimes be performed in the samepass. For ex-
ample, the enumeration of the number of predecessors and
the top level of the Preprocessing phase can be performed
together in a single pass on the input array. Similarly,
the later levels of the Preprocessing phase (which focus on
only one half of the input array) can be combined with the
construction of (one half of) the auxiliary array used in the
Parallel Prefix Sum phase, saving another half of a pass.
The Memory-Bandwidth Limitation.
The comparison of cache misses suggests that per-
formance is bottlenecked by memory bandwidth. To
evaluate whether this is the case, we measure for each
t ∈ {1, . . . , 18} the memory throughput of t threads
attempting to scan through disjoint portions of a large
array in parallel. Wemeasure two types of bandwidth, the
read-bandwidth , in which the threads are simply trying
to read from the array, and the read/write bandwidth ,
in which the threads are attempting to immediately
overwrite entries to the array after reading them. Given
read-bandwidth r bytes/second and read/write band-
width w bytes/second, the time needed for the low-space
algorithm to perform its memory operations on an input
of m bytes will be roughly 3.5m/w + .5m/r seconds.11
We call this the bandwidth constraint . No matter how
optimized the implementation of the low-space algorithm
is, the bandwidth constraint serves as a hard lower bound
for the running time.12
Figure 6 compares the time taken by the low-space
algorithm to the bandwidth constraint as the number
of threads t varies from 1 to 18. As the number of
threads grows, the algorithm becomes bandwidth limited,
achieving its best possible parallel performance on the
machine. The algorithm scales particularly well on the
first socket of the machine, achieving a speedup on nine
cores of roughly six times better than its performance on
a single core, and then scales more poorly on the second
socket as it becomes bottlenecked by memory bandwidth.
Implementation Details. In each implementation, the
parallelism is achieved through simple parallel-for-loops,
with one exception at the beginning of the low-space
implementation, when the number of predecessors in the
input array is computed. Although CilkPlus Reducers
(or OpenMP Reductions) could be used to perform this
parallel summation within a parallel-for-loop [16], we
11A naive implementation of the algorithm would require roughly
m/r time to count the number of predecessors, followed by 2m/w
time to perform the Preprocessing Phase, followed by roughly m/r
time to perform the Parallel Prefix Sum Phase, and then roughly
1.5m/w time for the In-Place Reordering Phase. As described in
the previous paragraph, however, the counting of predecessors and
the Parallel Prefix Sum phase can both be overlapped with the
Preprocessing phase so that their total added contribution to the
Memory-Bandwidth Limitation is only .5m/r.
12Empirically, on an array of size n = 228, the total number of
cache misses is within 8% of what this assumption would predict,
suggesting that the bandwidth constraint is within a small amount
of the true bandwidth-limited runtime.
14
5.1 Comparing Parallel-Prefix-Based Algorithms William Kuszmaul and Alek Westover
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0
2
4
6
8
10
12
Number of Threads
S
p
ee
d
u
p
O
v
er
S
er
ia
l
P
a
rt
it
io
n
Speedup Versus Number of Threads
Low-Space Med-Space
High-Space Smoothed-Striding
Strided
Figure 4: For a fixed table-size n = 230, we compare each implementation’s runtime to the Libc serial baseline, which
takes 3.9 seconds to complete (averaged over five trials). The x-axis plots the number of worker threads being used,
and the y-axis plots the multiplicative speedup over the serial baseline. Each time (including the serial baseline) is
averaged over five trials.
23 24 25 26 27 28 29 30
0
1
2
3
4
5
6
Log Input Size
S
lo
w
d
ow
n
O
v
er
S
er
ia
l
P
a
rt
it
io
n
Slowdown Versus Input Size in Serial
Low-Space Med-Space
High-Space Smoothed Striding
Strided
Figure 5: We compare the performance of the implementations in serial, with no scheduling overhead. The x-axis
is the log-base-2 of the input size, and the y-axis is the multiplicative slowdown when compared to the Libc serial
baseline. Each time (including the baseline) is averaged over five trials.
15
5.2 Comparing the Smoothed Striding and Strided Algorithms William Kuszmaul and Alek Westover
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0
5
10
15
20
Number of Threads
S
p
ee
d
u
p
O
v
er
S
er
ia
l
P
a
rt
it
io
n
Speedup Versus Number of Threads
Low-Space Low-Space Bandwidth Constraint
Smoothed Striding Smoothed Striding Bandwidth Constraint
Figure 6: We compare the performances of the low-space and Smoothed Striding parallel-partition algorithms to
their ideal performance determined by memory-bandwidth constraints on inputs of size 230. The x-axis is the number
of worker threads, and the y-axis is the multiplicative speedup when compared to the Libc serial baseline (which is
computed by an average over five trials). Each data-point is averaged over five trials.
found a slightly more ad-hoc approach to be faster: Using
a simple recursive structure, we manually implemented
a parallel-for-loop with Cilk Spawns and Syncs, allowing
for the summation to be performed within the recursion.
5.2 Comparing the Smoothed Striding
and Strided Algorithms
In this section we consider the performance of the
Strided Algorithm and the Recursive Smoothed Striding
Algorithm. Past work [15] found that, on large numbers
of threads, the Strided Algorithm has performance close
to that of other non-EREW state-of-the art partition
algorithms (i.e., within 20% of the best atomic-operation
based algorithms). The Strided Algorithm does not
offer provable guarantees on span and cache-efficiency,
however; and indeed, the reason that the algorithm
cannot recurse on the subarray A[vmin + 1], . . . , A[vmax]
is that the subarray has been implicitly constructed to be
worst-case for the algorithm. In this subsection, we show
that, with only a small loss in performance, the Smoothed
Striding Algorithm can be used to achieve provable
guarantees on arbitrary inputs. We remark that we do
not make any attempt to generate worst-case inputs for
the Strided Algorithm (in fact the random inputs that
we use are among the only inputs for which the Strided
Algorithm does exhibit provable guarantees!).
Figures 5 and 4 evaluate the performance of the
Smoothed Striding and Strided algorithms in serial and
in parallel. On a single thread, the Smoothed Striding
and Strided algorithms perform approximately 1.5
times slower than the Libc-based serial implementation
baseline. When executed on multiple threads, the perfor-
mances of the Smoothed Striding and Strided Algorithms
scale close to linearly in the number of threads. On 18
threads, the Smoothed Striding Algorithm achieves a
9.6× speedup over the Libc-based Serial Baseline, and
the Strided Algorithm achieves an 11.1× speedup over
the same baseline.
The nearly-ideal scaling of the two algorithms can
be explained by their cache behavior. Whereas the
parallel-prefix-based algorithms were bottlenecked by
memory bandwidth, Figure 6 shows that the same is
no longer true for the Smoothed Striding Algorithm.
The figure compares the performance of the Smoothed
Striding Algorithm to the minimum time needed simple
to read and overwrite each entry of the input array using
18 concurrent threads without any other computation
(i.e., the memory bandwidth constraint). On 18 threads,
the time required by the memory bandwidth constraint
constitutes 58% of the algorithm’s total running time.
NUMA Effects. We remark that the use of the Linux
numactl tool [23] to spread memory allocation evenly
across the nodes of the machine is necessary to prevent
the Smoothed Striding Algorithm and the Strided Algo-
rithm from being bandwidth limited. For example, if we
replicate the 18-thread column of Figure 6 without using
numactl, then the speedup of the Smoothed Striding Al-
16
William Kuszmaul and Alek Westover
gorithm is 8.2, whereas the memory-bandwidth bound for
maximum possible speedup is only slightly larger at 10.2.
Implementation Details. Both algorithms use
b = 512. The Smoothed Striding Algorithm uses slightly
tuned ǫ, δ parameters similar to those outlined in Corol-
lary 4.3. Although vmin and vmax could be computed
using CilkPlus Reducers [16], we found it advantageous
to instead manually implement the parallel-for-loop in
the Partial Partition step with Cilk Spawns and Syncs,
and to compute vmin and vmax within the recursion.
Example Application: A Full Quicksort.
In Figure 7, we graph the performance of a par-
allel quicksort implementation using the low-space
parallel-prefix-based algorithm, the Smoothed Striding
Algorithm, and the Strided Algorithm. We compare the
algorithm performances with varying numbers of worker
threads and input sizes to GNU Libc quicksort; the input
array is initially in a random permutation.
Our parallel quicksort uses the parallel-partition algo-
rithm at the top levels of recursion, and then swaps to the
serial-partitioning algorithm once the input size has been
reduced by at least a factor of 8p, where p is the number
of worker threads. By using the serial-partitioning algo-
rithm on the small recursive subproblems we avoid the
overhead of the parallel algorithm, while still achieving
parallelism between subproblems. Small recursive prob-
lems also exhibit better cache behavior than larger ones,
reducing the effects of memory-bandwidth limitations
on the performance of the parallel quicksort, and further
improving the scaling.
6 Conclusion and Open Questions
Parallel partition is a fundamental primitive in parallel al-
gorithms [7, 2]. Achieving faster and more space-efficient
implementations, even by constant factors, is therefore
of high practical importance. Until now, the only space-
efficient algorithms for parallel partition have relied exten-
sively on concurrency mechanisms or atomic operations,
or lacked provable performance guarantees. If a parallel
function is going to be invokedwithin a large variety of ap-
plications, then provable guarantees are highly desirable.
Moreover, algorithms that avoid the use of concurrency
mechanisms tend to scale more reliably (and with less de-
pendency on the particulars of the underlying hardware).
In this paper, we have shown that, somewhat surpris-
ingly, one can adapt the classic parallel algorithm to
completely eliminate the use of auxiliary memory, while
still using only exclusive read/write shared variables,
and maintaining a polylogarithmic span. Although the
superior cache performance of the low-space algorithm
results in practical speedups over its out-of-place coun-
terpart, both algorithms remain far from the state-of-the
art due to memory bandwidth bottlenecks. To close this
gap, we also presented a second in-place algorithm, the
Smoothed Striding Algorithm, which achieves polyloga-
rithmic span while guaranteeing provably optimal cache
performance up to low-order factors. The Smoothed
Striding Algorithm introduces randomization techniques
to the previous (blocked) Striding Algorithm of [15, 14],
which was known to perform well in practice but which
previously exhibited poor theoretical guarantees. Our
implementation of the Smoothed Striding Algorithm is
fully in-place, exhibits polylogarithmic span, and has
optimal cache performance.
Our work prompts several theoretical questions. Can
fast space-efficient algorithms with polylogarithmic
span be found for other classic problems such as ran-
domly permuting an array [5, 4, 27], and integer sorting
[26, 20, 3, 19, 17]? Such algorithms are of both theoretical
and practical interest, and might be able to utilize some
of the techniques introduced in this paper.
Another important direction of work is the design of
in-place parallel algorithms for sample-sort, the variant of
quicksort in whichmultiple pivots are used simultaneously
in each partition. Sample-sort can be implemented to ex-
hibit fewer cachemisses than quicksort, which is especially
important when the computation is memory-bandwidth
bound. The known in-place parallel algorithms for
sample-sort rely heavily on atomic instructions [6] (even
requiring 128-bit compare-and-swap instructions). Find-
ing fast algorithms that use only exclusive-read-write
memory (or concurrent-read-exclusive-write memory) is
an important direction of future work.
References
[1] Blelloch Acar and Blumofe. The data locality of
work stealing. 2000.
[2] Umut A Acar and Guy Blelloch. Algorithm design:
Parallel and sequential, 2016.
[3] Susanne Albers and Torben Hagerup. Improved
parallel integer sorting without concurrent writing.
Information and Computation, 136(1):25–51, 1997.
[4] Laurent Alonso and Rene´ Schott. A parallel
algorithm for the generation of a permutation
and applications. Theoretical Computer Science,
159(1):15–28, 1996.
[5] R Anderson. Parallel algorithms for generating ran-
dom permutations on a shared memory machine. In
Proceedings of the second annual ACM Symposium
on Parallel Algorithms and Architectures, pages
95–102. ACM, 1990.
[6] Michael Axtmann, Sascha Witt, Daniel Ferizovic,
and Peter Sanders. In-place parallel super scalar
samplesort. arXiv preprint arXiv:1705.02257, 2017.
[7] Guy E Blelloch. Programming parallel algorithms.
Communications of the ACM, 39(3):85–97, 1996.
17
REFERENCES William Kuszmaul and Alek Westover
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0
2
4
6
8
10
12
14
Number of Threads
S
p
ee
d
u
p
O
v
er
S
er
ia
l
P
a
rt
it
io
n
Speedup Versus Number of Threads
Low-Space Smoothed Striding
Strided
Figure 7: We compare the performance of the low-space and high-span sorting implementations running on varying
numbers of threads and input sizes. The x-axis is the number of worker threads and the y-axis is the multiplicative
speedup when compared to the Libc serial baseline. Each time (including each serial baseline) is averaged over five
trials.
[8] Guy E Blelloch, Jeremy T Fineman, Phillip B
Gibbons, and Julian Shun. Internally deterministic
parallel algorithms can be fast. In ACM SIGPLAN
Notices, volume 47, pages 181–192. ACM, 2012.
[9] Guy E. Blelloch, Charles E. Leiserson, Bruce M
Maggs, C Greg Plaxton, Stephen J Smith, and
Marco Zagha. An experimental analysis of parallel
sorting algorithms. Theory of Computing Systems,
31(2):135–167, 1998.
[10] Robert D Blumofe, Christopher F Joerg, Bradley C
Kuszmaul, Charles E Leiserson, Keith H Randall,
and Yuli Zhou. Cilk: An efficient multithreaded
runtime system. Journal of parallel and distributed
computing, 37(1):55–69, 1996.
[11] Robert D Blumofe and Charles E Leiserson. Schedul-
ing multithreaded computations by work stealing.
Journal of the ACM (JACM), 46(5):720–748, 1999.
[12] Richard P Brent. The parallel evaluation of gen-
eral arithmetic expressions. Journal of the ACM
(JACM), 21(2):201–206, 1974.
[13] Thomas H Cormen, Charles E Leiserson, Ronald L
Rivest, and Clifford Stein. Introduction to algo-
rithms. MIT press, 2009.
[14] Rhys S. Francis and LJH Pannan. A parallel
partition for enhanced parallel quicksort. Parallel
Computing, 18(5):543–550, 1992.
[15] Leonor Frias and Jordi Petit. Parallel partition revis-
ited. In InternationalWorkshop on Experimental and
Efficient Algorithms, pages 142–153. Springer, 2008.
[16] Matteo Frigo, Pablo Halpern, Charles E Leiserson,
and Stephen Lewin-Berlin. Reducers and other
cilk++ hyperobjects. In Proceedings of the twenty-
first annual symposium on Parallelism in algorithms
and architectures, pages 79–90. ACM, 2009.
[17] Alexandros V Gerbessiotis and Constantinos J
Siniolakis. Probabilistic integer sorting. Information
processing letters, 90(4):187–193, 2004.
[18] Torben Hagerup and Christine Ru¨b. Optimal
merging and sorting on the erew pram. Information
Processing Letters, 33(4):181–185, 1989.
[19] Yijie Han. Improved fast integer sorting in lin-
ear space. In Proceedings of the twelfth annual
ACM-SIAM symposium on Discrete algorithms,
pages 793–796. Society for Industrial and Applied
Mathematics, 2001.
[20] Yijie Han and Xin He. More efficient parallel integer
sorting. In Frontiers in Algorithmics and Algorith-
mic Aspects in Information and Management, pages
279–290. Springer, 2012.
[21] Philip Heidelberger, Alan Norton, and John T.
Robinson. Parallel quicksort using fetch-and-add.
18
REFERENCES William Kuszmaul and Alek Westover
IEEE Transactions on Computers, 39(1):133–138,
1990.
[22] Jyrki Katajainen, Christos Levcopoulos, and
Ola Petersson. Space-efficient parallel merging.
RAIRO-Theoretical Informatics and Applications,
27(4):295–310, 1993.
[23] Andi Kleen. A numa api for linux. Novel Inc, 2005.
[24] Jie Liu, Clinton Knowles, and Adam Brian Davis.
A cost optimal parallel quicksorting and its imple-
mentation on a shared memory parallel computer.
In International Symposium on Parallel and Dis-
tributed Processing and Applications, pages 491–502.
Springer, 2005.
[25] E Matias and Uzi Vishkin. A note on reducing
parallel model simulations to integer sorting. In
Parallel Processing Symposium, 1995. Proceedings.,
9th International, pages 208–212. IEEE, 1995.
[26] Sanguthevar Rajasekaran and Sandeep Sen. On par-
allel integer sorting. Acta Informatica, 29(1):1–15,
1992.
[27] Julian Shun, Yan Gu, Guy E Blelloch, Jeremy T
Fineman, and Phillip B Gibbons. Sequential random
permutation, list contraction and tree contraction
are highly parallel. In Proceedings of the twenty-
sixth annual ACM-SIAM symposium on Discrete
algorithms, pages 431–448. Society for Industrial
and Applied Mathematics, 2015.
[28] Daniel D Sleator and Robert E Tarjan. Amor-
tized efficiency of list update and paging rules.
Communications of the ACM, 28(2):202–208, 1985.
[29] Philippas Tsigas and Yi Zhang. A simple, fast paral-
lel implementation of quicksort and its performance
evaluation on sun enterprise 10000. In Proceedings
of the Eleventh Euromicro Conference on Parallel,
Distributed and Network-Based Processing, page
372. IEEE, 2003.
[30] Jeffrey Scott Vitter. Algorithms and data structures
for external memory. Foundations and Trends R© in
Theoretical Computer Science, 2(4):305–474, 2008.
19
