Cache Analysis of Non-uniform Distribution Sorting Algorithms by Rahman, Naila & Raman, Rajeev
ar
X
iv
:0
70
6.
28
39
v2
  [
cs
.D
S]
  1
3 A
ug
 20
07
Cache Analysis of Non-uniform Distribution Sorting
Algorithms
Naila Rahman
Department of Computer Science
University of Leicester
Leicester LE1 7RH, UK.
naila@mcs.le.ac.uk
Rajeev Raman∗
Department of Computer Science,
University of Leicester,
Leicester LE1 7RH, UK.
r.raman@mcs.le.ac.uk.
November 21, 2018
Abstract
We analyse the average-case cache performance of distribution sorting algorithms in the
case when keys are independently but not necessarily uniformly distributed. The analysis
is for both ‘in-place’ and ‘out-of-place’ distribution sorting algorithms and is more accurate
than the analysis presented in [13]. In particular, this new analysis yields tighter upper and
lower bounds when the keys are drawn from a uniform distribution. We use this analysis to
tune the performance of the integer sorting algorithm MSB radix sort when it is used to sort
independent uniform floating-point numbers (floats). Our tuned MSB radix sort algorithm
comfortably outperforms a cache-tuned implementations of bucketsort [11] and Quicksort when
sorting uniform floats from [0, 1).
1 Introduction
Distribution sorting is a popular alternative to comparison-based sorting which involves placing
n input keys into k ≤ n classes based on their value [6]. The classes are chosen so that all the
keys in the ith class are smaller than all the keys in the (i + 1)st class, for i = 1, . . . , k − 1, and
furthermore, the class to which a key belongs can be computed in O(1) time (e.g. if the keys
are floats in the range [a, b), we can calculate the class of a key x as 1 + ⌊x−ab−a · k⌋). Thus, the
original sorting problem is reduced in linear time to the problem of sorting the keys in each class.
A number of distribution sorting algorithms have been developed which run in linear (expected)
time under some assumptions about the input keys, such as bucket sort and radix sort. Due to
their poor cache utilisation, even good implementations—which minimise instruction counts—of
these ‘linear-time’ algorithms fail to outperform general-purpose O(n log n)-time algorithms such
as Quicksort or Mergesort on modern computers [8, 11].
Most algorithms are based upon the random-access machine model [1], which assumes that
main memory is as fast as the CPU. However, in modern computers, main memory is typically one
or two orders of magnitude slower than the CPU [4]. To mitigate this, one or more levels of cache
are introduced between CPU and memory. A cache is a fast associative memory which holds the
∗Supported in part by EPSRC grant GR/L92150
1
values of some main memory locations. If the CPU requests the contents of a memory location, and
the value of that location is held in some level of cache (a cache hit), the CPU’s request is answered
by the cache itself in typically 1-3 clock cycles; otherwise (a cache miss) it is answered by accessing
main memory in typically 30-100 clock cycles. Since typical programs exhibit locality of reference
[4], caches are often effective. However, algorithms such as distribution sort have poor locality of
reference, and their performance can be greatly improved by optimising their cache behaviour. A
number of papers have recently addressed this issue [7, 8, 11, 12, 9, 14], mostly in the context of
sorting and related problems. There is also a large literature on algorithms specifically designed
for hierarchical models of memory [15, 2], but there are some important differences between these
models and ours (see [10] for a summary).
The cache performance of comparison-based sorting algorithms was studied in [8, 9, 14] and
distribution sorting algorithms were considered in [7, 8, 11]. One pass of a distribution sort consists
of a count phase where the number of keys in each class are determined, followed by a permute
phase where the keys belonging to the same class are moved to consecutive locations in an array. We
give an analysis of the cache behaviour of the permute phase, assuming the keys are independently
drawn from a non-uniform distribution. In [13] we focused on ‘in-place’ permute, where the keys
are rearranged without placing them first in an auxiliary array. In this paper we extend the analysis
to ‘out-of-place’ permute. We model the above algorithms as probabilistic processes, and analyse
the cache behaviour of these processes. For each process we give an exact expression for, as well
as matching closed-form upper and lower bounds on, the number of misses.
In previous work on the cache analysis of distribution sorting, [7] have analysed the (somewhat
easier) count phase for non-uniform keys, and [11] gave an empirical analysis of the permute phase
for uniform keys. The process of accessing multiple sequences of memory locations, which arises in
multi-way merge sort, was analysed previously by [9, 14]. The analysis in [9] assumes that accesses
to the sequences are controlled by an adversary; our analysis demonstrates, among other things,
that with uniform randomised accesses to the sequences, more sequences can be accessed optimally.
In [14] a lower bound on cache misses is given for uniform randomised accesses; our lower bound
is somewhat sharper. The analysis also improves upon the results in [13], by giving tighter upper
and lower bounds when the keys are drawn from a uniform distribution.
In practice there are often cases when keys are not uniform (e.g., they may be normally dis-
tributed); our analysis can be used to tune distribution sort in these cases. We consider a different
application here: sorting uniform floats using an integer sorting algorithm. It is well known that
one can sort floats by sorting the bit-strings representing the floats, interpreting them as integers
[4]. Since (simple) operations on integers are faster than operations on floats, this can improve
performance; indeed, in [11] it was observed that an ad hoc implementation of the integer sorting
algorithm most-significant-bit first radix sort (MSB radix sort) outperformed an optimised version
of bucket sort on uniform floats. We observe that a uniform distribution on floating-point numbers
induces a non-uniform distribution on the representing integers, and use our cache analysis to
improve the performance of MSB radix sort on our machine. Our tuned ‘in-place’ MSB radix sort
comfortably outperforms optimised implementations of other in-place or ‘in-place’ algorithms such
as Quicksort or MPFlashsort [11], which is a cache-tuned version of bucket sort.
2 Cache preliminaries
This section introduces some terminology and notation regarding caches. The size of the cache
is normally expressed in terms of two parameters, the block size (B) and the number of cache
blocks (C). We consider main memory as being divided into equal-sized blocks consisting of B
consecutively-numbered memory locations, with blocks starting at locations which are multiples
of B. The cache is also divided into blocks of size B; one cache block can hold the value of exactly
one memory block. Data is moved to and from main memory only as blocks.
In a direct-mapped cache, the value of memory location x can only be stored in cache block
c = (x div B) mod C. If the CPU accesses location x and cache block c holds the values from
x’s block the access is a cache hit; otherwise it is a cache miss and the contents of the block
containing x are copied into cache block c, evicting the current contents of cache block c. For our
2
Permute phase(out-of-place permutation)
1 for i := 0 to n− 1 do
key := DATA[i];
x := classify(key)
idx := COUNT[x];
COUNT[x]++;
DEST[idx] := key;
Figure 1: Permute phase for an ‘out-of-place’ permutation in a generic distribution sorting algo-
rithm. DATA holds the input keys. COUNT and DEST are auxiliary arrays.
purposes, cache misses can be classified into compulsory misses, which occur when a memory block
is accessed for the first time, capacity misses, which occurs on an access to a memory block that
was previously evicted because the cache could not hold all the blocks being actively accessed, and
conflict misses, which happen when a block is evicted from cache because another memory block
that mapped to the same cache block was accessed.
3 Distribution sorting
As noted in the introduction, a distribution pass has two main phases, a count phase and a permute
phase, and our focus here is on the latter.
While describing this algorithm, the term data array refers to the array holding the input keys,
and the term count refers to an auxiliary array used by these algorithms. Each pass consists of
two main phases, a count phase followed by a permute phase.
The count phase counts for class 1 ≤ i ≤ k− 1, the total number of keys in classes 0, . . . , i− 1.
For class i = 0 this cumulative count is 0. Ladner et al [7] give an analysis of the count phase of
distribution sorting on a direct-mapped cache for uniformly and randomly distributed keys.
There are two main variants of the permute phase, in the first variant keys are permuted from
the data array to the auxiliary destination array, this is called an out-of-place permutation. In the
second variant keys in the data array are permuted within the data array, this is called an in-place
permutation.
3.1 Permute phase
The permute phase uses the cumulative count of keys generated during the count phase, to permute
the keys to their respective classes. We now describe the two variants of the permute phase. In
the description below it is assumed that k has been appropriately initialised, and that the function
classify maps a key to a class numbered {0, . . . , k − 1} in O(1) time.
3.1.1 Out-of-place permute
During an out-of-place permute, for any class j, unless all elements of that class have already
been moved, COUNT[j] points to the leftmost (lowest-numbered) available location for an element
of class j in an n element auxiliary array, DEST. Figure 1 shows the pseudo-code for out-of-place
permutation. In Step 1, for each element in DATA: we determine its class; using the count array
we determine the next available location for this key in the DEST array; we increment the count
array, thus setting the location for the next key of the same class; finally we move the key to its
location in DEST. Since each step takes constant time, this out-of-place permutation takes O(n)
time whenever k ≤ n.
3
Permute phase(in-place permutation)
1 leader := n− 1;
2 idx := leader; key := DATA[idx];
3.1 x := classify(key);
3.2 idx := COUNT[x];
3.3 COUNT[x]++;
3.4 swap key and DATA[idx];
3.5 if idx 6= leader repeat 3.1;
4 while (x > 0 ∧ COUNT[x− 1] ≥ START[x])
x--;
5 if (x > 0) leader := START[x]−1;
go to 2;
Figure 2: Permute phase for an ‘in-place’ permutation in a generic distribution sorting algorithm.
DATA holds the input keys. COUNT and START are auxiliary arrays. After the count phase, COUNT is
copied into START.
3.1.2 In-place Permute
The in-place permutation strategy described here is similar to that described by Knuth [6, Soln
5.2-13]. Before an in-place permute phase begins, a copy of the count array is made in a k element
auxiliary start array. During the permute phase, for any class j, an invariant is that locations
START[j], START[j]+ 1, . . . , COUNT[j]− 1 contain elements of class j, i.e. COUNT[j] points to the
leftmost (lowest-numbered) available location for an element of class j. Thus, for j = 0, . . . , k− 2,
all elements of class j have been permuted if COUNT[j] ≥ START[j + 1], and such a class will be
called complete in what follows. Class k − 1 is complete when COUNT[k− 1] ≥ n. Figure 2 shows
the pseudo-code for in-place permutation. We now describe this permutation, which consists of
two main activities: cycle following and cycle leader finding. In cycle following, keys are moved to
their final destinations in the data array along a cycle in the permutation (Steps 2 and 3). Once
a cycle is completed, we move to cycle leader finding, where we find the ‘leader’ (index of the
rightmost element) of the next cycle (Steps 1, 4 and 5). A cycle leader is simply the rightmost
location of the highest-numbered incomplete class. By the definition of a complete class, initially
the leader must be position n− 1. In more detail, the steps are as follows:
• In Step 1 n− 1 is selected as the first cycle leader.
• In Step 2 the key at the leader’s position is copied into the variable key, thus leaving a ‘hole’
in the leader’s position.
• In Steps 3.1-3.5 the key key is swapped with the key at key’s final position. If key ‘fills the
hole’, the cycle is complete, otherwise we repeat these steps.
• In Step 4 the algorithm searches for a new cycle leader. Suppose the leader of the cycle
which just completed was the last location of class j. When this cycle ends, class j must also
be complete, as a key of class j has been moved into the last location of class j. Note that
classes j + 1, j + 2, . . . must already have been complete when the leader of this cycle was
found. Note that the program variable x has value j at the end of this cycle, so the search
for the next leader begins with class j − 1, counting down (Step 4).
• In Step 5 we check to see if all classes have completed and terminate if this is the case.
Clearly the in-place permutation in one pass of distribution sorting takes O(n) time whenever
k ≤ n.
4
4 Cache analysis
We now analyse cache misses in a direct-mapped cache during the permute phase of distribution
sorting when the keys are independently drawn from a non-uniform random distribution. In the
permute phase of distribution sorting, when a key is moved to its destination, the algorithms
described in Section 3 access any one of k elements in the COUNT array and any one of k locations
in the DATA or DEST arrays, depending on whether the permutation is in-place or out-of-place.
The actual locations accessed are dependent on the value of the permuted key, so, if the keys
are independently and randomly distributed then, for every key permuted there are two random
accesses to memory, one in the count array and one in DATA or DEST. These random accesses can
potentially lead to a large number of cache conflict misses.
Our approach is to define two continuous processes which model in-place and out-of-place
permutations. Process “in-place” models an in-place permutation and is shown in Figure 3, and
Process “out-of-place” models an out-of-place permutation and is shown in Figure 4. Each round of
a process models the permutation of a key to its destination, and we analyse the expected number
of cache misses in n rounds of these processes. Our precise equations are difficult to compute so
we also give closed-form upper and lower bounds on these precise equations. We use our results
for in-place permutations to get upper and lower bounds on the expected number of cache misses
in a process which models accesses to multiple sequences.
The assumptions in the processes mean that we have to access at least n distinct locations in
memory, which requires Ω(n/B) cache misses. In the analysis, we will say that a process is optimal
if it incurs O(n/B) cache misses. In distribution sorting, the larger the value of k, the fewer the
number of passes over the data, hence the fewer the capacity misses. As we will see, if k is too
large, then there can be a large number of conflict misses. The aim of the analysis is to determine
the largest value of k, for a particular distribution of keys, such that there are O(n/B) misses in
one pass of distribution sorting.
4.1 Processes
We now give the two processes which model the distributing of keys drawn independently and
randomly from a non-uniform distribution into k classes.
4.1.1 Process to model an in-place permutation
Let k be an integer, 2 ≤ k ≤ CB. We are given k probabilities p1, . . . , pk, such that
∑k
i=1 pi = 1.
The process maintains k pointers D1, . . . , Dk, and there are also k consecutive ‘count array’ loca-
tions, C = c1, . . . , ck. The process (henceforth called Process “in-place”) executes a sequence of
rounds, where each round consists in performing steps 1-3 below:
Process “in-place”
1. Pick an integer x from {1, . . . , k} such that Pr[x = i] = pi,
independently of all previous picks.
2. Access the location cx.
3. Access the location pointed to by Dx, increment Dx by 1.
We denote the locations accessed by the pointer Di by di,1, di,2, . . ., for i = 1, . . . , k. We assume
that:
(a) the start position of each pointer is uniformly and independently distributed over the cache,
i.e., for each i, di,1 mod BC is uniformly and independently distributed over {0, . . . , BC−1},
(b) during the process, the pointers traverse sequences of memory locations which are disjoint
from each other and from C,
5
Cxaccess
Dxaccess at
increment Dx
2)
3)
1)  Randomly select x from 1,.., k
Figure 3: Process “Inplace”.
(c) c1 is located on an aligned block boundary, i.e., c1 mod B = 0,
(d) the pointers Di, for i = 1, . . . , k, are in separate memory blocks.
Assuming that the cache is initially empty, the objective is to determine the expected number
of cache misses incurred by the above process over n rounds, with the expectation taken over the
random choices in Step 1 as well as the starting positions of the pointers.
4.1.2 Process to model an out-of-place permutation
This process is like Process “in-place”, but it is augmented with accesses to a sequence of con-
secutive locations in a source array, S, determined by an index s. The process, henceforth called
Process “out-of-place”, executes a sequence of rounds, where each round consists in performing
steps 1-4 below:
Process “out-of-place”
1. Access the location S[s], increment s by 1.
2. Pick an integer x from {1, . . . , k} such that Pr[x = i] = pi,
independently of all previous picks.
3. Access the location cx.
4. Access the location pointed to by Dx, increment Dx by 1.
We make assumptions (a), (c), and (d) from Process “in-place”, assumption (b) is modified as
below and we add a further assumption:
(b) during the process, the pointers traverse sequences of memory locations which are disjoint
from each other, from C and from S.
Assuming that the cache is initially empty, again the objective is to determine the expected
number of cache misses incurred by the above process over n rounds, with the expectation taken
over the random choices in Step 2 as well as the starting positions of the pointers.
4.2 Preliminaries
We now introduce some notation that will be used for the analysis. We use k to denote the number
of classes that the keys will be distributed into, and throughout the analysis we assume that B
divides k. Assume that we are given a set of k probabilities p1, . . . , pk, such that
∑k
i=0 pi = 1. The
6
4)
3)
S
C
access S[s] 
Cxaccess
Dx
Dxaccess at
increment
1)
2)  Randomly select x from 1,.., k
Figure 4: Process “Out-of-place”.
expected value of a function f of a random variable X is denoted as E[f(X)]. When we wish to
make explicit the distribution D from which the random variable is drawn, we will use the notation
EX∼D[f(X)]. All vectors have dimension k (the number of classes) unless stated otherwise, and
we denote the components of a vector x¯ by x1, x2, . . . , xk. We now define some probabilities:
(i) For all i ∈ {1, . . . , k/B}, Pi =
∑iB
l=(i−1)B+1 pl.
(ii) For all i ∈ {1, . . . , k}, we denote by a¯i the following vector: aij = 0 if i = j, and a
i
j = pj/(1−pi)
otherwise and by b¯i the following vector: bij = 0 if (i − 1)B + 1 ≤ j ≤ iB, and b
i
j = pj/(1 − Pi)
otherwise. (Note that
∑
j a
i
j =
∑
j b
i
j = 1).
Letm ≥ 0 be an integer and q¯ be a vector of non-negative reals such that
∑
i qi = 1. We denote
by ϕ(m, q¯) the probability distribution on the number of balls in each of k bins, when m balls are
independently put into these bins, and a ball goes in bin i with probability qi, for i ∈ {1, . . . , k}.
Thus, ϕ(m, q¯) is a distribution on vectors of non-negative integers. If µ¯ is drawn from ϕ(m, q¯),
then:
Pr[µ1 = m1, . . . , µk = mk] =

 k∏
j=1
q
mj
j

m!/ k∏
j=1
mj ! (1)
whenever
∑k
i=1mi = m; all other vectors have zero probability
1. We now define functions f(x)
for x ≥ 0 and g(m¯) for a vector m¯ of non-negative integers:
f(x) =


1 if x = 0,
1− x+B−1BC if 0 < x ≤ BC −B + 1,
0 otherwise.
(2)
g(m¯) =
1
C
k/B∑
i=1
min{1,
iB∑
l=(i−1)B+1
ml}. (3)
We now set out some propositions that are used in the proofs.
Proposition 1 For all real numbers xi, i = 1, . . . , k, such that |xi| ≤ 1 we have that:
k∏
i=0
(1 − xi) ≥ 1−
k∑
i=0
xi.
1We take 00 = 1 in Eq. 1.
7
Proposition 2 (a) For all real numbers x, such that |x| < 1 we have that:
∞∑
m=0
xm =
1
1− x
.
(b) For all real numbers x, such that |x| < 1 we have that:
∞∑
m=0
mxm =
x
(1− x)2
.
(c) For all real numbers x, such that 0 < x < 2, we have that:
∞∑
m=0
x(1 − x)mm =
1
x
− 1.
Proof. Proposition 2(a) is the standard summation for an infinite decreasing geometric series.
We obtain Proposition 2(b) by differentiating both sides of the equation in Proposition 2(a).
Proposition 2(c) is obtained using Proposition 2(b) and is the expected value of the geometric
distribution multiplied by 1− x. ✷
Proposition 3 For all real numbers p and q such that 0 < p− q < 2, we have that:
∞∑
m=0
p(1− p)m
(
1−
q
1− p
)m
=
p
p+ q
.
Proof. Since (1− p)
(
1− q1−p
)
= 1− p− q, using Proposition 2(a) we get that:
∞∑
m=0
p(1− p)m
(
1−
q
1− p
)m
=
∞∑
m=0
p (1− p− q)
m
=
p
p+ q
.
✷
Proposition 4 (a) For all real numbers x, we have that:
e−x ≥ 1− x.
(b) For all real numbers x ≥ 0, we have that:
e−x ≤ 1− x+
x2
2
.
(c) For all real numbers xi, i = 1, . . . , k, such that xi ≤ 1 we have that:
∏
(1− xi) ≤ 1−
∑
xi +
∑ x2i
2
.
Proof. Propositions 4(a) and 4(b) are from Taylor’s series. For Propositions 4(c) we use Propo-
sition 1. ✷
Proposition 5 For all real numbers x and y, such that x ≤ 1 and y ≥ 0, we have that:
e−xy ≥ (1− x)y .
Proof. This proposition is proved using Proposition 4(a). ✷
8
Proposition 6 (a) For all real numbers x and p and integer y, such that 0 < p ≤ 1, y ≥ 0 and
x(1/p+ y) = O(1), we have that:
y∑
m=0
p(1− p)mmx = x
(
1
p
− 1
)
−O(e−py).
(b) For all real numbers x and p and integer y, such that 0 < p ≤ 1, y ≥ 0 and x = O(1), we
have that:
y∑
m=0
p(1− p)mx = x−O(e−py).
(c) For all real numbers x, p and q and integer y, such that 0 < p−q < 2, y ≥ 0 and xpp+q = O(1),
we have that:
y∑
m=0
p(1− p)mx =
xp
p+ q
−O(e−(p+q)y).
(d) For all real numbers m, x, p and q and integer y, such that 0 < p − q < 2, y ≥ 0 and
xp
p+q (
1−p−q
p+q + y + 1) = O(1), we have that:
y∑
m=0
p(1− p)mx =
xp(1− p− q)
(p+ q)2
−O(e−(p+q)y).
Note that we are misusing the O notation here to hide constant factors that are independent of
the variables in the equations.
Proof. Using Proposition 2(c) and Proposition 5, Proposition 6(a) is proved as follows:
y∑
m=0
p(1− p)mmx =
∞∑
m=0
p(1− p)mmx− (1− p)y+1
∞∑
m=0
p(1 − p)m(m+ y + 1)x
= x
(
1
p
− 1
)
− (1 − p)y+1x
(
1
p
+ y
)
= x
(
1
p
− 1
)
−O(e−py).
The proofs of Propositions 6(b), 6(c) and 6(d) are now trivial. ✷
The vector of random variables X = (X1, . . .Xn), is negatively associated [5] if for every two
disjoint index sets, I, J ⊂ [n],
E[f(Xi, i ∈ I)g(Xj , j ∈ J)] ≤ E[f(Xi, i ∈ I)]E[g(Xj , j ∈ J)]
for all functions f : ℜ|I| → ℜ and f : ℜ|J| → ℜ that are both non-decreasing or non-increasing.
Proposition 7 If the random variables X1, . . . Xk are negatively associated, then for any non-
decreasing function fi, i ∈ [k], we have that:
E[
k∏
i=1
fi(Xi)] ≤
k∏
i=1
E[fi(Xi)].
Proof. The proof follows directly from the definition of negatively associated variables. ✷
9
CDi
Figure 5: m rounds of Process “in-place”. Between two accesses to Di, there are m accesses to
“other” pointers, and m+ 1 accesses to C.
4.3 Cache Analysis of In-place Permutation
In this section we analyse the cache misses in a direct mapped cache during n rounds of Process
“in-place”, introduced in Section 4.1.1. We derive a precise equation for the expected number of
cache misses and then give closed form upper and lower bounds on this equation. We then derive
upper and lower bounds assuming the keys are drawn independently from a uniform distribution.
4.3.1 Average case analysis
We start by proving a theorem for the expected number of cache misses during n rounds of Process
“in-place”.
Theorem 1 The expected number X of cache misses in n rounds of Process “in-place” satisfies
n(pc + pd) ≤ X ≤ n(pc + pd) + k(1 + 1/B), where:
pc =
k/B∑
i=1
Pi

1− ∞∑
m=0
Pi(1 − Pi)
mEν¯∼ϕ(m,b¯i)

 k∏
j=1
f(νj)



 and
pd =
1
B
+
B − 1
B
k∑
i=1
pi

1− ∞∑
m=0
pi(1− pi)
mEµ¯∼ϕ(m,a¯i)

(1− g(µ¯)) k∏
j=1
f(µj)



 .
Proof. We first analyse the miss rates for accesses to pointers D1, . . . , Dk. Fix an i, 1 ≤ i ≤ k
and a z ≥ 1. Let µ be the random variable which denotes the number of rounds between accesses
to locations di,z and di,z+1 (µ = 0 if these locations are accessed in consecutive rounds). Figure 5
shows the other memory accesses between accesses z and z + 1 to Di. Clearly, Pr[µ = m] =
pi(1− pi)
m, for m = 0, 1, . . .. Let Xi denote the event that none of the memory accesses in these
µ rounds accesses the cache block to which di,z is mapped. We now fix an integer m ≥ 0 and
calculate Pr[Xi|µ = m]. Let µ¯ be a vector of random variables such that for 1 ≤ j ≤ k, µj is
the random variable which denotes the number of accesses to Dj in these m rounds. Clearly µ¯ is
drawn from ϕ(m, a¯i) (note that Di is not accessed in these m rounds by definition).
Fix any vector m¯, such that Pr[µ¯ = m¯] 6= 0, and let µj be the number of accesses to pointer Dj
in these m rounds. Since mi must be zero, f(mi) = 1, and for j 6= i, f(mj) is the probability that
none of the mj locations accessed by Dj in these m rounds is mapped to the same cache block as
location di,z [9, 14]. Similarly g(m¯) ·C is the number of count blocks accessed in these rounds, and
so 1− g(m¯) is the probability that the cache block containing di,z does not conflict with the blocks
from C which were accessed in these m rounds. As the latter probability is determined by the
starting location of sequence i and the former probabilities by the starting location of sequences
j, j 6= i, we conclude that for a given configuration m¯ of accesses, the probability that the cache
10
block containing di,z is not accessed in these m rounds is (1− g(m¯))
∏k
j=1 f(mj). Averaging over
all configurations m¯, we get that
Pr[Xi | µ = m] = Eµ¯∼ϕ(m,a¯i)[(1 − g(µ¯))
k∏
j=1
f(µj)]. (4)
Finally we get,
Pr[Xi] =
∞∑
m=0
Pr[µ = m] Pr[Xi|µ = m]
=
∞∑
m=0
pi(1− pi)
mEµ¯∼ϕ(m,a¯i)

(1− g(µ¯)) k∏
j=1
f(µj)

 . (5)
If di,z is at a cache block boundary or if Xi does not occur given that di,z is not at a cache
block boundary (Pr[Xi] does not change under this condition), then a cache miss will occur. The
first access to a pointer is a cache miss. So other than for the first access, the probability pd of a
cache miss for a pointer access is:
pd =
1
B
+
B − 1
B
k∑
i=1
pi(1− Pr[Xi]). (6)
Including the first access misses, the expected number of cache misses for pointer accesses is at
most
k∑
i=1
1 + (npi − 1)
((
B − 1
B
(1− Pr[Xi])
)
+
1
B
)
≤ npd + k. (7)
We now consider the probability of a cache miss for an access to a count array location. It is
convenient to partition C into count blocks of B locations each, where the i-th count block consists
of the locations c(i−1)B+1, . . . , ciB , for i = 1, . . . , k/B. So Pi is the probability of access to the
i-th block. We fix an i ∈ {1, . . . , k/B} and a z ≥ 1. Let ν be the random variable that denotes
the number of rounds between the z-th and (z + 1)-st accesses to the i-th count block. We have
that Pr[ν = m] = Pi(1− Pi)
m, for m = 0, 1, . . .. Let Yi denote the event that none of the memory
accesses in these m rounds accesses the cache block to which the i-th count block is mapped.
We now fix an integer m ≥ 0 and calculate Pr[Yi|ν = m]. Let ν¯ be a vector of random variables
such that for 1 ≤ j ≤ k, νj is the random variable which denotes the number of accesses to Dj
in these m rounds. Given that k ≤ BC and assumption (c) mean that two blocks from C cannot
conflict with each other. As the pointers D(i−1)B+1, . . . , DiB will not be accessed between two
successive accesses to count block i, the probability of accessing pointer Dj is given by b
i
j and
ϕ(m, b¯i) is the distribution for ν¯. Arguing as above:
Pr[Yi] =
∞∑
m=0
Pr[ν = m] Pr[Yi|ν = m]
=
∞∑
m=0
Pi(1− Pi)
mEν¯∼ϕ(m,b¯i)

 k∏
j=1
f(νj)

 . (8)
The first access to a count array block is a cache miss, for all other accesses there is a cache
miss if event Yi does not occur. So other than for the first access, the probability pc of a cache
miss for a count array access is:
pc =
k/B∑
i=1
Pi(1 − Pr[Yi]). (9)
11
Including the first access misses, the expected number of cache misses for count array accesses
is at most
k/B∑
i=1
1 + (nPi − 1)(1 − Pr[Yi]) ≤ npc + k/B. (10)
Plugging in the values from Eq. 5 into Eq. 7 and from Eq. 8 into Eq. 10 we get the upper bound
on X , the expected number of cache misses in the processes. The lower bound in Theorem 1 is
obvious.
✷
4.3.2 Upper bound
We now prove a theorem on the upper bound to the expected number of cache misses during n
rounds of Process “in-place”.
Theorem 2 The expected number of cache misses in n rounds of Process “in-place” is at most
n(pd + pc) + k(1 + 1/B), where:
pd ≤
1
B
+
k
BC
+
B − 1
BC
k∑
i=1

k/B∑
j=1
piPj
pi + Pj
+
B − 1
B
k∑
j=1
pipj
pi + pj

 ,
pc ≤
k
B2C
+
B − 1
BC
k/B∑
i=1
k∑
j=1
Pipj
Pi + pj
.
Proof. In the proof we derive lower bounds for Pr[Xi] and Pr[Yi] and use these to derive the
upper bounds on pd and pc.
Again, we consider a fixed i and consider the event Xi defined in the proof of Theorem 1. We
now obtain a lower bound on Pr[Xi].
Lower bound on Pr[Xi]
Letting Γ(x) = 1− f(x) and using Proposition 1 we can rewrite Eq. 5 as:
Pr[Xi] ≥
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)

1− g(µ¯)− k∑
j=1
Γ(µj)

 . (11)
We know that the j-th count block contributes 1/C to g(µ¯) if there is an access to that block
and Pr[j-th count block accessed|µ = m] = 1− (1− cij)
m, where cij =
Pj
1−pi
. So we have that,
Eµ¯∼ϕ(m,a¯i)[g(µ¯)] =
k/B∑
j=1
1
C
(1− (1− cij)
m),
and we get,
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[g(µ¯)] =
∞∑
m=0
pi(1− pi)
m
k/B∑
j=1
1
C
(1 − (1− cij)
m)
=
1
C
k/B∑
j=1
∞∑
m=0
pi(1 − pi)
m(1 − (1− cij)
m),
and using Proposition 3 we get,
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[g(µ¯)] =
1
C
k/B∑
j=1
Pj
pi + Pj
. (12)
12
We now evaluate
∞∑
m=0
Pr[µ = m]
k∑
j=1
Eµ¯∼ϕ(m,a¯i)[Γ(µj)].
Our approach is to first fix j and evaluate Eµ¯∼ϕ(m,a¯i)[Γ(µj)]. For m ≤ BC, we know that
Eµ¯∼ϕ(m,a¯i)[Γ(µj)] =
m∑
l=0
Pr[µj = l]
l +B − 1
BC
− Pr[µj = 0]
B − 1
BC
.
The last term is due to the fact that Γ(x) is discontinuous and Γ(0) = 0. Similarly for m > BC
we know that
Eµ¯∼ϕ(m,a¯i)[Γ(µj)] =
m∑
l=0
Pr[µj = l]
l+B − 1
BC
− Pr[µj = 0]
B − 1
BC
−
m∑
l=BC−B+1
Pr[µj = l]
(
l+B − 1
BC
− 1
)
.
The last term is due to the fact that Γ(x) = 1 for x ≥ BC −B+1. If we drop this last term when
m > BC, we get that for all m
Eµ¯∼ϕ(m,a¯i)[Γ(µj)] ≤
1
BC
[
m∑
l=0
Pr[µj = l]l+ (B − 1)(1− Pr[µj = 0])
]
.
The summation term is the expected value of the random variable with the binomial distribution
b(l;m, aij). So we get that
Eµ¯∼ϕ(m,a¯i)[Γ(µj)] ≤
1
BC
[
maij + (B − 1)
(
1−
(
1− aij
)m)]
. (13)
We now evaluate
∑∞
m=0 Pr[µ = m]
∑k
j=1 Eµ¯∼ϕ(m,a¯i)[Γ(µj)] as
∞∑
m=0
Pr[µ = m]
k∑
j=1
Eµ¯∼ϕ(m,a¯i)[Γ(µj)]
≤
∞∑
m=0
Pr[µ = m]
k∑
j=1
1
BC
[
maij + (B − 1)
(
1−
(
1− aij
)m)]
.
Since
∑k
j=1ma
i
j = m, we get
∑∞
m=0 Pr[µ = m]
∑k
j=1ma
i
j =
1
pi
− 1 by an application of Proposi-
tion 2(c). By applying Proposition 3 we get that
∑k
j=1
∑∞
m=0 Pr[µ = m](B − 1)(1− (1− a
i
j)
m) =
(B − 1)
∑k
j=1
pj
pi+p+j
. So we get:
∞∑
m=0
Pr[µ = m]
k∑
j=1
Eµ¯∼ϕ(m,a¯i)[Γ(µj)] ≤
1
BC

 1
pi
+ (B − 1)
k∑
j=1
pj
pi + pj

 .
(14)
Substituting Eq. 12 and 14 in Eq. 11 we obtain the following lower bound for Pr[Xi]
Pr[Xi] ≥ 1−
1
C
k/B∑
j=1
Pj
pi + Pj
−
1
BC

 1
pi
+ (B − 1)
k∑
j=1
pj
pi + pj

 . (15)
13
Upper bound on pd
Finally, substituting Pr[Xi] from Eq. 15 in Eq. 6 we get:
pd ≤
1
B
+
B − 1
B
k∑
i=1
pi

 1
C
k/B∑
j=1
Pj
pi + Pj
+
1
BC

 1
pi
+ (B − 1)
k∑
j=1
pj
pi + pj




=
1
B
+
(B − 1)k
B2C
+
B − 1
BC
k∑
i=1
k/B∑
j=1
piPj
pi + Pj
+
(B − 1)2
B2C
k∑
i=1
k∑
j=1
pipj
pi + pj
≤
1
B
+
k
BC
+
B − 1
BC
k∑
i=1

k/B∑
j=1
piPj
pi + Pj
+
B − 1
B
k∑
j=1
pipj
pi + pj

 .
We can evaluate pc using a very similar approach, as sketched out now. We again consider a
fixed i and consider the event Yi defined in the proof of Theorem 1. We now obtain a lower bound
on Pr[Yi].
Lower bound on Pr[Yi]
Again letting Γ(x) = 1− f(x) and using Proposition 1, we can rewrite Eq. 8 as:
Pr[Yi] ≥
∞∑
m=0
Pr[ν = m]Eν¯∼ϕ(m,b¯i)

1− k∑
j=1
Γ(νj)

 (16)
Arguing as for the derivation of Eq. 13, we get
Eν¯∼ϕ(m,b¯i)[Γ(νj)] ≤
1
BC
[
mbij + (B − 1)
(
1−
(
1− bij
)m)]
.
Then arguing as for the derivation of Eq. 14, we get
∞∑
m=0
Pr[ν = m]
k∑
j=1
Eν¯∼ϕ(m,b¯i)[Γ(νj)] ≤
1
BC

 1
Pi
+ (B − 1)
k∑
j=1
pj
Pi + pj

 .
Substituting this into Eq. 16, we get:
Pr[Yi] ≥ 1−
1
BC

 1
Pi
+ (B − 1)
k∑
j=1
pj
Pi + pj

 . (17)
Upper bound on pc
Substituting Pr[Yi] from Eq. 17 in Eq. 9 we get
pc ≤
k/B∑
i=1
Pi
1
BC
(
1
Pi
+ (B − 1)
k∑
i=1
pj
Pi + pj
)
=
k
B2C
+
B − 1
BC
k/B∑
i=1
k∑
j=1
Pipj
Pi + pj
.
✷
This proves the upper bound for the equation in Theorem 1. We now prove a lower bound on that
equation.
14
4.3.3 Lower bound
Theorem 3 When pi ≥ 1/C then the expected number of cache misses in n rounds of Process
“in-place” is at least npd + k, where:
pd ≥
1
B
+
k(2C − k)
2C2
+
k(k − 3C)
2BC2
−
1
2BC
−
k
2B2C
+
B(k − C) + 2C − 3k
BC2
k∑
i=1
k∑
j=1
(pi)
2
pi + pj
+
(B − 1)2
B3C2
k∑
i=1
pi

 k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−
B − 1
2
k∑
j=1
k∑
l=1
pi
pi + pj + pl − pjpl

−O (e−B) .
Proof. We again consider a fixed i and consider the event Xi defined in the proof of Theorem 1.
Let µ¯ be as defined in the proof of Theorem 1. We now obtain an upper bound on Pr[Xi].
Upper bound on Pr[Xi]
In [3] it is shown that the variables µj are negatively associated [5]. Noting that f(x) is a non-
increasing function of x, then using Proposition 7 we have that:
Eµ¯∼ϕ(m,a¯i)[
k∏
j=1
f(µj)] ≤
k∏
j=1
Eµ¯∼ϕ(m,a¯i)[f(µj)].
So we can re-write Eq. 5 as:
Pr[Xi] ≤
BC−B∑
m=0
Pr[µ = m]
k∏
j=1
Eµ¯∼ϕ(m,a¯i)[f(µj)] +
∞∑
m=BC−B+1
Pr[µ = m].
(18)
We first bound the last term. We know that
∞∑
m=BC−B+1
Pr[µ = m] = (1− pi)
BC−B+1
∞∑
m=0
Pr[µ = m]
= (1− pi)
BC−B+1.
Using Proposition 5 we get that (1 − pi)
BC−B+1 ≤ e−(BC−B+1)pi . Assuming pi ≥ 1/C the last
term is at most O(e−B).
We now bound the first term in Eq. 18. We use an approach similar to the derivation of Eq. 13
and since µ ≤ BC − B, so µj ≤ BC − B, we don’t have to drop any terms in the simplification,
so we get that:
Eµ¯∼ϕ(m,a¯i)[f(µj)] = 1−
1
BC
(maij + (B − 1)(1− (1 − a
i
j)
m)).
Letting tj(m) =
1
BC (ma
i
j + (B − 1)(1 − (1 − a
i
j)
m)) and using Proposition 4(a) we get that
e
−
∑
j
tj(m) ≥
∏
j(1− tj(m)). So we have that
Pr[Xi] ≤
BC−B∑
m=0
Pr[µ = m]e
−1
BC
∑
k
j=1
(maij+(B−1)(1−(1−aij)m))) +O(e−B).
≤
BC−B∑
m=0
Pr[µ = m]e
−1
BC
(
m+(B−1)(k−
∑
k
j=1
(1−aij)
m)
)
+O(e−B).
15
Using Proposition 4(b) and letting βj = (1 − a
i
j) we get that:
Pr[Xi] ≤
BC−B∑
m=0
Pr[µ = m]
[
1−
(B − 1)k
BC
−
((B − 1)k)2
2(BC)2
−
m2
2(BC)2
−
1
BC

m− (B − 1) k∑
j=1
βmj )

− (B − 1)
2(BC)2

2m k∑
j=1
βmj + 2(B − 1)k
k∑
j=1
βmj


+
(B − 1)
2(BC)2

2mk + (B − 1) k∑
j=1
k∑
l=1
βmj β
m
l



+O(e−B). (19)
We now evaluate the terms in Eq. 19 assuming that pi ≥ 1/C, so k ≤ C. For the simplifications
of the subtractive terms we use the fact that e−pi(BC−B+1) ≤ e−B.
Since pi ≥ 1/C, (1/pi +BC −B)/(BC) = O(1), so using Proposition 6(a), we get that
BC−B∑
m=0
Pr[µ = m]
m
BC
=
1
BC
(
1
pi
− 1
)
−O(e−B). (20)
Since pi ≥ 1/C, (B − 1)k/(BC) < 1, so using Proposition 6(b), we get that
α−1∑
m=0
Pr[µ = m]
(B − 1)k
BC
=
(B − 1)k
BC
− O(e−B). (21)
We now evaluate the term
(B − 1)
(BC)2
α−1∑
m=0
Pr[µ = m]m
k∑
j=1
βmj
=
(B − 1)
(BC)2
k∑
j=1
∞∑
m=0
Pr[µ = m]mβmj
−(1− pi)
α (B − 1)
(BC)2
∞∑
m=0
Pr[µ = m](m+ α)
k∑
j=1
βmj
=
(B − 1)
(BC)2
k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−(1− pi)
α (B − 1)
(BC)2

 k∑
j=1
pi(1 − pi − pj)
(pi + pj)2
+
k∑
j=1
αpi
pi + pj


=
(B − 1)
(BC)2
k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−O(e−B). (22)
The last simplification is due to pi ≥ 1/C and k ≤ C, so
∑
1≤j≤k(pi(1 − pi − pj))/(pi + pj)
2 ≤
kC ≤ C2 and
∑
1≤j≤k(αpi)/(pi + pj) ≤ kCB ≤ BC
2.
Substituting back (1− aij) = βj and using Proposition 3 we get that
(B − 1)2k
(BC)2
BC−B∑
m=0
Pr[µ = m]
k∑
j=1
βmj
=
(B − 1)2k
(BC)2
k∑
j=1
pi
pi + pj
− (1− pi)
α (B − 1)
2k
(BC)2
k∑
j=1
pi
pi + pj
16
=
(B − 1)2k
(BC)2
k∑
j=1
pi
pi + pj
−O(e−B). (23)
The last step used
∑k
j=1 pi/(pi + pj) ≤ k and ((B − 1)k)
2/(BC)2) < 1.
We now evaluate the additive terms, starting with
BC−B∑
m=0
Pr[µ = m]
mk(B − 1)
(BC)2
≤
(B − 1)k
(BC)2
(
1
pi
− 1
)
. (24)
Since m < BC we now get that:
BC−B∑
m=0
Pr[µ = m]
m2
2(BC)2
≤
1
2BC
(
1
pi
− 1
)
. (25)
Substituting back (1 − aij) = βj and using Proposition 3 we get that:
B − 1
BC
BC−B∑
m=0
Pr[µ = m]
k∑
j=0
βj
m ≤
B − 1
BC
k∑
j=1
pi
pi + pj
. (26)
Finally we evaluate
∑α−1
m=0 Pr[µ = m]
∑k
j=1
∑k
l=1 βj
mβl
m, by first evaluating
(1− pi)βjβl = (1− pi)(1− a
i
j)(1 − a
i
l)
=
1− 2pi − pj − pl + pi(pi + pj + pl) + pjpl
(1− pi)
.
Using this result and Proposition 2(a), we get that
α−1∑
m=0
Pr[µ = m]βj
mβl
m
≤
∞∑
m=0
pi
(
1− 2pi − pj − pl + pi(pi + pj + pl) + pjpl
(1− pi)
)m
=
pi
pi + pj + pl − pjpl/(1− pi)
≤
pi
pi + pj + pl − pjpl
.
So we get that
(B − 1)2
2(BC)2
α−1∑
m=0
Pr[µ = m]
k∑
j=1
k∑
l=1
βj
mβl
m
≤
(B − 1)2
2(BC)2
k∑
j=1
k∑
l=1
pi
pi + pj + pl − pjpl
. (27)
Lower bound on pd
Plugging Eqs. 20 , .., 27 into Eq. 19 we get that:
Pr[Xi] ≤ 1−
1
BC
(
1
pi
− 1
)
−
(B − 1)k
BC
−
(B − 1)
(BC)2
k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−
(B − 1)2k
(BC)2
k∑
j=1
pi
pi + pj
+
(B − 1)k
(BC)2
(
1
pi
− 1
)
+
1
2BC
(
1
pi
− 1
)
17
+
B − 1
BC
k∑
j=1
pi
pi + pj
+
(B − 1)2
2(BC)2
k∑
j=1
k∑
l=1
pi
pi + pj + pl − pjpl
+
((B − 1)k)2
2(BC)2
+O(e−B). (28)
Plugging Eq. 28 into Eq. 6 we get:
pd ≥
1
B
+
B − 1
B
k∑
i=1
pi
[(
1
pi
− 1
)
1
2BC
(
1−
2(B − 1)k
BC
)
+
k∑
j=1
pi
pi + pj
B − 1
BC
(
(B − 1)k
BC
− 1
)
+
(B − 1)k
BC
(
1−
(B − 1)k
2BC
)
+
(B − 1)
(BC)2
k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−
(B − 1)2
2(BC)2
k∑
j=1
k∑
l=1
pi
pi + pj + pl − pjpl
−O
(
e−B
) .
Simplifying further and using
∑k
i=1
∑k
j=1 pi
2/(pi + pj) < k and
∑k
i=1 pi(1/pi − 1) = k − 1, we
get:
pd ≥
1
B
+
k(2C − k)
2C2
+
k(k − 3C)
2BC2
−
1
2BC
−
k
2B2C
+
B(k − C) + 2C − 3k
BC2
k∑
i=1
k∑
j=1
(pi)
2
pi + pj
+
(B − 1)2
B3C2
k∑
i=1
pi

 k∑
j=1
pi(1− pi − pj)
(pi + pj)2
−
B − 1
2
k∑
j=1
k∑
l=1
pi
pi + pj + pl − pjpl

−O (e−B) .
✷
4.3.4 Upper and lower bounds for uniformly random data
Using the upper and lower bound Theorems just proven for general probability distributions, we
now derive Corollaries for upper and lower bounds for uniform distribution.
Corollary 1 If p1 = . . . = pk = 1/k then the number of cache misses in n rounds of Process
“in-place” is at most :
n
(
1
B
+
k(B + 5)
2BC
+
k
B2C
)
+ k
(
1 +
1
B
)
.
Proof. Since Pi in pc and Pj in pd are both B/k in the equations in Theorem 2, we get that:
pd + pc ≤
1
B
+
2(B − 1)
BC
k2
B
B/k
B + 1
+
(B − 1)2
B2C
k2
1/k
2
+
k
B2C
+
k
BC
=
1
B
+
2(B − 1)
BC
k
B + 1
+
(B − 1)2
B2C
k
2
+
k
B2C
+
k
BC
≤
1
B
+
k
C
[
3
B
+
B − 1
2B
]
+
k
B2C
=
1
B
+
k(B + 5)
2BC
+
k
B2C
.
18
✷Remark: As we will see later, Process “in-place” models the permute phase of distribution sorting
and Corollary 1 shows that one pass of uniform distribution sorting incurs O(n/B) cache misses if
and only if k = O(C/B).
The following corollary is from the lower bound result in Theorem 3.
Corollary 2 If p1 = . . . = pk = 1/k then the number of cache misses in n rounds of Process
“in-place” is at least:
k +
n
B
+ n
[
k
2C
−
k2
BC2
−
k + 1
2BC
−
k
2B2C
+
(B − 1)2
12B3C2
(
k2(5 − 2B)− 7k + 2
)]
.
Proof. Plugging pi = 1/k in the equation in Theorem 3 we get that:
pd ≥
1
B
+
k(2C − k)
2C2
+
k(k − 3C)
2BC2
−
1
2BC
−
k
2B2C
+
B(k − C) + 2C − 3k
BC2
k
2
+
(B − 1)2
B3C2
[
(k − 2)k
4
−
B − 1
2
(
k3
3k − 1
)]
≥
1
B
+
k
2C
−
k2
BC2
−
k + 1
2BC
−
k
2B2C
+
(B − 1)2
12B3C2
(
k2(5− 2B)− 7k + 2
)
.
✷
Remark: From Corollary 1 we have that for uniformly random data and k = αC, where α ≤ 1,
other than for small values of B, the upper bound for the number of cache misses in n round is
roughly
αn
2
,
and from Corollary 2 we have that for uniformly random data and k = αC, where α ≤ 1, other
than for small values of B, the lower bound for the number of cache misses in n round is roughly
n
(
α
2
−
α2
6
)
.
The ratio between the upper and lower bound is 3/(3 − α). So we have that for uniformly
random data the lower bound is within a factor of about 3/2 of the upper bound when k ≤ C and
is much closer when k ≪ C.
4.4 Cache Analysis of Out-of-place Permutation
In this section we analyse the cache misses in a direct mapped cache during n rounds of Pro-
cess “out-of-place”, introduced in Section 4.1.2. We derive a precise equation for the expected
number of cache misses and closed-form upper and lower bounds. During the analysis we re-use
k,Di, ci, C, pi, Pi, a¯, b¯, f(x) and g(m) introduced in Section 4.3.
4.4.1 Average case analysis
We start by proving a theorem for the expected number of cache misses during n rounds of Process
“out-of-place”.
Theorem 4 The expected number X of cache misses in n rounds of Process “out-of-place” is
n(pc + pd + ps) ≤ X ≤ n(pc + pd + ps) + k(1 + 1/B) + 1, where:
pc =
k/B∑
i=1
Pi

1− ∞∑
m=0
Pi(1− Pi)
mEν¯∼ϕ(m,b¯i)

f(m+ 1) k∏
j=1
f(νj)



 ,
19
SC
Di
Figure 6: m rounds of Process “out-of-place”. Between two accesses to Di, there are m accesses
to “other” pointers, and m+ 1 accesses to C, and m+ 1 accesses to consecutive locations in S.
pd =
1
B
+
B − 1
B
k∑
i=1
pi

1− ∞∑
m=0
pi(1 − pi)
mEµ¯∼ϕ(m,a¯i)

(1− g(µ¯))f(m+ 1) k∏
j=1
f(µj)



 ,
ps =
1
B
+
B − 1
B
(
1−
(
1−
1
C
)2)
.
Proof. We first analyse the miss rates for accesses to pointers D1, . . . , Dk. Fix an i, 1 ≤ i ≤ k
and a z ≥ 1 and consider the probability of a miss between access z and z + 1 to pointer Di.
We define µ, µj,m, µ¯, m¯, di,z , ϕ(m, a¯i) and Xi as in the proof of Theorem 1. Again f(mj) is the
probability that none of the mj locations accessed by Dj in m rounds is mapped to the same cache
block as location di,z. Similarly g(m¯) · C is the number of count blocks accessed in m rounds,
and so 1 − g(m¯) is the probability that the cache block containing di,z does not conflict with the
blocks from C which were accessed in these m rounds. We also have accesses to m+ 1 contiguous
locations in S and f(m+1) is the probability that these m+1 accesses are not to the cache block
containing di,z. Figure 6 shows the other memory accesses between accesses z and z + 1 to Di.
For a given configuration m¯ of accesses, as the probabilities f(mj), g(m¯) and f(m + 1) are
independent, we conclude that the probability that the cache block containing di,z is not accessed
in these m rounds is (1−g(m¯))f(m)
∏k
j=1 f(mj). Averaging over all configurations m¯, we get that
Pr[Xi | µ = m] = Eµ¯∼ϕ(m,a¯i)[(1 − g(µ¯))f(m+ 1)
k∏
j=1
f(µj)]. (29)
Using which we get,
Pr[Xi] =
∞∑
m=0
Pr[µ = m] Pr[Xi|µ = m]
=
∞∑
m=0
pi(1− pi)
mEµ¯∼ϕ(m,a¯i)

(1− g(µ¯))f(m+ 1) k∏
j=1
f(µj)

 .
(30)
20
Arguing as for Eq. 6 we get that, other than for the first access, the probability pd of a cache
miss for a pointer access is:
pd =
1
B
+
B − 1
B
k∑
i=1
pi(1− Pr[Xi]). (31)
Including the first access misses, the expected number of cache misses for pointer accesses is at
most
k∑
i=1
1 + (npi − 1)
((
B − 1
B
(1− Pr[Xi])
)
+
1
B
)
≤ npd + k. (32)
We now consider the probability of a cache miss for an access to a count array location. Fix
an i ∈ {1, . . . , k/B} and a z ≥ 1 and consider the probability of a miss between access z and z +1
to count block ci. We define ν, νj ,m, ν¯, m¯, Pi, ϕ(m, b¯i), and Yi as in the proof of Theorem 1.
Again, given that k ≤ BC and assumption (c) mean that two blocks from C cannot conflict
with each other. So we need to determine the probability of a conflict given mj accesses to the
pointer Dj, for all j ∈ {1, . . . , k}, and m accesses to contiguous locations in S. Again f(mj) is
the probability that none of the mj locations accessed by Dj in m rounds is mapped to the same
cache block as ci and f(m+ 1) is the probability that the accesses to m+ 1 contiguous locations
in S are not to the same cache block as ci.
So we have:
Pr[Yi] =
∞∑
m=0
Pr[ν = m] Pr[Yi|ν = m]
=
∞∑
m=0
Pi(1− Pi)
mEν¯∼ϕ(m,b¯i)

f(m) k∏
j=1
f(νj)

 . (33)
Arguing as for Eq. 9, the probability pc of a cache miss for a count array access is:
pc =
k/B∑
i=1
Pi(1 − Pr[Yi]). (34)
Including the first access misses, the expected number of cache misses for count array accesses
is at most
k/B∑
i=1
1 + (nPi − 1)(1 − Pr[Yi]) ≤ npc + k/B. (35)
We now calculate cache misses for accesses to the array S. We consider the probability of a
cache miss between accesses to S[s] and S[s + 1]. We know that there is exactly one access to
a count block and one access to a pointer between two accesses to S. The probability that the
pointer access is to the same cache block as S[s] is 1/C. The probability that a block from C maps
to the same cache block as S[s] is k/BC. Given that a block from C maps to the same cache block
as S[s], the probability that the access to the count array is to the same cache block as S[s] is
B/k. So the probability that the pointer access is to the same cache block as S[s] is also 1/C. So
the probability that there are no memory accesses to the cache block that S[s] is mapped to before
the access to S[s+ 1] is
(1− 1/C)2.
We have a cache miss if S[s] is at a cache block boundary, otherwise the probability of a cache
miss is 1− (1− 1/C)2. So the probability ps of an cache miss for an access to S is
ps =
1
B
+
B − 1
B
(
1−
(
1−
1
C
)2)
.
21
The first access to S is always a cache miss, so the expected number of cache misses in accesses
to S is:
nps + 1.
Plugging in the values from Eq. 30 into Eq. 32 and from Eq. 33 into Eq. 35 we get the upper
bound on X , the expected number of cache misses in the processes.
The lower bound in Theorem 4 is obvious.
✷
4.4.2 Upper bound
We now prove a theorem on the upper bound to the expected number of cache misses during n
rounds of Process “out-of-place”.
Theorem 5 The expected number of cache misses in n rounds of Process “out-of-place” is at most
n(pd + pc + ps) + k(1 + 1/B) + 1, where:
pd ≤
1
B
+
2(B − 1)k
B2C
+
B − 1
BC
k∑
i=1
k/B∑
j=1
piPj
pi + Pj
+
(B − 1)2
B2C

1 + k∑
i=1
k∑
j=1
pipj
pi + pj


pc ≤
2k
B2C
+
B − 1
BC

1 + k/B∑
i=1
k∑
j=1
Pipj
Pi + pj

 ,
ps =
1
B
+
B − 1
B
(
1−
(
1−
1
C
)2)
.
Proof. As for the upper bound for in-place permutation, in this proof we derive lower bounds
for Pr[Xi] and Pr[Yi] and we will use these to derive the upper bounds on pd and pc. We make
extensive use of the results obtained during the proof of Theorem 1.
Again, we consider a fixed i and consider the event Xi defined in the proof of Theorem 4. We
now obtain a lower bound on Pr[Xi].
Lower bound on Pr[Xi]
Letting Γ(x) = 1− f(x) and using Proposition1 we can rewrite Eq. 5 as:
Pr[Xi] ≥
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)

1− g(µ¯)− Γ(m+ 1)− k∑
j=1
Γ(µj)

 . (36)
We can use Eq. 12 as a simplification for
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[g(µ¯)],
and Eq. 14 as an upper bound on
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[
k∑
j=1
Γ(µj)].
So we just have to evaluate
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[Γ(m+ 1)].
22
Since we always have at least one access to S, we have that
∞∑
m=0
Pr[µ = m]Eµ¯∼ϕ(m,a¯i)[Γ(µ+ 1)] =
∞∑
m=0
Pr[µ = m]
m+B
BC
−
m∑
m=BC−B
Pr[µ = m]
(
m+B
BC
− 1
)
≤
1
BC
[
∞∑
m=0
Pr[µ = m]m+B
]
=
1
BC
(
1
pi
− 1 +B
)
, (37)
where the last simplification used Proposition 2(c). Substituting Eq. 12, Eq. 14 and Eq. 37 in
Eq. 36 we obtain the following lower bound for Pr[Xi]
Pr[Xi] ≥ 1−
1
C
k/B∑
j=1
Pj
pi + Pj
−
1
BC

 2
pi
+ (B − 1)

1 + k∑
j=1
pj
pi + pj



 . (38)
Upper bound on pd
Finally, substituting Pr[Xi] from Eq. 38 in Eq. 6 we get
pd ≤
1
B
+
B − 1
B
k∑
i=1
pi

 1
C
k/B∑
j=1
Pj
pi + Pj
+
1
BC

 2
pi
+ (B − 1)

1 + k∑
j=1
pj
pi + pj






=
1
B
+
2(B − 1)k
B2C
+
B − 1
BC
k∑
i=1
k/B∑
j=1
piPj
pi + Pj
+
(B − 1)2
B2C

1 + k∑
i=1
k∑
j=1
pipj
pi + pj

 .
We can evaluate pc using a very similar approach to that used in the proof of Theorem 2. We
again consider a fixed i and consider the event Yi defined in the proof of Theorem 4. We now
obtain a lower bound on Pr[Yi].
Lower bound on Pr[Yi]
We can rewrite Eq. 33 as:
Pr[Yi] ≥
∞∑
m=0
Pr[ν = m]Eν¯∼ϕ(m,b¯i)

1− Γ(m+ 1) k∑
j=1
Γ(νj)

 . (39)
Eq. 17 gives us
∞∑
m=0
Pr[ν = m]
k∑
j=1
Eν¯∼ϕ(m,b¯i)[Γ(νj)].
23
Arguing as for Eq. 37
∞∑
m=0
Pr[ν = m]Eν¯∼ϕ(m,b¯i)[Γ(m+ 1)] ≤
1
BC
(
1
Pi
− 1 +B
)
. (40)
Substituting Eq. 17 and Eq. 40 in Eq. 39 we obtain the following lower bound for Pr[Xi]:
Pr[Yi] ≥ 1−
1
BC

 2
Pi
+ (B − 1)

1 + k∑
j=1
pj
Pi + pj



 . (41)
Upper bound on pc
Finally, substituting Pr[Yi] from Eq. 41 in Eq. 9 we get:
pc ≤
k/B∑
i=1
Pi
1
BC
(
2
Pi
+ (B − 1)
(
1 +
k∑
i=1
pj
Pi + pj
))
=
2k
B2C
+
B − 1
BC

1 + k/B∑
i=1
k∑
j=1
Pipj
Pi + pj

 .
✷
4.4.3 Lower bound
It is quite obvious that the lower bound for in-place permutation, given in Theorem 3, is a lower
bound for out-of-place permutation.
4.4.4 Upper and lower bounds for uniformly random data
Using the upper bound Theorem just proven, we now derive a Corollary for an upper bound to
the number of cache misses if the data is uniformly distributed.
Corollary 3 If p1 = . . . = pk = 1/k then the number of cache misses in n rounds of Process
“in-place” is at most:
n
(
1
B
+
k(B + 3)
2BC
+
k
B2C
+
k
BC
)
+ k
(
1 +
1
B
)
.
Proof. Since Pi in pc and Pj in pd are both B/k in the equations in Theorem 5, we get that
pd + pc + ps ≤
2
B
+
2(B − 1)
BC
k2
B
B/k
B + 1
+
(B − 1)2
B2C
k2
1/k
2
+
2k
B2C
+
2(B − 1)k
B2C
+
B − 1
B
(
1−
(C − 1)2
C2
)
=
2
B
+
2(B − 1)
BC
k
B + 1
+
(B − 1)2
B2C
k
2
+
2k
B2C
+
2(B − 1)k
B2C
+
B − 1
B
2C − 1
C2
≤
2
B
+
k
C
[
4
B
+
B − 1
2B
]
+
2k
B2C
+
2
C
=
2
B
+
k(B + 7)
2BC
+
2k
B2C
+
2
C
.
✷
24
Remark: Corollaries 1 and 3 shows that for uniformly distributed data, other than for small
values of B, the number of cache misses during in-place and out-of-place permutations are quite
close. As for an in-place permutation, one pass of uniform distribution sorting using out-of-place
permutations incurs O(n/B) cache misses if and only if k = O(C/B).
Using Corollary 2 for the lower bound and Corollary 3 above, we see that when k ≤ C the
lower bound is again within 3/2 of the upper bound and is much closer when k ≪ C.
4.5 Cache Analysis of Multiple Sequences Access
Accessing k sequences is like Process “in-place” in Section 4.1.1 except that there is no interaction
with a count array, so we delete step 2 and assumption (c). An analogue of Theorem 1 is easily
obtained. An easy modification to the proof of Theorem 2 gives:
Theorem 6 The expected number of cache misses in n rounds of sequence accesses is at most:
k + n

 1
B
+
k(B − 1)
B2C
+
(B − 1)2
B2C
k∑
i=1
k∑
j=1
pipj
pi + pj

 .
Corollary 4 If p1 = . . . = pk = 1/k then the number of cache misses in n rounds of sequence
accesses is at most:
n
(
1
B
+
k(B + 3)
2BC
)
+ k.
Remark: From Corollary 4, k = O(C/B) random sequences can be accessed incurring an optimal
O(n/B) misses. This essentially agrees with the results obtained by Mehlhorn and Sanders [9] and
Sen and Chatterjee [14].
Remark: Since its derivation ignored the effects of the count array, the lower bound in Theorem 3
applies directly to sequence accesses. Note that the lower bound we obtain for uniformly random
data, as stated in Corollary 2, is sharper than the lower bound of 0.25(1 − e−0.25k/C) obtained
in [14].
Remark: Our upper and lower bounds are also closer than those in [9]. The analysis in [9] assumes
that accesses to the sequences are controlled by an adversary; our analysis demonstrates, that with
uniform randomised accesses to the sequences, more sequences can be accessed optimally.
4.6 Correspondence between the processes and the permute phase
We now show how the Processes “in-place” and “out-of-place” model the permute phase of a
generic distribution sorting algorithm.
The correspondence between Process “in-place” of Section 4.1.1 and the pseudocode in Figure 2
is as follows. Each iteration of the inner loop (steps 3.1-3.5) of the pseudocode corresponds to a
round of Process “in-place”. The array COUNT corresponds to the locations C, and the pointer Di
points to DATA[idx]. The variables x in the process and the pseudocode play a similar role. It can
easily be verified that in each iteration of the loop in the pseudocode, the value of x is any integer
1, . . . , k with probability p1, . . . , pk, independently of its previous values, as in Step 1 of Process
“in-place”. A read at a location immediately followed by a write to the same location is counted
as one access. Thus, the read and increment of COUNT[x] in Steps 3.2 and 3.3 of the pseudocode
constitutes one access, equivalent to Step 2. Similarly the “swap” in Step 3.5 of the pseudocode
corresponds to the memory access in Step 3 of the process. The process does not model the initial
access in Step 1 of the pseudocode, and nor does it model the task of looking for new cycle leaders
in Steps 4 and 5 of the pseudocode.
The correspondence between Process “out-of-place” of Section 4.1.2 and the pseudocode in
Figure 1 is as follows. The array COUNT corresponds to the locations C, the array DATA corresponds
to the locations S, i in the pseudocode corresponds to s in the process, and the pointer Di points to
DEST[idx]. The increment of i in the pseudocode is equivalent to the increment of s in the process,
25
and the accesses to DATA[i] and S[s] are equivalent. As above, the variables x in the process and
the pseudocode in the pseudocode play a similar role. Again, the read and increment of COUNT[x]
of the pseudocode constitutes one access, and is equivalent to Step 3 of the process. The access to
DEST[idx] of the pseudocode corresponds to the memory access in Step 4 of the process.
Assumption (b) of the processes is clearly satisfied and assumption (c) can normally be made
to hold. Assumption (d) and k ≤ CB or k ≤ C may not hold in practice, in [11] we give an
approximate analysis which deals with this. Assumption (a) of the processes, that the starting
locations of the pointers Di are uniformly and independently distributed, is patently false, we
discuss this in more detail in [11]. We may force it to hold by adding random offsets to the starting
location of each pointer, at the cost of needing more memory and adding a compaction phase
after the permute, this has also been suggested by Mehlhorn and Sanders [9]. This only works
if the permute is not in-place, and if k is sufficiently small (e.g. k ≤ n/(CB)). In [11] we study
assumption (a) empirically in the context of uniform distribution sorting. Another weakness is
that our processes are continuous, so the sequence lengths are not specified, whereas in distribution
sorting we sort n keys and each sequence is of a finite length.
5 MSB radix sort
We now consider the problem of sorting n independent and uniformly-distributed floating-point
numbers in the range [0, 1) using the integer sorting algorithm MSB radix sort. As noted earlier,
it suffices to sort lexicographically the bit-strings which represent the floats, by viewing them as
integers. One pass of MSB radix sort using radix size r groups the keys according to their most
significant r bits in O(2r + n) time. For random integers, a reasonable choice for minimising
instruction counts is r = ⌈logn − 3⌉ bits, or classifying into about n/8 classes. Since each class
has about 8 keys on average, they can be sorted using insertion sort. Using this approach for this
problem gives terrible performance even at small values of n (see Table 1). As we now show, the
problem lies with the distribution of the integers on which MSB radix sort is applied.
5.1 Radix sorting floating-point numbers
A floating-point number is represented as a triple of non-negative integers 〈i, j, l〉. Here i is called
the sign bit and is a 0-1 value (0 indicating non-negative numbers, 1 indicating negative numbers),
j is called the exponent and is represented using e bits and l is called the mantissa and represented
using m bits. Let j∗ = j − 2e−1 + 1 denote the unbiased exponent of 〈i, j, l〉. Roughly following
the IEEE 754 standard, let the triple 〈0, 0, 0〉 represent the number 0, and let 〈i, j, l〉, where j > 0,
represent the number ±2j
∗
(1 + l2−m), depending on whether x = 0 or 1; no other triple is a
floating-point number. Internally each member of the triple is stored in consecutive fields of a
word. The IEEE 754 standard specifies e = 8 and m = 23 for 32-bit floats and e = 11 and m = 52
for 64-bit floats [4].
We model the generation of a random float in the range [0, 1) as follows: generate an (infinite-
precision) random real number, and round it down to the next smaller float. On average, half
the numbers generated will lie in the range [0.5, 1) and will have an unbiased exponent of −1.
In general, for all non-zero numbers, the unbiased exponent has value i with probability 2i, for
i = −1,−2, . . . ,−2e−1 + 2, whereas the mantissa is a random m-bit integer. The value 0 has
probability 2−2
e−1+2. Clearly, the distribution is not uniform, and it is easy to see that the average
size of the largest class after the first pass of MSB radix sort with radix r is n
(
1− 1
22e−r+1
)
if
r < e+ 1, and n/(2r−e) if r ≥ e+ 1.
This shows, e.g., that the largest sub-problems in the examples of Table 1 would be of size
n/2⌈logn−3⌉−11 ≈ 214, so using insertion sort after one pass is inefficient in this case2. To get down
to problems of size 8 in one pass requires a radix of about logn + 8, which is impractical. Also,
MSB radix sort applied to random integers has O(n) expected running time independently of the
word size, but this is not true for floats. A first pass with r ≪ e barely reduces the largest problem
2In fact, the total number of keys in all sub-problems of this size would be n/2 on average.
26
size, and the same holds for subsequent passes until bits from the mantissa are reached. As the
radix in any pass is limited to logn + O(1) bits, we may need Ω(e/ logn) passes, introducing a
dependence on the word size.
5.2 Using Quicksort
To get around the problem of having several passes before we reduce the largest class, we partition
the input keys around a value 1/n ≤ θ ≤ 1/(logn), and sort the keys smaller than θ in O(n)
expected time using Quicksort. We then apply MSB radix sort to the remaining keys. Let e′ =
min{⌈log log(1/θ)⌉, e} denote the effective exponent, since the remaining keys have exponents which
vary only in the lower order e′ bits. This means that keys can be grouped according to a radix
r = e + 1 +m′ with m′ ≥ 0 in O(n + 2e
′+m′) time and O(2e
′+m′) space. Since e′ = O(log logn),
we can take up to logn− O(log logn) bits from the mantissa as part of the first radix; as all sub-
problems now only deal with bits from the mantissa they can be solved in linear expected time,
giving a linear running time overall.
5.3 Cache analysis
We now use our analysis to calculate an upper bound for the cache misses in the permute phase of
the first pass of MSB radix sort using a radix r = e+ 1+m′, for some m′ ≥ 0, assuming also that
all keys are in the range [θ, 1), for some θ ≥ 1/n. There are 2e
′+m′ pointers in all, which can be
divided into g = 2e
′
groups of K = 2m
′
pointers each. Group i corresponds to keys with unbiased
exponent −i, for i = 1, . . . , g. All pointers in group i have an access probability of 1/(K2i). Using
Theorem 1 and a slight extension of the methods of Theorem 2 we are able to prove Theorem 7
below, which states that the number of misses is essentially independent of g:
Theorem 7 Provided gK ≤ CB and K ≤ C the number of misses in the first pass of the permute
phase of MSB radix sort is at most:
n
(
1
B
+
2K
BC
(2.3B + 2 logB + logC − logK + 0.7)
)
+ gK(1 + 1/B).
Proof. Using Eq. 15 we can calculate an upper bound on the probability of event X(i−1)K+1 as:
Pr[X(i−1)K+1]
≥ 1−
K2i
BC
−
1
C
g∑
j=1
K/B∑
l=1
B2−j
B2−j + 2−i
−
B − 1
BC
g∑
j=1
K∑
l=1
2−j
2−j + 2−i
= 1−
K
BC

 g∑
j=1
B2−j
B2−j + 2−i
+ 2i + (B − 1)
g∑
j=1
2−j
2−j + 2−i


≥ 1−
K
BC

logB + i+ 2i + (B − 1) g∑
j=1
2−j
2−j + 2−i

 . (42)
If K2i/(BC) ≥ 1 then Pr[X(i−1)K+1] would be negative, so we place a bound on this term such
that K2i < BC. The maximum value of i such that K2i/(BC) < 1 is logBC − logK − 1.
Since the probabilities of access to pointer D(i−1)K+1, . . . , DiK are all 1/(K2
i) we can calculate
an upper bound on pd using Eq. 6 and 42 as:
pd ≤
K2
BC

 g∑
i=1
pi

logB + i+ (B − 1) g∑
j=1
2−j
2−j + 2−i

+ logBC−logK−1∑
i=1
pi2
i


27
n = 1× 106 2× 106 4× 106 8× 106 16× 106 32× 106
MTQuick 0.7400 1.5890 3.3690 7.2430 15.298 32.092
Naive1 7.0620 14.192 28.436 57.082 115.16 233.16
Table 1: Memory-tuned Quicksort and Naive1 MSBRadix. Running times in seconds of memory-
tuned Quicksort and Naive1 MSBRadix sort (single pass MSBRadix sort without partitioning,
r = ⌈logn− 3⌉)floating point keys.
+
g∑
i=logBC−logK
Kpi
=
g∑
i=1
1
2i
K
BC

logB + i+ (B − 1) g∑
j=1
2−j
2−j + 2−i


+
logBC−logK−1∑
i=1
1
2i
K
BC
2i +
g∑
i=logBC−logK
1
2i
≤
K
BC

 g∑
i=1
logB
2i
+
g∑
i=1
i
2i
+ (B − 1)
g∑
i=1
1
2i
g∑
j=1
2−j
2−j + 2−i


+
K
BC
(logBC − logK − 1) +
2K
CB
≤
K
BC
(2 logB + 3 + logC − logK + 2.3(B − 1)) .
✷
5.4 Tuning MSB radix sort
We now optimise parameter choices in our algorithms. The smaller the value of θ, the fewer keys
are sorted by Quicksort, but reducing θ may may increase e′. A larger value of e′ does not mean
more misses, by Theorem 7, but it does mean a larger count array. We choose θ = 1/(logn)2 as a
compromise, ensuring that Quicksort uses o(n) time. Using the above analysis we are also able to
determine an optimal number of classes to use in each sorting sub-problem. We use two criteria
of optimality. In the first, we require that each pass incur no more than (2 + ε)n/B misses for
some constant ε > 0, thus seeking essentially to minimise cache misses (2n/B misses is the bare
minimum for the count and permute phases). In the second, we trade-off reductions in cache misses
against extra computation. The latter yields better practical results, and results shown below are
for this approach.
5.5 Experimental results
Table 2 compares tuned MSB radix sort with memory-tuned Quicksort[8] and MPFlashsort [11], a
memory-tuned version of a distribution sorting algorithm which assumes that the keys are indepen-
dently drawn from a uniformly random distribution. The algorithms were coded in C and compiled
using gcc 2.8.1. The experiments were our Sun UltraSparc-II with 2× 300 MHz processors and
1GB main memory, and a 16KB L1 data cache, 512KB L2 direct-mapped cache. Observe that
MSB radix sort easily outperforms the other algorithms for the range of values considered.
6 Conclusions
We have analysed the average-case cache performance of the permute phase of distribution sorting
when the keys are independently but not uniformly distributed. We have presented equations for
28
n = 1× 106 2× 106 4× 106 8× 106 16× 106 32× 106 64× 106
MPFlash 0.6780 1.3780 2.2756 6.1700 13.308 27.738 56.796
MTQuick 0.7400 1.5890 3.3690 7.2430 15.298 32.092 67.861
MSBRadix 0.3865 0.8470 1.9820 5.0300 9.4800 19.436 40.663
Table 2: MPFlashsort, memory-tuned Quicksort and MSBRadix. Running times in seconds of
MPFlashsort, memory-tuned Quicksort and MSBRadix sort on a Sun UltraSparc-II using single
precision floating point keys.
the number of misses during in-place and out-of-place permutations and have given closed-form
upper and lower bounds on these. We have shown that the upper and lower bounds are quite close
when k ≤ C and the data is known to be independently and uniformly distributed. We have shown
how this analysis can easily be extended to obtain the number of cache misses during accesses to
multiple sequences.
We have shown that if the integer sorting algorithm MSB radix sort is used to sort uniformly
and randomly distributed floating point numbers then a non-uniform distribution of keys to classes
is induced. We have shown that a naive implementation of this algorithm would have very poor
performance due to this non-uniform distribution. We have shown that by partitioning the keys, to
remove keys which are expected to go into small classes, and by using our analysis, the algorithm
can be tuned for good cache performance. Due to fast integer operations and good cache utilisation
the tuned algorithm outperforms MPFlashsort, a cache-tuned distribution sorting algorithm, and
memory-tuned Quicksort.
References
[1] A. V. Aho, J. E. Hopcroft and J. D. Ullman. The Design and Analysis of Computer Algorithms.
Addison-Wesley, 1974.
[2] L. Arge. External memory data structures (Invited Paper). In Proc. 9th Annual European Symposium
on Algorithms, LNCS 2161, pp. 1–29, 2001.
[3] D. Dubhashi and D. Ranjan. Balls and Bins: A Study in Negative Dependence. Random Structures
and Algorithms 13, pp. 99–124, 1998.
[4] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 2nd ed..
Morgan Kaufmann, 1996.
[5] K. Joag-Dev and F. Proschan. Negative association of random variables, with applications. Annals
of Statistics 11, pp. 286–295, 1983.
[6] D. E. Knuth. The Art of Computer Programming. Volume 3: Sorting and Searching, 3rd ed.. Addison-
Wesley, 1997.
[7] R. E. Ladner, J. D. Fix and A. LaMarca. Cache Performance Analysis of Traversals and Random
Accesses. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 613–622, 1999.
[8] A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. Journal of
Algorithms 31, pp. 66–104, 1999.
[9] K. Mehlhorn and P. Sanders. Scanning Multiple Sequences via Cache Memory. Algorithmica 35(1):
pp. 75–93, 2003. Preliminary version in Proc. 3rd Workshop on Algorithm Engineering, Preliminary
version in Proc. 26th Annual International Colloquium on Automata, Languages and Programming,
LNCS 1555, pp. 655-664, 1999.
[10] N. Rahman. Internal Memory Sorting and Searching. Ph.D. Thesis. King’s College, University of
London.
29
[11] N. Rahman and R. Raman. Analysing Cache Effects in Distribution Sorting. ACM Journal of Exper-
imental Algorithmics 5, Article 14, 2001. Preliminary version in Proc. 3rd Workshop on Algorithm
Engineering, LNCS 1668, pp. 184–198, 1999.
[12] N. Rahman and R. Raman. Adapting radix sort to the memory hierarchy. ACM Journal of Experimen-
tal Algorithmics, (to appear). Preliminary version in Proc. 2nd Workshop on Algorithm Engineering
and Experiments, 2000.
[13] N. Rahman and R. Raman. Analysing the Cache Behaviour of Non-uniform Distribution Sorting
Algorithms. In Proc. 8th Annual European Symposium on Algorithms, LNCS 1879, pp. 380–391,
2000.
[14] S. Sen and S. Chatterjee. Towards a theory of cache-efficient algorithms (extended abstract). In Proc.
11th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 829–838, 2000.
[15] J. S. Vitter. External Memory Algorithms and Data Structures: Dealing with Massive Data. ACM
Computing Surveys 33, pp. 209–271, 2001.
30
