We analyse the average-case cache performance of distribution sorting algorithms in the case when keys are independently but not necessarily uniformly distributed. The analysis is for both 'in-place' and 'out-of-place' distribution sorting algorithms and is more accurate than the analysis presented in [13] . In particular, this new analysis yields tighter upper and lower bounds when the keys are drawn from a uniform distribution. We use this analysis to tune the performance of the integer sorting algorithm MSB radix sort when it is used to sort independent uniform floating-point numbers (floats). Our tuned MSB radix sort algorithm comfortably outperforms a cache-tuned implementations of bucketsort [11] and Quicksort when sorting uniform floats from [0, 1).
Introduction
Distribution sorting is a popular alternative to comparison-based sorting which involves placing n input keys into k ≤ n classes based on their value [6] . The classes are chosen so that all the keys in the ith class are smaller than all the keys in the (i + 1)st class, for i = 1, . . . , k − 1, and furthermore, the class to which a key belongs can be computed in O(1) time (e.g. if the keys are floats in the range [a, b), we can calculate the class of a key x as 1 + ⌊ x−a b−a · k⌋). Thus, the original sorting problem is reduced in linear time to the problem of sorting the keys in each class. A number of distribution sorting algorithms have been developed which run in linear (expected) time under some assumptions about the input keys, such as bucket sort and radix sort. Due to their poor cache utilisation, even good implementations-which minimise instruction counts-of these 'linear-time' algorithms fail to outperform general-purpose O(n log n)-time algorithms such as Quicksort or Mergesort on modern computers [8, 11] .
Most algorithms are based upon the random-access machine model [1] , which assumes that main memory is as fast as the CPU. However, in modern computers, main memory is typically one or two orders of magnitude slower than the CPU [4] . To mitigate this, one or more levels of cache are introduced between CPU and memory. A cache is a fast associative memory which holds the values of some main memory locations. If the CPU requests the contents of a memory location, and the value of that location is held in some level of cache (a cache hit), the CPU's request is answered by the cache itself in typically 1-3 clock cycles; otherwise (a cache miss) it is answered by accessing main memory in typically 30-100 clock cycles. Since typical programs exhibit locality of reference [4] , caches are often effective. However, algorithms such as distribution sort have poor locality of reference, and their performance can be greatly improved by optimising their cache behaviour. A number of papers have recently addressed this issue [7, 8, 11, 12, 9, 14] , mostly in the context of sorting and related problems. There is also a large literature on algorithms specifically designed for hierarchical models of memory [15, 2] , but there are some important differences between these models and ours (see [10] for a summary).
The cache performance of comparison-based sorting algorithms was studied in [8, 9, 14] and distribution sorting algorithms were considered in [7, 8, 11] . One pass of a distribution sort consists of a count phase where the number of keys in each class are determined, followed by a permute phase where the keys belonging to the same class are moved to consecutive locations in an array. We give an analysis of the cache behaviour of the permute phase, assuming the keys are independently drawn from a non-uniform distribution. In [13] we focused on 'in-place' permute, where the keys are rearranged without placing them first in an auxiliary array. In this paper we extend the analysis to 'out-of-place' permute. We model the above algorithms as probabilistic processes, and analyse the cache behaviour of these processes. For each process we give an exact expression for, as well as matching closed-form upper and lower bounds on, the number of misses.
In previous work on the cache analysis of distribution sorting, [7] have analysed the (somewhat easier) count phase for non-uniform keys, and [11] gave an empirical analysis of the permute phase for uniform keys. The process of accessing multiple sequences of memory locations, which arises in multi-way merge sort, was analysed previously by [9, 14] . The analysis in [9] assumes that accesses to the sequences are controlled by an adversary; our analysis demonstrates, among other things, that with uniform randomised accesses to the sequences, more sequences can be accessed optimally. In [14] a lower bound on cache misses is given for uniform randomised accesses; our lower bound is somewhat sharper. The analysis also improves upon the results in [13] , by giving tighter upper and lower bounds when the keys are drawn from a uniform distribution.
In practice there are often cases when keys are not uniform (e.g., they may be normally distributed); our analysis can be used to tune distribution sort in these cases. We consider a different application here: sorting uniform floats using an integer sorting algorithm. It is well known that one can sort floats by sorting the bit-strings representing the floats, interpreting them as integers [4] . Since (simple) operations on integers are faster than operations on floats, this can improve performance; indeed, in [11] it was observed that an ad hoc implementation of the integer sorting algorithm most-significant-bit first radix sort (MSB radix sort) outperformed an optimised version of bucket sort on uniform floats. We observe that a uniform distribution on floating-point numbers induces a non-uniform distribution on the representing integers, and use our cache analysis to improve the performance of MSB radix sort on our machine. Our tuned 'in-place' MSB radix sort comfortably outperforms optimised implementations of other in-place or 'in-place' algorithms such as Quicksort or MPFlashsort [11] , which is a cache-tuned version of bucket sort.
Cache preliminaries
This section introduces some terminology and notation regarding caches. The size of the cache is normally expressed in terms of two parameters, the block size (B) and the number of cache blocks (C). We consider main memory as being divided into equal-sized blocks consisting of B consecutively-numbered memory locations, with blocks starting at locations which are multiples of B. The cache is also divided into blocks of size B; one cache block can hold the value of exactly one memory block. Data is moved to and from main memory only as blocks.
In a direct-mapped cache, the value of memory location x can only be stored in cache block c = (x div B) mod C. If the CPU accesses location x and cache block c holds the values from x's block the access is a cache hit; otherwise it is a cache miss and the contents of the block containing x are copied into cache block c, evicting the current contents of cache block c. purposes, cache misses can be classified into compulsory misses, which occur when a memory block is accessed for the first time, capacity misses, which occurs on an access to a memory block that was previously evicted because the cache could not hold all the blocks being actively accessed, and conflict misses, which happen when a block is evicted from cache because another memory block that mapped to the same cache block was accessed.
Distribution sorting
As noted in the introduction, a distribution pass has two main phases, a count phase and a permute phase, and our focus here is on the latter. While describing this algorithm, the term data array refers to the array holding the input keys, and the term count refers to an auxiliary array used by these algorithms. Each pass consists of two main phases, a count phase followed by a permute phase.
The count phase counts for class 1 ≤ i ≤ k − 1, the total number of keys in classes 0, . . . , i − 1. For class i = 0 this cumulative count is 0. Ladner et al [7] give an analysis of the count phase of distribution sorting on a direct-mapped cache for uniformly and randomly distributed keys.
There are two main variants of the permute phase, in the first variant keys are permuted from the data array to the auxiliary destination array, this is called an out-of-place permutation. In the second variant keys in the data array are permuted within the data array, this is called an in-place permutation.
Permute phase
The permute phase uses the cumulative count of keys generated during the count phase, to permute the keys to their respective classes. We now describe the two variants of the permute phase. In the description below it is assumed that k has been appropriately initialised, and that the function classify maps a key to a class numbered {0, . . . , k − 1} in O(1) time.
Out-of-place permute
During an out-of-place permute, for any class j, unless all elements of that class have already been moved, COUNT[j] points to the leftmost (lowest-numbered) available location for an element of class j in an n element auxiliary array, DEST. Figure 1 shows the pseudo-code for out-of-place permutation. In Step 1, for each element in DATA: we determine its class; using the count array we determine the next available location for this key in the DEST array; we increment the count array, thus setting the location for the next key of the same class; finally we move the key to its location in DEST. Since each step takes constant time, this out-of-place permutation takes O(n) time whenever k ≤ n. 
go to 2; Figure 2 : Permute phase for an 'in-place' permutation in a generic distribution sorting algorithm. DATA holds the input keys. COUNT and START are auxiliary arrays. After the count phase, COUNT is copied into START.
In-place Permute
The in-place permutation strategy described here is similar to that described by Knuth [6, . Before an in-place permute phase begins, a copy of the count array is made in a k element auxiliary start array. During the permute phase, for any class j, an invariant is that locations Figure 2 shows the pseudo-code for in-place permutation. We now describe this permutation, which consists of two main activities: cycle following and cycle leader finding. In cycle following, keys are moved to their final destinations in the data array along a cycle in the permutation (Steps 2 and 3). Once a cycle is completed, we move to cycle leader finding, where we find the 'leader' (index of the rightmost element) of the next cycle (Steps 1, 4 and 5). A cycle leader is simply the rightmost location of the highest-numbered incomplete class. By the definition of a complete class, initially the leader must be position n − 1. In more detail, the steps are as follows:
• In Step 1 n − 1 is selected as the first cycle leader.
• In Step 2 the key at the leader's position is copied into the variable key, thus leaving a 'hole' in the leader's position.
• In Steps 3.1-3.5 the key key is swapped with the key at key's final position. If key 'fills the hole', the cycle is complete, otherwise we repeat these steps.
• In Step 4 the algorithm searches for a new cycle leader. Suppose the leader of the cycle which just completed was the last location of class j. When this cycle ends, class j must also be complete, as a key of class j has been moved into the last location of class j. Note that classes j + 1, j + 2, . . . must already have been complete when the leader of this cycle was found. Note that the program variable x has value j at the end of this cycle, so the search for the next leader begins with class j − 1, counting down (Step 4).
• In Step 5 we check to see if all classes have completed and terminate if this is the case.
Clearly the in-place permutation in one pass of distribution sorting takes O(n) time whenever k ≤ n.
Cache analysis
We now analyse cache misses in a direct-mapped cache during the permute phase of distribution sorting when the keys are independently drawn from a non-uniform random distribution. In the permute phase of distribution sorting, when a key is moved to its destination, the algorithms described in Section 3 access any one of k elements in the COUNT array and any one of k locations in the DATA or DEST arrays, depending on whether the permutation is in-place or out-of-place. The actual locations accessed are dependent on the value of the permuted key, so, if the keys are independently and randomly distributed then, for every key permuted there are two random accesses to memory, one in the count array and one in DATA or DEST. These random accesses can potentially lead to a large number of cache conflict misses.
Our approach is to define two continuous processes which model in-place and out-of-place permutations. Process "in-place" models an in-place permutation and is shown in Figure 3 , and Process "out-of-place" models an out-of-place permutation and is shown in Figure 4 . Each round of a process models the permutation of a key to its destination, and we analyse the expected number of cache misses in n rounds of these processes. Our precise equations are difficult to compute so we also give closed-form upper and lower bounds on these precise equations. We use our results for in-place permutations to get upper and lower bounds on the expected number of cache misses in a process which models accesses to multiple sequences.
The assumptions in the processes mean that we have to access at least n distinct locations in memory, which requires Ω(n/B) cache misses. In the analysis, we will say that a process is optimal if it incurs O(n/B) cache misses. In distribution sorting, the larger the value of k, the fewer the number of passes over the data, hence the fewer the capacity misses. As we will see, if k is too large, then there can be a large number of conflict misses. The aim of the analysis is to determine the largest value of k, for a particular distribution of keys, such that there are O(n/B) misses in one pass of distribution sorting.
Processes
We now give the two processes which model the distributing of keys drawn independently and randomly from a non-uniform distribution into k classes.
Process to model an in-place permutation
Let k be an integer, 2 ≤ k ≤ CB. We are given k probabilities p 1 , . . . , p k , such that
The process maintains k pointers D 1 , . . . , D k , and there are also k consecutive 'count array' locations, C = c 1 , . . . , c k . The process (henceforth called Process "in-place") executes a sequence of rounds, where each round consists in performing steps 1-3 below:
Process "in-place"
independently of all previous picks.
2. Access the location c x .
3. Access the location pointed to by D x , increment D x by 1.
We denote the locations accessed by the pointer D i by d i,1 , d i,2 , . . ., for i = 1, . . . , k. We assume that: Assuming that the cache is initially empty, the objective is to determine the expected number of cache misses incurred by the above process over n rounds, with the expectation taken over the random choices in Step 1 as well as the starting positions of the pointers.
Process to model an out-of-place permutation
This process is like Process "in-place", but it is augmented with accesses to a sequence of consecutive locations in a source array, S, determined by an index s. The process, henceforth called Process "out-of-place", executes a sequence of rounds, where each round consists in performing steps 1-4 below:
Process "out-of-place" We make assumptions (a), (c), and (d) from Process "in-place", assumption (b) is modified as below and we add a further assumption: (b) during the process, the pointers traverse sequences of memory locations which are disjoint from each other, from C and from S.
Assuming that the cache is initially empty, again the objective is to determine the expected number of cache misses incurred by the above process over n rounds, with the expectation taken over the random choices in Step 2 as well as the starting positions of the pointers.
Preliminaries
We now introduce some notation that will be used for the analysis. We use k to denote the number of classes that the keys will be distributed into, and throughout the analysis we assume that B divides k. Assume that we are given a set of k probabilities p 1 , . . . , p k , such that expected value of a function f of a random variable X is denoted as E[f (X)]. When we wish to make explicit the distribution D from which the random variable is drawn, we will use the notation E X∼D [f (X)]. All vectors have dimension k (the number of classes) unless stated otherwise, and we denote the components of a vectorx by x 1 , x 2 , . . . , x k . We now define some probabilities:
(ii) For all i ∈ {1, . . . , k}, we denote byā i the following vector: a 
Let m ≥ 0 be an integer andq be a vector of non-negative reals such that i q i = 1. We denote by ϕ(m,q) the probability distribution on the number of balls in each of k bins, when m balls are independently put into these bins, and a ball goes in bin i with probability q i , for i ∈ {1, . . . , k}. Thus, ϕ(m,q) is a distribution on vectors of non-negative integers. Ifμ is drawn from ϕ(m,q), then:
all other vectors have zero probability 1 . We now define functions f (x) for x ≥ 0 and g(m) for a vectorm of non-negative integers:
We now set out some propositions that are used in the proofs.
Proposition 1 For all real numbers x i , i = 1, . . . , k, such that |x i | ≤ 1 we have that:
1 We take 0 0 = 1 in Eq. 1.
Proposition 2 (a) For all real numbers x, such that |x| < 1 we have that:
(b) For all real numbers x, such that |x| < 1 we have that:
(c) For all real numbers x, such that 0 < x < 2, we have that:
Proof. Proposition 2(a) is the standard summation for an infinite decreasing geometric series. We obtain Proposition 2(b) by differentiating both sides of the equation in Proposition 2(a). Proposition 2(c) is obtained using Proposition 2(b) and is the expected value of the geometric distribution multiplied by 1 − x. 2 Proposition 3 For all real numbers p and q such that 0 < p − q < 2, we have that:
Proof.
using Proposition 2(a) we get that:
2 Proposition 4 (a) For all real numbers x, we have that:
(b) For all real numbers x ≥ 0, we have that:
(c) For all real numbers x i , i = 1, . . . , k, such that x i ≤ 1 we have that:
Proof. Propositions 4(a) and 4(b) are from Taylor's series. For Propositions 4(c) we use Proposition 1. 2
Proposition 5 For all real numbers x and y, such that x ≤ 1 and y ≥ 0, we have that:
Proof. This proposition is proved using Proposition 4(a). 2 Proposition 6 (a) For all real numbers x and p and integer y, such that 0 < p ≤ 1, y ≥ 0 and x(1/p + y) = O(1), we have that:
(b) For all real numbers x and p and integer y, such that 0 < p ≤ 1, y ≥ 0 and x = O(1), we have that:
(c) For all real numbers x, p and q and integer y, such that 0 < p−q < 2, y ≥ 0 and xp p+q = O(1), we have that:
(d) For all real numbers m, x, p and q and integer y, such that 0 < p − q < 2, y ≥ 0 and xp p+q (
, we have that:
Note that we are misusing the O notation here to hide constant factors that are independent of the variables in the equations. Proof. Using Proposition 2(c) and Proposition 5, Proposition 6(a) is proved as follows:
The proofs of Propositions 6(b), 6(c) and 6(d) are now trivial. 2
The vector of random variables X = (X 1 , . . . X n ), is negatively associated [5] if for every two disjoint index sets,
for all functions f : ℜ |I| → ℜ and f : ℜ |J| → ℜ that are both non-decreasing or non-increasing.
Proposition 7
If the random variables X 1 , . . . X k are negatively associated, then for any nondecreasing function f i , i ∈ [k], we have that:
Proof. The proof follows directly from the definition of negatively associated variables. 
Cache Analysis of In-place Permutation
In this section we analyse the cache misses in a direct mapped cache during n rounds of Process "in-place", introduced in Section 4.1.1. We derive a precise equation for the expected number of cache misses and then give closed form upper and lower bounds on this equation. We then derive upper and lower bounds assuming the keys are drawn independently from a uniform distribution.
Average case analysis
We start by proving a theorem for the expected number of cache misses during n rounds of Process "in-place".
Theorem 1
The expected number X of cache misses in n rounds of Process "in-place" satisfies
, where:
Proof. We first analyse the miss rates for accesses to pointers D 1 , . . . , D k . Fix an i, 1 ≤ i ≤ k and a z ≥ 1. Let µ be the random variable which denotes the number of rounds between accesses to locations d i,z and d i,z+1 (µ = 0 if these locations are accessed in consecutive rounds). Figure 5 shows the other memory accesses between accesses z and z + 1 to
denote the event that none of the memory accesses in these µ rounds accesses the cache block to which d i,z is mapped. We now fix an integer m ≥ 0 and calculate Pr[X i |µ = m]. Letμ be a vector of random variables such that for 1 ≤ j ≤ k, µ j is the random variable which denotes the number of accesses to D j in these m rounds. Clearlyμ is drawn from ϕ(m,ā i ) (note that D i is not accessed in these m rounds by definition).
Fix any vectorm, such that Pr[μ =m] = 0, and let µ j be the number of accesses to pointer D j in these m rounds. Since m i must be zero, f (m i ) = 1, and for j = i, f (m j ) is the probability that none of the m j locations accessed by D j in these m rounds is mapped to the same cache block as location d i,z [9, 14] . Similarly g(m) · C is the number of count blocks accessed in these rounds, and so 1 − g(m) is the probability that the cache block containing d i,z does not conflict with the blocks from C which were accessed in these m rounds. As the latter probability is determined by the starting location of sequence i and the former probabilities by the starting location of sequences j, j = i, we conclude that for a given configurationm of accesses, the probability that the cache block containing d i,z is not accessed in these m rounds is (1 − g(m)) k j=1 f (m j ). Averaging over all configurationsm, we get that
Finally we get,
If d i,z is at a cache block boundary or if X i does not occur given that d i,z is not at a cache block boundary (Pr[X i ] does not change under this condition), then a cache miss will occur. The first access to a pointer is a cache miss. So other than for the first access, the probability p d of a cache miss for a pointer access is:
Including the first access misses, the expected number of cache misses for pointer accesses is at most
We now consider the probability of a cache miss for an access to a count array location. It is convenient to partition C into count blocks of B locations each, where the i-th count block consists of the locations c (i−1)B+1 , . . . , c iB , for i = 1, . . . , k/B. So P i is the probability of access to the i-th block. We fix an i ∈ {1, . . . , k/B} and a z ≥ 1. Let ν be the random variable that denotes the number of rounds between the z-th and (z + 1)-st accesses to the i-th count block. We have that Pr[ν = m] = P i (1 − P i ) m , for m = 0, 1, . . .. Let Y i denote the event that none of the memory accesses in these m rounds accesses the cache block to which the i-th count block is mapped.
We now fix an integer m ≥ 0 and calculate Pr[Y i |ν = m]. Letν be a vector of random variables such that for 1 ≤ j ≤ k, ν j is the random variable which denotes the number of accesses to D j in these m rounds. Given that k ≤ BC and assumption (c) mean that two blocks from C cannot conflict with each other. As the pointers D (i−1)B+1 , . . . , D iB will not be accessed between two successive accesses to count block i, the probability of accessing pointer D j is given by b i j and ϕ(m,b i ) is the distribution forν. Arguing as above:
The first access to a count array block is a cache miss, for all other accesses there is a cache miss if event Y i does not occur. So other than for the first access, the probability p c of a cache miss for a count array access is:
Including the first access misses, the expected number of cache misses for count array accesses is at most
Plugging in the values from Eq. 5 into Eq. 7 and from Eq. 8 into Eq. 10 we get the upper bound on X, the expected number of cache misses in the processes. The lower bound in Theorem 1 is obvious. 2
Upper bound
We now prove a theorem on the upper bound to the expected number of cache misses during n rounds of Process "in-place".
Theorem 2
The expected number of cache misses in n rounds of Process "in-place" is at most
Proof. Letting Γ(x) = 1 − f (x) and using Proposition 1 we can rewrite Eq. 5 as:
We know that the j-th count block contributes 1/C to g(μ) if there is an access to that block and Pr 
and we get,
and using Proposition 3 we get,
We now evaluate
Our approach is to first fix j and evaluate Eμ ∼ϕ(m,āi) [Γ(µ j )]. For m ≤ BC, we know that
The last term is due to the fact that Γ(x) is discontinuous and Γ(0) = 0. Similarly for m > BC we know that
The last term is due to the fact that Γ(x) = 1 for x ≥ BC − B + 1. If we drop this last term when m > BC, we get that for all m
The summation term is the expected value of the random variable with the binomial distribution b(l; m, a i j ). So we get that
pi − 1 by an application of Proposition 2(c). By applying Proposition 3 we get that
pi+p+j . So we get:
Substituting Eq. 12 and 14 in Eq. 11 we obtain the following lower bound for Pr[
Upper bound on p d
Finally, substituting Pr[X i ] from Eq. 15 in Eq. 6 we get:
We can evaluate p c using a very similar approach, as sketched out now. We again consider a fixed i and consider the event Y i defined in the proof of Theorem 1. We now obtain a lower bound on Pr[Y i ].
Lower bound on Pr[Y i ]
Again letting Γ(x) = 1 − f (x) and using Proposition 1, we can rewrite Eq. 8 as:
Arguing as for the derivation of Eq. 13, we get
Then arguing as for the derivation of Eq. 14, we get
Substituting this into Eq. 16, we get:
Upper bound on p c Substituting Pr[Y i ] from Eq. 17 in Eq. 9 we get
2 This proves the upper bound for the equation in Theorem 1. We now prove a lower bound on that equation.
Lower bound
Theorem 3 When p i ≥ 1/C then the expected number of cache misses in n rounds of Process "in-place" is at least np d + k, where:
Proof. We again consider a fixed i and consider the event X i defined in the proof of Theorem 1. Letμ be as defined in the proof of Theorem 1. We now obtain an upper bound on Pr[X i ].
Upper bound on Pr[X i ]
In [3] it is shown that the variables µ j are negatively associated [5] . Noting that f (x) is a nonincreasing function of x, then using Proposition 7 we have that:
So we can re-write Eq. 5 as:
We first bound the last term. We know that
Using Proposition 5 we get that (1 − p i ) BC−B+1 ≤ e −(BC−B+1)pi . Assuming p i ≥ 1/C the last term is at most O(e −B ). We now bound the first term in Eq. 18. We use an approach similar to the derivation of Eq. 13 and since µ ≤ BC − B, so µ j ≤ BC − B, we don't have to drop any terms in the simplification, so we get that:
Letting t j (m) = Using Proposition 4(b) and letting β j = (1 − a i j ) we get that:
We now evaluate the terms in Eq. 19 assuming that p i ≥ 1/C, so k ≤ C. For the simplifications of the subtractive terms we use the fact that e −pi(BC−B+1) ≤ e −B .
, so using Proposition 6(a), we get that
Since p i ≥ 1/C, (B − 1)k/(BC) < 1, so using Proposition 6(b), we get that
We now evaluate the term
The last simplification is due to p i ≥ 1/C and k ≤ C, so 1≤j≤k (p i (1
Substituting back (1 − a i j ) = β j and using Proposition 3 we get that
The last step used
2 ) < 1. We now evaluate the additive terms, starting with
Since m < BC we now get that:
Substituting back (1 − a i j ) = β j and using Proposition 3 we get that:
Finally we evaluate
Using this result and Proposition 2(a), we get that
So we get that
Lower bound on p d
Plugging Eqs. 20 , .., 27 into Eq. 19 we get that:
Plugging Eq. 28 into Eq. 6 we get:
Simplifying further and using 
Upper and lower bounds for uniformly random data
Using the upper and lower bound Theorems just proven for general probability distributions, we now derive Corollaries for upper and lower bounds for uniform distribution.
. . = p k = 1/k then the number of cache misses in n rounds of Process "in-place" is at most :
Proof. Since P i in p c and P j in p d are both B/k in the equations in Theorem 2, we get that:
2 Remark: As we will see later, Process "in-place" models the permute phase of distribution sorting and Corollary 1 shows that one pass of uniform distribution sorting incurs O(n/B) cache misses if and only if k = O(C/B).
The following corollary is from the lower bound result in Theorem 3.
Corollary 2 If p 1 = . . . = p k = 1/k then the number of cache misses in n rounds of Process "in-place" is at least:
Proof. Plugging p i = 1/k in the equation in Theorem 3 we get that:
2 Remark: From Corollary 1 we have that for uniformly random data and k = αC, where α ≤ 1, other than for small values of B, the upper bound for the number of cache misses in n round is roughly αn 2 ,
and from Corollary 2 we have that for uniformly random data and k = αC, where α ≤ 1, other than for small values of B, the lower bound for the number of cache misses in n round is roughly
The ratio between the upper and lower bound is 3/(3 − α). So we have that for uniformly random data the lower bound is within a factor of about 3/2 of the upper bound when k ≤ C and is much closer when k ≪ C.
Cache Analysis of Out-of-place Permutation
In this section we analyse the cache misses in a direct mapped cache during n rounds of Process "out-of-place", introduced in Section 4.1.2. We derive a precise equation for the expected number of cache misses and closed-form upper and lower bounds. During the analysis we re-use k, D i , c i , C, p i , P i ,ā,b, f (x) and g(m) introduced in Section 4.3.
Average case analysis
We start by proving a theorem for the expected number of cache misses during n rounds of Process "out-of-place".
Theorem 4
The expected number X of cache misses in n rounds of Process "out-of-place" is Figure 6 : m rounds of Process "out-of-place". Between two accesses to D i , there are m accesses to "other" pointers, and m + 1 accesses to C, and m + 1 accesses to consecutive locations in S.
Proof. We first analyse the miss rates for accesses to pointers D 1 , . . . , D k . Fix an i, 1 ≤ i ≤ k and a z ≥ 1 and consider the probability of a miss between access z and z + 1 to pointer D i . We define µ, µ j , m,μ,m, d i,z , ϕ(m,ā i ) and X i as in the proof of Theorem 1. Again f (m j ) is the probability that none of the m j locations accessed by D j in m rounds is mapped to the same cache block as location d i,z . Similarly g(m) · C is the number of count blocks accessed in m rounds, and so 1 − g(m) is the probability that the cache block containing d i,z does not conflict with the blocks from C which were accessed in these m rounds. We also have accesses to m + 1 contiguous locations in S and f (m + 1) is the probability that these m + 1 accesses are not to the cache block containing d i,z . Figure 6 shows the other memory accesses between accesses z and z + 1 to D i . For a given configurationm of accesses, as the probabilities f (m j ), g(m) and f (m + 1) are independent, we conclude that the probability that the cache block containing d i,z is not accessed in these m rounds is ( 
. Averaging over all configurationsm, we get that
Using which we get,
Arguing as for Eq. 6 we get that, other than for the first access, the probability p d of a cache miss for a pointer access is:
We now consider the probability of a cache miss for an access to a count array location. Fix an i ∈ {1, . . . , k/B} and a z ≥ 1 and consider the probability of a miss between access z and z + 1 to count block c i . We define ν, ν j , m,ν,m, P i , ϕ(m,b i ), and Y i as in the proof of Theorem 1.
Again, given that k ≤ BC and assumption (c) mean that two blocks from C cannot conflict with each other. So we need to determine the probability of a conflict given m j accesses to the pointer D j , for all j ∈ {1, . . . , k}, and m accesses to contiguous locations in S. Again f (m j ) is the probability that none of the m j locations accessed by D j in m rounds is mapped to the same cache block as c i and f (m + 1) is the probability that the accesses to m + 1 contiguous locations in S are not to the same cache block as c i .
So we have:
Arguing as for Eq. 9, the probability p c of a cache miss for a count array access is:
We now calculate cache misses for accesses to the array S. We consider the probability of a cache miss between accesses to S[s] and S[s + 1]. We know that there is exactly one access to a count block and one access to a pointer between two accesses to S. The probability that the pointer access is to the same cache block as S[s] is 1/C. The probability that a block from C maps to the same cache block as S[s] is k/BC. Given that a block from C maps to the same cache block as S[s], the probability that the access to the count array is to the same cache block as S[s] is B/k. So the probability that the pointer access is to the same cache block as S[s] is also 1/C. So the probability that there are no memory accesses to the cache block that S[s] is mapped to before the access to
We have a cache miss if S[s] is at a cache block boundary, otherwise the probability of a cache miss is 1
So the probability p s of an cache miss for an access to S is
The first access to S is always a cache miss, so the expected number of cache misses in accesses to S is: np s + 1.
Plugging in the values from Eq. 30 into Eq. 32 and from Eq. 33 into Eq. 35 we get the upper bound on X, the expected number of cache misses in the processes.
The lower bound in Theorem 4 is obvious. 2
Upper bound
We now prove a theorem on the upper bound to the expected number of cache misses during n rounds of Process "out-of-place".
Theorem 5
The expected number of cache misses in n rounds of Process "out-of-place" is at most n(p d + p c + p s ) + k(1 + 1/B) + 1, where:
Proof. As for the upper bound for in-place permutation, in this proof we derive lower bounds for Pr[X i ] and Pr[Y i ] and we will use these to derive the upper bounds on p d and p c . We make extensive use of the results obtained during the proof of Theorem 1. Again, we consider a fixed i and consider the event X i defined in the proof of Theorem 4. We now obtain a lower bound on Pr[X i ].
Lower bound on Pr[X i ] Letting Γ(x) = 1 − f (x) and using Proposition1 we can rewrite Eq. 5 as:
We can use Eq. 12 as a simplification for
and Eq. 14 as an upper bound on
So we just have to evaluate
Since we always have at least one access to S, we have that
where the last simplification used Proposition 2(c). Substituting Eq. 12, Eq. 14 and Eq. 37 in Eq. 36 we obtain the following lower bound for Pr[
Upper bound on p d Finally, substituting Pr[X i ] from Eq. 38 in Eq. 6 we get
We can evaluate p c using a very similar approach to that used in the proof of Theorem 2. We again consider a fixed i and consider the event Y i defined in the proof of Theorem 4. We now obtain a lower bound on Pr[Y i ].
Lower bound on Pr[Y i ]
We can rewrite Eq. 33 as:
Eq. 17 gives us
Arguing as for Eq. 37
Substituting Eq. 17 and Eq. 40 in Eq. 39 we obtain the following lower bound for Pr[X i ]:
Upper bound on p c Finally, substituting Pr[Y i ] from Eq. 41 in Eq. 9 we get: 
Lower bound
It is quite obvious that the lower bound for in-place permutation, given in Theorem 3, is a lower bound for out-of-place permutation.
Upper and lower bounds for uniformly random data
Using the upper bound Theorem just proven, we now derive a Corollary for an upper bound to the number of cache misses if the data is uniformly distributed.
Corollary 3 If p 1 = . . . = p k = 1/k then the number of cache misses in n rounds of Process "in-place" is at most:
Proof. Since P i in p c and P j in p d are both B/k in the equations in Theorem 5, we get that
Remark: Corollaries 1 and 3 shows that for uniformly distributed data, other than for small values of B, the number of cache misses during in-place and out-of-place permutations are quite close. As for an in-place permutation, one pass of uniform distribution sorting using out-of-place permutations incurs O(n/B) cache misses if and only if k = O(C/B). Using Corollary 2 for the lower bound and Corollary 3 above, we see that when k ≤ C the lower bound is again within 3/2 of the upper bound and is much closer when k ≪ C.
Cache Analysis of Multiple Sequences Access
Accessing k sequences is like Process "in-place" in Section 4.1.1 except that there is no interaction with a count array, so we delete step 2 and assumption (c). An analogue of Theorem 1 is easily obtained. An easy modification to the proof of Theorem 2 gives:
Theorem 6 The expected number of cache misses in n rounds of sequence accesses is at most:
Corollary 4 If p 1 = . . . = p k = 1/k then the number of cache misses in n rounds of sequence accesses is at most:
Remark: From Corollary 4, k = O(C/B) random sequences can be accessed incurring an optimal O(n/B) misses. This essentially agrees with the results obtained by Mehlhorn and Sanders [9] and Sen and Chatterjee [14] .
Remark: Since its derivation ignored the effects of the count array, the lower bound in Theorem 3 applies directly to sequence accesses. Note that the lower bound we obtain for uniformly random data, as stated in Corollary 2, is sharper than the lower bound of 0.25(1 − e −0.25k/C ) obtained in [14] .
Remark: Our upper and lower bounds are also closer than those in [9] . The analysis in [9] assumes that accesses to the sequences are controlled by an adversary; our analysis demonstrates, that with uniform randomised accesses to the sequences, more sequences can be accessed optimally.
Correspondence between the processes and the permute phase
We now show how the Processes "in-place" and "out-of-place" model the permute phase of a generic distribution sorting algorithm.
The correspondence between Process "in-place" of Section 4.1.1 and the pseudocode in Figure 2 is as follows. Each iteration of the inner loop (steps 3.1-3.5) of the pseudocode corresponds to a round of Process "in-place". The array COUNT corresponds to the locations C, and the pointer D i points to DATA [idx] . The variables x in the process and the pseudocode play a similar role. It can easily be verified that in each iteration of the loop in the pseudocode, the value of x is any integer 1, . . . , k with probability p 1 , . . . , p k , independently of its previous values, as in Step 1 of Process "in-place". A read at a location immediately followed by a write to the same location is counted as one access. Thus, the read and increment of COUNT[x] in Steps 3.2 and 3.3 of the pseudocode constitutes one access, equivalent to Step 2. Similarly the "swap" in Step 3.5 of the pseudocode corresponds to the memory access in Step 3 of the process. The process does not model the initial access in Step 1 of the pseudocode, and nor does it model the task of looking for new cycle leaders in Steps 4 and 5 of the pseudocode.
The correspondence between Process "out-of-place" of Section 4.1.2 and the pseudocode in Figure 1 is as follows. The array COUNT corresponds to the locations C, the array DATA corresponds to the locations S, i in the pseudocode corresponds to s in the process, and the pointer D i points to DEST [idx] . The increment of i in the pseudocode is equivalent to the increment of s in the process, and the accesses to DATA[i] and S[s] are equivalent. As above, the variables x in the process and the pseudocode in the pseudocode play a similar role. Again, the read and increment of COUNT [x] of the pseudocode constitutes one access, and is equivalent to Step 3 of the process. The access to DEST[idx] of the pseudocode corresponds to the memory access in Step 4 of the process.
Assumption (b) of the processes is clearly satisfied and assumption (c) can normally be made to hold. Assumption (d) and k ≤ CB or k ≤ C may not hold in practice, in [11] we give an approximate analysis which deals with this. Assumption (a) of the processes, that the starting locations of the pointers D i are uniformly and independently distributed, is patently false, we discuss this in more detail in [11] . We may force it to hold by adding random offsets to the starting location of each pointer, at the cost of needing more memory and adding a compaction phase after the permute, this has also been suggested by Mehlhorn and Sanders [9] . This only works if the permute is not in-place, and if k is sufficiently small (e.g. k ≤ n/(CB)). In [11] we study assumption (a) empirically in the context of uniform distribution sorting. Another weakness is that our processes are continuous, so the sequence lengths are not specified, whereas in distribution sorting we sort n keys and each sequence is of a finite length.
MSB radix sort
We now consider the problem of sorting n independent and uniformly-distributed floating-point numbers in the range [0, 1) using the integer sorting algorithm MSB radix sort. As noted earlier, it suffices to sort lexicographically the bit-strings which represent the floats, by viewing them as integers. One pass of MSB radix sort using radix size r groups the keys according to their most significant r bits in O(2 r + n) time. For random integers, a reasonable choice for minimising instruction counts is r = ⌈log n − 3⌉ bits, or classifying into about n/8 classes. Since each class has about 8 keys on average, they can be sorted using insertion sort. Using this approach for this problem gives terrible performance even at small values of n (see Table 1 ). As we now show, the problem lies with the distribution of the integers on which MSB radix sort is applied.
Radix sorting floating-point numbers
A floating-point number is represented as a triple of non-negative integers i, j, l . Here i is called the sign bit and is a 0-1 value (0 indicating non-negative numbers, 1 indicating negative numbers), j is called the exponent and is represented using e bits and l is called the mantissa and represented using m bits. Let j * = j − 2 e−1 + 1 denote the unbiased exponent of i, j, l . Roughly following the IEEE 754 standard, let the triple 0, 0, 0 represent the number 0, and let i, j, l , where j > 0, represent the number ±2 j * (1 + l2 −m ), depending on whether x = 0 or 1; no other triple is a floating-point number. Internally each member of the triple is stored in consecutive fields of a word. The IEEE 754 standard specifies e = 8 and m = 23 for 32-bit floats and e = 11 and m = 52 for 64-bit floats [4] .
We model the generation of a random float in the range [0, 1) as follows: generate an (infiniteprecision) random real number, and round it down to the next smaller float. On average, half the numbers generated will lie in the range [0.5, 1) and will have an unbiased exponent of −1. In general, for all non-zero numbers, the unbiased exponent has value i with probability 2 i , for i = −1, −2, . . . , −2 e−1 + 2, whereas the mantissa is a random m-bit integer. The value 0 has probability 2 −2 e−1 +2 . Clearly, the distribution is not uniform, and it is easy to see that the average size of the largest class after the first pass of MSB radix sort with radix r is n 1 − 1 2 2 e−r+1 if r < e + 1, and n/(2 r−e ) if r ≥ e + 1. This shows, e.g., that the largest sub-problems in the examples of Table 1 would be of size n/2 ⌈log n−3⌉−11 ≈ 2 14 , so using insertion sort after one pass is inefficient in this case 2 . To get down to problems of size 8 in one pass requires a radix of about log n + 8, which is impractical. Also, MSB radix sort applied to random integers has O(n) expected running time independently of the word size, but this is not true for floats. A first pass with r ≪ e barely reduces the largest problem size, and the same holds for subsequent passes until bits from the mantissa are reached. As the radix in any pass is limited to log n + O(1) bits, we may need Ω(e/ log n) passes, introducing a dependence on the word size.
Using Quicksort
To get around the problem of having several passes before we reduce the largest class, we partition the input keys around a value 1/n ≤ θ ≤ 1/(log n), and sort the keys smaller than θ in O(n) expected time using Quicksort. We then apply MSB radix sort to the remaining keys. Let e ′ = min{⌈log log(1/θ)⌉, e} denote the effective exponent, since the remaining keys have exponents which vary only in the lower order e ′ bits. This means that keys can be grouped according to a radix
′ ) space. Since e ′ = O(log log n), we can take up to log n − O(log log n) bits from the mantissa as part of the first radix; as all subproblems now only deal with bits from the mantissa they can be solved in linear expected time, giving a linear running time overall.
Cache analysis
We now use our analysis to calculate an upper bound for the cache misses in the permute phase of the first pass of MSB radix sort using a radix r = e + 1 + m ′ , for some m ′ ≥ 0, assuming also that all keys are in the range [θ, 1), for some θ ≥ 1/n. There are 2 e ′ +m ′ pointers in all, which can be divided into g = 2 e ′ groups of K = 2 m ′ pointers each. Group i corresponds to keys with unbiased exponent −i, for i = 1, . . . , g. All pointers in group i have an access probability of 1/(K2 i ). Using Theorem 1 and a slight extension of the methods of Theorem 2 we are able to prove Theorem 7 below, which states that the number of misses is essentially independent of g:
Theorem 7
Provided gK ≤ CB and K ≤ C the number of misses in the first pass of the permute phase of MSB radix sort is at most: n 1 B + 2K BC (2.3B + 2 log B + log C − log K + 0.7) + gK(1 + 1/B).
Proof. Using Eq. 15 we can calculate an upper bound on the probability of event X (i−1)K+1 as: 
If K2 i /(BC) ≥ 1 then Pr[X (i−1)K+1 ] would be negative, so we place a bound on this term such that K2 i < BC. The maximum value of i such that K2 i /(BC) < 1 is log BC − log K − 1. Since the probabilities of access to pointer D (i−1)K+1 , . . . , D iK are all 1/(K2 i ) we can calculate an upper bound on p d using Eq. 6 and 42 as: 
Tuning MSB radix sort
We now optimise parameter choices in our algorithms. The smaller the value of θ, the fewer keys are sorted by Quicksort, but reducing θ may may increase e ′ . A larger value of e ′ does not mean more misses, by Theorem 7, but it does mean a larger count array. We choose θ = 1/(log n) 2 as a compromise, ensuring that Quicksort uses o(n) time. Using the above analysis we are also able to determine an optimal number of classes to use in each sorting sub-problem. We use two criteria of optimality. In the first, we require that each pass incur no more than (2 + ε)n/B misses for some constant ε > 0, thus seeking essentially to minimise cache misses (2n/B misses is the bare minimum for the count and permute phases). In the second, we trade-off reductions in cache misses against extra computation. The latter yields better practical results, and results shown below are for this approach. Table 2 compares tuned MSB radix sort with memory-tuned Quicksort [8] and MPFlashsort [11] , a memory-tuned version of a distribution sorting algorithm which assumes that the keys are independently drawn from a uniformly random distribution. The algorithms were coded in C and compiled using gcc 2.8.1. The experiments were our Sun UltraSparc-II with 2 × 300 MHz processors and 1GB main memory, and a 16KB L1 data cache, 512KB L2 direct-mapped cache. Observe that MSB radix sort easily outperforms the other algorithms for the range of values considered.
Experimental results

Conclusions
We have analysed the average-case cache performance of the permute phase of distribution sorting when the keys are independently but not uniformly distributed. We have presented equations for Table 2 : MPFlashsort, memory-tuned Quicksort and MSBRadix. Running times in seconds of MPFlashsort, memory-tuned Quicksort and MSBRadix sort on a Sun UltraSparc-II using single precision floating point keys.
the number of misses during in-place and out-of-place permutations and have given closed-form upper and lower bounds on these. We have shown that the upper and lower bounds are quite close when k ≤ C and the data is known to be independently and uniformly distributed. We have shown how this analysis can easily be extended to obtain the number of cache misses during accesses to multiple sequences.
We have shown that if the integer sorting algorithm MSB radix sort is used to sort uniformly and randomly distributed floating point numbers then a non-uniform distribution of keys to classes is induced. We have shown that a naive implementation of this algorithm would have very poor performance due to this non-uniform distribution. We have shown that by partitioning the keys, to remove keys which are expected to go into small classes, and by using our analysis, the algorithm can be tuned for good cache performance. Due to fast integer operations and good cache utilisation the tuned algorithm outperforms MPFlashsort, a cache-tuned distribution sorting algorithm, and memory-tuned Quicksort.
