We derive CPU time formulae for the two simplest linear time-sorting algorithms, linear probing sort and bucket sort, as a function of the load factor, and show agreement with experimentally measured CPU times. This allows us to compute optimal load factors for each algorithm, whose values have previously been identified only approximately in the literature. We also present a simple model of cache latency and apply it not only to linear probing sort and bucket sort, where the bulk of the latency is due to random access, but also to the log-linear algorithm quicksort, where the access is primarily sequential, and again show agreement with experimental CPU times. With minor modifications, our model also fits CPU times previously reported by LaMarca and Ladner for radix sort, and by Rahman and Raman for most significant digit radix sort, Flashsort1, and memory tuned quicksort.
INTRODUCTION

Overview
Efficient sorting algorithms fall into two broad classes: loglinear algorithms, such as quicksort and shellsort, which take O(N log N) time to sort an array of N elements [1, 2] , and linear time algorithms, such as bucket sort and linear probing sort [3, 4] , which take O(N) time. Log -linear algorithms compare only the order of pairs of elements being sorted: given any pair x and y, you only need to know if x , y or y , x. Linear time algorithms compare the sizes of the elements being sorted: given any one element x, you can estimate from the (not necessarily uniform) distribution of the sizes of the other elements approximately where in the final sorted list it should be placed, a process known as hashing [1] .
Although log-linear algorithms have attracted the bulk of research attention in the past, recent technological changes have prompted a renewed interest in linear time algorithms for several reasons. First, modern math coprocessors can now perform the previously very slow multiplications involved in hashing in only a small multiple of the time required to perform the previously much faster subtractions involved in pair comparisons. Second, frequent sorting of data sets drawn from a common source, which would be the case, for example, in the construction of internet search engine databases, means that the previously restrictive condition of knowing the distribution of the sizes of the elements to be sorted in advance can often now be easily assumed. Third, the sheer quantity of data involved in modern IT applications means that any algorithm which is asymptotically superior in principle can be expected to yield significantly better performance in practice.
For these reasons, linear time algorithms, and those which can be implemented to run in linear time, have been the subject of many recent research papers, including most significant digit radix sort [5, 6] and Flashsort1 [7] , an in-place version of bucket sort [8, 9] . We looked at linear probing sort [10 -13] because, although linear time, it had not yet received the same kind of contemporary treatment as the other algorithms. In Section 1.2, we will review the literature covering the theory of linear probing sort, i.e., which is valid for small N independent of hardware. We will argue that, for the purposes of minimizing the total CPU time rather than just the number of 'probes', the analysis of linear probing sort as it currently stands is incomplete.
An important characteristic of linear time algorithms is that they require random access to the data being sorted, whereas log -linear algorithms primarily require sequential access. For Permissions, please email: journals.permissions@oxfordjournals.org doi:10.1093/comjnl/bxm097 impact of cache latency upon the performance of sorting algorithms. This has been the subject of many recent studies [14 -19] involving log -linear and linear time sorting algorithms, but not yet linear probing sort. In Section 1.3, we will review the literature covering this practice of sorting algorithms, i.e., which is valid for large N but which depends on the actual hardware used. We shall argue that in order to compute a formula for total CPU time, rather than just the number of 'cache misses', it is necessary to distinguish between random access and sequential access cache latency.
Review of theory
The theory of linear probing sort hinges on the notion of displacement, which is the difference between the estimated, or hashed, position of an element and its final (not necessarily completely sorted) position in the hash table. This in turn depends on the load factor a ¼ N/M, where N is the number of elements to be sorted and M is the size of the hash table.
Many recent papers have been devoted to calculating both the average displacement [20 -23] and the distribution of individual displacements [24, 25] of variants of linear probing hashing as functions of a. From this, it has been computed that the sum of displacement and space compression factors, known collectively as probes, incurred when (completely) sorting an array using linear probing is least when a ¼ (22 ffiffi ffi 2 p )' 0.59 [3, 11] . A number of interchanges are performed along with the displacements, denoted I N [3] , which also contribute to the total CPU time. Although I N was measured experimentally in [3] , no formula for the average number of interchanges a function of a appears to exist in the literature, and without this it is impossible to predict the load factor minimizing total computation time. This is a consequence of the emphasis in the literature on the cost of using a hash table to retrieve data, which only depends on the total displacement, rather than on the cost of constructing it, which involves both displacements and interchanges. While this would be appropriate in a static context, where the data once sorted remains unchanged, in modern IT applications data may change and need to be resorted so frequently that the cost of sorting becomes as important as the cost of accessing the data.
In this paper, inspired by Knuth's famous Parking Problem [1, Sec. 6.4, Ex. 29] (also [26, 27] ), we show that linear probing sort in fact corresponds to the simplest traffic light queueing system [28, 29] , in which case the average displacement is the same as the average length of queue, which can be read off from standard results in queueing theory. Viewed from this perspective, it is not hard to derive a formula for the average number of interchanges as well. The results are quoted in Section 2.1, while the details of the derivation are in Section 2.3. We compare the predictions of these formulae with experimentally measured values and show that agreement is excellent. In Section 2.2, we show that bucket sort also has a load factor and interchanges, although the formulae in this case are much easier to derive.
In Section 2.4, imitating Knuth's counting of lines of MIX code [1] , we combine these two formulae (displacements and interchanges) into one formula for the total CPU time of linear probing sort as a function of the load factor, which accurately matches experimentally measured CPU times and shows that total CPU time is minimized when a ' 0.45, not 0.59 as above. We derive a similar formula for bucket sort and show that total CPU time is minimized at R ¼ a 21 ' 0.6, confirming results previously observed by Neubert [7, Fig. 1] . Note that, following [7] , we use R rather than a in our formulae. Also note that the correspondence with Knuth's MIX line counting is much easier to make in FORTRAN than C, but since we used a C compiler with a FORTRAN front end [30] (but with a better random number generator from the literature [31] ), this choice of high-level programming language probably made little difference at the level of machine code for simple programs like these.
Review of practice
In practice, the larger the array being sorted, the larger the memory location or cache in which the array is stored, and hence the longer the access time or latency [32, 33] . Many recent studies have modelled the impact of cache latency on the performance of sorting algorithms [14 -19] . A key parameter in these models is the block size, which represents the amount of data read at one time from one cache to another when the data requested is not present at first, a circumstance known as a cache miss. Since the block size was quite small, either 32 bytes [15] or 64 bytes [18] , both sequential and random access to arrays incurred significant cache misses, which complicated the theory. Consequently, these previous studies only compared predicted cache miss rates with results obtained from cache simulations and only of the L2 cache. The actual CPU times, which involved also the L1 cache, were reported separately and no attempt was made to model them theoretically.
In fact, modern memory technologies such as fast paging and burst access [32, 33] mean that sequential access to data incurs considerably less latency than does random access. The ideal approach would therefore be to take the cache miss analyses of these previous studies, separate out terms involving random access from those involving sequential access and weight them with separate latencies D ran and D seq . Unfortunately, the arrays to which these latencies refer, typically an initial unsorted array accessed sequentially and a final sorted array accessed at random, are generally about the same size (or, in the case of in-place algorithms, the same array) and so will exceed the capacity of any given cache at about the same value of N. This means that experimentally measured CPU times are generally only capable of determining the sum D ran þ D seq and not the separate values.
Page 2 of 19 P.M.E. SHUTLER et al.
Fortunately, since linear time algorithms access data in both ways, that sequential access latency is so much smaller than random access latency means that to a good approximation the former can be set to zero. In practice, this means using a single latency D which we think of as the random access latency, while accepting the fact that experimentally it does include a small amount of sequential access latency. In Section 3.2, we incorporate this assumption into a simple probability model of random access cache latency. In Section 3.4, we show that this model accurately matches CPU times measured on the three-level memory structure (L1 cache, L2 cache, RAM) of a Pentium PC, for both linear probing sort and bucket sort.
In Section 4.1 we show that, with minor modifications, our model accurately matches the radix sort results reported in [15] , and in Section 4.2 that it also matches the radix sort and Flashsort1 results reported in [18] . Since these results were generated on workstations, not PCs, this confirms that our theory applies across different architectures. Fitting our theory to data from the literature can be problematic, as discussed in Section 3.3, when there are too few data points to support the complexity of the algorithms being studied. Finally, all of the caches involved in these studies were exclusive, but equally common is the inclusive design where a copy of the contents of the L1 cache is kept in the L2 cache, but this can be taken into account simply by adjusting the size of the L2 cache in the model.
Although sequential access latency is very small compared to random access latency, for those log -linear algorithms such as quicksort which access arrays primarily sequentially it is still present and measurable. In Section 4.3, we present a simple model of sequential access cache latency and show that it accurately matches experimentally measured CPU times not only for a standard implementation [34] of quicksort [35] [36] [37] [38] , but also the memory tuned quicksort results in [18] . This is one reason why we conducted our experiments on an old Pentium system since on a modern Pentium 4 machine a much larger block size combined with hardware prefetch [39] makes sequential access latency quite difficult to measure directly [40] . In Section 4.4, we consider external storage and optimal sorting.
Another reason for using an older Pentium system is that the L2 cache is on the motherboard, so it is relatively large and slow [32, 33] giving broad and well-separated plateaux on the performance curves with a characteristic shape (cf. our results with [14, Fig. 2] ). The later trend was for the L2 cache to be integrated on the processor die, hence smaller and faster, making the L2 plateaux harder to resolve. More recently, L2 caches have grown in size again but on a modern Pentium 4 both caches are 8-way set associative whereas the caches of the machines used in these previous studies [15, 18] were direct mapped. The Pentium system used in our study therefore nicely bridges the gap, since it has a 4-way set associative L1 cache and a direct mapped L2 cache, so both types can be seen in a single set of data.
Nevertheless, our results are not significantly different from what would be obtained on a contemporary machine because, while overall speeds have increased by more than an order of ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 3 of 19 magnitude over the last decade, the ratio of CPU to main memory speeds has changed very little. This is important because the comparisons which we make in this paper between memory-intensive (linear time) and CPU-intensive (log-linear) sorting algorithms depend only on the ratio of CPU to memory speeds, not on their absolute values. Compared with our Pentium, a modern PC might have a CPU which is 3. Another point that many authors emphasize in their introductions [15, 39] is that a random access cache miss to main memory now takes of the order of a hundred CPU clock cycles to handle. When comparing the performance of different sorting algorithms, this is interesting since the amount of extra computation which would have to be carried out to avoid that miss to main memory also takes of the order of a hundred CPU clock cycles [18] . In fact, for about the last 10 years there has existed a balance between CPU-and memory-intensive algorithms, in the sense that the performance curves of different algorithms do not by in large cross each other but run in approximately parallel tracks. In other words, we shall find that if one ranks the algorithms according to their absolute speed at a given N then the result is largely independent of the value of N chosen.
THEORY OF LINEAR TIME SORTING ALGORITHMS
Linear probing sort
To understand linear probing sort [10 -13] , or LPSORT for short, consider the simple example shown in Fig. 1(a) , where N ¼ 10 integers are chosen at random in the range [1, M ¼ 100] and placed in an initial array or IARRAY for short. Divide the range into B ¼ 10 bins of a final array or FARRAY for short, shown in Fig. 1(b) , where each bin can contain exactly one element, and is labelled with the corresponding subrange 1 -10, 11-20, etc. Taking each of the elements in IARRAY in turn, attempt to insert it into the bin in FARRAY whose subrange corresponds to the value of that element. If the bin is already occupied, compare the size of that element with the one already in the bin, and if the element already in the bin is larger, then we have a pair of elements which must be swapped so that the current element becomes the larger of the two. Then bump the current element to the next bin, and so on, until an empty bin is found. Although the distribution from which the elements of IARRAY were drawn was uniform, the number of elements falling in each bin is far from constant. When we try to insert the ninth element with value 25, all the bins up to subrange 51-60 are already occupied, or have become saturated, so we have to bump many times before we can find an empty bin. LPSORT eases the problems of saturation by allowing more than one bin per element uniformly throughout the range. Fig. 1(c) shows the case where B ¼ 2N and we leave it as an exercise to the reader to show that the number of bumps drops from 10 to just 2, while the number of pairs drops from 3 to just 1. The price to be paid is that FARRAY now has many spaces which need to be compressed before the sorted array regains its original size. In general, if we define an expansion factor R ¼ B/N ¼ a
21
, then as R increases the number of bumps and pairs decreases, but the number of spaces increases, so overall there should be some value R for which the total time taken to sort is minimum. To identify the optimal value of R, we need to compute the number of BUMPS, PAIRS and SPACES as functions of R.
The function BUMPS, known in the literature as the total displacement, has been the subject of recent research [20 -23] . Inspired by Knuth's famous Parking Problem [1, Sec. 6.4, Ex. 29] we show in Section 2.3 that linear probing sort corresponds to the simplest traffic light queueing system, in which case BUMPS is just the average length of queue, which can be computed using standard generating function techniques. No such formula for PAIRS appears to exist in the literature, but we show in Section 2.3 that it can be computed by integrating BUMPS during sorting to yield the following formulae: Linear probing sort can be applied to any distribution, not just the uniform distribution, using the cumulative probability function [12, 13] . For example, given floating point data with probability density function f(t) ¼ dF/dt, the estimated position, or hashing function of an element is just NF(t). Even if the distribution of the data is not known in advance, it can be sampled, provided N is large enough, and the distribution estimated, an approach already applied successfully to quicksort partitioning [37] . Note also that just as elements are bumped inside FARRAY, they can also be bumped beyond the end of the array, so some additional space much be provided. Fortunately, provided R is not close to 1, the amount of additional space is almost certainly negligible compared to the size of the array [3, 11] . Finally, the formula for BUMPS given above is strictly speaking an approximation, based on the assumption that N is sufficiently large for the traffic light queueing system to have reached a steady state, and as can be seen from Fig. 2 this begins to break down as R approaches unity from above. This has also been the subject of recent research, which implies that provided R is not close to unity (in Section 2.4 we shall see that we are only interested in R ' 2), the error in the formula is only O(N 21 ) [3, 11] and is again negligible.
Bucket sort
BUCKETSORT [4, 7 -9] avoids the problems of saturation not by having more bins, but by allowing each bin to effectively contain arbitrary many elements inserted on a first come first served basis. The example shown in Fig. 3 , using the same data as Fig. 1 , uses B ¼ 5. Again we have a notion of expansion factor R ¼ B/N, except that it is no longer constrained to be greater than unity; indeed we would expect the optimal value to be less than 1, a contraction. An elementary application of the Poisson distribution then shows that the average number of pairs remaining to be swapped in each bin is a This flexibility in the size of a bin is in practice impossible to achieve on a computer. An alternative approach originally implemented by Gonnet in 1984 [4, p. 253] is to use the bins to count the observed frequency of elements in each subrange. These frequencies are prefix summed, taking a time proportional to the number of elements in the frequency array JARRAY, denoted by the function COUNTS ¼ RN. The resulting cumulative frequencies are then used to insert each element from the initial IARRAY into the partially sorted KARRAY so that bumps are avoided entirely, by inserting into the bins from right to left and decrementing the cumulative frequency count at each step. The remaining swaps are achieved by a final pass of insertion sort, which up to constants takes a time proportional to SWAPS. The total time to sort is therefore T ¼ const þ T COUNTS þ T SWAPS where T COUNTS / R is increasing, and T SWAPS / 1/(4R) is decreasing with R, so T should be minimum for a particular R. A rough approximation would be to set the derivative of R þ 1/(4R) to zero, giving R ¼ 1/ ffiffi ffi 2 p ' 0.71, whereas Neubert obtained R ' 0.43 via a slightly different approximation [7] .
Traffic light queueing systems
In this section, we show how to compute BUMPS and PAIRS as functions of the expansion factor R for LPSORT by forming a correspondence with a traffic light queueing system and applying standard generating function techniques [28, Chapter 7] . Consider the diagram in Fig. 4 (a) which shows one of the B ¼ NR bins in FARRAY. Temporarily ignoring the order in which the operations actually occur, denote by the symbol * elements which at any point are hashed into this bin, and by the symbol † elements which are at any point bumped into this bin from the bin below. In terms of queueing theory, the bin represents one cycle of the traffic 
ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 5 of 19
light, the elements * are the vehicles which arrive during that cycle and the elements † are the vehicles which are queueing from the previous cycle. That only one element is actually inserted into this bin in the expanded array, while all the rest are bumped into the bin above, means that the length of the green light is only sufficient to allow one vehicle to pass, while the rest queue until the next cycle. Note that time in the queue corresponds to the index of the bins in the expanded array, while time in the sorting algorithm determines the queue discipline, and a 21 ¼ R is the traffic intensity. The probability distribution p Ã ¼ (p 0 , p 1 , p 2 ,. . .) of the elements * follows the Poisson distribution p i ¼ e 2a a i /i! where a ¼ 1/R is the average number of elements per bin, and we can define the generating function Q(z) ¼ S i p i z i ¼ expf2aþazg. Provided the bin shown in Fig. 4 (a) is sufficiently far from the start of the expanded array, we can assume that the probability distribution p Ã ¼ (p 0 , p 1 , p 2 ,. . .) of the elements † also does not change as we move from one bin to the next. In terms of queueing theory, this corresponds to the steady-state assumption that the probability distribution for the number of vehicles in the queue is constant in time. We know, however, that the number of elements bumped from this bin into the next is the number of elements bumped into this bin, plus the number of elements which are hashed into this bin, minus the one element which is inserted into the bin. We can therefore express this steady-state condition by the system of equations
Multiplying the equation in p k by z k and summing yields the generating function equation
, and this rearranges to give
It is now possible to apply well-known techniques to compute the probability distribution p Ã , regarding z as a complex variable and locating the zeros of the denominator Q(z) 2 z in the complex plane [28, 29] . It is also clear how to generalize the case of the bulk service traffic light queueing system, where several vehicles pass on each green light, which in terms of sorting algorithms would correspond to a cross between LPSORT and BUCKETSORT.
For the purposes of this paper, however, we only need to compute expectations, as follows. The average number of elements bumped from one bin to the next is
is the derivative of the generating function G. If we differentiate our previous functional equation for G(z), we cannot evaluate directly at z ¼ 1 because both numerator and denominator are zero there, but an application of l'Hôpital's rule, or a simple power series expansion in z, yields the following result:
To compute p 0 we simply apply l'Hôpital's rule to evaluate the relation G(1) ¼ 1 and deduce that p 0 ¼ e a (1 2 a) and
To compute PAIRS as a function of R, we do actually have to pay attention to the queue discipline, i.e., the order in which elements insert into the expanded array. Consider the diagram in Fig. 4 (b) which illustrates the case of the (n þ 1)th element w which we hash into FARRAY, the previous n elements having already been inserted in for some n , N. The average number of elements per bin is then a 0 ¼ n/(NR) and we let p 0 * and p 0 * denote the prevailing probability distributions. If the (n þ 1)th element finds the bin already occupied, and is bumped to a higher bin, it passes three kinds of elements: those originally hashed to a lower bin but which themselves were bumped upwards (symbol †) and which therefore should lie below w in the final sorted order; those originally hashed the same bin and again were themselves bumped upwards (symbol *) which should lie below or above w with equal probability; and those elements originally hashed to a higher bin (symbol W ), which may or may not have themselves been bumped upwards, and which should lie above w in the final sorted order. All four possibilities ( †, * low, * high and W ) are illustrated in Fig. 4(b) by the element w ¼ 25, which also shows that the value of w can change as it moves upwards.
Denote by the function Bumps, the expected number of bumps that the element w generates and by Pairs the expected number of pairs which need to be swapped as a result. Since Pairs begin half way through the * block, it equals Bumps minus all of the † bumps and half of the * bumps. The expected number of pairs associated with w is therefore
where we have substituted for the expectations E (p 0 ) and E(p 0 ) in terms of the prevailing average occupancy a 0 using the formulae derived previously in terms of a. Summing over all N elements we find
where approximating the sum by an integral with respect to da 0 ¼ 1/(NR) is accurate provided N is sufficiently large. Evaluating the integral Ð 0 a a/(1 2 a) da ¼ 2log e (1 2 a) 2 a by substitution, and inserting this and our previous formula for BUMPS immediately yields the formula for PAIRS quoted in Section 2.1.
CPU time formulae
To find the optimal expansion factor R for LPSORT, we must express the total CPU time as a function of R in the form
To do this, we must convert the formulae given in Section 2.1 into expressions involving CPU time. If we adopted Knuth's MIX code line counting approach [1] , this would involve counting lines of machine code, requiring a rather thorough knowledge of the inner workings of compilers. A crude but effective approach is simply to count the number of lines of high-level code and assume that each of these lines of code takes the same amount of CPU time T 0 . In this paper, we shall compromise by weighting the lines of high-level code in the simplest possible way: to fast lines, such as those which increment counters, we assign a single weight, i.e., assume they take a time T 0 ; to the more complicated lines, such as those reading elements of the arrays, we shall assign a double weight, i.e., assume they take a time 2T 0 to execute. Figure 5 lists the essential lines of FORTRAN for LPSORT, where the weight of each line is placed in one of the columns labelled N, BUMPS, PAIRS, or SPACES according to how many times the line executes, equivalent to Knuth's fifth column in any of his MIX program listings in [1, Sec. 5.2]. The END DO lines have the same weight as the matching DO WHILE but in a different column, neatly separating the times when the logical condition is checked from inside the loop from the times when the condition is checked at the start to see if the loop executes at all. In C this would be much less clear since the loop terminator is just one g or ; among many. The hashing line number 50 takes a significantly longer time T 1 , so must be accounted for separately, which is indicated by placing it in brackets (T 1 ) in the column labelled N. Figure 6 does the same for BUCKETSORT, and multiplying the weight totals at the foot of each column with the factor at the top, and adding everything up, gives
To test these formulae, the variation with R of experimentally measured CPU times T for both algorithms on a 200 MHz Pentium PC using the GNU FORTRAN compiler [30] averaged over 10 4 arrays of size N ¼ 10 3 consisting of random integers in the range [1, M ¼ 10 8 ] is shown by the data points in Fig. 7 . The hashing time T 1 is not a parameter since it can be measured directly, being T 1 ¼ 135 ns on our machine. The only free parameter is therefore T 0 and the theoretical curves shown in Fig. 7 use best-fit values T 0 ¼ 33.0 ns for LPSORT and T 0 ¼ 26.6 ns for BUCKETSORT. That both sets of data show a clear minimum as a function of R confirms our earlier discussion, and in the case of BUCKETSORT matches the experimental results reported in [7, Fig. 1 ]. The close fit between theory and data, especially at low R, for very similar values of T 0 confirms the validity of our ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 7 of 19 weighted line counting approach. That T 1 ' 5T 0 confirms the efficiency of the modern math coprocessor mentioned in Section 1.1. Given the form of the above equations, the poor fit above the R minimum cannot be cured simply by adjusting T 0 but is due to cache latency, and is discussed in the next section.
PRACTICE OF LINEAR TIME SORTING ALGORITHMS
Experimental results
Fixing the expansion factor R at the optimal values seen in Fig. 7 (R ¼ 2.2 for LPSORT and R ¼ 0.6 for BUCKETSORT), Fig. 8 shows the variation in CPU time per sorted element T/N for both algorithms as a function of the number of elements N. The data points are averages over 1000 arrays for each value N up to 10 6 and over 100 arrays for N above this, and to ensure that the number of equal-valued elements was negligible and the range of the random integers being sorted is again [ 
Instead of the expected horizontal lines, LPSORT shows three plateaux (marked L1, L2 and RAM) separated by two sharp kinks. BUCKETSORT has twice as many kinks, hence not so well-defined plateaux. Corresponding results for a standard implementation of QUICKSORT [34] are also shown, which is much smoother by comparison, but which also has two slight kinks at approximately the same positions as LPSORT. For very large N, both linear time algorithms begin to access the hard disk drive A simplified explanation of these results is as follows. Single-precision integers are stored as 4 bytes, so the LPSORT kinks and HDD spike roughly match the values of N where FARRAY no longer fits inside the 16 kB L1 data cache on the 200 MHz MMX Pentium CPU [32, 33] , the 512 kB L2 cache on the TX97 motherboard [41, 42] and the 64 MB PC66 RAM [43] . BUCKETSORT is similar, except that there are twice as many kinks because two arrays being randomly accessed (JARRAY and KARRAY), but the expansion factor R applies to only one, so they exceed the size of the various memory locations at different values of N. The first kink near N ¼ 10 3 for both LPSORT and BUCKETSORT explains the poor fit between theory and data in Fig. 7 above the optimal R because, from a memory point of view, increasing R while keeping N fixed has the same effect on CPU time as increasing N while keeping R fixed. QUICKSORT has only two kinks because there is again only one array involved, but these kinks fall slightly later than those of LPSORT as the array is not expanded by a factor R. The shallowness of the QUICKSORT kinks confirms that sequential access incurs significantly less latency than random access, but we will analyse this in more detail in Section 4.3.
The hard disk drive spikes in Fig. 8 highlight the large speed difference between the hard drive and main memory, and can be modelled as follows. Our Asus TX97 motherboard with Intel 430TX chipset supports up to 256 Mbytes of RAM [41, 42] , but in fact only the first 64 Mbytes are cacheable [33, 42] Fig. 8 and fit very closely the point where there is a very abrupt increase in CPU time. The principal aim of the next section is to show that a similar, albeit much more precise, theory explains the other kinks shown in Fig. 8 .
A simple model of cache latency
To help us model random access latency more accurately, consider Fig. 9 which shows a typical Pentium PC memory structure [32, 33] , where the L1 cache, L2 cache and RAM can store up to, say, M 1 , M 2 and M 3 bytes of data, respectively. Data in the RAM must first be read into the L2 cache, taking a time D 32 say, and then into the L1 cache, taking a time D 21 , before it can be read into the CPU registers. If the cache controller is operating with maximum efficiency, an array of size M 0 bytes is stored entirely within the L1 cache if M 0 M 1 , otherwise at most M 0 2 M 1 bytes overflows into the L2 cache, and when that is full at most M 0 2 M 1 2 M 2 bytes overflows into the RAM. So the probability that randomly chosen piece of data lies in the RAM is
Arguing similarly for the L2 cache, and writing (T/N) 0 for the CPU time per element of the array in the case M 0 M 1 yields
Now suppose we have two arrays: an initial unsorted array IARRAY accessed sequentially, and a final sorted array KARRAY accessed at random. Provided both arrays are much larger than the cache in question, then a steady state will quickly be reached where the proportion of the cache occupied by IARRAY and KARRAY blocks, respectively, is constant. Suppose IARRAY accesses each of its blocks l times, KARRAY accesses each of its blocks m times, and let the faction of the cache occupied by IARRAY be r. For every l iterations, a new IARRAY block is loaded to the cache and at the same time an average of lr IARRAY blocks will have been ejected at random by KARRAY blocks, so in the steady state we have lr ¼ 1 [15, Equation (10)]. Furthermore, of the m times each KARRAY block is accessed, only the first m 2 1 times will result in a 'useful' block being loaded, since the mth time a block is loaded which will never be accessed again. So the fraction of the cache a occupied by KARRAY which will reduce cache misses in the sense of Equation (10) ANALYSIS OF LINEAR TIME SORTING ALGORITHMS 
Conflict misses may still occur even when the arrays are small, but for a direct mapped cache the blocks are generally not mapped contiguously but rather according to some complicated permutation. So, assuming that the overlap between the two arrays is effectively random, the probability that a given array KARRAY block conflicts with IARRAY is the ratio of the size of IARRAY to the size of the cache [18, p. 26, para. 2] . Given that they do conflict, the KARRAY block will suffer one (compulsory) miss when it is first loaded, and a further (conflict) miss when the sequential access to IARRAY passes through that block, provided this occurs in one of the m 2 1 gaps between the KARRAY random accesses, and not before the first, or after the last. Approximating these random accesses as equally spaced, the fraction of KARRAY accesses in a conflicting block which result in a cache miss is (cf.
[18, Equation (26)]):
To make the above more concrete, consider Equation (10) in the case of a direct mapped L 2 cache which is much larger than the L 1 cache, so that all terms involving the L 1 cache may be neglected or absorbed into (T/N) 0 (true, e.g., in [15] ). When N is large
where M K is the size of KARRAY and a is its cache fraction. When N is small
where M I is the size of the IARRAY and b is the cache miss probability. Suppose that both IARRAY and KARRAY consist of 4-byte integers in 32-byte blocks so M I ¼ M K ¼ 4N bytes and l ¼ m¼ 8, and the L 2 cache is directly mapped with capacity M 2 ¼ 512 kbytes and latency D 32 ¼ 300 ns, and (T/N) 0 ¼ 1000 ns. Equation (11) then gives a ¼ (1 2 1/8) 2 ' 0.77 and the resulting curve for Equation (13) is shown solid in Fig. 10 . Similarly, Equation (12) gives b ¼ 2/9 ' 0.22 and the resulting curve for Equation (14) is shown dotted in Fig. 10 . Also shown dotted is Equation (13) but for a ¼ 1/2, for later reference. Note that the point marked W at which the dotted b curve meets the solid a arch is also the point at which Equation (13) first starts to become reasonably accurate, i.e., when
Next let us consider what difference it would make if the cache were 4-way associative rather than direct mapped. This means that the cache is split into four parts, each of which is separately direct mapped, so that when a block is loaded from main memory it has a choice of a set of four locations to choose from. Which of these four blocks is ejected in favour of this new block is decided in hardware according to some simple rule such as LRU (least recently used), FIFO or Random [39, p. C10]. For a LRU cache when an IARRAY block is loaded it automatically becomes the MRU (most recently used) among its set of four, and so will not be ejected until the other three blocks in its set have been ejected first. Since the average time taken for a set of four blocks to be hit four times by KARRAY is the same as the average time for a single block to be hit once in the direct mapped case, r and hence a will be unchanged. This perhaps slightly surprising result is supported by data in [39, When N is low, associativity does make a significant difference and although an exact computation of b in this case is very hard, we can use the curves plotted in Fig. 10 to place close bounds in its behaviour. Provided both IARRAY and KARRAY are less than half the size of the cache, then conflict misses will be eliminated entirely, and so the performance curve must lie below the dotted a ¼ 1/2 curve in Fig. 10 . Conversely, when IARRAY and KARRAY are more than half the size of the cache, then conflict misses are unavoidable, but will be no worse than if the cache were direct mapped; so the performance curve lies below the dotted b ¼ 0.22 curve shown in Fig. 10 . This means that the performance curve of a 4-way associative LRU cache must pass through the trapezium labelled A in Fig. 10 after which it will asymptotically approach the solid a ¼ 0.77 curve. So associativity sharpens the kink (the point where M K ¼ aM 2 ) but leaves the large N performance unchanged.
Fitting theory to data
Unlike Fig. 10 , which represents a single cache/array pair, the algorithms studied in this paper generally have several arrays accessed at random and several caches, hence many possible array/cache pairs. Experimental data taken from these algorithms, therefore, rather than having a single well-defined latency 'arch' as in Fig. 10 will instead consist of a sequence of many, partially overlapping arches each with its own pair of constants (a, D). Although the value of a can generally be predicted using Equation (11), the random access latency D can only be obtained from the manufacturer's specifications. Unfortunately, most manufacturers are quite vague about this (in [43] it is listed simply as 'max. ¼ 1000 ns') preferring to give their attention instead to the sequential access latency, since it is much shorter. The best that we can do, therefore, is to adjust the value of D by hand, one for each arch in the data, until a 'best fit' is obtained. If the fit is precise this not only validates the theory but in fact constitutes an accurate measurement of the random access latency, something which we believe has not been attempted before in the literature. When we attempt to apply this approach to the existing data in the literature [15, 18] , however, there are several problems. First, the algorithms themselves tend to be rather complex, so that the number of latency arches in the data can be quite large. Second, the values of N used in these earlier studies increased by powers of 2, which on a logarithmic scale are spaced quite far apart. Consequently, the number of data points lying on each arch can be rather small, even as few as two points per arch, which is the bare minimum required to obtain a value for D. Since code and hardware are no longer available (the code posted for [18] is in fact the code for [19] posted twice by mistake), there is clearly little that can be done about the data. We can at least verify that the values of a are close to the theoretical values (in particular that the value of a is independent of the cache) that the latencies D are reasonable and that the array/cache pairs correspond to the description of the algorithm.
Our own data collection was designed to avoid these problems. First, we chose relatively simple algorithms with as few arches as possible; in fact LPSORT has only two. Second, we chose values of N which were much more closely spaced, so that each latency arch would have many data points. Finally, the large difference in the size of L 1 and L 2 caches in the older hardware used meant that the latency arches were as widely spaced as possible. Consequently, the shape of the latency arches can be located accurately, especially at high N (around N ¼ 10 6 in Fig. 10 ). This is important because the data will show a variety of low N effects near the kink (around N ¼ 10 5 in Fig. 10 ). For associative caches, the kink is quite sharp, following very closely the solid theoretical curve in Fig. 10 . For direct mapped caches when there are three or more arrays involved, then the kink generally follows the b type dotted curve. When there are only two arrays involved, however, we shall see strong evidence that the direct mapping does manage to keep them from conflicting for small N, the data initially following the a ¼ 1/2 dotted curve shown in Fig. 10 , but then transitioning to the value of a given by Equation (11) at some intermediate value of N.
Applying the model to LPSORT and BUCKETSORT
First, we apply Equation (10) to the random access to FARRAY in line 100 of the LPSORT code in Fig. 5 , but modified like Equation (13) for the interaction with IARRAY. Both arrays consist of 4-byte integers in 32-byte blocks so we have l I ¼ 8 and m F ¼ 8/R where R ¼ 2.2 is the expansion factor, so Equation (11) gives since the L2 cache of the TX97 motherboard runs at twice the speed of the PC66 RAM [41] . We can clearly see the difference between the 4-way associative L1 cache [32, 33] and the direct mapped L2 cache [42] , since the first kink is sharp, while the second initially follows the (dotted) a ¼ 1/2 curve. For BUCKETSORT, the three random memory accesses in lines 100, 400 and 500 shown in Fig. 6 should give six kinks, but the data in Fig. 8 show only four. This is because the summing of the observed frequencies in JARRAY in the DO 300 loop preloads JARRAY into the cache, so the random access to JARRAY in line 400 attracts zero CPU time penalty, at least for moderate N. As N increases and the cache fills up, there should be a measurable cache latency, but this will be partially masked behind the random access to KARRAY in line 500, which itself is partially masked ANALYSIS OF LINEAR TIME SORTING ALGORITHMS 
The resulting (solid) curve in 
DATA FROM THE LITERATURE
Comparison with LaMarca and Ladner
Our results partly confirm those of a recent study [15] in which the performance of several log -linear algorithms including quicksort was compared with a linear time implementation of radix sort. Except that their L2 cache was 2 MB rather than 512 kB, the CPU times reported in [15, Fig. 9(c) ] can be compared directly with our Fig. 8 . The authors of that earlier study focused on cache misses, simulated rather than actual, but since data were cached only 32 bytes at a time (four elements of their array) both sequential and random access to arrays incurred significant cache misses. This meant that both quicksort and radix sort exhibited a very large increase in cache misses at the L2/RAM transition near N ¼ 10 5 in [15, Fig. 9(b) ]. Comparing this with the CPU times in [15, Fig. 9(c)] , which was the focus of our study, we see a similarly large increase in CPU time for radix sort, but only a small rise for quicksort, confirming that the latency incurred by sequential access is much less than that incurred by random access. This can also be seen in [15, Fig. 3(c) ] where memory tuning of quicksort yields a big reduction in cache misses, but only a small reduction in CPU time.
Another important difference between that earlier study and our own lies in their implementation of radix sort which, like bucket sort, uses frequency arrays and buckets and therefore has an expansion factor R. The authors chose to fix number of buckets at B ¼ 2 16 since that was optimal for the largest size of array N ¼ 4 Â 10 6 , and allowed both of their frequency arrays to fit entirely within the L2 cache [15, p. 96 ]. This also meant that for very small N, the expansion factor became very large, more than R ¼ 16 for N ¼ 4000, at which point the two frequency arrays were mostly empty space. This explains why the CPU time per array element shown in [15, Fig. 7(c) ], far from being constant with N, in fact increased very rapidly with decreasing N. By contrast, in this paper we have shown that for optimal performance it is the expansion factor R rather than the number of buckets B which should be held constant, and then the CPU time per sorted element is constant for small N.
Finally, although our LPSORT and BUCKETSORT were about twice as fast as our QUICKSORT, their implementation of radix sort was slower than their quicksort by about the same margin. This may have been because they regarded each of the randomly chosen 64-bit binary integers in their initial array as consisting of four 'digits' in base 2 16 , which they then sorted by making a total of four passes through the array, starting with the least significant 'digit' and ending with the most significant. Since the maximum size of the array was only N ¼ 4 Â 10 6 ' 2 22 , (2 16 ) 2 , however, only the first two base 2 16 'digits' were truly significant, with about one in a thousand pairs differing only in their third or fourth 'digit'. The sorting of the array could therefore have been accomplished by making only two passes rather than four, applied to the first and second 'digits', followed by a single pass of insertion sort to catch the few remaining out of order pairs, approximately doubling the speed. Furthermore, by increasing the number of buckets to allow a singlebase 2 22 digit, the number of passes could have been reduced to 1, approximately doubling the speed again. Their radix sort might then have outperformed their quicksort by about a factor of 2.
Despite its relatively poor performance, LaMarca and Ladner s RADIXSORT is quite interesting to analyse, mostly due to the fact that B rather than R is held constant. Although they do not quote numerical data, and the resolution of the published graphs is rather poor, LaMarca s personal Web site [44] includes soft copies of the original graphs from which the data can be reconstructed reasonably accurately, and [15, Fig. 7(c) ] is replotted on a logarithmic scale in our Fig. 11 . Note that the vertical scale is clock cycles, but Page 12 of 19 P.M.E. SHUTLER et al.
given their reported clock frequency of 288 MHz, it follows that each clock cycle is about 3.57 ns and from this it is easy to compute CPU times. Note also that the values of N start from 4000 then increase by a factor 2 at each step; so while not exactly powers of 2 they are nearly so, and we have indicated the approximate values N ' 2 16 and N ' 2 18 against two points on the graph which will be important in our discussion below. Finally, the L2 cache was rather large at 2 MB, so the 16 kB L1 cache is small enough to be neglected as discussed in Section 3.2. First we compute the baseline performance (T/N) 0 , and being essentially a bucket sort but without any out of order pairs at the end, this is Equation (9), but with SWAPS set to zero. Furthermore, since B rather than R is fixed, the term COUNTS in Equation (9) is proportional to B ¼ 2 16 rather than to N. Putting these two things together we deduce that (T/N) 0 must be of the form
Assuming that the cache latency for the first two data points shown in Fig. 11 is negligible, we can set these values equal to the above expression and deduce that a ¼ 288 and b ¼ 96.6 clock cycles. If we use these values to plot the function (T/N) 0 for all N, we obtain the dotted curve shown in Fig. 11 . Alternatively, we can make use of the fact to be deduced later that for low N the cache latency, i.e., the gap between the data points and the dotted curve, is proportional to N. The third, fourth and fifth points up to N ' 2 16 would then give us an independent estimate of a and b which turn out to be very close to the above values.
For the rest of the analysis, some knowledge of the details of their RADIXSORT algorithm is unavoidable, and these are summarised as follows. In each of four passes, data move from an initial array IARRAY accessed sequentially to a final array KARRAY accessed at random, whose roles then reverse at the next pass. This data movement is mediated by two frequency arrays J 1 ARRAY and J 2 ARRAY, both accessed at random, where the first gives the position to which the data are moved from IARRAY to JARRAY, while the second precomputes the position in the opposite direction ready for the next pass. Both IARRAY and KARRAY consist of N 8-byte integers in 32-byte blocks, so l I ¼ m K ¼ 4 entries per block, while the J i ARRAY each consist of 2 16 4-byte integers. So while both IARRAY and KARRAY exceed the size of the L2 cache around N ¼ 2
17
, matching the kink shown in Fig. 11 , the J i ARRAY are of fixed size, each 1/8 hence collectively 1/4 of the L2 cache. Finally, there is an initial pass to precompute J 1 ARRAY before the first of the four passes, but we shall ignore this as it is largely balanced out by the fact that precomputing J 2 ARRAY in the last of the four passes is omitted, and we shall also ignore the prefix sum passes through the J i ARRAY since they are purely sequential.
When N is large, each block in the J i ARRAY is accessed more frequently than the blocks in either IARRAY or KARRAY that effectively these data arrays are excluded from the quarter of the L2 cache occupied by the J i ARRAY. This means that the KARRAY random access latency is of the form of Equation (13) except that M 2 is replaced by (3/4)M 2 . Nevertheless, because the L2 cache is directly mapped, at each iteration in N a KARRAY block will be loaded to the part of the cache occupied by the J i ARRAY with probability 1/4 where, because the J i ARRAY are accessed so much more frequently, it will almost certainly cause a J i ARRAY miss at some point soon afterwards. IARRAY causes J i ARRAY misses in exactly the same way except only every fourth iteration since it is accessed sequentially in blocks of four elements, hence the probability of causing a J i ARRAY miss is (1/4) 2 ¼ 1/16. (Note that LaMarca and Ladner include the J i ARRAY misses due to IARRAY [15, Equation (7)] but neglect those due to KARRAY which explains the gap between their theory and data curves [15, Fig. 8 ] when the radix is large.) Writing 1/4 þ 1/16 ¼ 5/16, M K ¼ 8N and a K for the KARRAY cache fraction inside (3/4)M 2 , we obtain 
ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 13 of 19
For the cache fraction a K note that Equation (11) was based on the assumption that the KARRAY access is random, but here it is limited to the 2 16 entry points determined by the J i ARRAY. Although these entry points are themselves randomly distributed, there are relatively few of them, in fact one per block in the cache since there are 2 MB/32 bytes ¼ 2 16 blocks in total, and they move relatively slowly during the course of the algorithm. We can therefore apply the Poisson distribution with mean x ¼ 1 and suppose that IARRAY occupies all of the e 2x of the blocks without an entry point, but still only 1/l I ¼ 1/4 of the blocks with an entry point hence r ' 1 Â
) ' 0.53 and applying Equation (11) we obtain a K ¼ (1 2 r) Â (1 2 1/m K ) ' 0.36 and this value of a K is used for the solid curve shown in Fig. 11 for N ! 2 18 . To cover the whole of the latency arch shown, however, we must be able to handle smaller values of N systematically, but setting x ¼ max (1, 2 18 /N) we can fill in the rest of the curve down to the vertical stroke marked at N ¼ 10
, it is clear from the very large drop in the cache latency, i.e., the rapid narrowing of the gap between the data points and the dotted baseline curve, that the large constant 5/16 in Equation (18) is no longer present. From this, we deduce that IARRAY and KARRAY are no longer in conflict with the J i ARRAY, but they must still be in conflict with each other since the cache latency is non-zero. This would be the case if IARRAY and KARRAY were confined to one-half of the cache, while the J i ARRAY to the other half (cf. the a ¼ 1/2 behaviour seen earlier) in which case Equation (14) would give (18) and (19) simultaneously when D 32 ¼ 300 clock cycles. Given that this is over four passes, and each clock cycle is about 3.57 ns, we deduce that the single-element random access cache latency is about 300 Â 3.57/4 ' 268 ns.
Comparison with Rahman and Raman
Our results also confirm those obtained in another study [18] which compared two linear time algorithms, FLASHSORT1 and MSBRADIX, with the log -linear algorithm MTQUICK, on a Sun UltraSparc-II [45] with direct mapped L1 and L2 caches the same size as on our PC. Although these authors followed the same approach as [15] , in that they compared their theory only with the results from a simulated L2 cache, the CPU times quoted in [18] are much more complete and accurate than those in [15] . In particular, the data in [18, Fig. 3 ], plotted on a linear scale in [18, Fig. 4 ], become much clearer when plotted on a logarithmic scale and are shown in our Fig. 12 . In this section, we adapt Equation (10) to model their FLASHSORT1 and MSBRADIX data. In Section 4.3, we shall do the same for their MTQUICK.
FLASHSORT1 [18] is an in-place bucket sort combining IARRAY and KARRAY into a single array called DATA. It performs the cycle-leader-finding step differently from Neubert's original Flashsort1 [7] , having different copies of the frequency array, called COUNT and START in [18] , but if the cycles are long, START is accessed so infrequently that it effectively drops out of the latency analysis. Otherwise, FLASHSORT1 makes the same number of random memory calls as our BUCKETSORT (lines 100, 400 and 500 in Fig. 6 ). All integers are four bytes in 64-byte blocks, so l D ¼ m D ¼ 16 (DATA accessed both sequentially and randomly) while m C ¼ 16/R ¼ 160 since [18] follows Neubert [7] in choosing R ¼ 0.1 to save space.
During the DO 600 loop, both DATA and COUNT are accessed at random, and if both are larger than the cache in question we would expect them to share the cache blocks equally and a C ¼ (1/2)(1 2 1/m C ) ' 0.50. During the DO 200 loop, we would expect the cache fraction of COUNT to be much larger, since DATA is accessed sequentially, but with only two arrays the data should initially follow an a ¼ 1/2 curve. So the two pairs of terms of the form
þ arising from lines 100 and 400, respectively, mostly share the same cache fraction a C ¼ 0.50 hence are mathematically indistinguishable. So a version of Equation (16) should apply to the FLASHSORT1 data shown in Fig. 12 , which is why we see four main kinks rather than six. When DATA is accessed at random in the DO 600 loop, the COUNT array, being 10 times smaller and accessing its blocks 10 times more frequently, will effectively exclude DATA, so they will share the cache in the ratio 1:10 hence Fig. 7 , it would seem that choosing R ¼ 0.1 instead of the optimal R ¼ 0.6 is the cause of the relatively small gap between FLASHSORT1 and MTQUICK in Fig. 12 , compared to the larger gap between BUCKETSORT and QUICKSORT in Fig. 8 .
FLASHSORT1 is slower than MTQUICK above N ¼ 10
is so large. This is tackled in [18] with the two-pass MPFLASH: the first pass distributes DATA into sufficiently many bins B 1 so that each bin fits inside roughly half the L2 cache, so that the second pass, applying FLASHSORT1 to each bin, incurs no random access L2 cache latency; the first pass also incurs no random access L2 cache latency provided B 1 is small, since each bin is filled from one end. So, defining T P to be the CPU time per sorted element for a single distribution pass, the net effect of the two passes is to trade [18, p. 15 ] the best time to switch to two passes is the D D 32 kink at N ¼ 10 5.8 matching the N ¼ 621 530 used in [18] . So, the (solid)
MPFLASH theoretical curve shown in Fig. 12 MSBRADIX [18] , an in-place most significant digit radix sort, makes two passes using the first two base 2 16 'digits', insertion sorting any remaining out of order pairs. As described in detail in [17, p. 389] , when applying an integer algorithm to floating point data the first 8 bits, which form the exponent, are essentially ineffective, most of the data being distributed among only about 2 8 bins. This exceeds T TLB , so the first pass DATA array attracts cache latency, while on the second pass all the arrays are essentially 2 8 times smaller. We should therefore combine Equations (10) and (20) , except the term in D 32 C which falls outside the range of the data, and so can be omitted.
8 and M C ¼ 4RN/2 8 bytes, where R ¼ 0.1, but using the same cache fractions a C and a D as for FLASHSORT1, we obtain
The resulting theoretical MSBRADIX curve shown in Fig. 12 
Measuring sequential access latency
Omitting the details, the line counting approach of Section 2.4 applied to the standard implementation of quicksort used in ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 15 of 19 this paper [34] yields a CPU time formula of the form
where c is the Hoare cutoff [36] (we found c ¼ 9 to be optimal, confirming the predictions made by Knuth [1, p. 122] and Sedgewick [38] ) and is absorbed into l ¼ l 2 k log 2 c, which can therefore be positive or negative [1, p. 122] . Since quicksort partitions the array in approximately log 2 (N/c) passes, we would expect any sequential latency to show up as a variation in the 'constant' k. This is best seen by plotting not the CPU time T but rather Figure 13 shows just such a plot for our QUICKSORT where the value of l ¼ 210 ns was adjusted so that, below N ¼ 10 3.2 , k N is indeed a constant. For MTQUICK the choice of l is more difficult since all of the algorithms studied in [18] exhibited an anomalous increase in CPU time for very small decreasing N. Anticipating that there should be an L2 plateaux around N ¼ 10 5.0 , however, we found that l ¼ 210 ns, resulting in the plot in Fig. 14 .
The sequential access cache latency incurred by any implementation of quicksort depends on the order in which the partitioned subarrays are processed. As [34] follows Hoare [36] in placing these subarrays on a pushdown stack, all subarrays smaller than some fraction a Ã of the size of the cache in question M Ã would already have been loaded by a preceding iteration and hence incur zero latency. We therefore
þ where M 0 ¼ 4N bytes, and this turns out to be true for the RAM. Judging from the shape of the curves in Figs. 13 and 14 , however, it is clear that the L2 cache plateaux follows the simple (probability) model of cache latency from Section 3.2. This implies that the order in which the subarrays are processed must, for some reason, become uncorrelated with the contents of the cache, so we should write instead
þ . Putting these two things together, we therefore arrive at the following model:
where l ¼ 2 for QUICKSORT, since [34] follows Sedgewick [38] in insertion sorting in a final pass for a total of log 2 (2N/c) ¼ log 2 (N/c) þ 1 passes, while l ¼ 1 for MTQUICK, since it is memory tuned and so insertion sorts each subarray as soon as it is produced [18] . Best-fit values for Fig. 13 theoretical curves in Fig. 13 
External storage and optimal sorting
The HDD spikes in Fig. 8 were due to the large speed difference between the hard drive and 64 Mbyte main memory. The data from [18] plotted in Fig. 12 do not exhibit any spikes because the Sun UltraSparc-II used [18] had 512 Mbytes of RAM and the largest arrays studies by these authors was consistently only 256 Mbytes. Figure 11 is similar except the size of RAM in the machine used in [15] is not specified. Any linear time algorithm must switch to a multipass version at some point, and the simplest option is to do as many passes of quicksort as required to split the original array into RAMsized portions. Given that the maximum size of main memory on a contemporary PC is 4 GB, the RAM plateaux shown in Figs. 8, 11 and 12 would extend much further to the right, giving multipass versions of these algorithms an even greater performance lead over purely log -linear algorithms.
Even with regard to the much smaller L2 cache latency, a multipass version can help, and [18] considers in detail the optimal number of distribution passes k opt for MPFLASH as a function of N. From the analysis in Section 4.2, however, it is clear that the switch from one to two passes should take place at N 0 where 4RN 0 ' (1/2)M 2 , since that is where the large increase in latency due to D 32 C occurs. For R ¼ 0.1, this is N 0 ' 2 19.3 and beyond that point the number of bins in each pass cannot exceed T TLB ¼ 2 6 . We deduce that k opt ¼ dlog 2 (N/2 19. 3 )/6e þ 1 which implies that MPFLASH has loglinear rather than linear time complexity. Furthermore, since one-sixth of the time to perform a single distribution pass, T P /6 ' 43.3 ns, is larger than the quicksort constant k ¼ 34.2 ns computed in Section 4.3 for the machine in [18] , it is probably faster to do the multi-passing using six quicksort partitioning passes (or samplesort [37] for non-uniform data) in place of each bucket sort distribution pass. This quicksort approach is a convenient option when formulating multipass versions of linear time algorithms such as linear probing sort which do not use buckets.
Nevertheless, a two-pass approach does not always achieve a performance benefit as this depends on the ratio of the speed of the processor to the speed of main memory. Bucket sort makes three random accesses per sorted element, two to the COUNT array and one to the DATA array, which explained why D 32 C ' 2D 32 D , while in Section 4.2 we saw that on the machine used in [18] D 32 D ' T P very accurately. So a two-pass version, which trades D 32 D þ D 32 C for T P , gives an approximate three times reduction in random access L2 cache latency. Linear probing sort, however, makes only one random access per sorted element, so the trade is T P for D 32 D which makes essentially no difference. The ratio of the memory access time to the CPU clock cycle of the machine in [18] s ¼ 2 Â 300 MHz Â 300 ns ' 180 [45] , therefore represents a critical value: on machines with s 180 linear probing sort will not benefit from a two-pass approach, while for s 180/3 bucketsort will not. On the machine used in our study with s ¼ 200 MHz Â 300 ns ' 60, neither would benefit. As explained in Section 1.3 a contemporary PC would have a ratio about 50% bigger, i.e., s ' 90, so linear probing sort would therefore not benefit from a two-pass approach, and bucket sort only marginally so. In this sense, the Pentium PC used in our study is closer to a contemporary PC than the machine used in [18] .
CONCLUSION
The results described in this paper underline the importance of CPU time as a measure of the success or failure of any theoretical analysis. Far from erasing differences between different factors, it forced us to look more closely at the differences, resulting in an improved theory and an excellent fit to data. For linear probing sort, this was the difference between displacement and interchanges, and for cache latency this was the difference between random access and sequential access. There are still places, however, where either the model fits the data but the parameters are somewhat anomalous (such as some of the cache latencies) or where the model itself is somewhat unexpected (such as the probability model for sequential access in quicksort) and where our understanding of what actually happens on the level of hardware needs to be improved.
With regard to cache latency, although for technical reasons our measurements were conducted on rather old hardware, the key assumptions underlying our model should be even more accurately true on a contemporary machine. First, we assumed, then later verified, that sequential access cache latency is negligible compared to random access, and the much larger block sizes of modern computers, combined with hardware prefetch, should increase this disparity even further. Second, our assumption that both conflict and capacity misses can be modelled collectively as capacity misses but with respect to a cache fraction, which we found to be more accurate for a 4-way associative cache than for a direct mapped cache, should apply even more accurately on a contemporary 8-way or 16-way associative cache.
With regard to the relative performance of the different algorithms, on the other hand, some general conclusions can be drawn. Linear time algorithms as a whole outperform ANALYSIS OF LINEAR TIME SORTING ALGORITHMS Page 17 of 19 log-linear algorithms such as quicksort, and although this is dependent on the ratio of the speed of the CPU to main memory, these have kept pace with each other in recent years and can be expected to continue to do so. Among linear time algorithms, those such as linear probing sort, which makes only one random access per array element, outperform those such as bucket sort, which makes two, while radix sort falls somewhere in between. Finally, because main memory is finite, an in-place version of the algorithm is useful, and some sort of multi-pass approach is mandated. So, assuming that the distribution of the elements to be sorted is known, or can be approximated, the 'best' sorting algorithm would seem to be multi-pass in-place linear probing sort. This paper also underlines the enduring influence of Knuth's book [1] : the line counting derivation of the CPU time formulae in Section 2.4 is modelled on Knuth's MIX code line counting approach; the clue to the traffic light model of average displacement in Section 2.3 was Knuth's parking problem interpretation of linear probing; Knuth himself published a paper on linear probing not too long ago [21] which answered some questions which he had left unanswered in his book; from his 'personal remarks' at the end of that paper, it is clear that linear probing is a personal favourite; and finally, even the desktop publishing software T E X in which this paper was typed was created by Knuth while writing his book.
