There have been many proposals for sorting integers on multicores/GPUs that include radixsort and its variants or other approaches that exploit specialized hardware features of a particular multicore architecture. Comparison-based algorithms have also been used. Network-based algorithms have also been used with primary example Batcher's bitonic sorting algorithm. Although such a latter approach is theoretically "inefficient", if there are few keys to sort, it can lead to better running times as it has low overhead and is simple to implement.
Overview
There have been many proposals for sorting integers on multicore machines including GPUs. These include traditional distribution-specific algorithms such as radix-sort [3, 10, 22, 24] , or variants and derivative algorithms of it that use fewer rounds of its baseline count-sort implementation whenever more information about the range of key values is available [6, 34] . Other proposals include algorithms that use specialized hardware or software features of a particular multicore architecture [4, 6, 19, 22] . Comparison-based algorithms have also been used with some obvious tweaks: use of regular-sampling based sorting [30] that utilizes sequential (serial) radix-sort for local sorting [7, 8, 9] or not [33, 3, 5, 6, 19] . Network-based algorithms have also been exploited with a primary example being Batcher's [1] bitonic sorting algorithm [20, 3, 26, 27, 5] . Although such a latter approach can be considered theoretically "inefficient", if there are few keys to sort, it can lead to better running times as it has low overhead and is simple to implement.
In this work we perform an experimental study of integer sorting on multicore processors using not only multithreading but also multiprocessing parallel programming approaches. Our implementations need only recompilation of the source in most cases to work under Open MPI [25] , MulticoreBSP [32] , and a non-multithreading, multi-processing and out of maintenace library, BSPlib [17] . We have implemented plain-vanilla radix-sort (serial and parallel) for various radixes and also previously little explored or unexplored variants of bitonic-sort and odd-even transposition sort methods.
We offer our observations on a performance evaluation of such algorithm implementations on multiple platforms and architectures and multiple programming libraries. If we can conclude anything is that modeling their performance by taking into consideration architecture dependent features such as the structure and characteristics of multiple memory hierarchies is difficult and more often than not unusable or unreliable. However we can still draw some very simple conclusions using traditional architecture independent parallel modeling under L. G. Valiant's BSP model [31] or the augmented MBSP model [15] that has recently been proposed by this author.
For example for very small problem sizes (say n, the number of integer keys is lower than about 10,000) a variant of bitonic sort as proposed for small size (sample) sorting in [11, 13, 12] and that we shall call BTN, indeed outperforms serial or parallel radix-sort with their more time-consuming setup and overhead as long as the number of cores or threads used is relatively low. This has been observed indepependently by others eg [3] . What has been quite even more amazing is that certain variant of odd-even transposition sort that we shall call OET (in other words, unoptimized bubble-sort), might be slightly better, if p the number of threads or cores is also kept small. This did not use to be the case when hardware configurations involved p processors of a cluster, an SMP machine, or a supercomputer.
Moreover, we have observed that assigning multiple threads per core is not recommended for
CPUs with large number of cores. This is indeed the case where access to main memory (RAM) is required either because of program or data complexity, and non-locality. And if it is used for CPUs with moderate number of cores, it should never exceed the hardware supported bound (usually two), or be even lower than that. If the number of cores is kept small, multiple threads per core can be used as long as problem size is kept small. For parallel radix-sort of 32-bit integers radix-2 8 radix-sorting i.e. four rounds of baseline count-sort is faster than the alternative radix-2 16 sorting that uses two rounds.
However, depending on the architecture and its structure of its caches (level-1, level-2 and level-3) it is possible that radix-2 16 to outperform radix-2 8 . Overall efficiency is dependent on the number of cores if the degree of parallelism is large. For degree of parallelism less than four or eight, efficiency can be expressed either in terms of number of cores or threads. Naturally the difference (i.e. ratio) of the two nevers exceeds a factor of two (threads over cores) as beyond a hardwareregulated two ineffiencies and significant drop in performance is observed. In this latter case number of cores rather than number of threads determines or best describes speedup or efficiency.
Related work
The experimental work of [3] offers a collection of parallel algorithms that have been used unmodified or not as a basis for integer multicore or parallel sorting. One such algorithm has been radix-sort, another one has been sample-sort [29, 18, 28] i.e. randomized oversampling-based sorting along the lines of [28] : in order to sort n keys with p processors, one uses a sample of ps − 1 uniformly 3 at random selected keys where s is a random oversampling factor. After sorting the sample of size ps − 1, one then identifies p − 1 splitters as equidistant keys in the sorted sample. Those p − 1 keys split the input into p sequences of approximately the same size than can the be sorted independently; the analysis of [28] shows how one can choose s so that each one of the p sequences is of size O(n/p) with high probability. A third algorithm used in [3] is bitonic sorting. For sorting n keys (integer or otherwise) the bitonic sort of [3] employs a Θ(lg 2 (n)) stage bitonic sort. If the input is properly partitioned n/p or so of the keys are located inside a single processor. Thus bitonic merging of those keys can be done entirely within the corresponding processor as observed in [3] ; they alternatively proposed using linear-time serial merging over the slower bitonic merging.
It is worth noting that the bitonic sort of [3] for small problem sizes outperformed the other sorting methods as implemented on a Connection Machine CM-2.
This approach of [3] to bitonic sorting has been followed since then. More recently [27] is using an implementation drawn from [20] for bitonic sorting on GPUs and CPUs. Such an implementation resembles the one above where for n/p keys local in-core operations are involved. Alas, overall GPU performance is rather unimpressive: a 10x speedup over the CPU implementation on an NVIDIA GT520, or 17x on a Tesla K40C for 5M keys, and speedup in the range of 2-3x for fewer keys.
However neither [3] nor the other implementations of bitonic sort cited [27, 20] or to be cited later in this section considered that bitonic sorting involving n keys on p processors/ cores/ threads can utilize a bitonic network of Θ(lg 2 (p)) stages. If p is substantially smaller than n, then the savings are obvious compared to a Θ(lg 2 (n)) stage bitonic sorter utilized in [3, 27, 20] and other works.
Indeed this author [11, 13] highlighted the possibility and used Θ(lg 2 (p)) stage bitonic sorting for sample sorting involving more than p keys in the context of bulk-synchronous parallel [31] sorting. [11, 13] cite the work of [2, 21] for first observing that a p-processor bitonic sorter can sort n keys in lg (p)(lg (p) + 1)/2 stages (or rounds). The input would consist of p sequences of length n/p and "comparison of two keys" gets replaced by a (serial) merging operation that separates the n/p smallest from the n/p largest keys. At start-up before the bitonic sorting commences each one of the p sequences needs to be sorted. And if the input is a sequence of integer keys such sorting can utilize a linear time radix-sort rather than logarithmic time bitonic-based sorting or merge-sort.
4
Such a bitonic sorting method has been implemented in this work. We shall call this BTN in the remainder.
Odd-even transposition sort [23] is an unrefined version of bubble-sort that has been used for sorting in array structured parallel architectures (one-dimensional arrays, two-dimensional meshes, etc). In such a sort n keys can be sorted by n processors in n rounds using an oblivious sorting algorithm. In an odd-indexed round a key at an odd-indexed position compares itself to the evenindexed key to its immediate right (index one more) with a swap if the former key is greater.
Like-wise for an even-indexed round. A very simple observation we make is that if the number of processors/cores is p, then a p-round odd-even transposition sort can sort n keys by dealing with n/p-element sequences, one such sequence per processor. Using the remark of [11, 13] referring to the work of [2, 21] the p sequences must be sorted before odd-even transposition sort commences its execution on the n key input. We shall call this OET for future references.
In [7] an algorithm is presented and implemented for sorting integers that utilizes the deterministic regular sampling algorithm of [30] . Deterministic regular sampling [30] works as follows. Split regularly and evenly n input keys into p sequences of equal size n/p and then sort the p sequences independently. Pick then from each sequence p − 1 sample (and equidistant) keys for a total sample of size p(p − 1). A serial sorting of the p(p − 1) sample keys is then performed by one of the p processors/cores. Subsequentaly that processor selects p − 1 equidistant splitters from the sorted sample, broadcasts them to the remaining processors, and all processors can the split their input keys around the p − 1 splitters. A routing operation and subsequent p-way merging of the sorted sequences completes the process. In [7] , for integer sorting, the first step of independently sorting the p sequences independently as well as the last step of p-way merging are replaced and realized with a simpler linear-time radix-sort. One can prove [30] that if n distinct keys are split around the p − 1 splitters as explained, then none of the p resulting sequences will be of size more than 2n/p. For this method to be optimal one needs to maintain n/p > p 2 . [7] also discuss the case of n/p < p 2 .
As a side note, [11, 13] extend the notion of deterministic regular sampling of [30] to determininistic regular oversampling. The random oversampling factor s whose behavior was analyzed by [28] and finetuned in the context of bulk-synchronous parallel sorting by [14] can be transformed into a deterministic regular oversampling factor: a sample of p(p − 1)s keys can then reduce a 2n/p imbalance into an (1 + ǫ)n/p one with ǫ depending on s. See [11, 13] for details.
In [8, 9] a variation of the sample sort of [30] as used in [7] is being utilized for GPU sorting of arbitrary (not necessarily integer) keys. Similarly but differently from [11, 13] a sample of size ps is being used rather than the p(p − 1)s of [11, 13] . The GPU architecture's block thread size determines p and other GPU constraints dictate s. Thus effectively the implied oversampling factor of [11, 13] becomes s/(p − 1) in [8] .
The work in [8] was published around the time of publication of [26] . The latter work uses bitonic sorting for GPU sorting. Both sets of authors' conclusions [26, 8] agree than bitonic-sorting works better for small values of n and either sample sort [8] is better for the larger n that can be afforded by a memory bound and memory bandwidth bound GPU or radix-sort [26] . Thus their overall conclusions are in line with those of [3] .
The work of [4] involves sorting using AVX-512 instructions on Intel's Knights Landing. The driver algorithm is quicksort. For small problem sizes the author explores the use of odd-even transposition sort or bitonic sort, and picks the latter over the former for vectorization. This is reminiscent of the work of [5] that use a quick-sort on the GPU called GPU-Quicksort.
The thread processors perform a single task: determining whether a key is smaller or not than a splitter. Then a rearranging of the keys is issued following something akin to a scan/parallel prefix [23] operation. Its performance is compared to a radix-sort implementation and shown to be slightly better but mostly worse than a Hybridsort approach that uses bucketsort and merge-sort (whose performance depends on the distribution of the input keys).
In [6] an integer sorting algorithm is presented that splits the input based on the size of a (shared) L2 cache and private L1 caches and utilizing SIMD instructions to optimize performance.
It reads like a BucketSort algorithm. It also assumes that key values are no more than a given bound m, an assumption that can shave off rounds from an otherwise value oblivious radix-sort.
A Figure 6 in [6] show that their algorithm exhibits a speedup of approximately 3 over serial radix sort (running time of 3sec for 32,000,000 integers, over approximately 9.5 sec for serial radix-sort).
What the input distribution of the input keys of the experiment was, was not clear.
The implementation of AA-sort is undertaken in [19] . AA-sort can be thought of as a bubble-sort 6 like enhancement of the odd-even transposition sort we discussed earlier and called OET. initially each n/p sequence is sorted; in the case of AA-sort, a merge-sort is used. The use of a bubble-sort oriented approach is to exploit vectorization instructions of the specific target platforms:
PowerPC 970MP and Cell Broadband Engine(BE). In the implementations [19] present also results for a bitonic sort based implementation. AA-sort seems to be slightly faster than bitonic sort with SIMD enhancements for 16K random integers. In the Cell BE, AA-sort outperforms bitonic sort for 32M integers (12.2 speedup for the former over 7.1 for the latter on 16 CELL BE cores).
The AQsort algorithm of [22] utilizes quicksort, a comparison-based sorting algorithm, and
OpenMP is utilized to provide a parallel/multithread version of quicksort. Though its discussion in an otherwise integer sorting oriented works as this one might not make sense, there are some interesting remarks made in [22] that are applicable to this work. It is observed that hyperthreading provides no benefit and that for Intel and AMD CPUs best performance is obtained for assigning one thread per core. For Intel Phi and IBM BG/Q two threads per core provide marginally lower running times even if four threads are hardware supported.
In [24] radix-sort is discussed in the context of reducing the number of rounds of count-sort inside radix-sort by inspecting key values' most significant bits. A parallelized radix-sort along those lines achieves efficiencies of approximately 15-30% (speed up of 5-10 on 32 cores). Other conclusions are in line with [22] in that memory channels can't keep up with the work assigned from many parallel threads.
In [33] a parallel merge-sort is analyzed and implemented on multicores. The parallelization of merge-sort is not optimal. An (n/p) lg (n/p) local and independent sorting on p threads/cores is followed by a merging that takes time n + n/2 + . . . + n/p ≈ 2n utilizing respectively 1, 2, . . . , p/2 cores. Thus the overall speedup of this straightforward approach is bounded by n lg (n)/(n lg (n)/p+ 2n). For n = 2000000 and p = 8 a bound on speedup is about 4 and for efficiency around 50%.
Superlinear speedups obtained for p = 5, 6 [33] and small problem sizes are probably due to caching effects. This work is similar to that of [34] . The latter deals with multisets (n keys taking only k distinct values). Even on four processing cores speedups are limited to less than 50% efficiency.
Implementations
In Using the MBSP cost modeling generic cache performance will be abstracted by the pair (L, G).
Parameter m would be set to p and M will be ignored; we will assume that M is large enough to accommodate the radix-related information of radix-sort. Intercore communication will be abstracted by (p, l, g). Since such communication is done through main memory g would be the cost of accessing non-cache memory (aka RAM). We shall in the remainder ignore l and L as we will be modeling our algorithms at a higher level. This is possible because in integer-sorting the operations performed are primitive and interaction with memory is the dominant operation. Thus the cost model of an algorithm would abstract only cost of access to the fast memory (G) and cost of access to the slow memory (g). Then we will use the easy to abstract g = 5G to further simplify our derivations. This is based on the rather primitive thinking that 20ns and 100ns reflect access times to a cache (L2 or higher) and main memory respectively thus defining a ratio of five between them.
Serial radix-r radix-sort A sequential radix-sort (called SR4) was implemented and used for local independent sorting in the odd-even transposition sort and bitonic sort implementations. The radix used was r = 256
i.e. it is a four-round count-sort. For such an implementation sorting N keys requires four rounds of a count-sort. In each round of count-sort the input is read twice, first during the initial count process and last when the output is to be generated, and the output is finally written. Thus the cost of such memory accesses is 3N g, with g referring to the cost of accessing the main memory.
Moreoever allocation and initialization of the count array incurs a cost of 2rG, with G being the cost of accessing the fast cache memory. We shall ignore this cost that is dominated by other terms. During the count operation the count array is accessed N times and so is during the output operation for a total cost of 2N G. Thus the overall cost of a round is 3N g + 2N G. For all four rounds of 32-bit sorting the total cost is given by the following.
If g = 5G and r = 256 then
Parallel radix-r radix-sort
We shall denote with PR2 and PR4 radix r = 2 16 and r = 2 8 parallel radix-sort algorithms.
Ignoring some details that are implementation dependent such as the contribution of counters used in the serial part and their copies involved in the parallel part, we recognize a cost 2rpg due to a scatter and gather operations involved in the parallel part the algorithm. If n keys are to be sorted, each processor or core is assigned roughly N = n/p keys. A 2N G is assigned for the same reasons that was assigned in the serial version. A 3N g of the serial version will become 4N g to account a communication required before the output array is formed in a given round of count-sort.
If g = 5G and r = 256 2 then
Odd-even transposition sort
We analyze the algorithm previously referred to as OET. If n keys are to be sorted, each processor or core is assigned roughly N = n/p keys. First the N keys per processor or core are sorted using a radix r = 256 radix-sort independently and in parallel of each other that requires time T s (n/p, G).
Then a p round odd-even transposition sort takes place utilizing n/p sorted sequences as explained earlier for OET. One round of it requires roughly 4N g for communication and merging (two input and one output arrays). Thus the overall cost of all p phases of OET will be as follows.
Bitonic Sort
We analyze the algorithm previously referred to as BTN. If n keys are to be sorted, each processor or core is assigned roughly N = n/p keys. First the N keys per processor or core are sorted using a radix r = 2 8 radix-sort independently and in parallel of each other that requires time T s (n/p, G).
Then lg (p)(lg (p) + 1)/2 stages of a p-processor bitonic-sort are realized as explained in Section 2.
One round of it requires roughly 4N g for communication and merging/comparing (two input and one output arrays). Thus the overall cost of all stages of bitonic sort will be as follows.
Experiments
All algorithms have been implemented in ANSI C. The code has been programmed in such a way that can be recompiled but does not need to be rewritten and works with three parallel, multiprocessing or multithreaded programming libraries: OpenMPI [25] , MulticoreBSP [32] , and BSPlib [17] . The latter library was only used on the Intel platform. The resulting source code that has been used in these experiments is publically and currently available through the author's web-page [16] .
A 8-processor quad-core AMD Opteron 8384 Scientific Linux 7 workstation with 128GiB of memory has been used for the experiments. We refer to it as the AMD platform in the remainder.
The version of OpenMPI available and used was 1.8.4. A quad-core Intel Xeon E3-1240 3.3Ghz
Scientific Linux 7 workstation with 16GiB of memory has also been used for the experiments. We refer to it as the Intel platform in the remainder. The version of OpenMPI available and used was 1.8.1. We also run some experiments with version 2.1.1 which was marginally faster for some experiments. However it caused a problem with some of our runs that we did not have time to fix so we report results based on 1.8. We have also run some experiments for smaller problem sizes to determine the cut-off point where bitonic-sort or odd-even transposition sort are superior to radix-sort methods. The input for all algorithms is the same set of random uniformly drawn integers.
For the serial algorithm SR4 we report in both Table 1 and Table 2 the corresponding serial execution time in seconds. For all other algorithms, PR4, PR2, BTN and OET we report speedup figures. For Table 3 , Table 4 , and Table 5 we report timing results in microseconds. We offer some of the observations that we consider important in the context of this experimentation.
Observation 1: Thread size per core. For the AMD platform one thread per core was a requirement for extracting best performance. This is in accordance to remarks by [22] and [24] .
Observation 2: Hyperthreading. For the Intel platform two threads per core was a requirement for extracting consistently better performance thus deviating from [22] . For smaller or larger problem size one thread or four threads per core improved speedup slightly but to the detriment of (thread) efficiency. Thus the one thread per core recommendation or observation of [22] might not be current any more for more recent architectures. For the Intel platform, experiments with p = 8 on a multicore use exclusively two threads per core. OpenMPI was able to cope with it, MulticoreBSP consistently did not, and so was not (obviously) BSPlib. The last two suffered a performance drop by 30% or so.
Observation 3: Libraries. For the AMD platform, OpenMPI had library latency and performed better in larger problem sizes than MulticoreBSP. To some degree the same can be said for the Intel platform. BSPlib with its multiprocessing only support but low library overhead was extremely competitive and bettered MulticoreBSP almost always. Moreover it was more often than not better than OpenMPI, despite its age and non support.
Observation 4: 4-round vs 2-round radix-sort. On the AMD platform PR4 was superior to the low overhead BTN, OET implementations, something that was not surprising. However in the AMD platform a two-round radix-sort on 32-bit integers was better than a four-round one for both libraries used and for large problem sizes. In the case of MulticoreBSP this was so for 32M and 128M but for the case of OpenMPI only for 128M. On the Intel platform a four-round was always the winner. Note also that our version of bitonic or odd-even transposition sort differs from other approaches in that it has only lg (p)(lg (p) + 1)/2 and p stages for BTN and OET rather than lg (n)(lg (n) + 1)/2 and n respectively. These results are shown in Table 3 , Table 4 , and Table 5 with figures indicating microseconds, i.e.
actual timing results rather than speedup information. Under OpenMPI BTN was better through 128K and only for 512K was OET marginally better. Both of them got beaten by the serial fourround radix-sort through around 32K where both started getting better running times (i.e. speedup greater than 1 or even close to 2 for 512K). Under MulticoreBSP, BTN was marginally better for the 1K-32K range and OET for the 128K-512K range. For the 1K-32K both were marginally better than serial radix-sort SR4 and beyond that their speedup was in the one to two range over SR4. For BSPlib, OET was consistently better than BTN except for p = 8 and sizes of 32K or more when it was marginally (less than 10%) slower. It was for the 128K-512K range that speedup figures were slightly over 2 for p = 4 and 128K and p = 2, 4, 8 for 512K. Both OET and to a lesser degree BTN exhibited a speed up of around 2 starting with 2K problem size This raises the question whether
OET is just good because p is consistently small, or there is some promise to it for GPU architectures and appropriate implementations. But it is safe to say that with increasing processor or thread sizes BTN will prevail.
Observation 7: MBSP modeling SR4 vs PR4, PR2. We may use equation 1 and equation 2 to determine the relative efficiency of a parallel four-round radix-sort. We have then than
The fraction to the limit goes to approximately 68 * p/88 ≈ 0.75p. Thus for p = 4, 8, 16 we should not be anticipating speedups higher than about 3, 6 and 12 respectively for the AMD platform.
Indeed this is the case from Table 1 . Likewise for the Intel platform. In the latter case however, we have a speedup of 3.61 over the "predicted" 3 for p = 4. Thus we restrained in expecting accuracy from such an estimation given the assumptions and simplifications incorporated in the formulas. it is approximately
For the problem sizes of the experiments and p = 2 − 16 we observe that we expect OET to be up to about 40% slower than BTN for p = 2, 4, 8 but up to 70% slower for p = 16. This is confirmed for the AMD platform AMD by looking at the speedup data for the two algorithms. 
Conclusion
We presented an experimental study of integer sorting on multicore processors using multithreading and multiprocessing parallel programming approaches for a code that is transportable and executable using three parallel/multithreading libraries, Open MPI, MulticoreBSP, and BSPlib.
We have implemented plain-vanilla serial and parallel radix-sort for various radixes and also some previously little explored or unexplored variants of bitonic-sort and odd-even transposition sort.
We offered a series of observations obtained through this evaluations organized and grouped in a way that has not been done before, to our knowledge. Some of those observations have been made previously, but some of them might not be valid any more.
Moreover we expressed the performance of our implementations in the context of the MBSP model [15] . We showed how one can use the model to compare the theoretical performance of the implementations. 
