Integer sorting on multicores: some (experiments and) observations by Gerbessiotis, Alexandros V.
ar
X
iv
:1
70
8.
09
49
5v
1 
 [c
s.D
C]
  3
0 A
ug
 20
17
Integer sorting on multicores: some (experiments and) observations
Alexandros V. Gerbessiotis∗
May 1, 2018
Abstract
There have been many proposals for sorting integers on multicores/GPUs that include radix-
sort and its variants or other approaches that exploit specialized hardware features of a particular
multicore architecture. Comparison-based algorithms have also been used. Network-based algo-
rithms have also been used with primary example Batcher’s bitonic sorting algorithm. Although
such a latter approach is theoretically ”inefficient”, if there are few keys to sort, it can lead to
better running times as it has low overhead and is simple to implement.
In this work we perform an experimental study of integer sorting on multicore processors
using not only multithreading but also multiprocessing parallel programming approaches. Our
implementations work under Open MPI, MulticoreBSP, and BSPlib. We have implemented
serial and parallel radix-sort for various radixes and also some previously little explored or
unexplored variants of bitonic-sort and odd-even transposition sort.
We offer our observations on a performance evaluation using the MBSP model of such al-
gorithm implementations on multiple platforms and architectures and multiple programming
libraries. If we can conclude anything is that modeling their performance by taking into con-
sideration architecture dependent features such as the structure and characteristics of multiple
memory hierarchies is difficult and more often than not unsuccessful or unreliable. However we
can still draw some very simple conclusions using traditional architecture independent parallel
modeling.
∗CS Department, New Jersey Institute of Technology, Newark, NJ 07102, USA. Email: alexg@cs.njit.edu
1 Overview
There have been many proposals for sorting integers on multicore machines including GPUs. These
include traditional distribution-specific algorithms such as radix-sort [3, 10, 22, 24], or variants
and derivative algorithms of it that use fewer rounds of its baseline count-sort implementation
whenever more information about the range of key values is available [6, 34]. Other proposals
include algorithms that use specialized hardware or software features of a particular multicore
architecture [4, 6, 19, 22]. Comparison-based algorithms have also been used with some obvious
tweaks: use of regular-sampling based sorting [30] that utilizes sequential (serial) radix-sort for
local sorting [7, 8, 9] or not [33, 3, 5, 6, 19]. Network-based algorithms have also been exploited
with a primary example being Batcher’s [1] bitonic sorting algorithm [20, 3, 26, 27, 5]. Although
such a latter approach can be considered theoretically ”inefficient”, if there are few keys to sort, it
can lead to better running times as it has low overhead and is simple to implement.
In this work we perform an experimental study of integer sorting on multicore processors using
not only multithreading but also multiprocessing parallel programming approaches. Our imple-
mentations need only recompilation of the source in most cases to work under Open MPI [25], Mul-
ticoreBSP [32], and a non-multithreading, multi-processing and out of maintenace library, BSPlib
[17]. We have implemented plain-vanilla radix-sort (serial and parallel) for various radixes and also
previously little explored or unexplored variants of bitonic-sort and odd-even transposition sort
methods.
We offer our observations on a performance evaluation of such algorithm implementations on
multiple platforms and architectures and multiple programming libraries. If we can conclude any-
thing is that modeling their performance by taking into consideration architecture dependent fea-
tures such as the structure and characteristics of multiple memory hierarchies is difficult and more
often than not unusable or unreliable. However we can still draw some very simple conclusions
using traditional architecture independent parallel modeling under L. G. Valiant’s BSP model [31]
or the augmented MBSP model [15] that has recently been proposed by this author.
For example for very small problem sizes (say n, the number of integer keys is lower than about
10,000) a variant of bitonic sort as proposed for small size (sample) sorting in [11, 13, 12] and that
2
we shall call BTN, indeed outperforms serial or parallel radix-sort with their more time-consuming
setup and overhead as long as the number of cores or threads used is relatively low. This has
been observed indepependently by others eg [3]. What has been quite even more amazing is that
certain variant of odd-even transposition sort that we shall call OET (in other words, unoptimized
bubble-sort), might be slightly better, if p the number of threads or cores is also kept small. This
did not use to be the case when hardware configurations involved p processors of a cluster, an SMP
machine, or a supercomputer.
Moreover, we have observed that assigning multiple threads per core is not recommended for
CPUs with large number of cores. This is indeed the case where access to main memory (RAM) is
required either because of program or data complexity, and non-locality. And if it is used for CPUs
with moderate number of cores, it should never exceed the hardware supported bound (usually
two), or be even lower than that. If the number of cores is kept small, multiple threads per core
can be used as long as problem size is kept small. For parallel radix-sort of 32-bit integers radix-28
radix-sorting i.e. four rounds of baseline count-sort is faster than the alternative radix-216 sorting
that uses two rounds.
However, depending on the architecture and its structure of its caches (level-1, level-2 and
level-3) it is possible that radix-216 to outperform radix-28. Overall efficiency is dependent on the
number of cores if the degree of parallelism is large. For degree of parallelism less than four or eight,
efficiency can be expressed either in terms of number of cores or threads. Naturally the difference
(i.e. ratio) of the two nevers exceeds a factor of two (threads over cores) as beyond a hardware-
regulated two ineffiencies and significant drop in performance is observed. In this latter case number
of cores rather than number of threads determines or best describes speedup or efficiency.
2 Related work
The experimental work of [3] offers a collection of parallel algorithms that have been used unmodified
or not as a basis for integer multicore or parallel sorting. One such algorithm has been radix-sort,
another one has been sample-sort [29, 18, 28] i.e. randomized oversampling-based sorting along
the lines of [28]: in order to sort n keys with p processors, one uses a sample of ps − 1 uniformly
3
at random selected keys where s is a random oversampling factor. After sorting the sample of
size ps − 1, one then identifies p − 1 splitters as equidistant keys in the sorted sample. Those
p − 1 keys split the input into p sequences of approximately the same size than can the be sorted
independently; the analysis of [28] shows how one can choose s so that each one of the p sequences
is of size O(n/p) with high probability. A third algorithm used in [3] is bitonic sorting. For sorting
n keys (integer or otherwise) the bitonic sort of [3] employs a Θ(lg2 (n)) stage bitonic sort. If the
input is properly partitioned n/p or so of the keys are located inside a single processor. Thus
bitonic merging of those keys can be done entirely within the corresponding processor as observed
in [3]; they alternatively proposed using linear-time serial merging over the slower bitonic merging.
It is worth noting that the bitonic sort of [3] for small problem sizes outperformed the other
sorting methods as implemented on a Connection Machine CM-2.
This approach of [3] to bitonic sorting has been followed since then. More recently [27] is using
an implementation drawn from [20] for bitonic sorting on GPUs and CPUs. Such an implementation
resembles the one above where for n/p keys local in-core operations are involved. Alas, overall GPU
performance is rather unimpressive: a 10x speedup over the CPU implementation on an NVIDIA
GT520, or 17x on a Tesla K40C for 5M keys, and speedup in the range of 2-3x for fewer keys.
However neither [3] nor the other implementations of bitonic sort cited [27, 20] or to be cited
later in this section considered that bitonic sorting involving n keys on p processors/ cores/ threads
can utilize a bitonic network of Θ(lg2 (p)) stages. If p is substantially smaller than n, then the
savings are obvious compared to a Θ(lg2 (n)) stage bitonic sorter utilized in [3, 27, 20] and other
works.
Indeed this author [11, 13] highlighted the possibility and used Θ(lg2 (p)) stage bitonic sorting
for sample sorting involving more than p keys in the context of bulk-synchronous parallel [31]
sorting. [11, 13] cite the work of [2, 21] for first observing that a p-processor bitonic sorter can sort
n keys in lg (p)(lg (p) + 1)/2 stages (or rounds). The input would consist of p sequences of length
n/p and “comparison of two keys” gets replaced by a (serial) merging operation that separates the
n/p smallest from the n/p largest keys. At start-up before the bitonic sorting commences each one
of the p sequences needs to be sorted. And if the input is a sequence of integer keys such sorting
can utilize a linear time radix-sort rather than logarithmic time bitonic-based sorting or merge-sort.
4
Such a bitonic sorting method has been implemented in this work. We shall call this BTN in the
remainder.
Odd-even transposition sort [23] is an unrefined version of bubble-sort that has been used for
sorting in array structured parallel architectures (one-dimensional arrays, two-dimensional meshes,
etc). In such a sort n keys can be sorted by n processors in n rounds using an oblivious sorting
algorithm. In an odd-indexed round a key at an odd-indexed position compares itself to the even-
indexed key to its immediate right (index one more) with a swap if the former key is greater.
Like-wise for an even-indexed round. A very simple observation we make is that if the number of
processors/cores is p, then a p-round odd-even transposition sort can sort n keys by dealing with
n/p-element sequences, one such sequence per processor. Using the remark of [11, 13] referring to
the work of [2, 21] the p sequences must be sorted before odd-even transposition sort commences
its execution on the n key input. We shall call this OET for future references.
In [7] an algorithm is presented and implemented for sorting integers that utilizes the determin-
istic regular sampling algorithm of [30]. Deterministic regular sampling [30] works as follows. Split
regularly and evenly n input keys into p sequences of equal size n/p and then sort the p sequences
independently. Pick then from each sequence p−1 sample (and equidistant) keys for a total sample
of size p(p − 1). A serial sorting of the p(p − 1) sample keys is then performed by one of the p
processors/cores. Subsequentaly that processor selects p − 1 equidistant splitters from the sorted
sample, broadcasts them to the remaining processors, and all processors can the split their input
keys around the p − 1 splitters. A routing operation and subsequent p-way merging of the sorted
sequences completes the process. In [7], for integer sorting, the first step of independently sorting
the p sequences independently as well as the last step of p-way merging are replaced and realized
with a simpler linear-time radix-sort. One can prove [30] that if n distinct keys are split around
the p − 1 splitters as explained, then none of the p resulting sequences will be of size more than
2n/p. For this method to be optimal one needs to maintain n/p > p2. [7] also discuss the case of
n/p < p2.
As a side note, [11, 13] extend the notion of deterministic regular sampling of [30] to deter-
mininistic regular oversampling. The random oversampling factor s whose behavior was analyzed
by [28] and finetuned in the context of bulk-synchronous parallel sorting by [14] can be transformed
5
into a deterministic regular oversampling factor: a sample of p(p−1)s keys can then reduce a 2n/p
imbalance into an (1 + ǫ)n/p one with ǫ depending on s. See [11, 13] for details.
In [8, 9] a variation of the sample sort of [30] as used in [7] is being utilized for GPU sorting
of arbitrary (not necessarily integer) keys. Similarly but differently from [11, 13] a sample of size
ps is being used rather than the p(p − 1)s of [11, 13]. The GPU architecture’s block thread size
determines p and other GPU constraints dictate s. Thus effectively the implied oversampling factor
of [11, 13] becomes s/(p − 1) in [8].
The work in [8] was published around the time of publication of [26]. The latter work uses
bitonic sorting for GPU sorting. Both sets of authors’ conclusions [26, 8] agree than bitonic-sorting
works better for small values of n and either sample sort [8] is better for the larger n that can be
afforded by a memory bound and memory bandwidth bound GPU or radix-sort [26]. Thus their
overall conclusions are in line with those of [3].
The work of [4] involves sorting using AVX-512 instructions on Intel’s Knights Landing. The
driver algorithm is quicksort. For small problem sizes the author explores the use of odd-even
transposition sort or bitonic sort, and picks the latter over the former for vectorization.
This is reminiscent of the work of [5] that use a quick-sort on the GPU called GPU-Quicksort.
The thread processors perform a single task: determining whether a key is smaller or not than
a splitter. Then a rearranging of the keys is issued following something akin to a scan/parallel
prefix [23] operation. Its performance is compared to a radix-sort implementation and shown to be
slightly better but mostly worse than a Hybridsort approach that uses bucketsort and merge-sort
(whose performance depends on the distribution of the input keys).
In [6] an integer sorting algorithm is presented that splits the input based on the size of a
(shared) L2 cache and private L1 caches and utilizing SIMD instructions to optimize performance.
It reads like a BucketSort algorithm. It also assumes that key values are no more than a given
bound m, an assumption that can shave off rounds from an otherwise value oblivious radix-sort.
A Figure 6 in [6] show that their algorithm exhibits a speedup of approximately 3 over serial radix
sort (running time of 3sec for 32,000,000 integers, over approximately 9.5 sec for serial radix-sort).
What the input distribution of the input keys of the experiment was, was not clear.
The implementation of AA-sort is undertaken in [19]. AA-sort can be thought of as a bubble-sort
6
like enhancement of the odd-even transposition sort we discussed earlier and called OET. Whereas in
OET we have odd and even phases in AA-sort there is no such distinction and either a bubble-sort
step (from left to right) is executed if a gap parameter g has a value of 1, or non-adjacent keys are
bubble-sorted if g is greater than 1. (Thus a[i] and a[i+ g] are then compared.) Likewise to OET
initially each n/p sequence is sorted; in the case of AA-sort, a merge-sort is used. The use of a
bubble-sort oriented approach is to exploit vectorization instructions of the specific target platforms:
PowerPC 970MP and Cell Broadband Engine(BE). In the implementations [19] present also results
for a bitonic sort based implementation. AA-sort seems to be slightly faster than bitonic sort with
SIMD enhancements for 16K random integers. In the Cell BE, AA-sort outperforms bitonic sort
for 32M integers (12.2 speedup for the former over 7.1 for the latter on 16 CELL BE cores).
The AQsort algorithm of [22] utilizes quicksort, a comparison-based sorting algorithm, and
OpenMP is utilized to provide a parallel/multithread version of quicksort. Though its discussion
in an otherwise integer sorting oriented works as this one might not make sense, there are some
interesting remarks made in [22] that are applicable to this work. It is observed that hyperthreading
provides no benefit and that for Intel and AMD CPUs best performance is obtained for assigning
one thread per core. For Intel Phi and IBM BG/Q two threads per core provide marginally lower
running times even if four threads are hardware supported.
In [24] radix-sort is discussed in the context of reducing the number of rounds of count-sort
inside radix-sort by inspecting key values’ most significant bits. A parallelized radix-sort along
those lines achieves efficiencies of approximately 15-30% (speed up of 5-10 on 32 cores). Other
conclusions are in line with [22] in that memory channels can’t keep up with the work assigned
from many parallel threads.
In [33] a parallel merge-sort is analyzed and implemented on multicores. The parallelization of
merge-sort is not optimal. An (n/p) lg (n/p) local and independent sorting on p threads/cores is
followed by a merging that takes time n+ n/2 + . . .+ n/p ≈ 2n utilizing respectively 1, 2, . . . , p/2
cores. Thus the overall speedup of this straightforward approach is bounded by n lg (n)/(n lg (n)/p+
2n). For n = 2000000 and p = 8 a bound on speedup is about 4 and for efficiency around 50%.
Superlinear speedups obtained for p = 5, 6 [33] and small problem sizes are probably due to caching
effects. This work is similar to that of [34]. The latter deals with multisets (n keys taking only k
7
distinct values). Even on four processing cores speedups are limited to less than 50% efficiency.
3 Implementations
In this section we introduce the algorithms that we implement and analyze their performance
using the Multi-memory Bulk-Synchronous Parallel (MBSP) model of computation [15]. The
MBSP is parameterized by the septuplet (p, l, g,m,L,G,M) to abstract, computation and memory
interactions among multi-cores. In addition to the modeling offered by the BSP model [31] and
abstracted by the triplet (p, l, g), the collection of core/processor components has m alternative
memory units distributed in an arbitrary way, and the size of the “fast memory” is M words of
information. The cost of memory unit-related I/O is modeled by the pair (L,G). L and G are
similar to the BSP parameters l and g respectively. Parameter G expresses the unit transfer time
per word of information and thus reflects the memory-unit throughput cost of writing a word of
information into a memory unit (i.e. number of local computational operations as units of time
per word read/written). Parameter L abstracts the memory-unit access latency time that reflects
access delays contributed mainly but not exclusively by two factors: (a) unit access-related delays
that can not be hidden or amortized by G, and (b) possible communication-related costs involved
in accessing a non-local unit as this could require intraprocessor or interprocessor communication.
Using the MBSP cost modeling generic cache performance will be abstracted by the pair (L,G).
Parameter m would be set to p and M will be ignored; we will assume that M is large enough
to accommodate the radix-related information of radix-sort. Intercore communication will be ab-
stracted by (p, l, g). Since such communication is done through main memory g would be the cost
of accessing non-cache memory (aka RAM). We shall in the remainder ignore l and L as we will be
modeling our algorithms at a higher level. This is possible because in integer-sorting the operations
performed are primitive and interaction with memory is the dominant operation. Thus the cost
model of an algorithm would abstract only cost of access to the fast memory (G) and cost of access
to the slow memory (g). Then we will use the easy to abstract g = 5G to further simplify our
derivations. This is based on the rather primitive thinking that 20ns and 100ns reflect access times
to a cache (L2 or higher) and main memory respectively thus defining a ratio of five between them.
8
Serial radix-r radix-sort
A sequential radix-sort (called SR4) was implemented and used for local independent sorting
in the odd-even transposition sort and bitonic sort implementations. The radix used was r = 256
i.e. it is a four-round count-sort. For such an implementation sorting N keys requires four rounds
of a count-sort. In each round of count-sort the input is read twice, first during the initial count
process and last when the output is to be generated, and the output is finally written. Thus the
cost of such memory accesses is 3Ng, with g referring to the cost of accessing the main memory.
Moreoever allocation and initialization of the count array incurs a cost of 2rG, with G being the
cost of accessing the fast cache memory. We shall ignore this cost that is dominated by other
terms. During the count operation the count array is accessed N times and so is during the output
operation for a total cost of 2NG. Thus the overall cost of a round is 3Ng + 2NG. For all four
rounds of 32-bit sorting the total cost is given by the following.
Ts(N, g,G, r) = (32/ lg (r)) · (3Ng + 2NG)
If g = 5G and r = 256 then
Ts(N,G) = 68NG (1)
Parallel radix-r radix-sort
We shall denote with PR2 and PR4 radix r = 216 and r = 28 parallel radix-sort algorithms.
Ignoring some details that are implementation dependent such as the contribution of counters used
in the serial part and their copies involved in the parallel part, we recognize a cost 2rpg due to a
scatter and gather operations involved in the parallel part the algorithm. If n keys are to be sorted,
each processor or core is assigned roughly N = n/p keys. A 2NG is assigned for the same reasons
that was assigned in the serial version. A 3Ng of the serial version will become 4Ng to account a
communication required before the output array is formed in a given round of count-sort.
Tp(N, g,G, p, r) = (32/ lg (r)) · (4Ng + 2NG+ 2prg)
If g = 5G and r = 256 then
Tp(n,G, p) = (88n/p + 40 · 256 · p)G (2)
9
If g = 5G and r = 2562 then
Tp(n,G, p) =
(
44n/p+ 20 · 2562 · p
)
G (3)
Odd-even transposition sort
We analyze the algorithm previously referred to as OET. If n keys are to be sorted, each processor
or core is assigned roughly N = n/p keys. First the N keys per processor or core are sorted using a
radix r = 256 radix-sort independently and in parallel of each other that requires time Ts(n/p,G).
Then a p round odd-even transposition sort takes place utilizing n/p sorted sequences as explained
earlier for OET. One round of it requires roughly 4Ng for communication and merging (two input
and one output arrays). Thus the overall cost of all p phases of OET will be as follows.
To(n, g,G, p) = Ts(n/p,G) + p (4n/p) g
If g = 5G and r = 256 then
To(n,G, p) = (68n/p + 20n)G (4)
Bitonic Sort
We analyze the algorithm previously referred to as BTN. If n keys are to be sorted, each processor
or core is assigned roughly N = n/p keys. First the N keys per processor or core are sorted using
a radix r = 28 radix-sort independently and in parallel of each other that requires time Ts(n/p,G).
Then lg (p)(lg (p) + 1)/2 stages of a p-processor bitonic-sort are realized as explained in Section 2.
One round of it requires roughly 4Ng for communication and merging/comparing (two input and
one output arrays). Thus the overall cost of all stages of bitonic sort will be as follows.
Tb(n, g,G, p) = Ts(n/p,G) + (lg (p) · (lg (p) + 1)/2) · (4n/p) g
If g = 5G and r = 256 then
Tb(n,G, p) = (68n/p + (10n lg (p)(lg (p) + 1)) /p)G (5)
4 Experiments
All algorithms have been implemented in ANSI C. The code has been programmed in such a
way that can be recompiled but does not need to be rewritten and works with three parallel,
10
multiprocessing or multithreaded programming libraries: OpenMPI [25], MulticoreBSP [32], and
BSPlib [17]. The latter library was only used on the Intel platform. The resulting source code
that has been used in these experiments is publically and currently available through the author’s
web-page [16].
A 8-processor quad-core AMD Opteron 8384 Scientific Linux 7 workstation with 128GiB of
memory has been used for the experiments. We refer to it as the AMD platform in the remainder.
The version of OpenMPI available and used was 1.8.4. A quad-core Intel Xeon E3-1240 3.3Ghz
Scientific Linux 7 workstation with 16GiB of memory has also been used for the experiments. We
refer to it as the Intel platform in the remainder. The version of OpenMPI available and used
was 1.8.1. We also run some experiments with version 2.1.1 which was marginally faster for some
experiments. However it caused a problem with some of our runs that we did not have time to
fix so we report results based on 1.8.1. Version 1.2.0 of MulticoreBSP was used on both platforms
and version 1.4 of BSPlib was used. The source code is compiled using the native gcc compiler
gcc version 4.8.5 with optimization options -O2 -mtune=native and -march=native and using
otherwise the default compiler and library installation. Indicated timing results (wall-clock time in
seconds) in the tables to follow are the averages of four experiments. We used small problem sizes
of 8 × 106, 32 × 106, 128 × 106 integers. This is the total problem size, not the per processor size.
We have also run some experiments for smaller problem sizes to determine the cut-off point where
bitonic-sort or odd-even transposition sort are superior to radix-sort methods. The input for all
algorithms is the same set of random uniformly drawn integers.
For the serial algorithm SR4 we report in both Table 1 and Table 2 the corresponding serial
execution time in seconds. For all other algorithms, PR4, PR2, BTN and OET we report speedup
figures. For Table 3, Table 4, and Table 5 we report timing results in microseconds. We offer some
of the observations that we consider important in the context of this experimentation.
Observation 1: Thread size per core. For the AMD platform one thread per core was a
requirement for extracting best performance. This is in accordance to remarks by [22] and [24].
Observation 2: Hyperthreading. For the Intel platform two threads per core was a requirement
for extracting consistently better performance thus deviating from [22]. For smaller or larger
problem size one thread or four threads per core improved speedup slightly but to the detriment
11
of (thread) efficiency. Thus the one thread per core recommendation or observation of [22] might
not be current any more for more recent architectures. For the Intel platform, experiments with
p = 8 on a multicore use exclusively two threads per core. OpenMPI was able to cope with it,
MulticoreBSP consistently did not, and so was not (obviously) BSPlib. The last two suffered a
performance drop by 30% or so.
Observation 3: Libraries. For the AMD platform, OpenMPI had library latency and performed
better in larger problem sizes than MulticoreBSP. To some degree the same can be said for the Intel
platform. BSPlib with its multiprocessing only support but low library overhead was extremely
competitive and bettered MulticoreBSP almost always. Moreover it was more often than not better
than OpenMPI, despite its age and non support.
Observation 4: 4-round vs 2-round radix-sort. On the AMD platform PR4 was superior to
the low overhead BTN, OET implementations, something that was not surprising. However in the
AMD platform a two-round radix-sort on 32-bit integers was better than a four-round one for both
libraries used and for large problem sizes. In the case of MulticoreBSP this was so for 32M and
128M but for the case of OpenMPI only for 128M. On the Intel platform a four-round was always
the winner.
Observation 5: Bitonic vs Odd-even transposition sort. On the AMD platform BTN was a
clear winner. On the INTEL platform surprisingly OET was better more often than not across all
three libraries. Only for p = 8 did it marginally lose to BTN. And this is despite that lg (p)(lg (p) +
1)/2 was still smaller than p. It would be quite interesting to compare the two on GPU platforms.
Note also that our version of bitonic or odd-even transposition sort differs from other approaches
in that it has only lg (p)(lg (p) + 1)/2 and p stages for BTN and OET rather than lg (n)(lg (n) + 1)/2
and n respectively.
Observation 6: Bitonic vs Odd-even transposition sort threshold. Under the Intel platform
we tested both implementations for smaller problem sizes ranging from 1K (= 103) to about 512K.
These results are shown in Table 3, Table 4, and Table 5 with figures indicating microseconds, i.e.
actual timing results rather than speedup information. Under OpenMPI BTN was better through
128K and only for 512K was OET marginally better. Both of them got beaten by the serial four-
round radix-sort through around 32K where both started getting better running times (i.e. speedup
12
greater than 1 or even close to 2 for 512K). Under MulticoreBSP, BTN was marginally better for
the 1K-32K range and OET for the 128K-512K range. For the 1K-32K both were marginally better
than serial radix-sort SR4 and beyond that their speedup was in the one to two range over SR4. For
BSPlib, OET was consistently better than BTN except for p = 8 and sizes of 32K or more when it
was marginally (less than 10%) slower. It was for the 128K-512K range that speedup figures were
slightly over 2 for p = 4 and 128K and p = 2, 4, 8 for 512K. Both OET and to a lesser degree BTN
exhibited a speed up of around 2 starting with 2K problem size This raises the question whether
OET is just good because p is consistently small, or there is some promise to it for GPU architectures
and appropriate implementations. But it is safe to say that with increasing processor or thread
sizes BTN will prevail.
Observation 7: MBSP modeling SR4 vs PR4, PR2. We may use equation 1 and equation 2 to
determine the relative efficiency of a parallel four-round radix-sort. We have then than
Ts(n,G)/Tp(n/p,G, p) = 68nG/ (88n/p + 40 · 256 · p)G
The fraction to the limit goes to approximately 68 ∗ p/88 ≈ 0.75p. Thus for p = 4, 8, 16 we should
not be anticipating speedups higher than about 3, 6 and 12 respectively for the AMD platform.
Indeed this is the case from Table 1. Likewise for the Intel platform. In the latter case however, we
have a speedup of 3.61 over the ”predicted” 3 for p = 4. Thus we restrained in expecting accuracy
from such an estimation given the assumptions and simplifications incorporated in the formulas.
For example it is possible that the g = 5G is not an accurate one for the Intel Platform. Repeating
this analysis for PR2 if we use equation 1 and equation 3 does not restrict the possible speedup.
Thus we can make the case that the MBSP model can be used to reason usefully about the results
of our experiments.
Observation 8: MBSP modeling OET vs BTN. If we take the ratio of equation 4 and equation 5
it is approximately
To(n,G, p)/Tb(n,G, p) ≈ (68 + 20p)/(68 + 10 lg
2 (p)).
For the problem sizes of the experiments and p = 2 − 16 we observe that we expect OET to be up
to about 40% slower than BTN for p = 2, 4, 8 but up to 70% slower for p = 16. This is confirmed for
13
the AMD platform AMD by looking at the speedup data for the two algorithms. For p = 4, 8 BTN
is faster by about 10-25% than OET. For p = 16 however BTN is faster by 50-60% with respect to
OET under OpenMPI and 30-65% under MulticoreBSP thus confirming this empirical finding. The
fact that this is not reproducible on the Intel platform might confirm that the ratio of g/G there
is different. Thus it indicates something about the potential of MBSP in modeling behavior: it
might be a useful and usable model but if one tries to use with the intent of achieving accuracy the
results will be mixed: it is difficult to model or abstract precisely the interactions of the underlying
architecture, its memory hierarchies and its core interactions.
Observation 9: MBSP modeling BTN vs SR4. The ratio of equation 1 and equation 5 is
approximately
Ts(n,G)/Tb(n,G, p) ≈ (68 · p) / (68 + 10 · lg (p) (lg (p) + 1))
On the AMD platform it suggests that for p = 4, 8, 16 we should be expecting speedup figures in
the range of 2.2, 2.9 and 4 respectively. Indeed this was the case as one can derive from Table 1.
The highest speedup observed for the corresponding processor/thread sizes was 2.63, 3.09 and 2.30
respectively. Likewise for the Intel platform it suggests that for p = 2, 4, 8 we should be expecting
speedup figures in the range of 1.6, 2.2 and 2.9 respectively. The highest speedup observed for the
corresponding processor/thread sizes are 1.74, 2.26 and 1.62 respectively as obtained from Table 2.
5 Conclusion
We presented an experimental study of integer sorting on multicore processors using multithread-
ing and multiprocessing parallel programming approaches for a code that is transportable and
executable using three parallel/multithreading libraries, Open MPI, MulticoreBSP, and BSPlib.
We have implemented plain-vanilla serial and parallel radix-sort for various radixes and also some
previously little explored or unexplored variants of bitonic-sort and odd-even transposition sort.
We offered a series of observations obtained through this evaluations organized and grouped
in a way that has not been done before, to our knowledge. Some of those observations have been
made previously, but some of them might not be valid any more.
Moreover we expressed the performance of our implementations in the context of the MBSP
14
model [15]. We showed how one can use the model to compare the theoretical performance of the
implementations. Several conclusions drawn through this theoretical comparison are in line with
the experimental results we obtained. This would suggest that MBSP might have merit in studying
the behavior of multicore and multi-memory hierarchy algorithms and thus be a useful and usable
model.
15
References
[1] K. Batcher. Sorting Networks and their applications. In Proceedings of the AFIPS Spring
Joint Computing Conference, pp. 307-314, 1968.
[2] G. Baudet and D. Stevenson. Optimal sorting algorithms for parallel computers. IEEE Trans-
actions on Computers, C-27(1):84-87, 1978.
[3] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A
comparison of sorting algorithms for the connection machine CM-2. In Proc. of the Symposium
on Parallel Algorithms and Architectures (SPAA ’91), pp. 3-16, Hilton Head, SC, USA, July
21-24, ACM Press, 1991.
[4] B. Bramas. Fast Sorting Algorithms using AVX-512 on Intel Knights Landing.
arXiv:1704.08579, 24 Apr 2017.
[5] D. Cederman and P. Tsingas. On sorting and load balancing on GPUs. In Proc. ACM
SIGARCH Computer Architecture News, Vol. 36(5), Dec. 2008, pp 11-18, ACM NY, USA.
[6] Z. Cheng, K. Qi, L. Jun, and H. Yi-Ran. Thread-Level Parallel Algorithm for Sorting Integer
Sequence on Multi-core Computers. In Proc. International Symposium on Parallel Architec-
tures, Algorithms and Programming, Tianjin, China, Dec. 9-11, 2011, pp. 37-41, IEEE Press.
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PAAP.2011.57
[7] A. Chan and F. Dehne.and H. Zaboli. A note on coarse grained parallel integer sorting. Parallel
Processing Letters, Vol. 9, No. 4, pp. 533-538, 1999.
[8] F. Dehne and H. Zaboli. Deterministic Sample sort for GPUs. Parallel Processing Letters,
Vol. 22, No. 3, pp. 1250008-1 to 1250008-14, 2012.
[9] F. Dehne and H. Zaboli. Parallel Sorting for GPUs. In: Adamatzky A. (eds) Emergent
Computation. Emergence, Complexity and Computation, vol 24., pp 293-302, Springer, 2017.
16
[10] Brian A. Garber, Dan Hoeflinger, Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Automatic Generation of a Parallel Sorting Algorithm. IEEE International Symposium on
Parallel and Distributed Processing, 2008, 14-18 April 2008, p 1-5.
[11] A. V. Gerbessiotis and C. J. Siniolakis. Deterministic sorting and randomized median finding
on the BSP model. In Proceedings of the 8-th Annual ACM Symposium on Parallel Algorithms
and Architectures, pp. 223-232, Padua, Italy, June 1996.
[12] A. V. Gerbessiotis and C. J. Siniolakis. An Experimental Study of BSP Sorting Algorithms.
In Proceedings of 6th Euromicro Workshop on Parallel and Distributed Processing, Madrid,
Spain, January, IEEE Computer Society Press, 1998.
[13] A. V. Gerbessiotis and C. J. Siniolakis. Efficient deterministic sorting on the BSP model.
Parallel Processing Letters, Vol 9 No 1 (1999), pp 69-79, World Scientific Publishing Company.
[14] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous algorithms. Journal of Parallel
and Distributed Computing, 22:251-267, Academic Press, 1994.
[15] A. V. Gerbessiotis. “Extending the BSP model for multi-core and out-of-core computing:
MBSP”, Parallel computing 41 (2015) 90-102.
[16] A. V. Gerbessiotis. http://www.cs.njit.edu/ ˜ alexg/cluster/software.html. August
2017.
[17] D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific
Programming, 6 (1997), pp. 249-274.
[18] J. S. Huang and Y. C. Chow. Parallel sorting and data partitioning by sampling. IEEE
Computer Society’s Seventh International Computer Software and Applications Conference,
pages 627–631, November 1983.
[19] H. Inoue, T. Moriyama, H. Komatsu and T. Nakatani. A high-performance sorting algorithm
for multicore single-instruction multiple-data processors. Softw. Pract. Exper., 42: 753777.
doi:10.1002/spe.1102, Wiley, 2012.
17
[20] M. F. Ionescu and K. E. Schauser. Optimizing parallel bitonic sort. In Proc. IEEE Parallel
Processing Symposium, Geneva, Switzerland, IEEE, pp 303-309, 1997.
[21] Knuth D. E. The Art of Computer Programming. Volume III: Sorting and Searching. Addison-
Wesley, Reading, 1973.
[22] D. Langr, P. Tvrdik, and I. Simecek. AQsort: Scalable Multi-Array In-Place Sorting with
OpenMP. Scalable Computing: Practice and Experience, Vol 17(4), pp 369-391,SCPE, 2016.
[23] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays - Trees - Hy-
percubes. Morgan Kaufmann, California, 1991.
[24] A. Maus. A full parallel radix sorting algorithm for multicore processors. In Proc. Norsk
Informatikkonferanse (NIK 2011), pp. 37-48, 2011.
[25] R.L. Graham, T.S. Woodall, J.M. Squyres. Open MPI: A Flexible High Performance MPI. In
Proceedings, 6th Annual International Conference on Parallel Processing and Applied Mathe-
matics, September 2005, Poznan, Poland, Springer Verlag Lecture Series in Computer Science,
pp. 228-239, Lecture Notes in Computer Science, Vol. 3911, Springer.
[26] H. Peters, O. Schulz-Hildebrandt and N. Luttenberger. Fast in-place, comparison-based sorting
with CUDA: a study with bitonic sort. Concurrency Computat.: Pract. Exper., 23: 681693,
2011. doi:10.1002/cpe.1686
[27] S. Rathi. Optimizing sorting algorithms using ubiquitous multi-core massively parallel GPGPU
processors. In Proceedings 7th Int. Conference on Communication, Computing, and Vizual-
ization, 2016, Procedia Computer Science 79, pp. 231-237, 2016.
[28] H. J. Reif and L. G. Valiant. A logarithmic time sort for linear size networks. Journal of the
ACM, 34:60-76, January 1987.
[29] R. Reischuk. Probabilistic parallel algorithms for sorting and selection. SIAM Journal on
Computing, 14(2):396-409, 1985.
[30] H. Shi and J. Schaeffer. Parallel sorting by regular sampling. Journal of Parallel and Distributed
Computing, 14:362-372, 1992.
18
[31] L. G. Valiant. A bridging model for parallel computation. Comm. of the ACM, 33(8):103-111,
August 1990.
[32] A. N. Yzelman, R. H. Bisseling, D. Roose, and K. Meerbergen. MulticoreBSP for C: a high-
performance library for shared-memroy parallel programming. Technical report TW 624, KU
Leuven, 2013. Also International Journal of Parallel Programming, Vol. 42(4), pp 619-642,
August 2014, Springer.
[33] S. S. Zaghloul, L. M. AlShehri, M. F. AlJouie, N. E. AlEissa, N. A. AlMogheerah. Analytical
and experimental performance evaluation of parallel merge sort on multi-core systems. In-
ternational Journal of Engineering and Computer Science, Vol 6 (6), pp. 21764-21773, June
2017.
[34] C. Zhong, Z. Qu, F. Yang, M. Yin, X. Li. Efficient and scalable parallel algorithm for sorting
multisets on multi-core systems. Journal of Computers, Vol 7(1), pp. 30-41, IAP, January
2012.
19
Speedup on AMD Platform
OpenMPI MulticoreBSP
8M 32M 128M 8M 32M 128M
sr4 p = 1 0.362 1.459 7.454 0.362 1.459 7.454
pr4 p = 4 2.46 2.54 3.17 2.46 2.46 3.10
pr4 p = 8 4.02 4.19 5.27 4.36 4.27 5.42
pr4 p = 16 4.70 5.78 7.55 6.03 5.67 7.09
pr2 p = 4 1.84 2.43 2.72 2.51 3.16 4.41
pr2 p = 8 2.58 3.87 4.82 4.20 5.32 7.63
pr2 p = 16 3.14 5.21 7.90 5.56 7.01 9.56
btn p = 4 2.03 2.02 2.44 2.18 2.11 2.63
btn p = 8 2.51 2.34 2.88 2.76 2.49 3.09
btn p = 16 2.19 1.84 2.30 2.91 1.97 2.35
oet p = 4 1.81 1.82 2.21 2.11 2.04 2.61
oet p = 8 2.06 1.93 2.41 2.54 2.36 2.91
oet p = 16 1.36 1.18 1.46 1.78 1.47 1.76
Table 1: Speedup for PR4,PR2, BTN and OET on AMD platform; Time(sec) for SR4
20
Speedup on Intel Platform
OpenMPI MulticoreBSP BSPlib
8M 32M 128M 8M 32M 128M 8M 32M 128M
sr4 p = 1 0.074 0.420 2.878 0.074 0.420 2.878 0.074 0.420 2.878
pr4 p = 2 1.21 1.62 1.79 0.98 1.36 1.65 1.25 1.73 1.87
pr4 p = 4 1.27 1.89 3.07 1.21 1.64 2.70 1.68 2.33 3.61
pr4 p = 8 1.64 2.41 4.02 1.19 1.66 2.59 1.64 2.32 3.57
pr2 p = 2 0.55 0.65 0.45 0.64 0.94 1.57 0.59 0.97 1.67
pr2 p = 4 0.66 0.73 0.37 0.80 1.19 2.04 0.64 1.27 2.25
pr2 p = 8 1.08 0.71 0.51 0.58 0.87 1.52 0.40 0.86 1.63
btn p = 2 1.15 1.58 1.74 1.13 1.52 1.71 1.15 1.56 1.71
btn p = 4 0.97 1.37 2.24 0.93 1.25 2.07 1.01 1.42 2.26
btn p = 8 0.69 0.96 1.62 0.58 0.83 1.33 0.67 0.96 1.55
oet p = 2 1.23 1.71 1.82 1.17 1.57 1.73 1.23 1.67 1.79
oet p = 4 1.04 1.47 2.43 0.94 1.31 2.12 1.05 1.50 2.38
oet p = 8 0.56 0.93 1.56 0.55 0.78 1.27 0.64 0.92 1.48
Table 2: Speedup for PR4,PR2, BTN and OET on AMD platform; Time(sec) for SR4
21
Running time in (µs) on Intel Platform
OpenMPI
1K 2K 8K 32K 128K 512K
sr4 p = 1 10 30 90 320 1330 6800
pr4 p = 2 560 420 640 640 1320 4000
pr4 p = 4 600 910 700 790 1800 2800
pr4 p = 8 790 880 990 1070 1550 4310
btn p = 2 40 40 80 250 980 3600
btn p = 4 50 70 90 210 1190 3000
btn p = 8 90 100 150 300 890 3700
oet p = 2 70 60 100 260 900 3200
oet p = 4 100 140 160 290 1230 3300
oet p = 8 270 260 310 480 1200 4200
Table 3: Timing results in (µs) for OpenMPI on the Intel platform
22
Running time in (µs) on Intel Platform
OpenMPI
1K 2K 8K 32K 128K 512K
sr4 p = 1 10 30 90 320 1330 6800
pr4 p = 2 160 180 220 490 1090 3180
pr4 p = 4 80 90 140 210 460 1690
pr4 p = 8 120 150 190 260 500 1590
btn p = 2 10 20 80 290 1020 3700
btn p = 4 9 10 40 150 650 3580
btn p = 8 10 10 50 180 790 5260
oet p = 2 20 20 80 280 980 3440
oet p = 4 10 10 40 170 680 3290
oet p = 8 20 30 70 230 910 4950
Table 4: Timing results in (µs) for MulticoreBSP on the Intel platform
23
Running time in (µs) on Intel Platform
OpenMPI
1K 2K 8K 32K 128K 512K
sr4 p = 1 10 30 90 320 1330 6800
pr4 p = 2 120 240 370 530 1050 2970
pr4 p = 4 70 80 190 460 810 1980
pr4 p = 8 420 630 910 970 1250 2270
btn p = 2 30 40 70 230 890 3630
btn p = 4 20 20 50 170 670 3470
btn p = 8 10 20 90 220 820 3350
oet p = 2 10 20 60 210 870 3250
oet p = 4 10 10 40 160 670 2830
oet p = 8 20 30 70 240 950 3870
Table 5: Timing results in (µs) for BSPlib on the Intel platform
24
