Bank Conflict Free Comparison-based Sorting On GPUs by Sitchinava, Nodari & Weichert, Volker
ar
X
iv
:1
30
6.
50
76
v2
  [
cs
.D
S]
  1
1 O
ct 
20
16
Bank Conflict Free Comparison-based Sorting
On GPUs
Nodari Sitchinava1 and Volker Weichert2
1 University of Hawaii, Manoa, HI, USA
2 Goethe University Frankfurt am Main, Frankfurt, Germany
Abstract. In this paper we present a framework for designing algo-
rithms in shared memory of GPUs without incurring memory bank con-
flicts. Using our framework we develop the first comparison-based shared
memory sorting algorithm that incurs no bank conflicts. It can be used
as a subroutine for GPU sorting algorithms to replace current use of
sorting networks in shared memory. Using our bank conflict free shared
memory sorting subroutine as a black box, we design BCFMergesort,
an algorithm for merging sorted streams of data that are larger than
shared memory. Our algorithm performs all accesses to global memory
in coalesced manner and incurs no bank conflicts during the merge.
1 Introduction
During the past decade GPUs – the massively parallel processors developed for
graphics rendering – have been adapted for general purpose computation and
CUDA and OpenCL – extensions for C programming language – have been
developed for programming them. With hundreds of cores capable of running
thousands of threads, GPUs have become a standard computational platform in
high-performance computing (HPC) and have been successfully used to analyze
data in natural sciences, such as biology, chemistry, physics and astronomy.
1.1 Brief introduction to GPU organization
To manage the large number of cores, GPUs are designed hierarchically. They
consist of the GPU’s on-board memory, known as global memory, and a number
of streaming multiprocessors (SMs). Each SM consists of multiple scalar proces-
sors, or cores, and several types of faster but smaller memories.
Each physical core can support multiple threads. The management of so
many threads comes with a very specific thread organization. All threads on a
GPU are organized into warps – a collection of 32 threads – which must execute
code as a single-instruction, multiple data (SIMD) processor in an implicitly
Work supported in part by the DFG grant ME 2088/3-1 within the priority pro-
gramme SPP 1736 (Algorithms for Big Data) and by the Danish National Re-
search Foundation grant DNRF84 through Center for Massive Data Algorithmics
(MADALGO).
synchronized (lock-step) manner. SIMD execution implies that any if statement
that requires some subset of threads to execute one instruction, while another
subset to execute another instruction, will result in the two subsets of threads
waiting for each other. This is known as branch divergence in GPU literature
and can result in drastic loss of parallelism for nested conditional statements.
Warps are grouped into cooperative thread arrays (CTAs), also known as
thread blocks. All warps of a CTA must execute the same code, are run asyn-
chronously from each other with the ability to call explicit barrier synchroniza-
tions. All warps of a CTA are guaranteed to be run on the same physical SM.
Finally, synchronization across multiple CTAs can be performed using explicit
barriers.
The memory of a GPU is organized into multiple memory types each with
its own advantages, but also their own challenges.
– Global memory is the largest but the slowest of all memory types. To hide
the latency associated with access to global memory, GPU programmers are
encouraged to coalesce their accesses to global memory. The accesses are
coalesced when the threads of a warp access contiguous address space in the
global memory. Global memory is accessible by all threads running on the
GPU at all times and the programmer is responsible to ensure the correctness
of the data under concurrent updates in such an asynchronous setting.
– Shared memory is faster but also smaller than global memory. It does not
require coalesced accesses, however, it is organized into a limited number of
memory banks. Each thread can access any memory bank, however, simul-
taneous accesses to the same memory bank by multiple threads of a warp
cause a contention, known in GPU literature as bank conflicts. The threads
causing a bank conflict are serviced sequentially resulting in the loss of par-
allelism. Each CTA implements its own address space within shared memory
and the data in shared memory is exclusive to all threads of a single CTA,
i.e., the threads of one CTA cannot access the data of another CTA via
shared memory.
– Registers are the fastest type of memory on GPUs. Efficient access to data
stored in registers does not require coalescing, nor does it suffer from memory
bank conflicts. However, registers are private to each thread, very limited in
size (only 63 of them per thread on our NVIDIA GTX 580 graphics card)
and the addressing must be exactly the same for all threads of a warp. If
more registers are required by a thread, the excess is stored in local memory.
– Local memory is not a separate physical memory type, but instead is im-
plemented as a logical partition of global memory. However, the latency of
accessing slow global memory is mitigated by automatic caching of recently
used data of local memory in the same memory banks where shared memory
resides.
– Texture memory and Constant memory are other types of memory which
are very useful in graphics rendering applications. However they are very
specialized and their use for general purpose computations is very poorly
understood. In this paper we do not use those memories.
2
1.2 Effects of bank conflicts on runtime
Due to large latency of accessing global memory, fast practical implementations
typically first load data into shared memory and process data from there. Ef-
ficient access to contiguous data in global memory through coalescing is well
understood and Merrill and Grimshaw [10] has optimized the process to get
sustainable throughput that are close to theoretical maximal.
However, once the data in shared memory, the access patterns in the presence
of banked memory is less understood.
To observe the effects of bank conflicts on runtime, consider the problem
of colored prefix sums: given an array of N elements and a set of d colors
{c1, c2, . . . , cd}, with a color ci associated with each element, colored prefix sums
asks to compute d independent prefix sums among the elements of the same
color.
Given P ≪ N processors, colored prefix sums can be solved using the stan-
dard three phase approach to solving prefix sums: partition the input into P
subsets of contiguous N/P elements and compute the d sums (one for each
color) within each subset using an independent processor; compute d indepen-
dent prefix sums on the d sets of P values each from previous phase; finally add
the value of the appropriate color from the previous step while computing the
prefix sum within the subsets for each of d colors using independent processor
per subset. If P ≪ N the runtime of the second step is negligible (in practice it is
typically performed using a single processor) and the algorithm is dominated by
the first and the third step of the algorithm. If the second step takes negligible
time and d≪ N/P , the runtime should vary little as a function of d.
Figure 1a (dashed line) shows the actual runtime of colored prefix sums.
Unlike our prediction, we can see that it grows significantly as a function of d.
Figure 1b shows the runtime of just the third phase and the number of bank
conflicts incurred during execution. One can clearly see a correlation between
the two. After implemented an improved version of colored prefix sums that
does not incur any bank conflicts (see Appendix A for implementation details),
one can see that the runtime of the third phase indeed becomes constant as a
function of d (solid lines). This simple example demonstrates that if we want
to be able to accurately predict runtimes of GPU algorithms using asymptotic
analysis, it is very important to design algorithms without any bank conflicts.
1.3 Past work on sorting on GPUs
Sorting numbers in an array was one of the first combinatorial problems imple-
mented on GPUs. Purcel et al. [14] were the first to implement a bitonic sorting
network on a GPU, while Sintorn et al. [17] implemented the first sorting algo-
rithm in the CUDA programming environment. Since then many different algo-
rithms have been implemented on GPUs, e.g. mergesort [15,5,18], quicksort [4],
bitonic sort [13], radixsort [11] and samplesort [9].
For comparison-based algorithms, when data is small enough, fast imple-
mentations typically implement a sorting network (usually Batcher’s odd-even
3
 0
 50
 100
 150
 200
 250
 0  2  4  6  8  10  12  14  16
ru
n
tim
e 
[m
s]
colors
colored prefix sums
bank conflict free colored prefix sums
(a)
 60
 65
 70
 75
 80
 85
 90
 95
 100
 0  2  4  6  8  10  12  14  16
 0
 25
 50
 75
 100
 125
 150
 175
 200
ru
n
tim
e 
[m
s]
ba
nk
 c
on
flic
ts
 [1
06
]
colors
kernel 3 runtime
kernel 3 bank conflicts
bank-conflict free kernel 3 runtime
(b)
Fig. 1: The runtime of the colored prefix sums algorithm rises with increasing
number of colors (a). The correlation between the number of bank conflicts and
runtime can be seen in the example of the prefix sums kernel (b). The input size
is 229 elements (2 GB).
mergesort or bitonic mergesort). The likely reason for the success of sorting net-
works on GPUs is the small constant factors in the run time analysis of sorting
networks and their data-oblivious nature. The data-obliviousness of sorting net-
works implies that the execution stream of every thread is the same regardless
of the value of the input data. Thus, there is no need for conditional statements
in the code which bodes well with the SIMD nature of threads in a warp.
However, the rigid access pattern of the sorting networks makes it difficult
to avoid memory bank conflicts and over the decade of GPU computing no one
has been able to come up with an efficient sorting network that causes no bank
conflicts on inputs larger than the number of memory banks in shared memory.
1.4 Our Contributions
We present a framework for designing bank conflict free algorithms in shared
memory. Using our framework we identified an existing 27 year old sorting algo-
rithm as a candidate for a bank conflict free comparison-based sorting algorithm
on GPUs for data as large as shared memory.
Using our new sorting algorithm as a subroutine, we developed a bank conflict
free merging of pairs of sorted sequences, thus, producing BCFMergesort – the
first fully bank conflicts free comparison-based sorting GPU algorithm.
The lack of bank conflicts in BCFMergesort allows us to perform a worst-
case theoretical analysis and accurately predict the runtime of our algorithm on
GPUs. This is the first comparison-based sorting algorithm on GPUs that we
are aware of that has such theoretical analysis. We implemented our algorithm
and show that it outperforms current GPU mergesort implementations.
4
Memory Bank 7
Memory Bank 6
Memory Bank 5
Memory Bank 4
Memory Bank 3
Memory Bank 2
Memory Bank 1
Memory Bank 0
A7
A6
A5
A4
A3
A2
A1
A0
A15
A14
A13
A12
A11
A10
A9
A8
A23
A22
A21
A20
A19
A18
A17
A16
A31
A30
A29
A28
A27
A26
A25
A24
A39
A38
A37
A36
A35
A34
A33
A32
A47
A46
A45
A44
A43
A42
A41
A40
A55
A54
A53
A52
A51
A50
A49
A48
A63
A62
A61
A60
A59
A58
A57
A56
Fig. 2: The view of shared memory as a b× ⌈M/b⌉ matrix.
2 Algorithmic Framework
Let Ph be the number of SMs on the hardware and Mh be the amount of shared
memory available on each SM. Let P be the total number of CTAs. Let w be
the number of threads within each warp and for simplicity we let w represent
the number of banks in shared memory. Let M be the size of shared memory
required by each CTA. We choose M to be at most Ph·MhP to guarantee that
all CTAs can fit on all SMs and run concurrently without any context switches.
Let W be the number of warps per CTA. We set P and W to be the minimum
values which maximize the throughput between global and shared memories.
For ease of exposition we define p = PW . In our experiments we observed that
p = PW = 8Ph was the optimal choice for P and W to maximize throughput.
This relationship and the constraint M ≤ Ph·MhP allows us to varyM depending
on the specific needs of the algorithm.
During algorithm design, we treatΘ(M) elements as a single contiguous block
and transfer these blocks between the global and shared memories. Note that
M > 32, is the lower bound to perform such transfer in coalesced manner. Our
experiments showed that loading M ≥ 1024 elements per warp maximizes the
throughput. OnceM elements are in shared memory, we can focus on processing
them without any bank conflicts as efficiently as possible.
2.1 Designing Bank Conflict Free Algorithms In Shared Memory
We extend the view of shared memory as observed by Dotsenko et al. [6]. Just
like them we view the shared memory of size M as a matrix M, with w rows
and
⌈
M
w
⌉
columns. Each row corresponds to one of the w memory banks. When
r items are loaded into contiguous address space in shared memory, it is loaded
in column major order into
⌈
r
w
⌉
columns (see Figure 2). Dotsenko et al. used
this view to pad the input to offset the starting indices of each thread, so that
a parallel scan of contiguous data would result in each thread starting from a
different memory bank.
We extend this view to computations other than parallel scan. In particular,
observe that if we view the shared memory as a matrix and process it in such
a way that each thread processes a separate row of the matrix, each thread will
5
be accessing a separate bank, thus resulting in no bank conflicts. Furthermore,
observe that if we can transform the matrix from the column-major layout to
the row-major layout without causing any bank conflicts, the original columns
become rows and, therefore, the columns of the original matrix can also be
processed without any bank conflicts.
A transformation of a square matrix between the two layouts is simply a
transposition of it. A warp can perform bank conflict free transposition of a
w×w matrix in w
2
−1 rounds, where in each round i: 1 ≤ i ≤ w
2
−1, each thread
0 ≤ t ≤ w−1 swaps the elementM[t, (i+ t) mod w] with the elementM[(i+ t)
mod w, t]. Note that within each reading or writing step, all w threads access a
different row (memory bank) of the matrix. The lock-step execution among the
threads of the warp ensures that the reads and writes are performed one after
another without interfering with each other.
Since the transposition incurs no bank conflicts, all w threads can perform
each step without interfering with each other and we can easily analyze the
asymptotic worst-case runtime as
(
w
2
− 1
)
· O(1) = O(w) parallel steps.
We can easily generalize the transformation for input sizes n = hw2 for a
small integer constant h that divides w. The details are presented in Appendix E.
3 BFCMergesort: Bank-conflict free sorting
For our bank conflict free sorting we chose the classic mergesort algorithm. Prac-
tical mergesort implementations distinguish two phases: the base case phase and
the merging phase. In the base case phase,
⌈
N
M
⌉
sequences of M elements each
are sorted. The merging phase proceeds in
⌈
log
(
N
M
)⌉
rounds, in each round
merging pairs of sorted sequences from the previous round to create increasingly
larger sequences, until only a single sorted sequence remains. In this section, we
present how to implement each of these phases without any bank conflicts.
3.1 Base case (ShearSort)
On GPUs, the base case is sorted in shared memory using a sorting network.
Sorting networks are ideal for GPUs because they are data oblivious (resulting in
no branch divergence) and they are very simple (resulting in very small constant
factors). However, sorting networks still cause bank conflicts if the size of the
sequence to be sorted is larger than w. This is probably the reason why the fastest
mergesort implementations on GPUs (e.g. the Thrust library [15], the CUDPP
library [5] and warpsort [18]) use sorting networks only for up to 2w = 64
elements.
In 1983, Sen et al.[16] introduced ShearSort – a sorting network to sort num-
bers arranged in an m ×m matrix in snake-like order along the column-major
order. The algorithm proceeds in rounds. In each round it sorts even-numbered
columns in ascending order and odd-numbered columns in descending order,
then sorts each row in ascending order. Using 0-1 principle [8], it is easy to show
that after Θ(logm) rounds the matrix is sorted.
6
Using our observation from Section 2.1, it is easy to see that ShearSort can
be implemented on GPUs to sort sequences of M = w2 elements in shared
memory with w banks without any bank conflicts. Using w threads, it will take
Tb(M) = O
(
tˆ
(
M
w
)
· logw
)
parallel time, where tˆ(n) is the time it takes to sort
a row or a column of n elements using a single thread. We can still benefit from
the simplicity and data-oblivious nature of traditional sorting networks by im-
plementing sequential sorting of rows and columns with a sorting network. Thus,
if the row/column sorting is implemented using Batcher’s sorting network [2],
then tˆ(n) = O(n log2 n) and the total time it takes to sort M = w2 elements
is Tb(M) = O(
M
w log
3 w) = O(w log3 w). Notice, that although this approach
incurs O(logw) extra steps over a simple implementation of a sorting network
in shared memory, this time complexity is guaranteed in the worst case and our
experiments show that the savings due to lack of bank conflicts are larger than
the extra O(logw) factor (see Section 4).
The following simple extension allows us to sort larger inputs: for any integer
h that divides w, we can sort M = w2h elements using h warps of a CTA.
Remember that w2h elements can be viewed in shared memory as a w × (wh)
matrix. Each column of this matrix can be sorted as before. However, each
row now consists of wh elements and sorting it using a single thread would be
too slow. Instead, we first pack wh such rows into separate w × w matrices as
follows. We transpose each of h square submatrices and gather the columns that
comprise an original row of wh elements into adjacent columns. These columns
can easily be identified using index arithmetic. Once each original row is packed
into h contiguous columns, we run Segmented ShearSort on w × w submatrices.
Segmented ShearSort is a modified ShearSort, which instead of sorting the whole
rows of a square matrix, sorts groups of wh elements within each row. Thus, after
log
(
w
h
)
rounds, each w ×
(
w
h
)
submatrix corresponding to the original rows of
wh elements will be sorted.
Transformation of the w × (wh) matrices takes O(w) parallel time when
performed using h warps (wh threads). Segmented ShearSort of a square matrix
using w threads takes O
(
tˆ
(
M
wh
)
log wh
)
= O(w·log2 w·log wh ) parallel time, which
is the total runtime for sorting M = w2h elements using h warps.
3.2 Bank-conflict free merging
After sorting the base case, our algorithm performs a binary merge of the re-
sulting sorted sequences using a separate warp to merge each pair of sequences.
Let P be the minimal number of CTAs that maximizes throughput, as men-
tioned in Section 2. If the number of sequences to be merged is fewer than 2P ,
some of the CTAs will be sitting idle, thus, reducing the available parallelism.
Ye et al. [18] presented a randomized approach to mitigate this problem. Their
idea is to suspend binary merging when the number of sequences drops below
2P , distribute the sequences into P buckets, such that all elements of bucket i
are smaller than elements in bucket i + 1 for all buckets, and continue binary
merging on the portions of the sequences within each bucket. Ye et al. choose
bucket boundaries as random samples of the input. In this section we present a
7
deterministic approach for determining bucket boundaries, while adapting their
merging algorithm to use our ShearSort from the previous section.
Binary merging. To merge two sorted sequences A and B, a standard sequen-
tial algorithm starts off by loading the smallest element from each sequence. Then
it repeatedly outputs the smaller of the two items and loads the next element
from the sequence, whose element was written out.
Since each warp consists of w = 32 threads, it is inefficient to load only one
element from each sequence. Instead Ye et al [18] load a page of size M
2
from each
sequence into shared memory, merge M elements using all w threads of a warp,
output at least M
2
smallest elements, and load the next page of M
2
elements from
the sequence, whose input is exhausted first.
In our implementation, we setM = w2 and use our ShearSort implementation
to merge the two pages in shared memory. Since both input sequences are already
sorted, by the 0-1 principle we can reduce the number of rounds of ShearSort to
just one, resulting in overall merging time Tm(M) = O
(
tˆ
(
M
w
))
= O(w log2 w).
Distribution. Each merge round reduces the number of sequences s by a factor
of 2, causing the algorithm to terminate after O
(
log
(
N
M
))
rounds. However, if
s < 2P , some CTAs will be idle while others have to work on longer sequences in
each round. To improve resource utilization we suspend the merge at s = P and
distribute the sequences across P buckets. To determine the bucket boundaries
deterministically, we adapt the approach of Aggarwal and Vitter [1].
Given s sorted sequences, each of size Ns , we pick t evenly spaced elements
as potential splitters from each sequence and sort the st elements (let’s call
them S). We pick P − 1 evenly spaced items from S as the final splitters that
define bucket boundaries. To perform the distribution of the s sequences into the
P buckets defined by the P−1 splitters, we calculate the rank of each splitter for
every sorted sequence by parallel scans of the data, and sort the subsequences
by bucket and sequence number.
Lemma 1. Assuming all elements are distinct, the size of each bucket is at most
O
(
N
t ·
(
1 + tP
))
elements.
Proof. The number of elements within each sequence between any pair of poten-
tial splitters is
⌈
N
st
⌉
. There are
⌈
st
P
⌉
potential splitters between a pair of selected
final splitters. Thus, the total number of elements between final selected splitters
is at most s ·
⌈
N
st
⌉
+
⌈
st
P
⌉
·
⌈
N
st
⌉
= O
(
N
t ·
(
1 + tP
))
.
Thus, if we pick t = P potential splitters from each sequence, the size of each
bucket is at most O(NP ).
Final merge phase. After rearranging the data the number of sorted subse-
quences is P 2. CTA pi then sorts the ith bucket by merging the sorted sub-
sequences of the bucket. Since the subsequences may have sizes that are not a
multiple of the page size, padding is introduced implicitly in shared memory,
but not written to the output array.
This phase requires log(P ) additional merge rounds in which none of the P
CTAs runs out of work prematurely.
8
Lemma 2. BCFMergesort takes O
(
N
Pw ·
(
1 + log2(w) · log
(
N
w
)))
time and re-
quires O
(
1 + log
(
N
M
))
parallel scans of the data in global memory.
Proof. The proof for Lemma 2 can be found in Appendix B.
3.3 Optimizations
Use of registers. While latency of accessing shared memory is much lower than
accessing global memory, it still incurs some latency. Any computation that
requires multiple accesses to the same location in shared memory can be sped
up by implementing it in registers. Since our GPU hardware has 64 physical
registers per thread (63 of which can be used for data), and our ShearSort sorts
M = w2 elements, where w = 32, the sorting of each row (and column after the
matrix transposition) of the w×w matrix can be implemented in registers. Note
that the indices for accessing registers when loading a row of w = 32 elements do
not depend on the thread index and the data is not transferred to local memory.
The effects of register optimization are presented in Appendix D.
Implicit matrix transposition. Processing the columns of the w × w matrix
requires transposition of the matrix. The challenge with loading the column
directly into registers is the fact that it either causes bank conflicts, or the indices
of registers being loaded must depend on the thread identifier. The latter case
causes the local memory (instead of actual registers) being used, which is actually
implemented in off-chip global memory (see Section 1.1). Thus, neither of the
approaches is acceptable. However, if we define column j as shifted diagonals of
the matrix (i.e. matrix elements M[(j + k) mod w, k] for 0 ≤ k < w) and row
i to be the ith column, shifted upwards by i elements (i.e. elements M[(i + k)
mod w, i] for 0 ≤ k < w), we can load each row and column directly into registers
without the need for transposition.
Duplicate keys. Lemma 1 guarantees that bucket sizes are at most a constant
factor more than NP only if all the items are distinct. A large number of duplicates
might cause some bucket to be much larger than others. To mitigate this issue, we
define additional P −1 buckets and place all items equal to the bucket boundary
into their own bucket. Thus, after the distribution, any duplicate items equal to
bucket boundaries will already be in the correct position and do not need to be
processed any further. We call this version BCFMergesortPB.
4 Experimental results
We conduct our experiments on a standard PC with a 2.4 GHz Intel Core2
Quad CPU, 8 GB of RAM and an NVIDIA Geforce GTX580 GPU with 3 GB of
memory, Ph = 16 SMs and Mh = 48KB per SM (with additional 16KB used for
caching local memory). There are w = 32 physical cores per SM, w = 32 threads
per warp and the shared memory consists of w = 32 banks. Our algorithms
are implemented using the CUDA runtime API and compiled with the CUDA
toolkit version 5.5, gcc version 4.7 and the -O 3 optimization flag. The runtimes
9
reported do not include the time needed to transfer data from the host computer
to the GPU or back.
We compare our sorting algorithm to the well-known implementation of the
Thrust library [3] (version 1.7) and warpsort [18].3 We did not include CUDPP
mergesort [5] in our experiments, as the available code imposes severe limitations
on the input and does not sort data larger than 64 MB. Unfortunately, we also
couldn’t find a working implementation of the Samplesort [9] for our hardware.
We report performance on four different data sets: Random data set consists
of keys chosen uniformly from [0 . . . 106], distinct data set is the sequence 1 . . . n
permuted by a shuffle algorithm [7], 0-1 data set contains uniformly chosen
entries of zero or one and defined duplicates data set is formed by k copies of
each of nk distinct keys, randomly permuted by the shuffle algorithm.
Warpsort uses a bucket distribution as well, but the splitter selection is ran-
domized. It is known that deterministic algorithms result in significantly slower
implementations in practice. Therefore, it is notable that our deterministic al-
gorithms are only slightly slower than the randomized warpsort.
Duplicate keys skew the bucket sizes of the BCFMergesort. The effects of
having separate buckets for pivot values can be seen in Figure 4. It’s notable
that warpsort fails to sort 0-1 data completely.
Though our merge algorithm is not in-place either, it requires less extra space
than both Thrust and warpsort. BCFMergesort lets us sort up to 1GiB of data
on our hardware, while Thrust and warpsort run on inputs up to 512 MiB and
256MiB, respectively.
The relevant parameters for the merge phase were empirically determined to
be p = 8Ph = 128 and M = w
2 = 1024.
The runtimes per key for the different algorithms and different input sets are
shown in Figures 3 and 4. BCFMergesort with pivot buckets (BCFMergesortPB)
for inputs with few distinct keys suffers a performance loss of 6% compared to
our normal mergesort. While warpsort falls exactly into this gap for random
keys, the randomized pivot selection appears to be beneficial when dealing with
inputs with all distinct keys, putting it on par and even surpassing our normal
mergesort on two instances. And while thrust mergesort is slightly faster than
BCFMergesort on the smallest instance (5%), we beat thrust by up to 64% on
random inputs and 54% on distinct keys.
The chart for 0-1 data impressively shows the strength of pivot buckets in-
troduced at the end of Section 3.2. Since the heuristic causes BCFMergesortPB
to have nothing more to do after rearranging the data, BCFMergesortPB beats
BCFMergesort on these inputs by a factor of up to 15.5 and thrust sort by a
factor of up to 2.7. In the chart for defined duplicates, the less extreme versions
3 The results for warpsort presented in [18] included heuristics not mentioned in the
paper. Unfortunately, these heuristics did not guarantee the runtimes mentioned in
the paper and because of this the warpsort would run much slower for some inputs.
Therefore, (with the approval of the authors of [18]) we modified warpsort to align
with the description within their paper, resulting in more uniform runtimes, albeit
slightly slower than what is presented in [18].
10
 0
 0.005
 0.01
 0.015
 0.02
 0  50  100  150  200  250  300
ru
n
tim
e 
pe
r e
le
m
en
t [µ
s/
ke
y]
Input size [106 keys]
warpsort
thrust
BCFMergesort
BCFMergesortPB
(a)
 0
 0.005
 0.01
 0.015
 0.02
 0  50  100  150  200  250  300
ru
n
tim
e 
pe
r e
le
m
en
t [µ
s/
ke
y]
Input size [106 keys]
warpsort
thrust
BCFMergesort
BCFMergesortPB
(b)
Fig. 3: Runtime per key for the random (a) and distinct (b) data sets.
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
 0  50  100  150  200  250  300
ru
n
tim
e 
pe
r e
le
m
en
t [µ
s/
ke
y]
Input size [106 keys]
thrust
BCFMergesort
BCFMergesortPB
(a)
 0
 0.002
 0.004
 0.006
 0.008
 0.01
 0  250  500  750  1000  1250  1500  1750  2000  2250
ru
n
tim
e 
pe
r e
le
m
en
t [µ
s/
ke
y]
distinct keys
warpsort
thrust
BCFMergesort
BCFMergesortPB
(b)
Fig. 4: Runtime per key for the 0-1 (a) and defined duplicates (b) data sets. For
defined duplicates we limit the axes to display the interesting part. BCFMergesort
for two distinct keys takes 0.051 µs per key. Input size for (b) is 225 keys.
of up to 2250 distinct keys can be seen. BCFMergesortPB first surpasses the
normal version of BCFMergesort when 1
4
of all keys are equal to a splitter.
5 Conclusion
We discussed the properties of GPUs, focusing on the memory hierarchy. We
provide experimental data that shows the great impact of shared memory bank
conflicts on the runtime of GPU programs and described methods to design bank
conflict free algorithms for GPUs. Using those methods, we developed the first
comparison-based bank conflict free sorting algorithm for GPUs and showed that
the benefits of having no bank conflicts outweigh the additional work necessary
to make algorithms bank conflict free.
Acknowledgments. We would like to thank Vitaly Osipov for helpful discus-
sions and sharing his insights on GPUs.
11
References
1. Aggarwal, A., Vitter, J.: The Input/Output Complexity of Sorting and Related
Problems. Communications of the ACM pp. 1116–1127 (1988)
2. Batcher, K.E.: Sorting Networks and their Applications. In: Proceedings of the
AFIPS Spring Joint Computer Conference. pp. 307–314. ACM (1968)
3. Bell, N., Hoberock, J.: Thrust: A productivity-oriented library for CUDA. GPU
Computing Gems: Jade Edition pp. 359–372 (2011)
4. Cederman, D., Tsigas, P.: A Practical Quicksort Algorithm for Graphics Proces-
sors. In: Algorithms-ESA 2008, pp. 246–258. Springer (2008)
5. Davidson, A., Tarjan, D., Garland, M., Owens, J.D.: Efficient Parallel Merge Sort
for Fixed and Variable Length Keys. In: Innovative Parallel Computing (InPar),
2012. pp. 1–9. IEEE (2012)
6. Dotsenko, Y., Govindaraju, N.K., Sloan, P.P., Boyd, C., Manfedelli, J.: Fast Scan
Algorithms on Graphics Processors. In: Proceedings of the 22nd Annual Interna-
tional Conference on Supercomputing (2008)
7. Knuth, D.E.: Seminumerical Algorithms, The Art of Computer Programming,
vol. 2. Addison-Wesley, Reading, third edn. (1998)
8. Knuth, D.E.: Sorting and Searching, The Art of Computer Programming, vol. 3.
Addison-Wesley, Reading, third edn. (1998)
9. Leischner, N., Osipov, V., Sanders, P.: GPU Sample Sort. In: Proceedings of the
24th IEEE Parallel and Distributed Processing Symposium (IPDPS) (2010)
10. Merrill, D., Grimshaw, A.: Parallel Scan for Stream Architectures. Tech. Rep.
CS2009-14, Department of Computer Science, University of Virginia (2009)
11. Merrill, D., Grimshaw, A.: High Performance and Scalable Radix Sorting: A Case
Study of Implementing Dynamic Parallelism for GPU Computing. Parallel Pro-
cessing Letters 21(02), 245–272 (2011)
12. NVIDIA: NVIDIA CUDA Programming Guide 5.0 (2012),
http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf, last
viewed on 2/11/2013
13. Peters, H., Schulz-Hildebrandt, O., Luttenberger, N.: Fast In-Place Sorting with
CUDA based on Bitonic Sort. In: Parallel Processing and Applied Mathematics,
pp. 403–410. Springer (2010)
14. Purcell, T.J., Donner, C., Cammarano, M., Jensen, H.W., Hanrahan, P.: Pho-
ton Mapping on Programmable Graphics Hardware. In: Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. pp. 41–50. Eu-
rographics Association (2003)
15. Satish, N., Harris, M., Garland, M.: Designing Efficient Sorting Algorithms for
Manycore GPUs. In: Proceedings of the 23rd IEEE Parallel and Distributed Pro-
cessing Symposium (IPDPS). pp. 1–10. IEEE (2009)
16. Sen, S., Scherson, I.D., Shamir, A.: Shear Sort: A True Two-Dimensional Sorting
Techniques for VLSI Networks. In: ICPP. pp. 903–908 (1986)
17. Sintorn, E., Assarsson, U.: Fast Parallel GPU-Sorting using a Hybrid Algorithm.
Journal of Parallel and Distributed Computing 68(10), 1381–1388 (2008)
18. Ye, X., Fan, D., Lin, W., Yuan, N., Ienne, P.: High Performance Comparison-Based
Sorting Algorithm on Many-Core GPUs. In: Proceedings of the 24th IEEE Parallel
and Distributed Processing Symposium (IPDPS) (2010)
12
A Impact of Bank Conflicts on Runtimes
The importance of shared memory bank conflicts can easily be demonstrated
by looking at the problem of colored prefix sums: given an array of N elements
and a set of d colors {c1, c2, . . . , cd}, with a color ci associated with each ele-
ment, colored prefix sums asks to compute d independent prefix sums among
the elements of the same color.
Our algorithm is basically a three-phase prefix sums algorithm similar to
Merrill’s [10]. The elements are assigned to P CTAs in contiguous tiles of size
N
P . For colored prefix sums, we keep intermediate values for each color and
perform extra operations to identify the color of each element.
In the first phase all elements of the tiles are reduced using the prefix sums
operation. This results in d values per CTA, one for each color. The second
phase calculates the prefix sums of those P · d reduced values as offsets for the
neighboring tiles. This can be done on a single CTA interpreting the values as a
d×P matrix. In the final phase, the prefix sums of the tiles are calculated, seeded
with the offsets computed in the second phase. Since the tiles may be too large
for shared memory, they are loaded in contiguous pages of M elements. Between
neighboring pages, page offsets for each color have to be stored in shared memory.
The calculation of colored prefix sums within a page resembles the algorithm
itself: Each thread is assigned a contiguous chunk of the page and first counts
the entries per color in its chunk, then the intra-page offsets are calculated by
one scan of those sums per color, and finally the threads calculate colored prefix
sums on their chunks seeded with the tile-, inter-page and intra-page offsets.
We would expect the runtime to depend on the input size N , the number of
CTAs P and the warp size w, but not d since all calculations depending on the
number of colors are performed in parallel. However, the runtime increases with
the number of colors (see Figure 1). The slowdown correlates nicely with the
number of shared memory bank conflicts incurred when calculating intra-page
offsets in the third phase.
To improve our algorithm, we implement a way to calculate the intra-page
offsets without any bank conflicts. We assign each thread one memory bank
for storage of intermediate values. The resulting w × d matrix lists the color
values in columns. To avoid having multiple threads working on the same banks
we transpose the matrix and then calculate the prefix sums with d threads in
parallel. After retransposing the matrix, we can propagate the offsets.
We observe that the runtime of the algorithm is now constant for different
numbers of colors (see Figure 1). We limit the number of colors to 16 because of
the limited amount of shared memory available for each multiprocessor.
B Proof for Lemma 2
Proof. The BCFMergesort runs in four basic phases: the base case phase, the
first merge phase, the bucket distribution and the final merge phase. We will
look at each phase separately and combine the results later.
13
The base case phase uses P CTAs working in parallel to sort the input in NM
segments of sizeM . Each CTA has to perform
⌈
N
M·P
⌉
rounds of sorting, resulting
in O
(
N
M·P · tˆ(w) · log(w)
)
parallel time. In the process, each data item is read
and written once.
To reduce the number of sorted sequences to P , log
(
N
M·P
)
rounds of bi-
nary merge have to be performed in the first merge phase. To merge two se-
quences of size N
2
, a total of 2 · N/2M/2 =
2·N
M pages of size
M
2
have to be processed
in each round: one page is read, two pages are merged, one page is written.
Therefore, each element is merged twice in average, resulting in a parallel time
of O
(
N
M·P · log
(
N
M·P
)
· tˆ(w)
)
and log
(
N
M·P
)
additional reads and writes of the
data.
The bucket distribution in itself consists of five phases: picking t pivots from
each sorted sequence, sorting those t · P pivots, choosing the P − 1 final pivots
and finding their location in the sorted sequences, calculating the splitter offsets
and rearranging the data by bucket. Since we use one CTA per sorted sequence,
selecting t pivots takes O
(
t
w
)
time.
The pivots are then sorted using mergesort: base case and binary merge
phases. That results inO
(
P 2
M · tˆ(w) · log(w) + log
(
P 2
M
)
· tˆ(w)
)
= O
(
P 2
M · tˆ(w) · log(w)
)
parallel time.
The splitter are located by sequentially scanning the sequences using w
threads for each. Therefore, each CTA needs O
(
N
Pw
)
time. In the process, the
data is read once.
The calculation of the P 2 subsequence offsets takes another O(P ) time.
Finally, the data is rearranged. Each of the P CTAs moves P subsequences
to their proper positions, causing O(P ) computation and one more read and
write of all data.
The final merge phase causes another log(P ) binary merges. Implicit padding
can at most be M
2
− 1 elements per subsequence,
(
M
2
− 1
)
· P 2 in total. This is
O(N) for any reasonable combination of parameters, so the time of the page-wise
binary merge applies.
14
This results in a total parallel runtime of
O
(
N
M · P
· tˆ(w) · log(w)
)
+O
(
N
M · P
· log
(
N
M · P
)
· tˆ(w)
)
+ O
(
t
w
)
+O
(
P 2
M
· tˆ(w) · log(w)
)
+O
(
N
Pw
)
+O(P ) +O(P )
+ O
(
N
M · P
· log(P ) · tˆ(w)
)
= O
(
N
M · P
· tˆ(w) · log(w) +
N
M · P
· log
(
N
M · P
)
· tˆ(w) +
N
Pw
+
N
M · P
· log(P ) · tˆ(w)
)
= O
(
N
M · P
· tˆ(w) ·
(
log(w) + log
(
N
M · P
)
+ log(P )
)
+
N
Pw
)
= O
(
N
M · P
· tˆ(w) · log
(
N · w
M
)
+
N
Pw
)
.
Considering that M = w2 and tˆ(w) = w · log2(w), we get
O
(
N
M · P
· tˆ(w) · log
(
N · w
M
)
+
N
Pw
)
= O
(
N
Pw
· log2(w) · log
(
N
w
)
+
N
Pw
)
= O
(
N
Pw
·
(
1 + log2(w) · log
(
N
w
)))
.
and in O
(
1 + log
(
N
M·P
)
+ log(P )
)
= O
(
1 + log
(
N
M
))
passes over the data.
15
C Additional charts
For completeness we present the plain runtime charts of GPU mergesort algo-
rithms in Figures 5 and 6. The dominance of the linear term makes it difficult
to see the lower order terms, for sorting, which is the reason why we chose to
normalize the runtimes per element in the main presentation of the results.
 0
 0.5
 1
 1.5
 2
 0  50  100  150  200  250  300
ru
n
tim
e 
[s]
Input size [106 keys]
warpsort
thrust
BCFMergesort
BCFMergesortPB
(a)
 0
 0.5
 1
 1.5
 2
 0  50  100  150  200  250  300
ru
n
tim
e 
[s]
Input size [106 keys]
warpsort
thrust
BCFMergesort
BCFMergesortPB
(b)
Fig. 5: Runtime for the random (a) and distinct (b) data sets.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 0  50  100  150  200  250  300
ru
n
tim
e 
[s]
Input size [106 keys]
thrust
BCFMergesort
BCFMergesortPB
(a)
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0  250  500  750  1000  1250  1500  1750  2000  2250
ru
n
tim
e 
pe
r e
le
m
en
t [µ
s/
ke
y]
distinct keys
warpsort
thrust
BCFMergesort
BCFMergesortPB
(b)
Fig. 6: Runtimes for the 0-1 (a) and defined duplicates (b) data sets. For defined
duplicates we limit the axes to display the interesting part. BCFMergesort for
two distinct keys takes 0.051 µs per key. Input size for (b) is 225 keys.
D Register optimization
We ran experiments to determine the speedup provided by sorting rows and
columns in register space. The results in Figure 7 show the runtimes of BCFMerge-
sort with transposition on a w×w matrix with register optimization and without.
As a comparison, we also provide the runtime of an implementation of Odd-Even
Transposition Sort on w2 = 1024 elements. It demonstrates the effect of bank
conflicts in shared memory on runtimes. The sorting algorithms are applied to
subsets of w2 elements on the total input of 225 elements.
16
 0
 50
 100
 150
 200
 250
 300
 350
 400
ShearSort with register optimization ShearSort Odd-Even Transposition Sort
ru
n
tim
e 
[m
s]
Fig. 7: Comparison of runtimes of ShearSort with and without register optimiza-
tion and of Odd-Even Transposition sorting network on subsets of 1024 elements
each, for the total of size of 225 elements.
E Bank-conflict free computation on non-square matrices
The approach to bank conflict free computation from Section 2.1 can easily be
generalized for inputs of size n = hw2 for a small integer constant h that divides
w, by transforming the input from column-major to row-major order and vice
versa. Since space in shared memory is limited [12], it is desirable to perform
this transformation in-place. Although there is no simple in-place algorithm for
transforming general rectangular matrices from row-major to column-major or-
der, h is a sufficiently small integer for current sizes of shared memories on GPUs
(currently h ≤ 12).
To transform the matrix from column-major to row-major layout, we split the
w×hw matrix into h square w×w matrices, transpose each one using a different
warp (see Figure 8), and combine h original columns (now h partial rows) that
compose a single hw-sized row using all h warps (see Figure 9). Transformation
from row-major to column-major layout is done similarly. For pseudocode refer
to Appendix F.
Then all h columns resulting from the transformation of the same row are
gathered together in the correct order. The gathering of all h columns resulting
from the transformation of the same row can be done by h ·w threads in parallel
without bank conflicts, because the w threads of the same warp will always run
in lock-step. Each thread is responsible for one element of each resulting column
of wh rows.
When analyzing the runtime of transforming a w × hw matrix one might
think that running h warps on the same streaming multiprocessor with only w
physical cores would increase the runtime by a factor of h. However, in practice
17
Fig. 8: 8×32 matrix with division into 8×8-submatrices, before and after trans-
forming the square submatrices
Fig. 9: The columns resulting from each row are gathered and placed in ascending
order of rows.
this is not the case. The reason is that as long as all data required by the warps
fits in the shared memory the context switch across the warps incurs no overhead
and running a single warp takes the same time as running h warps with extra
time spent waiting for data access due to the latency of accessing shared memory.
Since there are no bank conflicts in this procedure, the runtime of transform-
ing an w × hw matrix remains O(w) time.
F Pseudocode of BCFMergesort
Sorting w2 elements in shared memory without bank conflicts. ShearSort sorts
a matrix of values in snake-like order by first sorting the rows in alternating
directions and then columns in ascending order. We use the fact that transform-
ing a square matrix in shared memory from column-major to row-major order
can be done by simply transposing the matrix (see Algorithm 1). That way our
implementation can assign one memory bank to each of the w threads (see Sec-
tion 3.1). For simplicity we assume that the result should be sorted in ascending
order. Since all threads execute the same program in parallel, this pseudocode
is written for one thread with index tid.
18
Algorithm 1 SHEARSORT: Bank-conflict free sorting in shared memory
INPUT: w ×w matrix m
OUTPUT: Sorted matrix m
register array R[w]
FOR i IN 1. . . log
2
(w) DO
# Sort the rows of matrix m in alternating order
FOR j in 0 . . . (w − 1) DO R[j] = m[tid][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
IF tid mod 2 == 0 THEN
FOR j in 0 . . . (w − 1) DO m[tid][j] = R[j]
ELSE
FOR j in 0 . . . (w − 1) DO m[tid][w − j] = R[j]
# Change matrix from column-major order to row-major order
TRANSPOSE(m)
# Sort the columns of matrix m
FOR j in 0 . . . (w − 1) DO R[j] = m[tid][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
FOR j in 0 . . . (w − 1) DO m[tid][j] = R[j]
# Change matrix from row-major order to column-major order
TRANSPOSE(m)
# Clean up and remove snake-like order
FOR j in 0 . . . (w − 1) DO R[j] = m[tid][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
FOR j in 0 . . . (w − 1) DO m[tid][j] = R[j]
# Change matrix from column-major order to row-major order
TRANSPOSE(m)
19
As explained in Section 3.3, the transpositions can be removed if we treat
diagonals as columns and shifted columns as rows. The pseudocode for this opti-
mization can be seen in Algorithm 2. The extra step of transforming diagonals to
rows at the end of the algorithm is easily amortized by the missing 2 · log2(w)+1
transpositions.
Algorithm 2 SHEARSORT-DIAGONAL: Bank-conflict free sorting in shared
memory without transpositions
INPUT: w ×w matrix m
OUTPUT: Sorted matrix m
register array R[w]
FOR i IN 1. . . log
2
(w) DO
# Sort the diagonals of matrix m in alternating order
FOR j in 0. . . (w − 1) DO R[j] = m[(tid+ j) mod w][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
IF tid mod 2 == 0 THEN
FOR j in 0. . . (w − 1) DO m[(tid+ j) mod w][j] = R[j]
ELSE
FOR j in 0. . . (w − 1) DO m[(tid+ w − j) mod w][w − j] = R[j]
# Sort the shifted columns of matrix m
FOR j in 0. . . (w − 1) DO R[j] = m[(tid+ j) mod w][tid]
BATCHERS-ODD-EVEN-MERGESORT(R)
FOR j in 0. . . (w − 1) DO m[(tid+ j) mod w][tid] = R[j]
# Remove snake-like order
FOR j in 0. . . (w − 1) DO R[j] = m[(tid+ j) mod w][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
FOR j in 0. . . (w − 1) DO m[(tid+ j) mod w][j] = R[j]
# Transform diagonals to columns
FOR j in 0. . . (w − 1) DO R[j] = m[(2 · tid+ j) mod w][(j + tid) mod w]
FOR j in 0. . . (w − 1) DO m[(tid+ j) mod w][tid] = R[j]
Sorting h ·w2 elements in shared memory without bank conflicts. We extend
our ShearSort implementation to matrices of size w× (h ·w). We define the func-
tions CHANGE-TO-ROWMAJOR (Algorithm 3) and CHANGE-TO-COLUMNMAJOR (Algo-
rithm 4) that transforms the long rows into h columns of size w. These functions
require h warps.
Furthermore, SHEARSORT has to be modified to handle the long rows (Algo-
rithm 5). For this procedure to run, we need a subroutine SHEARSORT-SEGMENTED
20
Algorithm 3 CHANGE-TO-ROWMAJOR: Turn column-major long matrices
to row-major order
INPUT: w × (wh) matrix m
OUTPUT: Transformed matrix m
widx = ⌊ tid
w
⌋
wpos = tid mod w
# Change all w × w submatrices m1 . . .mh of matrix m to row-major order
TRANSPOSE(mwidx)
register array R[w]
FOR i in 0 . . .
(
w
h
− 1
)
DO
FOR j in 0 . . . (h− 1) DO
R[h · i+ j] = m[wpos][(widx · w
h
+ i) + j · w]
FOR i in 0 . . . (w − 1) DO
m[wpos][widx · w + i] = R[i]
Algorithm 4 CHANGE-TO-COLUMNMAJOR: Turn row-major long matrices
to column-major order
INPUT: w × (wh) matrix m
OUTPUT: Transformed matrix m
widx = ⌊ tid
w
⌋
wpos = tid mod w
register array R[w]
FOR i in 0 . . . (w − 1) DO
R[i] = m[wpos][widx · w + i]
FOR i in 0 . . .
(
w
h
− 1
)
DO
FOR j in 0 . . . (h− 1) DO
m[wpos][(widx · w
h
+ i) + j · w] = R[h · i+ j]
# Change all w × w submatrices m1 . . .mh of matrix m to row-major order
TRANSPOSE(mwidx)
21
that takes three parameters: the matrix, the number of rows per segment and
a flag snake-like that indicates whether the segments should be rearranged
in the snake-like order required by ShearSort. SHEARSORT-SEGMENTED performs
SHEARSORT where the second sort (the columns) is replaced by a segmented sort
routine that sorts wh segments of size h.
For the diagonal version of BCFMergesort, changing the matrix to a different
memory representation is reduced to gathering and scattering the diagonals in
the same manner as CHANGE-TO-ROWMAJOR or CHANGE-TO-COLUMNMAJOR treat the
columns.
Page-wise binary merge. To perform the bank conflict free page-wise binary
merge (see Section 3.2), we load one page of data from both input streams, merge
them and output one full page. Figuratively, the keys from the first exhausted
page are moved to the other stream and are replaced keys from that stream. That
is why merging has to continue until both streams are exhausted and cannot be
stopped when the first input stream is finished.
For a better performance we modify the SHEARSORT subroutine to only per-
form one round of ShearSort instead of log
2
(w). By the 0-1 principle this is
enough to merge two sorted sequences. We call this shortened sorting routine
SHEARSORT-MERGE(S).
In the case of arbitrary input stream sizes, we have to determine the amount
of data to load (e.g. by setting an integer ls = min
(
M
2
, n[e]− ic[e]
)
) and in
case of partially filled pages pad them with ∞. Since the padding will remain
in shared memory until the merge is finished, we also have to keep track of the
amount of padding used and subtract it from the size of the final outputs.
Bucket distribution. The bucket distribution of Section 3.2 consists of two
phases: finding the splitters and determining their positions in the input data,
and rearranging the data. The second part is not strictly necessary but it makes
life easier when the merging begins again.
The subroutine FIND-BUCKETS is called on P CTAs, identified by their index
pid, with w threads each. The sorted substreams of input stream I are marked
as I1 . . . IP . We also need a function EXCLUSIVE-PREFIX-SUMS that performs the
namesake operation on P integers. Since P = 128 in our experiments, we chose
a simple linear solution executed on one thread.
22
Algorithm 5 SHEARSORT-LONG: Bank-conflict free sorting of long matrices
in shared memory
INPUT: w × (wh) matrix m
OUTPUT: Sorted matrix m
register array R[w]
FOR i IN 1. . . log
2
(w) DO
# Change matrix to row-major order
CHANGE-TO-ROWMAJOR(m)
# Change all segmented w × w submatrices ms1 . . .m
s
h from row-major order to column-major order
TRANSPOSE(mswidx)
# Segmented sort submatrices ms1 . . .m
s
h of matrix m in snake-like order
SHEARSORT-SEGMENTED(mwidx, h, true)
# Change matrix to column-major order
CHANGE-TO-COLUMNMAJOR(m)
# Change submatrices m1 . . .mh from column-major order to row-major order
TRANSPOSE(mwidx)
# Sort the columns of all submatrices m1 . . .mh
FOR j in 0. . . (w − 1) DO R[j] = mwidx[tid][j]
BATCHERS-ODD-EVEN-MERGESORT(R)
FOR j in 0. . . (w − 1) DO mwidx[tid][j] = R[j]
# Change submatrices m1 . . .mh from row-major order to column-major order
TRANSPOSE(mwidx)
# Change matrix to row-major order
CHANGE-TO-ROWMAJOR(m)
# Change segmented submatrices ms1 . . .m
s
h from row-major order to column-major order
TRANSPOSE(mswidx)
# Segmented sort submatrices ms1 . . .m
s
h in ascending order
SHEARSORT-SEGMENTED(mwidx, h, false)
23
Algorithm 6 BINARY-MERGE: Page-wise binary merge
INPUT: Streams I [0] and I [1], stream size n
OUTPUT: Output stream O
shared memory array S[M ]
integer e
# Load a page of M
2
keys from each input stream into S
S[0 . . .
(
M
2
− 1
)
] = I [0][0 . . .
(
M
2
− 1
)
]
S[M
2
. . . (M − 1)] = I [1][0 . . .
(
M
2
− 1
)
]
# Determine the input stream, whose page will be exhausted first
IF S[M
2
− 1] ≤ S[M − 1] THEN
e = 0
ELSE
e = 1
# Merge the first pages of both input streams
SHEARSORT-MERGE(S)
# Write the first page to the output stream
O[0 . . .
(
M
2
− 1
)
] = S[0 . . .
(
M
2
− 1
)
]
# Offset counters for all streams
integer ic[0] = ic[1] = oc = M
2
WHILE (ic[0] < n) OR (ic[1] < n) DO
# If there are elements left in the designated input stream, load a page.
# Otherwise load a page from the other stream. Adjust the offset counter.
IF ic[e] ≥ n THEN e = 1− e
S[0 . . .
(
M
2
− 1
)
] = I [e][ic[e] . . .
(
ic[e] + M
2
− 1
)
]
ic[e] = ic[e] + M
2
# Determine which is the next page to be exhausted
IF S[M
2
− 1] > S[M − 1] THEN e = 1− e
# Merge the pages in shared memory
SHEARSORT-MERGE(S)
# Write the next page to the output stream and adjust the offset counter
O[oc . . .
(
oc+ M
2
− 1
)
] = S[0 . . .
(
M
2
− 1
)
]
oc = oc+ M
2
# Write the final remaining page to the output stream
O[oc . . . (2 · n− 1)] = S[M
2
. . . (M − 1)]
24
Algorithm 7 FIND-BUCKETS: Bucket distribution
INPUT: Input stream I , consisting of P sorted substreams of size n, number of pivots t
OUTPUT: Output stream O, stream separators Q
global potential pivot array S[P · t]
global final pivot array F [P ]
global temporary splitter array T [P 2]
global bucket offset array B[P ]
integer c, oc
# Pick t potential splitters per substream Ik
FOR i in 0 . . . t DO S[pid · t+ i] = Ipid[i ·
n
t
]
# Sort all potential pivots
MERGESORT(S)
# Pick the final pivots: Every tth element of S
F [pid] = A[pid · t]
# Determine the rank of each pivot in each substream
c = 1
FOR i in 0 . . . (n− 1) DO
IF Ipid[i] > F [c] THEN
T [pid · P+] = pid · n+ i
WHILE Ipid[i] > F [c] DO c = c+ 1
# Calculate target substream offsets (sum entries of each bucket, run prefix sums)
B[pid] = (T [0 · P + pid+ 1]− T [0 · P + pid]) + · · ·+ (T [(P − 1) · P + pid+ 1]− T [(P − 1) · P + pid])
EXCLUSIVE-PREFIX-SUMS(B)
# Sort substreams by bucket
oc = B[pid]
FOR i in 0 . . . (P − 1) DO
O[oc . . . (oc+ (T [pid+ i · P + 1]− T [pid+ i · P ]))] =
I [T [pid+ i · P ]] . . . I [T [pid+ i · P + 1]− 1]
Q[pid · P + i] = oc
oc = (oc+ (T [pid+ i · P + 1]− T [pid+ i · P ]))
25
