Engineering Faster Sorters for Small Sets of Elements by Marianczuk, Jasper
Bachelor Thesis
Engineering Faster Sorters
for Small Sets of Items
Jasper Anton Marianczuk
Date: May 09, 2019
Supervisors: Prof. Dr. Peter Sanders
Dr. Timo Bingmann
Institute of Theoretical Informatics, Algorithmics
Department of Informatics
Karlsruhe Institute of Technology
Hiermit versichere ich, dass ich diese Arbeit selbständig verfasst und keine anderen, als die
angegebenen Quellen und Hilfsmittel benutzt, die wörtlich oder inhaltlich übernommenen
Stellen als solche kenntlich gemacht und die Satzung des Karlsruher Instituts für Technolo-
gie zur Sicherung guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet habe.
Karlsruhe, den 09.05.2019
Abstract
Sorting a set of items is a task that can be useful by itself or as a building
block for more complex operations. That is why a lot of effort has been put into
finding sorting algorithms that sort large sets as efficiently as possible. But the
more sophisticated and fast the algorithms become asymptotically, the less efficient
they are for small sets of items due to large constant factors.
A relatively simple sorting algorithm that is often used as a base case sorter is
insertion sort, because it has small code size and small constant factors influencing
its execution time.
This thesis aims to determine if there is a faster way to sort these small sets of
items to provide an efficient base case sorter. We looked at sorting networks, at
how they can improve the speed of sorting few elements, and how to implement
them in an efficient manner by using conditional moves. Since sorting networks
need to be implemented explicitly for each set size, providing networks for larger
sizes becomes less efficient due to increased code sizes. To also enable the sorting
of slightly larger base cases, we modified Super Scalar Sample Sort and created
Register Sample Sort, to break down those larger sets into sizes that can in turn be
sorted by sorting networks.
From our experiments we found that when sorting only small sets, the sorting
networks outspeed insertion sort by at least 25% for any array size between 2 and
16. When integrating sorting networks as a base case sorter into quicksort, we
achieved far less performance improvements over using insertion sort, which is due
to the networks having a larger code size and cluttering the L1 instruction cache.
The same effect occurs when including Register Sample Sort as a base case sorter
for IPS4o. But for computers that have a larger L1 instruction cache of 64 KiB or
more, we obtained speed-ups of 6.4% when using sorting networks as a base case
sorter in quicksort, and of 9.2% when integrating Register Sample Sort as a base
case sorter into IPS4o, each in comparison to using insertion sort as the base case
sorter.
In conclusion, the desired improvement in speed could only be achieved under
special circumstances, but the results clearly show the potential of using conditional
moves in the field of sorting algorithms.
Zusammenfassung
Das Sortieren einer Menge von Elementen ist ein Prozess der für sich alleine nütz-
lich sein kann oder als Baustein für komplexere Operationen dient. Deswegen wurde
in den Entwurf von Sortieralgorithmen, die eine große Menge an Elementen effizi-
ent sortieren, bereits großer Aufwand investiert. Doch je ausgefeilter und schneller
die Algorithmen asymptotisch sind, desto ineffizienter werden sie beim Sortieren
kleinerer Mengen aufgrund hoher konstanter Faktoren.
Ein relativ einfacher Sortieralgorithmus, der oft als Basisfall Sortierer genutzt
wird, ist Insertion Sort, weil dessen Code kurz ist und er kleine konstante Faktoren
hat.
Diese Bachelorarbeit hat das Ziel herauszufinden ob es einen schnelleren Algo-
rithmus gibt um solche wenigen Elemente zu sortieren, damit dieser als effizienter
Basisfall Sortierer genutzt werden kann. Wir haben uns dazu Sortiernetzwerke an-
geschaut, wie man durch sie das Sortieren kleiner Listen beschleunigen kann und
wie man sie effizient implementiert: Durch das Ausnutzen von konditionellen move-
Befehlen. Weil Sortiernetzwerke für jede Listengröße explizit implementiert werden
müssen, nimmt die Effizienz des Sortierens mittels Sortiernetwerken wegen erhöhter
Codegröße ab je größer die Listen sind, die sortiert werden sollen. Um auch das
Sortieren etwas größerer Basisfälle zu ermöglichen haben wir Super Scalar Sample
Sort modifiziert und Register Sample Sort entworfen, welcher eine größere Liste in
mehrere kleine Listen zerteilt, die dann von den Sortiernetzwerke sortiert werden
können.
In unseren Experimenten sind wir zu dem Ergebnis gekommen, dass, wenn nur
kleine Mengen sortiert werden, die Sortiernetzwerke um mindestens 25% schneller
sind als Insertion Sort, für alle Listen, die zwischen 2 und 16 Elementen enthalten.
Beim Integrieren der Sortiernetzwerke als Basisfall Sortierer in Quicksort haben
wir weit weniger Geschwindigkeitszuwachs gegenüber der Benutzung von Insertion
Sort erhalten, was daran liegt, dass der Code der Netzwerke mehr Platz benötigt
und den Code für Quicksort aus dem L1 Instruktionscache verdrängt. Derselbe
Effekt tritt auch beim Benutzen von Register Sample Sort as Basisfall Sortierer
für IPS4o auf. Allerdings konnten wir uns bei Rechnern, die über einen größeren
L1 Instruktionscache von 64 KiB oder mehr verfügen, mit Sortiernetzwerken bei
Quicksort um 6,4% und mit Register Sample Sort bei IPS4o um 9,2% gegenüber
Insertion Sort als Basisfall Sortierer verbessern.
Zusammenfassend haben wir die angestrebte Verbesserung nur unter besonderen
Bedingungen erreicht, aber die Ergebnisse weisen deutlich darauf hin, dass die kon-
ditionellen move-Befehle Potential im Anwendungsbereich von Sortieralgorithmen
haben.
Contents
Contents
1 Introduction 8
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Sorting Networks 9
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Networks in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Improving the Speed of Sorting through Sorting Networks . . . . . . . . 10
2.1.3 Compare-And-Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Implementation of Sorting Networks . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Providing the Network Frame . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Implementing the Conditional Swap . . . . . . . . . . . . . . . . . . . . . 15
3 Register Sample Sort 20
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Implementing Sample Sort for medium-sized Sets . . . . . . . . . . . . . . . . . 20
4 Experimental Results 23
4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Generating Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Conducting the Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Sorting one set of 2-16 items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Sorting many continuous Sets of 2-16 Items . . . . . . . . . . . . . . . . . . . . 35
4.6 Sorting a large Set of Items with Quicksort . . . . . . . . . . . . . . . . . . . . . 38
4.7 Sorting a medium-sized Set of Items with Sample Sort . . . . . . . . . . . . . . 41
4.8 Sorting a large Set of Items with IPS4o . . . . . . . . . . . . . . . . . . . . . . . 45
5 Conclusion 49
5.1 Results and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Experiences and Hurdles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Possible Additions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5
List of Tables
List of Figures
1 Sorting network by Bose and Nelson for 6 elements . . . . . . . . . . . . . . . . 10
2 Best network with optimal length for 16 elements . . . . . . . . . . . . . . . . . 14
3 Bose Nelson network for 16 elements optimizing locality . . . . . . . . . . . . . . 14
4 Bose Nelson network for 16 elements optimizing parallelism . . . . . . . . . . . . 14
5 Single sort for array size = 8 on machine A . . . . . . . . . . . . . . . . . . . . . 32
6 Single sort for array size = 8 on machine B . . . . . . . . . . . . . . . . . . . . . 32
7 Single sort for array size = 8 on machine C . . . . . . . . . . . . . . . . . . . . . 33
8 Single sort of array sizes 2 to 16 on machine A . . . . . . . . . . . . . . . . . . . 33
9 Single sort of array sizes 2 to 16 on machine B . . . . . . . . . . . . . . . . . . . 34
10 Single sort of array sizes 2 to 16 on machine C . . . . . . . . . . . . . . . . . . . 34
11 Continuous sorting of array sizes 2 to 16 on machine A . . . . . . . . . . . . . . 35
12 Continuous sorting of array sizes 2 to 16 on machine B . . . . . . . . . . . . . . 36
13 Continuous sorting of array sizes 2 to 16 on machine C . . . . . . . . . . . . . . 36
14 Sorting times of quicksort with different base cases on machine A . . . . . . . . 38
15 Sorting times of quicksort with different base cases on machine B . . . . . . . . 39
16 Sorting times of quicksort with different base cases on machine C . . . . . . . . 39
17 Sample sort on machine A with 256 items and different configurations . . . . . . 42
18 Sample sort on machine B with 256 items and different configurations . . . . . . 42
19 Sample sort on machine C with 256 items and different configurations . . . . . . 43
20 Sample sort 332 with different base cases on machine A . . . . . . . . . . . . . . 43
21 Sample sort 332 with different base cases on machine B . . . . . . . . . . . . . . 44
22 Sample sort 332 with different base cases on machine C . . . . . . . . . . . . . . 44
23 Distribution of the size of the array passed to the base case sorter when executing
IPS4o with parameter BaseCaseSize4 = 16 . . . . . . . . . . . . . . . . . . . . 46
24 Distribution of the size of the array passed to the base case sorter when executing
IPS4o with parameter BaseCaseSize4 = 32 . . . . . . . . . . . . . . . . . . . . 46
25 Sorting times for IPS4o on machine A with different base cases and base case sizes 47
26 Sorting times for IPS4o on machine B with different base cases and base case sizes 48
27 Sorting times for IPS4o on machine C with different base cases and base case sizes 48
List of Tables
1 Registers required by Register Sample Sort with three or seven splitters . . . . . 22
2 Hardware properties of the machines used . . . . . . . . . . . . . . . . . . . . . 23
3 Average number of CPU cycles per iteration of single array sorting on machine A 28
4 Average number of CPU cycles per iteration of single array sorting on machine B 29
5 Average number of CPU cycles per iteration of single array sorting on machine C 30
6 Average number of CPU cycles per iteration of single array sorting across all
machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 Average number of CPU cycles per array of continuous sorting across all machines 37
8 Average speed-ups of the fastest sorting network over the fastest insertion sort
as base case in quicksort and unmodified std::sort . . . . . . . . . . . . . . . . . 40
9 Average speed-ups of the fastest sorting network over the fastest insertion sort
as base case in sample sort and unmodified std::sort . . . . . . . . . . . . . . . . 40
10 Average speed-ups of the fastest sorting network over the fastest insertion sort
as base case in IPS4o and unmodified std::sort . . . . . . . . . . . . . . . . . . . 47
6
Algorithms
Algorithms
1 Register Sample Sort Classification(array, elementCount, predicate) . . . . . . 22
2 MeasureSorting(arraySize, numberOfIterations, seed) . . . . . . . . . . . . . 26
3 MeasureSortingInRow(arraySize, numberOfArrays, seed) . . . . . . . . . . . . . 26
7
1 Introduction
1 Introduction
1.1 Motivation
Sorting, that is rearranging the elements in a set to be in a specific order, is one of the basic
algorithmic problems. In school and university, basic sorting algorithms like bubble sort, in-
sertion sort, and merge sort, as well as a simple variant of quicksort are taught at first. These
algorithms are rated by the number of comparisons they require to sort a set of items. This
amount of comparisons is put into relation to the input size and looked at on an asymptotic
level. Only later one realizes that what looks good on paper does not have to work well in
practice, so factors like average cases, cache effects, hardware setups, and constant factors need
to be taken into consideration, too. A sophisticated choice on which sorting algorithm to use
(for a particular use case) should be influenced by all of these factors.
Complex sorting algorithms aim to sort a large number of items quickly, and a lot of them
follow the divide-and-conquer idea of designing an algorithm. However, sorting small sets of
items, e.g. with 16 elements or less, is usually fast enough that investing a lot of effort into
optimizing sorting algorithms for those cases results in very small gains, looking at the absolute
amount of time saved.
The complex sorters do not perform as well when sorting small sets of items, having good
asymptotic properties but larger constant factors that become more important for the small
sizes. Because of that the base case of sorting small enough subsets is performed using a sim-
pler algorithm, which is often insertion sort. It has a worst-case run-time of O(n2), but small
constant factors that make it suitable to use for small n. If this sorter is executed many times
as base case of a larger sorter though, the times do sum up to contribute to a substantial part
of the sorting time.
The guiding question of this thesis is:
Is there a faster way to sort sets of up to 16 elements than insertion sort?
When sorting a set of uniformly distributed random numbers, the chance of any number being
greater than another is on average 50%. Therefore, whenever a conditional branch is influenced
by one element’s relation to another, one in two of those branches will be mispredicted, which
leads to an overall performance penalty.
This is a problem that has already been looked at by Michael Codish, Luís Cruz-Filipe, Markus
Nebel and Peter Schneider-Kamp in “Optimizing sorting algorithms by using sorting networks”
[CCNS17] in 2017, and this thesis has taken a great deal of inspiration from it.
1.2 Overview of the thesis
We will first look at sorting networks in section 2. Section 2.1 gives a basis of sorting networks
and assembly code. After that, we look at different ways of implementing sorting networks
efficiently in C++ in section 2.2. For that we focused on elements that consist of a key and
an additional reference value. This enables the sorting of complex items, not being limited to
integers.
In section 3 we will take a small detour to look at Super Scalar Sample Sort and develop
an an efficient modified version for sets with 256 elements or less by holding the splitters
in general purpose registers instead of an array. After that section 4 discusses the results
and improvements of using sorting networks we achieved in our experiments, measuring the
performance of the sorting networks and sample sort individually, and also including them as
base cases into quicksort and IPS4o [AWFS17]. After that we conclude the results of this thesis
in section 5.
8
2 Sorting Networks
2 Sorting Networks
2.1 Preliminaries
Sorting algorithms can generally be classified into two groups: Those of which the behaviour
depends on the input, e.g. quicksort where the sorting speed depends on how well the chosen
pivot partitions the set into equally-sized halves, and those of which the behaviour is not
influenced by the configuration of the input. The latter are also called data-oblivious.
One example of a data-oblivious sorting algorithm is the sorting network. A sorting network
of size n consists of a number of n so-called channels numbered 1 to n, each representing one
of the inputs, and connections between the channels, called comparators. Where two channels
are connected by a comparator it means that the values are to be compared, and if the channel
with the lower number currently holds a value that is greater than the value of the channel with
the higher number, the values are to be exchanged between the channels. The comparators are
given in a fixed order that determines the sequence of executing these conditional swaps, so
that in the end
(i) the channels contain a permutation of the original input, and
(ii) the values held by the channels are in nondecreasing order.
Sorting networks are data-oblivious because all the comparisons are always performed, and in
the same order, no matter which permutation of an input is given.
For any sorting network, two metrics can be used to quantify it: the length and the depth.
A network’s length refers to the number of comparators it contains, and a network’s depth
describes the minimal amount of levels a network can be divided into.
Where two comparators are ordered one after the other, and no channel is used by both com-
parators, they can be combined into a level. In other words: the result of the second comparator
does not depend upon the result of the first. Inductively, any comparator can be merged into a
level that executes right before or after it, if its channels are not already used by any compara-
tor in the level. Since now all the comparators in a level are independent from one another,
they can be executed in parallel.
2.1.1 Networks in Practice
• Best known networks: For networks of up to size 16 there exist proven optimal lengths
and a proven optimal depths. For example, the network for 10 elements with optimal
length 29 has depth 9, the one with optimal depth 7 has length 31 [Knu98, CCFS14].
For networks of greater size there only exist currently known lowest numbers of length
or depth. Those best networks are acquired through optimizations that were initially
done by hand and nowadays are realized e.g. with the help of computers and evolutionary
algorithms [ber18].
• Recursive networks: For creating sorting networks there also exist algorithms that
work in a recursive divide-and-conquer way: split the input into two parts, sort each part
recursively, and merge the two parts together in the end. Representatives for this kind
of approach are the construction of R.J. Nelson and B.C. Bose [BN62] and the algorithm
by K.E. Batcher [Bat68]. Bose and Nelson split the input sequence into first and second
half, while Batcher partitions into elements with an even index and elements with an odd
index. The advantage of those recursive networks over the specially optimized ones is that
they can easily be created even for large network sizes. While the generated networks
may have more comparators than the best known networks, the number of comparators
9
2 Sorting Networks
in a network acquired from either Bose-Nelson or Batcher of size n has an upper bound
of O(n (log n)2) [Knu98].
Figure 1: Sorting network by Bose and Nelson for 6 elements
Sorting networks are usually depicted by using horizontal lines for the channels, and vertical
connections between these lines for the comparators. A network by Bose and Nelson for 6
elements displayed like that can be seen in figure 1.
2.1.2 Improving the Speed of Sorting through Sorting Networks
An important question to ask is how sorting networks can improve the sorting speed on a set of
elements (on average), if they can not take any shortcuts for “good” inputs, like an insertion sort
that would leverage an already sorted input and do one comparison per element. The answer
to this question is branching. Because the compiler knows in advance which comparisons are
going to be executed in which order, the control flow does not contain conditional branches,
in particular getting rid of expensive branch mispredictions. On uniformly distributed random
inputs, the chances that any number is smaller than another is 50% on average, making branches
unpredictable. In the case of insertion sort that means not knowing in advance with how many
elements the next one has to be compared until it is inserted into the right place.
Even though with sorting networks the compiler knows in advance when to execute which
comparator, implementing the compare-and-swap operation in a naive way (as seen in 2.1.3)
the compiler might still generate branches. In that case, the sorting networks are no faster than
insertion sort, or even slower.
2.1.3 Compare-And-Swap
For sorting networks, the basic operation used is to compare two values against each other.
If they are in the wrong order (the “smaller” element occurs after the “bigger” one in the
sequence), they are swapped. Intuitively, one might implement the operation in C++ like this:
void ConditionalSwap(TValueType& left, TValueType& right)
{
if (left > right) { std::swap(left, right); }
}
10
2.1 Preliminaries
Here TValueType is a template typename and can be instantiated with any type that imple-
ments the > operator.
As suggested in [CCNS17], the same piece of code can be rewritten like this:
void ConditionalSwap2(TValueType& left, TValueType& right)
{
TValueType temp = left;
if (temp > right) { left = right; }
if (temp > right) { right = temp; }
}
At first glance it looks like we now have two branches that can be taken. But the code executed
if the condition is true now only consists of a single assignment each, which can be expressed in
x86-Architecture through a conditional move instruction. In AT&T syntax (see section 2.1.4),
a conditional move (cmov a,b) will write a value in register a into register b, if a condition
is met. If the condition is not met, no operation takes place (still taking the same number of
CPU cycles as the move operation would have). Since the address of the next instruction no
longer depends upon the previously evaluated condition, the control flow now does not contain
branches. The only downside of the conditional move is that it can take longer than a normal
move instruction on certain architectures, and can only be executed when the comparison has
performed and its result is available.
When the elements to be sorted are only integers, some compilers do generate code with con-
ditional moves for those operations. When the elements are more generalized (in this thesis we
will look at pairs of an unsigned 64 bit integer key and an unsigned 64 bit reference value, which
could be a pointer or an address in an array), gcc 7.3.0, the compiler used for the experiments,
does not generate conditional moves. To force the usage conditional moves, a feature of gcc
was used that allows the programmer to specify small amounts of assembly code to be inserted
into the regular machine code generated by gcc, called inline assembly [Fre19]. This mechanic
and the notation is further explained in section 2.1.4.
2.1.4 Assembly Code
Assembly code represents the machine instructions executed by the CPU. It can be given as
the actual opt-codes or as human-readable text. There are two different conventions for the
textual representation, the Intel syntax or MASM syntax and the AT&T syntax. The main
differences are:
Intel AT&T
Operand size The size of the operand does not
have to be specified
The size of the operand is ap-
pended to the instruction: b (byte
= 8 bit), l (long = 32 bit), q (quad-
word = 64 bit)
Parameter order The destination is written first,
then the source of the value:
mov dest,src
The source is written first, then the
destination: movq src,dest
In this thesis only the AT&T syntax will be used.
The gcc C++ compiler has a feature that allows the programmer to write assembly instruc-
tions in between regular C++ code, called “inline assembly” (asm) [Fre19]. A set of assembly
instructions to be executed must be given, followed by a definition for input and output vari-
ables and a list of clobbered registers. This extra information is there to communicate to the
11
2 Sorting Networks
compiler what is happening inside the asm block. Gcc itself does not parse or optimize the
given assembly statements, they are only after compilation added into the generated assembly
code by the GNU Assembler. A variable being in the output list means that the value will be
modified, a clobbered register is one where gcc cannot assume that the value it held before the
asm block will be the same as after the block. In this thesis, the clobbered registers will almost
always be the conditional-codes registers (cc), which include the carry-flag, zero-flag and the
signed-flag, which are modified during a compare-instruction. This way of specifying the input,
output and clobbered registers is also called extended asm.
Taking the code from 2.1.3, and assuming TValueType = uint64_t, the statement
if (temp > right) { left = right; }
can now be written as
__asm__(
"cmpq %[temp],%[right]\n\t" //performs right - temp internally
"cmovbq %[right],%[left]\n\t" //left = right, if right < temp
: [left] "=&r"(left) //output
: "0"(left), [right] "r"(right), [temp] "r"(temp) //input
: "cc" //clobber
);
In extended asm, one can define C++ variables as input or output operands, and gcc will assign a
register for that value (if it has the "r" modifier), and also write the value in an output register
back to the given variable after the asm block. Note that the names in square brackets are
symbolic names only valid in the context of the assembly instructions and independent from
the names in the C++ code before. The link between the C++ names and the symbolic names
happens in the input and output declarations.
With the conditional moves it is important to properly declare the input and output variables,
because they perform a task that is a bit unusual: an output variable may be overwritten, and
also may not. For the output register for left, two things must apply:
(i) if the condition is false, it must hold the value of left, and
(ii) if the condition is true, it must hold the value of right.
For optimizations purposes, the compiler might reduce the number of registers used by plac-
ing the output of one operation into a register that previously held the input for some other
operation. To prevent this, the declaration for the output [left] "=&r"(left) has the "&"
modifier added to it, meaning it is an “early clobber” register and that no other input can be
placed in that register. In combination with "0"(left) in the input line, it is also tied to an
input, so that the previous value of left is loaded into the register beforehand, to comply with
constraint (i). Because we already declared it as output, instead of giving it a new symbolic
name we tie it to the output by referencing its index in the output list, which since it is the
first output variable is "0". The "=" in the output declaration solely means that this register
will be written to. Any output needs to have the "=" modifier.
We see that each assembly instruction is postfixed with \n\t. That is because the instruction
strings are appended into a single instruction string during compilation and \n\t tells the GNU
assembler where one instruction ends and the next begins.
The cmov instruction is postfixed with a b in this example, which stands for “below”. So the
cmov will be executed if right is below temp (unsigned comparison right < temp). Apart
12
2.2 Implementation of Sorting Networks
from below we will also see not equal (ne) and carry (c) as a postfix.
In addition to that, both the cmp and the cmovb are postfixed with a q (quad-word) to indicate
that the operands are 64-bit values.
When a subtraction minuend−subtrahend is performed and subtrahend is larger than minuend
(interpreted as unsigned numbers), the operation causes an underflow which results in the carry
flag being set to 1. The check for that carry flag being 1 can be used as a condition by itself,
and the carry flag influences other condition checks like below. This property of the comparison
setting the carry flag will be used in section 3.2.
2.2 Implementation of Sorting Networks
2.2.1 Providing the Network Frame
The best networks for sizes of up to 16 elements were taken from John Gamble’s Website
[Gam19] and are length-optimal.
The Bose Nelson networks have been generated using the instructions from their paper [BN62].
For sizes of 8 and below the best and generated networks have the same amount of comparators
and levels. For sizes larger than 8 the generated networks are at a disadvantage because they
have more comparators and/or levels. As a trade-off their recursive structure makes it possible
to leverage a different trait: locality. Instead of optimizing them to sort as parallel as possible,
we can first sort the first half of the set, then the second half, and then apply the merger. This
way, chances are higher that all n2 elements of the first half might fit into the processor’s general
purpose registers. During this part of the sorting routine, no accesses to memory or cache are
required. To determine if there is a visible speed-up, the networks were generated optimizing
(a) locality and (b) parallelism.
As an extra idea, the Bose Nelson networks were generated in a way that one can pass the
elements as separate parameters instead of as an array. That way one can sort elements that
are not contiguously placed in memory. Because the networks were implemented as method
calls to the smaller sorters and merge methods, there would be a large overhead in placing
many elements onto the call stack for each method call. While we hoped this would make a
difference by reducing code size, the overhead for the method call was too large. That is why
all the methods are declared inline which results in the same flat sequence of swaps for each
size the networks optimizing locality have.
Examples of networks for 16 elements can be seen in figures 2, 3 and 4.
All networks are implemented so that they have an entry method that takes a pointer to an ar-
ray A and an array size n as input and delegates the call to the specific method for that number
of elements, which in turn executes all the comparators. To measure different implementations
for the conditional swaps, the network methods and the swap are templated, so that when
calling the network with an array of a specific type the respective specialized conditional-swap
implementation will be used.
13
2 Sorting Networks
Figure 2: Best network with optimal length for 16 elements
Figure 3: Bose Nelson network for 16 elements optimizing locality
Figure 4: Bose Nelson network for 16 elements optimizing parallelism
14
2.2 Implementation of Sorting Networks
Our approach differs from the work in [CCNS17] in the type of elements that were sorted.
While they measured the sorting of ints, which are usually 32-bit sized integers, we made the
decision to sort elements that consist of a 64-bit integer key and a 64-bit integer reference value,
enabling not only the sorting of numbers but also the sorting of complex elements, when giving
a pointer or an array index as the reference value to the original set. This was implemented by
creating structs that contain a key and reference value each, having the following structure:
struct SortableRef
{
uint64_t key, reference;
}
They also define the operators >, >=, ==, <, <= and != for reasons of usability, and
templated methods uint64_t GetKey(TSortable) and uint64_T GetReference(TSortable)
are available.
2.2.2 Implementing the Conditional Swap
The ConditionalSwap is implemented as a templated method like this:
template <typename TValueType>
inline
void ConditionalSwap(TValueType& left, TValueType& right)
{
//body
}
The following variants will represent the body of one specialization of the template function
for a specific struct. Each of them was given a three letter abbreviation to name them in the
results. We implemented the following approaches:
• using std::swap (Def)
• using inline if statements (QMa)
• using std::tie and std::tuple (Tie)
• using jmp and xchg (JXc)
• using four cmovs and temp variables (4Cm)
• using four cmovs split from one another and temp variables (4CS)
• using six cmovs and temp variables (6Cm)
• moving pointers with cmov instead of values (Cla)
• moving pointers and supporting a predicate (CPr)
The details of implementation can be seen in the following paragraphs.
using std::swap (Def) The default implementation for the template makes use of the defined
< operator:
if (right < left)
std::swap(left, right);
This is the intuitive way of writing the conditional swap we already saw in section 2.1.3, without
any inline assembly.
15
2 Sorting Networks
using inline if statements (QMa)
bool r = (left > right);
auto temp = left;
left = r ? right : left;
right = r ? temp : right;
Here it was attempted to convince the compiler to generate conditional moves by using the
inline if -statements with trivial values in the else part.
using std::tie and std::tuple (Tie)
std::tie(left, right) =
(right < left) ? std::make_tuple(right, left) : std::make_tuple(left, right);
This approach uses assignable tuples (tie).
using jmp and xchg (JXc)
__asm__(
"cmpq %[left_key],%[right_key]\n\t"
"jae %=f\n\t"
"xchg %[left_key],%[right_key]\n\t"
"xchg %[left_reference],%[right_reference]\n\t"
"%=:\n\t"
: [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),
[left_reference] "=&r"(left.reference),
[right_reference] "=&r"(right.reference)
: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference)
: "cc"
);
The %= generates a unique label for each instance of the asm statement, so that the jumps go
where they belong.
using four cmovs and temp variables (4Cm)
uint64_t tmp = left.key;
uint64_t tmpRef = left.reference;
__asm__(
"cmpq %[left_key],%[right_key]\n\t"
"cmovbq %[right_key],%[left_key]\n\t"
"cmovbq %[right_reference],%[left_reference]\n\t"
"cmovbq %[tmp],%[right_key]\n\t"
"cmovbq %[tmp_ref],%[right_reference]\n\t"
: [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),
[left_reference] "=&r"(left.reference),
[right_reference] "=&r"(right.reference)
: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference),
[tmp] "r"(tmp), [tmp_ref] "r"(tmpRef)
: "cc"
);
16
2.2 Implementation of Sorting Networks
using four cmovs split from one another and temp variables (4CS)
uint64_t tmp = left.key;
uint64_t tmpRef = left.reference;
__asm__ volatile (
"cmpq %[left_key],%[right_key]\n\t"
:
: [left_key] "r"(left.key), [right_key] "r"(right.key)
: "cc"
);
__asm__ volatile (
"cmovbq %[right_key],%[left_key]\n\t"
: [left_key] "=&r"(left.key)
: "0"(left.key), [right_key] "r"(right.key)
:
);
__asm__ volatile (
"cmovbq %[right_reference],%[left_reference]\n\t"
: [left_reference] "=&r"(left.reference)
: "0"(left.reference), [right_reference] "r"(right.reference)
:
);
__asm__ volatile (
"cmovbq %[tmp],%[right_key]\n\t"
: [right_key] "=&r"(right.key)
: "0"(right.key), [tmp] "r"(tmp)
:
);
__asm__ volatile (
"cmovbq %[tmp_ref],%[right_reference]\n\t"
: [right_reference] "=&r"(right.reference)
: "0"(right.reference), [tmp_ref] "r"(tmpRef)
:
);
Because we split the asm blocks, they have to be declared volatile so that the optimizer does
not move them around or out of order. Without declaring them volatile, some of the net-
works were not sorting correctly. The blocks were split because we hoped the compiler would
be able to insert operations that do not affect the conditional codes and are unrelated to the
current conditional swap between the cmp-instruction and the conditional moves, to reduce the
amount of wait cycles that have to be performed. This was successful as can be seen in the
experimental results in section 4.4.
17
2 Sorting Networks
using six cmovs and temp variables (6Cm)
uint64_t tmp;
uint64_t tmpRef;
__asm__ (
"cmpq %[left_key],%[right_key]\n\t"
"cmovbq %[left_key],%[tmp]\n\t"
"cmovbq %[left_reference],%[tmp_ref]\n\t"
"cmovbq %[right_key],%[left_key]\n\t"
"cmovbq %[right_reference],%[left_reference]\n\t"
"cmovbq %[tmp],%[right_key]\n\t"
"cmovbq %[tmp_ref],%[right_reference]\n\t"
: [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),
[left_reference] "=&r"(left.reference),
[right_reference] "=&r"(right.reference),
[tmp] "=&r"(tmp), [tmp_ref] "=&r"(tmpRef)
: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference),
"4"(tmp), "5"(tmpRef)
: "cc"
);
moving pointers with cmov instead of values (Cla) This idea came from a result created
by the clang compiler from the special code as seen in the ConditionalSwap2 method in 2.1.3.
For the transformation to gcc, we took only the minimal necessary instructions concerning the
conditional move into the asm block:
SortableRef_ClangVersion* leftPointer = &left;
SortableRef_ClangVersion* rightPointer = &right;
uint64_t rightKey = right.key;
SortableRef_ClangVersion tmp = left;
__asm__ volatile(
"cmpq %[tmp_key],%[right_key]\n\t"
"cmovbq %[right_pointer],%[left_pointer]\n\t"
: [left_pointer] "=&r"(leftPointer)
: "0"(leftPointer), [right_pointer] "r"(rightPointer),
[tmp_key] "m"(tmp.key), [right_key] "r"(rightKey)
: "cc"
);
left = *leftPointer;
leftPointer = &tmp;
__asm__ volatile(
"cmovbq %[left_pointer],%[right_pointer]\n\t"
: [right_pointer] "=&r"(rightPointer)
: "0"(rightPointer), [left_pointer] "r"(leftPointer)
:
);
right = *rightPointer;
18
2.2 Implementation of Sorting Networks
moving pointers and supporting a predicate (CPr) Instead of performing the comparison
inside the asm block, which requires knowledge of the datatype of the key, it can also be done
over a predicate, using the result of that comparison inside the inline assembly:
SortableRef_ClangPredicate* leftPointer = &left;
SortableRef_ClangPredicate* rightPointer = &right;
SortableRef_ClangPredicate temp = left;
int predicateResult = (int) (right < temp);
__asm__ volatile(
"cmp $0,%[predResult]\n\t"
"cmovneq %[right_pointer],%[left_pointer]\n\t"
: [left_pointer] "=&r"(leftPointer)
: "0"(leftPointer), [right_pointer] "r"(rightPointer),
[predResult] "r"(predicateResult)
: "cc"
);
left = *leftPointer;
leftPointer = &temp;
__asm__ volatile(
"cmovneq %[left_pointer],%[right_pointer]\n\t"
: [right_pointer] "=&r"(rightPointer)
: "0"(rightPointer), [left_pointer] "r"(leftPointer)
:
);
right = *rightPointer;
For the Cla implementation the b in cmovb was used to execute the conditional move if
right_key was smaller than temp_key. If that is the case, the predicate will return true,
or as an int a value not equal to zero. When comparing this result to 0, the cmov is to be
executed if the result was any value other than zero, so the postfix here is ne (not equal).
Note that while the knowledge of how to compare the elements is still present by doing the
comparison directly (right < temp), the compiler now needs to take the result from the com-
parison, and put it into an integer that is then used in the asm block. The only addition to
make it completely independent from the sorted elements would be to pass a predicate to do the
comparison, which would also involve modifying the network frame to take and pass the pred-
icate. To measure on the same network frame we took this shortcut of doing the comparison
using the < operator.
19
3 Register Sample Sort
3 Register Sample Sort
3.1 Preliminaries
Sample sort is a sorting algorithm that follows the divide-and-conquer principle. The input
is separated into k subsets, that each contain elements within an interval of the total or-
dering, with the intervals being distinct from one another. That is done by first choosing
a subset S of a · k elements and sorting S. Afterwards the splitters {s0, s1, . . . , sk−1, sk} =
{−∞, Sa, S2a, . . . , S(k−1)a,∞} are taken from S. The parameter a denotes the oversampling
factor. Oversampling is used to get a better sample of splitters to achieve more evenly-sized
partitions, trading for the time that is required to sort the larger sample.
With the splitters the elements ei are then classified, placing them into buckets bj, where
j ∈ {1, . . . , k} and sj−1 < ei ≤ sj. For k being a power of 2, this placement can be achieved by
viewing the splitters as a binary tree, with sk/2 being the root, all sl with l < k/2 representing
the left subtree and those with l > k/2 the right one. To place an element, one must only
traverse this binary tree, resulting in a binary search instead of a linear one [SW04].
Quicksort is therefore a specialization of sample sort with fixed parameter k = 2, having only
one splitter, the pivot, and splitting the input into two partitions.
3.2 Implementing Sample Sort for medium-sized Sets
The motivation to look at sample sort was that we wanted to see how well the sorting networks
perform when using them as a base case for the In-Place Parallel Super Scalar Samplesort
(IPS4o) by Michael Axtmann, Sascha Witt, Daniel Ferizovic and Peter Sanders [AWFS17].
The problem that occured is that IPS4o can go into the base case with sizes larger than 16,
while the networks we looked at only sort sets of up to 16 elements.
To close that gap, we created a sequential version of Super Scalar Sample Sort [SW04] that can
reduce base case sizes of up to 256 down to blocks of 16 or less in an efficient manner.
Since the total size was expected to not be much greater than 256, not much effort was made
to keep the algorithm in-place. The central idea was to place the splitters not into an array, as
described in [SW04], but to hold them in general purpose registers for the whole duration of
the element classification.
The question now arose as to which splitter an element needs to be compared to after the first
comparison with the middle splitter. When the splitters are organized in a binary heap in an
array, that can be done by using array indices, the children of splitter j being at positions 2j
and 2j +1. If an element is smaller than sj, it would afterwards be compared to s2j, otherwise
to s2j+1. But this way of accessing the splitters does not work when they are placed in registers.
The solution was to create a copy of the left subtree, and to conditionally overwrite that with
the right subtree, should the element be greater than the root node. The next comparison
is then made against the root of the temporary tree that now contains the correct splitters
to compare that element against. For 3 splitters that requires 1 conditional move, and for
7 splitters would require 3 conditional moves after the first comparison and 1 more after the
second comparison, per element.
After finding the correct splitters to compare to, we are left with one more problem: How to
know in which bucket the element is to be placed into at the end. In [SW04] this was done by
making use of the calculated index determining the next splitter to compare to. We chose an
approach similar to creating this index, using the correlation between binary numbers and the
tree-like structure of the splitters. We will be viewing the splitters not as a binary heap but
just as a list where the middle of the list represents the root node of the tree, its children being
20
3.2 Implementing Sample Sort for medium-sized Sets
the middle element of the left and the middle element of the right list.
If an element ei is larger than the first splitter sk/2 (with k−1 being the number of splitters), it
must be placed in a bucket bj with j ≥ k2 (assuming 0-based indexing for b). That also means
that the index of that bucket, represented as a binary number, must have its bit at position
l := log k2 set to 1. That way, the result of the comparison (ei > sk/2) can be interpreted as an
integer (1 for true, 0 for false) and added to j. If that was not the last comparison, j is then
multiplied by 2 (meaning its bits are shifted left by one position). This means the bit from
the first comparison makes its way “left” in the binary representation while the comparison
traverses down the tree, and so forth with the other comparisons. After traversing the splitter
tree to the end, ei will have been compared to the correct splitters and j will hold the index
of the bucket that ei belongs into. These operations can be implemented without branches by
making use of the way comparisons are done:
At the end of section 2.1.4 we explained that when comparing (unsigned) numbers (which is
nothing but a subtraction), and the subtrahend being greater than the minuend, the operation
causes an underflow and the carry flag is set. We also notice that when converting the result
of the predicate (ei > sk/2) to an integer value, the integer will be 1 for true and 0 for false.
So in assembly code, we can compare the result from evaluating the predicate to the value 0:
cmp %[predResult],%[zero] where zero is just a register that holds the value 0. This trick is
needed because the cmp instruction needs the second operand to be a register. This will execute
0 − predResult, which underflows for the predicate returning true. This way we can postfix
the cmov needed for moving the next splitters with a c checking for a set carry flag. The second
instruction we make use of is the rotate carry left (rcl) instruction, which performs a rotate
left instruction on j, but includes the carry flag as an additional bit after the least significant
bit of the integer. This exactly takes the predicate result and puts it at the bottom of j, with
the previous content being shifted one to the left beforehand. That means it performs two
necessary operations at once.
As an addition to the efficient classification, while looping over the elements we allow to place
multiple elements into buckets per loop, allowing for all the registers in the machine to be used.
This additional parameter is called blockSize.
There is one downside to this approach: The keys of the splitters (since we only need a splitter’s
key for classifying an element) must be small enough to fit into a general purpose register.
Needing more than one register per key would mean either running out of registers or spending
extra time to conditionally move the splitter keys around. For three splitters the needed
number of registers for block sizes 1 to 5 are as seen in table 1. We can see that the trade-off
for classifying multiple elements at the same time is the amount of registers needed.
If we were to use 7 splitters instead of three, the number of registers required for classifying just
1 element at a time would go up to 15. Also, with 8 buckets, if we get recursive subproblems
with sizes just over 16, classifying into 8 buckets again would be greatly inefficient, resulting in
many buckets containing very few. This is why we decided to only use three splitters for this
particular sorter.
Pseudocode to implement the classification can be seen as an example for an array of integers
and blockSize = 1 in algorithm 1. j is here called state, and the temporary subtree consists
of one splitter which we gave the name splitterx. For the branchless implementation we used
the cmovc for line 9 and the rcl instruction for line 10. At the last level of classification no more
moving of splitters is required, so instead of doing another comparison against the predicate
result and using rcl, we can just shift state left by one position and add the predicate’s result
to it (line 12). Alternatively we could use a bitwise OR or XOR after the shift, which would have
the same result. But we decided that adding the predicate result was more readable.
For sorting the splitter sample, the same sorting method can be used as for the base case.
21
3 Register Sample Sort
3 splitters 7 splitters
block size block size
1 2 3 4 5 1 2 3 4 5
splitters 3 3 3 3 3 7 7 7 7 7
buckets pointer 1 1 1 1 1 1 1 1 1 1
current element index 1 1 1 1 1 1 1 1 1 1
element count 1 1 1 1 1 1 1 1 1 1
state 1 2 3 4 5 1 2 3 4 5
predicate result 1 2 3 4 5 1 2 3 4 5
splitterx 1 2 3 4 5 3 6 9 12 15
sum 9 12 15 18 21 15 20 25 30 35
Table 1: Registers required by Register Sample Sort with three or seven splitters
Algorithm 1: Register Sample Sort Classification(array, elementCount, predicate)
1 int splitter0, splitter1, splitter2 ← determineSplitters()
2 int state, predicateResult, splitterx
3 int* b0, b1, b2, b3 ← allocateBuckets(elementCount)
4 for 1 ≤ i ≤ elementCount do
5 state ← 0
6 predicateResult ← (int) predicate(splitter1 < array[i])
7 splitterx ← splitter0
8 if predicateResult > 0 then
9 splitterx ← splitter2
10 state ← (state « 1) + 1
11 predicateResult ← (int) predicate(splitterx < array[i])
12 state ← (state « 1) + predicateResult
13 place array[i] in buckets bstate
22
4 Experimental Results
4 Experimental Results
In the tests we ran, different sorting algorithms and conditional-swap implementations were
compared. For the details about the different sorters and swaps refer to section 2.2.
The names of the sorters are built in an abbrevatory way that matches the following format:
(i) It starts with an I or an N, indicating if the used algorithm is insertion sort or a sorting
network.
• In case of sorting networks, if it is a Best network or a Bose Nelson network (BoNe).
– For a Bose Nelson network whether it was optimized for Locality (L), Parallelism
(P) or generated to take the items as single parameters M (see section 2.2)
(ii) Then follows the type of benchmark, -N for sorting one set of items (“normal sort”,
section 4.4), -I for sorting many continuous sets of items (“inrow sort”, section 4.5), -S
for sorting with Sample Sort (section 4.7), -Q for sorting with quicksort (section 4.6) and
-4 for sorting with IPS4o (section 4.8).
• In case of Sample Sort, the Parameters numberOfSplitters, oversamplingFactor
and blockSize are appended as numbers
(iii) Lastly, the name of the struct used for the template specialization is appended (see
section 2.1.3 for the abbreviations for conditional swaps) as well as a single K for elements
that have only a key and KR for those that have a key and a reference value.
Where for comparison std::sort was run, the name in step (i) is StdSort.
For example, when measuring sample sort with parameters 332 and a Bose Nelson network
optimizing parallelism as the base case with conditional swap 4CS, the sorter name would be
N BoNeP -S332 KR 4CS .
4.1 Environment
Machine Name A B C
CPU 2 x Intel Xeon 8-core 2 x Intel Xeon 12-core AMD Ryzen 8-coreE5-2650 v2 2.6 GHz E5-2670 v3 2.3 GHz 1800X 3.6 GHz
RAM 128 GiB DDR3 128 GiB DDR4 32GB DDR4
L1 Cache (per Core) 32 KiB I + 32 KiB D 32 KiB I + 32 KiB D 64 KiB I + 32 KiB D
L2 Cache (per Core) 256 KiB 256 KiB 512 KiB
L3 Cache (total) 20 MiB 30 MiB 16 MiB [8 MiB]
Table 2: Hardware properties of the machines used
As compiler the gcc C++ compiler in version 7.3.0 was used with the -O3 flag.
The measurements were done with only essential processes running on the machine apart from
the measurement. To prevent the process from being swapped to another core during execution
it was run with taskset 0x1.
In total, three different machines were used to do the measurements. Their hardware properties
can be seen in table 2. “I” and “D” refer to dedicated Instruction and Data caches. Also note
that while the AMD Ryzen’s L3 cache has a total size of 16 MiB, it is divided into two 8 MiB
caches that are exclusive to 4 cores each. Since all measurements were done on a single core,
the L3 cache size in brackets is the one available to the program. The operating system on all
machine was Ubuntu 18.04.
23
4 Experimental Results
4.2 Generating Plots
Due to the high number of dimensions in the measurements (machine the measurement is run
on, type of network, conditional swap implementation, array size) the results could not always
be plotted two-dimensionally. We used box-plots where applicable to show more than just an
average value for a measurement. The box incloses all values between the first quartile (1Q) and
third quartile (3Q). The line in the middle shows the median. Further the inter-quartile-range
(IQR) is calculated as the distance between first and third quartile. The lines (called whiskers)
left and right of the boxes go until the smallest value greater than 1Q−1.5 ·IQR and the greatest
value smaller than 3Q+1.5 · IQR respectively. Values below these ranges are called outliers and
shown as individual dots.
4.3 Conducting the Measurements
Random Numbers In order to measure the time needed to sort some data, one has to have
data first. For these measurements, the data consisted of pairs of a 64-bit unsigned integer key
and a 64-bit unsigned integer reference value. Those were generated as uniformly distributed
random numbers by a lightweight implementation of the std::minstd_rand generator from the
C++ <random> library that works as follows:
First a seed is set, taken e.g. from the current time. When a new random number is requested,
the generator calculates seed = seed · 48271 % 2147483647 and returns the current seed.
The numbers generated like that do not use all 64 bits available, which is only for practicality
with the permutation check as will be seen below.
For each measurement i, a new seedi is taken from the current time. The same seedi is then
set before the execution of each sorter, to provide all sorters with the same random inputs.
Measuring The actual measuring was done via linux’s PERF_EVENT interface that allows
to do fine-grained measurements. Here, the number of cpu cycles spent on sorting was the unit
of measurement. That also means that the results do not depend on clock speeds (e.g. when
overclocking), but only on the CPU’s architecture.
Compilation When we started this project, it was only a single source file (.cpp) with an
increasing amount of headers that were all included in that single file. That is also due to the
fact that templated methods cannot be placed in source files because they need to be visible
to all including files at compile time. The increasing amount of code and the many different
templates brought the compiler to a point where it took over a minute to compile the project.
The problem we encountered was that the compiler only gives itself a limited amount of time for
compiling a single source file. In order to stay within the time boundaries for a single file, the
optimization became poor. We saw measurements being slower for no apparent reason. To solve
that problem, we used code generation to create source files that contain an acceptable amount
of methods that initiate part of a measurement in a wrapper method. This way, from the main
source file we only need to call the correct wrapper methods to perform the measurements, and
this way we achieved results that were more stable and reproducible.
For compilation, the flag -O3 was used to achieve high optimization and speed. That also
means that, without using the sorted data in some way, the compiler would deem the result
unimportant and skip the sorting altogether. That is why after each sort, to generate a side-
effect, the set is checked for two properties: That it is sorted, and that it is a permutation of
the previously generated set. The first can easily be done by checking for each value that it is
not greater than the value before it.
24
4.3 Conducting the Measurements
Permutation Check The permutation check is done probabilistically: At design time, a
(preferably large) prime number p is chosen.
Before sorting, v = ∏ni=1(z−ai) mod p is calculated for a number z and values a = {a1, . . . , an}.
To check the permutation after sorting and obtaining a′ = {a′1, . . . , a′n}, w =
∏n
i=1(z−a′i) mod p
is calculated. If v 6= w, a′ cannot be a permutation of a. If v = w, we claim that a′ is a permu-
tation of a.
To minimize the chances of a′ not being a permutation of a, but v being equal to w, v = 0 was
disallowed in the first step. If v is zero, z is incremented by one and the product calculated
again, until v 6= 0.
Benchmarks The benchmark seen in algorithm 2 was used for most of the measurements.
To reduce the chance of cache misses at the beginning of the measurement, one warmup run of
random generation, sorting and sorted checking is done beforehand (lines 5 to 7). The array is
then sorted numberOfIterations times and checked for the sorted and permutation properties.
After that only the generation of the random numbers and the sorted and permutation checking
is measured, to later subtract the time from the previously measured one, resulting in the time
needed for the sorting alone. Since this is not deterministic in time, and both measurements
are subjects to their own deviation, it can occasionally happen that the second measurement
takes longer than the first, even though less work has been done. We get those negative times
more often for the sorters with small array sizes, where the sorting itself takes relatively little
time compared to the random generation and sorted checking. The negative times show up as
outliers in the results.
The function simulateCheckSorted checks the permutation like checkSorted, but since ran-
domly generated arrays are rarely ordered, instead of checking for each element if it is smaller
than its predecessor, it checks for equality. That should never happen with the random number
generator used, and thus run for the same amount of cycles.
The function MeasureSorting is called a total of numberOfMeasures times for each arraySize
that is sorted.
For the measurements shown in section 4.5 the benchmark was slightly modified as can be seen
in algorithm 3. Here the goal was to look at cache- and memory-effects by creating an array
that does not fit into the CPU’s L3-cache, and then filling the cache with something else, in this
case the reference array. We then split the original array into many blocks of size arraySize
and sort each independently. Because we have to create the whole array at the beginning, we
can generate the numbers before and check for correct sorting after measuring, so there is no
need to do a second measurement like in the first benchmark (lines 15 to 21 in algorithm 2).
Here, instead of giving a numberOfIterations parameter to indicate how often the sorting is to
be executed, we provide a numberOfArrays value that says how many arrays of size arraySize
are to be created contiguously. This parameter is chosen for each arraySize in a way that
numberOfArrays × arraySize does not fit into the L3 cache of the machine the measurement
is performed on.
25
4 Experimental Results
Algorithm 2: MeasureSorting(arraySize, numberOfIterations, seed)
1 foreach sorter do
2 setSeed(seed)
3 arr ← makeArray(arraySize)
4 numberOfBadSorts ← 0
5 arr ← generateRandomArray()
6 sorter(arr)
7 checkSorted(arr) // create side-effect
8 startMeasuring()
9 for i ← 0 to numberOfIterations do
10 arr ← generateRandomArray()
11 sorter(arr)
12 checkSorted(arr) // create side-effect
13 stopMeasuring()
14 outputResult()
15 setSeed(seed)
16 startMeasuring()
17 for i ← 0 to numberOfIterations do
18 arr ← generateRandomArray()
19 simulateCheckSorted(arr) // create side-effect
20 stopMeasuring()
21 outputResult()
Algorithm 3: MeasureSortingInRow(arraySize, numberOfArrays, seed)
1 foreach sorter do
2 SetSeed(seed)
3 arr ← makeArray(arraySize × numberOfArrays)
4 arr ← GenerateRandomArray()
5 compareArr ← makeArray(arraySize × numberOfArrays)
6 compareArr ← CopyArray(arr)
7 foreach currentArr in compareArr of size arraySize do
8 sort(currentArray, arraySize) //sort reference array
9 //warmup on single array of size arraySize like in algorithm 2, lines 5 to 7
10 StartMeasuring()
11 foreach currentArr in arr of size arraySize do
12 sorter(currentArray, arraySize)
13 StopMeasuring()
14 CheckArraysForEquality(arr, compareArr) //check correct sorting, create side-effect
OutputResult()
26
4.4 Sorting one set of 2-16 items
4.4 Sorting one set of 2-16 items
The benchmark from algorithm 2 was used with parameters
• numberOfIterations = 100
• numberOfMeasures = 500
• arraySize ∈ {2, . . . , 16}.
The results seen in tables 3, 4 and 5 contain the name of the sorter and the average number of
cycles per iteration, over the total of all measurements, for machines A, B and C. The algorithm
that performed best in a column is marked in bold font, and for each column the value relative
to the best in that column was calculated. For each row the geometric mean is calculated over
the relative values and from that the rank is determined.
Table 6 contains the geometric mean and rank taking the results from all three machines into
consideration.
Here it becomes visible that the implementations that have conditional branches and those that
do not are clearly separated by rank, the former occupy the lower share of the ranks, while the
latter get all the higher ranks. We see that the claim from section 2.2.2 for the 4CS conditional
swap is true for machines A and B, but not for machine C. We also see in table 6 that the first
three ranks have the same geometric mean, so the Bose Nelson networks can compete with the
optimized networks that have fewer comparators due to their locality.
The boxplots for array size 8 are given for each machine in figures 5, 6 and 7, showing that
these higher-ranked implementations are not only faster on average, but that their distribution
is almost entirely faster than any of the insertion sort implementations, together with a lower
variance. To improve readability, the variants JXc, 6Cm and QMa are omitted. Also one outlier
was removed from dataset of machine B for the ’N BoNeL -N KR Cla’ sorter with value −42.6
so that the plot has a scale similar to those of the other two machines, to improve comparability.
The result set for machine A contains a lot of outliers that we did not want to exclude. To be
able to compare it easily with the other two plots we added an additional axis at the top that
shows the CPU cycles per iteration as percentages where the average of the best insertion sort
is 100%.
To see a trend in increasing array size, we chose a few Conditional Swap implementations that
do best for more than one network and array size on all machines. Their average sorting times
can be seen in figures 8, 9 and 10. For visibility reasons, we omitted the Bose Nelson Parameter
networks in these plot. What we already saw from the tables is here visible as well, the 4Cm
and 4CS implementations have good performance and are almost always faster on average than
insertion sort (apart from arraySize = 2 on machine A).
These results indicate that there is potential in using sorting networks, showing an improvement
of 32% of the best network over the best insertion sort, on average, for any array size. Problems
with this way of measurement are that the same space in memory is sorted over and over again,
which is rarely a use case when sorting a base case. Because of this, the measurements probably
reflect unrealistic conditions regarding cache accesses and cache misses. To get a bit closer to
actual base case sorting, the next section has a different approach to not sort the same space
in memory twice.
27
4 Experimental Results
O
ve
ra
ll
A
rr
ay
Si
ze
R
an
k
G
eo
M
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
I
-N
KR
PO
p
22
1.
85
15
.2
1
37
.8
2
80
.5
2
12
4.
17
16
6.
14
20
4.
37
25
0.
08
28
2.
97
32
3.
87
36
9.
31
41
7.
57
43
7.
93
50
9.
37
52
0.
20
57
9.
44
I
-N
KR
ST
L
26
1.
92
13
.8
2
39
.9
9
83
.7
5
12
8.
44
17
8.
50
21
3.
03
25
7.
62
28
7.
15
34
6.
35
38
2.
08
43
4.
47
45
5.
29
53
2.
36
55
4.
80
61
4.
56
I
-N
KR
De
f
33
2.
07
17
.2
3
40
.5
3
84
.8
0
13
2.
12
17
8.
09
22
0.
69
27
7.
29
31
1.
27
37
8.
66
41
2.
51
47
5.
48
49
9.
27
59
5.
68
61
4.
74
69
3.
90
I
-N
KR
AI
F
36
2.
21
16
.5
5
50
.7
6
90
.5
6
14
8.
37
20
2.
86
25
2.
22
30
7.
29
34
2.
66
40
0.
98
44
2.
48
48
5.
63
51
8.
62
59
3.
07
60
9.
29
67
2.
98
N
Be
st
-N
KR
4C
S
1
1.
07
11
.5
9
24
.1
1
34
.3
5
66
.5
8
82
.5
4
96
.9
5
12
5.
56
13
4.
92
18
3.
73
21
8.
14
25
4.
95
27
8.
53
35
6.
75
35
3.
24
39
5.
58
N
Be
st
-N
KR
4C
m
5
1.
12
16
.3
3
24
.2
1
38
.0
5
54
.8
0
74
.7
4
85
.2
3
12
7.
40
14
1.
99
20
1.
85
23
8.
39
27
9.
15
30
1.
47
37
5.
39
39
9.
06
45
0.
54
N
Be
st
-N
KR
Cl
a
8
1.
23
8.
91
31
.2
2
40
.4
1
86
.7
3
11
0.
04
14
4.
05
16
3.
77
18
8.
67
22
0.
54
24
0.
00
29
8.
76
28
5.
42
34
7.
36
34
9.
35
40
0.
98
N
Be
st
-N
KR
CP
r
10
1.
27
8.
13
32
.2
0
46
.8
8
87
.9
9
11
2.
46
14
6.
10
16
4.
91
19
0.
61
20
8.
00
25
6.
03
29
6.
58
30
1.
14
38
2.
17
38
1.
71
43
8.
18
N
Be
st
-N
KR
6C
m
13
1.
37
17
.2
0
25
.5
7
46
.8
6
64
.6
8
96
.3
6
10
7.
56
14
6.
34
17
6.
34
25
9.
56
28
9.
24
33
9.
02
39
2.
64
49
0.
05
50
2.
11
58
5.
95
N
Be
st
-N
KR
De
f
21
1.
84
20
.0
9
37
.0
8
73
.9
4
11
3.
14
14
4.
92
17
9.
09
24
8.
33
26
8.
15
30
2.
44
33
8.
76
41
7.
69
42
9.
27
55
5.
59
55
2.
84
71
2.
22
N
Be
st
-N
KR
Ti
e
25
1.
90
20
.4
7
38
.0
6
63
.5
8
98
.9
2
13
9.
21
18
2.
82
23
8.
59
27
1.
77
31
6.
34
36
9.
96
47
7.
94
51
9.
14
59
7.
69
63
9.
07
75
3.
42
N
Be
st
-N
KR
JX
c
32
2.
04
18
.4
4
36
.5
0
68
.6
7
11
3.
06
16
7.
99
20
7.
12
26
4.
82
29
3.
56
34
7.
16
40
9.
67
50
6.
11
52
2.
21
68
0.
63
71
1.
20
79
1.
41
N
Be
st
-N
KR
QM
a
37
2.
60
17
.7
2
44
.6
9
96
.0
2
14
9.
03
20
7.
73
25
2.
95
34
1.
19
39
7.
81
43
8.
98
57
3.
64
68
1.
13
70
0.
09
83
2.
27
91
0.
77
10
57
.4
5
N
Bo
Ne
L
-N
KR
4C
S
2
1.
08
11
.4
5
24
.9
9
35
.8
2
67
.9
4
82
.4
1
98
.0
5
12
8.
16
13
2.
68
18
6.
46
22
4.
66
26
2.
69
27
5.
88
34
4.
76
35
2.
46
38
7.
59
N
Bo
Ne
L
-N
KR
4C
m
3
1.
11
13
.5
3
25
.1
8
38
.3
3
55
.6
2
76
.0
6
86
.0
6
13
2.
02
14
2.
06
19
3.
99
23
2.
70
28
4.
86
30
2.
12
38
3.
21
38
6.
96
42
3.
71
N
Bo
Ne
L
-N
KR
6C
m
15
1.
42
15
.9
6
28
.1
8
45
.9
0
73
.1
8
90
.1
8
11
5.
24
14
8.
93
21
4.
27
27
8.
32
29
8.
28
36
5.
05
41
4.
27
49
3.
26
50
8.
85
56
0.
16
N
Bo
Ne
L
-N
KR
Cl
a
16
1.
42
8.
72
31
.0
4
40
.7
1
82
.9
9
11
2.
46
14
3.
75
16
3.
76
23
9.
80
27
0.
36
32
5.
30
35
4.
56
40
3.
83
45
2.
67
49
3.
90
55
0.
58
N
Bo
Ne
L
-N
KR
CP
r
17
1.
44
9.
03
33
.0
1
47
.2
1
88
.3
7
11
3.
62
14
7.
21
16
6.
12
23
8.
67
26
5.
87
32
1.
12
34
7.
10
40
1.
81
44
6.
13
48
2.
28
53
6.
51
N
Bo
Ne
L
-N
KR
Ti
e
27
1.
93
20
.8
7
40
.5
7
64
.3
5
99
.6
5
13
7.
68
17
3.
18
23
1.
53
26
5.
85
34
3.
47
38
3.
88
47
2.
81
51
3.
56
63
6.
85
67
6.
79
78
2.
22
N
Bo
Ne
L
-N
KR
De
f
28
1.
94
20
.1
1
40
.5
2
78
.6
0
10
2.
58
13
9.
42
17
0.
73
23
7.
18
26
5.
27
36
6.
93
37
2.
58
47
8.
09
48
1.
87
63
7.
88
63
6.
35
76
4.
29
N
Bo
Ne
L
-N
KR
JX
c
31
2.
04
18
.9
5
36
.1
8
68
.8
6
10
8.
92
16
0.
18
19
6.
61
25
6.
54
31
4.
83
36
8.
23
42
7.
12
50
4.
82
57
0.
41
64
2.
64
66
2.
73
78
9.
99
N
Bo
Ne
L
-N
KR
QM
a
38
2.
67
18
.1
6
45
.9
5
93
.3
8
14
3.
03
19
6.
39
24
1.
82
32
6.
07
40
6.
05
51
4.
55
57
8.
58
68
5.
08
77
6.
92
91
2.
15
99
8.
34
11
63
.9
9
N
Bo
Ne
M
-N
KR
4C
m
7
1.
22
16
.0
8
27
.1
1
38
.8
4
54
.8
6
82
.5
1
94
.3
6
11
9.
90
21
4.
28
25
2.
78
25
1.
85
28
4.
48
31
8.
74
41
5.
51
39
4.
31
50
0.
68
N
Bo
Ne
M
-N
KR
4C
S
11
1.
28
11
.7
9
24
.7
0
43
.9
6
73
.2
3
83
.7
6
13
0.
14
11
5.
48
20
4.
19
27
3.
27
28
6.
72
28
1.
43
31
4.
56
48
7.
85
46
4.
53
51
8.
94
N
Bo
Ne
M
-N
KR
6C
m
18
1.
51
15
.6
5
27
.9
8
52
.7
1
93
.0
7
85
.7
6
11
3.
21
13
4.
35
22
3.
11
34
0.
69
34
0.
96
39
3.
85
43
4.
99
57
5.
98
49
9.
30
63
0.
78
N
Bo
Ne
M
-N
KR
Cl
a
19
1.
63
13
.0
5
32
.4
6
54
.0
7
90
.3
2
11
6.
27
14
9.
86
17
5.
80
27
8.
17
31
4.
13
34
8.
62
39
5.
85
44
8.
91
55
9.
24
56
6.
87
63
9.
09
N
Bo
Ne
M
-N
KR
CP
r
20
1.
67
15
.1
0
33
.9
1
46
.4
2
10
3.
89
12
0.
13
15
3.
62
20
3.
55
28
7.
35
32
2.
22
34
7.
58
37
4.
67
42
3.
75
52
1.
46
56
5.
28
72
6.
59
N
Bo
Ne
M
-N
KR
De
f
29
1.
94
18
.3
8
39
.3
4
75
.9
1
11
3.
68
15
7.
87
19
9.
64
23
7.
50
25
9.
60
35
2.
19
36
9.
31
45
5.
58
47
9.
48
59
6.
34
63
3.
54
74
8.
67
N
Bo
Ne
M
-N
KR
Ti
e
30
2.
01
21
.3
4
38
.6
4
62
.6
6
96
.0
5
13
5.
19
17
2.
29
22
9.
37
26
5.
31
36
8.
96
45
2.
05
55
4.
50
54
4.
95
76
9.
84
76
7.
68
78
8.
78
N
Bo
Ne
M
-N
KR
JX
c
35
2.
18
19
.4
7
38
.2
9
70
.0
1
10
8.
73
14
7.
29
18
4.
76
25
2.
61
35
8.
55
45
4.
97
47
8.
92
59
4.
31
58
9.
14
76
9.
05
75
1.
62
92
4.
53
N
Bo
Ne
M
-N
KR
QM
a
39
2.
71
24
.2
8
54
.1
4
10
0.
79
13
6.
42
20
4.
57
25
1.
56
32
5.
24
40
3.
05
48
0.
82
54
7.
65
65
1.
81
73
9.
13
86
4.
35
96
6.
34
10
92
.3
0
N
Bo
Ne
P
-N
KR
4C
S
4
1.
11
11
.4
1
24
.8
3
35
.6
7
61
.1
2
84
.5
1
96
.5
9
13
0.
30
15
9.
81
19
9.
35
23
0.
24
27
1.
39
30
2.
50
36
3.
86
38
8.
34
42
2.
56
N
Bo
Ne
P
-N
KR
4C
m
6
1.
14
13
.0
0
25
.0
8
38
.1
4
53
.9
5
74
.0
4
94
.4
7
11
9.
49
15
1.
09
20
9.
73
23
7.
70
29
1.
55
33
4.
29
39
8.
82
42
6.
06
46
8.
23
N
Bo
Ne
P
-N
KR
Cl
a
9
1.
25
8.
81
31
.3
0
41
.4
7
80
.3
0
10
0.
54
13
0.
65
14
7.
00
21
1.
21
23
3.
02
26
5.
58
28
6.
03
32
0.
45
36
3.
05
38
5.
07
43
8.
35
N
Bo
Ne
P
-N
KR
CP
r
12
1.
28
9.
68
33
.0
4
47
.4
6
82
.3
6
94
.7
6
11
6.
33
14
7.
20
21
2.
97
22
3.
40
26
7.
75
28
9.
20
33
7.
12
40
1.
57
41
7.
11
46
8.
35
N
Bo
Ne
P
-N
KR
6C
m
14
1.
41
15
.1
7
27
.8
6
45
.6
9
66
.4
3
86
.3
8
10
2.
07
15
1.
31
20
7.
81
27
5.
71
31
7.
69
37
5.
96
41
8.
47
50
5.
54
54
5.
92
61
7.
54
N
Bo
Ne
P
-N
KR
Ti
e
23
1.
88
20
.5
6
37
.0
0
64
.4
9
97
.0
5
13
1.
03
17
1.
37
22
3.
33
27
0.
21
32
4.
68
39
5.
20
46
2.
10
53
4.
33
59
0.
53
64
2.
31
74
4.
88
N
Bo
Ne
P
-N
KR
De
f
24
1.
88
19
.8
0
41
.8
6
74
.7
9
11
1.
41
14
4.
15
17
4.
45
23
7.
91
26
5.
89
32
5.
01
35
0.
58
42
0.
83
47
3.
65
56
3.
51
59
9.
15
72
7.
96
N
Bo
Ne
P
-N
KR
JX
c
34
2.
08
20
.2
2
36
.9
0
69
.5
0
10
9.
37
15
0.
28
18
8.
40
25
1.
74
29
3.
71
37
9.
19
43
9.
07
52
5.
51
58
5.
25
71
9.
42
74
4.
81
84
4.
24
N
Bo
Ne
P
-N
KR
QM
a
40
2.
79
24
.2
3
52
.0
9
99
.0
1
14
8.
85
19
2.
75
26
3.
79
33
8.
27
39
5.
25
51
7.
71
58
4.
97
67
7.
76
79
4.
20
93
8.
78
10
06
.3
1
11
66
.8
0
Ta
bl
e
3:
Av
er
ag
e
nu
m
be
r
of
C
PU
cy
cl
es
pe
r
ite
ra
tio
n
of
sin
gl
e
ar
ra
y
so
rt
in
g
on
m
ac
hi
ne
A
28
4.4 Sorting one set of 2-16 items
O
ve
ra
ll
A
rr
ay
Si
ze
R
an
k
G
eo
M
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
I
-N
KR
PO
p
25
1.
84
12
.5
6
36
.7
8
73
.3
7
11
1.
91
15
1.
52
18
3.
05
22
1.
86
26
3.
07
30
2.
35
35
3.
36
39
9.
83
43
9.
54
47
5.
60
50
8.
11
55
0.
21
I
-N
KR
ST
L
29
1.
93
10
.9
7
37
.2
6
78
.0
4
12
2.
75
16
1.
05
20
1.
27
24
7.
30
28
0.
78
32
1.
94
37
6.
15
42
0.
50
46
1.
52
49
3.
07
51
9.
96
55
9.
69
I
-N
KR
De
f
32
1.
98
14
.0
3
40
.1
4
76
.9
3
11
7.
68
15
7.
59
19
8.
50
24
2.
60
28
0.
73
32
2.
72
37
5.
54
42
2.
67
46
5.
00
51
1.
51
55
8.
47
59
9.
95
I
-N
KR
AI
F
36
2.
32
14
.9
8
54
.7
8
92
.2
0
13
5.
59
19
5.
79
24
5.
14
29
2.
73
32
7.
82
38
0.
85
43
8.
38
48
9.
94
53
8.
38
57
3.
31
61
2.
57
65
6.
83
N
Be
st
-N
KR
4C
S
1
1.
06
8.
00
22
.1
0
35
.7
0
60
.3
6
71
.6
1
93
.4
0
11
2.
30
14
4.
17
17
1.
34
21
4.
03
23
8.
42
28
5.
88
31
6.
71
34
4.
10
36
4.
40
N
Be
st
-N
KR
4C
m
3
1.
10
8.
75
20
.3
4
34
.4
7
54
.7
0
70
.8
8
90
.2
7
11
5.
48
13
7.
92
18
9.
67
22
2.
14
26
2.
03
30
7.
09
35
3.
77
39
1.
72
43
2.
70
N
Be
st
-N
KR
CP
r
8
1.
19
5.
23
25
.1
5
38
.7
1
71
.1
3
92
.2
4
12
4.
12
14
5.
15
18
8.
31
19
4.
53
23
4.
72
26
6.
15
30
7.
63
34
3.
99
37
2.
53
40
9.
52
N
Be
st
-N
KR
Cl
a
9
1.
20
6.
86
28
.3
0
38
.7
0
75
.6
9
92
.8
4
12
9.
24
14
6.
96
19
0.
31
19
6.
30
23
3.
05
26
1.
56
27
8.
43
30
6.
94
33
5.
57
36
3.
08
N
Be
st
-N
KR
6C
m
14
1.
36
9.
52
24
.2
3
39
.9
7
69
.7
3
88
.5
9
11
1.
52
13
6.
40
17
4.
35
22
8.
98
28
3.
96
32
1.
57
40
2.
13
45
8.
39
49
9.
02
55
3.
25
N
Be
st
-N
KR
De
f
21
1.
78
16
.3
8
33
.7
5
64
.5
7
95
.6
5
12
5.
05
16
4.
26
20
4.
86
24
6.
37
26
5.
34
33
6.
21
39
9.
28
42
8.
57
50
8.
38
55
0.
37
63
1.
51
N
Be
st
-N
KR
Ti
e
22
1.
82
16
.0
4
33
.7
4
57
.3
7
88
.8
6
11
4.
24
16
5.
85
20
3.
09
25
0.
32
28
0.
61
35
2.
47
43
5.
29
50
8.
75
53
9.
74
59
8.
76
68
5.
05
N
Be
st
-N
KR
JX
c
33
2.
00
16
.5
2
33
.4
4
63
.4
4
98
.8
8
13
8.
36
17
9.
19
21
8.
44
28
2.
01
30
9.
11
38
5.
21
48
0.
67
54
1.
52
62
6.
83
66
5.
20
76
6.
23
N
Be
st
-N
KR
QM
a
37
2.
53
14
.9
8
42
.4
3
86
.3
1
14
0.
41
18
4.
73
23
5.
35
30
1.
62
36
4.
06
38
3.
06
53
9.
44
62
3.
15
65
9.
71
74
4.
50
84
6.
43
93
2.
15
N
Bo
Ne
L
-N
KR
4C
S
2
1.
08
8.
79
23
.4
9
37
.0
4
62
.2
1
72
.4
0
93
.3
7
11
2.
70
14
2.
31
17
3.
74
21
8.
67
23
7.
75
28
1.
41
30
9.
13
34
2.
61
34
9.
19
N
Bo
Ne
L
-N
KR
4C
m
4
1.
11
9.
06
21
.7
7
35
.4
5
55
.1
4
72
.1
0
90
.9
8
11
6.
37
14
2.
51
18
1.
85
22
3.
36
26
3.
63
31
9.
19
35
5.
23
38
0.
37
40
4.
78
N
Bo
Ne
L
-N
KR
CP
r
13
1.
34
6.
54
27
.4
7
39
.7
9
72
.9
4
93
.5
1
12
5.
24
14
5.
99
21
5.
92
24
3.
90
28
5.
57
31
5.
19
37
4.
44
40
4.
65
43
4.
14
45
4.
70
N
Bo
Ne
L
-N
KR
Cl
a
15
1.
39
7.
30
29
.8
4
39
.3
5
76
.7
1
92
.9
1
13
0.
63
14
7.
23
22
6.
02
24
7.
07
29
3.
28
32
4.
55
39
3.
59
41
5.
47
46
0.
03
47
7.
18
N
Bo
Ne
L
-N
KR
6C
m
16
1.
43
10
.6
5
25
.9
0
40
.5
6
70
.9
0
87
.9
9
11
3.
19
13
7.
92
21
7.
54
26
0.
44
29
2.
33
35
3.
98
42
7.
97
47
5.
74
50
3.
83
52
9.
54
N
Bo
Ne
L
-N
KR
Ti
e
26
1.
87
17
.5
9
33
.7
5
57
.6
2
87
.7
9
11
5.
16
15
7.
70
19
8.
00
24
5.
89
30
7.
31
36
7.
01
44
4.
25
50
6.
00
57
9.
13
65
8.
81
72
2.
07
N
Bo
Ne
L
-N
KR
De
f
27
1.
89
16
.6
4
36
.3
7
67
.2
3
90
.5
8
12
0.
61
15
8.
18
20
0.
27
25
0.
83
32
0.
55
35
5.
67
45
3.
90
48
7.
58
57
5.
63
62
5.
45
68
8.
18
N
Bo
Ne
L
-N
KR
JX
c
31
1.
97
16
.2
8
33
.1
5
62
.4
3
91
.8
6
13
3.
58
17
1.
44
21
4.
22
27
9.
84
33
8.
71
39
8.
70
45
6.
86
53
6.
44
61
5.
27
66
0.
74
74
8.
77
N
Bo
Ne
L
-N
KR
QM
a
38
2.
63
14
.5
1
43
.2
3
82
.7
0
13
4.
53
17
8.
88
22
5.
65
29
0.
45
39
1.
15
46
2.
34
54
7.
55
64
6.
34
74
2.
97
83
5.
85
94
0.
49
10
57
.8
9
N
Bo
Ne
M
-N
KR
4C
m
7
1.
17
9.
86
21
.4
9
34
.8
9
55
.4
3
72
.6
8
89
.5
6
10
6.
83
21
2.
18
22
4.
25
23
3.
20
26
0.
33
32
5.
45
37
8.
21
40
5.
97
43
8.
89
N
Bo
Ne
M
-N
KR
4C
S
12
1.
28
8.
90
23
.6
6
41
.7
4
72
.8
4
72
.8
8
12
9.
91
10
8.
65
20
6.
26
24
8.
00
26
2.
31
26
2.
53
32
1.
93
43
1.
00
44
5.
25
45
2.
85
N
Bo
Ne
M
-N
KR
6C
m
18
1.
56
11
.7
2
25
.5
9
48
.6
6
95
.3
6
84
.1
1
11
0.
15
13
3.
02
23
4.
05
31
8.
24
34
3.
21
38
2.
43
44
7.
82
52
7.
23
52
8.
53
59
9.
74
N
Bo
Ne
M
-N
KR
CP
r
19
1.
58
11
.9
2
29
.6
9
39
.6
6
93
.6
5
94
.4
9
14
4.
63
16
8.
80
25
8.
49
30
1.
06
33
1.
26
34
1.
15
43
2.
34
44
4.
36
53
1.
04
56
6.
47
N
Bo
Ne
M
-N
KR
Cl
a
20
1.
60
12
.1
1
33
.0
8
47
.8
6
88
.1
8
94
.5
2
14
9.
64
14
7.
27
24
9.
49
28
0.
22
33
0.
82
35
0.
09
44
3.
00
49
0.
15
54
6.
82
53
7.
15
N
Bo
Ne
M
-N
KR
De
f
28
1.
92
15
.7
4
35
.4
0
67
.2
5
10
5.
51
13
7.
72
17
9.
21
21
1.
66
25
0.
63
32
0.
06
35
0.
38
44
3.
83
48
1.
71
54
9.
25
62
1.
84
68
9.
81
N
Bo
Ne
M
-N
KR
Ti
e
30
1.
96
16
.9
5
33
.7
2
56
.6
7
86
.7
2
11
5.
31
15
4.
85
19
8.
75
24
5.
12
33
9.
09
44
3.
65
52
6.
52
55
3.
74
69
5.
92
74
1.
39
72
3.
53
N
Bo
Ne
M
-N
KR
JX
c
35
2.
14
16
.4
6
35
.7
1
60
.9
3
95
.7
2
12
3.
50
16
7.
92
21
8.
13
35
9.
78
43
3.
33
44
9.
88
53
9.
73
57
7.
80
68
9.
96
72
9.
62
86
2.
97
N
Bo
Ne
M
-N
KR
QM
a
39
2.
66
20
.6
4
48
.6
4
90
.6
1
12
6.
40
18
5.
62
22
9.
86
30
0.
87
37
3.
85
43
1.
10
51
8.
01
61
5.
57
73
5.
18
78
0.
40
89
5.
37
95
1.
65
N
Bo
Ne
P
-N
KR
4C
S
5
1.
12
8.
46
23
.3
4
37
.0
9
57
.8
0
73
.6
8
92
.7
0
11
3.
69
15
9.
60
19
0.
06
21
9.
94
25
2.
55
30
7.
79
32
8.
41
37
7.
27
39
2.
42
N
Bo
Ne
P
-N
KR
4C
m
6
1.
14
8.
95
22
.6
9
35
.5
7
55
.3
5
69
.6
0
90
.4
1
11
1.
13
14
9.
56
19
7.
46
22
5.
24
27
1.
98
33
5.
76
36
2.
40
41
4.
60
44
7.
03
N
Bo
Ne
P
-N
KR
CP
r
10
1.
22
5.
97
26
.9
9
39
.4
4
67
.4
4
80
.1
2
11
1.
68
12
9.
78
19
0.
36
21
1.
08
25
1.
35
27
8.
01
33
6.
65
37
0.
57
40
5.
37
43
6.
74
N
Bo
Ne
P
-N
KR
Cl
a
11
1.
23
7.
10
29
.9
7
39
.9
6
73
.8
1
84
.9
3
11
6.
40
13
1.
89
20
0.
48
21
3.
50
24
3.
83
26
5.
25
31
7.
23
33
7.
10
37
7.
03
40
2.
82
N
Bo
Ne
P
-N
KR
6C
m
17
1.
44
10
.3
2
25
.6
2
40
.4
7
69
.1
9
84
.7
5
11
0.
24
13
4.
78
21
0.
52
25
2.
52
31
3.
76
36
8.
79
43
4.
89
48
1.
18
54
3.
99
58
0.
31
N
Bo
Ne
P
-N
KR
De
f
23
1.
83
16
.9
1
38
.3
9
65
.1
8
97
.0
7
12
0.
63
15
6.
82
19
5.
89
24
2.
43
28
8.
94
33
6.
14
40
3.
45
45
8.
41
52
2.
17
58
9.
12
65
6.
89
N
Bo
Ne
P
-N
KR
Ti
e
24
1.
84
17
.3
7
32
.9
3
57
.1
0
86
.3
5
11
4.
55
15
6.
12
19
3.
53
24
8.
25
29
9.
72
36
9.
58
42
9.
28
50
9.
05
56
6.
42
62
8.
05
71
6.
46
N
Bo
Ne
P
-N
KR
JX
c
34
2.
04
16
.5
1
32
.4
7
60
.1
7
95
.9
7
13
7.
22
17
1.
44
22
1.
18
28
3.
43
33
9.
45
42
2.
44
49
2.
87
57
6.
73
65
5.
46
72
3.
94
78
4.
93
N
Bo
Ne
P
-N
KR
QM
a
40
2.
75
20
.4
4
48
.9
2
87
.1
5
14
2.
84
17
4.
98
24
8.
01
29
8.
49
38
5.
09
46
2.
66
54
4.
00
62
6.
85
76
3.
33
84
7.
00
94
0.
79
10
27
.0
4
Ta
bl
e
4:
Av
er
ag
e
nu
m
be
r
of
C
PU
cy
cl
es
pe
r
ite
ra
tio
n
of
sin
gl
e
ar
ra
y
so
rt
in
g
on
m
ac
hi
ne
B
29
4 Experimental Results
O
ve
ra
ll
A
rr
ay
Si
ze
R
an
k
G
eo
M
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
I
-N
KR
PO
p
21
2.
45
11
.3
9
36
.0
5
77
.2
8
12
8.
96
18
1.
15
22
7.
59
26
5.
00
29
8.
11
33
5.
56
37
0.
76
40
4.
10
45
0.
91
48
8.
88
52
9.
10
58
2.
69
I
-N
KR
De
f
33
2.
90
15
.3
3
48
.2
4
95
.2
0
14
8.
27
19
5.
18
24
9.
58
30
5.
80
34
7.
41
39
5.
45
43
4.
98
47
9.
17
52
4.
59
57
5.
01
62
8.
20
68
8.
63
I
-N
KR
ST
L
35
2.
97
16
.6
4
52
.5
6
10
2.
10
15
4.
79
19
4.
80
24
8.
41
30
6.
96
34
6.
42
39
3.
33
43
6.
37
48
6.
46
53
6.
80
58
0.
03
63
4.
94
70
2.
84
I
-N
KR
AI
F
36
3.
23
16
.7
7
56
.4
5
10
9.
56
16
0.
66
21
5.
97
26
8.
93
33
2.
20
37
9.
01
43
2.
62
47
9.
89
54
0.
53
60
1.
36
64
3.
93
70
8.
10
77
6.
50
N
Be
st
-N
KR
4C
m
2
1.
11
7.
31
16
.2
7
29
.4
5
47
.0
3
70
.0
3
82
.2
2
96
.4
8
12
6.
39
14
3.
46
16
9.
87
17
6.
58
23
3.
61
26
8.
71
31
0.
79
34
5.
07
N
Be
st
-N
KR
4C
S
5
1.
19
7.
07
18
.1
0
34
.2
4
55
.8
9
67
.8
6
93
.5
4
11
0.
54
13
7.
94
14
6.
50
17
5.
16
19
5.
19
23
9.
20
27
4.
19
31
5.
00
34
3.
90
N
Be
st
-N
KR
6C
m
8
1.
25
6.
85
17
.8
4
33
.8
2
62
.4
4
79
.2
7
97
.9
4
11
6.
81
14
2.
67
16
5.
77
18
4.
17
20
6.
94
25
6.
17
29
5.
75
31
6.
26
35
2.
90
N
Be
st
-N
KR
CP
r
11
1.
43
3.
09
30
.8
6
36
.0
6
73
.5
7
89
.7
1
12
4.
63
14
0.
12
19
0.
47
19
9.
37
25
1.
14
26
5.
94
29
2.
43
33
5.
35
36
4.
68
38
9.
62
N
Be
st
-N
KR
Cl
a
12
1.
45
4.
73
26
.6
0
37
.6
2
75
.0
6
87
.9
0
11
6.
55
14
2.
33
19
4.
22
19
6.
00
25
1.
06
26
0.
73
29
4.
26
32
6.
31
35
6.
21
38
4.
08
N
Be
st
-N
KR
Ti
e
26
2.
75
15
.7
6
38
.0
5
73
.0
1
10
9.
62
14
4.
16
21
1.
29
26
6.
32
31
4.
66
36
2.
63
42
4.
85
50
0.
30
59
5.
58
68
7.
34
73
3.
87
89
7.
51
N
Be
st
-N
KR
JX
c
27
2.
79
15
.5
6
42
.2
7
75
.4
4
11
4.
30
14
7.
68
20
5.
71
25
4.
78
30
9.
67
37
5.
45
43
3.
90
49
1.
17
60
5.
14
70
3.
28
77
4.
28
90
1.
10
N
Be
st
-N
KR
De
f
30
2.
85
15
.4
0
37
.4
8
80
.6
0
12
6.
48
15
8.
41
22
2.
60
29
4.
91
34
6.
24
35
8.
13
45
6.
70
51
4.
14
56
1.
78
69
4.
25
75
8.
85
84
6.
88
N
Be
st
-N
KR
QM
a
38
3.
79
11
.0
5
54
.5
3
11
0.
76
18
0.
78
23
9.
95
31
0.
40
38
9.
88
46
8.
14
49
4.
45
66
5.
40
74
1.
91
78
3.
56
92
0.
78
10
00
.0
1
11
10
.0
8
N
Bo
Ne
L
-N
KR
4C
m
1
1.
07
6.
05
16
.7
7
29
.7
0
47
.3
2
70
.1
4
83
.5
7
96
.8
7
12
5.
01
14
9.
19
16
4.
25
17
4.
84
21
6.
16
24
4.
09
26
6.
22
28
8.
18
N
Bo
Ne
L
-N
KR
4C
S
4
1.
15
6.
51
18
.5
1
33
.6
4
57
.8
4
69
.0
8
90
.7
0
11
2.
78
13
4.
28
15
3.
70
18
2.
76
18
5.
91
23
4.
77
25
6.
08
27
2.
71
29
9.
99
N
Bo
Ne
L
-N
KR
6C
m
9
1.
31
8.
37
20
.0
4
35
.3
2
67
.7
2
77
.6
4
99
.9
7
12
0.
89
15
6.
16
17
7.
52
19
9.
67
21
7.
96
26
0.
26
28
2.
87
31
6.
31
34
8.
27
N
Bo
Ne
L
-N
KR
CP
r
17
1.
58
3.
27
32
.0
8
37
.0
8
73
.8
5
91
.5
7
12
4.
20
14
1.
78
20
5.
08
25
1.
50
29
0.
99
31
2.
06
36
9.
32
40
5.
85
43
7.
23
46
2.
54
N
Bo
Ne
L
-N
KR
Cl
a
18
1.
59
4.
48
27
.2
1
37
.6
0
76
.0
8
89
.2
8
11
7.
83
14
1.
78
20
0.
79
24
0.
48
28
9.
48
30
6.
41
37
1.
91
40
3.
27
44
3.
11
45
7.
93
N
Bo
Ne
L
-N
KR
JX
c
22
2.
56
15
.2
4
35
.2
1
66
.9
6
10
0.
40
13
1.
09
18
9.
59
23
5.
21
30
4.
23
36
4.
99
41
8.
31
47
2.
06
53
6.
86
65
5.
17
70
4.
77
77
7.
83
N
Bo
Ne
L
-N
KR
Ti
e
24
2.
69
15
.1
5
37
.3
9
64
.1
0
11
2.
33
13
9.
39
19
1.
19
24
0.
18
31
4.
21
37
5.
78
43
5.
35
50
0.
68
61
0.
78
68
2.
65
76
5.
66
88
2.
64
N
Bo
Ne
L
-N
KR
De
f
31
2.
89
14
.6
5
39
.2
5
78
.6
5
11
5.
49
14
5.
62
20
6.
33
27
3.
29
34
7.
40
40
9.
92
46
0.
74
56
0.
31
61
1.
72
72
5.
22
82
7.
99
93
2.
90
N
Bo
Ne
L
-N
KR
QM
a
39
3.
81
10
.3
3
50
.0
8
97
.4
2
16
8.
02
22
3.
17
28
6.
64
35
9.
03
48
4.
01
57
8.
52
66
3.
41
76
6.
77
90
6.
75
10
02
.3
5
11
12
.7
3
12
23
.3
2
N
Bo
Ne
M
-N
KR
4C
m
7
1.
23
7.
29
17
.7
2
29
.3
5
45
.0
3
61
.8
8
82
.5
2
98
.4
3
20
2.
36
21
2.
83
20
0.
65
24
3.
93
23
8.
92
30
8.
91
32
0.
33
36
4.
00
N
Bo
Ne
M
-N
KR
6C
m
15
1.
50
7.
00
21
.0
7
40
.8
5
84
.3
6
80
.0
1
99
.3
1
11
7.
10
19
7.
20
26
6.
33
23
5.
02
25
8.
16
32
3.
18
41
5.
37
35
2.
34
39
9.
50
N
Bo
Ne
M
-N
KR
4C
S
16
1.
51
6.
96
17
.6
0
40
.6
6
80
.8
0
70
.9
3
14
5.
19
11
2.
51
20
8.
82
24
6.
55
25
6.
60
23
8.
25
28
2.
13
41
1.
16
41
6.
42
43
3.
64
N
Bo
Ne
M
-N
KR
CP
r
19
1.
95
8.
86
33
.2
8
38
.1
5
94
.9
3
98
.6
2
14
7.
79
18
6.
70
25
1.
71
29
8.
36
32
3.
52
34
1.
95
41
7.
32
44
8.
73
54
8.
35
59
2.
07
N
Bo
Ne
M
-N
KR
Cl
a
20
1.
96
13
.0
1
32
.5
5
47
.5
1
90
.7
2
93
.8
7
15
4.
40
14
6.
24
23
2.
55
27
6.
11
31
6.
92
34
0.
40
42
1.
70
48
2.
01
53
3.
05
53
3.
90
N
Bo
Ne
M
-N
KR
JX
c
28
2.
82
16
.3
9
37
.7
9
66
.8
3
11
0.
69
13
4.
03
19
4.
48
24
3.
91
38
5.
89
44
4.
32
49
6.
26
53
8.
42
58
6.
56
70
5.
07
76
7.
01
87
7.
98
N
Bo
Ne
M
-N
KR
Ti
e
29
2.
84
17
.5
2
41
.2
9
65
.8
6
10
5.
65
14
5.
14
19
8.
25
25
9.
15
32
2.
99
39
8.
23
46
8.
78
54
7.
83
62
7.
10
79
6.
34
79
8.
66
84
2.
88
N
Bo
Ne
M
-N
KR
De
f
34
2.
95
11
.7
1
37
.0
8
84
.6
3
13
6.
06
17
6.
03
23
1.
43
28
9.
16
35
1.
31
41
3.
91
45
6.
64
54
2.
57
64
1.
53
72
0.
54
82
6.
37
90
8.
10
N
Bo
Ne
M
-N
KR
QM
a
37
3.
77
17
.6
3
58
.3
2
10
7.
62
14
2.
40
23
7.
64
26
4.
96
38
4.
33
43
5.
37
50
0.
00
57
4.
65
72
6.
09
82
5.
43
91
3.
20
98
0.
12
11
29
.6
1
N
Bo
Ne
P
-N
KR
4C
m
3
1.
15
6.
86
17
.9
2
29
.8
4
49
.0
1
61
.9
0
79
.9
0
10
6.
49
12
9.
82
15
6.
20
16
3.
84
20
1.
19
25
1.
63
28
7.
49
31
9.
13
34
5.
12
N
Bo
Ne
P
-N
KR
4C
S
6
1.
21
6.
63
20
.9
9
33
.4
7
50
.8
7
69
.9
1
89
.2
3
11
4.
03
13
2.
59
15
5.
05
19
7.
48
20
4.
60
26
2.
52
28
5.
92
32
1.
02
34
2.
46
N
Bo
Ne
P
-N
KR
6C
m
10
1.
31
7.
80
20
.5
8
34
.6
2
60
.0
7
83
.2
7
10
1.
94
12
6.
02
14
3.
48
17
4.
07
20
7.
35
22
8.
25
26
2.
75
28
7.
39
34
4.
92
35
5.
76
N
Bo
Ne
P
-N
KR
CP
r
13
1.
48
3.
24
31
.7
7
37
.4
7
74
.4
6
84
.1
8
12
3.
89
14
6.
83
19
6.
26
21
3.
12
25
4.
61
27
4.
11
32
3.
22
34
7.
05
38
4.
41
41
4.
97
N
Bo
Ne
P
-N
KR
Cl
a
14
1.
48
4.
77
27
.8
8
38
.1
7
69
.8
3
75
.0
2
11
6.
28
14
1.
25
19
1.
49
22
1.
49
25
0.
44
27
4.
70
32
5.
92
35
0.
47
39
4.
49
42
4.
23
N
Bo
Ne
P
-N
KR
JX
c
23
2.
69
13
.9
0
35
.1
6
65
.3
9
10
5.
11
13
9.
71
19
7.
48
25
5.
87
29
4.
85
36
4.
97
46
1.
81
50
5.
71
60
5.
46
68
6.
42
81
4.
11
91
9.
46
N
Bo
Ne
P
-N
KR
Ti
e
25
2.
72
15
.6
9
34
.7
2
68
.5
5
11
2.
03
14
2.
03
20
9.
70
25
7.
04
31
2.
52
36
7.
93
43
4.
85
50
9.
64
57
8.
63
70
2.
64
75
2.
57
89
2.
05
N
Bo
Ne
P
-N
KR
De
f
32
2.
90
14
.8
1
39
.7
8
78
.6
1
12
7.
92
18
1.
05
23
1.
43
27
4.
24
34
6.
11
37
9.
57
45
3.
44
49
1.
92
61
0.
98
70
8.
39
77
3.
08
85
9.
50
N
Bo
Ne
P
-N
KR
QM
a
40
3.
99
19
.0
9
56
.4
9
99
.2
6
18
0.
09
22
2.
01
29
8.
15
37
1.
18
47
5.
00
54
9.
86
66
0.
00
74
5.
95
88
1.
24
99
4.
07
10
82
.7
6
11
96
.1
3
Ta
bl
e
5:
Av
er
ag
e
nu
m
be
r
of
C
PU
cy
cl
es
pe
r
ite
ra
tio
n
of
sin
gl
e
ar
ra
y
so
rt
in
g
on
m
ac
hi
ne
C
30
4.4 Sorting one set of 2-16 items
O
ve
ra
ll
A
rr
ay
Si
ze
R
an
k
G
eo
M
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
I
-N
KR
PO
p
21
1.
97
13
.3
1
36
.8
4
77
.5
2
12
1.
73
16
6.
39
20
4.
93
24
5.
35
28
1.
53
32
0.
20
36
4.
03
40
7.
29
44
3.
03
49
1.
19
51
9.
29
57
0.
38
I
-N
KR
ST
L
29
2.
18
13
.9
9
44
.1
3
88
.7
2
13
6.
05
17
8.
19
22
1.
54
27
1.
65
30
6.
46
35
4.
48
39
9.
11
44
8.
19
48
5.
94
53
6.
15
57
0.
51
62
6.
51
I
-N
KR
De
f
34
2.
22
15
.4
9
43
.5
3
85
.9
2
13
3.
04
17
7.
05
22
2.
94
27
5.
77
31
3.
38
36
6.
00
40
7.
47
45
9.
20
49
6.
26
56
1.
57
60
0.
20
65
9.
81
I
-N
KR
AI
F
36
2.
48
16
.1
9
53
.9
6
98
.1
4
14
8.
52
20
5.
27
25
5.
57
31
1.
14
35
0.
39
40
4.
94
45
4.
16
50
5.
94
55
3.
38
60
4.
04
64
5.
66
70
3.
20
N
Be
st
-N
KR
4C
S
1
1.
08
9.
49
21
.1
4
34
.6
1
61
.3
9
74
.1
2
93
.8
8
11
6.
89
13
8.
90
16
6.
86
20
1.
46
22
9.
80
26
7.
67
31
8.
35
33
8.
59
36
7.
90
N
Be
st
-N
KR
4C
m
4
1.
10
11
.8
0
20
.2
4
33
.8
0
51
.2
4
71
.6
7
85
.7
5
11
3.
85
13
5.
38
17
8.
33
20
8.
90
23
8.
10
27
8.
08
33
3.
61
36
7.
57
41
3.
45
N
Be
st
-N
KR
Cl
a
8
1.
26
7.
07
28
.7
1
38
.9
7
80
.0
4
96
.9
7
12
9.
58
15
0.
98
19
1.
32
20
4.
88
24
2.
45
27
3.
71
28
6.
36
32
6.
97
34
7.
76
38
2.
99
N
Be
st
-N
KR
CP
r
9
1.
27
5.
62
29
.1
4
40
.7
9
78
.3
0
98
.9
9
13
3.
63
15
0.
52
18
9.
73
20
0.
78
24
7.
67
27
8.
47
30
0.
13
35
3.
26
37
3.
06
41
2.
66
N
Be
st
-N
KR
6C
m
12
1.
31
11
.8
6
21
.8
4
40
.1
8
65
.6
1
87
.8
3
10
5.
09
13
3.
64
16
3.
41
21
7.
18
25
2.
83
28
6.
92
35
3.
76
41
7.
93
42
9.
41
49
7.
07
N
Be
st
-N
KR
Ti
e
23
2.
07
17
.7
9
36
.5
3
64
.8
6
99
.3
1
13
2.
75
18
6.
83
23
6.
91
27
8.
75
32
0.
47
38
2.
78
47
1.
00
54
2.
17
60
8.
62
65
8.
26
77
9.
34
N
Be
st
-N
KR
De
f
24
2.
07
17
.6
7
36
.2
5
73
.2
3
11
1.
95
14
3.
00
18
9.
50
25
0.
12
28
7.
76
30
8.
47
37
8.
63
44
3.
90
47
4.
71
58
7.
42
62
6.
30
73
1.
54
N
Be
st
-N
KR
JX
c
33
2.
19
17
.1
1
37
.5
8
69
.3
6
10
8.
46
15
1.
93
19
7.
17
24
5.
69
29
5.
11
34
3.
64
40
9.
43
49
2.
56
55
6.
92
67
0.
48
71
6.
70
81
8.
76
N
Be
st
-N
KR
QM
a
37
2.
86
14
.4
9
47
.7
1
97
.7
1
15
7.
50
21
1.
46
26
6.
79
34
4.
29
41
0.
69
43
9.
47
59
2.
67
68
2.
56
71
4.
70
83
3.
13
91
9.
82
10
34
.2
9
N
Bo
Ne
L
-N
KR
4C
m
2
1.
08
9.
98
21
.3
6
34
.7
0
51
.9
9
73
.3
9
87
.2
3
11
6.
70
13
5.
64
17
4.
36
20
5.
56
24
0.
54
27
5.
67
33
0.
05
34
6.
83
37
5.
29
N
Bo
Ne
L
-N
KR
4C
S
3
1.
08
9.
11
22
.1
9
35
.6
5
62
.9
2
75
.1
6
94
.3
8
11
9.
00
13
6.
72
17
3.
22
21
1.
53
23
2.
70
26
5.
93
30
4.
89
32
6.
46
34
8.
16
N
Bo
Ne
L
-N
KR
6C
m
14
1.
37
12
.1
0
24
.6
1
41
.1
9
70
.4
6
84
.8
3
10
8.
23
13
5.
38
19
6.
49
24
5.
98
25
9.
64
31
0.
46
36
6.
85
42
9.
46
43
7.
78
48
5.
76
N
Bo
Ne
L
-N
KR
Cl
a
16
1.
43
6.
67
29
.2
3
39
.1
3
79
.2
8
99
.1
9
13
0.
78
15
0.
76
22
2.
19
25
2.
77
30
3.
20
32
8.
71
38
9.
41
42
2.
62
46
5.
09
49
3.
86
N
Bo
Ne
L
-N
KR
CP
r
17
1.
43
6.
24
30
.7
4
41
.6
1
79
.8
6
10
0.
76
13
4.
10
15
1.
81
21
9.
75
25
4.
20
29
9.
55
32
7.
56
38
2.
07
42
1.
80
45
2.
26
48
5.
46
N
Bo
Ne
L
-N
KR
Ti
e
25
2.
08
18
.1
4
37
.3
9
61
.9
8
10
0.
26
13
0.
55
17
4.
16
22
3.
09
27
5.
50
34
2.
71
39
5.
67
47
2.
43
54
4.
67
63
3.
90
70
1.
33
79
4.
77
N
Bo
Ne
L
-N
KR
JX
c
27
2.
12
17
.0
3
34
.8
9
65
.9
1
10
0.
48
14
2.
81
18
6.
18
23
5.
30
29
9.
63
35
6.
66
41
4.
47
47
8.
47
54
8.
58
63
7.
53
67
7.
42
77
1.
92
N
Bo
Ne
L
-N
KR
De
f
28
2.
15
17
.4
8
38
.6
2
74
.4
4
10
3.
11
13
5.
11
17
8.
95
23
7.
15
28
8.
96
36
5.
87
39
7.
72
49
8.
02
52
9.
04
64
6.
10
69
9.
94
79
6.
40
N
Bo
Ne
L
-N
KR
QM
a
38
2.
92
14
.4
3
46
.6
3
90
.9
3
14
9.
35
19
9.
66
25
1.
64
32
5.
05
42
7.
89
51
8.
88
59
6.
66
69
9.
04
81
0.
17
91
7.
14
10
17
.5
5
11
48
.2
8
N
Bo
Ne
M
-N
KR
4C
m
7
1.
19
11
.6
7
22
.3
6
34
.3
5
51
.8
9
72
.1
9
89
.0
1
10
9.
41
20
8.
81
23
1.
50
22
7.
94
26
3.
31
29
4.
41
36
7.
71
37
4.
03
43
1.
20
N
Bo
Ne
M
-N
KR
4C
S
13
1.
32
9.
54
21
.8
9
42
.0
9
76
.1
8
77
.2
0
13
6.
41
11
2.
08
20
6.
68
25
6.
74
27
0.
60
26
0.
43
30
6.
41
44
1.
37
44
2.
21
46
7.
06
N
Bo
Ne
M
-N
KR
6C
m
18
1.
49
11
.6
7
24
.9
0
47
.7
8
90
.5
3
83
.7
6
10
7.
69
12
8.
11
21
9.
37
31
1.
29
29
7.
59
34
5.
98
40
1.
83
50
3.
23
45
7.
17
54
1.
58
N
Bo
Ne
M
-N
KR
Cl
a
19
1.
68
12
.8
7
32
.6
8
50
.3
1
89
.8
1
10
3.
25
15
1.
29
15
7.
12
25
3.
04
29
0.
36
33
2.
44
36
1.
42
43
7.
62
51
0.
11
54
8.
59
57
2.
66
N
Bo
Ne
M
-N
KR
CP
r
20
1.
69
12
.2
5
32
.2
6
42
.0
4
98
.5
0
10
5.
12
14
9.
09
18
6.
85
26
5.
49
30
8.
33
33
4.
27
35
5.
31
42
4.
85
47
6.
11
54
8.
73
62
8.
61
N
Bo
Ne
M
-N
KR
De
f
30
2.
18
15
.3
3
37
.2
6
76
.0
7
11
9.
07
15
7.
49
20
3.
52
24
6.
69
28
9.
45
36
2.
59
39
2.
89
48
2.
09
53
6.
33
62
1.
77
69
7.
33
78
3.
48
N
Bo
Ne
M
-N
KR
Ti
e
31
2.
18
18
.9
7
38
.0
5
61
.8
5
96
.3
5
13
1.
68
17
5.
20
22
9.
28
27
8.
30
36
8.
86
45
5.
01
54
2.
54
57
6.
28
75
3.
78
76
9.
33
78
4.
82
N
Bo
Ne
M
-N
KR
JX
c
35
2.
30
17
.7
3
37
.1
0
65
.9
3
10
4.
73
13
5.
12
18
2.
40
23
7.
82
36
8.
48
44
4.
19
47
4.
94
55
8.
74
58
4.
25
72
2.
61
74
9.
45
88
9.
37
N
Bo
Ne
M
-N
KR
QM
a
39
2.
92
21
.0
0
53
.8
3
99
.4
7
13
5.
04
20
9.
36
24
8.
78
33
7.
02
40
3.
97
47
0.
19
54
6.
77
66
4.
62
76
8.
21
85
3.
46
94
6.
93
10
58
.0
6
N
Bo
Ne
P
-N
KR
4C
m
5
1.
12
10
.1
0
21
.8
0
34
.5
3
52
.9
4
69
.2
0
88
.7
8
11
2.
80
14
3.
67
18
8.
17
21
0.
52
25
2.
70
30
3.
77
35
5.
74
39
2.
22
41
9.
03
N
Bo
Ne
P
-N
KR
4C
S
6
1.
13
9.
31
23
.1
1
35
.5
1
56
.2
9
77
.1
1
92
.9
9
12
0.
16
15
1.
19
17
9.
63
21
7.
68
24
4.
75
29
2.
50
32
8.
03
36
2.
46
38
7.
20
N
Bo
Ne
P
-N
KR
Cl
a
10
1.
28
6.
86
29
.6
9
39
.9
0
74
.9
3
87
.1
5
12
2.
21
13
9.
99
20
1.
27
22
2.
60
25
3.
53
27
5.
37
32
1.
67
35
0.
16
38
5.
47
42
1.
30
N
Bo
Ne
P
-N
KR
CP
r
11
1.
30
6.
55
30
.5
1
42
.0
4
75
.2
7
86
.6
8
11
7.
72
14
0.
35
19
9.
82
21
6.
17
25
8.
35
28
0.
93
33
1.
44
37
3.
93
40
2.
71
44
0.
35
N
Bo
Ne
P
-N
KR
6C
m
15
1.
38
11
.5
1
24
.7
7
40
.8
6
65
.5
8
84
.9
8
10
5.
34
13
7.
82
18
7.
49
23
5.
82
27
6.
33
32
4.
97
37
2.
66
43
5.
94
47
0.
59
52
5.
82
N
Bo
Ne
P
-N
KR
Ti
e
22
2.
06
18
.1
5
34
.9
3
63
.4
2
98
.7
7
12
9.
21
17
9.
81
22
4.
98
27
7.
16
33
1.
25
39
9.
71
46
6.
61
54
0.
55
62
0.
63
67
5.
01
78
5.
68
N
Bo
Ne
P
-N
KR
De
f
26
2.
11
17
.4
8
39
.9
7
72
.5
0
11
2.
59
14
9.
29
18
7.
89
23
6.
08
28
4.
89
33
2.
48
38
0.
62
43
9.
68
51
5.
16
60
0.
16
65
6.
77
74
9.
71
N
Bo
Ne
P
-N
KR
JX
c
32
2.
19
17
.0
1
34
.8
1
65
.0
9
10
3.
38
14
3.
31
18
5.
32
24
2.
27
29
0.
60
36
1.
36
44
1.
20
50
8.
06
58
9.
21
68
7.
21
76
2.
41
84
9.
86
N
Bo
Ne
P
-N
KR
QM
a
40
3.
05
21
.3
9
52
.5
6
94
.7
1
15
8.
02
19
6.
77
27
0.
19
33
5.
88
41
9.
22
50
9.
74
59
6.
41
68
4.
62
81
3.
48
92
6.
26
10
10
.0
3
11
29
.4
0
Ta
bl
e
6:
Av
er
ag
e
nu
m
be
r
of
C
PU
cy
cl
es
pe
r
ite
ra
tio
n
of
sin
gl
e
ar
ra
y
so
rt
in
g
ac
ro
ss
al
lm
ac
hi
ne
s
31
4 Experimental Results
lllll
ll l
lll l
ll l
ll ll
l ll
lll l
l l ll
ll l
lll ll
lll l
ll l
ll l
lll ll
ll ll
l lll l
ll ll
ll lll
ll lll
l lll
ll ll l
l ll l
l ll
l l llll
l l lll l
ll l llll
l ll l
ll l l ll
0% 20% 40% 60% 80% 100% 120% 140% 160% 180% 200% 220%
0 200 400
I       −N KR AIF
I       −N KR Def
I       −N KR STL
I       −N KR POp
N Best  −N KR Def
N Best  −N KR Tie
N Best  −N KR CPr
N Best  −N KR Cla
N Best  −N KR 4Cm
N Best  −N KR 4CS
N BoNeL −N KR Def
N BoNeL −N KR Tie
N BoNeL −N KR CPr
N BoNeL −N KR Cla
N BoNeL −N KR 4Cm
N BoNeL −N KR 4CS
N BoNeM −N KR Def
N BoNeM −N KR Tie
N BoNeM −N KR CPr
N BoNeM −N KR Cla
N BoNeM −N KR 4Cm
N BoNeM −N KR 4CS
N BoNeP −N KR Def
N BoNeP −N KR Tie
N BoNeP −N KR CPr
N BoNeP −N KR Cla
N BoNeP −N KR 4CS
N BoNeP −N KR 4Cm
Value in relation to 'I       −N KR POp'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
ArraySize = 8
Figure 5: Single sort for array size = 8 on machine A
l
l
l
l ll
lll ll
lll ll
ll ll lllll l
l ll l
l lll l
lll
ll ll
l ll lll l lll
lll l ll
ll ll lll
llll ll
ll l
lll l
llll l l
l llll
l ll l
lll l
ll ll lll
ll
lll lll
lll ll l
llll l
l ll ll
l ll lll ll
40% 60% 80% 100% 120% 140%
100 150 200 250 300
I       −N KR AIF
I       −N KR STL
I       −N KR Def
I       −N KR POp
N Best  −N KR Def
N Best  −N KR Tie
N Best  −N KR Cla
N Best  −N KR CPr
N Best  −N KR 4Cm
N Best  −N KR 4CS
N BoNeL −N KR Def
N BoNeL −N KR Tie
N BoNeL −N KR Cla
N BoNeL −N KR CPr
N BoNeL −N KR 4Cm
N BoNeL −N KR 4CS
N BoNeM −N KR Def
N BoNeM −N KR Tie
N BoNeM −N KR CPr
N BoNeM −N KR Cla
N BoNeM −N KR 4CS
N BoNeM −N KR 4Cm
N BoNeP −N KR Def
N BoNeP −N KR Tie
N BoNeP −N KR Cla
N BoNeP −N KR CPr
N BoNeP −N KR 4CS
N BoNeP −N KR 4Cm
Value in relation to 'I       −N KR POp'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
ArraySize = 8
Figure 6: Single sort for array size = 8 on machine B
32
4.4 Sorting one set of 2-16 items
ll ll ll
l ll l ll ll
l lllll
lllll
l l ll
ll l
l ll
llll l l
lll llll ll
lll ll
lll l
ll
l ll
llll
llll ll
ll llll
ll l
lll
ll lll l
l ll
l
ll
ll l
ll
l lll
ll ll
l llll
l
40% 60% 80% 100% 120% 140%
100 200 300 400
I       −N KR AIF
I       −N KR STL
I       −N KR Def
I       −N KR POp
N Best  −N KR Def
N Best  −N KR Tie
N Best  −N KR Cla
N Best  −N KR CPr
N Best  −N KR 4CS
N Best  −N KR 4Cm
N BoNeL −N KR Def
N BoNeL −N KR Tie
N BoNeL −N KR Cla
N BoNeL −N KR CPr
N BoNeL −N KR 4CS
N BoNeL −N KR 4Cm
N BoNeM −N KR Def
N BoNeM −N KR Tie
N BoNeM −N KR CPr
N BoNeM −N KR Cla
N BoNeM −N KR 4CS
N BoNeM −N KR 4Cm
N BoNeP −N KR Def
N BoNeP −N KR Tie
N BoNeP −N KR CPr
N BoNeP −N KR Cla
N BoNeP −N KR 4CS
N BoNeP −N KR 4Cm
Value in relation to 'I       −N KR POp'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
ArraySize = 8
Figure 7: Single sort for array size = 8 on machine C
l
l
l
l
l
l
l l
l l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
10
20
30
40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 8: Single sort of array sizes 2 to 16 on machine A
33
4 Experimental Results
l
l
l
l
l
l
l l
l
l
l l l l l
l
l
l
l
l
l
l
l
l
l l l l
l
10
20
30
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 9: Single sort of array sizes 2 to 16 on machine B
l
l
l
l
l
l
l l
l l
l
l l
l
l
l
l
l
l
l
l
l l l l l
l l l
l
0
10
20
30
40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 10: Single sort of array sizes 2 to 16 on machine C
34
4.5 Sorting many continuous Sets of 2-16 Items
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
10
20
30
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 11: Continuous sorting of array sizes 2 to 16 on machine A
4.5 Sorting many continuous Sets of 2-16 Items
Here the benchmark shown in algorithm 3 was used. Instead of sorting a single array multiple
times, multiple arrays are created adjacent to each other and sorted in series.
The number of arrays used is chosen in a way that all of them do not fit into the CPU’s L3
cache. Since the reference array is sorted before the measurement, the original array should
not be present in the cache, causing a cache miss on every access.
The results are similar to the previous ones. A difference we can see when comparing figures
11, 12 and 13 to figures 8, 9 and 10 from the single sort measurement is that the CPr swap
that operates on pointers and moves values around in memory became worse compared to the
4Cm and 4CS implementations for array sizes greater than 2. Here the values can probably get
pre-loaded for the next conditional swap while the current one is finishing, while CPr accesses
the element’s reference value only when the destination address is calculated, which results in
less pre-loading that can be done.
The complete overview over the average values of each sorter across all three machines can be
seen in table 7. We see speed-ups for using the sorting networks from 25% at array size 2 all
the way up to 59% at array size 15.
35
4 Experimental Results
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
10
20
30
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 12: Continuous sorting of array sizes 2 to 16 on machine B
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
10
20
30
40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array Size
CP
U 
Cy
cle
s 
pe
r e
le
m
en
t
l
l
l
l
l
 4Cm
 4CS
 CPr
 POp
 STL
l I       
N Best  
N BoNeL 
N BoNeP 
Figure 13: Continuous sorting of array sizes 2 to 16 on machine C
36
4.5 Sorting many continuous Sets of 2-16 Items
O
ve
ra
ll
A
rr
ay
Si
ze
R
an
k
G
eo
M
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
I
-I
KR
PO
p
26
2.
39
20
.6
7
44
.6
7
72
.2
3
10
4.
33
14
2.
53
18
2.
47
22
2.
07
26
2.
90
30
2.
13
34
2.
53
38
2.
87
42
5.
57
46
8.
73
51
3.
80
55
6.
07
I
-I
KR
De
f
27
2.
42
22
.1
7
47
.3
0
74
.6
0
10
6.
70
14
3.
40
18
3.
47
22
3.
37
26
3.
23
30
2.
53
34
2.
47
38
3.
63
42
5.
73
46
7.
77
51
1.
97
55
4.
67
I
-I
KR
ST
L
34
2.
57
27
.0
7
53
.4
7
82
.4
0
11
4.
07
15
0.
63
19
1.
53
23
2.
87
27
3.
00
31
3.
67
35
5.
40
39
8.
73
44
1.
80
48
4.
63
52
9.
97
57
3.
47
I
-I
KR
AI
F
35
2.
57
23
.6
7
50
.0
0
79
.5
3
11
1.
97
14
9.
87
19
3.
77
23
7.
77
28
0.
87
32
4.
37
36
7.
43
41
2.
23
45
6.
13
49
9.
43
54
4.
10
58
9.
47
N
Be
st
-I
KR
4C
S
2
1.
05
15
.9
7
25
.0
0
35
.6
7
48
.7
3
57
.6
0
70
.9
0
79
.3
0
96
.5
0
11
1.
23
13
3.
43
14
8.
70
17
5.
07
22
2.
03
22
6.
00
26
1.
80
N
Be
st
-I
KR
4C
m
5
1.
08
15
.9
7
23
.6
7
34
.8
7
47
.2
0
56
.3
0
70
.2
0
82
.2
0
95
.0
7
11
4.
20
13
1.
30
14
6.
00
20
5.
50
23
6.
97
27
1.
73
29
9.
80
N
Be
st
-I
KR
6C
m
8
1.
29
15
.9
7
25
.7
0
37
.0
3
55
.1
3
66
.8
0
85
.5
0
94
.3
7
11
7.
63
13
9.
07
16
0.
90
18
1.
33
27
0.
23
30
1.
70
34
7.
80
37
9.
53
N
Be
st
-I
KR
Cl
a
11
1.
40
14
.0
0
25
.3
0
39
.9
0
70
.8
3
87
.6
7
11
4.
43
12
8.
97
16
9.
67
17
2.
10
20
1.
00
21
8.
60
23
1.
20
25
4.
33
27
8.
03
29
5.
50
N
Be
st
-I
KR
CP
r
13
1.
43
13
.9
0
28
.3
7
41
.0
7
74
.0
3
91
.3
3
11
7.
27
13
0.
77
16
7.
40
16
5.
93
20
1.
17
22
0.
90
23
3.
23
25
9.
73
28
8.
53
30
7.
23
N
Be
st
-I
KR
De
f
22
2.
33
23
.6
7
44
.3
3
72
.0
0
96
.0
7
12
9.
27
16
7.
40
20
5.
97
24
5.
73
26
2.
57
32
8.
70
37
3.
57
40
8.
03
47
8.
33
53
1.
47
60
5.
10
N
Be
st
-I
KR
Ti
e
23
2.
34
25
.9
7
43
.7
7
69
.0
0
92
.0
0
11
9.
37
16
6.
30
20
0.
83
23
5.
50
27
4.
13
32
4.
33
38
0.
43
43
5.
33
48
6.
90
55
5.
63
63
9.
40
N
Be
st
-I
KR
JX
c
32
2.
52
24
.6
0
45
.3
0
70
.6
3
99
.8
3
13
7.
83
17
3.
73
21
3.
70
25
3.
73
29
5.
23
36
3.
17
41
8.
60
47
9.
23
56
0.
80
63
8.
80
69
6.
60
N
Be
st
-I
KR
QM
a
37
3.
10
24
.6
7
46
.0
0
83
.1
3
13
0.
07
17
6.
83
22
3.
30
27
9.
90
33
3.
30
35
7.
03
50
0.
63
57
5.
63
58
0.
67
68
4.
17
77
3.
97
84
5.
93
N
Bo
Ne
L
-I
KR
4C
S
1
1.
04
16
.0
0
25
.0
0
35
.6
3
48
.5
0
57
.1
0
69
.3
0
78
.8
3
10
0.
57
11
5.
27
13
6.
13
15
0.
63
17
4.
33
20
0.
53
20
7.
23
24
7.
57
N
Bo
Ne
L
-I
KR
4C
m
3
1.
06
15
.9
7
24
.0
0
34
.9
7
47
.7
0
57
.0
0
70
.5
0
79
.1
0
10
0.
80
11
7.
30
13
6.
77
14
8.
63
18
0.
77
19
8.
23
24
5.
17
26
8.
67
N
Bo
Ne
L
-I
KR
6C
m
9
1.
32
16
.0
0
26
.0
0
37
.6
0
55
.7
0
66
.9
3
85
.3
3
94
.5
7
12
4.
93
14
5.
57
17
3.
90
19
4.
77
26
9.
73
32
8.
20
35
0.
90
35
5.
43
N
Bo
Ne
L
-I
KR
Cl
a
17
1.
65
13
.9
3
25
.1
3
40
.0
0
70
.9
7
87
.6
3
11
4.
33
12
9.
00
20
0.
47
22
3.
80
26
8.
67
28
6.
87
34
4.
30
36
5.
33
40
0.
97
41
6.
60
N
Bo
Ne
L
-I
KR
CP
r
18
1.
67
14
.6
0
28
.4
0
41
.1
0
74
.0
0
91
.4
0
11
7.
00
13
0.
73
19
7.
03
22
9.
17
26
5.
43
28
2.
93
33
5.
90
36
2.
13
38
9.
33
40
7.
03
N
Bo
Ne
L
-I
KR
Ti
e
25
2.
36
25
.9
7
43
.6
3
68
.3
3
92
.8
7
12
0.
40
16
6.
40
20
0.
63
23
7.
40
28
2.
20
33
1.
43
38
4.
17
44
4.
20
49
4.
67
56
2.
53
63
9.
57
N
Bo
Ne
L
-I
KR
De
f
28
2.
42
24
.1
0
45
.4
0
74
.3
7
95
.9
0
12
8.
67
16
6.
37
20
5.
67
24
8.
33
30
6.
17
33
0.
20
41
4.
60
43
8.
80
52
5.
63
57
5.
17
63
0.
03
N
Bo
Ne
L
-I
KR
JX
c
31
2.
47
24
.3
3
45
.3
3
70
.6
7
99
.6
7
13
7.
83
17
3.
80
21
3.
67
26
0.
80
30
3.
70
34
7.
07
40
2.
80
45
9.
87
53
3.
17
57
6.
37
64
0.
67
N
Bo
Ne
L
-I
KR
QM
a
39
3.
27
25
.0
0
46
.5
0
83
.2
0
13
0.
23
17
7.
13
22
3.
37
28
0.
17
37
1.
97
43
4.
07
50
5.
10
58
2.
43
68
8.
13
75
6.
13
83
3.
53
94
5.
70
N
Bo
Ne
M
-I
KR
4C
m
7
1.
19
16
.0
0
23
.9
3
34
.3
3
47
.3
3
55
.9
0
66
.6
0
76
.5
7
17
0.
50
17
7.
00
16
7.
90
19
4.
00
19
4.
70
25
3.
57
26
2.
33
30
7.
97
N
Bo
Ne
M
-I
KR
4C
S
12
1.
41
16
.0
0
24
.6
3
41
.2
0
63
.0
3
55
.7
7
11
1.
00
77
.5
0
16
5.
30
20
3.
97
20
4.
63
20
0.
43
29
4.
33
32
1.
33
34
9.
40
38
2.
30
N
Bo
Ne
M
-I
KR
6C
m
16
1.
52
16
.0
0
25
.9
3
36
.9
0
53
.2
3
65
.3
0
12
5.
07
92
.4
7
19
3.
63
24
4.
90
23
9.
67
23
6.
83
34
4.
47
34
8.
43
38
3.
13
36
3.
77
N
Bo
Ne
M
-I
KR
Cl
a
19
1.
86
16
.6
7
29
.3
3
41
.2
0
83
.6
7
89
.0
0
13
5.
63
12
8.
30
22
0.
20
25
9.
97
29
7.
53
31
1.
73
40
2.
10
42
7.
67
48
4.
07
47
7.
33
N
Bo
Ne
M
-I
KR
CP
r
20
1.
93
16
.2
7
29
.7
3
41
.9
0
92
.6
7
91
.7
0
13
7.
23
15
7.
13
24
0.
17
27
5.
77
30
4.
70
30
9.
10
39
9.
97
40
4.
70
47
9.
83
53
1.
63
N
Bo
Ne
M
-I
KR
Ti
e
29
2.
45
26
.0
0
44
.3
3
69
.0
3
92
.7
0
12
1.
00
16
2.
47
19
7.
57
23
9.
17
30
5.
27
37
3.
97
42
7.
80
51
6.
20
56
9.
87
56
6.
87
64
8.
47
N
Bo
Ne
M
-I
KR
De
f
30
2.
46
23
.3
3
45
.4
0
73
.7
3
10
5.
90
14
4.
73
18
2.
37
21
2.
07
24
6.
03
30
5.
07
32
9.
70
41
0.
60
43
2.
07
50
1.
03
57
5.
37
63
3.
07
N
Bo
Ne
M
-I
KR
JX
c
36
2.
64
24
.4
0
42
.6
7
70
.6
3
12
3.
73
13
0.
33
16
8.
57
21
2.
20
28
1.
13
33
3.
73
43
2.
47
45
8.
37
49
2.
10
58
3.
77
63
2.
10
74
1.
07
N
Bo
Ne
M
-I
KR
QM
a
38
3.
20
29
.0
0
52
.0
0
89
.7
7
12
6.
97
17
7.
13
22
5.
23
27
8.
20
34
6.
27
40
0.
33
46
7.
37
53
3.
50
61
1.
23
70
2.
33
77
4.
00
87
8.
87
N
Bo
Ne
P
-I
KR
4C
S
4
1.
08
16
.0
0
24
.9
7
35
.7
0
45
.7
0
54
.7
7
66
.7
0
78
.7
3
99
.6
0
12
2.
90
14
5.
37
15
6.
00
19
0.
20
21
0.
63
26
3.
43
27
9.
30
N
Bo
Ne
P
-I
KR
4C
m
6
1.
09
16
.0
0
23
.8
0
36
.7
3
46
.9
3
54
.9
7
65
.8
0
78
.7
3
10
1.
50
11
9.
97
13
8.
13
15
5.
37
21
8.
03
21
1.
63
28
5.
87
30
5.
37
N
Bo
Ne
P
-I
KR
6C
m
10
1.
35
16
.3
3
26
.0
0
37
.3
0
52
.4
3
63
.1
7
79
.5
7
93
.2
7
12
4.
70
15
1.
13
17
4.
90
24
1.
00
28
7.
13
32
9.
40
38
5.
30
41
1.
77
N
Bo
Ne
P
-I
KR
Cl
a
14
1.
45
14
.6
3
25
.0
3
40
.0
3
66
.0
0
81
.4
7
10
8.
33
11
6.
67
17
8.
70
18
8.
40
21
5.
53
22
7.
33
26
9.
33
28
5.
97
31
6.
60
33
3.
97
N
Bo
Ne
P
-I
KR
CP
r
15
1.
47
13
.9
0
28
.3
3
41
.1
3
70
.2
7
83
.4
3
10
8.
47
11
6.
93
17
7.
00
18
5.
57
21
0.
00
22
8.
57
26
3.
30
29
5.
20
31
8.
77
34
1.
63
N
Bo
Ne
P
-I
KR
De
f
21
2.
32
23
.6
7
46
.1
0
71
.0
0
96
.9
0
12
8.
83
16
0.
97
19
6.
50
24
1.
30
27
8.
13
32
1.
10
36
4.
47
41
9.
27
47
8.
67
54
4.
53
59
7.
17
N
Bo
Ne
P
-I
KR
Ti
e
24
2.
35
25
.5
0
44
.6
7
69
.6
3
90
.3
3
12
4.
73
16
0.
10
19
3.
67
23
6.
70
28
4.
33
32
7.
73
38
3.
80
44
0.
30
50
0.
37
55
7.
47
63
3.
60
N
Bo
Ne
P
-I
KR
JX
c
33
2.
54
24
.6
7
43
.3
3
70
.6
0
10
0.
33
13
3.
17
16
9.
03
21
7.
23
25
8.
40
31
8.
80
37
7.
17
42
8.
93
48
9.
80
56
4.
90
62
5.
70
71
3.
63
N
Bo
Ne
P
-I
KR
QM
a
40
3.
29
26
.3
3
49
.6
7
87
.5
7
13
0.
00
16
5.
33
22
4.
03
27
4.
07
36
7.
53
42
8.
20
50
9.
07
57
0.
77
68
4.
67
77
6.
17
84
9.
17
92
7.
27
Ta
bl
e
7:
Av
er
ag
e
nu
m
be
r
of
C
PU
cy
cl
es
pe
r
ar
ra
y
of
co
nt
in
uo
us
so
rt
in
g
ac
ro
ss
al
lm
ac
hi
ne
s
37
4 Experimental Results
ll l l
lll
l ll l
ll l
l
l
l l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
ll
l
l
l l
l
l
l
l ll
l l
98% 100% 102% 104% 106% 108% 110% 112%
3400000 3500000 3600000 3700000 3800000
I       −Q KR STLI       −Q KR AIF
I       −Q KR POpI       −Q KR Def
N Best  −Q KR TieN Best  −Q KR Def
N Best  −Q KR 4CmN Best  −Q KR 4CS
N Best  −Q KR CPrN Best  −Q KR Cla
N BoNeL −Q KR TieN BoNeL −Q KR Def
N BoNeL −Q KR CPrN BoNeL −Q KR Cla
N BoNeL −Q KR 4CmN BoNeL −Q KR 4CS
N BoNeM −Q KR TieN BoNeM −Q KR Def
N BoNeM −Q KR CPrN BoNeM −Q KR Cla
N BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm
N BoNeP −Q KR TieN BoNeP −Q KR Def
N BoNeP −Q KR 4CmN BoNeP −Q KR 4CS
N BoNeP −Q KR CPrN BoNeP −Q KR Cla
QSort   −Q KR Def
StdSort −Q KR Def
Value in relation to 'I       −Q KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
QuickSort
Figure 14: Sorting times of quicksort with different base cases on machine A
4.6 Sorting a large Set of Items with Quicksort
After seeing the first two results, we wanted to know how the base case sorters perform when
used inside a scalable sorting algorithm. For that we modified introsort, a quicksort implemen-
tation from the STL library, as follows: Introsort calls insertion sort only once, right at the
end. Since that is not possible with the sorting networks, they had to be called directly when
the partitioning resulted in a partition of 16 elements or less. Also we determined the pivot
using the 3-element Bose Nelson parameter network instead of using if-else and std::swap.
The sorters were measured using benchmark 2 with parameters
• numberOfIterations = 50
• numberOfMeasures = 200
• arraySize = 1024× 16 = 16384 = 214.
To have a basis of comparison we also measured sorting with std::sort. These times can be
taken from figures 14, 15 and 16.
The QSort -Q KR Def sorter is just a direct copy of the STL sort doing a final insertion sort
at the end. That was measured to see that our code copy does as well as std::sort before
doing the modifications.
38
4.6 Sorting a large Set of Items with Quicksort
l
l l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
l
98% 100% 102% 104% 106% 108% 110%
3300000 3400000 3500000 3600000 3700000
I       −Q KR AIFI       −Q KR STL
I       −Q KR POpI       −Q KR Def
N Best  −Q KR TieN Best  −Q KR Def
N Best  −Q KR 4CmN Best  −Q KR 4CS
N Best  −Q KR CPrN Best  −Q KR Cla
N BoNeL −Q KR TieN BoNeL −Q KR Def
N BoNeL −Q KR CPrN BoNeL −Q KR 4Cm
N BoNeL −Q KR ClaN BoNeL −Q KR 4CS
N BoNeM −Q KR TieN BoNeM −Q KR Def
N BoNeM −Q KR CPrN BoNeM −Q KR Cla
N BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm
N BoNeP −Q KR TieN BoNeP −Q KR Def
N BoNeP −Q KR 4CmN BoNeP −Q KR 4CS
N BoNeP −Q KR CPrN BoNeP −Q KR Cla
QSort   −Q KR Def
StdSort −Q KR Def
Value in relation to 'I       −Q KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
QuickSort
Figure 15: Sorting times of quicksort with different base cases on machine B
l
ll
l l
l
l
ll
l
l
l
l
ll
ll
l l
l
ll
ll
l
l
ll
l
l
l
l
l
l
l l
l lll l
94% 96% 98% 100% 102% 104% 106% 108% 110%
3600000 3800000 4000000
I       −Q KR STLI       −Q KR Def
I       −Q KR AIFI       −Q KR POp
N Best  −Q KR TieN Best  −Q KR Def
N Best  −Q KR CPrN Best  −Q KR 4CS
N Best  −Q KR 4CmN Best  −Q KR Cla
N BoNeL −Q KR TieN BoNeL −Q KR Def
N BoNeL −Q KR CPrN BoNeL −Q KR Cla
N BoNeL −Q KR 4CSN BoNeL −Q KR 4Cm
N BoNeM −Q KR TieN BoNeM −Q KR Def
N BoNeM −Q KR CPrN BoNeM −Q KR Cla
N BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm
N BoNeP −Q KR TieN BoNeP −Q KR Def
N BoNeP −Q KR CPrN BoNeP −Q KR 4CS
N BoNeP −Q KR 4CmN BoNeP −Q KR Cla
QSort   −Q KR Def
StdSort −Q KR Def
Value in relation to 'I       −Q KR POp'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
QuickSort
Figure 16: Sorting times of quicksort with different base cases on machine C
39
4 Experimental Results
A: N Best -Q KR Cla B: N Best -Q KR Cla C: N Best -Q KR Cla
I -Q KR Def 1.76% 2.1% 8.76%
I -Q KR POp 3.99% 2.58% 6.47%
StdSort -Q 12.3% 10.6% 14%
Table 8: Average speed-ups of the fastest sorting network over the fastest insertion sort as base
case in quicksort and unmodified std::sort
Speed-ups of including sorting networks into a sorting algorithm like quicksort can be seen in
table 8.
What is notable is that the variants with insertion sort at the base are faster than the one with
the final insertion sort, which should come from the fact that they are already specialized for
the item they sort and do not require a predicate for the sorting. Also, the base case is called
right after the partitioning is at a low enough level, which means that the elements are still
present in the first- or second-level cache. That also explains why the Cla conditional swap
performs the best with quicksort, while we saw in the last section that this is not necessarily
the case when we have a cache miss.
Recalling the results from the previous sections, we appeared to be achieving great improve-
ments in reducing the time needed for sorting sets of 2-16 items. By measuring only the sorting
of the small sets we have exploited the networks’ strength: not containing conditional branches.
The results from the measurements with quicksort highlight the networks’ weakness: The larger
code size.
When integrating the sorting networks into quicksort for sorting the base cases, every time a
partition results in one part having 16 elements or less, we switch from the code for quicksort
to the code for the sorting network. Thus, the code for quicksort is partly removed from the L1
instruction cache and replaced with the code for the sorting network. Because the network’s
code is just a flat sequence of conditional swaps, each line of code is accessed exactly once per
sort. That means it caused a lot of quicksort’s code to be removed from the instruction cache
without gaining a speed-up because its code is now in the cache, and will be removed again
when quicksort is handed back the flow of control and loads its code back into the instruction
cache.
We can see that effect especially for machines A and B which have 32 KiB of L1 instruction
cache, where the speed-up is hardly over 2% for the best network base case over the best in-
sertion sort base case. Where we got a much more improvement is on machine C, which has
double the space in its L1 instruction cache. Here we achieved a speed-up of almost 6.5% when
making use of the best networks.
It is no surprise that we do not see improvements similar to those in section 4.4 or 4.5 because
the partitioning that quicksort performs takes the same amount of time no matter which base
case sorter is used, representing a part of the algorithm that is not optimizable through using
sorting networks.
A: N BoNeL -s332 4CS B: N BoNeL -s332 4CS C: N BoNeL -s332 4CS
I -s332 Def 17.4% 17.5% 29.2%
StdSort -s 43.6% 43.49% 51%
Table 9: Average speed-ups of the fastest sorting network over the fastest insertion sort as base
case in sample sort and unmodified std::sort
40
4.7 Sorting a medium-sized Set of Items with Sample Sort
4.7 Sorting a medium-sized Set of Items with Sample Sort
Sample sort was measured using benchmark 2 with parameters:
• numberOfIterations = 50
• numberOfMeasures = 200
• arraySize = 256.
The measurements were done with two different goals in mind: The first was to see which pa-
rameters work best for the machines used and the array size set. This can be seen in figures 17,
18 and 19 for the Bose Nelson networks optimizing locality. To be able to compare the results
on the different machines, the configurations were ordered based on the times from machine A,
and are in the same order in the other two plots. An oversampling factor of 3 and block size of
2 performed best on machine A and B. That configuration also performs best when using the
other networks or insertion sort as a base case.
On machine C block sizes larger than 2 performed better (on average) along with an oversam-
pling factors of 3 or greater. We measured larger variances and got a lot more outliers, so
here choosing a “best” configuration was not so easy. When looking at the other networks and
insertion sort as base case, consistently well performing parameters are an oversampling factor
of 3 and a block size of 4, but with very little lead over other configurations. That is interesting
to see because all three machines run x86 assembly instructions and have the same number of
general purpose registers available. What comes into play here is the size of the instruction
cache: Machine C has double the amount of L1 instruction cache of what machines A and B
have. We can only assume that the instructions for classifying three elements need more space
than the smaller 32 KiB instruction caches can provide, while the 64 KiB instruction cache
that machine C has fits the instructions for classifying four and / or almost five elements at
once, considering that block size 5 also performs well.
The second goal was to see if the results from section 4.4, 4.5 and 4.6 would relate to the results
from using sample sort with the sorting networks as base cases. These results can be seen in
figures 20, 21 and 22 for the 332 configuration. All measurements were made with a base case
limit of 16. Here, too, a single outlier was excluded from the dataset for scaling purposes: A
value of 40177 measured on machine B for the ’N BoNeP -s332 KR 4Cm’ sorter.
The achieved speed-ups of using the sorting networks are given in table 9. On the left we see
sample sort with insertion sort as base case and std::sort that was also measured sorting 256
elements. On the top we see the best performing network ’N BoNeL -s332 4CS’ as a base case
for sample sort on all three machines. The number indicates the speed-up of sample sort with
the network over sample sort with insertion sort and over std::sort.
Again we see that due to machine C having a larger L1 instruction cache the performance gain
is almost double that for the other machines. Unlike in the previous section though we got much
greater speed-ups as a result of using the sorting networks as a base case. That comes from
the fact that sample sort has no unpredictable branches classifying the elements, as opposed to
quicksort having to deal with conditional branches during the partitioning, while both need to
invest the same time to sort all the base cases. So with sample sort, the base case sorting takes
up a larger time slot of the whole execution than it does with quicksort. We also see that with
very few conditional branches we can get up to 50% faster than std::sort (for sets of up to 256
items at least).
41
4 Experimental Results
l l
ll
ll l
ll
l
l
l ll
ll
l l
ll
l l
ll
lllll
l l
ll
ll ll
lll
llll
lll l ll
l ll
25000 27500 30000
N BoNeL −S311 KR 4Cm
N BoNeL −S315 KR 4Cm
N BoNeL −S341 KR 4Cm
N BoNeL −S321 KR 4Cm
N BoNeL −S314 KR 4Cm
N BoNeL −S313 KR 4Cm
N BoNeL −S355 KR 4Cm
N BoNeL −S325 KR 4Cm
N BoNeL −S312 KR 4Cm
N BoNeL −S353 KR 4Cm
N BoNeL −S354 KR 4Cm
N BoNeL −S324 KR 4Cm
N BoNeL −S345 KR 4Cm
N BoNeL −S323 KR 4Cm
N BoNeL −S344 KR 4Cm
N BoNeL −S342 KR 4Cm
N BoNeL −S331 KR 4Cm
N BoNeL −S343 KR 4Cm
N BoNeL −S335 KR 4Cm
N BoNeL −S351 KR 4Cm
N BoNeL −S334 KR 4Cm
N BoNeL −S352 KR 4Cm
N BoNeL −S333 KR 4Cm
N BoNeL −S322 KR 4Cm
N BoNeL −S332 KR 4Cm
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 17: Sample sort on machine A with 256 items. -Sxyz has parameters x =
numberOfSplitters, y = oversamplingFactor and z = blockSize
l
ll
ll l l
l
lll
ll
l
ll
ll l
ll
l l
l
l
22000 24000 26000 28000 30000
N BoNeL −S311 KR 4Cm
N BoNeL −S315 KR 4Cm
N BoNeL −S341 KR 4Cm
N BoNeL −S321 KR 4Cm
N BoNeL −S314 KR 4Cm
N BoNeL −S313 KR 4Cm
N BoNeL −S355 KR 4Cm
N BoNeL −S325 KR 4Cm
N BoNeL −S312 KR 4Cm
N BoNeL −S353 KR 4Cm
N BoNeL −S354 KR 4Cm
N BoNeL −S324 KR 4Cm
N BoNeL −S345 KR 4Cm
N BoNeL −S323 KR 4Cm
N BoNeL −S344 KR 4Cm
N BoNeL −S342 KR 4Cm
N BoNeL −S331 KR 4Cm
N BoNeL −S343 KR 4Cm
N BoNeL −S335 KR 4Cm
N BoNeL −S351 KR 4Cm
N BoNeL −S334 KR 4Cm
N BoNeL −S352 KR 4Cm
N BoNeL −S333 KR 4Cm
N BoNeL −S322 KR 4Cm
N BoNeL −S332 KR 4Cm
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 18: Sample sort on machine B with 256 items. -Sxyz has parameters x =
numberOfSplitters, y = oversamplingFactor and z = blockSize
42
4.7 Sorting a medium-sized Set of Items with Sample Sort
ll
l lll llll l
l
ll
llll ll
ll l
lll lll
l lll l
ll ll
llll lll l
lll llll
llll ll l l
l lll
l lll
lllll l l
lll l ll
l
l lll l
lll ll
lll ll lll
ll
lll lll ll l
l l
lll
llll
20000 25000 30000 35000 40000
N BoNeL −S311 KR 4Cm
N BoNeL −S315 KR 4Cm
N BoNeL −S341 KR 4Cm
N BoNeL −S321 KR 4Cm
N BoNeL −S314 KR 4Cm
N BoNeL −S313 KR 4Cm
N BoNeL −S355 KR 4Cm
N BoNeL −S325 KR 4Cm
N BoNeL −S312 KR 4Cm
N BoNeL −S353 KR 4Cm
N BoNeL −S354 KR 4Cm
N BoNeL −S324 KR 4Cm
N BoNeL −S345 KR 4Cm
N BoNeL −S323 KR 4Cm
N BoNeL −S344 KR 4Cm
N BoNeL −S342 KR 4Cm
N BoNeL −S331 KR 4Cm
N BoNeL −S343 KR 4Cm
N BoNeL −S335 KR 4Cm
N BoNeL −S351 KR 4Cm
N BoNeL −S334 KR 4Cm
N BoNeL −S352 KR 4Cm
N BoNeL −S333 KR 4Cm
N BoNeL −S322 KR 4Cm
N BoNeL −S332 KR 4Cm
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 19: Sample sort on machine C with 256 items. -Sxyz has parameters x =
numberOfSplitters, y = oversamplingFactor and z = blockSize
l
ll ll
ll l
l
l
lll l
ll
l ll
l
ll
ll llll
l l ll
ll ll
lll
ll
l
l l
lll
l l
l lll
lll l
ll
l ll
ll
ll
80% 90% 100% 110%
20000 22500 25000 27500
I       −s332 KR POp
I       −s332 KR STL
I       −s332 KR AIF
I       −s332 KR Def
N Best  −s332 KR Tie
N Best  −s332 KR Def
N Best  −s332 KR 4Cm
N Best  −s332 KR CPr
N Best  −s332 KR Cla
N Best  −s332 KR 4CS
N BoNeL −s332 KR Tie
N BoNeL −s332 KR Def
N BoNeL −s332 KR 4Cm
N BoNeL −s332 KR CPr
N BoNeL −s332 KR Cla
N BoNeL −s332 KR 4CS
N BoNeM −s332 KR Tie
N BoNeM −s332 KR Def
N BoNeM −s332 KR CPr
N BoNeM −s332 KR 4Cm
N BoNeM −s332 KR Cla
N BoNeM −s332 KR 4CS
N BoNeP −s332 KR Tie
N BoNeP −s332 KR Def
N BoNeP −s332 KR 4Cm
N BoNeP −s332 KR CPr
N BoNeP −s332 KR Cla
N BoNeP −s332 KR 4CS
Value in relation to 'I       −s332 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 20: Sample sort 332 with different base cases on machine A
43
4 Experimental Results
l l
l
l
l
ll l
l l
l
l
l
ll
l l
ll
ll l
l l
ll l
l
l l
ll
lll l
ll
80% 90% 100% 110%
20000 22000 24000 26000 28000
I       −s332 KR AIF
I       −s332 KR STL
I       −s332 KR POp
I       −s332 KR Def
N Best  −s332 KR Tie
N Best  −s332 KR Def
N Best  −s332 KR 4Cm
N Best  −s332 KR Cla
N Best  −s332 KR CPr
N Best  −s332 KR 4CS
N BoNeL −s332 KR Tie
N BoNeL −s332 KR Def
N BoNeL −s332 KR Cla
N BoNeL −s332 KR 4Cm
N BoNeL −s332 KR CPr
N BoNeL −s332 KR 4CS
N BoNeM −s332 KR Tie
N BoNeM −s332 KR Def
N BoNeM −s332 KR 4Cm
N BoNeM −s332 KR CPr
N BoNeM −s332 KR Cla
N BoNeM −s332 KR 4CS
N BoNeP −s332 KR Tie
N BoNeP −s332 KR Def
N BoNeP −s332 KR 4Cm
N BoNeP −s332 KR Cla
N BoNeP −s332 KR CPr
N BoNeP −s332 KR 4CS
Value in relation to 'I       −s332 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 21: Sample sort 332 with different base cases on machine B
ll
l
l ll
l l
l
l
ll
l
l
l ll
l l
l
l
lll
ll
l
l
l
l
l
l
l
70% 80% 90% 100% 110% 120%
20000 24000 28000 32000
I       −s332 KR AIF
I       −s332 KR STL
I       −s332 KR POp
I       −s332 KR Def
N Best  −s332 KR Tie
N Best  −s332 KR Def
N Best  −s332 KR 4Cm
N Best  −s332 KR CPr
N Best  −s332 KR Cla
N Best  −s332 KR 4CS
N BoNeL −s332 KR Def
N BoNeL −s332 KR Tie
N BoNeL −s332 KR 4Cm
N BoNeL −s332 KR CPr
N BoNeL −s332 KR Cla
N BoNeL −s332 KR 4CS
N BoNeM −s332 KR Def
N BoNeM −s332 KR Tie
N BoNeM −s332 KR Cla
N BoNeM −s332 KR CPr
N BoNeM −s332 KR 4Cm
N BoNeM −s332 KR 4CS
N BoNeP −s332 KR Tie
N BoNeP −s332 KR Def
N BoNeP −s332 KR 4Cm
N BoNeP −s332 KR CPr
N BoNeP −s332 KR Cla
N BoNeP −s332 KR 4CS
Value in relation to 'I       −s332 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
SampleSort
Figure 22: Sample sort 332 with different base cases on machine C
44
4.8 Sorting a large Set of Items with IPS4o
4.8 Sorting a large Set of Items with IPS4o
With the efficient implementation of sample sort for medium-sized sets we can now include
the new base case sorters into a complex sorting algorithm. The In-Place Parallel Super
Scalar Samplesort (IPS4o) [AWFS17] was executed without introducing parallelism. The al-
gorithm has many parameters that can be adjusted. The important parameter for us was
the BaseCaseSize41: it tells IPS4o to aim for base case sizes that are smaller or equal to
BaseCaseSize4. Even though that is the goal, for a large-scale sorter like IPS4o it would
be far less efficient to partition e.g. 32 elements into many buckets, that might end up not
containing many elements each, than just using the base case sorter for these situations, even
though the number of items is larger than the specified BaseCaseSize4.
That was the reason to develop Register Sample Sort that can break those medium-sized sets
down into sizes that can be sorted using the sorting networks.
We started the measuring using the best combination of sample sort from section 4.7 as a base
case for IPS4o, together with using the default BaseCaseSize4= 16, but that turned out to
perform worse than just insertion sort.
The distribution of the base case array sizes can be seen in figure 23 for BaseCaseSize4 = 16
and figure 24 for BaseCaseSize4 = 32. From that it was evident that in most of the instances
with parameter BaseCaseSize4 = 16 the base case sorter was being invoked on sets smaller
than even 32 elements. That also meant that sample sort had to deal with a larger overhead
than insertion sort, not justified by a larger amount of items.
In addition to that the size of the instruction cache that had already had a great influence on
the measurements of quicksort seemed to be another factor for the bad performance of Register
Sample Sort as a base case.
That is why we decided to measure the following setups:
• Pure insertion sort as base case (I) with
– BaseCaseSize4= 16 and 32
• Register sample sort as base case (S+N) with
– BaseCaseSize4= 16, 32, and 64,
– Configurations 331 and 332, and
– Best networks and Bose Nelson networks (optimizing locality) as base case for Reg-
ister Sample Sort, with the 4CS conditional swap and base case size 16
• A combination of the sorting networks and insertion sort (I+N):
Since the base case sizes were often smaller than 16, we wanted to make use of that
by using the sorting networks, while not having to rely on Register Sample Sort with
its larger overhead for the slightly larger base cases. The solution was to use the Bose
Nelson networks (optimizing locality) if the set had 16 elements or less, and insertion sort
otherwise.
1we will use the 4 to distinguish IPS4o’s BaseCaseSize from Register Sample Sort’s base case size
45
4 Experimental Results
0.00
0.01
0.02
0.03
0.04
0 16 32 48 64 80 96 112 128 160 192 224 256 288 320 352 384
Base Case Size
Fr
eq
ue
nc
y
Figure 23: Distribution of the size of the array passed to the base case sorter when executing
IPS4o with parameter BaseCaseSize4 = 16
0.000
0.005
0.010
0.015
0.020
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576
Base Case Size
Fr
eq
ue
nc
y
Figure 24: Distribution of the size of the array passed to the base case sorter when executing
IPS4o with parameter BaseCaseSize4 = 32
46
4.8 Sorting a large Set of Items with IPS4o
l
l
ll
ll
l l
l
ll
l
l
l
l
l l
100% 105% 110% 115% 120% 125%
3200000 3400000 3600000 3800000 4000000
I       −4 32 KR Def
I       −4 16 KR Def
I + N       −4 16 KR 4CS
S+N Best  −4 64_332 KR 4CS
S+N Best  −4 64_331 KR 4CS
S+N Best  −4 32_332 KR 4CS
S+N Best  −4 16_332 KR 4CS
S+N Best  −4 32_331 KR 4CS
S+N Best  −4 16_331 KR 4CS
S+N BoNeL −4 64_332 KR 4CS
S+N BoNeL −4 64_331 KR 4CS
S+N BoNeL −4 32_332 KR 4CS
S+N BoNeL −4 16_332 KR 4CS
S+N BoNeL −4 32_331 KR 4CS
S+N BoNeL −4 16_331 KR 4CS
Value in relation to 'I       −4 16 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
IPSSSSo
Figure 25: Sorting times for IPS4o on machine A with different base cases and base case sizes
Figures 25, 26 and 27 display the results from the measurements with the above variants. The
BaseCaseSize4 was appended after the -4, along with an underscore followed by the Register
Sample Sort configuration.
The the benchmark from algorithm 2 was used with parameters
• numberOfIterations = 50
• numberOfMeasures = 200
• arraySize = 1024× 32 = 32768 = 215.
As already seen in [AWFS17], we get a speed-up of over 59% over std::sort with unchanged
IPS4o on all machines. On machine A, none of the variants we tried led to an improvement
in sorting speed over the default use of insertion sort at BaseCaseSize4 16. For machine B,
interestingly, using Register Sample Sort did not lead to an improvement, but the combination
of insertion sort and Bose Nelson networks did manage to reduce the sorting time by 4.3%.
For machine C we see the impact of the large L1 instruction cache in the visible improvement
of 9.2% for having Register Sample Sort as a base case instead of insertion sort, though the
combinations of insertion sort and the sorting network also performed well. It is notable to see
that, while Register Sample Sort by itself did well with blockSizes of 4 or 5, here it is beneficial
to use blockSize = 1, having a smaller impact on the instruction cache.
A: S+N BoNeL 16_331 4CS B: I+N 16 C: S+N BoNeL 16_331 4CS
I 16 Def -3.4% 4.3% 9.2%
StdSort -s 59.1% 61,7% 65%
Table 10: Average speed-ups of the fastest sorting network over the fastest insertion sort as
base case in IPS4o and unmodified std::sort
47
4 Experimental Results
ll ll
ll
lll
ll
l l
l lll
l
l
ll
l
l
96% 98% 100% 102% 104%
3000000 3100000 3200000
I       −4 32 KR Def
I       −4 16 KR Def
I + N       −4 16 KR 4CS
S+N Best  −4 64_331 KR 4CS
S+N Best  −4 64_332 KR 4CS
S+N Best  −4 16_332 KR 4CS
S+N Best  −4 32_331 KR 4CS
S+N Best  −4 32_332 KR 4CS
S+N Best  −4 16_331 KR 4CS
S+N BoNeL −4 64_332 KR 4CS
S+N BoNeL −4 64_331 KR 4CS
S+N BoNeL −4 16_331 KR 4CS
S+N BoNeL −4 16_332 KR 4CS
S+N BoNeL −4 32_332 KR 4CS
S+N BoNeL −4 32_331 KR 4CS
Value in relation to 'I       −4 16 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
IPSSSSo
Figure 26: Sorting times for IPS4o on machine B with different base cases and base case sizes
l ll
l
l l
l
l
l ll
l
l l
l
l
ll
ll
90% 92% 94% 96% 98% 100% 102% 104% 106% 108% 110% 112%
3000000 3200000 3400000 3600000
I       −4 32 KR Def
I       −4 16 KR Def
I + N       −4 16 KR 4CS
S+N Best  −4 64_331 KR 4CS
S+N Best  −4 64_332 KR 4CS
S+N Best  −4 32_332 KR 4CS
S+N Best  −4 16_332 KR 4CS
S+N Best  −4 32_331 KR 4CS
S+N Best  −4 16_331 KR 4CS
S+N BoNeL −4 64_331 KR 4CS
S+N BoNeL −4 64_332 KR 4CS
S+N BoNeL −4 32_332 KR 4CS
S+N BoNeL −4 32_331 KR 4CS
S+N BoNeL −4 16_332 KR 4CS
S+N BoNeL −4 16_331 KR 4CS
Value in relation to 'I       −4 16 KR Def'
CPU cycles per iteration
So
rti
ng
 a
lg
or
ith
m
IPSSSSo
Figure 27: Sorting times for IPS4o on machine C with different base cases and base case sizes
48
5 Conclusion
5 Conclusion
5.1 Results and Assessment
In this thesis we have seen that for sorting sets of up to 16 elements it can be viable to use
sorting algorithms other than insertion sort. We looked at sorting networks in particular, paying
special attention to the implementation of the conditional swap and giving multiple alternative
ways of realizing that implementation.
After seeing that the sorting networks outperform insertion sort each on their own for a specific
array size in section 4.4 and 4.5, we saw in section 4.6 that this improvement does not necessarily
transfer to sorting networks being used as base case sorter in quicksort. Because the networks
have a larger code size, the code for quicksort is removed from the instruction cache and the
advantage of not having conditional branches is impaired by that larger code size. But we also
saw that for machines with larger instruction caches using sorting networks with quicksort can
lead to visible improvements of about 6.4%.
After that we integrated the sorting networks into a very advanced sorter like IPS4o, which
was possible by adding an intermediate sorter into the procedure. For that we created Register
Sample Sort, which is an implementation of Super Scalar Sample Sort that holds the splitters
in general-purpose registers instead of an array. When measuring IPS4o with Register Sample
Sort as a base case, we found that the instruction cache makes even more of a difference,
because we now add the code size for Register Sample Sort on top of the code size for the
sorting networks.
We proposed an additional alternative to Register Sample Sort, using a combination of insertion
sort and sorting networks: For base cases of 16 elements or less, we used the sorting network,
for any size above that insertion sort.
On one of the machines with a smaller instruction cache of 32 KiB we could not achieve a
speed-up with any of the variants, on the other the combination of insertion sort and sorting
networks led to an improvement in sorting time of 4.3%. The only substantial improvement
we achieved with IPS4o was on the machine with 64 KiB of L1 instruction cache, where using
Register Sample Sort led to an improvement of 9.2% over plain insertion sort.
In closing, we want to mention that this particular implementation only compiles when using
the gcc C++ compiler due to compiler-dependent inline-assembly statements. This also means
that the code is probably not as fast as it could be due to the inline-assembly not being
optimized by the compiler. The complete project is available on github at
https://github.com/JMarianczuk/SmallSorters.
5.2 Experiences and Hurdles
The greatest hurdle we encountered during this project was, as mentioned in section 4.3, the fact
that the compiler reduces its optimizations with increasing compilation effort, when compiling
only a single source file. That can lead to performance variations that happen for no “apparent”
reason, and is especially tricky when dealing with templated methods that can not be moved
from header files into source files. The solution was to use code generation and to include all
logically coherent method invocations in one wrapper method that is then placed in its own
source file, to not have different parts of the program influencing each other over the decision
which one gets to be optimized and which one not.
49
5 Conclusion
5.3 Possible Additions
In addition to the work in this thesis, we would like to explore further possibilities to implement
the conditional swap for the sorting networks, as well as seeing which of the C++ compil-
ers generate conditional moves when using portable C++ code instead of compiler-dependent
inline-assembly. That also includes looking at conditional swaps for elements that differ from
the 64-bit key and reference value pair that we looked at in this thesis.
Furthermore we would like to take a look at implementing sorting networks in a way that they
take up less code space, and what the trade-off for that decreased code size would be.
Apart from the sorting networks we would also like to take another look at Register Sample
Sort to find out if using seven splitters instead of three can be more practical when increasing
the input size to sizes larger than 256.
50
References
References
[AWFS17] Axtmann, Michael ; Witt, Sascha ; Ferizovic, Daniel ; Sanders, Peter:
In-Place Parallel Super Scalar Samplesort (IPSSSSo). In: 25th Annual Euro-
pean Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria.
https://github.com/SaschaWitt/ips4o, 2017, 9:1–9:14
[Bat68] Batcher, Kenneth E.: Sorting Networks and Their Applications. In: Ameri-
can Federation of Information Processing Societies: AFIPS Conference Proceedings:
1968 Spring Joint Computer Conference, Atlantic City, NJ, USA, 30 April - 2 May
1968, 1968, 307–314
[ber18] bertdobbelaere: SorterHunter. 2018
[BN62] Bose, R. C. ; Nelson, R. J.: A Sorting Problem. In: J. ACM 9 (1962), Nr. 2, 282–
296. http://dx.doi.org/10.1145/321119.321126. – DOI 10.1145/321119.321126
[CCFS14] Codish, Michael ; Cruz-Filipe, Luís ; Frank, Michael ; Schneider-Kamp, Pe-
ter: Twenty-Five Comparators Is Optimal When Sorting Nine Inputs (and Twenty-
Nine for Ten). In: 26th IEEE International Conference on Tools with Artificial
Intelligence, ICTAI 2014, Limassol, Cyprus, November 10-12, 2014, 2014, 186–193
[CCNS17] Codish, Michael ; Cruz-Filipe, Luís ; Nebel, Markus ; Schneider-Kamp, Pe-
ter: Optimizing sorting algorithms by using sorting networks. In: Formal Asp. Com-
put. 29 (2017), Nr. 3, 559–579. http://dx.doi.org/10.1007/s00165-016-0401-3.
– DOI 10.1007/s00165–016–0401–3
[Fre19] Free Software Foundation: How to Use Inline Assembly Language in C Code.
https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.
html, 2019
[Gam19] Gamble, John M.: Sorting network generator. http://pages.ripco.net/
~jgamble/nw.html, 2019
[Knu98] Knuth, Donald E.: The art of computer programming, , Volume III, 2nd Edi-
tion. Addison-Wesley, 1998 http://www.worldcat.org/oclc/312994415. – ISBN
0201896850
[SW04] Sanders, Peter ; Winkel, Sebastian: Super Scalar Sample Sort. In: Algorithms -
ESA 2004, 12th Annual European Symposium, Bergen, Norway, September 14-17,
2004, Proceedings, 2004, 784–796
51
