Parallel Sorting on Multi-core Architecture by Wang, Wei
Wright State University 
CORE Scholar 
Browse all Theses and Dissertations Theses and Dissertations 
2011 
Parallel Sorting on Multi-core Architecture 
Wei Wang 
Wright State University 
Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all 
 Part of the Computer Sciences Commons 
Repository Citation 
Wang, Wei, "Parallel Sorting on Multi-core Architecture" (2011). Browse all Theses and Dissertations. 484. 
https://corescholar.libraries.wright.edu/etd_all/484 
This Thesis is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has 
been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE 
Scholar. For more information, please contact library-corescholar@wright.edu. 
Parallel Sorting on Multi-core Architecture
A thesis submitted in partial fulfillment








SCHOOL OF GRADUATE STUDIES
August 19, 2011
I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER-
VISION BY Wei Wang ENTITLED Parallel Sorting on Multi-core ArchitectureBE
ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF Master of Science.






Meilin Liu, Ph. D.
Jack Jean, Ph. D.
T.K. Prasad, Ph. D.
Andrew Hsu, Ph. D.





Wang, Wei. M.S. Department of Computer Science and Engineeri g, Wright State
University, 2011. Parallel Sorting on Multi-Core Architecure.
With the limitations given by the power consumption (power wall), memory
wall and the instruction level parallelism, the computing idustry has turned its direc-
tion to multi-core architectures. Nowadays, the multi-core and many-core architectures
are becoming the trend of the processor design. But how to expl it these architectures
is the primary challenge for the research community. To takeadvantage of the multi-
core architectures, the software design has undergone fundamental changes.
Sorting is a fundamental, important problem in computer scien e. It is utilized
in many applications such as databases and search engines. In this thesis, we will in-
vestigate and auto-tune two parallel sorting algorithms, i.e. radix sort and sample sort
on two parallel architectures, the many-core nVIDIA CUDA enabled graphics proces-
sors, and the multi-core Cell Broadband Engine. We redesignand manually tune these
two parallel sorting algorithms to take advantage of multiple-level parallelism simul-
taneously, i.e., thread level parallelism, loop level parallelism, data level parallelism
(SIMD instructions). At the same time, we try to take advantage of the high-speed
shared memory. The experimental results showed that the parallel implementation of
these two sorting algorithms on these two multi-core archite tures achieved significant
performance improvement compared to the corresponding sequential version.
iv
TABLE OF CONTENTS
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 8
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2. BASIC CONCEPTS ABOUT PARALLEL COMPUTING . . . . . . 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Data Dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 10
2.3 Locality of Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Instruction-Level Parallelism . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.4.2 Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 16
2.4.3 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2 Distributed Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 21
CHAPTER 3. THE ARCHITECTURE OF CELL BROADBAND ENGINE . . . 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Decreasing the Impact of Serial-performance Walls . . . .. . . . . . . . . . . . . . . . . . . 24
3.3 Architecture Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 25
3.4 PowerPC Processor Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 PowerPC Processor Unit . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 26
3.4.2 PowerPC Processor Storage Subsystem . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Synergistic Processor Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
3.5.1 Synergistic Processor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 The Element Interconnect Bus . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
3.7 The Memory interface Controller. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 30
3.8 Cell Broadband Engine Interface Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Programming Model for the PPE and SPEs . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9.1 Language-extension Differences Between PPE and SPE .. . . . . . . . . . . 33
3.9.2 Communication between the PPE and SPEs . . . . . . . . . . . . . . . . . . 34
CHAPTER 4. THE ARCHITECTURE OF GPU . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 An Overview of the GPU Architecture. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 36
4.2.1 Streaming Processor Array . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 37
4.2.2 Streaming Multi-processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Streaming Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 39
4.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 39
4.3.2 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Local Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 40
4.4 Programming Model for CUDA GPU . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 40
4.4.1 CUDA Programming Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Single Instruction Multiple Threads and Warp Divergenc . . . . . . . . . . 42
CHAPTER 5. PARALLEL RADIX SORT. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 45
5.1 The Prefix Sum Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
5.1.1 Sequential Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 46
5.1.2 An Inefficient Parallel Scan . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 47
5.1.3 An Efficient Parallel Scan . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 48
5.2 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Serial Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 51
5.2.2 CUDA Parallel Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 53
vi
5.2.3 Parallel Radix Sort on the Cell B.E . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
CHAPTER 6. PARALLEL SAMPLE SORT . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 63
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Sample Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 CUDA Parallel Sample Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Cell B.E. Sample Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
CHAPTER 7. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . .. . . 76
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 79
vii
LIST OF FIGURES
2.1 Basic Structure of UMA System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Basic Structure of NUMA System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Basic Architecture of Distributed Memory System. . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Overview of CBEA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
3.2 PPE Structure Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Structure of SPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Vector Add Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 The G80 GPU Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Texture/Processor Cluster (TPC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 CUDA Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Warp Scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 An Illustration of the Inefficient Scan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 An Illustration of the Up-Sweep Phase of the Efficient Scan. . . . . . . . . . . . . . . . 49
5.3 An Illustration of the Down-Sweep Phase of the Efficient Scan. . . . . . . . . . . . . 50
5.4 The Operation of Radix Sort on 6 4-digit Numbers. . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 The Layout of the Histogram Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 TheSplit Operation Based on the Least Significant Bit. . . . . . . . . . . . . . . . . . . . . 57
5.7 The Extended Parallel Scan Algorithm for a Large Array. . . . . . . . . . . . . . . . . . . 58
5.8 The Comparison of Radix Sorting Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 An Illustration of the sample Sort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 The Comparison of Parallel Sample Sort and the Sequential Quick Sort . . . . . . 74
viii
LIST OF TABLES
3.1 Vector Data Type Supported by PPE and SPE . . . . . . . . . . . . . .. . . . . . . . . . . 33
5.1 The Running Time for Transferring220 Keys to/from Shared Memory . 54
5.2 Running Time of the Radix Sorting Algorithm . . . . . . . . . . .. . . . . . . . . . . . . 60
6.1 Running Time of the Parallel Sample Sort and Serial QuickSort . . . . . . . 74
ix
ACKNOWLEDGEMENTS
First, I would like to thank my advisor, Dr. Meilin Liu, for her ideas, advice, and
constant encouragement. Without her guidance and advice, this body of work would
not have been possible.
I also would like to thank my thesis committee members: Drs. Jack Jean and
T.K. Prasad for taking their invaluable time to serve on my committee and help in
improving this document.
Next, I would like to thank Mark Haines for correcting this thesis.
I also wish to thank my friends in Wright State University fortheir help and






On 19th April 1965, Gordon E. Moore published a paper with thetitl ”Cramming
more components onto integrated circuits” in ”Electronic Magazine” [14]. In this pa-
per, he stated that ”The complexity for minimum component costs has increased at a
rate of roughly a factor of two per year. Certainly over the short term this rate can be
expected to continue, if not to increase. Over the longer term, the rate of increase is
a bit more uncertain, although there is no reason to believe it will not remain nearly
constant for at least 10 years. That means by 1975, the numberof components per
integrated circuit for minimum cost will be 65,000 [21].” The author presented that he
believed such a large circuit can be built on a single wafer [21]. In the revised version
of that paper, the author stated that transistor density on integrated circuits doubles ev-
ery 18 months. Although Moore is adamant that he did not use the term ”18 months”,
we still refer it as Moore’s law, which is a very significant principle affecting the de-
velopment of computer industry. Under the guidance of Moore’s law, the computer
industry has been witnessing tremendous advances in microprocessor technology over
the past couple of decades. Clock cycle speed of processors has been improved from
about 4.77MHz in the 1980s to about 2GHz in the 2000s.
While Moore’s law continues to be a reliable predictor in thecomputer indus-
try for the coming decade, we can expect little improvement in system performance of
1
2
the general purpose CPUs if we only double the number of transistor integrated on a
chip to increase clock rate. The first reason for this is powerdissipation. As we know,
power dissipation in an electronic device is proportional to the clock rate. While the
clock frequency has increased by a factor of 4000 in the last deca e, the technology of
power dissipating has almost reached physical limitation nw. It is impossible to in-
crease the clock frequency without taking cooling techniques into account. Therefore,
a significant increment of the clock rate will not be achievedwithout a breakthrough
in the technology of power dissipation. This is the power wall which impedes the
improvement of system performance.
The gap between the performance improvement of processors and the memory
systems is another factor impacting system performance. Becaus the performance
improvement of the memory systems can not keep up with the growth f CPU speed,
which has introduced a large gap between them. In order to reduce the average memory
access time, thecacheis introduced to the memory hierarchy. For modern computer
architecture, increasing the cache size and utilizing multi-level cache are two primary
methods to reduce the cache miss rate. However, even with theelp of cache, we
can not prevent memory performance improvement from lagging behind the processor
speed yet.
System performance can be improved not only from the growth of e clock
frequency, but also from theinstruction level parallelism(ILP). The primary benefit we
can obtain from ILP is that the performance of a sequential program will be increased
without any modification. However, the speedup of system performance by utilizing
ILP is usually stalled. The data dependences are the main factor which prevents the
consecutive instructions from being executed in parallel.And the conditional branch is
the second. For the branch instruction, its execution path cn not be decided before the
3
result of the current branch is known. But in order to achieveILP, the target instruction
must be predicted andspeculative executedbefore the correct result is produced. Since
the branch is very difficult to predict successfully, it is very likely that the processor
speculatively executes an incorrect branch. Because the penalty of taking the incorrect
branch is significant, the improvement of system performance by using ILP can not be
guaranteed.
David Patterson from U.C Berkeley has a formula which summarizes the prob-
lem of serial performance: The power wall + the memory wall + the ILP wall = the
wall for system performance [12]. While Moore’s law will continue to guarantee that
more gates can be integrated into a chip, the improvement of system performance ob-
taining from increasing clock frequency has reached the limitation. Therefore, the
hardware engineers have to veer their focus to parallel architecture to keep making
progress. Rather than aggregating more transistors withina si gle processor, com-
puter architects use more possible gates to create additional independent processors in
the chip.
The nVIDIA CUDA enabled graphics processors, as well as the Sony Cell
Broadband Engine architecture are two well-designed parallel platforms. CUDA is a
software and hardware architecture which allows programmers to code in high level
languages without considering the graphics interfaces of the GPU. Instead of using
APIs provided by GPU, a programmer can write a general-purpose application with
CUDA C/C++. Compared with general-purpose CPU, the programming model for
CUDA is single program multiple data (SPMD), in which each instruction of a pro-
gram will be executed by multiple threads in parallel. The Cell Broadband Engine ar-
chitecture is a single-chip heterogeneous multiprocessorsystem cooperating with both
shared and distributed memory. It has nine processors: One Pow r C Processor Ele-
4
ment, which is optimized for scheduling tasks and running operating system, and eight
Synergistic Processor Elements, which are optimized for executing compute-intensive
tasks.
Parallelism introduces much more complexity into the process of program-
ming. In parallel programming, the most important aspect isto exploit concurrency,
which means decomposing a problem into sub-problems which can be executed in par-
allel. Therefore, how to find parallelism in a sequence program, and how to expose the
parallelism are the major problems. In order to implement cocurrency, it is important
to write code to decompose problems and exploit parallelismin the running time. The
two decomposition strategies, task decomposition and datadecomposition, are usually
used to achieve this goal. Intuitively, this pattern usually maps similar operations onto
different elements of data arrays.
Parallelism brings unique challenges to programmers. It isvery common that
concurrent tasks which constitute the problems involve dependences which must be
identified and managed appropriately. The execution of tasks must follow program
order. A good parallel programmer must rule out all the non-deterministic issues.
Structuring correct parallel programs may take considerabl effort from programmers
when compared to writing sequential programs.
Even though a parallel program is safe, it is possible that the performance im-
provement realized from exploiting concurrency is not as good as expected. Under
these circumstances, we must make sure that the overhead of incurred parallelism does
not overwhelm the program’s runtime. Moreover, distributing tasks in a absolute bal-
anced way is not straightforward. Therefore, the performance of one parallel program




Sorting plays a very important role in the area of computer science, because it not only
can be used in many applications, but also has theoretical significance. It has been
applied in numerous fields such as database management system and knowledge base
management systems. And for the architecture of modern advanced computers, the
ratio of the sorting time to the total CPU time is very significant. Someone has con-
ducted a survey, and concluded that general-purpose computers can spend to75% total
CPU time on sorting [10]. When computing or data processing is executed, it usually
involves the sorting problem. Programmers need to select different sorting algorithms
to obtain the expected performance on various computer architectures. Moreover, for
the well-known algorithms, sorting is a very interesting topic deserving studying and
analyzing. Thus, it is critical for students whose major is computer science to best
understand the various sorting algorithms.
Generally speaking, sorting is a rearrangement process condu ted in accor-
dance with the increasing or decreasing relationship between k ys’ values of records,
which changes the original set of records with arbitrary order into a set of records
with monotone increase or monotone decrease order based on the values of keys. Sup-
pose a file with n records is represented as{R1, R2, ..., Rn}, and the keys’ values
corresponding to each record are{k1, k2, ..., kn}, then we can determine a permuta-
tion t(1), t(2), ..., t(n), such that the order of the keys’ values satisfy the following
requirement:
kt(1) ≤ kt(2) ≤ ... ≤ kt(n)
or
6
kt(1) ≥ kt(2) ≥ ... ≥ kt(n)
the corresponding file will become{Rt(1), Rt(2), ..., Rt(n)}, which is permuted in terms
of the order of keys’ values. We call this type of operation sorting, and state that the
previous relation is an ascending order, and the later orderis a descending order. In
practice, we can simply consider a sorting operation as a process in which a sequence
without order is transformed into a sequence with order based on a given value.
Sorting, also known as classification, is a very efficient wayto improve timing
performance for the search operation. If the search is executed on an unordered se-
quence with the size ofn, its time complexity isO(n). But if this sequence is in order,
its time complexity will be decreased toO(logn2 ).
Sorting can be divided into several categories from different points of view:
Based on the usage of different storage in the process of sorting, sorting algo-
rithms can be categorized as eitherinternalor externalsorting algorithms. For internal
sorting, the size of the keys is small enough that all the keyscan be entirely stored
in the main memory. On the contrary, an external sorting algorithm needs the help of
the external storage such as hard disk. Processors must exchange data between main
memory and external storage. This kind of sorting is named external sorting. The
speed of external sorting is much slower than internal sorting. Since there is a limita-
tion of the storage ability of main memory, some files with large amounts of data can
not be loaded into the main memory at one time to sort. Therefore, external sorting is
the only way in which we can sort files with large amount of data.
Sorting algorithms can also be categorized asst ble sortingor unstable sorting
algorithms. Stable sorting keeps the relative order for thekeys with same value. A
sorting algorithm is stable if the keys with same value aftersorting are in the same
7
order as they are before sorting. If all the keys are distinct, the property of stability is
trivial. Most of the sorting algorithms such ascount sortare stable. Unstable sorting
does not need to keep the relative order for keys with same value. Thequick sortis
an unstable sorting algorithm. The property of stability isvery important because it
is the primary reason which some sorting algorithms such as te radix sortcan work
correctly.
Sorting algorithms can be divided intoc mparison-basedsorting andnon comparison-
basedsorting algorithms. A comparison-based sorting algorithmsorts an unordered
sequence by repeatedly comparing two sequential keys, and exchanges them if they
are out of order. Non comparison-based sorting algorithms exploit some properties of
the keys, such as their binary representation, to sort.
Some of the most important applications for sorting are illustrated in [10]:
Solving the problem of togetherness, in which all keys with the same value are
concatenated together. Keys with the same value will appearin consecutive positions
after sorting. This is the conventional meaning of a sortingproblem.
Matching keys among files: When the keys in several files are inorder, the
problem of matching keys can be satisfied in one sequential pass without backing up.
Accessing the information table sorted on the key is usuallyf ster than the randomly
table, unless the table is so small that all the keys can be stor d entirely in high-speed
random access memory.
Searching information based on the value of the keys: Sorting can be used to
accelerate the speed of searching. Sorting makes the data record asier to be read,
because in most cases people feel more comfortable reading an information report
sorted in alphabetical order than a report without any order.
8
The emergence of parallel computer architecture with largecaches turns our
attention more on the efficient parallel approaches to solvesorting problems. Paral-
lelizing sequential sorting algorithms usually raises a number of issues. And those
problems vary based on different sorting algorithms. For the parallel radix sorting al-
gorithm which we will discuss in chapter 5, our primary concer is how to eliminate
data dependences in the HISTOGRAM phase. For the parallel sample sort presented in
chapter 6, our focus is on how to distribute keys into bucketsas efficiently as possible.
1.3 Problem Statement
Nowadays, the multi-core and many-core architectures are becoming the trend of the
processor design. From the above introduction we can conclude that sorting has been
widely used. In order to achieve better performance, a largenumber of parallel sorting
algorithms have been published. In this thesis, we will investigate and auto-tune two
parallel sorting algorithms, i.e., radix sort and sample sort on the two parallel architec-
tures, the many-core nVIDIA CUDA enabled graphics processors, and the multi-core
Cell Broadband Engine. We re-design and manually tune thesetwo parallel sorting
algorithms to take advantage of multiple-level parallelism imultaneously, i.e., thread
level parallelism, loop level parallelism, and data level parallelism (SIMD instruc-
tions). At the same time, we try to take advantage of the high-speed shared memory.
Our experimental results showed that the parallel radix sort implemented on CUDA
and Cell B.E. yield, on average, 45 and 7 times performance improvement, respec-
tively, compared to the sequential radix sort. The experimental results also showed
that the parallel sample sort implemented on CUDA and Cell B.E. have roughly 68
times and 4 times speedup on average, respectively, over thesequential quick sort.
9
1.4 Organization
The organization of the paper is as follows:
Chapter 1 gives the motivation and the background of the problem, also out-
lines the problems which this thesis will address.
Chapter 2 overviews some basic parallel concepts and princiles related to this
thesis, which include data dependency, instruction/data/thread level parallelism, gran-
ularity. This chapter will also introduce Amdahls law, whicis the basis for later
chapters. In the end of this chapter, we will introduce memory hierarchy.
Chapter 3 introduces the Cell Broadband Engine Architectur, as well as the
architecture features which will be utilized in the Cell B.E. parallel sorting algorithms.
Chapter 4 illustrates the architecture of the nVIDIA graphics processors, along
with the CUDA C language extension which will be exploited inthe CUDA parallel
sorting algorithms.
Chapter 5 presents the radix sort algorithm and its implementatio s on both
target architectures.
Chapter 6 describes the sample sort algorithm in detail, as well as the imple-
mentations on both target architectures.
Chapter 7 concludes the master thesis.
CHAPTER 2
BASIC CONCEPTS ABOUT PARALLEL COMPUTING
2.1 Introduction
In this chapter, we elaborate on some basic definitions and cocepts related to the
techniques will be presented in this thesis, which includesdata dependence, locality of
reference, granularity, and instruction/data/thread level parallelism. Also, this chapter
will introduce Amdahl’s law, which is the fundamental for later chapters. And this
chapter will conclude with memory hierarchy.
2.2 Data Dependence
Dependence is a kind of relationship that represents how oneinstruction depending
on the others. It determines to what extent parallelism can exist in a program as well
as how to exploit this parallelism. Also, it restricts how the data is generated and
consumed in correct program order. This relationship produce by this constraint is
called data dependence. If two instructions are dependent,they should be executed in
order. On the contrary, if two instructions are not dependent, they are parallel and can
be run simultaneously.
There are three different types of dependences existing among instructions:
data dependence, name dependence, and control dependence.In this chapter, we only
discuss data dependence, which is highly related to the following chapters. Consider
10
11
the following C style segment of code:
S1 A = 5
S2 B = 2
S3 C = A + B
It is safe if we change the order of statement S1 and S2, but it will produce the po-
tentially incorrect result when we reschedule S3 before either S1 or S2. In order to
keep the program order that statement S1 and S2 execute beforS3, we introduce data
dependence from S1 and S2 to S3. There is no data dependence betwe n S1 and S2,
since the interchange of S1 and S2 will generate the same valuas the execution order
of S1, S2, S3. The following definition of data dependence is from [1].
Definition 2.2.1. There is a data dependence from statement S1 to statement S2 (state-
ment S2 depends on statement S1) if and only if
• both statements access the same memory location and at leastone of them stores
into it,
• there is a feasible runtime execution path from S1 to S2 [1].
In order to produce correct result, we must take care of the ord r of load and
store to the same memory location. Also, we must make sure that the order of two
stores to the same location is correct so that the following reference will get the ex-
pected value.
There are three different kinds of data dependences which can be produced in
a program based on [1]:
12
• True dependence or flow dependence. The former statement writes to, and the
latter statement reads from the same memory location:
S1 M =...
S2 ... = M
This type of dependence guarantees that the result read by S2is written by S1 at
first.
• Anti-dependence. The former statement reads from, and the la ter statement
writes to the same memory location:
S1 ... = M
S2 M =...
The interchange of S1 and S2 will lead to S1 using an unexpected value produced
by S2. Also, it will introduce true dependence if we change the order of these
statements. This type of dependence guarantees that this case can not happen.
• Output dependence. Both statements store into the same memory location:
S1 M =...
S2 M =...
This type of dependence prevents S1 and S2 from interchange,which will cause
the latter statement statement reading from an unexpected value. Consider the
following example:
S1 M = 1
S2 ...
S3 M = 2
S4 N = 4 * M
The result of variableN in S4 will be 8 if the above code fragment is executed
in order. Suppose the order of S1 and S3 is exchanged, it will produce 4 instead
13
of 8 as a result for variableN , which is not what we expected.
From the perspective of hardware design, dependence is often call dhazardor
stall. True dependence is equal to read after write hazard; anti-dependence is equiv-
alent with read after write hazard; and output dependence isthe ame as write after
write hazard [1,5].
2.3 Locality of Reference
In the area of computer science, locality of reference, alsoknown as the principle
of locality, is the phenomenon of the same value or related storage locations being
frequently accessed [20]. There are two basic types of locality which should be con-
sidered [16]:
• Temporal locality (locality in time): if an item is referenced, it will tend to be
referenced again soon.
• Spatial locality (locality in space): if an item is referencd, items whose ad-
dresses are close by will tend to be referenced soon.
Consider the following example:
FOR I = 0 TO N
A[I] = B[I] * 2
C[I] = A[I] * B[I+1]
ENDFOR
The reference of A[I] is temporal locality, since memory location of A[I] is used twice
in each iteration. Accessing array B not only owns temporal locality but also has
14
spatial locality. Because this array is allocated to a contiguous memory space, after the
access of an element of B in the first instruction, the item that next to this element will
be accessed immediately in the second statement of each iteration.
The reason that locality occurs often is twofold. Firstly, highly related data is
always stored in neighboring memory locations. For instance, we prefer storing a set
of related data into an array or a vector. Secondly, programsinvolving loops usually
access arrays or vectors by index. Due to these properties, it s feasible to use cache to
improve system performance. Cache is a hardware component which is used to store
data. It is proximate to CPU with small storage compared to main memory. Hence
fetching data from cache is much faster than fetching from main emory. because not
only is it close to CPU, but also it spends short time to addressing. Therefore, as the
existence of locality, we prefer putting frequently accessed data into cache to improve
system performance.
2.4 Granularity
In the field of parallel computing, granularity refers to theamount of computation in
relation to communication [19]. It is a basic concept about data decomposition, and
has a significant influence on the efficiency of parallel computing. Fine-grained paral-
lelism, also called fine-grained data decomposition, whichmeans there are larger num-
ber of individual tasks with smaller amounts of computation. On the other hand, for
coarse-grained parallelism or coarse-grained decomposition, there are smaller number
of individual tasks with larger amounts of computation. Although fine-grained decom-
position can better utilize the potential parallelism, theov rhead of synchronization
and communication will become larger. To gain the best performance, care should be
taken to find the balance of the overhead between computationnd communication.
15
If the decomposition is too fine-grained, the timing performance can be dominated by
communication; if it is too coarse-grained, computation caoverwhelm the communi-
cation time.
In most cases, programmers manually tune the data size to obtain the best par-
allel performance based on their experience instead of deriving an optimum granu-
larity mathematically. The shape of the data for each task alo ffects the parallel
performance. Normally, the performance is better if the data stored in memory could
be coalesced. The potential reason is that the function unitmay combine multiple
memory access instructions together and then issue them, which results in accessing
memory only once.
2.4.1 Instruction-Level Parallelism
Since about 1985, all processors adopt pipelining to overlap the execution of instruc-
tions and improve performance. This potential parallelismamong instructions is called
instruction-level parallelism(ILP), because instructions may be executed in parallel [5].
Instruction-level parallelism can be increased by two mainmethods. The first is
overlapping more instructions to be executed simultaneously by increasing the pipeline
depth. The other method is doubling the functional units inside CPU so that multiple
instructions can be issued in each clock cycle. We call this technique super-scalar.
However, if data dependence exists among instructions, theperformance speedup which
can be achieved by using these methods is limited. In following code segment:
S1 M = A * B
S2 N = C + D
S3 P = M * N
Statements S1 and S2 are independent, so they can be executedin parallel. Suppose
16
each instruction can be finished in one clock cycle, and if we only take statements S1
and S2 into account, the ILP is2. However, since statement S3 depends on the results
of S1 and S2, it can not be executed until its required values ar available. For all of




Typically, programmers write code in a sequential manner bywhich they sup-
pose instructions are executed in the order specified by programs. In this model, they
do not need to consider how to schedule their code to best utilize ILP. In contrast, it
is the task of compiler or underlying hardware to identify and exploit as much ILP as
possible to improve performance.
2.4.2 Data-Level Parallelism
Data-level parallelism (DLP), or loop-level parallelism,describes applying the same
instruction to different data. A typical example of DLP is matrix addition. Since all
the elements of two matrices have no data dependences for addition, the operation of
addition can be performed simultaneously.
SIMD, single instruction multiple data, is a basic concept highly related with
DLP. From the hardware perspective, it allows all function units to execute in parallel,
and allows all of them to perform a single instruction but with different data sets. From
the programmer’s point of view, a set of data can be operated at the same time, such as
addition of vectors.
The benefit behind DLP is that it amortizes the cost of instruction fetch and
decode, because one issued instruction can be used in multiple function units. DLP is
very efficient when it works on arrays in loops.
17
FOR I = 0 TO 7
C[I] = A[I] * B[I]
ENDFOR
In the above example, if there is only one function unit residing in the processor, it
will take 8 iterations to finish this loop. However, suppose we have two function units.
Then array A, B and C can be fairly distributed to each unit. Therefore, the first unit
needs only to execute an even number of iterations, while thesecond one is taking care
of odd number of iterations. All of the function units will perform multiplication but
on different data which is decided by the loop index.
2.4.3 Thread-Level Parallelism
It is totally transparent to programmers that using instruction-level parallelism im-
proves system performance , but ILP’s application is very limited, especially for in-
structions with data dependences. Thread-level parallelism, as known as task-level
parallelism, is higher-level parallelism. It logically groups instructions as separate
threads which can be executed in parallel. A thread is the smallest execution unit
which can be scheduled independently by the operating system. Compared with ILP,
which focuses on finding and exploiting the parallelism within loops or straight-line
instructions, thread-level parallelism explicitly distrbutes tasks to multiple threads by
using the inherently parallelism of the program, and executes them in parallel.
Thread-level parallelism is an important alternative to data-level or instruction-
level parallelism [5]. As we will discuss in later chapters,the high performance can
be achieved in the CELL B.E. platform by exploiting the data-level parallelism such
as SIMD instructions. But as to according to taking advantage of the thread-level
parallelism, even higher performance is gained in the platform of CUDA.
18
Following code segment of CUDA is a straightforward exampleshowing how




int A[4], B[4], C[4];
...




__global__ void array_add(g_A[], g_B[], g_C[])
{
idx = threadIdx.x;
g_C[thid] = g_A[idx] + g_B[idx];
}
In this example, four threads are explicitly declared and dispatched to execute
the function of array addition. Each thread performs addition only once, and the data
it needs can be located by using its unique thread number, so the e four threads work
together to complete the operation of array addition.
19
2.5 Amdahl’s Law
In parallel computing, the improvement of system performance which can be obtained
is limited by the program segment that can be parallelized. In other words, not an
entire program can be parallelized. It is usually impossible to optimize some part
of a program. Even if we can, the return in doing that is very trivial or may cause
side-effect. Amdahl’s law serves as a guideline to find the maxi um performance
improvement. It states that the performance improvement which can be obtained from
using parallelism is limited by the fraction of a program in which the parallelism can
be used [5]. We can use a formula to express it:
Tnew = Told ∗ Fractionunaffected +
Told ∗ Fractionaffected
Amount of improvement
For example, suppose the execution time of a sequential program is 100 seconds. And
we use4 cores to run80% of this program in parallel after optimizing. Therefore, for
this problem:
Tnew = 100 ∗ 20% +
100 ∗ 80%
4
= 20 + 20 = 40seconds
Amdahl’s law also defines thespeedupto tell us how much faster a parallelized
program will run compared with the original sequential program.Speedupis the ratio:
speedup =
Execution time for an original program
Execution time for this parallized program
For the above example, thespeedup = 100
40
= 2.5. Note that if we take the parallel
overhead, such as communication between cores into account, the ime spent on a
parallelization program may be more than the sequential one. That is, the overhead of
20
exploiting parallelism overwhelms the parallel computation and causes a degradation
of performance.
2.6 Memory Organization
For modern multi-processors, there are two different approaches used for accessing
data among them. In the first method, data is accessed based onthe address of shared
space. We call this type of memory architecture shared memory. Another way of multi-
processors access data is by utilizing the distributed memory, in which each processor
has in its own private memory.
2.6.1 Shared Memory
In a shared memory system, all processors share a single memory address and commu-
nicate with each other by writing to and reading from shared variables [13]. The most
popular class of shared memory systems is calleduniform memory access(UMA),
which arises from the fact that the latency of accessing a memory location is indepen-
dent from the processor making this request. Figure 2.1 showits bacic structure. Al-
though this type of memory system is easy for programming, UMA is not well-scaled,
since the more memory access requests the processors present, the less memory band-
width per processor has.
The other category of shared memory systems is nonuniform meory access
(NUMA), as shown in Figure 2.2. Memory in this system is stillaccessible for all
processors, but the latency depends on the physical distance between the requesting
processor and the target memory block. This eliminates the scalable problem. But
care must be taken to distribute data into different memory blocks when programming,
because the system performance is influenced by the layout ofthe data. System perfor-
21
Figure 2.1.Basic Structure of UMA System
mance will be improved if we arrange the data most often used close to its associated
processor. Obviously, this requires extra work for programming.
Figure 2.2.Basic Structure of NUMA System
2.6.2 Distributed Memory
Alternatively, in distributed memory systems, each process has its private address
space which is not addressable by other processors. Instead, they must use message
passing to communicate with each other. Figure 2.3 shows itslayout. This type of
memory system can be divided into two further categories: massively parallel proces-
22
sors (MPP) and clusters [13]. For an MMP system, processors are highly coupled on
a board and specialized for parallel computing. One exampleof this system is CELL
B.E. architecture. Clusters are composed of independent computers connected by a
network. Depending on how closely the multi-processors areintegrated together, the
communicating time is varied greatly. For CELL B.E. architecture, the communication
time between PPE or SPE and SPE is not longer than shared memory system. In con-
trast, the communication time of clusters is always severalorders of magnitude longer
than CELL B.E..
Figure 2.3.Basic Architecture of Distributed Memory System
CHAPTER 3
THE ARCHITECTURE OF CELL BROADBAND ENGINE
3.1 Introduction
In Randall Hyde’s book,Write Great Code, one of his basic ideas is that, for program
optimizing, you need to not only know the APIs supplied by thetarget platform but also
understand how the code is executed on it. This principal still works for programmers
who pursue high performance coding by using Cell Broadband Egine Architecture
(CBEA). It is not good enough to know the extension of C/C++ commands in CELL.
In order to obtain high performance, one needs to understandthe esign principles of
this architecture.
Although the initial purpose of Cell Broadband Engine is used for media-rich
applications such as game consoles, this architecture expos s the outstanding perfor-
mance for scientific applications. The Cell Broadband Engine is a single-chip hetero-
geneous multi-processor system cooperating with both shared and distributed memory.
It has nine processors:
• one PowerPC Processor Element (PPE), which is optimized forscheduling tasks
and running operating system.




The PPE has the same memory hierarchy as the ordinary CPU, andit loa s data or
instructions into cache from main memory which is shared by all processors before
consuming them. In contrast, the SPE accesses the main memory by using direct
memory access (DMA) commands, and stores data or instructions into the private local
store to be used in the future. Replacing cache with local store is a radical innovation
in computer architecture.
The reason for this radical change is based on the fact that the bottleneck of
system performance is, in most cases, determined by the memory latency rather than
computation ability or peak bandwidth of memory. When cachemiss occurs for a se-
quential program running on the conventional architecture, th re will be several hun-
dred clock cycles penalty. But for SPE, if a miss happens, thesetting up time of DMA
is only few cycles; which is very small compared with severalhundred clock cycles
halting.
3.2 Decreasing the Impact of Serial-performance Walls
As we have discussed in chapter 1, the performance of modern computer architecture
is limited by the problems of power dissipation, memory latency and frequency. How-
ever, due to the innovation of the Cell Broadband Engine Archite ture, these impacts
are scaled down.
Power dissipation of a circuit is proportional to the clock frequency, the tech-
nology used by manufacturers to dissipate heat has almost reached a physical limit.
Hardware engineers can not integrate more transistors becaus of lacking necessary
power dissipation technique support. Therefore, the performance limitation has been
shifted from clock speed to the available technology for power dissipation. CBEA im-
proves power efficiency to alleviate this problem. Rather than using common general-
25
purpose processors, two types of special designed cores areadopted. The PPE is spe-
cialized for control-intensive code, and eight SPEs are optimized for computing-rich
programs.
Currently, the ratio of memory latency to clock speed is approaching 1000. As
a result, the serial performance is determined by the way of how to efficiently accessing
memory. CBEA uses two mechanisms to mitigate the problem of memory latency:
• a 3-level memory hierarchy of SPE; that is main memory, with loca stores and
private register file;
• asynchronous DMA transfers between memory and local store [8].
The long latency can be efficiently covered by scheduling data tr nsfer and computa-
tion simultaneously.
Traditionally, processors need to make pipelines deeper toachieve higher through-
put. But this approach has reached its limitation and even gar ers negative return if
we take power consumption into account. The reason for the high performance of the
PPE is that it is able to run two threads simultaneously. EachSPE achieves efficiency
primarily by using a large register file and asynchronous DMAtransfer.
3.3 Architecture Overview
Figure 3.1 shows the primary block diagram of the CBEA: the PowerPC Processor El-
ement (PPE), eight Synergistic Processor Elements (SPEs),the Element Interconnect
Bus (EIB), the Cell Broadband Engine Interface (BEI), the Memory Interface Con-
troller (MIC), etc. We will explore some components in greater depth in the following
sections.
26
Figure 3.1.Overview of CBEA
3.4 PowerPC Processor Element
The PowerPC Processor Element (PPE) is the center of CBEA. Itruns the operating
system, manages and dispatches workload among SPEs, and coordinates operations.
The PPE is a general-purpose, dual-threaded, 64-bit RISC processor with the vec-
tor/SIMD multimedia extension processor [7]. As shown in Figure 3.2, the PPE con-
sists of two main elements: the PowerPC Processor Unit, or PPU, and the PowerPC
Processor Storage Subsystem (PPSS) [7].
3.4.1 PowerPC Processor Unit
The PPU has a level-one instruction cache, a level-one data cache, a register file and
several functional units. It executes the 64-bit RISC PowerPC Architecture instruction
set and the vector/SIMD multimedia extension instructions[7], and performs 32-byte
load and 16-byte store operations. One notable feature of the PPU is that it allows two
threads to run simultaneously, which is called symmetric multi-threading (SMT). The
27
Figure 3.2.PPE Structure Diagram
PPU has the following functional units [7]:
• Instruction Unit (IU)– The IU contains a 2-way set associative level-1, 32 KB
instruction cache with cache-line size of 128 bytes. It execut s instruction fetch,
decode, issue, and branch.
• Fixed-Point Unit (FXU)– The FXU performs fixed-point operations, such as
arithmetic and logic operations.
• Load and Store Unit (LSU)– The LSU is responsible for data access. It has a
level-1, 32 KB, 4-way set associative and write-through data cache. The cache-
line size is 128 bytes.
• Vector/Scalar Unit (VSU)– The VSU contains a floating-point unit (FPU) which
executes floating-point operations, and a 128-bit vector/SIMD multimedia ex-
tension unit (VXU) which performs vector/SIMD multimedia ext nsion instruc-
tions.
28
• Memory Management Unit (MMU)– The MMU takes care of address trans-
lation, which includes one 64-entry segment look-aside buffer (SLB) and one
1024-entry translation look-aside buffer (TLB).
3.4.2 PowerPC Processor Storage Subsystem
The PPSS handles all memory requests from the PPU and the Element Interconnect
Bus (EIB).It has a unified, 512 KB, 8-way set-associative, write-back level-2 instruc-
tion and data cache. The cache-line size is 128 bytes, which is the same as L1 cache [8].
The L2 cache does not need to contain all the contents of L1 instruction cache, but in-
cludes all the contents of the L1 data cache. The interface between the PPU and the
PPSS supports 32-byte load and 16-byte store operation. Andboth the load port and
store port between the PPSS and the EIB are 16 bytes wide.
3.5 Synergistic Processor Elements
The PPE is a general-purpose process, but the eight Synergistic Processor Elements
(SPEs) are specifically designed for only one purpose: performing high-speed SIMD
operations. Each SPE is a 128-bit RISC processor including two parallel pipelines
which execute SIMD instruction set. Every SPE provides a separate computing en-
vironment for tasks. Since the SPE does not have cache, cachemiss is not affecting
its performance. Each SPE contains two main units: the Synergistic Processor Unit
(SPU) and the Memory Flow Controller (MFC) [7], shown in Figure 3.3.
3.5.1 Synergistic Processor Unit
The SPU receives tasks from the PPU and executes them. The required instructions
and data for each SPU are stored in its local store. Each SPU isan independent pro-
29
Figure 3.3.Structure of SPE
cessor with four execution units, and specified to run SPU programs. The SPU loads
instructions and data to its local store using DMA transfer,then consumes them and
stores the result back to local store. Local store, along with SPU’s large register file, is
the memory hierarchy that the SPU can access directly.
Local store is a unified non-caching memory with a size of 256 KB. Each SPU
has its own local store memory. And it is not addressable fromanother remote SPUs.
The bandwidth of the SPU instruction and data transfer are 128 bytes and 16 bytes per
clock cycle, respectively. But the data access needs to be quad-word aligned. For DMA
transfer, the bandwidth is 128 bytes per clock cycle. There is no address translation
when an SPU accesses its local store.
3.5.2 Memory Flow Controller
Each SPE has one MFC which plays a role as the SPU’s interface.The primary func-
tion of the MFC is executing DMA commands autonomously. Whenthe PPE needs
30
to send data to the SPEs, it specifies the source address of themain memory, the des-
tination address of their local stores, and notifies the DMA controller of each MFC to
start working. Also, when the SPEs need to communicate with the main memory, they
set up DMA transfers and tell MFC to move data from local stores. This makes SPUs
focusing only on computing, and frees it from dealing with data transfer. While DMA
transfer allows up to 16K bytes of data transfer for each DMA transfer instance, all the
transferred data need to be 128-bit aligned. Moreover, the size of the transferred data
needs to be 1, 2, 4, 8, 16 or multiples of 16 bytes.
3.6 The Element Interconnect Bus
The Element Interconnect Bus (EIB) is the separated communication paths for com-
mands and data among the PPE, the SPEs, the controllers for the main memory, and
the I/O interface. Functionally, it consists of four data rings. Two of them transfer data
in the clockwise direction, and the other two run in the counterclockwise direction.
The peak bandwidth of the EIB is 204.8 GB/s [4].
3.7 The Memory interface Controller
The Memory Interface Controller (MIC) connects the EIB and physical memory, and
provides two channels to physical memory. This memory, called Extreme Data Rate
(XDR) DRAM, differs from the conventional DRAM because it supports eight data
transfer per clock cycle rather than the usual two or four. Therefore, the memory can
achieve a high data rate without very high clock frequency [17].
31
3.8 Cell Broadband Engine Interface Unit
The on-chip Cell Broadband Engine Interface Unit (BEI) serves as the I/O interface.
It contains a Bus Interface Controller (BIC), an I/O controller (IOC), and an Internal
Interrupt Controller (IIC). It manages data transfer between the EIB and the I/O devices
[7].
3.9 Programming Model for the PPE and SPEs
Both the PPE and the SPE instruction sets support C-languagecommands. Addition-
ally, these sets also include vector data types and a large set of calar and vector in-
trinsics. Intrinsics are, in essential, inline assembly-language instructions which have
similar syntax to C-language [8]. A vector is an operand containing multiple unified
data elements. The type of the element consists of the C-language data type. Most
PPE and SPE instructions operate on vector operands, which are also calledSIMD
operandsor packed operands[7].
Figure 3.4.Vector Add Operation
32
SIMD which exploits data-level parallelism is widely applied in the Cell Broad-
band Engine. In the PPE, it is supported by the Vector/SIMD Multimedia Extension
instruction set. In the SPE, it is supported by a new instruction set which is called
SPU instruction set. For both the PPE and the SPE, the vector registers are 128-bit
wide, which can hold multiple data elements. For instance, four 32-bit integers can be
loaded into a single vector register simultaneously. The vector add operation is shown
in Figure 3.4.
Programmers can manually vectorize scalar data array to vecr. This process
is calledvectorizationor SIMDization. The following code is an example showing the
difference between scalar addition and vector addition:
i n t a [ 4 ] = {1 , 2 , 3 , 4} ;
i n t b [ 4 ] = {5 , 6 , 7 , 8} ;
i n t c [ 4 ] ;
f o r ( i = 0 ; i < 4 ; i ++)
c [ i ] = a [ i ] + b [ i ] ;
i n t a [ 4 ] = {1 , 2 , 3 , 4} ;
i n t b [ 4 ] = {5 , 6 , 7 , 8} ;
i n t c [ 4 ] ;
v e c t o r s igned i n t ∗va = ( v e c t o r s igned i n t ∗ ) a ;
v e c t o r s igned i n t ∗vb = ( v e c t o r s igned i n t ∗ ) b ;
v e c t o r s igned i n t ∗vc = ( v e c t o r s igned i n t ∗ ) c ;
∗vc = vec add (∗ va , ∗vb ) ; / / PPE: 1 + 5 , 2 + 6 , 3 + 7 , 4 + 8
Data multi-buffering is another mechanism to improve system p rformance. It
can efficiently hide the overhead of DMA transfer: The SPUs are busy consuming the
current data set, while the MFC is loading data which will be us d in the next round
to another data buffer. So each SPE loads new data into its local store when it operate
on the data already in its local store. Thus, the overhead of the data transfer can be
hiddern.
33
3.9.1 Language-extension Differences Between PPE and SPE
Both instruction sets of the PPE and the SPE are operated on128-bit SIMD vectors.
But from the programmer’s perspective, they are quite different because of the differ-
ent intrinsics and different supported data types. Table 3.1 specifies the vector data
supported by PPE and SPE [8].
Vector Data Type Meaning PPE SPE
vector unsigned char 16 8-bit unsigned values Y Y
vector signed char 16 8-bit signed values Y Y
vector bool char 16 8-bit unsigned boolean Y N
vector unsigned short 8 16-bit unsigned values Y Y
vector signed short 8 16-bit signed values Y Y
vector bool short 8 16-bit unsigned boolean Y N
vector unsigned int 4 32-bit unsigned values Y Y
vector signed int 4 32-bit unsigned values Y Y
vector bool int 4 32-bit unsigned values Y N
vector float 4 32-bit signed precision Y Y
vector unsigned long long2 64-bit unsigned values N Y
vector signed long long 2 64-bit signed values N Y
vector double 2 64-bit signed values N Y
Table 3.1. Vector Data Type Supported by PPE and SPE
The differences are:
• The PPE does not support long long type;
• The SPE does not support boolean value.
34
In a specific instruction set, instruction names are usuallystarted by a unified
prefix. That is still true for both the PPE and the SPE instruction sets. For the SIMD
instructions of the PPE, they are named by the prefix of”vec ” , such as vecadd and
vec or. For the SPE, instructions are denominated by the prefix of”spu ” , for instance,
spuadd and spuor which are the same functions asvecaddandvec or in the PPE.
3.9.2 Communication between the PPE and SPEs
While DMA transfers allow the data passing, using mailbox isanother way to commu-
nicate between the PPE and the SPE. The mailbox is designed for transferring small
message, such as a memory address and status information between the PPE and the
SPE, because it can only transfer 32-bit data each time. In essence, a mailbox is a
queue which can hold only small number of entries. Here, we only introduce two
types of mailboxes which are related to our later chapters.
• SPU Inbound Mailbox: It can only store up to 4 32-bit messages, and is used to
transfer data from the PPE to the SPE. If no message is available when it is read
by an SPE, this SPE will stall until there is data available.
• SPU Outbound Mailbox: It only has space for holding 1 32-bit message, and is
used to send data from the SPE to the PPE. If it is full, writingof the incoming
data will be suspended until the PPE reads data from it.
CHAPTER 4
THE ARCHITECTURE OF GPU
4.1 Introduction
The first Graphics Processing Unit (GPU), invented by nVIDIAin 1999, was aiming at
processing and generating graphics. Driven by the market demand for real-time graph-
ics, the GPU has evolved into a many-core processor with multi-threading and is highly
parallel. It has over-performed the general purpose CPUs inarithmetic throughput and
memory bandwidth from 2003. Since then, the research on non-graphics applications
by exploiting the GPUs began to emerge. We call these general-pu pose programming
on GPU GPGPU.
Although the GPU has better performance, it suffers some flaws. Firstly, it re-
quired programmers to have knowledge about the GPU architecture and its program-
ming languages, for instance, DirectX and openGL. Moreover, some programming
features such as double precision and random memory read andwrite were not sup-
ported. These factors impeded the making of the GPU into a universal platform for
scientific applications.
In order to address these problems, nVIDIA introduced two significant archi-
tectures in November 2006 [3]: the G80 unified graphics and compute architecture and
Compute Unified Device Architecture (CUDA). CUDA is a software and hardware
architecture that allows programmers to code in high level languages without consid-
ering the graphics interfaces of the GPU. Instead of using APIs provided by GPU, a
35
36
programmer can write a general-purpose application with CUDA C/C++. Compared
with the general-purpose CPU, the programming model for CUDA is single program
multiple data (SPMD), in which each instruction of a programwill be executed by
multiple threads in parallel.
4.2 An Overview of the GPU Architecture
Figure 4.1 shows the typical GPU architecture. It is based ona programmable pro-
Figure 4.1.The G80 GPU Architecture
cessor array calledStreaming Multiprocessor(SM) which is highly parallel. In G80
architecture, two SMs form a Texture/Processor Cluster (TPC), and each SM has 8
highly multi-threadedstreaming Processors(SPs).
37
The G80 GPU has up to 4 gigabytes of graphics double data rate (GDDR)
DRAM which is known as global memory [9]. The global memory isa very high
bandwidth, off-chip memory. Its available bandwidth is 84.6 GB/s, which is over-
whelming the 8 GB/s bandwidth of system memory. Since the latency is a little more
than typical system memory, the high bandwidth is a compensation for the longer la-
tency in parallel applications.
Each SM supports 768 threads, and there are 16 SMs resident inthe G80 GPU,
so more than 12,000 threads coexist in the chip. The level of parallelism provided by
the GPU is very high, which is deserved to putting effort on exploiting this feature for
massively parallel applications.
4.2.1 Streaming Processor Array
As shown in Figure 4.1, the G80 GPU with 16 SMs is organized into 8 TPCs, and this
8-TPC vector forms the Streaming Processor Array (SPA) which manages the thread
scheduling and executes GPU computing programs. The numberof TPCs determines
the programming ability of a GPU. The more TPCs available, thhigher performance
a GPU can deliver.
4.2.2 Streaming Multi-processor
In Figure 4.2, each Texture/Processor Cluster (TPC) consists of a geometry controller,
an SM controller (SMC), two SMs, and a texture unit [11]. The Streaming Multipro-
cessor (SM) is a homogeneous multi-processor which can execute both graphics and
parallel computing applications. Each SM contains 8 Streaming Processors (SPs), 2
Special Function Units (SFUs), and a read/write shared memory.
To efficiently create and execute hundreds even thousands ofthreads concur-
38
Figure 4.2.Texture/Processor Cluster (TPC)
rently when running a program, the SM is hardware multi-threaded. In other words,
the threads are managed and scheduled by hardware without any overhead. In order
to achieve parallel computing, each thread has its own execution state, which allows
threads to execute independently. The concurrent threads of a program can synchro-
nize at a barrier with a single CUDA C/C++ instruction” syncthreads” [11]. More-
over, compared to the Cell B.E. architecture which runs SIMDinstruction, the threads
in the SM execute scalar CUDA C/C++ instructions because scalar instructions are
simpler and compiler-friendly [16]. The low-cost thread creation and management
with an easy way of synchronization, along with a small instruction set, make the
GPU to be an efficient platform to realize parallelism.
39
4.2.3 Streaming Processor
The Streaming Processor (SP) is the key functional element in the SM. Each SP has
a scalar register file with 1024 32-bit entries, and supportsup to 64 threads. The
SP, optimized for fully pipelined, performs all 32-bit and 64-bit integer arithmetic,
comparison, and logical instructions [11]. It also implements the fundamental float-
point operations such as add and multiply-add.
4.3 Memory Hierarchy
The CUDA-capable GPU presents three levels of memories to allow thread access:
the global memory, the shared memory and the local memory. Itprovides different
memory space to programmers for storing value in the best-performance way.
4.3.1 Global Memory
The global memory is implemented in external DRAM [11], and shared by all threads
of a program. It has two key functions: providing a relative large storage space for a
GPU program and a communication way among SMs. In fact, reference a location in
global memory by multiple threads is executed sequentiallyr ther than in parallel, and
there is no any guarantee that these threads will have sequential consistency.
Within a thread, the order of memory access to the same address is maintained,
but the order of memory read and write to different addressesi not preserved [16]. The
memory access requests are processed without order. Withina read block, the CUDA
barrier/fence instruction” syncthreads” can be used to synchronize these threads.
This instruction provides a way to synchronize threads which commit prior memory
operations and make them available to later memory access.
40
4.3.2 Shared Memory
The shared memory can not be shared by all threads of a program. It is only accessible
to the threads in a thread block. As shown in Figure 4.1, the shared memory resides on-
chip, which makes memory accessing much faster. Compared tothe global memory,
the shared memory does not suffer the problem of competing with the limited off-chip
bandwidth within a thread block. So it is practical to provide very high-bandwidth
memory space for each thread block.
4.3.3 Local Memory
The local memory is only visible to an individual thread. It is physically larger than
per-thread’s register file, and is addressable by a program.In order to achieve larger
memory space for each single thread, the operation of memoryallocation is imple-
mented in external DRAM.
4.4 Programming Model for CUDA GPU
CUDA, invented by nVIDIA in late 2006, is an extension to the conventional C/C++
language. It is a scalable parallel programming language which exploits a large de-
gree of thread-level parallelism for general applications. It provides three key utilities
for parallelism — a hierarchy of the thread group, shared memory and barrier/fence
synchronization [16].
4.4.1 CUDA Programming Paradigm
In order to solve a problem efficiently in the platform of CUDAGPU, the programmer
usually partitions the problem into a number of relatively small sub-problems which
41
can be solved inkernelsin parallel. A kernel is a function that can be executed by
many threads simultaneously. The programmer organizes sevral threads into athread
blockor aCooperative Thread Array(CTA) for better management and scheduling. A
thread block is usually used to solve a sub-problem. A group of thread blocks form a
grid to solve a certain problem by executing the same kernel function in parallel.
The declaration of a kernel is prefixed by theglobal qualifier. Its return
type must bevoid. The syntax of kernel procedure called from a CUDA program is:
kernel name <<< dimGrid, dimBlock >>> (...);
wheredimGrid anddimBlockare three-element vectors which specify the dimensions
of the grid in blocks and the dimensions of the blocks in threads respectively [16]. A
kernel can call device functions as well. The declaration ofa function started with the
device identifier indicates that it is a device function and must be called from the
kernel only.
When a kernel is invoked, there aredimGrid × dimBlock threads will be
generated. Each thread in a block is assigned to a thread number threadIdx. Each
block has a unique block numberblockIdxwhich is visible to all threads in the block.
As Figure 4.3 shows, a thread may use different level memories in its execution.
A CUDA program uses the local memory for per-thread storage,which may be larger
than its register file, for stacks and register spilling. Thelocal variable, which is stored
in local memory, is declared by thedevice identifier proceeding the variable type.
Each thread block has a shared memory, which can be accessed by all the threads in
this block. The threads in a thread block always work together loading the most often
used data into their shared memory for future utilization. Declaring variables in shared
memory is prefixed by shared qualifier.
42
Figure 4.3.CUDA Memory Hierarchy
The global memory space for a kernel is managed at run time throug the
CUDA APIs: cudaMalloc()andcudaFree(). It is visible to the whole grid running this
kernel. Moreover,cudaMemcpy()can be used to transfer data between GPU and CPU
or between GPU’s.
4.4.2 Single Instruction Multiple Threads and Warp Divergence
The CUDA architecture exploits thread-level parallelism to improve system perfor-
mance. This coarse-grained parallelism is called Single Instruction Multiple Threads
(SIMT) which executes a single instruction by multiple independent threads in paral-
lel. Threads are created, scheduled and executed as groups calledwarps. Each SM has
a pool consisting of 16 warps [16], and each warp contains 32 threads.
Figure 4.4 shows how the thread scheduler dispatch threads to execute instruc-
tions. Since warps are independent, at each instruction issue time slot, the instruction
43
Figure 4.4.Warp Scheduling
scheduler can only select one warp which has all of its threads ready for the same in-
struction, then issue this instruction for this warp. If forany reason, however, some
threads in a warp are not ready for the same instruction, thiswarp will not be scheduled
to execute.
The high performance of a CUDA program can be achieved when all threads
in a warp have the same execution path. And the performance will be degraded if a
warp diverges, which means the execution path is not unified within the thread warp.
For instance, a branch condition is taken in half of a warp, and is ot taken in the other
half. In this case, the instructions of branch taken and not taken must be executed
sequentially rather than in parallel for this warp. Different execution paths among
warps have no affect on system performance.
From the perspective of code correctness, programmers may ignore the concept
44
of warp and its divergence. But potential performance will be improved by putting
effort into minimizing the warp divergence.
CHAPTER 5
PARALLEL RADIX SORT
In this chapter, we elaborate on the parallel radix sort imple ented on both Cell Broad-
band Architecture and CUDA GPU Architecture. Parallel prefix sum is a key primitive
for the CUDA radix and sample sorts, so it will be introduced in the first subsection
of this chapter. To achieve high performance, these algorithms are carefully designed
to exploit the features of the target platforms. The resultsshow that the CUDA radix
sort yields about 45 times performance improvement compared to the sequential radix
sort, and the radix sort on Cell B.E. is about 7 times faster.
5.1 The Prefix Sum Primitives
In computer science, thexclusive prefix sum, or exclusive scanis an operation that
takes a binary operator⊕ with identity I, a sequence of elements:
[x0, x1, ..., xn−1],
and returns a new sequence [2]:
[I, x0, (x0 ⊕ x1), ..., (x0 ⊕ x1 ⊕ ...⊕ xn−2)].
For the exclusive prefix sum, each elementi in the input is replaced by the sum of all
elements up to but not including the elementi. In contrast, theinclusive prefix sum, or
inclusive scanreturns:
[x0, (x0 ⊕ x1), ..., (x0 ⊕ x1 ⊕ ...⊕ xn−1)],
45
46
where each input element is substituted by the sum up to and includi gi. For example,
for the binary operation addition, given an array:
[3, 1, 2, 2, 7, 4, 6, 5],
the exclusive scan will return the result:
[0, 3, 4, 6, 8, 15, 19, 25],
but the result of the inclusive scan is:
[3, 4, 6, 8, 15, 19, 25, 30].
The exclusive and inclusive scan can be transformed mutually. An exclusive
scan can be produced by shifting the inclusive scanned arrayone element right and
inserting the identity in the leftmost position. Similarly, an inclusive scan can be gen-
erated by shifting the exclusive scanned array one element left and inserting the sum
of the whole input array into the rightmost position [2]. Thetypical binary functions
are addition, max, min and logical operations. But for this and the following chapters,
we focus only on the discussion of the exclusive scan with binary operation addition,
and simply refer to it ascan.
5.1.1 Sequential Scan
Algorithm 5.1.1 The Sequential Version of Scan Algorithm
Input: an array A
Output: a scanned array B
B[0]← 0
for i← 1 to length[A]− 1 do
B[i]← A[i− 1] +B[i− 1]
end for
47
From Algorithm 5.1.1, in order to get the result ofB[i] in output (1 ≤ i ≤
length[A]), the previous element of inputA[i − 1] needs to be added to the previous
element of outputB[i− 1]. The minimum number of additions needed to produce the
scanned array isO(n) for an array with length ofn. In other words,n − 1 iterations
are required for traversing all the elements of the input to generate the output data, so
the time complexity isO(n).
5.1.2 An Inefficient Parallel Scan
Algorithm 5.1.1 does not take advantage of any parallelism.We prefer to develop
the parallel version of the scan algorithm by exploiting theplatform’s parallelism.
Hillis and Steele proposed the first parallel scan in 1986 which is shown in Algo-
rithm 5.1.2 [6]. Its operation is illustrated in Figure 5.1.Algorithm 5.1.2 executeslogn2
Algorithm 5.1.2 An Inefficient Parallel Scan Algorithm
Input: an array A
Output: the scanned array A
for i← 0 to loglength[A]2 − 1 do
for all t in paralleldo
if t ≥ 2i then




iterations, where n is the size of array A, so totallyO(nlogn2 ) additions are performed.
The time complexity for Algorithm 5.1.2 isO(logn2 ), if there are as many ALUs as the
array elements. But it is usually the case that the number of elements needed to be
processed is greater than the number of ALUs. Hence, there isno guarantee that the
performance of Algorithm 5.1.2 is better than Algorithm 5.1.1 In the worst case, if
48
Algorithm 5.1.2 is executed on the platform with only one processor, the time com-
plexity will be degraded toO(nlogn2 ), which is apparently worse than the sequential
one.
Figure 5.1.An Illustration of the Inefficient Scan
5.1.3 An Efficient Parallel Scan
The performance of Algorithm 5.1.2 is not guaranteed to be good, especially when
working with large arrays. Sengupta and Harriset al. [18] presented a work-efficient
parallel scan based on the pattern of the balanced tree in 2007. They partitioned the
Algorithm 5.1.3 The Up-Sweep Phase of the Efficient Scan Algorithm
Input: an array A
Output: the scanned array A
for i← 0 to loglength[A]2 − 1 do
for all t = 0 to n− 1 by 2i+1 in paralleldo
A[t+ 2i+1 − 1]← A[t+ 2i − 1] +A[t+ 2i+1 − 1]
end for
end for
scan algorithm into two phases: up-sweep, or reduce phase and down-sweep phase.
The up-sweep phase is illustrated in Figure 5.2. Pseudocodefor the up-sweep phase is
49
given in Algorithm 5.1.3 [18]. The up-sweep phase performsn− 1 additions for ann
elements array, wheren = 2d. And logn2 iterations are needed to finish the up-sweep
phase.
Figure 5.2.An Illustration of the Up-Sweep Phase of the Efficient Scan
Algorithm 5.1.4 is the pesudocode for down-sweep phase [18]. And its oper-
ation is shown in Figure 5.3. For the down-sweep phase, Algorithm 5.1.4 starts from
assigning0 to the last element of array A. Aftern − 1 additions andn − 1 swaps,
Algorithm 5.1.4 The Down-Sweep Phase of the Efficient Scan Algorithm
Input: an array A
Output: the scanned array A
A[length[A] − 1]← 0
for i← loglength[A]2 − 1 down to0 do
for all t = 0 to length[A] − 1 by 2i+1 in paralleldo
temp← A[t+ 2i − 1]
A[t+ 2i − 1]← A[t+ 2i+1 − 1]
A[t+ 2i+1 − 1]← temp+A[t+ 2i+1 − 1]
end for
end for
we get the scanned array. Also, Algorithm 5.1.4 still needslogn2 iterations, so the time
complexity for this efficient scan algorithm isO(logn2 ). There are totally3×(n−1)+1
50
arithmetic operations performed (2×(n−1) additions,n−1 swaps and1 assignment).
Therefore, the efficient parallel scan algorithm executesO(n) operations, which has
better performance than Algorithm 5.1.2, especially when working with large arrays.
Figure 5.3.An Illustration of the Down-Sweep Phase of the Efficient Scan
5.2 Radix Sort
The radix sortingalgorithm depends on the representation of the keys to be sort d.
Since numbers are stored as a sequence of binary bits in the memory, it is usually
the case that radix sort is performed based on the binary repres ntation of the keys
rather than their decimal representation. The radix sorting algorithm considers a bits
sequence as a multiple-digit number, and sorts it from its lea t significant digit to its
most significant digit. The value of each digit is in the range[0, 1, ..., m], wherem is
called the radix.
For a sequence of 32-bit integers, for instance, they could be treated as a group
of 8-digit numbers with the radixr being 16 (r = 232/8 = 24 = 16). The radix
sort requires 8 iterations to sort these integers. For iterat on i, it sorts the keys based
51
on theirith least significant digits. Usually, the radixr is selected to obtain the best
performance and highly relies on the size of the keys to be sort d.
Radix sort is a stable non-comparison sorting algorithm, wherestablemeans
the output preserves the relative input order for keys with same value. However, be-
cause of randomly accessing memory, the performance of radix sort is often worse
than quick sort for the sequential version.
5.2.1 Serial Radix Sort
The main idea of radix sort is very straightforward. It divides keys into multiple digits,
and uses a stable sorting algorithm to sort one digit in each iterat on from the least
significant digit. Figure 5.4 is an example showing how radixsort operates on six
4-digit numbers.
Figure 5.4.The Operation of Radix Sort on 6 4-digit Numbers
Thecounting sortis an alternative usually used to sort each digit. Suppose n-
bit unsorted keys are considered as m-digit numbers, where each digit is composed of
k bits, so there arer = 2k possible values for each digit. And the counting sort will be
calledt = n/k times to sort the keys. In each iteration,r buckets are required with one
for each possible digit value. The functions of these buckets are counting how many
times each possible value appears, and giving each key its corresp nding temporary
52
position in the output of each iteration. Pesudocode is given in Algorithm 5.2.1 [22],
wheres is the size of the unsorted arrayin, d[i] is the given digit value of theith input
element. The sorted keys will be written to the arrayout.
Algorithm 5.2.1 COUNTING-SORT
/* HISTOGRAM-KEYS */
for i← 0 to r − 1 do
bucket[i]← 0
end for
for i← 0 to s− 1 do















Counting sort consists of three phases. In the HISTOGRAM-KEYS phase, all
the elements of thebuckets are initially set to0. Then, for iterationi, the value of the
given digit of theith element is used as an offset to locate the corresponding buckets,
then the value of this bucket is incremented by1. After all keys have been looped over,
bucket[i] is the number of keys with valuei for the given digit. The sum of thebuckets
53
is equal to the size of the input.
For the phase of SCAN-BUCKETS, the scan algorithm which we have dis-
cussed in section 5.1 is performed on thebuckets. After the execution of the scan, the
first position in the output for all possible values of a givendigit is known.
In the phase of RANK-AND-PERMUTE, each key with digiti is placed in
the output position based on the value ofbucket[i]. Then the value ofbuckets[i] is
increased by1 to record the next location for the following key with the same valuei.
5.2.2 CUDA Parallel Radix Sort
Since radix sort uses counting sort for each pass, it is one ofthe sorting algorithms
easy to be parallelized. For the serial radix sort, HISTOGRAM-KEYS and RANK-
AND-PERMUTE phases have no data dependences, so it is very straightforward to
parallelize them. But due to the fact that the position in theoutput of each key relies
on the number of previous keys with the same value, the phase of SCAN-BUCKETS
is our central concern. By replacing the second phase with the parallel scan as we
discussed in section 5.1, this constraints can be eliminated.
To design an efficient radix sorting algorithm for CUDA, we should first con-
sider the size of the sub-task for each thread block. As we know, accessing shared
memory is much faster than accessing global memory, so all the operations on the data
should be performed on shared memory instead of global memory to achieve better
performance. Because of the relatively small size of sharedm mory (16 KB for G80,
48 KB for GTX480), and with extra overhead taken into account, processing 1024 el-
ements in each thread block is the best choice. Therefore, for a sequence with the size
of power-of-2, the number of thread blocks is in proportion tthe sequence size.
Intuitively, assigning one element for each thread would bea natural choice,
54
because each thread only has to access global memory twice (loading and storing). But
from Table 5.1, we can find that assigning two elements per thrad is more efficient.
Table 5.1 comes from the result of our experiment running10000 times. As Table 5.1





Table 5.1. The Running Time for Transferring220 Keys to/from Shared Memory
shows, for1024 thread blocks of1024 elements assigned to each thread block, the
execution time for loading all the elements from global memory t shared memory
then writing back varies with the threads number. Due to the fact that the overhead of
creating1024 threads can not be totally hidden by the relatively small globa memory
access time.512 threads in each thread block for1024 elements provides the overall
best performance.
There are eight passes in total, and four phases for each passin the radix sorting
algorithm:
• An unsorted sequence is divided intop chunks with1024 elements each chunk.
Each thread block loads its chunk into shared memory, then performs4 iterations
of 1-bit split sorting operation;
• Each thread block generates its own24-entry local histogram, and writes the
histogram along with the sorted chunk back to global memory.The p × 24
55
local histogram table is stored in column-major in global memory as shown in
Figure 5.5;
• p thread blocks cooperatively perform the parallel scan on this histogram table
to produce the global position for each possible digit value;
• Theith threads block writes its elements to the correct position inglobal memory
according to theith column in the global histogram table.
Figure 5.5.The Layout of the Histogram Table
The 1-bit split operation ”splits” keys with the value of0 for a given bit to
the left side of the output, and the keys with1 value to the right side. Hence, for
n-bit integers, a 1-bit split requiresn iterations to sort them. If we treat each key
with integer type as a 32-digit number, the split operation will need to access global
memory64 times. Since transferring data to and from global memory is relatively
expensive, we should avoid this kind of strategy. To performsplit operation efficiently,
we consider each key as an 8-digit number, and each digit consists of 4 bits. Although
the split operation will still be called32 times, it needs to access global memory2
56
times for each4 split operations on the shared memory. The pesudocode is given in
Algorithm 5.2.2. And its operation is shown in Figure 5.6.
Algorithm 5.2.2 1-bit Split Operation
for all t from 0 to n− 1 in paralleldo








f ← flag[n− 1]
perform parallel scan onflag
totalFalses← f + flag[n− 1]
for all t from 0 to n− 1 in paralleldo
d[t]← b[t] ? (t− flag[t] + totalFalses) : flag[t]
end for
for all t from 0 to n− 1 in paralleldo
out[d[t]]← in[t]
end for
The efficient parallel scan given in section 5.1 can only operate on an array
within a single thread block. And the size of the array can notexceed twice the number
of maximum threads in a block, since each thread processes two elements at most. For
the G80 and GTX 480 GPUs, the maximum number of elements are1024 and2048
respectively. However, the histogram table with size ofp × 24 is usually larger than
these numbers, so we need to extend the efficient parallel scan algorithm so that it
could operate on relative large arrays.
The input array is partitioned into tiles with suitable sizewhich can be privately
57
Figure 5.6.TheSplit Operation Based on the Least Significant Bit
scanned by each thread block, then the first phase of the scan will be conducted. When
it is done, we write the last element of each tile which is called total suminto an
auxiliary array, then continue the second phase of the scan.Next, the efficient parallel
scan is performed on this auxiliary array to generate thelocalsum array. At last, each
elementi in the localsum is added to all elements of tilei to get the scanned output.
The illustration of this procedure is shown in Figure 5.7.
5.2.3 Parallel Radix Sort on the Cell B.E
The implementation of the Cell radix sorting algorithm for integer keys uses the SPE
SIMD instructions. Since each DMA can only transfer 4K integers at a time between
the PPE and a SPE, the communication time is significant, and cnot be ignored.
Thus, the double buffering technique is also adopted to hidet is penalty.
58
Figure 5.7.The Extended Parallel Scan Algorithm for a Large Array
We treat integer keys as 8-digit numbers which is the same as the CUDA radix
sort, so there are still 8 iterations of counting sort. In each pass, the keys are sorted
based on its24 = 16 possible values instead of using the 1-bit scan operation with 4
iterations.
For each pass, the main task of the PPE is dividing the sequence into equal-
sized chunks, and sending them to 4 SPEs. When the operation is done by the SPEs,
it performs serial scan, then writes each element of all chunks to its correct position.
The job of each SPE is receiving data chunks, sorting them based on a given digit (4
bits), and writing it back with its corresponding local histogram.
• Given an integer sequenceinput with size of power of two, the PPE partitions it
into n chunks with4096 elements each, and sendsn/4 chunks to each SPE.
• Each SPE receives a chunk by the double buffered DMA.
59
• Each SPE moves all the elements to the16 buckets usingSIMDshift and
SIMDand instructions, and records the number of elements in each bucket.
• Each SPE packs these buckets to a chunk, and writes it back to PPE with double
buffered DMA.
• After all chunks have been written back, the PPE builds up theglobal histogram,
and writes each element to its correct position in memory.
Since the Cell B.E. architecture uses 128-bit vector registr file, then each SPU
can perform four efficientshift andand operations for 32-bit integers simultaneously
by using SIMD instructions. In order to improve the system performance, the loop
structure can be unrolled by at least a factor of four to exploit the vector register file.
There are two reasons that the CUDA radix sorting algorithm is not appropriate
for the Cell architecture. Firstly, for the counting sort, it accesses memory randomly
in the phase of key permutation. This is impossible for Cell since the DMA transfer
issued by both the PPE and the SP needs to be 128-bit aligned. The size of the data
which can be transferred has to be 1, 2, 4, 8, 16 or multiples of16 bytes. Also, the
maximum size of transferred data is 16 KB.
Secondly, though both the SPE and the thread block are the basic units to solve
the sub-problem, we can not treat a SPE as the analog of a thread block in CUDA
programming. There are hundreds even thousands of thread blocks which can be gen-
erated in a CUDA program. In contrast, we only have eight SPEsin one Cell processor.
The SPE achieves parallelism using SIMD instruction. Therefore, the well-designed
parallel scan algorithm executed by multiple threads is useles for Cell radix sort.
60
5.2.4 Results
The radix sorting algorithm is implemented on both CUDA GPU and the Cell Broad-
band Engine Architecture. The CUDA radix sort is executed onthe platform of the
nVIDIA GTX 480 GPU. The Cell B.E radix sorting algorithm is run on the Sony
playstation 3 by using 4 out of 6 SPEs. For better comparison,the serial version of
radix sort described in Algorithm 5.2.1 is also implemented. And its performance is
also given.
Number of Elements Serial Radix CUDA Cell B.E.
65536 0.112145s 0.004195s 0.023989s
131072 0.231582s 0.006578s 0.040264s
262144 0.466402s 0.011950s 0.072786s
524288 0.940496s 0.022037s 0.138929s
1048576 1.881464s 0.042267s 0.271606s
Table 5.2. Running Time of the Radix Sorting Algorithm
Table 5.2 shows the timing performances of each implementatio of radix sort.
Each radix sort with different data size is executed 100 times, then the average is
calculated. The relationship between the data set size and the running time is shown in
Figure 5.8.
As Figure 5.8 shows, the CUDA radix sort outperforms the Cellsort. This is
determined by the fact that Cell B.E. and CUDA are two totallydifferent architectures.
The primary reason for the outformance of CUDA is memory latency. In Cell
B.E. architecture, accessing the cache-like local store ofthe SPE is very fast, but first
the data must be transferred from memory to local store by DMA. The communication
61

























Figure 5.8.The Comparison of Radix Sorting Algorithms
time is the most important concern of ours. Even though we canuse double buffer or
triple buffer to overlap certain part of data transfer time and computing time in SPE,
but the PPE can build up the global histogram only after all the c unks are written
back. In other words, the PPE has been idle since it distributed the task. And it can
not execute the next task before all the SPEs finish its job. However, this is not the
case for CUDA. Like the Cell B.E architecture, each thread block needs to load data
from global memory to its private shared memory before the computations begin, but
this memory operation can be performed by multiple threads simultaneously within
a few clock cycles. That means the data transfer time is almost hidden by the large
amount of thread-level parallelism. Moreover, the overhead of the threads creation
and synchronization is extremely trivial for the CUDA program.
The different parallel strategy is the secondary reason. Each SPE has a 128-
bit register file, and four 32-bit integers can be stored in a single register. Thus, each
62
SIMD instruction can only operate four integers within one SPE at any given time.
The entire sixteen integer values can be processed when using four SPEs. In contrast,
CUDA uses SPMD mode to implement parallelism. By defining thenumber of threads
in a block, each thread can operate on a different scalar integer value simultaneously.
Therefore, the maximum number of integer values which can beprocessed in each
threads block can vary from1 to 1024 in GTX 480 GPU at any given time.
CHAPTER 6
PARALLEL SAMPLE SORT
Sample sort is a comparison based sorting algorithm, which is known as a generaliza-
tion of quick sort. It is very suitable for multi-cores and many-cores processors. In this
chapter, we re-design, implement, and manually auto-tune the parallel sample sort on
both Cell Broadband Architecture and CUDA GPU Architecture. To achieve high per-
formance, these algorithms eliminate the expensive key comparison operations using
bit shift operations. The experiments showed that, for the CUDA sample sort, we have
obtained an increase in speed of up to 68 times over the sequential quick sort. And the
Cell B.E. sample sort is 4 times faster than the sequential quck sort.
6.1 Introduction
In general, for a serial external sorting algorithm, the combination of an internal sort
and some other phases is a common strategy to solve the sorting p oblem with large
input size. Firstly, in the phase ofkey partition, the input with a size exceeding the
available storage capacity is partitioned into tiles. Eachdivided tile can vary from
only one element to as large as the memory size. Next, one selected internal sorting
algorithm operates on these tiles, and generates the sortedtiles. We call this stagelocal
computationphase. Then, in thekey combinationphase, the resulting sorted tiles are
written back to main memory.
Due to limitations given by the data dependences, these phases have to be al-
63
64
ternately executed several times to generate the sorted output. The efficiency is usually
gained from making the size of each tile as large as possible that s ill can be resident
in the memory. Because of the similar phase patterns betweenthe sample sort and
the external sort, the designing strategy for the external sort i still applicable for the
sample sort. However, rather than using one processor sorting each tile, the sample
sort distributes tiles to different processors to be sorted, then writes the results back.
6.2 Sample Sort
Thesample sort, also calledsplitter sort, drives this style of alternating phases to the
extreme, since each phase is executed only once. The principle behind the sample
sort is quite simple. In the phase of key partition,s samples could be selected from
an unsorted input withn items, then these samples are sorted by a sequential sorting
algorithm. Because the number of samples is relatively small, the chosen algorithm for
sorting these samples has no influence on the performance of th sample sort. Then,
p− 1 elements calledsplittersare picked up from the sorted samples array, wherep is
usually the number of processors. After definingp buckets by thep− 1 splitters, each
key of the input sequence will be written to the correct bucket. Finally, each bucket is
sent to a different processor.
In the local computation step, each processor receives one bucket and stores
it to its private memory. Since this phase dominates the overall performance of the
sample sort, the local sorting algorithms should be carefully designed. Usually, the
best candidate for the local sorting algorithm is recursivequick sort, because quick
sort is well-known to be the fastest serial-sorting algorithm in practice. And its average
time complexity isO(nlogn2 ).
After the local sorting is done, all the buckets are sent backto main memory
65
in the key combination phase. If these buckets can be writtenfrom the local memory
back to the original input position directly, this algorithm is calledin-placealgorithm.
On the other hand, if these buckets are first transferred fromlocal memory to some
other position in main memory, then overwrite the input, we call this algorithmout-in-
placealgorithm. Figure 6.1 shows an example of in-place sample sort which sorts24
elements.
Figure 6.1.An Illustration of the sample Sort
As shown in Figure 6.1, we can find that sample sort is not a balanced sort-
ing algorithm; that is, the workload distributed to each processor is not balanced. So
the goal for the phase of key partition is to make buckets as balanced as possible.
Therefore, sample selection and the sample size selection serves this goal. One of the
strategies is randomly generating some samples and sortingthem, then selectingp− 1
splitters. However, using the other strategy, we can obtainmore balanced buckets: a
position of the input array is randomly produced, then a certain number of consecutive
66
elements are selected from this location as a part of the samples. After several repe-
titions of this process, we get the final sample array. Due to the fact that the samples
can reflect the distribution of their population, the splitters coming from these samples
can make the buckets more balanced than the previous method.Intuitively, the more
the samples, the more balanced the buckets are. However, there is a tradeoff between
the sample size and the balance. Since with the increment of the sample size, the
performance of the sample sort will be degraded.
While the balance problem is one concern for us, the large number of condi-
tional branch is the other. Givenp−1 splitters, each key would be moved to the correct
bucket underp/2 operations ofconditional branchon average. Assume the size of the
input sequence isn, the sample sort algorithm has to performp/2 ∗ n = O(n) con-
ditional branches in the key partition phase. With the increment of the value ofn, its
performance will be degraded drastically since the conditional branch may generate
significant penalty.
When a conditional branch comes into the first stage of the pipline, the in-
struction which will be executed next must be predicted since the result of this branch
will not be known until it reaches the third stage. If the prediction hit, there will not
be any penalty. If the predictionmisses, the pipeline must be flushed because of its
incorrect execution path. And the penalty produced by the missed prediction is in
proportion to the pipeline depth.
For the input sequence with a integer type, conditional branches can be elimi-
nated by the bitwise shift operation. Also, some steps such as sample selection, sample
sorting and splitter selection can be skipped as well. The only thing we need to do in
the key partition phase is to shift each key to given bits, anddistribute it to the correct
bucket. For instance, suppose there are8 processors available, and the same number
67
of buckets have been established. For the input sequence of 32-bit integers, we need to
shift each key31 − 3 = 28 bits right (the most significant bit is sign bit). After these
shift right operations, the value of each key will fall into the range between0 and7.
Then we use each altered value as the offset for the buckets, and distribute the key to
its corresponding bucket.
In essence, some special splitters are still chosen to put keys into different
buckets. And the number of splitters is one less than the number of processors in the
regular sample sort. From the above example, we can find that there are7 splitters:229,
230, 229+230, 231, 229+231, 230+231, 229+230+230. By using these splitters with the
pattern of power-of-two, the key partition phase based on comparison is transformed
to the phase based on bitwise shift. Even though the operation of the shift will make
the buckets quite unbalanced, this problem can be totally hidden by the improvement
of the performance by using bitwise shift operations.
6.3 CUDA Parallel Sample Sort
In order to map the sorting problem to the CUDA GPU architecture efficiently, the
task needs to be divided into data-independent subproblemswhich can be solved in
parallel by thread blocks. So the input sequence with sizen must be partitioned into
b = n/(t × s) blocks, wheret threads are assigned to each block with each thread
processings elements. Because the phase of local computation plays a vital role in the
performance of sample sort, the average number of elements within a block on average
first needs to be determined according to the given numbern. Then the specific values
of b, t, ands can be decided.
According to the discussion of section 6.2, we know that the performance of the
sample sort can be improved by replacing conditional branches with the shift operation.
68
This is one benefit provided by the bitwise shift operation. Othe other hand, it also
implies that some of these most significant bits of the input sequence have been sorted
by the shift operation. For example, suppose we define16 buckets. When the key
partition phase is done, each input element is moved to the corr ct bucket. The keys
with the same three most significant bits are resident in a bucket having the same
offset. Thus, in the next step, we need only to sort the keys locally based on their
28 least significant bits. Moreover, CUDA provides atomic write functions. And the
parallel scan is a mature technique applied in the CUDA GPU platform. All of these
features make the counting sort the best candidate for the local c mputation phase.
As we know, counting sort includes three phases: HISTOGRAM-KEYS, SCAN-
BUCKETS, and RANK-AND-PERMUTE. The HISTOGRAM-KEYS can be paral-
lelized by theatomicAddinstruction as the following code shows. Here, each thread
has its uniquethid, 64 is the threads number in each group, andbucket num is the
number of defined buckets.
grp num = t h i d / 64 ;
p = d a t a [ t h i d ] >> 31 − l og2 ( bucket num ) ;
atomicAdd(&temp [ p ∗ 4 + grp num ] , 1 ) ;
The functionatomicAddreads the old value located at theaddress in global or shared
memory, computes(old + val), and writes the result back to the sameaddress, then
returns the old value. These four operations are performed in one atomic transac-
tion [15]. When multiple threads try to change the number of the elements in the same
bucket simultaneously, this function guarantees the altered value is correct. Therefore,
the shift and addition operations can be performed at the same ti e by the threads
within a block.
The phase of the HISTOGRAM-KEYS can be parallelized by the atomic write
69
transaction, but this is not the case for the RAND-AND-PERMUTE phase. Since
there is no atomic read operation supplied by the CUDA, the reading operation from
the scanned array is needs to be executed sequentially. So the phase of RAND-AND-
PERMUTE is a bottleneck of system performance. Because of this reason, each bucket
can not hold too many keys. We empirically choose128 elements on average for each
bucket. With the growth of the input size, we increase the number of the buckets
manually. And each bucket has a128 elements on average.
The CUDA sample sorting algorithm can be described in3 phases with the
given input sizen:
Each thread block defines ak × h-entry local histogram and loadst keys into
its shared memory, wheren/h = 128 on average. The allocated shared memory for
the local histogram and the loaded keys can not exceed 16 KB for G80 GPU, or 48
KB for GTX 480 GPU. Each thread computes its group number according to its thread
ID, and executes the shift operation on the keys. Since the local histogram is a two
dimensional array, each thread uses the shifted value as a row index and its group
number as a column index to find the target entry, then increases the value of that entry
by one with the atomic write operation. The local parallel scan is executed and the
total sum of each row in the local histogram is written to one column of an auxiliary
table. The auxiliary table is stored in column-major and located in global memory. It
is a two-dimensional array with size ofb × h, whereb is the number of thread blocks
created in this phase.
The extended parallel scan algorithm discussed in Figure 5.7 is performed on
the b × h auxiliary table to produce the global histogram. Each thread block loads
one column from the global histogram, indexed by its block IDto shared memory, and
adds it on each column of thek × h-entry local histogram. Each thread block now
70
has the knowledge of the bucket position for its keys, so it distributes each key to the
correct bucket.
Each thread block loads one bucket to its shared memory, and performs a local
radix sorting algorithm which consists of several iterations of counting sort. For the
radix sort, the number of bits of the keys which can be sorted in each pass and the
number of times the counting sort is needed are all determined by the buckets count.
For instance, if there are27 = 128 buckets available, the counting sort needs to be
looped3 times sorting8 bits of the integer keys in each iteration. After the sorting, all
the threads in each block cooperatively write their bucket back to the global memory.
In the second phase, before a key can be sent to the correct bucket, its cor-
responding position must be read from shared memory. Due to the fact that CUDA
does not support the atomic read transaction, the step of dispatching keys can not be
done in parallel. If we try to parallelize it with several threads, it is very possible that
multiple threads read the same position for different keys,which will cause a writing
conflict. This is the primary reason we usek sets of the local histogram rather than
only 1 set in the previous phase. By using ak × h local histogram, distributing keys
to the global memory can be performed byk threads simultaneously. Sincek threads
access the global memory in a random way, their memory accessing requests can not
be coalesced. Moreover, due to the small size of the shared memory, the value ofk
can not be very large. Thus, each thread still needs to accessth global memory many
times to write the keys back. Therefore, dispatching keys tothe buckets is the first
bottleneck for the CUDA sample sort.
For the counting sort in the local computation stage, each key ne ds to be sent
to the correct shared memory position after the parallel scan is performed. For the
same reason that CUDA does not supply atomic read for us, the operation of reading
71
a position for each key must be done sequentially. Although each key is written to the
shared memory instead of the global memory at this time, it isthe other bottleneck for
the sample sort.
6.4 Cell B.E. Sample Sort
In the Cell B.E. radix sorting algorithm, the PPE divides theinput sequence into tiles
with the size of4096 each, and sends them to its SPEs to perform radix sort. After
the radix sort is complete, each SPE writes its tiles back to the main memory. For the
sorting process in the SPEs, the SIMD instruction and doublebuffer can be utilized
to improve system performance. Since the Cell radix sortingalgorithm sorts only 4
bits for the keys each time, it requires8 iterations to complete the sorting process.
Unlike the radix sort, the Cell B.E. sample sort only needs one iteration to produce the
sorted output. Before the local computation phase starts, each key has been moved to
the correct bucket. This operation eliminates the data dependences among the buckets
before the execution of the sorting algorithm. So after the sorting, the output can be
generated.
The Cell B.E. sample sort can be described in the following sta es:
• Given an integer input sequence with the size ofn, the PPE distributes each key
into one ofb buckets, and sends these buckets to the SPEs.
• Each SPE receives a bucket by the double buffered DMA.
• Each SPE performs quick sort on this bucket.
• Each SPE sends the sorted bucket to the main memory by the double b ffered
DMA, then go to the second step.
72
• The PPE discards the padding elements in each bucket, and concatenates these
buckets to form the sorted output.
In the first step, due to the the constraint of the DMA transfer, we can not
define the size of each bucket arbitrarily. As we discussed inchapter 3, for each DMA
transfer in the Cell B.E. architecture, the transferred data size needs to be 1, 2, 4, 8,
16 or multiples of 16 bytes. In order to make the data transfermo e efficient, we
specify the bucket size to be 16 KB, which is the maximum DMA transfer size as
well. Moreover, we need to guarantee that the number of elements can not exceed
4096 in each bucket. The PPE uses the SIMD shift right instructionto move each key
into the correct bucket, and also needs to compute the numberof k ys in each bucket.
The following pseudocode shows the associative operationsin the first stage. Here,
thevec splats function returns a vector with all of its elements being the given scalar
value.
Algorithm 6.4.1 The First Phase of the Cell B.E Sample Sort
vector int *vs← (vector int *)in
num buckets← n / 2048
vector int shft← vec splats(31 - log(numbuckets))
for i ← 0 to (numbuckets - 1)do
countbuckets[i]← 0
malloc(buckets[i], 4096 * sizeof(int))
end for
for i ← 0 to n/4 do







As we know, the sample sort is an unbalanced sorting algorithm. The num-
ber of elements in each bucket can not be known until the run time. So we need
another data structure to store it in the program. And this isthe main function of
the arraycount buckets. Aafter the PPE sends the array ofbuckets along with the
count buckets to the SPEs, the SPEs perform the recursive quick sort only onthe first
count buckets[i] elements forbuckets[i], then write the whole bucket back.
Due to the data size limitation of the DMA transfer, the padding elements are
necessary in each bucket. Therefore, when each bucket is written back to main mem-
ory, the PPE first discards them, then concatenates these buck ts to generate the sorted
output.
Although the Cell B.E. processors support SIMD instructions, key distribution
is a bottleneck for system performance because this operation is done by the PPE
sequentially. With the increment of the input size, the performance will be degraded
significantly. On the other hand, when all the buckets are written back to the main
memory, the PPE will access main memory too many times to discard the padding
elements and concatenate these buckets to generate the output. Thus, this is another
bottleneck for the Cell B.E. sample sort.
6.4.1 Results
The sample sorting algorithm is implemented on both CUDA GPUand the Cell B.E.
architecture. The CUDA sample sort is executed on the platform of the nVIDIA GTX
480 GPU. And the Cell B.E radix sorting algorithm is run on theSony playstation 3
by using 4 out of 6 SPEs. For better comparison, the serial version of the quick sort is
also implemented. The following table shows the results.
Table 6.1 shows the timing performances of the parallel sample sort algorithms
74
Input Size Serial Quick Sort CUDA Sample Sort Cell Sample Sort
65536 0.048718s 0.001142s 0.022649s
131072 0.102867s 0.001901s 0.037609s
262144 0.220722s 0.003754s 0.067547s
524288 0.470589s 0.007245s 0.128581s
1048576 0.980413s 0.014418s 0.251127s
Table 6.1. Running Time of the Parallel Sample Sort and Serial Quick Sort
together with the sequential quick sort. Each sorting algorithm with different data
size is executed 100 times, then the average timing performance is calculated. The
relationship between the data set size and the running time is shown in Figure 6.2.

























Figure 6.2.The Comparison of Parallel Sample Sort and the Sequential Quick Sort
As Figure 6.2 shows, the CUDA sample sort outperforms the Cell sort. The
reasons are the same as we discussed in section 5.2.4. The differ nt memory accessing
75
times is the first reason. For the Cell B.E. sample sort, the PPE has to distribute all
the keys to buckets by using SIMD instruction. Since this task involves the indirect
addressing mode, the cache miss penalty will dominate this operation. So the PPE
needs to access main memory approximatelyO(n) times to complete this work. On
the contrary, for the CUDA sample sort, all the threads in each block first cooperatively
load their tile to the shared memory, and perform the shift operation for the keys, then
write them back to global memory. On average, in this process, each thread needs only
to access the global memoryO(1) times.
The different parallel degree is the second reason. The PPE has a 128-bit reg-
ister file, so four 32-bit integers can be stored in a single entry. Therefore, by using
the SIMD instruction, the PPE can operate on four integers atany given time. For the
CUDA GPU, each thread can operate on only one key at any time, because the kernel
function does not support the SIMD instructions. However, each thread block can cre-
ate up to1024 threads, the number of keys which can be processed in the GPU is much
greater than the Cell B.E. architecture at any given time.
CHAPTER 7
CONCLUSION AND FUTURE WORK
This thesis has presented the implementation and the tuningof parallel radix sort and
parallel sample sort on the nVIDIA CUDA enabled graphics processors, as well as on
the Cell B.E. architecture. From our experimental results,we can summarize that the
sorting algorithms on the GPU processors outperforms the parallel sorting algorithm
on the Cell B.E. processors.
For the CUDA GPU, the high performance is achieved by exploiting hree
levels of memory hierarchy. Rather than consuming data resident in the low-speed
global memory directly, all the threads in a thread block first cooperatively load the
required data into the high-speed shared memory, then process it. This can effectively
reduce the average memory access time, especially when the threads need to access
the global memory multiple times in one kernel function. Also, by combination con-
secutive access requests to the global memory issued by the threads within a thread
block into a single operation, the memory latency can be further optimized. Unlike the
CUDA GPU, the Cell B.E. architecture adopts the unified cache-like local store for its
SPEs, and attempts to overlap the overhead of the DMA transfer with the computation
through double buffer to obtain better system performance.B fore the data can be
processed by the SPEs, it must be transferred from the main memory to the local store.
Although the memory latency for local store is trivial, the extra cost for DMA transfer
can not be ignored. And in most cases, it is usually very difficult to completely hide
76
77
the overhead of DMA transfer by the computation.
The many-core GPU exploits fine-grained parallelism. The programming mode
for CUDA is SPMD, so a large number of data can be processed simultaneously. This
can be reflected from our parallel scan primitive and bitwiseshift operation. In con-
trast, the parallel structure for the multi-core Cell B.E. architecture is more suitable to
exploit coarse-grained parallelism. Because of its limited number of processors, even
with the help of the SIMD instructions, the total number of integer data which can be
processed is trivial when compared to the CUDA at any given time. For the process
of building the global histogram in the radix sort, the Cell B.E. implementation has to
do it sequentially. In some cases, the Cell B.E. can only process one piece of data at a
time.
The other conclusion we can gain from our experiments is thatthe performance
of the sample sort is better than that of the radix sort for both CUDA and Cell B.E.
implementations. For the radix sort, the parallel countingsort needs to be invoked
eight times to complete the sorting process. But for the sample sort, the similar steps
need only to be executed once as the algorithm of sample sort eliminates the data
dependences before the sorting procedure begins. Comparedwith the eight iterations’
execution of the counting sort within the radix sort, the overhead introduced from
removing the data dependences of the sample sort is inconsequential.
Although a processor core of the multi-core architecture ismore sophisticated
than that of the many-core architecture, the latter is the trend for the parallel computing,
since the potential improvement performance produced by the many-core architecture
would surpass that of the multi-core architecture. However, the parallel computing
also presents some new challenges for the programmer. Sincesequential algorithms
are not appropriate for parallel architectures, almost allof the parallel versions of these
78
algorithms should be re-designed to achieve higher system performance on parallel
architectures. Because of the existence of data dependences, the parallel algorithms
are always more difficult to design than their sequential versions.
The future work of this thesis will include tuning and improving the imple-
mentations of the radix and sample sorts on parallel architectur s while identifying
the bottlenecks resident in these algorithms. It is also an interesting future direction
to extend our algorithms to sort key-value pairs. Finally, we are also interested in
developing efficient parallel scan primitives on the Cell B.E. architecture.
REFERENCES
[1] Randy Allen and Ken Kennedy.Optimizing Compilers for Modern Architectures:
A Dependence-based Approach. Morgan Kaufmann, 2001.
[2] Guy E. Blelloch.Vector Models for Data-Parallel Computing. MIT Press, 1990.
[3] NVIDIA Corporation. Nvidia fermi compute architecturewhitepaper. 2009.
[4] Michael Gschwind, David Erb, Sid Manning, and Mark Nutter. An open source
environment for cell broadband engine system software.Computer, 40:37–47,
June 2007.
[5] John L. Hennessy and David A. Patterson.Computer Architecture: A Quantita-
tive Approach, 4th Edition. Morgan Kaufmann, 2006.
[6] W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM,
29:1170–1183, December 1986.
[7] IBM. Cell broadband engine programming handbook.
[8] IBM. Programming tutorial.
[9] David B. Kirk and Wen mei W. Hwu. Programming Massively Parallel Pro-
cessors: A Hands-on Approach (Applications of GPU Computing Series). 1st
Edition. Morgan Kaufmann, 2010.
79
80
[10] Donald Ervin Knuth. The Art of Computer Programming, Volum 3. Addison
Wesley, 1998.
[11] Erik Lindholm, John Nickolls, Stuart Oberman, and JohnMontrym. Nvidia tesla:
A unified graphics and computing architecture.IEEE Micro, 28:39–55, 2008.
[12] J.L. Manferdelli, N.K. Govindaraju, and C. Crall. Challenges and opportunities
in many-core computing.Proceedings of the IEEE, 96(5):808–815, May 2008.
[13] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for
Parallel Programming. Addison Wesley, 2004.
[14] GORDON E. MOORE. Cramming more components onto integrat d circuits.
Electronic Magazine, 1965.
[15] nVIDIA. Nvidia cuda c progamming guide.
[16] David A. Patterson and John L. Hennessy.Computer Organization and Design,
Fourth Edition: The Hardware/Software Interface. Morgan Kaufmann, 2008.
[17] Matthew Scarpino.Programming the Cell Processor: For Games, Graphics, and
Computation. 1st Edition. Prentice Hall, 2008.
[18] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and JohnD. Owens. Scan prim-
itives for gpu computing. InProceedings of the 22nd ACM SIGGRAPH/EURO-
GRAPHICS symposium on Graphics hardware, GH ’07, pages 97–106, Aire-la-
Ville, Switzerland, Switzerland, 2007. Eurographics Association.
[19] Wikipedia. Granularity. http://en.wikipedia.org/wiki/
Granularity.
81
[20] Wikipedia. Locality of reference.http://en.wikipedia.org/wiki/
Locality_of_reference.
[21] Wikipedia. Moore’s law. http://en.wikipedia.org/wiki/Moore’
s_law.
[22] Marco Zagha and Guy E. Blelloch. Radix sort for vector multiprocessors. InIn
Proceedings Supercomputing ’91, pages 712–721, 1991.
