GPU Array Access Auto-Tuning by Weber, Nicolas
GPU Array Access Auto-Tuning
Vom Fachbereich Informatik
der Technischen Universität Darmstadt
genehmigte
Dissertation
zur Erlangung des akademischen Grades
Doktor Ingenieur (Dr.-Ing.)
vorgelegt von
Nicolas Weber, M.Sc.
geboren in Hanau.
Referenten: Prof. Dr.-Ing. Michael GoeseleTechnische Universität Darmstadt
Prof. Dr. Michael GerndtTechnische Universität München
Tag der Einreichung: 13.04.2017Tag der Disputation: 19.06.2017
Darmstädter Dissertation, 2017
D 17

Erklärung zur Dissertation
Hiermit versichere ich die vorliegende Dissertation selbstständig nur mit denangegeben Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die ausQuellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hatin gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.
Darmstadt, den 13.04.2017
Nicolas Weber
I
II
Abstract
Graphics Processing Units (GPUs) have been used for years in compute intensiveapplications. Their massive parallel processing capabilities can speedup calcula-tions signiﬁcantly. However, to leverage this speedup it is necessary to rethink anddevelop new algorithms that allow parallel processing. These algorithms are onlyone piece to achieve high performance. Nearly as important as suitable algorithmsis the actual implementation and the usage of special hardware features such asintra-warp communication, shared memory, caches, and memory access patterns.Optimizing these factors is usually a time consuming task that requires deep under-standing of the algorithms and the underlying hardware. Unlike Central ProcessingUnits (CPUs), the internal structure of GPUs has changed signiﬁcantly and willlikely change even more over the years. Therefore it does not suﬃce to optimizethe code once during the development, but it has to be optimized for each newGPU generation that is released. To eﬃciently (re-)optimize code towards theunderlying hardware, auto-tuning tools have been developed that perform theseoptimizations automatically, taking this burden from the programmer. In particular,NVIDIA – the leading manufacturer for GPUs today – applied signiﬁcant changesto the memory hierarchy over the last four hardware generations. This makes thememory hierarchy an attractive objective for an auto-tuner.
In this thesis we introduce the MATOG auto-tuner that automatically optimizesarray access for NVIDIA CUDA applications. In order to achieve these optimizations,MATOG has to analyze the application to determine optimal parameter values.The analysis relies on empirical proﬁling combined with a prediction method anda data post-processing step. This allows to ﬁnd nearly optimal parameter values ina minimal amount of time. Further, MATOG is able to automatically detect varyingapplication workloads and can apply diﬀerent optimization parameter settings atruntime. To show MATOG’s capabilities, we evaluated it on a variety of diﬀerentapplications, ranging from simple algorithms up to complex applications on thelast four hardware generations, with a total of 14 GPUs. MATOG is able to achieveequal or even better performance than hand-optimized code. Further, it is able toprovide performance portability across diﬀerent GPU types (low-, mid-, high-endand HPC) and generations. In some cases it is able to exceed the performanceof hand-crafted code that has been speciﬁcally optimized for the tested GPU bydynamically changing data layouts throughout the execution.
III
IV
Zusammenfassung
Graphics Processing Units (GPUs)werden seit Jahren für berechnungsintensive An-wendungen eingesetzt. Ihre massiv-parallele Rechenleistung kann Berechnungensigniﬁkant beschleunigen. Um diese Beschleunigung zu erreichen ist es notwendig,dass Algorithmen überarbeitet oder neu entwickelt werden, um parallele Berech-nungen zu ermöglichen. Diese Algorithmen jedoch sind nur ein Teil um hoheBerechnungsgeschwindigkeiten zu erreichen. Genauso wichtig wie raﬃnierte Al-gorithmen, ist die eigentliche Implementierung und die Nutzung von speziellenKomponenten wie Intrawarp Kommunikation, geteilte Speicher, Zwischenspeicherund Speicherzugriﬀsmuster. Diese Faktoren zu optimieren ist üblicherweise einezeitintensive Aufgabe, welche ein umfassendes Verständnis der Algorithmen unddes Beschleunigers erfordert. Anders als bei Central Processing Units (CPUs) hatsich die interne Struktur von GPUs in den letzten Jahren stark verändert und wirdsich mit Sicherheit weiterentwickeln. Deshalb reicht es nicht aus, Programmenur während der Entwicklung zu optimieren. Um eﬃzient Programme für einbestimmtes Gerät zu optimieren wurden Auto-Tuner entwickelt, welche dieseOptimierungen automatisch durchführen und somit die Programmierer entlasten.NVIDIA – der führende Hersteller von GPUs – hat in letzten vier Generationensigniﬁkante Änderungen an der Speicherhierarchie vorgenommen. Dies macht dieSpeicherhierarchie zu einem attraktiven Ziel für einen Auto-Tuner.
In dieser Arbeit stellen wir den MATOG Auto-Tuner vor, welcher automatisch Ar-rayzugriﬀe in NVIDIA CUDA Anwendungen optimiert. Um diese Optimierungenzu erreichen, muss die Anwendung analysiert und optimale Parameter gefundenwerden. Diese Analyse basiert auf empirischen Messungen kombiniert mit einerVorhersagemethode und einer Datennachverarbeitung. Dies erlaubt es nahezu op-timale Parameter in kürzester Zeit zu ﬁnden. MATOG ist darüber hinaus in der Lageverschiedene Programmzustände zu erkennen und unterschiedliche Optimierun-gen zur Laufzeit anzuwenden. Um die Fähigkeiten von MATOG zu belegen habenwir eine Auswahl von simplen und komplexen Anwendungen auf den letzten vierHardware Generationen mit insgesamt 14 verschiedenen GPUs getestet. MATOGist in der Lage äquivalente, bzw. teilweise auch bessere Leistung als handopti-mierte Implementierungen zu erreichen. Weiterhin bietet es Leistungsportabilitätüber verschiedene GPU Typen und Generationen. In einigen Fällen kann MATOGdie Leistung von handoptimiertem Code übertreﬀen, indem es dynamisch dieSpeicherlayouts zur Laufzeit anpasst.
V
VI
Acknowledgements
First of all, I would like to thank my supervisor Michael Goesele for the opportunityto work in his research group for the last four years, his support and valuable feed-back. The work environment he created is challenging and sometimes exhaustingbut exactly this makes it fruitful and pushes one to seek and achieve even more. Itwas a pleasure to work with him for all these years.
I would also like to extend my sincere gratitude to Michael Gerndt who kindlyagreed to be referee for this thesis.
Next, I would like to give special thanks to my colleagues, starting withMartin Hess.We shared an oﬃce for all these years and had many inspiring/funny discussions!One great hug goes to Michael Wächter who never gave up on me and pushedme all the way through the obstacles of the Detail-Preserving Image Downscaling(DPID) paper [Weber et al. 2016]. Further, I want to thank Dominik Wodniok, notonly for all the discussions we had, but also his valuable feedback for MATOG andDPID. A special thanks goes to Sandra C. Amend for her support in all the annoyingprogramming tasks that had to be done.
Finally, I want to thank my parents who always supported me, my brothers and allmy friends for their help and support.
The work of Nicolas Weber is supported by the ‘Excellence Initiative’ of the Ger-man Federal and State Governments and the Graduate School of ComputationalEngineering at Technische Universität Darmstadt.
VII
VIII
Contents
Frontmatter
Abstract III
Zusammenfassung V
Acknowledgements VII
Main Content
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Computational Basics . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Processor Architectures . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Memory Hierarchy and Caches . . . . . . . . . . . . . . . . 8
2.1.3 Multi-Tasking and Scheduling . . . . . . . . . . . . . . . . . 9
2.1.4 Processing Improvements . . . . . . . . . . . . . . . . . . . 10
2.1.5 Multi-Processing . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.6 Performance Classiﬁcation . . . . . . . . . . . . . . . . . . 12
2.1.7 Performance Limitations . . . . . . . . . . . . . . . . . . . 14
2.2 Hardware Implementations . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Bus Systems . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Processor Performance . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Caches for Parallel Processing . . . . . . . . . . . . . . . . . 17
2.2.4 Parallel Processors and Accelerators . . . . . . . . . . . . . . 17
2.2.5 Graphic Processing Units . . . . . . . . . . . . . . . . . . . 19
IX
CONTENTS
2.2.6 Memory Types . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Array Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Multidimensional Indexing . . . . . . . . . . . . . . . . . . 21
2.3.2 Struct Layouts . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Target Architecture and Platform 25
3.1 NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Compute Model . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Programming Language . . . . . . . . . . . . . . . . . . . . 28
3.1.3 CUDA Proﬁling Tools Interface . . . . . . . . . . . . . . . . . 30
3.2 NVIDIA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Maxwell Architecture . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Pascal Architecture . . . . . . . . . . . . . . . . . . . . . . 34
4 Auto-Tuning and Related Work 37
4.1 Options for optimization . . . . . . . . . . . . . . . . . . . . . . 39
4.2 How to integrate optimizations into applications? . . . . . . . . . 40
4.3 Which conﬁgurations are optimal? . . . . . . . . . . . . . . . . . 40
4.4 How to detect and handle runtime dependent performance eﬀects? 42
4.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.1 Performance Measurement, Modeling and Simulation . . . . . 43
4.5.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.3 Programming Languages . . . . . . . . . . . . . . . . . . . 45
4.5.4 Domain Dependent Auto-Tuning . . . . . . . . . . . . . . . 46
4.5.5 Domain Independent Auto-Tuning . . . . . . . . . . . . . . . 47
4.5.6 Memory Access and Data Layouts Auto-Tuning . . . . . . . . . 50
5 MATOG Auto-Tuner 53
5.1 Programming Interface . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Programming Example . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Texture Memory . . . . . . . . . . . . . . . . . . . . . . . 58
X
CONTENTS
5.3.2 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.3 Optimization Hints . . . . . . . . . . . . . . . . . . . . . . 61
6 Application Analysis 63
6.1 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Step 1: Application Proﬁling . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 In-Application Proﬁling . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Prediction Based Proﬁling . . . . . . . . . . . . . . . . . . . 65
6.3 Step 2: Determine Optimal Conﬁgurations . . . . . . . . . . . . . 68
6.3.1 Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3.2 Array Dependencies . . . . . . . . . . . . . . . . . . . . . . 69
6.3.3 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.4 Predictive Search . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Step 3: Decision Models . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.1 Directional Model . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 MATOG Runtime System . . . . . . . . . . . . . . . . . . . . . . 72
7 Evaluation 75
7.1 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.1 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.2 Speckle Reducing Anisotropic Diﬀusion . . . . . . . . . . . . 77
7.1.3 Hotspot . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.4 Detail Preserving Image Downscaling . . . . . . . . . . . . . 78
7.1.5 Coevolution via MI on CUDA . . . . . . . . . . . . . . . . . 79
7.1.6 Renders Everything You Ever Saw . . . . . . . . . . . . . . . 79
7.1.7 KD-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Execution Performance . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.1 GPU Execution Time . . . . . . . . . . . . . . . . . . . . . 82
7.2.2 Application Execution Time . . . . . . . . . . . . . . . . . . 88
7.2.3 Performance Portability . . . . . . . . . . . . . . . . . . . . 91
7.2.4 Analysis Time . . . . . . . . . . . . . . . . . . . . . . . . . 92
8 Empirical Performance Models 95
8.1 Model Training and Prediction Accuracy . . . . . . . . . . . . . . 97
XI
CONTENTS
8.1.1 Single Dataset . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.2 Multiple Datasets . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.3 Error Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2 Predicting Unknown Conﬁguration Performance . . . . . . . . . . 105
9 Discussion 107
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Is auto-tuning useful? . . . . . . . . . . . . . . . . . . . . . . . . 108
9.3 Which optimizations are optimal? . . . . . . . . . . . . . . . . . 108
9.4 MATOG Implementation Improvements . . . . . . . . . . . . . . 114
9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10 Future Work 117
10.1 Future of MATOG . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.2 Evaluation and comparability . . . . . . . . . . . . . . . . . . . . 118
10.2.1 Benchmark Suites . . . . . . . . . . . . . . . . . . . . . . . 119
10.2.2 Collective Knowledge . . . . . . . . . . . . . . . . . . . . . 120
10.3 How to combine diﬀerent optimizations? . . . . . . . . . . . . . . 120
10.4 Performance models, continuous monitoring and adaptive reopti-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.5 Incompatibility of Auto-Tuners . . . . . . . . . . . . . . . . . . . 121
10.6 Performance Portability Aware Software Stack . . . . . . . . . . . 122
Appendix
A Benchmark Training-/Testing-Data 125
A.1 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2 SRAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.3 Hotspot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.4 DPID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.5 COMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.6 REYES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.7 KD-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
XII
CONTENTS
Acronyms 129
Bibliography 133
(Co-)Authored Publications 159
XIII
CONTENTS
XIV
Chapter 1
Introduction
Writing eﬃcient code is one of the major objectives for programmers, besidesthe correctness of the calculations. However, eﬃciency covers multiple factors,such as time, energy and cost. Depending on the application the focus is shiftedbetween these factors. For example, gaming hardware is tuned to deliver highestperformance rather to have a low energy consumption [Mills and Mills 2015].The costs have to be moderate, so that a private person can aﬀord the computer.Supercomputers also have striven for highest available performance [Top 5002016] for many years. However, today many supercomputers seek to providehigh performance with low energy consumption [Green 500 2016], to keep thecosts reasonable. In general, costs consist of the initial expenses for the hardware,maintenance costs, power consumption (which can be several million dollars peryear for supercomputers) and costs for developing and optimizing the software. Toreduce the costs, on the one hand applications have to be time and energy eﬃcientto optimize the utilization and lower the energy consumption. On the other hand,this increases the costs for development and optimization. Unfortunately, codeeﬃciency always depends on the underlying hardware. It has to be speciﬁcallydesigned for the hardware it is running on. This requires that experts with a deepunderstanding of both, the hardware and the application, optimize the code, whichagain increase the costs [Bischof et al. 2012].
The ﬁrst computers have been executing a single application on a single threadedprocessor [Goodacre 2011]. With the upcoming of Intel’s x86 architecture, itbecame the standard architecture for many years [Levenson 2013]. It caused thatsoftware was mainly developed for this particular architecture. For years thisworked, as technological advances allowed to develop faster processors withoutchanging the architecture signiﬁcantly. This eﬀect had been predicted by Moore[1965] which became to known as “Moore’s Law”. However, these advances hadstalled and introduced a demise of Moore’s Law [Berkeley 2014] (Section 2.2.2). Tofurther improve the performance of processors, other methods had to be found.One of these is parallel computation. Parallel processors can have up to severalthousand compute cores operating in parallel. This parallel processing powercomes with the necessity of writing eﬃcient code that enables all cores of theprocessors to solve a problem together. Therefore, not only algorithms have tobe rethought and speciﬁcally designed for parallel processing but also the actualimplementation has to utilize available hardware features. Today, the market is
1
Chapter 1: Introduction
ﬁlled with all kinds of processors, speciﬁcally designed to excel in a particularﬁeld. There are all kinds of variations of processors ranging from few, but fastcompute cores (multi-core CPUs) to processors with thousands, but rather slowcores (GPUs). Also special purpose processors are available, speciﬁcally tunedfor a speciﬁc purpose, e.g., Google’s Tensor Processing Unit (TPU) [Jouppi 2016]that is optimized for neural network processing. This variety of processors posesa challenge to software developers as they no longer have only one dominantarchitecture their software has to be developed and optimized for. They haveto deal with varying processor designs, number of processing cores, specializedhardware functionality, programming languages, library support and operatingsystems.
Further, hardware architectures (even from the same manufacturer) undergo con-stant changes and improvements, introducing new, changed or removed function-ality. NVIDIA is the leading manufacturer for GPUs today [Shilov 2016]. NVIDIA’sGPUs consist of a complex memory hierarchy with a series of diﬀerent automaticand self-organized caches that need to be eﬃciently used (Section 3.1.1). In the lastfour GPU generations, many signiﬁcant changes to this memory hierarchy havebeen applied (Section 3.2) so that code written for prior generations usually doesnot necessarily work as eﬃcient as it could on newer generations. For example,NVIDIA added the option to deﬁne a trade-oﬀ between having more self-managedshared memory or automatic L1 cache in their Fermi architecture [NVIDIA 2009].The next generation [NVIDIA 2014a] added more trade-oﬀ options to choose. Inthe third generation [NVIDIA 2014b], this feature was entirely removed. So inthree consecutive hardware generations (2009-2014) of the same manufacturer’shardware, the behavior has been constantly changed. Another example are vectorprocessing units in CPUs. Advanced Vector Extensions (AVX) 1.0 [Reinders 2013]were added in the 2nd generation of Intel’s i7 processors [INTEL 2011]. In the 4thgeneration [INTEL 2013a] AVX 2.0 followed. As AVX is not backward compatible,AVX 2.0 instructions cannot be used on any older CPU, so that programmers haveto explicitly check for the capabilities of the CPU their software is running on.
One of the major problems of parallel programming is data access. As the time toaccess a single memory cell has not really improved over the years [CRUCIAL 2015],today’s memory systems read not a single data cell but entire blocks. These arethen transfered to the processor using wide memory bus systems to push throughthe needed amounts of data. This method is only eﬃcient, if not only a single core,but (in the best case) all cores are satisﬁed by the data provided by this transfer. Ifnot all cores get the data they are seeking, the performance signiﬁcantly drops asall others have to wait until the next data transmission. Therefore, it is necessarythat the access to data is optimized to satisfy as many cores as possible with thedata within one block. Unfortunately optimizing the utilization of these memory
2
blocks is not the only optimization objective. Diﬀerent data layouts can have vary-ing computational and resource requirements. In some layouts the location of amemory cell can be obtained quite easily, while others either require more calcula-tions or more temporary registers to determine the correct location (Section 2.3.2).Using a diﬀerent layout could improve the utilization of the data block, but in thesame turn, could increase the resource usage. This can be a problem for GPUs, astheir cores are quite limited in their computational and register resources. Thisadditional consumption can reduce the overall performance of the cores, whichwould result in a lower overall performance, although the memory utilization isbetter. Therefore it is necessary to ﬁnd a good balance between both optimizationgoals.
In general, this variety of hardware architectures and software environmentsoverwhelms the abilities of many programmers, especially when they are no hard-ware enthusiasts, but scientiﬁc programmers with a biologic, physics, chemical-,mechanical-, or electrical-engineering background. It has been argued that moreexperienced programmers are therefore needed to tune code to keep a high ef-ﬁciency, especially for supercomputers [Bischof et al. 2012]. An alternative waybeside manual optimization is to develop tools that apply these optimizationsautomatically. This so-called “auto-tuning” of software has been an active researchtopic for several years and is still ﬂourishing, as the publication rates in Figure 1.1indicate. The idea of auto-tuning is that software adjusts itself to the underlyinghardware, without any manual interaction. In general there are multiple goalsfor auto-tuning. First of all, it is supposed to ﬁnd an optimal implementationautomatically, if possible, in less time than if done manually. Second, the softwareshould adjust itself to the hardware it is running on, not to the hardware it wasdeveloped for. So the auto-tuning has to be usable by the customer and not onlyby the developers. Third, auto-tuning should provide performance portabilityacross diﬀerent hardware types and generations, if possible even for unknownfuture hardware. However, support for future hardware can also be added byupdating the auto-tuner, as long as this does not require any adjustments to theapplication code.
To summarize: parallel processors such as GPUs signiﬁcantly suﬀer from bad dataaccess. As many programmers are overwhelmed by the complexity of program-ming and optimizing code for speciﬁc hardware, we developed an auto-tuner inthis thesis, that helps all kinds of programmers (independent of the skill level) toovercome the obstacles of optimizing GPU code. We set the following goals forthe auto-tuner. First, the auto-tuner should optimize array access in NVIDIA GPUapplications independent of the used hardware and application domain. A generalapproach is important, to make it available for a wide range of applications and notto limit it to a speciﬁc kind of GPUs, hardware generation or application domain.
3
Chapter 1: Introduction
0
10
20
30
40
50
60
70
1
9
9
8
1
9
9
9
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
2
0
0
6
2
0
0
7
2
0
0
8
2
0
0
9
2
0
1
0
2
0
1
1
2
0
1
2
2
0
1
3
2
0
1
4
2
0
1
5
2
0
1
6
P
u
b
lic
at
io
n
s 
/ 
Ye
ar
Year
Figure 1.1: Number of scientiﬁc auto-tuning publications over the years. An in-creasing trend can be seen. (Statistics are taken over the papers that we discuss inChapter 4)
Second, only a minimal time eﬀort should be required analyzing the application,to ﬁnd nearly optimal parameters. Software that requires hours or days to ﬁndsuitable optimizations will probably not be adopted by developers, as they cannotaﬀord to wait too long for the auto-tuner during the development. Third, theauto-tuner should achieve at least equal, but in the best case, higher performancethan hand-optimized code. Fourth, it is supposed to provide performance porta-bility across multiple GPU generations without code adjustments. This ensuresthat software can be used on diﬀerent hardware without any manual optimizationinteraction. Fifth, the eﬀort for the developer to integrate the auto-tuner into theapplication should be minimal.
1.1 Contributions
To show the applicability of our techniques, we have developed the “MATOG:Auto-Tuning on GPUs” (MATOG) auto-tuner in this thesis. It optimizes the arrayaccess and utilization of the memory hierarchy for NVIDIA Compute Uniﬁed DeviceArchitecture (CUDA) applications. The main contributions of this thesis have beenpublished as peer reviewed papers in a series of international conferences [Weberand Goesele 2014; Weber et al. 2015; Weber and Goesele 2016] and journals[Weber and Goesele 2017].
The ﬁrst problem we faced was to design an Application Programming Interface(API) that allowed to integrate MATOG into CUDA applications with little eﬀort.Our API is divided into two components. The ﬁrst mimics the CUDA Driver API,which enables to easily use existing CUDA code in MATOG. The second uses codegeneration to create data structures that are tailor made to the needs of theapplication [Weber and Goesele 2014].
4
1.1 Contributions
Next we had to analyze the application and how diﬀerent data layouts impact theperformance. We chose to use empirical proﬁling as this allows to get accurateexecution times, without the need of models, which could break whenever anew GPU generation is released. However, as MATOG applications can have morethan a million diﬀerent parameter conﬁgurations, it is unfeasible to perform anexhaustive search. So we developed a specialized three step analysis method.
In the ﬁrst step we execute the application. Every time a GPU kernel is executed,we run the kernel in diﬀerent implementations, measuring the time and store theresults in a database. For this we developed a specialized prediction method, thatonly needs to measure the time for a small fraction of possible conﬁgurations toestimate the performance of the entire solution space [Weber et al. 2015].
With this data we can determine optimal conﬁgurations for each kernel. However,as data is shared between kernels, we have to ﬁnd conﬁgurations that use thesame data layouts for shared arrays, so we have to ﬁnd a solution for the entireapplication, that is optimal. For this we developed a dependency graph basedmethod, that puts the kernel executions into relation [Weber and Goesele 2016].As this graph can be very complex, it was necessary to develop a method that isable to ﬁnd the optimum of the graph in short time. We reused knowledge fromour prediction method to speed up the processing [Weber and Goesele 2017].
Having a applicationwide optimal solution, however, did not suﬃce as optimal dataaccess does not only depend on the used algorithms or how the code is written,but also on the actual data. This causes that a single kernel can have diﬀerentoptimal solutions, depending on the data. To handle such eﬀects, we automaticallygather meta data during the proﬁling and use it to construct decision models, thatcan react to data dependent eﬀects at runtime. For this MATOG continuouslymonitors certain parameters and adjusts the data layouts accordingly [Weber andGoesele 2016; Weber and Goesele 2017].
This thesis is mainly based on our fourth paper [Weber and Goesele 2017], butadds additional information and evaluation results. Further, we show resultsfor experiments we have conducted to generate performance models based onautomatically gathered proﬁling and meta data [Amend 2017]. These performancemodels show promising results but have not made it into the active developmentof MATOG yet.
5
Chapter 1: Introduction
1.2 Thesis Outline
We start this thesis by introducing the basics of today’s compute hardware, mem-ory types and hierarchies, how these theoretical concepts are implemented intoday’s hardware and conclude with diﬀerent methods to store arrays (Chapter 2).We continue by introducing our target platform (Chapter 3), including the NVIDIAcompute model, the used programming language and the diﬀerences of the mem-ory hierarchy in the last four NVIDIA GPU generations. As a next step we deﬁnethe term auto-tuning, what it stands for and discuss methods that have beenused by prominent auto-tuners. We continue with a discussion of the diﬀerentobstacles auto-tuners have to deal with and how these have been addressed inliterature. Finally, we give an overview of the state-of-the-art in auto-tuning. Whilewe mainly concentrate on GPUs in this thesis, we also include papers for otherhardware, as the concepts usually stay the same (Chapter 4). Then we introducethe main ideas of MATOG, show programming examples and provide details of theimplementation itself (Chapter 5). In Chapter 6 we explaining our multi-step appli-cation analysis. It starts with proﬁling the application using our prediction basedalgorithm. The gathered data is then analyzed in an oﬄine analysis step utilizing aspecialized data and execution dependency graph. The graph is used to model therelation betweenmultiple kernel calls. This allows us to select optimized layouts ac-cording to the runtime ratio of the diﬀerent kernels. Finally, we construct decisionmodels that can be used during runtime to determine optimized conﬁgurationsaccording to the current application workload. At the end of the chapter we ex-plain how the MATOG runtime system works and how it gathers meta data duringthe execution to facilitate the adaptive decision-making. In our evaluation (Chap-ter 7) we apply MATOG on a series of diﬀerent benchmark applications, rangingfrom simple algorithms up to very complex applications with changing workload.All tests are performed on 14 GPU from the last four NVIDIA GPU architectures.Before we discuss our results, we present techniques that could be integrated intoMATOG to improve the decision-making (Chapter 8). Explicitly, we explore optionsfor generating automated performancemodels and their usefulness. This work hasbeen performed together with Sandra C. Amend [Amend 2017]. In Chapter 9 wereﬂect what can be learned from our experiments. We draw conclusions from ourresults, summarize our contributions, reﬂect our proposed methods and if there isspace for improvements. Finally, we identify open issues for future research andoutline directions that auto-tuning could pursue (Chapter 10).
6
Chapter 2
Background
This chapter gives an overview of technologies and methods used in this thesis.We start with a high-level view of computers and the deﬁnition of importantterminology (Section 2.1). These are the foundations for the hardware (Section 2.2)that we target in this thesis. In Section 2.3 we introduce diﬀerent ways to storedata in arrays and how these diﬀer in implementation, resource and computationalrequirements.
2.1 Computational Basics
First we give an overview over the structure of computers, their components andconcepts. These are the foundations for the actual hardware implementationsthat we explain in Section 2.2 and the target architecture in this thesis (Chapter 3).Most of the information in this section is taken from Patterson and Hennessy[2013]. A computer usually consists of a processor, which performs all calcula-tions and controls the operation of the computer, amain memory that is used tostore temporary data, a mass storage device that permanently stores data anda bus system that interconnects all of these components. There are also othercomponents available, such as monitors, keyboards, mouses, etc. which we willnot further introduce in this thesis. In the following we will discuss all of thementioned components independently.
2.1.1 Processor Architectures
The main component of a computer is a processor. It consists of ProcessingUnits (PUs), which perform calculations. There are separate PUs for diﬀerenttypes of operations, such as integer or ﬂoating point calculations. To store inter-mediate results, every processor has a certain number of registers. A control unitreads the instructions of a program and coordinates the operations of the PUs. Theprocessor can further contain input/output (I/O) interfaces to communicate withother hardware components, such as the system memory. There are mainly threearchitectural types used today: Von-Neumann, Harvard and modiﬁed Harvard[Patterson and Hennessy 2013, CD 1.7]. The main diﬀerence between the Von-Neumann and Harvard architecture is that the ﬁrst uses the samememory for dataand instructions, while the second uses two diﬀerent memories. The modiﬁedHarvard architecture is a hybrid of both that uses the same memory to store the
7
Chapter 2: Background
PUInst./Data
Control
I/O
PUData
Control
I/O
Instructions
PUData
Control
I/O
Instructions
Von-Neumann Harvard Modified Harvard
Figure 2.1: Schematic illustration of Von-Neumann, Harvard and modiﬁed Harvardarchitecture. The diﬀerences are how instructions are stored, either in separateor the same memory.
instructions, but utilizes two separate access paths to the memory. The advan-tage of the Von-Neumann architecture is its simplicity and that it can interleaveprograms with data. Harvard, on the one hand, can easier parallelize the loadingof data and instructions, but on the other hand, comes with higher hardwarerequirements. The modiﬁed Harvard architecture tries to combine the advantagesof both approaches by removing the instruction memory, but keeping the separateinstruction path. Which architecture is used, depends on the application of theprocessor. Figure 2.1 shows the schematics for these architectures.
2.1.2 Memory Hierarchy and Caches
Memory is characterized by three properties: latency, capacity and bandwidth.Latency is the time between requesting data at a particular memory addressand when the data is delivered by the memory. Capacity is the amount of datathat can be stored in a memory and bandwidth (or throughput) is the amount ofdata that can be transfered in a certain time frame, usually measured in Bits persecond (Bit/s). Figure 2.2 shows a schematic illustration of the memory hierarchyof a computer. We already mentioned that processors have registers to storeintermediate results. These registers are extremely fast, have a very low latencybut are very limited in size. To work on large amounts of data, computers havea main memory that is slower but much bigger. It can be accessed through thesystem bus of the processor. As data is often used multiple times, caches are usedon the data path between main memory and registers to temporarily store data.This allows to quickly access data that has been read before, without keeping it ina register. As caches do not only load single data words but entire data lines, theyalso improve access to neighboring data. Caches can only store a small fractionof the data, as they are signiﬁcantly smaller than the system memory. Therefore,special strategies are used to ensure that only data resides in the caches, which ismost likely be used [Patterson and Hennessy 2013, p. 457]. If the size of the main
8
2.1 Computational Basics
PCIe/SATA
System Bus
Reg.
La
te
n
cy
lo
w
er
hi
gh
er
B
an
d
w
id
th
h
ig
h
er
lo
w
er
Ca
p
ac
it
y
lo
w
er
hi
gh
er
Mass Storage Devices
Main Memory
Caches
P
ro
cesso
r
Figure 2.2: Memory hierarchy of a computer. With increasing bandwidth, thelatency improves but the capacity decreases [Patterson andHennessy 2013, p. 454].
memory does not suﬃce, or data is supposed to be stored permanently, to makeit available after powering down the machine, massive storage devices are used.Their capacity is signiﬁcantly larger than of the main memory, but they are alsosigniﬁcantly slower. To improve the performance, caches can be used betweenthe main memory and the storage device. Overall we see that in the worst case,data has to go through all levels of the memory hierarchy to get to the processor,which can quite some time as the latency of all levels is summed up.
2.1.3 Multi-Tasking and Scheduling
On a computer usually multiple applications are running simultaneously (e.g., abrowser, text processing or image viewing applications) and each of these appli-cations itself is executed as a separate process. As all of these have to share thesame compute resources, it is necessary to apply a schedule that deﬁnes whichapplication is allowed to use the processor at a certain time. Basic schedulingalgorithms (e.g., round robin), deﬁne a strict schedule when and for how longan application is allowed to use the compute resources. While this schedulingmethod is easy to implement it does not guarantee to yield good performance.Applications can be put to hold, as long as they wait for data, allowing othersthat have their data available can continue their computations. This concept iscalled latency hiding and is actively used at multiple levels in the processor, notonly on an application level. Within the same application it is possible to havemultiple active separated calculations, called threads. In general it produces acertain overhead to switch between diﬀerent applications as the data in registershas to be stored in another location, wherefore another application or thread canuse them for its calculations.
9
Chapter 2: Background
+0 +1 +2 +3
+0 +1 +2 +3
+0 +1 +2 +3
+0 +1 +2 +3
Execution Time
+
+
+
+
10 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In
st
ru
ct
io
ns
se
ri
al
p
ip
el
in
e
d
Figure 2.3: Simple pipeline example. The pipelined variant requires only sevencycles, compared to 16 in the purely serial variant.
2.1.4 Processing Improvements
Executing one operation after each other is very costly, as the next calculationhas to wait until the ﬁrst is complete. Therefore, Instruction Level Parallelism(ILP) [Patterson and Hennessy 2013, p. 391] is used to better utilize the computeresources. There are two kinds of ILP. First, it is possible to divide calculationoperations into a series of stages. Normally, operations (e.g., an addition) requireseveral clock cycles. To better utilize the operation unit it is fed with new data inevery clock cycle. In every following cycle the data is passed to the next stage ofthe operation. This technique is called pipelining and is illustrated in Figure 2.3.
The secondmethod is to providemultiple PUs that can be used in parallel. However,this method requires that the code allows to map diﬀerent instructions onto thePUs. This mapping can be done in a static way [Patterson and Hennessy 2013,p. 393] during programming of the application. Very Long Instruction Word (VLIW)processors require this kind of programs. Other processors (called superscalar)do this mapping on-the-ﬂy by analyzing the code before execution [Patterson andHennessy 2013, p. 397]. Figure 2.4 shows an example code that can be executed ontwo diﬀerent PUs in parallel. Of course, both ILP methods can also be combined.
2.1.5 Multi-Processing
Another method to improve the performance of a computer ismulti-processing.Its goal is to further increase parallel computations, but on an even higher level.
10
2.1 Computational Basics
int a = 1;
a = a + 5;
int b = 2;
b = b + 7;
int c = a + b;
c = c + 3;
int a = 1;     int b = 2;
a = a + 5;     b = b + 7;
int c = a + b;
c = c + 3;
Serial DDG Instruction Parallelized
PU0 PU1
Figure 2.4: Simple superscalar ILP example. The serial code (left), its correspondingdata dependency graph (center) and how this code is executed in a reordered ILPmanner (right). As can be seen, the ﬁrst operations on a and b are independent ofeach other and can be processed in parallel. Starting with initialization of variable
c, the execution depends on the results of a and b.
There are four diﬀerent processing types categorized by Flynn’s Taxonomy [Flynn1966]. The ﬁrst category describes serial processors as Single Instruction, SingleData (SISD) processors. This also includes processors that utilize ILP. In some appli-cations it is possible to process data in a vector fashion. Here, the same operationis applied to a set of data and therefore it is speciﬁed as Single Instruction, Multi-ple Data (SIMD). The third option is to use multiple processors that can operateseparately on the data, calledMultiple Instruction, Multiple Data (MIMD). Flynn’sTaxonomy also speciﬁes theMultiple Instruction, Single Data (MISD), but thereis no known implementation today, as it can just be implemented using a MIMDarchitecture. Figure 2.5 shows an example of the diﬀerent architectures. Thepreviously described multi-tasking can be easily realized with the MIMD pattern,as a process or task can be mapped onto a single processor. In contrast, it is notpossible to map this onto SIMD processors as there only one instruction can behandled at the same time.
One of the biggest problem of parallel processing (besides developing suitablealgorithms that leverage enough parallelism to be eﬃcient) are race conditions,dead locks and hazards. Race conditions are timing problems, so that the outcomeof an algorithm might be random. For example, if an algorithm is supposed tocount the occurrences of a speciﬁc value in a list, a processor has to read thecurrent count, increase the value and write it back. If another processor updatesthe value, between the read and write of the ﬁrst processor, the result will bewrong. Dead locks appear whenever parallel threads reserve resources exclusivelythat are also required by other threads. For example, if two threads require two
11
Chapter 2: Background
control unit processor
control unit
P P P P P P P P
control unit processor control unit processor
Instruction Data Input/Output Interconnect
P
SI
M
D
SI
SD
M
IM
D
Figure 2.5: Example for a SISD, SIMD andMIMD architecture. MIMD architecturesnot necessarily have to consist of SISD units, but could also consist of SIMD units.
resources (A and B) and the ﬁrst grabs A while the second takes B, both wait forthe other resource to be released, which will never happen. Hazards occur whendata is accessed in parallel. These may or may not be problematic. For example,when multiple threads write data to the same memory cell, it is undetermined,which data will be stored in the end. However, if all threads write the same value,it does not matter as the result is always the same. To prevent hazards eﬃciently,atomic operations can be used to enforce certain constraints, e.g., to store themax value. Cases where data is read from a cell that was previously overwrittenby another thread can be problematic. Without explicit synchronization it is notguaranteed that all threads read the new value, depending on the execution orderon the device.
2.1.6 Performance Classiﬁcation
So far we have introduced how computers work and some techniques that areused to improve the performance through parallelism. To be able to comparethe performance of diﬀerent processors, it is necessary to ﬁnd a suitable metric.For a simple processor without any ILP and multi-processing capabilities, whereeach instruction requires exactly one clock cycle, it is possible to compare theseprocessors by their clock frequency. However, with all of our improvements,this measure is not suﬃcient. Therefore, today measures such as Cycles PerInstruction (CPI), Instructions per Second (IPS) or Floating Point Opterations PerSecond (FLOPS) [Patterson and Hennessy 2013, p. 70] are used, depending on the
12
2.1 Computational Basics
Processor0 Processor1 Processor2 Processor3
Figure 2.6: Parallel search for maximum value of a list with 16 numbers, on fourprocessors.
application of the processor. To calculate the improvement (also called speedup (S))of a speciﬁc application on diﬀerent processors or to determine the improvementachieved through parallelization of code, the ratio between original execution time(Toriginal) and optimized or parallelized execution time (Toptimized) can be calculated.
S =
Toriginal
Toptimized (2.1)
Unfortunately, applications can never be entirely parallelized, so that their execu-tion time consists always of a serial (Tserial) and a parallel (Tparallel) fraction, with pas the number of parallel processors.
Ttotal = Tserial + Tparallel
p
(2.2)
Additionally, it is common that parallel algorithms require more operations com-pared to serial algorithms. This is necessary as parallel algorithms often haveto aggregate the results across the parallel processors. For example, to ﬁnd themaximum value of a list on a serial processor, all elements have to be processed,which results in a linear complexity ofO(n). Figure 2.6 shows a multi-processorand how a parallel algorithm would process the elements. As can be seen, itrequires signiﬁcantly less processing steps (5 vs. 16), but it is only 3.2x faster andnot 4x as a perfect speed up would suggest.
13
Chapter 2: Background
2.1.7 Performance Limitations
Overall we can see that the performance of an application can be limited by threefactors:
1. By latency, if the application has to wait too long for data to arrive at theprocessor.2. By bandwidth, if the memory is unable to provide enough data in time.3. By computation, if there are not enough computational resources or thecomputation itself depends on too many intermediate results.
2.2 Hardware Implementations
In this section, we dive deeper into the actual implementation of processors,memory and storage devices. The processor of a computer is called CPU, whichalso describes the type of processor. CPUs cores are optimized for serial processingperformance. To utilize parallelism, processors can be placed in separate chipsand interconnected by a bus, which are called multi-CPU systems, or multipleprocessors can be placed into the same chip, which are called multi-core CPUs.Every sub-processor in such a CPU is called a core. The on-chip memory hierarchy(registers and caches) of CPUs usually have multiple layers of diﬀerent caches. Thesystem memory (called Random Access Memory (RAM)) traditionally is placed ina separate hardware component. Depending on the application RAM can also bedirectly embedded into the same chip together with the processor. This not onlyreduces the energy consumption but also the signal latency. As today’s RAM isusually manufactured as Dynamic RAM (DRAM), temporary data is lost as soon asthe power to the device is turned oﬀ. Therefore, mass storage devices are usedto permanently store data. Traditionally this task was performed by Hard DriveDisks (HDDs), but in recent years these are more and more replaced by Solid StateDisks (SSDs). The diﬀerence is that HDDs store their data on magnetic disks whileSSDs store the data in non-mechanical memory chips. This property makes SSDsmore durable in terms of physical damage, while also reducing the weight, size andsigniﬁcantly improves the access performance. However, there are concerns thatthe SSD’s live cycle is lower than of HDDs because their memory cells deterioratewith every write operation. Experiments as conducted by Gasior [2014] proofedthat even consumer SSDs can survive over 2PB of written data. The reason for thehigh performance of SSDs compared to HDDs is that instead of physical readersfor the magnetic disks (that need to be repositioned to access a memory cell) SSDscan directly access their memory cells. Despite the advantages of SSDs, HDDsare still used because of their high storage capacity and lower prices. There arealso other techniques to permanently store data directly inside the RAM, called
14
2.2 Hardware Implementations
Component Model Capacity Latency Bandwidth Reference
CPU Intel I7-6770 few kB 4-49 ns [7-CPU 2016]
RAM DDR4-2400 CL17 1-16 GB 14.17 ns 18.75 GB/s [CRUCIAL 2015]
SSD Samsung 960 Pro M.2 512GB - 2TB 21.9 µs 2.25 GB/s [Armstrong 2016]
HDD WD4001FAEX 4TB 6.62 ms 148.55 MB/s [Tom's Hardware 2017]
Table 2.1: Exemplary properties of a CPU, RAM, SSD and HDD. As can be seen,with increasing capacity the latency signiﬁcantly increases while the bandwidthdecreases.
Non-volatile RAM (NVRAM). NVRAM is a categorical term describing varioustechnologies, e.g., Static RAM (SRAM) combined with a battery, FerrorelectricRAM (FeRAM),Magnetoresitive RAM (MRAM) or Phase-change RAM (PCRAM).These technologies are still very expensive and hardly used in today’s computersystems. The latency, capacity and bandwidth of these components greatly diﬀers,as shown in Table 2.1.
2.2.1 Bus Systems
In order to connect all components of a computer (e.g., HDD, RAM and CPU) aninterconnection bus is required. The main bus of a computer is the system bus. Itusually connects the CPU with the RAM. Further, it comes with an I/O componentthat can attach other buses. Depending on the topology of the computer, e.g.,if it is equipped with multiple processors, the system bus also connects the dif-ferent CPUs (e.g., using Intel’s QuickPath Interconnect (QPI) [INTEL 2009]). Othercomponents are usually connected using the Peripheral Component InterconnectExpress (PCIe) [PCI-SIG 2010] bus. It was introduced in 2004 and is currently re-leased in the third revision. PCIe uses lanes, which also can be clustered to transfermore data in parallel. Up to 32 lanes are possible, whereas maximal 16 lanes areused today, leading up to 15.75GB/s1 total bandwidth. PCIe can be used to connectall kinds of components. A more specialized bus is the Serial AT Attachment (SATA)bus that is used for HDDs and SSDs. For SSDs also PCIe can be used, due to thehigher bandwidth. Figure 2.7 shows an illustration of a multi-processor topology.
2.2.2 Processor Performance
Traditionally the speed of a hardware components is deﬁned by its clock frequency.To increase the performance it is possible to raise the clock frequency. In the
18GT /s · 2︸︷︷︸
full duplex
· 128Bit/130T︸          ︷︷          ︸
encoding
= 15.75GB/s
15
Chapter 2: Background
RAM0
CPU0 CPU1
RAM1
RAM2
RAM3
Network
GPU0
SSD
HDD
SSD
SATA 
Controller
GPU1
QPI
SATA
P
C
Ie
P
C
Ie
Figure 2.7: Illustration of the bus topology of a two-processor system with fourRAM modules and some additional components such as storage drives, controllerand accelerators.
past this was usually limited by the size of the manufacturing technique of theprocessors. If the feature size of the chip is too big and the frequency too high,there is not enough time for the electrons to travel through the circuits beforethe next cycle starts. Electrons in silicon travel at max 107 cm/s (known as velocitysaturation) [Yu and Cardona 2010, p. 226], so with a frequency of 3GHz these canonly travel up to 33.33µm until the next clock cycle. This leads also to the fact thatthe smaller a transistor is, the faster it can operate. Therefore, the clock frequencywas increased every time manufacturers have been able to produce chips with adecreasing feature size. Moore [1965] predicted that the number of transistorswould duplicate approximately every two years. This increase would improve theprocessor performance at the same rate. Since its proclamation in 1965, Moore’slaw was more or less accurate. With more and faster transistors, manufacturershave been able to constantly improve the performance of processors.
However, in the last decade the advances in shrinking the feature size have beenslowed down, as the manufacturing processes reached physical limitations [Berke-ley 2014]. There are multiple reasons for this limitation. First, it became more andmore diﬃcult to create methods that are able to produce small enough structuresin the silicon. The second reason is heat. Every time a transistor is switched itconsumes energy, which produces heat. With increasing clock frequency the tran-sistors are switched more often, creating more heat in less time. To compensatefor this increased heat, the voltage that is driving the transistors has to be reduced.However, as components require a certain minimum voltage, it cannot be reducedinﬁnitively. Another phenomenon is electric leakage. This is an electric currentthat is lost even when an electric component is not actually switched. The leakageincreases with smaller feature sizes. Methods as presented by Zhang et al. [2005]signiﬁcantly reduce this leakage. However, the development of new techniques to
16
2.2 Hardware Implementations
shrink the feature size and to reduce eﬀects such as the electric leakage has sloweddown in recent years. More information on the topic can be found in Ahmed andSchuegraf [2011]. We can summarize that the methods that were able to drive thedevelopment do no longer work and other solutions have to be found.
2.2.3 Caches for Parallel Processing
Before we introduce diﬀerent parallel processor implementations, we take a closerlook on caches in parallel processors. As previously mentioned, caches are usedto store data directly in the processor for faster access. With parallel processorssystems, multiple processing units access the same RAM and therefore utilize thesame caches. To achieve high performance, cache hierarchies are used. Thereare multiple levels of caches, where some of the caches only serve a single core,while others serve a group or all cores. Every level has to be kept synchronizedwith the next higher level to ensure that processors work on the correct data. Thehigher a cache is in the hierarchy, the bigger its size and latency. Depending on theprocessor the number of caches can diﬀer. Today two or three layers are normallyused. Figure 2.8 shows an example cache hierarchy used in today’s processors. Ifthe left processor changes a value and stores the result in its L1 cache, the resulthas to be communicated to the other cores, otherwise they will use outdated data.There are many ways for ensure synchronization, depending on the processor’spurpose, implementation and features used [Patterson and Hennessy 2013, pp.534].
2.2.4 Parallel Processors and Accelerators
The meaning of the term MIMD is very wide, as it describes not only multi-processor, multi-core but also any interconnected compute cluster. It therefore isdiﬃcult to pinpoint an exact date when the ﬁrst MIMD devices appeared. One ofthe ﬁrst articles about multi-processing has been published by Krajewski [1985] [p.171-181]. In 2005 the ﬁrst consumermulti-core processors where introduced [INTEL2005], where multiple processor cores have been put onto the same chip. Oneproblem of processors is the thread switching, which can be costly. SimultaneousMultithreading (SMT) is a solution for this thread switching. It provides multipleregister sets that can be switched eﬃciently without copying the data to anothermemory. One implementation is Intel’s HyperThreading [INTEL 2002] technology.
The ﬁrst implementations of SIMD instructions for consumer CPUs have been In-tel’s Multi Media Extension (MMX) [INTEL 1997] and AMD’s 3DNow! [AMD 2000].Todays CPUs support AVX with up to 512Bit wide operations [Reinders 2013]. Touse these SIMD capabilities, the code has to explicitly use the corresponding SIMD
17
Chapter 2: Background
CPUCPU
Core Core Core Core
Main Memory
L1 Data
L1 Inst.
L1 Data
L1 Inst.
L1 Data
L1 Inst.
L1 Data
L1 Inst.
L3 Cache L3 Cache
System Bus
L2 Cache L2 Cache L2 Cache L2 Cache
Figure 2.8: Schematic view of a 3-layer cache hierarchy in a multi-core setup withtwo CPUs. As can be seen, the L1 and L2 caches are placed in the cores, so thatwhen one core changes a value, the change has to be propagated to the othercores to ensure consistency. Same applies for the two CPUs.
instructions. Unfortunately, not all CPUs support all techniques and commands,so it is necessary to check, which instructions are supported. Luckily, projects asthe Intel Single-Program Multiple-Data Program Compiler (ISPC) [Babokin andBrodman 2016] try to automatically parallelize code for SIMD CPUs. However, withmulti-core and SIMD instructions, CPUs are still mostly tuned for serial perfor-mance and support only few parallel workloads. In the mid-90’s GPUs [Glatter2015] have been introduced. However, their design has been very crude comparedto todays GPUs. Initially they have been solely designed for 3D rendering. Never-theless, their design evolved and today they are massively parallel processors withup to several thousand cores in a single chip. In some early developments [Bucket al. 2004] the rendering pipeline of GPUs have been misused to accelerate cer-tain procedures, e.g., matrix multiplications. Later easier programming languageshave emerged such as CUDA that we will introduce in Section 3.1.
Other approaches are the Intel Xeon Phi [INTEL 2013b], which puts up to 72 coresonto a chip and is a mixture of a massively parallel GPU and a multi-core CPU, alsocalledMany Integrated Core (MIC) architecture. In general manufacturers have toﬁnd a trade oﬀ between serial processing, leading to fewer but faster cores, andparallel processing, resulting in slower but much higher numbers of cores.
18
2.2 Hardware Implementations
Further, there are many diﬀerent ways to interconnect processors. Multi-CPUsystems are usually interconnected by buses such as Intel’s QPI. Acceleratorsoperate usually as slave devices in workstation or server computers, connectedusing PCIe. Atop of these computers, it is possible to interconnected these toclusters by diﬀerent types of network adapters, based on copper or ﬁber cables.Interconnects such as InﬁniBand are trimmed for bandwidth and low latency[InﬁniBand 2016] and are usually used in computer clusters. These again can bebuilt tightly interconnected cluster or as a loosely coupled compute cloud [Buyyaet al. 2009].
However parallel processing and higher clock frequencies are not the only way tospeed up computations. For certain application-domains, specialized processorssuch as Digital Signal Processings (DSPs), the Epiphany-V (a 1024-core processor)[Olofsson 2016] or Google’s TPU [Jouppi 2016] exist that are specially designed tooperate eﬃciently on the operations needed for these applications. They usuallyprovide specialized hardware processing units. Sometimes, if the application is notaltered, e.g., in video de-/encoding, the algorithm itself is put into hardware (so-calledApplication Speciﬁc Integrated Circuits (ASIC)), without anymeans of alteringafter it has been manufactured. This results in the best possible performance perenergy consumption ratio, but bears the danger of implementation error on thechip, which require to be explicitly replace the entire component. More generalare Field Programmable Gate Array (FPGA) which can be reconﬁgured but alsotranslate the program they are executed directly into hardware. Although theyusually have a very low clock frequency compared to CPUs, they can achieve muchhigher performance in specialized applications or when non-standard variabletypes are used. Many studies have been performed on diﬀerent application ﬁelds,concluding varying results rather a FPGA, CPU or GPU is better [Chase et al. 2008;Papadonikolakis et al. 2009; Pauwels et al. 2011]. Except for CPUs, processors areusually designed to function as slaves. This prevents them from operating on theirown, so they must be controlled by a master CPU.
2.2.5 Graphic Processing Units
As the main focus of this thesis is on GPUs, we take a deeper look into their imple-mentation. GPUs have been speciﬁcally designed to run thousands of calculationsin parallel. Therefore, their cores are much simpler than those of CPUs, with lessfeatures and a lower clock frequency but their high number of processing corescompensates for this. The advantage of GPUs is, because of their SIMD architec-ture, that groups of cores perform the same operation in parallel so that not eachsingle core requires its own controlling infrastructure. Instead the cores can begrouped together in so-called SIMD groups, as all of them are supposed to execute
19
Chapter 2: Background
the same operation. GPUs can be integrated into a computer in diﬀerent ways.Traditionally they are extension cards that are either attached using PCIe or withthe recently introduced NVLink [NVIDIA 2014c] bus that provides higher bandwidththan PCIe which requires special support from the CPU. So far only IBM’s Powerprocessors [Gupta 2016] are announced to support NVLink. Another option is toput the GPU directly onto the mainboard (usually referred as “on-board GPU”). Inthese cases the GPU is still attached to the CPU using PCIe.
Due to the low bandwidth of PCIe compared to the bandwidth of the systemmemory, PCIe attached GPUs are equipped with their own memory. This requiresto explicitly copy data between the system memory and the GPU. This copy canbecome a major bottleneck in many applications. Manufacturers such as Intelor AMD therefore provide CPUs with directly integrated GPUs. This allows theGPUs to be directly attached to the system bus and access the main memory.This is, e.g., done in AMD’s Accelerated Processing Unit (APU) [Gaster and Howes2011]. However, in these combinations CPU and GPU shared the same chip andare usually only providing limited compute performance, as both produce heaton the chip. AMD wants to remove these limitations in their recently proposedExascale Heterogeneous Processor (EHP) [Vijayaragavan et al. 2017] that aims atan even tighter coupling of cache coherent CPU and GPU cores, using fast on-chipand slow but bigger oﬀ-chip memory, which can be accessed by all CPU and GPUcores of the processor.
Today there are diﬀerent types of GPUs available. Intel focuses mainly on low-endand multimedia GPUs that require only a small amount of energy and thereforecan be especially used in low-power and mobile systems. The same applies for theMali GPUs from ARM, which are speciﬁcally trimmed for smart phone applications.These GPUs provide only limited compute capabilities. Matrox mainly providesGPUs formulti-display setups in professional environmentswith advanced features,low-energy consumption and high reliability. AMD is mainly established in low-endand gaming GPUs. The latter type aims at providing high performance for real-time3D rendering applications. To expand onto the High Performance Computing (HPC)market, they recently released the Radeon Instinct GPUs [Hook and Graves 2016].NVIDIA tries to provide GPUs for the entire market from low-end (GT-series),over gaming (GTX-series), professional (Quadro-series) up to HPC (Tesla-series).The GT-series is meant for multimedia applications and comes with very limitedcompute capabilities requiring only a low amount of energy. The GTX series isusually optimized for high single precision ﬂoat performance. The Quadro-seriesaims at providing professional features, comparable to the GPUs from Matrox,an advanced multi-display support, or features such as GPUdirect, which enablesto directly access the memory of a GPU from other devices that are connectedto the PCIe bus. The Tesla-series aims at high performance, providing excellent
20
2.3 Array Layouts
performance for single and double precision ﬂoat computation. However, TeslaGPUs usually do not have any display ports and therefore can only be used ascompute accelerator which cannot be attached to a monitor.
2.2.6 Memory Types
As we have already mentioned, there are many diﬀerent kinds of memory typesbuild into a computer, ranging from registers, caches, RAM up to storage devices.Over time there have been signiﬁcant changes to how RAMhas been implemented.Traditionally Synchronous Dynamic RAM (SD-RAM) was used in computers, trans-ferring one data word at the positive clock edge, over a 64Bit interface. This waslater improved by Double Data Rate (DDR) SD-RAM which not only transferreddata by the positive but also the negative clock edge. The memory loads two datawords into a message buﬀer and then transfers the data. This method is called2n-prefetch. Today the fourth revision of DDR is used, which still uses a 64Bitinterface but a 8n-prefetch, resulting in much higher external clock frequencies.Figure 2.9 shows an illustration of the diﬀerences between SD-RAM and DDR. AsGPUs are massively parallel processors, with hundreds or thousands of activecores, specialized Graphics DDR (GDDR) has been developed, which is a modiﬁedversion of DDR memory with extended bus width. Today GDDR5 with 8n-prefetchor GDDR5X with 16n-prefetch are used. Further, High Bandwidth Memory (HBM)[AMD 2015] has been proposed as successor for GDDR. For HBM, multiple DRAMmodules are stacked atop of each other. To transfer the data to the processor, awide memory interface is used – signiﬁcantly wider as for DDR or GDDR modules.As currently HBM is still expensive to manufacture, it is mainly used in high endgaming (e.g., AMD Radeon R9 Fury Series [Macri 2015]) or HPC GPUs (e.g., NVIDIATesla P100 [NVIDIA 2016d]).
2.3 Array Layouts
Arrays are one of the most important building blocks in applications. As hardwarearchitectures use certain cache hierarchies, strategies, memory technologies andbuses, it is important to optimize the access to the data in an array. Arrays canbe one- or multi-dimensional and consist of scalar or multiple data ﬁelds or evencontain sub-arrays. Therefore, the structure of an array can have a signiﬁcantimpact on the performance if used in an inadequate way.
2.3.1 Multidimensional Indexing
Every multi-dimensional array has to be stored in a linearized manner. The waythis is done can be arbitrary as long as every entry is mapped to one unique index.
21
Chapter 2: Background
SD
DDR
I/O 
Buffer
CPU
CPU
RAM System Bus Proc.
Figure 2.9: Illustration of how SD-RAM and DDR works. DDR is shown for 2n-prefetch, other modes work the same way. The SD-RAM only transfers one datapackage per clock cycle. In contrast DDR loads two data packages into an I/O buﬀerand transfers one of these at each clock edge.
M00 M01 M02 M03
M10 M11 M12 M13
M20 M21 M22 M23
M30 M31 M32 M33
M00
M01
M02
M03
M10
M11
M12
M13
M20
M21
M22
M23
M30
M31
M32
M33
M00 M01
M02 M03
M10 M11
M12 M13
M20 M21
M22 M23
M30 M31
M32 M33
M00 M01 M02 M03
M10 M11 M12
M20 M21
M30
Figure 2.10: From left to right: two diﬀerent linear transpositions, z-order curveand triangular matrix.
Usually this is done by a simple linearization such as x + y ∗ |x | as it is easy andfast to compute. However, in some applications more complex schemes such asthe z-order curve [Morton 1966] yields in better results. Figure 2.10 shows someexamples for diﬀerent indexing schemes. For simple linearizations of the form
x + y ∗ |x | the number of possible combinations is deﬁned by the factorial of thenumber of dimensions, so that a 5D array has 120 diﬀerent linearizations.
2.3.2 Struct Layouts
Another type of arrays contains no primitive types, but structures. These arecalled Array of Structs (AoS). In an AoS data that belongs to the same index isstored in one block. For parallel applications this works well, when each threadaccesses the entire struct data at the same moment. It performs badly, whenonly one of the components is used at the same time. In this case, a Structureof Arrays (SoA) performs better. Here, every component is stored in a separatearray. This is also the format that the NVIDIA Programming Guide [NVIDIA 2016a]recommends. However, one disadvantage of the SoA is that its implementation
22
2.3 Array Layouts
0 0 0 1 1 1 2 2 2 3 3 3
0 1 41 0 52 2 63 3 7
0 1 01 0 1 2 2 33 3 2
0 1 010 12 2 3 3 2
AoS
SoA
AoSoA 
SoAoS1
4 4 4 5 5 5 6 6 6 7 7 7
4 5 05 4 16 6 27 7 3
4 5 45 4 5 6 6 77 7 6
4 554 6 6 7 7
SoAoS2
SoAoS3
4 5 6 73
0 1 010 12 2 3 3 24 554 6 6 7 7 4 5 6 73
0 1 010 12 2 3 3 24 554 6 6 7 7 4 5 6 73
Figure 2.11: Diﬀerent ways to store the data of a struct array with three childitems (orange, green and blue). The AoS stores all of the components of the sameindex next to each other. In contrast, the SoA groups all elements of the samecomponent together. AoSoA is a hybrid, that stores groups in a SoA way. SoAoSstores parts of the array as AoS and other as SoA.
either requires more registers, as the root pointers to each sub-array have to bestored, or the root pointers have to be explicitly recalculated every time the arrayis accessed. In general it can be said that SoA requires more registers than anequivalent AoS. If memory access is a bottle neck, this additional resource orcompute overhead can be less than the beneﬁt gained from the improvedmemoryutilization. Further, there are also hybrid formats such as Array of Structure ofArrays (AoSoA) (sometimes also referenced as tiled-AoS [Koﬂer et al. 2015] orArray-of-Structure-of-Tiled-Arrays (ASTA) [Sung et al. 2012]) which are a hybrid ofAoS and SoA. In AoSoA, small tiles of data are stored in a SoA-style, while thesetiles themselves are organized in an AoS-way. The size of the tiles is applicationdepending, but for GPUs using a size equal to the SIMD-group size proved to workwell in many scenarios. One disadvantage of AoSoA is that in the last part of thearray, memory is wasted, if the number of elements cannot be divided by thetile-size. Changing the layout of a AoS is not the only transformation that can beapplied. The AoS itself can also be divided into diﬀerent arrays, where each of theresulting arrays can be stored in a diﬀerent layout. Peng et al. [2016] refer thisas Structure of Array of Structures (SoAoS). They use a AoS and store one partas SoA and the other as AoS. In Figure 2.11 we show how the data is stored forthe mentioned layouts. Further, Listing 2.1 shows the diﬀerences in register andcomputational requirements for AoS, SoA and AoSoA.
23
Chapter 2: Background
1 /****************************** AoS ******************************/
2 // Registers: 3 (array, index, address)
3 // Operations: 3 (2x add., 1x mult.)
4 struct AoS { | address = index + 2;
5 int A, B, C D; | address = address * sizeof(int);
6 } array*; | address = address + array;
7 array[index].C; |
8
9 /************************* SoA, variant A ************************/
10 // Registers: 6 (array.{A,B,C,D}, index, address)
11 // Operations: 2 (1x add., 1x mult.)
12 struct SoA { | address = index * sizeof(int);
13 int *A, *B, *C, *D; | address = address + array.C;
14 } array; |
15 array.C[index]; |
16
17 /************************* SoA, variant B ************************/
18 // Registers: 4 (array, index, count, address)
19 // Operations: 4 (2x add., 2x mult.)
20 int* array; | address = count * 2;
21 array[index + 2 * count]; | address = address + index;
22 | address = address * sizeof(int);
23 | address = address + array;
24
25 /************************ AoSoA (2-tiles) ************************/
26 // Registers: 4 (array, index, address, temp)
27 // Operations: 6 (3x add., 1x mult., 1x div., 1x mod.)
28 struct AoSoA { | address = index % 2;
29 int A[2], B[2], C[2], D[2]; | address = address + 4;
30 } array*; | temp = index / 2;
31 array[index / 2].C[index % 2]; | address = address + temp;
32 | address = address * sizeof(int);
33 | address = address + array;
Listing 2.1: Example for AoS, SoA (in two diﬀerent implementations) and AoSoA.Left: Description of the data layout and an exemplary access to the item
array[index].C. Right: Necessary calculations to acquire the address of theitem. Registers and operations indicate how many registers and calculations arenecessary to calculate the address.
24
Chapter 3
Target Architecture and Platform
This chapter gives an overview of NVIDIA’s CUDA (Section 3.1) and all NVIDIA GPUarchitectures (Section 3.2) that can still be used with the current CUDA toolkit.This covers four hardware generations, ranging from the “Fermi” up to the mostrecent “Pascal” architecture.
3.1 NVIDIA CUDA
CUDA [NVIDIA 2016a] was introduced by NVIDIA in 2007. It is the standard fornon-graphical compute intensive programming for NVIDIA GPUs. CUDA standsfor “Compute Uniﬁed Device Architecture” and is speciﬁcally designed to modelmassively parallel computations. CUDA consists of a compute model and a pro-gramming language, which will be explained in detail in the following sections.There are also competing programming languages such as OpenCL1, OpenACC2and C++ AMP3. We do not explain these in this thesis, as our research concentrateson CUDA.
3.1.1 Compute Model
The CUDA compute model is designed for massively parallel processors. As GPUsimplement the SIMD scheme, these organize threads in SIMD groups, which arecalledwarps. Since the beginning of CUDA, awarp consisted of 32 threads, whereasin the early versions only 16 of them had been active at the same time. Thislimitation is no longer present in newer GPUs. Multiple warps are grouped inso called blocks. The current maximum is 1024 threads or 32 warps respectively,per block. During the execution, one block is mapped onto one Streaming Multi-Processor (SM) on the GPU. This concept is illustrated in Figure 3.1. These multi-processors usually consist of hundreds of small cores, where one thread is mappedto one of the cores. The GPU threads are very light weight, so that wheneverone warp stalls – because it has to wait for memory to be accessed – it can beimmediately replaced by another idle warp of the same block with very littleswapping overhead. This method is comparable to Intel’s HyperThreading [INTEL
1www.khronos.org/opencl2www.openacc.org3blogs.msdn.microsoft.com/nativeconcurrency
25
Chapter 3: Target Architecture and Platform
Multithreaded CUDA Program
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Ex
ec
ut
io
n
GPU with 2 SMs
SM 0 SM 1
Block 1Block 0
Block 3Block 2
Block 5Block 4
GPU with 4 SMs
SM 0 SM 1
Block 1Block 0
Block 5Block 4
Block 3Block 2
SM 2 SM 3
SM 2* SM 3*
*inactive SM
Figure 3.1: Schematic illustration of a CUDA program with six blocks and how theyare scheduled onto two diﬀerent GPUs. (Based on [NVIDIA 2016a, Figure 1])
2002]. Whenever a condition is not fulﬁlled by all threads in a warp, a so-calledthread divergence occurs. In this situation only threads that fulﬁll the conditioncontinue the execution until the end of the conditional block. NVIDIA refers totheir GPUs as Single Instruction, Multiple Thread (SIMT) devices, as they can putsingle threads inside a SIMD group to sleep. An example for thread divergence isillustrated in Figure 3.2. Before the 2nd generation of the Kepler architecture, GPUshad to be explicitly controlled by a CPU. All GPU functions (in the following calledkernels) had to be directly called by the host system. Since then GPUs have beenable to invoke kernels fromwithin other kernels. This is called Dynamic Parallelism.However, the initial kernel call still has been done by a CPU.
Feeding thousands of threads with data on a single multi-processor is a diﬃculttask. Therefore, GPUs are equipped with a complex memory system, consistingof a diﬀerent automatic/self-organized caches and on-/oﬀ-chip memories. Thismemory hierarchy has been changed signiﬁcantly in every GPU generation so farand will be explained in more detail in Section 3.2. However, the main conceptremained unchanged. The main memory of the GPU is oﬀ-chip and called globalmemory or device memory. It can be accessed from all threads for read and writeoperations. To guarantee that all threads in a warp, a block or the entire GPU readthe most recent data, special synchronization functions can be used. However, thissynchronization is costly and should be avoided if possible. To share data withina block, the so called shared memory can be used, which is a very fast on-chipmemory. In fact, this memory is a cache that has to be explicitly programmed.
26
3.1 NVIDIA CUDA
no divergence
if(...)
else
running suspended
... ... ... ...
MemoryMemory
Memory Memory
Threads
ThreadsThreads
Threads
Threads
Thread Divergence
optimal
optimal* bad
Memory Access
Figure 3.2: Thread divergence occurswhenever conditionals are processed,where not all threads continue on thesame code path.
Figure 3.3: Memory divergences oc-curs whenever threads access non-connected data. The optimal* (blue)case is only optimal if its a read oper-ation. A write would create a write-hazard, except if an atomic write is used,as this would serialize the memory ac-cess. However, serialization should beavoided in parallel systems.
It can be accessed and synchronized with all threads of a block quite fast. It isorganized in memory banks. Simultaneous access to the same bank results inconﬂicts, that cause a serialization of the memory access. Therefore the memoryaccess has to be optimized to ensure that diﬀerent banks are accessed in parallel.However, the size of the shared memory is limited to a few kB, depending onthe GPU architecture. With the Kepler architecture, NVIDIA added the shuﬄefunctions, which allow to transfer 32Bit values within a single warp without usingany additional memory resources or synchronization and is therefore very fast.Shuﬄe supports not only to broadcast values across threads in a warp, but alsoto exchange values between speciﬁc threads. Further, the CUDA programmingmodel allows the usage of so called local memory, which resides in the main
27
Chapter 3: Target Architecture and Platform
memory of the GPU but is private to a single thread. As this limitation doesnot require any synchronization between the multi-processors, it can be fasterthan accessing global memory. On the one hand, local memory is stored in theoﬀ-chip memory and therefore is signiﬁcantly slower than registers or sharedmemory. On the other hand, the number of registers per thread is very limitedso local memory allows to use more memory per thread. The employed cachessigniﬁcantly diﬀer between the GPU generations. However, usually there is a L1and a L2 cache, which serve global and/or local memory requests. Additionally,GPUs feature a non-coherent or texture cache that serves read-only requests tothe oﬀ-chip memory. In older GPUs this cache can only be used through texturesand is therefore often referenced as texture memory. In newer GPUs this cachecan be directly accessed. Finally, every multi-processor has a small but fast, read-only memory for constant values (constant memory) which can be written toby the host system prior calling a GPU function. As synchronizing all threads iscostly, especially for oﬀ-chip memory, the GPUs support to execute 32 and 64Bitatomic writes to global and shared memory with the add, min, max, exchange andcompare-and-swap operation for integer and ﬂoat values, whereas older GPUsdo not support all combinations of operation and value type. Available atomicoperations are listed in the programming guide [NVIDIA 2016a, B.12]. Figure 3.4shows a schematic illustration of the CUDA memory hierarchy. The problem ofproviding data eﬃciently to such a large number of cores is that the data has tobe stored in neighboring memory cells, as due to the speculative data transferdata would be transmitted that is not used, which would reduce the bandwidthutilization. Figure 3.3 shows examples for eﬃcient and ineﬃcient memory accesspatterns. As the resources per thread (registers) and block (shared memory) arevery limited on GPUs, it is possible to adjust their usage to the application. If a largeamount of memory is required for sharedmemory or registers, more resources areassigned to the blocks so that less of them can be assigned to a multi-processor,which reduces the ability to swap between idle warps. This utilization is measuredusing the occupancy metric that is deﬁned as percentage of concurrent blocks permultiprocessor. Although higher occupancy means that more warps a eligible tobe scheduled, it does not necessarily result in higher performance. The reason forthis eﬀect is that all SMs share the same memory interface. If too much data isrequested by a high number of warps, the interface is the limiting factor. Moreinformation on this eﬀect can be found in Volkov [2010].
3.1.2 Programming Language
The CUDA programming language consists of two diﬀerent parts, one for the hostsystem that controls the execution and one for the device that performs the actualcomputation.
28
3.1 NVIDIA CUDA
C
hi
p
Memory Controller
L2
 C
ac
h
e
Memory Controller Memory Controller
R
e
g
iste
rs
L1 Cach
e
SM
S
h
a
re
d
 M
e
m
o
ry
C
o
n
sta
n
t M
e
m
o
ry
Texture Cache
Registers
L1
 Cach
e
SM
S
h
a
re
d
 M
e
m
o
ry
Con
stan
t M
em
ory
T
e
xtu
re
 C
a
ch
e
Re
gi
st
er
s
L1
 C
ac
h
e
SM
Sh
ar
ed
 M
em
or
y
Co
n
st
an
t M
em
or
y
T
e
xt
u
re
 C
a
ch
e
Re
gi
st
er
s
L1
 C
ac
h
e
SM
S
h
a
re
d
 M
e
m
o
ry
C
o
n
st
a
n
t M
e
m
o
ry
Te
xt
ur
e 
Ca
ch
e
Device Memory (Off-Chip)
Global Memory
Local Memory Texture Memory
Figure 3.4: Schematic illustration of a CUDA capable GPU, with all kinds of memoryand its location. Blue elements access oﬀ-chip memory, green are on-chip memoryand orange are compute cores.
The host system is supposed to work as a master device and controls most of thefunctions of the GPU. It is responsible to allocate oﬀ-chip memory, copy data toand from the device and launch compute functions. Two diﬀerent host APIs areavailable. The ﬁrst is the Runtime API that allows to write GPU and CPU code in thesame ﬁle. This is very convenient to use and most likely one of the reasons for thesuccess of CUDA. Throughout time, this API has been enriched with more features,e.g., most recently the option to declare an array asmanaged, which implicitlycopies data to the GPU. The new Pascal architecture also allows to actively swapdata between GPU and the host during execution [NVIDIA 2016d]. The second APIis theDriver API. It not only supports all features of the RuntimeAPI, but also comeswith advanced features, e.g., to dynamically load kernel implementations duringruntime. This allows to choose diﬀerent implementations of a kernel according tothe underlying hardware without recompiling the actual application. However, theRuntime API is tuned for usability, the Driver API is tuned for features, so that itrequires more programming eﬀort than the Runtime API. It is also possible to mixup both APIs, whereas certain constraints have to be noted. More details aboutthe APIs can be found in the programming guide [NVIDIA 2016a].
29
Chapter 3: Target Architecture and Platform
To write a kernel, certain requirements have to be fulﬁlled. First, it is necessary toannotate the function with the __global__ keyword, so that the compiler knowsthat this function can be invoked by the host system. Further, it cannot return anyvalues. Every output needs to be stored into memory that is passed as a pointer tothe kernel. The kernel itself has then to be written from the perspective of a singlethread, whereas all threads execute the same code. To distinguish between thethreads, it is possible to acquire the ID of a thread inside a block, or the ID of a block,as well as the sizes and counts of blocks by the variables threadIdx.{x, y, z},
blockIdx.{x, y, z}, blockDim.{x, y, z} and gridDim.{x, y, z}. However,these variables always require a register if used, so that sizes or numbers of blocksthat are known at compile time should always be set as constant values so thatprecious hardware resources are not wasted. Further, kernels can only invokefunctions that are annotated with the __device__ keyword. GPUs with DynamicParallelism can further also call other kernel functions. Unlike CPUs, GPUs requirevariables to be aligned, meaning that the address of a 4B variable needs to be 4B.This can cause problems when data structures are used between CPUs and GPUsas it is not guaranteed that the host compiler obeys the alignment requirementsof the GPU.
Listing 3.1 shows a simple kernel that searches the minimum in a ﬂoat array. In List-ing 3.2 a diﬀerent version of the same operation utilizing the shuﬄe functionalityis shown. It requires signiﬁcantly less shared memory (512B vs 16B) and much lesssynchronizations (8 vs 1). This illustrates themajor problem of CUDA. Programmershave explicitly to use diﬀerent memory types, caches, programming concepts,algorithms and implementations to achieve optimal performance. This makesit a diﬃcult task to write good GPU code that works optimal for each hardwaregeneration.
There also exist an assembler language for CUDA capable devices, called ParallelThread eXecution architecture (PTX). PTX is an intermediate assembler, that workson all GPUs. However, before PTX code can be executed, it has to be compiledinto an device speciﬁc assembler, called SASS that can directly be translated intodevice micro-code. SASS is specialized for the particular hardware and can not beused on a device of a diﬀerent architecture.
3.1.3 CUDA Proﬁling Tools Interface
CUDA comes with a series of tools and libraries to assist the development of GPUdriven applications. One of these tools is the CUDA Proﬁling Tools Interface (CUPTI),which allows to access the CUDA proﬁler inside a CUDA application. This allowsan application to monitor its own performance and gather metrics such as cachehit rates, memory/processor utilizations, achieved FLOPS and others. As CUPTI
30
3.1 NVIDIA CUDA
directly accesses the data within the GPU driver, its measurements are muchmore precise than measuring the execution time of a kernel using diﬀerent timemeasurement techniques such as std::chrono. However, CUPTI implementspure C-callback driven API that is uncomfortable to use and requires signiﬁcantprogramming eﬀort to be utilized.
1 #define NUM_THREADS 128
2 __global__ void findMinimum(float* globalValues, const int
elementCount) {
3 // search through all items
4 float localValue = FLT_MAX;
5
6 for(int i = 0; i < elementCount; i += NUM_THREADS)
7 localValue = fminf(localValue, globalValues[i]);
8
9 // reduce results
10 __shared__ float shared[NUM_THREADS];
11 shared[threadIdx.x] = localValue;
12
13 // sync required so that all threads know the actual values
14 __syncthreads();
15
16 // perform reduction
17 for(int stride = NUM_THREADS/2; stride > 0; stride << 2) {
18 if(threadIdx.x < stride)
19 shared[threadIdx.x] = fminf(shared[threadIdx.x + stride],
shared[threadIdx.x]);
20 __syncthreads();
21 }
22
23 // write back result into first element of globalValues
24 if(threadIdx.x == 0)
25 globalValues[0] = shared[0];
26 }
Listing 3.1: This kernel searches the minimum in the ﬂoat array globalValues.First, all threads search through all values, whereas each stores a local minimumvalue. Then shared memory and a parallel reduction are used to ﬁnd the globalminimum. In the end, the ﬁrst thread stores the result back into the array. The
fminf(float, float) function returns the minimum of two ﬂoat values.
31
Chapter 3: Target Architecture and Platform
1 #define NUM_THREADS 128
2 #define NUM_WARPS (NUM_THREADS/32)
3 __global__ void findMinimum(float* globalValues, const int
elementCount) {
4 // search through all items
5 float localValue = FLT_MAX;
6
7 for(int i = 0; i < elementCount; i += NUM_THREADS)
8 localValue = fminf(localValue, globalValues[i]);
9
10 // use shuffle to reduce values on warp level
11 localValue = fminf(__shfl_down(localValue, 16), localValue);
12 localValue = fminf(__shfl_down(localValue, 8), localValue);
13 localValue = fminf(__shfl_down(localValue, 4), localValue);
14 localValue = fminf(__shfl_down(localValue, 2), localValue);
15 localValue = fminf(__shfl_down(localValue, 1), localValue);
16
17 // use shared memory to reduce the remaining elements across
warps
18 __shared__ float shared[NUM_WARPS];
19 if(threadIdx.x % 32 == 0) // only first thread in a warp
20 shared[threadIdx.x / 32] = localValue; // id of the warp
21
22 __syncthreads(); // sync required so that all threads knows the
actual values
23
24 // use NUM_WARPS threads in first warp to reduce
25 if(threadIdx.x < NUM_WARPS) {
26 localValue = shared[threadIdx.x];
27 localValue = fminf(__shfl_down(localValue, 2), localValue);
28 localValue = fminf(__shfl_down(localValue, 1), localValue);
29 }
30
31 // write back result into first element of globalValues
32 if(threadIdx.x == 0)
33 globalValues[0] = localValue;
34 }
Listing 3.2: This kernel is an optimized version of the code shown in Listing 3.1.It uses the shuﬄe functionality to perform local reductions inside every warpand then only use shared memory once to broadcast the result over the warpboundaries. The __shfl_down(value, offset) function returns the ﬂoat valuegiven by the thread, whose ID is given by current thread ID + oﬀset
32
3.2 NVIDIA GPUs
3.2 NVIDIA GPUs
In this section we give an overview of the last four NVIDIA GPU architectures,including “Fermi”, “Kepler”, “Maxwell” and the most recent “Pascal” architecture.Older GPUs are no longer supported since CUDA v8.0 [NVIDIA 2016a]. The nextgeneration will be called “Volta” and is announced for 2018. Only a few rumorsare known today. Most likely HBM2 and GDDR6 will be used [Moammer 2016]. Todistinguish between GPU supported functionality, NVIDIA introduced the Com-pute Capabilities (CC). Table 3.1 shows an extract from the compute capabilitydependent features and technical details. For a full list, please refer to the CUDAProgramming Guide [NVIDIA 2016a, Table 12 and 13].
3.2.1 Fermi Architecture
The Fermi architecture [NVIDIA 2009] (CC = 2.x) was introduced in 2009. It wasused in GPUs of the 400, 500, low-end GPUs of the 600, some Quadro and TeslaC series. Fermi was the ﬁrst GPU with uniﬁed compute cores and a L1/L2 cachehierarchy. Comparable to normal CPUs, the L1 cache serves as additional layerbetween the L2 cache and the cores. The new instruction set introduced a uniﬁedaddress space for local, shared and global memory. In this architecture, the L1cache and the shared memory are the same hardware component and allow tobe dynamically adjusted for each kernel execution, so that the programmer canchoose either to prefer shared memory or L1 cache. One of these is assigned 48kBand the other 16kB. To utilize the non-coherent/texture cache, it is necessary tobind the memory addresses explicitly to texture references from the host systemprior to a kernel execution. These texture bindings underlie certain limitations:ﬁrst, the memory address needs to be 512B aligned; second, it can only contain upto 227 elements per texture and, third, only supports 1, 2 and 4B variables. Further,it is possible to load multiple elements as a vector of size 1, 2 and 4 at the sametime. To read long or double values, it is necessary to read a vector of two 4Bvalues and use a reinterpret cast.
3.2.2 Kepler Architecture
The next GPU architecture was called Kepler [NVIDIA 2014a] (CC = [3.0, 3.2, 3.5and 3.7]) and was released in 2012. It was used in GPUs of the 600, 700, 800,low-end GPUs of the 900, Quadro K and Tesla K series. There have been twomajorrevisions of the Kepler architecture, starting with CC 3.0 and 3.5. GPUs with CC 3.0introduced the already mentioned shuﬄe functions that allow to access memoryfrom other threads within the same warp without additional hardware registers.Further, Kepler added a third mode to the L1 cache and shared memory, to equally
33
Chapter 3: Target Architecture and Platform
distribute them to 32kB each. The shared memory supports to adjust the size ofshared memory banks either to serve 32 or 64B banks, depending on the datathat is supposed to be stored. Further, the operation mode of the L1 cache hadbeen changed to only serve local memory accesses. Most likely this was doneto remove any synchronization between the SMs for global memory, to reducethe communication between the SMs. With the Kepler architecture it was alsopossible to use so called “unbound textures”, which do not have to be explicitlybound to a texture reference, but can be passed to the kernel as an executionparameter. However, these textures still need to be explicitly initialized by the hostcode and have also the same restrictions as textures using texture references.
With the second generation of Kepler (CC > 3.5) Dynamic Parallelism has been in-troduced, allowing that kernels start other kernels directly from the GPU. Anotherinnovation was the __ldg(ptr*) function. It allows to access the texture cache,without binding any textures in advance and removing the limitations such as thelimitation to 227 elements.
3.2.3 Maxwell Architecture
The ﬁrst generation of Maxwell [NVIDIA 2014b] (CC = 5.0) GPUs was released in2014 with the Geforce GTX 745, 750 and 750 Ti GPUs, followed by the secondgeneration cards in the 900 series (CC = 5.2) and Jetson TX1/Tegra X1 (CC = 5.3)embedded processors. This architecture introduced signiﬁcant changes to thememory system. First of all, the shared memory no longer can be conﬁgured, nei-ther the capacity nor thememory bank size. Further, the L1 cache has beenmergedwith the non-coherent cache. Also the implementation of atomic operations forglobal and shared memory had been signiﬁcantly improved.
3.2.4 Pascal Architecture
Pascal [NVIDIA 2016d] (CC = [6.0, 6.1 and 6.2]) is the newest GPU architecture,released in 2016 with the Geforce GTX 10XX and Tesla P series. One of the newfeatures is, that the GPUs allow to actively swap data between the device and thehost system during the execution of a kernel. This allows to use more memorythan the device is equipped with in one kernel. Further, the Tesla P100 featuresthe new HBMmemory. This allows much higher memory capacities (up to 16GB)and bus width (4096Bit).
34
3.2NVIDIAGPUs
Fermi
2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2
Atomic addition operating on 64-bit floating point values in global memory and shared 
memory (atomicAdd())
No
Unified Memory Programming
Funnel shift (see reference manual)
Dynamic Parallelism
Half-precision floating-point operations: addition, subtraction, multiplication, comparison, 
warp shuffle functions, conversion
Maximum number of resident grids per device (Concurrent Kernel Execution) 4 16 128 32 16
Maximum x-dimension of a grid of thread blocks 65535
Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 48
Maximum number of resident threads per multiprocessor 1536
Number of 32-bit registers per multiprocessor 32 K 128 K
Maximum number of 32-bit registers per thread block 32 K 64 K 32 K 32 K 32 K
Maximum number of 32-bit registers per thread
Maximum amount of shared memory per multiprocessor 112 KB 64 KB 96 KB 96 KB 64 KB
Cache working set per multiprocessor for constant memory
Cache working set per multiprocessor for texture memory 12 KB
Maximum width, height, and depth for a 3D texture reference bound to a CUDA 
array
2048^3
Maximum number of textures that can be bound to a kernel 128
Maximum number of surfaces that can be bound to a kernel 8
8 KB 10 KB
64 K 64 K
63 255
48 KB 64 KB
16
256
4096^3
Between 12 and 48 KB Between 24 and 48 KB
64
2048
64 K 64 K
2
31
-1
16 32
Kepler Maxwell Pascal
16 32
Yes
NVIDIA CUDA Programming Guide, v8, Table 12
NVIDIA CUDA Programming Guide, v8, Table 13
Yes
No
No Yes
No Yes
No
Table 3.1: Extract of compute capability dependent features and technical details [NVIDIA 2016a, Table 12 and 13]. As can beseen, signiﬁcant diﬀerences are present even within the same architecture.
35
Chapter 3: Target Architecture and Platform
36
Chapter 4
Auto-Tuning and Related Work
This chapter starts with introducing the term auto-tuning, techniques and con-cepts used in auto-tuning and provides an overview of a wide range of auto-tuningprojects and publications. In general, the term auto-tuning is not consistentlydeﬁned. It is usually used to describe automated systems that tune certain pa-rameters to achieve a speciﬁc objective. This objective can be to optimize theutilization of compute resources, achieve optimal performance, provide certainquality of services or to reduce the energy consumption. In mathematical termsauto-tuning is an optimization problem. It is diﬃcult to pinpoint an exact date orpublication that marks the beginning of auto-tuning research because the term isnot clearly deﬁned. Actually the ﬁrst optimizing compiler could already be seen asan auto-tuner, as it automatically optimized the performance of the code. To limitthe extend for this thesis, we deﬁne auto-tuning as a method that automaticallyadjusts an application to operate optimally on the executing hardware.
The idea of such auto-tuning applications and frameworks is, to optimize exist-ing code for a given hardware. Software is usually written once and used overyears, while hardware is constantly improved and enriched with new features,so that code that was written in the past might not leverage the full capabilitiesof the processor. Without an auto-tuning system, the code has to be readjustedby an experienced programmer who not only knows the abilities, strengths andweaknesses of the hardware the application is supposed to run on, but also theapplication and the used algorithms. This procedure needs to be repeated when-ever new hardware is released. Especially for long living codes this is a signiﬁcantengineering and cost overhead. With increasing number of diﬀerent processortypes, it is necessary to not only optimize the existing code to the unique speciﬁca-tions, strengths and weaknesses of the processors, but also to evaluate alternativealgorithms that are more eﬃcient for the given hardware. Auto-tuning (regardingour focus) aims at solving this problem. The goal is to establish a system thatcan be integrated into an application that automatically analyzes and optimizesthe application without any user interaction. This not only saves time for thetuning itself, but also enables software users (even when they are no experiencedprogrammers) to optimize their applications towards the hardware.
As already mentioned, it is diﬃcult to pinpoint an exact publication that initiatedauto-tuning research, but certainly the ATLAS auto-tuner [Whaley and Dongarra1998] is one of the ﬁrst publications. It automatically optimizes Basic Linear Algebra
37
Chapter 4: Auto-Tuning and Related Work
Application
Auto-Tuner
ControlFeedback Execution
Application
API
Profiler DB
Data 
Tracker
Decision 
Model
Abstract MATOG
Figure 4.1: Left: Schematic feedback loop of an auto-tuner. The exact imple-mentation of the control and feedback depends on the application and what theauto-tuner is optimizing. The feedback usually contains some kind of measure thatcharacterizes performance of the applications (execution time, energy consump-tion, ...), but also other metrics (cache hit-/miss-rates, hardware utilization, ...) ormeta data (data amount, hardware parameters, ...) that might help the auto-tunerto derive better decisions. Right: Schematic of the MATOG auto-tuner. MATOGbases its decisions on meta data tracked during the execution and performancedata captured in previous executions. More details will be explained in Chapter 5.
Subprograms (BLAS) operations for CPUs. Another is Active Harmony [Hollings-worth and Keleher 1998], an auto-tuning system for optimizing applications inparallel, dynamic environments. Karsai et al. [2001] described an abstract self-adaptive software system that monitors performance metrics and automaticallytunes parameters using a feedback loop. They port the concept of feedback basedsystem control (used in other engineering disciples) to the ﬁeld of software devel-opment. Figure 4.1 illustrates an example for such a feedback based application.For a given set of tuning parameters (which can either be speciﬁed by the auto-tuner or the programmer) it analyzes the application and how it performs in termsof its objective function on the given hardware. According to the objective func-tion it decides, which values for the tuning parameters are likely optimal. In thefollowing we call a combination of tuning parameter values a conﬁguration. Giventhis abstract description, auto-tuning can actually be deﬁned by the following fourbuilding blocks. Informally they can be referred as “The Programmer’s Guide tobuild an Auto-Tuner”:
38
4.1 Options for optimization
1. “What to optimize?”2. “How are optimizations integrated into the application?”3. “Which conﬁgurations are optimal?”4. “How to detect and handle runtime dependent performance eﬀects?”
In the following we will discuss these building blocks separately. As the number ofpublications in this topic is very high, we ﬁrst only concentrate on a few hand pickedexamples to give an overview of the topic and used techniques (Sections 4.1 to 4.4)and later take a wider look to the diﬀerent auto-tuners and concepts proposed inthe literature (Section 4.5).
4.1 Options for optimization
The variety of available optimizations in applications seems to be nearly endless. Itstarts with simple and easily applicable optimizations such as compiler ﬂags [Ansel2014]. As compute intensive applications usually iterate over large datasets, it isvery important to transform loops [Hall et al. 2009] and apply optimized launchconﬁgurations [Bergstra et al. 2012; Liu et al. 2008]. Lutz et al. [2015] adjust theusage of API calls to overlap memcopy and execution. Magni et al. [2014] applythread coarsening by letting one thread do the work of multiple threads, whichoptimizes the resource usage. As GPUs are used for their compute power andmemory bandwidths, an optimized data access is necessary. This can either bedone by speciﬁcally targeting particular operations such as in 3D stencil codes[Lutz et al. 2013], dense [Sorensen 2012] or sparse [Choi et al. 2010; Monakovet al. 2010] matrix vector multiplications, or by applying application independentoptimizations, e.g., tuning the cache utilization [Li et al. 2015], data mapping andparitioning [Ben-Nun et al. 2015], data placement [Li 2016] or data layouts [Koﬂeret al. 2015; Peng et al. 2016]. [Edwards and Trott 2013] provide an optimization APIfor data layouts that does not perform any automatic optimizations. Sung et al.[2012] propose a framework that transforms AoS to AoSoA based on static rulesand executes explicit data transformations prior the kernel launch. Hsu et al. [2014]extend this work by proposing a hardware controller that performs these layouttransformations during memcopy, eliminating the conversion overhead. Moregeneral are approaches such as [Muralidharan et al. 2014; Ansel 2014; Nugterenand Codreanu 2015] that allow user speciﬁed optimizations, that can, e.g., switchbetween alternative algorithms. MATOG optimizes data layouts, placement andutilization of caches and user speciﬁed optimizations.
39
Chapter 4: Auto-Tuning and Related Work
4.2 How to integrate optimizations into applications?
There are multiple ways to integrate optimizations into an application. A populartechnique is the usage of compilers. Some approaches use code annotations[Liu et al. 2008; Han and Abdelrahman 2009; Lee and Vetter 2014] to generateparallel code for given serial code, or use them to apply optimizations such asloop transformations. Li et al. [2015] apply their optimizations directly to thePTX intermediate format. Auto-tuners that allow user deﬁned optimizations (asmentioned before in Section 4.1), usually require using a speciﬁc API to instructthe auto-tuner what to optimize. They use annotations or macros inside the codeto enable/disable features or adjust parameters. For MATOG we decided to usecode generation that has to be manually integrated into the application. On theone hand, this is less comfortable to use, as it requires some manual work. Onthe other hand, our code generation makes the code easily portable betweendiﬀerent operating systems and compilers. Our API mimics the CUDA Driver APIand therefore requires only a few changes to existing CUDA applications. Listing 4.1shows an overview of diﬀerent data layouts, how these are implemented in CUDAand how these can be accessed using the MATOG data-structures. These data-structures map the memory access transparently onto one of the methods.
4.3 Which conﬁgurations are optimal?
Finding optimal conﬁgurations or values for parameters is essential for auto-tuning.Mainly there are two methods: ﬁrst, to analyze the code (static analysis) and,second, to measure the execution time (proﬁling). Peng et al. [2016] recentlyproposed a metric to estimate if either AoS, SoA, AoSoA or SoAoS yield betterperformance. The advantage of such static analysis methods over proﬁling isthat they can be performed quite fast. However, they ignore data dependenteﬀects that can occur at runtime, as it solely operates on heuristics that makecertain assumptions about the data and workload of the application. Further,the hardware of GPUs is proprietary, which makes it diﬃcult to establish suitablemodels for static analysis methods, especially if a new architecture with signiﬁcantinternal changes is released. Hong [2009] established an analytical model forCUDA code. However, they mention that their model cannot represent cachemisses. As they worked on a GPU that does not have a cache for oﬀ-chip memorythis is no problem. In contrast, modern GPUs have even multiple caches for thistype of memory, so their model is likely not very accurate today.
The second technique is empirical proﬁling. To execute the application in diﬀerentimplementations is quite time intensive, so thatmultiple approaches have emergedto reduce the required time. Koﬂer et al. [2015] use a hybrid approach, as they
40
4.3 Which conﬁgurations are optimal?
1 /************************** AoS-Layouts **************************/
2 array[index].field // AoS
3 array.field[index] // SoA
4 array[index/tile_size].field[index%tile_size] // AoSoA
5 » array[index].field « // MATOG
6
7 /*********************** Multi-Dimensional ***********************/
8 array[x + y * size_x + z * size_x * size_y] // array[x][y][z]
9 array[x + z * size_x + y * size_x * size_z] // array[x][z][y]
10 array[y + x * size_y + z * size_y * size_x] // array[y][x][z]
11 array[y + z * size_y + x * size_y * size_z] // array[y][z][x]
12 array[z + x * size_z + y * size_z * size_x] // array[z][x][y]
13 array[z + y * size_z + x * size_z * size_y] // array[z][y][x]
14 » array[x][y][z] « // MATOG
15
16 /************************ Texture Memory *************************/
17 tex1Dfetch(textureReference, index + offset) // until CC 3.0
18 __ldg(&array[index]) // since CC 3.5
19 » array[index] « // MATOG
Listing 4.1: Overview of how AoS, multi-dimensional arrays and texture can beused in CUDA compared to MATOG. Also combinations of these layouts can beused with MATOG, e.g., array[x][y][z].field.
utilize a generic GPU benchmark to capture GPU speciﬁc properties and thenproject themonto the code to estimate the performance using a directed executiongraph. Magni et al. [2014] train a neural network based on a dataset consisting ofstatic code features and best candidates for some reference benchmarks. With thistrained model they are able to predict optimal thread coarsening factors based onstatic code features. Other approaches use methods such as an exhaustive search[Muralidharan et al. 2014], greedy search [Liu et al. 2008], swarm search, simulatedannealing [Nugteren and Codreanu 2015], Nelder-Mead downhill simplex method[Chung and Hollingsworth 2004], or a genetic algorithm [Ansel 2014] to guide thesearch into the direction of the optimal conﬁguration. We proposed a predictionbased proﬁling method that requires only a very small subset of conﬁgurations toestimate the execution time of non-proﬁled conﬁgurations. This method is basedon the observation that changes induced by one optimization have little inﬂuenceon other optimizations. With this technique it is possible to estimate the optimalconﬁguration with only a small number of executed conﬁgurations, making thismethod very fast while it achieves nearly the same results in comparison to anexhaustive search.
41
Chapter 4: Auto-Tuning and Related Work
4.4 How todetect andhandle runtimedependent per-formance eﬀects?
The performance of some applications changes depending on varying input data.These data dependent eﬀects can beneﬁt from dynamic decisions that are madeat runtime. To establish such a decision system, it is necessary to ﬁnd measures toidentify these eﬀects and to create decision models that can distinguish betweenthem.
To dynamically select optimal conﬁgurations during runtime it is necessary thatthe auto-tuner is able to identify diﬀerent data properties. Approaches suchas presented by Muralidharan et al. [2014] or Chung and Hollingsworth [2004]rely on user deﬁned callback functions. This requires that the user knows howto characterize the properties of his data. In contrast, we use a fully automaticprocess that relies on meta data, which is already contained in our system (i.e., thelaunch conﬁguration of a kernel or the size of arrays). Additionally, MATOG alsosupports to monitor user deﬁned variables that can contain additional informationthan our automatic data.
With the ability to identify diﬀerent data properties, it is possible to create decisionmodels based on machine learning techniques, e.g., regression trees [Bergstraet al. 2012] or Support Vector Machines (SVMs) [Muralidharan et al. 2014]. First,we relied on SVMs, but encountered cases with high false decision rates. In thisthesis we propose an improve decision model that achieves better decisions forour optimization problem. If the extend of the used meta data is unknown, it canhappen that models derive bad decisions, when data signiﬁcantly diﬀers from thetraining data. In our case a specialized nearest neighbor search using an implicitnormalization yields equal or better results compared to a SVM. We further explainthis in Section 6.4.1.
4.5 Overview
In this section we give an overview of existing auto-tuners. The literature in thisarea is very extensive, so we clustered the literature in to categories, starting withperformance measurement, modeling and simulation (Section 4.5.1), compilers(Section 4.5.2), programming languages (Section 4.5.3), domain dependent (Sec-tion 4.5.4) and independent approaches (Section 4.5.5), as well as memory accessand data layout auto-tuning (Section 4.5.6).
42
4.5 Overview
4.5.1 Performance Measurement, Modeling and Simulation
Measuring performance of hardware is a key component to many auto-tuners.However, most hardware implementations are closed. To reveal implementationdetails of hardware, Wong et al. [2010] proposed a method based on micro-benchmarking. Mei and Chu [2017] propose a similar micro-benchmarking methodto analyze the GPU memory hierarchy. Accuracy is a very important factor inperformance measuring. A high proﬁling overhead can not only be time consum-ing but also introduce distortions into the measurement. Therefore, Baghsorkhiet al. [2012] provide a framework for proﬁling memory hierarchy eﬀects in GPUapplications. Zheng et al. [2012] proposed a low-overhead proﬁler for GPUs, calledGMProf. To ﬁnd an optimal conﬁguration is a key component for auto-tuning.Oliveira Castro et al. [2013] proposed an adaptive proﬁling technique that concen-trates its search on parts of the solution space with high irregularity.
Modeling performance is another important research topic, as it allows to predicthow long an execution will take without actually running the code. However, as theinternals of hardware is usually proprietary, it is diﬃcult to predict the performanceof an unknown hardware. Hong [2009] introduced a performance model for CUDAcode that is able to reliably predict the performance of the used hardware, but asthey do notmodel cachemisses, their methodmost likely fails on today’s hardware.Baghsorkhi et al. [2010] proposed similar work that explicitly takes global memorylatency and shared memory bank conﬂicts into account. Wang and Chu [2017] usemicro-benchmarking and some performance counters to predict the performanceof GPU kernels based on the used core and memory frequency. GROPHECY [Menget al. 2011] estimate the performance of a GPU implementation given a simple CPUskeleton implementation, which helps developers to predict if their code wouldrun faster when executed on GPUs. Calotoiu et al. [2013] use empirical proﬁlingdata and a performance model to ﬁnd scalability problems in complex parallelprograms. Similarly, Shudler et al. [2015] build performance models to predict thescaling behavior of cluster applications. Calotoiu et al. [2016] extend their previouswork to create multi-parameter performance models. Other machine learningbased performance prediction methods have been proposed by Ipek et al. [2005],Lee et al. [2007], and Wu et al. [2015]
Bakhoda et al. [2009] use a diﬀerent approach and proposed GPGPU-Sim thatactually simulates CUDA PTX code. This provides an execution time estimate andallows to capture other internals of the execution. Similar work was proposed byCollange et al. [2009]. Other simulators areWASTE1 or Ocelot2. However, theseprojects are discontinued and no longer maintained, so that current GPU hardware
1code.google.com/archive/p/cuda-waste/2gpuocelot.gatech.edu
43
Chapter 4: Auto-Tuning and Related Work
cannot be simulated and new CUDA features cannot be used.
4.5.2 Compilers
Compilers are widely used in programming to translate textual code into machinecode. These employ certain optimizations to the code, based on heuristics, toachieve higher performance through more ILP, better jump predictions or otheroptimizations. To improve these heuristics, Stephenson et al. [2003], Agakov et al.[2006], Fursin et al. [2008], and Park et al. [2011] use machine learning techniquesthat help to guide the search for optimal implementations. Long and Fursin [2005]provide a heuristic search algorithm to eﬃciently locate good optimizations incompiler search spaces.
Automatic parallelization is a wide research topic. There exist diﬀerent techniquesfor multi-core (OpenMP) or accelerator (OpenACC) programming, which requirethe code to be annotated with instructions how to parallelize the given code.Iwainsky et al. [2015] investigated scalability properties of OpenMP applications ondiﬀerent hardware andusing diﬀerent compilers fromvarious vendors. Approachessuch as presented by Wang et al. [2015] use OpenMP code to generate optimizedOpenCL code, that can be executed on an accelerator. They further use a predictorto estimate whether running the code on GPU will be faster than on a CPU andautomatically select the optimal device. Kim et al. [2016] proposed a similarframework that explicitly tries to remove unnecessary memory transfers. Afonsoet al. [2016] automatically generate OpenCL from annotated Android Java codefor mobile devices. OmpSs3 is an approach to extend OpenMP for heterogeneousdevices. Elangovan et al. [2015] auto-tune code that was generated using OmpSs-OpenCL. Lee and Vetter [2014] published OpenARC, an OpenACC compiler thataims at automatically optimizing the resulting parallel implementation. Siddiquiet al. [2014] auto-tune OpenACC parameters using data from prior executions.Lashgar and Baniasadi [2016] investigate the eﬀectiveness of diﬀerent OpenACCcache directives and draw conclusions when to use which settings. Calore et al.[2016] use OpenACC to achieve performance portability of accelerated latticeBoltzmann applications. Hybrid Multicore Parallel Programming (HMPP) [Dolbeauet al. 2007] is a directive-based language that also aims at automatic parallelizationof serial code for GPUs. Grauer-Gray et al. [2012] auto-tune the directives of HMPPto achieve high performance.
Going even further, Marangoni andWischgoll [2016] proposed DawnCC, a compilerthat has the goal to generate OpenACC code from serial C code without anyannotations. However, their results indicate that except for one benchmark, their
3pm.bsc.es/ompss
44
4.5 Overview
method does not really improve anything through the parallelization. In somecases their method even decreases the performance. Commercial products suchas the Silexica SLX Parallelizer4 also automatically parallelize serial C code. Barmanet al. [2011] automatically synthesize parallel code for scan based algorithms.Langdon et al. [2016] use genetic programming to generate parallel code fromserial C code. togpu [Marangoni and Wischgoll 2016] automatically translates CPUC++ code to CUDA.
4.5.3 Programming Languages
Beside implementing auto-tuners as libraries or frameworks, some approaches de-ﬁne their own optimizable programming language. The advantage of this methodis that it can directly deﬁne optimizations in the language. However, the draw-back is that the program needs to be written in this particular language. Theselanguages usually have restrictions or requirements that can make it diﬃcult toimplement certain algorithms, as they do not allow all constellations of algorithmor language constructs, which are available in other high-level languages.
An example for these optimizable languages is Sequoia [Fatahalian et al. 2006]that breaks down memory access onto diﬀerent levels, making it easy to mapmemory access on diﬀerent levels of memories, as it is, e.g., the case for GPUsor special purpose processors. Ansel et al. [2009] proposed PetaBricks, a lan-guage that natively supports to provide multiple implementations of the samealgorithm on CPUs. Its compiler automatically selects not only the best workingimplementation, but also auto-tunes parameters if possible. This was used byChan et al. [2009] to optimize implementations of the multi-grid technique that isused to solve partial diﬀerential equations. In Ansel et al. [2011] a new evolutionaryalgorithm was presented that optimizes PetaBricks based applications and ﬁndssolutions much faster than other approaches. Pacula et al. [2012] proposed anin-situ method to optimize PetaBricks based applications. Hall et al. [2009] deﬁnetransformation recipes, which can be used to transform loops and optimize theirexecution behavior through auto-tuning. Han and Abdelrahman [2009] and Hanand Abdelrahman [2011b] propose hiCUDA, a directive-based language extensionto enable automatic CUDA kernel generation from serial code, which can be seenas a predecessor of OpenACC. PyCUDA and PyOpenCL [Klöckner et al. 2011] al-lows to run CUDA or OpenCL application from Python. Copperhead [Catanzaroet al. 2010] is based on this, as it uses a Python based language to provide dataparallel instructions and building blocks to facilitate commonly used executionprimitives on GPUs. Rudy et al. [2011] proposed CUDA CHiLL, a parallel languagethat maps down to parallel building blocks and instructions, which allow to gener-
4www.silexica.com
45
Chapter 4: Auto-Tuning and Related Work
ate CUDA code. Khan et al. [2013] uses CUDA CHiLL to apply code transformationsto achieve highly eﬃcient CUDA code. Devito et al. [2013] proposed Terra, whichis a Lua-based language that enables to write highly eﬃcient parallel GPU code.TANGRAM [Chang et al. 2016] synthesizes eﬃcient portable code based on codeskeletons that use speciﬁc qualiﬁers, primitives and containers. HeterogeneousHabanero-C (H2C) [Majeti et al. 2016] uses an abstract language to optimize codefor parallel execution and explicitly optimize the usage of data layouts. Pony5 is anobject-oriented, actor-model based programming language for HPC applications.
Other languages are directly designed for a speciﬁc application domain. For ex-ample, Hong et al. [2012] proposed Green Marl that enables to write eﬃcientgraph analysis algorithms. Vollmer et al. [2015] proposed an auto-tuning frame-work for optimizing functional programs, written in Obsidian, a Haskell based GPUprogramming language.
4.5.4 Domain Dependent Auto-Tuning
There are many application domains that can beneﬁt from auto-tuning. Matrixmultiplications have been targeted inmany auto-tuning publications, as it is amajorbuilding block in many applications. These can be categorized in Dense Matrix-VectorMultiplication (GEMV) [Sorensen 2012], SparseMatrix-VectorMultiplication(SpMV) [Vuduc et al. 2005; Choi et al. 2010; Guo and Wang 2010; Guo et al. 2011;Tang et al. 2015; Zhang et al. 2016] andDenseMatrix-MatrixMultiplication (GEMM)[Whaley and Dongarra 1998; Kurzak et al. 2012; Matsumoto et al. 2012; Steuweret al. 2016; Veras et al. 2016]. Mainly these approaches try to optimize diﬀerentalgorithms, storage formats, thread and data mapping or launch conﬁgurations.Bell and Garland [2008] introduced multiple implementations for eﬃcient SpMVon GPUs, for a variety of sparse storage formats, without automatic selection ofoptimal layouts. This automatic selection has later been done by Muralidharanet al. [2014]. Beaumont et al. [2016] proposed a framework for eﬃcient schedulingof linear algebra kernels on heterogeneous hardware.
Another important operation in many applications are stencil codes. Approachessuch as presented by Datta et al. [2008], Kamil et al. [2010], Monakov et al. [2010],Christen et al. [2011], Lutz et al. [2013], Zhang and Mueller [2013], Luo et al. [2015],and Jia and Zhou [2016] provide auto-tunable implementations for common sten-cil operations that are adjusted towards the given computational problem andhardware.
Halide [Ragan-Kelley et al. 2012; Ragan-Kelley et al. 2013] is a domain speciﬁc lan-guage for image processing pipelines. It supports common operations as building
5ponylang.org
46
4.5 Overview
blocks, which make the code easy to understand. Similar work was done by Yanget al. [2016], but instead of using an own language, they rely on Python.
Other projects aim at solving large tridiagonal system [Davidson et al. 2011], DigitalSignal Processing (DSP) [Püschel et al. 2004], Discrete Fourier Transformation(DFT) [Frigo and Johnson 1998; Frigo 1999; Frigo and Johnson 2005], ray-tracing[Ganestam and Doggett 2012], hypernuclear spectroscopy [Bianchin et al. 2008;Bajrovic et al. 2013], graph based algorithms [Pai and Pingali 2016], or NeuralNetworks (NNs) [Li et al. 2016b; Moskewicz et al. 2016; Imani et al. 2017].
4.5.5 Domain Independent Auto-Tuning
Beside optimal performance, energy consumption is an important optimizationobjective. Hoﬀmann et al. [2010] proposed Application Heartbeats, an API thatallows to develop auto-tuners. It supports an interface, allowing to pass monitoredperformance data from the application to the auto-tuner and execution param-eters from the auto-tuner to the application. PowerDial [Hoﬀmann et al. 2011]aims at reducing computational accuracy for power eﬃciency and can use theHeartbeats API. Coplin and Burtscher [2015] investigate the eﬀects of source-codeoptimizations for GPUs, concerning performance and energy consumption. Baoet al. [2016] proposed a compile-time based CPU frequency selection that outper-forms runtime based approaches. Sensi et al. [2016] select optimal conﬁgurationsfor either performance or power consumption without relying on previous applica-tion runs. Some auto-tuners combine the energy consumption and performanceobjectives and try to ﬁnd an optimum that satisﬁes both. For this, Jordan et al.[2012] introduced a multi-objective optimization infrastructure that generatesmulti-versioned executables, allowing to select optimal conﬁgurations during run-time. Durillo and Fahringer [2015] provide an overview of the term “auto-tuning”and discuss the advantages and problems of multi-objective auto-tuning. TheAutoTune6 project aims at providing auto-tuning for cluster applications. As partof the project, Miceli et al. [2013] proposed the Periscope Tuning Framework (PTF),which is a plug-in driven framework that aims at providing a solid base structurefor auto-tuning of HPC applications. Preliminary results of their research havebeen presented in Miceli and Bodin [2013]. Pimenta et al. [2013], Liu et al. [2014],and Sikora et al. [2016] optimize the usage ofMessage Passing Interface (MPI) incluster applications using PTF. The Runtime Exploitation of Application Dynamismfor Energy-eﬃcient eXascale computing (READEX) project [Gerndt 2016] is a suc-cessor of the AutoTune project and explores the potential of dynamic auto-tuningfor energy saving in HPC environments. Other energy optimizing auto-tuning tech-niques have been proposed by Götz et al. [2010], Tiwari et al. [2011], and Tomusk
6www.autotune-project.eu
47
Chapter 4: Auto-Tuning and Related Work
et al. [2016].
Some auto-tuners are not speciﬁcally designed for a speciﬁc application. Manyauto-tuners provide a framework that is able to ﬁnd optimal parameter valuesfor the given application. Hollingsworth and Keleher [1998] introduced the ActiveHarmony framework that is designed to tune parameters in distributed systems.In this initial version the authors used a greedy based empirical proﬁling to ﬁndoptimal parameters, which was replaced with the Nelder-Mead method by Tapuset al. [2002]. Chung and Hollingsworth [2004] added a parameter prioritizationto better guide the search in high dimensional search spaces. Further, they keeprecord of previous proﬁling data to feed them also into the optimization process.Tiwari et al. [2009] used Active Harmony for the search of optimal parameters forthe CHiLL compiler framework [Rudy et al. 2011]. Chang and Karamcheti [2001]provided an auto-tuner to optimize distributed applications. Speciﬁcally, theyallow to adjust parameters during runtime, according to user deﬁned constraints.A similar idea was proposed by Bhat et al. [2006] which also optimize distributedapplications that run on remote supercomputing sites. The GADAPT auto-tuner[Liu et al. 2008] bases its optimization decisions on a heuristic-based empiricalsearch and builds decision models that decide based on the program parameters,which implementation to choose. For this they compile the application in multiplediﬀerent conﬁgurations and use a dispatcher that selects and executes the optimalconﬁguration. TheOpenTuner [Ansel et al. 2014] framework is a python based auto-tuner designed to optimize compiler ﬂag values. Initially it was used to optimizeGCC compiler ﬂags, but it also can be used for other compilers such as the NVIDIACUDA Compiler [Bruel et al. 2015] or to enable preprocessor based optimizations.CLTune [Nugteren and Codreanu 2015] is a similar framework, speciﬁcally targetingOpenCL applications. Muralidharan et al. [2014] proposed the Nitro auto-tuningframework. Nitro uses an exhaustive search with empirical proﬁling and thereforeis mainly supposed to optimize applications with only few possible conﬁgurations.It features a callback mechanism to feed user deﬁned meta data into the SVMbased decision system. One of their benchmarks chooses optimal layouts for SpMVbased algorithms. In Muralidharan et al. [2016a] the framework is extended topredict optimal conﬁgurations for unknown GPUs, based on micro-benchmarking.Tillmann et al. [2013] introduced ATuneRT which aims at optimizing parametersin applications, e.g., the launch conﬁguration of kernels. In Tillmann et al. [2016]they apply it on a KD-Tree building and ray casting pipeline for real-time renderingwith an in-situ optimization. In this example they optimize, i.e., the costs for theKD-Tree building heuristic.
Other projects go one level lower. These do not tune speciﬁc parameter values,but explicitly apply optimizations to the implementation. Han and Abdelrahman[2011a] targeted branch divergence in GPU applications by either delaying an
48
4.5 Overview
execution inside a loop or by aggressive branch transformations. In Han andAbdelrahman [2013] they reduce branch divergence by merging loops. However,both methods require the optimizations to be applied manually. Magni et al.[2014] provide a compiler chain that applies thread coarsening to reduce similarcalculations and optimize resource usage on GPUs. Xu and Gregg [2015] use acompiler, which employs hyper loop parallelism tomergemultiple SIMDexecutionsinto one thread. The framework of Gao and Peterson [2015] analyzes sharedmemory bank conﬂicts and allows to automatically optimize the shared memoryaccess. Li et al. [2015] try to achieve higher performance by explicitly by-passingcaches on GPUs. They determine a certain threshold and only allow thread groupswithin this threshold to use the caches, the others circumvent these and directlyaccess the oﬀ-chip memory. Moreira et al. [2017] perform call re-vectorization forSIMD platforms such as CPUs and GPUs. This method wakes up dormant threadsto collaborate, e.g., in a memcopy.
In recent years, executing code on heterogeneous devices has become very popu-lar. This allows to schedule tasks onto devices, which are better suited for a speciﬁcalgorithm. Dandelion was proposed by Rossbach et al. [2013]. It uses a C# or F#implementation of an application and maps predeﬁned parallel building blocksto generate device code. This can then be mapped on diﬀerent heterogeneousdevices, such as CPUs, GPUs, FPGAs or cloud applications. The HeterogeneousProgramming Library (HPL) [Viñas et al. 2013] easily allows to run OpenCL kernelson diﬀerent kind of hardware. It allows to write code similar to the CUDA RuntimeAPI. Fabeiro et al. [2014] extend HPL to allow auto-tuning of parameters. Viñaset al. [2016] added analytical models for optimal workload balancing. Gadioli et al.[2014] proposed a similar framework, targeting heterogeneous OpenCL applica-tions with auto-tuning and runtime resource management. Jääskeläinen et al.[2014] introduced an OpenCL performance portability optimizing compiler thatmodiﬁes kernels in a way that they work optimal on the given architecture. Paoneet al. [2014] propose a runtime resource management system and conclude thatthis objective can be orthogonal to auto-tuning objectives. Grasso et al. [2013]provide a compiler that uses a single-device OpenCL application and transformsit into a multi-device application. They leverage a machine-learning based pre-diction model using static program and dynamic input features, to predict anoptimal partitioning. Helium [Lutz 2015] is an OpenCL framework to optimize theusage of OpenCL API functions, that postpones, gathers or even removes API callsto optimize the overall execution time. Bodin et al. [2016] proposed Diplomatwhich generates optimal mapping of kernels onto CPUs and GPUs. Bolchini et al.[2016] published an operating system based resource scheduler for schedulingOpenCL applications on heterogeneous hardware. Cheng et al. [2017] auto-tunethe task scheduling of heterogeneous MapReduce cluster applications. As more
49
Chapter 4: Auto-Tuning and Related Work
and more heterogeneous computers and clusters are used, many approachesproposed frameworks to assist the development, optimization and schedulingof such applications [Bauer 2014; Bajrovic and Benkner 2014; Chang et al. 2016;Fachada et al. 2016; Gray and Stratford 2016; Hechtman et al. 2016; Helal et al.2016; Muralidharan et al. 2016b; Panneerselvam and Swift 2016; Rossi and Zhou2016; Srivastava et al. 2016; Zenker et al. 2016; Inggs et al. 2017; Yamato 2017].
As data partitioning in multi- and heterogeneous-device applications is very im-portant, as well as optimizing the memory transfer between devices,MAPS [Rubinet al. 2014; Ben-Nun et al. 2015] was proposed. It is a framework aiming at dataabstraction and partitioning for single- andmulti-GPU applications. It is speciﬁcallydesigned to map onto commonly used operations in GPU programming. Sakai et al.[2016] automate mapping single-GPU applications onto multiple GPUs. SkePU [En-myren and Kessler 2010] uses code skeletons for multi-GPU programming. Arslanet al. [2016] proposed HARP, a predictive based auto-tuner for application leveldata transfer, which takes historical data analysis and real-time probing for itsdecision-making. dCUDA [Gysi et al. 2016] allows device-to-device memory accessin CUDA applications on a cluster level with automatic latency hiding. Similar workis done by Tausche et al. [2016] for OpenCL.
4.5.6 Memory Access and Data Layouts Auto-Tuning
Optimal memory access is a key objective in optimal performance, especially onGPUs, as discussed by Ryoo et al. [2008]. Strzodka [2011] provides specializedC++ containers that allow to easily switch diﬀerent AoS layouts and multi-valuedcontainers [Strzodka 2012]. Edwards and Trott [2013] pursue a similar way, withproviding adjustable containers. However, the optimizations of these approacheshave to be applied by the user and are not automated. Sung et al. [2012] proposedDL, a framework that tunes the memory layouts in GPU applications. It uses staticdecision rules to determine, which data layout to be used. They provide an eﬃcientdata-conversion prior kernel runs, if the data is not in the correct format. Hsuet al. [2014] extended this approach by proposing a hardware conversion unit, thatconverts the data format during the memory transfer. Cantanzaro et al. [2014]introduced a method for eﬃcient matrix in-place transposition on GPUs. Koﬂeret al. [2015] auto-tune access to struct arrays in OpenCL applications, including atile size selection for AoSoA. Peng et al. [2016] propose a static cost estimation toselect optimal struct array layouts. These two are properly the closest auto-tunersto our approach. In contrast, we do not rely on static analysis but use empiricalproﬁling. Further, MATOG is able to apply signiﬁcantly more optimizations, e.g.,transpositions, memory placement, or selection of optimal L1 cache sizes.
Other approaches do not optimize on a high-level, data layout based approach, but
50
4.5 Overview
directly target the actual implementation inside the code. Cruz et al. [2016] providea mechanism to improve memory page hit rates, by analyzing the memory accessbehavior of parallel applications. Li et al. [2016a] optimize the usage of on-chipmemory resources in CUDA applications. Ainsworth and Jones [2017] provide acompiler to enable software based prefetch for indirect memory accesses. Yang etal. [2010] provide a source-to-source compiler, which uses a naïve implementationand applies certain optimizations to enforce coalescence ofmemory access, threadregrouping and automatic usage of shared memory. Han and Abdelrahman [2014]learn a machine-learning model to predict if an array should be buﬀered in sharedmemory. Michaud [2016] provides mathematical proof of cache replacementeﬀects and a new algorithm for reasoning about optimal cache replacements.
Besides changing the code, also the hardware could be changed. Park et al. [2015]proposed Earliest Load First (ELF), an alternativewarp schedulingmethod that triesto maximize the memory-level parallelism on GPUs. Dublish et al. [2016] propose achange to the L1 caches in GPUs to employ a cooperative caching network, allowingto eﬃciently reuse data in these caches. Ziabari et al. [2016] proposed a uniﬁedhardware-based memory hierarchy for CPU/GPU systems.
51
Chapter 4: Auto-Tuning and Related Work
52
Chapter 5
MATOG Auto-Tuner
MATOG is a recursive acronym and stands for “MATOG: Auto-Tuning on GPUs”. It isan application domain independent array layout auto-tuner for CUDA applications.We introduced MATOG ﬁrst in Weber and Goesele [2014] and have constantlyimproved it since then. MATOG is open source and can be downloaded from ourproject page1. The main concept of MATOG is that it abstracts the memory accessfrom the application and dynamically chooses optimal array layouts during runtime.In order to achieve this, MATOG requires to run and analyze the application withreal data. During this proﬁling it measures the execution time of diﬀerent datalayouts, tracks meta data, and stores the results in a database (Section 6.2). Theseproﬁling results are then analyzed (Section 6.3) and used to build decision modelsthat are utilized during runtime to select optimal conﬁgurations according tochanging data properties (Section 6.4). As shown in Figure 5.1 there can be diﬀerentoptimal conﬁgurations of a kernel, depending on the input data. We automaticallyidentify these data properties and adjust the layouts accordingly.
MATOG optimizes array access for one- or multi-dimensional arrays of primitive,struct and hierarchical types. Arrays of pointers are currently not supported. Itarranges these arrays in diﬀerent struct layouts, i.e., AoS, SoA or AoSoA (with atile-size of 32, the size of a CUDA thread group). We do not support to separatethe ﬁelds of an AoS and store these in diﬀerent layouts. Further, multi-dimensionalarrays can be stored in a transposed way. The array layouts can be separatelyoptimized for global, shared and local memory. For read-only arrays residing in theglobal memory, MATOG determines, which of these arrays should use the defaultand which the non-coherent cache (also known as texture memory) to optimizethe cache utilization. Arrays with constant size (known at compile time) can beplaced in constant memory. On Fermi and Kepler GPUs the optimal ratio betweenL1 cache and shared memory size is automatically determined. Additionally, theprogrammer can deﬁne preprocessor optimizations, to use discrete values andvalue ranges in his code. This can be used to implement diﬀerent algorithms for atask, or to evaluate a series of values as parameter of an algorithm as shown inListing 5.1. Experienced users can further customize MATOG by deﬁning their ownindexing schemes, e.g., to implement a special triangular matrix, z-order curves,or to provide their own allocation/deallocation mechanism, in order to combineMATOG with other frameworks.
1matog.org
53
Chapter 5: MATOG Auto-Tuner
0.0
0.2
0.4
0.6
0.8
1.0
R
el
at
iv
e 
Ex
ec
-T
im
e
Dataset 1 Dataset 2
Figure 5.1: Execution time for two kernel runs with diﬀerent data. All results havebeen normalized between 0.0 (best) and 1.0 (worst), and are sorted ascending forthe ﬁrst call (black). Conﬁgurations, which were optimal for the ﬁrst call, becamethe worst in the second (blue) execution. The shown kernel is the ﬁrst main kernelof our KD-Tree benchmark. The big jumps come from using local memory as abuﬀer for shared memory. This works good for the black iteration, but not for theblue one.
5.1 Programming Interface
The MATOG programming interface was designed to reduce programming over-head and to be as compatible to CUDA as possible. To support platform indepen-dence and easy maintainability, we decided to use code generation instead of anown source-to-source compiler, so no changes to the compile chain have to beapplied. MATOG requires only small changes to the source code as it mimics theCUDA Driver API [NVIDIA 2016a] interface. It is necessary to use the Driver APIinstead of the Runtime API as the latter does not allow to dynamically load andexchange kernel implementations during runtime. To access data MATOG sup-ports multi-dimensional memory access (e.g. array[x][y].subarray[z].field)similar to the code skeletons used by GROPHECY [Meng et al. 2011]. To apply itsoptimizations, it intercepts all communication to CUDA.
5.2 Programming Example
In order to use MATOG arrays, the programmer has to provide a JavaScript ObjectNotation (JSON) description (example shown in Figure 5.2) of the arrays, whichis then used to generate the optimization code. This code can then be includedinto the application. The main contribution of MATOG is a dynamic selection ofoptimal parameters at runtime. However, MATOG automatically performs sometrivial static optimizations, e.g., it ensures an optimized alignment inside the arraysto prevent alignment spacings (example shown in Figure 5.2). Further, the usercan specify for each kernel how data is used (only read, only written or read and
54
5.2 Programming Example
1 // Define Example: values = [15, 42]
2 #if MY_DEFINE == 15
3 // algorithm 1 ...
4 #elif MY_DEFINE == 42
5 // algorithm 2 ...
6 #else
7 #error NOT_IMPLEMENTED
8 #endif
9
10 // Range Example: min: 32, step: 32, max: 512
11 __shared__ int bins[MY_BIN_COUNT];
12 for(...) {
13 int binIndex = (int)(value[i] / MAX_VALUE * MY_BIN_COUNT);
14 //...
15 }
Listing 5.1: Example code to show the deﬁne/range feature. The ﬁrst code snippetshows how diﬀerent implementations can be diﬀerentiated using a preprocessordeﬁne. The second shows how a range of values can be evaluated.
write) which enables certain optimizations. This is explained in Section 5.3.3 and6.3 in more detail. The programmer can specify whether multiple arrays havealways the same size. This allows to reduce the number of used registers, as nor-mally every MATOG data structure maintains a separate copy for its sizes. Finally,compiler ﬂags can be speciﬁed, to enable GPU debugging or speciﬁc features (e.g.,
--use_fast_math). There are three types of MATOG data structures that can beused in the host code: one for arrays located in host memory (Array::Host), onerepresenting dynamic shared memory (Array::Dyn) and one for device memory(Array::Device). Data inside host arrays can be directly accessed. Data trans-fers are done by the CUDA memcopy functions. For kernel modules, the sourcecode has to be provided instead of a pre-compiled CUDA module. MATOG takescare of the actual compile process in a just-in-time manner. Nothing else hasto be changed for loading and executing GPU code, as MATOG array referencescan simply be put into the kernel argument list. Listing 5.3 shows a host codeexample. Changes are indicated compared to CUDA Driver API. Inside the kernelcode, each memory type uses separate implementations. As multiple instances ofglobal or dynamic shared memory arrays can have diﬀerent layouts, a template isused to distinguish the diﬀerent instances. Only for static shared memory types,the same layout is used for all instances of the same type. Listing 5.4 shows anexample for kernel code. Figure 5.3 shows a schematic workﬂow of MATOG. Inthe development phase the user speciﬁes the optimizable data structures, which
55
Chapter 5: MATOG Auto-Tuner
1 typedef struct {
2 long rand;
3 float result;
4 } MyStruct[3][7];
1 { "name": "MyStruct",
2 "counts": [0, 3, 7],
3 "fields": {
4 ["name": "rand", "type": "long"],
5 ["name": "result", "type": "float"]
6 }}
Listing 5.2: Left: Example data structure in C++. Right: Same data structure inMATOG JSON notation.
Example Struct
1 struct {
2 double a; // 64bit
3 short b; // 16bit
4 double c; // 64bit
5 char d; // 8bit
6 int e; // 32bit
7 }
Naïve Alignment
b
a
c
ed
Optimized Alignment
a
c
e b d
Figure 5.2: An example struct (left) and how it would usually be stored in CUDA(center). MATOG uses an optimized layout (right) since GPUs require n-Byte vari-ables to be n-Byte aligned.
are then generated by the code generator. The user then incorporates these intothe application and compiles it using a standard host compiler (e.g., GCC or VisualStudio). Listing 5.5 shows an example for using a MATOG based application andhow to execute the optimization procedure. By default, MATOG runs in an unop-timized mode. To optimize it, MATOG has to be switched into a proﬁling mode,that measures the execution time of the kernels. Every time a kernel is executedduring optimization, MATOG runs multiple implementations of the same kernelto determine which layouts work best. After the proﬁling, MATOG analyzes theresults and builds decision models. These are used to determine optimal layoutsduring runtime.
5.3 Implementation Details
This section gives somemore insights on the implementation ofMATOG and explic-itly its data structures. The main concept behind MATOG data structures is to pro-vide a class that supplies an multi-dimensional AoS-like (e.g., array[x][y].field)memory access to the user and internally maps this onto an optimized layout. As
56
5.3 Implementation Details
1 /************************* allocate data *************************/
2 int X = ..., Y = 3, Z = 7;
3 MyStruct* host = new MyStruct[X * Y * Z];
4 MyStruct::Host host(X, _fl);
5 CUdeviceptr device = cuMemAlloc(sizeof(MyStruct) * X * Y * Z);
6 MyStruct::Device device(X, _fl);
7
8 /*************************** load data ***************************/
9 for(...) {
10 host[x + y * X + z * X * Y][x][y][z].rand = rand();
11 host[x + y * X + z * X * Y][x][y][z].result = 0.0f;
12 }
13
14 /************************** load module **************************/
15 CUmodule module;
16 cuModuleLoad(&module, "preCompiledCode.ptx""sourceCode.cu");
17
18 /************************* load function *************************/
19 CUfunction function;
20 cuModuleGetFunction(&function, module, "kernel");
21
22 /***************************** exec ******************************/
23 void* args[] = {&paramA, &paramB, &device, 0};
24 cuLaunchKernel(function, ...);
Listing 5.3: Host Code Example: CUDA Driver API (stroked through) compared toMATOG (underlined)
MATOG supports to store data in hierarchical data structures, all sub-structurescan use a diﬀerent indexing and struct layout, so that the root, e.g., can be storedas an untransposed AoS, while the sub-array is stored as transposed SoA. However,we decided that sub-arrays cannot be stored as AoSoA because depending onthe tile-size and number of elements, this layout wastes memory. As sub-arraysappear hundreds or thousands of times in the root-array, this wastage would beenormous. On the host system, MATOG uses a dynamic adjustable implementa-tion, so that the host code does not need to be recompiled to switch the layouts.While this is very convenient to use, it can slightly decrease the performance ofthe CPU code, compared to a purely static implementation. In the device code, weexplicitly compile the exact layout into the code, to achieve maximal performance.To provide the necessary information to the GPU code, we modify the kernelargument list and pass a struct to the GPU that has the same signature as the GPU
57
Chapter 5: MATOG Auto-Tuner
1 /************************* function body *************************/
2 __global__ void kernel(const float paramA,
3 const int paramB,
4 MyStruct*<> data)
5 {
6 const int x = threadIdx.x, y = threadIdx.y, z = threadIdx.z;
7 float result = 0.0f;
8
9 /********************** define shared memory **********************/
10 __shared__ MyStructShared<128> shared[128*Y*Z];
11
12 /***************** global memory » shared memory *****************/
13 shared[x + y * X + z * X * Y][x][y][z] =
14 data[x + y * X + z * X * Y][x][y][z];
15 ...
16
17 /******************** register » global memory *******************/
18 data[x + y * X + z * X * Y][x][y][z].result = result;
19 }
Listing 5.4: Kernel code example: CUDA (stroked through) compared to MATOG(underlined)
data structure implementation and contains all necessary information, such as thepointers and sizes. However, one disadvantage of the way the data structures aredesigned is that we cannot make use of the restrict keyword and therefore areprone to pointer aliasing [Cook 1997].
5.3.1 Texture Memory
As described in Section 3.2 the way texture memory can be used has changedsigniﬁcantly throughout the generations. MATOG employs diﬀerent implemen-tations depending on the GPU generation. On Fermi and ﬁrst generation Keplercards, we use texture references. To circumvent the limitation of 227 items pertexture, we concatenate multiple textures. Unfortunately CUDA does not allow tocreate arrays of texture references, so normally it is necessary to use if-statementsto distinguish between diﬀerent textures, which however would yield in threaddivergence, causing this approach to be not usable. Combining a detailed studyof the available documentation with an experimental code analysis, we havebeen able to implement a method that can select the correct texture withoutthe need of an if-statement. This is based on the fact that texture references
58
5.3 Implementation Details
File System
MATOG code 
generator
host compiler
(e.g., GCC or VS)
JSON
GPU Data 
Struct.
CPU Data 
Struct.
user GPU 
code
user CPU 
code
executable
kernel_1.cu
kernel_2.cu
...
kernel_N.cu
link
MATOG 
runtime library
user MATOG compiler
c
Figure 5.3: Schmatic workﬂow of MATOG. Orange indicates steps that are per-formed by the user, green is automated by MATOG and blue is done by the hostcompiler.
are actually pointers to a point in memory, where all information of the textureresides. This texture information is stored adjoined, so that the textures actu-ally can be switched by adding an certain oﬀset to the pointer. For Fermi GPUsthis oﬀset is calculated by texture(i) = rootTexture · 0x800000 · i and on Keplerby texture(i) = rootTexture − i. The order of the textures is determined by theorder in which they are speciﬁed in the source code. However, this alone doesnot work, as the compiler recognizes if the texture references are not speciﬁcallyreferenced in the code and removes them. To prevent this, we specify an initmethod that references all possible texture references using a read operation,but we skip the actual execution of this part with a goto-statement. This causesthat the init method is actually not executed, but the compiler keeps the texturereferences. As texture memory does not allow to read 64Bit values, we use thevector functionality in this case, reading two 32Bit values at once and cast these tothe corresponding 64Bit value. Starting with the second generation Kepler GPUs,NVIDIA added the __ldg(ptr*) command, which allows to directly access texturememory without the need of texture references, which makes it much easier touse and signiﬁcantly reduces the engineering overhead. As MATOG does not allowto directly access the CUDA module or function objects, it is not possible to useuser deﬁned texture references. However, since the ﬁrst generation of the Keplerarchitecture unbound textures are supported, which are passed on to the kernelas argument. For Fermi GPUs there is no such solution available.
59
Chapter 5: MATOG Auto-Tuner
1 ## -------------------- execute optimization ------------------- ##
2 export MATOG_PROFILING=1 # enable profiling
3 ./myApp training_0 ... # run application one or multiple times
4 ./myApp training_1 ...
5 ...
6 unset MATOG_PROFILING # disable profiling
7 matog-analyze # run analyzer
8
9 ## ---------------------- run application ---------------------- ##
10 ./myApp param_0 param_1 param_2 ...
Listing 5.5: Example for running a MATOG application, including the optimizationprocess. If lines 1-7 are omitted, the application will run unoptimized using defaultimplementations for the data structures.
Sh
a
re
d
 
M
e
m
o
ry User Array
MATOG Array
DynamicStatic
Figure 5.4: Illustration of the sharedmemory placement. For static sharedmemory,MATOG data structures can be placed mixed with user deﬁned arrays. For dynamicshared memory MATOG ensures that all MATOG data structures are placed atthe end, so that the user can use the ﬁrst part of the dynamic shared memorysegment for his own data structures.
5.3.2 Shared Memory
For sharedmemoryMATOG supports two diﬀerent implementations, one for staticand one for dynamic shared memory. The static variant requires to know all sizesat compile time. As MATOG supports to also use user managed dynamic sharedmemory, it places all MATOG controlled dynamic shared memory data structuresat the end of the dynamic sharedmemory segment, to ensure no interference withthe user managed segment. This does not have any performance implications, butis intended to be easier to use by the user, so he does not need to be aware of anyoﬀsets he has to obey when using dynamic shared memory. Figure 5.4 illustratesthis.
60
5.3 Implementation Details
5.3.3 Optimization Hints
We allow to specify a series of compiler hints, to assist MATOG in optimizing theapplication. One of these hints is the way data is used, where we allow to deﬁneif it is read (R), written (W) and read once (O). Depending in which combinationthese ﬂags are used, we can employ diﬀerent optimizations.
RW: By default, only global memory is available.R: Read-only allows to use texture memory and data does not need to berestored after a proﬁling run.W: Write-only implies that data is entirely overwritten, therefore the layout canbe switched prior each execution of this kernel. Further, it does not need tobe restored after a proﬁling run as it will be overwritten entirely.RWO: Read-Once-Then-Write will read the entire data once at the beginning of thekernel and then entirely overwrite it, which enables to use texture memoryfor reading the data
In some cases it is possible that multiple arrays have the same sizes. In this casethe user can specify the underlying relation, which allows to reuse the memorysize counters of one array, instead that each array uses its own copy. As C++does not allow to access the private member variables of other class instances, weutilize inline PTX to directly access the necessary elements in the kernel arguments.However, this only improves the performance if the number of used registers is atthe border to an occupancy drop. Nevertheless it is not guaranteed that with ahigher occupancy, the performance will be better.
61
Chapter 5: MATOG Auto-Tuner
62
Chapter 6
Application Analysis
To optimize an application usingMATOG it is necessary to determine, which conﬁg-urations achieve optimal performance. As MATOG is meant to work for arbitraryNVIDIA GPU architectures and applications from all kinds of application domainswe treat GPUs, kernels and optimizations as black boxes. In particular, we assumethat we do not know anything about their speciﬁc properties, except for whichoptimizations can be applied for a given GPU, a particular kernel and the hintsgiven by the programmer. This potential lack of information might complicatethe analysis but guarantees that our concept can be applied to any applicationon any past, current and future hardware, as long as the optimization to hard-ware relation is modeled and as long as the code can actually be compiled forthe speciﬁc hardware. In the following section, we formalize our optimizationproblem. Then we explain our algorithm to ﬁnd optimal solutions for a particularapplication on a given GPU. We use a 3-step application analysis consisting of anempirical proﬁling (Section 6.2) to determine, which optimizations are optimal, anoﬄine analysis (Section 6.3) to determine the application-wide optimal solutionand a decision model training (Section 6.4) that enables to select data dependentoptimal solutions during runtime.
6.1 Optimization Problem
Our optimization problem has multiple optimization dimensions: If an array usesstruct types, we can optimize the struct layout (dL ∈ [AoS, SoA,AoSoA]). For multi-dimensional arrays we can optimize the transposition (dT ), where the number ofpossible values is the factorial of the number of array dimensions. If data is onlyread from arrays residing in global memory we can determine if we use the defaultor non-coherent cache hierarchy, or place the data in constant memory (dM ∈
[Default,Non-Coherent, Constant1])). On Fermi and Kepler cards we can furtherdetermine the size of the L1 cache (dL1 ∈ [Prefer SM, Prefer L1, Prefer EQ2]) bytrading with shared memory. Additionally we can select diﬀerent implementationsusing user deﬁned preprocessor instructions (dD ∈ [...]) and value ranges (dR ∈
[min,min + step, ...,max]). As can easily be seen the number of dimensions fora single kernel can be very high as a single array already can have up to three
1if size is known at compile time2only for Kepler
63
Chapter 6: Application Analysis
optimization dimensions. Further, we can see that the number of values for adimension are very low. This causes a very high dimensional optimization problemwith a very limited extend but a very high number of conﬁgurations (C) for a givenkernel (k ∈ K ), as this is deﬁned by |Ck | = ∏d∈D |d |. This gets even more complex,when multiple kernels are used as this results in |C | = ∏k∈K |Ck | conﬁgurations.
6.2 Step 1: Application Proﬁling
As processor architectures can signiﬁcantly change from one to another genera-tion, purely analytical models can break, as new or changed features no longerperform as modeled. Therefore we solely rely on empirical proﬁling without anycode parsing or speciﬁc GPU model. This removes the necessity of constantlyupdating the used analytical GPU model with each new generation. However, thementioned high number of conﬁgurations is a big problem for empirical proﬁling.These conﬁgurations not only have to be executed but also be compiled, as theoptimizations for the GPU code have to explicitly compiled to achieve maximalperformance. Further, to gather the execution time of the application it has to berestarted in each conﬁguration, which leads to massive overhead especially if ahuge amount of I/O is taking place. To tackle these issues we used a specializedproﬁling technique inWeber et al. [2015] which relies on an in-application proﬁling(similar to the NVIDIA CUDA proﬁler) and a prediction algorithm that performs aspecialized search space pruning. For time measurements we rely on CUPTI (Sec-tion 3.1.3). To validate correctness MATOG not only checks whether conﬁgurationscan be executed (e.g., do not exceed constant memory limitations or try to usetexture memory on non-read-only arrays) but also has a veriﬁcation mode thatcan be activated during proﬁling to verify that all conﬁgurations produce the sameresults.
6.2.1 In-Application Proﬁling
An in-application proﬁling has the advantage (compared to restarting an appli-cation) that it does not require to repeat any application setup and ﬁnalizationprocedures, which can be very time consuming depending on how much and fromwhere data has to be read or written to. In addition, the in-application proﬁlingreduces the number of conﬁgurations to test from ∏k∈K |Ck | to ∑k∈K |Ck | as wecan calculate the execution time for all permutations of kernel conﬁgurations anddo not need to explicitly measure them. One drawback of this method is, that itrequires much more memory, as data has to be duplicated so it can be restoredprior to (re-)executing the same kernel, in another conﬁguration. Further, it has
64
6.2 Step 1: Application Proﬁling
to be converted into other data layouts. In MATOG we copy the data to the hostsystem prior to each kernel execution and start parallel CPU threads to convertdata if necessary. Further, all necessary kernel conﬁgurations are compiled inparallel to the GPU proﬁling to overlap compilation and execution. This is realizedusing multiple threads so it does not interfere with the proﬁling of the kernelexecution.
6.2.2 Prediction Based Proﬁling
InWeber et al. [2015]we introduced a prediction based algorithm that requires onlyvery few samples to estimate the performance of the entire optimization spacefor a single kernel. This is based on the observation that many optimizations donot inﬂuence others. Formally this means that many dimensions are independentof others and therefore can be optimized independently of others. This allows usto use a diﬀerence model to predict the execution time of a kernel in a speciﬁcconﬁguration (p(k, cp)). For this we require a reference point that we call baseconﬁguration (cb) (that can be arbitrarily selected). Further, several additional datapoints are required, called support conﬁguration (cs,d ). For these all values areequal to the base conﬁguration, except for the value of one dimensiond . To predictthe performance of a speciﬁc conﬁguration, we then use the diﬀerence (∆(t1, t2) =
t1 − t2)) of the execution times between support and base conﬁgurations, sumthem up and add the time of the base conﬁguration.
p(k, cp) = t(cb) +
∑
d∈D
(
t(cs,d) − t(cb)
)︸            ︷︷            ︸
∆(cs,d ,cb )
(6.1)
However, in reality this does not always work, as not all dimensions are inde-pendent of the other ones. This is caused by the fact that some optimizations,e.g., the L1 cache size (dL1), signiﬁcantly inﬂuence the availability of resources andcan thus lead to strongly varying hardware utilizations. Other dimension types(e.g., layouts or transpositions) are mostly independent and hardly change theamount of used hardware resources. Given this inﬂuence, we have to modify theprediction formula. First, we divide all dimensions in two sets. The ﬁrst containsindependent dimensions (DI ) and the second contains all that have an inﬂuence onothers, which we further call shared dimension (DS ). For each value combinationof the DS , we create a separate prediction domain. Only conﬁgurations that arelocated inside this domain can be used to predict conﬁgurations in this domain.Simply spoken this means that each domain requires its own base conﬁgurationand consequently, also corresponding support conﬁgurations. For the proﬁlingthis means that we have to execute allDS ⊗DI combinations to gather all required
65
Chapter 6: Application Analysis
AoSoA
SoA
AoS
SM L1 EQ
La
yo
u
t
L1 Cache
AoSoA
SoA
AoS
SM L1 EQ
Figure 6.1: Left: all cb (grey) and cs (blue) that have to be proﬁled to estimatethe performance for all non-proﬁled conﬁgurations (white) for three optimizationdimensions. Right: Selection of the cs and cb to predict the execution time of cp(green).
measurements to estimate the performance of all other conﬁgurations (shown inFigure 6.1). The number of these conﬁgurations is signiﬁcantly smaller than for anexhaustive search, as can be seen by:∏
ds∈DS
|ds | ·
∑
di∈DI
|di | 
∏
d∈D
|d |, with D = DS ∪ DI and DS ∩ DI = ∅ (6.2)
Figure 6.1 shows an example on how the support conﬁgurations are selected forthe prediction. Note that we empirically showed that his works on NVIDIA GPUsbut cannot provide any proof of why this works, as the GPUs and the compilersare proprietary and their speciﬁcation not publicly available. We can, however,provide the following argument: Let us assume a very simple kernel with a 5x5AoS consisting of two ﬁelds, a memory access as shown in Figure 6.2, and adevice with four hardware threads and a memory controller that can fetch onememory bank with four adjoining items at once. For our theoretical system, eachmemory bank load requires 10 clock cycles and one additional clock cycle forreading memory banks that are not adjoined. Further, let us assume that thedata can only be represented as AoS or SoA, and can be stored untransposed ortransposed. This results in a total of four conﬁgurations. Figure 6.2 shows thememory access for all conﬁgurations in the ﬁrst iteration of the inner loop. Themajor goal for us is to minimize the required clock cycles for all memory loads. Ascan be seen, SoAT is the best layout with only two lines to be read. To ﬁnd theoptimal layout, we have to sample AoS as base conﬁguration. Further we require
66
6.2 Step 1: Application Proﬁling
B11 B12
B13 B14
B11
B12
B13
B14
AoSTAoS
A21 B21
A22 B22
A23 B23
A24 B24
SoA SoAT
B15 B21A21
B15 A25 B25
A21 A22 A23
A21 A31 A41
A51 A22 A32
A42 A52 A23
A33 A43 A53
A25 A35 A45
A11 A12
A13 A14
A11
A12
A13
A14
A15
A15
A11 A12 A13 A14
A15
A11
A12
A13
A14
A15
...
...
...
...
1 struct AoS {int a; int b};
2 AoS array[5][5];
3
4 int tid = threadIdx.x; // {0..3}
5 int bsize = blockDim.x; // 4
6
7 int sum = 0;
8 for(int y = 0; y < 5; y++)
9 for(int x = tid; x < 5; x += bsize)
10 sum += array[y][x].a;
Figure 6.2: Example for storing an 5x5 AoS. Each line represents a memory linewith four items. Blue boxes show accessed items in the ﬁrst iteration of the code.Red bars indicate where the data is scattered over the memory. When AoS is used,each iteration of the inner loop (although executed in parallel) requires to read aseparate non-connected data line. In contrast, for AoST only three neighboringlines have to be read. SoAT is the best layout, as only two consecutive lines haveto be read.
two support conﬁgurations, which are AoST and SoA. Their measured executiontime is t(AoS) = 54, t(AoST) = 30 and t(SoA) = 51 clock cycles for the inner loop.When we apply our predictor (Equation 6.1) we get the predicted execution time
p(SoAT) = 27 which is the best result. Although this value diﬀers from the exactvalue t(SoAT) = 20, it is still a useful prediction. In fact, the diﬀerence is caused bythe choice of parameters for our artiﬁcial example. However, in a real system wewould expect a deviation anyway, caused by noise of themeasurements. Figure 6.3shows a real example for the prediction of a kernel, where we require three baseand 36 support conﬁgurations to estimate the performance of 5145 non-proﬁledconﬁgurations, which is 0.75% of the total conﬁguration count. As can be seen,the prediction is not perfect but it is good enough to select a conﬁguration, whichis close to the optimal solution.
67
Chapter 6: Application Analysis
60
160
260
360
460
560
0 5184
Ti
m
e 
(m
s)
Measured Predicted
65
70
75
80
85
90
0 680
Ti
m
e 
(m
s)
Figure 6.3: Left: The measured performance (black) of 5184 conﬁgurations sortedfrom best to worst, our predicted performance (green) based on measured sup-port (blue cross) and base (red cross) conﬁgurations. Right: A closer look on thesection of the best conﬁgurations (indicated by black box). It can be seen that theprediction can be noisy and can slightly deviate from the exact results.
6.3 Step 2: Determine Optimal Conﬁgurations
Given the prediction algorithm we can calculate the optimal conﬁguration for eachkernel execution. Now we have to determine an application-wide optimal solution.This is diﬃcult as arrays that are used in multiple kernels introduce dependencyconstraints that need be resolved. In general there are two ways to handle these.Either, data can be converted between two kernel executions, or not. Withoutknowing how long a kernel will execute using unknown data, we have no meansto predict at runtime whether the conversion will yield enough improvement tocompensate the conversion, so we decided not to allow any conversions oncea layout has been determined. However, we conducted initial experiments forpredicting the performance based on automatically generated models and showour results in Chapter 8.
6.3.1 Decisions
As data is allocated at various points during the execution, we may have to applysome data layouts already prior to any kernel executions, e.g., host arrays areusually ﬁlled with data before they are copied to the device. We diﬀerentiatebetween three decision events that require action:
1. whenever a host array is allocated2. whenever a device array is used in a kernel (if it has not inherited a layoutthrough a prior memcopy)3. whenever a kernel is executed, non-device memory decisions (e.g., shared /localmemory, deﬁnes, ranges, ...) can be determinedwithout any constraints
68
6.3 Step 2: Determine Optimal Conﬁgurations
There is one special case: If a device array was marked as write-only, the kernelwill overwrite the entire content of the array and therefore is not constrained bythe existing layout. In this case MATOG can assign a new layout resulting in a newdecision (Type 2). As an allocation can occur multiple times during the execution(e.g., in a loop), every array gets assigned a unique id. This allows also to usethe size of previously allocated arrays as meta data during the decision-makingprocess.
6.3.2 Array Dependencies
To resolve the array dependencies, we previously used an Array Dependency Graph(ADG) [Weber and Goesele 2016]. An ADG is a directed graph that contains nodesfor each array allocation, kernel execution and memcopy between host and device.This graph maps onto the life-cycle of arrays during the application execution.While the ADG representation ﬁts exactly the ﬂow of the application, it is diﬃcultto parse. Therefore, we propose a simpliﬁed version, called Decision DependencyGraph (DDG). This is also a directed graph, but instead of mapping to the life-cycleof arrays, it maps onto the decisions made during the execution. There are onlytwo types of nodes in this tree: global and kernel decisions. Global decisionsoccur whenever the layout of a global memory array needs to be determined(Types 1 and 2). These nodes are connected to kernel decisions (Type 3) by edges.Each edge is labeled by the global decision it originates from and connects allkernels in the sequence it is used in. Figure 6.4 shows an artiﬁcial example of aDDG with all possible cases. Depending on the structure of the application, thegraph does not need to be fully connected and can be a forest of multiple graphs.This also happens, when the application has been proﬁled with multiple diﬀerentdatasets, where each of these proﬁling results is a disjunct graph. In the followingwe describe our method on one graph, but this can be easily extended to a forestof graphs.
6.3.3 Exhaustive Search
In Weber and Goesele [2016] we proposed an exhaustive analysis of the DDG.This method guarantees to ﬁnd the optimum of the application, but can be verytime consuming for complex applications with a high number of reallocations. Theexhaustive search varies all possible combinations of the DDG and calculates thetotal execution time for all kernel executions using exhaustive proﬁling data or ourpredicted execution times. If predicted data is used, the execution can be spedup by splitting all decisions into two categories: local and global. Local decisionsinﬂuence only one kernel execution, e.g., shared/local memory layouts, usage ofnon-coherent cache, constant memory, ranges or deﬁnes. Global decisions are
69
Chapter 6: Application Analysis
C#1
Loop
A#1
B#1
K
e
rn
e
l#
1
K
e
rn
e
l#
2
K
e
rn
e
l#
3
K
e
rn
e
l#
3
K
e
rn
e
l#
3
K
e
rn
e
l#
4
C#2 C#3
Iteration 1 Iteration 2 Iteration 3
Reallocations
Figure 6.4: DDG example with global decisions of Type 1 (grey), Type 2 (green) andkernel decisions of Type 3 (blue). Global decisions are shown by their unique deci-sion identiﬁer (A-C) and their unique allocation id (#1-3). The array correspondingto C#X is reallocated inside the loop in every iteration.
all global memory struct layouts and transpositions. This separation allows toprecompute optimal values for the local decisions, for each kernel, as we can storeone optimal local value for each possible global decision permutation.
6.3.4 Predictive Search
The problem of an exhaustive search is that if an application allocates or reallocatesmany arrays (e.g., in every iteration of a loop), the number of global decision nodesgrows rapidly. As the number of combinations that have to be evaluated increasesthus exponentially, this method becomes quickly unfeasible. Unfortunately, thisoften occurs in real applications. We therefore extended our predictive proﬁlingonto the search for an application wide optimal solution. The assumption of theprediction algorithm implies that we can select optimal values without respectingany interactions between the optimization dimensions. In the ﬁrst step we iterateover all global decisions separately and select optimal values for these. We ignorethe optimization domains in this step and only search inside one domain of theshared degrees. This is possible, as we have previously mentioned, layouts andtranspositions (as employed by global decisions) hardly change the amount ofused resources. In a second step we iterate over all local decisions and searchfor the best values, but now we obey the optimization domains. We will show inour experiments that this method provides comparable results to an exhaustivesearch.
70
6.4 Step 3: Decision Models
6.4 Step 3: Decision Models
At this point we know the optimal solutions for all proﬁled application runs. Nowwe have to use this information to build decision models that can select optimalconﬁgurations during runtime. For these decisions we require some kind of metadata to distinguish between diﬀerent input data scenarios. Auto-tuners suchas Active Harmony [Chung and Hollingsworth 2004] or Nitro [Muralidharan etal. 2014] use user deﬁned callback functions that calculate some arbitrary, userdetermined metrics for this distinction. Instead, we use a fully automatic way. Asanalyzing and categorizing arbitrary data is a very diﬃcult, often time consumingtask (depending on which analysis are performed), we chose to use the metadata that is already available in our system such as the sizes of arrays and thelaunch conﬁgurations of kernels. In contrast to our previous publications, ourimproved meta data gathering system does not only track the most recent arraysin the system, but also previous data, using unique ids for each allocated array. Weadditionally allow the user to register variables, which are monitored during theexecution. These can contain data that has been explicitly calculated by the users,or e.g., counts of items in a preallocated, not-resized array. This information isusually available in the application without any need to speciﬁcally calculate these.For our decision model creation, we gather all decision events from all DDGs andgroup them by their decision event. We create one model per decision event. Forthis we collect all meta data that was present in the proﬁling when the decisionevent occurred and store it in a matrix, in which each column represents one kindof meta data (e.g., size of a speciﬁc array or the launch conﬁguration of a kernel)and each row represents one decision event. At this point it does not matter ifit is an array allocation or kernel execution decision. This gathered meta datausually contains a high amount of redundant, linear dependent or constant data.To simplify the decision model creation and to reduce the meta data gatheringoverhead during runtime, we pre-process this data by removing all redundantdata. If multiple rows with equal meta data but diﬀerent optimal conﬁgurationsoccur, we perform a majority voting and keep the conﬁguration, which was chosenmost.
6.4.1 Directional Model
In Weber and Goesele [2016] we used a SVM as decision model. This provedto be suitable but may not be optimal (depending on the application). This isrooted in the kind of meta data that we use in MATOG, consisting of array sizesand launch conﬁgurations. These come with some diﬃculties. For example, as wedo not know the limits of our gathered meta data, there is no way to normalize it,which can result in bad decisions if the values diﬀer too much from the training
71
Chapter 6: Application Analysis
data. Further, as array sizes and launch conﬁgurations are usually used as iterationcounts for loops, they imply a certain linear scaling, which can be leveraged in adecision model. We therefore analyzed our meta data and determined that insome of the failure cases, a specialized nearest neighbor search model workedbetter, in the following called Directional Model (DM). The DM uses the cosinesimilarity distance that interprets the training (®t ) and meta ( ®m) data as vectors. Itthen applies the normalized dot product on these vectors to compute the metric:
λ = cos(γ ) = ®t · ®m|®t | | ®m | (6.3)
The resulting λ is in the range [−1; 1]. To evaluate the model, we iterate over alltraining samples, calculate λ and select the data point where λ is maximal. Theadvantage of this method is that it performs an implicit normalization, resultingin better decisions even if the meta and training data values diﬀer signiﬁcantly.Figures 6.5 and 6.6 show a simple example for a linear SVM compared to the DM,both trained and evaluated on data taken from our KD-Tree benchmark, run on aTesla K20c.
6.5 MATOG Runtime System
In this section we brieﬂy explain our runtime system. At the beginning of theapplication run, MATOG is automatically initialized once the ﬁrst CUDA relatedfunction is executed. Whenever an array is allocated, we determine, which decisionevent it belongs to, and store its meta data in a centralized meta data storage. Ifan array of the same decision event is allocated, it gets a new instance id assigned.If this array is a host array, its decision model is evaluated immediately and thelayout is set (Type 1). Whenever a memcopy occurs, MATOG propagates thelayout from the source to the destination array. If a kernel is executed, all non-initialized device arrays evaluate their decisionmodel and set their layouts (Type 2).Further, the kernel evaluates its own decision model (Type 3). The ﬁnal executedconﬁguration of the kernel is the combination of all layouts of the device arrays(which have been separately evaluated before) and the conﬁguration from thekernel’s decision model. If this conﬁguration has been compiled before, it isloaded from the database, and then executed. If not, it will be compiled and thenexecuted. In Weber and Goesele [2016] we compiled all available combinationsthat could occur after our application analysis process. However, depending onthe complexity of the application this can be several million conﬁgurations, makingthis approach not usable. We discuss in our future work how this high number ofconﬁgurations could be reduced to improve the compilation time and also reducethe number of implementation switches during runtime.
72
6.5 MATOG Runtime System
10 1,000 100,000 10,000,000
1
100
10,000
1,000,000
Su
bt
re
es
Triangles
Shared Memory
Local M
em
ory
10 1,000 100,000 10,000,000
1
100
10,000
1,000,000
Su
bt
re
es
Triangles
Shared Memory
Local M
em
ory
Shared Memory Local Memory Support Vectors False Decision
Figure 6.5: Example for a linear SVM . On top, the data used for training and theresulting models are shown. At the bottom, the models are applied to the testingdatasets. Every dot represents the meta data of a kernel execution and colorsindicate if either using local (orange) or shared memory (blue) is the better choice.The shown results are for a linear SVM. The support vectors and false decisions arehighlighted by black circles and red crosses, respectively. The hyperplane of theSVM is shown as black line and the colored background visualizes the classiﬁcationof the SVM. As can be seen, the amount of false decisions is quite high for thisdecision model. Be advised, that the meta dimensions are hand picked and thatthe actual model does not only decide whether to use local or shared memory,but also decides on layouts, transpositions and texture memory usage.
73
Chapter 6: Application Analysis
10 1,000 100,000 10,000,000
1
100
10,000
1,000,000
Su
b
tr
ee
s
Triangles
10 1,000 100,000 10,000,000
1
100
10,000
1,000,000
Su
b
tr
ee
s
Triangles
Shared Memory Local Memory False Decision
Figure 6.6: Example for a Directional Model using the same data as in Figure 6.5.The decision vectors are shown as arrows, with the color of the correspondingdecision. As can be seen, the number of false decisions is signiﬁcantly lower forthe DM than for the linear SVM.
74
Chapter 7
Evaluation
In this chapter we evaluate the diﬀerent analysis steps of MATOG, their eﬃciencyand the achieved performance on seven applications. All tests have been per-formed on a system equipped with dual Intel Xeon E5649, 48GB DDR3-1333,Ubuntu 16.04 and CUDA 8.0 (driver version 367.57). We evaluated the last fourNVIDIA GPU architectures: Fermi, Kepler, Maxwell and Pascal. Table 7.1 shows anoverview of all evaluated GPUs. Our results are compared to the unoptimizedperformance of MATOG and a hand-optimized reference code that does not useMATOG.
Providing a fair and direct comparison to related approaches on our benchmarksis diﬃcult. On one hand, there is to our knowledge no publicly available code forseveral other memory access auto-tuners [Sung et al. 2012; Koﬂer et al. 2015; Penget al. 2016]. On the other hand, auto-tuners such as Nitro [Muralidharan et al.2014] or OpenTuner [Ansel 2014] do provide code but miss speciﬁc, automaticallygenerated optimizations. Instead, they require all optimizations considered to beexplicitly hand-coded by the user, a stark contrast toMATOG’s automated approach.We can, however, make indirect but meaningful comparisons: Nitro is designed tohandle only a small number of conﬁgurations and relies thus on exhaustive search.Since we provide both timing as well as performance evaluations for exhaustivesearch as part of our results below, these can be seen as representative for Nitro.OpenTuner uses a genetic algorithm. In Weber et al. [2015] we showed that ourpredictive based search uses a minimalistic set of conﬁgurations that suﬃces toﬁnd comparable results and is faster for the optimization problem that we face inMATOG compared to a genetic algorithm. This performance argument is still validfor the current version of MATOG and thus provides a clear advantage.
7.1 Benchmark Applications
We evaluate seven diﬀerent GPU applications, ranging from very simple algorithmswith regular workload up to very irregular algorithms with varying workload. Ta-ble 7.2 shows the number of possible conﬁgurations per GPU and benchmark.Most of our applications originate from our research group or student projects.Except for DPID, these applications have been developed for Fermi GPUs. TheSpeckle Reducing Anisotropic Diﬀusion (SRAD) and Hotspot benchmarks are takenfrom the Rodinia benchmark suite [Che et al. 2009] v3.1. Besides the fact that most
75
Chapter 7: Evaluation
Name Arch. (CC) Chip Released Cores SMs Clock Boost Type Clock Bus
GT 440 Fermi (2.1) GF108 Feb-10 96 2 810 1,620 DDR3 900 128
GTX 480 Fermi (2.0) GF100 Mar-10 480 15 701 1,401 GDDR5 924 384
Tesla C2070 Fermi (2.0) GF100 Jul-11 448 14 575 1,150 GDDR5 750 384
GTX 560 Ti Fermi (2.1) GF114 Jan-11 384 8 823 1,645 GDDR5 1,002 256
GTX 570 Fermi (2.0) GF110 Dec-10 480 15 732 1,464 GDDR5 950 320
GTX 590 Fermi (2.0) GF110 Mar-11 512 16 608 1,215 GDDR5 854 384
GT 620 Fermi (2.1) GF108 May-12 96 2 700 1,400 DDR3 533 64
GTX 680 Kepler (3.0) GK104 Mar-12 1,536 8 1,006 1,058 GDDR5 1,502 256
GT 730 Kepler (3.5) GK208 Jul-14 384 2 902 902 DDR3 800 64
GTX 780 Kepler (3.5) GK110 May-13 2,304 12 863 902 GDDR5 1,502 384
Tesla K20c Kepler (3.5) GK110 Nov-12 2,496 13 706 706 GDDR5 1,300 320
GTX 980 Maxwell (5.2) GM204 Sep-14 2,048 16 1,127 1,216 GDDR5 1,753 256
GTX TITAN X Maxwell (5.2) GM200 Mar-15 3,072 24 1,000 1,089 GDDR5 1,753 384
GTX 1080 Pascal (6.1) GP104 May-16 2,560 20 1,607 1,733 GDDR5X 1,251 256
GPU Processor Memory
Table 7.1: GPUs used in our evaluation. It can easily be seen that even within thesame GPU generation, the number of cores, SMs, clock rate, memory type and buswidth signiﬁcantly diﬀer. Color scale from low (blue) to high (yellow). (Data takenfrom TechPowerUp GPU Database [TechPowerUp.com 2017] and CUDA DeviceQuery [NVIDIA 2016c])
of the benchmarks in this suite are very simple and do not provide much ways toenhance the performance (in terms of memory access), we also encountered aseries of programming errors in the code, e.g., in the b+tree benchmark:
• a non-zero terminated c-string, which can cause a segmentation fault (in
b+tree/main.c lines 1937-1941)• no ﬁxed seed for random values• wrongly used C functions: char* output; ... fputs("Fail to open %s
!\n", output) (in b+tree/main.c line 2224, the signature of the functionis: fputs(char*, FILE*))
Other applications even miss certain functionality, as e.g., the calculation of com-puting the reverse substring matches in the MummerGPU benchmark is simplycommented out (most likely as it is not working correctly). Therefore, we decidednot to include more of these into our evaluation.
7.1.1 Bitonic Sort
Bitonic Sort [Batcher 1968] is a widely used parallel sorting algorithm. In ourimplementation it sorts a 1D AoS with four integer ﬁelds (8, 4, 2 and 1B), ensuringthat the 8B value is sorted ﬁrst and only if this value is equal, the 4B value issorted and so on. This results in a sorted list, for all ﬁelds. To ensure conﬂicting
76
7.1 Benchmark Applications
Benchmark Fermi (2.0, 2.1) Kepler (3.0, 3.5) Maxwell, Pascal (5.2, 6.1)
Theoretical 48 72 24
Exhaustive 42 63 21
Predictive 18 27 9
Theoretical 786,432 1,179,648 393,216
Exhaustive 147,456 221,184 73,728
Predictive 62 93 31
Theoretical 1,024 1,536 512
Exhaustive 512 768 256
Predictive 18 27 9
Theoretical 3,744 5,616 1,872
Exhaustive 1,872 2,808 936
Predictive 40 60 20
Theoretical 3,112 4,668 1,556
Exhaustive 780 1,170 390
Predictive 54 81 27
Theoretical 27,247,112 40,870,668 13,623,556
Exhaustive 13,600,232 20,400,348 6,800,116
Predictive 158 237 79
Theoretical 1,290,054,564 1,935,081,846 645,027,282
Exhaustive 40,353,714 60,530,571 20,176,857
Predictive 200 300 100
KD-Tree
Bitonic
SRAD
Hotspot
COMIC
REYES
DPID
Table 7.2: Theoretical number of conﬁgurations (entire solution space), the num-ber of conﬁgurations an exhaustive search has to execute and the number ofconﬁgurations required for our predictive method for all evaluated GPU architec-tures and benchmarks. The exhaustive search contains less conﬁgurations as thetheoretical, as texture memory can only be used if the data is read-only.
rows, we limit all values to 0-1023 (255 for the 1B ﬁeld). The application consistsof two kernels: One uses shared memory in loop iterations where it can be usedeﬃciently, while the other directly operates on global memory. The referenceimplementation is not optimized and uses a naïve AoS layout. We train on twodatasets and evaluate on ﬁve others, all with varying element counts (64k to 4M).
7.1.2 Speckle Reducing Anisotropic Diﬀusion
SRAD [Che et al. 2009] is a diﬀusion method for ultrasonic and radar imagingapplications. The benchmark’s computations are quite simple and use a straightforward implementation without much program logic or dynamic allocations. Wetrain on three parameter sets with same grid size and one iteration, and evaluate12 other parameter sets with varying grid sizes and 100 iterations.
77
Chapter 7: Evaluation
Figure 7.1: From left to right: The original image, the most recent competingalgorithms (Kopf et al. [2013] and Öztireli and Gross [2015]) and DPID. Except forthe original image, all are downscaled to 128px width.
7.1.3 Hotspot
Hotspot [Che et al. 2009] calculates the processor temperature based on anarchitectural ﬂoor plan and simulated powermeasurements. The overall executiontime of the application is quite low. The benchmark comes with three datasets.We train on the medium sized one and evaluate on the other two.
7.1.4 Detail Preserving Image Downscaling
DPID [Weber et al. 2016] is a perceptually inspired image downscaling algorithmthat in contrast to traditional downscaling algorithms is not based on physicaleﬀects. In a ﬁrst kernel it computes a smooth downscaled version of the image.Then a second kernel uses this image to assign weights to each input pixel andgenerates the ﬁnal output image according to the inﬂuence of each input pixel.The implementation loads a video using OpenCV [Bradski 2000], copies the datato the device using a page-locked memory segment to speed up the memcopy (asthis is the major bottleneck of the entire application), downscales it, copies theresult back and saves it to an output video ﬁle. The original code was developedfor a GTX 680 and uses the shuﬄe command to exchange data between threadsinside a warp. To be compatible to older cards, wemodiﬁed the code to use sharedmemory instead of shuﬄe on Fermi GPUs, as these do not support the shuﬄeoperation. The MATOG implementation also uses page-locked memory for fastermemcopy onto the device. Figure 7.1 shows an image generated with DPID andthree competing algorithms.
78
7.1 Benchmark Applications
7.1.5 Coevolution via MI on CUDA
Coevolution via MI on CUDA (COMIC) [Waechter et al. 2012] calculates the co-evolutionary mutual information for protein and Deoxyribonucleic acid (DNA)sequences. It consists of three kernels: the ﬁrst one initializes a randomizationseed that is used in a second kernel to permute the sequences. The third kernelperforms the main operation by creating a 3D histogram of occurrences in thepermuted sequences. This kernel is templated and uses diﬀerent compressionschemes depending on the input data. Further, it uses a complex memory accessfor the 3D histogram, as the algorithm does only need to store its results in a tri-angular pyramid. Our MATOG variant is able to buﬀer results from the histogramin local memory instead of directly storing them in shared memory. The originalimplementation was optimized for a GTX 480. We train on two datasets with oneiteration and evaluate on 10 others using 100 iterations.
7.1.6 Renders Everything You Ever Saw
Renders Everything You Ever Saw (REYES) [Cook et al. 1987] is a technique used inmovie productions. In contrast to classic 3D mesh rendering, it uses patches tomodel smooth surfaces. The patches are transformed into micro-polygons anditeratively split into smaller polygons until they have subpixel size. The implemen-tation uses four kernels: One of the kernels performs the splitting while the secondcompresses the micro-polygons after each iteration. A third kernel uses the ﬁnalpolygons and renders the resulting image into a depth buﬀer, which contains depthand color information for each pixel. Finally, a fourth kernel extracts the colorinformation from this buﬀer into a 2D texture, which then can be displayed. Thebenchmark was originally developed for a GTX 480. As this benchmark has a veryhigh number of conﬁgurations we do not show any exhaustive proﬁling results asit would require several weeks to proﬁle the application. We train on one model,rendering two frames on two diﬀerent resolutions and evaluate on 8 models, 100frames and varying resolutions. Themodels are rotated after each frame to changethe workload as depending on the viewing angle, more or less polygons have tobe split and rendered. An example rendering is shown in Figure 7.2.
7.1.7 KD-Tree
The KD-Tree benchmark constructs acceleration structures for 3D ray tracing andresembles the work of Popov et al. [2006]. The application consists of two mainand six maintenance kernels, which have a low total execution time. The ﬁrstmain kernel discretizes a triangulated scene in multiple bins, which are separatedby equidistant planes in all three dimensions. Then all triangles in the scene are
79
Chapter 7: Evaluation
Figure 7.2: REYES rendering of the Utah teapot. On the left the smooth renderedsurface is visible while on the right the generated micro-polygon mesh is shown,whereas one grid cell corresponds to 16x16 polygons.
processed and the number of starting and ending triangles in a bin are counted.In the last step a preﬁx and postﬁx sum are executed on these values. Withthese ﬁnal values, the kernel calculates a building heuristic that is used to selectthe best split plane. Our optimizable version uses an adjustable preprocessorimplementation, which is able to buﬀer the binning results in local memory insteadof using an atomicAdd on shared memory. The second main kernel performsthe actual splitting of a subtree and stores all necessary data in two diﬀerentdata segments, one for each child tree. Additionally, it has to perform somerecalculations if a triangle is located directly on the split plane. The maintenancekernels only have a low portion of the overall execution time. One is run at thebeginning of the application, to calculate the axis aligned bounding boxes for theinput geometry. Another initializes the default data for each iteration step. Twoothers calculate necessary oﬀsets for storing the results of the splitting kernel. Ifsubtrees are marked as leaf another kernel is used to compact the header data.The last kernel is a post processing kernel for the splitting step. All kernels arebuild in a fashion that they can process multiple subtrees in parallel.
The diﬃculty of this benchmark are the changing data characteristics over time.At the beginning only few but big subtrees are processed. While the number ofsubtrees signiﬁcantly increases during the processing, the size of these is reducedand the number of leaf subtrees constantly increases. In the end many, smallsubtrees are processed. This change has a signiﬁcant impact on the performance
80
7.1 Benchmark Applications
Figure 7.3: The Buddha (left) is a 3D scan. It mainly consists of small equally dis-tributed triangles. The Kitchen (right) has varying triangles sizes and distributions.
Figure 7.4: San Miguel (left) and the Powerplant (right) are the biggest scenes weevaluate. They have varying triangles sizes and distributions, equal to the Kitchen,but with much more variety in density and distribution.
and should beneﬁt from MATOG’s adaptiveness. This benchmark was originallyoptimized for a GTX 590. Due to the extreme high number of conﬁgurations, wedo not show any exhaustive proﬁling results. Further, as in each iteration memoryis reallocated, our implementation of the exhaustive post-processing cannot beused, as it exceeds 264 combinations.
We run this application using 32 bins and 9 diﬀerent scenes ranging from small(69k triangles) to big (12M triangles), from 3D scans (with mostly small equallydistributed triangles) to artist generated (with varying sized and distributed tri-angles). Figures 7.3 and 7.4 show four example scenes with varying properties.Memory capacity limitations prevent that all scenes can be executed on all GPUs(refer to Section A.7 for more details). The implementation we are comparingagainst uses a mixed set of data structures such as AoS, SoA or hierarchical-AoS(e.g. aabb[a].point[b].dim[c]). As the ﬁrst kernel uses an atomic based mutexin global memory that does not work on Fermi GPUs, the benchmark uses analternative implementation for this kernel on these GPUs.
81
Chapter 7: Evaluation
7.2 Execution Performance
In this section we evaluate the achieved execution performance of MATOG againstthe unoptimized MATOG variant and the hand-optimized, purely CUDA referencecode. When unoptimized, MATOG uses SoA, a naïve transposition (T0 in Figure 9.1),no texture memory, no constant memory, prefers shared memory over L1 cacheand the default option for user deﬁned preprocessor optimizations. This can beseen as a naïve implementation, as it does not apply any special optimizations andstrictly follows the CUDA programming guide [NVIDIA 2016a] that claims SoA tobe optimal in most cases. All executions have been repeated ﬁve times (except forthe application proﬁling runs), to reduce the inﬂuence of measurement noise. Allevaluations have been performed on separate training and evaluation datasets.Details are listed in Appendix A.
7.2.1 GPU Execution Time
First, we take a look at the achieved speed up for the kernel execution time. Wehave evaluated three diﬀerent decisionmodel types: SVM, DM and a non-adaptivemodel that always selects the single conﬁguration that was chosen most, to verifyif adaptive decisions are necessary to achieve better performance. Further, weevaluated three analysis strategies: exhaustive proﬁling/exhaustive analysis (EE),exhaustive proﬁling/predictive analysis (EP) (method used in Weber and Goesele[2016]) and predictive proﬁling/predictive analysis (PP) (our new method). Theresults are shown in Figures 7.5 to 7.11.
As mentioned before Bitonic Sort, SRAD, Hotspot, DPID are simple algorithmswithout much divergence and a very regular processing. In the Bitonic case thereference code is quite slow, caused by the naïve AoS layout. SRAD and Hotspot arealready optimally optimized in terms of memory access. MATOG is slightly slowerthan the reference code, as MATOG data structures employ a certain overheadcompared to hard coded memory access. One reason for this overhead is pointeraliasing, as the way we have implemented the MATOG data structures does notallow to use the restrict keyword to deﬁne the internal pointers as non-overlapping.Therefore, the compiler is unable to perform certain optimizations, which canslightly decrease the performance. For DPID the reference code usually is slightlyfaster than the MATOG variant, except for GPUs with CC 3.5 (GT 730, GTX 780 andTesla K20c), where MATOG achieves a signiﬁcant higher speed up. In most casesno real diﬀerence between the diﬀerent analysis methods and decision modelscan be seen, implying that these are invariant to adaptive optimizations. For SRAD,the exhaustive proﬁling usually ﬁnds slightly better solutions than our predictiveproﬁling, but has to proﬁle over 2371x more conﬁgurations to achieve this result.The irregularities for the results of the GT 730 on Hotspot are diﬃcult to explain.
82
7.2 Execution Performance
0.5
0.6
0.7
0.8
0.9
1.0
1.1
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R EEN EES EED PEN PES PED PPN PPS PPD
Figure 7.5: Bitonic: This is the only benchmark where the reference code is notoptimized. The results show that already the unoptimized MATOG variant is opti-mal, so that MATOG is unable to further increase the performance. There is noneed for any adaptive optimizations, because of the simplicity of the algorithm.
1.0
1.1
1.2
1.3
1.4
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R EEN EES EED PEN PES PED PPN PPS PPD
Figure 7.6: SRAD: The reference code for this benchmark is optimal and usuallyfaster than the result MATOG can achieve. This is caused by the rather basicalgorithm of SRAD. The small drop in performance between the reference andMATOG is caused by overhead through the MATOG data structures, e.g., throughpointer aliasing.
83
Chapter 7: Evaluation
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R EEN EES EED PEN PES PED PPN PPS PPD
Figure 7.7: Hotspot: Again, the reference code is already optimal. The same appliesas for the SRAD benchmark.
0.95
1.00
1.05
1.10
1.15
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R EEN EES EED PEN PES PED PPN PPS PPD
Figure 7.8: DPID: In this case, again the reference implementation is slightly fasterthan the optimized MATOG, except for GPUs with compute capability 3.5 (GT 730,GTX 780 and Tesla K20c) where MATOG achieves a higher performance. This isdue to the usage of texture memory. However, in general the unoptimized code isalready very good, so most improvements are minimal.
84
7.2 Execution Performance
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R EEN EES EED PEN PES PED PPN PPS PPD
Figure 7.9: COMIC: For this benchmark we see that for the Fermi GPUs, the refer-ence code is optimal and MATOG achieves comparable performance. However,starting with the GTX 680 the reference code performance drops signiﬁcantly,whereas MATOG is able to achieve up to 10% over the unoptimized variant. In thecases where the optimized variant is slower than the unoptimized can be causedby false predictions of the decision models.
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R PEN PES PED PPN PPS PPD
Figure 7.10: REYES: For the REYES benchmark the reference code is always slowerthan the unoptimized MATOG variant. This is caused by the fact that the codeoften uses AoS as a layout, while SoA performs better, even on the older cards.For Fermi we see that MATOG is unable to leverage more performance, but forall newer cards it achieves up to 18% over the unoptimized variant. The dynamicdecision-making mostly pays oﬀ for the Tesla K20c.
85
Chapter 7: Evaluation
0.9
1.0
1.1
1.2
1.3
1.4
G
T 
4
4
0
G
TX
 4
8
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
Te
sl
a 
C
2
0
7
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
R
el
at
iv
e 
Sp
ee
d
 U
p
R PPN PPS PPD
0
.2
0
0
.4
3
0
.1
0
0
.4
1
Figure 7.11: KD-Tree: For the KD-Tree the reference code varies signiﬁcantly be-tween the diﬀerent GPUs, whereas it is even slower on the newest GPU comparedto the unoptimized variant. However, the optimized variant always achieves asigniﬁcant speed up of up to 38%. The dynamic variant using the DM usuallyachieves the highest speed up, followed by the static model. In this benchmarkthe SVM usually achieves very bad results. This is caused by the fact that the metadata gathered during training can signiﬁcantly diﬀer from data gathered duringthe testing, causing bad decisions.
86
7.2 Execution Performance
We think this is caused by the heat of the chip, as this GPU is passive cooled andtherefore can only regulate its temperature by reducing the clock frequency.
So far, all benchmarks have been simple algorithms, consisting of only 1-2 kernelswith a very low execution time that do not beneﬁt from adaptiveness duringruntime. The three remaining applications are much more complex, consistingof 3-8 kernels. For COMIC we can see that the reference code is optimal for theGTX 480 and 570, which it had been developed for. On the newer cards (startingwith the GTX 680), the reference code’s performance is signiﬁcantly lower thanof the unoptimized MATOG code. Further, MATOG is capable of achieving evenmore performance when optimized, except for the GTX 980. We think that this iscaused by diﬀerences in the data from the test and training datasets. The mainspeed up is achieved through transposition of the data stored in shared memory.Local memory is never used. This benchmark further does also not beneﬁt fromadaptiveness.
Although REYES was optimized for the GTX 480, even the unoptimized MATOGcode is already faster than the reference code. On the newer architectures thereference performance is further decreased, below 80% compared to the un-optimized code. However, the optimized variant is unable to ﬁnd much bettersolutions until the GT 730. As can be seen, the adaptive methods achieve up to6% higher performance on some GPUs, while on others the static perform better.
The KD-Tree is by far the most complex and diﬃcult to optimize application weevaluate on. It has a lot of changing data properties, a huge number of reallocationsand an irregular work processing. The non-adaptive model achieves decent speedups, but the DM achieves always the best and is up to 33% faster than the non-adaptive solution. The SVM always performs less optimal than the DM, especiallyon the (GTX 580Ti, GTX 590, GT 620 and GT 730). The SVM is here unable tounderstand the underlying coherence of the meta data since the values diﬀer toomuch from the training data, as it cannot be normalized. This causes the SVM touse local memory very often, whereas this is only optimal in the very ﬁrst iterationswith few big subtrees and performs very badly in all subsequent iterations. It isdiﬃcult to say why this does not happen for the other GPUs and might correlatewith the fact how good the local memory variants work on the respective GPUs.
To summarize, we can say that MATOG achieved in general the highest speed upson the Tesla K20c. As the codes have been mostly developed for consumer/GTXGPUs this is no surprise, as the Tesla K20c is a HPC card with diﬀerent properties.There is no real diﬀerence between the analysis methods (in terms of quality),except for the SRAD benchmark, where the exhaustive proﬁling often is ableto ﬁnd slightly better conﬁgurations. Further, the DM has achieved the highestperformance (compared to the other methods). In the very complex applications,
87
Chapter 7: Evaluation
our adaptive optimizations out-perform the static optimizations by up to 33%.
7.2.2 Application Execution Time
Next, we take a look at the application execution time, containing not only theGPU but also the CPU time. For this we only use the DM results. Figure 7.12shows the percentage of time spend on CPU or GPU for the reference code onthe Tesla K20c. In Figures 7.13 to 7.16 the total application speed up, in relation tothe unmodiﬁed MATOG variant are shown (the order of the benchmarks has beenaltered to maximize the axis ranges of the graphs). For the Bitonic Sort we cansee that the reference code is slower than the unmodiﬁed MATOG, which alreadyuses the optimal conﬁguration for this benchmark. In some cases the optimizedvariant is slower though. This is caused by the fact that it does not use always thesame, but multiple slightly diﬀerent implementations and therefore has to loadmore modules, causing a higher framework overhead. We will discuss this eﬀectmore in detail in Section 9.4. For COMIC we can see mostly better performancefor the optimized MATOG and decreasing for the reference code on the newerGPUs. This benchmark clearly shows the necessity of auto-tuning for performanceportability as code optimized for older hardware can perform less optimal onnewer hardware. For SRAD, Hotspot and DPID we see that the reference code isusually faster than the optimized MATOG code. This is caused by the fact that thetotal GPU execution time of these benchmarks is very low and the additional CPUoverhead through the MATOG framework causes a slowdown. For REYES we seethat although wemainly had a speedup for the GPU code, the overall performanceusually is decreased. Again, this is caused by (un-)loading andmaintaining multiplevariants of kernel implementations. Finally, for the KD-Tree benchmark we seethat the optimized code is able to achieve up to 36% higher overall performancecompared to the unoptimized variant.
88
7.2 Execution Performance
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
P
er
ce
n
ta
ge
 o
f 
to
ta
l e
xe
cu
ti
o
n
 t
im
e
CPU GPU
Figure 7.12: Percentage of executiontime spend on CPU and GPU using theTesla K20c. As can be seen, the ﬁrst fourbenchmarks spend more than 50% oftheir execution time on the CPU.
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
G
T 
4
40
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
DPID
To
ta
l S
p
ee
d
 U
p
CUDA MATOG
Figure 7.13: Total application speed upfor the DPID benchmark. The results aremixed, but the reference code is usuallythe fastest, due to lower CPU overhead.
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
Bitonic COMIC
To
ta
l S
p
ee
d
 U
p
CUDA MATOG
Figure 7.14: Total application speed up for the Bitonic Sort and COMIC benchmark.The reference code performs usually slower than the unoptimized and optimizedMATOG code. For the Bitonic Sort, in some rare cases the optimized is slower thanthe unoptimized caused by overhead through maintaining multiple kernel imple-mentations. For COMIC we see that the reference code performance signiﬁcantlydecreases on the newer GPUs.
89
Chapter 7: Evaluation
0.85
0.90
0.95
1.00
1.05
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
Hotspot REYES
To
ta
l S
p
ee
d
 U
p
CUDA MATOG
Figure 7.15: Total application speed up for the Hotspot and REYES benchmark.Especially for REYES we can see that the performance is slightly lower than the un-optimized code, caused by overhead through maintaining multiple conﬁgurationsper kernel.
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
G
T 
4
4
0
G
TX
 4
8
0
Te
sl
a 
C
2
0
7
0
G
TX
 5
6
0
 T
i
G
TX
 5
7
0
G
TX
 5
9
0
G
T 
6
2
0
G
TX
 6
8
0
G
T 
7
3
0
G
TX
 7
8
0
Te
sl
a 
K
2
0
c
G
TX
 9
8
0
G
TX
 T
IT
A
N
 X
G
TX
 1
0
8
0
SRAD KD-Tree
To
ta
l S
p
ee
d
 U
p
CUDA MATOG
Figure 7.16: Total application speed up for the SRAD and KD-Tree benchmark. ForSRAD the CUDA code is usually the fastest due to lower CPU overhead, while forthe KD-Tree the optimized MATOG is faster.
90
7.2 Execution Performance
7.2.3 Performance Portability
Pennycook et al. [2016] proposed a method to measure performance portabilityfor applications across multiple platforms. They calculate the harmonic meanover the execution time for one approach over a series of executed applicationsand normalize it with the harmonic mean of the best results for the applications.Figure 7.17 shows the results for the hand-optimized CUDA code and our proposedmethod, compared to all other methods we have analyzed. Except for SRAD andHotspot, our method is superior to the pure CUDA implementation for the GPUand application eﬃciency. Overall CUDA achieves an average eﬃciency of 88.33%GPU and 96.09% application eﬃciency, while our proposed method achieves98.26% and 95.11% respectively. Assuming that an exhaustive search would yieldin 100.0% GPU eﬃciency, our method comes pretty close, whereas it requiressigniﬁcantly less time to achieve its results, as we show in the next section.
6
1
.6
9
%
9
9
.2
6
%
9
9
.9
2
%
9
7
.8
3
%
8
8
.5
2
%
8
6
.6
7
%
8
4
.4
6
%
9
9
.9
0
%
9
4
.8
1
%
9
6
.5
7
%
9
8
.0
7
%
9
9
.4
9
%
9
9
.1
7
%
9
9
.7
9
%
0%
20%
40%
60%
80%
100%
Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree
G
P
U
 E
ff
ic
ie
n
cy
CUDA MATOG
9
1
.5
5
%
1
0
0
.0
0
%
9
9
.9
6
%
1
0
0
.0
0
%
9
1
.4
2
%
9
6
.2
9
%
9
3
.4
5
%
9
9
.9
6
%
8
5
.0
1
%
9
8
.6
3
%
8
3
.7
3
%
1
0
0
.0
0
%
9
9
.1
4
%
9
9
.3
0
%
0%
20%
40%
60%
80%
100%
Bitonic SRAD Hotspot DPID COMIC REYES KD-TreeA
p
p
lic
at
io
n
 E
ff
ic
ie
n
cy
CUDA MATOG
Figure 7.17: GPU (top) and application (bottom) eﬃciency for CUDA and our pro-posed method. Except for SRAD, Hotspot and DPID for the application eﬃciency,MATOG is superior to CUDA for both metrics.
91
Chapter 7: Evaluation
7.2.4 Analysis Time
Finally, we take a look at the time required for the analysis on the GTX 1080. Fig-ure 7.18 shows the time required for the proﬁling. It is easy to see that for theexhaustive search the time is signiﬁcantly higher than for our predictive proﬁling.Given the high numbers of conﬁgurations for the REYES and KD-Tree benchmarks,it can be estimated that the overall exhaustive proﬁling for these two benchmarkswould require a lot of time. Assuming that compiling a kernel requires approx-imately 3s, on 16cores the compilation allone would require ~15days for REYESand ~44days for the KD-Tree.
Figure 7.19 shows the ratio between proﬁling compared a normal application run.As can be seen, this depends on the number of conﬁgurations, while our methodrequires usually below 10x, for REYES and KD-Tree less than 100x for the proﬁlingwhile the exhaustive can reach up to 10,000x (for SRAD).
In Figure 7.20 we show the measured times for the proﬁling and analysis of thebenchmarks. As stated before, REYES and KD-Tree require too much time for theexhaustive proﬁling and have therefore been excluded. Except for the Bitonic Sort,our predictive proﬁling is always signiﬁcantly faster than the exhaustive search.The reason for this is that the Bitonic Sort only has very few possible conﬁgurations,so that the number of conﬁgurations hardly diﬀer.
5.
65
15
49
0.
92
23
.2
4 13
2.
20
13
6.
76
5.
59
5.
66
2.
43 7
.5
5 24
.8
0
58
.2
0
14
8.
69
0.1
1.0
10.0
100.0
1000.0
10000.0
100000.0
Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree
Ti
m
e 
(s
)
Exhaustive Predictive
Figure 7.18: Execution time for the proﬁling. It is easy to see that the predictiveproﬁling is signiﬁcantly faster than the exhaustive proﬁling and can proﬁle allapplications within a few minutes.
92
7.2 Execution Performance
1
10
100
1,000
10,000
Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree
P
ro
fi
lin
g 
/ 
R
e
fe
re
n
ce
 E
xe
cu
ti
o
n
 
Ti
m
e
Exhaustive Predictive
Figure 7.19: Ratio between proﬁling and execution time. As can be seen, withincreasing complexity of the benchmark, the proﬁling time signiﬁcantly increases,especially for the exhaustive search. It can be seen that our predictive methodrequires even for the complex KD-Tree benchmark less than 100x compared to anormal application run. The exhaustive search can be multiple orders of magni-tudes slower. Be reminded, the proﬁling time not only contains the pure executiontime, but also the time required to compile the GPU kernels, convert data intoother layouts and restoring the input data before starting a kernel in anotherconﬁguration!
0.01
0.10
1.00
10.00
Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree
Ti
m
e 
(s
)
Exhaustive - Exhaustive Predictive - Exhaustive Predictive - Predictive
Figure 7.20: Time required for the analysis of the proﬁling data. The analysis isquite fast for all methods (compared to the proﬁling), however our proposedmethod (green) is always the fastest. Further, it can easily be seen that withincreasing complexity the execution time rises. The exhaustive analysis requiresmore time using exhaustive data (blue) than with predictive data (orange) causedby much more data that needs to be loaded and processed.
93
Chapter 7: Evaluation
94
Chapter 8
Empirical Performance Models
Before we discuss the results of our evaluation, we take a look onto work wehave conducted to further extend the decision making of MATOG. This chaptercontains recent research results that have not made it into the active developmentof MATOG yet and therefore are not part of the evaluation in the previous chapter.This work was conducted in conjunction with Sandra C. Amend for her MasterThesis [Amend 2017].
So far we have explained our concepts for auto-tuning array layouts, analyzingapplications, how to use meta data to predict optimal layouts, using categoricaldecision models and how all of this is implemented in MATOG. What these modeldo not provide is an estimate for how long the kernel will execute on a speciﬁchardware. This is an important feature. One use case would be to estimate prior akernel execution whether converting an array into another layout between twokernel calls would yield enough speed up to compensate for the time necessary forthe conversion. Figure 8.1 shows in which situations this would yield an improve-ment and in which it would not. Another application would be data partitioning inmulti-heterogeneous-device applications. The problem here is that the heteroge-neous devices can require a diﬀerent amount of time to process the same amountof data. With an accurate prediction model it would be possible to divide the datainto exactly the required pieces, so that all devices complete their task at the sametime, reducing the synchronization overhead between the devices to an minimum.Our assumption is, that when an application is executed using the same data onthe same GPU, the kernel runtimes will be deterministic. Therefore we expect,that it is possible to extract characteristics of the data that enable us to estimatehow long a kernel will run on a speciﬁc GPU. In this chapter we evaluate whetherour automatically gathered meta data supplies the required information.
Already a lot of research has been conducted to ﬁnd performance models. Thereare two main methods, analytical and empirical performance models. Analyticalmethods model the correlation of the computation in relation to the used hard-ware. To establish these analytical models, a deep knowledge of the algorithmand hardware is necessary [Wolf et al. 2014]. These are usually handcrafted andtherefore are not suited for our application, as we have the ambition to have a fullyautomatic auto-tuner. Empirical performance models use measured performancedata, similar to the data that MATOG gathers during its proﬁling, and then use amethod to ﬁt some kind of performance curve into the measured data. Please
95
Chapter 8: Empirical Performance Models
Δ(A,B) < convert(A,B)
Δ(A,B) > convert(A,B)
0
1
2
3
4
5
0 10 20 30 40 50 60
Ex
ec
u
ti
o
n
 T
im
e
Meta Data
Layout A Layout B Δ(A,B) convert(A,B)
Figure 8.1: Artiﬁcial example for the execution time of two diﬀerent layouts independency of a given meta data. ∆(A,B) is the speed up that could be achievedif the faster layout would be used and convert(A,B) is the time necessary forconverting the data. In the left example, the conversion takes more time, so thatthis would not yield in an signiﬁcant improvement. However, in the right examplethe conversion would result in a mentionable speed up.
refer to Section 4.5.1 for an overview of state-of-the-art techniques in this area.
Which technique is suitable to establish these empirical models depends on theapplication and on which data is available. Approaches that utilize Neural Net-works (NNs) [Ipek et al. 2005; Lee et al. 2007; Wu et al. 2015] achieve high qualityestimations but require a signiﬁcant number of data samples to train the networks.MATOG does automatically gather data, however, after reducing redundant, con-stant and linearly dependent information these datasets are usually rather smalland usually contain less than 100 entries, which are too few to be used in NNs.Gaussian Processes (GP) [Rasmussen andWilliams 2006] are another method thatcan be used with the advantage that these only require very few data samples.Another advantage of GP is that they not only provide a predicted value, but alsoan error estimate, how certain the model is of the value.
In the following sections we will explain how we use GP to automatically generateperformance models, evaluate their accuracy and how these could be used inMATOG. These experiments have been conducted usingMATLAB and theGaussianProcesses for Machine Learning (GPML) Toolbox1 to quickly generate results andevaluate diﬀerent model implementations.
1www.gaussianprocess.org/gpml/code
96
8.1 Model Training and Prediction Accuracy
8.1 Model Training and Prediction Accuracy
In order to train our models, we use the MATOG proﬁler to gather all necessarydata. This data is then extracted from the MATOG database, converted in a formatthat can be processed by MATLAB and then directly fed into the GPML Toolbox.For the data we use the same ﬁltering as described in Section 6.4, where we re-move all constant and linear dependent entries. We evaluated multiple GP modelimplementations, using diﬀerent covariance functions: linear (GP lin.), squaredexponential (GP SE), squared exponential with automatic relevance determina-tion (GP SE + ARD) and combined linear plus squared exponential (GP lin. + SE)[Rasmussen and Williams 2006]. Further, we compare all results against a normallinear regression (lin. reg.) to evaluate if the usage of a complex model as GP isbeneﬁcial. For the evaluation we chose the “bound and split” kernel from theREYES (Section 7.1.6) and the ﬁrst main kernel of the KD-Tree benchmark (Sec-tion 7.1.7), both executed on a GTX 680. We chose these kernels, as these are themain kernels of the two most complex benchmarks we have available. Further,these two benchmarks produce the highest amount of meta data.
8.1.1 Single Dataset
First, we take a look at the accuracy of these models for predicting a single dataset.For this we learned our decision models on all available datasets and predictedthe performance on the “Utah Teapot” (REYES) and “Kitchen” (KD-Tree) dataset.In Figure 8.2 we show the results of the best performing GP model (GP lin. + SE)against the lin. reg. model. In this test we can see that for both benchmarks, mostpredictions are quite accurate. For the KD-Tree benchmark we can see that inthe lower part of the ﬁgures some samples deviate signiﬁcantly. In this area themodels are unable to use our automatically gathered data to perform accuratepredictions. However, overall most samples for both methods seem to be accurateenough for our purpose.
Next we directly compare the accuracy of the GP lin. + SE and lin. reg. models foreach sample. Results are shown in Figure 8.3. For the GP lin. + SE also the errorestimate is shown. The results of REYES show that the GP lin. + SE is in the mostcases closer to the measured results than the lin. reg.. For the KD-Tree both seemto equally good. In the ﬁrst 10 samples both models signiﬁcantly diﬀer from theresults. This is the area where the models cannot use our meta data to accuratelypredict the performance.
97
Chapter 8: Empirical Performance Models
lin. reg. GP lin. + SE
REY
ES
0
20
40
60
80
100
0 20 40 60 80 100
P
re
d
ic
te
d
 T
im
e 
[μ
s]
Measured Time [μs]
0
20
40
60
80
100
0 20 40 60 80 100
P
re
d
ic
te
d
 T
im
e 
[μ
s]
Measured Time [μs]
KD-
Tree
0
20
40
60
80
0 20 40 60 80
P
re
d
ic
te
d
 T
im
e 
[m
s]
Measured Time [ms]
0
20
40
60
80
0 20 40 60 80
P
re
d
ic
te
d
 T
im
e 
[m
s]
Measured Time [ms]
Figure 8.2: Results for the lin. reg. (left) and GP lin. + SE (right). The black lineindicates the optimum, where the prediction is identical to the measured value.The more a value deviates, the less accurate the model is. For REYES (top) we seethat both models perform equally. For the KD-Tree (bottom) we can see that in thelower part of the chart both models deviate signiﬁcantly from the expected values.This is an area where the models are unable to use our automatically gathereddata to predict the performance.
98
8.1 Model Training and Prediction Accuracy
REY
ES
10
30
50
70
90
110
0 10 20 30 40 50 60 70 80 90 100 110
Ex
ec
u
ti
o
n
 T
im
e 
[μ
s]
Samples
Measured GP lin. + SE lin. reg.
KD-
Tree
0
20
40
60
80
0 10 20 30 40 50 60 70
Ex
ec
u
ti
o
n
 T
im
e 
[m
s]
Samples
Measured GP lin. + SE lin. reg.
Figure 8.3: Predicted over measured execution time for the GP lin. + SE modelwith 95% conﬁdence interval and the lin. reg. model. The samples for the REYES(top) are sorted ascending for the measured execution time. As can be seen,the lin. reg. model deviates more than the GP lin. + SE model. For the KD-Tree(bottom) the samples are not sorted. Here we can clearly see the area where themodels deviate from the measured runtime.
99
Chapter 8: Empirical Performance Models
8.1.2 Multiple Datasets
Next, we perform our analysis on all available datasets. For this we go over alldatasets available for the benchmarks and train the models on all datasets, exceptthe one, we are predicting. Further we show results for all tested GP modelvariants and the lin. reg.. We ﬁrst take a look at the Relative Root Mean SquaredError (RRMSE). This is deﬁned as:√∑d
i=1(tm(i) − tp(i))2
max(tm) (8.1)
It can be used to compare the accuracy of models. The lower the RRMSE, thebetter the model performance. Our results are shown in Figure 8.4.
REY
ES
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
RRMSE
KD-
Tree
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
RRMSE
Figure 8.4: RRMSE for all tested models (lower is better). As can be seen,GP lin. + SE works best for REYES (top), followed by the GP lin. and lin. reg. models.For the KD-Tree, again the GP lin. + SE performs best, while the diﬀerence to thelin. reg. is quite low.
Finally, we compare the ratios between the measured and the predicted values(Figure 8.5). This shows how big the diﬀerences are. We visualize this using abox plot. The closer the median (inner most line) is to the 1.0 ratio (indicated byblack line) the better. The box itself shows the upper and lower quantile aroundthe median value. The black error bars further show 1.5x of the box extend. Allremaining samples are outside this range.
100
8.1 Model Training and Prediction Accuracy
REY
ES
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
Ratios
KD-
Tree
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Ratios
Figure 8.5: Distributions of ratios for the testedmodels. For the REYES (top) can beseen, again GP lin. + SE performs best, followed by GP lin. and lin. reg., whereasthe latter deviates most from the 1.0 ratio. The results of the GP SE for the KD-Tree(bottom) signiﬁcantly deviate from the optimum. According to the plot, the lin. reg.is closer to the median and the box is tighter than for the GP lin. + SE.
101
Chapter 8: Empirical Performance Models
8.1.3 Error Cases
So far the models have in mostly worked as expected, except for the few samplesin Figure 8.3. However, we encountered a case, where the models provided verybad predictions. This case occurred when training our prediction models for theKD-Tree on the GTX 1080. The results of the models are shown in Figures 8.6to 8.9. What we see is that the predictions are very bad for the ﬁrst 45 samples(Figure 8.7). The reason for this lies in the diﬀerence of the 3D scenes that we useto train our models, which has similarities to the SVM problem we had encoun-tered previously (Section 6.4.1). There exist multiple problems. First, as shownin Figures 7.3 and Figures 7.4, the scenes diﬀer in their properties. The 3D scans(e.g., Buddha) are very dense with small, nearly equally sized triangles. The otherscenes (e.g., Kitchen) are artist modeled, with varying triangle sizes. San Migueland the Powerplant introduce a new property that signiﬁcantly diﬀers from theothers. These scenes consist of densely packed and very sparse or even emptyareas. These scene dependent properties have an impact on the performance,but are not captured by any of our automatically gathered meta data. Second,Figure 8.10 shows the distribution of our meta data in linear space. For the twoadditional datasets (San Miguel and Powerplant) the data signiﬁcantly diﬀers fromthe others. When predicting for the Powerplant, the models only know meta dataup to 25M triangles and 4.2M subtrees. As the Powerplant uses up to 45M trian-gles and 6M subtrees, the performance models have no reference data availableand need to extrapolate. Third, in the case, when we predict the performance forthe Kitchen (Figure 8.7), the two additional scenes provide meta data in the samerange as the smaller scenes, but their runtime diﬀers (because of their diﬀerentproperties), so that the prediction models interpolate between the smaller andthe bigger scenes. All these eﬀects have a negative impact on themodel’s accuracy.However, a positive property of the GP is, that it knows that its predictions arebad, as indicated by the error bars in Figure 8.7. This does not solve the problem,but it allows to detect when the model provides bad predictions. So far we do notknow a solution to this problem. It might be possible to build a two layer model,that ﬁrst determines a rough classiﬁcation (e.g., “small”, “medium” or “big” scene)and then builds a separate prediction model for each of these classes.
102
8.1 Model Training and Prediction Accuracy
lin. reg. GP lin. + SE
0
1
2
3
4
0 1 2 3 4
P
re
d
ic
te
d
 T
im
e 
[m
s]
Measured Time [ms]
0
1
2
3
4
0 1 2 3 4
P
re
d
ic
te
d
 T
im
e 
[m
s]
Measured Time [ms]
Figure 8.6: Results for the lin. reg. (left) and GP lin. + SE (right). Both methodssigniﬁcantly diﬀer from the optimum, while the lin. reg. is overall closer to theoptimum, the GP lin. + SE is much better in the lower region and worse in theupper.
0
4
8
12
16
20
0 10 20 30 40 50 60 70
Ex
ec
u
ti
o
n
 T
im
e 
[m
s]
Samples
Measured GP lin. + SE lin. reg.
Figure 8.7: Predicted over measured execution time for the GP lin. + SE modelwith 95% conﬁdence interval and the lin. reg. model. The samples are not sorted.As can be seen, both models deviate signiﬁcantly from the measured value for theﬁrst 45 samples. Also the error estimate of the GP is very high in this area.
103
Chapter 8: Empirical Performance Models
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Ratios
Figure 8.8: Distributions of ratios for the tested models. Again the GP lin. + SE hasthe best result. Although most of its values are in an acceptable range, it has asigniﬁcant number of outliers.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
GP lin. + SE
GP SE + ARD
GP SE
GP lin.
lin. reg.
RRMSE
Figure 8.9: RRMSE for all tested models. Again the GP lin. + SE performs best,although the results of the other ﬁgures suggest an overall bad ﬁt of the model.
0
2
4
6
0 10 20 30 40
Su
b
tr
ee
s 
 (
1
0
6
)
Triangles (106)
Powerplant San Miguel Others
Figure 8.10: Selected meta data dimensions for all KD-Tree datasets. The twolargest (San Miguel and Powerplant) are highlighted. As can be seen, their metadata values signiﬁcantly diﬀer from those of the other models.
104
8.2 Predicting Unknown Conﬁguration Performance
8.2 Predicting Unknown Conﬁguration Performance
In order to be useful for MATOG, we have to combine our previously introducedprediction for non-proﬁled conﬁgurations (Section 6.2.2) with the meta data basedprediction models. This enables us to predict the performance of kernels fornon-proﬁled conﬁgurations and non-proﬁled meta data. With this we wouldbe able to use only a few prediction models, to predict the performance of theentire solution space. The previously mentioned use case with converting anarray prior a kernel execution could use this method. Further, it could be usedto validate and improve the model’s accuracy, as the predicted time could becompared with the actual execution time. With an initial model, generated usingour proﬁling method (Section 6.2), the auto-tuner would compare the predictionswith the actual measured performance and adjust the models accordingly, tofurther increase their accuracy.
For this experiment, we ﬁrst built a GP prediction model for each base and supportconﬁguration. Second, we use our prediction formula (Equation 6.1) but instead ofﬁxed time values, we use our prediction models. All tests have been performedon a GTX 680 using the GP lin. + SE model, as this proved to be the best workingin the previous evaluations. Results are shown in Figures 8.11 and 8.12. Again, theREYES case works better than the KD-Tree, as this has a +10ms oﬀset. However,the quality of the predictions is comparable to the predictions in Figure 6.3.
105
Chapter 8: Empirical Performance Models
100
300
500
700
900
0 40 80 120 160 200 240
Ex
ec
u
ti
o
n
 T
im
e 
[μ
s]
Samples
Measured Predicted
Figure 8.11: Results for predicting 240 randomly selected conﬁgurations of thebound and split kernel. The accuracy is quite good and comparable with otherresults of the MATOG prediction.
250
260
270
280
290
0 10 20 30 40 50
Ex
ec
u
ti
o
n
 T
im
e 
[m
s]
Samples
Measured Predicted
Figure 8.12: Results for predicting 50 randomly selected conﬁgurations of thebinning kernel. There is a slight oﬀset of approximately 10ms. Despite this oﬀset,the results follow relatively good the measured values.
106
Chapter 9
Discussion
In this chapter we discuss our results and whether MATOG was able to satisfy ourgoals. First, we summarize this thesis and its contributions (Section 9.1). Thenwe analyze whether auto-tuning is capable of providing the advances that it ispromising and how applicable they are (Section 9.2). Further, we take a closerlook onto the optimizations that have been chosen by MATOG throughout ourexperiments and analyze if we can extract knowledge from them that can furtherbe used to improve auto-tuners or manual code optimizations (Section 9.3). InSection 9.4 we take a closer look onto MATOG itself, how it performed and if thereare aspects that should be improved/changed in future. Finally, we conclude ourdiscussion by reﬂecting the goals we set for this thesis (Section 9.5).
9.1 Summary
In this thesis we have addressed the question if it is possible to automaticallyoptimize the performance of memory access in GPU applications. We thereforedeveloped an auto-tuner that speciﬁcally targets array access in NVIDIA CUDAapplications. The main problem we faced has been how to eﬃciently determineoptimal array layouts. As we decided to use empirical proﬁling instead of an an-alytical method, this analysis can take a huge amount of time for very complexapplications if done with an exhaustive search, which is in general infeasible. Wetherefore developed multiple techniques to reduce this time to an absolute mini-mum, while achieving nearly optimal performance, comparable to the results ofan exhaustive search. For this we employ an in-application proﬁling that uses aprediction algorithm to estimate the performance of huge parts of the solutionspace. This procedure only requires a very limited number of measurements.Given this data, we established a dependency graph to model the relation be-tween multiple kernel executions and estimate the overall application executiontime. This gives us optimal decisions, which we use to train decision models thatadaptively select array layouts during runtime according to the data used in theapplication. The entire optimization process is designed to work fully automaticand therefore does not require any user interaction (except for integrating MATOGinto the application).
107
Chapter 9: Discussion
9.2 Is auto-tuning useful?
In our introduction we stated that to achieve optimal performance it is necessaryto adapt the code to the underlying hardware. For the evaluation we ran codeon diﬀerent kinds of GPUs (low-, mid-, high-end and HPC) of four diﬀerent GPUgenerations. As we have seen, the performance of the reference code was usuallyoptimal for the GPU that the code was developed for, but the more complex anapplication was the higher the chance that this reference code performed lessoptimal on newer hardware. This happened, e.g., for the COMIC benchmark,where starting with Kepler GPUs the reference code was even slower than theunoptimized baseline code of MATOG. On the Fermi it was up to 5% faster. Thisclearly shows the necessity of auto-tuning in terms of performance portability.
Further, we have seen that depending on the complexity of the application staticoptimizations (Bitonic Sort, SRAD, Hotspot, DPID and COMIC) may suﬃce, butadaptiveness did not reduce the performance. In some cases (REYES and KD-Tree)these dynamic optimizations achieved a signiﬁcantly higher performance.
The results showed that optimal memory access is crucial in many applicationsand can signiﬁcantly inﬂuence the performance of the code. However, solelyoptimizing the GPU code does not necessarily improve the overall performance ofan application as can be seen in the REYES benchmark, where MATOG achieves asigniﬁcant GPU speed up on all cards with CC 3.5 or higher (Figure 7.10), but thetotal application speed up is slightly negative in most cases (Figure 7.15). This iscaused bymemory layouts that perform less optimal on the CPU and by frameworkoverhead. Section 9.4 discusses this in more detail.
Our results further show that auto-tuning an application can be done quite fast,as for all benchmarks we required only a few minutes to optimize these usingMATOG. Even for applications with longer execution times, this is most likely stillsigniﬁcantly faster than optimizing the code manually. It also does not requireany knowledge of the hardware or software and can therefore be executed byany person. Manual code optimizations always require signiﬁcant programmingskills and a sophisticated knowledge of the application, applied algorithms andthe hardware.
9.3 Which optimizations are optimal?
A key question is, whether we can ﬁnd any regularity in which optimizations areoptimal since this would allow to perform static analysis. In Table 9.1 to 9.4 weshow how often which layout, cache size, transposition or memory have beenused by MATOG. Figure 9.1 shows the diﬀerent indexing schemes used by MATOG.
108
9.3 Which optimizations are optimal?
X X
X X X XY Y
Y Y
Y YZ Z Z Z
Z ZT0 T1 T2 T3 T4 T5
Figure 9.1: Diﬀerent indexing schemes (transpositions) for a 3D matrix and the cor-responding internal identiﬁcation (T0 toT5). Lower dimensional matrices behaveequally. C++ code forT0 is: x + y * size_x + z * size_x * size_y
The results (Table 9.1 to 9.4) show that not only between the diﬀerent benchmarkssigniﬁcantly diﬀerent conﬁgurations are used, but also between the GPUs evenwithin the same architecture. There is no clear tendency towards a speciﬁc structlayout, transposition or the L1 cache size visible. Solely the usage of local memoryinstead of shared memory proves only to be beneﬁcial in some really rare cases(Table 9.4). REYES (Table 9.3) is the only benchmark where array sizes are knownat compile time. This benchmark greatly beneﬁts from using constant memory,although again the usage of texture, global and constant memory varies signiﬁ-cantly between the diﬀerent architectures. Further, it can be seen that the usageof texture memory is preferred on all newer platforms, whereas in most of thecases a certain balance between global and texture is chosen. What we can see isthe obvious fact that using texture memory solely beneﬁts arrays that are readmore than once. Not using texture memory in this case allows to use the cachemore eﬃciently as read-once data does not suppress other data. Be aware thatthe percentage does not represent the amount of memory that is used with therespective memory but the number of arrays!
109
Chapter9:Discussion
Benchmark GPU
SM L1 EQ AoS SoA AoSoA T0 T1 T2 T3 T4 T5 Shared Local Global Texture Constant
GT 440 76% 24% 0% 75% 25% 100% 0%
GTX 480 76% 24% 0% 75% 25% 100% 0%
Tesla C2070 71% 29% 0% 75% 25% 100% 0%
GTX 560 Ti 71% 29% 25% 75% 0% 100% 0%
GTX 570 79% 21% 0% 75% 25% 100% 0%
GTX 590 80% 20% 0% 75% 25% 100% 0%
GT 620 71% 29% 0% 100% 0% 100% 0%
GTX 680 50% 21% 29% 0% 100% 0% 100% 0%
GT 730 71% 0% 29% 25% 75% 0% 100% 0%
GTX 780 75% 25% 0% 25% 75% 0% 50% 50%
Tesla K20c 51% 29% 21% 0% 100% 0% 50% 50%
GTX 980 0% 75% 25% 50% 50%
GTX TITAN X 0% 100% 0% 100% 0%
GTX 1080 25% 75% 0% 100% 0%
GT 440 100% 0% 56% 44% 60% 40%
GTX 480 100% 0% 52% 48% 60% 40%
Tesla C2070 100% 0% 52% 48% 60% 40%
GTX 560 Ti 100% 0% 56% 44% 50% 50%
GTX 570 100% 0% 52% 48% 60% 40%
GTX 590 100% 0% 52% 48% 50% 50%
GT 620 100% 0% 65% 35% 60% 40%
GTX 680 100% 0% 0% 69% 31% 90% 10%
GT 730 100% 0% 0% 66% 34% 80% 20%
GTX 780 100% 0% 0% 66% 34% 0% 100%
Tesla K20c 100% 0% 0% 52% 48% 0% 100%
GTX 980 56% 44% 20% 80%
GTX TITAN X 61% 39% 10% 90%
GTX 1080 70% 30% 100% 0%
Bitonic
SRAD
Cache Size Layout Transposition Local Arrays Global Arrays
Table 9.1: Usage of optimizations for the Bitonic and SRAD benchmark.
110
9.3Whichoptimizationsareoptimal?
Benchmark GPU
SM L1 EQ AoS SoA AoSoA T0 T1 T2 T3 T4 T5 Shared Local Global Texture Constant
GT 440 0% 100% 50% 50% 100% 0%
GTX 480 0% 100% 50% 50% 100% 0%
Tesla C2070 0% 100% 50% 50% 100% 0%
GTX 560 Ti 0% 100% 50% 50% 33% 67%
GTX 570 0% 100% 50% 50% 100% 0%
GTX 590 0% 100% 50% 50% 100% 0%
GT 620 0% 100% 50% 50% 100% 0%
GTX 680 0% 0% 100% 50% 50% 100% 0%
GT 730 0% 0% 100% 50% 50% 33% 67%
GTX 780 0% 0% 100% 50% 50% 33% 67%
Tesla K20c 100% 0% 0% 50% 50% 33% 67%
GTX 980 50% 50% 67% 33%
GTX TITAN X 50% 50% 67% 33%
GTX 1080 17% 83% 67% 33%
GT 440 67% 33% 58% 42% 0% 100% 0% 100% 0%
GTX 480 100% 0% 42% 48% 10% 83% 17% 100% 0%
Tesla C2070 67% 33% 28% 72% 0% 94% 6% 100% 0%
GTX 560 Ti 17% 83% 58% 42% 0% 100% 0% 100% 0%
GTX 570 50% 50% 28% 72% 0% 94% 6% 100% 0%
GTX 590 0% 100% 42% 48% 10% 75% 25% 100% 0%
GT 620 33% 67% 58% 42% 0% 100% 0% 100% 0%
GTX 680 50% 0% 50% 27% 31% 42% 76% 24% 100% 0%
GT 730 17% 0% 83% 47% 53% 0% 100% 0% 42% 58%
GTX 780 33% 0% 67% 69% 31% 0% 75% 25% 42% 58%
Tesla K20c 83% 0% 17% 31% 69% 0% 89% 11% 42% 58%
GTX 980 10% 48% 42% 37% 63% 42% 58%
GTX TITAN X 22% 37% 42% 31% 69% 42% 58%
GTX 1080 0% 42% 58% 31% 69% 94% 6%
Transposition Local Arrays Global Arrays
Hotspot
DPID
Cache Size Layout
Table 9.2: Usage of optimizations for the Hotspot and DPID benchmark.111
Chapter9:Discussion
Benchmark GPU
SM L1 EQ AoS SoA AoSoA T0 T1 T2 T3 T4 T5 Shared Local Global Texture Constant
GT 440 52% 48% 8% 75% 0% 8% 8% 0% 100% 0% 100% 0%
GTX 480 75% 25% 26% 66% 0% 8% 0% 0% 100% 0% 100% 0%
Tesla C2070 84% 16% 28% 63% 0% 0% 0% 8% 100% 0% 100% 0%
GTX 560 Ti 50% 50% 3% 80% 0% 0% 8% 8% 100% 0% 100% 0%
GTX 570 66% 34% 26% 66% 0% 0% 8% 0% 100% 0% 100% 0%
GTX 590 52% 48% 12% 80% 0% 0% 0% 8% 100% 0% 100% 0%
GT 620 50% 50% 8% 75% 0% 8% 8% 0% 100% 0% 100% 0%
GTX 680 50% 34% 16% 38% 54% 0% 8% 0% 0% 100% 0% 100% 0%
GT 730 52% 39% 9% 36% 47% 0% 0% 0% 17% 100% 0% 50% 50%
GTX 780 50% 11% 39% 38% 46% 8% 0% 0% 8% 100% 0% 50% 50%
Tesla K20c 66% 11% 23% 49% 34% 0% 0% 0% 17% 100% 0% 50% 50%
GTX 980 54% 38% 8% 0% 0% 0% 100% 0% 63% 38%
GTX TITAN X 83% 8% 0% 8% 0% 0% 100% 0% 75% 25%
GTX 1080 54% 38% 0% 8% 0% 0% 100% 0% 75% 25%
GT 440 79% 21% 25% 41% 34% 74% 14% 0% 0% 11% 0% 37% 11% 52%
GTX 480 69% 31% 32% 48% 20% 48% 41% 0% 0% 0% 11% 49% 6% 45%
Tesla C2070 77% 23% 12% 79% 9% 76% 23% 0% 0% 0% 0% 61% 7% 33%
GTX 560 Ti 90% 10% 11% 65% 23% 84% 5% 0% 0% 11% 0% 56% 4% 40%
GTX 570 88% 12% 22% 41% 36% 44% 45% 11% 0% 0% 0% 37% 22% 41%
GTX 590 66% 34% 12% 53% 36% 47% 42% 0% 0% 11% 0% 48% 19% 33%
GT 620 79% 21% 30% 56% 13% 83% 5% 0% 0% 11% 0% 52% 4% 45%
GTX 680 44% 29% 27% 19% 42% 39% 46% 40% 13% 0% 0% 0% 7% 51% 42%
GT 730 31% 20% 49% 54% 11% 35% 72% 26% 1% 0% 0% 1% 3% 44% 53%
GTX 780 44% 2% 54% 29% 58% 13% 68% 19% 2% 0% 11% 0% 3% 42% 56%
Tesla K20c 50% 10% 40% 14% 72% 13% 68% 18% 0% 0% 13% 0% 0% 45% 55%
GTX 980 63% 26% 11% 53% 33% 1% 0% 14% 0% 12% 38% 50%
GTX TITAN X 53% 29% 18% 43% 21% 1% 0% 33% 2% 16% 45% 39%
GTX 1080 72% 6% 22% 42% 22% 2% 1% 33% 0% 7% 58% 35%
Cache Size Global Arrays
COMIC
REYES
Layout Transposition Local Arrays
Table 9.3: Usage of optimizations for the COMIC and REYES benchmark.
112
9.3Whichoptimizationsareoptimal?
Benchmark GPU
SM L1 EQ AoS SoA AoSoA T0 T1 T2 T3 T4 T5 Shared Local Global Texture Constant
GT 440 39% 61% 29% 70% 1% 39% 40% 16% 5% 1% 0% 100% 0% 100% 0%
GTX 480 80% 20% 27% 71% 2% 56% 26% 11% 7% 0% 0% 100% 0% 100% 0%
Tesla C2070 40% 60% 30% 68% 2% 50% 32% 11% 7% 0% 0% 100% 0% 100% 0%
GTX 560 Ti 57% 43% 68% 30% 2% 52% 23% 10% 11% 5% 0% 43% 57% 100% 0%
GTX 570 66% 34% 81% 17% 2% 28% 50% 10% 11% 1% 0% 100% 0% 99% 1%
GTX 590 61% 39% 50% 48% 2% 51% 29% 13% 6% 0% 1% 97% 3% 100% 0%
GT 620 60% 40% 27% 60% 13% 40% 44% 10% 5% 0% 0% 91% 9% 77% 23%
GTX 680 64% 22% 14% 69% 21% 10% 74% 14% 10% 0% 2% 0% 100% 0% 100% 0%
GT 730 47% 29% 24% 20% 66% 14% 54% 32% 12% 0% 1% 0% 96% 4% 81% 19%
GTX 780 48% 26% 27% 22% 76% 2% 55% 32% 10% 3% 0% 0% 100% 0% 46% 54%
Tesla K20c 23% 41% 36% 44% 30% 26% 65% 23% 11% 1% 0% 0% 100% 0% 50% 50%
GTX 980 22% 65% 13% 43% 36% 7% 0% 14% 1% 100% 0% 55% 45%
GTX TITAN X 8% 71% 21% 47% 35% 6% 1% 11% 0% 100% 0% 57% 43%
GTX 1080 32% 64% 4% 56% 25% 1% 3% 10% 4% 100% 0% 52% 48%
KD-Tree
Cache Size Layout Transposition Local Arrays Global Arrays
Table 9.4: Usage of optimizations for the KD-Tree benchmark.113
Chapter 9: Discussion
9.4 MATOG Implementation Improvements
In our results we have been able to see that MATOG achieves considerable speedups over an unoptimized version and also over the hand-written code, especiallyon the cards the reference code has not been optimized for. Nevertheless, thereare several cases where the implementation of MATOG could be improved.
Starting with the implementation of MATOG data structures – as previously men-tioned – they are prone to pointer aliasing, which can cause less performance asthe compiler cannot know if diﬀerent pointers do overlap. Therefore the amountof ILP can be decreased as it is not possible to distinguish if the results depend oneach other. Currently we use C++ classes to hide the memory access from the user.As classes are normally used to be initialized in multiple diﬀerent instances, it isnot possible to tell the compiler that pointers returned by the class do not overlap.ForMATOG each class is uniquely instantiated using a template and the frameworkalso ensures that the pointer only exists once in the code. As far as we know, thereis no way to tell the compiler about this. One solution would be to switch awayfrom code generation and use a source-to-source compiler, which then directlyputs the optimization into the code, without classes. This would require a lot ofwork to establish and maintain such a compiler. As C++ – and therefore CUDA –allows all kinds of fancy typedef structures and preprocessor hacks, it could bediﬃcult to build a compiler that can transform all types of possible C++ code into aMATOG compatible solution. Using an existing optimizing compiler, e.g., OpenARC[Lee and Vetter 2014], which automatically generates parallel CUDA code fromOpenACC, could be an option, but it would remove the ability to directly writeCUDA code.
Another limitation of MATOG is that it does not take any CPU times into consid-eration. Layouts that decrease the CPU performance are currently not regardedduring the optimization. Capturing the CPU time during the execution requiresto put in check points. We have two ideas for solving this. First, as MATOG al-ready intercepts CUDA Driver API calls, it could track the time elapsing betweenthese. This can be implemented very easily but would also capture all kinds ofI/O, which can contain a lot of noise. The other option is to disallow memoryaccess to MATOG data structures in the host code. Instead CPU kernels could besupported that are equal to GPU kernels but are executed on the CPU. In thiscase it would be possible not only to auto-tune CPU code inside this kernel – asit is done with the GPU code – but also to track explicitly the time and optimizefor it. CPU memory hierarchies are getting more and more complex. In the nextgeneration HPC processors of AMD [Vijayaragavan et al. 2017], not only a CPUand GPU will be integrated onto the same chip, it will also have on-chip (HBM)and oﬀ-chip (DDR) memory. This make it essential that also CPU applications
114
9.5 Conclusion
become more aware of the underlying memory hierarchy, to draw beneﬁt fromthe diﬀerent kinds of memory. Zivanovic et al. [2017] explore the possibilities ofsuch systems for common HPC applications and argue that the beneﬁt comes withan increased development and optimization eﬀort. This is where the functionalityof MATOG could be used to assist the development. For MATOG this would meanthat a compiler and module load/unload infrastructure equal to the existing CUDAinfrastructure has to be implemented.
One other aspect that decreases the CPU performance is the CUDA module load-ing/unloading from MATOG. As it is possible that during the execution severalhundred diﬀerent modules are used, MATOG loads and unloads these in the back-ground, while it maintains a cache that employs a least recently used (LRU) strategy.This part of MATOG has several issues. First, MATOG stores the compiled CUDAimages inside its database. The advantage is a very easy and fast access to themodules (as parts of the database are kept inside the memory) and we do notneed to store every image as a single ﬁle, which would certainly litter the harddrive. However, to load these images we have to use the function cuModuleLoad-
Data, which in contrast to cuModuleLoad seems to require signiﬁcantly more time:~838.66µs vs ~163.09µs per module. We have not been able to ﬁgure out a reasonfor this and there is also no information given, neither in the CUDA programmingguide [NVIDIA 2016a] nor the Driver API reference [NVIDIA 2016b] for this eﬀect.Even more problematic is the execution time of unloading modules, which issigniﬁcantly higher (~3.533ms per module). At the moment we do not know agood method to decrease this overhead, most likely reducing the total numberof conﬁgurations would be the best solution. This would require to cluster theconﬁgurations according to their performance. We will discuss this in more detailin Section 10.1. Loading an unlimited number of modules also does not work, asall modules require memory on the GPU and loading/keeping too many modulescould cause an out-of-memory exception to the application, which is not desirable.Table 9.5 shows an example for the time required for CUDA Driver API calls usingMATOG and the reference code for the KD-Tree on a GTX 1080.
9.5 Conclusion
Overall we can summarize that the goals that we have deﬁned in Chapter 1 havebeen fulﬁlled. First, we were able to build a tool that can optimize array access inCUDA applications independent of the used hardware and the application domain.We showed this by our evaluation on 14 GPUs from four diﬀerent hardware genera-tions and seven applications, from various application domains (image processing,bio informatics, real-time rendering and simulation). Second, our applicationanalysis achieves results comparable to an exhaustive search, in signiﬁcantly less
115
Chapter 9: Discussion
Function Name Time (µs) Time (%) Calls Time (µs) Time (%) Calls
cuMemcpyHtoD 795,830.0 51.39% 134 413,150.0 29.96% 133
cuCtxCreate 436,130.0 28.16% 1 422,030.0 30.61% 1
cuModuleUnload 0.00% 194,440.0 14.10% 55
cuCtxSynchronize 238,630.0 15.41% 138 183,360.0 13.30% 138
cuMemAlloc 46,455.0 3.00% 139 41,441.0 3.01% 209
cuModuleLoad(Data) 1,304.7 0.08% 8 88,059.0 6.39% 105
cuMemFree 17,190.0 1.11% 139 18,269.0 1.32% 209
cuLaunchKernel 6,550.1 0.42% 542 8,647.7 0.63% 542
cuDeviceGetAttribute 3,311.9 0.21% 365 3,352.4 0.24% 409
cuDeviceTotalMem 928.7 0.06% 4 1,875.8 0.14% 8
cuMemcpyDtoH 1,581.2 0.10% 69 1,600.0 0.12% 69
cuDeviceGetName 284.6 0.02% 4 592.9 0.04% 8
cuCtxGetCurrent 0.00% 823.8 0.06% 2671
cudaDeviceReset 506.1 0.03% 1 512.9 0.04% 1
cuFuncSetCacheConfig 0.00% 492.3 0.04% 542
cuCtxGetDevice 0.00% 180.0 0.01% 542
cuModuleGetFunction 3.4 0.00% 8 106.8 0.01% 105
cuDeviceGet 5.0 0.00% 13 5.9 0.00% 13
cuDeviceGetCount 2.2 0.00% 3 3.4 0.00% 4
cuInit 0.9 0.00% 1 1.0 0.00% 1
cuDriverGetVersion 0.7 0.00% 1 0.5 0.00% 1
Total Driver API 1,548,714.4 1,378,944.3
CUDA Reference MATOG
Table 9.5: Time required for CUDA Driver API calls for running the KD-Tree bench-mark on a GTX 1080 with the reference and optimized MATOG variant. The high-lighted calls (cuModuleLoadData, cuModuleUnload and cuModuleGetFunction) re-quire signiﬁcantly more time in the MATOG variant than in the reference code.
time. The optimization process is fully automated and does not require any userinteraction. Finally, our tool is capable of reaching performance comparable tohand-optimized code, or even outperform it. Further, it can dynamically reactonto changing application workloads to achieve even higher performance. Thesecan be higher than for purely static applications. In Chapter 8 we have analyzedhow automatic performance models could be generated using the data that wehave available in MATOG. Our results show that in the most cases these alreadywork as expected, but before they can be used in MATOG the method has to beimproved. Currently, in some rare situations, the performance models do not workas expected.
116
Chapter 10
Future Work
It is easy to answer the question “Is auto-tuning at its end?”, as it is deﬁnitivelynot. More so, it is at its beginning! There are many open research topics, whichare presented in the following sections.
10.1 Future of MATOG
As already mentioned in Section 9.4, the implementation itself could be improvedto achieve higher performance. First, by changing the data-structures, so that nopointer aliasing occurs. Second, reducing the overhead of the module (un-)loadingby reducing the number of conﬁgurations that are used and ﬁnally by extending toother compute device types (e.g., CPUs or FPGAs). Another big topic in computeintensive applications are sparse matrices. They are very data dependent and cangreatly beneﬁt from auto-tuning [Muralidharan et al. 2014]. Therefore, it could bebeneﬁcial not only to integrate sparse matrix data structures in MATOG, but alsoan automatic detection whether a dense or sparse matrix should be used. Besidesoptimized memory access, it would also be possible to choose between diﬀerentcompilers. For NVIDIA GPUs not only NVIDIA’s own NVCC compiler exists, but alsoan alternative, called GPUCC. Wu et al. [2016] have shown that their compiler canoutperform NVCC in some applications, therefore it would be convenient to havea mechanism that further detects, which of these compilers works better for thegiven kernel. Thinking further, also adding an MPI layer into MATOG would beinteresting, as especially in HPC performance is of the essence. This would allowto use multiple cluster nodes for the calculation, while MATOG would take care ofthe data layouts. Further, it could automatically partition the data, similar to MAPS[Ben-Nun et al. 2015]. However, in its current implementation it is not possibleto run MATOG in a distributed environment, as its database system is based onSQLite1, which is not designed to be operated in an network environment. Oneoption would be to use one instance as master for accessing the database, orto entirely switch to a network capable database (e.g., MySQL2 or MariaDB3) oreven an entirely distributed database (e.g., Apache Cassandra4). Another very
1sqlite.org2mysql.org3mariadb.org4cassandra.apache.org
117
Chapter 10: Future Work
interesting optimization is kernel fusion/splitting. This deals with the question,when it makes sense to put multiple diﬀerent processing steps into the samekernel or to split these in diﬀerent kernels. More functionality in the same kernelallows to use shared memory to store intermediate results and access these faster.This usually consumes more resources, which can decrease the utilization of thekernel. Furthermore, necessary synchronization between the diﬀerent processingparts can decrease the performance. Splitting a kernel can reduce the resourceconsumption, but does not allow to share data using shared memory betweentwo kernels. To our knowledge, some work has been conducted into this direction[Wang et al. 2010; Fousek et al. 2011; Filipovic et al. 2012; Filipovic et al. 2015]but there is no fully automatic tool for arbitrary complex kernels so far, only forsome specialized applications. As laid out in Chapter 8, the usage of performancemodels could improve the decision-making during runtime.
10.2 Evaluation and comparability
One big problem of auto-tuning is the evaluation and comparability. Nearly noauto-tuning project publishes its code. We would have liked to compare MATOGagainst other auto-tuners [Sung et al. 2012; Koﬂer et al. 2015; Peng et al. 2016]in our evaluation, but as their projects are not publicly available, this was notpossible. We want to encourage researchers to publish their work as open sourceas we did for MATOG from the beginning. Another option is to provide a webplatform, as e.g., the 3D Web Reconstruction project5 or the DawnCC project6that allow to use their tools without giving away the code. If such a web-basedapproach is feasible for auto-tuning has to be evaluated.
Another issue is, that there is no default way or application to evaluate auto-tunersand given the variety of possible optimizations and methods, this is most likelydiﬃcult to establish in the community. This causes that some authors comparetheir auto-tuner to hand-optimized code and “only” achieve a speed up of a fewpercents, while others compare against “naïve” solutions (which are most likelythe worst performing they were able to ﬁnd) and achieve speed ups of severalorders of magnitude. This is equal to the “GPUs are 100x faster than CPUs”-myth,where an optimized GPU implementation is compared against an unoptimizedCPU version [Lee et al. 2010]. Neither variant can be blamed but this variety doesnot help to compare the diﬀerent approaches. Further, often the used hardwareand software stack (drivers, CUDA version, operating system, ...) are omitted,which make it even more uncomparable. There are ambitions in the communityto change this using benchmark suites such as Rodinia [Che et al. 2009], Parboil
5gccvmwebreconstruction.igd.fraunhofer.de6cuda.dcc.ufmg.br/dawn
118
10.2 Evaluation and comparability
[Stratton et al. 2012] or Polybench [Grauer-Gray et al. 2012] and building uponthis the project of Fursin et al. [2016]. However, yet none of these approachesprovides a satisfying solution, as we will lay out in the following sections.
10.2.1 Benchmark Suites
The mentioned benchmark suites usually consist of too simplistic applications, ase.g., SRAD or Hotspot from our evaluation, whose runtime is in a millisecond range.To show the ability of auto-tuners, in our opinion it is necessary to test them onrealistic applications with runtime of seconds, minutes or even hours. Usuallymost of the time is used for I/O or setup of the application and only a very smallfraction of the already short execution time is used for the actual computation.Further, these benchmarks do not come with predeﬁned tests, so it is impossibleto recreate the same results and no author can be blamed to choose datasetsand parameters that work good with his tool. Another issue is the maintainabilityof these benchmarks. They often come with no common build chain, but everysingle application has its own way to be build and usually requires some adaptionto work on other machines. Tools such as CMake7 could be a solution for this,as this would even allow to use the benchmarks on multiple platforms such asWindows, Linux and MacOS, without changes (if the source code does not useplatform dependent functionality). Further, some of the benchmarks contain evenprogramming errors or non-functioning code (Section 7.1). We only took a lookon some of the benchmarks in the Rodinia suite but are certain that we wouldbe able to ﬁnd similar errors in the other benchmarks as well, which raises thequestion for quality and usability of these.
Better organized examples for benchmarks can be found in other communities, e.g.,the Common Visual Data Foundation [CVDF 2016] that poses explicit challengeswith precisely determined task descriptions that have to be completed and how theresults are scored. The problem for auto-tuning in such challenges is most likely thescoring, as diﬀerent hardware (even equal GPUs from diﬀerent manufacturers canvary in performance, caused by customizations such as varying clock frequencies ormodiﬁed cooling systems) could yield in diﬀerent scoring. Another example is theMiddlebury benchmark [Baker et al. 2011] which explicitly contains a training andan evaluation dataset and even allows to submit results to the author’s homepage.
7cmake.org
119
Chapter 10: Future Work
10.2.2 Collective Knowledge
The project of Fursin et al. [2016] (cTuning Foundation) goes into the right direction,as they provide a set of predeﬁned benchmarks, datasets and a repository to storethe benchmark results. However, one major problem is the comparability ofdiﬀerent hardware and software setups. Everyone is using diﬀerent hardware,drivers and operating systems and even small changes (e.g., a newer GPU driverversion) can result in a change of performance, making it diﬃcult to comparethe results. They also store information about the test setup, but this does notsolve the comparability problem. Metrics as FLOPS are also no good alternative, ashigher FLOPS not necessarily guarantee shorter execution times. We do not havea solution for this. One idea could be a community driven centralized evaluationcluster, where researchers can upload their code and evaluate it on a standardizedsoftware/hardware stack with a predeﬁned set of benchmarks, parameter anddatasets. This would enable comparability across diﬀerent approaches, but settingup and maintaining such an infrastructure is a very costly endeavor.
10.3 How to combine diﬀerent optimizations?
Today there is a wide variety of auto-tuners available for all kinds of optimiza-tions such as data layouts (MATOG), data partitioning and multi-device scheduling(MAPS [Ben-Nun et al. 2015]), sparse matrix formats (Nitro [Muralidharan et al.2014]), and many more. But all of these only focus on a particular optimizationﬁeld and provide analysis and optimization methods that are suitable for exactlytheir solution, but how to combine them? When an auto-tuner has the optionto choose a diﬀerent algorithm, it is most likely that it has to reevaluate, whichdata layouts could be optimal for the diﬀerent algorithms, but what about datapartitioning and data layouts? Do they exclude each other? Are they orthogonal toeach other? Do layouts have an inﬂuence on partitioning at all? To our knowledgethere has been no research conducted into this direction so far.
In the case that diﬀerent optimizations do not allow to draw any conclusions oneach other it would be very diﬃcult for empirical proﬁling, as this would signiﬁ-cantly increase the number of conﬁgurations that actually have to be evaluated. Inthis case a more analytical or hybrid method could be the right choice. However,as already mentioned, this could again have diﬃculties with unknown future hard-ware and unknown internal implementation of proprietary hardware. One idea forresearch could be: “How to design hardware that has a predictable performance,which can either be done entirely analytically or in a hybrid way?” Even if thishardware would be slower than an unpredictable one, the option to optimizesoftware optimally to the hardware without “guessing” and “magic witchcraft” ofcompilers or auto-tuners could in end be as fast or even faster.
120
10.4 Performance models, continuous monitoring and adaptive reoptimization
10.4 Performancemodels, continuousmonitoring andadaptive reoptimization
MATOG and many other auto-tuners rely on proﬁling the application with a ref-erence dataset, which is supposed to represent a realistic workload. However,many users will most likely use a rather small dataset as they do not want to waithours for the auto-tuner to optimize the application. This can – depending on theapplication – lead to suboptimal optimization results. A better approach wouldbe to use the current techniques to establish an initial performance model andthen continuously monitor the performance of the application during runtime.With performance models (based on the initial proﬁling) it would be possible tocheck whether the prediction diﬀers from the monitoring results. The modelsthen could be continuously improved in parallel to the application runtime, onrealistic workloads. Depending on the applied optimizations and if it is possible todraw conclusions from running one conﬁguration onto others (as it is possible withthe prediction that MATOG is build upon), the auto-tuner could employ betterdecisions. Further, it could be constantly improved during runtime, even withchanging workloads. In Chapter 8 we showed initial experiments for such a system,based on the data available in MATOG.
10.5 Incompatibility of Auto-Tuners
Another unresolved problem is that most auto-tuners are incompatible to eachother, so that e.g., using MAPS [Ben-Nun et al. 2015] and MATOG together wouldnot work, as MATOG data structures cannot be used together with the data parti-tioning of MAPS. The same applies to libraries, as MATOG can be used togetherwith THRUST data structures in the same kernel, butMATOGdata structures cannotbe used in THRUST functions. This problem already starts with the array layouts,as partitioning an AoS is quite simple and can be done by a simple memcopy,whereas for SoA or even AoSoA the data itself has to be explicitly unweaved intodiﬀerent data segments.
However, the problemof incompatibility is rooted even lower at the hardware level.With increasing number of diﬀerent compute platforms (CPUs, GPUs, FPGAs, ...)the number of languages, dialects and extensions to program these increases sig-niﬁcantly. Approaches such as OpenCL tried to put all of these under the same roof,but actually failed as the support of vendors is more and more reduced. NVIDIAonly supported OpenCL until v1.2 [NVIDIA 2015] and Intel no longer supports itfor the Xeon Phi Knights Landing. Even AMD seems no longer to believe in it, asthey released their Heterogeneous-Compute Interface for Portability (HIP) [AMD2016] which is very similar to CUDA and even provides a CUDA to HIP converter.
121
Chapter 10: Future Work
The reason for this is most likely that OpenCL is too low level and requires to beoptimized speciﬁcally towards the underlying hardware (as it is necessary withevery other language as well), so the dream to write code once and execute iton all platforms is not fulﬁlled. On the contrary, it does not only require to writediﬀerent kernels for all kind of devices, but also does not allow to use specializedfeatures the hardware does provide (e.g., the shuﬄe operation in CUDA).
The problem of many new programming languages today is that they do notlive very long. Every year new languages appear, with new fancy features anddisappear some years later because someone came up with a new, even more“hipster” language. However, the only constant languages for years have beenC/C++ and Java [TIOBE 2016]. But these require the software to be speciﬁcallytuned towards the hardware, including a very high maintenance overhead tokeep long living software still be functional and performing on today’s and futurehardware. This misses the actual goal of code portability and maintainability. Inorder to establish a successor for OpenCL, it would be necessary to establish anentire system stack rather than only a programming language. The focus of such asystem should explicitly be performance portability and maintainability. An ideafor such a system is discussed in the next section.
10.6 Performance Portability Aware Software Stack
In this section we discuss the idea of a performance portability aware softwarestack. The main concept should be that the algorithm and implementation areseparated. This could be achieved when the application only serves as a control-server that schedules tasks and receives results. The underlying automatic taskscheduler either controls a single workstation with CPUs and accelerator cardsor it controls an entire cluster and oﬄoads the tasks onto various heterogeneouscompute nodes. This should be transparent to the program and could use avendor speciﬁc implementation similar to today’s MPI implementations. Thereare several existing examples for such scheduling systems also for heterogeneoushardware today, as shown in Section 4.5. However, they do not really provideperformance portability today, as they still rely on a tight coupling of algorithmand implementation. Figure 10.1 shows an overview of the system that we areproposing.
The key idea of our programming model is to split the program code into a high-level algorithmic and a low-level implementation part, because algorithms ingeneral are portable but implementations are device speciﬁc. The algorithmicimplementation has to be high-level so that the programmer can concentrateon the algorithm rather than on hardware speciﬁc functionality or limitations.
122
10.6 Performance Portability Aware Software Stack
Libraries
Matrix Multiplication
Task A BLAS
Intel Xeon 
Phi
NVIDIA GPU
NVIDIA 
Kepler
NVIDIA P100
AMD Kaveri
AMD GPUNVIDIA GPU
Intel CPU
Task B
Task C
Workstation
Intel CPU
Intel Knights 
Landing
Cluster Node
2x Intel CPU
4x NVIDIA 
P100
Cluster Node
2x AMD CPU
4x AMD 
V7800
BLAS Algorithm
M*M Algorithm
Algorithm
Scheduler
Algorithm
Auto-Tuning
Tasks
Algorithm Selection Implementation Selection Node Selection
Data PartitioningData Layouts Data Layouts
Figure 10.1: Schematic illustration of our proposed system. In the user-level (blue)a programmer can deﬁne his tasks and rely their functionality on libraries thatprovide highly optimized implementations for commonly used functions. Anscheduler then takes control of task and data transfers in the system, which can bea workstation or even an entire cluster. As no application is similar to another aswell as every workstation or cluster diﬀers in its used hardware and interconnect,an auto-tuning layer should take care of selecting optimal parameters for the code,as well as the selection of hardware it is supposed to run on.
The low-level implementation part can use hardware speciﬁc functionality andshould be provided by experienced programmers. This is complemented by anauto-tuning approach that chooses the actual implementation and processingunit on which the algorithm is executed in each speciﬁc instance. As auto-tuningis also always depending on the underlying hardware, it could be designed in adriver fashion, similar to operating systems, as these usually extend predeﬁnedAPIs to plug into the system, but their actual operation is hidden inside the driver.The same way also diﬀerent auto-tuning optimizations could be plugged into thesystem, depending on the demands of the application or hardware.
Algorithms should be programmed in capsuled functions (in the following calledtasks), in any high-level language (e.g. in C#) but cannot access any low-levelor vendor speciﬁc functionality. Instead parallelization features as known from
123
Chapter 10: Future Work
OpenMP and OpenACC or commonly used parallel functions as in CUB [NVIDIA2013] or Intel’s Threading Building Blocks (TBB) [INTEL 2016] should be available. Asdiﬀerent processor types are better suited for certain algorithms, the user shouldable to provide multiple implementations of his algorithm, so that a suitablealgorithm for a platform can be chosen automatically. For example, the user couldprovide an exact and a Monte Carlo based solving technique. However, codeperformance always beneﬁts if it is speciﬁcally designed for a given hardwareand when it uses vendor speciﬁc functionality. For these libraries with an APIinterface (e.g., BLAS) that can be used inside the tasks should be available. Theactual implementation of the libraries can be speciﬁcally assigned to a hardwaretype (e.g., NVIDIA GPUs), a hardware generation (e.g., NVIDIA Pascal GPUs) oreven a speciﬁc model (e.g., NVIDIA GTX 1080) and can be provided in a vendorspeciﬁc language, e.g., CUDA, OpenCL or HIP. Again, multiple implementationsare possible, which then could be chosen by an auto-tuner. The implementationsof libraries are meant for experts and hardware enthusiasts only. However, noviceusers still would be able to use such a systems as basic functionality such as BLASor matrix-multiplications can easily be provided by vendors, as they are alreadyavailable today.
Additionally to the separation of algorithm and low-level implementation, theauto-tuning should dynamically adjust performance critical parameters at runtime.Optimizations should include algorithm selection, data layouts, data partitioning,device selection/scheduling, message passing, and a dynamic allocation of com-pute nodes to ﬁt the needs of the current compute state, so that clusters can beused more eﬃciently.
Of course, establishing such a system requires a lot of work and deﬁnitively cannotbe done by a single person or a small research group alone. It most likely requiressupport from the industry and community. Therefore, it would already be a greatcontribution if a concrete concept and suitable API would exist, equal to the MPIstandard.
124
Appendix A
Benchmark Training-/Testing-Data
This appendix contains information about the training and testing datasets andparameters used in our evaluation.
A.1 Bitonic Sort
The datasets for the Bitonic Sort contain varying random, ascending or descendingvalues, ranging from 0 to 1023 (255 for the 1B ﬁeld). Format: dataset name(number of items).
Training:
• dddd.dat (131.0k)• rddd.dat (65.5k)
Testing:
• adad.dat (2.1M)• raaa.dat (262.1k)• dada.dat (524.3k)• aaaa.dat (1.0M)• rrrr.dat (4.2M)
A.2 SRAD
For SRAD we used three diﬀerent speckle parameter sets that we executed withthe given sizes. For training purposes we only executed one iteration, while 100for testing. We further varied the grid size, as shown below. The speckle valuesare:
• X = (0, 127),Y = (0, 127), λ = 0.2• X = (253, 213),Y = (32, 74), λ = 1.0• X = (0, 53),Y = (74, 222), λ = 0.6
Training:
• 512x512
125
Appendix A: Benchmark Training-/Testing-Data
Testing:
• 128x128• 256x256• 1024x1024• 2056x2056
A.3 Hotspot
The Hotspot benchmark is only provided with three datasets.
Training:
• power_512
Testing:
• power_64• power_1024
A.4 DPID
For this benchmarkweused the video “Fuerteventura 4K - A Timelapse Adventure1”.We want to thank S. Schall and J. Schmid for letting us use their video. The videohas an input resolution of 3840x2178. For training we downscaled only one frame,while 100 frames for testing. Further, we changed the output resolution to:
Training:
• 1920x1123• 768x436
Testing:
• 2048x1162• 640x363• 320x182
1youtu.be/40s_HSZkt3U
126
A.5 COMIC
A.5 COMIC
For COMICwe used a series of diﬀerent datasets /with varying counts of sequencesand sequence lenghts. Format: dataset name (count of sequences / length ofsequences).
Training:
• PF00520_mod (2,261 / 238)• L4 (390 / 244)
Testing:
• alnComplete_A2Seqs1 (140 / 71)• Calmodulin_MSA_clustalw (753 / 264)• hcn (211 / 570)• aln254718 (211 / 465)• Calmodulin_MSA_muscle (753 / 275)• PF07885_mod (4,204 / 120)• ProtShort1_Selection (500 / 99)• ProtShort1 (16,000 / 99)• S2 (376 / 222)• PF01007_mod (616 / 336)
A.6 REYES
All training runs have rendered two frames / while all testing runs rendered 100frames. Format: dataset name (render resolution / patch counts).
Training:
• Utah Teapot (1920x1080 / 32)• Utah Teapot (640x480 / 32)
Testing:
• Aphroidite (1920x1080 / 4,004)• Bike (640x480 / 5,216)• Cube (1024x786 / 6)• Gumbo (720x480 / 128)• Motor Bike (1440x960 / 826)• Plato (1920x1200 / 224)• Square (320x200 / 1)• Utah Teapot (1280x854 / 32)
127
Appendix A: Benchmark Training-/Testing-Data
A.7 KD-Tree
All runs have been performed with 32 bins. Format: dataset name (number oftriangles / type of scene).
Training:
• Happy Buddha (1,087,716 / 3D-Scan)
Testing:
• Bunny (69,451 / 3D-Scan)• Conference Room (282,755 / artistic scene)• Crytek Sponza (262,267 / artistic scene)• Dabrovic Sponza (66,450 / artistic scene)• Kitchen (425,504 / artistic scene)• Mustang (787,668 / artistic scene)• Sibenik (75,284 / artistic scene)• San Miguel2 (10,500,482 / artistic scene)• Powerplant3 (12,759,246 / artistic scene)
2only on GTX 1080 and Tesla K20c3only on GTX 1080
128
Acronyms
ADG Array Dependency GraphAoS Array of StructsAoSoA Array of Structure of ArraysAPI Application Programming InterfaceAPU Accelerated Processing UnitASIC Application Speciﬁc Integrated CircuitsASTA Array-of-Structure-of-Tiled-ArraysAVX Advanced Vector ExtensionsBit/s Bits per secondBLAS Basic Linear Algebra SubprogramsCC Compute CapabilitiesCOMIC Coevolution via MI on CUDACPI Cycles Per InstructionCPU Central Processing UnitCUDA Compute Uniﬁed Device ArchitectureCUPTI CUDA Proﬁling Tools InterfaceDDG Decision Dependency GraphDDR Double Data RateDFT Discrete Fourier TransformationDM Directional ModelDNA Deoxyribonucleic acidDPID Detail-Preserving Image DownscalingDRAM Dynamic RAMDSP Digital Signal ProcessingEE exhaustive proﬁling/exhaustive analysisEHP Exascale Heterogeneous ProcessorEP exhaustive proﬁling/predictive analysisELF Earliest Load FirstFeRAM Ferrorelectric RAMFLOPS Floating Point Opterations Per SecondFPGA Field Programmable Gate ArrayGDDR Graphics DDRGEMM Dense Matrix-Matrix MultiplicationGEMV Dense Matrix-Vector MultiplicationGP Gaussian ProcessesGPML Gaussian Processes for Machine Learning
129
Acronyms
GP lin. linearGP lin. + SE combined linear plus squared exponentialGP SE squared exponentialGP SE + ARD squared exponential with automatic relevance determinationGPU Graphics Processing UnitH2C Heterogeneous Habanero-CHBM High Bandwidth MemoryHDD Hard Drive DiskHIP Heterogeneous-Compute Interface for PortabilityHMPP Hybrid Multicore Parallel ProgrammingHPC High Performance ComputingHPL Heterogeneous Programming LibraryI/O input/outputILP Instruction Level ParallelismIPS Instructions per SecondISPC Intel Single-Program Multiple-Data Program CompilerJSON JavaScript Object Notationlin. reg. linear regressionLRU least recently usedMATOG “MATOG: Auto-Tuning on GPUs”MIC Many Integrated CoreMIMD Multiple Instruction, Multiple DataMISD Multiple Instruction, Single DataMMX Multi Media ExtensionMPI Message Passing InterfaceMRAM Magnetoresitive RAMNN Neural NetworkNVRAM Non-volatile RAMPCIe Peripheral Component Interconnect ExpressPCRAM Phase-change RAMPP predictive proﬁling/predictive analysisPTF Periscope Tuning FrameworkPTX Parallel Thread eXecution architecturePU Processing UnitQPI QuickPath InterconnectRAM Random Access MemoryREADEX Runtime Exploitation of Application Dynamism forEnergy-eﬃcient eXascale computingREYES Renders Everything You Ever SawRRMSE Relative Root Mean Squared ErrorSATA Serial AT Attachment
130
SD-RAM Synchronous Dynamic RAMSIMD Single Instruction, Multiple DataSIMT Single Instruction, Multiple ThreadSISD Single Instruction, Single DataSM Streaming Multi-ProcessorSMT Simultaneous MultithreadingSoA Structure of ArraysSoAoS Structure of Array of StructuresSpMV Sparse Matrix-Vector MultiplicationSRAD Speckle Reducing Anisotropic DiﬀusionSRAM Static RAMSSD Solid State DiskSVM Support Vector MachineTBB Threading Building BlocksTPU Tensor Processing UnitVLIW Very Long Instruction Word
131
Acronyms
132
Bibliography
[7-CPU 2016] 7-CPU (2016). Intel Skylake - Intel i7-6700. www.7-cpu.com/cpu/Skylake.html [accessed 07.04.2017] (cit. on p. 15).
[AMD 2000] AMD (2000). 3DNow! Technology Manual. support.amd.com/TechDocs/21928.pdf [accessed 07.04.2017] (cit. on p. 17).
[AMD 2015] AMD (2015). High-Bandwidth Memory (HBM) - Reinventing MemoryTechnology. www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf[accessed 07.04.2017] (cit. on p. 21).
[AMD 2016] AMD (2016). It’s HIP to be Open - Convert your CUDA Code to C++Using AMD’s New HIP Tool. www.amd.com/Documents/HIP-Datasheet.pdf[accessed 07.04.2017] (cit. on p. 121).
[Afonso et al. 2016] Afonso, S., A. Acosta, and F. Almeida (2016). “AutomaticGeneration of OpenCL Code for ARM Architectures”. In: Proc. Euro-Par (cit. onp. 44).
[Agakov et al. 2006] Agakov, F., E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P.O’Boyle, J. Thomson,M. Toussaint, and C. K. I. Williams (2006). “UsingMachineLearning to Focus Iterative Optimization”. In: Proc. CGO (cit. on p. 44).
[Ahmed and Schuegraf 2011] Ahmed, K. and K. Schuegraf (2011). “Transistor Wars -Rival architectures face oﬀ in a bid to keep Moore’s Law alive”. In: IEEE Spec-trum. spectrum.ieee.org/semiconductors/devices/transistor-wars [accessed07.04.2017] (cit. on p. 17).
[Ainsworth and Jones 2017] Ainsworth, S. and T.M. Jones (2017). “Software Prefetch-ing for Indirect Memory Accesses”. In: Proc. CGO (cit. on p. 51).
[Amend 2017] Amend, S. C. (2017). “Predicting Execution Time of GPU Kernelsusing automatic PerformanceModels”. In: TUDarmstadt,Master Thesis (cit. onpp. 5, 6, 95).
[Ansel 2014] Ansel, J. (2014). “Autotuning Programs with Algorithmic Choice”.PhD thesis. MIT (cit. on pp. 39, 41, 75).
133
Bibliography
[Ansel et al. 2009] Ansel, J., C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edel-man, and S. Amarasinghe (2009). “PetaBricks: A Language and Compiler forAlgorithmic Choice”. In: Proc. PLDI (cit. on p. 45).
[Ansel et al. 2011] Ansel, J., M. Pacula, S. Amarasinghe, and U.-M. O’Reilly (2011).“An Eﬃcient Evolutionary Algorithm for Solving Incrementally Structured Prob-lems”. In: Proc. GECCO (cit. on p. 45).
[Ansel et al. 2014] Ansel, J., S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bos-boom, U.-M. O’Reilly, and S. Amarasinghe (2014). “OpenTuner: An ExtensibleFramework for Program Autotuning”. In: Proc. PACT (cit. on p. 48).
[Armstrong 2016] Armstrong, A. (2016). “Samsung 960 EVO M.2 NVMe SSD Re-view”. In: StorageReview.com. www.storagereview.com/samsung_960_evo_m2_nvme_ssd_review [accessed 07.04.2017] (cit. on p. 15).
[Arslan et al. 2016] Arslan, E., K. Guner, and T. Kosar (2016). “HARP: PredictiveTransfer Optimization Based on Historical Analysis and Real-time Probing”. In:Proc. SC (cit. on p. 50).
[Babokin and Brodman 2016] Babokin, D. and J. Brodman (2016). Intel SPMDProgram Compiler - An open-souce compiler for high-performance SIMD pro-gramming on the CPU. ispc.github.io/ [accessed 07.04.2017] (cit. on p. 18).
[Baghsorkhi et al. 2010] Baghsorkhi, S. S., M. Delahaye, S. J. Patel, W. D. Gropp,and W. mei W. Hwu (2010). “An Adaptive Performance Modeling Tool for GPUArchitectures”. In: Proc. PPoPP (cit. on p. 43).
[Baghsorkhi et al. 2012] Baghsorkhi, S. S., I. Gelado, M. Delahaye, and W. meiW. Hwu (2012). “Eﬃcient Performance Evaluation of Memory Hierarchy forHighly Multithreaded Graphics Processors”. In: Proc. PPoPP (cit. on p. 43).
[Bajrovic and Benkner 2014] Bajrovic, E. and S. Benkner (2014). “Automatic Perfor-mance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures”.In: Proc. PDPTA (cit. on p. 50).
[Bajrovic et al. 2013] Bajrovic, E., S. Benkner, J. Dokulil, and M. Sandrieser (2013).“Autotuning of Pattern Runtimes for Accelerated Parallel Systems”. In: CSE(cit. on p. 47).
134
[Baker et al. 2011] Baker, S., D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R.Szeliski (2011). “A Database and Evaluation Methodology for Optical Flow”. In:IJCV (cit. on p. 119).
[Bakhoda et al. 2009] Bakhoda, A., G. L. Yuan, W. W. L. Fung, H. Wong, and T. M.Aamodt (2009). “Analyzing CUDA Workloads Using a Detailed GPU Simulator”.In: Proc. ISPASS (cit. on p. 43).
[Bao et al. 2016] Bao, W., C. Hong, S. Chunduri, S. Krishnamoorthy, L.-N. Pouchet,F. Rastello, and P. Sadayappan (2016). “Static and Dynamic Frequency Scalingon Multicore CPUs”. In: ACM TACO (cit. on p. 47).
[Barman et al. 2011] Barman, S., R. Bodik, S. Jain, Y. Pu, S. Srivastava, and N. Tung(2011). “Parallel Programming with Inductive Synthesis”. In: Proc. HotPar (cit.on p. 45).
[Batcher 1968] Batcher, K. E. (1968). “Sorting Networks and Their Applications”. In:Proc. SJCC (cit. on p. 76).
[Bauer 2014] Bauer, M. (2014). “Legion: Programming Distributed HeterogeneousArchitectures with Logical Regions”. PhD thesis. Stanford University (cit. onp. 50).
[Beaumont et al. 2016] Beaumont, O., T. Cojean, L. Eyraud-Dubois, A. Guermouche,and S. Kumar (2016). “Scheduling of Linear Algebra Kernels on Multiple Het-erogeneous Resources”. In: Proc. HiPC (cit. on p. 46).
[Bell and Garland 2008] Bell, N. and M. Garland (2008). Eﬃcient Sparse Matrix-Vector Multiplication on CUDA. Tech. rep. NVIDIA (cit. on p. 46).
[Ben-Nun et al. 2015] Ben-Nun, T., E. Levy, A. Barak, and E. Rubin (2015). “MemoryAccess Patterns TheMissing Piece of theMulti-GPU Puzzle”. In: Proc. SC (cit. onpp. 39, 50, 117, 120, 121).
[Bergstra et al. 2012] Bergstra, J., N. Pinto, and D. Cox (2012). “Machine Learning forPredictive Auto-Tuning with Boosted Regression Trees”. In: Proc. InPar (cit. onpp. 39, 42).
[Berkeley 2014] Berkeley (2014).Moore’s Law and Computer Processing Power.Tech. rep. datascience.berkeley.edu/moores-law-processing-power/[accessed 07.04.2017]. Berkeley - University of California (cit. on pp. 1, 16).
135
Bibliography
[Bhat et al. 2006] Bhat, V., M. Parashar, H. Liu, M. Khandekar, N. Kandasamy, and S.Abdelwahed (2006). “Enabling Self-Managing Applications using Model-basedOnline Control Strategies”. In: Proc. ICAC (cit. on p. 48).
[Bianchin et al. 2008] Bianchin, S., P. Achenbach, S. Ajimura, O. Borodina, T. Fukuda,J. Hoﬀmann,M. Kavatsyuk, K. Koch, T. Koike, N.Kurz, F.Maas, S.Minami, Y.Mizoi,T. Nagae, D. Nakajima, A. Okamura, W. Ott, B. Özel, J. Pochodzalla, C. Rappold,T. R. Saito, A. Sakuguchi, M. Sako, M. Sekimoto, H. Suhimura, T. Takahashi, H.Tamura, K. Tanida, andW. Trautmann (2008). “The HyPHI Project: HypernuclearSpectroscopy with Stable Heavy Ion Beams and Rare Isotope Beams at GSI andFAIR”. In: ArXiv (cit. on p. 47).
[Bischof et al. 2012] Bischof, C., D. an Mey, and C. Iwainsky (2012). “Brainware forgreen HPC”. In: Computer Science - Research and Development (cit. on pp. 1,3).
[Bodin et al. 2016] Bodin, B., L. Nardi, P. H. J. Kelly, and M. F. P. O’Boyle (2016).“Diplomat: Mapping of multi-kernel applications using a static dataﬂow ab-straction”. In: Proc. MASCOTS (cit. on p. 49).
[Bolchini et al. 2016] Bolchini, C., S. Cherubin, G. C. Durelli, S. Liutti, A. Miele, andM. D. Santambrogio (2016). “A Runtime Controller for OpenCL Applications onHeterogeneous System Architectures”. In: Proc. ESWEEK (cit. on p. 49).
[Bradski 2000] Bradski, G. (2000). “The OpenCV Library”. In: Dr. Dobb’s Journal ofSoftware Tools (cit. on p. 78).
[Bruel et al. 2015] Bruel, P., M. A. Gonzalez, and A. Goldman (2015). “AutotuningGPU Compiler Parameters Using OpenTuner”. In: Proc. SHPC (cit. on p. 48).
[Buck et al. 2004] Buck, I., T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,and P. Hanrahan (2004). “Brook for GPUs: Stream Computing on GraphicsHardware”. In: Proc. SIGGRAPH (cit. on p. 18).
[Buyya et al. 2009] Buyya, R., C. S. Yeo, S. Venugopal, J. Broberg, and I. Bradic(2009). “Cloud computing an emerging IT platforms: Vision, hype, and realityfor delivering computing as the 5th utility”. In: Future Generation ComputerSystems (cit. on p. 19).
[CRUCIAL 2015] CRUCIAL (2015). Speed vs. Latency - Why CAS latency isn’t anaccuratemeasure of memory performance. www.crucial.com/usa/en/memory-performance-speed-latency [accessed 07.04.2017] (cit. on pp. 2, 15).
136
[CVDF 2016] CVDF (2016). Common Visual Data Foundation.www.cvdfoundation.org [accessed 07.04.2017] (cit. on p. 119).
[Calore et al. 2016] Calore, E., A. Gabbana, J. Kraus, S. F. Schifano, and R. Tripic-cione (2016). “Performance and portability of accelerated lattice Boltzmannapplications with OpenACC”. In: CCPE (cit. on p. 44).
[Calotoiu et al. 2013] Calotoiu, A., T. Hoeﬂer, M. Poke, and F. Wolf (2013). “Usingautomated performance modeling to ﬁnd scalability bugs in complex codes”.In: Proc. SC (cit. on p. 43).
[Calotoiu et al. 2016] Calotoiu, A., D. Beckingsale, C. W. Earl, T. Hoeﬂer, I. Karlin,M. Schulz, and F. Wolf (2016). “Fast Multi-Parameter Performance Modeling”.In: Proc. CLUSTER (cit. on p. 43).
[Cantanzaro et al. 2014] Cantanzaro, B., A. Keller, and M. Garland (2014). “A De-composition for In-place Matrix Transposition”. In: Proc. PPoPP (cit. on p. 50).
[Catanzaro et al. 2010] Catanzaro, B., M. Garland, and K. Keutzer (2010). “Copper-head Compiling an Embedded Data Parallel Language”. In: Proc. PPoPP (cit. onp. 45).
[Chan et al. 2009] Chan, C., J. Ansel, Y. L. Wong, S. Amarasinghe, and A. Edelman(2009). “Autotuning Multigrid with PetaBricks”. In: Proc. SC (cit. on p. 45).
[Chang and Karamcheti 2001] Chang, F. and V. Karamcheti (2001). “A Frameworkfor Automatic Adaptation of Tunable Distributed Applications”. In: ClusterComputing (cit. on p. 48).
[Chang et al. 2016] Chang, L.-W., H.-S. Kim, and W. mei W. Hwu (2016). “DySel:Lightweight Dynamic Selection for Kernel-based Data-parallel ProgrammingModel”. In: Proc. ASPLOS (cit. on pp. 46, 50).
[Chase et al. 2008] Chase, J., B. Nelson, J. Bodily, and L. Dha-Jye (2008). “Real-TimeOptical Flow Calculations on FPGA and GPU Architecture: A Comparison Study”.In: Proc. FPCCM (cit. on p. 19).
[Che et al. 2009] Che, S.,M. Boyer, J.Meng, D. Tarjan, J.W. Sheaﬀer, S.-H. Lee, and K.Skadron (2009). “Rodinia: A Benchmark Suite for Heterogeneous Computing”.In: Proc. IISWC (cit. on pp. 75, 77, 78, 118).
137
Bibliography
[Cheng et al. 2017] Cheng, D., J. Rao, Y. Guo, C. Jiang, and X. Zhou (2017). “Improv-ing Performance of Heterogeneous MapReduce Clusters with Adaptive TaskTuning”. In: IEEE TPDS (cit. on p. 49).
[Choi et al. 2010] Choi, J. W., A. Singh, and R. W. Vuduc (2010). “Model-drivenautotuning of sparse matrix-vector multiply on GPUs”. In: Proc. PPoPP (cit. onpp. 39, 46).
[Christen et al. 2011] Christen,M., O. Schenk, andH. Burkhart (2011). “PATUS: A CodeGeneration and Auto-Tuning Framework For Parallel Stencil Computations”. In:Proc. IPDPS (cit. on p. 46).
[Chung andHollingsworth 2004] Chung, I.-H. and J. K. Hollingsworth (2004). “UsingInformation from Prior Runs to Improve Automated Tuning Systems”. In: Proc.SC (cit. on pp. 41, 42, 48, 71).
[Collange et al. 2009] Collange, S., D. Defour, and D. Parello (2009). Barra, aModular Functional GPU Simulator for GPGPU. Tech. rep. Univ. de Perpignan(cit. on p. 43).
[Cook 1997] Cook, D. (1997). “Performance Implications of Pointer Aliasing”. In:SGI. ftp.sgi.com/sgi/audio/audio.apps/dev/aliasing.html [accessed 07.04.2017](cit. on p. 58).
[Cook et al. 1987] Cook, R. L., L. Carpenter, and E. Catmull (1987). “The Reyes ImageRendering Architecture”. In: Proc. SIGGRAPH (cit. on p. 79).
[Coplin and Burtscher 2015] Coplin, J. and M. Burtscher (2015). “Eﬀects of Source-Code Optimizations on GPU Performance and Energy Consumption”. In: Proc.GPGPU (cit. on p. 47).
[Cruz et al. 2016] Cruz, E. H. M., M. Diener, L. L.Pilla, and P. O. A. Navaux (2016).“Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Archi-tectures”. In: ACM TACO (cit. on p. 51).
[Datta et al. 2008] Datta, K., M.Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D.Patterson, J. Shalf, and K. Yelick (2008). “Stencil Computation Optimization andAuto-tuning on State-of-the-Art Multicore Architectures”. In: Proc. SC (cit. onp. 46).
138
[Davidson et al. 2011] Davidson, A., Y. Zhang, and J. D. Owens (2011). “An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU”. In: Proc.IPDPS (cit. on p. 47).
[Devito et al. 2013] Devito, Z., J. Hegarty, A. Aiken, P. Hanrahan, and J. Vitek (2013).“Terra: A Multi-Stage Language for High-Performance Computing”. In: Proc.PLDI (cit. on p. 46).
[Dolbeau et al. 2007] Dolbeau, R., S. Bihan, and F. Bodin (2007). “HMPP: A HybridMulti-core Parallel Programming Environment”. In: Proc. GPGPU (cit. on p. 44).
[Dublish et al. 2016] Dublish, S., V. Nagarajan, and N. Topham (2016). “CooperativeCaching for GPUs”. In: ACM TACO (cit. on p. 51).
[Durillo and Fahringer 2015] Durillo, J. and T. Fahringer (2015). “From single- tomulti-objective auto-tuning of programs: Advantages and implications”. In:Scientiﬁc Programming (cit. on p. 47).
[Edwards and Trott 2013] Edwards, H. C. and C. R. Trott (2013). “Kokkos: Enablingperformance portablitity across manycore architectures”. In: Proc. XSW (cit. onpp. 39, 50).
[Elangovan et al. 2015] Elangovan, V. K., R. M. Badia, and E. Ayguadé (2015). “Auto-Tuning OmpSs-OpenCL Kernels Across GPUMachines”. In: Proc. PARMA-DITAM(cit. on p. 44).
[Enmyren and Kessler 2010] Enmyren, J. and C. W. Kessler (2010). “SkePU: a multi-backend skeleton programming library for multi-GPU systems”. In: Proc. HLPP(cit. on p. 50).
[Fabeiro et al. 2014] Fabeiro, J. F., D. Andrade, B. B. Fraguela, and R. Doallo (2014).“Writing self-adaptive codes for heterogeneous systems”. In: Proc. Euro-Par(cit. on p. 49).
[Fachada et al. 2016] Fachada, N., V. V. Lopes, R. Martins, and A. C. Rosa (2016).“cf4ocl: a C framework for OpenCL”. In: ArXiv (cit. on p. 50).
[Fatahalian et al. 2006] Fatahalian, K., T. J. Knight, M. Houston, M. Erez, D. R.Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan (2006).“Sequoia: Programming the Memory Hierarchy”. In: Proc. SC (cit. on p. 45).
139
Bibliography
[Filipovic et al. 2012] Filipovic, J., J. Fousek, and B. Lakomy (2012). “AutomaticallyOptimizedGPUAcceleration of Element Subroutines in Finite ElementMethod”.In: Proc. SAAHPC (cit. on p. 118).
[Filipovic et al. 2015] Filipovic, J., M. Madzin, J. Fousek, and L. Matyska (2015).“Optimizing CUDA code by kernel fusion: application on BLAS”. In: SC (cit. onp. 118).
[Flynn 1966] Flynn, M. J. (1966). “Some Computer Organizations and Their Eﬀec-tiveness”. In: IEEE Transactions on Computers (cit. on p. 11).
[Fousek et al. 2011] Fousek, J., J. Filipovic, and M. Madzin (2011). “Automaticfusions of CUDA-GPU kernels for parallel map”. In: ACM SIGARCH ComputerArchitecture News (cit. on p. 118).
[Frigo 1999] Frigo, M. (1999). “A fast Fourier transform compiler”. In: Proc. PLDI(cit. on p. 47).
[Frigo and Johnson 1998] Frigo, M. and S. G. Johnson (1998). “FFTW: An adaptivesoftware architecture for the FFT”. In: Proc. CASS (cit. on p. 47).
[Frigo and Johnson 2005] Frigo, M. and S. G. Johnson (2005). “The Design andImplementation of FFTW3”. In: IEEE (cit. on p. 47).
[Fursin et al. 2008] Fursin, G., C. Miranda, O. Temam, M. Namolaru, E. Yom-Tov,A. Zaks, E. Bonilla, J. Thomson, H. Leather, C. Williams, and M. O’Boyle (2008).“MILEPOST GCC: machine learning based research compiler”. In: Proc. GCC(cit. on p. 44).
[Fursin et al. 2016] Fursin, G., A. Lokhmotov, and E. Plowman (2016). “CollectiveKnowledge: Towards R&D sustainability”. In: Proc. DATE (cit. on pp. 119, 120).
[Gadioli et al. 2014] Gadioli, D., S. Libutti, G. Massari, E. Paone, M. Scandale, P.Bellasi, G. Palermo, V. Zaccaria, G. Agosta, W. Fornaciari, and C. Silvano (2014).“OpenCL Application Auto-Tuning and Run-Time Resource Management forMulti-Core Platforms”. In: Proc. ISPA (cit. on p. 49).
[Ganestam and Doggett 2012] Ganestam, P. and M. Doggett (2012). “Auto-tuningInteractive Ray Tracing using an Analytical GPU Architecture Model”. In: Proc.GPGPU (cit. on p. 47).
140
[Gao and Peterson 2015] Gao, S. and G. D. Peterson (2015). “Optimizing CUDAShared Memory Usage”. In: Proc. SC (cit. on p. 49).
[Gasior 2014] Gasior, G. (2014). “The SSD Endurance Experiment: Two freakingpetabytes”. In: TechReport.com. techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes [accessed 07.04.2017] (cit. onp. 14).
[Gaster and Howes 2011] Gaster, B. R. and L. Howes (2011). The Future of the APU -Braided Parallelism. AMD Fusion Developer Summit (cit. on p. 20).
[Gerndt 2016] Gerndt, M. (2016). “The READEX Project for Dynamic Energy Eﬃ-ciency Tuning”. In: Proc. SEM4HPC (cit. on p. 47).
[Glatter 2015] Glatter, Z. (2015). “ATI 3D Rage”, “NVIDIA NV1” and “3dfx Voodoo”.In: Vintage3D.org. vintage3d.org/rage.php / vintage3d.org/nv1.php /vintage3d.org/3dfx1.php [accessed 07.04.2017] (cit. on p. 18).
[Goodacre 2011] Goodacre, J. (2011). “The evolution of the microprocessor - fromsingle cputs tomany core devices”. In:NewElectronics. www.newelectronics.co.uk/electronics-technology/the-evolution-of-the-microprocessor-from-single-cpus-to-many-core-devices/35556/ [accessed 07.04.2017] (cit. on p. 1).
[Götz et al. 2010] Götz, S., C. Wilke, M. Schmidt, S. Chech, and U. Assmann (2010).“Towards Energy Auto-Tuning”. In: Proc. GREEN IT (cit. on p. 47).
[Grasso et al. 2013] Grasso, I., K. Koﬂer, B. Cosenza, and T. Fahringer (2013). “Au-tomatic problem size sensitive task partitioning on heterogeneous parallelsystems”. In: Proc. PPoPP (cit. on p. 49).
[Grauer-Gray et al. 2012] Grauer-Gray, S., L. Xu, R. Searles, S. Ayalasomayajula, andJ. Cavazos (2012). “Auto-tuning a High-Level Language Targeted to GPU Codes”.In: Proc. InPar (cit. on pp. 44, 119).
[Gray and Stratford 2016] Gray, A. and K. Stratford (2016). “A Lightweight Approachto Performance Portability with targetDP”. In: HPCA (cit. on p. 50).
[Green 500 2016] Green 500 (2016).November 2016. www.top500.org/green500/lists/2016/11/ [accessed 07.04.2017] (cit. on p. 1).
[Guo and Wang 2010] Guo, P. and L. Wang (2010). “Auto-Tuning CUDA Parametersfor Sparse Matrix-Vector Multiplication on GPUs”. In: Proc. ICCIS (cit. on p. 46).
141
Bibliography
[Guo et al. 2011] Guo, P., H. Huang, Q. Chen, L. Wang, E.-J. Lee, and P. Chen (2011).“AModel-Driven Partitioning and Auto-tuning Integrated Framework for SparseMatrix-Vector Multiplication on GPUs”. In: Proc. TeraGrid (cit. on p. 46).
[Gupta 2016] Gupta, S. (2016). “IBM and NVIDIA present the NVLink server you’vebeen waiting for”. In: IBM Systems Blog. www.ibm.com/blogs/systems/ibm-nvidia-present-nvlink-server-youve-waiting/ [accessed07.04.2017] (cit. on p. 20).
[Gysi et al. 2016] Gysi, T., J. Baer, and T. Hoeﬂer (2016). “dCUDA: hardware sup-ported overlap of computation and communication”. In: Proc. SC (cit. on p. 50).
[Hall et al. 2009] Hall, M., J. Chame, C. Chen, J. Shin, G. Rudy, and M. M. Khan(2009). “Loop Transformation Receipes for Code Generation and Auto-Tuning”.In: Proc. LCPC (cit. on pp. 39, 45).
[Han and Abdelrahman 2009] Han, T. D. and T. S. Abdelrahman (2009). “HiCUDA AHigh-Level Directive-based Language for GPU Programming”. In: Proc. GPGPU(cit. on pp. 40, 45).
[Han and Abdelrahman 2011a] Han, T. D. and T. S. Abdelrahman (2011a). “ReducingBranch Divergence in GPU Programs”. In: Proc. GPGPU (cit. on p. 48).
[Han and Abdelrahman 2011b] Han, T. D. and T. S. Abdelrahman (2011b). “hiCUDA:High-Level GPGPU Programming”. In: IEEE TPDS (cit. on p. 45).
[Han and Abdelrahman 2013] Han, T. D. and T. S. Abdelrahman (2013). “ReducingDivergence in GPGPU Programs with Loop Merging”. In: Proc. GPGPU (cit. onp. 49).
[Han and Abdelrahman 2014] Han, T. D. and T. S. Abdelrahman (2014). “AutomaticTuning of Local Memory Use on GPGPUs”. In: ArXiv (cit. on p. 51).
[Hechtman et al. 2016] Hechtman, B. A., A. D. Hilton, and D. J. Sorin (2016). “TREES:A CPU-GPU Task-Parallel Runtime with Explicit Epoch Synchronization”. In:ArXiv (cit. on p. 50).
[Helal et al. 2016] Helal, A. E., P. Sathre, and W. chun Feng (2016). “MetaMorph: alibrary framework for interoperable kernels on multi- and many-core clusters”.In: Proc. SC (cit. on p. 50).
142
[Hoﬀmann et al. 2010] Hoﬀmann, H., J. Eastep,M. D. Santambrogio, J. E.Miller, andA. Agarwal (2010). “Application Heartbeats A Generic Interface for SpecifyingProgram Performance and Goals in”. In: Proc. ICAC (cit. on p. 47).
[Hoﬀmann et al. 2011] Hoﬀmann, H., S. Sidiroglou, M. Carbin, S. Misailovic, A.Agarwal, and M. Rinard (2011). “Dynamic Knobs for Power-Aware Computing”.In: Proc. ASPLOS (cit. on p. 47).
[Hollingsworth and Keleher 1998] Hollingsworth, J. K. and P. J. Keleher (1998).“Prediction and Adaption in Active Harmony”. In: Proc. HPDC (cit. on pp. 38,48).
[Hong et al. 2012] Hong, S., H. Chaﬁ, E. Sedlar, and K. Olukotun (2012). “Green-Marl:A DSL for Easy and Eﬃcient Graph Analysis”. In: Proc. ASPLOS (cit. on p. 46).
[Hong 2009] Hong, S. (2009). “An Analytical Model for a GPU Architecture withMemory-level and Thread-level Parallelism Awareness”. In: Proc. ISCA (cit. onpp. 40, 43).
[Hook and Graves 2016] Hook, C. and L. Graves (2016). “AMD introduces RadeonInstinct: Accelerating Machine Intelligence”. In: AMD Press Releases.www.amd.com/en-us/press-releases/Pages/radeon-instinct-2016dec12.aspx[accessed 07.04.2017] (cit. on p. 20).
[Hsu et al. 2014] Hsu, C.-C., C.-Y. Lin, S. K. Chen, C.-W. Liu, and J.-K. Lee (2014). “Opti-mized Memory Access Support for Data Layout Conversion on HeterogeneousMulti-core Systems”. In: Proc. ESTIMedia (cit. on pp. 39, 50).
[INTEL 1997] INTEL (1997). Intel Architecture Software Developer’s Manual, Volume1: Basic Architecture. download.intel.com/design/intarch/manuals/24319001.PDF [accessed 07.04.2017] (cit. on p. 17).
[INTEL 2002] INTEL (2002). Intel Hyper-Threading Technology - Get Faster Perfor-mance for Many Demanding Business Applications. www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html [accessed 07.04.2017] (cit. on pp. 17, 25).
[INTEL 2005] INTEL (2005). Intel Pentium D Processor 805, Speciﬁcations.ark.intel.com/de/products/27511/ [accessed 07.04.2017] (cit. on p. 17).
143
Bibliography
[INTEL 2009] INTEL (2009). An Introduction to the Intel QuickPath Interconnect.www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf [accessed 07.04.2017] (cit. on p. 15).
[INTEL 2011] INTEL (2011). Intel Core i7 2700K, Speciﬁcations.ark.intel.com/de/products/61275/ [accessed 11.04.2017] (cit. on p. 2).
[INTEL 2013a] INTEL (2013a). Intel Core i7 4765T, Speciﬁcations.ark.intel.com/de/products/75121/ [accessed 11.04.2017] (cit. on p. 2).
[INTEL 2013b] INTEL (2013b). Intel Xeon Phi Coprocessor Developer’s Quick StartGuide. software.intel.com/sites/default/ﬁles/managed/ee/4e/intel-xeon-phi-coprocessor-quick-start-developers-guide.pdf [accessed 07.04.2017] (cit. onp. 18).
[INTEL 2016] INTEL (2016). Threading Building Blocks - Developer Guide.software.intel.com/en-us/node/506045 [accessed 07.04.2017] (cit. on p. 124).
[Imani et al. 2017] Imani, M., D. Peroni, Y. Kim, A. Rhaimi, and T. Rosing (2017).“Eﬃcient Neural Network Acceleration on GPGPU using Content AddressableMemory”. In: Proc. DATE (cit. on p. 47).
[InﬁniBand 2016] InﬁniBand (2016). InﬁniBand Architecture Speciﬁcation Volume2, Release 1.3.1. cw.inﬁnibandta.org/document/dl/8125 [accessed 07.04.2017](cit. on p. 19).
[Inggs et al. 2017] Inggs, G., D. B. Thomas, and W. Luk (2017). “A Domain SpeciﬁcApproach to High Performance Heterogeneous Computing”. In: IEEE TPDS(cit. on p. 50).
[Ipek et al. 2005] Ipek, E., B. R. de Supinski, M. Schulz, and S. A. McKee (2005).“An approach to performance prediction for parallel applications”. In: Proc.ECPP (cit. on pp. 43, 96).
[Iwainsky et al. 2015] Iwainsky, C., S. Shudler, A. Calotoiu, A. Strube, M. Knobloch,C. Bischof, and F. Wolf (2015). “How Many Threads will be too Many? On theScalability of OpenMP Implementations”. In: Proc. EUROPAR (cit. on p. 44).
[Jääskeläinen et al. 2014] Jääskeläinen, P., C. S. de La Lama, E. Schnetter, K. Raiskila,J. Takala, and H. Berg (2014). “pocl: A Performance-Portable OpenCL Imple-mentation”. In: Parallel Programming (cit. on p. 49).
144
[Jia and Zhou 2016] Jia, Q. and H. Zhou (2016). “Tuning Stencil Codes in OpenCLfor FPGAs”. In: Proc. ICCD (cit. on p. 46).
[Jordan et al. 2012] Jordan, H., P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner,T. Fahringer, and H. Moritsch (2012). “A Multi-Objective Auto-Tuning Frame-work for Parallel Codes”. In: Proc. SC (cit. on p. 47).
[Jouppi 2016] Jouppi, N. (2016). “Google supercharges machine learning tasks withTPU custom chip”. In: Google Cloud Platform Blog. cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html [accessed 07.04.2017] (cit. on pp. 2, 19).
[Kamil et al. 2010] Kamil, S., C. Chan, L. Oliker, J. Shalf, and S. Williams (2010). “AnAuto-Tuning Framework for Parallel Multicore Stencil Computations”. In: Proc.IPDPS (cit. on p. 46).
[Karsai et al. 2001] Karsai, G., A. Ledeczi, and J. Sztipanovits (2001). “An Approachto Self-Adaptive Software based on Supervisory Control”. In: Self-AdaptiveSoftware: Applications (cit. on p. 38).
[Khan et al. 2013] Khan, M., P. Basu, G. Rudy, M. Hall, C. Chen, and J. Chame (2013).“A script-based autotuning compiler system to generate high-performanceCUDA code”. In: ACM TACO (cit. on p. 46).
[Kim et al. 2016] Kim, J., Y.-J. Lee, J. Park, and J. Lee (2016). “Translating OpenMPDevice Constructs to OpenCL using Unnecessary Data Transfer Elimination”. In:Proc. SC (cit. on p. 44).
[Klöckner et al. 2011] Klöckner, A., N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, andA. Fasih (2011). “PyCUDA and PyOpenCL: A Scripting-Based Approach to GPURun-Time Code Generation”. In: Parallel Computing (cit. on p. 45).
[Koﬂer et al. 2015] Koﬂer, K., B. Cosenza, and T. Fahringer (2015). “Automatic DataLayout Optimizations for GPUs”. In: Proc. Euro-Par (cit. on pp. 23, 39, 40, 50,75, 118).
[Kopf et al. 2013] Kopf, J., A. Shamir, and P. Peers (2013). “Content-Adaptive ImageDownscaling”. In: ACM TOG (cit. on p. 78).
[Krajewski 1985] Krajewski, R. (1985). “Multiprocessing An Overview”. In: BYTEMagazine 5 (cit. on p. 17).
145
Bibliography
[Kurzak et al. 2012] Kurzak, J., S. Tomov, and J. Dongarra (2012). “Autotuning GEMMKernels for the Fermi GPU”. In: IEEE TPDS (cit. on p. 46).
[Langdon et al. 2016] Langdon, W. B., B. Y. H. Lam, M. Modat, J. Petke, and M.Harman (2016). “Genetic Improvement of GPU Software”. In: Genetic Program-ming and Evolvable Machines (cit. on p. 45).
[Lashgar and Baniasadi 2016] Lashgar, A. and A. Baniasadi (2016). “OpenACC cacheDirective: Opportunities and Optimizations”. In: Proc. WACCPD (cit. on p. 44).
[Lee et al. 2007] Lee, B. C., D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh,and S. a. McKee (2007). “Methods of inference and learning for performancemodeling of parallel applications”. In: Proc. PPPP (cit. on pp. 43, 96).
[Lee and Vetter 2014] Lee, S. and J. S. Vetter (2014). “OpenARC: Open AcceleratorResearch Compiler for Directive-Based, Eﬃcient Heterogeneous Computing”.In: Proc. HPDC (cit. on pp. 40, 44, 114).
[Lee et al. 2010] Lee, V. W., C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N.Satish, M. Smelyanskiy, S. Chennpaty, P. Hammarlund, R. Singhal, and P. Dubey(2010). “Debunking the 100X GPU vs. CPU myth: an evaluation of throughputcomputing on CPU and GPU”. In: Proc. ISCA (cit. on p. 118).
[Levenson 2013] Levenson, M. D. (2013). “Lessons From Past Architecture Wars”.In: Semiconductor Manufacturing and Design. semiengineering.com/lessons-architecture-wars/ [accessed 07.04.2017] (cit. on p. 1).
[Li 2016] Li, A. (2016). “GPU Performance Modeling and Optimization”. PhD thesis.Technische Universiteit Eindhoven (cit. on p. 39).
[Li et al. 2015] Li, A., G.-J. van den Braak, A. Kumar, and H. Corporaal (2015).“Adaptive and Transparent Cache Bypassing for GPUs”. In: Proc. SC (cit. onpp. 39, 40, 49).
[Li et al. 2016a] Li, C., Y. Yang, Z. Lin, and H. Zhou (2016a). “Automatic Data Place-ment into GPU On-Chip memory resources”. In: Proc. CGO (cit. on p. 51).
[Li et al. 2016b] Li, C., Y. Yang, M. Feng, S. Chakradhar, and H. Zhou (2016b). “Opti-mizing Memory Eﬃciency for Deep Convolutional Neural Networks on GPUs”.In: Proc. SC (cit. on p. 47).
146
[Liu et al. 2014] Liu, W., I. A. C. Ureña, M. Gerndt, and B. Gong (2014). “AutomaticMPI-I Tuning with the Periscope Tuning Framework”. In: Proc. IPDPS (cit. onp. 47).
[Liu et al. 2008] Liu, Y., E. Z. Zhang, and X. Shen (2008). A Cross-Input AdaptiveFramework for GPU Programs Optimization. Tech. rep. College of William andMary (cit. on pp. 39–41, 48).
[Long and Fursin 2005] Long, S. and G. Fursin (2005). “A heuristic search algorithmbased on Uniﬁed Transformation Framework”. In: Proc. ICPP (cit. on p. 44).
[Luo et al. 2015] Luo, Y., G. Tan, Z. Mo, and M. Suo (2015). “FAST: A Fast StencilAutotuning Framework Based on an Optimal-solution Space Model”. In: Proc.ICS (cit. on p. 46).
[Lutz 2015] Lutz, T. (2015). “Enhancing Productivity and Performance Portability ofOpenCL Applications on Heterogeneous Systems using Runtime Optimizations”.PhD thesis. The University of Edinburgh (cit. on p. 49).
[Lutz et al. 2013] Lutz, T., C. Fensch, and M. Cole (2013). “PARTANS: An autotuningframework for stencil computation on multi-GPU systems”. In: ACM TACO (cit.on pp. 39, 46).
[Lutz et al. 2015] Lutz, T., C. Fensch, and M. Cole (2015). “Helium: A TransparentInter-kernel Optimizer for OpenCL”. In: Proc. GPGPU (cit. on p. 39).
[Macri 2015] Macri, J. (2015). “AMD’s Next Generation GPU and High BandwidthMemory Architecture: FURY”. In: Proc. Hot Chips Symposium (cit. on p. 21).
[Magni et al. 2014] Magni, A., C. Dubach, and M. O’Boyle (2014). “AutomaticOptimization of Thread-Coarsening for Graphics Processors”. In: Proc. PACT(cit. on pp. 39, 41, 49).
[Majeti et al. 2016] Majeti, D., K. S. Meel, R. Barik, and V. Sarkar (2016). “AutomaticData Layout Generation and Kernel Mapping for CPU+GPU Architectures”. In:Proc. CC (cit. on p. 46).
[Marangoni and Wischgoll 2016] Marangoni, M. and thomas Wischgoll (2016).“Togpu: Automatic Source Transformation fromC++ to CUDAusing Clang-LLVM”.In: Proc. Electronic Imaging (cit. on pp. 44, 45).
147
Bibliography
[Matsumoto et al. 2012] Matsumoto, K., N. Nakasato, and S. G. Sedukhin (2012).“Performance tuning of matrix multiplication in OpenCL on diﬀerent GPUS andCPUS”. In: Proc. SCC (cit. on p. 46).
[Mei and Chu 2017] Mei, X. and X. Chu (2017). “Dissecting GPU Memory HierarchyThrough Microbenchmarking”. In: ArXiv (cit. on p. 43).
[Meng et al. 2011] Meng, J., V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D.Uram (2011). “GROPHECY: GPU Performance Projection from CPU Code Skele-tons”. In: Proc. SC (cit. on pp. 43, 54).
[Miceli and Bodin 2013] Miceli, R. and F. Bodin (2013). The State-of-the-Art inDirective-Guided Auto-Tuning for Accelerator and Heterogeneous Many-CoreArchitectures. Tech. rep. PRACE White Papers (cit. on p. 47).
[Miceli et al. 2013] Miceli, R., G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C.Navarrete, S. Benkner, M. Sandrieser, L. Morin, and F. Bodin (2013). “AutoTune:A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications”. In:Proc. PARA (cit. on p. 47).
[Michaud 2016] Michaud, P. (2016). “Some Mathematical Facts About OptimalCache Replacement”. In: ACM TACO (cit. on p. 51).
[Mills and Mills 2015] Mills, N. and E. Mills (2015). “Taming the energy use ofgaming computers”. In: Energy Eﬃciency (cit. on p. 1).
[Moammer 2016] Moammer, K. (2016). “Nvidia Plans GTX 2080 Ti, 2080 and 2070Refresh With GDDR5X and Faster Clocks In 2017 - Volta GPUs With HBM2 andGDDR6 in 2018”. In:WCCFTECH.com. wccftech.com/nvidia-pascal-volta-gpu-leaked-2017-2018/ [accessed 07.04.2017] (cit. on p. 33).
[Monakov et al. 2010] Monakov, A., A. Lokhmotov, and A. Avetisyan (2010). “Auto-matically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures”.In: Proc. HiPEAC (cit. on pp. 39, 46).
[Moore 1965] Moore, G. E. (1965). “Cramming more components onto integratedcircuits”. In: Electronics Magazine (cit. on pp. 1, 16).
[Moreira et al. 2017] Moreira, R. E. A., S. Collange, and F. M. Q. Pereira (2017).“Function Call Re-Vectorization”. In: Proc. PPoPP (cit. on p. 49).
148
[Morton 1966] Morton, G. M. (1966). A computer Oriented Geodetic Data Base;and a New Technique in File Sequencing. Tech. rep. IBM Corporation (cit. onp. 22).
[Moskewicz et al. 2016] Moskewicz, M. W., A. Jannesari, and K. Keutzer (2016). “AMetaprogramming and Autotuning Framework for Deploying Deep LearningApplications”. In: ArXiv (cit. on p. 47).
[Muralidharan et al. 2014] Muralidharan, S., M. Shantharam, M. Hall, M. Gar-land, and B. Catanzaro (2014). “Nitro: A Framework for Adaptive Code VariantTuning”. In: Proc. IPDPS (cit. on pp. 39, 41, 42, 46, 48, 71, 75, 117, 120).
[Muralidharan et al. 2016a] Muralidharan, S., A. Roy, M. Hall, M. Garland, andP. Rai (2016a). “Architecture-Adaptive Code Variant Tuning”. In: Proc. ASPLOS(cit. on p. 48).
[Muralidharan et al. 2016b] Muralidharan, S., M. Garland, A. Sidelnik, and M. Hall(2016b). “Designing a Tunable Nested Data-Parallel Programming System”. In:ACM TACO (cit. on p. 50).
[NVIDIA 2009] NVIDIA (2009). NVIDIA’s Next Generation CUDA Compute Architec-ture: Fermi. www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf [accessed 07.04.2017] (cit. on pp. 2,33).
[NVIDIA 2013] NVIDIA (2013). CUB. nvlabs.github.io/cub/ [accessed 07.04.2017](cit. on p. 124).
[NVIDIA 2014a] NVIDIA (2014a). Kepler GK110/210Whitepaper. images.nvidia.com/content/pdf/tesla/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf[accessed 07.04.2017] (cit. on pp. 2, 33).
[NVIDIA 2014b] NVIDIA (2014b). NVIDIA GeForce GTX 980.international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF [accessed 07.04.2017] (cit. on pp. 2, 34).
[NVIDIA 2014c] NVIDIA (2014c). NVIDIA NVLink High-Speed Interconnect: Applica-tion Performance. info.nvidianews.com/rs/nvidia/images/NVIDIA NVLink High-Speed Interconnect Application Performance Brief.pdf [accessed 07.04.2017](cit. on p. 20).
149
Bibliography
[NVIDIA 2015] NVIDIA (2015). Release 349 Graphics Drivers for Windows, Version350.12. de.download.nvidia.com/Windows/350.12/350.12-win8-win7-winvista-desktop-release-notes.pdf [accessed 07.04.2017] (cit. on p. 121).
[NVIDIA 2016a] NVIDIA (2016a). CUDA Programming Guide v8.0. docs.nvidia.com/cuda/cuda-c-programming-guide/index.html [accessed 07.04.2017] (cit. onpp. 22, 25, 26, 28, 29, 33, 35, 54, 82, 115).
[NVIDIA 2016b] NVIDIA (2016b). NVIDIA CUDA Driver API v8.0. docs.nvidia.com/cuda/cuda-driver-api/index.html [accessed 07.04.2017] (cit. on p. 115).
[NVIDIA 2016c] NVIDIA (2016c). NVIDIA CUDA Samples v8.0. [included in CUDAToolkit v8.0] (cit. on p. 76).
[NVIDIA 2016d] NVIDIA (2016d). NVIDIA Tesla P100. images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf[accessed 07.04.2017] (cit. on pp. 21, 29, 34).
[Nugteren and Codreanu 2015] Nugteren, C. and V. Codreanu (2015). “A GenericAuto-Tuner for OpenCL Kernels”. In: Proc. MCSoC (cit. on pp. 39, 41, 48).
[Oliveira Castro et al. 2013] Oliveira Castro, P. de, E. Petit, A. Farjallah, and W. Jalby(2013). “Adaptive Sampling for Performance Characterization of ApplicationKernels”. In: Concurrency and Computation: Practice and Experience (cit. onp. 43).
[Olofsson 2016] Olofsson, A. (2016). Epiphany-V: A 1024 processor 64-bit RISCSystem-On-Chip. Tech. rep. www.parallella.org/2016/10/05/epiphany-v-a-1024-core-64-bit-risc-processor/ [accessed 07.04.2017]. Adapteva Inc. (cit. on p. 19).
[Öztireli and Gross 2015] Öztireli, A. C. and M. Gross (2015). “Perceptually BasedDownscaling of Images”. In: ACM TOG (cit. on p. 78).
[PCI-SIG 2010] PCI-SIG (2010). PCI Express Base Speciﬁcation Revision 3.0.composter.com.ua/documents/PCI_Express_Base_Speciﬁcation_Revision_3.0.pdf [accessed 07.04.2017] (cit. on p. 15).
[Pacula et al. 2012] Pacula, M., J. Ansel, S. Amarasinghe, and U.-M. O’Reilly (2012).“Hyperparameter Tuning in Bandit-Based Adaptive Operator Selection”. In:Proc. EvoApplications (cit. on p. 45).
150
[Pai and Pingali 2016] Pai, S. and eshav Pingali (2016). “A Compiler for ThroughputOptimization of Graph Algorithms on GPUs”. In: Proc. OOPSLA (cit. on p. 47).
[Panneerselvam and Swift 2016] Panneerselvam, S. andM. Swift (2016). “Rinnegan:Eﬃcient Resource Use in Heterogeneous Architectures”. In: Proc. PACT (cit. onp. 50).
[Paone et al. 2014] Paone, E., D. Gadioli, G. Palermo, V. Zaccaria, and C. Silvano(2014). “Evaluating Orthogonality between Application Auto-Tuning and Run-Time ResourceManagement for Adaptive OpenCL Applications”. In: Proc. ASAP(cit. on p. 49).
[Papadonikolakis et al. 2009] Papadonikolakis, M., C.-S. Bouganis, and G. Constan-tinides (2009). “Performance comparison of GPU and FPGA architectures forthe SVM training problem”. In: Proc. FPT (cit. on p. 19).
[Park et al. 2011] Park, E., L.-N. Pouchet, J. Cavazos, A. Cohen, and P. Sadayappan(2011). “Predictive Modeling in a Polyhedral Optimization Space”. In: ParallelProgramming (cit. on p. 44).
[Park et al. 2015] Park, J. J. K., Y. Park, and S. Mahlke (2015). “ELF: MaximizingMemory-level Parallelism for GPUs with Coordinated Warp and Fetch Schedul-ing”. In: Proc. SC (cit. on p. 51).
[Patterson and Hennessy 2013] Patterson, D. A. and J. L. Hennessy (2013). ComputerOrganization and Design. Vol. 4. Morgan Kaufmann (cit. on pp. 7–10, 12, 17).
[Pauwels et al. 2011] Pauwels, K., M. Tomasi, J. D. Alonso, E. Ros, and M. M. V. Hulle(2011). “A Comparison of FPGA and GPU for Real-Time Phase-based OpticalFlow, Stereo, and Local Image Features”. In: IEEE TOC (cit. on p. 19).
[Peng et al. 2016] Peng, Y., M. Grossman, and V. Sarkar (2016). “Static Cost Estima-tion for Data Layout Selection”. In: Proc. PMBS (cit. on pp. 23, 39, 40, 50, 75,118).
[Pennycook et al. 2016] Pennycook, S. J., J. D. Sewall, and V. W. Lee (2016). “AMetric for Performance Portability”. In: Proc. PMBS (cit. on p. 91).
[Pimenta et al. 2013] Pimenta, A., E. Cesar, and A. Sikora (2013). “Methodology forMPI Applications Autotuning”. In: Proc. EuroMPI (cit. on p. 47).
151
Bibliography
[Popov et al. 2006] Popov, S., J. Günther, H.-P. Seidel, and P. Slusallek (2006).“Experiences with Streaming Construction of SAH KD-Trees”. In: Proc. IEEE IRT(cit. on p. 79).
[Püschel et al. 2004] Püschel, M., B. Singer, J. Xiong, J. M. F. Moura, J. Johnson,D. Padua, M. Veloso, and R. W. Johnson (2004). “SPIRAL: A Generator forPlatform-Adapted Libraries of Signal Processing Algorithms”. In: IJHPCA (cit. onp. 47).
[Ragan-Kelley et al. 2012] Ragan-Kelley, J., A. Adams, S. Paris, M. Levoy, S. Amaras-inghe, and F. Durand (2012). “Decoupling Algorithms from Schedules for EasyOptimization of Image Processing Pipelines”. In: ACM TOG (cit. on p. 46).
[Ragan-Kelley et al. 2013] Ragan-Kelley, J., C. Barnes, A. Adams, S. Paris, F. Durand,and S. P. Amarasinghe (2013). “Halide: A Language and Compiler for OptimizingParallelism, Locality and Recomputation in Image Processing Pipelines”. In:Proc. PLDI (cit. on p. 46).
[Rasmussen and Williams 2006] Rasmussen, C. E. and C. K. I. Williams (2006).“Gaussian Processes for Machine Learning”. In: The MIT Press (cit. on pp. 96,97).
[Reinders 2013] Reinders, J. (2013). “Intel AVX-512 instructions”. In: Intel DeveloperZone. software.intel.com/en-us/blogs/2013/avx-512-instructions[accessed 07.04.2017] (cit. on pp. 2, 17).
[Rossbach et al. 2013] Rossbach, C., Y. Yu, J. Currey, and J.-P. Martin (2013). Dande-lion: a Compiler and Runtime for Heterogeneous Systems. Tech. rep. MicrosoftResearch (cit. on p. 49).
[Rossi and Zhou 2016] Rossi, R. A. and R. Zhou (2016). “Hybrid CPU-GPU Frameworkfor Network Motifs”. In: ArXiv (cit. on p. 50).
[Rubin et al. 2014] Rubin, E., E. Levy, A. Barak, and T. Ben-Nun (2014). “MAPS:Optimizing Massively Parallel Applications Using Device-Level Memory Ab-straction”. In: ACM TACO (cit. on p. 50).
[Rudy et al. 2011] Rudy, G., M. M. Khan, M. Hall, C. Chen, and J. Chame (2011). “AProgramming Language Interface to Describe Transformations and Code”. In:Proc. LCPC (cit. on pp. 45, 48).
152
[Ryoo et al. 2008] Ryoo, S., C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B.Kirk, and W. mei W. Hwu (2008). “Optimization Principles and ApplicationPerformance Evaluation of a Multithreaded GPU Using CUDA”. In: Proc. PPoPP(cit. on p. 50).
[Sakai et al. 2016] Sakai, R., F. Ino, and K. Hagihara (2016). “Towards AutomatingMulti-Dimensional Data Decomposition for Executing a Single-GPU Code on aMulti-GPU System”. In: Proc. CSA (cit. on p. 50).
[Sensi et al. 2016] Sensi, D. D., M. Torquati, and M. Danelutto (2016). “A Reconﬁgu-ration Algorithm for Power-Aware Parallel Applications”. In: ACM TACO (cit. onp. 47).
[Shilov 2016] Shilov, A. (2016). “Discrete Desktop GPU Market Trends Q2 2016:AMD Grabs Market Share, But NVIDIA Remains on Top”. In: AnandTech.com.www.anandtech.com/show/10613/discrete-desktop-gpu-market-trends-q2-2016-amd-grabs-market-share-but-nvidia-remains-on-top[accessed 07.04.2017] (cit. on p. 2).
[Shudler et al. 2015] Shudler, S., A. Calotoiu, T. Hoeﬂer, A. Strube, and F.Wolf (2015).“Exascaling Your Library Will Your Implementation Meet Your Expectations”. In:Proc. ICS (cit. on p. 43).
[Siddiqui et al. 2014] Siddiqui, S., F. AlZayer, and S. Feki (2014). “Historic LearningApproach for Auto-tuning OpenACC Accelerated Scientiﬁc Applications”. In:Proc. VECPAR (cit. on p. 44).
[Sikora et al. 2016] Sikora, A., E. César, I. Comprés, and M. Gerndt (2016). “Auto-tuning of MPI Applications Using PTF”. In: Proc. SEM4HPC (cit. on p. 47).
[Sorensen 2012] Sorensen, H. H. B. (2012). “Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs”. In: Proc. PPAM (cit. on pp. 39, 46).
[Srivastava et al. 2016] Srivastava, P., M. Kotsifakou, and V. Adve (2016). “HPVM: APortable Virtual Instruction Set for Heterogeneous Parallel Systems”. In: ArXiv(cit. on p. 50).
[Stephenson et al. 2003] Stephenson, M., S. Amarasinghe, M. Martin, and U.-M.O’Reilly (2003). “Improving compiler heuristics with machine learning”. In:Proc. PLDI (cit. on p. 44).
153
Bibliography
[Steuwer et al. 2016] Steuwer, M., T. Remmelg, and C. Dubach (2016). “Matrixmultiplication beyond auto-tuning: rewrite-based GPU code generation”. In:Proc. CASES (cit. on p. 46).
[Stratton et al. 2012] Stratton, J. A., C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang,N. Anssari, D. Liu, and W. mei W. Hwu (2012). Parboil: A Revised BenchmarkSuite for Scientiﬁc and Commercial Throughput Computing. Tech. rep. UIUC(cit. on p. 119).
[Strzodka 2011] Strzodka, R. (2011). Abstraction for AoS and SoA Layout in C++(cit. on p. 50).
[Strzodka 2012] Strzodka, R. (2012). “Data Layout Optimization for Multi-ValuedContainers in OpenCL”. In: Parallel and Distributed Computing (cit. on p. 50).
[Sung et al. 2012] Sung, I.-J., G. D. Liu, andW.-M.W. Hwu (2012). “DL: A Data LayoutTransformation System for Heterogeneous Computing”. In: Proc. InPar (cit. onpp. 23, 39, 50, 75, 118).
[TIOBE 2016] TIOBE (2016). TIOBE Index. www.tiobe.com/tiobe-index [accessed12.04.2017] (cit. on p. 122).
[Tang et al. 2015] Tang, W. T., R. Zhao, M. Lu, Y. Liang, H. P. Huynh, X. Li, and R. S. M.Goh (2015). “Optimizing and Auto-Tuning Scale-Free Sparse Matrix-VectorMultiplication on Intel Xeon Phi”. In: Proc. CGO (cit. on p. 46).
[Tapus et al. 2002] Tapus, C., I.-H. Chung, and J. K. Hollingsworth (2002). “ActiveHarmony: Towards Automated Performance Tuning”. In: Proc. SC (cit. on p. 48).
[Tausche et al. 2016] Tausche, K., M. Plauth, and A. Polze (2016). “dOpenCL: Evalu-ation of an API-Forwarding Implementation”. In: Proc. HPI Cloud Symposium(cit. on p. 50).
[TechPowerUp.com 2017] TechPowerUp.com (2017). GPU Database.www.techpowerup.com/gpudb/ [accessed 07.04.2017] (cit. on p. 76).
[Tillmann et al. 2013] Tillmann, M., T. Karcher, C. Dachsbacher, and W. Tichy (2013).“Application-independent Autotuning for GPUs”. In: Proc. ParCo (cit. on p. 48).
[Tillmann et al. 2016] Tillmann,M., P. Pfaﬀe, C. Kaag, andW. F. Tichy (2016). “Online-Autotuning of Parallel SAH kD-Trees”. In: Proc. IPDPS (cit. on p. 48).
154
[Tiwari et al. 2009] Tiwari, A., C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth(2009). “A Scalable Auto-tuning Framework for Compiler Optimization”. In:Proc. IPDPS (cit. on p. 48).
[Tiwari et al. 2011] Tiwari, A., M. A. Laurenzano, L. Carrington, and A. Snavely (2011).“Auto-tuning for Energy Usage in Scientiﬁc Applications”. In: Proc. EUROPAR(cit. on p. 47).
[Tom’s Hardware 2017] Tom’s Hardware (2017). Enterprise HDD Charts.www.tomshardware.com/charts/enterprise-hdd-charts/benchmarks,156.html[accessed 07.04.2017] (cit. on p. 15).
[Tomusk et al. 2016] Tomusk, E., C. Dubach, and M. O’Boyle (2016). “SelectingHeterogeneous Cores for Diversity”. In: ACM TACO (cit. on p. 47).
[Top 500 2016] Top 500 (2016). November 2016. www.top500.org/lists/2016/11/[accessed 07.04.2017] (cit. on p. 1).
[Veras et al. 2016] Veras, R. M., T. M. Low, T. M. Smith, and R. van de Geijn FranzFranchetti (2016). “Automating the Last-Mile for High Performance DenseLinear Algebra”. In: ArXiv (cit. on p. 46).
[Vijayaragavan et al. 2017] Vijayaragavan, T., Y. Eckert, G. H. Loh, M. J. Schulte,M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang,A. Karaunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba, S. Raasch, S. K.Reinhardt, G. Sadowski, and V. Sridharan (2017). “Design and Analysis of anAPU for Exascale Computing”. In: Proc. HPCA (cit. on pp. 20, 114).
[Viñas et al. 2013] Viñas, M., Z. Bozkus, and B. B. Fraguela (2013). “Exploitingheterogeneous parallelism with the Heterogeneous Programming Library”. In:Parallel and Distributed Computing (cit. on p. 49).
[Viñas et al. 2016] Viñas, M., B. B. Fraguela, D. Andrade, and R. Doallo (2016). “HighProductivity Multi-device Exploitation with the Heterogeneous ProgrammingLibrary”. In: Parallel and Distributed Computing (cit. on p. 49).
[Volkov 2010] Volkov, V. (2010). “Better Performance at Lower Occupancy”. In:GPU Tech Conference (cit. on p. 28).
[Vollmer et al. 2015] Vollmer, M., B. J. Svensson, E. Holk, and R. R. Newton (2015).“Meta-Programming and Auto-Tuning in the Search for High Performance GPUCode”. In: Proc. FHPC (cit. on p. 46).
155
Bibliography
[Vuduc et al. 2005] Vuduc, R., J. W. Demmel, and K. A. Yelick (2005). “OSKI: Alibrary of automatically tuned sparse matrix kernels”. In: Journal of Physics(cit. on p. 46).
[Waechter et al. 2012] Waechter, M., K. Jaeger, S. Weissgraeber, S. Widmer, M.Goesele, and K. Hamacher (2012). “Information-theoretic Analysis ofMolecular(Co)Evolution Using Graphics Processing Units”. In: Proc. ECMLS (cit. on p. 79).
[Wang et al. 2010] Wang, G., Y. Lin, and W. Yi (2010). “Kernel Fusion: an EﬀectiveMethod for Better Power Eﬃciency onMultithreadedGPU”. In: Proc. GreenCom(cit. on p. 118).
[Wang and Chu 2017]Wang, Q. and X. Chu (2017). “GPGPU Performance Estimationwith Core and Memory Frequency Scaling”. In: ArXiv (cit. on p. 43).
[Wang et al. 2015] Wang, Z., D. Grewe, and M. F. P. O’Boyle (2015). “Automaticand Portable Mapping of Data Parallel Programs to OpenCL for GPU-basedHeterogeneous Systems”. In: ACM TACO (cit. on p. 44).
[Weber andGoesele 2014]Weber, N. andM.Goesele (2014). “Auto-Tuning ComplexArray Layouts for GPUs”. In: Proc. EGPGV (cit. on pp. 4, 53).
[Weber and Goesele 2016] Weber, N. and M. Goesele (2016). “Adaptive GPU ArrayLayout Auto-Tuning”. In: Proc. SEM4HPC (cit. on pp. 4, 5, 69, 71, 72, 82).
[Weber and Goesele 2017]Weber, N. andM. Goesele (2017). “MATOG: Array LayoutAuto-Tuning for CUDA”. In: ACM TACO. *in review* (cit. on pp. 4, 5).
[Weber et al. 2015] Weber, N., S. C. Amend, and M. Goesele (2015). “GuidedProﬁling for Auto-Tuning Array Layouts on GPUs”. In: Proc. PMBS (cit. on pp. 4,5, 64, 65, 75).
[Weber et al. 2016]Weber, N., M.Waechter, S. C. Amend, S. Guthe, andM. Goesele(2016). “Rapid, Detail-Preserving Image Downscaling”. In: Proc. SIGGRAPH Asia(cit. on pp. VII, 78).
[Whaley and Dongarra 1998]Whaley, R. C. and J. J. Dongarra (1998). “AutomaticallyTuned Linear Algebra Software”. In: Proc. SC (cit. on pp. 37, 46).
[Wolf et al. 2014] Wolf, F., C. Bischof, T. Hoeﬂer, B. Mohr, G. Wittum, A. Calotoiu,C. Iwainsky, A. Strube, and A. Vogel (2014). “Catwalk: A Quick DevelopmentPath for Performance Models”. In: Proc. EUROPAR (cit. on p. 95).
156
[Wong et al. 2010] Wong, H., M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A.Moshovos (2010). “Demystifying GPUMicroarchtitecture through Microbench-marking”. In: Proc. ISPASS (cit. on p. 43).
[Wu et al. 2015] Wu, G., J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou(2015). “GPGPU performance and power estimation using machine learning”.In: Proc. HPCA (cit. on pp. 43, 96).
[Wu et al. 2016] Wu, J., A. Belevich, E. Bendersky, M. Heﬀernan, C. Leary, J. Pienaar,B. Roune, R. Springer, X. Weng, and R. Hundt (2016). “GPUCC - An Open-SourceGPGPU Compiler”. In: Proc. SCGO (cit. on p. 117).
[Xu and Gregg 2015] Xu, S. and D. Gregg (2015). “Exploiting Hyper-Loop Parallelismin Vectorization to Improve Memory Performance on CUDA GPGPU”. In: Proc.TRUSTCOM (cit. on p. 49).
[Yamato 2017] Yamato, Y. (2017). “Optimum Application Deployment Technologyfor Heterogeneous IaaS Cloud”. In: Information Processing (cit. on p. 50).
[Yang et al. 2010] Yang, Y., P. Xiang, J. Kong, and H. Zhou (2010). “A GPGPU Compilerfor Memory Optimization and Parallelism Management”. In: Proc. PLDI (cit. onp. 51).
[Yang et al. 2016] Yang, Y., S. Prestwood, and C. Barnes (2016). “VizGen: AcceleratingVisual Computing Prototypes in Dynamic Languages”. In: Proc. SIGGRAPH Asia(cit. on p. 47).
[Yu and Cardona 2010] Yu, P. Y. and M. Cardona (2010). Fundamentals of Semicon-ductors. Vol. 4. Pearson (cit. on p. 16).
[Zenker et al. 2016] Zenker, E., B. Worpitz, R. Widera, A. Huebl, G. Juckeland, A.Knüpfer, W. E. Nagel, andM. Bussmann (2016). “Alpaka - An Abstraction Libraryfor Parallel Kernel Acceleration”. In: ArXiv (cit. on p. 50).
[Zhang et al. 2005] Zhang, K., U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray,N. Vallepalli, Y. Wang, B. Zheng, and M. Bohr (2005). “SRAM Design on 65-nmCMOS Technology With Dynamic Sleep Transistor for Leakage Reduction”. In:SSC (cit. on p. 16).
[Zhang and Mueller 2013] Zhang, Y. and F. Mueller (2013). “Auto-Generation andAuto-Tuning of 3D Stencil Codes on homogeneous and Heterogeneous GPUClusters”. In: IEEE TPDS (cit. on p. 46).
157
Bibliography
[Zhang et al. 2016] Zhang, Y., S. Li, S. Yan, and H. Zhou (2016). “A Cross-PlatformSpMV Framework on Many-Core Architectures”. In: ACM TACO (cit. on p. 46).
[Zheng et al. 2012] Zheng, M., V. T. Ravi, W. Ma, F. Qin, and G. Agrawal (2012).“GMProf: A Low-Overhead, Fine-Grained Proﬁling Approach for GPU Programs”.In: Proc. HiPC (cit. on p. 43).
[Ziabari et al. 2016] Ziabari, A. K., Y. Sun, Y. Ma, D. Schaa, J. L. Abellán, R. Ubal,J. Kim, A. Joshi, and D. Kaeli (2016). “UMH: A Hardware-Based Uniﬁed MemoryHierarchy for Systems with Multiple Discrete GPUs”. In: ACM TACO (cit. onp. 51).
[Zivanovic et al. 2017] Zivanovic, D., M. Pavlovic, M. Radulovic, H. Shin, J. Son, S. A.McKee, P. M. Carpenter, P. Radojković, and E. Ayguadé (2017). “Main Memoryin HPC: Do We Need More or Could We Live with Less?” In: ACM TACO (cit. onp. 115).
158
(Co-)Authored Publications
Patrick Weber, Nicolas Weber, Michael Goesele and Rüdiger Kabst. Prospect forKnowledge in SurveyData –AnArtiﬁcial Neural Network Sensitivity Analysis.Social Science Computer Review (SSCR), 2017
Nicolas Weber and Michael Goesele. MATOG: Array Layout Auto-Tuning forCUDA ACM Transactions on Architecture and Code Optimization (TACO), 2017.
Nicolas Weber, Michael Waechter, Sandra C. Amend, Stefan Guthe and MichaelGoesele. Rapid, Detail-Preserving Image Downscaling. ACM Transactions onGraphics (TOG), SIGGRAPH Asia, 2016.
Nicolas Weber and Michael Goesele. Adaptive GPU Array Layout Auto-Tuning.In proceedings of Software Engineering Methods for Parallel and High Perfor-mance Applications, SEM4HPC, 2016.
NicolasWeber, Sandra C. Amend andMichael Goesele. Guided Proﬁling for Auto-Tuning Array Layouts on GPUs. In proceedings of 6th International Workshopin Performance Modeling, Benchmarking and Simulation of High PerformanceComputer Systems, PMBS, 2015.
Nicolas Weber and Michael Goesele. Auto-Tuning Complex Array Layouts onGPUs. In proceedings of Eurographics Symposium on Parallel Graphics andVisualization, EGPGV, 2014.
Sven Widmer, Dominik Wodniok, Nicolas Weber and Michael Goesele. Fast Dy-namicMemory Allocator forMassively Parallel Architectures. In proceedingsof 6th Workshop on General Purpose Processing Using GPUs, GPGPU, 2013.
159
