Runtime Prediction of Fused Linear Algebra in a Compiler Framework by Karlin, Ian
University of Colorado, Boulder
CU Scholar
Computer Science Graduate Theses & Dissertations Computer Science
Spring 1-1-2011
Runtime Prediction of Fused Linear Algebra in a
Compiler Framework
Ian Karlin
University of Colorado at Boulde
Follow this and additional works at: http://scholar.colorado.edu/csci_gradetds
Part of the Computer Sciences Commons
This Dissertation is brought to you for free and open access by Computer Science at CU Scholar. It has been accepted for inclusion in Computer
Science Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact
cuscholaradmin@colorado.edu.
Recommended Citation
Karlin, Ian, "Runtime Prediction of Fused Linear Algebra in a Compiler Framework" (2011). Computer Science Graduate Theses &
Dissertations. Paper 26.
Runtime Prediction of Fused Linear Algebra in a Compiler
Framework
by
Ian Karlin
B.S., University of California at Davis, 2005
M.S., University of Colorado, Boulder, 2007
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2011
This thesis entitled:
Runtime Prediction of Fused Linear Algebra in a Compiler Framework
written by Ian Karlin
has been approved for the Department of Computer Science
Elizabeth Jessup
Prof. Jeremy Siek
Prof. Xiao-Chuan Cai
Prof. Manish Vachharajani
Dr. Jonathan Hu
Date
The final copy of this thesis has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
iii
Karlin, Ian (Ph.D., Computer Science)
Runtime Prediction of Fused Linear Algebra in a Compiler Framework
Thesis directed by Prof. Elizabeth Jessup
On modern processors, data transfer exceeds floating-point operations as the predominant
cost in many linear algebra computations. For these memory-bound calculations, reducing data
movement is often the only way to significantly increase their speed. One tuning technique that
focuses on reducing memory accesses is loop fusion. However, determining the optimum amount
of loop fusion to apply to a routine is difficult as fusion can both positively and negatively impact
memory traffic.
In this thesis, we perform an in depth analysis of how loop fusion affects data movement
throughout the memory hierarchy. The results of this analysis are used to create a memory model
for fused linear algebra calculations. The model predicts data movement throughout the memory
hierarchy. Included in its design are runtime and accuracy tradeoffs based on our fusion research.
The model’s memory traffic predictions are converted to runtime estimates that can be used to
compare loop fusion variants on serial and shared memory parallel machines. We integrate our
model into a compiler where its predictions often reduce compile times by 99% or more. The kernel
produced by the compiler with the model turned on are usually the same as the optimal kernel for
the target architecture found by exhaustively testing all possible loop fusion combinations.
Dedication
I dedicate this thesis to all the friends and family who helped me through the ups and downs
of life and contributed to me reaching this point.
vAcknowledgements
I would like to thank the National Science Foundation for funding the grants that made
it possible for me to complete this research. I also thank Professor Jessup for the guidance and
support she provided especially when it came to transforming the quality of my writing. I would
like to thank and acknowledge my thesis committee for their time and help. In particular, Professor
Siek, who wrote the initial version of the Build to Order (BTO) compiler. Additional thanks go to
my other co-authors Geoffrey Belter, Erik Silkensen, Pavel Zelinsky and Thomas Nelson for help
both in writing papers and ideas contributed during various discussions. In particular I would like
to recognize Geoffrey whom I worked closely with during the integration of the BTO compiler with
my model and the analysis of the two working together.
vi
Contents
Chapter
1 Introduction 1
2 Background 6
2.1 Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Multi-core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Processor Memory Speed Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Performance Tuning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Use Faster but Equivalent Instructions . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Linear Algebra Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Linear Algebra Package (LAPACK) . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Higher Level Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Autotuning Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Memory Models that Predict Performace . . . . . . . . . . . . . . . . . . . . 28
vii
2.5.2 Models that Generate Performance Bounds . . . . . . . . . . . . . . . . . . . 29
2.6 Hybrid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Loop Fusion 31
3.1 Loop Fusion Applications to Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Optimizing Specific Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 General Tools for Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Fusion Reduces Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Independent Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Dependent Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Build to Order (BTO) Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Functionality of BTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Using BTO to Test and Improve the Memory Model . . . . . . . . . . . . . . 43
3.3.3 Leveraging BTO to Learn More About Loop Fusion . . . . . . . . . . . . . . 44
3.4 Negative Memory Effects of Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Performance Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Suboptimal Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Overcoming Bad Memory Effects of Fusion (Combining Fusion and Other Optimiza-
tions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.2 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3 Not Fusing Too Much . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Search Complexity of Fusion Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6.1 Similar Routine Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6.2 Operations That Significantly Impact Performance . . . . . . . . . . . . . . . 56
3.6.3 Removing Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
viii
4 Predicting Memory Traffic on a Single Processor 62
4.1 Runtime and Accuracy Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Theoretical Framework (Equations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 Cache and TLB Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 Register prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Comparison of Implementation to Equations and Instrumented Code . . . . . 72
4.4.2 Comparison of Predictions to Hardware Counters . . . . . . . . . . . . . . . . 74
5 Runtime Prediction 78
5.1 Automatically Determining Machine Characteristics . . . . . . . . . . . . . . . . . . 78
5.1.1 How Many Caches and Their Sizes . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 How Many Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.3 Useable Bandwidth from Memory Structure to the Processor . . . . . . . . . 81
5.2 Converting Memory Traffic Predictions to Runtime Predictions . . . . . . . . . . . . 81
5.2.1 Theoretical Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Validation and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.1 Validation of Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Runtime Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Parallel Shared Memory Model 90
6.1 Parallel Machine and Execution Features . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Data and Workload Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.2 Routine Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Modeling Parallel Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.1 Parallel Memory Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
ix
6.2.2 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Parallel Runtime Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.1 Theoretical Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Proof of Best and Worst Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Integration of Runtime Prediction into BTO Compiler 110
7.1 Integration of the Model into BTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Analysis of Model’s Effectiveness in Reducing Compile Time . . . . . . . . . . . . . 112
7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.2 Cost of Modeling and Empirically Testing Runtimes . . . . . . . . . . . . . . 113
7.2.3 Model’s Impact on Serial Compile Time . . . . . . . . . . . . . . . . . . . . . 117
7.2.4 Model’s Impact on Parallel Compile Time . . . . . . . . . . . . . . . . . . . . 120
8 Conclusions 125
9 Future Work 127
Bibliography 129
Appendix
xTables
Table
3.1 Specifications of the test machines. For TLB, we list the number of entries. . . . . . 32
3.2 Kernel specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Search Space of Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Instrumented and modeled reuse distances for unfused b = AATx. . . . . . . . . . . 73
4.2 Instrumented and modeled reuse distances for fused b = AATx. . . . . . . . . . . . . 73
6.1 Specifications of parallel test machines. All machines are homogenous and caches are
shared by the same number of cores. The number of sockets is equal to the number
of buses from memory to the processor in all cases. . . . . . . . . . . . . . . . . . . . 90
7.1 The number of serial and parallel versions produced by the BTO compiler for various
routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 Number of routines the model and empirical testing evaluate per second on the Work
machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Number of routines the model and empirical testing evaluate per second on the
Opteron machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Impact of model on reducing search space and runtime for serial routines on the
Work machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5 Impact of model on reducing search space and runtime for serial routines on the
Opteron machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
7.6 Impact of model on reducing search space and runtime for parallel routines on the
Work machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.7 Impact of model on reducing search space and runtime for parallel routines on the
Opteron machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xii
Figures
Figure
2.1 Dual socket quad-core Intel Clovertown . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Optimized vs. un-optimized matrix matrix multiply performance. . . . . . . . . . . . 18
2.3 Blocking a Matrix-Matrix Multiply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 How to interleave data to create a multivector. . . . . . . . . . . . . . . . . . . . . . 21
2.5 Software pipelining of a calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Fusing Ax = r,AT y = s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Memory performance of fused and unfused Ax = r,AT y = s . . . . . . . . . . . . . . 37
3.3 Performance of fused and unfused Ax = r,AT y = s . . . . . . . . . . . . . . . . . . . 38
3.4 Fusing b = AATx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Performance of fused and unfused AATx = b . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Memory performance of fused and unfused AATx = b . . . . . . . . . . . . . . . . . 41
3.7 DGEMV2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Three possible loop fusion options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.9 The performance of fusing only outer loops for various nvecs. . . . . . . . . . . . . . 46
3.10 The memory effects of fusing loops for nvecs = 8. . . . . . . . . . . . . . . . . . . . . 47
3.11 Performance of Fully Fusing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.12 Memory Effects of Fully Fusing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.13 Fully Fused Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xiii
3.14 Cache Blocking Ax = r,AT y = s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.15 Performance with Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.16 Memory Performance of Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.17 Software Pipelining b = AATx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.18 Performance with Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.19 Unfused GESUMMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.20 All Versions of GEMVER Actual and Predicted Performance . . . . . . . . . . . . . 57
3.21 Runtime of fusing a vector operation for the GESUMMV calculation where the vector
is accessed n times on an Opteron system. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 The abstract syntax tree taken in by the model for b = AATx. . . . . . . . . . . . . 69
4.2 A machine structure representing a Core 2 machine. . . . . . . . . . . . . . . . . . . 70
4.3 Predicted vs. actual memory misses of b = AATx on the Work machine. . . . . . . . 75
4.4 Predicted vs. actual memory misses of b = AATx on the Opteron. . . . . . . . . . . 76
5.1 Code used to determine number of registers availible on a machine. . . . . . . . . . . 80
5.2 Predicted vs. actual runtime of three kernels on the Work machine. . . . . . . . . . . 86
5.3 Predicted vs. actual runtime of three kernels on the Opteron system. . . . . . . . . . 87
5.4 Predicted vs. actual runtime of the 648 versions of GEMVER produced by the BTO
compiler on the Work machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Accuracy of Model Predictions With and Without Registers . . . . . . . . . . . . . . 89
6.1 Parallel machine storage example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Predicted and actual cache misses for r = ATx, s = Ay. . . . . . . . . . . . . . . . . 98
6.3 Predicted and actual cache misses for b = AATx. . . . . . . . . . . . . . . . . . . . . 99
6.4 Actual and predicted runtimes comparison for matrix kernels on the Work machine. 106
6.5 Actual and predicted runtimes comparison for vector kernels on the Work machine. . 107
6.6 Actual and predicted runtimes comparison for matrix kernels on the Opteron machine.108
xiv
6.7 Actual and predicted runtimes comparison for vector kernels on the Opteron machine.109
7.1 Time to model all versions of GEMVER on Work machine. . . . . . . . . . . . . . . 116
Chapter 1
Introduction
The execution of mathematical kernels is often the most time consuming part of scientific
applications in the diverse fields of atmospheric science [107], quantum physics [97], and structural
engineering [95]. Reduction in the runtime of these mathematical kernels is achieved using various
methods. The increasing speed and parallelism of modern computer hardware produces large
reductions in the overall program runtime [106]. Algorithmic improvements, such as iterative
methods, diminish runtimes by decreasing the number of operations that need to be performed
[68]. Finally, software tuning techniques reduce kernel runtimes by improving the computation’s
speed [13,30,55]. For many mathematical kernels, such as matrix-matrix multiplication, the result
is near optimal performance on a wide variety of hardware [49, 105]. The performance of these
kernels also improves in lockstep with processor improvements.
Despite the large number of ways to raise performance, the speed of many calculations in-
creases more slowly than processors and performance tuning technique advancements improve other
operations. Most of these calculations perform few floating-point operations per memory access.
The result is that their performance is bounded by the speed of the memory bus, which is slower
than the CPU [27]. This speed difference is a consequence of memory bandwidth’s increasing by
7-10% per year while the CPU’s floating point throughput has increased by approximately 60%
per year [52]. This trend has occurred for the past 30 years and is expected to continue. There-
fore the speed of memory bound calculations increases at the slower rate of memory bandwidth
improvements [3].
2The combination of memory-bound algorithms and slower memory bandwidth increases
means that many computations perform significantly below theoretical peak machine performance
[21, 93]. To counteract this problem, work has been performed to reduce the amount of memory
traffic required to perform calculations [3, 5, 10, 30, 55]. Loop fusion [44], an optimization that
combines multiple loops together, is one such technique.
Fusing loops that access the same data elements can reduce the amount of data that must
be read from memory. An added benefit of loop fusion is that it often combines operations in a
way that makes array contraction possible [44]. Array contraction is an optimization that reduces
the size of an array that is neither an input or output to a smaller array or scalar quantity. Array
contraction reduces the working set size of a program and may further decrease data movement by
diminishing the number of conflict misses in a cache. The result is fused codes that can run much
faster than their unfused equivalents [10,55,101].
One frequent application of loop fusion is to memory bound linear algebra operations. Cre-
ating an efficient fused routine is important, since the kernel that can be fused often contributes a
large portion of the overall routine runtime [55]. However, generating a routine with the optimal
amount of fusion is not always easy or obvious because fusion interacts in both positive and negative
ways with the memory hierarchy and other optimizations [61]. Without exploring the whole fusion
search space, it is possible to produce a sub-optimal routine. However, the number of possible ways
to create a fused routine is NP-complete [26], and, therefore it is not always feasible to enumerate
and test all possible combinations.
Multiple approaches to creating efficient routines have been used by researchers, including
optimizing single routines to run as fast as possible [55,101], performing fusion on a general purpose
language input [16, 85], using a domain specific language and compiler [10] and adding important
and commonly used functions to a widely used application programing interface [15]. The projects
that accept diverse inputs often include cost models or heuristics to predict the performance of
routines [10, 16, 86]. These models and heuristics decrease the time the compilers take to generate
fused routines, though they sometimes produce routines with sub-optimal amounts of fusion [82].
3The result of these efforts is highly efficient routines that can be over 100% faster than non-fused
implementations.
In this thesis, we present a model designed with enough speed to handle large search spaces
and enough accuracy to result in a feasible number of routines that must be tested to assure good
performance [10, 65]. We integrate our model into the Build to Order (BTO) compiler framework
[10] and demonstrate that it reduces the search space of compilation without significantly impacting
routine quality. BTO is is a compiler that takes in an annotated subset of MATLAB and produces
optimized kernels in C. Optimizations included within BTO include two forms of loop fusion and
data partitioning. The data partitioning enables cache blocking and the creation of shared memory
parallel codes [11]. BTO enumerates the entire search space of potentially profitable loop fusion
combinations that access common data to avoid generating sub-optimal fused kernels, which is
feasible for the restricted domain in which BTO works. Those operations are then tested using
a hybrid analytical/empirical testing methodology. In this methodology, the model presented in
this thesis is used to produce kernel runtime estimates and the best identified kernels are then
empirically tested. The compiler then outputs the routine identified as the fastest. The result is
efficient linear algebra kernels produced in a small amount of time that run over 100% faster than
vendor-optimized Basic Linear Algebra Subprograms (BLAS) on serial and parallel machines.
In designing the model, we draw inspiration from the fact that, for memory bound operations,
the best way to compare the performance of two implementations is by comparing their memory
costs, not their operation counts [3]. Reducing the memory traffic of an operation results in a cor-
responding reduction in runtime of memory bound operations [30]. Additionally, the performance
of memory-bound codes can be predicted from memory access patterns and benchmarks of the
system [20]. Finally, we use only the most distinguishing memory effects for fused calculations to
reduce model runtime. The result is a model that in a small amount of time accurately distin-
guishes between multiple versions of the same routine to find a small subset that can feasibly be
empirically tested. Included in this thesis are models for both single processor and multiprocessor
shared memory systems.
4The rest of the thesis is organized as follows: In Chapter 2, we present an overview of the
important parts of computer architecture as relevant to the performance of loop fused operations.
We review performance tuning techniques other than loop fusion and their interactions with the
computer architecture. Additionally, we describe other memory models and runtime prediction
methods.
In Chapter 3, we review others’ loop fusion work including the analysis of its complexity.
Different ways to produce fused routines are presented in detail. The impact of fusion on the
memory subsystem is explored, along with a detailed explanation of when and why fusion impacts
data movement. We show ways to mitigate the negative effects of fusion while not losing positive
effects. We also discuss the complexity of searching for the optimum amount of fusion and how to
reduce decisions considered without missing the best performing routines. Included in this chapter
is an overview of the BTO compiler system its capabilities and features and how we leverage it to
test, develop and improve our model.
We explain how we model memory traffic in Chapter 4. We discuss the assumptions made
in the model and the accuracy and performance tradeoffs that result. The chapter also contains
details of the theoretical framework for the model, discussion of its implementation and analysis of
the model’s accuracy, strengths and weaknesses.
Chapter 5 describes how memory predictions are converted into runtime predictions. It covers
the method used to perform this translation and the accuracy of these predictions. Included is a
theoretical framework and implementation details. We also explain a procedure to automatically
determine hardware characteristics of a machine relevant to memory and runtime prediction.
We expand our memory traffic runtime predictions to shared memory parallel machines in
Chapter 6. Included in this expansion are runtime predictions for routines on parallel machines. In-
cluded in this chapter are assumptions made about parallel machines, differences between modeling
serial and parallel machines and validation of the model’s accuracy.
In Chapter 7 we describe how the model and runtime prediction function are integrated and
used within the BTO compiler system. We show how and by how much the model reduces the
5amount of time required to find an efficient kernel.
Finally, Chapter 8 includes conclusions, and Chapter 9 presents areas for future work.
Chapter 2
Background
To achieve good performance, programmers of linear algebra software must account for the
underlying hardware on which their programs are run [49, 106]. Tuning of routines for the target
architecture is important because the speed differential between naive and tuned algorithms is often
significant [5,6,100,106] due to the naive algorithm’s inefficient use of the computer’s hardware [101].
Memory models [41,47] and automated tuning programs [13,42,98,105] aid in the creation of efficient
programs. As computer architectures continue to improve and become increasingly complex, new
models and tuning techniques are to needed aid the implementation of high performing routines
on modern hardware [31].
In this chapter we begin with a review of the important elements of computer architecture
for linear algebra operations for serial and parallel systems. We then present an analysis of various
tuning techniques and their interaction with computer hardware. A summary of some of the major
libraries that perform linear algebra computations follows. Next, we present a description of mem-
ory models used to aid in performance analysis and tuning of linear algebra computations. Then we
discuss other approaches to autotuning software. Finally, we present how others incorporate models
into hybrid search strategies. Throughout this chapter and the rest of this thesis linear algebra
computations are presented. When they are, all variables that are capital Roman letters denote
matrices, lower case Roman letters are vectors and lower case Greek letters are scalar quantities.
72.1 Computer Architecture
Creating high performing linear algebra kernels requires understanding the computer hard-
ware on which the routines are run. For calculations that are bound by data movement, knowledge
of the memory subsystem and how it impacts runtimes aids in tuning calculations. In this section,
we describe in detail the memory subsystem and how data move through it. We also discuss the gap
in data movement speed and the speed of the processor. Included in this section are the significant
differences of the memory subsystem design for single processor and parallel machines.
2.1.1 Memory Subsystem
Efficient use of the memory subsystem is important to linear algebra routine performance
[13,49,70,101]. To design efficient routines requires a good understanding of the memory subsystem,
which is made up of multiple components or varying sizes and speeds. From the fastest and smallest
to the largest and slowest there are: registers, caches and main memory. Ofter multiple caches of
varying size and speed are used. These memory structures along with the translation lookaside
buffer (TLB), which speeds up the process of finding where data is stored in computers running
multiple processes, make up the memory hierarchy.
In this section, we explain in detail the components of a computer’s memory hierarchy and
how these structures combine to impact performance. The material is a summary of information
covered in Computer Architecture: A Quantitave Approach by John Hennessy and David
Patterson [52].
2.1.1.1 Registers
Within the processor, all operations and calculations occur by manipulating data that are
stored in registers. Registers are high speed memory structures that typically store 32 or 64 bits of
information that can be accessed nearly instantly by the processor. The speed of access comes at a
large monetary and transistor cost. Therefore, most modern processor contain only 8 to 128 general
8purpose registers available that are available for programs to use. Due to the limited number of
registers most calculations require some or all of their data to be read from memory before the
calculations begins.
2.1.1.2 Caches
To perform a calculation on data not within a register, a computer needs to retrieve data from
its main memory and transfer it into the processor where the computation occurs. How long the
transfer takes is dependent on the latency and bandwidth between main memory and the processor.
Latency is the amount of time it takes to move data from memory to the processor, and bandwidth
is the amount of data that can be moved between two parts of a computer system in a given time.
They are the limiting factors in reading data from memory. Since main memory has a high latency
and a low bandwidth to the processor, reading large amounts of data from main memory can limit
the performance of a program. High latency affects the program’s runtime by causing the processor
to stall while waiting for data to be retrieved from main memory. A computer with low bandwidth
cannot move data from main memory to the processor as rapidly as the processor can perform
calculations. Therefore, decreasing latency and increasing bandwidth by reading as much data
from closer faster parts of the memory hierarchy are important to decreasing a computer program’s
runtime.
The pattern and distance of reads from different memory addresses determines whether band-
width or latency are more important to data retrieval speeds. When a program accesses main
memory sequentially, bandwidth is more important than latency for performance. Once the ini-
tial memory location is accessed, data are streamed from consecutive memory locations as fast
as the machine’s bandwidth allows. When a program accesses data from nonsequential memory
addresses, low latency is more important than high bandwidth for performance. Each access to
a nonsequential memory address results in a high latency memory lookup, which can take a few
hundred cycles.
To reduce the effects of high-latency and low bandwidth, caches are used. A cache is a fast
9memory structure that stores a subset of the data in main memory. The data stored in a cache
are organized into units called cache lines, which are typically 32 to 128 bytes in size. These
lines contain consecutive memory addresses from main memory. The first cache line begins at the
memory address 0, with the next line beginning immediately after the first line. Therefore, a cache
line begins every x bytes, where x is the line size used by the cache. As a segment of data is read
into the processor from main memory, it is put into the cache; at the same time, the other data
members on the same cache line are also read into the cache.
There are two ways programs can take advantage of caches. Consecutively reading data from
nearby memory address results in spatial locality of data accesses. For example spatially local
reads occur when a program reads two data elements on the same cache line. When a program
reads a data element that is already in cache due to a previous read the program is exhibiting
good temporal locality. Each of these access patterns reduces the amount of data read from slower
memory structures and speeds up routine execution.
Most caches are small in comparison to main memory. Typical cache sizes range from 32
KB to 24MB while main memory can be many gigabytes in size. Therefore, caches can only store
a subset of main memory at a given time. To store and retrieve values from a cache quickly, the
data are stored by using the first bits of the address in main memory to index into a set where
typically two to sixteen cache lines are stored. Once the proper set is found, the next bits of the
address are compared with the tag of the cache line to see if the set contains the line sought. If
the line is found, a cache hit has occurred, and the data sought are read from the cache line to the
processor. If the line that contains the data is not found then a cache miss has occurred. The cache
line containing the data sought is then read from main memory and stored in the appropriate set.
If the set is already full with other lines, then the least recently used line is evicted from the cache.
The same procedure is followed for computers with multiple levels of cache. When there is a miss
in a faster and smaller cache, the next larger and slower caches are tried. If the data are not found
in any cache, they are read from main memory.
On most computers cache lines are arranged in sets to simplify cache design and reduce
10
manufacturing cost. For these set associative caches the number of lines in the cache in which
a data element can be stored is the associativity of a cache. When a data element is read from
memory it is stored in a cache set using a mapping based on where in memory it is stored. If that
set is full the cache evicts the least recently used (LRU) cache line. Often the cost of maintaining
which line was used last is too expensive or complicated and a pseudo-LRU replacement strategy
is used.
However, some computers use fully associative caches where a data element can be stored in
any cache line. In a fully associative cache all cache lines need to be checked for that line, whenever
data is read. To not impact the speed of reading data in fully associative caches more circuitry is
required to check each line for the data in parallel. A fully associative cache avoids conflict misses
and usually results in fewer overall cache misses. However, set associative caches usually have a
better cost-benefit tradeoff.
To take full advantage of a cache, the number of cache misses needs to be limited since every
cache miss means a read from either a slower cache or main memory. There are three types of
cache misses: compulsory, capacity and conflict. Compulsory misses, which are unavoidable, occur
the first time that data are read in from main memory. A large cache line size, however, can limit
the number of these misses since, when one element on a cache line is read in, the other elements
on that line are also read. The extra data that are read on the cache line are prefetched data
and, if the processor uses them before the cache line is evicted, then some compulsory misses are
avoided. Capacity misses occur when a cache is not large enough to hold all the data the processor
is currently using to perform calculations. The two ways to avoid capacity misses are to increase
the size of the cache or to change a program so the amount of data it uses at a given time fits
within the cache. The final type of cache miss is a conflict miss which occurs when there is space
in the cache for a line, but it was evicted because the set in which it is stored was full. To avoid
these misses, a cache either needs to be made with more lines per set, which is expensive in terms
of the transistors necessary, or a program needs to access memory sequentially, which distributes
the cache lines being used by a program more uniformly across cache sets.
11
2.1.1.3 Translation Lookaside Buffer
Another memory structure used to speed the operation of the memory hierarchy is the trans-
lation lookaside buffer (TLB), which stores virtual to physical page translations. A virtual address
is the address a program uses to access data. The computer, when retrieving the data, translates
it into the physical address where the data reside. Virtual addresses have both advantages and
disadvantages but are necessitated by the need to share the limited amount of physical memory
among multiple processes and provide data protection to each process’ data.
The advantages of virtual memory are that code is easier to write, multiple processes can
access the same virtual addresses, and a program can address more memory than is physically
present. Virtual addresses enable a program to reside in any part of main memory during execution
because the physical address corresponding to a virtual address can be anywhere. Therefore, virtual
addresses allow a programmer not to have to worry about memory management when writing code.
Virtual addresses also allow multiple programs to access the same virtual address because, for each
program, there is a separate mapping of virtual to physical addresses. Therefore, a programmer
does not need to worry about another program accessing said program’s data because virtual
memory provides data protection. In addition, virtual addresses allow a program to address more
memory than is physically present by allowing access of data on disk as if it were within memory.
In this thesis, the concern is only with data sets that fit within memory. However, the cost of
virtual to physical address translations is significant, in both space and time and important to the
performance of routines that fit within memory [49].
For the proper physical page to be accessed when a virtual address is referenced, a translation
between virtual and physical pages must occur. For each process, therefore, there must be a
virtual to physical page table to translate addresses for proper access. The page table is stored
within memory, and, when the tables for all running programs are combined, they can take up a
significant portion of memory. A virtual address must be translated to a physical address by looking
up the physical address in the page table. Each of these lookups requires at least one additional
12
memory access. For multilevel page tables, which are found on most machines, two or more reads
are required per translation. The TLB, by storing recent translations, reduces the need to go to
memory to translate addresses and, as with caches, reduces the amount of memory traffic.
A TLB typically contains 32 to 1024 entries, where each stores the translation for a single
page of memory. Pages typically contain 4KB to 2MB of data [52]. Most TLBs are fully associative
as the added cost of a fully associative structure is offset by its ability to reduce the number of
page table lookups. When a miss occurs in the TLB, the cost is amortized if there is good spatial
locality among reads since each page contains more data than a cache line.
2.1.1.4 Other Memory Sub-system Structures
In addition to efficient use of caches and the TLB, hardware prefetching improves perfor-
mance. To reduce the latency of reading data from memory, the next data element to be used is
predicted. Then, before that datum is accessed by the program, the hardware begins reading it
from memory. If the prediction is correct, prefetching reduces the time that it takes for the datum
to be read from memory. The memory prefetcher has many choices for predicting the next accessed
data element but typically fetches data consecutive to a current cache miss.
Another hardware feature that improves the performance of the memory subsystem is non-
blocking caches. A non-blocking cache allows reads of data within cache to be serviced while a
previous cache miss is serviced from memory. By not stalling on a cache miss, the latency of
reading data from memory is hidden as other data are read from cache while the memory read is
being serviced. Hardware support for out of order execution, which is explained in section 2.1.3, is
needed to take advantage of a non-blocking cache.
2.1.1.5 Putting it all Together
The total cost of moving data through through the memory hierarchy can be expressed neatly
as shown by Byna et al. [20]. For a memory hierarchy with k levels of cache (called L1, L2; ...,
Lk) and one TLB, the amount of time spent moving data can be calculated using the following
13
equation [52]:
TotalMemoryCost =(NumberofTLBhits ∗ TimetoaccessTLB)
+ (NumberofTLBmisses ∗ TLBmissPenalty)
+ (NumberofL1hits ∗ TimetoaccessL1)
+ (NumberofL1misses ∗ L1penalty)
+ (NumberofL2hits ∗ TimetoaccessL2)
+ ...+ (NumberofLkmisses ∗ Lkpenalty)
+ (Numberofmemoryhits ∗ Timetoaccessmemory) (2.1)
Byna, et al. also show that, if memory behavior of a sequence of operations can be predicted
and the performance of the sequence of operations is limited by data movement, program runtime
can be approximated without running the program. In this thesis, we build upon this work and
create a memory model for more fused linear algebra routines that combine multiple operations
into one calculation.
2.1.2 Multi-core Architectures
Current microarchitecture design has moved from increasing the number of instructions that
can be executed by a single core in a given amount of time towards keeping the speed of the cores
on a processor fairly constant while increasing the number of cores. Multi-core processors are single
circuits that contain two or more processing cores, each with the full computing capabilities of a
normal processor. There are two reasons leading to this shift in design. The more important is heat
dissipation. Approximately every two years, the number of transistors that fits on a piece of silicon
doubles, but the amount of power each transistor uses decreases at a slower rate [67]. Therefore,
power consumption per chip is increasing, resulting in more heat on a processor. The other reason
14
for a shift in design is the problem with synchronizing operations across the processor because,
at current clock speeds, it takes an electronic pulse more than one clock cycle to move across a
processor. Multi-core processors are the most common solution to the heat and synchronization
problem since they decrease power consumption, which reduces heat production, by performing the
same number of operations at a lower clock speed. The lower frequency and decreased size of the
multi-core processors also make it easier to synchronize chips.
Most multi-core chips are combined together into shared memory parallel (SMP) systems
that have multiple sockets each with a bus to memory. The processors on these sockets typically
have multiple cores that all share the same memory bus. For example, the dual socket quad-core
Intel Clovertown shown in Figure 2.1 has two buses from memory, one per socket. The added
buses per socket increase overall memory to processor bandwidth available to perform calculations.
However, sharing buses between cores can decrease available memory bandwidth per core and create
contention for the memory bus.
In contrast to memory buses, which typically only increase with the number of sockets,
the number of caches and buses from cache to the processor often increase more rapidly. In our
Clovertown example, there are four L2 caches and eight L1 caches. Additionally, each core has a
bus from both an L1 and L2 cache dedicated to it. Therefore, when compared to using a single core
of the Clovertown system, memory to processor bandwidth is increased by a factor of two when
using the whole system, while cache to processor bandwidth is increased by a factor of eight.
Another positive of parallel systems is that additional caches increase the total amount of
data that can be stored close to the processor. However, when caches are shared among cores, as
the L2 cache is in our Clovertown system, the amount of each cache a given core can use is reduced.
Contention for the cache can cause data to be evicted, resulting in more costly reads from slower
memory structures.
Additionally, the organization of the memory sub-system of some SMP computers varies.
For some machines all the data stored in memory can be accessed with the same latency and
bandwidth by all processors. In the other common memory organization access reads of the same
15
Figure 2.1: Dual socket quad-core Intel Clovertown
16
data by different processors result in different latency and data throughput rates. In these Non-
Uniform Memory Access (NUMA) systems each processor has its own memory address space, and
reads from a processor’s own address space are quicker than those from other processors’ address
spaces. Therefore, keeping data reads local to a processor’s own address space can increase routine
performance.
2.1.3 Processor Memory Speed Gap
The speed of processors is increasing at a greater rate than the speed data are moved from
memory into the processor [52,81]. The result of this differing pace of performance improvement is
that, for each word read in from memory, the processor can perform multiple calculations. There
are many ways in which computer designers attempt to increase the amount of data moved to the
processor per calculation. Caches allow algorithms with good temporal locality to reuse data read
from memory more than once, thereby improving the ratio of calculations to memory reads.
To allow the processor to still perform work while waiting for high latency cache or memory
reads, out-of-order execution allows a processor to execute later instructions while the current
instruction is stalled. The processor analyzes the instruction(s), to be executed after the stalled
instruction and any instruction not dependent on the result of any stalled instruction is executed
while the cache or memory read is serviced. Through out-of-order execution, the latency of a
memory read is mitigated or hidden as the computer performs useful work while waiting for a
data read to be serviced. Out-of-order execution, therefore, can increase the total throughput of
a computer by eliminating the need for a processor to stall on high latency reads. However, out-
of-order execution is limited by the number of instructions that can be stalled at once and how
many instructions in the future the processor can examine for calculations that can be executed
immediately.
Caches and out of order execution are effective at increasing the overall throughput of pro-
grams with good spatial and temporal locality and few latency bound reads. However, they are
not effective at increasing the speed of algorithms that do not reuse a datum once it is read from
17
cache and that have many reads from high-latency or low-bandwidth memory. Such algorithms
are inherently limited by the bandwidth from memory to the processor [3]. For algorithms with
little data reuse, the only way to increase their speed is by reducing memory operations [30] or
combining two algorithms that access the same data [55].
2.2 Performance Tuning Techniques
Naively designed linear algebra routines take longer to execute than highly optimized routines
although both should produce the same result to within machine precision. The performance
gap occurs because optimized routines take into account the underlying hardware on which the
program runs. An example of two routines that perform the same calculations at dramatically
different speeds are the matrix-matrix multiply routines of Netlib BLAS [78] and GotoBLAS [22].
The BLAS provide an application programming interface (API) developed to allow application
developers a standardized interface to call highly optimized linear algebra routines. Each is an
implementation of the BLAS [36,37,71] routine GEMM. The Netlib routine is an untuned reference
BLAS implementation written in Fortran. GotoBLAS is a highly tuned BLAS implementation
written primarily in assembly. The top left graph in Figure 2.2, shows that GotoBLAS matrix-
matrix multiplication outperforms Netlib BLAS matrix-matrix multiplication by 25% to 900% more
MFlops across a wide range of matrix sizes even though each performs the same number of floating-
point operations. The remaining three graphs in the figure show the ratio of flops performed to L1,
L2, and TLB misses. Executing GotoBLAS results in significantly fewer of these costly misses for
large matrix orders, which dramatically improves their performance.
In this section, we discuss the tuning techniques that are used by GotoBLAS and other codes
to achieve performance beyond that of a naive implementation. Techniques are grouped into four
categories to reflect the portion of the hardware where they improve program performance: faster
instructions, pipeline, register and memory. When a tuning technique positively affects more than
one hardware structure we include them in its description. For such techniques, the technique is
grouped where the largest effects occur. Techniques that are of primary interest in this thesis are
18
101 102 103 104
0
500
1000
1500
2000
2500
3000
3500
4000
4500
n
M
Fl
op
s
Throughput Matrix−Matrix Multiply
 
 
Netlib
GotoBLAS
101 102 103 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n
L1
 C
ac
he
 M
iss
es
 p
er
 F
lop
L1 Cache Behavior Matrix−Matrix Multiply
 
 
Netlib
GotoBLAS
101 102 103 104
0
0.005
0.01
0.015
0.02
0.025
n
L2
 C
ac
he
 M
iss
es
 p
er
 F
lop
L2 Cache Behavior Matrix−Matrix Multiply
 
 
Netlib
GotoBLAS
101 102 103 104
0
0.2
0.4
0.6
0.8
1
1.2 x 10
−3
n
TL
B 
M
iss
es
 p
er
 F
lop
TLB Behavior Matrix−Matrix Multiply
 
 
Netlib
GotoBLAS
Figure 2.2: Optimized vs. un-optimized matrix matrix multiply performance.
19
given a full explanation rather than a quick mention and citation. Loop fusion, which is a memory
optimization and the primary focus of this thesis is left to the next chapter.
2.2.1 Use Faster but Equivalent Instructions
Some operations can be performed by a computer in multiple ways to yield the same result.
For example, 5∗2 = 10 can be computed by shifting 5 one bit to the left, multiplying 5∗2 or adding
5 + 5. Although each instruction yields identical results, bit shifting is preferable as it executes in
the least amount of time by most CPUs. Also, converting integer multiplies to adds when possible
results in equivalent code that executes in less time. Eliminating magnitude compares and replacing
them with inequality or equality compares results in faster execution. For best performance, pointer
updates should be minimized to allow the use of register plus offset addressing mode [13].
2.2.2 Pipeline
Flushing the pipeline because branches are mispredicted or instructions need to be delayed
while waiting for the results of other instructions slows the execution of programs. Mispredicted
branches can be reduced by minimizing the number of branch instructions. By exposing inde-
pendent operations through the use of local variables and false dependency elimination, more
instructions can be executed by the processor’s out of order execution unit at a given time, thereby
keeping the pipeline busy [13]. In addition, by unrolling loops [104] and balancing the instruction
mix [13], the computer has more choices when scheduling instructions, decreasing the likelihood of
the pipeline’s idling.
2.2.3 Register
Once data are stored in registers, they are within the fastest memory structures in the com-
puter. Therefore, using a value as much as possible while it is in a register improves performance.
Two ways to keep a value within a register are register blocking and the use of local variables.
Register blocking alters loops so that each iteration of a loop uses all the registers and therefore
20
maximizes the reuse of them. Using local variables in code eliminates false read-after-write de-
pendencies. False read-after-write dependencies occur when a compiler (due to indirect accesses)
cannot discern if a data element being written to memory will be used by a future instruction.
When the compiler wrongly predicts a data element is going to be read again, that data element
must be written out to memory when it could remain in a register [13].
2.2.4 Memory
To reduce memory reads and the effects of high latency and low bandwidth, programs can
be rearranged. Blocking is used in matrix-matrix multiplications to reduce memory traffic by
performing as many calculations as possible on data read from memory before the data are evicted
from cache. Figure 2.3 shows how matrices are blocked when they are multiplied. The block sizes
are chosen such that one block of A and one block of B fit within cache at once and C is streamed
through memory. The operation is performed by multiplying each block in a row of A by each
block in a column of B yielding a partial block of C. The process continues for all block rows and
block columns until C is produced. The result is a reduction in the number of reads from memory
from O(n3) to O(n3/blocksize2). However, careful choice of blocksize is important to performance
as shown by Lam et al. [70]. Blocking can be applied to all levels of the cache and the TLB and is
usually done for each memory structure [49].
A11
A21
A32A31
A12
A22
B11
B21 B23
B13B12
B22
* =
C11
C21
C12
C22
Figure 2.3: Blocking a Matrix-Matrix Multiply.
When blocking is applied to matrix-matrix multiplication routines written in Fortran, two
21
matrices are accessed by reading down their columns and one matrix is accessed by reading across
its rows. In Fortran, where data are stored in a column-wise manner, reading elements row-wise
results in bad spatial locality of reads. Through the use of a copy optimization, data elements in the
matrix can be rearranged so they are accessed in a consecutive manner [70]. Copy optimizations can
be used on blocked matrices to rearrange data within blocks so that they are stored consecutively
and that successively accessed blocks follow each other. The combination of blocking and copy
optimizations reduces the number of memory reads and improves memory access patterns.
To reduce the number of conflict misses caused by data accessing the same set within cache,
array padding [87] can be used. Array padding is a technique in which more memory is allocated
to an array than is necessary [38]. Extra data are added to the end of each row or column of a
matrix thus separating the last value of the previous row or column from the first of the next with
empty space. The result is that data values map to different sets in cache, reducing conflict misses.
A1 A2 A3 B1 B2 B3
A1 B1 A2 B2 A3 B3
Figure 2.4: How to interleave data to create a multivector.
Data interleaving, also called multivector optimization, reorganizes multiple vectors to im-
prove the data access pattern of the vectors [66]. The goal is to decrease the number of cache lines
accessed and thus the potential for conflicts. Figure 2.4 shows how two arrays can be interleaved.
The top image in the figure is the data in memory where the two vectors are declared separately.
22
The bottom image is the data in memory where the elements of each vector are interleaved. To
perform a multivector optimization the first element of each vector, A1 and B1 in our example, are
moved to successive memory addresses. Then the second element of each vector immediately follows
the first in the same order. The process is repeated with the locality rest of the elements in each
vector resulting in a single multi-vector composed of independent vectors. If both vectors a and
b are accessed consecutively within a loop then all the data are accessed consecutively improving
spatial locality.
Another way to reduce the number of reads from memory is software pipelining. Software
pipelining interleaves operations where a dependency exists. It reorders dependent operations such
that they are overlapped in code [2]. An example of software pipelining is shown in figure 2.5.
The left calculation is the unpipelined routine and the right calculation is the pipelined one. The
software pipelined routine reduces the number of times the vectors b and c need to be read from
memory by one since all accesses occur within a few calculations of each other rather than a loop
apart.
for i = 1: n b(1) = b(0) + c(0)
b(i) = b(i− 1) + c(i− 1) c(1) = c(0) + c(−1)
c(i) = c(i− 1) + c(i− 2) for i = 2: n
for i = 1: n b(i) = b(i− 1) + c(i− 1)
a(i) = b(i+ 1) + c(i+ 1) c(i) = c(i− 1) + c(i− 2)
a(i− 1) = b(i) + c(i)
a(n) = b(n+ 1) + c(n+ 1)
Figure 2.5: Software pipelining of a calculation
2.3 Linear Algebra Libraries
Since many scientific computing applications and other computing problems use similar sets
of linear algebra calculations, libraries are designed to allow the reuse of codes that perform those
calculations [4, 7, 15]. Routines are written for various linear algebra structures and environments
including dense matrices, sparse matrices and parallel processing environments. Many approaches
23
use the same API to allow for an application developer to design one program and link in the proper
optimized library for the machine on which the program is run. In addition, many packages build
on each other to allow for code reuse and modularity. The result can be highly portable code that
performs efficiently across many machines for many commonly used operations. In this section,
we discuss major libraries and APIs that perform linear algebra computations and discuss their
importance in scientific computing.
2.3.1 Basic Linear Algebra Subprograms
The BLAS are a standard Fortran interface for dense linear algebra operations [36,37,71] that
has been updated to include sparse operations, additional routines and a C interface [15, 34, 35].
The BLAS are broken into three levels of operations. Level 1 routines contain vector operations,
such as an inner product, that perform O(n) computation on O(n) data. Level 2 routines are
matrix-vector operations, such as outer products and matrix-vector multiplies, that have O(n2)
computation on O(n2) data. Level 3 routines carry out matrix-matrix operations, such as matrix-
matrix multiplications, where O(n3) computation is performed on O(n2) data. With the updated
standard, four new routines that perform the work of multiple old routines and a sparse BLAS
(SBLAS) interface which provides a standard interface for a variety of sparse matrix storage formats
[39] are added.
The BLAS were first developed to provide linear algebra operations to application developers
through a standardized interface. Before their creation, application developers would create their
own code to perform linear algebra routines for each program. Since the development of tuned
BLAS implementations application developers are able to create a portable program that, when
linked with a tuned BLAS library on the target machine, delivers high performance. Another
benifit of the BLAS is that often tuning linear algebra routines must only be done once for each
architecture, by library developers, rather than by each programmer.
There are three types of BLAS implementations available: reference, handcoded and machine
generated. The Netlib BLAS [78] are a reference dense BLAS that other developers can use to
24
compare their implementations against for correctness. The library contains slightly optimized or
unoptimized versions of the routines defined in the BLAS specifications [15, 36, 37, 71]. Reference
sparse BLAS are available in Fortran from Netlib [79] and C++ from NIST [80]. The Netlib BLAS
also provide a C interface to ease the development of programs written in C and C++.
Handcoded BLAS routines are highly optimized routines that typically use assembly code to
perform the most important parts of their calculations and as a result are among the fastest BLAS
implementations available. Most computer vendors supply their own versions for their machines.
Some examples are Engineering Scientific Subroutine Library (ESSL) [56] for IBM processors,
Math Kernel Library (MKL) [57] for Intel processors, and Sun Performance Library [96] for Sun
processors. Some of these libraries, such as MKL, also include optimized sparse computations. In
addition to vendor supplied packages, Kazaski Goto formally of the Texas Advanced Computing
Center developed and distributed GotoBLAS [22], another highly tuned BLAS package available
for multiple architectures.
The problem with handcoded BLAS routines is that they are expensive to create. An al-
ternative is the automatic generation of kernels. Two programs, Automatically Tuned Linear
Algebra Software (ATLAS) [105] and Portable High Performance ANSI C (PHiPAC) [13], generate
high performance dense linear algebra algorithms for many architectures. In addition, OSKI [98]
generates high performance sparse kernels and can perform runtime tuning by transforming data
structures. All three programs search through a range of parameters to find the best settings for a
given machine. They can either figure out cache sizes or be given machine parameters to shorten
the amount of time spent searching for them. The result is performance that is comparable to or
better than handcoded operations, but with less programmer effort. While very good at producing
highly efficient multi-platform code, these packages need constant updating to take advantage of
new instructions and architectures as they appear.
Since most scientific simulations are run on parallel machines, parallel BLAS routines have
been developed. ATLAS, GotoBLAS, and most vendor BLAS have multi-threaded implementa-
tions, and a multi-core aware version of OSKI was considered [99]. The multi-threaded libraries
25
are used on SMP machines. For distributed memory machines, Parallel BLAS (PBLAS) [24] are
used. PBLAS calls the Basic Linear Algebra Communications Subprograms (BLACS) [32] to han-
dle communication between nodes. In addition, the author of this thesis implemented an interface
to OSKI within Epetra [90], a linear algebra library, that allows for most of OSKI’s kernels to be
used for serial computation within parallel routines [64].
Furthermore, implementations of the new routines contained in the updated BLAS standard
[15] have resulted in significant speedups [55]. These routines combine the functionality of two or
more level 1 or level 2 BLAS routines into a single more memory-efficient routine. They have been
used by Howell et. al to improve the speed of Householder bidiagonalization by 10% to 25% [55].
2.3.2 Linear Algebra Package (LAPACK)
Once the BLAS were designed and widely available, it became practical to have higher level
packages that use them to perform more complex calculations. The first of these packages were
EISPACK [45,94], which includes routines for computing eigenvalues and eigenvectors of matrices,
and LINPACK [33], which is used to solve linear equations and linear least-squares problems. Both
EISPACK and LINPACK were written in Fortran and use column operations and functions in
the level 1 BLAS. The advantage of having one combined package, the ability to optimize those
routines through calls to the more efficient level 3 BLAS routines, and algorithm advances led to
the creation of the Linear Algebra Package (LAPACK) [4]. Since it uses level 3 BLAS routines,
LAPACK takes better advantage of the memory hierarchy, which for most machines results in more
efficient computation than the level 1 BLAS routines used by its predecessors.
A parallel version of LAPACK called Scalable LAPACK (ScaLAPACK) [14] is available. To
increase code reuse, ScaLAPACK uses PBLAS and BLACS for computation and communication
whenever possible.
26
2.3.3 Higher Level Packages
Leveraging the BLAS and LAPACK are libraries such as Trilinos [54] and PETSc [7]. These
packages build upon the BLAS and LAPACK by providing data structure support and higher
level solvers and preconditioners. Trilinos is an object-oriented framework for the solution of large
scale complex physics, engineering and scientific problems. The project is designed to assist the
development of independent packages which are autonomous pieces of code that can function in-
dependently or, using support provided by the Trilinos framework, communicate with each other.
Examples of such packages and their functions are Sacado [91], which performs automatic differen-
tiation, ML [46], a multigrid preconditioning package and Amesos [89], a direct sparse linear solver
package.
PETSc provides many sparse linear solvers, handles communication for the programmer and
provides automatic profiling of floating-point and memory usage, all within a consistent interface.
In addition, related projects that use the same data structures and some PETSc routines provide
added functionality. TAO [74], the Toolkit for Advanced Optimization, adds in the ability to solve
optimization problems. SLEPc [53], the Scalable Library for Eigenvalue Computation, is used for
solving eigenvalue problems. Finally, Prometheus [1] is a multigrid solver used to solve unstructured
finite element problems.
The importance of the speed of linear algebra computations to overall runtime of higher level
libraries and application codes is shown by the constant effort made to improve the runtime of
the underlying linear algebra. The OSKI interface within Epetra is now available to all Trilinos
package developers and allows them another option when looking for the fastest available routines.
In addition, the author of this thesis worked on improving the speed of a matrix-matrix multiply
within ML by creating a block version of it [63]. The constant search for faster algorithms and
methods within Trilinos demonstrates that improvements in the speed of linear algebra routines
are both sought and used in real scientific applications.
27
2.4 Autotuning Programs
Autotuning is used to reduce programmer effort by having a program self tune based on
a set of rules. Efforts for diverse calculations, such as matrix multiplication and Fast Fourier
Transformations have shown success in producing routines that are faster than hand-tuned libraries.
In this section, we examine other auto-tuned linear algebra software.
ATLAS [105] and PHiPAC [13] both produce matrix-matrix multiplication kernels that are
are competitive with hand-tuned codes across various architectures. Both of these systems query
the hardware to determine characteristics of the target machine and produce optimized codes during
installation. ATLAS produces multi-threaded code that runs on parallel machines.
OSKI [98] takes a different approach. It focuses on sparse kernels and matrix vector oper-
ations. It profiles the hardware at install time, but performs runtime tuning because the proper
tuning techniques for sparse kernels are not known until the matrix structure can be determined.
Therefore, by deferring tuning until runtime, more specialized optimization can be performed for
the target matrix. The drawback of tuning at runtime is that the tuned kernel needs to be called
many times to amortize the tuning cost [64].
FFTW [42] and SPIRAL [84] both perform discrete Fourier transforms. In addition, SPIRAL
performs other transforms used in digital signal processing such as discrete cosine transformations.
FLAME [51] partially automates the process of generating provably correct linear algebra algo-
rithms.
2.5 Models
Memory models are used to estimate the numbers of reads and writes that will occur from
each level of memory when a program is executed without running the code. Through the use of
such models, the memory behavior of different algorithms can be predicted. From these predictions,
the runtime can then sometimes be estimated, for example, by using Equation 2.1. In this section,
we discuss memory models that have been implemented including what they estimate and what
28
they do not. The models are divided into two sections, those that predict performance and those
that give performance bounds. The models presented are chosen for their diversity and influence
on this research.
2.5.1 Memory Models that Predict Performace
Some memory models attempt to approximate how many misses occur in each memory
structure and use those estimates to predict runtime performance. Predictions can aid developers
in creating code and can lead to large reductions in program runtimes. Here, we summarize models
that attempt to project the performance of calculations they model.
Byna et al. [20] predict the amount of time each memory reference in a segment of code
takes. They analyze the cost of differeny access patterns including sequential and strided. These
costs and their memory predictions are used to estimate the amount of time two matrix transpose
algorithms take.
The PAD/PADLITE model is designed to approximate conflict misses and then minimize
them through array padding [87]. It uses the Euclidian algorithm for computing the greatest
common denominator to calculate conflicts between array columns in linear algebra code. Padlite
is the simpler version of the model and is only effective for eliminating conflicts in benchmark
programs. Pad is a more complex version that can improve performance for more complicated
kernels. The heuristics used lead to large improvements for some problems and sizes and minimal
to no improvement for others. The largest gains come for arrays where the size has a large power
of two as a factor.
The Sparse Linear Algebra Memory Model (SLAMM) [29,30] uses automated memory analy-
sis to speedup iterative methods. Using compiler techniques, SLAMM is able to analysis a MATLAB
implementation of an algorithm and determine the minimum amount of data movement necessary
to implement the algorithm in C or Fortran. SLAMM proved effective by showing that the con-
jugate gradient (CG) solver in the Parallel Ocean Program (POP) had more memory traffic than
necessary. Through the use of SLAMM, memory traffic for CG was reduced by 18% resulting in
29
a 46% speedup in the runtime of CG. The speedup in CG resulted in an overall 9% reduction
in the runtime of POP and a savings of 216,000 CPU hours per year at the National Center for
Atmospheric Research (NCAR) [30].
Ghosh et al. [47] formulate equations for reuse distances and cache misses that provide a high
degree of accuracy. Their model almost exactly predicts cache misses when compared to actual
misses provided by hardware counters. While accurate, this approach suffers from a large cost of
creating and evaluating the equations.
Yotov et al. [108] develop an analytic model for matrix multiplication. They show that their
analytic model can produce codes that perform almost as well as the automatically generated
routines from ATLAS. Their model takes approximately half the time to run as ATLAS does and
produces routines that perform as well to less than 10% worse than the ATLAS generated kernels.
2.5.2 Models that Generate Performance Bounds
Other memory models do not attempt to generate accurate predictions but rather generate
upper and/or lower bounds on performance. Using these bounds, which can be the number of
misses to a structure, runtime or MFlops, an estimate of performance is obtained. The following
is a summary of models that provide performance bounds in their predictions.
Ferrante [41] presents a capacity miss model that calculates the amount of data accessed by
a series of nested loops. It computes a bound on how many cache lines are used by these data.
Validation tests performed on matrix-matrix multiplications show that the model can predict when
to interchange loops to decrease the amount of data accessed within inner loops.
In [88], Rivera and Tseng present models that leverage their previous work on Pad to predict
tile and array pad sizes for 3D stencils. Rivera and Tseng’s model is then used in [27] on SMPs and
other new microprocessors to predict upper and lower bounds of execution time of algorithms. For
the lower bound only compulsory misses are considered while the upper bound is calculated from
the maximum amount of data movement that could occur during execution. The model is effective
as it bounds the actual runtime in all but one case.
30
2.6 Hybrid Search
Many authors combine models and other search techniques to optimize routines. For example,
Yotov et. al [109] use analytic methods to perform global search and approximate the values that
should be used for tuning parameters. They then use empirical search to fine tune their estimates
and further improve performance. The combined search results in better performing routines than
only using a model in significantly less time than global search.
Chen et al. [23] use analytic modeling and heuristics to select a small number of optimization
variants. Combinations of loop permutation, unrolling, register and loop tiling, copy optimization,
and prefetching are selected. Empirical testing is used via guided search to select the best variant,
typically in three to eight minutes. The resulting performance is comparable to vendor BLAS and
ATLAS implementations of matrix multiplication.
Epshteyn et al. [40] consider loop tiling decisions in the context of matrix multiplication.
They use an explanation-based learning algorithm to adapt their analytic model based on empirical
results. The result is faster code than modeling or empirical search alone in comparable or less
time than ATLAS’s empirical search and somewhat more time than the modeling used by Yotov.
Qasem [85] uses pattern-based direct search to find good combinations of loop fusion decisions
for Fortran programs. A model guides the search direction of his compiler for these decisions. Since
Qasem targets a general purpose language and compiler, he needs the scalability of direct search to
produce tractable compile times even though direct search sometimes misses the globally optimal
fusion decision.
Chapter 3
Loop Fusion
Loop fusion is a memory optimization that combines two loops that access the same data into
one [44]. For calculations where memory bandwidth is the limiting factor on performance, fusing
loops can reduce data traffic through the memory subsystem and decrease runtime [28, 55, 101].
However, fusing loops can decrease performance if not done carefully. Fusion can increase the
amount of data stored in a cache or registers leading to capacity and conflict misses [60,61]. Also,
determining which fusion decisions result in the best performance is non-trivial as the search space
of all possible fusion decisions is NP-complete [26].
In this chapter, we first overview other research using loop fusion to improve the speed of
linear algebra calculations. Then, we show how to use loop fusion to reduce memory traffic. A
detailed overview of the BTO compiler follows, including how we leverage it to improve our memory
model and learn more about loop fusion. Next, we discuss how negative memory effects can result
when loops are fused. How the memory subsystems interacts with the fused calculations’ data
access patterns to decrease performance is explained. We then describe how to combine other
tuning techniques with fusion to mitigate these negative memory effects. Finally, we present the
search complexity of finding the optimal amount of fusion. In our discussion of the complexity
we present work that shows ways to limit the testing of fused routines without reducing routine
performance.
In this chapter and the rest of the thesis tests were run on various computers. These com-
puters and their associated architectures are listed in Table 3.1. Throughout the thesis they are
32
Name Processor Speed Mem Bus Speed L1 L2 L3 TLB
Hemisphere Intel Xeon 2.4 Ghz 2 GB 400 MHz 8 KB 512 KB N/A 64
Quadfather Intel Core 2 2.4 GHz 4 GB 1066 MHz 32 KB 4 MB N/A 256
Work Intel Core 2 2.4 GHz 2 GB 1333 MHz 32 KB 4 MB N/A 256
Opteron AMD Opteron 2.6 GHz 3 GB 1000 MHz 64 KB 1 MB N/A 40/512
PowerPC PowerPC 970FX 2.3 GHz 8 GB 1150 MHz 32 KB 512 KB N/A 1024
Nahalem Intel Core i7 2.8 GHz 4 GB 1333 MHz 32 KB 256 KB 8 MB 64/512
Table 3.1: Specifications of the test machines. For TLB, we list the number of entries.
referred to by the description in the name column of Table 3.1. Also, various mathematical ker-
nels were used to perform tests. Table 3.2 contains the operations performed in these kernels.
Throughout the thesis some of these kernels are presented in more detail to illustrate points.
3.1 Loop Fusion Applications to Linear Algebra
To improve the speed of memory bound linear algebra routines, many authors have used loop
fusion on a diverse set of calculations [16,18,55,82,83,101]. The work has ranged from optimizing
a single routine to general purpose tools within compiler frameworks. The result is significant
reductions in runtime for the routines. In this section, we present a survey of tools and approaches
used to create loop fusion routines. The techniques are broken into those focused on extracting the
most performance possible from a single routine or a few routines and general approaches designed
to handle a large number of routines well.
3.1.1 Optimizing Specific Routines
Many frequently used sequences of calculations can be combined using loop fusion. For these
calculations, such as the four fused routines recently added to the BLAS [15], spending a large
amount of time optimizing them is worthwhile. Two of these routines GEMVT and GEMVER are
used by Howell et. al [55] to speed up the execution of the Golub and Kahan algorithm to perform
Householder Bidiagonalization [48] by up to 25% as compared to the LAPACK implementation
GEBRD. Since Householder Bidiagonalization is half the cost of computing a singular value decom-
33
Kernel Operation
AXPYDOT z ← w − αv
r ← zTu
AATX y ← AATx
BiCGK q ← Ap
s← AT r
DGEMV z ← αAx+ βy
DGEMVT x← βAT y + z
w ← αAx
GEMVER B ← A+ u1vT1 + u2vT2
x← βBT y + z
w ← αBx
GESUMMV y ← αAx+ βBx
GRAMM r ← qT ∗ v
v ← v − r ∗ q
HOUSE A← A− αv ∗ (vTA)
VADD x← w + y + z
WAXPBY w ← αx+ βy
Table 3.2: Kernel specifications.
34
position (SVD) [48] in LAPACK and SVD is a frequently used algorithm, fast implementations
of GEMVT and GEMVER are important enough to warrant the time investment into fast performing
implementations. Also, Premkumar shows how to efficiently parallelize GEMVT [83].
Vuduc et. al present an efficient sparse implementation of b = ATAx where they combine
loop fusion and other performance tuning techniques to produce fast performing routines [101].
This calculation appears in linear programming and linear least squares [68,103]. The same fusion
strategy can be used to compute b = AATx, which occurs when using Kleinberg’s algorithm to
find authorities in hyperlinked environments [69]. These two fused kernels, b = Akx and the
fusion of two matrix vector multiplies, are implemented in the Optimized Sparse Kernel Interface
(OSKI) [98, 102]. OSKI is an auto-tuning package that makes tuning decisions at runtime based
on user input parameters. The calculation b = Akx appears in s-step methods [25] while the two
matrix vector multiples occur in the bi-conjugate gradient method [8]. OSKI increases the speed
of parallel sparse block matrix-vector multiplies within the EPETRA package of Trilinos [64]. It is
also included within the Portable Extensible Toolkit for Scientific Computation (PETSc) [106].
3.1.2 General Tools for Fusion
Fusing and tuning one or a few routines at a time is costly. Constantly changing computing
architectures makes the challenge greater. Updating routines for new platforms is costly and only
worthwhile for frequently used routines. Specialized compilers reduce the time and effort to fuse
other routines.
PLUTO, which performs automatic parallelization and locality optimization for multicores
using the polyhedral model, also performs loop fusion [16, 18]. PLUTO produces speedups of up
to 100% over vendor-tuned BLAS implementations and vendor compilers. Runtime options allow
a user to have PLUTO fuse as many loops as possible or use heuristics to determine the optimal
amount of fusion to apply. Unfortunately, the heuristics are not published so there is no way to
measure their usefulness other than a posteri. Additionally, results from their website show that
that their heuristics do not always produce optimal routines [17].
35
An updated approach using other polyhedral tools along with PLUTO adds iterative search
[82]. The result is speedups as compared to using the Intel C Compiler (icc) [58] with auto paral-
lelization on multiple platforms. This approach is limited by tiling only being performed for the
L1 cache regardless of profitability. Also by using directed and guided search, local and not global
maximums could be found [72].
A similar approach taken by Qasem works on Fortran codes [85]. He uses a model that
accounts for all levels of cache to guide empirical search [86]. Qasem also targets a general purpose
compiler and uses direct search, which is more scalable than exhaustive search. However, like
PLUTO his search can end up in local extrema missing the globally optimal solution.
In Chapter 3.3, we describe our approach of using a domain specific language and compiler
to solve the loop fusion problem. Starting in Chapter 4, the remainder of this thesis focuses on
how we model loop fusion to reduce the search space of our compiler. Our approach differs from
the previous compilers in the following ways: we enumerate all possible loop fusion combinations
and then let our model determine a set of the best routines. These routines are then empirically
tested to determine the highest performing. Using this approach we are able to find the globally
optimal routine. Enumerating all possible loop fusion combinations differs from both the Qasem
et. al and PLUTO approaches of using models or heuristics to guide and narrow the search space.
By searching through the whole space, we cannot get stuck in local extrema. Additionally, our
model and code generation include parallel capabilities missing from Qasem’s approach.
3.2 Fusion Reduces Memory Traffic
When properly applied, loop fusion reduces the amount of data that must move through
the memory hierarchy to perform a calculation. For memory bound routines, loop fusion can
increase performance nearly proportionately to the reduction in memory traffic. In this section, we
demonstrate how to fuse independent and dependent calculations and show how each improves the
performance of a routine by reducing memory traffic.
36
3.2.1 Independent Calculations
Independent calculations occur when the result of one calculation does not affect the result of
the other. When independent calculations occur in close proximity to each other in an algorithm,
they can be combined using loop fusion. Fusing independent calculations is the simplest case
because data dependencies do not need to be observed. The two matrix-vector multiplies Ax = r
and AT y = s. that occur in the bi-conjugate gradient method [8] are one example.
In Figure 3.1 we show how to fuse the calculation Ax = r and AT y = s. By accessing A
twice in succession, the value can remain within a register during the calculation instead of being
read from a cache or memory. The resulting one half reduction memory reads on the Quadfather
machine for large matrix orders is shown by the decrease in L2 cache misses in Figure 3.2. Figure
3.2 also shows that fusion reduces the number of executed loads, and the number of misses to the
L1 cache and the TLB. L1 cache misses are reduced by one half for small matrices and one quarter
for larger ones. All values are normalized to be per flop. TLB misses which are not shown are
reduced by one half for large matrix orders. Additionally, for all matrix orders, there is a reduction
by upto one quarter in the number of load instructions executed. The corresponding performance
increase to the reduction in data movement of approximately 90% is shown in Figure 3.3.
for i = 1:n for i = 1:n
for j = 1:n for j = 1:n
r(j) = A(j, i) ∗ x(i) r(j) = A(j, i) ∗ x(i)
for i = 1:n s(i) = A(j, i) ∗ y(j)
for j = 1:n
s(i) = A(j, i) ∗ y(j)
Figure 3.1: Fusing Ax = r,AT y = s
3.2.2 Dependent Calculations
Dependent calculations occur when part of one calculation must be complete before the
other calculation can be started. An example of one such calculation is b = AATx, which mentioned
previously occurs when using Kleinberg’s algorithm to find authorities in hyperlinked environments
37
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
Matrix Size
L
1
 C
a
c
h
e
 M
is
s
e
s
 p
e
r 
F
lo
p
L1 Cache Behavior Ab = c, Atd = e
baseline
singleproc
dualcore
(a) L1
0 2000 4000 6000 8000 10000 12000
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Matrix Size
L
2
 C
a
c
h
e
 M
is
s
e
s
 p
e
r 
F
lo
p
L2 Cache Behavior Ab = c, Atd = e
baseline
singleproc
dualcore
(b) L2
Figure 3.2: Memory performance of fused and unfused Ax = r,AT y = s
38
0 2000 4000 6000 8000 10000 12000
0
1000
2000
3000
4000
Matrix Size
M
F
lo
p
s
Throughput Ab = c, Atd = e
baseline
singleproc
dualcore
Figure 3.3: Performance of fused and unfused Ax = r,AT y = s
39
[69]. It is important to note that the same technique used to combine b = AATx is used to fuse
the computation b = ATAx [101].
Fusing loops of b = AATx requires accounting for the dependence between the two matrix-
vector multiplies. The matrix-vector product ATx is computed as inner products of the rows of
AT with the vector x. Each inner product results in one element of the vector t. The matrix-
vector product b = At is then computed via a linear combination of the columns of A with the
elements of t as coefficients. Thus, the second matrix-vector product cannot begin until at least
one element of the vector t has been computed. As each new element of t is computed, one more
column of A can be added to the linear combination. Because the inner product with that column
is computed independently of the linear combination, the column is retrieved from cache once for
each matrix-vector product.
An important consideration with dependent calculations is that they sometimes require data
structures to be aligned for good performance. For example, to efficiently compute a fused version
of b = AATx a column major data structure is needed so that the memory accesses for each inner
product and linear combination are sequential rather than strided. A corollary is that to perform
ATAx efficiently a row major data structure must be used.
In Figure 3.4 we show how to fuse the calculation AATx = b. In this case, only the outer
loops of the calculation are fused because of the dependence between the calculations performed in
the inner loops. The resulting performance improvement of approximately 60% on the quadfather
machine for matrices larger than 1000 is shown in Figure 3.5. The improvement in memory and
cache reads are shown in Figure 3.6. Memory reads for the fused calculation are half those of the
unfused version. Reads from the L2 cache are the same for both fused and unfused matrices of 3000
and larger. The increased L2 reads are the added cost in performing the fused calculation when
compared to the fused dependent calculation. The reason for the added cost is that these reads
occur while the memory bus is idle, as explained in the next section.
40
for i = 1:n for i = 1:n
for j = 1:n t = 0.0
t(i)+ = A(j, i) ∗ x(j) for j = 1:n
for i = 1:n t+ = A(j, i) ∗ x(j)
for j = 1:n for j = 1:n
b(j)+ = A(j, i) ∗ t(i) b(j)+ = A(j, i) ∗ t
Figure 3.4: Fusing b = AATx
for i = 1:n for i = 1:n
for j = 1:n t = 0.0
t(i)+ = A(j, i) ∗ x(j) for j = 1:n
for i = 1:n t+ = A(j, i) ∗ x(j)
for j = 1:n for j = 1:n
b(j)+ = A(j, i) ∗ t(i) b(j)+ = A(j, i) ∗ t
Figure 3: Fusing b = AATx
not a reduction in L1 cache misses though because a column of A is too large
to fit in the 32 KB L1 cache for large matrix orders, which.
(a) Memory
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10
4
0
400
800
1200
1600
2000
Matrix Size
M
F
lo
p
s
Throughput AA
T
x = b
 
 
Baseline
Composed
(b) Mflops
Figure 4: Performance of fused and unfused AATx = b
3. Negative Memory Eﬀects of Fusion
Loop fusion does not always result in increased performance. Decreases
in performance occur when fusion creates code that requires more data to fit
in a memory structure of the computer than the structure can hold [15, 16].
In other cases loop fusion results in memory access patterns that are sub-
optimal [15]. In this section, we explore the eﬀects how loop fusion can
negatively aﬀect caches, registers and memory access patterns.
3.1. Performance Degradation
In order to observe eﬀects that occur when many loops are fused, we look
at what happens when many matrix-vector multiplies are fused together.
5
Figure 3.5: Performance of fused and unfused AATx = b
41
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
Matrix Size
L
1
 C
a
c
h
e
 M
is
s
e
s
 p
e
r 
F
lo
p
L1 Cache Behavior AAtx = b
baseline
singleproc
dualcore
(a) L1
for i = 1:n for i = 1:n
for j = 1:n t = 0.0
t(i)+ = A(j, i) ∗ x(j) for j = 1:n
for i = 1:n t+ = A(j, i) ∗ x(j)
for j = 1:n for j = 1:n
b(j)+ = A(j, i) ∗ t(i) b(j)+ = A(j, i) ∗ t
Figure 3: Fusing b = AATx
not a reduction in L1 cache misses though because a column of A is too large
to fit in the 32 KB L1 cache for large matrix orders, which.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10
4
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Matrix Size
L
2
 C
a
c
h
e
 M
is
s
e
s
 p
e
r 
F
lo
p
L2 Cache Behavior AA
T
x = b
 
 
Baseline
Composed
(a) Memory (b) Mflops
Figure 4: Performance of fused and unfused AATx = b
3. Negative Memory Eﬀects of Fusion
Loop fusion does not always result in increased performance. Decreases
in performance occur when fusion creates code that requires more data to fit
in a memory structure of the computer than the structure can hold [15, 16].
In other cases loop fusion results in memory access patterns that are sub-
optimal [15]. In this section, we explore the eﬀects how loop fusion can
negatively aﬀect caches, registers and memory access patterns.
3.1. Performance Degradation
In order to observe eﬀects that occur when many loops are fused, we look
at what happens when many matrix-vector multiplies are fused together.
5
(b) L2
Figure 3.6: Memory performance of fused and unfused AATx = b
42
3.3 Build to Order (BTO) Compiler
The BTO compiler [10, 92] is a system that takes in a subset of annotated MATLAB and
produces optimized kernels in C. Its primary goal is to create memory-efficient linear algebra
kernels for shared memory computers by reducing data traffic through the memory hierarchy. To
limit memory traffic, the compiler uses two forms of loop fusion. Data partitioning enables two
additional features: cache blocking, which can further reduce data movement, and shared memory
parallel codes [11]. BTO ensures the creation of efficient routines by exploring the entire search
space of potentially profitable parallelization and optimization decisions. A secondary goal of the
project is ease of use. Ease of use is accomplished by automating the creation of efficient linear
algebra routines from an accessible high level input.
In this section, we first describe the BTO compiler at a high level. Components relevant to
the analytic memory model presented in chapters 4, 5 and 6 are discussed more thoroughly than
other components. We also discuss how we use BTO to aid in the testing and development of the
memory model. and increase our knowledge of loop fusion.
3.3.1 Functionality of BTO
The BTO compiler works in phases. In the first phase, it parses the input MATLAB and
generates a data flow graph of that input. Next, it performs the refinement phase where it generates
loops and scalars from the high level input of matrix and vector operations by performing graph
lowering operations. The first step during the loop creation and lowering is a data partitioning
algorithm that labels some loops to be used to create shared memory parallel code or cache blocks
during code generation phase of the compiler. Data partitioning is applied to a single operation in
a calculation and then propagated to other data structures that share a dependency with it. Once
data partitioning decisions are complete, the compiler then performs graph lowering to generate
loops and scalars to carry out the calculations expressed by the MATLAB input.
Next, the optimization phase applies loop fusion to the input routine. First, it enumerates
43
all potentially profitable combinations of two forms of loop fusion, interleaving and pipelining, that
can be applied to the input routine. A fusion opportunity is potentially profitable when the loops
share at least one data structure. Interleaving involves fusing loops of two independent operations.
In this case, any data accessed by both operations are read once after the routines are fused.
Pipelining fuses two operations where one operation consumes the result of another. Pipelining
reduces the number of data traversals and removes the need for an intermediate array. Examples
of these operations are found in Section 3.2 where independent fusion represents interleaving and
dependent fusion is an example of pipelining. Routines fused using pipelining, could benefit from
software pipelining as described in Section 3.5.2, however this optimization is not currently included
in BTO.
The optimization phase produces multiple versions of the input routine. Each version differs
from all others in at least one way. The aspects that vary between routines are the amount of fusion,
parallelization, number of cache blocks and size of the blocks. These versions are then passed to
the evaluation phase.
In the evaluation phase, all versions of a routine are tested using a two step process. First
the analytic memory model is run on all versions of a routine, producing a sorted list of predicted
runtimes. Then the best routines are empirically tested with the fastest generated into C code.
The interaction between the two steps in the evaluation phase is user-controllable through runtime
options explained in Chapter 7, which also presents how the options influence compiler runtime and
the performance of the produced routine. After the evaluation phase, the code generation phase
outputs the best performing version from empirical testing as C code. For shared memory parallel
codes PThreads [76] is used to create separate processes that can be run on different processors in
parallel.
3.3.2 Using BTO to Test and Improve the Memory Model
To extensively test the memory model, we leverage the BTO compiler’s ability to empirically
test all versions of a routine it produces. Using BTO, we can compare memory predictions to
44
actual memory traffic measured by hardware performance counters across multiple versions of a
wide range of routines. Runtime predictions produced by the model are easily compared to actual
runtimes of the routines. The large number of versions BTO produces helps ensure that the model
can handle diverse input and assists in discovering bugs in the model. When earlier versions of
the model were inaccurate, the feedback from these tests showed us ways to improve it. We can
evaluate the model’s memory and runtime prediction accuracy across a wide number of routines as
presented in the next four chapters of this thesis.
3.3.3 Leveraging BTO to Learn More About Loop Fusion
The BTO compiler’s ability to enumerate all potentially profitable loop fusion optimizations
allows us to run experiments that increase our understanding of loop fusion. We then add this new
knowledge to the model to increase its accuracy. For example, we used the compiler’s enumeration
capabilities to fuse an arbitrary number of matrix-vector multiplies together in a controlled exper-
iment, which we present in Section 3.4. This experiment demonstrates that fusing too many loops
together negatively impacts performance. The results of the experiment led us to add registers to
the memory model [60], as explained in detail in Section 4.3.2.
Another use of BTO’s ability to enumerate all fusion possibilities is that we can test how
the fusion of certain operations impacts performance. We use BTO to show that fusing vector
operations with matrix operations when the vector is accessed only once never significantly improves
routine performance [62]. The results of these experiments shown in Section 3.6 will allow us to
decrease the size of the search space enumerated by the compiler, reducing compile times.
3.4 Negative Memory Effects of Fusion
Loop fusion does not always result in increased performance. Decreases in performance occur
when fusion creates code that requires more data to fit in a memory structure of the computer than
the structure can hold [60,61]. In other cases, loop fusion results in memory access patterns that are
sub-optimal [61], for example, non-consecutive reads. In this section, we explore how loop fusion
45
can negatively affect cache, registers and memory access patterns.
3.4.1 Performance Degradation
In order to observe effects that occur when many loops are fused, we look at what happens
when many matrix-vector multiplies are combined. For example, we define a routine DGEMV2
that multiplies vectors u0 and u1 in turn by a matrix A as shown in Figure 3.7. The annotated
MATLAB provided in this figure serves as input to the BTO compiler which generates all possible
loop fusion combinations for the pair of matrix-vector multiplies. Figure 3.8 shows three of these
possibilities ranging from the least complex to the most complex: no loop fusion, only outer loops
fused, and all loops fused.
DGEMV2
in
u0 : vector, u1 : vector,
A : row matrix
out
v0 : vector, v1 : vector
{
v0 = A * u0
v1 = A * u1
}
Figure 3.7: DGEMV2
3.4.1.1 Cache Effects
If we fuse the outer loops of an arbitrary number of matrix vector multiples on the Opteron
system, performance degrades for large sized matrices as shown in Figure 3.9. The dropoffs occur
due to increased L2 cache misses as shown in Figure 3.10. For large calculations, the ux vectors
are accessed with each iteration of the outer loop. However, for large matrix orders the combined
size of the vectors is larger than cache meaning that they must be read from memory. Increasing
the number of vectors accessed per iteration of the outer loop results in a greater number of misses
for smaller matrix orders.
46
Jessup, Karlin, Silkensen, Belter and Siek / Procedia Computer Science 00 (2010) 1–11 5
4. Improving the Analytic Model
We now present an analysis of memory bound matrix-vector multiplies that suggests ways to
improve the analytic model. In particular, this analysis aids the development of the compiler’s
memory model, enabling improvements that reduce the number of loop fusion options consid-
ered. This cutting is important because the number of possible routines grows rapidly with kernel
complexity, making exhaustive testing expensive. We also present hardware performance counter
data that show that we need to consider register allocation in the memory model.
We consider a sequence of matrix-vector multiplies. For example, we define a routine
DGEMV2 which multiplies vectors u0 and u1 in turn by a matrix A as shown in Figure 3. The
annotated MATLAB provided in this figure serves as input to the compiler which generates all
possible loop fusion combinations for the pair of matrix-vector multiplies. Figure 4 shows three
of these possibilities: no loop fusion, only outer loops fused, and all loops fused. These three
choices range from the least complex (no fusion) to the most complex (fully fused) with one
intermediate selection (outer loops).
DGEMV2
in
u0 : vector, u1 : vector,
A : row matrix
out
v0 : vector, v1 : vector
{
v0 = A * u0
v1 = A * u1
}
Figure 3: DGEMV2
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v1[i] += A[i][j] * u1[j]
(a) All Outer Loops Fused (b) All Loops Fused
Figure 4: Three possible loop fusion options
In order to evaluate the analytic model, we ran a series of experiments on sequences of one
to eight successive matrix-vector multiplications for the fully fused and all outer loops fused
(a) No Fusion
Jessup, Karlin, Silkensen, Belter and Siek / Procedia Computer Science 00 (2010) 1–11 6
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v1[i] += A[i][j] * u1[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
for (j = 0; j < n; j++)
v1[i] += A[i][j] * u1[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
v1[i] += A[i][j] * u1[j]
Figure 4: Three possible loop fusion options
memory access costs. The assembly for an inner loop of the fused outer loops variant presented
in Figure 7 is the same for all values of nvecs. In contrast, Figure 8 shows that the assembly for
the fully fused version adds move instructions as nvecs increases from four to five to six. These
increased instructions account for the degradation in performance of the fully fused version.
5. Accounting for Registers
To be able to account for register usage in the model, the number of registers was deter-
mined, the bandwidth between the level 1 cache and processor calculated, and the model code
modified to account for how the native compiler allocates registers. The following section first
describes how to incorporate registers into the model and then shows how adding them improves
performance prediction.
5.1. The Changes Needed to Add in Registers
The first step in including registers to the model is to represent the registers as a memory
structure. To do so on the Opteron system described in Section 3, we first determined that there
are eight general purpose registers. However, because one is reserved for the stack pointer and
another for the base pointer, only six registers can be used for general purpose computation.
Then we determined the bandwidth between the level 1 cache and the processor by means of the
DAXPY benchmark in STREAM2 [32] and stored it as the cost of register misses.
The next step involved modifying the code to account for the fact that registers are not allo-
cated in a least recently used fashion. Instead, the native compilers attempt to allocate registers
in a manner that reduces the number of reads into registers. To figure out which variables re-
main in registers, the following heuristics are used in the model to mimic the native compiler’s
allocation. The iterate of an inner loop is stored in a register. A variable that is accessed within
an inner loop more than once is stored in a register if one is available. Finally, when a register
(b) All Outer Loops Fused
Jessup, Karlin, Silkensen, Belter and Siek / Procedia Computer Science 00 (2010) 1–11 6
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v1[i] += A[i][j] * u1[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
for (j = 0; j < n; j++)
v1[i] += A[i][j] * u1[j]
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
v0[i] += A[i][j] * u0[j]
v1[i] += A[i][j] * u1[j]
Figure 4: Three possible loop fusion options
memory access costs. The assembly for an inner loop of the fused outer loops variant presented
in Figure 7 is the same for all values of nvecs. In contrast, Figure 8 shows that the assembly for
the fully fused version adds move instructions as nvecs increases from four to five to six. These
increased instructions account for the degradation in performance of the fully fused version.
5. Accounting for Registers
To be able to account for register usage in the model, the number of registers was deter-
mined, the bandwidth between the level 1 cache and processor calculated, and the model code
modified to account for how the native compiler allocates registers. The following section first
describes how to incorporate registers into the model and then shows how adding them improves
performance prediction.
5.1. The Changes Needed to Add in Registers
The first step in including registers to the model is to represent the registers as a memory
structure. To do so on the Opteron system described in Section 3, we first determined that there
are eight general purpose registers. However, because one is reserved for the stack pointer and
another for the base pointer, only six registers can be used for general purpose c mputation.
Then we determined the bandwidth between the level 1 cache a d the processor by means of the
DAXPY benchmark in STREAM2 [32] a d stored it as the cost of register misses.
The next step involved modifying the code to account for the fact that registe s are not allo-
cated in a least recently used fashion. Instead, the nat ve compilers attempt to allocat registers
in a manner that reduces the number of reads into registers. To figure out which variables re-
main in registers, the followin h uris ics are used in the model to mim c nativ c mpiler’s
allocation. The iterate of an inner loop is stored a egister. A variable that is accessed within
an inner loop more than once is stored in a register if n is available. Finally, when a register
(c) All Loops Fused
Figure 3.8: Three possible loop fusion options
2000 4000 6000 8000 10000 12000 14000
matrix order
0
200
400
600
800
1000
1200
1400
m
flo
ps
All Ou er Loops Fused
nvecs = 4
nvecs = 8
nvecs = 12
0 2000 4000 6000 8000 10000 12000 14000
matrix order
0.00
0.01
0.02
0.03
0.04
0.05
L
2
m
is
se
s
pe
r
flo
p
nvecs = 8
all outer loops fused
fully fused
Figure 3.9: The performance of fusing only outer loops for various nvecs.
47
2000 4000 6000 8000 10000 12000 14000
matrix order
0
200
400
600
800
1000
1200
1400
m
flo
ps
All Outer Loops Fused
nvecs = 4
nvecs = 8
nvecs = 12
0 2000 4000 6000 8000 10000 12000 14000
matrix order
0.00
0.01
0.02
0.03
0.04
0.05
L
2
m
is
se
s
pe
r
flo
p
nvecs = 8
all outer loops fused
fully fused
Figure 3.10: The memory effects of fusing loops for nvecs = 8.
48
3.4.1.2 Register Effects
Figure 3.11 shows the effects of fusing all loops of one to eight matrix-vector multiplies on
the Opteron system for a fixed matrix size. The figure shows that fully fusing loops increases
performance from one multiply through four. For five or more matrix-vector multiplies, the speed
of the calculations decreases with each added operation. The slowdown is not a result of L1 and L2
cache and the TLB misses because the fully fused routine has the same number or fewer of these
than the outerloop fused routine as shown in Figure 3.12. However, the assembly produced by icc
shown in Figure 3.13 for these operations shows that the falloff in speed is a result of an increased
number of mov instructions, each of which corresponds to an extra read from the L1 cache. The
mov instructions occur because there are not enough registers to store the results of the inner most
loop. When data accessed between successive iterations of the inner most loop cannot be stored
in registers this is referred to as register spill. To prevent such register spill, the amount of fusion
should be limited.
1 2 3 4 5 6 7 8
nvecs
0
100
200
300
400
500
600
700
800
900
1000
m
flo
ps
matrix order = 6500
all outer loops fused
fully fused
Figure 3.11: Performance of Fully Fusing
49
(a) L1 Cache
(b) L2 Cache
(c) TLB
Figure 3.12: Memory Effects of Fully Fusing
50
(a) nvecs = 4 (b) nvecs = 5 (c) nvecs = 6
Figure 3.13: Fully Fused Assembly
3.4.2 Suboptimal Memory Access Patterns
In the previous section of this chapter we describe how loop fusion can increase or slow the
speed of calculations through its interactions with the memory sub-system. What is less apparent
is that fusing loops can increase performance while not maximizing performance. For example, in
the calculation of AATx = b in Figure 3.4, each matrix column A[i] is accessed twice. If this vector
is too large to fit in the level 1 cache, then the second read must come from the slower level 2 cache.
In addition, when the vector is read in a second time from cache, the read is latency limited as
the access is non-consecutive, because the sequence of reads jumps from the end of the vector to
the beginning. The first read of the vector is consecutive because it follows the second read of the
previous vector. Finally, the algorithm is not reading in new data from memory while performing
At = b, which leaves the memory bus idle. As the movement of data over the memory bus is the
limiting factor in routine performance, performing the calculation in this manner is inefficient.
3.5 Overcoming Bad Memory Effects of Fusion (Combining Fusion and
Other Optimizations)
By combining loop fusion with other optimizations or limiting loop fusion to amounts that
produce positive memory impacts, we can maximize the performance of calculations. In this section,
we use the tuning techniques of cache blocking and software pipelining to overcome some of the
negative effects of loop fusion. We also show that, by fusing the optimal amount, the performance
of a routine can be increased.
51
3.5.1 Cache Blocking
For many calculations where loop fusion is profitable, more data must fit within caches for
good performance to be achieved. In certain cases, the data do not fit within cache and performance
of the routine suffers. Examples of this effect are shown in Section 3.4.1.1. To reduce the amount
of data that must fit within cache for good performance, cache blocking can be used.
Figure 3.14 shows cache blocking applied to a fused implementation of Ax = r and AT y = s
shown in 3.1. By cache blocking only pieces of the vectors r and y are accessed by the inner most
loop. When the block size is chosen to be small enough so these pieces remain in cache between
iterations of the middle loop, performance gains result.
Figure 3.15 shows an example of performance gains attributable to cache blocking on the
Hemisphere system. A corresponding reduction in memory reads is shown in 3.16. Both these
figures show the calculation with the matrix divided into two blocks. Performance increases to
within 10% of what it was without cache blocking and memory reads are reduced by approximately
50%. The gap in performance can be attributed to the fact that cache blocking creates high
latency non-consecutive memory reads, which can also disrupt the memory prefetcher. Also, for
smaller matrices cache blocking produces a slower algorithm than just fusing because it introduces
non-consecutive memory reads and no reduction in memory reads.
blocksize = n/blocks
for k = 1:blocks
for i = 1:n
for j = (blocks− 1) ∗ blocksize+ 1:blocks ∗ blocksize
r(j) = A(j, i) ∗ x(i)
s(i) = A(j, i) ∗ y(j)
Figure 3.14: Cache Blocking Ax = r,AT y = s
3.5.2 Software Pipelining
During the execution of the loop that computes At = b for AATx = b, the memory bus is
idle meaning that computing resources are not efficiently used. To keep data moving through the
52
0 2000 4000 6000 8000 10000 12000 14000
0
400
800
1200
1600
Matrix Size
M
Fl
op
s
Throughput Ax = r, Aty = s
 
 
baseline
composed
cache block
Figure 3.15: Performance with Cache Blocking
0 2000 4000 6000 8000 10000 12000 14000
0
0.005
0.01
0.015
0.02
0.025
Matrix Size
L2
 C
ac
he
 M
iss
es
 p
er
 F
lop
L2 Cache Behavior Ax = r, Aty = s
 
 
baseline
composed
cache block
Figure 3.16: Memory Performance of Cache Blocking
53
bus, the computation can be rearranged using software pipelining. In order to do so, the n − 1st
iteration of the second inner loop is interleaved with the nth iteration of first inner loop. The first
iteration of ATx = t and the last iteration of At = b are performed in separate loops to complete
the computation as shown in Figure 3.17.
for j = 1:n
t(1)+ = A(j, 1) ∗ x(j)
for i = 2:n
for j = 1:n
t(i)+ = A(j, i) ∗ x(j)
b(j)+ = A(j, i− 1) ∗ t(i− 1)
for j = 1:n
b(j)+ = A(j, n) ∗ t(n)
Figure 3.17: Software Pipelining b = AATx
Software pipelining results in a 30% speedup for matrix orders larger than 2000 on the
Quadfather machine as shown in Figure 3.18. However, software pipelining does not affect the
number of cache misses that occur during routine execution. The number of reads from each
memory structure are the same as for the fused calculation without pipelining. The resulting
speedup is a result of the memory bus’ never being idle.
3.5.3 Not Fusing Too Much
Less outer loop fusion can increase performance by allowing more data to remain in cache
and registers between loop iterations. In this section, we present examples of when reducing fusion
allows data to remain in a cache or registers. In the case of register spill, reducing fusion is the best
way to increase performance. For caches, though, other optimizations, such as tiling, are often a
better choice.
Earlier we showed that fusing all the inner loops of six matrix-vector multiplies decreases
performance when compared to fusing just the outer loops due to register spill. However, if we fuse
the inner most loops in groups of three then performance is better than fusing all outer loops and
all loops. When fused in groups of three there is no register spill, and some of the benifits of fusing
54
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
400
800
1200
1600
2000
Matrix Size
M
Fl
op
s
Throughput AATx = b
 
 
Baseline
Composed
Software Pipelined
Figure 3.18: Performance with Software Pipelining
55
inner loops are realized.
When calculations become large or many are fused together, not fusing all the outer loops
together can also increase the speed of a calculation. For example, when all outer loops are fused
in Figure 3.9 the performance of the calculations begins to decrease at large matrix orders as level
2 cache misses start occuring. The more loops that are fused, the smaller the matrices for which
these effects occur. The costs and benefits of not fusing, however, must be weighed against the
benefits of combining another optimization such as cache blocking with loop fusion.
3.6 Search Complexity of Fusion Decisions
Finding the optimal amount of loop fusion is an NP complete problem [26]. Therefore, any
consideration of loop fusion with other optimizations is also NP complete and the practicality of
testing all possible loop fusion decisions is limited to a small number of loops. To create efficient
code in a feasible amount of time, models [86] and/or heuristics [16] can be used to reduce the size
of the search space. The risk in these cases is that local extrema can result in sub-optimal results.
Our approach to modeling fusion takes a different tack. We enumerate all possible options
using the BTO compiler and then test only those that are predicted by our model to have the
potential to be among the best routines. We explain how the model and compiler interact to
efficiently produce high performing routines in Chapter 7.
In this section, we present an experiment that used the BTO compiler to show that certain
fusion decisions never positively impact performance [62]. We plan to use these results to reduce
our fusion search, which will reduce search times.
3.6.1 Similar Routine Performance
The enumeration of routines to be considered and the testing of those routines using hybrid
search dominate the runtime of the BTO compiler. For many routines, multiple versions have near
identical runtimes as well as model predictions of small runtime differences. For example, for the
GESUMMV calculation shown in Figure 3.19, the compiler enumerates twelve possible versions
56
for i = 1:n //Loop 1
for j = 1:n //Loop 2
t1(i) = t1(i) +A(i, j) ∗ x(j)
for i = 1:n //Loop 3
t1(i) = α ∗ t1(i)
for i = 1:n //Loop 4
for j = 1:n //Loop 5
t2(i) = t2(i) +B(i, j) ∗ x(j)
for i = 1:n //Loop 6
t2(i) = β ∗ t2(i)
for i = 1:n //Loop 7
y(i) = t1(i) + t2(i)
Figure 3.19: Unfused GESUMMV
with different amounts of fusion. The model produces predictions for the Core 2 system that all
differ by less than 1%, and actual performance differences are less than 3% for the best and worst
of these versions. Also, as shown in Figure 3.20, when we graph the actual and predicted runtimes
for the GEMVER calculation for the Work system, we notice that many of the predicted and actual
runtimes of routines are nearly identical.
A closer examination of these routines reveals that for many pairs of routines with near
identical performance and predictions, the only difference between routines with near identical
performance is the fusion of a vector operation with a matrix operation. If it were always the case
that fusing a vector operation with matrix operations does not significantly improve performance
then we would not need to enumerate and test these fusions in the BTO compiler.
3.6.2 Operations That Significantly Impact Performance
In this section, we show that fusing vector operations with matrix operations does not sig-
nificantly impact performance. Throughout this section, fusion of a vector operation refers to the
fusion of loops where each loop accesses the same vector. An example calculation with three sets of
loops that contain vector operations and can be fused is the GESUMMV calculation shown with all
loops unfused in Figure 3.19. Loops 1 and 2 can be fused with loops 4 and 5 to reduce the number
of accesses to the vector x, where each element is accessed n times. Loops 3 and 6 can both be
57
Figure 3.20: All Versions of GEMVER Actual and Predicted Performance
fused with loop 7 reducing the number of accesses to each element of the t1 and t2 vectors by one.
To determine whether fusing vector operations with matrix operations significantly impacts
performance, we ran a series of tests. The test results were then analyzed to determine the signifi-
cance of fusing vector operations. In this section, we first describe the environment, routines and
methodology used to perform tests. We then present the results of these experiments including a
statistical analysis of the results when needed.
3.6.2.1 Test Environment and Methodology
To determine the impact of fusing vector operations with matrix calculations, we ran the three
calculations GEMVER, GESUMMV and DGEMVT on the Work, Nahalem, PowerPC and Opteron
machines. All tests were compiled using gcc with the -O3 compiler flag turned on. The DGEMVT
and GEMVER kernels were chosen from the updated Basic Linear Algebra Subprograms [15] and
contain vector operations where the vector is accessed only once. The GESUMMV operation was
chosen because it contains vector operations where the vector is accessed both once and multiple
times. For DGEMVT, there are two sets of loops that can be fused that contain vector operations.
For the GEMVER and GESUMMV calculations, there are respectively four and three sets of loops
that containing vector operations that can be fused. All routines were chosen because they occur
in important numerical linear algebra routines such as Householder bidiagonalization [55].
58
Table 3.3: Search Space of Routines
Routine Name With Vector Operations Without Vector Operations
DGEMVT 8 2
GEMVER 648 162
GESUMMV 12 3
Routines were run five times for each test of interest and performance differences less than
3% were considered small enough not to be significant. Any differences in the means of two routines
being compared that was greater than 3% were subjected to statistical analysis to determine if the
differences were statistically significant at a 95% confidence level. One directional Student’s paired
T-test [59] was used to compare the results.
Our null hypothesis for the T-test was that fusing vector operations never resulted in a
statistically significant performance increase. Therefore, a one directional T=test was used because
we want to identify when fusion improves performance. If fusion negatively impacts the performance
of a routine in a statistically significant manner then the hypothesis is accepted. The hypothesis
is considered true unless the p value from running the T-test is less than 0.05. A p value is the
probability that the result of the test conducted occurred, assuming the null hypothesis is true. A
p value below 0.05 indicates there is a statistically significant difference between the means of the
two samples we compared.
3.6.2.2 Results and Analysis
The middle column of Table 3.3 shows the number of ways to fuse each routine with all vector
operations considered. In all cases, we compared the performance of fusing and not fusing each
operation by keeping all other fusion decisions the same and only changing the loop of interest.
When a single pair of loops had a performance difference of more than 3%, we then used Student’s
paired T-test to determine if the differences were significant. The paired T-test was run for all pairs
of routines that were identical except one routine fused a vector operation with a matrix operations
and the other did not.
59
For the DGEMVT and GEMVER calculations, the fusion of vector operations never sig-
nificantly impacted routine performance. On the Work, PowerPC and Nahalem machines, the
performance differences were always less than 2%. On the Opteron, differences were larger and a
paired t-test analysis, for the four pairs of interest, resulted in p values of 0.14 to 0.68. Therefore,
there was not a statistically significant difference in runtime for any loop pair of interest at the 95%
confidence level.
For the GESUMMV calculation on the Work, Nahalem and PowerPC, system, differences
in performance were always less than 3% for all fusion possibilities. On the Opteron system,
however, performance differences were greater than 3%. For vectors accessed once, the difference
were not statistically significant with p values of 0.27 and 0.087. For vectors accessed many times
performance differences of over 30% resulted as shown in Figure 3.21. The top line shows the
performance of fusing all vectors accessed once, while the bottom line shows the performance of
fusing all vector operations. The resulting speedups were statistically significant with a p value of
0.04349 for a matrix order of 3000.
From these experiments, we conclude that the fusion of vector operations where each element
of the vector is accessed many times can significantly increase performance and must be considered
when fusing loops. We also conclude that vector operations that only access elements once do not
significantly impact performance when fused and can be removed from the search space.
3.6.3 Removing Vector Operations
For each routine we tested, there are two vector operations where all elements in the vector are
accessed once. Removing the fusion of these operations with each other and with matrix operations
from the space of considered routines results in a 75% reduction in the number of routines to be
tested as shown in Table 3. For most routines, being able to eliminate a vector operation from the
search space reduces the number of routines to be consider by approximately one half.
To perform this reduction within the compiler, we have developed two potential strategies.
For each, we need to determine the relative cost of various operations. One option is to always fuse
60
	  
Figure 3.21: Runtime of fusing a vector operation for the GESUMMV calculation where the vector
is accessed n times on an Opteron system.
61
vector operations when enumerating routines and then unfuse them when modeling and empirically
testing their performance. Another option is to only fuse vector operations with matrix operations
when fusing the vector operation enables the fusion of matrix operations.
3.7 Summary
Loop fusion is a powerful optimization that can dramatically reduce the runtime of linear
algebra routines by reducing data movement. The impact of loop fusion on linear algebra routines
has led to work on optimizing important routines highly, adding loop fusion capabilities to opti-
mizing compilers and domain specific compilers. However, loop fusion can increase data movement
and slow down routine performance if not carefully applied. Therefore, when fusing loops it is
important to consider the underlying hardware and other optimizations being applied.
Chapter 4
Predicting Memory Traffic on a Single Processor
To predict the amount of data moving between each memory structure and the processor
for fused linear algebra calculations, we developed a memory model. The model takes in a tree
representation of calculations and uses reuse distances to determine from which memory structure
each data structure is read. A reuse distance is a measure of how many unique data elements are
accessed between accesses to a data element. The model also predicts how many times a data
structure is read from each memory structure.
In this chapter, we first present the tradeoffs the memory model uses and explain the reasons
for these tradeoffs. We then present the framework the model uses, including the data structures
that store information about calculations and memory structures of a machine. Next, we explain
how we predict misses for both caches and registers. Finally, we present how the model was
validated, including a comparison of misses measured from hardware performance counters to
model predictions.
4.1 Runtime and Accuracy Tradeoffs
To be useful within the BTO compiler, our model must be able to accurately identify the best
routines in a practical amount of time. To meet this constraint, we restrict our model to calculating
only the most distinguishing memory factors that impact the performance of fused linear algebra
routines. Assumptions are made about memory structure features, data access patterns and cache
state that decrease model runtime, but impact prediction accuracy. In this section, each tradeoff
63
is listed along with why we use it and the impact it has on model runtime and accuracy.
Consecutive data access patterns: Data movement through the memory hierarchy is
most efficient if data elements stored next to each other in memory are read one after another. A
consecutive access pattern is one that always reads neighboring data sequentially. For arrays, a
consecutive access pattern means that column matrices are read a column at a time and row major
matrices are read a row at a time. Also, within the columns and one dimensional arrays, indices
are accessed sequentially, which means that the second element of an array is read immediately
after the first element. When possible, linear algebra codes are designed to use consecutive access
patterns. For example, the BTO compiler does so when it creates loops. Accounting for non-
consecutive accesses would require adding latency and cache line size into the model, which would
complicate it and only add small accuracy gains to predictions in a few cases. An example of where
accuracy gains would occur by not assuming consecutive access patterns is in the summing of a
row major and a column major matrix into the row major matrix where it is impossible to access
both data structures sequentially. Elements stored consecutively in one matrix are stored the size
of either a column or a row apart in the other matrix. Therefore, adjacent matrix elements results
in non-consecutive accesses to one matrix.
Latency not modeled: When a consecutive access pattern is used, latency bound reads
are rare, occurring only when all iterations of an inner loop access the same element of an array
and then another loop is incremented changing the element of the array accessed. Also, out of
order processors can hide some or all of the effects of latency. Therefore, exactly how much a
non-consecutive memory access slows down program execution is impossible to know until runtime.
To estimate the cost of latency bound reads on out of order processors would require knowing
the size of the instruction window and whether the cache is blocking or not. When the cache is
non-blocking, accurate estimates depend on knowing how many misses can be outstanding while
the cache still services hits. Even with the ability to model latency bound reads, the benefit for
routines with consecutive access patterns only occurs rarely. Due to the limited number of routines
affected by latency and the added complexity of calculating hardware parameters, latency is left
64
out. The marginal benefit of including latency do not justify the significant computational cost.
TLBs and caches can be treated the same: The main differences between TLBs and
caches are a TLB page addresses more data than a cache line holds, while a cache has many more
lines than a TLB has pages. The latency penalty of an unanticipated TLB miss is typically an order
of magnitude or more larger than an unanticipated cache miss. TLBs are usually fully associative
and caches are typically set associative. As shown later, these differences are among the least
significant factors in memory prediction and impact different versions of the same routine equally.
Therefore, treating caches and TLBs the same reduces model complexity and increases model speed
without significantly impacting the accuracy of our predictions.
Fully associative memory structures: For set associative caches and some TLBs, as-
suming memory structures are fully associative causes predictions to produce an abrupt increase in
cache misses while actual misses follow a curved slope. Section 4.4.2 contains graphs showing the
results of this assumption and a full explanation of why actual measured misses on the computer
follow a curved slope. Since most TLBs and registers are usually fully associative structures, this
assumption does not affect them.
Line size is equal to word size: By assuming that all cache lines and TLB pages are
one data element, the model ignores data that are moved but not used. When reading data
consecutively, most but not all data moved through the memory hierarchy are used. For example,
all but the first and last cache lines and TLB pages are fully used, but the first and last lines and
pages may not be fully used if the calculation starts or stops reading data in the middle of a line
or page. However, the data movement not modeled in the first and last page is insignificant and
ignoring it in our model has a negligible impact on accuracy. To determine how much data stored
in a TLB page or cache line is moved but not used requires knowing where data will be stored
in memory at runtime, which is impossible to predict. If strided accesses were introduced into
the model then this assumption would need to be relaxed in order to account for the extra data
movement caused by only using part of a cache line.
65
Warm cache assumed: A warm cache already contains data relevant to the routine being
executed. In almost all programs, the calculations we model occur in the middle of a string of
computations or are called multiple times. In these cases, most of the data that can stay within
cache during a computation are there to begin with. Therefore, to mimic this behavior, we model
the calculation by assuming it was the last routine executed by the processor.
All TLBs and caches use Least Recently Used replacement: For TLBs, the cost of
a page fault is large enough and they occur infrequently enough that LRU replacement is typically
used. For caches, a pseudo LRU strategy is often used in each set to reduce cost [52]. Therefore,
assuming LRU is a close approximation of how actual memory structures operate in real hardware.
Reuse distance at the first element of an array is good for all: The reuse distance of
a value in an array can change at different points within the array. In practice, the reuse distance
changes on the order of one dimension of the array size, and arrays are equally likely to have
increasing or decreasing reuse distance as calculations move further away from the start of the
array. Sacrificing accuracy here allows for significant runtime speedups of the model while still
accurately being able to distinguish between routines with large memory traffic differences as we
show in Chapter 7.
Reads and Writes to Registers are Instantaneous: In most modern processors, ac-
cessing registers is immediate unless there is a dependency between multiple accesses. When a
dependency does exist, often it does not delay execution because out of order execution hides it. Ad-
ditionally, no experiments we have run indicate that accessing data in registers limits performance.
Therefore, with few possible instances where accessing data from registers impacts performance, it
is not worthwhile to model dependencies that occur as calculations execute on machines.
Arithmetic Costs Ignored: For memory bound calculations, the best way to compare their
runtimes is data movement [3]. Since our model is designed for memory bound routines with large
data sets, we ignore the cost of executing arithmetic within the processor. Ignoring arithmetic costs
sacrifices accuracy for small problem sizes but increases our model’s speed. In the next chapter,
the impact of this assumption is seen in our inaccurate runtime predictions for small calculations.
66
4.2 Theoretical Framework (Equations)
To allow a general implementation of our model for those who want to use different tradeoffs,
we develop a set of equations. These equations also allow us to verify that the implementation of
our model is correct.
To define equations for the number of accesses to each memory structure, we establish several
auxiliary notions. Let x range over memory structures in a machine. We write prev(x) for the next
smaller memory structure than x. The function prev is undefined for the smallest memory structure
(typically registers) and set equal to the number of load and store instructions completed during
program execution. We use the symbol ⊥ to represent the loads and stores. The amount of data a
memory structure x can hold is written size(x).
To represent all memory accesses within a loop L, we use a multiset of addresses R(L). Each
address occurs once in R(L) for each time it is accessed in loop L (not including subloops). Let d
range over memory addresses. We write R(L)(d) for the number of occurrences of d in R(L). We
write L1 ≤ L2 when L1 is either the same loop as L2 or nested somewhere within L2. For a loop
L, the working set of the loop, written WS (L), is the number of unique data accesses in the loop
(including sub-loops).
WS (L) = |{d | 0 <
∑
L′≤L
R(L′)(d)}|. (4.1)
The reuse distance of data element d in loop L, written RD(d, L), is the number of unique
data accesses between two accesses to d during the execution of loop L.
Now we define the hits to a memory structure x in loop L, written H(x, L), and we define
the accesses to x in loop L, written A(x, L). These two multisets are mutually recursive and a (d)
added to each applies the function to a specific input data element. However, the base case of A
67
does not rely on H while H always relies on A.
H(x, L)(d) =

A(x, L)(d) if WS (L) ≤ size(x) or
(RD(d) ≤ size(x)
and R(d, L) > 1)
0 otherwise,
(4.2)
A(x, L) =

R(L) if prev(x)=⊥
A(prev(x), L)−H(prev(x), L) otherwise.
(4.3)
The number of accesses to memory structure x in loop L (not counting sub-loops) is |A(x, L)|.
Equation 4.2 calculates how many data elements are read from each memory structure and
not a larger structure based on reuse distances and working set size. A data element d that was not
stored within a smaller memory structure is read from a memory structure if one of two conditions
is met: The working set of the current loop is smaller than the memory structure’s size or its reuse
distance is smaller than the memory structure’s size and it is not the first read to d. Otherwise,
the read occurs from a larger memory structure.
Equation 4.3 totals up how many data elements pass through a memory structure assuming
inclusivity among memory hierarchy levels. For the smallest memory structure, (usually registers),
all are read from or transferred through it during execution. For all other memory structures,
the amount of data that moves through it is the amount of data moved through the next smaller
memory structure minus the amount of data read from the smaller memory structure and not larger
structures.
4.3 Implementation
Our implementation of the model uses the equations in the previous section and the as-
sumptions listed in Section 4.1 to predict memory traffic. The implementation takes as input an
abstract syntax tree representing the calculation to be analyzed and a machine structure depicting
68
the system on which the calculation is going to be run. The model then estimates how much data
will be read from each memory structure when the routine executes.
The abstract syntax tree, shown for both unfused and fused b = AATx routines in Figure
4.1, contains two element types: nodes and leaves. Nodes are loops and statements and leaves are
variables. Each node contains the name of the variable that is incremented by the loop, and the
number of iterations the loop is run, which is set to zero if it is a statement. Nodes also contain
the number of variables that are accessed when the loop or statement is executed and pointers to
those variables. The number of subloops and statements is stored along with pointers to them. For
statements subloops is equal to zero and the pointers to other nodes are set to NULL. A loop also
contains a pointer to the iterate that accesses the loop, which is set to NULL for a statement.
Variables are stored as leaves of the tree. If a variable appears multiple times in code, each
instance of the variable is stored separately. Each leaf contains a variable’s name, how many loop
variables iterate the array and pointers to those variables. Additionally, a pointer to the number
of misses to that variable is used to create an array with each value in the array corresponding to
misses to one memory structure in the machine structure.
A machine is represented in a structure that contains its name, the number of memory
structures it contains, and pointers to those memory structures. As shown in Figure 4.2 for a
Core 2 system, each memory structure contains the amount of data it can hold or address, and a
bandwidth, all expressed in bytes. The bandwidth represents how fast data can be moved from the
next largest memory structure to the processor.
4.3.1 Cache and TLB Prediction
The memory prediction function starts at the outermost loop of the loop nest. Its first step
is determining if the working set of a calculation or a loop is small enough to fit within a given
cache. If the working set of a loop fits within a cache then the prediction function returns zero
misses for that loop and cache. If the working set is not smaller than the cache then the function
to calculate working set size is recursively called on each loop within the outer loop. The function
69
itr:a; iters:1
 num_chld:2; vars:6
itr:b; iters:3000
 num_chld:1; vars:3
itr:b; iters:3000
 num_chld:1; vars:3
itr:c; iters:3000
 num_chld:1; vars:3
stmt
vars: 3
x; 1
c
A; 2
c,b
t3; 1
b
itr:c; iters:3000
 num_chld:1; vars:3
stmt
vars: 3
A; 2
c,b
t3; 1
b
y; 1
c
(a) Unfused
itr:a; iters:1
 num_chld:1; vars:6
itr:b; iters:3000
 num_chld:2; vars:6
itr:c; iters:3000
 num_chld:1; vars:3
itr:c; iters:3000
 num_chld:1; vars:3
stmt
vars: 3
A; 2
c,b
t8; 1
b
y; 1
c
stmt
vars: 3
x; 1
c
A; 2
c,b
t8; 1
b
(b) Fused
Figure 4.1: The abstract syntax tree taken in by the model for b = AATx.
70
Core 2, 3
Registers
Size = 48 Bytes
Bandwidth = 17416986893 B/s
L1
Size = 32 KB
Bandwidth = 14005944862 B/s
L2
Size = 4 MB
Bandwidth = 4192213168 B/s
Figure 4.2: A machine structure representing a Core 2 machine.
71
is called until a working set smaller than the memory structure is found. If a working set smaller
than the memory structure is now found misses are calculated for the innermost statement.
The first time the working set is not smaller than the memory structure size, misses are
calculated for each instance of a variable. Each element of a data structure that is only accessed
once by the current loop results in a miss to that memory structure. If elements of a data structure
are accessed multiple times then reuse distances are calculated and used to determine if those reads
occur from the current memory structure or a larger one. When the reuse distance is larger than
the memory structure, each access to an element of the data structure results in a miss to the
memory structure. When an inner loop contains misses, reuse distance is used for all variables
without any misses in the inner loop to determine if misses occur to them in the current loop. For
variables that already contain misses, the number of misses in the inner loop are multiplied by the
number of iterations contained by the outer loop.
Misses to a data structure rather than hits are stored to reduce the number of memory
structures represented in the machine structure by one. Removing one memory structure, usually
main memory, is done because accessing registers is assumed to have no cost. Additionally, the
implementation becomes cleaner because the bandwidth from the next largest memory structure
to the processor is stored with that memory structure. Therefore, the number of misses and the
bandwidth for the bus that missed data is moved over are stored together.
Reuse distances for arrays are calculated for the first array element only. Reuse distances are
found by counting all unique data accesses that occur between two accesses to the first element of
an array. When an element of an array is only accessed once, or for the first access to the array in
a loop, the warm cache assumption is used in the calculation. The loop being modeled is assumed
to have been executed twice in a row with the second execution being the one modeled. The reuse
distance is then calculated from the last access to that array element in the first instance of the
loop.
72
4.3.2 Register prediction
Registers must be treated differently in some respects from caches because the compiler is
able to allocate them using additional information available to it. Therefore, we do not assume
that they are allocated in a least recently used manner. To approximate how registers are allocated
by the compiler, we use the following three heuristics. The first is that the iterate of an innermost
loop is always stored in a register. The second is that variables accessed more than once within
an inner loop are assumed to be stored in registers. The third is that values read from the same
memory location within each loop iteration are likewise said to be stored in registers.
Also, all values stored in a register do not have cache misses while within the register even
if their reuse distance is larger than the cache. Other than the heuristics presented in this section,
registers are treated identically to caches.
4.4 Validation
To ensure the correctness and accuracy of our implementation, we validated our memory
predictions using multiple techniques. To verify reuse distances and memory misses were being
calculated properly the implementation was compared to the theoretical equations and code in-
strumented to track the number of reads and writes between successive accesses to the same data
element. Additionally, predicted memory misses were compared to actual memory misses measured
from hardware counters.
4.4.1 Comparison of Implementation to Equations and Instrumented Code
To ensure the correctness and accuracy of the implementation, we first compare the reuse
distances computed when the analytic model is executed to those resulting from running instru-
mented code. The greatest possible difference between the two is twice the maximum number of
variables in a statement (an array only counts once) times the wordsize (single or double) of the
data structures being modeled. This difference results from the model’s not enforcing the ordering
73
Variable Instrumented RD Model RD
a 81592 81600
x 1600 1608
t 81600 81600
a 81592 81600
t 81600 81600
b 1608 1608
Table 4.1: Instrumented and modeled reuse distances for unfused b = AATx.
of variables within a statement. It is small compared to both cache size and reuse distance.
In Tables 4.1 and 4.2, we compare the reuse distances (RD) calculated by the model to those
calculated by instrumenting code for the outer loop of matrix-vector multiples for the calculation of
fused and unfused b = AATx. The figure shows that the model predicts reuse to within twenty-four
bytes for a matrix order of one hundred. Since all statements contain three elements and wordsize
is eight bytes, the predictions are within our thirty-two byte tolerance. A small matrix order was
used in the validation due to the long runtime of instrumented code.
Using the reuse distance calculations, we can calculate the number of misses to each memory
structure that should occur per variable at each loop. Below, we compare the number misses per
variable, produced by the model for unfused b = AATx to hand calculations for a 32 KB cache.
Calculations are performed at the outer loop of the matrix-vector multiplies.
For the unfused routine the model predicts the following number of misses:
x: 800
A: 80000
Variable Instrumented RD Model RD
t 8 24
a 81600 81600
x 3192 3216
a 1600 1608
t 8 24
b 2408 2408
Table 4.2: Instrumented and modeled reuse distances for fused b = AATx.
74
t: 800
A: 80000
t: 800
b: 800
Hand calculations of misses match the models estimates:
For both instances of A the reuse distance is 81,600. Since 81,600 > 32,768 cache size all 100 (rows)
x 100 (columns) x 8 (wordsize) = 80,000 bytes of A are read from a larger cache.
For x and y their reuse distances (1608) are less than the cache’s size. However, each is still read
once from memory because between matrix-vector multiplies its reuse distance is larger than cache
resulting in 800 bytes read.
Each instance of t in both matrix-vector multiplies has a reuse distance of 81,600 the first time it is
accessed resulting in 800 bytes read from memory. However, subsequent accesses in the inner loop
have a RD of 24, and, therefore, come from cache.
In all cases our model calculates the exact number of misses as our hand calculations.
4.4.2 Comparison of Predictions to Hardware Counters
We also compare our model’s memory read predictions to the actual number of reads mea-
sured by hardware performance counters. For each memory structure and loop, we divide the
number of accesses by the memory structure’s line or page size, written LS(x), to obtain the
number of cache lines or page table walks needed for that memory structure, written LA(x):
LA(x) = dA(x, L)/LS(x)e. (4.4)
Figures 4.3 and 4.4 show the predicted and actual memory miss results normalized per floating
point operation for the kernel b = AATx on the Work and Opteron machines show in Table 3.1.
Misses to a cache are equivalent to accesses from the next larger structure. Misses to the TLB
result in page table walks. In each figure the actual 1 and predicted 1 lines show results with no
fusion and the actual 2 and predicted 2 lines display the results for fully fused routines.
75
0 500 1000 1500 2000 2500 3000 3500
matrix order
0.0
0.1
0.2
0.3
0.4
0.5
L1
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(a) L1
0 500 1000 1500 2000 2500 3000 3500
matrix order
0.00
0.05
0.10
0.15
0.20
0.25
0.30
L2
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(b) L2
0 500 1000 1500 2000 2500 3000 3500
matrix order
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
T
LB
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(c) TLB
Figure 4.3: Predicted vs. actual memory misses of b = AATx on the Work machine.
76
0 2000 4000 6000 8000 10000
matrix order
0.0
0.1
0.2
0.3
0.4
0.5
0.6
L1
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(a) L1
0 2000 4000 6000 8000 10000
matrix order
0.00
0.05
0.10
0.15
0.20
0.25
0.30
L2
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(b) L2
0 2000 4000 6000 8000 10000
matrix order
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
T
LB
2
 m
is
se
s 
p
e
r 
fl
o
p
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(c) TLB
Figure 4.4: Predicted vs. actual memory misses of b = AATx on the Opteron.
77
On both machines, the predicted misses for the L1 and L2 caches are accurate to within 1%
except near cache boundaries. In those cases, conflict misses play an important role and expose
the difference between the set associativity of the actual caches and the full associativity assumed
by the model. Set associativity causes misses to begin sooner than predicted. Additionally, once
the number of predicted misses spikes, fewer occur than predicted. The misses that happen before
the predicted jump result in some cache sets filling before others and so causing conflicts. Actual
misses then follow a curve as the matrix order increases due to progressively more sets having
conflict misses. The curve of actual misses approaches the predicted number of misses from below
because not all sets have conflict misses when a fully associative cache does have conflict misses.
The predicted misses for the TLB are accurate to within 10% for large matrices on the Work
machine and within 2% on the Opteron except at TLB boundaries. On the Work machine the TLB
is fully associative so conflict misses do not result and the prediction of where misses should begin
to occur actually do line up. The Opteron’s TLB is four way set associative resulting in a curve in
the predictions for the same reason as the cache predictions.
Chapter 5
Runtime Prediction
The model uses a runtime prediction function to convert memory traffic estimates into a
single value in seconds that allows the direct comparison of different implementations of the same
routine. The function takes as input the estimated memory accesses to each loop determined by
the memory prediction function presented in Chapter 4 and the usable bandwidth between each
memory structure and the CPU. These inputs are then converted into runtime estimates.
In this chapter, we first describe how we automatically determine machine characteristics.
Machine characteristics include number of registers, number of caches and their sizes, and the
useable bandwidth between the processor, caches and memory. Then we present our serial runtime
prediction function. Finally, we examine the validity of the runtime prediction function and present
experiments showing its application to various problems.
5.1 Automatically Determining Machine Characteristics
To convert our memory traffic predictions to runtime predictions, the memory structure sizes
and the bandwidth between these memory structures and the processor are needed. We determine
these values through the use of an automated system, which increases the model’s portability and
ease of use. Automating this process eases the use of the model by making it unnecessary for
users to determine the machine specifications and useable bandwidth of their system. Additionally,
automation allows the model to be included with the BTO compiler without user input during the
installation process.
79
In this section, we describe how we determine the number of caches on the system and
their sizes. Also found at the same time is which processors share which caches and buses. We
then explain how the number of registers on a system is determined. Finally, we present how the
bandwidth from caches and memory to the processor is found.
5.1.1 How Many Caches and Their Sizes
To determine the number of caches and their sizes automatically we use Portable Hardware
Locality (hwloc) [19]. Hwloc is a subproject of OpenMPI [43] that determines the memory hierarchy
and processor layout of most modern computers. Information obtained includes how many cores
and sockets a machine contains. Also determined is the layout of cores and caches on sockets.
For example, hwolc discovers which cores share which caches. The single processor model uses
the number of caches and their sizes, while the parallel model also uses the information about
core, socket and cache layout. Hwloc works on operating systems as varied as Linux, Mac OSX,
Microsoft Windows and Solaris. Its output is parsed and stored in a machine structure as described
in Section 4.3 for serial predictions and a machine structure as describe in Section 6.2 for parallel
calculations.
5.1.2 How Many Registers
To determine the number of registers accessible by the compiler on a system the following
procedure is used:
A test code produces n+ 1 programs with bodies as shown in Figure 5.1 where j ranges from
0 to n. Each is compiled to assembly code and all lines containing 3456789 are counted and pulled
out of the file for analysis. Lines that contain indirect addressing, which indicates the datum is
read from memory are removed from the count. The program is run until three test codes in a
row produce the same register count, which is then used to represent the number of registers on a
machine. The procedure works in 32 bit and 64 bit mode for regular and vector registers.
Tests on Intel processors running both Linux and Mac OSX along with Opteron processors
80
register int a0;
register int a1;
:
:
register int an;
a0 = 03456789;
a1 = 13456789;
:
:
aj = j3456789;
a0 = a0 + a1 + ... + aj;
Figure 5.1: Code used to determine number of registers availible on a machine.
81
running Linux find the exact number of registers for 32 and 64 bit modes and regular and vector
register choices.
5.1.3 Useable Bandwidth from Memory Structure to the Processor
To determine the amount of bandwidth available from a memory structure to the processor,
we use the STREAM TRIAD benchmark [73]. STREAM is a widely accepted benchmark for
determining bandwidths. TRIAD was chosen because, of the four stream benchmarks, it best
approximates the expected mix of instructions for targeted applications. It requires two loads and
a store to perform one multiplication and one addition, which is similar to the instruction and
memory loads of many of the memory bound operations the model is designed to handle best.
The benchmark is run for each memory structure other than registers. For the smallest cache,
the data size used is half of the cache size. For all other caches, the benchmark uses a data set
that is the average of the size of the cache being profiled and the next smaller cache. To profile
the bandwidth from main memory to the processor, we use a data set three times the largest cache
size. The bandwidth for the smallest cache is stored in the machine structure with the registers,
the bandwidth for the second smallest cache in the machine structure with the smallest cache and
so on until the bandwidth from memory to the processor is stored in the machine structure with
the largest cache. Therefore, each memory structure stores the bandwidth available to move data
that does not fit in that memory structure.
5.2 Converting Memory Traffic Predictions to Runtime Predictions
To create a single value for predicted runtime from memory miss predictions, we use a run-
time prediction function. The function takes in memory misses determined by the hardware model
described in Chapter 4 and bandwidths discovered by using STREAM in the hardware profiling
system presented in this chapter. Cost predictions can be used to compare the predicted perfor-
mance of different versions of the same calculation or to estimate the runtime of a routine. In this
section, we present equations that express the runtime prediction function and then describe the
82
implementation of the function.
5.2.1 Theoretical Equations
We express the conversion of memory traffic and machine characteristics to runtime predic-
tions in a mathematical form. We continue to use the terms defined in Section 4.2 where x ranges
over all memory structures and A(x, L) is the number of accesses to the memory structure x for
loop L. The runtime of the entire function is predicted by the following equations and the values
calculated by the memory traffic prediction equations presented in section 4.2.
If L is an inner loop and B(x) is the bandwidth between memory structure x and the pro-
cessor, the runtime is computed as follows.
runtime(L) = max{A(x, L)/B(x) | for all x}. (5.1)
We use the largest value of A(x, L)/B(x) over x because that represents the memory structure that
is the bottleneck that limits performance for that loop.
We write child(L) for each of the loops directly nested within loop L. If L is an outer loop,
which is any loop with another loop inside of it, the runtime is computed as follows.
runtime(L) = max{max{A(x, L)/B(x) | for all x},
∑
c∈child(L)
runtime(c)}. (5.2)
When applied recursively from the root of the abstract syntax tree described in Section 4.3, Equa-
tion 5.2 produces a runtime cost estimate.
The sum of the inner loops’ runtimes in this equation captures the impact of loops being
bound by different memory structures, which can increase overall runtime. Once the sum’s inner
loops’ runtime is found, it is compared to the predicted runtime of the outer loop. The greater of
the two values is then used as the current loop’s runtime.
5.2.2 Implementation
The runtime prediction function takes as input a machine structure as described above and
the annotated tree containing memory predictions produced by the memory model presented in
83
section 4.3 and converts them into costs in seconds. The amount of time it takes to move data is
found by dividing reads of it from a memory structure other than registers by the amount of usable
bandwidth between that structure and the processor.
The first step in determining the cost of a calculation is a recursive traversal of the tree from
root to leaves. At each node, the runtime prediction function is applied to the corresponding loop
for each memory structure. The cost of the loop is the maximum of those contributions. The
second step is a reverse traversal of the tree from leaves to root in which the cost of each parent
node is set to be the greater of the cost returned from the runtime prediction function for it and the
sum of the costs of its children. The cost of the root is the estimated cost of running the routine.
5.3 Validation and Accuracy
To validate the implementation of the runtime prediction algorithm and test its predictive
accuracy, we performed the following tests. The values produced by the implementation of the
algorithm were compared to hand calculations to make sure they were exactly the same. The
model’s runtime predictions were also compared to the measured runtimes of modeled routines.
Additionally, the rank orderings of different versions produced by the model were compared to the
rank orderings of actual runtimes of routines.
5.3.1 Validation of Implementation
The following procedure was used to verify that the implementation of the runtime prediction
function produces the same prediction as the theoretical equations. The implementation of the
memory model in Chapter 4 was set to output its miss predictions. These values were then used in
hand calculations as demonstrated below for the fused and unfused calculation b = AATx, which is
described in more detail in Section 3.2. The results of the hand calculations were then compared to
the results of running the runtime prediction function with the same input machine and memory
miss predictions. This procedure was completed for fused and unfused b = AATx, r = Ax and
s = AT y, and the GEMVER calculation found within the new BLAS standard [15] to ensure the
84
accurate implementation of the runtime prediction function.
Here we show an example of our validation of these calculations for unfused and fused b =
AATx with a matrix order of 3000 on the Work machine. The bandwidths from each memory
structure to the processor are shown in Figure 4.2.
The following charts show the memory miss predictions produced by the model:
unfused:
L1 L2 Reg
Loop1: 288048000 144096000 288048000
Loop2: 144024000 72048000 144024000
Loop3: 144024000 72048000 144024000
Loop4: 144024000 72048000 144024000
Loop5: 144024000 72048000 144024000
‘
fused:
L1 L2 Reg
Loop1: 288048000 72072000 288048000
Loop2: 288048000 72072000 288048000
Loop3: 144024000 72048000 144024000
Loop4: 144024000 24000 144024000
‘
The resulting runtime estimates from the cost function are 0.034372 seconds for the unfused routine
and 0.0274692 seconds for the fused routine.
Hand calculations result in the same predictions as the cost function. For the unfused routine
the runtimes of the loops are found as follows:
Loop 1 = max(288048000/14005944862, 144096000/4192213168, 288048000/17416986893) = 0.0343723
all other loops = (144024000/14005944862, 72048000/4192213168, 144024000/17416986893) = 0.0171861
‘
Since the sum of loops 2 and 3 equal loop 1 this is the cost of running that loop.
For the unfused routine the runtimes of the loops are found as follows:
Loops 1 and 2 = max(288048000/14005944862, 72048000/4192213168, 288048000/17416986893) =
0.0205661
Loop 3 = max(144024000/14005944862, 72048000/4192213168, 144024000/17416986893) = 0.0171861
Loop 4 = max(144024000/14005944862, 24000/4192213168, 144024000/17416986893) = 0.0102831
85
Since the sum of Loops 3 and 4 is larger than Loop 2’s cost 0.0274692 is used as the cost of Loop
2. That cost is larger than Loop 1’s cost and is used as the calculations cost.
In both cases, the runtimes from our hand calculations match with the cost function’s pre-
dictions verifying its implementation.
5.3.2 Runtime Prediction Accuracy
To show the usefulness and accuracy of the combined memory model and runtime prediction
function, a series of tests was performed. Two criteria were used to judge the success of the
combined system. How close runtime predictions are to actual runtimes measures the accuracy
of predictions. Also, the difference in predicted runtimes between routines was compared against
differences in their actua runtimes.
For using the model within the BTO compiler, this second criterion is especially important
because accurate prediction is not as important as correctly identifying the best performing routines.
Within BTO, it is also important that small differences between routines are not exaggerated and
large difference are not minimized. Chapter 7 presents an analysis of the model’s usefulness within
BTO, while here we examine accuracy and correctness of ranking routines.
In Figures 5.2, 5.3 and 5.4, we compare the estimates produced by the runtime prediction
function with actual runtimes of the compiler-generated codes on the Work and Opteron machines.
The graphs in Figures 5.2 and 5.3 show the results for all versions produced by the compiler for
three routines, b = ATAx, x = w + y + z and w = αx+ βy. In each figure, the lines are numbered
according to how much fusion was applied to that version of the kernel, with number 1 corresponding
to no fusion and the largest number referring to the routine that is fused as much as possible.
Figure 5.2 shows that our predictions for large matrices and vectors are accurate on the
Work machine with predicted and actual performances nearly identical. Figure 5.3 shows that, on
the Opteron, our predictions for large matrices overestimate actual performance on some routines
by up to 25%. However, the relative ranking of the routines is accurate, meaning that within
the compiler we are able to prune the search space using the model’s predictions. For smaller
86
0 2000 4000 6000 8000 10000
matrix order
0
1000
2000
3000
4000
5000
6000
7000
m
fl
o
p
s
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(a) AATX
0 200000 400000 600000 800000 1000000
vector dimension
0
200
400
600
800
1000
1200
m
fl
o
p
s
vadd
Predicted 1
Predicted 2
Actual 1
Actual 2
(b) VADD
0 200000 400000 600000 800000 1000000
vector dimension
0
500
1000
1500
2000
2500
3000
m
fl
o
p
s
waxpby
Predicted 1
Predicted 2
Predicted 3
Predicted 4
Actual 1
Actual 2
Actual 3
Actual 4
(c) WAXPBY
Figure 5.2: Predicted vs. actual runtime of three kernels on the Work machine.
87
0 2000 4000 6000 8000 10000
matrix order
0
500
1000
1500
2000
2500
3000
3500
4000
4500
m
fl
o
p
s
aatx
Predicted 1
Predicted 2
Actual 1
Actual 2
(a) AATX
0 200000 400000 600000 800000 1000000
vector dimension
0
100
200
300
400
500
600
700
m
fl
o
p
s
vadd
Predicted 1
Predicted 2
Actual 1
Actual 2
(b) VADD
0 200000 400000 600000 800000 1000000
vector dimension
0
200
400
600
800
1000
1200
1400
1600
1800
m
fl
o
p
s
waxpby
Predicted 1
Predicted 2
Predicted 3
Predicted 4
Actual 1
Actual 2
Actual 3
Actual 4
(c) WAXPBY
Figure 5.3: Predicted vs. actual runtime of three kernels on the Opteron system.
88
matrices, our predictions are less accurate because the model does not take the cost of arithmetic
or other computations into account, so the model overpredicts performance when computation is
the bottleneck.
At cache boundaries, our predictions abruptly change while the compiler-produced versions
follow smooth curves. The abrupt changes happen because the runtime prediction function uses the
memory model’s predictions. Therefore, jumps in performance in Figures 5.2 and 5.3 correspond
to the the jumps in memory predictions in Figures 4.3 and 4.4.
Figure 5.4 compares our predictions to actual performance for the 648 versions of GEMVER
produced by the BTO compiler at matrix order 3000 on the Work machine. It shows that, as
the actual performance increases, the predicted performance, for the most part, increases as well.
The model overpredicts performance in all cases, though this inaccuracy is less important than the
performance difference between versions. On the Work machine the overpredictions occur because
when kernels require temporary storage. When temporary storage is used the first write to a
temporary array produces ten times the expected TLB misses. If we replace the TLB predictions
with the actual number of TLB misses, the resulting costs are extremely accurate.
The last observation from Figures 5.2, 5.3 and 5.4 is that our runtime prediction function
always ranks best kernel first.
In Section 3.4.1.2, we showed that data not being stored in registers significantly impacts
routine performance and in Chapter 4 we show how we include them in our memory model. In
Figure 5.5 we show how adding registers increases the predictive capabilities of the model. As
seen in the figure the model including registers predicts a decrease in performance beginning at
nvecs = 5, which is when five matrix-vector multiplies are computed. This drop off matches
the behavior of the measured performance, while the model without registers predicts increased
performance. At nvecs = 5, not all vectors in the inner loop of the fully fused calculation shown
in Figure 3.8 remain in registers. The resulting growth in L1 cache reads causes performance to
be bound on traffic from the L1 cache. Additionally, each increase of nvecs beyond five raises the
ratio of L1 reads to computation and so reduces the performance predicted by the model.
89
Figure 5.4: Predicted vs. actual runtime of the 648 versions of GEMVER produced by the BTO
compiler on the Work machine.
Figure 5.5: Accuracy of Model Predictions With and Without Registers
Chapter 6
Parallel Shared Memory Model
In order for our memory and cost predictions to be useful on modern machines, which often
contain multiple sockets each with a multi-core processor, we add features relevant to parallel
machines. To do so, we expand our serial model to account for additional hardware features such
as multiple buses between main memory and processors that serial models do not incorporate. We
also include which processors share caches and buses between the processors and how data and
work is divided and distributed into our parallel model.
In this chapter and the following one, tests are run on some of the same machines listed in
Table 3.1. In Table 6.1, we expand our description of the machines on which we run parallel tests
by listing the number of processing units and memory structures they contain.
Name Sockets Cores L2 Caches NUMA
Work 1 4 2 No
Opteron 2 4 4 Yes
Table 6.1: Specifications of parallel test machines. All machines are homogenous and caches are
shared by the same number of cores. The number of sockets is equal to the number of buses from
memory to the processor in all cases.
In this chapter, we extend our serial memory model and runtime prediction function to SMP
machines. First, we present how parallel machines change data and workload patterns. We then
detail changes in implementation to our memory predictions for parallel machines. Finally, we
91
present changes to our cost function and introduce a second cost function that allows us to account
for potential non-deterministic execution of certain routines on a parallel machine.
6.1 Parallel Machine and Execution Features
Additional processors, cores, buses, caches and registers in parallel machines increase the
resources available to execute a program. However, shared caches and buses can impact runtimes in
positive and negative ways through contention for resources and synergistic accesses. Also, division
of data and workload between the cores and how a program is executed impacts performance. In
this section, we explain how data division and program execution on parallel machines affect data
movement and routine runtime.
6.1.1 Data and Workload Distribution
To take advantage of the additional processors, work must be divided among all of their cores.
Two ways to partition the workload among cores are data parallelism and task parallelism. A data
parallel approach splits the data being worked on among processing cores with each core executing
the same instructions on different data. A task parallel approach allocates different operations to
each core. The operations are then performed on the entire data set. A hybrid approach can also
be used, with data divided among groups of cores that then split the tasks among themselves.
How data and tasks are apportioned to cores impacts the amount of data stored in caches, reuse
distances and data movement through buses.
For example, dividing the workload using a data parallel approach can result in extra data
stored in shared caches and increased reuse distances. When computing a matrix-vector multiply,
each core accesses a different part of the matrix. The data accessed by each core are stored in the
shared cache, increasing the amount of data in the shared cache and the reuse distances of other
data elements. Additionally, if the matrix is divided into groups of columns, each core produces a
partial result of the whole product vector. The partial result occurs because each core only uses
part of each row of the matrix. When performing a matrix-vector multiply on data divided by
92
columns, a column of the matrix is multiplied by single element of the multicand vector producing
a partial resultant vector. These partial results are stored separately and summed, causing extra
data to be resident in shared caches. The increased data stored in caches can produce additional
misses, leading to increased reads from slower memory structures.
Another potential pitfall of data parallelism is contention for shared memory buses. If two
cores are to read different data at the same time that data needs to move over the same memory
bus. There are two possible ways the data can move over the bus, either one core stalls while waiting
for the other core’s read to complete or the reads are interleaved with each processor receiving data
at half the transfer rate of the bus. However, there are no extra data in cache or bus contention
if the calculation is performed on two cores using a task parallel division of work. With one core
performing the multiply and the other the add, each accesses the same part of the matrix requiring
no partial results and not causing bus contention.
Splitting data using task parallelism results in different potential pitfalls. If cores performing
tasks on the same data each read that data through different memory buses then there can be an
increase in total data movement as compared to serial and data parallel routines. Additionally, if
the two cores become out of sync, such that data do not stay within shared caches between when
two tasks access them, extra memory traffic can result. Finally, there also is added overhead needed
to enforce dependencies between tasks. Data parallel divisions avoid both these issues, though cores
finishing their workload later than others can lead to wasted clock cycles and memory bandwidth.
Both forms of parallel computation can increase or decrease data movement compared to
serial execution. When used together, their strengths can be combined and weaknesses hidden.
For example a calculation can be implemented using task parallelism when cores share caches and
memory buses and data parallelism when cores do not share a bus. The combined approach can
reduce contention on that memory bus, the amount of data stored in the shared cache and data
movement. This hybrid approach can be the most effective way to perform a calculation. However,
our parallel memory model is currently limited to modeling data parallelism because currently BTO
does not create task parallel routines.
93
6.1.2 Routine Execution
Due to non-deterministic execution patterns, acounting for how a routine can be run is nec-
essary on parallel machines. Different execution patterns result in verying amounts of contention
for hardware resources occur among cores. In certain cases, one execution pattern might some-
times leave the bus that limits data movement idle while a different pattern does not. When
non-deterministic execution occurs, the runtime of the calculation can vary depending on which
execution pattern occurs during runtime. In other cases, both execution patterns can result in
identical performance. For example when neither execution pattern leaves the bus that limits data
traffic idle more than the other.
An example of a routine where contention may or may not occur depending on how a routine
is executed is the fused kernel b = AATx shown on the right side of Figure 3.4. For this routine, we
present two execution paths, which are demonstrations of possible performance execution, and use
them to develop performance bounds. In actuality, a processor can switch between these execution
patterns unless the code is written to enforce one or the other. We assume that two columns of A
and all vectors fit within cache. For our examples, we run the calculation on two cores of a parallel
machine that share a cache and divide the data by performing half of the outer loop iterations on
each core.
For the first execution pattern, each core executes the same inner loop, t = AT ∗x at the same
time accessing different columns of A that are read from memory at the same time thus creating
bus contention. Then each core executes the second inner loop b = A ∗ t. Since the data of A
accessed by the second inner loop are in cache, there is no bus contention because no data are read
from memory.
For the second execution pattern, the two cores become out of sync. While one core executes
t = AT ∗ x the other core executes b = A ∗ t. Only the core performing t = AT ∗ x core is reading
data from main memory since the core performing b = A ∗ t is reading data from cache. Then the
cores switch which calculation they are performing, resulting in no bus contention since only one
94
core is ever reading data from memory at a time.
For our machine with a single shared cache, if moving data from memory is the limiting
factor for routine performance, then the first algorithm executes more slowly than the second. The
disparity in speed is the amount of time it takes to execute b = At. The time difference occurs
because the second algorithm hides the cost of performing b = At by executing it on one core while
t = ATx reads data from memory and runs on the other core. However, if data movement from
cache limits performance, the runtime of both algorithms is the same since calculating both parts
of b = At at the same time does not create bus contention because cores executing the routine
use separate buses to access cache. When and by how much execution patterns impact a routine’s
runtime is both machine and routines dependent.
6.2 Modeling Parallel Memory Traffic
To add parallel capabilities to our model, we made a series of changes. First we added
more information to our abstract syntax tree (AST) and machine structure. An additional field
is appended to all variables in the AST to mark those that store partial sums once per running
thread. We also use this field to indicate data structures accessed from more than one location by
different threads.
Machine structures were changed to store a hierarchy of structures. Figure 6.1 shows and
example of the new machine structure for a dual socket Clovertown machine shown in Figure 2.1.
The added fields of cores, threads and sockets each express how many processing units access that
memory structure directly. At the top level in this figure is the machine level. The second level
is the socket level for this machine. The third level represents the two cores that share the level 2
cache. The lowest level represents the individual cores, level 1 caches and registers. For homogenous
systems or heterogenous systems with homogenous components, identical memory structures are
compressed to a single representation. A new field called compressed is added and the number of
memory structures represented by the representation is stored in the compressed field. Therefore,
all memory structures stored at a given level of the hierarchy are accessed by the same number of
95
Two Socket Clovertown
Cores = 8
Threads = 8
Sockets = 2
Caches = 0
Compressed = 0
Cores = 4
Threads = 4
Sockets = 1
Caches = 0
Compressed = 2
Cores = 2
Threads = 2
Sockets = 0
Caches = 1
Compressed = 2
Cores = 1
Threads = 1
Sockets = 0
Caches = 2
Compressed = 2
L2
Size = 4 MB
Bandwidth = 4192213168 B/s
L1
Size = 32 KB
Bandwidth = 14005944862 B/s
Registers
Size = 48 Bytes
Bandwidth = 17416986893 B/s
Figure 6.1: Parallel machine storage example.
96
cores, which is stored at that level.
Buses are represented implicitly with the number of buses from memory equal to the number
of compressed structures in the second level of the hierarchy. The number of buses from all caches
at a given level is found by dividing the total number of cores by the number in the next lower level
of the hierarchy. The number of cores, threads and sockets at a level of the hierarchy represents
how many of these hardware structures share all caches at that level.
Currently our model is only designed to work on homogenous machines, but all changes
made to parallel system representation are forward looking towards the advent of generating ker-
nels for heterogenous computing environments. In this section, we present our memory prediction
algorithms for shared memory parallel machines that use these modified data structures. Implemen-
tation details are explained and then followed by analysis of the accuracy of our model’s predictions.
Since the theoretical equations for memory predictions presented in Section 4.2 do not change from
the serial case, we do not present them here.
6.2.1 Parallel Memory Prediction
To predict the amount of data movement on a parallel machine, we must adjust reuse distance
calculations to account for extra variables stored in caches, such as partial results and variables
used by other cores that share a given cache. These additional data can increase reuse distances in
shared caches when multiple cores access different parts of a shared data structure. Also increasing
reuse distances are partial results that must be stored and accessed by different cores. If a partial
result is reused by its core, but is accessed fully before the reuse occurs, then the reuse distance
of each partial result is made larger by the number of partial results times the sizes of the partial
results. In our model, we adjust our reuse distance calculations to account for these factors.
Whenever a partial result is updated between accesses to any data structure, the reuse dis-
tance of the data structure increases by the size of the partial result times the number of instances
of the partial result stored in the memory structure of interest. The same adjustment is applied to
all data structures divided among cores and accessed in different places by the cores. Calculating
97
reuse distances in this way accounts for both shared caches and the extra data stored in them when
executing a calculation in parallel. The result is that reuse distances for the same variable can
differ for various memory structures due to the number of cores that access that structure and,
therefore, must be calculated per structure. Then to determine the memory structure from which
reads to each data structure occur, reuse distances are compared to the memory structure’s size.
6.2.2 Prediction Results
The first step in evaluating the accuracy of our parallel memory model was comparing memory
miss predictions to the actual number of memory misses observed by measuring hardware counters.
In Figures 6.2 and 6.3, we compare the actual and predicted L1 and L2 misses for the Intel Core 2
Duo machine. Results are normalized per floating-point operation to aid in presentation.
For small orders, overhead can cause our predictions to be inaccurate. Also, at cache bound-
aries, we abruptly predict increased misses due to assuming a fully associative cache while actual
misses increase in a curve as explained in Section 4.1. Otherwise, for both calculations, our predic-
tions are accurate other than in two noteworthy cases. For b = AATx, the unfused calculations’ L1
cache miss predictions are inaccurate for matrix orders of approximately 2000-4500. This probably
occurs because the processor fits most of the x and t vectors into cache by not including a column
of A. We are working on testing this explanation. The other inaccuracy is in the L1 cache miss
prediction of the fused calculation of r = ATx, s = Ay. For large orders, our model predicts only
about 75% of the actual misses. As for our previous inaccuracy, we assume the hardware cache
strategy is causing the discrepancy.
6.3 Parallel Runtime Prediction
In this section, we explain how to convert parallel memory traffic predictions to runtime
predictions. We present two functions for converting predictions to runtimes that provide best and
worst case predictions when nondeterminism is possible. The section is organized as follows. First,
we present theoretical equations for the parallel case. A proof of why the equations presented
98
L1
 C
ac
he
 M
iss
es
 p
er
 F
LO
P
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
Matrix Order
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
r = ATx, s = Ay -- L1 Cache misses
unfused - actual
unfused - predicted
fused - actual
fused - predicted
(a) L1 Misses
L2
 C
ac
he
 M
iss
es
 p
er
 F
LO
P
0
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
0.050
0.055
0.060
0.065
Matrix Order
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
r = ATx, s = Ay -- L2 Cache misses
unfused - actual
unfused - predicted
fused - actual
fused - predicted
(b) L2 Misses
Figure 6.2: Predicted and actual cache misses for r = ATx, s = Ay.
99
L1
 C
ac
he
 M
iss
es
 p
er
 F
LO
P
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
0.15
Matrix Order
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
b = AATx -- L1 Cache misses
unfused - actual
unfused - predicted
fused - actual
fused - predicted
(a) L1 Misses
L2
 C
ac
he
 M
iss
es
 p
er
 F
LO
P
0
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
0.050
0.055
0.060
0.065
0.070
Matrix Order
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
b = AATx - L2 Cache misses
unfused - actual
unfused - predicted
fused - actual
fused - predicted
(b) L2 Misses
Figure 6.3: Predicted and actual cache misses for b = AATx.
100
represent the best and worst cases follows. We then describe the implementation of the model.
Finally, we demonstrate the accuracy of our performance predictions.
6.3.1 Theoretical Expression
To provide a theoretical framework that is implementation independent we developed equa-
tions to express our parallel runtime predictions. These two equations are used to represent the
worst case and best case execution scenarios described in Section 6.1.2.
The worst case is expressed by equations 6.1 and 6.2 which are modifications of the equations
5.1 and 5.2 for the serial case presented in Section 5.2.1. Equation 6.1 is applied to the innermost
loops and equation 6.2 is applied to all other loops. To the serial equations, we add the term
Bus(x), which is the number of buses that all memory structures x use to access a processor. This
added term expresses the additional bandwidth available due to parallel execution.
runtime inner(L) = max{A(x, L)/(B(x) ∗Bus(x)) | for all x}. (6.1)
runtime outer(L) = max{max{A(x, L)/(B(x) ∗Bus(x)) | for all x},
∑
c∈child(L)
runtime(c)}. (6.2)
The best case scenario assumes that all data movement not from the limiting memory struc-
ture is overlapped with data movement from the memory structure that limits performance. There-
fore, the memory structure that bounds performance at the outermost loop level is the memory
structure that limits performance. The limiting memory structure is found by applying equation 6.1
to all memory structures for the outermost loop. The runtime returned is the routine’s performance
estimate.
There is one exception to this use. We define runtest(L) as the result of applying equation
6.2 to loop L. When the inequality in equation 6.3 is true for the outermost loop, data movement
is not overlapped for memory structures closer to the processor.
101
runtime(L) ∗ processors/Bus(x) < runtest(L) (6.3)
For closer memory structures, their runtime estimate is the same as produced by a worst case
analysis and applying equations 6.1 and 6.2. The result from the worst case analysis is compared
to the best case prediction and the greater of the two is used as the runtime estimate.
6.3.2 Proof of Best and Worst Case
For the best case, we assume that data are always moved from the memory structure with
the highest data movement cost to the processor whenever possible. We divide the proof into
two cases, structures larger than the bound memory structure and those smaller than the bound
structure. For structures larger than the bound structure, reads are included in the reads to the
bound structure. Therefore, because the cost of moving data from the larger structure is less than
the bound structure, the data from the larger structure can be moved to the bound structure while
the bound structure is moving data to the processor and does not impact runtime.
For structures smaller than the structure with the largest data movement cost, data that are
not read from the bound structure can be moved to the processor when there is spare bandwidth.
The reads can be interleaved between reads from the bound structure or travel on a different bus
from the smaller structure to a different processing core. However, certain data access patterns
prevent data movement from a closer cache to a processor from being performed in a manner that
does not impact performance. When the data movement cost can not be hidden we can not use
the best case estimate.
Instead, when the runtime of the bound structure times the number of cores that share a
bus from the bound structure is less than the worst case analysis, we use the worst case analysis.
When this inequality is not true, the calculation is executed in a way such that it is impossible to
overlap data from closer caches.
In the worst case, data movement is assumed to cause conflicts for resources whenever pos-
102
sible. We prove that having all cores execute the same code simultaneously is the worst possible
execution pattern by contradiction. We assume that multiple cores are executing a different part
of the program at once and that this execution pattern results in the most possible conflicts. We
also assume each core needs to move data over the same memory buses and that one memory bus
is used at full capacity. The bus limiting data movement can be described using two cases.
In the first case, the combined execution of the codes causes the bus closer to the processor
to be used at full capacity. One of two scenarios or a combination of both can occur. In the first
scenario, codes not bound on data movement from the bus run at the same speed as if each core
were executing them. The core bound on data movement over the bus at capacity uses the excess
bandwidth not used by the other cores to execute calculations faster than if all cores were executing
the same code. In the second scenario, cores bound on data movement over the bus at capacity run
at the same speed as they would if all cores were executing the same code. The cores that would be
bound on a bus farther from the processor if all cores were executing the same code as each other
can execute faster. They can use the excess bandwidth on the bus farther from the processor. In
both scenarios one or more cores executes more code than if both cores executed the same code
creating a contradiction.
In the second case, the combined calculation is bound on data traveling over a bus farther from
the processor than part of the calculation which is bound on another bus closer to the processor.
In this case there are once again two scenarios. In the first scenario, codes not bound on data
movement from the bus run at the same speed as if each core was executing them. Since these
cores do not use as much of the bandwidth from the farther memory structure, there is more
available to execute the calculation bound on data movement over the farther memory structure.
In the second scenario, cores bound on data movement over the bus at capacity run at the same
speed as they would if all cores were executing the same code. The cores running calculations not
bound on data movement from the farther structure are able to use the extra bandwidth available
between closer memory structures and the processor to run faster than if all cores were executing
the same calculation. In both scenarios, in this case more computation occurs if cores execute
103
different calculations than the same calculation at once. Therefore, having all cores execute the
same code at once produces the worst possible data movement patterns.
6.3.3 Implementation
For converting memory traffic estimates to runtime predictions on parallel machines, both
the best and worst case execution patterns must be modeled. We solve this problem by creating two
runtime prediction methods. The worst case prediction method performs the first execution pattern
presented in Section 6.1.2, and the best case method uses the second execution pattern. Both of
these prediction methods account for the same hardware features but assume different calculation
execution paths. Hardware features modeled include the amount of bandwidth available between
each memory structure and core, how many data are moved over that bandwidth and how many
cores share that bandwidth.
The worst case prediction method uses a modified version of the single processor runtime
prediction method described in Section 5.2.2, accounting for the increase in bandwidth created by
multiple buses. The serial algorithm is used to traverse the AST. In parallel the serial traversal
scheme implies that all cores are executing the same loop of a calculation at the same time. Si-
multanious execution is implied because all runtime predictions are calculated for each loop . To
account for parallel structures, bandwidths from each cache to processing cores are multiplied by
the number of buses from a memory structure to cores. For the example Clovertown machine in
Figure 6.1, the bandwidth between memory and the cores is multiplied by two and the bandwidth
between the L1 and L2 caches is multiplied by eight. The total number of misses for all cores to a
memory structure for all inner loops is divided by the bandwidth to produce runtime estimates for
all inner loops. These runtime estimates are propagated up the AST in the same manner as in the
serial algorithm to produce overall runtime predictions.
The best case prediction method assumes that data are always read from the memory struc-
ture that limits performance. This assumption implies that computations that can be overlapped
always are. The predicted runtime is found by summing all misses to each memory structure. The
104
sum of the number of misses to each memory structure is then divided by the bandwidth between
that memory structure and the processor in the cost algorithm. The maximum of these values is
the runtime estimate. There is one exception in the algorithm as explained in the previous sec-
tion. When only a single core accesses a memory structure over a given bus and another memory
structure is between it and the CPU, the worse case prediction method is used for those memory
structures since calculations cannot be overlapped. That cost is then compared to the cost of
moving data through memory structures where overlapping can occur, with the greater of the two
selected.
In practice, when executing a calculation where overlapping is possible, the execution results
in some contention and some overlapping. Therefore, we use our best and worst case estimates in
combination to provide high and low bounds on potential performance.
6.3.4 Results
The second step in evaluating the accuracy of our parallel memory model was comparing
runtime predictions to the actual routine runtimes. We ran tests using parallel versions of routines
produced by BTO on the Work and Opteron machines. In this section we present selected results
that are typical of performance on these two machines.
Figures 6.4 and 6.5 show runtime estimates and measured performance on the Work ma-
chine. In these figures and the rest in this section routines are numbered with one representing
the least amount of fusion and routines with progressivly higher numbers containing more fusion.
On the Work machine for matrix kernels Figure 6.4 displays that our model’s runtime predictions
were not as accurate at predicting actual performance as they were in serial. However, the run-
time predictions are useful in determining performance differences of different versions of the same
routine. Being able to accurately distinguish performance differences between versions based on
their memory costs means that our model can be used to determine which versions produced by
BTO will perform the best. Figure 6.5 shows that on the Work machine our model for large vector
dimensions accurattly predicts routine runtime. Predictions for all versions are within 10% of their
105
actual values for all the versions shown.
Figures 6.6 and 6.7 show runtime estimates and measured performance on the Opteron ma-
chine. Predictions on the Opteron are within 10% for large matrix orders and vector dimensions.
The actual values have more variability than on the Work machine. The variability is probably due
to the NUMA archnitecture with threads changing which processor they are exectuting on during
execution. Pinning threads to a single processor might reduce the variability of runtimes, but for
now BTO does not support pinning.
On both machines for matrix based kernels the speed of the actual kernels increases with
matrix order. Our model does not capture this effect because it only profiles the hardware and mea
sures usable bandwidth at one size. To capture the increase in speed in our runtime predictions
we would need to profile the machine at various sizes and interpolate those values.Using this more
accurate benchmarking procedure for various kernel sizes might improve our predictions, but the
benefits need to be weig hed against the added costs of measuring more values at install time and
looking up values at runtime. However, as implimented the relative accuracy of our predictions
demonstrates that the model can be used to accurately and efficiently trim the search space of
versions of routines considered within BTO as shown in the next chapter.
Also on both machines actual performance for small kernels is significantly less than predicted.
There are two reasons for the gap. As with our serial predictions the parallel model does not account
for the cost of arithmetic. The new factor not accounted for in parallel is the amount of time it
takes to generate threads for parallel computation.
106vid size Predicted - 1 Actual - 1 Predicted - 2 Predicted - 2 Actual - 2 Predicted - 3 Actual - 3 size Predicted - 1 Actual - 1 Predicted - 2 Predicted - 2 Actual - 2 Predicted - 3 Actual - 3
7 100 4.98E-06 9.20E-05 4.98E-06 4.98E-06 9.50E-05 3.84E-06 6.30E-05 100 8,025.7626983 434.7826087 8,025.7626983 8,025.7626983 421.05263158 10,428.69769 634.92063492
7 200 2.50E-05 0.000131 1.92E-05 2.28E-05 0.00012 1.57E-05 0.000117 200 6,393.1210018 1221.3740458 8,352.9540744 7,032.3797133 1333.3333333 10,220.442162 1367.5213675
7 300 5.47E-05 0.000235 4.25E-05 4.96E-05 0.000212 3.38E-05 0.000211 300 6,583.5192568 1531.9148936 8,468.0179428 7,259.6745246 1698.1132075 10,645.312957 1706.1611374
7 400 9.58E-05 0.000394 7.51E-05 8.67E-05 0.000377 5.89E-05 0.000403 400 6,683.0261578 1624.3654822 8,526.7620421 7,378.9336057 1697.6127321 10,871.263852 1588.08933
7 500 0.000148276 0.000583 0.00011679 0.000134185 0.000531 9.08E-05 0.000497 500 6744.1797729 1715.2658662 8562.3769158 7452.397809 1883.2391714 11,011.505923 2012.0724346
7 600 0.000212214 0.001196 0.000167709 0.000191945 0.000731 0.000129648 0.001353 600 6785.6032119 1204.0133779 8586.3012718 7502.1490531 1969.9042408 11106.997408 1064.3015521
7 700 0.000287581 0.001299 0.000227815 0.000260011 0.000998 0.000175372 0.000939 700 6815.4711194 1508.8529638 8603.4721155 7538.142617 1963.9278557 11176.242502 2087.3269436
7 800 0.000374375 0.00153 0.000297108 0.000338385 0.001339 0.000227986 0.001298 800 6838.0634391 1673.2026144 8616.3953848 7565.3471637 1911.8745332 11228.759661 1972.2650231
7 900 0.000472597 0.001931 0.000375588 0.000427067 0.001697 0.00028749 0.002338 900 6855.7354363 1677.8871051 8626.4736893 7586.6316058 1909.2516205 11269.957216 1385.7998289
7 1000 0.000582246 0.002502 0.000463255 0.000526056 0.002092 0.000353885 0.002007 1000 6869.948441 1598.7210232 8634.5533238 7603.7532126 1912.0458891 11303.106942 1993.0244145
7 1100 0.00474426 0.003017 0.00240194 0.00274582 0.002542 0.00240194 0.002527 1100 1020.1801756 1604.2426251 2015.0378444 1762.6792725 1904.0125885 2015.0378444 1915.3146023
7 1200 0.00564015 0.003641 0.0028526 0.00326202 0.00318 0.0028526 0.002921 1200 1021.2494349 1581.9829717 2019.2105448 1765.7770339 1811.3207547 2019.2105448 1971.9274221
7 1300 0.00661347 0.004499 0.00334197 0.00382265 0.003603 0.00334197 0.003552 1300 1022.1562962 1502.5561236 2022.7590313 1768.4067335 1876.2142659 2022.7590313 1903.1531532
7 1400 0.00766423 0.006436 0.00387005 0.00442771 0.004273 0.00387005 0.004084 1400 1022.9338107 1218.147918 2025.81362 1770.6670039 1834.7765036 2025.81362 1919.6865818
7 1500 0.00879241 0.005795 0.00443686 0.0050772 0.00578 0.00443686 0.004502 1500 1023.6101365 1553.0629853 2028.4615697 1772.6305838 1557.0934256 2028.4615697 1999.111506
7 1600 0.00999803 0.00704 0.00504238 0.00577111 0.005488 0.00504238 0.005452 1600 1024.2017677 1454.5454545 2030.787049 1774.3553666 1865.8892128 2030.787049 1878.2098313
7 1700 0.0112811 0.008418 0.00568661 0.00650946 0.006175 0.00568661 0.006238 1700 1024.7227664 1373.2478023 2032.8455794 1775.8769545 1872.0647773 2032.8455794 1853.1580635
7 1800 0.0126416 0.009087 0.00636956 0.00729224 0.005672 0.00636956 0.005281 1800 1025.1866852 1426.2132717 2034.6774345 1777.2316874 2284.9083216 2034.6774345 2454.0806665
7 1900 0.0140795 0.009582 0.00709123 0.00811945 0.007702 0.00709123 0.007526 1900 1025.6046024 1506.9922772 2036.3181 1778.4455844 1874.8377045 2036.3181 1918.6819027
7 2000 0.0155948 0.011395 0.00785161 0.00899109 0.007808 0.00785161 0.005898 2000 1025.98302 1404.1246161 2037.7986171 1779.5395219 2049.1803279 2037.7986171 2712.7839946
7 2100 0.0171876 0.011516 0.00865071 0.0111671 0.00937 0.00865071 0.006848 2100 1026.3213014 1531.7818687 2039.138984 1579.6401931 1882.6040555 2039.138984 2575.9345794
7 2200 0.0188578 0.012482 0.00948852 0.0122504 0.010313 0.00948852 0.010035 2200 1026.6308901 1551.0334882 2040.3603512 1580.3565598 1877.2423155 2040.3603512 1929.2476333
7 2300 0.0206054 0.013845 0.0103651 0.0133839 0.009508 0.0103651 0.008506 2300 1026.9152746 1528.3495847 2041.4660737 1581.0040422 2225.4943206 2041.4660737 2487.6557724
7 2400 0.0224305 0.015197 0.0112803 0.0145675 0.010238 0.0112803 0.011808 2400 1027.1728227 1516.0887017 2042.4988697 1581.6028831 2250.439539 2042.4988697 1951.2195122
7 2500 0.024333 0.016273 0.0122343 0.0158013 0.010726 0.0122343 0.012905 2500 1027.4113344 1536.2871013 2043.4352599 1582.1483043 2330.7850084 2043.4352599 1937.2336304
7 2600 0.026313 0.017377 0.0132269 0.0170853 0.012575 0.0132269 0.01307 2600 1027.6289287 1556.0798757 2044.3187746 1582.6470709 2150.2982107 2044.3187746 2068.8599847
7 2700 0.0283703 0.019221 0.0142583 0.0184193 0.011987 0.0142583 0.013826 2700 1027.8354476 1517.0906821 2045.1245941 1583.1220513 2432.635355 2045.1245941 2109.0698684
7 2800 0.0305051 0.020201 0.0153285 0.0198035 0.014671 0.0153285 0.015212 2800 1028.0248221 1552.3983961 2045.8622827 1583.5584619 2137.5502692 2045.8622827 2061.5303708
7 2900 0.0327174 0.021677 0.0164373 0.0212379 0.013976 0.0164373 0.015592 2900 1028.1990623 1551.8752595 2046.5648251 1583.9607494 2406.9834001 2046.5648251 2157.5166752
7 3000 0.035007 0.022471 0.0175848 0.0227224 0.015177 0.0175848 0.014221 3000 1028.3657554 1602.0648836 2047.2226013 1584.3396824 2372.0102787 2047.2226013 2531.467548
7 3100 0.0373741 0.025276 0.0187711 0.024257 0.016916 0.0187711 0.016498 3100 1028.5197503 1520.8102548 2047.8288433 1584.6972008 2272.4048238 2047.8288433 2329.9793914
7 3200 0.0398187 0.025753 0.0199961 0.0258418 0.017179 0.0199961 0.016563 3200 1028.6624124 1590.4943113 2048.3994379 1585.0289067 2384.3064206 2048.3994379 2472.9819477
7 3300 0.0423406 0.027611 0.0212597 0.0274767 0.016933 0.0212597 0.017862 3300 1028.7997808 1577.6321031 2048.9470689 1585.3432181 2572.4915845 2048.9470689 2438.6966745
7 3400 0.04494 0.030435 0.0225622 0.0291618 0.01897 0.0225622 0.017668 3400 1028.9274588 1519.3034335 2049.4455328 1585.6360033 2437.5329468 2049.4455328 2617.160969
7 3500 0.0476168 0.030786 0.0239033 0.030897 0.01928 0.0239033 0.019671 3500 1029.0485711 1591.6325603 2049.9261608 1585.9144901 2541.4937759 2049.9261608 2490.9765645
7 3600 0.0503711 0.032698 0.0252831 0.0326824 0.022497 0.0252831 0.020037 3600 1029.1615629 1585.4180684 2050.3814801 1586.1748219 2304.307241 2050.3814801 2587.2136547
7 3700 0.0532028 0.035331 0.0267017 0.0345179 0.022243 0.0267017 0.021554 3700 1029.2691362 1549.9136735 2050.8057539 1586.423276 2461.8981253 2050.8057539 2540.5957131
7 3800 0.0561119 0.035332 0.0281589 0.0364035 0.023581 0.0281589 0.022157 3800 1029.3716663 1634.7786709 2051.2164893 1586.6606233 2449.4296255 2051.2164893 2606.851108
7 3900 0.0590985 0.038874 0.0296549 0.0383393 0.024952 0.0296549 0.022428 3900 1029.467753 1565.0563359 2051.6002414 1586.8834329 2438.2815005 2051.6002414 2712.6805778
7 4000 0.0621625 0.039141 0.0311896 0.0403253 0.024544 0.0311896 0.023574 4000 1029.559622 1635.1140748 2051.96604 1587.0929665 2607.5619296 2051.96604 2714.8553491
7 4100 0.0653039 0.042104 0.0327631 0.0423614 0.026603 0.0327631 0.024575 4100 1029.6475402 1596.9979099 2052.3088475 1587.2940932 2527.5344886 2052.3088475 2736.1139369
7 4200 0.0685227 0.046321 0.0343752 0.0444476 0.028624 0.0343752 0.026542 4200 1029.7317531 1523.2831761 2052.6426028 1587.4872884 2465.0642817 2052.6426028 2658.4281516
7 4300 0.071819 0.047114 0.036026 0.0465839 0.02909 0.036026 0.028046 4300 1029.8110528 1569.8093985 2052.9617498 1587.6729943 2542.4544517 2052.9617498 2637.0961991
7 4400 0.0751927 0.047671 0.0377156 0.0487705 0.028288 0.0377156 0.028788 4400 1029.8872098 1624.4677057 2053.2617803 1587.8451113 2737.5565611 2053.2617803 2690.0097263
7 4500 0.0786439 0.050372 0.0394439 0.0510071 0.031832 0.0394439 0.033135 4500 1029.9590941 1608.0362106 2053.5494715 1588.0142176 2544.6091983 2053.5494715 2444.545043
7 4600 0.0821725 0.054128 0.0412109 0.0532939 0.03221 0.0412109 0.031326 4600 1030.0282941 1563.7008572 2053.8255656 1588.1742563 2627.7553555 2053.8255656 2701.9089574
7 4700 0.0857785 0.053597 0.0430166 0.0556309 0.033917 0.0430166 0.032379 4700 1030.0949539 1648.5997351 2054.0907464 1588.3259124 2605.1832414 2054.0907464 2728.9292443
7 4800 0.0894619 0.056575 0.044861 0.0580179 0.036836 0.044861 0.036419 4800 1030.1592074 1628.9880689 2054.3456454 1588.4752809 2501.9003149 2054.3456454 2530.5472418
7 4900 0.0932228 0.059388 0.0467442 0.0604552 0.035695 0.0467442 0.035711 4900 1030.2200749 1617.1617162 2054.5864514 1588.6143789 2690.5729094 2054.5864514 2689.3674218
7 5000 0.0970611 0.061293 0.048666 0.0629426 0.036847 0.048666 0.036435 5000 1030.2788656 1631.5076762 2054.8226688 1588.7491143 2713.9251499 2054.8226688 2744.6136956
7 5100 0.100977 0.064654 0.0506266 0.0654801 0.039377 0.0506266 0.03765 5100 1030.3336403 1609.1811798 2055.0461615 1588.8796749 2642.1515098 2055.0461615 2763.3466135
7 5200 0.10497 0.067504 0.0526259 0.0680677 0.040103 0.0526259 0.038873 5200 1030.3896351 1602.2754207 2055.2617627 1589.0062394 2697.0550832 2055.2617627 2782.3939495
7 5300 0.109041 0.070147 0.0546639 0.0707055 0.040446 0.0546639 0.038408 5300 1030.4380921 1601.779121 2055.4698805 1589.1267299 2778.025021 2055.4698805 2925.4322016
7 5400 0.113189 0.074571 0.0567407 0.0733935 0.046533 0.0567407 0.040758 5400 1030.4888284 1564.1469204 2055.6672723 1589.2415541 2506.6082135 2055.6672723 2861.7694686
7 5500 0.117414 0.076492 0.0588561 0.0761316 0.045549 0.0588561 0.042222 5500 1030.5415027 1581.8647702 2055.8616694 1589.3531727 2656.4798349 2055.8616694 2865.8045569
7 5600 0.121717 0.077991 0.0610103 0.0789198 0.044643 0.0610103 0.045455 5600 1030.5873461 1608.3907117 2056.0462742 1589.4617067 2809.8470085 2056.0462742 2759.6524035
7 5700 0.126097 0.080661 0.0632031 0.0817582 0.047507 0.0632031 0.046578 5700 1030.635146 1611.1875628 2056.2282546 1589.565328 2735.5968594 2056.2282546 2790.1584439
7 5800 0.130555 0.084127 0.0654347 0.0846467 0.051177 0.0654347 0.04665 5800 1030.6767263 1599.4864907 2056.4012672 1589.6662244 2629.3061336 2056.4012672 2884.4587353
7 5900 0.13509 0.086892 0.067705 0.0875854 0.052766 0.067705 0.051901 5900 1030.7202606 1602.4490172 2056.5689388 1589.7626773 2638.8204526 2056.5689388 2682.7999461
7 6000 0.139703 0.089634 0.0700141 0.0905742 0.05264 0.0700141 0.049707 6000 1030.7581083 1606.5332352 2056.7285732 1589.8567142 2735.56231 2056.7285732 2896.976281
7 6100 0.144393 0.091053 0.0723618 0.0936132 0.054982 0.0723618 0.051108 6100 1030.7978919 1634.6523453 2056.8863682 1589.9467169 2707.0677676 2056.8863682 2912.2642248
7 6200 0.14916 0.095246 0.0747483 0.0967023 0.057338 0.0747483 0.052395 6200 1030.8393671 1614.3460093 2057.0367487 1590.03457 2681.6421919 2057.0367487 2934.6311671
7 6300 0.154005 0.097482 0.0771734 0.0998415 0.056972 0.0771734 0.053967 6300 1030.8756209 1628.6083585 2057.1855069 1590.1203407 2786.6320298 2057.1855069 2941.7977653
7 6400 0.158928 0.101472 0.0796373 0.103031 0.057603 0.0796373 0.106896 6400 1030.9070774 1614.632608 2057.3274081 1590.2010075 2844.296304 2057.3274081 1532.7046849
7 6500 0.163928 0.10446 0.0821399 0.10627 0.060909 0.0821399 0.05941 6500 1030.9404129 1617.8441509 2057.4653731 1590.2888868 2774.6310069 2057.4653731 2844.6389497
7 6600 0.169005 0.108576 0.0846812 0.10956 0.065004 0.0846812 0.061705 6600 1030.9754149 1604.7745358 2057.5995617 1590.3614458 2680.4504338 2057.5995617 2823.7582044
7 6700 0.174159 0.113714 0.0872613 0.1129 0.065358 0.0872613 0.060844 6700 1031.0118914 1579.0491936 2057.7277671 1590.4340124 2747.3300897 2057.7277671 2951.1537703
7 6800 0.179391 0.115208 0.08988 0.11629 0.070412 0.08988 0.063913 6800 1031.0439208 1605.4440664 2057.8549177 1590.5064924 2626.824973 2057.8549177 2893.9339415
7 6900 0.184701 0.118626 0.0925375 0.11973 0.073645 0.0925375 0.06362 6900 1031.0718404 1605.3816195 2057.976496 1590.5788023 2585.9189354 2057.976496 2993.3983024
7 7000 0.190088 0.12348 0.0952337 0.12322 0.076555 0.0952337 0.063398 7000 1031.1013846 1587.3015873 2058.0949811 1590.6508684 2560.2508001 2058.0949811 3091.580176
7 7100 0.195552 0.122747 0.0979686 0.126761 0.072336 0.0979686 0.068622 7100 1031.1323842 1642.7285392 2058.2104879 1590.7100764 2787.5470029 2058.2104879 2938.4162513
7 7200 0.201094 0.129628 0.100742 0.130351 0.077134 0.100742 0.082871 7200 1031.1595572 1599.6543957 2058.3272121 1590.7818122 2688.3086577 2058.3272121 2502.2022179
7 7300 0.206713 0.131645 0.103555 0.133992 0.081776 0.103555 0.073257 7300 1031.1881691 1619.20316 2058.4230602 1590.8412443 2606.6327529 2058.4230602 2909.7560643
7 7400 0.21241 0.135152 0.106406 0.137683 0.0865 0.106406 0.077556 7400 1031.2132197 1620.6937374 2058.530534 1590.9008374 2532.2543353 2058.530534 2824.2818093
7 7500 0.218184 0.140742 0.109295 0.141424 0.088116 0.109295 0.079044 7500 1031.2396876 1598.6699066 2058.6486116 1590.9605159 2553.4522675 2058.6486116 2846.5158646
7 7600 0.224036 0.146335 0.112224 0.145215 0.091414 0.112224 0.080546 7600 1031.2628328 1578.8430656 2058.7396635 1591.0202114 2527.4028048 2058.7396635 2868.4230129
7 7700 0.229965 0.148031 0.115191 0.149056 0.091089 0.115191 0.078402 7700 1031.2873698 1602.0968581 2058.8414025 1591.0798626 2603.6074608 2058.8414025 3024.9228336
7 7800 0.235971 0.153556 0.118197 0.152947 0.093255 0.118197 0.093697 7800 1031.3131698 1584.8289875 2058.935506 1591.1394143 2609.6187872 2058.935506 2597.308345
7 7900 0.242055 0.155691 0.121242 0.156889 0.096773 0.121242 0.093696 7900 1031.3358534 1603.4324399 2059.022451 1591.1886748 2579.645149 2059.022451 2664.3613388
7 8000 0.248216 0.158522 0.124325 0.160881 0.101509 0.124325 0.106089 8000 1031.3597834 1614.9178032 2059.1192439 1591.2382444 2521.943867 2059.1192439 2413.0682729
7 8100 0.254455 0.160734 0.127447 0.164923 0.103532 0.127447 0.099862 8100 1031.3807942 1632.7597148 2059.2089261 1591.2880556 2534.8684465 2059.2089261 2628.0266768
7 8200 0.260771 0.164972 0.130608 0.169014 0.099829 0.130608 0.101373 8200 1031.4030318 1630.3372694 2059.291927 1591.3474623 2694.2070941 2059.291927 2653.1719491
7 8300 0.267165 0.16927 0.133807 0.173157 0.1068 0.133807 0.102737 8300 1031.4225291 1627.9317067 2059.3840382 1591.3881622 2580.1498127 2059.3840382 2682.1885007
7 8400 0.273636 0.175835 0.137045 0.177349 0.110291 0.137045 0.103846 8400 1031.4432312 1605.1411835 2059.4695173 1591.4383504 2559.0483358 2059.4695173 2717.8706931
7 8500 0.280184 0.182905 0.140322 0.181591 0.112392 0.140322 0.107057 8500 1031.4650373 1580.0552199 2059.5487522 1591.488565 2571.357392 2059.5487522 2699.4965299
7 8600 0.28681 0.184565 0.143638 0.185884 0.116632 0.143638 0.113946 8600 1031.4842579 1602.9041259 2059.6221056 1591.5302016 2536.5251389 2059.6221056 2596.3175539
7 8700 0.293513 0.18391 0.146992 0.190227 0.112897 0.146992 0.112308 8700 1031.5045671 1646.2400087 2059.7039295 1591.5721743 2681.73645 2059.7039295 2695.8008334
7 8800 0.300294 0.195274 0.150385 0.194619 0.118392 0.150385 0.11824 8800 1031.5224413 1586.2838883 2059.7798983 1591.6226062 2616.3929995 2059.7798983 2619.7564276
7 8900 0.307152 0.19941 0.153817 0.199062 0.124692 0.153817 0.112755 8900 1031.5413867 1588.8872173 2059.8503416 1591.6649084 2540.9809771 2059.8503416 2809.9862534
7 9000 0.314088 0.199644 0.157288 0.203555 0.118302 0.157288 0.1178 9000 1031.5580347 1622.888742 2059.9155689 1591.7074009 2738.75336 2059.9155689 2750.4244482
7 9100 0.321101 0.203443 0.160797 0.208099 0.124692 0.160797 0.119083 9100 1031.5757347 1628.1710356 2059.9886814 1591.7423918 2656.4655311 2059.9886814 2781.5893117
7 9200 0.328191 0.207361 0.164345 0.212692 0.127939 0.164345 0.12503 9200 1031.5944069 1632.7081756 2060.0565883 1591.7853046 2646.2611088 2060.0565883 2707.8301208
7 9300 0.335359 0.211735 0.167932 0.217336 0.132656 0.167932 0.123293 9300 1031.6109006 1633.929204 2060.1195722 1591.8209593 2607.9483778 2060.1195722 2805.9987185
7 9400 0.342604 0.218172 0.171557 0.222029 0.131527 0.171557 0.130927 9400 1031.6283523 1620.0062336 2060.1899077 1591.8641259 2687.2049085 2060.1899077 2699.5195796
7 9500 0.349927 0.220445 0.175221 0.226773 0.136056 0.175221 0.128186 9500 1031.6437428 1637.5966794 2060.2553347 1591.9002703 2653.3192215 2060.2553347 2816.2201801
7 9600 0.357327 0.227123 0.178924 0.231567 0.137052 0.178924 0.153795 9600 1031.6600761 1623.0852886 2060.3161119 1591.9366749 2689.7819806 2060.3161119 2396.9569882
7 9700 0.364805 0.231437 0.182665 0.236411 0.13902 0.182665 0.128481 9700 1031.6744562 1626.1876882 2060.3837626 1591.9733007 2707.2363689 2060.3837626 2929.3047221
7 9800 0.37236 0.238762 0.186446 0.241306 0.145561 0.186446 0.13632 9800 1031.6897626 1608.9662509 2060.4357294 1592.0035142 2639.1684586 2060.4357294 2818.0751174
7 9900 0.379992 0.24301 0.190265 0.24625 0.142419 0.190265 0.143122 9900 1031.7059307 1613.2669437 2060.4945734 1592.0406091 2752.7226002 2060.4945734 2739.2015204
7 10000 0.387702 0.244569 0.194122 0.251245 0.142281 0.194122 0.143927 10000 1031.7202387 1635.5302594 2060.5598541 1592.071484 2811.3381267 2060.5598541 2779.186671
6 100
6 200
6 300
6 400
6 500
6 600
6 700
6 800
6 900
6 1000
6 1100
6 1200
6 1300
6 1400
6 1500
6 1600
6 1700
6 1800
6 1900
6 2000
6 2100
6 2200
6 2300
6 2400
6 2500
6 2600
6 2700
6 2800
6 2900
6 3000
6 3100
6 3200
6 3300
6 3400
6 3500
6 3600
6 3700
6 3800
6 3900
6 4000
6 4100
6 4200
6 4300
6 4400
6 4500
6 4600
6 4700
6 4800
6 4900
6 5000
6 5100
6 5200
6 5300
6 5400
6 5500
6 5600
6 5700
6 5800
6 5900
6 6000
6 6100
6 6200
6 6300
6 6400
6 6500
6 6600
6 6700
6 6800
6 6900
6 7000
6 7100
6 7200
6 7300
6 7400
6 7500
6 7600
6 7700
6 7800
6 7900
6 8000
6 8100
6 8200
6 8300
6 8400
6 8500
6 8600
6 8700
6 8800
6 8900
6 9000
6 9100
6 9200
6 9300
6 9400
6 9500
6 9600
6 9700
6 9800
6 9900
6 10000
4 100 4.98E-06 4.98E-06 0.000155
4 200 2.50E-05 2.50E-05 0.000181
4 300 5.47E-05 5.47E-05 0.000232
4 400 9.58E-05 9.58E-05 0.000337
4 500 0.000148276 0.000148276 0.000545
4 600 0.000212214 0.000212214 0.000717
4 700 0.000287581 0.000287581 0.0012
4 800 0.000374375 0.000374375 0.001687
4 900 0.000472597 0.000472597 0.001909
4 1000 0.000582246 0.000582246 0.002626
4 1100 0.00474426 0.00474426 0.002968
4 1200 0.00564015 0.00564015 0.003818
4 1300 0.00661347 0.00661347 0.004477
4 1400 0.00766423 0.00766423 0.005537
4 1500 0.00879241 0.00879241 0.005726
4 1600 0.00999803 0.00999803 0.006818
4 1700 0.0112811 0.0112811 0.008164
4 1800 0.0126416 0.0126416 0.009012
4 1900 0.0140795 0.0140795 0.009731
4 2000 0.0155948 0.0155948 0.010935
4 2100 0.0171876 0.0171876 0.01201
4 2200 0.0188578 0.0188578 0.013263
4 2300 0.0206054 0.0206054 0.01482
4 2400 0.0224305 0.0224305 0.016012
4 2500 0.024333 0.024333 0.016998
4 2600 0.026313 0.026313 0.017584
4 2700 0.0283703 0.0283703 0.019798
4 2800 0.0305051 0.0305051 0.021847
4 2900 0.0327174 0.0327174 0.02403
4 3000 0.035007 0.035007 0.024048
4 3100 0.0373741 0.0373741 0.027643
4 3200 0.0398187 0.0398187 0.027791
4 3300 0.0423406 0.0423406 0.030008
4 3400 0.04494 0.04494 0.028285
4 3500 0.0476168 0.0476168 0.030296
4 3600 0.0503711 0.0503711 0.031672
4 3700 0.0532028 0.0532028 0.037312
4 3800 0.0561119 0.0561119 0.034867
4 3900 0.0590985 0.0590985 0.038206
4 4000 0.0621625 0.0621625 0.039056
4 4100 0.0653039 0.0653039 0.044743
4 4200 0.0685227 0.0685227 0.043677
4 4300 0.071819 0.071819 0.044864
4 4400 0.0751927 0.0751927 0.048201
4 4500 0.0786439 0.0786439 0.049304
4 4600 0.0821725 0.0821725 0.051423
4 4700 0.0857785 0.0857785 0.053665
4 4800 0.0894619 0.0894619 0.057807
4 4900 0.0932228 0.0932228 0.058176
4 5000 0.0970611 0.0970611 0.061454
4 5100 0.100977 0.100977 0.066422
4 5200 0.10497 0.10497 0.066954
4 5300 0.109041 0.109041 0.07007
4 5400 0.113189 0.113189 0.072978
4 5500 0.117414 0.117414 0.075132
4 5600 0.121717 0.121717 0.076772
4 5700 0.126097 0.126097 0.08112
4 5800 0.130555 0.130555 0.083717
4 5900 0.13509 0.13509 0.083608
4 6000 0.139703 0.139703 0.091296
4 6100 0.144393 0.144393 0.092602
4 6200 0.14916 0.14916 0.100034
4 6300 0.154005 0.154005 0.100947
4 6400 0.158928 0.158928 0.103126
4 6500 0.163928 0.163928 0.107665
4 6600 0.169005 0.169005 0.106848
4 6700 0.174159 0.174159 0.114334
4 6800 0.179391 0.179391 0.114028
4 6900 0.184701 0.184701 0.120157
4 7000 0.190088 0.190088 0.123391
4 7100 0.195552 0.195552 0.128102
4 7200 0.201094 0.201094 0.131058
4 7300 0.206713 0.206713 0.131994
4 7400 0.21241 0.21241 0.135478
4 7500 0.218184 0.218184 0.137217
4 7600 0.224036 0.224036 0.145512
4 7700 0.229965 0.229965 0.145315
4 7800 0.235971 0.235971 0.151953
4 7900 0.242055 0.242055 0.153086
4 8000 0.248216 0.248216 0.161991
4 8100 0.254455 0.254455 0.160439
4 8200 0.260771 0.260771 0.169978
4 8300 0.267165 0.267165 0.173481
4 8400 0.273636 0.273636 0.172193
4 8500 0.280184 0.280184 0.180884
4 8600 0.28681 0.28681 0.18106
4 8700 0.293513 0.293513 0.189739
4 8800 0.300294 0.300294 0.188782
4 8900 0.307152 0.307152 0.197064
4 9000 0.314088 0.314088 0.196223
4 9100 0.321101 0.321101 0.200009
4 9200 0.328191 0.328191 0.205849
4 9300 0.335359 0.335359 0.211373
4 9400 0.342604 0.342604 0.212453
4 9500 0.349927 0.349927 0.225827
4 9600 0.357327 0.357327 0.226306
4 9700 0.364805 0.364805 0.229689
4 9800 0.37236 0.37236 0.234422
4 9900 0.379992 0.379992 0.236682
4 10000 0.387702 0.387702 0.249457
5 100 4.98E-06
5 200 2.50E-05
5 300 5.47E-05
5 400 9.58E-05
5 500 0.000148276
5 600 0.000212214
5 700 0.000287581
5 800 0.000374375
5 900 0.000472597
5 1000 0.000582246
5 1100 0.00474426
5 1200 0.00564015
5 1300 0.00661347
5 1400 0.00766423
5 1500 0.00879241
5 1600 0.00999803
5 1700 0.0112811
5 1800 0.0126416
5 1900 0.0140795
5 2000 0.0155948
5 2100 0.0171876
5 2200 0.0188578
5 2300 0.0206054
5 2400 0.0224305
5 2500 0.024333
5 2600 0.026313
5 2700 0.0283703
5 2800 0.0305051
5 2900 0.0327174
5 3000 0.035007
5 3100 0.0373741
5 3200 0.0398187
5 3300 0.0423406
5 3400 0.04494
5 3500 0.0476168
5 3600 0.0503711
5 3700 0.0532028
5 3800 0.0561119
5 3900 0.0590985
5 4000 0.0621625
5 4100 0.0653039
5 4200 0.0685227
5 4300 0.071819
5 4400 0.0751927
5 4500 0.0786439
5 4600 0.0821725
5 4700 0.0857785
5 4800 0.0894619
5 4900 0.0932228
5 5000 0.0970611
5 5100 0.100977
5 5200 0.10497
5 5300 0.109041
5 5400 0.113189
5 5500 0.117414
5 5600 0.121717
5 5700 0.126097
5 5800 0.130555
5 5900 0.13509
5 6000 0.139703
5 6100 0.144393
5 6200 0.14916
5 6300 0.154005
5 6400 0.158928
5 6500 0.163928
5 6600 0.169005
5 6700 0.174159
5 6800 0.179391
5 6900 0.184701
5 7000 0.190088
5 7100 0.195552
5 7200 0.201094
5 7300 0.206713
5 7400 0.21241
5 7500 0.218184
5 7600 0.224036
5 7700 0.229965
5 7800 0.235971
5 7900 0.242055
5 8000 0.248216
5 8100 0.254455
5 8200 0.260771
5 8300 0.267165
5 8400 0.273636
5 8500 0.280184
5 8600 0.28681
5 8700 0.293513
5 8800 0.300294
5 8900 0.307152
5 9000 0.314088
5 9100 0.321101
5 9200 0.328191
5 9300 0.335359
5 9400 0.342604
5 9500 0.349927
5 9600 0.357327
5 9700 0.364805
5 9800 0.37236
5 9900 0.379992
5 10000 0.387702
8 100 5.05E-06 5.05E-06 0.000164
8 200 2.53E-05 2.53E-05 0.000153
8 300 5.50E-05 5.50E-05 0.000255
8 400 9.62E-05 9.62E-05 0.00039
8 500 0.000148847 0.000148847 0.000868
8 600 0.0002129 0.0002129 0.001018
8 700 0.000288381 0.000288381 0.001255
8 800 0.000375289 0.000375289 0.002
8 900 0.000473625 0.000473625 0.002703
8 1000 0.000583389 0.000583389 0.003524
8 1100 0.00475278 0.00475278 0.003745
8 1200 0.00564944 0.00564944 0.005861
8 1300 0.00662354 0.00662354 0.005265
8 1400 0.00767507 0.00767507 0.007105
8 1500 0.00880403 0.00880403 0.007186
8 1600 0.0100104 0.0100104 0.00804
8 1700 0.0112942 0.0112942 0.009346
8 1800 0.0126555 0.0126555 0.009845
8 1900 0.0140942 0.0140942 0.011086
8 2000 0.0156103 0.0156103 0.01284
8 2100 0.0172039 0.0172039 0.013516
8 2200 0.0188748 0.0188748 0.014479
8 2300 0.0206233 0.0206233 0.015962
8 2400 0.0224491 0.0224491 0.01785
8 2500 0.0243524 0.0243524 0.01816
8 2600 0.0263331 0.0263331 0.020706
8 2700 0.0283912 0.0283912 0.021721
8 2800 0.0305268 0.0305268 0.024791
8 2900 0.0327398 0.0327398 0.022743
8 3000 0.0350303 0.0350303 0.024025
8 3100 0.0373981 0.0373981 0.027723
8 3200 0.0398434 0.0398434 0.034245
8 3300 0.0423662 0.0423662 0.02971
8 3400 0.0449663 0.0449663 0.031393
8 3500 0.0476439 0.0476439 0.033652
8 3600 0.050399 0.050399 0.03713
8 3700 0.0532314 0.0532314 0.036643
8 3800 0.0561413 0.0561413 0.038213
8 3900 0.0591287 0.0591287 0.04083
8 4000 0.0621934 0.0621934 0.042897
8 4100 0.0653356 0.0653356 0.045821
8 4200 0.0685552 0.0685552 0.047001
8 4300 0.0718523 0.0718523 0.052233
8 4400 0.0752268 0.0752268 0.051353
8 4500 0.0786787 0.0786787 0.054523
8 4600 0.0822081 0.0822081 0.056069
8 4700 0.0858149 0.0858149 0.057677
8 4800 0.0894991 0.0894991 0.062619
8 4900 0.0932607 0.0932607 0.062024
8 5000 0.0970998 0.0970998 0.065068
8 5100 0.101016 0.101016 0.070777
8 5200 0.10501 0.10501 0.073136
8 5300 0.109082 0.109082 0.074065
8 5400 0.11323 0.11323 0.077795
8 5500 0.117457 0.117457 0.081067
8 5600 0.12176 0.12176 0.082232
8 5700 0.126141 0.126141 0.085634
8 5800 0.1306 0.1306 0.089619
8 5900 0.135136 0.135136 0.092134
8 6000 0.139749 0.139749 0.093971
8 6100 0.14444 0.14444 0.098268
8 6200 0.149208 0.149208 0.098929
8 6300 0.154054 0.154054 0.102957
8 6400 0.158977 0.158977 0.132177
8 6500 0.163978 0.163978 0.111972
8 6600 0.169056 0.169056 0.112836
8 6700 0.174211 0.174211 0.11776
8 6800 0.179444 0.179444 0.122052
8 6900 0.184754 0.184754 0.12605
8 7000 0.190142 0.190142 0.127568
8 7100 0.195607 0.195607 0.131509
8 7200 0.20115 0.20115 0.133502
8 7300 0.20677 0.20677 0.132667
8 7400 0.212467 0.212467 0.140843
8 7500 0.218242 0.218242 0.144321
8 7600 0.224095 0.224095 0.146801
8 7700 0.230024 0.230024 0.150397
8 7800 0.236032 0.236032 0.15599
8 7900 0.242116 0.242116 0.160713
8 8000 0.248278 0.248278 0.16555
8 8100 0.254518 0.254518 0.167081
8 8200 0.260835 0.260835 0.176507
8 8300 0.267229 0.267229 0.173814
8 8400 0.273701 0.273701 0.180958
8 8500 0.28025 0.28025 0.186994
8 8600 0.286876 0.286876 0.19126
8 8700 0.293581 0.293581 0.192615
8 8800 0.300362 0.300362 0.201186
8 8900 0.307221 0.307221 0.20037
8 9000 0.314157 0.314157 0.209366
8 9100 0.321171 0.321171 0.212406
8 9200 0.328262 0.328262 0.219664
8 9300 0.335431 0.335431 0.220001
8 9400 0.342677 0.342677 0.224697
8 9500 0.350001 0.350001 0.228261
8 9600 0.357402 0.357402 0.255167
8 9700 0.36488 0.36488 0.236348
8 9800 0.372436 0.372436 0.243538
8 9900 0.380069 0.380069 0.251488
8 10000 0.38778 0.38778 0.2549
9 100 5.05E-06 5.05E-06 0.000102
9 200 2.53E-05 2.53E-05 0.000133
9 300 5.50E-05 5.50E-05 0.000257
9 400 9.62E-05 9.62E-05 0.000426
9 500 0.000148847 0.000148847 0.000611
9 600 0.0002129 0.0002129 0.001007
9 700 0.000288381 0.000288381 0.001684
9 800 0.000375289 0.000375289 0.00192
9 900 0.000473625 0.000473625 0.00237
9 1000 0.000583389 0.000583389 0.003054
9 1100 0.00475278 0.00475278 0.003767
9 1200 0.00564944 0.00564944 0.004613
9 1300 0.00662354 0.00662354 0.005077
9 1400 0.00767507 0.00767507 0.006879
9 1500 0.00880403 0.00880403 0.006115
9 1600 0.0100104 0.0100104 0.008438
9 1700 0.0112942 0.0112942 0.008924
9 1800 0.0126555 0.0126555 0.010052
9 1900 0.0140942 0.0140942 0.011604
9 2000 0.0156103 0.0156103 0.012379
9 2100 0.0172039 0.0172039 0.01327
9 2200 0.0188748 0.0188748 0.013687
9 2300 0.0206233 0.0206233 0.015787
9 2400 0.0224491 0.0224491 0.016985
9 2500 0.0243524 0.0243524 0.016987
9 2600 0.0263331 0.0263331 0.019121
9 2700 0.0283912 0.0283912 0.020752
9 2800 0.0305268 0.0305268 0.022715
9 2900 0.0327398 0.0327398 0.022921
9 3000 0.0350303 0.0350303 0.025762
9 3100 0.0373981 0.0373981 0.026162
9 3200 0.0398434 0.0398434 0.030113
9 3300 0.0423662 0.0423662 0.029933
9 3400 0.0449663 0.0449663 0.029641
9 3500 0.0476439 0.0476439 0.033119
9 3600 0.050399 0.050399 0.034665
9 3700 0.0532314 0.0532314 0.036222
9 3800 0.0561413 0.0561413 0.038313
9 3900 0.0591287 0.0591287 0.040579
9 4000 0.0621934 0.0621934 0.041515
9 4100 0.0653356 0.0653356 0.046121
9 4200 0.0685552 0.0685552 0.046439
9 4300 0.0718523 0.0718523 0.049905
9 4400 0.0752268 0.0752268 0.051524
9 4500 0.0786787 0.0786787 0.052124
9 4600 0.0822081 0.0822081 0.055532
9 4700 0.0858149 0.0858149 0.057545
9 4800 0.0894991 0.0894991 0.060225
9 4900 0.0932607 0.0932607 0.061904
9 5000 0.0970998 0.0970998 0.064913
9 5100 0.101016 0.101016 0.069709
9 5200 0.10501 0.10501 0.072475
9 5300 0.109082 0.109082 0.073109
9 5400 0.11323 0.11323 0.076591
9 5500 0.117457 0.117457 0.081253
9 5600 0.12176 0.12176 0.080513
9 5700 0.126141 0.126141 0.084993
9 5800 0.1306 0.1306 0.088886
9 5900 0.135136 0.135136 0.09103
9 6000 0.139749 0.139749 0.100462
9 6100 0.14444 0.14444 0.098302
9 6200 0.149208 0.149208 0.099579
9 6300 0.154054 0.154054 0.104181
9 6400 0.158977 0.158977 0.126391
9 6500 0.163978 0.163978 0.108779
9 6600 0.169056 0.169056 0.113469
9 6700 0.174211 0.174211 0.11459
9 6800 0.179444 0.179444 0.120153
9 6900 0.184754 0.184754 0.123704
9 7000 0.190142 0.190142 0.1233
9 7100 0.195607 0.195607 0.132093
9 7200 0.20115 0.20115 0.136312
9 7300 0.20677 0.20677 0.134607
9 7400 0.212467 0.212467 0.140334
9 7500 0.218242 0.218242 0.143307
9 7600 0.224095 0.224095 0.148815
9 7700 0.230024 0.230024 0.153551
9 7800 0.236032 0.236032 0.15533
9 7900 0.242116 0.242116 0.160892
9 8000 0.248278 0.248278 0.171144
9 8100 0.254518 0.254518 0.16853
9 8200 0.260835 0.260835 0.17114
9 8300 0.267229 0.267229 0.176634
9 8400 0.273701 0.273701 0.185034
9 8500 0.28025 0.28025 0.189966
9 8600 0.286876 0.286876 0.193146
9 8700 0.293581 0.293581 0.199141
9 8800 0.300362 0.300362 0.198671
9 8900 0.307221 0.307221 0.204815
9 9000 0.314157 0.314157 0.207469
9 9100 0.321171 0.321171 0.212361
9 9200 0.328262 0.328262 0.21657
9 9300 0.335431 0.335431 0.220759
9 9400 0.342677 0.342677 0.227259
9 9500 0.350001 0.350001 0.231485
9 9600 0.357402 0.357402 0.253371
9 9700 0.36488 0.36488 0.241818
9 9800 0.372436 0.372436 0.245686
9 9900 0.380069 0.380069 0.248548
9 10000 0.38778 0.38778 0.256528
0
3,750
7,500
11,250
15,000
0 2500 5000 7500 10000
M
fl
o
p
s
Matrix Order
Predicted - 1
Actual - 1
Predicted - 2
Predicted - 2
Actual - 2
Predicted - 3
Actual - 3
(a) BiCGK
vid size  Predicted - 1 Actual - 1  Best - 2 Worst - 2 Actual - 2 size  Predicted - 1 Actual - 1  Best - 2 Worst - 2 Actual - 2
6 100 4.69E-06 0.000153 4.97E-06 4.97E-06 8.30E-05 100 8,537.2126428 261.4379085 8,044.291871 8,044.291871 481.92771084
6 200 2.43E-05 0.000158 1.91E-05 2.25E-05 0.000115 200 6,573.2174256 1012.6582278 8,362.9958342 7,110.9846936 1391.3043478
6 300 5.37E-05 0.000257 4.25E-05 4.92E-05 0.000177 300 6,709.7207079 1400.7782101 8,474.8954763 7,315.3335487 2033.8983051
6 400 9.44E-05 0.000392 7.50E-05 8.62E-05 0.000262 400 6,780.1206861 1632.6530612 8,531.9795926 7,421.9764723 2442.7480916
6 500 0.000146562 0.000537 0.000116732 0.000133556 0.000408 500 6823.0509955 1862.1973929 8566.6312579 7487.4958819 2450.9803922
6 600 0.000210157 0.001086 0.00016764 0.00019119 0.001057 600 6852.0201564 1325.9668508 8589.8353615 7531.7746744 1362.346263
6 700 0.000285181 0.00114 0.000227734 0.000259131 0.000872 700 6872.8281337 1719.2982456 8606.5321823 7563.7418912 2247.706422
6 800 .000371632 0. 0221 0.0 2 7016 0.0 0 37379 0.001055 800 6888.5348947 1158.37 0407 86 9.0642928 7587. 055899 2426. 402844
6 900 0.000469511 0.003011 0. 00375484 0.000425935 0.00186 900 6900.7967864 1076.054467 8628.8630141 76 6.7944639 1 4 .9354839
6 1000 0.000578818 0.003158 0.00046314 0.000524798 0.001748 1000 6910.6351219 1266.6244459 8636.6973269 7621.9802667 2288.3295195
6 1100 0.00471871 0.003906 0.00239981 0.00274125 0.002184 1100 1025.7040589 1239.1193036 2016.8263321 1765.6178751 2216.1172161
6 1200 0.00561228 0.004705 0.00285027 0.00325703 0.002576 1200 1026.3208536 1224.229543 2020.8611816 1768.482329 2236.0248447
6 1300 0.00658327 0.005373 0.00333945 0.00381724 0.004165 1300 1026.8453215 1258.1425647 2024.2854362 1770.9130157 1623.0492197
6 1400 0.0076317 0.006072 0.00386734 0.00442189 0.003407 1400 1027.2940498 1291.1725955 2027.2331887 1772.9975192 2301.1447021
6 1500 0.00875757 0.007081 0.00443395 0.00507096 0.003783 1500 1027.6823365 1271.0069199 2029.7928484 1774.8118699 2379.0642347
6 1600 0.00996086 0.008874 0.00503928 0.00576446 0.005134 1600 1028.0236847 1153.9328375 2032.0363226 1776.4022996 1994.5461628
6 1700 0.0112416 0.009993 0.00568332 0.0065024 0.005425 1700 1028.3233703 1156.8097668 2034.0223672 1777.8051181 2130.875576
6 1800 0.0125997 0.009981 0. 636608 0.0 728476 0.006033 1800 1028. 95919 1298.4670875 20 5.7896853 1779.0565 09 2 48.18 9826
6 1900 0.014 35 0.011155 0.00708755 0.00811155 0.005385 1900 1028.8 44389 129 . 867772 2037.37 3977 1780.1776479 681.5227484
6 2000 0.015548 0.012302 0.007 4774 0.00898278 0.005548 00 1029.0447892 1300.60 282 2 38.8 3528 1781.18 7799 2883.9221341
6 2100 0.0171388 0.013212 0.00864664 0.0111589 0.008331 2100 1029.2435876 1335.1498638 2040.0988129 1580.800975 2117.39287
6 2200 0.0188067 0.015055 0.00948427 0.0122419 0.008893 2200 1029.4203661 1285.9515111 2041.2746579 1581.4538593 2176.9931407
6 2300 0.020552 0.015685 0.0103606 0.013375 0.010295 2300 1029.5834955 1349.0596111 2042.3527595 1582.0560748 2055.3666829
6 2400 0.0223748 0.016361 0.0112757 0.0145583 0.009831 2400 1029.7298747 1408.226881 2043.3321213 1582.6023643 2343.6069576
6 2500 0.024275 0.01799 0.0122294 0.0157916 0.009617 2500 1029.8661174 1389.6609227 2044.2540108 1583.1201398 2599.5632734
6 2600 0.0262526 0.019753 0.0132219 0.0170752 0.012528 2600 1029.9932197 1368.905989 2045.0918552 1583.5832084 2158.3652618
6 2700 0.0283076 0.021 0.0142531 0.0184089 0.012752 2700 1030.1120547 1388.5714286 2045.8707229 1584.0164268 2286.7001255
6 2800 0.0304401 0.023341 0.015323 0.0197927 0.01447 2800 1030.2200058 1343.558545 2046.5966195 1584.4225396 2167.2425708
6 2900 0.032 0.023394 .0 64317 0.0212267 0.014786 2900 1030.32159 6 1437.975 493 2047. 62304 1584.7965063 227 .1251184
6 3000 0.0349373 0.02607 .017579 0. 227108 0.013979 30 0 1030.4 3476 1380.6857406 2047.89 0602 1585.14 9159 25 5.29 5087
6 3100 0.0373021 0.027566 0.0187651 0.024245 0.014651 3100 1030.504985 1394.4714503 2048.4836212 1585.4815426 2623.711692
6 3200 0.0397443 0.030141 0.0199899 0.0258294 0.017486 3200 1030.588034 1358.9462858 2049.0347626 1585.7898364 2342.4453849
6 3300 0.042264 0.029753 0.0212534 0.027464 0.017987 3300 1030.6643952 1464.054045 2049.5544242 1586.0763181 2421.749041
6 3400 0.044861 0.031133 0.0225556 0.0291486 0.017667 3400 1030.739395 1485.2407413 2050.0452216 1586.3540616 2617.3091074
6 3500 0.0475355 0.033144 0.0238965 0.0308835 0.024469 3500 1030.8085536 1478.3972966 2050.5094888 1586.6077355 2002.5338183
6 3600 0.0502875 0.034859 0.0252761 0.0326685 0.019712 3600 1030.8724832 1487.1338822 2050.9493158 1586.8497176 2629.8701299
6 3700 0.0531168 0.036516 0.0266945 0.0345036 0.022381 3700 1030.9355985 1499.6166064 2051.3588942 1587.0807684 2446.7181985
6 3800 0.0560236 0.038555 0.0281516 0.0363888 0.021739 3800 1030.9940811 1498.1195694 2051.7483909 1587.3015873 2656.9759419
6 3900 0.059 79 0.038458 .0296474 0.0383242 0.020659 39 0 103 .048385 1581.985 427 2052.1192415 158 .508676 2944.9634542
6 4000 0.062 6 0.044 65 .0311819 0.0403098 0.025894 4000 1031.10223 2 1445.837 692 2052.47274 6 1587.7032384 2471.615046
6 4100 0.0652086 0.044962 0.0327551 0.0423455 0.024896 4100 1031.1523327 1495.4850763 2052.8100967 1587.8900946 2700.8354756
6 4200 0.0684252 0.047436 0.0343671 0.0444313 0.027504 4200 1031.1990319 1487.4778649 2053.1263912 1588.0696716 2565.4450262
6 4300 0.0717191 0.049338 0.0360177 0.0465673 0.028572 4300 1031.245512 1499.0473874 2053.434839 1588.2389574 2588.548229
6 4400 0.0750905 0.05329 0.0377071 0.0487534 0.028882 4400 1031.2889114 1453.1807093 2053.7246301 1588.4020397 2681.2547608
6 4500 0.0785393 0.051573 0.0394352 0.0509897 0.030245 4500 1031.3308115 1570.5892618 2054.0025155 1588.5561202 2678.1286163
6 4600 0.0820656 0.054457 0.041202 0.0532761 0.029895 4600 1031.3700259 1554.2538149 2054.2692102 1588.7048789 2831.2426827
6 4700 0.0856693 0.057892 0.0430075 0.0556127 0.032715 4700 1031.407984 1526.2903337 2054.5253735 1588.8457133 2700.901727
6 4800 0.0893504 0.060025 0.0448517 0.0579994 0.033721 4800 1031.4447389 1535.3602666 2054.771614 1588.981955 2733.0150351
6 4900 0.093109 0. 62288 . 467347 0.0604362 0.033146 4900 1031.479234 541.8700231 2055.0040976 1589. 380 3 2897.4838593
6 5000 0.0969449 0. 64828 . 486564 0.0629232 0.03839 5000 1031.5 37774 154 .5433455 2055.2280892 1589. 3 9453 6 4.8450117
6 5100 0.100858 0.067548 0.0506168 0.0654603 0.037822 5100 1031.5493069 1540.2380529 2055.4440423 1589.3602687 2750.7799693
6 5200 0.104849 0.068709 0.0526159 0.0680476 0.039601 5200 1031.5787466 1574.1751444 2055.6523788 1589.4756024 2731.2441605
6 5300 0.108917 0.072109 0.0546537 0.070685 0.040267 5300 1031.6112269 1558.196619 2055.8534921 1589.587607 2790.3742519
6 5400 0.113063 0.075874 0.0567302 0.0733726 0.042155 5400 1031.6372288 1537.2854996 2056.0477488 1589.6942455 2766.9315621
6 5500 0.117286 0.081043 0.0588455 0.0761103 0.042619 5500 1031.6661835 1493.0345619 2056.2319973 1589.7979643 2839.1093174
6 5600 0.121587 0.080813 0.0609994 0.0788981 0.044215 5600 1031.6892431 1552.2255083 2056.4136696 1589.8988696 2837.0462513
6 5700 0.125965 0.083273 0.0631921 0.0817361 0.046523 5700 1031.715159 1560.6499105 2056.5861872 1589.9951184 2793.4569998
6 5800 0.13042 0.087437 0.0654235 0.0846243 0.049223 5800 1031.7435976 1538.9366058 2056.7533073 1590.0870081 2733.6814091
6 5900 0.134953 0.092316 0.0676936 0.0875626 0.048616 5900 1031.766615 1508.2975866 2056.9152771 1590.1766279 2864.0776699
6 6000 0.139564 0.100131 .0 00025 0.090551 0.051495 6 00 1031.7847009 143 .1 6068 2057.0693904 1590.2640501 2 96.3879988
6 6100 0.144251 0. 9836 0.07235 0.0935896 0.054284 61 0 0 1.8126044 513.1244536 2057.2218383 1590.3476455 2741.8760592
6 6200 0.149016 0.099027 0.0747363 0.0966783 0.054904 6200 1031.8355076 1552.7078474 2057.3670358 1590.4292897 2800.5245519
6 6300 0.153859 0.103582 0.0771612 0.0998171 0.056632 6300 1031.8538402 1532.6987314 2057.5107697 1590.509041 2803.3620568
6 6400 0.158779 0.134035 0.0796249 0.103006 0.060324 6400 1031.8744922 1222.3672921 2057.6477961 1590.5869561 2716.0002652
6 6500 0.163777 0.110093 0.0821273 0.106245 0.059232 6500 1031.8909249 1535.065808 2057.7810302 1590.66309 2853.1874662
6 6600 0.168851 0.113277 0.0846685 0.109535 0.060745 6600 1031.9157127 1538.1763288 2057.9081949 1590.724426 2868.3842292
6 6700 0.174004 0.114914 0.0872483 0.112874 0.0651 6700 1031.9303005 1562.5598273 2058.0343686 1590.8003615 2758.218126
6 6800 0.179234 0.119447 0.0898669 0.116264 0.072951 6800 1031.9470636 1548.4691955 2058.1548935 1590.8621757 2535.4004743
6 6900 0.184541 0.122286 0.0925241 0.119703 0.067397 6900 1031.9657962 1557.33281 2058.2745468 1590.9375705 2825.6450584
6 7000 0.189925 0.126017 0.0952201 0.123193 0.072179 7000 1031.9863104 1555.3457073 2058.3889326 1590.9994886 2715.471259
6 7100 0.195387 0.143582 0.0979548 0.126733 0.079678 7100 1032.0031527 1404.3543063 2058.5004512 1591.061523 2530.6860112
6 7200 0.200927 0.133363 0.100728 0.130323 0.081495 7200 1032.016603 1554.854045 2058.6132952 1591.1235929 2544.4505798
6 7300 0.206544 0.137169 0.10354 0.133964 0.075611 7300 1032.0319157 1553.9954363 2058.7212671 1591.1737482 2819.1665234
6 7400 0.212238 0.140354 0.106391 0.137654 0.078585 7400 1032.0489262 1560.6252761 2058.8207649 1591.2359975 2787.3003754
6 7500 0.21801 0.144679 0.109281 0.141395 0.080369 7500 1032.0627494 1555.1669558 2058.9123452 1591.2868206 2799.5869054
6 7600 0.223859 0.148867 0.112209 0.145185 0.08263 7600 1032.0782278 1551.9893596 2059.014874 1591.3489686 2796.078906
6 7700 0.229786 0.152904 0.115176 0.149026 0.092545 7700 1032.0907279 1551.0385601 2059.1095367 1591.4001584 2562.6451996
6 7800 0.23579 0.158896 0.118182 0.152917 0.091375 7800 1032.1048391 1531.5678179 2059.196832 1591.4515718 2663.3105335
6 7900 0.241871 0.161911 0.121226 0.156858 0.093813 7900 1032.1204278 1541.8347117 2059.2942108 1591.503143 2661.0384488
6 8000 0.24803 0.170071 0.124309 0.16085 0.096138 8000 1032.1332097 1505.2536882 2059.3842763 1591.5449176 2662.8388358
6 8100 0.254267 0.171334 0.127431 0.164891 0.101368 8100 1032.1433768 1531.7450127 2059.4674765 1591.5968731 2588.9827164
6 8200 0.260581 0.170876 0.130592 0.168983 0.101462 8200 1032.1550689 1574.0068822 2059.5442294 1591.6393957 2650.8446512
6 8300 0.266972 0.178343 0.133791 0.173124 0.102544 8300 1032.1681674 1545.1125079 2059.6303189 1591.6915044 2687.2366984
6 8400 0.27344 0.184055 0.137029 0.177316 0.106282 8400 1032.1825629 1533.4546739 2059.7099884 1591.7345304 2655.5766734
6 8500 0.279987 0.186446 0.140306 0.181558 0.106434 8500 1032.1907803 1550.0466623 2059.7836158 1591.7778341 2715.2977432
6 8600 0.28661 0.193379 0.143621 0.185851 0.109678 8600 1032.2040403 1529.8455365 2059.865897 1591.8127963 2697.3504258
6 8700 0.293311 0.194282 0.146976 0.190193 0.110732 8700 1032.2149527 1558.3533215 2059.9281515 1591.8566929 2734.1689846
6 8800 0.300089 0.208094 0.150368 0.194585 0.117123 8800 1032.227106 1488.5580555 2060.0127687 1591.9007118 2644.7409988
6 8900 0.306945 0.198312 0.1538 0.199028 0.115091 8900 1032.2370457 1597.6844568 2060.0780234 1591.9368129 2752.9520119
6 9000 0.313879 0.206249 0.15727 0.203521 0.122676 9000 1032.2449097 1570.9167075 2060.1513321 1591.9733099 2641.1033943
6 9100 0.320889 0.21162 0.160779 0.208064 0.119872 9100 1032.2572603 1565.2584822 2060.2193072 1592.0101507 2763.2808329
6 9200 0.327977 0.221676 0.164327 0.212657 0.123587 9200 1032.2675066 1527.2740396 2060.2822421 1592.0472874 2739.4467056
6 9300 0.335143 0.221279 0.167914 0.2173 0.131591 9300 1032.2757748 1563.4560894 2060.3404124 1592.0846756 2629.0551785
6 9400 0.342386 0.225179 0.171539 0.221993 0.127726 9400 1032.2851986 1569.5957438 2060.4060884 1592.1222741 2767.1734807
6 9500 0.349706 0.227943 0.175203 0.226737 0.135098 9500 1032.2956998 1583.7292656 2060.4670011 1592.1530231 2672.1343025
6 9600 0.357104 0.257354 0.178905 0.23153 0.131361 9600 1032.3043147 1432.4238209 2060.5349208 1592.1910768 2806.3123758
6 9700 0.36458 0.239173 0.182647 0.236374 0.134641 9700 1032.3111526 1573.5889921 2060.586815 1592.2224949 2795.28524
6 9800 0.372132 0.24281 0.186427 0.241268 0.138543 9800 1032.3218643 1582.1424159 2060.6457219 1592.2542567 2772.8575244
6 9900 0.379762 0.247763 0.190245 0.246212 0.139492 9900 1032.3307756 1582.3185867 2060.7111882 1592.2863224 2810.4837553
6 10000 0.38747 0.254743 0.194103 0.251206 0.145424 10000 1032.3379875 1570.2099763 2060.7615544 1592.3186548 2750.5776213
3 100 4.69E-06 4.69E-06 0.000153
3 200 2.43E-05 2.43E-05 0.000158
3 300 5.37E-05 5.37E-05 0.000257
3 400 9.44E-05 9.44E-05 0.000392
3 500 0.000146562 0.000146562 0.000537
3 600 0.000210157 0.000210157 0.001086
3 700 0.000285181 0.000285181 0.00114
3 800 0.000371632 0.000371632 0.00221
3 900 0.000469511 0.000469511 0.003011
3 1000 0.000578818 0.000578818 0.003158
3 1100 0.00471871 0.00471871 0.003906
3 1200 0.00561228 0.00561228 0.004705
3 1300 0.00658327 0.00658327 0.005373
3 1400 0.0076317 0.0076317 0.006072
3 1500 0.00875757 0.00875757 0.007081
3 1600 0.00996086 0.00996086 0.008874
3 1700 0.0112416 0.0112416 0.009993
3 1800 0.0125997 0.0125997 0.009981
3 1900 0.0140353 0.0140353 0.011155
3 2000 0.0155484 0.0155484 0.012302
3 2100 0.0171388 0.0171388 0.013212
3 2200 0.0188067 0.0188067 0.015055
3 2300 0.020552 0.020552 0.015685
3 2400 0.0223748 0.0223748 0.016361
3 2500 0.024275 0.024275 0.01799
3 2600 0.0262526 0.0262526 0.019753
3 2700 0.0283076 0.0283076 0.021
3 2800 0.0304401 0.0304401 0.023341
3 2900 0.03265 0.03265 0.023394
3 3000 0.0349373 0.0349373 0.026074
3 3100 0.0373021 0.0373021 0.027566
3 3200 0.0397443 0.0397443 0.030141
3 3300 0.042264 0.042264 0.029753
3 3400 0.044861 0.044861 0.031133
3 3500 0.0475355 0.0475355 0.033144
3 3600 0.0502875 0.0502875 0.034859
3 3700 0.0531168 0.0531168 0.036516
3 3800 0.0560236 0.0560236 0.038555
3 3900 0.0590079 0.0590079 0.038458
3 4000 0.0620695 0.0620695 0.044265
3 4100 0.0652086 0.0652086 0.044962
3 4200 0.0684252 0.0684252 0.047436
3 4300 0.0717191 0.0717191 0.049338
3 4400 0.0750905 0.0750905 0.05329
3 4500 0.0785393 0.0785393 0.051573
3 4600 0.0820656 0.0820656 0.054457
3 4700 0.0856693 0.0856693 0.057892
3 4800 0.0893504 0.0893504 0.060025
3 4900 0.093109 0.093109 0.062288
3 5000 0.0969449 0.0969449 0.064828
3 5100 0.100858 0.100858 0.067548
3 5200 0.104849 0.104849 0.068709
3 5300 0.108917 0.108917 0.072109
3 5400 0.113063 0.113063 0.075874
3 5500 0.117286 0.117286 0.081043
3 5600 0.121587 0.121587 0.080813
3 5700 0.125965 0.125965 0.083273
3 5800 0.13042 0.13042 0.087437
3 5900 0.134953 0.134953 0.092316
3 6000 0.139564 0.139564 0.100131
3 6100 0.144251 0.144251 0.098366
3 6200 0.149 16 0.149016 0.099027
3 63 0.153859 0.153859 0.103582
3 6400 0.158779 0.158779 0.134035
3 6500 0.163777 0.163777 0.110093
3 6600 0.168851 0.168851 0.113277
3 6700 0.174004 0.174004 0.114914
3 6800 0.179234 0.179234 0.119447
3 6900 0.184541 0.184541 0.122286
3 7000 0.189925 0.189925 0.126017
3 7100 0.195387 0.195387 0.143582
3 72 0.200927 0.200927 0.133363
3 73 0.206544 0.206544 0.137169
3 740 0.212238 0.212238 0.140354
3 7500 0.21801 0.21801 0.144679
3 7600 0.223859 0.223859 0.148867
3 7700 0.229786 0.229786 0.152904
3 7800 0.23579 0.23579 0.158896
3 7900 0.241871 0.241871 0.161911
3 8000 0.24803 0.24803 0.170071
3 8100 0.254267 0.254267 0.171334
3 8200 0.260581 0.260581 0.170876
3 830 0.266972 0.266972 0.178343
3 840 0.27344 0.27344 0.184055
3 8500 0.279987 0.279987 0.186446
3 8600 0.28661 0.28661 0.193379
3 8700 0.293311 0.293311 0.194282
3 8800 0.300089 0.300089 0.208094
3 8900 0.306945 0.306945 0.198312
3 9000 0.313879 0.313879 0.206249
3 9100 0.320889 0.320889 0.21162
3 9200 0.327977 0.327977 0.221676
3 930 0.33514 0.335143 0.221279
3 940 0.342386 0.342386 0.225179
3 9500 0.349706 0.349706 0.227943
3 9600 0.357104 0.357104 0.257354
3 9700 0.36458 0.36458 0.239173
3 9800 0.372132 0.372132 0.24281
3 9900 0.379762 0.379762 0.247763
3 10000 0.38747 0.38747 0.254743
4 100 4.98E-06 4.98E-06 0.000142
4 200 2.50E-05 2.50E-05 0.000154
4 30 5.47E- 5.47E-05 0.000231
4 40 9.58E- 5 9.58E-05 0.00034
4 500 0.000148276 0.000148276 0.000515
4 600 0.000212214 0.000212214 0.0007
4 700 0.000287581 0.000287581 0.001362
4 800 0.000374375 0.000374375 0.001897
4 900 0.000472597 0.000472597 0.002523
4 1000 0.000582246 0.000582246 0.002361
4 1100 0.00474426 0.00474426 0.003121
4 1200 0.00564015 0.00564015 0.003734
4 1300 0.00661347 0.00661347 0.004642
4 1400 0.00766423 0.00766423 0.005388
4 1500 0.00879241 0.00879241 0.005855
4 1600 0.00999803 0.00999803 0.006937
4 1700 0.0112811 0.0112811 0.00763
4 1800 0.0126416 0.0126416 0.008736
4 1900 0.0140795 0.0140795 0.010046
4 2000 0.0155948 0.0155948 0.010731
4 2100 0.0171876 0.0171876 0.014524
4 2200 0.0188578 0.0188578 0.012852
4 2300 0.0206054 0.0206054 0.014009
4 2400 0.0224305 0.0224305 0.016462
4 2500 0.02433 0.024333 0.017362
4 2600 0.026313 0.026313 0.017089
4 2700 0.0283703 0.0283703 0.020353
4 2800 0.0305051 0.0305051 0.020761
4 2900 0.0327174 0.0327174 0.021499
4 3000 0.035007 0.035007 0.024416
4 3100 0.0373741 0.0373741 0.025199
4 3200 0.0398187 0.0398187 0.027761
4 3300 0.0423406 0.0423406 0.030126
4 3400 0.04494 0.04494 0.02992
4 3500 0.0476168 0.0476168 0.031461
4 3600 0.0503711 0.0503711 0.033885
4 3700 0.0532028 0.0532028 0.035946
4 3800 0.0561119 0.0561119 0.038736
4 3900 0.0590985 0.0590985 0.037212
4 4000 0.0621625 0.0621625 0.040698
4 4100 0.0653039 0.0653039 0.044532
4 4200 0.0685227 0.0685227 0.044633
4 4300 0.071819 0.071819 0.0454
4 4400 0.07519 7 0.0751927 0.048476
4 4500 0.0786439 0.0786439 0.050258
4 4600 0.0821725 0.0821725 0.055013
4 4700 0.0857785 0.0857785 0.056282
4 4800 0.0894619 0.0894619 0.056828
4 4900 0.0932228 0.0932228 0.060008
4 5000 0.0970611 0.0970611 0.065059
4 5100 0.100977 0.100977 0.065096
4 5200 0.10497 0.10497 0.067545
4 5300 0.109041 0.109041 0.069145
4 5400 0.113189 0.113189 0.076047
4 5500 0.117414 0.117414 0.075214
4 5600 0.121717 0.121717 0.077553
4 5700 0.126097 0.126097 0.079759
4 5800 0.130555 0.130555 0.08473
4 5900 0.13509 0.13509 0.084799
4 6000 0.139703 0.139703 0.091075
4 6100 0.144393 0.144393 0.095622
4 6200 0.14916 0.14916 0.099264
4 6300 0.154005 0.154005 0.099825
4 6400 0.158928 0.158928 0.103831
4 6500 0.163928 0.163928 0.105096
4 6600 0.169005 0.169005 0.110473
4 6700 0.174159 0.174159 0.11121
4 6800 0.179391 0.179391 0.114842
4 6900 0.184701 0.184701 0.118737
4 7000 0.190088 0.190088 0.12168
4 7100 0.195552 0.195552 0.122448
4 7200 0.201094 0.201094 0.128105
4 7300 0.206713 0.206713 0.132585
4 7400 0.21241 0.21241 0.139777
4 7500 0.218184 0.218184 0.141151
4 7600 0.224036 0.224036 0.147102
4 7700 0.229965 0.229965 0.149473
4 7800 0.235971 0.235971 0.149631
4 7900 0.242055 0.242055 0.156229
4 8000 0.248216 0.248216 0.156071
4 8100 0.254455 0.254455 0.165584
4 8200 0.260771 0.260771 0.167796
4 8300 0.267165 0.267165 0.169234
4 8400 0.273636 0.273636 0.175893
4 8500 0.280184 0.280184 0.180811
4 8600 0.28681 0.28681 0.181922
4 8700 0.293513 0.293513 0.188664
4 8800 0.300294 0.300294 0.193007
4 8900 0.307152 0.307152 0.19109
4 9000 0.314088 0.314088 0.199983
4 9100 0.321101 0.321101 0.203201
4 9200 0.328191 0.328191 0.208857
4 9300 0.335359 0.335359 0.211957
4 9400 0.342604 0.342604 0.213224
4 9500 0.349927 0.349927 0.220172
4 9600 0.357327 0.357327 0.226959
4 9700 0.364805 0.364805 0.227261
4 9800 0.37236 0.37236 0.234706
4 9900 0.379992 0.379992 0.24146
4 10000 0.387702 0.387702 0.244702
5 100 4.98E-06 4.98E-06 7.50E-05
5 200 2.50E-05 2.50E-05 0.000179
5 300 5.47E-05 5.47E-05 0.000179
5 400 9.58E-05 9.58E-05 0.0003
5 500 0.000148276 0.000148276 0.000477
5 600 0.000212214 0.000212214 0.000716
5 700 0.000287581 0.000287581 0.000887
5 800 0.000374375 0.000374375 0.001191
5 900 0.000472597 0.000472597 0.001699
5 1000 0.000582246 0.000582246 0.001963
5 1100 0.00474426 0.00474426 0.002722
5 1200 0.00564015 0.00564015 0.003113
5 1300 0.00661347 0.00661347 0.00385
5 1400 0.00766423 0.00766423 0.004705
5 1500 0.00879241 0.00879241 0.006047
5 1600 0.00999803 0.00999803 0.00748
5 1700 0.0112811 0.0112811 0.007118
5 1800 0.0126416 0.0126416 0.008291
5 1900 0.0140795 0.0140795 0.009702
5 2000 0.0155948 0.0155948 0.011037
5 2100 0.0171876 0.0171876 0.011624
5 2200 0.0188578 0.0188578 0.012443
5 2300 0.0206054 0.0206054 0.014039
5 2400 0.0224305 0.0224305 0.014004
5 2500 0.024333 0.024333 0.015346
5 2600 0.026313 0.026313 0.016622
5 2700 0.0283703 0.0283703 0.017761
5 2800 0.0305051 0.0305051 0.019111
5 2900 0.0327174 0.0327174 0.020377
5 3000 0.035007 0.035007 0.022576
5 3100 0.0373741 0.0373741 0.023311
5 3200 0.0398187 0.0398187 0.025044
5 3300 0.0423406 0.0423406 0.026121
5 3400 0.04494 0.04494 0.028098
5 3500 0.0476168 0.0476168 0.030698
5 3600 0.0503711 0.0503711 0.031442
5 3700 0.0532028 0.0532028 0.0331
5 3800 0.0561119 0.0561119 0.03636
5 3900 0.0590985 0.0590985 0.038928
5 4000 0.0621625 0.0621625 0.038672
5 4100 0.0653039 0.0653039 0.042162
5 4200 0.0685227 0.0685227 0.044112
5 4300 0.071819 0.071819 0.044416
5 4400 0.0751927 0.0751927 0.047492
5 4500 0.0786439 0.0786439 0.050174
5 4600 0.0821725 0.0821725 0.051435
5 47 0.0857785 0.0857785 0.053204
5 48 0.0894619 0.0894619 0.055043
5 4900 0.0932228 0.0932228 0.058601
5 5000 0.0970611 0.0970611 0.059793
5 5100 0.100977 0.100977 0.063391
5 5200 0.10497 0.10497 0.067525
5 5300 0.109041 0.109041 0.069822
5 5400 0.113189 0.113189 0.070287
5 5500 0.117414 0.117414 0.073467
5 5600 0.121717 0.121717 0.07767
5 570 0.126 97 0.126097 0.078002
5 580 0.130555 0.130555 0.083166
5 5900 0.13509 0.13509 0.086624
5 6000 0.139703 0.139703 0.089
5 6100 0.144393 0.144393 0.094021
5 6200 0.14916 0.14916 0.094185
5 6300 0.154005 0.154005 0.09868
5 6400 0.158928 0.158928 0.101093
5 6500 0.163928 0.163928 0.102494
5 6600 0.169005 0.169005 0.10845
5 670 0.174159 0.174159 0.108222
5 680 0.179391 0.179391 0.112249
5 690 0.1847 1 0.184701 0.118593
5 7000 0.190088 0.190088 0.120917
5 7100 0.195552 0.195552 0.125017
5 7200 0.201094 0.201094 0.126349
5 7300 0.206713 0.206713 0.130459
5 7400 0.21241 0.21241 0.133054
5 7500 0.218184 0.218184 0.135616
5 7600 0.224036 0.224036 0.142696
5 7700 0.229965 0.229965 0.14433
5 780 0.235971 0.235971 0.149429
5 790 0.242 55 0.242055 0.151873
5 8000 0.248216 0.248216 0.155645
5 8100 0.254455 0.254455 0.160177
5 8200 0.260771 0.260771 0.16344
5 8300 0.267165 0.267165 0.168061
5 8400 0.273636 0.273636 0.174249
5 8500 0.280184 0.280184 0.17425
5 8600 0.28681 0.28681 0.180314
5 8700 0.293513 0.293513 0.182358
5 8800 0.300294 0.300294 0.193029
5 8900 0.307152 0.307152 0.195334
5 9000 0.314088 0.314088 0.199515
5 9100 0.321101 0.321101 0.202743
5 9200 0.328191 0.328191 0.203787
5 9300 0.335359 0.335359 0.20803
5 9400 0.342604 0.342604 0.218191
5 9500 0.349927 0.349927 0.216665
5 9600 0.357327 0.357327 0.226423
5 9700 0.364805 0.364805 0.230515
5 9800 0.37236 0.37236 0.230479
5 9900 0.379992 0.379992 0.234907
5 10000 0.387702 0.387702 0.243543
0
2,250
4,500
6,750
9,000
0 2500 5000 7500 10000
M
fl
o
p
s
Matrix Order
 Predicted - 1
Actual - 1
 Best - 2
Worst - 2
Actual - 2
(b) AATX
Figure 6.4: Actual and predicted runtimes comparison for matrix kernels on the Work machine.
107
vid size Predicted - 1 Actual -1 Predicted - 2 Actual - 2 size Predicted - 1 Actual -1 Predicted - 2 Actual - 2
5 100000 0.000171417 0.000911 0.000114279 0.000343 100000 1750.118133 329.30845225 2625.1542278 874.63556851
5 200000 0.000342834 0.002285 0.000228557 0.000977 200000 1750.118133 262.58205689 2625.1657136 614.12487206
5 300000 0.0029037 0.003146 0.00232297 0.002214 300000 309.94937494 286.0775588 387.43505082 406.50406504
5 400000 0.00464592 0.004099 0.00309729 0.002671 400000 258.29114578 292.75433032 387.43546778 449.26993635
5 500000 0.0058074 0.006235 0.00387161 0.003487 500000 258.29114578 240.57738573 387.43571796 430.16919989
5 600000 0.00696889 0.006628 0.00464593 0.004556 600000 258.29077514 271.57513579 387.43588474 395.0834065
5 700000 0.00813037 0.00789 0.00542025 0.005115 700000 258.29082809 266.15969582 387.43600387 410.55718475
5 800000 0.00929185 0.009099 0.00619457 0.0059 800000 258.2908678 263.76524893 387.43609322 406.77966102
5 900000 0.0104533 0.010646 0.00696889 0.006612 900000 258.29163996 253.61638174 387.43616272 408.34845735
5 1000000 0.0116148 0.012552 0.00774321 0.007553 1000000 258.29114578 239.00573614 387.43621831 397.19316828
5 1100000 0.0127763 0.013531 0.00851753 0.008009 1100000 258.29074145 243.88441357 387.4362638 412.03645898
5 1200000 0.0139378 0.013907 0.00929185 0.008817 1200000 258.29040451 258.86244337 387.43630171 408.30214359
5 1300000 0.0150993 0.014718 0.0100662 0.009539 1300000 258.29011941 264.98165512 387.43517911 408.84788762
5 1400000 0.0162607 0.016071 0.0108405 0.010185 1400000 258.29146347 261.34030241 387.43600387 412.37113402
5 1500000 0.0174222 0.017491 0.0116148 0.01098 1500000 258.29114578 257.27517009 387.43671867 409.83606557
5 1600000 0.0185837 0.018401 0.0123891 0.011756 1600000 258.2908678 260.85538829 387.43734412 408.30214359
5 1700000 0.0197452 0.019518 0.0131635 0.012303 1700000 258.29062253 261.29726406 387.43495271 414.53304072
5 1800000 0.0209067 0.020665 0.0139378 0.013237 1800000 258.29040451 261.31139608 387.43560677 407.94742011
5 1900000 0.0220681 0.02168 0.0147121 0.013953 1900000 258.29137987 262.91512915 387.43619198 408.514298
5 2000000 0.0232296 0.023355 0.0154864 0.014449 2000000 258.29114578 256.90430315 387.43671867 415.25365077
5 2100000 0.0243911 0.024082 0.0162607 0.015383 2100000 258.29093399 261.60617889 387.4371952 409.54300202
5 2200000 0.0255526 0.025225 0.0170351 0.016018 2200000 258.29074145 261.64519326 387.43535406 412.03645898
5 2300000 0.0267141 0.026337 0.0178094 0.016609 2300000 258.29056566 261.988837 387.43584848 415.43741345
5 2400000 0.0278755 0.029087 0.0185837 0.017447 2400000 258.2913311 247.53326228 387.43630171 412.67839743
5 2500000 0.029037 0.028878 0.019358 0.01828 2500000 258.29114578 259.71327654 387.43671867 410.28446389
5 2600000 0.0301985 0.029728 0.0201323 0.019096 2600000 258.29097472 262.37890205 387.43710356 408.46250524
5 2700000 0.03136 0.030924 0.0209067 0.019681 2700000 258.29081633 261.93247963 387.43560677 411.56445303
5 2800000 0.0325215 0.032042 0.021681 0.020278 2800000 258.29066925 262.15592035 387.43600387 414.2420357
5 2900000 0.0336829 0.033131 0.0224553 0.021279 2900000 258.29129915 262.59394525 387.4363736 408.85379952
5 3000000 0.0348444 0.034573 0.0232296 0.021977 3000000 258.29114578 260.31874584 387.43671867 409.51904264
5 3100000 0.0360059 0.035581 0.0240039 0.022679 3100000 258.29100231 261.37545319 387.43704148 410.07099078
5 3200000 0.0371674 0.036581 0.0247783 0.023356 3200000 258.2908678 262.4313168 387.4357805 411.02928584
5 3300000 0.0383289 0.038645 0.0255526 0.024137 3300000 258.29074145 256.17803079 387.43611218 410.15867755
5 3400000 0.0394904 0.039254 0.0263269 0.024665 3400000 258.29062253 259.84613033 387.43642434 413.5414555
5 3500000 0.0406518 0.04183 0.0271012 0.025613 3500000 258.29114578 251.01601721 387.43671867 409.94807324
5 3600000 0.0418133 0.041625 0.0278755 0.025793 3600000 258.29102223 259.45945946 387.43699665 418.71825689
5 3700000 0.0429748 0.042502 0.0286499 0.026686 3700000 258.29090537 261.16418051 387.43590728 415.94843738
5 3800000 0.0441363 0.043562 0.0294242 0.027459 3800000 258.29079465 261.69597355 387.43619198 415.16442696
5 3900000 0.0452978 0.044808 0.0301985 0.028343 3900000 258.29068961 261.11408677 387.43646208 412.80033871
5 4000000 0.0464592 0.045797 0.0309728 0.029132 4000000 258.29114578 262.02589689 387.43671867 411.91816559
5 4100000 0.0476207 0.048959 0.0317471 0.029974 4100000 258.2910373 251.23062154 387.43696275 410.35564156
5 4200000 0.0487822 0.048134 0.0325215 0.030524 4200000 258.29093399 261.76922757 387.43600387 412.78993579
5 4300000 0.0499437 0.049138 0.0332958 0.031252 4300000 258.29083548 262.52594733 387.43625322 412.77358249
5 4400000 0.0511052 0.051565 0.0340701 0.032049 4400000 258.29074145 255.98758848 387.43649123 411.8693251
5 4500000 0.0522666 0.05266 0.0348444 0.032482 4500000 258.29114578 256.36156476 387.43671867 415.61480204
5 4600000 0.0534281 0.053506 0.0356188 0.033165 4600000 258.29104909 257.91500019 387.43584848 416.10131162
5 4700000 0.0545896 0.053734 0.0363931 0.034322 4700000 258.29095652 262.40369226 387.43607992 410.81522056
5 4800000 0.0557511 0.056061 0.0371674 0.034823 4800000 258.2908678 256.86305988 387.43630171 413.51980013
5 4900000 0.0569126 0.05615 0.0379417 0.035424 4900000 258.29078271 261.79875334 387.43651444 414.97289973
5 5000000 0.058074 0.057569 0.038716 0.036537 5000000 258.29114578 260.55689694 387.43671867 410.5427375
5 5100000 0.0592355 0.059041 0.0394904 0.036803 5100000 258.29105857 259.1419522 387.4359338 415.72697878
5 5200000 0.060397 0.060013 0.0402647 0.03789 5200000 258.29097472 259.94367887 387.43614133 411.71813143
5 5300000 0.0615585 0.060768 0.041039 0.038456 5300000 258.29089403 261.65086888 387.43634104 413.45953817
5 5400000 0.06272 0.061844 0.0418133 0.039035 5400000 258.29081633 261.94942112 387.43653335 415.01216857
5 5500000 0.0638814 0.063311 0.0425876 0.040021 5500000 258.29114578 260.618218 387.43671867 412.28355114
5 5600000 0.0650429 0.064704 0.043362 0.040461 5600000 258.29106636 259.64391691 387.43600387 415.21465115
5 5700000 0.0662044 0.065971 0.0441363 0.041202 5700000 258.29098972 259.20480211 387.43619198 415.02839668
5 5800000 0.0673659 0.067536 0.0449106 0.042311 5800000 258.29091573 257.64036958 387.4363736 411.24057574
5 5900000 0.0685274 0.068383 0.0456849 0.042886 5900000 258.29084425 258.83626047 387.43654906 412.72210045
5 6000000 0.0696889 0.068521 0.0464592 0.043254 6000000 258.29077514 262.69318895 387.43671867 416.14648356
5 6100000 0.0708503 0.07057 0.0472336 0.044445 6100000 258.29107287 259.31699022 387.43606246 411.74485319
5 6200000 0.0720118 0.071282 0.0480079 0.044669 6200000 258.29100231 260.93543952 387.43623445 416.39615841
5 6300000 0.0731733 0.072234 0.0487822 0.045786 6300000 258.29093399 261.64963867 387.43640098 412.78993579
5 6400000 0.0743348 0.073687 0.0495565 0.045852 6400000 258.2908678 260.56156446 387.43656231 418.73855012
5 6500000 0.0754963 0.075197 0.0503308 0.046973 6500000 258.29080366 259.31885581 387.43671867 415.13209716
5 6600000 0.0766577 0.075877 0.0511052 0.047977 6600000 258.29107839 260.94864056 387.43611218 412.69775101
5 6700000 0.0778192 0.077377 0.0518795 0.048522 6700000 258.29101301 259.76711426 387.43627059 414.2450847
5 6800000 0.0789807 0.079126 0.0526538 0.049777 6800000 258.29094956 257.81664687 387.43642434 409.82783213
5 6900000 0.0801422 0.079632 0.0534281 0.050028 6900000 258.29088795 259.94575045 387.43657364 413.76828976
5 7000000 0.0813037 0.080709 0.0542024 0.050755 7000000 258.29082809 260.19403041 387.43671867 413.75233967
5 7100000 0.0824651 0.081227 0.0549768 0.051471 7100000 258.29108314 262.22807687 387.43615489 413.82526083
5 7200000 0.0836266 0.082787 0.0557511 0.052446 7200000 258.29102223 260.91052943 387.43630171 411.85219082
5 7300000 0.0847881 0.084455 0.0565254 0.052574 7300000 258.290963 259.30969155 387.4364445 416.55571195
5 7400000 0.0859496 0.086516 0.0572997 0.053534 7400000 258.29090537 256.59993527 387.43658344 414.68972989
5 7500000 0.0871111 0.087663 0.058074 0.054777 7500000 258.29084927 256.66472742 387.43671867 410.75633934
5 7600000 0.0882725 0.087765 0.0588484 0.055079 7600000 258.29108726 259.7846522 387.43619198 413.95087057
5 7700000 0.089434 0.089241 0.0596227 0.055898 7700000 258.29103026 258.8496319 387.43632878 413.25271029
5 7800000 0.0905955 0.090745 0.060397 0.05643 7800000 258.29097472 257.86544713 387.43646208 414.67304625
5 7900000 0.091757 0.091336 0.0611713 0.05733 7900000 258.29092058 259.48147499 387.436592 413.39612768
5 8000000 0.0929185 0.091702 0.0619457 0.058136 8000000 258.2908678 261.71730169 387.43609322 412.82509977
5 8100000 0.09408 0.093088 0.06272 0.058745 8100000 258.29081633 261.04331385 387.43622449 413.65222572
5 8200000 0.0952414 0.094013 0.0634943 0.059462 8200000 258.2910373 261.66593982 387.43635255 413.70959604
5 8300000 0.0964029 0.095336 0.0642686 0.060249 8300000 258.29098502 261.18150541 387.43647753 413.2848678
5 8400000 0.0975644 0.095923 0.0650429 0.061847 8400000 258.29093399 262.71071589 387.43659954 407.45711191
5 8500000 0.0987259 0.098008 0.0658173 0.061634 8500000 258.29088416 260.18284222 387.43613001 413.73268001
5 8600000 0.0998874 0.098802 0.0665916 0.062816 8600000 258.29083548 261.12831724 387.43625322 410.72338258
5 8700000 0.101049 0.100012 0.0673659 0.063657 8700000 258.29053232 260.96868376 387.4363736 410.00989679
5 8800000 0.10221 0.101886 0.0681402 0.064866 8800000 258.29175227 259.11312644 387.43649123 406.99287762
5 8900000 0.103372 0.101398 0.0689145 0.064488 8900000 258.29044616 263.31880313 387.43660623 414.03051731
5 9000000 0.104533 0.103009 0.0696889 0.065681 9000000 258.29163996 262.11301925 387.43616272 411.07778505
5 9100000 0.105695 0.104268 0.0704632 0.066467 9100000 258.29036378 261.82529635 387.4362788 410.73013676
5 9200000 0.106856 0.106295 0.0712375 0.066922 9200000 258.29153253 259.65473447 387.43639235 412.42042975
5 9300000 0.108018 0.10638 0.0720118 0.067442 9300000 258.29028495 262.26734349 387.43650346 413.68879926
5 9400000 0.109179 0.108253 0.0727861 0.068746 9400000 258.29142967 260.50086372 387.43661221 410.20568469
5 9500000 0.110341 0.110491 0.0735605 0.069642 9500000 258.29020944 257.93956069 387.43619198 409.23580598
5 9600000 0.111502 0.110018 0.0743348 0.070308 9600000 258.2913311 261.77534585 387.43630171 409.62621608
5 9700000 0.112664 0.111093 0.0751091 0.070233 9700000 258.29013704 261.94269666 387.43640917 414.3351416
5 9800000 0.113825 0.113059 0.0758834 0.071569 9800000 258.29123655 260.04121742 387.43651444 410.79238218
5 9900000 0.114987 0.113451 0.0766577 0.07238 9900000 258.29006757 261.787027 387.43661759 410.3343465
5 10000000 0.116148 0.115782 0.0774321 0.072546 10000000 258.29114578 259.10763331 387.43621831 413.53072533
3 100000
3 200000
3 300000
3 400000
3 500000
3 600000
3 700000
3 800000
3 900000
3 1000000
3 1100000
3 1200000
3 1300000
3 1400000
3 1500000
3 1600000
3 1700000
3 1800000
3 1900000
3 2000000
3 2100000
3 2200000
3 2300000
3 2400000
3 2500000
3 2600000
3 2700000
3 2800000
3 2900000
3 3000000
3 3100000
3 3200000
3 3300000
3 3400000
3 3500000
3 3600000
3 3700000
3 3800000
3 3900000
3 4000000
3 4100000
3 4200000
3 4300000
3 4400000
3 4500000
3 4600000
3 4700000
3 4800000
3 4900000
3 5000000
3 5100000
3 5200000
3 5300000
3 5400000
3 5500000
3 5600000
3 5700000
3 5800000
3 5900000
3 6000000
3 6100000
3 6200000
3 6300000
3 6400000
3 6500000
3 6600000
3 6700000
3 6800000
3 6900000
3 7000000
3 7100000
3 7200000
3 7300000
3 7400000
3 7500000
3 7600000
3 7700000
3 7800000
3 7900000
3 8000000
3 8100000
3 8200000
3 8300000
3 8400000
3 8500000
3 8600000
3 8700000
3 8800000
3 8900000
3 9000000
3 9100000
3 9200000
3 9300000
3 9400000
3 9500000
3 9600000
3 9700000
3 9800000
3 9900000
3 10000000
4 100000
4 200000
4 300000
4 400000
4 500000
4 600000
4 700000
4 800000
4 900000
4 1000000
4 1100000
4 1200000
4 1300000
4 1400000
4 1500000
4 1600000
4 1700000
4 1800000
4 1900000
4 2000000
4 2100000
4 2200000
4 2300000
4 2400000
4 2500000
4 2600000
4 2700000
4 2800000
4 2900000
4 3000000
4 3100000
4 3200000
4 3300000
4 3400000
4 3500000
4 3600000
4 3700000
4 3800000
4 3900000
4 4000000
4 4100000
4 4200000
4 4300000
4 4400000
4 4500000
4 4600000
4 4700000
4 4800000
4 4900000
4 5000000
4 5100000
4 5200000
4 5300000
4 5400000
4 5500000
4 5600000
4 5700000
4 5800000
4 5900000
4 6000000
4 6100000
4 6200000
4 6300000
4 6400000
4 6500000
4 6600000
4 6700000
4 6800000
4 6900000
4 7000000
4 7100000
4 7200000
4 7300000
4 7400000
4 7500000
4 7600000
4 7700000
4 7800000
4 7900000
4 8000000
4 8100000
4 8200000
4 8300000
4 8400000
4 8500000
4 8600000
4 8700000
4 8800000
4 8900000
4 9000000
4 9100000
4 9200000
4 9300000
4 9400000
4 9500000
4 9600000
4 9700000
4 9800000
4 9900000
4 10000000
0
750
1500
2250
3000
0 2500000 5000000 7500000 10000000
M
flo
p
s
Vector Dimension
Predicted - 1
Actual -1
Predicted - 2
Actual - 2
(a) VADD
size analytic-parallel empirical size Predicted - 1 Actual - 1 Predicted - 2 Actual - 2 Predicted - 3 Actual - 3 Predicted - 4 Actual - 4
100000 0.000199989 0.001144 0.000142851 0.000717 0.000142851 0.000962 8.57E-05 0.00018 100000 1500.0825045 262.23776224 2100.0903039 418.41004184 2100.0903039 311.85031185 3,500.0484173 1666.6666667
200000 0.000399975 0.003095 0.000285698 0.001633 0.000285698 0.001505 0.000171422 0.000635 200000 1500.0937559 193.86106624 2100.1197068 367.42192284 2100.1197068 398.67109635 3500.1341718 944.88188976
300000 0.00348444 0.004197 0.00232297 0.002644 0.00232297 0.002594 0.00025713 0.001471 300000 258.29114578 214.43888492 387.43505082 340.39334342 387.43505082 346.95451041 3500.1750088 611.82868797
400000 0.00542025 0.005619 0.00387161 0.004371 0.00387161 0.00382 0.00232298 0.002151 400000 221.39200221 213.56113187 309.94857437 274.53671929 309.94857437 314.13612565 516.57784398 557.88005579
500000 0.00677531 0.006771 0.00483952 0.005204 0.00483952 0.004839 0.00290372 0.002925 500000 221.3920839 221.53300842 309.94809403 288.23981553 309.94809403 309.98140112 516.57873349 512.82051282
600000 0.00813037 0.008611 0.00580742 0.00593 0.00580742 0.006041 0.00348446 0.003485 600000 221.39213837 209.03495529 309.94830751 303.54131535 309.94830751 297.96391326 516.5793265 516.49928264
700000 0.00948544 0.010044 0.00677532 0.006806 0.00677532 0.006865 0.0040652 0.003423 700000 221.39194386 209.08004779 309.94846 308.55127828 309.94846 305.89949017 516.57975007 613.49693252
800000 0.0108405 0.011493 0.00774322 0.007779 0.00774322 0.007804 0.00464594 0.004559 800000 221.39200221 208.82276168 309.94857437 308.52294639 309.94857437 307.53459764 516.58006776 526.43123492
900000 0.0121956 0.013016 0.00871112 0.008931 0.00871112 0.009026 0.00522668 0.005421 900000 221.39132146 207.43700061 309.94866332 302.31776957 309.94866332 299.13582982 516.58031485 498.06308799
1000000 0.0135506 0.014752 0.00967902 0.010396 0.00967902 0.010114 0.00580742 0.005782 1000000 221.39241067 203.36225597 309.94873448 288.5725279 309.94873448 296.61854855 516.58051252 518.85160844
1100000 0.0149057 0.01655 0.0106469 0.010816 0.0106469 0.011055 0.00638816 0.006596 1100000 221.39181655 199.39577039 309.94937494 305.1035503 309.94937494 298.50746269 516.58067425 500.30321407
1200000 0.0162607 0.017438 0.0116148 0.012571 0.0116148 0.012256 0.0069689 0.006727 1200000 221.39268297 206.44569331 309.94937494 286.37339909 309.94937494 293.73368146 516.58080902 535.15683068
1300000 0.0176158 0.018675 0.0125827 0.012965 0.0125827 0.013012 0.00754964 0.007776 1300000 221.39215931 208.83534137 309.94937494 300.80987273 309.94937494 299.72333231 516.58092306 501.54320988
1400000 0.0189709 0.021415 0.0135506 0.014749 0.0135506 0.013971 0.00813038 0.008044 1400000 221.39171046 196.124212 309.94937494 284.76506882 309.94937494 300.62271849 516.58102081 522.12829438
1500000 0.0203259 0.022013 0.0145185 0.015295 0.0145185 0.015094 0.00871112 0.008829 1500000 221.39241067 204.42465816 309.94937494 294.21379536 309.94937494 298.13170796 516.58110553 509.68399592
1600000 0.021681 0.024239 0.0154864 0.015983 0.0154864 0.016503 0.00929186 0.009223 1600000 221.39200221 198.02797145 309.94937494 300.31908903 309.94937494 290.85620796 516.58117966 520.43803535
1700000 0.023036 0.024596 0.0164543 0.01777 0.0164543 0.017046 0.0098726 0.009538 1700000 221.39260288 207.35078875 309.94937494 287.00056275 309.94937494 299.19042591 516.58124506 534.70329209
1800000 0.0243911 0.026421 0.0174222 0.018804 0.0174222 0.018163 0.0104533 0.010185 1800000 221.39222913 204.38287726 309.94937494 287.17294193 309.94937494 297.30771348 516.58327992 530.19145803
1900000 0.0257462 0.027399 0.0183901 0.019031 0.0183901 0.019189 0.0110341 0.01085 1900000 221.39189473 208.03678966 309.94937494 299.51132363 309.94937494 297.04518214 516.58041888 525.34562212
2000000 0.0271012 0.029887 0.019358 0.020939 0.019358 0.020288 0.0116148 0.011424 2000000 221.39241067 200.75618162 309.94937494 286.54663546 309.94937494 295.74132492 516.58229156 525.21008403
2100000 0.0284563 0.030839 0.0203259 0.021973 0.0203259 0.0213 0.0121956 0.012252 2100000 221.39209946 204.28677973 309.94937494 286.7155145 309.94937494 295.77464789 516.57975007 514.20176298
2200000 0.0298114 0.031684 0.0212938 0.022558 0.0212938 0.02247 0.0127763 0.01257 2200000 221.39181655 208.30703194 309.94937494 292.57912936 309.94937494 293.72496662 516.5814829 525.05966587
2300000 0.0311664 0.03316 0.0222617 0.023266 0.0222617 0.023299 0.013357 0.013575 2300000 221.3922686 208.08202654 309.94937494 296.5701023 309.94937494 296.15004936 516.58306506 508.28729282
2400000 0.0325215 0.034583 0.0232296 0.024388 0.0232296 0.024457 0.0139378 0.014099 2400000 221.39200221 208.19477778 309.94937494 295.2271609 309.94937494 294.39424296 516.58080902 510.67451592
2500000 0.0338765 0.036029 0.0241975 0.025235 0.0241975 0.025162 0.0145185 0.014655 2500000 221.39241067 208.16564434 309.94937494 297.20626115 309.94937494 298.06851602 516.58229156 511.77072671
2600000 0.0352316 0.037855 0.0251654 0.026319 0.0251654 0.026532 0.0150993 0.014811 2600000 221.39215931 206.04939902 309.94937494 296.36384361 309.94937494 293.98462234 516.58023882 526.63560867
2700000 0.0365867 0.039197 0.0261333 0.027249 0.0261333 0.028114 0.01568 0.015634 2700000 221.39192657 206.648468 309.94937494 297.258615 309.94937494 288.11268407 516.58163265 518.10157349
2800000 0.0379417 0.040959 0.0271012 0.028297 0.0271012 0.02825 0.0162607 0.015982 2800000 221.39229397 205.08313191 309.94937494 296.85125632 309.94937494 297.34513274 516.58292693 525.5912902
2900000 0.0392968 0.041513 0.0280691 0.030062 0.0280691 0.030254 0.0168415 0.016706 2900000 221.39207264 209.57290487 309.94937494 289.40190273 309.94937494 287.56528062 516.58106463 520.77098049
3000000 0.0406518 0.043819 0.029037 0.030818 0.029037 0.030366 0.0174222 0.01736 3000000 221.39241067 205.39035578 309.94937494 292.03712116 309.94937494 296.38411381 516.58229156 518.43317972
3100000 0.0420069 0.045237 0.0300049 0.031506 0.0300049 0.032358 0.018003 0.018104 3100000 221.39219985 205.58392466 309.94937494 295.18187012 309.94937494 287.40960504 516.58056991 513.69863014
3200000 0.043362 0.04595 0.0309728 0.033073 0.0309728 0.032467 0.0185837 0.018201 3200000 221.39200221 208.92274211 309.94937494 290.26698515 309.94937494 295.68484923 516.58173561 527.44354706
3300000 0.044717 0.047971 0.0319407 0.034042 0.0319407 0.034504 0.0191645 0.018857 3300000 221.39231165 206.37468471 309.94937494 290.81722578 309.94937494 286.92325527 516.58013515 525.0039773
3400000 0.0460721 0.048489 0.0329086 0.034607 0.0329086 0.034332 0.0197452 0.019762 3400000 221.39212235 210.35698818 309.94937494 294.73805877 309.94937494 297.09891646 516.58124506 516.14209088
3500000 0.0474271 0.049996 0.0338765 0.035209 0.0338765 0.035221 0.0203259 0.020346 3500000 221.39241067 210.01680134 309.94937494 298.21920532 309.94937494 298.1176003 516.58229156 516.07195518
3600000 0.0487822 0.052806 0.0348444 0.036558 0.0348444 0.036623 0.0209067 0.020771 3600000 221.39222913 204.52221338 309.94937494 295.42097489 309.94937494 294.89664965 516.58080902 519.95570748
3700000 0.0501373 0.053777 0.0358123 0.037968 0.0358123 0.038562 0.0214874 0.021238 3700000 221.39205741 206.40794392 309.94937494 292.35145386 309.94937494 287.84814066 516.58181074 522.64808362
3800000 0.0514923 0.055392 0.0367802 0.038451 0.0367802 0.039317 0.0220682 0.021804 3800000 221.39232468 205.80589255 309.94937494 296.48123586 309.94937494 289.95091182 516.58041888 522.8398459
3900000 0.0528474 0.05621 0.0377481 0.039862 0.0377481 0.039343 0.0226489 0.022333 3900000 221.39215931 208.14801637 309.94937494 293.51261853 309.94937494 297.38454109 516.58137923 523.88841625
4000000 0.0542024 0.058731 0.038716 0.040547 0.038716 0.040929 0.0232296 0.022816 4000000 221.39241067 204.32139756 309.94937494 295.95284485 309.94937494 293.19064722 516.58229156 525.94670407
4100000 0.0555575 0.059032 0.0396839 0.041909 0.0396839 0.042037 0.0238104 0.02336 4100000 221.39225127 208.36156661 309.94937494 293.49304445 309.94937494 292.59937674 516.58098982 526.54109589
4200000 0.0569126 0.06101 0.0406518 0.042296 0.0406518 0.042363 0.0243911 0.024386 4200000 221.39209946 206.52352073 309.94937494 297.90051069 309.94937494 297.42936053 516.58186798 516.68990404
4300000 0.0582676 0.063219 0.0416197 0.043709 0.0416197 0.044058 0.0249719 0.024497 4300000 221.39233468 204.05257913 309.94937494 295.13372532 309.94937494 292.79586 516.58063664 526.59509328
4400000 0.0596227 0.063354 0.0425876 0.044706 0.0425876 0.044518 0.0255526 0.025327 4400000 221.39218787 208.35306374 309.94937494 295.26238089 309.94937494 296.50927715 516.5814829 521.18292731
4500000 0.0609778 0.064688 0.0435555 0.045223 0.0435555 0.046252 0.0261333 0.026442 4500000 221.3920476 208.69403908 309.94937494 298.52066426 309.94937494 291.87927009 516.58229156 510.55139551
4600000 0.0623328 0.066368 0.0445234 0.046592 0.0445234 0.047041 0.0267141 0.026305 46 0000 221.39226 6 207.93153327 309.94937494 296.18 18681 309.9493 494 293.36111052 516.58113131 524.61509219
4700000 0.0636879 0.069494 0.0454913 0.047862 0.0454913 0.048474 0.0272948 0.027418 47 0000 221.39213 57 202.89521398 309.94937494 294.59696628 309.94937494 290.8775838 516.58191304 514.26070465
4800000 0.0650429 0.068684 0.0464592 0.048351 0.0464592 0.049009 0.0278756 0.027377 48 0000 221.39234259 209.65581504 309.94937494 297.82217534 309.94937 94 293.82358342 51 .58080902 525.98896884
4900000 0.066398 0.070498 0.0474271 0.050212 0.0474271 0.049172 0.0284563 .028 4 49 0 00 221.39221061 208.51655366 309.9493 494 292.7587031 309.94937494 298.950622 1 51 .58156542 524.2510699
5000000 0.0677531 0.071427 0.048395 0.05109 0.048395 0.051313 0.029 37 0.028957 50 0000 221.3920839 210.0046201 309.9493 494 293.59953024 309.94937494 292.32358272 516.58229156 518.00946231
5100000 0.0691081 0.074605 0.0493629 0.05166 0.0493629 0.0513 0.0 96178 0.029587 51 0000 221.39228253 205.080 8847 309.9493 494 296.16724739 309.94937494 298.24561 04 5 6.58124506 517.11900497
5200000 0.0704632 0.075269 0.0503309 0.052364 0.0503309 0.053423 0.0 1985 0.029719 52 0000 221.3921 931 207.25663952 309.94875911 297.914 9782 309.94875911 292.00905977 516.58194943 524.91671994
5300000 0.0718182 0.076412 0.0512988 0.053233 0.0512988 0.054089 0.03 7793 0.030902 53 0000 221.39234901 208.08250013 309.9487 073 298.68690474 309.94877073 293.95995 89 516.5809 888 514.5298039
5400000 0.0731733 0.0787 0.0522667 0.054491 0.0522667 0.053994 0. 3136 .031 7 54 0000 221.39222913 205.84498094 309.94878192 297. 9680131 309.94878192 300.0 333704 51 .58163265 519.73051011
5500000 0.0745284 0.079849 0.0532346 0.055645 0.0532346 0.05542 0.03194 7 0.031667 55 0000 221.39211361 206.640 3306 309.948 927 296.52259862 309.9487927 297.72645254 5 6.58229156 521.04714687
5600000 0.0758834 0.080402 0.0542025 0.057429 0.0542025 0.057634 0.0325215 0.032056 56 0000 221.39229397 208.950 2612 309.9488031 292.53513033 309.948 031 291.49460388 516.58 385 524.082855
5700000 0.0772385 0.082268 0.0551704 0.05723 0.0551704 0.057785 0.0331 22 0.032917 57 0 00 221.39218136 207.85724705 09.94881313 298.79433863 309.94881313 295.92454789 516.58197945 519.48841024
5800000 0.0785935 0.084095 0.0561383 0.058757 0.0561383 0.058732 0.033683 0.033203 58 0000 221.39235433 206.90885308 09.94882282 296.1349286 309.94882282 2 6.26098209 516.58106463 524.04903171
5900000 0.0799486 0.085897 0.0571062 0.059917 0.0571062 0.059866 0.0342637 0.034034 59 0000 221.39224452 206.06074717 309.94883218 295.40864863 309.94883218 295.66030802 516.58168849 520.06816713
6000000 0.0813037 0.087046 0.0580741 0.061926 0.0580741 0.060203 0.0348444 0. 34685 60 00 0 221.39213837 206.78721595 309.94884 22 290.66950877 309.94884122 298.9 84225 516.58229156 518.95632118
6100000 0.0826587 0.089819 0.059042 0.062968 0.059042 0.0624 0.0354252 0.034927 61 00 0 221.39230 53 203.743 8331 309.94884997 290.6 3 08 2 309.94884997 293.26923077 516.58141662 523.94995276
6200000 0.0840138 0.089594 0.0600099 0.062618 0.0600099 0.063148 0.0 6 59 0.0356 8 62 0000 221.39219985 207.60318771 309.94885844 297.039190 1 309.94885 44 294.5461455 516.58200462 522.35452707
6300000 0.0853689 0.092283 0.0609778 0.063196 0.0609778 0.063294 0.0365867 0.036195 63 0000 221.392 9946 204.80478528 309.94886664 299.06956136 309.94886664 298.60650299 516.58116201 522. 7157066
6400000 0.0867239 0.095086 0.0619457 0.06543 0.0619457 0.065252 0.0371674 0.036692 64 00 0 221.3922 75 201.92247018 309.94887458 293.4433746 309.94887458 94.24385 59 516.5817 561 523.2748283
6500000 0.088079 0.094573 0.0629136 0.065987 0.0629136 0.066552 0.0377481 0.0374 65 0000 221.39215931 206.18992736 309.9488 228 295.51275251 309.94888 28 293.00396682 516.58229156 520.18032918
6600000 0.089434 0.095746 0.0638815 0.066676 0.0638815 0.067669 0.0383289 0.03826 660 0 221.39231165 206.79715079 309.9488 974 296.9 42582 309.94888974 292.60074776 51 .58 4829 517.51176163
6700000 0.0907891 0.099417 0.0648494 0.068228 0.0648494 0.067081 0.0389 96 0. 38643 670 000 221.39221559 202.17870183 309.94889698 294.60045729 309.94889698 299.6 7751 8 5 6.582 2603 520.1459514
6800000 0.0921442 0.098044 0.0658173 0.068811 0.0658173 0.069172 0.03949 4 0.039265 680 000 221.39212235 208.06984619 309.948904 1 296.4642281 309.94890401 294.91701845 516.5812 506 519.54667006
6900000 0.0934992 0.099447 0.0667852 0.071089 0.0667852 0.071209 0.04 711 0.03964 690 0 0 221.3922686 208.151 7545 309.94891 84 291.184290 2 309.9489 084 90.6935921 516.58177589 522.1602805
7000000 0.0948543 0.102689 0.0677531 0.071556 0.0677531 0.07074 0.04 6518 0.040288 700 000 221.3921 727 204.50096895 309.94891747 293.476438 309.94891747 296.86174724 516.58229156 521.24702145
7100000 0.0962093 0.101358 0.068721 0.072585 0.068721 0.071174 0.0412326 0.041229 710 000 221.39231862 210.14621441 309.94892391 293.44905 72 309.948 2391 299.26658611 516.58153985 516.62664629
7200000 0.0975644 0.104123 0.0696889 0.073259 0.0696889 0.072942 0.0418133 0.041829 7200 00 221.39222913 207.44696177 309.94893017 294.84431 47 309.94893017 2 6.1256889 516.58204447 516.38815176
7300000 0.0989195 0.10516 0.0706568 0.074247 0.0706568 0.073873 0.0 23941 0.04216 730 000 221.3921421 208.254 8901 309.94893627 294.96141258 309.94893627 296.454726 5 516.58131674 519.36348329
7400000 0.100275 0.107424 0.0716247 0.075493 0.0716247 0.074593 0.0429748 0.042796 7400000 221.39117427 206.65773012 309.94894219 294.06699959 309.94894219 297.61505771 516.58181074 518.74006917
7500000 0.10163 0.108765 0.0725926 0.076549 0.0725926 0.075446 0.0435555 0.043136 7500000 221.39132146 206.8680182 309.94894796 293.92937857 309.94894796 298.22654614 516.58229156 521.60608309
7600000 0.102985 0.109871 0.0735605 0.07733 0.0735605 0.077727 0.0441363 0.044077 7600000 221.39146478 207.51608705 309.94895358 294.84029484 309.94895358 293.33436258 516.5815893 517.27658416
7700000 0.10434 0.112501 0.0745284 0.078089 0.0745284 0.079292 0.044717 0.044753 7700000 221.39160437 205.33150816 309.94895905 295.81631216 309.94895905 291.32825506 516.58206051 516.16651398
7800000 0.105695 0.112349 0.0754963 0.078774 0.0754963 0.078959 0.0452978 0.044875 7800000 221.39174039 208.27955745 309.94896439 297.05232691 309.94896439 296.35633683 516.58137923 521.44846797
7900000 0.10705 0.114173 0.0764642 0.079631 0.0764642 0.079469 0.0458785 0.045261 7900000 221.39187296 207.5797255 309.94896958 297.6227851 309.94896958 298.22949829 516.58184117 523.6296149
8000000 0.108405 0.114037 0.0774321 0.081285 0.0774321 0.080784 0.0464593 0.045767 8000000 221.39200221 210.45800924 309.94897465 295.25742757 309.94897465 297.08853238 516.58117966 524.39530666
8100000 0.10976 0.116807 0.0784 0.080854 0.0784 0.081361 0.04704 0.04664 8100000 221.39212828 208.03547733 309.94897959 300.54171717 309.94897959 298.66889542 516.58163265 521.01200686
8200000 0.111115 0.116632 0.0793679 0.083751 0.0793679 0.08323 0.0476207 0.047233 8200000 221.39225127 210.91981617 309.94898441 293.72783609 309.94898441 295.56650246 516.5820746 520.82230644
8300000 0.11247 0.119533 0.0803358 0.0844 0.0803358 0.085546 0.0482015 0.04796 8300000 221.3923713 208.31067571 309.94898912 295.02369668 309.94898912 291.07147032 516.58143419 519.18265221
8400000 0.113825 0.121389 0.0813037 0.085079 0.0813037 0.085478 0.0487822 0.04818 8400000 221.39248847 207.59706398 309.94899371 296.19530084 309.94899371 294.81270034 516.58186798 523.03860523
8500000 0.11518 0.122099 0.0822716 0.086207 0.0822716 0.086783 0.049363 0.048966 8500000 221.39260288 208.8469193 309.9489982 295.79964504 309.9489982 293.83635044 516.58124506 520.76951354
8600000 0.116535 0.123113 0.0832395 0.087494 0.0832395 0.087952 0.0499437 0.049837 8600000 221.39271464 209.56357168 309.94900258 294.87736302 309.94900258 293.34182281 516.58167096 517.68766178
8700000 0.11789 0.124543 0.0842074 0.08846 0.0842074 0.089762 0.0505244 0.050541 8700000 221.39282382 209.56617393 309.94900686 295.04860954 309.94900686 290.76892226 516.58208707 516.41241764
8800000 0.119245 0.125779 0.0851753 0.089051 0.0851753 0.089657 0.0511052 0.050082 8800000 221.39293052 209.89195335 309.94901104 296.45933229 309.94901104 294.4555361 516.5814829 527.13549778
8900000 0.1206 0.128212 0.0861432 0.090262 0.0861432 0.089364 0.0516859 0.051035 8900000 221.39303483 208.24883786 309.94901513 295.80554386 309.94901513 298.77803142 516.58189177 523.17037327
9000000 0.121956 0.128253 0.0871111 0.092346 0.0871111 0.090785 0.0522667 0.052171 9000000 221.39132146 210.52139131 309.94901913 292.37866285 309.94901913 297.40595913 516.5813032 517.52889536
9100000 0.123311 0.131455 0.088079 0.091915 0.088079 0.092089 0.0528474 0.052047 9100000 221.39144115 207.67563044 309.94902304 297.01354512 309.94902304 296.45234501 516.58170506 524.52590927
9200000 0.124666 0.132752 0.0890469 0.092743 0.0890469 0.093229 0.0534281 0.052893 9200000 221.39155824 207.90647222 309.94902686 297.59658411 309.94902686 296.04522198 516.58209818 521.80817878
9300000 0.126021 0.13402 0.0900148 0.095585 0.0900148 0.09618 0.0540089 0.053623 9300000 221.39167282 208.1778839 309.9490306 291.88680232 309.9490306 290.08109794 516.58152638 520.29912538
9400000 0.127376 0.134236 0.0909827 0.094931 0.0909827 0.095705 0.0545896 0.054286 9400000 221.39178495 210.07777347 309.94903427 297.05786308 309.94903427 294.65545165 516.58191304 519.47095015
9500000 0.128731 0.13773 0.0919506 0.095701 0.0919506 0.097531 0.0551704 0.054841 9500000 221.39189473 206.92659551 309.94903785 297.8025308 309.94903785 292.21478299 516.58135522 519.6841779
9600000 0.130086 0.138198 0.0929185 0.096273 0.0929185 0.096714 0.0557511 0.054752 9600000 221.39200221 208.39664829 309.94904136 299.14929419 309.94904136 297.78522241 516.58173561 526.00818235
9700000 0.131441 0.139629 0.0938864 0.100787 0.0938864 0.099483 0.0563318 0.055907 9700000 221.39210749 208.40942784 309.9490448 288.7277129 309.9490448 292.51228853 516.58210815 520.507271
9800000 0.132796 0.141094 0.0948543 0.099202 0.0948543 0.098433 0.0569126 0.056531 9800000 221.39221061 208.37172382 309.94904817 296.36499264 309.94904817 298.68032062 516.58156542 520.06863491
9900000 0.134151 0.144422 0.0958222 0.100089 0.0958222 0.099571 0.0574933 0.056602 9900000 221.39231165 205.64733905 309.94905147 296.73590504 309.94905147 298.27961957 516.58193216 524.71644112
10000000 0.135506 0.141887 0.0967901 0.100576 0.0967901 0.100349 0.0580741 0.057992 10000000 221.39241067 211.43586093 309.94905471 298.28189628 309.94905471 298.95664132 516.58140204 517.31273279
0
1000
2000
3000
4000
0 2500000 5000000 7500000 10000000
M
flo
p
s
Vector Dimension
Predicted - 1 
Actual - 1
Predicted - 2 
Actual - 2
Predicted - 3 
Actual - 3
Predicted - 4
Actual - 4
(b) WAXPBY
Figure 6.5: Actual and predicted runtimes comparison for vector kernels on the Work machine.
108
size analytic-parallel empirical size Predicted - 1 Actual - 1 Predicted - 2 Actual - 2 Predicted - 3 Actual - 3
100 6.47E-06 6.20E-05 6.47E-06 7.80E-05 4.98E-06 5.60E-05 100 6,185.5638219 645.16129032 6,185.5638219 512.82051282 8,037.5191393 714.28571429
200 2.49E-05 0.000103 2.49E-05 0.000111 1.89E-05 0.000115 200 6,437.7250689 1553.3980583 6,437.7250689 1441.4414414 8,468.5657425 1391.3043478
300 5.52E-05 0.000173 5.52E-05 0.000149 4.18E-05 0.000139 300 6,526.4211282 2080.9248555 6,526.4211282 2416.1073826 8,622.6925316 2589.9280576
400 9.74E-05 0.000249 9.74E-05 0.000211 7.35E-05 0.000207 400 6,571.6852779 2570.2811245 6,571.6852779 3033.1753555 8,701.8948376 3091.7874396
500 0.000151535 0.000443 0.000151535 0.000324 0.000114284 0.000303 500 6599.1355132 2257.3363431 6599.1355132 3086.4197531 8750.131252 3300.330033
600 0.000217602 0.000586 0.000217602 0.000474 0.000163961 0.000588 600 6617.5862354 2457.337884 6617.5862354 3037.9746835 8782.5763444 2448.9795918
700 0.000295589 0.001417 0.000295589 0.00099 0.000222579 0.000717 700 6630.8286168 1383.203952 6630.8286168 1979.7979798 8805.8621883 2733.6122734
800 0.00214176 0.001892 0.00128526 0.001526 0.00109572 0.001125 800 1195.2786493 1353.0655391 1991.8148857 1677.5884666 2336.3633045 2275.5555556
900 0.00270368 0.002661 0.00161983 0.001601 0.00137979 0.001767 900 1198.3666706 1217.5873732 2000.2098986 2023.7351655 2348.1834192 1833.6162988
1000 0.00333097 0.003245 0.00199306 0.001994 0.00169654 0.001812 1000 1200.8514036 1232.6656394 2006.9641657 2006.0180542 2357.7398706 2207.5055188
1100 0.00402365 0.003923 0.00240493 0.002708 0.00204598 0.002099 1100 1202.8879251 1233.7496814 2012.5325893 1787.2968981 2365.6145221 2305.8599333
1200 0.0047817 0.004632 0.00285546 0.002823 0.00242811 0.002475 1200 1204.5925089 1243.5233161 2017.1881238 2040.3825717 2372.2154268 2327.2727273
1300 0.00560512 0.005421 0.00334463 0.003272 0.00284293 0.002882 1300 1206.0401918 1247.0023981 2021.1503216 2066.0146699 2377.8285079 2345.593338
1400 0.00649393 0.006344 0.00387245 0.003796 0.00329044 0.003326 1400 1207.2812611 1235.813367 2024.5580963 2065.3319283 2382.6600698 2357.1858088
1500 0.00744811 0.007215 0.00443892 0.00435 0.00377064 0.003791 1500 1208.3602417 1247.4012474 2027.5202076 2068.9655172 2386.8627077 2374.0437879
1600 0.00846767 0.008147 0.00504403 0.004937 0.00428352 0.004314 1600 1209.3055114 1256.904382 2030.1227392 2074.1340895 2390.5572987 2373.6671303
1700 0.00955261 0.009229 0.0056878 0.005539 0.0048291 0.00487 1700 1210.1404747 1252.5734099 2032.4202679 2087.0193176 2393.8207948 2373.7166324
1800 0.0107029 0.01037 0.00637021 0.006181 0.00540736 0.007232 1800 1210.886769 1249.75892 2034.4698212 2096.748099 2396.7333412 1792.0353982
1900 0.0119186 0.01144 0.00709128 0.007036 0.00601831 0.005991 1900 1211.5516923 1262.2377622 2036.3037421 2052.3024446 2399.3446665 2410.2820898
2000 0.0131997 0.012755 0.00785099 0.007683 0.00666195 0.006572 2000 1212.148761 1254.4100353 2037.959544 2082.519849 2401.6992022 2434.5709069
2100 0.0145461 0.014016 0.00864935 0.008407 0.00733828 0.007191 2100 1212.6961866 1258.5616438 2039.4596126 2098.2514571 2403.8330508 2453.0663329
2200 0.0159579 0.015334 0.00948636 0.009175 0.00804729 0.00834 2200 1213.192212 1262.553802 2040.8249318 2110.0817439 2405.7788398 2321.3429257
2300 0.0174351 0.016789 0.010362 0.010091 0.008789 0.00874 2300 1213.6437416 1260.3490381 2042.0768191 2096.9180458 2407.5548982 2421.0526316
2400 0.0189777 0.018214 0.0112763 0.010976 0.00956339 0.009413 2400 1214.0564979 1264.961019 2043.2233977 2099.1253644 2409.1875371 2447.6787422
2500 0.0205857 0.019713 0.0122293 0.012084 0.0103705 0.01041 2500 1214.4352633 1268.1986506 2044.2707269 2068.8513737 2410.6841522 2401.5369837
2600 0.022259 0.021297 0.0132209 0.012784 0.0112102 0.011454 2600 1214.7895233 1269.6623938 2045.2465415 2115.1439299 2412.0889904 2360.7473372
2700 0.0239977 0.023062 0.0142511 0.013864 0.0120827 0.012008 2700 1215.1164487 1264.4176568 2046.1578404 2103.2890941 2413.3678731 2428.3810793
2800 0.0258018 0.024563 0.01532 0.014898 0.0129879 0.012903 2800 1215.4190793 1276.7170134 2046.997389 2104.9805343 2414.5550859 2430.4425327
2900 0.0276713 0.026364 0.0164276 0.015926 0.0139257 0.013725 2900 1215.7000213 1275.9824002 2047.7732596 2112.2692453 2415.6774884 2451.0018215
3000 0.0296061 0.0282 0.0175738 0.017068 0.0148962 0.014609 3000 1215.9656287 1276.5957447 2048.504023 2109.210218 2416.7237282 2464.2343761
3100 0.0316063 0.030026 0.0187586 0.018197 0.0158994 0.015618 3100 1216.2132233 1280.223806 2049.1934366 2112.4361158 2417.7012969 2461.2626457
3200 0.0336719 0.032044 0.0199821 0.019243 0.0169353 0.016803 3200 1216.4445725 1278.2424167 2049.834602 2128.5662319 2418.6167355 2437.6599417
3300 0.0358029 0.033912 0.0212443 0.020604 0.0180039 0.017755 3300 1216.6612202 1284.5010616 2050.4323513 2114.1525917 2419.4757802 2453.3934103
3400 0.0379993 0.035369 0.0225451 0.02187 0.0191052 0.018893 3400 1216.8645212 1307.3595522 2050.9999956 2114.3118427 2420.283483 2447.4673159
3500 0.040261 0.038036 0.0238845 0.02682 0.0202392 0.020033 3500 1217.058692 1288.2532338 2051.5397015 1826.99478 2421.0443101 2445.9641591
3600 0.0425881 0.04031 0.0252626 0.024629 0.0214058 0.021124 3600 1217.241436 1286.0332424 2052.045316 2104.8357627 2421.773538 2454.0806665
3700 0.0449806 0.04176 0.0266794 0.025806 0.0226052 0.022385 3700 1217.4137295 1311.302682 2052.5199217 2121.9871348 2422.4514713 2446.2809917
3800 0.0474384 0.045003 0.0281347 0.027031 0.0238372 0.023803 3800 1217.5790077 1283.4699909 2052.9808386 2136.8058895 2423.1033846 2426.5848843
3900 0.0499617 0.046311 0.0296288 0.028408 0.0251019 0.024923 3900 1217.7327833 1313.7267604 2053.4074954 2141.6502394 2423.7209136 2441.1186454
4000 0.0525503 0.047971 0.0311615 0.02995 0.0263994 0.026342 4000 1217.8807733 1334.1393759 2053.8164081 2136.8948247 2424.2975219 2429.5801382
4100 0.0552043 0.050651 0.0344705 0.031386 0.0277295 0.027666 4100 1218.02106 1327.515745 1950.6534573 2142.3564647 2424.8543969 2430.4200101
4200 0.0579236 0.053675 0.0361663 0.033139 0.0290923 0.029037 4200 1218.1563301 1314.5784816 1950.9875215 2129.2133136 2425.3840363 2430.0030995
4300 0.0607084 0.055672 0.0379027 0.034567 0.0304877 0.030308 4300 1218.2828077 1328.4954735 1951.3121756 2139.6129256 2425.8963451 2440.2797941
4400 0.0635585 0.059393 0.03968 0.036594 0.0319159 0.031706 4400 1218.4050914 1303.8573569 1951.6129032 2116.1939116 2426.3768216 2442.4399167
4500 0.066474 0.061887 0.0414979 0.03781 0.0333768 0.033175 4500 1218.5215272 1308.837074 1951.9060001 2142.2903994 2426.8354066 2441.5975885
4600 0.0694549 0.065702 0.0433565 0.039758 0.0348703 0.034738 4600 1218.6325227 1288.240845 1952.1871 2128.8797223 2427.2805224 2436.5248431
4700 0.0725012 0.06836 0.0452559 0.041747 0.0363965 0.036149 4700 1218.7384485 1292.5687537 1952.4526084 2116.5592737 2427.7059607 2444.3276439
4800 0.0756128 0.070266 0.047196 0.043475 0.0379555 0.037785 4800 1218.8412544 1311.5873965 1952.7078566 2119.8389879 2428.1065985 2439.0631203
4900 0.0787898 0.07419 0.0491767 0.045123 0.0395471 0.039157 4900 1218.9395074 1294.5140855 1952.9573965 2128.404583 2428.4966534 2452.6904513
5000 0.0820322 0.077844 0.0511982 0.04716 0.0411714 0.040575 5000 1219.0335015 1284.6205231 1953.193667 2120.4410517 2428.8705266 2464.5717807
5100 0.0853399 0.080593 0.0532605 0.048995 0.0428284 0.042348 5100 1219.1249345 1290.9309742 1953.417636 2123.481988 2429.2292031 2456.7866251
5200 0.0887131 0.081969 0.0553634 0.051003 0.044518 0.044078 5200 1219.2111424 1319.5232344 1953.6372405 2120.659569 2429.5790467 2453.8318436
5300 0.0921516 0.087023 0.0575071 0.0538 0.0462404 0.04562 5300 1219.2951615 1291.1529136 1953.8456991 2088.4758364 2429.9097759 2462.9548444
5400 0.0956555 0.090947 0.0596914 0.055131 0.0479954 0.047542 5400 1219.3757808 1282.5051953 1954.0503322 2115.6880884 2430.2328973 2453.4096168
5500 0.0992248 0.093514 0.0619165 0.056938 0.0497832 0.049409 5500 1219.4532012 1293.9239044 1954.2448297 2125.11855 2430.5388163 2448.9465482
5600 0.102859 0.097775 0.0641823 0.059386 0.0516036 0.051168 5600 1219.5335362 1282.9455382 1954.43292 2112.2823561 2430.8381586 2451.5322076
5700 0.106559 0.100723 0.0664888 0.062202 0.0534567 0.052439 5700 1219.6060398 1290.2713382 1954.6149126 2089.3218868 2431.1265005 2478.3081294
5800 0.110325 0.10484 0.0688361 0.063628 0.0553426 0.054846 5800 1219.6691593 1283.4795879 1954.7882579 2114.7922298 2431.4000426 2453.4150166
5900 0.114156 0.107036 0.071224 0.065518 0.057261 0.056623 5900 1219.7343985 1300.8707351 1954.9590026 2125.2174975 2431.6725171 2459.0714021
6000 0.118052 0.111123 0.0736527 0.070741 0.0592122 0.058813 6000 1219.8014434 1295.8613428 1955.1218082 2035.5946339 2431.9312574 2448.4382705
6100 0.122013 0.11413 0.076122 0.070065 0.0611961 0.059816 6100 1219.8700139 1304.1268729 1955.28231 2124.3131378 2432.1811357 2488.2974455
6200 0.12604 0.118516 0.0786321 0.073404 0.0632127 0.062618 6200 1219.9301809 1297.3775693 1955.4355028 2094.7087352 2432.4225986 2455.5239707
6300 0.130133 0.12161 0.0811829 0.07449 0.0652619 0.064167 6300 1219.9826332 1305.4847463 1955.5842425 2131.292791 2432.6597908 2474.1689654
6400 0.13429 0.125454 0.0837745 0.078363 0.0673439 0.06667 6400 1220.0461687 1305.9766927 1955.7263845 2090.7826398 2432.8855323 2457.4771261
6500 0.138513 0.128998 0.0864067 0.07826 0.0694585 0.068558 6500 1220.1020843 1310.097831 1955.8668483 2159.4684385 2433.107539 2465.0660754
6600 0.142802 0.133405 0.0890797 0.081992 0.0716058 0.069091 6600 1220.1509783 1306.0979723 1956.0011989 2125.0853742 2433.3224404 2521.8914186
6700 0.147156 0.136636 0.0917934 0.084412 0.0737858 0.072466 6700 1220.2016907 1314.1485406 1956.1319223 2127.1857082 2433.5305709 2477.8516822
6800 0.151575 0.141755 0.0945477 0.086946 0.0759985 0.075117 6800 1220.2539997 1304.7864273 1956.2612311 2127.297403 2433.7322447 2462.2921576
6900 0.156059 0.144791 0.0973428 0.089593 0.0782439 0.076809 6900 1220.3077041 1315.2751207 1956.3850639 2125.612492 2433.9277567 2479.3969457
7000 0.160609 0.149926 0.100179 0.092478 0.080522 0.078988 7000 1220.3550237 1307.3116071 1956.4978688 2119.4229979 2434.1173841 2481.3895782
7100 0.165224 0.152721 0.103055 0.09472 0.0828327 0.080374 7100 1220.4038154 1320.316132 1956.6251031 2128.8006757 2434.3043267 2508.7714933
7200 0.169905 0.157927 0.105972 0.097112 0.0851762 0.0831 7200 1220.4467202 1313.0117079 1956.7432907 2135.2664964 2434.482872 2495.3068592
7300 0.174651 0.160502 0.10893 0.099455 0.0875523 0.085746 7300 1220.4911509 1328.0831392 1956.8530249 2143.2808808 2434.6590552 2485.9468663
7400 0.179463 0.166224 0.111929 0.101645 0.0899611 0.087755 7400 1220.5301371 1317.7399172 1956.9548553 2154.9510551 2434.8301655 2496.0401117
7500 0.18434 0.16896 0.114969 0.10396 0.0924027 0.090208 7500 1220.5706846 1331.6761364 1957.0492915 2164.2939592 2434.9937826 2494.2355445
7600 0.189282 0.174116 0.118049 0.107037 0.0948769 0.090798 7600 1220.6126309 1326.931471 1957.1533855 2158.5059372 2435.1554488 2544.5494394
7700 0.194289 0.17709 0.121169 0.109364 0.0973838 0.092533 7700 1220.6558271 1339.2060534 1957.2662975 2168.538093 2435.3126495 2562.9775323
7800 0.199362 0.182378 0.124331 0.112212 0.0999233 0.09527 7800 1220.6940139 1334.3714702 1957.3557681 2168.7520051 2435.468004 2554.4242679
7900 0.204501 0.185091 0.127533 0.114298 0.102496 0.097727 7900 1220.727527 1348.7419702 1957.4541491 2184.1152076 2435.6072432 2554.4629427
8000 0.209704 0.191066 0.130776 0.117795 0.105101 0.099942 8000 1220.7683211 1339.8511509 1957.545727 2173.2671166 2435.7522764 2561.4856617
8100 0.214973 0.193858 0.13406 0.118197 0.107738 0.101645 8100 1220.8044731 1353.7744122 1957.6309115 2220.3609229 2435.9093356 2581.927296
8200 0.220308 0.200064 0.137384 0.123565 0.110409 0.107087 8200 1220.8362837 1344.3698017 1957.7243347 2176.6681504 2436.0332944 2511.6027155
8300 0.225708 0.203103 0.14075 0.125707 0.113112 0.106202 8300 1220.8694419 1356.7500234 1957.7975133 2192.0815865 2436.1694604 2594.6780663
8400 0.231173 0.208229 0.144155 0.128832 0.115847 0.108511 8400 1220.9038253 1355.4307997 1957.8925462 2190.7600596 2436.3168662 2601.026624
8500 0.236703 0.210262 0.147602 0.130825 0.118616 0.110559 8500 1220.9393206 1374.4756542 1957.9680492 2209.0579018 2436.4335334 2613.9889109
8600 0.242299 0.2167 0.151089 0.13412 0.121417 0.113605 8600 1220.970784 1365.205353 1958.0512148 2205.7858634 2436.5616018 2604.1107346
8700 0.247961 0.219141 0.154617 0.136173 0.124251 0.11507 8700 1220.9984635 1381.5762454 1958.1287957 2223.3482408 2436.6805901 2631.0941166
8800 0.253688 0.225718 0.158186 0.139964 0.127117 0.117758 8800 1221.0274037 1372.3318477 1958.201105 2213.1405218 2436.8101827 2630.4794579
8900 0.25948 0.22859 0.161795 0.1418 0.130016 0.119375 8900 1221.0574996 1386.0623824 1958.2805402 2234.4146685 2436.9308393 2654.1570681
9000 0.265337 0.234845 0.165445 0.145549 0.132948 0.122189 9000 1221.0886533 1379.6333752 1958.3547402 2226.0544559 2437.0430544 2651.6298521
9100 0.27126 0.236237 0.169136 0.148282 0.135913 0.123852 9100 1221.1162722 1402.1512295 1958.4239902 2233.851715 2437.1472928 2674.4824468
9200 0.277248 0.242806 0.172868 0.151924 0.13891 0.12719 9200 1221.1449677 1394.3642249 1958.4885577 2228.4826624 2437.2615362 2661.8444846
9300 0.283302 0.245423 0.17664 0.154345 0.14194 0.12998 9300 1221.1703412 1409.6478325 1958.5597826 2241.472027 2437.3679019 2661.6402523
9400 0.289421 0.253316 0.180453 0.157666 0.145002 0.132579 9400 1221.1968033 1395.2533594 1958.6263459 2241.7008106 2437.4836209 2665.8822287
9500 0.295605 0.254953 0.184307 0.159506 0.148098 0.135861 9500 1221.2242689 1415.947253 1958.6884926 2263.2377465 2437.5751192 2657.1275053
9600 0.301855 0.262408 0.188201 0.164062 0.151226 0.139734 9600 1221.2486127 1404.8352184 1958.756861 2246.9554193 2437.676061 2638.1553523
9700 0.30817 0.264731 0.192136 0.166772 0.154386 0.14278 9700 1221.2739722 1421.6695438 1958.8208352 2256.7337443 2437.7858096 2635.9434094
9800 0.314551 0.272281 0.196112 0.17038 0.15758 0.144964 9800 1221.2963875 1410.8953618 1958.8806396 2254.7247329 2437.8728265 2650.0372506
9900 0.320996 0.275818 0.200129 0.173373 0.160806 0.148759 9900 1221.3236302 1421.3720642 1958.936486 2261.251752 2437.9687325 2635.4035722
10000 0.327508 0.282297 0.204186 0.177434 0.164064 0.150843 10000 1221.3442114 1416.9473994 1958.9981683 2254.3593674 2438.0729471 2651.7637544
0
2,250
4,500
6,750
9,000
0 2500 5000 7500 10000
M
flo
p
s
Matrix Order
Predicted - 1
Actual - 1
Predicted - 2
Actual - 2
Predicted - 3
Actual - 3
(a) BiCGKvid size analytic-parallel empirical size Predicted - 1 Actual - 1 Predicted - 2 Actual - 2
6 100 6.08E-06 0.00012 6.45E-06 7.60E-05 100 6,579.7373698 333.33333333 6,199.8490337 526.31578947
6 200 2.41E-05 0.000167 2.48E-05 0.000126 200 6,644.8770075 958.08383234 6,445.4533369 1269.8412698
6 300 5.40E-05 0.000243 5.51E-05 0.000223 300 6,666.8888963 1481.4814815 6,531.714194 1614.3497758
6 400 9.58E-05 0.000345 9.73E-05 0.00041 400 6,677.9426511 1855.0724638 6,575.7095345 1560.9756098
6 500 0.000149598 0.000478 0.00015146 0.00061 500 6684.5813447 2092.0502092 6602.4032748 1639.3442623
6 600 0.000215278 0.000803 0.000217513 0.00093 600 6689.0253533 1793.2752179 6620.2939594 1548.3870968
6 700 0.000292878 0.002165 0.000295485 0.001399 700 6692.2063112 905.31177829 6633.1624279 1401.0007148
6 800 0.00212607 0.003182 0.00127991 0.001702 800 1204.0995828 804.52545569 2000.1406349 1504.1128085
6 900 0.00268603 0.003751 0.00161382 0.001513 900 1206.2411812 863.76966142 2007.6588467 2141.440846
6 1000 0.00331136 0.004541 0.00198637 0.001936 1000 1207.9628914 880.86324598 2013.7235258 2066.1157025
6 1100 0.00400207 0.00494 0.00239758 0.002256 1100 1209.3741489 979.75708502 2018.7021914 2145.3900709
6 1200 0.00475816 0.006417 0.00284743 0.00265 1200 1210.5519781 897.61570827 2022.8767696 2173.5849057
6 1300 0.00557963 0.006808 0.00333593 0.003087 1300 1211.5498698 992.94947121 2026.4214177 2189.8283123
6 1400 0.00646647 0.007517 0.00386309 0.003544 1400 1212.4080062 1042.9692697 2029.4634606 2212.1896163
6 1500 0.00741869 0.007674 0.00442889 0.004071 1500 1213.1521872 1172.7912432 2032.1118836 2210.7590273
6 1600 0.00843629 0.009231 0.00503333 0.004648 1600 1213.8036981 1109.3056007 2034.4384334 2203.0981067
6 1700 0.00951926 0.010726 0.00567643 0.005109 1700 1214.3801094 1077.7549879 2036.4912454 2262.6737131
6 1800 0.0106676 0.011189 0.00635818 0.005877 1800 1214.8936968 1158.280454 2038.3191416 2205.2067381
6 1900 0.0118813 0.012647 0.00707857 0.006488 1900 1215.3552221 1141.7727524 2039.9600484 2225.647349
6 2000 0.0131605 0.013745 0.00783762 0.007236 2000 1215.7592797 1164.0596581 2041.4360482 2211.1663903
6 2100 0.0145049 0.014742 0.00863531 0.007925 2100 1216.1407524 1196.5811966 2042.7755344 2225.8675079
6 2200 0.0159148 0.018919 0.00947165 0.008847 2200 1216.477744 1023.3099001 2043.9944466 2188.3124223
6 2300 0.01739 0.020032 0.0103466 0.009698 2300 1216.7912593 1056.3099042 2045.1162701 2181.8931739
6 2400 0.0189307 0.023199 0.0112603 0.010641 2400 1217.0706841 993.1462563 2046.1266574 2165.2100367
6 2500 0.0205366 0.021607 0.0122126 0. 11347 2500 1217.3388 1157.0324432 2047.0661448 2 03.225522
6 2600 0.022208 0.023504 0.0132035 0. 12494 2600 1217.5792507 1150.4424779 2047.9418336 2164.238834
6 2700 0.0239448 0.024176 0.0142331 0. 13496 2700 217.8009422 1206.1548 43 2048 7455298 2160.6401897
6 2800 0.0257469 0.027156 0.0153013 0.014 9 2800 12 8.010712 1154.8092503 2049.4990622 2194.541637
6 2900 0.027614 0.029957 0.0164082 0. 15542 2900 1218.2049945 1122.9428848 2050.194415 2164.4575988
3000 0.0295473 0.030088 0.0175537 0. 17046 3000 1218.385 362 1196. 902951 2050.8 96784 2111. 324182
3100 0.0315455 0.031948 0.0187379 0. 176 5 3100 1218.557322 1203.2052 85 2051.4572071 2180. 929078
6 32 0 0.0336092 0.034028 0.0199607 0. 18906 3200 2 8.7139236 1203.714588 2052.0322434 2166.5079869
6 3300 0.0357382 0.037295 0.0212222 0. 20442 3300 1218.8638488 1167.9849846 2052.5675943 2130.9069563
6 3400 0.0379326 0.03861 .0225223 0. 21593 3400 1219.0042338 1197.6171 6 2053.0762844 21 1.4347242
6 3500 0.0401923 0.03893 0.0238611 0.022521 3500 1219.1389893 1258.6694066 2053.5515965 2175.7470805
6 3600 0.0425175 0.041537 0.0252385 0.02409 3600 1219.2626566 1248.0439127 2054.0047943 2151.9302615
6 3700 0.044908 0.043882 0.0266546 0.025364 3700 1219.3818473 1247.8920742 2054.4296294 2158.9654629
6 3800 0.0473639 0.04757 0.0281093 0.0273 3800 1219.4941717 1214.210637 2054.835944 2115.7509158
6 3900 0.0498852 0.048771 0.0296027 0.028141 3900 1219.6002021 1247.4626315 2055.2179362 2161.9700792
6 4000 0.0524718 0.051646 0.0311347 0.030141 4000 1219.7027737 1239.2053596 2055.5842838 2123.3535715
6 4100 0.0551239 0.05579 0.0344437 0.031512 4100 1219.7975833 1205.2339129 1952.1712243 2133.7903021
6 4200 0.0578413 0.053972 0.0361388 0.033466 4200 1219.8895945 1307.344549 1952.4721352 2108.408534
6 4300 0.0606241 0.057687 0.0378746 0.034349 4300 1219.9768739 1282.0912857 1952.7598971 2153.1922327
6 4400 0.0634722 0.06159 0.0396512 0.03693 4400 1220.0616963 1257.3469719 1953.0304253 2096.9401571
6 4500 0.0663858 0.06487 0.0414685 0.03771 4500 1220.1404517 1248.6511485 1953.2898465 2147.9713604
6 4600 0.0693647 0.068693 0.0433264 0.040108 4600 1220.2171998 1232.1488361 1953.5433362 2110.3021841
6 4700 0.072409 0.069349 0.0452252 0.041535 4700 1220.2902954 1274.1351714 1953.7779822 2127.3624654
6 4800 0.0755186 0.070156 0.0471646 0.044052 4800 1220.3616063 1313.6438794 1954.0078788 2092.0730046
6 4900 0.0786937 0.075015 0.0491447 0.04511 4900 1220.4280648 1280.2772779 1954.229042 2129.0179561
6 5000 0.0819341 0.076686 0.0511656 0.048061 5000 1220.4930548 1304.0189865 1954.4381381 2080.6891242
6 5100 0.0852399 0.080694 0.0532271 0.04885 5100 1220.5551625 1289.3151907 1954.6434053 2129.7850563
6 5200 0.0886111 0.082196 0.0553294 0.051804 5200 1220.6145731 1315.8791182 1954.8377535 2087.8696626
6 5300 0.0920477 0.087263 0.0574724 0.052896 5300 1220.6714562 1287.6018473 1955.0253687 2124.1681791
6 5400 0.0955496 0.090326 0.0596561 0.055569 5400 1220.7272453 1291.3225428 1955.2065925 2099.0120391
6 5500 0.0991169 0.092291 0.0618805 0.057147 5500 1220.7807145 1311.0704186 1955.3817438 2117.3464924
6 5600 0.10275 0.094294 0.0641457 0.060566 5600 1220.8272506 1330.3073366 1955.5480726 2071.1290163
6 5700 0.106448 0.099078 0.0664516 0.061146 5700 1220.8777995 1311.693817 1955.7091176 2125.4047689
6 5800 0.110211 0.101277 0.0687981 0.064456 5800 1220.9307601 1328.6333521 1955.8679673 2087.6256671
6 5900 0.11404 0.105179 0.0711854 0.065461 5900 1220.9750965 1323.8384088 1956.0190713 2127.0680252
6 6000 0.117934 0.10891 0.0736134 0.069531 6000 1221.0219275 1322.1926361 1956.1655894 2071.0186823
6 6100 0.121894 0.112541 0.0760822 0.070099 6100 1221.0609218 1322.5402298 1956.3051542 2123.2827858
6 6200 0.125919 0.116319 0.0785916 0.074259 6200 1221.1024548 1321.8820657 1956.4431822 2070.5907701
6 6300 0.130009 0.11768 0.0811418 0.074699 6300 1221.1462283 1349.082257 1956.5747864 2125.3296564
6 6400 0.134165 0.121445 0.0837326 0.079392 6400 1221.1828718 1349.0880646 1956.7050348 2063.6839984
6 6500 0.138386 0.126453 0.0863642 0.079522 6500 1221.2217999 1336.4649316 1956.8293344 2125.1980584
6 6600 0.142672 0.12813 0.0890365 0.083362 6600 1221.2627565 1359.8688832 1956.9502395 2090.1609846
6 6700 0.147024 0.135996 0.0917495 0.084321 6700 1221.2972032 1320.332951 1957.0678859 2129.4813866
6 6800 0.15144 0.138442 0.0945033 0. 89017 6800 1221.333 207 1336.0107482 1957. 8 3313 077.805363
6 69 0 0.155924 0.140826 0.0972977 0. 88348 6900 1221.364 544 135 .3 71 4 1957.291899 2155.5666229
6 7000 0.160472 0.144064 0.100133 0. 9340 7000 1221.3968792 1360. 06 416 1957 3966624 2098.4561358
6 7100 0.165 85 0.147916 0.103009 0. 94853 7100 1221.4313838 1363.206144 1957.4988593 2125.815735
6 7200 0.169764 0.154757 0.105925 0. 99 89 7200 1221.4603803 1339.9070801 1957 6115176 2090.5543962
7300 0.1745 8 0.158383 0.108883 0. 97903 7300 1221.491 783 134 .8515118 1957.69 7122 17 .257081
6 7400 0.179318 0.163468 0.111881 0.103594 7400 12 1.5170814 1339.956 441 1957 944423 2114.4081704
6 7500 0.184192 0.164752 0.114919 0.10381 7500 221.5514246 1365.6 90356 1957.9007823 2167.3794937
6 7600 0.189133 0.174218 0.117999 0.108771 7600 1221.5742361 1326.1 45879 1957. 826948 21 .09557 9
6 7700 0.194138 0.171904 0.121119 0.108239 7700 1221.6052499 1379.6072226 1958.0742906 2191.0771533
6 7800 0.199209 0.17627 0.12428 0.114013 7800 1221.6315528 1380.6092926 1958.1589958 2134.493435
6 7900 0.204346 0.182907 0.127482 0.113967 7900 1221.6534701 1364.846616 1958.2372413 2190.4586415
6 8000 0.209547 0.192144 0.130724 0.120208 8000 1221.6829637 1332.3340828 1958.3244087 2129.6419539
6 8100 0.214814 0.194722 0.134007 0.12048 8100 1221.7080823 1347.7675866 1958.4051579 2178.2868526
6 8200 0.220147 0.19346 0.137331 0.126148 8200 1221.7291174 1390.2615528 1958.4798771 2132.0988046
6 8300 0.225545 0.203614 0.140695 0.126256 8300 1221.7517569 1353.3450549 1958.5628487 2182.5497402
6 8400 0.231008 0.205129 0.1441 0.132718 8400 1221.7758692 1375.9146683 1958.6398334 2126.6143251
6 8500 0.236537 0.209752 0.147546 0.132801 8500 1221.7961672 1377.8176132 1958.7111816 2176.1884323
6 8600 0.242131 0.215312 0.151033 0.138049 8600 1221.8179415 1374.0060935 1958.7772209 2143.0071931
6 8700 0.24779 0.219469 0.15456 0.138486 8700 1221.8410751 1379.5114572 1958.8509317 2186.2137689
6 8800 0.253515 0.224021 0.158128 0.145332 8800 1221.8606394 1382.7275122 1958.9193565 2131.3957009
6 8900 0.259305 0.230244 0.161737 0.145048 8900 1221.881568 1376.1053491 1958.9827931 2184.3803431
6 9000 0.265161 0.233626 0.165387 0.152285 9000 1221.8991481 1386.8319451 1959.0415208 2127.5897166
6 9100 0.271081 0.237302 0.169077 0.151524 9100 1221.9225988 1395.858442 1959.1073889 2186.0563343
6 9200 0.277068 0.24418 0.172808 0.159746 9200 1221.9382967 1386.5181424 1959.168557 2119.3644911
6 9300 0.283119 0.24774 0.176579 0.157581 9300 1221.9596707 1396.4640349 1959.2363758 2195.4423439
6 9400 0.289236 0.252767 0.180392 0.166997 9400 1221.9779004 1398.283795 1959.2886603 2116.4452056
6 9500 0.295419 0.25768 0.184245 0.162894 9500 1221.993169 1400.962434 1959.3476078 2216.1651135
6 9600 0.301667 0.263875 0.188139 0.17369 9600 1222.0096994 1397.0251066 1959.4023568 2122.4019805
6 9700 0.30798 0.268483 0.192073 0.169985 9700 1222.0274044 1401.8019763 1959.463329 2214.0777127
6 9800 0.314358 0.272844 0.196048 0.179784 9800 1222.0462021 1407.9840495 1959.5201175 2136.7863659
6 9900 0.320802 0.278794 0.200064 0.17686 9900 1222.0622066 1406.1995595 1959.5729367 2216.6685514
6 10000 0.327312 0.284279 0.204121 0.186749 10000 1222.0755732 1407.0684081 1959.6219889 2141.9124065
0
1,750
3,500
5,250
7,000
0 2500 5000 7500 10000
M
flo
p
s
Matrix Order
Predicted - 1
Actual - 1
Predicted - 2
Actual - 2
(b) AATX
Figure 6.6: Actual and predicted runtimes comparison for matrix kernels on the Opteron machine.
109
size size Predicted - 1 Actual - 1 Predicted - 2 Actual - 2
100000 0.000240817 0.001391 0.000160546 0.000617 100000 830.50615197 143.78145219 1245.7488819 324.14910859
200000 0.00196132 0.002872 0.00130755 0.001675 200000 203.94428242 139.27576602 305.91564376 238.80597015
300000 0.00294198 0.00394 0.00196133 0.002476 300000 203.94428242 152.28426396 305.91486389 242.32633279
400000 0.00392264 0.004901 0.0026151 0.003231 400000 203.94428242 163.23199347 305.91564376 247.60136181
500000 0.0049033 0.00612 0.00326887 0.004999 500000 203.94428242 163.39869281 305.91611168 200.040008
600000 0.00588396 0.008509 0.00392265 0.004281 600000 203.94428242 141.02714773 305.91564376 280.30833917
700000 0.00686462 0.007988 0.00457642 0.005134 700000 203.94428242 175.26289434 305.91597799 272.6918582
800000 0.00784528 0.009609 0.00523019 0.005726 800000 203.94428242 166.51056301 305.91622866 279.42717429
900000 0.00882594 0.011101 0.00588396 0.00642 900000 203.94428242 162.14755427 305.91642363 280.37383178
1000000 0.0098066 0.013861 0.00653774 0.006988 1000000 203.94428242 144.28973379 305.91611168 286.20492272
1100000 0.0107873 0.012737 0.00719151 0.00772 1100000 203.94352618 172.72513151 305.91628184 284.97409326
1200000 0.0117679 0.014341 0.00784528 0.008847 1200000 203.94462903 167.35234642 305.91642363 271.27839946
1300000 0.0127486 0.014618 0.00849906 0.00919 1300000 203.94396247 177.86290874 305.91618367 282.91621328
1400000 0.0137292 0.015287 0.00915283 0.009555 1400000 203.94487661 183.16216393 305.91631222 293.04029304
1500000 0.0147099 0.017354 0.0098066 0.01053 1500000 203.94428242 172.87080788 305.91642363 284.9002849
1600000 0.0156906 0.017393 0.0104604 0.011491 1600000 203.94376251 183.98206175 305.91564376 278.4788095
1700000 0.0166712 0.018305 0.0111141 0.011474 1700000 203.94452709 185.74160066 305.91770814 296.32211957
1800000 0.0176519 0.022645 0.0117679 0.013613 1800000 203.94405135 158.97549128 305.91694355 264.45309631
1900000 0.0186325 0.02073 0.0124217 0.013976 1900000 203.94472025 183.3092137 305.91625945 271.89467659
2000000 0.0196132 0.021833 0.0130755 0.014648 2000000 203.94428242 183.20890395 305.91564376 273.0748225
2100000 0.0205939 0.023392 0.0137292 0.015509 2100000 203.9438863 179.54856361 305.91731492 270.81049713
2200000 0.0215745 0.023952 0.014383 0.016155 2200000 203.94447148 183.7007348 305.91670722 272.36149799
2300000 0.0225552 0.026099 0.0150368 0.016992 2300000 203.94410158 176.25196368 305.91615237 270.71563089
2400000 0.0235358 0.029459 0.0156906 0.019243 2400000 203.94462903 162.93832106 305.91564376 249.4413553
2500000 0.0245165 0.029795 0.0163443 0.017153 2500000 203.94428242 167.81339151 305.91704753 291.49419927
2600000 0.0254972 0.028907 0.0169981 0.017834 2600000 203.94396247 179.88722455 305.91654361 291.57788494
2700000 0.0264778 0.031691 0.0176519 0.018817 2700000 203.94443647 170.39538039 305.91607702 286.9745443
2800000 0.0274585 0.030651 0.0183057 0.020287 2800000 203.94413387 182.70203256 305.91564376 276.03884261
2900000 0.0284391 0.032138 0.0189594 0.021036 2900000 203.94456927 180.47171573 305.91685391 275.71781708
3000000 0.0294198 0.033488 0.0196132 0.022735 3000000 203.94428242 179.16865743 305.91642363 263.91027051
3100000 0.0304005 0.032884 0.020267 0.022834 3100000 203.94401408 188.54153996 305.91602112 271.52491898
3200000 0.0313811 0.034682 0.0209207 0.025304 3200000 203.9444124 184.53376391 305.91710602 252.92443882
3300000 0.0323618 0.036309 0.0215745 0.024227 3300000 203.94415638 181.7731141 305.91670722 272.42332934
3400000 0.0333424 0.037461 0.0222283 0.024643 3400000 203.94452709 181.52211633 305.91633188 275.94042933
3500000 0.0343231 0.037483 0.0228821 0.023618 3500000 203.94428242 186.75132727 305.91597799 296.38411381
3600000 0.0353037 0.039669 0.0235358 0.025266 3600000 203.94462903 181.50192846 305.91694355 284.96794111
3700000 0.0362844 0.040805 0.0241896 0.026814 3700000 203.94439484 181.35032472 305.91659225 275.97523682
3800000 0.0372651 0.04259 0.0248434 0.027933 3800000 203.94417297 178.44564452 305.91625945 272.07961909
3900000 0.0382457 0.043907 0.0254972 0.026404 3900000 203.94449572 177.64821099 305.91594371 295.4097864
4000000 0.0392264 0.046812 0.0261509 0.028963 4000000 203.94428242 170.89635136 305.91681357 276.21448054
4100000 0.040207 0.044227 0.0268047 0.032808 4100000 203.94458676 185.40710426 305.91649972 249.93903926
4200000 0.0411877 0.0469 0.0274585 0.028948 4200000 203.94438145 179.10447761 305.91620081 290.17548708
4300000 0.0421684 0.046559 0.0281123 0.028843 4300000 203.94418569 184.71187096 305.91591581 298.16593281
4400000 0.043149 0.048127 0.028766 0.045805 4400000 203.94447148 182.84954392 305.91670722 192.11876433
4500000 0.0441297 0.048625 0.0294198 0.030754 4500000 203.94428242 185.08997429 305.91642363 292.64485921
4600000 0.0451103 0.04899 0.0300736 0.047312 4600000 203.94455368 187.79342723 305.91615237 194.45383835
4700000 0.046091 0.050511 0.0307273 0.032371 4700000 203.94437092 186.09807765 305.91688824 290.38336783
4800000 0.0470717 0.054124 0.0313811 0.034025 4800000 203.94419577 177.3704826 305.9166186 282.14548126
4900000 0.0480523 0.052686 0.0320349 0.032987 4900000 203.94445219 186.00766807 305.91635997 297.08673114
5000000 0.049033 0.054358 0.0326887 0.03344 5000000 203.94428242 183.96556165 305.91611168 299.0430622
5100000 0.0500136 0.053303 0.0333424 0.034208 5100000 203.94452709 191.35883534 305.91679063 298.17586529
5200000 0.0509943 0.057653 0.0339962 0.034465 5200000 203.94436241 180.3895721 305.91654361 301.75540403
5300000 0.051975 0.07622 0.03465 0.039618 5300000 203.94420394 139.07110994 305.91630592 267.5551517
5400000 0.0529556 0.058894 0.0353038 0.056477 5400000 203.94443647 183.38031039 305.91607702 191.22828762
5500000 0.0539363 0.061178 0.0359575 0.037982 5500000 203.94428242 179.80319723 305.91670722 289.61086831
5600000 0.0549169 0.068256 0.0366113 0.037827 5600000 203.94450524 164.08813877 305.91647934 296.08480715
5700000 0.0558976 0.062367 0.0372651 0.041182 5700000 203.94435539 182.78897494 305.91625945 276.81996989
5800000 0.0568783 0.058818 0.0379188 0.059709 5800000 203.94421071 197.21853854 305.91685391 194.27556985
5900000 0.0578589 0.064766 0.0385726 0.039262 5900000 203.94442342 182.19436124 305.91663512 300.54505629
6000000 0.0588396 0.065715 0.0392264 0.041076 6000000 203.94428242 182.6067108 305.91642363 292.14139644
6100000 0.0598202 0.066766 0.0398802 0.057987 6100000 203.94448698 182.72773567 305.91621908 210.39198441
6200000 0.0608009 0.065582 0.0405339 0.04256 6200000 203.94434951 189.07627093 305.91677583 291.35338346
6300000 0.0617816 0.06743 0.0411877 0.06393 6300000 203.9442164 186.86044787 305.91657218 197.09056781
6400000 0.0627622 0.067576 0.0418415 0.046628 6400000 203.9444124 189.41636084 305.91637489 274.51316805
6500000 0.0637429 0.072424 0.0424953 0.057476 6500000 203.94428242 179.49850878 305.91618367 226.18136266
6600000 0.0647235 0.071302 0.043149 0.062159 6600000 203.94447148 185.1280469 305.91670722 212.35862868
6700000 0.0657042 0.071629 0.0438028 0.045689 6700000 203.9443445 187.07506736 305.91651675 293.2872245
6800000 0.0666849 0.071735 0.0444566 0.045775 6800000 203.94422126 189.58667317 305.91633188 297.10540688
6900000 0.0676655 0.072374 0.0451104 0.046596 6900000 203.94440298 190.67620969 305.91615237 296.16276075
7000000 0.0686462 0.076062 0.0457641 0.049701 7000000 203.94428242 184.0603718 305.91664645 281.68447315
7100000 0.0696268 0.078627 0.0464179 0.048454 7100000 203.94445817 180.5995396 305.91646757 293.06146035
7200000 0.0706075 0.076517 0.0470717 0.048406 7200000 203.94434019 188.19347335 305.91629365 297.483783
7300000 0.0715882 0.074944 0.0477254 0.052255 7300000 203.94422544 194.81212639 305.9167655 279.39910056
7400000 0.0725688 0.08137 0.0483792 0.052672 7400000 203.94439484 181.88521568 305.91659225 280.98420413
7500000 0.0735495 0.076914 0.049033 0.053641 7500000 203.94428242 195.02301272 305.91642363 279.63684495
7600000 0.0745301 0.081723 0.0496868 0.079205 7600000 203.94444661 185.99415097 305.91625945 191.90707657
7700000 0.0755108 0.083292 0.0503405 0.053121 7700000 203.94433644 184.89170629 305.91670722 289.90418102
7800000 0.0764915 0.081394 0.0509943 0.074296 7800000 203.9442291 191.66031894 305.91654361 209.9709271
7900000 0.0774721 0.081231 0.0516481 0.053132 7900000 203.94438772 194.50702318 305.91638415 297.3725815
8000000 0.0784528 0.082799 0.0523019 0.057373 8000000 203.94428242 193.23904878 305.91622866 278.87682359
8100000 0.0794334 0.084104 0.0529556 0.05607 8100000 203.94443647 192.61866261 305.91665471 288.92455859
8200000 0.0804141 0.084075 0.0536094 0.090531 8200000 203.94433315 195.06393101 305.91649972 181.15341706
8300000 0.0813948 0.085746 0.0542632 0.055839 8300000 203.94423231 193.5950365 305.91634846 297.2832608
8400000 0.0823754 0.083933 0.054917 0.07683 8400000 203.94438145 200.15965115 305.91620081 218.66458415
8500000 0.0833561 0.090667 0.0555707 0.061869 8500000 203.94428242 187.49931066 305.91660713 274.77411951
8600000 0.0843367 0.089015 0.0562245 0.058854 8600000 203.94442751 193.22586081 305.91645991 292.24861522
8700000 0.0853174 0.088054 0.0568783 0.064752 8700000 203.94433023 197.60601449 305.91631606 268.71756857
8800000 0.0862981 0.091391 0.057532 0.09159 8800000 203.94423516 192.57913799 305.91670722 192.16071624
8900000 0.0872787 0.090968 0.0581858 0.059706 8900000 203.94437589 195.67320376 305.91656384 298.12749137
9000000 0.0882594 0.093457 0.0588396 0.061592 9000000 203.94428242 192.60194528 305.91642363 292.2457462
9100000 0.08924 0.093202 0.0594934 0.083444 9100000 203.94441954 195.27477951 305.91628651 218.11034946
9200000 0.0902207 0.101209 0.0601471 0.068303 9200000 203.94432763 181.80201365 305.91666099 269.38787462
9300000 0.0912014 0.093528 0.0608009 0.062727 9300000 203.9442377 198.87092635 305.91652426 296.52302836
9400000 0.092182 0.095594 0.0614547 0.065114 9400000 203.94437092 196.66506266 305.91639045 288.72439107
9500000 0.0931627 0.09506 0.0621085 0.070944 9500000 203.94428242 199.87376394 305.91625945 267.81686964
9600000 0.0941433 0.100678 0.0627622 0.06447 9600000 203.9444124 190.7070065 305.9166186 297.81293625
9700000 0.095124 0.101545 0.063416 0.067733 9700000 203.9443253 191.04830371 305.91648795 286.41873238
9800000 0.0961047 0.101084 0.0640698 0.06643 9800000 203.94423998 193.89814412 305.91635997 295.04741834
9900000 0.0970853 0.10247 0.0647235 0.067281 9900000 203.94436645 193.22728603 305.91670722 294.28813484
10000000 0.098066 0.104061 0.0653773 0.067219 10000000 203.94428242 192.19496257 305.91657961 297.53492316
0
375
750
1125
1500
0 2500000 5000000 7500000 10000000
M
fl
o
p
s
Vector Dimension
Predicted - 1
Actual - 1
Predicted - 2
Actual - 2
(a) VADD
size analytic-parallel empirical size Predicted - 1 Actual - 1 Predicted - 2 Actual - 2 Predicted - 3 Actual - 3 Predicted - 4 Actual - 4
100000 0.000280957 0.001646 0.000200686 0.000777 0.000200686 0.001001 0.000120415 0.000131 100000 1067.7790552 182.2600243 1494.872587 386.1003861 1494.872587 299.7002997 2491.3839638 2290.0763359
200000 0.00228822 0.002682 0.00163445 0.001795 0.00163445 0.00183 0.000980686 0.000968 200000 262.21254949 223.71364653 367.095965 334.26183844 367.095965 327.86885246 611.81662632 619.83471074
300000 0.00343232 0.005048 0.00245167 0.002989 0.00245167 0.003006 0.00147102 0.001667 300000 262.21331344 178.28843106 367.09671367 301.10404818 367.09671367 299.4011976 611.82036954 539.8920216
400000 0.00457643 0.006222 0.00326889 0.003921 0.00326889 0.003949 0.00196135 0.002201 400000 262.21312246 192.86403086 367.097088 306.04437643 367.097088 303.87439858 611.82348892 545.20672422
500000 0.00572053 0.00682 0.0040861 0.004978 0.0040861 0.005531 0.00245168 0.002798 500000 262.21346623 219.94134897 367.09821101 301.32583367 367.09821101 271.19869825 611.82536057 536.09721229
600000 0.00686463 0.008324 0.00490332 0.005892 0.00490332 0.005883 0.00294201 0.003421 600000 262.21369542 216.24219125 367.09821101 305.49898167 367.09821101 305.9663437 611.82660834 526.16194095
700000 0.00800873 0.009685 0.00572053 0.006796 0.00572053 0.006634 0.00343234 0.00398 700000 262.21385913 216.83014972 367.09885273 309.00529723 367.09885273 316.55110039 611.82749961 527.63819095
800000 0.00915284 0.011303 0.00653775 0.007714 0.00653775 0.007759 0.00392266 0.004587 800000 262.21369542 212.33300894 367.09877251 311.12263417 367.09877251 309.31821111 611.82972779 523.2177894
900000 0.0102969 0.012677 0.00735497 0.009199 0.00735497 0.008794 0.00441299 0.005191 900000 262.21484136 212.98414451 367.09871012 293.51016415 367.09871012 307.02751876 611.83007439 520.13099595
1000000 0.011441 0.013558 0.00817218 0.009564 0.00817218 0.009633 0.00490332 0.005752 1000000 262.21484136 221.27157398 367.09910942 313.67628607 367.09910942 311.42946123 611.83035168 521.55771905
1100000 0.0125851 0.015179 0.0089894 0.010107 0.0089894 0.010561 0.00539365 0.006498 1100000 262.21484136 217.40562619 367.09902774 326.50638172 367.09902774 312.47041 611.83057855 507.84856879
1200000 0.0137292 0.01726 0.00980662 0.011393 0.00980662 0.0113 0.00588398 0.00721 1200000 262.21484136 208.57473928 367.09895968 315.98349864 367.09895968 318.5840708 611.83076761 499.30651872
1300000 0.0148734 0.017359 0.0106238 0.012326 0.0106238 0.012304 0.00637431 0.007547 1300000 262.21307838 224.66731955 367.10028427 316.40434853 367.10028427 316.97009103 611.83092758 516.76162714
1400000 0.0160175 0.018571 0.011441 0.0131 0.011441 0.01345 0.00686464 0.00811 1400000 262.21320431 226.15906521 367.1007779 320.61068702 367.1007779 312.26765799 611.8310647 517.87916153
1500000 0.0171616 0.020153 0.0122583 0.014332 0.0122583 0.014207 0.00735497 0.008582 1500000 262.21331344 223.2918176 367.09821101 313.98269606 367.09821101 316.74526642 611.83118354 524.3532976
1600000 0.0183057 0.021919 0.0130755 0.015256 0.0130755 0.015168 0.0078453 0.008668 1600000 262.21340894 218.98809252 367.09877251 314.63030939 367.09877251 316.4556962 611.83128752 553.76095985
1700000 0.0194498 0.023356 0.0138927 0.01591 0.0138927 0.017171 0.00833563 0.009105 1700000 262.2134932 218.3593081 367.09926796 320.55311125 367.09926796 297.01240464 611.83137927 560.13179572
1800000 0.0205939 0.024569 0.0147099 0.0171 0.0147099 0.016768 0.00882596 0.009798 1800000 262.2135681 219.78916521 367.09970836 315.78947368 367.09970836 322.04198473 611.83146083 551.13288426
1900000 0.021738 0.02585 0.0155271 0.017903 0.0155271 0.01823 0.00931629 0.010182 1900000 262.21363511 220.50290135 367.1001024 318.38239401 367.1001024 312.67142074 611.8315338 559.81143194
2000000 0.0228821 0.0264 0.0163443 0.018792 0.0163443 0.018539 0.00980662 0.010855 2000000 262.21369542 227.27272727 367.10045704 319.28480204 367.10045704 323.64205189 611.83159947 552.7406725
2100000 0.0240262 0.027647 0.0171616 0.019745 0.0171616 0.019478 0.010297 0.011456 2100000 262.21374999 227.87282526 367.09863882 319.06811851 367.09863882 323.44183181 611.82868797 549.9301676
2200000 0.0251703 0.028807 0.0179788 0.020581 0.0179788 0.0206 0.0107873 0.011876 2200000 262.2137996 229.11097997 367.09902774 320.68412614 367.09902774 320.38834951 611.83057855 555.7426743
2300000 0.0263144 0.031173 0.018796 0.021361 0.018796 0.021115 0.0112776 0.012244 2300000 262.21384489 221.34539505 367.09938285 323.01858527 367.09938285 326.7819086 611.83230475 563.54132636
2400000 0.0274585 0.031675 0.0196132 0.022373 0.0196132 0.022555 0.0117679 0.014244 2400000 262.21388641 227.308603 367.09970836 321.81647522 367.09970836 319.21968521 611.8338871 505.47598989
2500000 0.0286026 0.033904 0.0204304 0.023217 0.0204304 0.027468 0.0122583 0.013155 2500000 262.21392461 221.21283624 367.10000783 323.03915235 367.10000783 273.04499782 611.83035168 570.12542759
2600000 0.0297467 0.034493 0.0212476 0.024467 0.0212476 0.024978 0.0127486 0.014038 2600000 262.21395987 226.13283855 367.10028427 318.79674664 367.10028427 312.27480183 611.83188742 555.6347058
2700000 0.0308908 0.03536 0.0220649 0.025054 0.0220649 0.025174 0.0132389 0.014409 2700000 262.21399252 229.07239819 367.0988765 323.3016684 367.0988765 321.7605466 611.83330941 562.14865709
2800000 0.0320349 0.036822 0.0228821 0.025499 0.0228821 0.025373 0.0137293 0.015523 2800000 262.21402283 228.12449079 367.09917359 329.42468332 367.09917359 331.0605762 611.83017342 541.13251305
2900000 0.033179 0.036892 0.0236993 0.027179 0.0236993 0.026777 0.0142196 0.015104 2900000 262.21405106 235.82348477 367.09945019 320.10007727 367.09945019 324.90570266 611.83155644 576.00635593
3000000 0.0343231 0.039516 0.0245165 0.027529 0.0245165 0.028523 0.0147099 0.015713 3000000 262.2140774 227.75584573 367.09970836 326.92796687 367.09970836 315.53483154 611.83284727 572.77413607
3100000 0.0354672 0.039848 0.0253337 0.028691 0.0253337 0.02892 0.0152003 0.016413 3100000 262.21410204 233.38687011 367.09994987 324.14345962 367.09994987 321.57676349 611.83002967 566.62401755
3200000 0.0366113 0.041797 0.0261509 0.029224 0.0261509 0.02955 0.0156906 0.017058 3200000 262.21412515 229.68155609 367.10017628 328.49712565 367.10017628 324.87309645 611.83128752 562.78578966
3300000 0.0377554 0.050923 0.0269682 0.030693 0.0269682 0.030407 0.0161809 0.017471 3300000 262.21414685 194.41116981 367.09902774 322.54911543 367.09902774 325.58292498 611.83246915 566.6533112
3400000 0.0388995 0.043616 0.0277854 0.031642 0.0277854 0.031646 0.0166712 0.019072 3400000 262.21416728 233.85913426 367.09926796 322.3563618 367.09926796 322.31561651 611.83358127 534.81543624
3500000 0.0400436 0.045946 0.0286026 0.032873 0.0286026 0.032095 0.0171616 0.019515 3500000 262.21418654 228.52914291 367.09949445 319.41106683 367.09949445 327.15376227 611.8310647 538.04765565
3600000 0.0411877 0.048067 0.0294198 0.033099 0.0294198 0.033316 0.0176519 0.020726 3600000 262.21420473 224.68637527 367.09970836 326.29384574 367.09970836 324.16856766 611.83215405 521.084628
3700000 0.0423318 0.048559 0.030237 0.033872 0.030237 0.033673 0.0181422 0.02061 3700000 262.21422193 228.58790338 367.09991071 327.70429854 367.09991071 329.64095863 611.83318451 538.57350801
3800000 0.0434759 0.049941 0.0310542 0.035213 0.0310542 0.03443 0.0186326 0.02132 3800000 262.21423823 228.26935784 367.1001024 323.74407179 367.1001024 331.10659309 611.83087706 534.70919325
3900000 0.04462 0.049676 0.0318715 0.03536 0.0318715 0.035728 0.0191229 0.020131 3900000 262.2142537 235.52620984 367.09913245 330.88235294 367.09913245 327.47424989 611.83188742 581.19318464
4000000 0.0457641 0.052298 0.0326887 0.036704 0.0326887 0.036445 0.0196132 0.021121 4000000 262.21426839 229.45428123 367.09933402 326.93984307 367.09933402 329.26327343 611.83284727 568.15491691
4100000 0.0469082 0.053162 0.0335059 0.037374 0.0335059 0.03767 0.0201036 0.02179 4100000 262.21428236 231.36827057 367.09952576 329.10579547 367.09952576 326.51977701 611.83071689 564.47911886
4200000 0.0480523 0.054747 0.0343231 0.039091 0.0343231 0.038729 0.0205939 0.023808 4200000 262.21429567 230.14959724 367.09970836 322.3248318 367.09970836 325.33760231 611.83165889 529.23387097
4300000 0.0491964 0.056309 0.0351403 0.03901 0.0351403 0.039711 0.0210842 0.023939 4300000 262.21430836 229.09304019 367.09988247 330.68443989 367.09988247 324.84701972 611.83255708 538.86962697
4400000 0.0503405 0.056467 0.0359575 0.040057 0.0359575 0.040282 0.0215745 0.023272 4400000 262.21432048 233.76485381 367.10004867 329.53041915 367.10004867 327.689787 611.83341445 567.20522516
4500000 0.0514846 0.056778 0.0367748 0.041162 0.0367748 0.041687 0.0220649 0.02366 4500000 262.21433205 237.76814964 367.09920924 327.97240173 367.09920924 323.84196512 611.83146083 570.58326289
4600000 0.0526288 0.058448 0.037592 0.041845 0.037592 0.04188 0.0225552 0.024315 4600000 262.21384489 236.10730906 367.09938285 329.7885052 367.09938285 329.51289398 611.83230475 567.55089451
4700000 0.0537729 0.061302 0.0384092 0.042506 0.0384092 0.043327 0.0230455 0.024902 4700000 262.21386609 230.00880885 367.09954907 331.71787512 367.09954907 325.43217855 611.83311276 566.21958076
4800000 0.054917 0.063321 0.0392264 0.043425 0.0392264 0.043582 0.0235359 0.02604 4800000 262.21388641 227.41270668 367.09970836 331.60621762 367.09970836 330.41163783 611.83128752 552.99539171
4900000 0.0560611 0.063737 0.0400436 0.044022 0.0400436 0.044762 0.0240262 0.025319 4900000 262.2139059 230.6352668 367.09986115 333.92394712 367.09986115 328.40355659 611.83208331 580.59165054
5000000 0.0572052 0.066555 0.0408608 0.045763 0.0408608 0.045684 0. 245165 0.02571 5 00000 262.21392461 225.3775 732 367.10000 83 327.775714 367.10000783 3 8.34252692 611.83284727 583.43 57176
5100000 0.0583493 0.068988 0.0416781 0.046511 0.0416781 0.046121 0. 250068 0.028 41 5100000 262.2139 258 221.7777 047 367.09926 96 328.95444088 367.09926796 3 .73608551 611.8 358127 536.07091 53
5200000 0.0594934 0.066335 0.0424953 0.046495 0.0424953 0.046487 0. 254972 0.028818 5200000 262.213959 7 235.1699706 367.09942041 335.51994838 367.09942041 33 .57 68839 611.83188742 541.32833646
5300000 0.0606375 0.067944 0.0433125 0.048938 0.0433125 0.049394 0. 259875 0.029948 5300000 262.2139765 234.01624868 367.09956 1 324.90089501 367.0995671 321.90144552 611.83261183 53 .92026179
5400000 0.0617816 0.074258 0.0441297 0.0484 0.0441297 0.048667 0. 264778 0.027 74 5400000 262.21399252 218.15831291 367.09970836 334.7107438 367.0997083 3 2. 7 43237 611.8 330941 587.5 9 316
5500000 0.0629257 0.071349 0.0449469 0.050629 0.0449469 0.050821 0. 269682 0.028387 5500000 262.21400795 231.25762099 367.09984448 325.90017579 367.09984 48 324.66 936 7 611. 3171291 581.25198154
5600000 0.0640698 0.072252 0.0457641 0.050566 0.0457641 0.052078 0. 274585 0.0 927 5600000 262.2140 283 232.51951503 367.09997575 332.23905391 367.09997575 322.59303 53 611.83240162 573.96651862
5700000 0.0652139 0.072267 0.0465814 0.053665 0.0465814 0.053412 0. 279488 0.029234 5700000 262.21403719 236.6225248 367.09931432 318.6434 613 367.09931432 320.15277466 611.83306618 584.93 34 25
5800000 0.066358 0.077697 0.0473986 0.052003 0.0473986 0.052136 0. 284392 .032349 5 00000 262.214051 6 223.94687054 367.09945019 334.596081 367.09945019 3 .74251956 611.83155644 537.883 583
5900000 0.0675021 0.076404 0.0482158 0.052949 0.0482158 0.054801 0. 289295 0.047058 5900000 262.21406445 231.6632637 367.09958146 334.2839336 367.09958146 322.98680681 611.83221279 376.1315 23
6000000 0.0686462 0.076633 0.049033 0.054048 0.049033 0.053897 0. 294198 0.030572 6000000 262.2140774 234.88575418 367.09970 36 333.03730018 367.0997 836 333.9703508 611.83284727 588.77404161
6100000 0.0697903 0.077738 0.0498502 0.054477 0.0498502 0.058063 0. 299101 0.031819 61 0000 262.21408992 235.4061077 367.09983109 335.92158158 367.09983109 315.17489623 611.83346094 5 5.12806814
6200000 0.0709344 0.080626 0.0506674 0.071042 0.0506674 0.057956 0. 304005 0.0 2227 6200000 262.21410 0 230.69481309 367.09994987 261.81695335 367.09994987 320.93312168 611.83204224 577.15580104
6300000 0.0720785 0.079448 0.0514847 0.057904 0.0514847 0.057886 0. 308908 0.036813 6300000 262.214113 237.89145101 367.09935185 326.40232108 367.09935185 326.5038178 611.8326492 513.40559042
6400000 0.0732226 0.082037 0.0523019 0.056647 0.0523019 0.061599 0. 313811 0.033755 6400000 262.21412515 234.04073772 367.0994744 338.94116193 367.0994744 3 1.6 33716 611.8332372 568.804 154
6500000 0.0743667 0.085761 0.0531191 0.058119 0.0531191 0.058409 0. 318715 0.0343 65 0000 262.21413617 227.37608004 367.09959318 335.51850514 367.0995 318 33.85265969 611.8 188742 568.513 1953
6600000 0.0755108 0.084104 0.0539363 0.05936 0.0539363 0.058734 0. 323618 0.036291 6600000 262.21414685 235.4228 985 367.09970836 333.55795148 367.0997083 337.11308612 611.8 246915 545.58981566
6700000 0.0766549 0.086267 0.0547535 0.063165 0.0547535 0.059796 0. 328521 .053074 6700000 262.21415722 232.99755411 367.0998201 318.2142009 367.0998201 336.14288581 611.8330335 378.71650903
6800000 0.077799 0.086296 0.0555707 0.076938 0.0555707 0.064654 0. 333425 .034555 6800000 262.21416728 236.39566144 367.09992856 265.14856118 367.09992856 315.52 72153 611.83174627 590.36318912
6900000 0.0789431 0.089637 0.056388 0.068345 0.056388 0.069506 0. 338328 0.054007 6 00000 262.214177 5 230.93142341 367.09938285 3 2.87511888 367.09938285 2 7.81601588 611.83230475 383.28364842
7000000 0.0800872 0.092182 0.0572052 0.062087 0.0572052 0.066503 0. 343231 0.037 23 7000000 262.214186 227.8102 156 367.099494 5 338.23505726 367.0994 445 315.77522819 611.83284727 564.16731591
7100000 0.0812313 0.093331 0.0580224 0.063736 0.0580224 0.063298 0. 348134 0.039549 7100000 262.21419576 228.21999121 367.09960291 334.1910 803 367.09960291 3 .50352 2 611.83337451 538.57240385
7200000 0.0823754 0.091669 0.0588396 0.063985 0.0588396 0.067731 0. 353038 0.0 5981 72 0000 262.21420473 235.63036577 367.09970836 337.57912011 367.09970836 318.90862382 611.83215405 469.75924839
7300000 0.0835195 0.093558 0.0596568 0.066537 0.0596568 0.065205 0. 357941 0.038893 7300000 262.21421 5 234.0793946 367.09981092 329.14017765 367.09981092 3 5.86381412 611. 3267633 563.08 3 119
7400000 0.0846636 0.093416 0.060474 0.066471 0.060474 0.066972 0. 362844 0.046 35 7400000 262.21422193 237.64665582 367.09991071 333.9802 198 367.09991071 331.4818133 611.83318451 475.01872258
7500000 0.0858077 0.094962 0.0612913 0.0696 0.0612913 0.066673 0. 367748 0.039447 75 0000 262.21423019 236.93688002 367.09940889 323.27586207 367.09940889 337.4679405 611.8320154 570.3855 065
7600000 0.0869518 0.097025 0.0621085 0.06691 0.0621085 0.067982 0. 372651 0.039393 7600000 262.21423823 234.99098171 367.0995 134 34 .7562 973 367.09951134 335.38289547 611.8325189 578.78303252
7700000 0.0880959 0.100624 0.0629257 0.068343 0.0629257 0.068812 0.0377554 0.061666 7700000 262.21424607 229.56749881 367.09961113 338.00096572 367.09961113 335.69726211 611.83300932 374.59864431
7800000 0.08924 0.099373 0.0637429 0.069454 0.0637429 0.069965 0.0382458 0.061226 7800000 262.2142537 235.47643726 367.09970836 336.91364068 367.09970836 334.45294076 611.83188742 382.19057263
7900000 0.0903841 0.101586 0.0645601 0.0701 0.0645601 0.071867 0.0387361 0.062021 7900000 262.21426114 233.29986415 367.09980313 338.08844508 367.09980313 329.77583592 611.83237342 382.12863385
8000000 0.0915283 0.105528 0.0653773 0.071955 0.0653773 0.071049 0.0392264 0.06335 8000000 262.21398191 227.42779168 367.09989553 333.54179696 367.09989553 337.79504286 611.83284727 378.84767167
8100000 0.0926724 0.101783 0.0661945 0.071744 0.0661945 0.074144 0.0397167 0.063203 8100000 262.21399252 238.74320859 367.09998565 338.70428189 367.09998565 327.74061286 611.83330941 384.47542047
8200000 0.0938165 0.10581 0.0670118 0.075113 0.0670118 0.079496 0.0402071 0.049833 8200000 262.21400287 232.49220301 367.09952576 327.50655679 367.09952576 309.44953205 611.83223858 493.64878695
8300000 0.0949606 0.10404 0.067829 0.073479 0.067829 0.073124 0.0406974 0.064967 8300000 262.21401297 239.33102653 367.09961816 338.87233087 367.09961816 340.51747716 611.83269693 383.27150707
8400000 0.0961047 0.109943 0.0686462 0.073743 0.0686462 0.074665 0.0411877 0.065823 8400000 262.21402283 229.20968138 367.09970836 341.72735039 367.09970836 337.50753365 611.83314436 382.84490224
8500000 0.0972488 0.107062 0.0694634 0.076697 0.0694634 0.074476 0.0416781 0.047934 8500000 262.21403246 238.17974632 367.09979644 332.47715034 367.09979644 342.39218003 611.83211327 531.98147453
8600000 0.0983929 0.111331 0.0702806 0.078255 0.0702806 0.077997 0.0421684 0.067382 8600000 262.21404187 231.7413838 367.09988247 329.69139352 367.09988247 330.78195315 611.83255708 382.89157342
8700000 0.099537 0.10973 0.0710978 0.077035 0.0710978 0.076788 0.0426587 0.045623 8700000 262.21405106 237.856557 367.09996652 338.80703576 367.09996652 339.89685888 611.83299069 572.07987199
8800000 0.100681 0.111715 0.0719151 0.081526 0.0719151 0.078764 0.0431491 0.046514 8800000 262.21432048 236.31562458 367.09953821 323.82307485 367.09953821 335.17850795 611.8319965 567.57105388
8900000 0.101825 0.113353 0.0727323 0.078238 0.0727323 0.078716 0.0436394 0.048503 8900000 262.21458384 235.5473609 367.09962424 341.26639229 367.09962424 339.19406474 611.83242666 550.48141352
9000000 0.102969 0.116219 0.0735495 0.081687 0.0735495 0.082075 0.0441297 0.050108 9000000 262.21484136 232.32001652 367.09970836 330.52994969 367.09970836 328.96740786 611.83284727 538.83611399
9100000 0.104113 0.115775 0.0743667 0.079358 0.0743667 0.081617 0.04462 0.051023 9100000 262.21509322 235.80220255 367.09979063 344.01068575 367.09979063 334.48913829 611.83325863 535.05281932
9200000 0.105257 0.118205 0.0751839 0.081041 0.0751839 0.084173 0.0451104 0.052885 9200000 262.2153396 233.49266105 367.09987112 340.56835429 367.09987112 327.89611871 611.83230475 521.88711355
9300000 0.106402 0.119117 0.0760011 0.081516 0.0760011 0.081875 0.0456007 0.048845 9300000 262.21311629 234.22349455 367.09994987 342.26409539 367.09994987 340.76335878 611.83271309 571.19459515
9400000 0.107546 0.120731 0.0768184 0.086259 0.0768184 0.083788 0.046091 0.048949 9400000 262.21337846 233.57712601 367.09954907 326.9224081 367.09954907 336.56370841 611.83311276 576.1098286
9500000 0.10869 0.120776 0.0776356 0.083638 0.0776356 0.085718 0.0465814 0.049382 9500000 262.21363511 235.97403458 367.09962955 340.75420264 367.09962955 332.48559229 611.83219053 577.13336843
9600000 0.109834 0.122722 0.0784528 0.084692 0.0784528 0.088661 0.0470717 0.051542 9600000 262.21388641 234.67674908 367.09970836 340.05573136 367.09970836 324.83279007 611.83258731 558.767607
9700000 0.110978 0.121379 0.07927 0.088439 0.07927 0.085176 0.047562 0.053553 9700000 262.21413253 239.74493117 367.09978554 329.0403555 367.09978554 341.64553395 611.83297591 543.3869251
9800000 0.112122 0.125193 0.0800872 0.091222 0.0800872 0.089536 0.0480524 0.050779 9800000 262.21437363 234.83741104 367.09986115 322.29067549 367.09986115 328.35954253 611.83208331 578.9794994
9900000 0.113266 0.126542 0.0809044 0.087453 0.0809044 0.087524 0.0485427 0.052155 9900000 262.21460986 234.70468303 367.09993523 339.61099105 367.09993523 339.33549655 611.83246915 569.45642796
10000000 0.11441 0.126274 0.0817217 0.088986 0.0817217 0.089409 0.049033 0.05415 10000000 262.21484136 237.57859892 367.09955862 337.13168364 367.09955862 335.53669094 611.83284727 554.0166205
0
750
1500
2250
3000
0 2500000 5000000 7500000 10000000
M
fl
o
p
s
Vector Dimension
Predicted - 1
Actual - 1
Predicted - 2
Actual - 2
Predicted - 3
Actual - 3
Predicted - 4
Actual - 4
(b) WAXPBY
Figure 6.7: Actual and predicted runtimes comparison for vector kernels on the Opteron machine.
Chapter 7
Integration of Runtime Prediction into BTO Compiler
The serial and parallel memory models described in this thesis are integrated into the BTO
compiler within which they reduce the amount of time used to empirically test different versions of
the same routine. In this chapter, we describe how the models are integrated into BTO and how
we use them to reduce search time in BTO. Throught the chapter we compare and contrast search
time and the produced routine’s performance using the model in conjunction with empirical search
to exhaustive empirical search.
7.1 Integration of the Model into BTO
To include the models in BTO, we first convert the internal graph representation of a routine,
used by the compiler, to the tree format used by the model. Different representations are used
because BTO requires information to make decisions that the model does not need. The less
complex tree representation used by the model adds convenience and speed. Next, the compiler
calls a create machine function, which generates a machine structure for the current system and
model found at install time, as described in Section 5.1. Then the compiler invokes an interface
function that runs the memory traffic prediction and cost functions. The interface function returns
runtime estimates for the input routine and machine pair. In serial, a single estimate is returned,
and, in parallel, the best and worst case predictions are returned. These values are then placed in
a vector structure sorted by model predictions.
The BTO compiler has five flags that control how the model and empirical testing interact
111
when searching for high performing routines. Two of these flags -m [ --model off] and -e [
--empirical off] turn off one evaluation method. The other three control the routines and the
sizes that are empirically tested. The flag -l [ --limit] specifies a maximum amount of time in
seconds to be used for empirical search. When selected, the empirical search begins with the routine
predicted best by the model and continues until the time limit is reached. The size(s) of routines
to model and empirically test are set by -r [ --test param]. How aggressively the model trims
the search space is determined by the -t [ --threshold] flag. This flag restricts the compiler to
empirically test only those versions within a set percentage of the version predicted best by the
model.
The use of these flags can be both advantageous and detrimental to compiler runtime and
produced routine performance. To decrease compile time, empirical testing can be turned off
producing the version of the routine the model predicts as best. While substantially reducing
runtime, not empirically testing routines can result in subpar performance of the produced routine,
as shown later in this chapter. A second way to reduce compile time is to order all versions based
on the model’s predictions and then test them in that order until a given amount of time elapses.
By specifying a maximum amount of time for empirical testing, compilation is guaranteed to finish
in an acceptable amount of time. However, routines with poor predicted and actual performance
might be tested thus wasting time.
Another way to trim search time is only empirically testing routines predicted by the model
to have runtimes within a certain percentage of the best predicted runtime for a version. Only
emperically testing the best predicted versions impacts compile time and produced routine perfor-
mance to varying degrees, depending on the threshold set. The threshold and time limit flags can
be used together with the advantages and disadvantages of each being combined. If time is not a
concern, then all routines can be empirically tested by turning off the model or setting the threshold
parameter to one. Finally, if a user knows that, on their machine, routines of varying sized kernels
perform similarly, empirical tests can be performed on smaller test sizes than the target kernel size.
Empirical testing time is reduced and/or more kernels can be tested in the same amount of time,
112
Kernel Serial Versions Parallel Versions
AATX 2 4
BiCGK 3 8
DGEMV 4 18
DGEMVT 8 31
GEMVER 648 7808
GESUMMV 12 52
HOUSE 6 24
AXPYDOT 4 9
GRAMM 2 3
VADD 2 3
WAXPBY 4 9
Table 7.1: The number of serial and parallel versions produced by the BTO compiler for various
routines.
but the user risks a suboptimal kernel being produced.
7.2 Analysis of Model’s Effectiveness in Reducing Compile Time
The model and runtime prediction function are only useful to the BTO compiler if they
reduce overall search time without compromising the quality of the produced routines. In this
section, we analyze the model’s effectiveness at reducing the search space and compare the model’s
speed to empirical testing. We include discussion of the model’s strengths and weaknesses along
with performance results in our evaluation.
7.2.1 Experimental Setup
We tested the model’s ability to trim the compiler’s search space for the latest release of the
BTO compiler (Version 1.2) on the kernels listed in Table 3.2. The selected kernels were chosen
for their diversity of routine features, real world use and varying search space sizes. Tests were
performed on the Work and Opteron systems in Tables 3.1 and 6.1 for both serial and parallel
routines. The number of versions produced by BTO for each of these routines is shown in Table
7.1. The kernels at the top of the table contain matrices and vectors and the kernels at the bottom
include only vectors. All the tables in this chapter are divided in this manner.
113
We ran all the kernels on both machines and measured the amount of time used by the model
to predict the runtimes of all versions of a routine. We also measured the amount of time needed
to empirically test all produced versions.
7.2.2 Cost of Modeling and Empirically Testing Runtimes
Tables 7.2 and 7.3 show the number of versions of a routine that our model and empirical
testing analyze per second on the Work and Opteron machines. The table shows that the model
predicts performance of routines hundreds to thousands of times faster than it takes to empirically
test them. Also shown in the tables is that our model’s runtime varies little based on data structure
size, but the amount of time needed to empirically test routines is proportional to the dimensions
of the matrices and vectors tested.
From the tables, we observe that both empirical testing and the model are slower for more
complex routines. For example, GEMVER, which is the most complex routine containing matrices
is the slowest to model and empirically test for both machines in parallel and serial. The parallel
model runs at least 25% slower than the serial model. The parallel model is slower because it calls
a second runtime prediction function and parallel machines are more complex. Empirical testing of
routines occurs at approximately the same speed whether the routine is serial or parallel. Finally,
not shown in the table but of note is that the model evaluates each version of a routine three to
four times faster than the compiler generates them on both machines.
Figure 7.1 shows that the runtime of the model is fairly independent of matrix order. When
GEMVER is modeled for matrix orders ranging from one to one million on the Work machine,
there are only slight runtime differences as size changes. For both the serial and parallel model,
runtimes increase with routine size, by only 20% to 30%.
Both models follow the same pattern with runtime increasing as routine size increases up to
one thousand and then slightly decreasing as routines get larger. The increase for small matrix
orders occurs as the calculation becomes larger than successive caches, increasing the number of
caches for which misses must be calculated. From a matrix order of ten to one hundred, the matrices
114
Kernel Dimension Model Empirical
Serial/Parallel Serial/Parallel
AATX
2000 4843/3344 2.37/2.19
10,000 4890/3309 0.20/0.20
BICG
2000 5396/3501 2.40/2.17
10,000 5396/3492 0.19/0.20
DGEMV
2000 3252/2575 2.35/1.79
10,000 3147/2497 0.20/0.17
DGEMVT
2000 2092/1406 2.28/2.03
10,000 2116/1413 0.19/0.19
GEMVER
2000 734/456 1.31/1.14
6000 733/455 0.23/0.23
GESUMMV
2000 2210/1431 1.59/1.47
10,000 2224/1431 0.10/0.11
HOUSE
2000 2326/1622 2.21/1.99
10,000 2325/1614 0.17/0.18
AXPYDOT
2,000,000 5487/3314 1.70/1.57
10,000,000 5479/3308 0.47/0.46
GRAMM
2,000,000 5076/3128 2.43/2.23
10,000,000 5115/3128 0.81/0.79
VADD
2,000,000 7905/5848 1.63/1.65
10,000,000 7782/5747 0.46/0.47
WAXPBY
2,000,000 5706/3965 1.96/1.77
10,000,000 5755/3956 0.56/0.59
Table 7.2: Number of routines the model and empirical testing evaluate per second on the Work
machine.
115
Kernel Dimension Model Empirical
Serial/Parallel Serial/Parallel
AATX
1000 2058/1500 1.57/3.58
10,000 2058/1466 0.16/0.18
BICG
1000 2445/1574 1.71/3.60
10,000 2368/1587 0.17/0.19
DGEMV
1000 1494/824 1.55/2.49
10,000 1440/830 0.18/0.16
DGEMVT
1000 891/610 2.35/2.81
10,000 943/607 0.16/0.19
GEMVER
1000 763/458 1.07/1.80
10,000 752/466 0.19/0.22
GESUMMV
1000 986/620 2.65/2.35
10,000 966/620 0.10/0.10
HOUSE
1000 1035/697 2.31/2.70
10,000 1027/687 0.15/0.17
AXPYDOT
1,000,000 2509/1483 1.80/1.65
10,000,000 2536/1483 0.38/0.41
GRAMM
1,000,000 2291/1372 1.60/1.96
10,000,000 2210/1386 0.63/0.65
VADD
1,000,000 3305/2573 1.38/2.31
10,000,000 3521/2560 0.41/0.37
WAXPBY
1,000,000 2559/1824 2.11/2.10
10,000,000 2492/1832 0.45/0.50
Table 7.3: Number of routines the model and empirical testing evaluate per second on the Opteron
machine.
116
0
0.225
0.45
0.675
0.9
1 10 100 1000 10000 100000 1000000
S
ec
on
d
s
Matrix Order
0
4.5
9
13.5
18
1 10 100 1000 10000 100000 1000000
S
ec
on
d
s
Matrix Order
(a) Serial
0
0.225
0.45
0.675
0.9
1 10 100 1000 10000 100000 1000000
Chart 4
S
ec
on
d
s
0
4.5
9
13.5
18
1 10 100 1000 10000 100000 1000000
S
ec
on
d
s
Matrix Order
(b) Parallel
Figure 7.1: Time to model all versions of GEMVER on Work machine.
117
become larger than the L1 cache and from one hundred to one thousand they become larger than
the L2 cache. For a matrix order of one, the model takes longer to run than for an order of ten.
While modeling for such a small size is not practical, the extra runtime is probably attributable to
modeling register misses for more variables. Within the model the code to model register misses
is more complex than the code to model cache misses, which most likely increases the runtime.
Finally, the extra predicted runtime produced by the parallel model reduces variance by adding a
fixed cost to the parallel model.
7.2.3 Model’s Impact on Serial Compile Time
Tables 7.4 and 7.5 show the impact of the serial model on routines tested and compiler time
on the Work and Opteron machines. The runtime column reports the amount of time it takes to
generate the routines, perform model predictions for all versions of the routine and to empirically
test the best predicted versions. The savings column reports the compile time reduction from using
the model in hybrid search, as opposed to empirically testing all routines. Positive values represent
savings and negative values represent increased costs. For both tables, different versions of routines
are only empirically tested if their predicted runtime is within 1% of that of the best predicted
routine. When a single routine is predicted to be the best routine, it is not tested.
In all but one case, the optimal routine was either the single routine predicted best by the
model or in the group of best predicted routines. Compile times are also reduced by over 99% for
all but a few routines when using hybrid search. For the routines where multiple versions were
empirically tested, compile time was still reduced by half for AXPYDOT and slightly increased for
DGEMV and GESUMMV.
For a matrix order of 6000 on the Work machine, the GEMVER routine predicted best by
the model was 2% slower than the optimal routine found by empirical search. The optimal routine
was predicted to be just over 8% slower than the best predicted routine and was found in a group of
seven routines predicted to have nearly identical performance. This cluster of versions was ranked
as the second best through eighth best by the model and empirically testing the first eight routines
118
Kernel Size Empirically Runtime Model
Tested Savings
AATX
2000 0 0.020413 0.822587
10,000 0 0.020409 10.224591
BiCGK
2000 0 0.021556 1.249444
10,000 0 0.021556 15.756444
DGEMV
2000 4 1.723230 -0.01230
10,000 4 19.941271 -0.01271
DGEMVT
2000 0 0.033825 3.501175
10,000 0 0.033781 41.130219
GEMVER
2000 0 3.528834 492.419
6,000 0 3.529889 2877.409
GESUMMV
2000 12 7.587429 -0.005429
10,000 12 123.887396 -0.005396
HOUSE
2000 0 0.029580 2.706420
10,000 0 0.029581 34.920419
AXPYDOT
2,000,000 2 1.175 1.200
10,000,000 2 4.236 4.360
GRAMM
2,000,000 0 0.020394 0.822606
10,000,000 0 0.030391 2.481609
VADD
2,000,000 0 0.020253 1.227747
10,000,000 0 0.020257 4.325743
WAXPBY
2,000,000 0 0.022701 2.042299
10,000,000 0 0.022695 7.149305
Table 7.4: Impact of model on reducing search space and runtime for serial routines on the Work
machine.
119
Kernel Size Empirically Runtime Model
Tested Savings
AATX
1000 0 0.044972 1.27003
10,000 0 0.044908 12.4001
BiCGK
1000 0 0.047227 1.75277
10,000 0 0.047267 17.5447
DGEMV
1000 4 2.63568 -0.008974
10,000 4 22.8628 -0.008484
DGEMVT
1000 0 0.064974 3.39703
10,000 0 0.064484 49.1745
GEMVER
1000 0 3.29977 606.778
6,000 0 0.861650 3429.02
GESUMMV
1000 12 4.59417 -0.012167
10,000 12 125.288 -0.012423
HOUSE
1000 0 0.056798 2.58520
10,000 0 0.05684 40.7542
AXPYDOT
1,000,000 2 1.498 0.719
10,000,000 2 5.272 5.232
GRAMM
1,000,000 0 0.044873 1.24613
10,000,000 0 0.044905 3.18510
VADD
1,000,000 0 0.043605 1.44940
10,000,000 0 0.043568 4.88432
WAXPBY
1,000,000 0 0.045563 1.89744
10,000,000 0 0.045695 8.90040
Table 7.5: Impact of model on reducing search space and runtime for serial routines on the Opteron
machine.
120
would increase compile time to 43.169 seconds.
On both machines, there are three routines - DGEMV, GESUMMV and AXPYDOT - where
the model groups multiple versions of the same routine together. For DGEMV, the grouping is
accurate because on both machines and for all matrix orders, the performance difference of the
kernels produced was less than 3%. However, for AXPYDOT, performance differences between
the two best predicted versions were 15% on the Opteron and 20% on the Work machine. For
GESUMMV, performance differences were less than 3% on the Work machine for all versions
tested, but on the Opteron the gap between the best and worst routines is over 50% for a matrix
order of 1000 and 30% for a matrix order of 10,000. On the Opteron, the GESUMMV versions
clustered into three groups of four routines with near identical performance. Within the groups
the routines only differ by the fusion of a vector operation. Therefore, if we applied our result in
section 3.6 to GESUMMV, we would test three routines and not twelve to find one of the best
performing routines. The same research can also be applied to DGEMV, reducing its search space
to a single routine on both machines.
The model was not ideal in two cases. It missed the best routine, using our 1% criteria, for
one matrix order of GEMVER on the Work machine and failed to differentiate between versions
of GESUMMV on the Opteron with 30-50% performance differences. For GEMVER, depending
on how much a user was planning on using the produced kernel and the amount of time they had
for compilation, our results might be satisfactory. Alternatively, a small increase in compile time
would quickly find the best routine. Not reducing the search space of the GESUMMV routine
highlights a weakness of our approach. We only try to capture large memory differences caused by
loop fusion and can miss smaller effects.
7.2.4 Model’s Impact on Parallel Compile Time
Tables 7.6 and 7.7 show the impact the parallel model has on reducing the number of routines
tested and compile time on the Work and Opteron machines. In the tables, the columns contain
the same information presented for the serial case with one exception. In some cells we present two
121
values for the number of routines tested, runtime and model savings since we use two criteria to
determine which routines to empirically test. The first criterion is the average of the best and worst
case estimates and the second is the best case estimate. As in our serial experiments, we assume
different versions of routines are empirically tested only if their predicted runtimes are within 1%
of the best predicted routine.
On the machines in parallel, compile time was greatly reduced with savings on both machines
exceeding 99% for many routines. Additionally, in our parallel tests, the optimal routine was the
best predicted version or found in a group of the best predicted with three exceptions. On the
Work machine, the optimal GEMVER implementation for a matrix order of 2000 was 1% faster
than the best kernel in the model’s best predicted group. The optimal routine was predicted to
be just under 17% slower than the best predicted routine and was found in a group of 61 routines
predicted to have nearly identical performance. This cluster of versions was ranked as the 32nd best
through 92nd best by the model. Testing the first ninety-two routines would increase compile time
by 64.922 seconds. On the Opteron with a matrix order of 2000, the best routine was not found
for the GESUMMV and DGEMV routines. The routine found was 23% slower than the optimal
for GESUMMV and 11% slower for DGEMV. For both routines, increasing the threshold to 2%
would have tested all versions produced by the compiler and found the optimal implementation.
The number of additional routines tested would increase by four for DGEMV and sixteen for
GESUMMV.
For the routines on the Work machine where empirical testing was used along with modeling,
the performance of the tested versions varied more than in serial. The versions of DGEMV and
GESUMMV tested saw performance variations of about 25% for a matrix order of 2000. For a
matrix order of 10,000, DGEMV did not have large performance differences between versions, but
GESUMMV had four versions perform about 25% slower than the rest. AXPYDOT had about a
15% performance difference between the two kernels tested for both vector dimensions tested. For
BiCGK, the best kernel is found in the group of the two kernels with the best predicted runtime
when best and worst cases predictions were averaged for a matrix order of 10,000. Within this
122
Kernel Size Empirically Runtime Model
Tested Savings
AATX
2000 0 0.028196 1.82280
10,000 0 0.028209 20.1668
BiCGK
2000 2/4 0.995485/1.88149 2.72252/1.83652
10,000 2/4 9.99729/19.9873 30.2127/20.2227
DGEMV
2000 22 10.1360 -0.006991
10,000 22 109.125 -0.007210
DGEMVT
2000 0 0.123053 15.2779
10,000 0 0.122946 159.156
GEMVER
2000 30/31 111.387/111.872 6738.81/6738.32
6,000 0 206.047/209.456 34,393.9/34,390.5
GESUMMV
2000 52 35.5043 -0.036337
10,000 52 493.943 -0.036328
HOUSE
2000 0 0.092797 12.0692
10,000 0 0.092788 136.786
AXPYDOT
2,000,000 2 1.25127 4.47573
10,000,000 2 4.261271 15.3687
GRAMM
2,000,000 0 0.025959 1.34204
10,000,000 0 0.025977 3.80302
VADD
2,000,000 0 0.023513 1.81749
10,000,000 0 0.023513 6.39778
WAXPBY
2,000,000 0 0.034270 5.08973
10,000,000 0 0.034275 15.1957
Table 7.6: Impact of model on reducing search space and runtime for parallel routines on the Work
machine.
123
Kernel Size Empirically Runtime Model
Tested Savings
AATX
1000 0 0.058666 1.11433
10,000 0 0.058728 22.0533
BICGK
1000 2 1.149 1.131
10,000 2 10.737 31.122
DGEMV
1000 18 6.795 0.421
10,000 22 114.063 -0.021688
DGEMVT
1000 0 0.203885 10.9771
10,000 0 0.204357 164.424
GEMVER
1000 30 89.439 4307.01
6000 30 198.612 35293
GESUMMV
1000 36 15.604 6.553
10,000 52 508.904 -0.083913
HOUSE
1000 0 0.161424 8.84058
10,000 0 0.161914 141.207
AXPYDOT
1,000,000 3 2.170 3.355
10,000,000 3 7.110 14.727
GRAMM
1,000,000 0 0.054186 1.53081
10,000,000 0 0.054164 4.57884
VADD
1,000,000 0 0.048166 1.29683
10,000,000 0 0.048172 8.01283
WAXPBY
1,000,000 0 0.062934 4.28307
10,000,000 0 0.062913 18.0741
Table 7.7: Impact of model on reducing search space and runtime for parallel routines on the
Opteron machine.
124
group, runtimes vary by 10%. However, for a matrix order of 2000, the best kernel is not found
until all four routines with similar best case runtimes are included in the empirical testing. Within
this group of four kernels, performance varies by 20%. For the GEMVER kernel, both criteria
produced the same routine for both matrix orders. Runtimes of the routines clustered into three
groups for a matrix order of 2000. Of note is that all kernels with the best predicted performance
were in the first group.
On the Opteron, both the average and best predicted time criteria produce groups of equal
size. For AXPYDOT, in the best predicted group, there is a 17% performance difference between
the best and worst kernel for the 1,000,000 test size and a 9% difference for the 10,000,000 size.
Performance for both the DEGEMV and GEMVER kernels varies greatly for a matrix order of
1000 with the slowest routine taking more over 150% the time as the fastest. However, for a matrix
order of 10,000, DGEMV sees performance differences of less than 3% and, for a matrix order of
6000, GEMVER performance varies by less than 10%. For GESUMMV, routines with a matrix
order of 10,000 actual routine runtimes are in two clusters separated by about 15%, while for a
matrix order of 1000 performance varies by just over a factor of two from best to worst. BICGK
has a small runtime difference of less than 3% for a 10,000 matrix order and a difference of just
under 10% for a matrix order of 1000.
As with our serial model, the parallel model failed to distinguish between versions with
significant performance differences for the GESUMMV and DGEMV kernels on the Opteron. The
parallel model also produced the second best routine on the Work machine for one matrix order of
GEMVER. However, as in the serial case, the model dramatically reduced compile time for most
routines with only a minimal reduction in the quality of the produced kernel.
Chapter 8
Conclusions
Data movement through the memory hierarchy often limits linear algebra routine perfor-
mance. For these routines, reducing data traffic often results in significant speedups. Throughout
this thesis, our focus is on loop fusion, which is one optimization used to minimize data reads and
writes from slow memory. We show the positive and negative impacts of fusion on data movement
and routine performance for linear algebra kernels. How loop fusion affects reads and writes from
memory structures is the basis for the memory model we present.
Our model works in two steps. First, it predicts the amount of data movement from each
memory structure needed to execute a linear algebra routine. Then the model converts those
estimates into runtime predictions in seconds. These runtime predictions are used on both serial
and parallel machines to compare different implementations of the same calculation within the
BTO compiler. When turned on, the model usually reduces compile time by over 99%, while
having negligible impact on the quality of the routine BTO produces.
The model achieves these compile time reductions at a small accuracy cost by using a series of
tradeoffs between speed and accuracy. For example, we incorporate cache size, data transfer rates
and how many processors share a given cache into the model but do not include cache associativity,
latency of data reads and the cost of arithmetic instructions. These tradeoffs result in a model
that is efficient in reducing the number of routines empirically tested by our compiler but the
model has a few weaknesses. For almost all the kernels used to test our model, compile time was
significantly reduced and a high performing routine produced. However, this was not the case
126
for the DGEMV and GESUMMV calculations. When vector operations are fused with matrix
operations, routine performance is usually not impacted. However, since the model is ineffective
at differentiating between routines with small performance differences, we empirically test these
routines. Also, on the Opteron machine, there was a significant performance difference between
versions of GESUMMV but the model predicted them to perform equally. Another way to reduce
the compile time of DGEMV and GESUMMV is to remove fusion decisions that never positively
impact routine performance from the search space.
Overall, the model achieves the task it is designed to handle. It accurately and efficiently dis-
tinguishes between large routine performance difference for most routines and dramatically reduces
the runtime of the BTO compiler.
Chapter 9
Future Work
One area where our model can be improved includes increasing the variety of machines for
which it predicts the runtimes of loop fusion kernels. For example, fusion when applied to OpenCL
[50] kernels on graphical processing units (GPUs) is effective at reducing data movement [12]. As
with CPUs, the performance of GPUs is often limited by reads and writes. Additionally, as with
the CPU, fusing many kernels on the GPU can decrease performance. To model GPUs requires
the ability to analyze data movement within the GPU and data transfers between a computer’s
main memory and the GPU. Also needed is a method to determine, at install time, the amount of
useable bandwidth between the CPU and GPU and within the GPU.
Another new machine class that can be modeled is clusters and other distributed memory
machines [75]. Estimating runtime on clusters requires adding latency, which is often a larger cost
of data movement than bandwidth between distributed nodes, to the model. Also, a way to indicate
segments of code that can execute while data transfers occur is needed because overlapping data
movement and computation is common in distributed memory programs.
Other enhancements to the model, with applicability to many computing environments, in-
clude the ability to analyze strided data access patterns and recommend cache block sizes. Strided
access patterns occur in many mathematical computations, such as transposing a matrix, and re-
cently the BTO compiler was improved to allow the production of codes with strided accesses [9].
To predict the cost of strided accesses, cache line sizes and/or data movement rates for various
strides are needed and should be found at install time. Strided access patterns create additional
128
data movement because they only use part of a cache line, but they require the entire line to be
moved through the memory hierarchy. To model strided patterns, an additional field denoting
variables that are accessed using a strided pattern needs to be added to variable representations.
Also modifications to the model are required to count the extra data moved but not used when
only part of a cache line is used.
The model is currently able to predict the memory traffic of code with cache blocks, but it
cannot determine if the proper size was chosen. Adding the ability to determine the maximum size
of blocks that allows data to fit within cache might improve search times for optimal block size
depending on the additional costs.
To improve model runtime, different versions of the same routine can be evaluated in parallel.
For parallel evaluation to significantly impact compile time, BTO needs to generate routines in
parallel because routine generation is a larger cost than modeling. However, searching for cache
block sizes in parallel with the current model could benefit from parallel evaluation by dedicating
a thread to evaluating different block sizes for the same version of a routine.
Finally, search strategies other than exhaustive search, such as genetic algorithms, are being
evaluated for their use within BTO [77]. When using the model in these alternative search strategies,
different tradeoffs between runtime and accuracy might be beneficial. The tradeoffs used would
vary depending on their strengths and weaknesses and how the search strategies use the model.
Bibliography
[1] Mark F. Adams. Multigrid Equation Solvers for Large Scale Nonlinear Finite Element
Simulations. PhD thesis, University of California, Berkeley, Berkeley, CA, January 1999.
[2] Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software pipelining.
ACM Computing Surveys, 27(3):367–432, September 1995.
[3] W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving
high sustained performance in an unstructured mesh cfd application. In Proceedings of the
1999 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’99, Portland,
Oregon, United States, 1999. ACM.
[4] E. Angerson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. DuCroz, S. Hammarling,
J. Demmel, C. Bischof, and D. Sorenson. LAPACK: A portable linear algebra library for high
performance computers. In Proceedings of Supercomputing ’90, pages 2–11, New York, NY,
November 1990.
[5] A. H. Baker, J. M. Dennis, and E. R. Jessup. An efficient block variant of GMRES. SIAM
J. Sci. Comput., 27:1608–1626, 2006.
[6] A. H. Baker, E. R. Jessup, and T. Manteuffel. A technique for accelerating the convergence
of restarted GMRES. SIAM J. Matrix Anal. Appl., 26:962–984, 2005.
[7] Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. PETSc users
manual. Technical Report ANL-95/11–Revision 2.1.2, Argonne National Laboratory, Ar-
gonne, IL, 2002.
[8] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,
C. Romine, and H. V. der Vorst. Templates for the solution of linear systems: Building Blocks
for Iterative Methods. SIAM, second edition, 1994.
[9] Geoffrey Belter. Personal communication, December 2010.
[10] Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. Automating the genera-
tion of composed linear algebra kernels. In SC ’09: Proceedings of the Conference on High
Performance Computing Networking, Storage and Analysis, pages 1–12, New York, NY, USA,
2009. ACM.
[11] Geoffrey Belter, Jeremy G. Siek, Ian Karlin, and E. R. Jessup. Automatic generation of tiled
and parallel linear algebra routines. In In the Fifth International Workshop on Automatic
Performance Tuning (iWAPT10), pages 1–15, Berkeley, California, June 2010.
130
[12] B.K. Bergen, M.G. Daniels, and P.M. Weber. A hybrid programming model for compress-
ible gas dynamics using opencl. In Parallel Processing Workshops (ICPPW), 2010 39th
International Conference on, pages 397–404, 2010.
[13] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and James Demmel. Optimizing matrix
multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In
Proceedings of 11th International Conference on Supercomputing, pages 340–347, New York,
NY, July 1997. ACM Press.
[14] L. S. Blackford, J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling,
G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable
Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance.
In Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on (CDROM),
pages 1–5, November 1996.
[15] L. Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry,
Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin
Remington, and R. Clint Whaley. An updated set of Basic Linear Algebra Subprograms
(BLAS). ACM Transactions on Mathematical Software, 28(2):135–151, June 2002.
[16] Uday Bondhugula. Effective Automatic Parallelization and Optimization Using the
Polyhedral Model. PhD thesis, The Ohio State University, August 2008.
[17] Uday Bondhugula. PLUTO an automatic loop nest parallelizer for multicores.
http://pluto-compiler.sourceforge.net/, August 2010.
[18] Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Roun-
tev, and P. Sadayappan. Automatic transformations for communication-minimized par-
allelization and locality optimization in the polyhedral model. In CC’08/ETAPS’08:
Proceedings of the Joint European Conferences on Theory and Practice of Software 17th
international conference on Compiler construction, pages 132–146, Berlin, Heidelberg, 2008.
Springer-Verlag.
[19] Franc¸ois Broquedis, Je´roˆme Clet Ortega, Ste´phanie Moreaud, Nathalie Furmento, Brice
Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. hwloc: a Generic
Framework for Managing Hardware Affinities in HPC Applications. In IEEE, editor,
PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and
Network-Based Computing, pages 180–186, Pisa Italie, February 2010.
[20] Surendra Byna, Xian-He Sun, William Gropp, and Rajeev Thakur. Predicting memory-access
cost based on data-access patterns. In Proceedings of the 2004 IEEE International Conference
on Cluster Computing, pages 327–336, San Diego, CA, September 2004.
[21] Jonathan Carter, Leonid Oliker, and John Shalf. Performance evaluation of scientific applica-
tions on modern parallel vector systems. In High Performance Computing for Computational
Science - VECPAR 2006, volume 4395 of Lecture Notes in Computer Science, pages 490–503.
Springer Berlin / Heidelberg, May 2007.
[22] Texas Advanced Computing Center. GotoBLAS.
http://www.tacc.utexas.edu/resources/software/#blas, 2007.
131
[23] Chun Chen, Jacqueline Chame, and Mary Hall. Combining models and guided empirical
search to optimize for multiple levels of the memory hierarchy. In CGO ’05: Proceedings of the
international symposium on Code generation and optimization, pages 111–122, Washington,
DC, USA, March 2005. IEEE Computer Society.
[24] Jaeyoung Choi, Jack Dongarra, Susan Ostrouchov, Antoine Petitet, David W. Walker, and
R. Clinton Whaley. A proposal for a set of parallel basic linear algebra subprograms. In PARA
’95: Proceedings of the Second International Workshop on Applied Parallel Computing,
Computations in Physics, Chemistry and Engineering Science, pages 107–114, London, UK,
1996. Springer-Verlag.
[25] A. T. Chronopoulos. s-step iterative methods for (non)symmetric (in)definite linear systems.
SIAM J. Numer. Anal., 28:1776–1789, December 1991.
[26] Alain Darte. On the complexity of loop fusion. Parallel Computing, 26:1175–1193, July 2000.
[27] Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine
Yelick. Optimization and performance modeling of stencil computations on modern micro-
processors. SIAM Review, 51:129–159, February 2009.
[28] J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick. Avoiding communication in
sparse matrix computations. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE
International Symposium on, pages 1–12, April 2008.
[29] J. M. Dennis. Automated memory analysis: improving the design and implementation of
iterative algorithms. PhD thesis, University of Colorado, Boulder, CO, July 2005.
[30] J. M. Dennis and E. R. Jessup. Applying automated memory analysis to improve iterative
algorithms. SIAM Journal on Scientific Computing, 29(5):2210–2223, September 2007.
[31] J. Dongarra, D. Gannon, G. Fox, and K. Kenned. The impact of multicore on computational
science software. CTWatch Quarterly, 3:3–10, 2007.
[32] J. Dongarra and R. Whaley. A user’s guide to the BLACS v 1.0. Technical Report UT
CS-95-281, LAPACK Working Note No. 94, University of Tennessee, Knoxville, TN, March
1995.
[33] J. J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK Users’ Guide. SIAM,
Philadelphia, Pa., 1979.
[34] Jack Dongarra. Preface: Basic Linear Algebra Subprograms Technical (BLAST) Forum
Standard I. International Journal of High Performance Applications and Supercomputing,
16(1):1–111, Spring 2002.
[35] Jack Dongarra. Preface: Basic Linear Algebra Subprograms Technical (BLAST) Forum
Standard II. International Journal of High Performance Applications and Supercomputing,
16(2):115–199, Summer 2002.
[36] Jack J. Dongarra, Jeremy De Croz, Sven Hammarling, and Richard J. Hanson. An extended
set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical
Software, 14(1):1–17, March 1988.
132
[37] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 Basic
Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1–17,
March 1990.
[38] Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rude, and Christian Weiss.
Cache optimization for structured and unstructured multigrid. Electronic Transactions on
Numerical Analysis, 10:21–40, 2000.
[39] I. Duff, M. Heroux, and R. Pozo. An overview of the Sparse Basic Linear Algebra Subpro-
grams: The new standard from the BLAS technical forum. ACM TOMS, 28(2):239–267, June
2002.
[40] Arkady Epshteyn, Mara Garzaran, Gerald DeJong, David Padua, Gang Ren, Xiaoming Li,
Kamen Yotov, and Keshav Pingali. Analytic models and empirical search: A hybrid approach
to code optimization. In Languages and Compilers for Parallel Computing, volume 4339 of
Lecture Notes in Computer Science, pages 259–273. Springer Berlin / Heidelberg, 2006.
[41] J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness.
Lecture Notes in Computer Science, 589:328–343, 1991.
[42] Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings
of the IEEE, 93(2):216–231, February 2005. Special issue on “Program Generation, Optimiza-
tion, and Platform Adaptation”.
[43] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jef-
frey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine,
Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI:
Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th
European PVM/MPI Users’ Group Meeting, pages 97–104, Budapest, Hungary, September
2004.
[44] G. Gao, R. Olson, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction.
In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing,
pages 281–295, New Haven, CT, Aug. 2004.
[45] B. S. Garbow, J. M. Boyle, J. J. Dongarra, and C. B. Moler. Matrix Eigensystem
Routines-EISPACK Guide Extension, volume 51. Springer-Verlag, New York, 1977.
[46] M. W. Gee, C. M. Siefert, J. J. Hu, R.S. Tuminaro, and M. G. Sala. ML 5.0 Smoothed
Aggregation User’s Guide. Technical Report SAND2006-2649, Sandia National Laboratories,
May 2006.
[47] Somnath Ghosh, Margaret Martonosi, and Sharad Malik. Cache miss equations: An analyti-
cal representation of cache misses. In Proceedings of the 1997 ACM International Conference
on Supercomputing, pages 317–324. ACM Press, July 1997.
[48] G. Golub and W. Kahan. Calculationg the singular values and pseudo-inverse of a matrix.
Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis,
2(2):205–224, 1965.
[49] Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multipli-
cation. ACM Transactions on Mathematical Software, 34(3):25, May 2008.
133
[50] Khronos OpenCL Working Group. The OpenCL Specification Version 1.1.
http://www.khronos.org/opencl, 2011.
[51] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME:
Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software,
27(4):422–455, December 2001.
[52] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative
Approach. Morgan Kaufmann, third edition, 2003.
[53] Vincente Hernandez, Jose E. Roman, and Vicente Vidal. SLEPc: A scalable and flexible
toolkit for the solution of eigenvalue problems. ACM Transactions on Mathematical Software
(TOMS), 31(3):351–362, September 2005.
[54] Michael A. Heroux, Roscoe A. Barlett, Vicki E. Howell, Robert J. Hoekstra, Jonathan J.
Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T.
Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S. Tuminaro, James M. Wil-
lenbring, Alan Williams, and Kendall S. Stanley. An overview of the Trilinos project.
ACM Transactions on Mathematical Software (TOMS), 31(3):397–423, September 2005.
[55] Gary W. Howell, James W. Demmel, Charles T. Fulton, Sven Hammarling, and Karen Mar-
mol. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Transactions on
Mathematical Software (TOMS), 34(3), May 2008.
[56] IBM. Engineering Scientific Subroutine Library.
http://www-03.ibm.com/systems/p/software/essl/index.html, 2008.
[57] Intel. Intel Math Kernel Library.
http://www.intel.com/cd/software/products/asmo-na/eng/307757, 2007.
[58] Intel. Intel Compilers and Libraries - Intel Software Network.
http://software.intel.com/en-us/articles/intel-compilers, 2010.
[59] Raj Jain. The Art of Computer Systems Performance Analysis : Techniques for Experimental
Design, Measurement, Simulation and Modeling. Wiley, New York, 1991.
[60] Elizabeth R. Jessup, Ian Karlin, Erik Silkensen, Geoffrey Belter, and Jeremy Siek. Under-
standing memory effects in the automated generation of optimized matrix algebra kernels.
Procedia Computer Science, 1(1):1867 – 1875, May 2010.
[61] I. Karlin. Memory analysis and tuning of composed linear algebra kernels. In Proceedings of
Colorado Celebration of Women in Computing, pages 1–5, Boulder, CO, April 2008.
[62] I. Karlin, E. Silkensen, E. R. Jessup, G. Belter, T. Nelson, P. Zelinsky, and J. G. Siek. A
statistical approach to reducing an optimization search space. In Proceedings of Colorado
Celebration of Women in Computing, pages 1–5, Golden, CO, November 2010.
[63] Ian Karlin and Jonathan Hu. Implementing and profiling of a variable block matrix-matrix
multiply in ML. Technical Report SAND 2007-7977, Sandia National Laboratories, December
2007.
134
[64] Ian Karlin and Jonathan Hu. Overview and performance analysis of the epetra/OSKI matrix
class in trilinos. Technical Report SAND2008-8257P, Sandia National Laboratories, December
2008.
[65] Ian Karlin, Elizabeth R. Jessup, Geoffrey Belter, and Jeremy Siek. Parallel memory prediction
for fused linear algebra kernels. In Proceedings of 1st International Workshop on Performance
Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS
10), pages 1–8, New Orleans, LA, November 2010.
[66] A D. K. Kaushik, B D. E. Keyes, and B. F. Smith D. Toward realistic performance bounds
for implicit cfd codes. In Proceedings of Parallel CFD99, pages 233–240. Elsevier, 1999.
[67] Nam Sung Kim, Todd Austin, David Blaauw, Trevor Mudge, Kriszti? Flautner, Jie S.
Hu, Mary Jane Irwin, Mahmut Kandemir, and Vijaykrishnan Narayanan. Leakage current:
Moore’s law meets static power. Computer, 36(12):68–75, December 2003.
[68] D. Kincaid and W. Cheney. Numerical Analysis: Mathematics of Scientific Computing.
Brooks/Cole, third edition, 2002.
[69] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46:604–632,
September 1999.
[70] Monica S. Lam, Edward E. Rothber, and Michael E. Wolf. The cache performance and
optimizations of blocked algorithms. In Proceedings of the Fourth International Conference
on Architectural Support for Programming Languages and Operating Systems, pages 63–74,
Palo Alto, CA, Apr. 1991.
[71] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Sub-
programs for Fortran usage. ACM Transactions on Mathematical Software, 5(3):308–323,
September 1979.
[72] Robert Michael Lewis, Virginia Torczon, and Michael W. Trosset. Direct search methods:
then and now. Journal of Computational and Applied Mathematics, 124(1-2):191 – 207,
December 2000.
[73] John D. McCalpin. Memory bandwidth and machine balance in current high performance
computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA)
Newsletter, pages 19–25, December 1995.
[74] Lois Curfman McInnes, Jorge More´, Todd Munson, and Jason Sarich. TAO user manual
(revision 1.10.1). Technical Report ANL/MCS-TM-242-Revision 1.10.1, Mathematics and
Computer Science Division, Argonne National Laboratory, July 2010.
[75] J.C. Meyer and A.C. Elster. Performance modeling of heterogeneous systems. In Parallel
Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International
Symposium on, pages 1 –4, april 2010.
[76] Frank Mueller. A library implementation of POSIX threads under UNIX. In
In Proceedings of the USENIX Conference, pages 29–41, January 1993.
[77] Thomas Nelson. Personal communication, November 2010.
135
[78] Netlib. BLAS. http://www.netlib.org/blas/index/html, 2008.
[79] Netlib. Sparse BLAS. http://www.netlib.org/sparse-blas/index.html, 2008.
[80] NIST. Sparse BLAS. http://math.nist.gov/spblas, 2008.
[81] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeon, C. Kozyrakis, R. Thomas,
and K. Yelick. A case for intelligent RAM. IEEE Micro, pages 34–44, March/April 1997.
[82] Louis-Noe¨l Pouchet, Uday Bondhugula, Ce´dric Bastoul, Albert Cohen, J. Ramanujam, and
P. Sadayappan. Combined iterative and model-driven optimization in an automatic paral-
lelization framework. In Proceedings of the 2010 ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Wash-
ington, DC, USA, November 2010. IEEE Computer Society.
[83] Madhan Premkumar. Parallelization of cache efficient BLAS 2.5 operator GEMVT. Master’s
thesis, Florida Institute of Technology, Melbourne, FL, May 2005.
[84] Markus Pu¨schel, Jose´ M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso,
Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen,
Robert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP trans-
forms. Proceedings of the IEEE, special issue on “Program Generation, Optimization, and
Adaptation”, 93(2):232– 275, February 2005.
[85] Apan Qasem. Automatic Tuning of Scientific Applications. PhD thesis, Rice University, July
2007.
[86] Apan Qasem and Ken Kennedy. A cache-conscious profitability model for empirical tuning
of loop fusion. In Eduard Ayguad, Gerald Baumgartner, J. Ramanujam, and P. Sadayappan,
editors, LCPC, volume 4339 of Lecture Notes in Computer Science, pages 106–120. Springer,
2006.
[87] Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses.
In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design
and Implementation, pages 38–49, Montreal, Quebec, Canada, May 1998.
[88] Gabriel Rivera and Chau-Wen Tseng. Tiling optimizations for 3D scientific computations.
In Supercomputing ’00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing
(CDROM), pages 1–23, Washington, DC, USA, November 2000. IEEE Computer Society.
[89] M. Sala, K. Stanley, and M. Heroux. Amesos: A set of general interfaces of sparse direct
solver libraries. 4699:976–985, 2007.
[90] Sandia National Laboratories. Epetra - Home.
http://trilinos.sandia.gov/packages/epetra/index.html, 2008.
[91] Sandia National Laboratories. Sacado - Home.
http://trilinos.sandia.gov/packages/sacado/index.html, 2008.
[92] Jeremy G. Siek, Ian Karlin, and E. R. Jessup. Build to order linear algebra kernels. In
Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL
2008), pages 1–8, Miami, FL, April 2008.
136
[93] Horst Simon, Leonid Oliker, Andrew Canning, Jonathan Carter, Stephane Ethier, and John
Shalf. Evaluation of leading scalar and vector architectures for scientific computations. Tech-
nical Report LBNL-55291, Lawrence Berkeley National Laboratory, 2004.
[94] B. T Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Kelma, and C. B.
Moler. Matrix Eigensystem Routines EISPACK Guide, volume 6. Springer-Verlag, New York,
2nd edition, 1976.
[95] B. Spencer Jr., T. Finholt, I. Foster, C. Kesselman, C. Beldica, J. Futrelle, S. Gullapalli,
P. Hubbard, L. Liming, D. Marcusiu, L. Pearlman, C. Severance, and G. Yang. NEESgrid:
A distributed collaboratory for advanced earthquake engineering experiment and simulation.
In 13th World Conference on Earthquake Engineering, Vancouver, B.C, Canada, Aug 2004.
Paper NO. 1674.
[96] Sun. Sun Performance Library.
http://developers.sun.com/sunstudio/overview/topics/perflib index.html, 2008.
[97] G. Videl. Efficient simulation of one dimensional quantum many-body systems. In Physical
Review Letters, 93(4):1–4, July 2004.
[98] R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix
kernels. Journal of Physics: Conference Series, 16:521–530, June 2005.
[99] Richard Vuduc. Personal communication, July 2008.
[100] Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and
Benjamin Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In
Proceedings of the IEEE/ACM Conference on Supercomputing, pages 1–35, Baltimore, MD,
November 2002.
[101] Richard Vuduc, Attila Gyulassy, James W. Demmel, and Katherine A. Yelick. Memory
hierarchy optimizations and performance bounds for sparse ATAx. In ICCS 2003: Workshop
on Parallel Linear Algebra, Melbourne, Australia, June 2003.
[102] Richard W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis,
University of California, Berkeley, CA, USA, January 2004.
[103] W. Wang and D. P. O’Leary. Adaptive use of iterative methods in predictor-corrector interior
point methods for linear programming. 25:387–406, September 2000.
[104] Shlomo Weiss and James E. Smith. Study of scalar compilation techniques for pipelined super-
computers. ACM Transactions on Mathematical Software (TOMS), 16(3):223–245, September
1990.
[105] R. Clint Whaley and Jack Dongarra. Automatically tuned linear algebra software. In
Proceedings of 1998 ACM/IEEE Conference on Supercomputing (CDROM), pages 1–27,
Washington DC, November 1998. IEEE Computer Society.
[106] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James
Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore plat-
forms. Parallel Computing, 35(3):178 – 194, 2009.
137
[107] M. Xue, K. K. Droegemeier, and V. Wong. The advanced regional prediction system (ARPS)
- a multiscale nonhydrostatic atmospheric simulation and prediction model. Part I: model
dynamics and verification. Meteorology and Atmospheric Physics, 75:161–193, December
2000.
[108] K. Yotov, X. Li, G. Ren, M.J.S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search
really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358
–386, feb. 2005.
[109] Kamen Yotov, Keshav Pingali, and Paul Stodghill. Think globally, search locally. In ICS ’05:
Proceedings of the 19th annual international conference on Supercomputing, pages 141–150,
New York, NY, USA, June 2005. ACM.
