Performance modeling and prediction for linear algebra algorithms by Iakymchuk, Roman
Performance Modeling and Prediction
for Linear Algebra Algorithms
Von der Fakultät für Mathematik, Informatik und
Naturwissenschaften der RWTH Aachen University
zur Erlangung des akademischen Grades
eines Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Master of Applied Mathematics
Roman Iakymchuk
aus Lviv, Ukraine
Berichter: Juniorprofessor Paolo Bientinesi, Ph.D.
Universitätsprofessor Dr. Enrique S. Quintana Ortí
Universitätsprofessor Dr. rer. nat. Uwe Naumann
Tag der mündlichen Prüfung: 30.08.2012
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfügbar.
II
Abstract
This dissertation incorporates two research projects: performance modeling and pre-
diction for dense linear algebra algorithms, and high-performance computing on clouds.
The first project is focused on dense matrix computations, which are often used
as computational kernels for numerous scientific applications. To solve a particular
mathematical operation, linear algebra libraries provide a variety of algorithms. The
algorithm of choice depends, obviously, on its performance. Performance of such
algorithms is affected by a set of parameters, which characterize the features of the
computing platform, the algorithm implementation, the size of the operands, and the
data storage format. Because of this complexity, predicting algorithmic performance
is a challenging task, to the point that developers are often forced to rely on extensive
trial-and-error technique. We approach this problem from a different perspective.
Instead of performing exhaustive tests, we introduce two techniques for modeling
performance: one is based on measurements and requires a limited number of sampling
tests, while the other needs neither the execution of the algorithms nor parts of them.
Both techniques employ a bottom-up approach: we first model the performance of the
BLAS kernels. Then, using these results we create models for higher-level algorithms,
e.g. the LU factorization, that are built on top of the BLAS kernels. As a result, by
using the hierarchical and modular structure of linear algebra libraries, we develop two
techniques for hierarchical performance modeling; both techniques yielding accurate
predictions.
The second project is concerned with high-performance computing on commercial
cloud environments. We empirically study the computational efficiency of compute-
intensive scientific applications in such an environment, where resources are shared
under high contention. Although high performance is occasionally obtained, con-
tention for the CPU and cache space degrades the expected performance and intro-
duces significant variance. Using the matrix-matrix multiplication kernel of BLAS
and the LINPACK benchmark, we show that the underutilization of resources sub-
stantially improves the expected performance. For instance, for a number of cluster
configurations, the solution is reached considerably faster when the available resources
are underutilized. Finally, since the performance measurements of scientific applica-
tions show high fluctuation, we propose alternatives, such as expected performance,
expected execution time, and cost per GFlop, to the standard definitions of efficiency.
III
IV
Zusammenfassung
Die vorliegende Dissertation umfasst zwei Forschungsprojekte: Zum einen die Perfor-
mance-Modellierung und -Vorhersage von Algorithmen der Linearen Algebra und zum
anderen das Hochleistungsrechnen in Clouds.
Das erste Projekt konzentriert sich auf Berechnungen mit dicht-besetzten Matri-
zen, die häufig den Kern wissenschaftlicher Simulationen bilden. Zur Durchführung
dieser mathematischen Operationen bieten Lineare-Algebra-Bibliotheken eine Viel-
zahl verschiedener Algorithmen an. Die Wahl des entsprechenden Algorithmus wird
maßgeblich durch die Performance bedingt, welche wiederum durch eine Vielzahl von
Parametern beeinflusst wird. Unter diesen Einflussgrößen befinden sich die charak-
teristischen Eigenschaften der Computerplattform, die Implementierungsdetails des
Algorithmus sowie die Größe und das Speicherformat der Operanden. Aufgrund
dieser Komplexität stellt die Vorhersage der Performance eine anspruchsvolle Her-
ausforderung dar – die Entwickler sind meist zu einer aufwendigen Trial-and-Error-
Herangehensweise gezwungen. In unserer Arbeit zeigen wir eine alternative Methode
auf und stellen dazu zwei Techniken zur Modellierung der Performance vor. Die Erste
basiert auf stichprobenartigen Messungen ausgewählter Kernoperationen, während die
Zweite vollständig auf das Ausführen des Algorithmus oder Teilen davon verzichtet.
Beide Techniken verwenden einen Bottom-up-Ansatz: Zuerst wird die Performance
der BLAS Kernoperationen modelliert. Diese Modelle werden dann genutzt, um die
Performance von übergeordneten Algorithmen (z.B. der LU-Faktorisierung einer Ma-
trix) zu modellieren, die auf den Kernoperationen hierarchisch aufbauen. Wir zeigen,
dass beide Techniken exzellente Performance-Vorhersagen liefern.
Das zweite Forschungsprojekt der Dissertation befasst sich mit dem Hochleistungs-
rechnen in einer kommerziellen Cloud-Computing-Umgebung. Wir untersuchen em-
pirisch die Effizienz dieser Umgebung für rechenintensive wissenschaftliche Anwen-
dungen und den Einfluss der dabei auftretenden geteilten Ressourcennutzung mit
anderen Nutzern. Obwohl gelegentlich eine hohe Rechenleistung erzielt wird, führt
die gemeinsame Nutzung der Prozessoren sowie der Cache-Speicher im Allgemeinen
zu einer verminderten durchschnittlichen Rechenleistung mit hoher Varianz in der
Performance. Mit Hilfe der Matrix-Matrix-Multiplikation und des LINPACK Bench-
marks zeigen wir, dass durch die Unterauslastung der Ressourcen eine deutlich bes-
sere durchschnittliche Rechenleistung erzielt werden kann. Zum Beispiel werden für
eine Reihe von Cluster-Konfigurationen Operationen wesentlich schneller berechnet,
wenn die vorhandenen Ressourcen nicht vollständig genutzt werden. Da die üblichen
Performance-Metriken wissenschaftlicher Anwendungen hohe Fluktuationen zeigen,
schlagen wir abschließend alternative Messgrößen vor, z.B. die zu erwartende Perfor-
mance, die Kosten pro GFlop oder die Kosten für die Lösung eines Problems.
V
VI
Acknowledgments
“Patriae Decori Civibus Educandis”
Motto of Lviv National University
First and foremost, I would like to express my gratitude to my advisor, Prof. Paolo
Bientinesi, for allowing me to do my Ph.D. under his guidance, for inspiring me
throughout the doctoral study, and also for always being open to my questions. I was
always astonished by the passion he shows in doing research. I would like to thank my
co-advisor, Prof. Enrique S. Quintana Ortí, for many useful suggestions and for those
rare, but extremely productive visits to the University Jaime I, Castellón, Spain. I
would also like to thank Prof. Uwe Naumann for agreeing to be a co-referee in the
committee for my doctoral exam.
Thanks to Prof. Mykhailo Bartish and Prof. Mykhailo Samotyuk for their patience
in explaining math, for many interesting stories and problems to solve, and for their
supervision at the Youth Academy of Science of Ukraine. Many thanks to Prof. Stepan
Shakhno for supervising my Bachelor and Master theses and for having encouraged
and supported me in applying for a doctoral study.
This dissertation is the result of the work, which I started in February 2009, at
the graduate school AICES (Aachen Institute for advanced study in Computational
Engineering Science) at RWTHAachen. I gratefully acknowledge the financial support
by grant GSC 111 from the Deutsche Forschungsgemeinschaft (DFG) and by project
50225798 PARSEMUL from the Deutscher Akademischer Austausch Dienst (DAAD).
During my doctoral study I have been provided with the best tools and opportunities
to improve my skills and to do research. In this context, I would like to thank the
AICES Service Team Annette de Haes, Nadine Bachem, and Nicole Faber, who did a
great job in supporting us.
A special thanks to my academic brothers Diego, Matthias, and Elmar, for proof-
reading my dissertation and for many useful suggestions.
Thanks to my friends Aravind and Ionut for sharing with me good and bad moments
at work and outside, for scientific discussions, and their support. Thanks to my
Ukrainian friends Yurii, Taras, Mariya, Petro, Svitlana, and Stepan for encouraging
me and providing their kind support all the time. Even in Germany, they made me
feel like at home.
Many thanks to mama and tato, Mariya and Petro, for giving me the opportunity to
receive good education, for their sincere love and support throughout my life. Thanks
to my brother, Yurii, for his support and also for taking care of our parents during my
academic life abroad. Finally, I would like to thank Zoriana for her love and support
in these often stressful times.
Aachen, June 2012
VII
VIII
Contents
Abstract III
Acknowledgments VII
1 Introduction 3
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Performance Modeling and Prediction: General Concepts 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Hierarchy of Dense Linear Algebra Libraries . . . . . . . . . . . 8
2.1.2 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Performance Prediction through Measurements . . . . . . . . . . . . . . 12
2.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Timing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Execution-Less Performance Modeling 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 A General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Modeling Memory-Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 DGER and LU Factorization . . . . . . . . . . . . . . . . . . . . 28
3.4.2 DGEMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3 DTRSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Assembling the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 L1 Miss Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.2 Flop Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.3 Balancing Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.1 Reading and Writing a Matrix . . . . . . . . . . . . . . . . . . . 59
3.6.2 DGER and LU Factorization . . . . . . . . . . . . . . . . . . . . 63
3.6.3 DGEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.8.1 Modeling Performance of Parallel LU Factorization . . . . . . . 71
IX
Contents
4 High-Performance Computing on Clouds 75
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.3 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Single Node Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.1 DGEMM Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2 HPL Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Multiple Nodes Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 HPL Minimum Evaluation . . . . . . . . . . . . . . . . . . . . . . 90
4.3.2 HPL Average Evaluation . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.3 HPL Evaluation through Resource Underutilization . . . . . . . 95
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Conclusions 103
A Measuring Memory Latency 105
B Measuring Flop Time 107
Bibliography 109
Lebenslauf 117
X
List of Figures
2.1 Hierarchical structure of linear algebra libraries. . . . . . . . . . . . . . . . . . . 10
2.2 Measurements of DGER from the GotoBLAS2 library using timing approaches
with and without cache flushing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Minimum, median, and average time of DGER from the GotoBLAS2 library
on the Harpertown and Barcelona processors. . . . . . . . . . . . . . . . . . . . . 16
2.4 An unblocked algorithm for the LU factorization. . . . . . . . . . . . . . . . . . 17
2.5 3 × 3 partitioning of the matrix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Piecewise-parabolic behavior of the execution time of DGER from the Goto-
BLAS2 library on Harpertown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Modeling the execution time of DGER from the GotoBLAS2 library. . . . . . 19
2.8 Modeling the execution time of the unblocked LU factorization. . . . . . . . . . 20
3.1 The memory hierarchy. Source: Bryant et al. [12] . . . . . . . . . . . . . . . . . 24
3.2 Dependence between the execution time and L1 misses of DGER on Penryn. . 27
3.3 Alignment of cache lines within the matrix A. . . . . . . . . . . . . . . . . . . . 30
3.4 Predicted and measured L1 misses for DGER from the GotoBLAS library on
Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Partitioning of operands used in DGER for DAXPY; both operations are from
the GotoBLAS2 library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Predicted and measured L1 misses for DGER from the reference BLAS library
on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Deviation between predicted and measured L1 misses for DGER from the ref-
erence BLAS library on Barcelona; p ≥ q, LDA = 512. . . . . . . . . . . . . . . . 33
3.8 Alignment of the matrix A for the unlocked LU factorization within a big
matrix for a blocked LU factorization with pivoting. . . . . . . . . . . . . . . . . 34
3.9 Predicted and measured L1 misses for DGER from the GotoBLAS2 library on
Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.10 Predicted and measured L1 misses for DGER from the reference BLAS library
on Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Predicted and measured L1 misses for DGER from the GotoBLAS2 library on
Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Deviation between predicted and measured L1 misses for DGER from the Go-
toBLAS2 library on Penryn; p ≥ q, LDA = 512. . . . . . . . . . . . . . . . . . . . 39
3.13 Predicted and measured L1 misses for Algorithm 2.4 when used as part of a
blocked LU factorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.14 A matrix A used in DGEMV as a part of a big m × n matrix. . . . . . . . . . . 40
3.15 Predicted and measured L1 misses for DGEMV from the reference BLAS li-
brary on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.16 Predicted and measured L1 misses for DGEMV from the GotoBLAS2 library
on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.17 Predicted and measured L1 misses for DGEMV from the reference BLAS li-
brary on Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.18 Predicted and measured L1 misses for DGEMV from the GotoBLAS2 library
on Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
XI
List of Figures
3.19 Partitioning of a lower triangular matrix A used in DTRSV from the Goto-
BLAS2 library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.20 Predicted and measured L1 misses for DTRSV from the reference BLAS library
on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.21 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library
on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.22 Predicted and measured L1 misses for DTRSV from the reference BLAS library
on Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.23 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library
on Penryn; A is a unit lower triangular matrix. . . . . . . . . . . . . . . . . . . . 51
3.24 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library
on Penryn; A is a non-unit upper triangular matrix. . . . . . . . . . . . . . . . . 52
3.25 The time spent serving one L1 miss. . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.26 Latency of different levels in the memory hierarchy on Barcelona. . . . . . . . . 55
3.27 Latency of different levels in the memory hierarchy on Penryn. . . . . . . . . . 56
3.28 Predicted and measured execution time of reading and writing a matrix on
Barcelona; the code was compiled without optimization. . . . . . . . . . . . . . 61
3.29 Predicted and measured execution time of reading and writing a matrix on
Barcelona; the code was compiled with the optimization level two enabled. . . 62
3.30 Predicted and measured execution time of reading and writing a matrix on
Penryn; the code was compiled without optimization. . . . . . . . . . . . . . . . 62
3.31 Predicted and measured execution time of reading and writing a matrix on
Penryn; the code was compiled the optimization level two enabled. . . . . . . . 63
3.32 Predicted and measured execution time of DGER from the reference BLAS
library on Barcelona; the library was compiled without optimization. . . . . . 64
3.33 Predicted and measured execution time of DGER from the reference BLAS
library on Barcelona; the library was compiled with the optimization level two
enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.34 Predicted and measured execution time of DGER from the GotoBLAS2 library
on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.35 Predicted and measured execution time of DGER from the reference BLAS
library on Penryn; the library was compiled without optimization. . . . . . . . 66
3.36 Predicted and measured execution time of DGER from the reference BLAS
library on Penryn; the library was compiled with the optimization level two
enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.37 Predicted and measured execution time of DGER from the GotoBLAS2 library
on Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.38 Predicted and measured execution time of the unblocked LU factorization on
Penryn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.39 Predicted and measured execution time of DGEMM from the GotoBLAS2 library. 69
3.40 Partitioning of the matrix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 DGEMM’s execution time over 6 hours using all 8 cores of a c1.xlarge Amazon
EC2 instance; see Table 4.1 for a description of this instance type. The matrix
size is n = 10 k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 DGEMM’s execution time over 6 hours using 1 of 8 cores of the c1.xlarge
instance. The size of the input matrices is n = 10 k. . . . . . . . . . . . . . . . . 83
4.3 DGEMM’s execution time over 6 hours using 4 of 8 cores of the c1.xlarge
instance. The size of the input matrices is n = 10 k. . . . . . . . . . . . . . . . . 84
4.4 DGEMM’s average and best execution times on c1.xlarge vs. number of cores.
The size of the input matrices is n = 10 k; error bars show one standard deviation. 85
XII
List of Figures
4.5 DGEMM’s average and best execution times on a standard cluster vs. number
of cores. The size of the input matrices is n = 10 k; error bars show one standard
deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Average and minimum execution times of HPL on c1.xlarge vs. number of
cores. The size of the input/output matrix is n = 25 k. The input configuration
to each execution of HPL is identical. . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7 Required number of nodes for different input matrix sizes n. The number of
nodes is determined by maximizing usage of RAM on each node. We could not
allocate enough m1.large nodes of AMD type to solve a problem of size n = 80k. 88
4.8 Efficiency scaling by problem size. Efficiency is determined with respect to the
theoretical peak given in Table 4.1 multiplied by the number of nodes used. . 90
4.9 Total time to solution for different instance types by problem size. . . . . . . . 90
4.10 Cost to solution prorated to actual time spent for different instance types by
problem size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.11 Actual cost to solution for different instance types by problem size. Execution
time is rounded to the nearest hour for cost calculation. . . . . . . . . . . . . . 91
4.12 Cost per GFlop ($/GFlop) prorated to actual time spent for different instance
types by problem size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.13 Actual cost to solution per GFlop ($/GFlop) for different instance types by
problem size. Execution time is rounded to the nearest hour for cost calculation. 92
4.14 Minimum and average execution time of HPL on m1.large (Intel) instances by
problem size. Error bars on average show one standard deviation. . . . . . . . 94
4.15 Minimum and average execution time of HPL on m1.xlarge (AMD) instances
by problem size. Error bars on average show one standard deviation. . . . . . 94
4.16 Minimum and average execution time of HPL on m1.xlarge (Intel) instances
by problem size. Error bars on average show one standard deviation. . . . . . 95
4.17 Minimum and average execution time of HPL on c1.xlarge instances by problem
size. Error bars on average show one standard deviation. . . . . . . . . . . . . . 95
4.18 Average execution time of repeated HPL on m1.large (Intel) by number of cores
used per each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.19 Average execution time of repeated HPL on m1.xlarge (AMD) by number of
cores used per each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.20 Average execution time of repeated HPL on m1.xlarge (Intel) by number of
cores used per each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.21 Average execution time of repeated HPL on c1.xlarge (Intel) by number of
cores used per each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.22 Average cost per GFlop ($/GFlop) prorated to actual time spent for m1.xlarge
(AMD) by number of cores used per each node. . . . . . . . . . . . . . . . . . . 99
4.23 Average cost per GFlop ($/GFlop) prorated to actual time spent for m1.xlarge
(Intel) by number of cores used per each node. . . . . . . . . . . . . . . . . . . . 99
4.24 Average cost per GFlop ($/GFlop) prorated to actual time spent for c1.xlarge
(Intel) by number of cores used per each node. . . . . . . . . . . . . . . . . . . . 100
4.25 Time to solution versus prorated cost for c1.xlarge (Intel). . . . . . . . . . . . 101
4.26 Time to solution versus prorated cost for different instance types. . . . . . . . 101
XIII
XIV
List of Tables
2.1 Cache system on Intel Xeon E5450 (Harpertown). . . . . . . . . . . . . . . . . . 13
2.2 Cache system on Intel Core 2 P8600 (Penryn). . . . . . . . . . . . . . . . . . . . 13
2.3 Cache system on AMD Opteron 8356 (Barcelona). . . . . . . . . . . . . . . . . . 14
3.1 Modeled and measured L1 misses for the unblocked LU factorization on Barcelona. 34
3.2 Modeled and measured L1 misses for the unblocked LU factorization on Penryn. 38
3.3 Results of measuring the Flop time using different test cases and the optimiza-
tion levels enabled on Barcelona. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Information about various instance types: processor type, number of cores
per instance, installed RAM (in Gb), and theoretical peak performance (in
GFlops/sec). Prices are on Amazon EC2 as of May, 2010. . . . . . . . . . . . . 80
4.2 Results of the HPL benchmark for matrix size n = 20 k. Columns are, in
order: Amazon EC2 instance type, CPU type (Intel or AMD), block size (NB),
process grid (p × q), number of threads used in the BLAS routines, broadcast
algorithm (BFACT ), best elapsed time, and corresponding efficiency using the
theoretical peak performance from Table 4.1. A more detailed description of
the HPL parameters can be found in Subsection 4.1.2. . . . . . . . . . . . . . . 88
4.3 Results of the HPL benchmark for matrix size n = 30 k. . . . . . . . . . . . . . . 89
4.4 Results of the HPL benchmark for matrix size n = 50 k. . . . . . . . . . . . . . . 89
4.5 Results of the HPL benchmark for matrix size n = 80 k.a . . . . . . . . . . . . . 89
A.1 The results of measuring the memory latency by lat_mem_ld on Penryn. . . . 105
XV
XVI
Acronyms and Notation
ALU Arithmetic Logic Unit
Amazon EC2 Amazon Elastic Compute Cloud
ATLAS Automatically Tuned Linear Algebra Software
BLAS Basic Linear Algebra Subprograms
b Byte (8 bits)
CPU Central Processing Unit
FLAME Formal Linear Algebra Methods Environment
Flop Floating point operation
Gb Gigabyte (230 b)
GFlops GigaFlops (109 Flops)
GFlops/sec GFlops per second
GFlop/$ GFlops per dollar
GHz GigaHertz
GPU Graphical Processing Unit
HPL High-Performance Linpack
Kb Kilobyte (210 b)
LAPACK Linear Algebra PACKage
L1 cache Level 1 cache
L2 cache Level 2 cache
L3 cache Level 3 cache
Mb Megabyte (220 b)
MPI Message Passing Interface
ns nanoseconds
RAM Random Access Memory
SIMD Single-Instruction-Multiple-Data
SSE Streaming SIMD Extensions
1
2
Chapter 1
Introduction
In this dissertation we combine two research projects. The goal of the first project
is to develop theory and tools to accurately model and predict the performance of
matrix computations. The second project aims at investigating the effects of shared
resources for high-performance computing (HPC) on a commercial cloud environment.
A short description of these projects follows.
Performance modeling and prediction. In linear algebra, the same math-
ematical operation can be solved by different algorithms. Choosing the best
performing algorithm on the target architecture is not a trivial task. This is
of particular relevance in combination with techniques of automatic algorithm
generation and tuning [5], in which dozens of algorithmic variants – written in
terms of BLAS (Basic Linear Algebra Subprograms) routines [60, 30, 28] – are
produced. The standard practice to select the best algorithm is to perform tens
or hundreds of experiments, measuring the execution time of each algorithm
for many different combinations of parameters. The LINPACK benchmark1 [3]
provides us with a typical example. While the actual run to determine the speed
of a supercomputer takes few days, the search for the fastest algorithmic variant
takes weeks. We approach the problem from a different perspective. Instead of
trying to speed up the search by reducing the algorithmic space, we aim at accu-
rate performance modeling with limited number of executions or even without
them.
Given the hierarchical and modular structure of dense linear algebra algorithms,
our analysis is based on a bottom-up approach. First we build models for
the kernel operations like the ones from the BLAS library and then use those
models for higher-level operations such as matrix factorizations. We present
two techniques that implement the bottom-up approach: one is the performance
prediction through measurements; the other is the execution-less performance
modeling. The latter can be said to be the main contribution of this project.
Both techniques produce accurate results – the deviation from the measurements
is within 3%.
In this dissertation we do not pursue an exhaustive analysis of a full set of
linear algebra routines. Instead, we propose methodologies, techniques, and also
implementations that are general enough to be adapted to dense linear algebra
1This benchmark computes the solution of a random dense linear system, and it is used to rank the
fastest supercomputers in the world (the TOP500 list [86]).
3
Chapter 1 Introduction
algorithms2. A detailed review on the development of performance modeling
and prediction can be found in Chapter 2.
High-performance computing on clouds. Computing as a utility has rea-
ched the mainstream. Scientists can easily rent resources on large clusters that
can be expanded and reduced on-demand. Such service of delivering compute
cycles and storage capacity to end-users by commercial or non-commercial or-
ganizations is called cloud computing. Commercial clouds are designed for web
service workload and data storage, which do not require brand-new processors
and fast interconnect. Unlike conventional cluster systems, clouds do not re-
quire a significant monetary or time investment, in infrastructure or people, be
paid upfront. Instead of paying for the resources according to average or peak
load, the cloud user can pay costs directly proportional to current need. When
resources are not in use, total cost can be close to zero. We aim at investigating
both the applicability of commercial cloud environments for high-performance
scientific computing and the cost for such service. For this study, we consider
dense linear algebra algorithms which provide a favorable ratio of computation
to communication. Using the DGEMM kernel [28] – the matrix multiplication
kernel of BLAS – and the LINPACK benchmark, we show that contention for
the CPU and cache space affects the computation time and, hence, the perfor-
mance. But, we have found that if we use only a part of the allocated resources,
then the expected performance improves significantly. A detailed review on the
development of cloud computing can be found in Chapter 4.
Some of the numerical results presented in this document have already been pub-
lished or submitted, see [47, 48, 46, 6, 49, 68, 69].
1.1 Contributions
As this dissertation covers two different research projects, the contributions of each
of them are separately listed below.
• Performance modeling and prediction
– Performance prediction through measurements. This study focuses on un-
blocked algorithms (memory-bound) that are typical building blocks of
blocked algorithms. We apply the bottom-up approach to predict the per-
formance of dense matrix computations. We first model the performance
of the kernel operations, for which we conduct several measurements. Per-
formance of higher-level algorithms is then predicted by assembling the
models for their building blocks (kernel operations).
– Execution-less performance modeling. We present a methodology for mod-
eling performance of memory-bound algorithms through their memory-
stalls. Its important feature is that it does not require the execution of
either the algorithms or the parts of them. The methodology consists
2Such as BLAS kernels or algorithms that are built on top of them.
4
1.2 Outline of the Thesis
of several stages such as modeling the memory-stalls and estimating the
characteristics of CPU and memory system. Each of these stages is rep-
resented in the general performance model. We demonstrate that with
minor changes the model can be adapted to various algorithms. We also
show evidence that the performance of compute-intensive algorithms can
be modeled by applying the same methodology.
– Analytical models for memory-stalls. Within the framework of the execution-
less methodology, we provide a technique for predicting cache misses of
dense matrix computations. Using this technique we derive analytical mod-
els for counting cache misses of memory-bound algorithms. Each of those
depend not only on the parameters of the algorithm at hand, but also on
the characteristics of the architecture (CPU and memory) used for exe-
cution. Accordingly, we tailor the model for different architectures. We
show that the model can be adapted with minor changes to algorithms
with similar structure.
• High-performance computing on clouds
– We examined through empirical evaluation the computational efficiency of
high-performance numerical applications in a commercial cloud environ-
ment. The execution time of the applications showed high fluctuation due
to the high contention for resources such as CPU and cache. In compar-
ison, there is almost no fluctuation in the achieved results on a standard
cluster.
– We presented a strategy for the efficient usage of the commercial cloud
resources where multiple virtual machines share a single hardware node.
Applying this strategy considerably improves the expected execution time
and thus reduces the cost of completion.
– We introduced alternative definitions of efficiency for commercial cloud en-
vironments where strong performance guarantees do not exist. The new
definitions are expected performance, cost to completion, and cost per
GFlop. We believe that these concepts should complement or even substi-
tute the standard definitions of efficiency.
1.2 Outline of the Thesis
This dissertation is organized as follows. Chapter 2 introduces general concepts of
performance modeling and presents a technique for performance prediction through
time measurements. An execution-less technique for modeling both memory-stalls
and performance is developed and assessed in Chapter 3. In Chapter 4, we study
the applicability of commercial clouds for high-performance computing and provide
a strategy for the efficient use of cloud resources. Finally, Chapter 5 summarizes the
results of this thesis.
5
6
Chapter 2
Performance Modeling and Prediction:
General Concepts
2.1 Introduction
In general, the performance of an algorithm can be modeled by any of the following
approaches [94].
1. Modeling through measurements. It is classified further as
• Interpolation. The execution time of the algorithm or more commonly its
components are measured for a set of problem sizes and then the modeling
is performed on the same interval using polynomial fitting.
• Extrapolation. This approach is similar to the interpolation. The execution
time of the algorithm is measured for a number of problem sizes, but the
performance is modeled for problems that are outside of the interval.
• Similarity between algorithms. A set of both compute- and memory-intensive
benchmarks, which are not necessarily the building blocks of the algorithm,
is used to capture the behavior of the algorithm and, therefore, to model
its performance.
Interpolation is often a preferable solution for modeling the performance of an
algorithm. In the literature this approach is also called semi-analytical since it
requires execution of the building blocks, but not of the considered algorithm
itself.
2. Analytical modeling. A mathematical formula is developed to predict the
performance of the algorithm by taking into account the detailed information
regarding the algorithm as well as the target architecture. The disadvantage
of analytical modeling is that it cannot completely capture all the performance
factors due to the simplified assumptions that are made in the model. How-
ever, in contrast to the previous approaches, analytical modeling requires only
a limited number of measurements of system parameters such as cache latency.
Each of these approaches has its advantages and disadvantages. The most suitable
choice depends on the posed requirements, the type of the algorithm, and the envi-
ronment (hardware architecture, workload, etc.).
The main focus of our study is on dense linear algebra algorithms, which possess a
number of properties that make them a promising test case for our investigation.
7
Chapter 2 Performance Modeling and Prediction: General Concepts
• Modularity. As a result of more than thirty years development, the linear
algebra algorithms are nicely organized into libraries which are often arranged
in a hierarchical structure. Many libraries from this hierarchy, both sequential
and parallel (LAPACK [2], FLAME [42], etc.), are almost entirely built on top
of the BLAS routines.
• Regularity. The performance of sparse matrix algorithms is dictated by the
density and pattern of non-zero elements. Performance of dense matrix compu-
tations mostly does not depend on the numerical values of the matrix elements.
• Automation. It was shown that for a set of operations it is possible to generate
algorithms automatically. Even though those algorithms are mathematically
equivalent, their performance may vary significantly. Thus, there is a need for
a performance analysis tool in the automation system.
We present two techniques for modeling performance of dense linear algebra algo-
rithms: one is based on cycle-accurate time measurements of the kernels; the other
does not require the execution of the algorithms or their building blocks. The idea of
the first approach is to predict the performance (the execution time) of higher-level
algorithms through sampling the BLAS kernels. For instance, the execution time of
the LU factorization [37, 82, 90] is predicted through modeling the time of its build-
ing blocks – BLAS level-1 and level-2 routines. As the measurements confirmed, the
execution time of the BLAS kernels has a piecewise-polynomial behavior. Thus, their
computation time is constructed by sampling them only for several problem sizes and
then applying polynomial fitting.
In the second approach, we aim at modeling performance without actually exe-
cuting the code. For that, we create a general performance model for an algorithm,
incorporating the time that the CPU spends on both the computation and waiting for
the memory system (memory-stalls). Since we mainly focus on memory-bound algo-
rithms, accurate modeling of memory-stalls for these algorithms is a key factor for the
accurate prediction of their execution time. We first create analytical models for the
BLAS kernels. Then, by composing the BLAS models, we develop the performance
models for higher-level algorithms.
In summary, in this dissertation we present methodologies and techniques for model-
ing and predicting performance. As the implementation would show, those are general
enough to be applied to any linear algebra routine that is built on top of the BLAS
kernels.
2.1.1 Hierarchy of Dense Linear Algebra Libraries
In many fields of science and engineering, the process of finding the solution for
a specific problem requires to solve a system of linear equations, or a least squares
problem, or eigenvalue problem. The common approach is to develop solvers for those
tasks alone and then spend tremendous amount of time on tuning them. However, the
best practice suggests to use already optimized solution-routines contained in linear
algebra libraries.
8
2.1 Introduction
The development of linear algebra libraries has its beginning in the early 1970s.
From that time many libraries have been released. With the influence of common
HPC computers, which were based on vector processors, in 1979 a first set of Basic
Linear Algebra Subprograms (BLAS level-1 or BLAS-1) [43, 60] was designed as a set
of basic vector operations. In 1988 the idea of BLAS was developed further yielding to
a second set of routines formatrix-vector operations (BLAS level-2 or BLAS-2) [30, 29].
For those routines the amount of data required and floating point operations (Flops)
performed have quadratic complexity.
When architectures with multiple layers of cache memory appeared, the perfor-
mance for both BLAS-1 and BLAS-2 operations became an issue: for these routines
the ratio between the numbers of Flops and memory accesses is only O(1). In order
to attain high performance on architectures with a hierarchical memory system, in
1990 the third level of BLAS (BLAS level-3 or BLAS-3) [28] with matrix-matrix oper-
ations was defined. These routines perform O(n3) Flops over O(n2) data, giving the
opportunity to hide the memory latency and offer performance close to the achievable
peak.
A generic implementation of the BLAS specification [71] is provided since the an-
nouncement of the library in 1979. This reference implementation is equipped with
the complete functionality, but it is not optimized for any architecture. Thus, pro-
cessor manufacturers as well as the scientists developed tuned implementations of the
BLAS for each architecture. Prominent examples of these implementations are In-
telMKL [52], AMD ACML [50], IBM ESSL [51], and SUN Performance Library [54].
Independently from the manufacturers, the scientific community provided two im-
plementations of the BLAS, namely the Automatically Tuned Linear Algebra Soft-
ware (ATLAS) [92] and GotoBLAS [38], for general-purpose architectures. ATLAS
is based on auto-tuned empirical approach while GotoBLAS is hand-tuned machine-
specific implementation of the BLAS. The latter is also the fastest open source BLAS
for many architectures. Due to the gained popularity of the Graphical Processing
Units (GPUs) for high-performance and scientific computing, NVIDIA provided a
GPU-version of the BLAS (cuBLAS) [53].
Since BLAS defines a set of fundamental operations, for more complex linear alge-
bra operations, like matrix factorizations, higher-level libraries were also developed.
A good example is the Linear Algebra PACKage (LAPACK) [2] that provides rou-
tines for solving linear systems of equations, least squares problems, or eigenvalue
problems. In order to exploit the optimized BLAS implementations, LAPACK builds
its blocked algorithms on top of the BLAS-3 operations. As with the BLAS, man-
ufacturers also provide implementations of LAPACK that are tuned for the specific
architectures. There is also a commercial non-vendor-provided library – the NAG For-
tran Library which is a part of the NAG Numerical library [41] – that incorporates
different implementations of both BLAS and LAPACK, and provides new interfaces
to them.
The Formal Linear Algebra Methods Environment (FLAME) project [42] created
a new methodology for the derivation of high-performance linear algebra algorithms.
Similar to LAPACK, FLAME builds higher-level algorithms on top of optimized lower-
level such as the BLAS routines.
The hierarchical structure of dense linear algebra libraries is illustrated in Figure 2.1.
9
Chapter 2 Performance Modeling and Prediction: General Concepts
Basic Linear Algebra Subprograms (BLAS)
Reference BLAS Vendor BLAS GotoBLAS ATLAS
LAPACK FLAME NAG
Figure 2.1 Hierarchical structure of linear algebra libraries.
For the above mentioned libraries, performance is a driving factor that leads to
the development of highly optimized routines for many architecture. Apart from the
optimization, scientists also investigate predictability of those routines and algorithms
that are built on top of them. Such performance prediction can be used as an indicator
to provide the essential feedback for the possible improvement of the efficiency of both
routines and algorithms.
2.1.2 State-of-the-art
Predicting the performance of linear algebra algorithms is a problem that, although
widely investigated, is still far from solved. Cuenca et al. [19] developed an analyti-
cal model to represent the execution time of self-optimizing individual routines as a
function of the problem size, the system and the algorithmic parameters. In other
papers by Cuenca et al. [20, 21, 35], the study was extended to the development of
automatically tuned parallel linear algebra libraries. In these libraries, the execution
time of higher-level routines like the LU factorization is modeled using information
generated from the lower-level BLAS kernels. For instance, the execution time of a
sequential blocked LU factorization for problems larger than 512 in size is predicted
using the optimal execution time of the BLAS routines. The optimal time is selected
from a set of timings that are gathered during the installation of the automatically
tuned libraries [22, 21].
Phansalkar et al. [76] proposed two simple techniques to predict performance and
cache miss-rates based on the similarity of applications. The performance of a new
application can be predicted without running cycle accurate simulation, but just by
measuring similarity between the application and the set of benchmarks in the repos-
itory.
A methodology for analyzing, modeling, and predicting the performance of MPI
applications was presented by Midorikawa et al. [65]. This methodology is composed
of several techniques: graph modeling, performance measurements, curve fitting, and
analysis of the application code. The aim of the study was to predict the execu-
tion time of an MPI application in a particular environment by taking into account
variations in the number of processors and the amount of input data.
For predicting and evaluating the performance of parallel and distributed appli-
cations Pllana et al. [77] introduced a tool named Performance Prophet. The tool
10
2.1 Introduction
provides a graphical interface for a user to specify the performance model using the
Unified Modeling Language (UML) [17]. Then, the tool automatically transforms the
model from the UML representation to a C++ code and evaluate the performance of
the C++ program.
In order to achieve best performance for linear algebra kernels, machine-specific
hand tuning of those kernels is often applied. In contrast, scientists aim to optimize
this process for existing and upcoming architectures through the automatic generation
of such routines. As the matrix-matrix multiplication (DGEMM) is the core of the
BLAS library (all other BLAS-3 routines can be expressed in terms of DGEMM), in
several works [9, 92, 96] the problem of optimizing this routine for a given architecture
was tackled by applying the automatic generation approach. In [10, 9] Bilmes et al.
developed a a Portable, High-Performance, ANSI C (PHiPAC) methodology, which
consists of three components: 1. Models for current machines and C compilers that
provide guidelines to produce portable high-performance C code; 2. Parametrized code
generators that produce code according to the guidelines; 3. A set of search scripts
that automatically tune the code for a target architecture by varying the code param-
eters and benchmarking the routine. The ATLAS project [92, 93] extended the ideas
developed as part of the PHiPAC project by reducing the search space through limit-
ing the number of generated algorithms1 for the studied operation. ATLAS provides
a very good implementation of BLAS by tuning routines for various architectures;
those are centralized around a highly tuned matrix-matrix product that is automati-
cally optimized for different levels of the memory hierarchy.
Instead of the empirical search in ATLAS, Yotov et al. [96] introduced a model-drive
approach for DGEMM to estimate optimal values of parameters such as tile sizes and
loop unrolling factors. In order to determine which values give the best performance,
the search engine runs routines with many different combinations of parameter values
on the actual hardware. On modern architectures, such approach may take from eight
minutes to more than eight hours. During the study, the authors of [96] discovered
that cache misses slow the execution time and therefore decrease the performance.
Accordingly, the tiling block sizes for different cache levels are modeled by minimizing
the number of cache misses. The experiments demonstrated that the model-driven
optimization can be very effective and can generate code with the performance that
is close to optimal. This study also shows that compilers can generate excellent code
using analytical models.
We approach the problem from a different perspective. Instead of applying modeling
for deriving high performing algorithms or constructing corresponding libraries, our
focus is on predicting performance of such algorithms. We benefit from the hierarchical
structure of linear algebra libraries and start the prediction from the kernel operations
– the bottom-up approach. We consider modeling of the kernels from two different
implementations of the BLAS library, namely reference BLAS, which is close to naive,
and GotoBLAS2. For predicting performance of those kernels, we create a model and
then tailor it for the selected architectures as well as the specific implementation. The
latter is quite important and difficult to capture since each implementation follows
its own path to attain the high performance on a target architecture. Our aim is to
1algorithmic variants
11
Chapter 2 Performance Modeling and Prediction: General Concepts
accurately match and express those different implementations in analytical models.
2.2 Performance Prediction through Measurements
In general, the performance of an algorithm can be defined as a ratio between the
number of floating point operations performed by the algorithm and the execution
time:
Performance = Flops
Execution time
. (2.1)
For direct algorithms, Flops can be calculated a priori from the structure of the
algorithm. Since Flops is known, the performance prediction reduces to modeling the
execution time.
Here, we analyze the impact of accuracy in measurements on performance predic-
tion, and investigate modeling of the execution time through measurements. Previ-
ously, Cuenca et al. developed performance models for linear algebra routines [22];
those models are based on measurements. The authors focused on the modeling for
relatively big problem sizes (n ≥ 512). We, instead, are interested in modeling the
execution time of linear algebra algorithms for smaller problem sizes (n < 512). These
are typical for unblocked algorithms as well as algorithms-by-blocks [16]. Exploiting
the hierarchical structure of linear algebra libraries (Figure 2.1), the execution time
of higher level algorithms, like those included in the LAPACK library, is predicted
through models for the kernel operations such as the BLAS routines. As the measure-
ments confirm, the execution time of the BLAS routines has a piecewise-polynomial
behavior. Therefore, we conduct few measurements of the routine using the cycle-
accurate timer and, then, model the computation time of it for any given problem
size by applying polynomial fitting. The approach is verified by predicting the execu-
tion time of an unblocked variant of the LU factorization [37, 87] that is built on top
of two BLAS routines, namely DGER [30] and DSCAL [60].
First, we present timing methodologies (Subsection 2.2.2) and then show how they
can be used for modeling performance (Subsections 2.2.3 and 2.2.4).
2.2.1 Experimental Setup
For the performance experiments we use three different architectures (processors),
namely
• Intel Xeon E5450 (code name Harpertown);
• Intel Core 2 P8600 (code name Penryn);
• AMD Opteron 8356 (code name Barcelona).
In order to refer to the Intel Xeon E5450, Intel Core 2 P8600, and AMD Opteron
8356 architectures we introduce short names Harpertown, Penryn, and Barcelona,
accordingly.
Each of the four cores on Harpertown operates at 3.0GHz and can execute four
Flops per cycle, at a peak performance of 12.0GFlops/sec per core or 48.0GFlops/sec
12
2.2 Performance Prediction through Measurements
Table 2.1 Cache system on Intel Xeon E5450 (Harpertown).
Name Total size Line size Associativity n (n × n)
L1 data cache 32Kb 64 b 8 64
L2 unified cache 6144Kb 64 b 24 886
per socket. Memory-wise, each core includes a 32Kb L1 data cache. Each chip has
two 6Mb L2 unified caches; each of those is shared between two cores. Table 2.1 pro-
vides detailed information regarding the cache system on Harpertown and additionally
matrix sizes that can completely utilize the particular cache.
The RWTH Xeon cluster’s nodes execute the Scientific Linux operating system
release 6.1 (Carbon) with the 2.6.32 Linux kernel [1]. On Harpertown, we used the
GotoBLAS2 library [38] of version 1.07. The library was compiled with version 4.4.5
of gcc compiler from GNU Compiler Collection [84] under the optimization level two
enabled. Since the optimization influences the data access patterns of non-tuned
algorithms like the ones from the reference BLAS library, the library was compiled
without optimization. The GOTO_NUM_THREADS option was set to one meaning that
only one core per node was used for both libraries. The Performance Application
Programming Interface (PAPI) libraries [11, 56, 66] of versions 4.1.x and 4.2.x were
used to access the processor performance counters.
Table 2.2 Cache system on Intel Core 2 P8600 (Penryn).
Name Total size Line size Associativity n (n × n)
L1 data cache 32Kb 64 b 8 64
L2 unified cache 3072Kb 64 b 12 627
Penryn is another Intel processor. Each of its two cores operates at 2.4GHz and
can execute four Flops per cycle, at a peak performance of 9.6GFlops/sec per core or
19.2GFlops/sec per socket. Each core includes a 32Kb L1 data cache and each chip
has a shared 3Mb L2 unified cache. The information concerning the cache system on
Penryn is shown in Table 2.2.
Penryn executes the Ubuntu 10.04 LTS (Lucid Lynx) operating system using the
2.6.32 Linux kernel; by default the kernel does not provide an access to the hard-
ware performance counters. In order to enable the access to them from the PAPI
library (version 4.1.0) we patched the kernel. The gcc compiler of version 4.4.3, with
optimization level two enabled, was used for the GotoBLAS2 library of version 1.07.
Barcelona has four cores, operating at 2.3GHz. Each core can execute three Flops
per cycle for a peak performance of 6.9GFlops/sec per core or 27.6GFlops/sec per
socket. Memory-wise, each core contains a 64Kb L1 data cache and a 512Kb L2
unified cache. In addition, each chip has a 2Mb L3 unified cache shared among
all four cores. The comprehensive specification of the cache system on Barcelona is
presented in Table 2.3.
13
Chapter 2 Performance Modeling and Prediction: General Concepts
Table 2.3 Cache system on AMD Opteron 8356 (Barcelona).
Name Total size Line size Associativity n (n × n)
L1 data cache 64Kb 64 b 2 90
L2 unified cache 512Kb 64 b 16 256
L3 unified cache 2048Kb 64 b 32 512
Barcelona supports a wide variety of x86 instruction-extensions like MMX, Ex-
tended 3DNow!, SSE, SSE2, SSE3, SSE4a, and AMD64 [62, 44]. This processor
is compatible to the x86 architecture implementing a floating-point stack using 80-
bit registers opposed to 64-bit registers used in many other processor architectures.
Therefore floating point results might slightly vary when using this mode. Some
compilers offer a switch to adjust the floating point precision to the IEEE standard.
When employing SSE instructions by appropriate compiler options, 64-bit operations
are used [1].
Barcelona is equipped with the same operating system and software as Harpertown.
2.2.2 Timing Methodologies
Accurate time measurements are crucial not only for performance evaluation, but also
for performance prediction. Whaley et al. [91] investigated the importance of context-
sensitive timers in achieving optimal performance of HPC applications. As a result,
techniques and methodologies were developed in order to obtain accurate timings
of HPC kernel routines for the empirical tuning ATLAS framework [93]. For our
performance study, we apply similar concepts in combination with a cycle-accurate
timer, which has a resolution of one clock cycle.
In general, system timers report time according to the system clock; those timers
can be divided into two categories, depending on measurements: wall timers or CPU
timers. Wall timers provide a very accurate measure of elapsed time with resolution
of clock cycles. The problem with wall timers is that they also include time spent by
the processor on other users’ tasks and unrelated OS processes. CPU timers instead
measure only the time during which the processor is working on a particular process.
The disadvantage of the CPU time measurements is that the timer aggregates the
time slices of the appropriate process that are measured in terms of wall time to the
total CPU time of the same. Therefore, CPU timer cannot have a finer resolution
than the most accurate wall timer. Also, CPU time might be highly inaccurate [91]
– from one measurement to another it can vary significantly. In our experiments we
rely on a cycle-accurate wall timer.
The accuracy in measurements does not only depend on the accuracy of the timer,
but also on the data placement, meaning whether the operands are in or out of cache.
Thus, if we compare a particular routine using timing approach when the operands
are kept in cache with the same when operands are out of cache, we would see that
the results are better in the former case. This is particularly the case for relatively
small problems that fit in cache.
14
2.2 Performance Prediction through Measurements
Flush No Flush
20 30 40 50 60
0
0.1
0.2
0.3
×105
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(a) Harpertown
20 40 60 80
0
0.5
1
1.5
×105
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(b) Barcelona
Figure 2.2 Measurements of DGER from the GotoBLAS2 library using timing approaches
with and without cache flushing.
To demonstrate the effect on timings when the operands are in or out of cache we
chose DGER – a BLAS-2 routine [30] that updates a matrix with an outer product
of two vectors. Figure 2.2 demonstrates the DGER measurements that were obtained
using two different timing approaches: one (Flush) flushes cache before the compu-
tation to ensure that the operands are out of cache, while the other (No Flush) does
not do that. The results differ significantly: On Harpertown the No Flush approach
outputs the results that are four times smaller than the ones reported by the Flush;
on Barcelona the difference is even greater, nearly one order of magnitude. To con-
clude, in order to achieve meaningful results while comparing algorithms, it is better
to consider both the execution time at one hand and the timing approach at the other.
Another timing approach is based on sampling the computational kernels in the
same way as they are called within an algorithm [91]. For instance, an unblocked
variant of the LU factorization [8, 87] is built on top of two BLAS routines, namely
DGER and DSCAL. Hence, while measuring the execution time of this algorithm
cache is flushed only once before the first call to the BLAS routines; for the next calls
operands are partially in cache. To summarize, if there is no good reason to place
operands in cache, it is recommended to flush cache before timing the routine.
In order to improve the accuracy of outcomes, we repeat the measurements multiple
times and, then, find minimum, median, or average2. While using wall timers the
inaccuracy in the results can be caused by another processes. Accordingly, in terms
of wall time the most stable and reliable timings are minimum and median.
Figure 2.3 shows the timings of DGER: For each problem size we conducted 10 (or
100) samples and then selected minimum, median, and average. The distance between
2We discard the first measurement because it includes the time spent on caching instructions. Thus,
it may corrupt the results.
15
Chapter 2 Performance Modeling and Prediction: General Concepts
min med avg
20 30 40 50 60
0
0.2
0.4
0.6
0.8
×104
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(a) Harpertown. Without cache flushing
20 30 40 50 60
0
1
2
3
4
×104
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(b) Harpertown. With cache flushing
20 40 60 80
0
0.05
0.1
0.15
×105
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(c) Barcelona. Without cache flushing
20 40 60 80
0
0.5
1
1.5
×105
Matrix size [m = n]
T
im
e
[c
yc
le
s]
(d) Barcelona. With cache flushing
Figure 2.3 Minimum, median, and average time of DGER from the GotoBLAS2 library on
the Harpertown and Barcelona processors.
the minimum and median timings is almost negligible; whereas the average can vary
considerably. Due to the usage of the cycle-accurate wall timer and unstable average
timings, we chose the median timing for the performance prediction experiments.
2.2.3 Prediction
In this section, we apply the gained knowledge regarding the timing techniques to
predict the performance of linear algebra algorithms. The approach is demonstrated
on DGER and the unblocked variant for computing the LU factorization without
pivoting [8, 87] (Algorithm 2.4). This algorithm decomposes an m × n matrix A in
the product of an m×min(m,n) unit lower triangular matrix L and an min(m,n)×n
16
2.2 Performance Prediction through Measurements
upper triangular matrix U :
A = LU.
The computation is performed by calling two BLAS routines:
• DSCAL [43, 60] – a BLAS-1 routine – scales a vector by a constant
x = αx; (2.2)
• DGER [30, 29] – a BLAS-2 routine – computes a rank-1 update of a general
m × n matrix A:
A = αxyT +A, (2.3)
where α is a scalar, x and y are vectors of size m and n, respectively.
Partition
A→ (ATL ATR
ABL ABR
)
where ATL is 0 × 0
While size(ATL) < size(A) do
Repartition
(ATL ATR
ABL ABR
)→⎛⎜⎝
A00 a01 A02
aT10 α11 a
T
12
A20 a21 A22
⎞⎟⎠
where α11 is 1 × 1
a21 ∶= a21/α11 [DSCAL]
A22 ∶= A22 − a21aT12 [DGER]
Continue with
(ATL ATR
ABL ABR
)←⎛⎜⎝
A00 a01 A02
aT10 α11 a
T
12
A20 a21 A22
⎞⎟⎠
endwhile
Algorithm 2.4 An unblocked algorithm
for the LU factorization.
A00 a01 A02
aT10 α11 a
T
12
A20 a21 A22
i
1
m − i − 1
i 1 n − i − 1
Figure 2.5 3×3 partitioning of the ma-
trix A.
The unblocked variant of the LU factorization is presented using the FLAME no-
tation [7, 8, 87]. This notation makes it easier to identify what regions of the matrix
are updated and used, see Figure 2.5. In Algorithm 2.4, size(A) indicates the number
of columns of the matrix A; ’T’, ’B’, ’L’, and ’R’ stand for ’Top’, ’Bottom’, ’Left’, and
’Right’, respectively. Before the computation starts, the matrix A is partitioned into
four parts ATL, ATR, ABL, and ABR, where ATL is a 0 × 0 matrix. The matrix A is
traversed from the top left to the bottom right corner. At iteration i of the loop, A
is repartitioned from 2 × 2 to 3 × 3 form, where
17
Chapter 2 Performance Modeling and Prediction: General Concepts
0.0e+00
1.0e+04
2.0e+04
3.0e+04
4.0e+04
5.0e+04
6.0e+04
7.0e+04
 20  40  60  80  100  120  140  160  180  200
T
im
e[
cy
cl
es
]
Matrix size[m=n]
GER
Figure 2.6 Piecewise-parabolic behavior of the execution time of DGER from the Goto-
BLAS2 library on Harpertown.
• A00,A02,A20, and A22 are matrices;
• a01, aT10, a
T
12, and a21 are vectors;
• α11 is a scalar.
The algorithm updates the matrix A22 and the vector a21 using DGER and DSCAL,
respectively. To denote the size of the matrix A22 we use p × q, where p = m − i − 1
and q = n − i − 1. As the algorithm progresses, A22 decreases in size, starting from
m − 1 × n − 1 down to m − n + 1 × 1; for simplicity we consider the case when m ≥ n.
At the end of the computation, the matrix A is overwritten by the upper triangular
matrix U and the unit lower triangular matrix L.
Figure 2.6 illustrates how the execution time of DGER can be represented by a
piecewise-parabolic function; the number of parabolas depends on the number of
memory levels. For instance, the execution time for problems up to p = 64, which
completely fit in the L1 cache on the Harpertown processor, can be characterized by
one parabola and for bigger problems, which fit in the L2, by another one. Therefore,
for each cache level we developed a model that constructs the time for DGER by
evaluating it only for few problems and then applying parabolic fitting.
Since DGER performs roughly 96% of the total Flops in the LU factorization, it is
the major contributor to the execution time of the algorithm; the fraction of DSCAL’s
Flops as well as its impact on the total execution time are negligible. Therefore, we
predict the computation time of the LU factorization by combining the modeled time
of DGER for different problem sizes p × q from m − 1 × n − 1 down to m − n + 1 × 1.
18
2.2 Performance Prediction through Measurements
Measured Modeled Deviation
20 40 60
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
2
4
6
8
×103
T
im
e
[c
yc
le
s]
(a) Harpertown, p ≤ 64
0 100 200 300
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
0.5
1
1.5
×105
T
im
e
[c
yc
le
s]
(b) Harpertown, p < 300
20 40 60 80
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
0.5
1
1.5
×104
T
im
e
[c
yc
le
s]
(c) Barcelona, p < 90
0 100 200 300
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
0.5
1
1.5
2
×105
T
im
e
[c
yc
le
s]
(d) Barcelona, p < 300
Figure 2.7 Modeling the execution time of DGER from the GotoBLAS2 library.
2.2.4 Evaluation
In these experiments, we focus on modeling the execution time for small problems that
fit in cache. The predictions of the execution time for DGER and the LU factorization
on the aforementioned architectures were verified by comparing the modeled time
with the time separately measured in practice. The measurements were performed on
DGER from the GotoBLAS2 library (Subsection 2.2.1). The deviation (in percentage)
between the modeled and measured execution time was calculated by
Deviation = ∣Modeled time −Measured time∣
Measured time
. (2.4)
Figure 2.7 presents the results of modeling the execution time of DGER for different
problem sizes on the Harpertown and Barcelona processors. For each cache level, we
performed only 4−6 measurements, applied parabolic fitting by solving a least squares
problem. Figures 2.7(a) and 2.7(c) show the results when the operands fit in the L1
19
Chapter 2 Performance Modeling and Prediction: General Concepts
Measured Modeled Deviation
0 100 200 300
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
0.5
1
1.5
×107
T
im
e
[c
yc
le
s]
(a) Harpertown
0 100 200 300
0
2
4
6
8
10
p = q
D
ev
ia
ti
on
[%
]
0
1
2
×107
T
im
e
[c
yc
le
s]
(b) Barcelona
Figure 2.8 Modeling the execution time of the unblocked LU factorization.
cache, while Figures 2.7(b) and 2.7(d), illustrate the scenario when the operands
require more memory than the L1 cache provides. To summarize, the modeling is
more accurate on Harpertown than on Barcelona; in most cases the deviation is less
than 4%.
Figures 2.8(a) and 2.8(b) demonstrate the modeled and measured time of the un-
blocked variant of the LU factorization for different problem sizes. The execution time
of Algorithm 2.4 is modeled by assembling the predicted time of DGER for problems
from m−1×n−1 down to m−n+1×1. The deviation between the modeled and mea-
sured results is slightly higher for smaller problems. Nevertheless, when the problem
size increases the deviation drops. In general, the deviation is mostly less than 3%.
2.3 Summary
We studied a technique for predicting the performance of linear algebra algorithms for
relatively small problem sizes; those problems often occur when lower-level algorithms
such as the BLAS kernels are used as building blocks of higher-level algorithms like
the ones included in the LAPACK library. This technique is based on modeling per-
formance of an algorithm through measuring the execution time of either the entire
algorithm (in case of the kernel operation) or parts of it. To ensure the accuracy of the
measurements, we used a cycle-accurate wall timer. As the time measurements con-
firmed, the execution time of the BLAS routines has a piecewise-polynomial behavior.
Therefore, we modeled the execution time of them by conducting only few measure-
ments and applying polynomial fitting. The execution time of higher-level algorithms,
which are built on top of the BLAS routines, is modeled by assembling the predicted
time of their building blocks. The approach was validated by comparing the models
for DGER – a BLAS-2 routine – and the unblocked LU factorization, which is built
on top of DGER, with the actual measurements on the Intel and AMD processors.
20
2.3 Summary
The results are satisfactory with deviations below 4% even for small problems.
In the following chapter we introduce a new technique for modeling the execu-
tion time of dense linear algebra algorithms. The main concept of that technique is
the execution-less modeling, meaning that neither algorithms nor parts of them are
executed in order to build performance models.
21
22
Chapter 3
Execution-Less Performance Modeling
3.1 Introduction
In Section 2.2 a technique for modeling the execution time through time measure-
ments was demonstrated. In contrast, here we avoid measurements altogether and
aim at obtaining execution time through analytical modeling. This is of particu-
lar relevance in combination with techniques of algorithm generation and tuning [5],
where for each target architecture dozens or even hundreds of algorithmic variants
have to be evaluated. Based on this technique, several prominent libraries were de-
veloped: PhiPAC [10, 9], FFTW [34], ATLAS [92, 93], and SPIRAL [78, 81]. These
libraries generate more efficient code than what native compilers can provide. In
contrast, Yotov et al. [96] showed evidence that compatible results can be achieved
through analytical modeling of optimal parameters, e.g. tile sizes and loop unrolling
factors, for the matrix-matrix multiplication.
We propose estimating the execution time by using detailed information about the
algorithm and the architecture (CPU and memory) it is executed on. While computing
an algorithm, the CPU spends time on both computations (“CPU active time”) and
waiting for the requested data from the memory system (“CPU idle time”); the latter
is also called “memory-stalls time”. Since memory-stalls, e.g. L1 data cache misses
(L1 misses), idle the CPU and add a significant overhead to the computation time,
we model the execution time not only through the CPU active time, but also through
the CPU idle time. The latter plays an especially important role in operations that
are memory-bound, because for those the major fraction of the computation time is
spent on waiting for the memory system. This is the case for unblocked algorithms
that are built on top of memory-bound operations like the BLAS-2 routines [30].
This is also relevant to the LAPACK library, where each routine has unblocked and
blocked implementations. For instance, DGETF2 implements an unblocked variant of
the LU factorization while DGETRF stands for a blocked version of the same [2, 8].
The blocked algorithms instead are built on top of compute-bound operations like the
BLAS-3 routines [28] whose computation time can be estimated rather accurately by
simply counting the Flops performed. Here, we focus on the more challenging goal of
predicting performance for memory-bound algorithms.
This chapter is organized as follows. Section 3.2 reviews the main aspects of the
memory hierarchy. Section 3.3 describes the general performance model. Section 3.4
introduces analytical models for cache misses that are the main component in the
execution time of memory-bound algorithms. Section 3.5 assembles the performance
model, while Section 3.6 validates the model by comparing with the actual measure-
23
Chapter 3 Execution-Less Performance Modeling
ments. Finally, Section 3.7 summarizes the results of the chapter and Section 3.8
proposes future research directions.
3.2 Memory Hierarchy
Regs
L1 cache 
(SRAM)
Main memory
(DRAM)
Local secondary storage
(local disks)
Larger,  
slower, 
and 
cheaper
Remote secondary storage
(distributed file systems, Web servers)
L2 cache 
(SRAM)
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,
faster,
and 
costlier
L3 cache 
(SRAM)
L6:
Figure 3.1 The memory hierarchy. Source: Bryant et al. [12]
In a typical memory hierarchy, as shown in Figure 3.1, the storage devices get larger,
slower and cheaper as moving from higher (registers) to lower levels (local disks). The
highest level (L0) consists of very fast CPU registers that can be accessed directly by
ALUs. The next levels are two or three small to moderate-size cache levels (or caches)
that the CPU can access in a few / tens of clock cycles. The caches are followed by a
large main memory (RAM) that can be accessed in hundreds of clock cycles. Finally,
the lowest levels are local disks and, sometimes, disks on remote servers. It takes
hundred thousand times longer to read information from a disk than from the main
memory and million times longer than from the cache [12].
Each level k in the memory hierarchy caches data from the next lower level k + 1.
The storage at each level is partitioned into contiguous chunks called blocks. Block
is used as a unit for transferring data between memory levels; the size of block may
differ among pairs of memory levels. For instance, data is transferred between the L1
cache and the CPU registers by one-word1 blocks; between the L2 cache and the L1
cache as well as the main memory and the L2 cache, data is usually moved in blocks
of 8 or 16 words called cache lines. In general, memory levels that are further from
the CPU have longer access time, and therefore large blocks are used to amortize this
1The term “word” is used for a small group of bits which are handled simultaneously by processors of
a particular architecture. The size of a word is thus CPU-specific. On the architectures considered
in this study it is 64-bit.
24
3.3 A General Model
penalty [12].
In order to know whether a piece of data is in cache or not and in which cache
line it is, the hardware implements a mapping between memory and cache locations.
The simplest approach is the so called the direct mapped cache, indicating that each
memory location maps to exactly one cache line. However, the direct mapped cache
causes a problem with overwriting data, because many memory blocks can be mapped
into the same cache line. Opposite to the direct mapped is the fully associative cache,
in which a block from memory can be placed in any location in the cache. This
solution is not very efficient, because to find a particular cache line all entries in the
cache must be searched. A compromise between the direct mapped and the fully
associative caches is the set associative cache, in which each block in memory can be
placed in any location within one set of cache lines. A cache with n cache lines per
set is called the n-way set associative cache [44].
To perform computation, a program needs to bring blocks of data from memory
to registers. Before retrieving a block of data d from level k + 1, the program first
looks for d at level k. If d is cached at level k, then a cache hit occurs. The time
needed to fetch d from level k to the processor is called cache hit time, it includes
the time needed to determine whether the access is a hit or not. On the other hand,
if d is not found at level k, then a cache miss occurs. The time needed to serve the
cache miss is called cache miss time, which is the time to fetch the block of data from
level k + 1 to level k and the delivery time of this block to the processor. Due to
the smaller and faster upper levels, the hit time is much smaller than the miss time.
While serving a cache miss, it may happen that level k is full, so d has to overwrite an
existing block there following a replacement strategy. Two commonly used strategies
are Random Replacement (RR) and Least Recently Used (LRU) schemes. The RR
scheme randomly selects the candidate data block while the LRU replaces the cache
line that was unused for the longest time [12, 74].
3.3 A General Model
A simple formula for modeling the execution time combines the number of clock cycles
required by the algorithm and the time of one clock cycle [74]:
Execution time = CPU clock cycles ×Clock cycle time. (3.1)
The CPU clock cycles can be divided into the cycles that the CPU spends executing
the algorithm and those spent waiting for the memory system. Hence,
Execution time = (CPU active clock cycles+
Memory-stall clock cycles)× (3.2)
Clock cycle time.
The memory-stalls, which are composed of cache and TLB misses, are divided into
read-stalls and write-stalls. The write-stalls may occur without influencing the execu-
tion since data can be copied in the background. For this reason, in the following, we
consider only read-stalls. Modeling read-stalls is quite a complex task that requires
25
Chapter 3 Execution-Less Performance Modeling
deep understanding of the memory system behavior.
In order to obtain a more comprehensive model, the CPU active clock cycles and
memory-stall clock cycles can be refined into smaller components. The CPU active
clock cycles can be written as
CPU active clock cycles = Flops × Flop time
where Flops is the number of floating operations performed. The memory-stall clock
cycles, which include the L1 data cache hits (L1 hits)2, can be split into four or five
components depending on the number of cache levels in the memory hierarchy:
Memory-stall clock cycles = n∑
i=1Li misses × Li miss time+
L1 hits × L1 hit time+
TLB misses ×TLB miss time,
where n is the number of cache levels. Including the above mentioned information,
Equation (3.2) becomes
Execution time = n∑
i=1αi × Li misses × Li miss time+
α01 × L1 hits × L1 hit time+
β ×TLB misses ×TLB miss time+
γ × Flops × Flop time.
(3.3)
The balancing weights (α01, αi, β, and γ) are used to model the overlap between
computation and data movement; α01, αi, β, γ ∈ [0,1]. For compute-bound algo-
rithms, γ ≈ 1 and αi and β are both close to zero. For memory-bound algorithms, the
situation is reversed.
We restrict the study to a scenario where data resides in L2 cache, so only L1
data cache misses (L1 misses) occur. Our interest is in the analytical modeling of L1
misses. Being able to accurately model L1 misses is the key to accurate modeling of
execution time.
3.4 Modeling Memory-Stalls
The main focus of this study is on modeling memory-bound algorithms. The execution
time of such algorithms is strongly influenced by the time spent on memory-stalls
such as L1 misses. For instance, Figure 3.2 shows clear dependence of the DGER
execution time on the number of its L1 misses; DGER is used from both the reference
BLAS and the GotoBLAS2 libraries. Therefore, we predict the execution time of
memory-bound algorithms through modeling their memory-stalls. The latter is also
an important stage in our performance modeling methodology, which is represented
by Equation (3.3).
2The successful accesses to the L2 (L3) caches are covered by L1 (L2) misses.
26
3.4 Modeling Memory-Stalls
0 2 4 6 8×103
0
0.5
1
1.5
×105
L1 misses
T
im
e
[c
yc
le
s]
(a) Reference DGER
0 2 4 6 8×103
0
0.5
1
×105
L1 misses
T
im
e
[c
yc
le
s]
(b) Goto’s DGER
Figure 3.2 Dependence between the execution time and L1 misses of DGER on Penryn.
A technique, which is named Parallel Cache Assignment (PCA), for parallelizing
panel operations in LAPACK was discussed by Castado et al. [13]. The key to PCA
is a data-ownership model of parallelism where data is assigned to the cores in order
to enable cache reuse. A side effect of the data assignment is that synchronization
between the cores is introduced into the innermost loop of the algorithms in order to
maximize cache reuse. This counter-intuitive optimization turns to be highly efficient.
Without this modification the performance is limited to the speed of memory. The
approach demonstrates that the performance can be improved by carefully reusing
data in order to reduce or avoid extra memory-stalls. Those have a significant impact
on the performance of memory-bound algorithms and therefore needs to be thoroughly
studied.
A study on improving hardware prefetching in case of L1 misses was conducted by
Mahjur et al. [61]. They presented a two-phase prediction algorithm that identifies
the possibility of the next cache miss and prefetches only one corresponding cache
line.
Fraguela et al. presented a strategy to develop probabilistic analytical models of the
cache behavior [33]. The approach is based on automatic generation of equations that
estimate the number of cache misses of typical code found in scientific applications –
such as non- and perfectly nested loops and matrix-matrix products. Their aim was to
apply these models during the compilation in order to obtain automatic optimization
of the code.
We instead consider studying/modeling memory-stalls from a different perspective.
We aim at developing analytical models for counting cache misses of dense linear
algebra algorithms. Afterwards, these models would be applied in order to predict
the execution time of those algorithms.
Under the assumption that data resides in L2 cache, our goal is to model L1 misses.
It is obvious that the number of L1 misses strongly depends on the capacity of L1
cache. For instance, most problems do not fit in L1 cache, so data has to be moved
from L2 cache to L1 first and then to the CPU registers. Data movement between
27
Chapter 3 Execution-Less Performance Modeling
the caches causes not only L1 misses, but also increases the computation time.
Determining the number of L1 misses is a complex task that depends on the or-
ganization of L1 cache, the processor type and the memory access pattern of the
algorithm. Different organizations of L1 cache may result in different numbers of
L1 misses for the same algorithm. For instance, more cache misses are measured on
2-way associative than on 8-way associative caches [74]. Concerning the processor
type, one of the important features supported by processors is hardware prefetching.
In general, prefetching means bringing data or instructions from the memory into
the cache before they are needed. Data can be prefetched either into the L1 or into
the L2 cache. Usually hardware prefetching is very efficient, but in the case of mis-
prediction of data accesses it pollutes (when redundant data is prefetched) the cache
and causes unexpected cache misses [14]. The memory access pattern of the algorithm
is also crucial. The development of algorithms is mainly based on taking advantage of
cached data. To exploit the temporal (data in cache) and spatial (data in cache line)
localities, algorithms are adapted to the organization of the cache and the processor
type. Thus, the organization of the cache, the processor type, and the memory access
pattern of the algorithm are tightly coupled and altogether dictate the behavior of L1
misses.
As case studies, we chose the unblocked variant of the LU factorization without
pivoting [8] and its building block DGER. We develop an analytical formula for mod-
eling L1 misses for DGER. Then, we model the L1 misses for the LU factorization by
assembling the predicted misses of DGER. We also show that the model for DGER
can be easily adapted for other algorithms with similar and even different structures
such as other BLAS-2 operations [30] like DGEMV and DTRSV. As a result, by using
models for the LU factorization and DTRSV, we can predict L1 misses for solving a
linear system of equations.
This section is divided into three subsections: DGER and LU factorization, DGEMV,
and DTRSV. In each of those, we introduce a model and show how it captures the
behavior of the algorithm on two various architectures. In order to verify the accu-
racy of the models, we use two different BLAS implementations: the reference BLAS
library [71] and the highly optimized GotoBLAS2 library [38]. The purpose of us-
ing the reference BLAS is to better understand the kernel operations and, especially,
data movements within them. Then, using the gained knowledge, we directly tackle
modeling L1 misses of DGEMV and DTRSV from the GotoBLAS2 library.
3.4.1 DGER and LU Factorization
In Subsection 2.2.3 (Algorithm 2.4) we presented the unblocked variant of the LU
factorization. DSCAL, which is one (of two) building blocks of the unblocked LU,
has linear complexity while DGER has quadratic complexity; therefore, DGER is the
main source of L1 misses of the LU. Accordingly, we model the number of L1 misses
for the LU factorization through the number of L1 misses of DGER.
Let us first consider DGER in isolation, see Equation (2.3) (Subsection 2.2.3).
This scenario rarely occurs in practice and it is rather a theoretical study. But, it
is a starting point for our modeling process. DGER performs 2mn floating point
operations over mn+m+n data, which means that mn+m+n elements of the matrix
28
3.4 Modeling Memory-Stalls
and vectors have to be loaded into the CPU registers to perform the computation.
Theoretically, each of those elements may cause an L1 miss. However, the elements of
the matrix A – stored by columns – and both vectors x and y are contiguous, meaning
that adjacent entries along one column are stored next to one another in the memory.
Since the data movement in cache is organized in cache lines, only one cache miss per
accessed cache line occurs.
To build a model for L1 misses, we make the following assumptions.
1. The matrix A and the vectors x and yT are aligned to the beginning of a cache
line. This means that the first elements of the matrix and the vectors are the
first elements in cache lines.
2. The L1 cache is large enough to contain both x and yT , as well as a part of the
matrix A.
3. Once the vectors x and yT are loaded into the L1 cache, they remain there for
the whole computation.
With these assumptions, which are commonly satisfied, the number of L1 misses of
DGER can be modeled by
L1 misses = ⌈mn
d
⌉ + ⌈m
d
⌉ + ⌈n
d
⌉, (3.4)
where d is the number of double precision floating point values in a cache line; ⌈mn
d
⌉ is
the number of L1 misses when reading the contiguous matrix A; ⌈m
d
⌉ and ⌈n
d
⌉ are the
numbers of cache misses when reading the contiguous vectors x and yT respectively.
Equation (3.4) covers only the simplified scenario when the matrix and vectors
are contiguous in memory. In practice, DGER is used as a building block of other
algorithms. In case of the unblocked variant of the LU factorization, see Algorithm 2.4
and Figure 2.5, DGER operates on: the matrix A22 and the vectors aT12 and a21, where
A22 and aT12 are non-contiguous
3. Therefore, this influences the number of L1 misses
(more cache misses occur) and adds complexity to the model. The storage of the
matrices A and A22 and the vectors a21 and aT12 is organized as follows: A is stored by
columns and is contiguous in memory; the leading dimension of A, which is denoted
as LDA (LDA =m), is the memory distance between two successive entries in a row;
A22, which is the bottom-right part of A, is partially contiguous, meaning that the
elements of each column of A22 are contiguous, but columns are non-contiguous; a21
is contiguous as it is the part of a column of A; and aT12 is non-contiguous since each
element belongs to a different column of A. As reading the matrix A22 is the major
source of L1 misses in DGER, we use a simplified case where the vector aT12 is also
contiguous. By taking this into account, Equation (3.4) can be modified to
L1 misses = ζ + ⌈p
d
⌉ + ⌈q
d
⌉, (3.5)
3This means that adjacent entities along one column are not stored next to another in the memory.
29
Chapter 3 Execution-Less Performance Modeling
A
A22
(a) m − p < d
A
A22
(b) m − p ≥ d
Figure 3.3 Alignment of cache lines within the matrix A.
where p × q is the size of the matrix A22. The quantity
ζ =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
⌊mq
d
⌋, if m − ⌊p
d
⌋d < d
⌈p
d
⌉ + n−1∑
i=1 ⌈p + (mimod d)d ⌉, otherwise
(3.6)
accounts for the L1 misses to load the matrix A22 from L2 cache. For simplicity, from
now on, we use m − p < d instead of m − ⌊p
d
⌋d < d.
Equation (3.6) covers all possible cases depending on the value LDA−m, which is the
distance in memory between adjacent columns of the matrix A22. These two cases are
illustrated by the two plots in Figure 3.3. On each of these plots, the storage of A and
partitioning of its elements into cache lines (eight elements per a cache line) is shown.
Figure 3.3(a) represents the situation when A22 is the bottom-right part of A and
m−p < d. Thus, due to the organization of data movement in cache, the whole matrix
A22 and the unused data between two columns of A22 will be loaded. Accordingly, we
use ⌊mq
d
⌋ to model L1 misses on reading A22. Figure 3.3(b) demonstrates the case
when m − p ≥ d: only some unused data between two columns of A22 will be fetched.
The amount of the unused data depends on the alignment of the cache lines as well
as the position of A22 with respect to that alignment. We take the unused data into
account by using a “tail” (mimod d) in the second case of Equation (3.6).
We applied the above described model to predict the L1 misses of DGER from the
GotoBLAS library on Barcelona. Figure 3.4 shows the modeled and measured L1
misses as well as the deviation from the measurements. The results are satisfactory
only when LDA = p;x in other cases they show very high deviation; the expected
cache misses are much lower than the those measured when LDA > p. However,
Figures 3.4(a) and 3.4(b) reveal that independently of LDA the measured L1 misses
30
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 10
 20
 30
 40
 50
L
1
 m
is
se
s
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 10
 20
 30
 40
 50
L
1
 m
is
se
s
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.4 Predicted and measured L1 misses for DGER from the GotoBLAS library on
Barcelona.
demonstrate linear behavior on the interval [LDA−7d,LDA]. We refer this to the spe-
cific implementation of the algorithm in combination with the impact of the hardware.
This effect is explained in more details in the following subsection.
In the next two subsections we show how the basic model for L1 misses, Equa-
tions (3.5) and (3.6), captures the behavior of DGER on two different architectures:
Barcelona and Penryn.
Barcelona
The first testbed for our experiments is Barcelona, described in Subsection 2.2.1.
Barcelona has two prefetching units: one for instructions and one for data. The data
prefetching unit fetches data only into the L2 cache [62]. Thus, hardware prefetching
does not play a role in modeling L1 misses.
In contrast to hardware prefetching, which is a feature automatically provided and
controlled by the hardware, data can be fetched in advance at the software level. This
mechanism is called software prefetching.
:= + Ô⇒
(a) DGER
:= +
(b) DAXPY
Figure 3.5 Partitioning of operands used in DGER for DAXPY; both operations are from
the GotoBLAS2 library.
Our study revealed that the rank-1 update of a matrix performed by DGER from
31
Chapter 3 Execution-Less Performance Modeling
the GotoBLAS2 library is decomposed into smaller tasks such as linear combination
of two vectors by DAXPY, see Figure 3.5. The code for DAXPY as well as other
routines is a mixture of C and assembly languages. The combination of programming
languages with a tremendous length of source files, e.g. up to 900 lines for a simple
operation as DAXPY, makes the code almost unreadable. Nevertheless, we were able
to find that in the implementation of DAXPY, prefetching is accomplished through
assembly instructions. The results in Figure 3.4 emphasize the side effect of software
prefetching: up to seven extra cache lines per column of A22 are loaded. The number
of prefetched cache lines depends on the distance between m and p. By taking the
extra cache misses into account, Equation (3.6) in the model for Goto’s DGER can
be modified to
ζ =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
⌊mq
d
⌋, if m − p < d
⌈p
d
⌉ + n−1∑
i=1 (⌈p + (mimod d)d ⌉ + η(i)), otherwise,
(3.7)
where
η(i) =min (d − 1, ⌊m + (mimod d)
d
⌋ − ⌈p + (mimod d)
d
⌉)
indicates the number of cache misses caused by the mis-prediction of software prefetch-
ing.
Since neither hardware nor software prefetching impacts modeling L1 misses of the
reference DGER, the model, Equations (3.5) and (3.6), can be applied to it without
any changes.
Evaluation. Now, we verify the accuracy of the models for DGER and the unblocked
LU factorization by comparing them with the number of cache misses measured in
practice. The experiments were conducted using DGER from the reference BLAS and
GotoBLAS2. The deviation (in percentage) between the modeled and measured L1
misses was calculated by
Deviation = ∣Modeled misses −Measured misses ∣
Measured misses
. (3.8)
Figure 3.6 reports the accuracy of the model for L1 misses of DGER from the
reference BLAS. We plot the expected and measured misses, as well as the deviation
from the measurements, for problems of size p = q and leading dimension m. We chose
the problem size from the interval [16,256] – typical sizes for the unblocked variant
of the LU factorization (Algorithm 2.4). In fact, these problem sizes are small enough
to fit in L2 cache. Moreover, as the results suggest, our model for L1 misses becomes
more accurate as the problem size increases. The results of the modeling are always
accurate, even for small problems; in most cases, the deviation is less than 1%.
In order to simulate the behavior of DGER within Algorithm 2.4, we fix m and
vary the problem size (p = q) from p = m down to p = 16, see Figure 3.3. A detailed
picture of modeling L1 misses, where the problem size changes with a step size 1, is
shown in Figures 3.6(a) and 3.6(b) for LDA = 128 and 192, respectively.
32
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
Figure 3.6 Predicted and measured L1 misses for DGER from the reference BLAS library
on Barcelona.
 50  100  150  200  250p  0
 50
 100
 150
 200
 250
q
 0
 1
 2
 3
 4
 5
D
ev
ia
ti
o
n
 [
%
]
 0
 1
 2
 3
 4
Figure 3.7 Deviation between predicted and measured L1 misses for DGER from the refer-
ence BLAS library on Barcelona; p ≥ q, LDA = 512.
Figure 3.7 refers to the more general scenario where p ≥ q, which covers the shapes
of both A and A22 most commonly encountered in a factorization. This scenario
occurs when the unblocked LU factorization is used as a building block of a blocked
LU factorization with pivoting. Thus, Algorithm 2.4 operates on a panel matrix, see
Figure 3.8. In these experiments, we set the leading dimenssion, LDA, to 512. As
for the case p = q, the deviation is small, usually less than 2%. For problems of very
small size, the deviation is around 4%; such deviation carries little or no weight at all
when considered as part of a factorization.
The expected and measured L1 misses of DGER from the GotoBLAS2 library
are shown on Figure 3.9. The deviation from the measurements is high for very
small problems. However, when the problem size increases the deviation decreases
significantly and usually it is less than 2%.
33
Chapter 3 Execution-Less Performance Modeling
A
m
n
p
q
(a) On i-th iteration
m
nq
A
(b) k iterations
Figure 3.8 Alignment of the matrix A for the unlocked LU factorization within a big matrix
for a blocked LU factorization with pivoting.
The impact of the mis-prediction of software prefetching on the modeling can be
noticed when comparing Figures 3.9(a) and 3.9(c) with Figures 3.4(a) and 3.4(b),
correspondingly. When the extra cache misses due to software prefetching are included
in the model, Figure 3.9, the expected cache misses nicely capture the behavior of the
measured ones. This, therefore, leads to the low deviation from the measurements.
Table 3.1 Modeled and measured L1 misses for the unblocked LU factorization on Barcelona.
m = LDA Modeled Measured Deviation [%]
64 12,601 12,612 0.087
96 40,977 41,306 0.797
128 94,889 95,899 1.053
192 312,089 313,926 0.585
256 729,737 732,501 0.377
Even if the DGER models seem to have a somewhat large error for small sizes, this
does not affect the predictive power of the model for higher-level algorithms that use
DGER models. The unblocked LU factorization is a clear example. Since DGER is
a computational kernel for the LU factorization, we are interested in the sequence of
problems (p = q) from the interval [1,m]. In Table 3.1, we sum the number of modeled
L1 misses of DGER for different values of m. The result is compared with the number
of measured L1 misses for the LU factorization. In this scenario, the model attains a
high level of accuracy: the deviation is mostly less than 1%.
34
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
Figure 3.9 Predicted and measured L1 misses for DGER from the GotoBLAS2 library on
Barcelona.
Penryn
The second testbed for our experiments is Penryn, described in Subsection 2.2.1. Our
interest in the Penryn processor is due to the different organization of the cache system
with respect to the Barcelona processor:
• Barcelona has three levels of cache, while Penryn has only two;
• The L1 data cache on Penryn is two times bigger than the one on Barcelona;
• Since the L2 cache on Barcelona is also a private cache4, altogether, Barcelona
owns much more memory in terms of private cache than Penryn.
• The last level cache (L3) on Barcelona, which is shared among the four cores
and has a higher associativity than the one (L2) on Penryn, is 1MB smaller
than the L2 cache on Penryn.
Unlike Barcelona, which has only two prefetching units, Penryn has a powerful
prefetching mechanism: a dedicated unit detects multiple reads from a single cache line
4Private cache is a cache that belongs to one core and it is not shared among other cores.
35
Chapter 3 Execution-Less Performance Modeling
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128; step = 1
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256; step = 1
Figure 3.10 Predicted and measured L1 misses for DGER from the reference BLAS library
on Penryn.
and loads the next line into the L1 cache [23]. Each access to the L2 cache, whether it is
due to the prefetching or not, counts as an L1 miss by processor performance counters.
Prefetching may lead to the mis-prediction of data causing extra L1 misses. Even
though the GotoBLAS2 library is equipped with the hand-tuned implementations of
each routine for many architectures, on Penryn Goto’s DGER does not benefit from
software prefetching. Thus, by taking the L1 misses due to hardware prefetching into
account, our model, Equations (3.5) and (3.6), is modified as follows
L1 misses = ζ ′ + ⌈p
d
⌉ + ⌈q
d
⌉ + 3, (3.9)
where
ζ
′ =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
⌊mq
d
⌋, if m − p < d
⌈p
d
⌉ + n−1∑
i=1 (⌈p + (mimod d)d ⌉ + η(i)), otherwise.
(3.10)
In Equation (3.9), due to the prefetching of one extra cache line after the load of each
operand, three cache misses are added: one for the matrix and two for the vectors. In
Equation (3.10), cache misses caused by the mis-prediction of hardware prefetching
are also included. For instance, η(i), which represents the number of additional L1
misses per each column of A22 due to the mis-prediction, is given by
η(i) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩
1, if ⌊m + (mimod d)
d
⌋ − ⌈p + (mimod d)
d
⌉ > 0
0, otherwise.
Evaluation. On the Penryn processor, we modeled L1 misses of DGER from both
the reference BLAS and GotoBLAS2 libraries. Figures 3.10(a) and 3.10(b), where the
problem size changes with step = 1, provide a more refined picture of modeling L1
misses for DGER from the reference BLAS library. The oscillation in the deviation is
36
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
Figure 3.11 Predicted and measured L1 misses for DGER from the GotoBLAS2 library on
Penryn.
observed especially for small problems. However, for most cases the deviation is less
than 2%.
In the rest of the evaluation, our focus is on results for the GotoBLAS2 library. The
case p = q for various values of LDA is illustrated in Figure 3.11. For each LDA, we
plot a separate image: Figures 3.11(a) to 3.11(c) show a clear dependency between the
deviation and the problem size. The deviation is larger for problems of size smaller
than p = 60. This is due to the small number of L1 misses; even a slight difference
of 2-6 cache misses between the model and the measurements becomes significant
in terms of deviation. In general, for most problems, the deviation is less than 3%.
All results indicate that modeling L1 misses is accurate for relatively big problems.
These problems dictate the behavior of L1 misses in Algorithm 2.4 since their number
of cache misses is much higher than for the smaller ones.
In Table 3.2, we report the number of misses for the LU factorization by adding
up all the predictions of DGER. The deviation of the combined results is mostly less
than 2%. This demonstrates that the relatively high deviation for small problem sizes
is negligible in the context of a factorization.
Figure 3.12 presents the results of modeling L1 misses of DGER in the more general
37
Chapter 3 Execution-Less Performance Modeling
Table 3.2 Modeled and measured L1 misses for the unblocked LU factorization on Penryn.
m = LDA Modeled Measured Deviation [%]
64 14,009 13,500 3.770
96 44,433 43,698 1.682
128 101,289 100,643 0.642
192 327,065 326,145 0.282
256 756,822 753,946 0.382
scenario when p ≥ q (LDA = 512). Figures 3.12(a) and 3.12(b) indicate that in the
area close to the origin the deviation is slightly higher than the rest of the spectrum.
However, as demonstrated above, such deviations do not affect the accuracy of the
model for the whole LU factorization.
Finally, we look at the scenario in which the unblocked LU factorization is a building
block for a blocked LU with pivoting, see Figure 3.8; in this case, q is small and
fixed, p ≥ q, and LDA might be much greater than p. To model the L1 misses of
Algorithm 2.4, we aggregate the models of DGER for problem sizes from p − 1 × q − 1
down to p − q + 1 × 1. Figure 3.13 shows the modeled and measured L1 misses for
LDA = 2048 and two fixed q (q = 64 and 128). Since q is fixed and only p varies, the
modeled and measured misses demonstrate a linear behavior. In summary, the results
are quite accurate – the model is within 4% (for q = 64) and 2% (for q = 128) of the
measurements.
3.4.2 DGEMV
DGEMV and DTRSV were selected to extend the modeling of L1 misses and show
that the analytical model, Equations (3.5) and (3.6), can be adapted without major
hurdles to the other BLAS-2 operations.
We first look at DGEMV since it is used as a building block for other operations
such as DTRSV in the GotoBLAS2 library. DGEMV [30] is the BLAS-2 routine that
computes the matrix-vector operation
y = αAx + βy, (3.11)
where α and β are scalars; y and x are two vectors of size m and n, respectively; and
A is a general m × n matrix. DGEMV performs 2mn floating point operations over
mn +m + n data, meaning that mn +m + n elements of the matrix and the vectors
have to be fetched into the CPU registers to perform the computation.
Suppose that the operands used in DGEMV are contiguous in memory and the
assumptions made in Subsection 3.4.1 are satisfied. Then, the number of L1 misses
for DGEMV can be modeled by Equation (3.4). However, when DGEMV is used
as a building block of DTRSV, the matrix A is non-contiguous. In this case, the
matrix A of size p × q is part of a big m × n matrix, see Figure 3.14. Accordingly,
Equations (3.5) and (3.6) can be applied to model L1 misses. Thus, the basic model
for DGER, without applying any changes, suites DGEMV as well.
38
3.4 Modeling Memory-Stalls
 50  100  150  200  250p  0
 50
 100
 150
 200
 250
q
 0
 1500
 3000
 4500
 6000
 7500
 9000
L
1
 m
is
se
s
 0
 1500
 3000
 4500
 6000
 7500
 9000
(a) Measurements
 50  100  150  200  250p  0
 50
 100
 150
 200
 250
q
 0
 5
 10
 15
 20
D
ev
ia
ti
o
n
 [
%
]
 0
 2
 4
 6
 8
 10
 12
 14
 16
(b) Deviation
Figure 3.12 Deviation between predicted and measured L1 misses for DGER from the Go-
toBLAS2 library on Penryn; p ≥ q, LDA = 512.
Next, we show how the basic model for DGEMV can be tailored for the Barcelona
and Penryn architectures.
39
Chapter 3 Execution-Less Performance Modeling
0.0e+00
5.0e+04
1.0e+05
1.5e+05
2.0e+05
2.5e+05
3.0e+05
3.5e+05
4.0e+05
 0  300  600  900  1200  1500
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(a) q = 64, LDA = 2048
0.0e+00
2.0e+05
4.0e+05
6.0e+05
8.0e+05
1.0e+06
1.2e+06
1.4e+06
1.6e+06
 0  300  600  900  1200  1500
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(b) q = 128, LDA = 2048
Figure 3.13 Predicted and measured L1 misses for Algorithm 2.4 when used as part of a
blocked LU factorization.
A
m
n
p
q
Figure 3.14 A matrix A used in DGEMV as a part of a big m × n matrix.
Barcelona
Since the Barcelona processor does not support hardware prefetching from the L2
cache, modeling L1 misses for the reference DGEMV can be performed by Equa-
tions (3.5) and (3.6).
As it was discovered with DGER from the GotoBLAS2 library, in the implemen-
tation of DGEMV, which consist of up to 2900 lines of C-assembly code, prefetching
is realized through assembly instructions. Software prefetching may lead to the mis-
prediction of data, which, therefore, causes additional cache misses. We observed up
to
d
2
extra cache misses per fetched column of A when m − p > d. By taking into
40
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.15 Predicted and measured L1 misses for DGEMV from the reference BLAS library
on Barcelona.
account these extra cache misses, Equation (3.6) becomes
ζ =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
⌊mq
d
⌋, if m − p < d
⌈p
d
⌉ + n−1∑
i=1 (⌈p + (mimod d)d ⌉ + η(i)), otherwise,
(3.12)
where
η(i) =min (d
2
, ⌊m + (mimod d)
d
⌋ − ⌈p + (mimod d)
d
⌉)
represents the cache misses caused by the mis-prediction of software prefetching.
To summarize, L1 misses for Goto’s DGEMV can be modeled by Equations (3.5)
and (3.12).
Evaluation. To verify the models for DGEMV, we conducted a set of experiments,
in which we focus on small problems that completely fit into the L2 cache.
Figure 3.15 demonstrates the expected and measured L1 misses for DGEMV from
the reference BLAS library. Figures 3.15(a) and 3.15(b) illustrate the results of mod-
eling for two different leading dimensions LDA = 128 and 256, respectively. The model
for DGEMV produces very accurate results on Barcelona. In most cases, the deviation
from the measurements is below 2%.
We also studied DGEMV from the highly optimized GotoBLAS2 library: the re-
sults of modeling L1 misses for DGEMV are shown on Figure 3.16. The behavior of
Goto’s DGEMV revealed that software prefetching has a significant impact on both
achieving the optimal performance and modeling cache misses. As we observed, soft-
ware prefetching leads to the mis-prediction of up to four cache lines on loading each
column of A when LDA − p > d. We also account these extra cache misses in the
model. In summary, the results of modeling are satisfactory with the deviation from
the measurements less than 4% for all problem sizes.
Figures 3.16(e) and 3.16(f) present the modeled and measured L1 misses for Goto’s
41
Chapter 3 Execution-Less Performance Modeling
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 0  100  200  300  400  500
 0
 2
 4
 6
 8
 10
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) LDA = 512
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(e) q = 48;LDA = 128
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(f) q = 48;LDA = 256
Figure 3.16 Predicted and measured L1 misses for DGEMV from the GotoBLAS2 library
on Barcelona.
DGEMV used as a building block of DTRSV. In this case q is fixed (q = 48) and only
p varies. Because of this, the L1 misses demonstrate piecewise linear behavior. The
deviation, except for very small problems, is within 3%.
42
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 512
Figure 3.17 Predicted and measured L1 misses for DGEMV from the reference BLAS library
on Penryn.
Penryn
In contrast to Barcelona, the Penryn processor has a powerful prefetching mechanism
supporting data prefetching from the L2 cache. The prefetching unit detects the data
access pattern and loads the next cache line into the L1 cache. Additionally, data
prefetching can be initialized and controlled by software, separately from hardware.
As it was shown for DGER, hardware prefetching on Penryn causes one extra cache
miss per column of the matrix A whenm−p > d and three additional cache misses after
loading the matrix and two vectors. Thus, the L1 misses for the reference DGEMV
can also be modeled by Equations (3.9) and (3.10).
With the implementation of DGEMV from the GotoBLAS2 library the situation is
slightly different: not only hardware, but also software prefetching are involved. As a
consequence, up to eight extra cache lines are loaded per column of A when m−p > d;
one of those is due to the mis-prediction of hardware prefetching, while the other
seven are due to software prefetching. By taking into account these extra L1 misses,
Equation (3.10) becomes
ζ
′ =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
⌊mq
d
⌋, if m − p < d
⌈p
d
⌉ + n−1∑
i=1 (⌈p + (mimod d)d ⌉ + η(i)), otherwise,
(3.13)
where
η(i) =min (d, ⌊m + (mimod d)
d
⌋ − ⌈p + (mimod d)
d
⌉)
represents the cache misses caused by the mis-prediction of hardware and software
prefetching. Therefore, the L1 misses for the Goto’s DGEMV can be modeled by
Equations (3.9) and (3.13).
43
Chapter 3 Execution-Less Performance Modeling
Evaluation. Figure 3.17 shows the expected and measured L1 misses for DGEMV
from the reference BLAS library. Figures 3.17(a) and 3.17(b) represent modeling L1
misses for LDA = 192 and 512, respectively. In general, the results are accurate; for
LDA = 512 the deviation from the measurements is within 2%. The only exception is
the high deviation for very small problems. However, when the problem size increases
the deviation decreases considerably.
We examine DGEMV from the GotoBLAS2 library as well: Figure 3.18 demon-
strates the results of modeling L1 misses for square matrices and different LDA. As
we observed, the mis-prediction of software prefetching increases the number of cache
misses per column of the matrix A significantly. The side effect of software prefetching
can be seen clearly in the results for problems in the interval p ∈ [LDA − 7d;LDA]:
when the problem size increases, the number of cache misses grows linearly. This is
because of the smaller distance between LDA and p that corresponds to the lower
number of extra cache misses. The results of modeling show spikes in the deviation
for very small problems, however for the rest, the deviation is within 4%.
The scenario in which DGEMV is a building block of DTRSV is considered too, see
Figures 3.18(e) and 3.18(f). According to this scenario q is fixed (q = 256) and only
p varies. Therefore, the expected and measured cache misses show a piecewise linear
behavior. The results of modeling are highly accurate – the deviation is mostly less
than 2%.
3.4.3 DTRSV
Modeling for DTRSV is the natural next step in the prediction of the number of L1
misses for solving a linear system of equations. Our interest is in modeling DTRSV
as
• Forward substitution, where A is a unit lower triangular matrix;
• Backward substitution, where A is a non-unit upper triangular matrix.
Analytical models for these two variants would complete the modeling of L1 misses
for the solution of a linear system.
DTRSV [30] is the BLAS-2 routine that solves a system of equations
Ax = b, (3.14)
where x and b are two vectors of size n; A is an n × n unit lower or non-unit upper
triangular matrix. Thus, DTRSV operates on
n2 − n
2
+n or n2 + n
2
+n data, meaning
that, compared to DGER or DGEMV, only half of the elements is used. During the
computation the vector b is overwritten by the vector x. Therefore, in order to capture
the specific features of DTRSV, the basic formula for modeling L1 misses of DGER,
Equation (3.4), can be written as
L1 misses = n∑
i=1 ⌈ id⌉ + ⌈nd ⌉, (3.15)
44
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
 25
 30
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
 25
 30
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
 25
 30
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
 25
 30
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) LDA = 512
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(e) q = 256;LDA = 256
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p
Measured
Modeled
Deviation
(f) q = 256;LDA = 512
Figure 3.18 Predicted and measured L1 misses for DGEMV from the GotoBLAS2 library
on Penryn.
where ⌈n
d
⌉ is the number of cache misses when reading the vectors b and x; ⌈ i
d
⌉ are the
L1 misses when fetching the i-th column of the upper triangular or the (n−i+1)-th of
the lower triangular matrix A. The L1 misses when reading b and x are counted only
once since these vectors refer to the same location in memory. Equation (3.15) differs
from the simple formula for DGER due to the distinct structures of these algorithms;
45
Chapter 3 Execution-Less Performance Modeling
DTRSV
DTRSV
DTRSV
DTRSV
DGEMV
DGEMV
DGEMV
b
Figure 3.19 Partitioning of a lower triangular matrix A used in DTRSV from the Goto-
BLAS2 library.
especially, the difference can be seen in modeling cache misses that occur on loading
the matrix A. In the more general scenario when DTRSV is a building block of
another algorithm, Equations (3.5) and (3.6) can be modified to
L1 misses = ζ + ⌈p
d
⌉, (3.16)
where p × p is the size of the matrix A and LDA = n. The quantity
ζ = 1∑
i=p ⌈ i + (n(p − i) mod d)d ⌉ (3.17)
represented the cache misses committed when fetching A from the L2 cache. Equa-
tion (3.17) considers the case when A is a unit lower triangular matrix. For a non-unit
upper triangular matrix,
n
d
cache misses need to be added in order to take into account
the diagonal elements.
By working on DTRSV from the GotoBLAS2 library, we observed a difference in
cache misses for various combinations of two parameters: uplo and diag; uplo in-
dicates whether A is an upper or a lower triangular matrix and diag whether it is
unit or non-unit triangular. Using the Callgrind tool [25] from the Valgrind frame-
work [26, 70], we discovered that for each of these combinations a separate private
routine is called. For instance, the routine DTRSV_NLU finds the solution of a linear
system with a non-transpose (N) unit (U) lower (L) triangular matrix. Moreover,
the performance experiments using the Callgrind tool on DTRSV revealed that the
matrix A, which is for example unit lower triangular, is partitioned into smaller blocks
of size b, see Figure 3.19. This particular partitioning, which is one out of two pos-
46
3.4 Modeling Memory-Stalls
sibilities for this situation, was also confirmed by the measured cache misses. After
partitioning the matrix A, each of the triangular subsystems is solved by DTRSV and
then DGEMV is applied to the remaining part of each panels. Accordingly, modeling
L1 misses for DTRSV from the GotoBLAS2 library can be done by
L1 misses =n
b
model_trsv(b) + model_trsv(nmod b)+
n/b∑
i=1 model_gemv(n − ib, b), (3.18)
where model_gemv() refers to the model in Equations (3.5) and (3.6) and model_trsv()
to the one in Equations (3.16) and (3.17).
Barcelona
For modeling L1 misses of DTRSV from the reference BLAS library, Equations (3.16)
and (3.17) can be used without any changes. For Goto’s implementation of DTRSV
the model in Equation (3.18) stays the same, but model_gemv() refers to the model
in Equations (3.5) and (3.12); the block size is b = 48.
Evaluation. We carried out a set of experiments in order to verify the models for
DTRSV. Figure 3.20 demonstrates the results of predicting L1 misses for DTRSV from
the reference BLAS library. In the experiments with a unit lower triangular matrix,
see Figures 3.20(a) and 3.20(b), the results are extremely accurate – the deviation is
within 1%. The other two figures, Figures 3.20(c) and 3.20(d), correspond to modeling
L1 misses for solving a linear system with a non-unit upper triangular matrix; NUN
stands for a non-transpose (N) upper (U) non-unit (N) triangular matrix. In this case,
the spike and small oscillation in the results, especially for very small problems, are
due to the reverse order access of the matrix A (the matrix is read from the bottom-
right to the top-left corner). This data access pattern influences the measurements
too, causing the oscillation in the cache misses. Nevertheless, for most problems the
deviation is within 3%.
Figure 3.21 presents the expected and measured cache misses for forward and back-
ward substitutions using DTRSV from the GotoBLAS2 library. For both cases the
results are very accurate with nicely decreasing deviation. Despite the spike and slight
oscillation for small problems, the deviation is within 2%.
Penryn
While modeling L1 misses for DTRSV on Penryn we observed that only hardware
prefetching leads to mis-prediction and, therefore, causes additional L1 misses. By
including these cache misses into the model, Equations (3.16) and (3.17) for DTRSV
from the reference BLAS library can be rewritten as
L1 misses = ζ ′ + ⌈p
d
⌉ + 2, (3.19)
47
Chapter 3 Execution-Less Performance Modeling
 0
 200
 400
 600
 800
 1000
 1200
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) NLU; LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) NLU; LDA = 256
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) NUN; LDA = 192
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) NUN; LDA = 256
Figure 3.20 Predicted and measured L1 misses for DTRSV from the reference BLAS library
on Barcelona.
where
ζ
′ = 1∑
i=p (⌈ i + (n(p − i) mod d)d ⌉ + η(i)). (3.20)
Equation (3.19) includes two extra cache misses that correspond to the prefetched
cache line after loading the matrix and the vector. In Equation (3.20), the extra
cache misses caused by the mis-prediction are counted by
η(i) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩
1, if ⌊n + (nimod d)
d
⌋ − ⌈p + (nimod d)
d
⌉ > 0,
0, otherwise.
Prediction of L1 misses for DTRSV from the GotoBLAS2 library stays the same as
it was initially constructed – Equation (3.18) with the block size b = 256. However,
model_gemv() refers to the model in ?? and eq. (3.13) and model_trsv() to the model
in Equations (3.19) and (3.20).
48
3.4 Modeling Memory-Stalls
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) NLU; LDA = 192
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) NLU; LDA = 512
 0
 1000
 2000
 3000
 4000
 5000
 6000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) NUN; LDA = 256
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) NUN; LDA = 512
Figure 3.21 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library on
Barcelona.
Evaluation. Figure 3.22 shows the modeled and measured L1 misses of DTRSV from
the reference BLAS library: Figures 3.22(a) and 3.22(b) illustrate the modeling for
backward substitution while Figures 3.22(c) and 3.22(d) correspond to forward sub-
stitution. Except for very small problems, the results of modeling are quite accurate.
When the problem size increases the deviation decreases sharply.
Figure 3.23 presents the results of predicting cache misses of DTRSV for a unit
lower triangular matrix and various LDA. The behavior of L1 misses for DTRSV
varies depending on the problem size and the partitioning of the matrix in Goto’s
implementation (Figure 3.19). When the routine DTRSV_NLU is called for matrices of
size p ≤ 256, the cache misses show parabolic behavior. At the interval p ∈ [256,256+
7d] the behavior changes to linear, but then it is parabolic again. The linear behavior
is because of the mis-prediction of software prefetching in DGEMV, which is also
involved as a building block of TRSV. The piecewise polynomial behavior of DTRSV
can be noticed on Figure 3.23(d). In summary, apart from the peak for very small
problems, the deviation is below 4% and, when the problem size grows, it decreases
significantly.
Modeling L1 misses for backward substitution with a non-unit upper triangular
49
Chapter 3 Execution-Less Performance Modeling
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) NLU; LDA = 256
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) NLU; LDA = 512
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) NUN; LDA = 192
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) NUN; LDA = 256
Figure 3.22 Predicted and measured L1 misses for DTRSV from the reference BLAS library
on Penryn.
matrix is illustrated in Figure 3.24. Since the diagonal elements of the matrix A are
also used, one would expect to see more cache misses for backward substitution than
for forward one. However, the opposite is true. This confusion is due to different im-
plementations of the called routines and the optimization applied within each routine.
The linear behavior of cache misses for sizes p ∈ [256,256 + 7d], see Figure 3.24(d),
is due to the side effect of software prefetching, which is inherited from DGEMV. In
general, the results are satisfactory with the slightly higher deviation for very small
problems. However for the rest, the deviation is mostly less than 4%; for p ≥ 256 it is
even below 1%.
Analytical models for cache misses can also be applied as performance indicators
for new automatically generated algorithms [5, 32, 31] that are built on top of the
BLAS or LAPACK routines.
The next section presents evidence that the accurate model for cache misses leads
to accurate prediction for the execution time.
50
3.5 Assembling the Model
 0
 200
 400
 600
 800
 1000
 1200
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) LDA = 512
Figure 3.23 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library on
Penryn; A is a unit lower triangular matrix.
3.5 Assembling the Model
“Even in the relatively simple context of the BLAS,
there are aspects of program behavior
that may not be worth modeling in practice
even if they can be modeled in principle.”
Kamen Yotov et al. [96]
In the previous section we have presented the results of modeling L1 misses for the
BLAS-2 routines and the unblocked LU factorization. Here, we show how the modeled
cache misses are used to model the execution time of these routines. We also consider
the scenario when the operands reside in the L2 cache. According to this scenario, α2
(and α3) and β equal to zero in Equation (3.3). Therefore, the performance model
51
Chapter 3 Execution-Less Performance Modeling
 0
 200
 400
 600
 800
 1000
 1200
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 0  50  100  150  200  250
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 0  100  200  300  400  500
 0
 5
 10
 15
 20
L
1
 m
is
se
s
D
ev
ia
ti
o
n
 [
%
]
p=q
Measured
Modeled
Deviation
(d) LDA = 512
Figure 3.24 Predicted and measured L1 misses for DTRSV from the GotoBLAS2 library on
Penryn; A is a non-unit upper triangular matrix.
can be written as
Execution time = α1 × L1 misses × L1 miss time+
α01 × L1 hits × L1 hit time+ (3.21)
γ × Flops × Flop time.
L1 hits impacts the execution time mostly when the routines are compiled without
optimization. Otherwise, the impact of L1 hits is negligible and, therefore, α01 is set
to zero.
Our aim is to predict the execution time spent on solving a linear system of equa-
tions. Following the same scheme as for modeling memory-stalls, we predict the
execution time needed to find the solution of linear systems through modeling the
execution time of the unblocked LU factorization and DGER, and DTRSV. Those are
refined accordingly into smaller units such as calculating the number of Flops (or just
Flops); modeling L1 misses; estimating the L1 hit time, the L1 miss time, and the
Flop time.
The number of floating point operations performed by the dense linear algebra
52
3.5 Assembling the Model
algorithms can be calculated a priori. For the studied algorithms the number of Flops
is the following
• DGER: 2mn + n;
• Unblocked LU factorization, Algorithm 2.4: mn2 − n3
3
+ m2
2
−mn + m
2
− n
6
;
• DGEMV: 2mn +m + n;
• DTRSV: n2 for non-unit and n2−3n+2 for unit triangular matrices, respectively.
In order to complete the model (Equation (3.21)) the remaining five parameters
need to be determined: α1, α01, γ, the L1 miss time, and the Flop time. We first
explain the approach to measure the time spent on one L1 hit and miss, and one
Flop. Then, we describe the selection of the balancing weights α1 and γ. Since
all four parameters are architecture dependent, we determine them on two different
architectures, namely Barcelona and Penryn.
3.5.1 L1 Miss Time
The time spent on one L1 miss varies from architecture to architecture and often it is
either not provided by processor manufactures or hard to estimate. Hence, we derive
the L1 miss time from experiments by following the next procedure
1. Create a general matrix A of size n × n, where n is in the interval [16,200];
2. Flush the L1 cache to ensure that the matrix resides only in the L2 cache;
3. Measure the time spent on reading A from the L2 cache by accessing it with a
stride h; the stride is equal to the number of double precision floating point val-
ues in both the read and the prefetched cache lines. Accordingly, each memory
access causes only one L1 miss;
4. The L1 time is calculated by
L1 miss time = Access time
Accesses
. (3.22)
Since the time spent on one L1 miss is equivalent to the time spent on one L2 cache
hit, we use these quantities interchangeably.
Figure 3.25 presents the results of measuring the time of one L1 miss when reading
the elements of the n×n matrix A from the L2 cache with a stride h on both Barcelona
and Penryn. Since Barcelona does not have prefetching unit for loading data into the
L1 cache, the access stride is h = 8. Unlike Barcelona, Penryn incorporates a prefetcher
that loads the current as well as the next cache line into the L1 cache, meaning that
h = 16. Figures 3.25(a) and 3.25(b) emphasize that the increase in the matrix size n
leads to more stable results. Thus, the L1 miss time equals 15 clock cycles on both
Barcelona and Penryn.
53
Chapter 3 Execution-Less Performance Modeling
 10
 15
 20
 25
 30
 0  20  40  60  80  100  120  140  160  180  200
T
im
e[
cy
cl
es
]
n
L1 miss time
(a) Penryn
 10
 15
 20
 25
 30
 0  20  40  60  80  100  120  140  160  180  200
T
im
e 
[c
y
cl
es
]
n
L1 miss time
(b) Barcelona
Figure 3.25 The time spent serving one L1 miss.
As an alternative to the above-mentioned approach, one can use open source tools
like LMbench [64], which is a suite of utilities for measuring cache and memory band-
width and latency. We use this tool in order to ensure the accuracy of the measured
parameters since the official information concerning them is rather missing or is diffi-
cult to find.
With the lat_mem_rd utility [63] from LMbench, it is possible to accurately measure
the latency of all caches and main memory [79] within one execution of the utility.
So, this provides the whole picture of the memory system instead of a narrow focus
on one of its levels. Since we concern about the L1 hit and miss times, we measure
them as the L1 and L2 cache latencies, accordingly. More detailed information on the
lat_mem_rd utility and the experiments can be found in Appendix A.
Figure 3.26 illustrates the memory latency when an array is accessed with a stride
h = 8 (size of one cache line) on Barcelona. The latency of different levels in the
memory hierarchy can be seen on Figure 3.26(b). While Figure 3.26(a) focuses on the
latency of L1, L2, and L3 caches. For better understanding, we convert the measured
latency from nanoseconds (ns) into clock cycles. As a result, the L1 cache latency is
3 clock cycles and the L2 latency is 16. Therefore, the L1 hit and miss time equal 3
and 16 clock cycles, correspondingly.
Figure 3.27 demonstrates the latency for different levels of the memory hierarchy
on Penryn. A stride h = 16 (size of two cache lines) is chosen to ensure that the
prefetching of data does not interfere the measurements. Figure 3.27(b) shows the
latency of all three levels in the memory hierarchy, while Figure 3.27(a) provides a
detailed view on the cache latencies. These figures reveal the following information:
• The L1 cache latency is 3 clock cycles;
• The L2 cache latency is 15 clock cycles;
• The latency of the main memory is an order of magnitude higher than the
latency of the L2 cache.
54
3.5 Assembling the Model
 0
 10
 20
 30
 40
 50
2
0
2
2
2
4
2
6
2
8
2
10
T
im
e[
cy
cl
es
]
Memory size[Kb]
L1 cache
L2 cache
L3 cache
(a) L1, L2, and L3 caches
 0
 50
 100
 150
 200
 250
 300
2
-10
2
-8
2
-6
2
-4
2
-2
2
0
2
2
2
4
2
6
2
8
T
im
e[
cy
cl
es
]
Memory size[Mb]
L1 cache
L2 cache
L3
Main memory
(b) All memory levels
Figure 3.26 Latency of different levels in the memory hierarchy on Barcelona.
We presented two approaches for measuring the L1 miss time (the L2 hit time):
one is based on the use of the lat_mem_rd utility, while the other is our own imple-
mentation. Our approach supports the measurements of the L2 hit time only, thus
it needs to be adapted for the other memory levels. However, at the current stage
of modeling the execution time, when the operands reside in the L2 cache, we would
rely on both approaches in order to ensure the accuracy of the measurements.
3.5.2 Flop Time
We use an empirical approach in order to find the time spent on one floating point
operation. The main idea of the approach is to fill the processor’s pipeline completely
and, therefore, maximize the number of Flops performed per clock cycle. To reach this,
55
Chapter 3 Execution-Less Performance Modeling
 0
 5
 10
 15
 20
 25
 30
2
0
2
2
2
4
2
6
2
8
2
10
T
im
e[
cy
cl
es
]
Memory size[Kb]
L1 cache
L2 cache
(a) L1 and L2 caches
 0
 50
 100
 150
 200
2
-10
2
-8
2
-6
2
-4
2
-2
2
0
2
2
2
4
2
6
2
8
T
im
e[
cy
cl
es
]
Memory size[Mb]
L1 cache
L2 cache
Main memory
(b) All memory levels
Figure 3.27 Latency of different levels in the memory hierarchy on Penryn.
the data needs to be located as close to the CPU (in the registers) as possible to reduce
the load time and to keep the ALUs busy. The results of measuring the Flop time
on Barcelona are presented in Table 3.3. The code is compiled with different levels
of optimization (-O0, -O1, -O2, and -O3). The Flop time as well as the efficiency are
presented for each (of 19) test case in columns under the indicator of the optimization
level. The efficiency is used to demonstrate how good the ALUs are utilized. Since
the efficiency of DGEMM is roughly 90%, we would expect to attain even higher
efficiency for our test cases since they do not require loads from the memory. The
results with -O3 are not shown, because they are the same as the ones with -O2.
Our approach is based on addition5 and multiplication since these operations mostly
5Subtraction is equivalent to addition
56
3.5 Assembling the Model
occur in the studied subset of linear algebra algorithms. We grouped the operations
by duplicating and mixing them in test cases in order to fill the processor’s pipeline
and, therefore, increase the performance. As a result, we created 19 test cases, which
can be divided into 3 groups:
• Addition (test cases 1 − 6);
• Multiplication (test cases 7 − 12);
• Mixed addition and multiplication (test cases 13 − 19).
Each test case from Table 3.3 was executed in a loop with n = 100,000 repetitions.
Afterwards, we divided the measured time by n×Flops, where Flops is the number of
floating operations performed by a test case when it is called only once. For instance,
Flops = 8 for test case 16. To ensure the accuracy of the achieved timings, each test
(with n repetitions) was executed five times and the minimum time was taken. The
achieved performance (in GFlops) was calculated as a ratio between Flops and the
measured time. Then, the efficiency in percentage was computed as a ration between
the achieved and theoretical performance on one core.
All variables used in the operations are double precision. Except for β = K and
τ = 1.1, they equal to 1.0 + K, where K is 1e − 6 in order to prevent an overflow
in the computations. The overflow requires additional time that, therefore, leads to
incorrect results.
The results show that the efficiency of each test case is getting better with higher
optimization level enabled. For test cases 18 and 19 the efficiency is poor since it is
bounded by the dependencies between the operations. While compiling the code with
-O1 or -O2, the compiler is able to identify trivial operations such as multiplication by
one and, therefore, they are ignored. Because of this, the time and efficiency for test
case 7 are misleading. The main message of the experiments is in the -O2 column of
Table 3.3: Test cases 1 − 3, 8 − 10, and 13 − 16 show that the ALUs on Barcelona are
able to perform up to four additions and/or multiplications simultaneously through
filling the processor’s pipeline. However, this does not influence the Flop time, but
improves the efficiency. Adding more operations in a test case does not change the
results. The maximum gained efficiency is only 62% even though the data is kept in
the CPU registers during the measurements. As the results on Penryn carry the same
message, we do not show them.
Since the data is placed in the CPU registers, one would expect that the achieved
efficiency would be close to the 100% peak. However, the obtained efficiency us-
ing the above-mentioned approach is less than 2/3 of the peak. For this reason,
we apply another approach that is based on Streaming SIMD Extensions (SSE) in-
structions [12, 44]. Single-instructions-multiple-data (SIMD) instructions [74, 44] can
greatly increase performance when exactly the same operations are to be performed
on multiple data objects. Modern SSE contain more than 70 new instructions, most
of which work on double precision floating point data. Also, SSE added new registers
and introduced both scalar and packed floating point operations [12, 44]. Using the
SSE instructions, we reached 99+% of the peak performance on the Barcelona and
Penryn processors and issued up to four double precision floating point operations
57
C
hapter
3
E
xecution-Less
P
erform
ance
M
odeling
# Test case -O0 -O1 -O2
Cycles Efficiency[%] Cycles Efficiency[%] Cycles Efficiency[%]
1 α+ = τ 12.11482 2.56 12.12867 2.56 4.00063 7.76
2 α+ = τ ;β+ = τ ;γ+ = τ 14.00126 6.65 13.89021 6.71 4.00128 23.28
3 α+ = τ ;β+ = τ ;γ+ = τ ; ζ+ = τ 14.41263 8.62 14.48214 8.58 4.00098 31.04
4 α+ = τ ;β+ = τ ;γ+ = τ ; ζ+ = τ ;η+ = τ 13.10768 11.84 14.39455 10.78 5.00099 31.04
5 α+ = τ ;β+ = τ ;γ+ = τ ; ζ+ = τ ;η+ = τ ; θ+ = τ 13.11687 14.20 13.68510 13.61 6.00130 31.04
6 α+ = α + τ 17.67147 3.51 17.61483 3.53 11.33385 5.48
7 α∗ = 1.0 2.00125 15.51 0.00086 36101.48 ≈0 34497.00
8 α∗ = δ 12.00126 2.59 12.07544 2.57 4.00076 7.76
9 α∗ = δ;β∗ = δ;γ∗ = δ 14.12095 6.60 13.57253 6.86 4.00120 23.28
10 α∗ = δ;β∗ = δ;γ∗ = δ; ζ∗ = δ 14.85523 8.36 14.83327 8.37 4.00117 31.04
11 α∗ = δ;β∗ = δ;γ∗ = δ; ζ∗ = δ;η∗ = δ 14.00162 11.09 14.00097 11.09 5.00129 31.04
12 α∗ = δ;β∗ = δ;γ∗ = δ; ζ∗ = δ;η∗ = δ; θ∗ = δ 14.04933 13.26 14.71298 12.66 6.00098 31.04
13 α∗ = δ;β+ = τ 13.50219 4.60 12.00133 5.17 4.00124 15.52
14 α∗ = δ;β+ = τ ;γ∗ = δ; ζ+ = τ 13.46800 9.22 14.25825 8.71 4.00111 31.04
15 α∗ = δ;β+ = τ ;γ∗ = δ; ζ+ = τ ;η∗ = δ; θ+ = τ 13.97384 13.33 14.07524 13.23 4.00148 46.55
16 α∗ = δ;β+ = τ ;γ∗ = δ; ζ+ = τ ;η∗ = δ; θ+ = τ ; ξ∗ = τ ;σ+ = τ 15.77207 15.75 14.59915 17.01 4.00132 62.07
17 α∗ = δ;β+ = τ ;γ∗ = δ; ζ+ = τ ;η∗ = δ; θ+ = τ ; ξ∗ = δ;σ+ = τ ;χ∗ = τ ;ψ+ = τ 20.04116 15.49 14.13381 21.97 5.00134 62.08
18 α+ = α ∗ β 18.65953 3.33 17.50211 3.55 8.00126 7.76
19 γ+ = τ ;α+ = γ ∗ δ 14.35192 6.49 13.50219 6.90 4.33444 21.49
Table 3.3 Results of measuring the Flop time using different test cases and the optimization levels enabled on Barcelona.
58
3.6 Evaluation
per clock cycle. As a result, the Flop time is 0.33 and 0.25 cycles on Barcelona and
Penryn, accordingly. More information regarding the measurements of the Flop time
can be found in Appendix B.
3.5.3 Balancing Weights
To complete modeling the execution time we need to specify the balancing weights
α01, α1, and γ, which represent the overlap between L1 hits, L1 misses, and the CPU
active execution, respectively. The balancing weights show also the impact of each of
these factors on the execution time of the algorithm. Since the value of α01, α1, and
γ depend on the algorithm type, architecture, and compiler optimizations, they vary
from scenario to scenario. Accordingly, we define them through the empirical exper-
iments and we present their values in the following section within the performance
discussion. Preliminarily, for memory-bound algorithms, for which the time spent on
memory-stalls is the major factor in the computation time, γ varies and is close to
zero, while α1 is one. The situation is the opposite for CPU-bound algorithms.
3.6 Evaluation
In this section, we verify the accuracy of the model represented in Equation (3.21) by
comparing the modeled time with the time measured in practice. The performance
experiments were conducted using one core on both the Barcelona and Penryn proces-
sors. The detailed specification of these architectures can be found in Subsection 2.2.1.
The deviation from the measurements was calculated by Equation (2.4).
The model (Equation (3.21)) consists of several independent components; each of
those requiring accurate prediction in order to ensure the accuracy of the model.
Since modeling the execution time of an algorithm is a very complicated task, we
begin with modeling the time spent on reading and updating a matrix. Then, by
taking arithmetic operations into account, we model the execution time of DGER
and the unblocked LU factorization.
3.6.1 Reading and Writing a Matrix
Listing 3.1 Measuring the time spent on reading and writing a matrix.
1 for (i = 0; i < n_repeat; i++) {
2 // start timing
3 rdtscll(ticks_start);
4
5 for (j = 0; j < n; j++)
6 for (k = 0; k < n; k++) {
7 alpha = A[j * lda + k];
8 A[j * lda + k] = beta;
9 }
10
11 // stop timing
12 rdtscll(ticks_end);
13 vtime[i] = ticks_end - ticks_start;
14 // to prevent compiler optimization
15 beta = alpha;
16 }
59
Chapter 3 Execution-Less Performance Modeling
Listing 3.1 presents the C implementation for measuring the execution time spent on
reading (alpha = A[j * lda + k]) and writing (A[j * lda + k] = beta) a matrix.
In order to avoid the influence of compiler optimization and to have a clearer view on
where the time is spent, we first compile the code with -O0. We also analyze the case
with the optimization level two enabled.
Let us first consider the case without optimization (-O0). We refine the model
(Equation (3.21)), which expects that compiler optimization is applied, for the non-
optimized case. As no Flops are performed, γ is set to zero and α1 to one in Equa-
tion (3.21). Since L1 hits on reading the matrix A also have a strong impact on the
execution time, α01 is set to one. Using the Cachegrind tool [24] from the Valgrind
framework we discovered that not only L1 read hits, but also L1 hits on writing the
matrix are important. Therefore, the latter is included in the model too. The re-
maining fraction of the total time (up to 33%) is spent on the double loop. At each
iteration of the loop the following operations are performed: the comparisons j < n
and k < n; and the increment of the loop counters j + + and k + +. Penalties can
be also paid for reading j, k, and n even though they should be kept in the CPU
registers. The study of the time spent on the double loop is by itself independent6
and requires deep knowledge of the architecture, assembly language, and compiler.
We do not attempt to refine the model in this direction since the time of the double
loop only plays a role in the non-optimized case. Therefore, we measure only once the
time spent on one iteration of the loop and then use it in all experiments. Finally, by
putting everything together, the model for reading and writing a matrix is
Execution time = L1 misses × L1 miss time+
L1 read hits × L1 read hit time+ (3.23)
L1 write hits × L1 write hit time+
Loop iterations × Loop iteration time.
Since only the first access to a cache line is counted as a miss, the other d−1 accesses
are hits. Accordingly, the L1 read hits equal m(d − 1). After a cache line is accessed
it is kept in the L1 cache, therefore each of mn writes – are performed immediately
after reads – is counted as an L1 write hit. Thus, the L1 write hits equal mn. As the
experiments confirmed, the L1 write hit time is equivalent to the L1 read hit time.
We denote the latter as the L1 hit time.
After the comprehensive study on reading and writing a matrix for the non-optimized
case, we proceed with modeling its time when the code is compiled with the optimiza-
tion level two enabled. In this case, the time of the double loop becomes negligible.
The L1 write hits are hidden (performed in the background) and do not influence
the execution time. The L1 read hits overlap with the L1 misses. So, the balancing
weight α01 is chosen during the evaluation since its value depends on the algorithm,
the problem size, and the architecture. The cost of L1 misses remains the same, so
α1 = 1. In order to model the time spent on reading and writing a matrix for the
6The double loop time does not depend on the operation performed.
60
3.6 Evaluation
0.0e+00
1.0e+05
2.0e+05
3.0e+05
4.0e+05
5.0e+05
6.0e+05
7.0e+05
8.0e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
0.0e+00
2.0e+05
4.0e+05
6.0e+05
8.0e+05
1.0e+06
1.2e+06
1.4e+06
 0  50  100  150  200  250
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.28 Predicted and measured execution time of reading and writing a matrix on
Barcelona; the code was compiled without optimization.
optimized case, Equation (3.21) is modified as
Execution time = L1 misses × L1 miss time+
α01 × L1 read hits × L1 read hit time. (3.24)
In the following paragraphs, we show how the models (Equations (3.23) and (3.24))
are tailored for the Barcelona and Penryn processors.
Barcelona
The time spent on one iteration in the double loop was determined by measurements,
being approximately eight clock cycles. We observed that five additional clock cycles
occur per loop iteration: four cycles are due to the computation of the index j ∗
lda + k and one is caused by the memory address calculation. By including this
into Equation (3.23), we successfully modeled the time spent on reading and writing
a matrix, see Figure 3.28. The deviation between the modeled and measured time is
mainly below 4%.
When compiler optimization is applied, the behavior of the execution time becomes
more difficult to predict. As shown in Subsection 2.2.3, the execution time of DGER
has a piecewise-parabolic behavior. We have also observed the piecewise-parabolic
behavior of the execution time spent on reading and writing a matrix. For problems
up to 90 × 90 (fit in the L1 cache) L1 read hits are as important as L1 read misses
and, therefore, α01 = 1. For bigger problem sizes, an overlap between L1 read hits
occurs; thus, α01 is empirically chosen as 0.88. Figure 3.29 illustrates the results of
modeling the time spent on reading and writing a matrix when the code is compiled
with -O2. Except for very small problems for which the error is high, the deviation is
mainly less than 4% and for LDA = 192 it is lower than 2%.
61
Chapter 3 Execution-Less Performance Modeling
0.0e+00
1.0e+04
2.0e+04
3.0e+04
4.0e+04
5.0e+04
6.0e+04
7.0e+04
8.0e+04
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
1.4e+05
1.6e+05
1.8e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
Figure 3.29 Predicted and measured execution time of reading and writing a matrix on
Barcelona; the code was compiled with the optimization level two enabled.
0.0e+00
5.0e+04
1.0e+05
1.5e+05
2.0e+05
2.5e+05
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
1.0e+05
2.0e+05
3.0e+05
4.0e+05
5.0e+05
6.0e+05
7.0e+05
8.0e+05
9.0e+05
1.0e+06
 0  50  100  150  200  250
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.30 Predicted and measured execution time of reading and writing a matrix on
Penryn; the code was compiled without optimization.
Penryn
For this type of processor, we also model the time spent on reading and writing a
matrix with and without compiler optimization. We consider first the non-optimized
case, for which we used the same model (Equation (3.23)) as on Barcelona. The L1
hits on read and write as well as the L1 misses are fully counted in Equation (3.23),
meaning that the appropriate balancing weights are one. The time spent on one
iteration of the double loop is empirically determined and it approximately equals
four clock cycles. The time of one iteration on Penryn is half of the one on Barcelona.
Three additional clock cycles occur at each loop iteration on Penryn too: two cycles
are due to the computation of the matrix index and one is caused by the memory
address calculation. These extra clock cycles are also included in Equation (3.23).
Figure 3.30 reports the accuracy of the model for reading and writing a matrix when
the code is compiled with -O0. The deviation, except for few problem sizes, is below
62
3.6 Evaluation
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
1.4e+05
1.6e+05
1.8e+05
2.0e+05
 0  50  100  150  200  250
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.31 Predicted and measured execution time of reading and writing a matrix on
Penryn; the code was compiled the optimization level two enabled.
2%.
With the optimization level two enabled the cost for the double loop as well as
the L1 write hits becomes negligible due to compiler optimization. On Penryn, we
also observe the piecewise-parabolic behavior of the execution time spent on reading
and writing a matrix. In order to balance the overlap between the L1 read hits, the
balancing weight α01 is empirically selected as 0.28 for problem sizes up to 64×64 (fit
in the L1 cache); for bigger problems, α01 = 0.38. The impact of the L1 misses on the
total time is even stronger than with -O0 since α1 equals one too.
Figure 3.31 shows the results of modeling the time spent on reading and writing a
matrix when compiler optimization is applied. For both LDA = 192 and 256, when
the problem size increases, the deviation decreases smoothly. Except peaks for very
small problems, the deviation is lower than 2%.
3.6.2 DGER and LU Factorization
In the previous subsection we modeled the time spent on reading and writing a matrix.
Here, we expand this preliminary study and apply it to model the execution time of
DGER and the unblocked LU factorization.
We first consider modeling the execution time of DGER from the reference BLAS
library compiled with -O0 and then analyze the results achieved with -O2. In the
non-optimized case, all balancing weights equal to one, meaning that neither the CPU
active cycles nor both L1 hits and misses overlap. In order to model the execution
time of DGER, Equation (3.23) is modified by including the time spent on reading
63
Chapter 3 Execution-Less Performance Modeling
0.0e+00
5.0e+04
1.0e+05
1.5e+05
2.0e+05
2.5e+05
3.0e+05
3.5e+05
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
1.0e+05
2.0e+05
3.0e+05
4.0e+05
5.0e+05
6.0e+05
7.0e+05
8.0e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
Figure 3.32 Predicted and measured execution time of DGER from the reference BLAS
library on Barcelona; the library was compiled without optimization.
two vectors (hidden in the L1 hits and misses) and performing Flops
Execution time = L1 misses × L1 miss time+
L1 read hits × L1 read hit time+
L1 write hits × L1 write hit time+ (3.25)
Flops × Flop time+
Loop iterations × Loop iteration time.
When compiler optimization (-O2) is applied the time spent on the double loop
and the L1 write hits becomes negligible. Accordingly, the DGER time is modeled
by Equation (3.21).
Barcelona
In the non-optimized case, on average the Flop time equals two clock cycles. By
taking this into account, we validate the model (Equation (3.25)) by comparing it
with the measured time of GER from the reference BLAS library, see Figure 3.32.
For most problem sizes, the deviation is within 4%.
In the optimized (-O2) case, the balancing weight α01 is chosen as 0.47 due to
the overlap between the L1 read hits and L1 misses, while α1 = 1 stays the same.
On average the time of one Flop is 0.33 clock cycles, as shown in Subsection 3.5.2.
Figure 3.33 presents the results of modeling DGER from the reference BLAS library
when the library is compiled with the optimization level two enabled. The DGER time
is up to five times smaller compared to the one without optimization. The deviation
as well as the measured time suddenly increase when p > 168 due to the appearance
of the L2 misses, which are not yet included in the model. Except for very small
problems, the deviation is below 3%.
We also model the execution time of DGER from the GotoBLAS2 library compiled
with the optimization level two enabled. While working on Goto’s DGER, we discover
64
3.6 Evaluation
0.0e+00
1.0e+04
2.0e+04
3.0e+04
4.0e+04
5.0e+04
6.0e+04
7.0e+04
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
1.4e+05
1.6e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
Figure 3.33 Predicted and measured execution time of DGER from the reference BLAS
library on Barcelona; the library was compiled with the optimization level two
enabled.
0.0e+00
5.0e+03
1.0e+04
1.5e+04
2.0e+04
2.5e+04
3.0e+04
3.5e+04
4.0e+04
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
1.0e+04
2.0e+04
3.0e+04
4.0e+04
5.0e+04
6.0e+04
7.0e+04
8.0e+04
9.0e+04
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
Figure 3.34 Predicted and measured execution time of DGER from the GotoBLAS2 library
on Barcelona.
that the measured time may oscillate from one execution to another. For instance, for
n = 128 and LDA = 128 one measurement can report 37.5 k clock cycles while another
yields 40 k clock cycles; the deviation in measurements can reach 7%. For this reason,
in our experiments we aim at modeling the median time. So, in the evaluation the
measured time may vary in both directions from the modeled one.
Figure 3.34 shows the accuracy of the model for DGER from the GotoBLAS2 library.
The estimated time demonstrates the step-wise behavior due to the behavior of the L1
misses that are the major factor in the DGER time. Since the cache misses strongly
influence the DGER time, they are entirely (α1 = 1) counted in the model. We take
into account L1 hits (α01 = 0.3) as well when modeling the time of DGER for very
small problem sizes. For problem sizes that occupy nearly half of the L1 cache, Flops
also have a considerable impact (γ = 1) on the execution time. For bigger problems,
65
Chapter 3 Execution-Less Performance Modeling
0.0e+00
1.0e+05
2.0e+05
3.0e+05
4.0e+05
5.0e+05
6.0e+05
7.0e+05
8.0e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
0.0e+00
1.0e+05
2.0e+05
3.0e+05
4.0e+05
5.0e+05
6.0e+05
7.0e+05
8.0e+05
9.0e+05
1.0e+06
 0  50  100  150  200  250
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.35 Predicted and measured execution time of DGER from the reference BLAS
library on Penryn; the library was compiled without optimization.
we observed that γ depends not only on the algorithm and architecture, but also on
the leading dimension of the matrix. Accordingly, γ = 0.58 for LDA = 128 and γ = 0.35
for LDA = 192. In general, the models are accurate with the deviation mainly less
than 3%.
Penryn
We also consider two cases of modeling the execution time of DGER: without and
with optimization. In the first case, the model in Equation (3.25) is used; on average
the time spent performing one Flop on Penryn equals one clock cycle. Figure 3.35(a)
presents the modeled and measured time of DGER for LDA = 192. The deviation
from the measurements is mostly less than 3%. For LDA = 256 (Figure 3.35(b)) the
measured time varies a bit, thus the deviation is slightly higher than for LDA = 192.
In general, the deviation is below 4%.
When the reference BLAS library is compiled with the optimization level two en-
abled, Equation (3.21) is used to model the execution time of DGER. In this case,
both the L1 misses and Flops are entirely counted in the model, meaning that α1 = 1
and γ = 1. According to the results in Subsection 3.5.2, on average the Flop time
equals 0.25 clock cycles. The L1 hits overlap, so that α01 is empirically chosen as
0.13. Figures 3.36(a) and 3.36(b) report the accuracy of modeling the execution time
of DGER for LDA = 192 and 256, respectively. The results show high deviation for
very small problems. When the problem size increases, the deviation decreases and it
is lower than 2%.
For DGER from the GotoBLAS2 library we consider only the scenario when com-
piler optimization is applied. In this case, both the L1 hits and Flops influence the
execution time only for small problem sizes. For instance, α01 = 0.2 and γ = 1 for
problems that fit in half of the L1 cache; for bigger problems, which can still fit in the
L1 cache, γ = 0.45. Since, the number of L1 misses for DGER varies depending on
both the problem size and the leading dimension of A, the balancing weight α1 varies
66
3.6 Evaluation
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
1.4e+05
1.6e+05
1.8e+05
 0  50  100  150  200  250
 0
 5
 10
 15
 20
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.36 Predicted and measured execution time of DGER from the reference BLAS
library on Penryn; the library was compiled with the optimization level two
enabled.
0.0e+00
5.0e+03
1.0e+04
1.5e+04
2.0e+04
2.5e+04
3.0e+04
3.5e+04
 0  20  40  60  80  100  120  140
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 128
0.0e+00
1.0e+04
2.0e+04
3.0e+04
4.0e+04
5.0e+04
6.0e+04
7.0e+04
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 192
0.0e+00
2.0e+04
4.0e+04
6.0e+04
8.0e+04
1.0e+05
1.2e+05
 0  50  100  150  200  250
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(c) LDA = 256
Figure 3.37 Predicted and measured execution time of DGER from the GotoBLAS2 library
on Penryn.
67
Chapter 3 Execution-Less Performance Modeling
0.0e+00
5.0e+05
1.0e+06
1.5e+06
2.0e+06
2.5e+06
3.0e+06
3.5e+06
4.0e+06
4.5e+06
5.0e+06
 0  20  40  60  80  100 120 140 160 180 200
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(a) LDA = 192
0.0e+00
2.0e+06
4.0e+06
6.0e+06
8.0e+06
1.0e+07
1.2e+07
 0  50  100  150  200  250
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
p=q
Measured
Modeled
Deviation
(b) LDA = 256
Figure 3.38 Predicted and measured execution time of the unblocked LU factorization on
Penryn.
accordingly. For LDA = 128 and 192, α1 = 1 for problem sizes up to 88× 88, α1 = 0.94
for p ∈ (88,136], and α1 = 0.89 for the other cases. For LDA = 256 the boundaries of
the intervals are slightly shifted: α1 = 1 for problem sizes up to 104 × 104, α1 = 0.94
for p ∈ (104,152], and α1 = 0.89 otherwise.
Figure 3.37 reports the results of modeling the execution time of DGER from the
GotoBLAS2 library. The deviation is quite high for few very small problems. We
assume that this is due to the repacking of the operands in memory. For the rest,
the deviation decreases significantly. In Figure 3.37(c), which shows the results for
LDA = 256, “jumps” in the deviation can be noticed; they are caused by changes in
the balancing weight α1. In general, the accuracy of the model is sufficient – the
deviation is below 4%.
After successful modeling of the execution time of DGER, we proceed with the
unblocked LU factorization. We applied the recursive approach: the execution time
of the unblocked LU is predicted through modeling its L1 misses, which therefore are
modeled through L1 misses of DGER. As with Goto’s DGER, the L1 misses of the
LU factorization and the appropriate balancing weight depend also on the problem
size and the leading dimension of the matrix. For example, when LDA = 192, α1 = 1
for problem sizes up to 64×64 (fit in the L1 cache), while for LDA = 256 this holds up
to 152×152; in the other cases α1 = 0.94 for both LDAs. Flops are taken into account
(γ = 1) only for very small problem sizes that fit in the L1 cache.
Figure 3.38 shows the expected and estimated execution time of the unblocked LU
factorization. Since the unblocked LU is often used as a building block of a blocked
algorithm, the main focus is on problem sizes starting from 64 × 64. For those, the
deviation is mostly less than 4%.
3.6.3 DGEMM
In the previous subsections we consider modeling the execution time of memory-bound
algorithms. Here, our focus is on modeling for CPU-bound algorithms. As case study,
68
3.6 Evaluation
0.0e+00
2.0e+08
4.0e+08
6.0e+08
8.0e+08
1.0e+09
 0  200  400  600  800  1000  1200
 0
 5
 10
 15
 20
 25
 30
 35
 40
T
im
e[
cy
cl
es
]
D
ev
ia
ti
o
n
[%
]
m=n=k
Measured
Modeled
Deviation
Figure 3.39 Predicted and measured execution time of DGEMM from the GotoBLAS2 li-
brary.
we chose DGEMM [28, 39, 40], which is a BLAS-3 routine and the core of the BLAS
library. DGEMM computes the matrix-matrix products as
C ∶= αAB + βC, (3.26)
where α and β are scalars; A,B, and C are general matrices with A a m × k matrix,
B a k×n matrix, and C a m×n matrix. DGEMM performs 2mnk floating operations
overmk+kn+mn data. Whenm = n = k the ratio between Flops and memory accesses
is
2n
3
. This means that most memory accesses can be performed in the background
while the processor performs the computation.
Figure 3.39 provides evidence that the execution time of DGEMM7 from the Go-
toBLAS2 library can be modeled mainly through counting Flops. We discard the L1
hits and misses in Equation (3.21) by setting α1 and α01 to zero. The deviation is
very high for small problem. However, when the problem size increases the deviation
decreases and stays within 5%.
Modeling DGEMM reveals that counting Flops is not enough to obtain sufficient re-
sults, especially for very small problems. In order to attain more accurate predictions,
cache misses also need to be taken into account. This is a part of our future work as
well as modeling the execution time for a blocked variant of the LU factorization. For
the latter, DGEMM is a crucial building block.
7This implementation of DGEMM is highly optimized for each target architecture and its perfor-
mance is 90+% of the processor peak.
69
Chapter 3 Execution-Less Performance Modeling
3.7 Summary
The main goal of this research work was to predict performance of linear algebra
algorithms without actual code execution. In fact, we proposed an analytical model
which is based on the knowledge of the algorithm at hand as well as the CPU type
and the memory hierarchy. As shown, modeling performance is equivalent to modeling
both the CPU active time and the time spent on memory-stalls. Our focus was on
memory-bound algorithms for which the memory-stalls bring the major contribution
to the execution time. We considered the scenario in which the input data resides
in L2 cache, therefore only L1 misses occur. For modeling L1 misses of the linear
algebra kernels, like the BLAS routines, we built an analytical formula8. To verify
the formula, we carried out a set of experiments using DGER from the reference BLAS
and the GotoBLAS2 on two different architectures. By working on two architectures,
we tailored the basic formula for different processor types and memory hierarchies.
We also showed that the basic formula for DGER can be adapted with minor changes
to other BLAS-2 operations such as DGEMV and DTRSV. Since the DGER kernel is
responsible for most computations within an unblocked variant of the LU factorization,
we modeled the number of misses for the LU factorization by composing L1 misses for
DGER. In all cases, the models showed extremely accurate results, with the deviations
from the actual measurements normally lower than 3%. As a result, by combining
the models for the LU factorization and DTRSV together, we can predict L1 misses
for solving a linear system of equations.
From modeling L1 misses we proceeded to modeling execution time. We first con-
sidered simple memory operations such as reading and writing a matrix. For this, two
models were developed that cover non-optimized (-O0) and optimized (-O2) cases.
For the former, not only cache misses, but also cache hits and both the update of
loop indexes and the calculation of memory addresses were take into account. For
the optimized case, the impact of these last three factors disappear since they either
are performed in the background or overlapped with cache misses. By incorporating
floating point operations, we constructed models for DGER. In order to handle an
overlap between the computations and data movements, we used balancing weights.
Those differ from one implementation of DGER to another, so they were chosen em-
pirically. However, we observed that the balancing weights are constant on certain
intervals and correlate with the chosen algorithmic variant and both the problem size
and the leading dimension. Therefore, there is a possibility for constructing the bal-
ancing weights analytically which remains as future work. We validated the models
for reading and writing a matrix, the reference and Goto’s DGER, and the LU factor-
ization by comparing the predicted and measured results – the deviation was mostly
less than 3%. This shows that the accurate modeling of memory-stalls for memory-
bound algorithms leads to the accurate prediction of the execution time. Finally, we
provided evidence that the execution time of CPU-bound algorithms like DGEMM
can also be modeled analytically.
8The formula does not take into account conflict misses because those are hard to model analytically.
There is a work on modeling conflict misses in the context of compiler optimization [18], however
these models appear to be intractable. Also, the effect of L1 conflict misses on the execution time
of the considered algorithms is negligible.
70
3.8 Future Work
3.8 Future Work
Here we propose possible directions to extend and strengthen the results achieved in
this chapter. First off, larger problems will not fit in the L2 cache, therefore causing
not only L1, but also L2 misses. Lifting memory constraints and incorporating L2
cache misses into the model is the natural next step; for even larger problems, TLB
misses must be taken into account too. In this document we present only a set of
BLAS operations; many other building blocks are needed. Conceptually, our model
can be adapted to other level-1 and level-2 BLAS operations, which are memory-
bound, with minor changes. Since level-3 operations are CPU-bound, the modeling
for those operations is substantially different; as a consequence, data movement – and
cache misses – are of secondary importance for accurate timing predictions. However,
as we showed for DGEMM, the same methodology can be applied for both CPU-
and memory-bound operations: different ratios between memory-stalls and Flops are
handled through balancing weights. Those weights can be also modeled analytically.
3.8.1 Modeling Performance of Parallel LU Factorization
Partition
A→ (ATL ATR
ABL ABR
), p→ (pT
pB
)
where ATL is 0 × 0, pT has 0 elements
While size(ATL) < size(A) do
Repartition
(ATL ATR
ABL ABR
)→⎛⎜⎝
A00 A01 A02
A10 A11 A12
A20 A21 A22
⎞⎟⎠, (pTpB )→
⎛⎜⎝
p0
p1
p2
⎞⎟⎠
where A11 is b × b, p1 is b × 1
( A11
A21
) , p1 ∶= LUPIV ( A11
A21
)
( A10 A12
A20 A22
) ∶= P (p1)( A10 A12A20 A22 )
A12 ∶= L−111A12 [DTRSM]
A22 ∶= A22 −A21A12 [DGEMM]
Continue with
(ATL ATR
ABL ABR
)←⎛⎜⎝
A00 A01 A02
A10 A11 A12
A20 A21 A22
⎞⎟⎠, (pTpB )←
⎛⎜⎝
p0
p1
p2
⎞⎟⎠
endwhile
Algorithm 3.1: A blocked algorithm for the LU factorization with partial pivoting.
71
Chapter 3 Execution-Less Performance Modeling
A00 A01 A02
A10 A11 A12
A20 A21 A22A
j+1
22
(j − 1)b
b
n − jb
(j − 1)b b n − jb
Figure 3.40 Partitioning of the matrix A.
Let us consider a general matrix A of size n × n. Algorithm 3.1 demonstrates a
block variant of the LU factorization with partial pivoting and a block size b. More
detailed information regarding the algorithm and the FLAME notation can be found
in [87]. At each iteration of the loop, the matrix A is partitioned from 2× 2 into 3× 3
form, see Figure 3.40. Then, the algorithm updates the matrices A22, A12, and both
A11 and A21 using DTRSM, DGEMM, and an unblocked algorithm for the recursive
LU factorization, respectively. The major computation cost of Algorithm 3.1 is in the
update of the matrix A22 (DGEMM).
We present models for predicting the computation time of the blocked LU factor-
ization with partial pivoting. In this study, we consider two different scenarios. In the
first scenario, we assume that the resources are “unlimited” – the number of processors
is significantly bigger than the matrix size. In this case, DGEMM and DTRSM can
scale and, therefore, their execution time is negligible compared to the unblocked LU
factorization. Instead, in the second scenario, the number of processors is smaller
than the matrix size. Thus, the BLAS-3 routines dictate the value of the execution
time.
Let us denote by LUj , Pj , Tj , and Gj the operations performed at the iteration
j (j = 1,2, ..., n/b) of Algorithm 3.1. These operations correspond to the unblocked
LU factorization of the current (j-th) panel, the permutations, solving the triangular
system and the matrix-matrix product, respectively.
Assume that t threads are created to perform the computation and each of these
threads is assigned to a separate core. n is an exact multiple of both b and t, which
implies that there exist two integers, i and j, such that n = bj and n = it.
Let us also assume that the operations Pj , Tj , and Gj are well parallelized by
columns or blocks of columns. This means that the t threads collaborate in one of
these operations by partitioning the corresponding data into t columns or blocks of
columns in such a way that each thread takes care of the operations on one of these
72
3.8 Future Work
columns or blocks of columns. Parallel partitioning by blocks of columns implies that
only the number of columns is divided by t.
With these assumptions, modeling the execution time of the blocked algorithm
for the LU factorization with partial pivoting can be done by one of the following
scenarios:
A. Theoretical scenario, when the number of threads t are unlimited, e.g. t = n,
then the computation can be easily distributed among these threads. Thus, the
execution time of the blocked LU can be modeled as
time(LU) = n/b∑
j=1 [time(LUj) + time(Pj,c) + time(Tj,c) + time(Gj,c)], (3.27)
where Pj,c, Tj,c, and Gj,c are the corresponding operations applied to one sin-
gle column of the matrix; time(Pj,c) is the time spent for the permutations;
time(Tj,c) is the time spent for solving the triangular system; time(Gj,c) – the
time spent for the matrix-matrix product. Given that time(Pj,c), time(Tj,c),
and time(Gj,c) are likely negligible compared to time(LUj) (provided b ≫ 1),
the execution time of Algorithm 3.1 can be modeled by combining only the time
spent on the unblocked LU (LUj). This scenario will hold for the blocked LU
factorization with partial pivoting both with and without look-ahead (compute-
ahead) 9.
B. Practical scenario (t is smaller than n), then modeling the execution time be-
comes much more difficult and, therefore, the previously suggested model, Equa-
tion (3.27), needs to be refined. Hence, the time spent at the j-th iteration, see
Figure 2.5, can be modeled as follows
B.1. For the blocked LU factorization with partial pivoting and without look-
ahead:
• Pivoting: time(Pj,b), where Pj,b corresponds to the operations per-
formed for the permutations on a block of b/t columns;
• DTRSM: time(Tj,b), where Tj,b stands for the operations performed
for solving the triangular system with (n − jb)/t right-hand sides;
• DGEMM: time(Gj,b), where Gj,b – the operations performed by the
matrix-matrix product with a size (n − jb) × (n − jb)/t of the updated
submatrix.
By summation, Equation (3.27) can be written as
time(LU) = n/b∑
j=1 [time(LUj)+ time(Pj,b)+ time(Tj,b)+ time(Gj,b)]. (3.28)
9In the case of the blocked LU factorization, the update of A22 is subdivided and partially computed
(Aj+122 ), so that the unblocked LU factorization from the next iteration can be performed ahead
of the current iteration in parallel with the rest of the update to A22 [15].
73
Chapter 3 Execution-Less Performance Modeling
B.2. For the blocked LU with partial pivoting and look-ahead:
• DTRSM: time(Tj,b′ ), where Tj,b′ stands for the operations performed
for solving the triangular system with b/t right-hand sides;
• DGEMM: time(Gj,b′ ), where Gj,b′ – the operations performed for the
matrix-matrix product with a size (n − jb) × b/t of the updated sub-
matrix.
As a result, Equation (3.27) becomes
time(LU) = n/b∑
j=1 [time(LUj) + time(Pj,b) + time(Tj,b′ ) + time(Gj,b′ )]. (3.29)
Compared to Equation (3.27), Equations (3.28) and (3.29) demonstrate a more real-
istic scenario where t is limited and the smallest unit in the computation is a block.
In the B.2 case we assume that only a single thread computes LUj (in the other
cases too) and the other t − 1 threads collaborate in the update to the block Aj+122 of
size (n− jb)× b; the update aims at the preparation for and the speed-up of the next
factorization LUj+1. The update of the rest of blocks of A12 and A22 is overlapped
with the next factorizations.
We presented performance models for Algorithm 3.1 considering two different sce-
narios: theoretical and practical. In order to prove the posed statements, we require
to implement, possibly refine, and validate these models.
Even thought unblocked algorithms rarely dictate the computation time of blocked
ones, the theoretical scenario showed that such a situation is possible and needs to be
taken into account too. Under this scenario, modeling the computation time of the
blocked LU factorization with partial pivoting can be performed mainly through the
unblocked LU factorization.
74
Chapter 4
High-Performance Computing on
Clouds
4.1 Introduction
In the last few years cloud computing became a fascinating topic in both academia and
industry. Scientists show interest in using clouds as an alternative to supercomputers
that can help to solve their problems.
Cloud computing could be defined as a set of virtual servers that work together
through the Internet and can be dynamically managed, monitored, and maintained.
There are several distinct characteristics of cloud computing:
1. Virtualization – that could be defined as a layer between an operating system
and hardware. In most cases, virtualization is provided by VMware and Xen
that are virtual machine (VM) monitors for different architectures. They allow
several guest operating systems to execute concurrently on the same hardware.
Normally, the user allocates physical resources on clouds and uploads there his
own image of the operating system and the software using VMware and Xen
virtual machines. Also, the user can develop his own virtual image or use the
existing one as an executable environment on the cloud.
2. Scalability on-demand – users can have as much or as little of resources as they
want at any given time.
3. “Pay-as-you-go” – users pay for what they use, typically by minute or hour.
Cloud users can quickly create and scale-up a custom compute cluster, paying
only for sporadic usage. However, there are also some disadvantages. Costs
are divided into different categories that are billed separately: for example,
network, storage, and CPU usage. This model can be complex when attempting
to minimize costs [83]. Furthermore, the setup time for computation resources is
currently quite long (on the order of minutes for Amazon), and the granularity
for billing CPU resources is coarse: by hour.
4. Clouds are fully managed and supported by the provider.
Characteristics 2 and 3 imply that resources should be scaled conservatively in current
clouds, reducing some of the benefits of scaling on-demand.
With the gained popularity of cloud computing, numerous companies started pro-
viding cloud-based services. The latter can be divided into the following three layers:
75
Chapter 4 High-Performance Computing on Clouds
1. Infrastructure-as-a-Service (IaaS) delivers computer infrastructure, typically a
platform virtualization environment. Rather than purchasing servers, software,
data center space or network equipment, clients buy instead those resources as
a fully outsourced service. Amazon and Rackspace are good examples.
2. Platform-as-a-Service (PaaS) delivers a computing platform as a service, often
consuming cloud infrastructure and sustaining cloud applications. It facilitates
deployment of applications without the cost and complexity of buying and man-
aging the underlying hardware and software layers. Best examples of PaaS are
Microsoft’s Azure and Google’s AppEngine.
3. Software-as-a-Service (SaaS) is a model of software deployment over the inter-
net. With SaaS, a provider licenses an application to customers for use as a
service on-demand through a “pay-as-you-go” model. Also known as “software
on-demand”, the SaaS model allows vendors to develop, host and operate soft-
ware for customer use. Saleforce.com is the pioneer in this category.
We are interested in studying the use of commercial cloud resources, which are pro-
vided as the IaaS service, for high-performance computing. We chose Amazon Elastic
Compute Cloud (EC2) as a commercial cloud environment. Since the Amazon clouds
were mainly intended for web hosting and data management, we aim at analyzing
• Applicability of clouds for both compute-intensive and high-performance scien-
tific applications.
• Difference in the delivered performance on standard clusters and on clouds.
• Cost of doing HPC on clouds.
• Strategies for the efficient use of clouds1.
The rest of the chapter is organized as follows. We first present a methodology of the
study in Subsection 4.1.1 and give an overview of the Amazon EC2 environment and
the HPL benchmark in Subsection 4.1.2. Then, we describe the single node evaluation
in Section 4.2, while Section 4.3 explains the multiple nodes evaluation and provides
a strategy for the efficient use of Amazon EC2. Finally, we summarize the results
in Section 4.4.
4.1.1 Methodology
The ability to allocate nodes on-demand in current commercial cloud environments
implies a new problem in scientific computing: contention. Good high-performance
computing clusters balance the CPU, memory, and network usage of applications to
maintain efficiency while scaling up resource usage but assume exclusive access to
these resources by an application. For example, the RoadRunner supercomputer at
Los Alamos National Laboratory can solve a linear system using 100 k cores while
1This statement arose during our investigation after we have observed high variability in the per-
formance results.
76
4.1 Introduction
maintaining 90% efficiency2 of CPU usage. This requires carefully balancing the
application’s needs across CPU, memory, and the network. With exclusive access,
developers of applications and compute libraries can achieve such balanced computing
by reducing the noise in resource usage that prevents efficient synchronization. For
example, one node out of a large system might run extra software for monitoring that
causes it to reach the end of a computation slower than the other nodes. If all nodes
must synchronize at the end of a computation, all other nodes in the system must
wait for the slowest node to finish. Small amounts of noise can significantly affect the
overall performance of a tightly coupled application [75].
Although cloud environments typically provide a small degree of resource availabil-
ity guarantees, they do not provide a complete facsimile of an unshared virtual node
or strong performance guarantees [6]. Contention is visible along almost all resources:
memory bandwidth, last-level cache space, network bandwidth and latency [89], and
even CPU time. In addition to introducing the kind of noise discussed above, such
contention can severely limit the use of resources shared between virtual environ-
ments. For example, memory contention can increase the compute time of a simple
matrix-matrix multiplication by an order of magnitude.
To run scientific applications efficiently, cloud computing needs to provide resources
to scientific applications comparable to current well-balanced, exclusive-access high-
performance computing systems. Isolation between different virtual machines in the
cloud should provide predictable performance for developers of compute-intensive ap-
plications and libraries. However, our evaluation of the Amazon EC2 cloud – currently
the largest commercial cloud environment – demonstrates that significant work is still
needed to isolate virtual machines in the cloud or to adjust HPC libraries to adapt
to the dynamic, contentious environment of the cloud. The Amazon EC2 cloud sys-
tem was built for web service workloads (that is, without extremely fast interconnects)
and cannot compare favorably with purpose-built, heavily-engineered HPC supercom-
puters for many tightly-coupled scientific computations such as dense linear algebra.
Although we do not expect comparable performance, the dynamic scalability of the
environments holds some promise for smaller workloads.
In order to determine the achievable efficiency of current cloud systems for HPC,
we consider the execution of dense linear algebra, which provides a favorable ratio
of computation to communication: O(n3) operations on O(n2) data. Dense linear
algebra algorithms can thus overlap the slow communication of data with quick com-
putations over much more data [27]. However, HPC on cloud systems will still gen-
erally be limited by slow communication more than specialized HPC systems with
high-throughput and low-latency interconnect. We are concerned primarily with the
efficiency of shared cloud systems as they scale up and show that the effects of con-
tention can far outweigh (at least in current offerings) the imbalance caused by an
underprovisioned interconnection. We thus focus on CPU costs, ignoring possible
associated (and significantly less per computation) storage and networking costs.
We evaluate the performance of HPC on shared cloud environments, performing
2Efficiency is the ratio between attained and theoretical (peak) performance. Attained performance
comes from the ratio between number of executed flops and execution time. Theoretical perfor-
mance is calculated using the method described in Subsection 4.1.2.
77
Chapter 4 High-Performance Computing on Clouds
 0
 100
 200
 300
 400
 500
 600
16:00 16:30 17:00 17:30 18:00 18:30 19:00 19:30 20:00 20:30 21:00 21:30
E
x
ec
u
ti
o
n
 t
im
e 
(s
ec
s)
Time of day
Average
Figure 4.1 DGEMM’s execution time over 6 hours using all 8 cores of a c1.xlarge Amazon
EC2 instance; see Table 4.1 for a description of this instance type. The matrix
size is n = 10 k.
hundreds of high-performance experiments on Amazon EC2. The experiments occur
while nodes are shared with other unknown applications. The results show that the
performance of single nodes available on EC2 can be as good as nodes found in current
HPC systems [86], but on average performance is much worse and shows high variabil-
ity both on single node and small cluster evaluations. Typical fluctuation in the results
is demonstrated in Figure 4.1. The graph presents timings of repeated execution of
DGEMM using all eight cores of a High-CPU Extra Large (c1.xlarge) instance3. The
standard deviation of the average performance is 33%, several orders of magnitude
larger than that observed in standard high-performance computing clusters. This
is because the contention skews the balance of node’s CPU and memory resource
availability to network capacity to prevent efficient usage of the resources. Cache
utilization becomes highly unpredictable and similarly affects computation time.
Due to the high variability in EC2’s performance, we also empirically explore com-
putational efficiency in a cloud environment when resources are underutilized. We
observe an abnormal behavior in the average performance: the peak average perfor-
mance is reached when only a portion of the available resources is used, typically
25–50%. This behavior occurs both in single node computations and on clusters of
up to 64 compute cores. Our results show that it is often more efficient to underutilize
CPU resources – the solution can be obtained faster and thus the corresponding cost
is also lower. Finally, we show that there is still available parallelism when under-
utilizing resources. In some cases, adding an extra node to the cluster is free: the
computation finishes earlier enough to compensate for the marginal price of the extra
node.
In light of the high-contention, we believe that alternative definitions of efficiency
for cloud environments should be introduced. Concepts like expected performance,
3Instance is a compute node on Amazon EC2.
78
4.1 Introduction
expected cost to completion, and variance measures – traditionally ignored in the HPC
context – should now complement or even substitute the standard definitions of effi-
ciency. In addition to standard metrics such as GFlops and efficiency used in HPC,
we introduce GFlop/$ (billions of floating point operations per dollar) to analyze in
depth the pros and cons of clouds. The GFlop/$ metric allows users to estimate
straightforward costs – currently limited to the CPU usage – for different applications
with respect to the computational efficiency.
4.1.2 Experimental Setup
We perform our experiments on the Amazon EC2 service as a case study for commer-
cial cloud environments [80]. Although there are competing cloud offerings that were
publicly available at the time [95, 45], Amazon’s service is the largest that provides
highly configurable virtual machines. The nodes allocated by EC2 run a kernel or
operating system configured by Amazon, but all software above this level is config-
ured by the user. Many other cloud offerings by other providers limit applications
to certain APIs or languages. To use existing highly optimized dense linear algebra
libraries, we use Amazon EC2 as a case study.
Nodes allocated through EC2 are called instances. Instances are allocated from
Amazon’s data centers according to unpublished scheduling algorithms. Allocations
are initially limited to 20 total instances, but this restriction can be lifted upon re-
quest. Data centers are combined into entities known as an availability zone, Amazon’s
smallest logical geographic entity for allocation. These zones are further combined into
geographical regions: the US, Europe, Asia Pacific, and South America4.
After allocation, each instance automatically loads a user-specified image containing
the proper operating system (in our case Linux) and user software (described below).
Images are loaded automatically by Amazon services onto one or more virtualized pro-
cessors using the Xen virtual machine [4]. Each processor is itself multi-core, resulting
in a total of 2 to 8 virtual cores for the instances we reserved. The Terms of Service
provided by Amazon do not provide strong performance guarantees. Most impor-
tantly for this study, they do not limit Amazon’s ability to implement multi-tenancy;
that is, to co-locate VMs from different customers. We discuss the performance char-
acteristics of different instances below.
Tools written to Amazon’s public APIs provide the abilities to allocate extra nodes
on demand, release unused nodes, and create and destroy images to be loaded onto
allocated instances. Using these tools and developing our own, we built images with
the latest compilers provided by the hardware CPU vendors AMD and Intel. We use
HPL 2.0 [3] from the University of Tennessee, compiled with GotoBLAS 1.26 [38]
from the Texas Advanced Computing Center (TACC), and MPICH2 1.0.8 [59] from
the Argonne National Laboratory. Using our tools we can allocate and configure
variable size clusters automatically, including support for MPI applications.
Although we developed tools to automatically manage and configure nodes for our
applications, there are also other publicly available tools for running scientific appli-
cations on cloud platforms (including EC2) [73, 72]. Further, as the cloud computing
4At the moment of the study, only the US and Europe were available.
79
Chapter 4 High-Performance Computing on Clouds
Table 4.1 Information about various instance types: processor type, number of cores per in-
stance, installed RAM (in Gb), and theoretical peak performance (in GFlops/sec).
Prices are on Amazon EC2 as of May, 2010.
Instance Processor Cores RAM Peak Price ($/hr)
m1.large Intel Xeon E5430 2 7.5 21.28 $0.34
m1.large AMD Opteron 270 2 7.5 8.00 $0.34
m1.xlarge Intel Xeon E5430 4 15.0 42.56 $0.68
m1.xlarge AMD Opteron 270 4 15.0 16.00 $0.68
c1.xlarge Intel Xeon E5345 8 7.0 74.56 $0.68
m2.2xlarge Intel Xeon X5550 4 34.2 42.72 $1.20
m2.4xlarge Intel Xeon X5550 8 68.4 85.44 $2.40
platform matures, we expect much more development for specific applications such as
high-performance computing to reduce or eliminate much of the initial learning curve
for deploying scientific applications on cloud platforms. Already, for example, public
images are available on EC2 supporting MPICH [36].
Amazon EC2
Our case study was carried out using various instance types on the Amazon EC2
service from November 2008 through April 2010. Table 4.1 describes the differences
between the instance types: number of cores per instance, installed memory, theoret-
ical peak performance, and the cost of the instance per hour. We only used instances
with 64-bit processors, so that we treat the m1.large as the smallest instance Amazon
provides. The costs per node vary by a factor of 7 from $0.34 for the smallest to
$2.40 for nodes with the highest theoretical peak performance and installed memory.
We note that cost scales more closely with installed RAM than with peak CPU per-
formance – the c1.xlarge instance being the outlier. Peak performance is calculated
using processor-specific capabilities. For example, the c1.xlarge instance type consists
of two Intel Xeon quad-core processors operating at a frequency of 2.3GHz with a
total memory of 7Gb. Each core is capable of executing 4 floating-point operations
per clock cycle, leading to a theoretical peak performance of 74.56GFlops per node.
There are additional costs for bandwidth used into and out from Amazon’s network
and for long-term storage of data, but we ignore these costs in our calculations because
they are negligible compared to the costs of the computation itself in our experiments.
In regards to multi-threaded parallelism provided by the multi-core processors, ex-
tensive testing typically delivered the best performance when we set the GotoBLAS
library to use as many threads as available cores per socket – 4 and 2 for the Intel
and the AMD processors, respectively. We provide the number of threads used to ob-
tain specific results in the following section when presenting peak achieved efficiency.
With these settings and using the platform-specific libraries and compilers, we reached
76% and 68% of theoretical peak performance (as measured in GFlops) for the In-
80
4.1 Introduction
tel Xeon E5345 and the AMD Opteron 270 processors, respectively, for single node
performance on xlarge instances. We thus believe the configuration and execution of
LINPACK in HPL on the high-CPU and standard instances is efficient enough to use
as an exemplar of compute-intensive applications for the purposes of our evaluation.
All instance types (with Intel or AMD CPUs) execute the RedHat Fedora Core
8 operating system using the 2.6.21 Linux kernel. The 2.6 line of Linux kernels
supports autotuning of buffer sizes for high-performance networking, which is enabled
by default. The specific interconnect used by Amazon is unspecified [80] and multiple
instances might even share a single hardware network card [89]. Therefore, the entire
throughput might not be available to any particular instance. In order to reduce the
number of hops between nodes, we run all experiments with cluster nodes allocated
in the same availability zone.
HPL Benchmark
Our goal is to determine the suitability of commercial cloud environments for certain
kinds of scientific applications. We focus on the HPL benchmark [3] as the exemplar of
tightly coupled, highly parallel scientific applications. HPL computes the solution of
a random dense system of linear equations via LU factorization with partial pivoting
and triangular solves. This algorithm requires O(n3) floating point operations on
O(n2) data; that is, HPL is compute-intensive, and represents a realistic upper bound
for the performance of such scientific applications. The actual implementation is
driven by more than a dozen parameters, all of which may have a significant impact
on the resulting performance and therefore require fine tuning. We describe the HPL
parameters that were tuned below:
1. Block size (NB). It is determined in relation to the problem size and the per-
formance of the underlying BLAS kernels. We used four different block sizes,
namely 192, 256, 512, and 768.
2. Process grid (p × q). This is the number of process rows and columns of the
compute grid. As with most clusters, we empirically observed that on EC2 it
is better to use process grids where p ≤ q. This is a product of the data flow of
the algorithms used in HPL.
3. Broadcast algorithm (BFACT ). It depends on the problem size and network
performance. Testing suggested that the best broadcast parameters are 3 and
5. For large machines featuring fast nodes compared to the available network
bandwidth, algorithm 5 is observed to be best.
We kept the other HPL settings fixed for all the experiments.
4.1.3 State-of-the-art
Cloud computing has been proposed as a service to scale out or share existing scientific
clusters. The Eucalyptus project [72] provides an open source cloud management
tool. Similarly, the Nimbus project [57] provides cloud management tools for clusters.
81
Chapter 4 High-Performance Computing on Clouds
These tools do not typically provide for multi-tenancy – unlike the commercial cloud
environments we study – and thus observe different performance characteristics closer
to existing scientific clusters with exclusive access. We are concerned in this chapter
with the effects of contention on HPC applications when multiple VMs share resources.
Tikotekar et al. [85] ran different HPC benchmarks in virtual machines to deter-
mine the effects of virtualization on high-performance computing applications. The
results show isolating the effects of multiple VMs on the same machine is difficult,
but in general virtualization has little effect on the performance of compute-intensive
HPC applications. Similarly, Youseff et al. [97] saw little performance impact on the
memory hierarchy behavior of linear algebra libraries under virtualization. However,
this study did not consider contention between multi-tenant VMs.
The impact of virtualization on the networking performance of EC2 was analyzed
empirically by Wang and Ng [89]. They reported heavy network instabilities and
delays, especially on small 32-bit nodes. We observed poor performance on the High-
Performance Linpack (HPL) parallel benchmark using multiple instances, but we are
more concerned with the scaling properties. Although the instances do not approach
peak performance, the parallel job can scale (for tens of nodes) with poor network
performance provided the instances have enough installed RAM due to the high ra-
tio of computation to communication in dense linear algebra. Our experiments (see
Figure 4.8) demonstrate this effect.
The effects of sharing the last-level cache in multi-core processors was investigated
by Iyer et al. [55]. The paper discusses different approaches to improving the perfor-
mance guarantees of each core with respect to cache behavior. Using these techniques
could greatly increase the predictability of performance on cloud platforms.
Previously, Edward Walker [88] had compared EC2 nodes to current HPC systems.
We also demonstrated through empirical evaluation the computational efficiency of
compute-intensive applications on EC2 nodes [6, 49, 68, 69]. Our results here are
similar to Walker’s for the small clusters of 4 nodes that he used. Preliminary results
were antecedently reported by Napper and Bientinesi in [67].
4.2 Single Node Scaling
We begin our empirical analysis of Amazon EC2 by evaluating the consistency of
achievable performance on a single node. Our goal is to analyze the stability of achiev-
able performance and to characterize how it varies with the utilized number of cores.
In our experiments, we execute compute-intensive applications such as DGEMM and
HPL tests for 24 hours, repeating the experiments over different days. We first present
the results for DGEMM and then discuss the results for HPL.
4.2.1 DGEMM Evaluation
We perform measurements of the execution time of DGEMM under different scenarios
to determine peak and average performance5 of single allocated VMs. DGEMM is
5We calculate average performance in GFlops by dividing the required flops for the operation (2n3
for DGEMM and 2n3/3 for the HPL benchmark) by the average execution time of the samples.
82
4.2 Single Node Scaling
 0
 50
 100
 150
 200
 250
15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00
E
x
ec
u
ti
o
n
 t
im
e 
(s
ec
s)
Time of day
Average
Figure 4.2 DGEMM’s execution time over 6 hours using 1 of 8 cores of the c1.xlarge instance.
The size of the input matrices is n = 10 k.
highly optimized for target architectures, and its performance is typically taken as the
peak achievable performance for a processor, often in the 90+% range of efficiency.
More detailed information about DGEMM can be found in Subsection 3.6.3. To
measure the achievable performance on allocated VM instances, we initialize three
square matrices and invoke the GotoBLAS [38] implementation of DGEMM. We only
measure the time spent in the BLAS library and not the time spent allocating and
initializing the matrices. Due to the dense nature of the matrices involved in the
experiments – most of the entries are non-zero – we expect little to no fluctuations in
the measured time on a single node independent of problem size and the number of
cores used.
Figure 4.2 shows the execution time of repeated DGEMM on the c1.xlarge instance
over 6 hours for matrices of size n = 10 k; the size is such that the input matrices (each
768Mb) do not to fit in the 8Mb L2 cache of the instance. For space reasons we show
only 6 hours; however, the results over 24 hours and on different days show similar
behavior. In this figure, only 1 out of 8 cores of the c1.xlarge instance was used, and
the results are quite stable. The average execution time6 over all our experiments
is 227.9 s with standard deviation of only 0.23 s, i.e., 3 orders of magnitude smaller
than the average. We calculate the average using the arithmetic mean of the samples,
to demonstrate the statistically expected execution time of the computation. In the
rest of this chapter we use average, arithmetic mean, and expected execution time
interchangeably. These experiments show that contention among the co-located VMs
(and possibly the hypervisor itself) degrades performance.
Executing on only 1 of 8 cores in an instance heavily underutilizes the paid re-
sources. However, attempting to increase utilization quickly degrades performance.
For example, in Figure 4.3 we present the time to complete the same experiments
using 4 of 8 cores of the c1.xlarge instance. The results show very high variability
6We calculate the average execution time using the arithmetic mean over sample runs.
83
Chapter 4 High-Performance Computing on Clouds
 0
 50
 100
 150
 200
 250
 300
 350
 400
16:30 17:00 17:30 18:00 18:30 19:00 19:30 20:00 20:30 21:00 21:30 22:00 22:30
E
x
ec
u
ti
o
n
 t
im
e 
(s
ec
s)
Time of day
Average
Figure 4.3 DGEMM’s execution time over 6 hours using 4 of 8 cores of the c1.xlarge instance.
The size of the input matrices is n = 10 k.
in the execution time: the average execution time is 191.8 s, with standard deviation
of 68.6 s. The average execution time is 84% of the single core case using four times
the resources. In addition, the standard deviation is of the same order (36%) as the
expected execution time. For comparison, we also ran the same DGEMM experiments
on a node of a standard cluster: Intel Harpertown E5450 processor with 8 cores and
16GB RAM. The node is quite similar to the c1.xlarge instance on Amazon EC2,
but shows very little variability (Figure 4.5): when using 4 of 8 cores and limiting
the memory to 7GB, the average execution time is 43.73 s (25% of single core), with
standard deviation of only 0.053 s (0.1% of the average).
On the c1.xlarge instance we note that occasionally DGEMM executes with very
high performance so that the best results are as fast as they would be expected,
achieving 90% of the efficiency. These fast executions imply that the virtualization
itself is not limiting the efficiency of the computation; instead, competition for the
CPU and cache space through multi-tenancy significantly degrade the expected per-
formance.7 The timings for the same experiments executed with all 8 cores show
an even higher level of variability and much worse average execution time as shown
previously in Figure 4.1.
The change in average and minimum execution times as utilization increases is
demonstrated in Figure 4.4. The figure presents timings for DGEMM executed over
24 hours. Error bars show the standard deviation. The best performance (minimum)
line indicates that DGEMM reaches roughly 90% of the efficiency; for example, the
minimum execution time on 4 cores is 57.16 secs, therefore the efficiency is 87.4%.
Such performance is close to the optimal achievable. The average performance shows
that similar computations will not likely ever reach optimal efficiency. As the num-
7The nature and the causes of the observed performance fluctuation and degradation would likely
be better understood by using performance counters. Unfortunately, at the time of this study
performance counters could not be used on EC2 instances, as necessary patches to the Linux
kernels were not permitted.
84
4.2 Single Node Scaling
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
 1  2  4  6  8
T
im
e 
(s
ec
s)
Number of cores
Minimum
Average
Figure 4.4 DGEMM’s average and best execution times on c1.xlarge vs. number of cores.
The size of the input matrices is n = 10 k; error bars show one standard deviation.
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
 1  2  4  6  8
T
im
e 
(s
ec
s)
Number of cores
Minimum
Average
Figure 4.5 DGEMM’s average and best execution times on a standard cluster vs. number
of cores. The size of the input matrices is n = 10 k; error bars show one standard
deviation.
ber of cores increases, the average execution time significantly increases along with
the standard deviation. The standard deviation increases by four orders of mag-
nitude. Expected performance diverges so significantly from best performance that
when using eight cores one would rarely expect to achieve the best performance. For
comparison we present in Figure 4.5 the average and minimum execution times of
repeated DGEMM on the node of a standard cluster (as described previously). The
85
Chapter 4 High-Performance Computing on Clouds
difference between the average and minimum execution times is negligible, and thus
the standard deviation of the average is always close to zero.
Underutilization is clearly a “good policy” on EC2 to reduce overhead due to con-
tention between VMs. Conversely, the results from the standard, unshared cluster
imply the extra parallelism provided by more cores should reduce the execution time
correspondingly. Each execution uses roughly the same amount of RAM with more
cores using more (small) temporary buffers. Memory allocation is identical between
the EC2 and cluster experiments. We infer that due to co-location of tenants, the
higher pressure on cache space due to increased parallelism has the negative effect on
EC2 than on an unshared cluster. As utilization increases, the performance on EC2
actually decreases roughly exponentially as the widening gap between the minimum
and the average performance demonstrates.
Using two cores appears to be optimal; however, this sweet spot does not necessarily
hold true for different workloads in the allocated VM or the co-allocated tenants. In
our DGEMM experiment, using 4 (of 8) cores of a c1.xlarge instance takes longer on
average than using only 2, and the trend worsens as the number of cores increases.
Adding parallelism after two cores only succeeds in increasing contention. We conclude
that multi-tenancy on EC2 probably does not go beyond 3–4 VMs per physical node
(each with two cores) since some benefit from parallelism is still available.
Since the average performance is better (best) on 2 cores or even 1 core than on 8
cores, it would be better to rent part of the node than the whole node. By multiplexing
several VMs on the same hardware, Amazon effectively supplies only part of the node
to the customer, but provides access to all of the cores. We have shown that it is
better to ignore the access to all of the cores and underutilize the instance to reduce
the overhead of contending VMs.
4.2.2 HPL Evaluation
The GotoBLAS implementation of DGEMM supports parallelism only on shared
memory architectures. In order to determine whether the effects we have seen us-
ing DGEMM extends to multiple nodes, we use the HPL benchmark [3] that can
scale to large compute grids using MPI [59].
We first consider HPL on a single-node to provide a basis for comparison with the
DGEMM kernel, and then extend the analysis to small clusters of instances. Figure 4.6
presents average and minimum execution times of HPL with n = 25 k against the
number of utilized cores on the c1.xlarge instance. As with the previous DGEMM
kernel experiments, error bars show the standard deviation. The input/output matrix
for the computation fits in the main memory of a single instance. We do not directly
compare the DGEMM and HPL experiments because of the different operations that
they implement; however, we do note that Figures 4.4 and 4.6 show a similar behavior.
Likewise, the comparison between HPL on EC2 and HPL on a standard cluster mirrors
the comparison between DGEMM on EC2 and on a standard cluster; thus, we focus
on the effects of scaling HPL.
From the single node experiments we can conclude that the best performance of
DGEMM and HPL on the EC2 cloud is nearly the same as on a standard clus-
ter. While the individual nodes provided are capable of efficient HPC computations
86
4.3 Multiple Nodes Scaling
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 1  2  4  6  8
T
im
e 
(s
ec
s)
Number of cores
Minimum
Average
Figure 4.6 Average and minimum execution times of HPL on c1.xlarge vs. number of cores.
The size of the input/output matrix is n = 25 k. The input configuration to each
execution of HPL is identical.
even within virtual machines, the expected performance represented by the arithmetic
mean can be several orders of magnitude worse than the best performance. As the
number of utilized cores per node increases, both the average and the standard devia-
tion increase considerably. We infer that competition for CPU and cache space among
co-located VMs causes significant performance fluctuations and overall degradation.
On the High-CPU c1.xlarge instances, the best average performance is obtained using
from a quarter to half of the cores available! Amazon provides no quality of service
guarantees on the number of VMs that might be multiplexed on a single hardware
node, so this ratio is subject to change as Amazon changes the rate of multi-tenancy
within EC2 nodes.
4.3 Multiple Nodes Scaling
The previous section demonstrates the significant effects of contention on single-node
performance in a commercial cloud environment. In this section, we extend our em-
pirical analysis to parallel multi-node computations by using the HPL benchmark on
a cluster composed of allocated EC2 instances. We executed the HPL benchmark
on different clusters of EC2 instances, varying the parameters as described in Sub-
section 4.1.2 (HPL Benchmark). Using different instance types allowed us to see the
effects on performance by varying the amount RAM available and problem size.
Tables 4.2 to 4.5 show the parameters used to obtain the best results over different
problem sizes and instance types. The total number of cores used for a particular
execution is p × q × threads. It was necessary to try different input configurations to
maximize performance. So, we minimized reliance on the interconnect by maximizing
the memory allocation per node used to solve a particular problem. Figure 4.7 shows
87
Chapter 4 High-Performance Computing on Clouds
 0
 1
 2
 3
 4
 5
 6
 7
 8
 20000  30000  50000  80000
N
u
m
b
er
 o
f 
n
o
d
es
Matrix size
m1(c1).(x)large
m1.xlarge
m2.2xlarge
m2.4xlarge
Figure 4.7 Required number of nodes for different input matrix sizes n. The number of
nodes is determined by maximizing usage of RAM on each node. We could not
allocate enough m1.large nodes of AMD type to solve a problem of size n = 80k.
Table 4.2 Results of the HPL benchmark for matrix size n = 20 k. Columns are, in order:
Amazon EC2 instance type, CPU type (Intel or AMD), block size (NB), process
grid (p × q), number of threads used in the BLAS routines, broadcast algorithm
(BFACT ), best elapsed time, and corresponding efficiency using the theoretical
peak performance from Table 4.1. A more detailed description of the HPL pa-
rameters can be found in Subsection 4.1.2.
Instance Processor Block p×q Threads Bcast Time Efficiency
(NB) (min:sec) (% peak)
m1.large Intel 256 1 × 1 2 3 06:29 64.4
m1.large AMD 192 1 × 2 1 3 12:24 89.6
m1.xlarge Intel 256 1 × 1 4 3 05:07 40.8
m1.xlarge AMD 256 1 × 2 2 3 06:28 85.9
c1.xlarge Intel 512 1 × 1 8 3 01:52 64.0
m2.2xlarge Intel 256 1 × 2 2 3 03:20 62.5
m2.4xlarge Intel 256 1 × 1 8 3 01:45 59.7
the number of nodes used for different instance types to solve each problem. Only the
m2.4xlarge instance with 68.4Gb of RAM is large enough to solve all problems on a
single node. We used this as a reference point in Figure 4.8 to bound the effects of
network usage on the performance.
In Figures 4.8 to 4.13 we present different aspects of our HPL benchmark exper-
iments; the results are given in Tables 4.2 to 4.5. We first discuss the best results
obtained in a contentious commercial cloud environment. In these parallel experi-
88
4.3 Multiple Nodes Scaling
Table 4.3 Results of the HPL benchmark for matrix size n = 30 k.
Instance Processor Block p×q Threads Bcast Time Efficiency
(NB) (min:sec) (% peak)
m1.large Intel 256 1 × 1 2 3 20:21 69.3
m1.large AMD 256 1 × 2 1 3 41:21 90.8
m1.xlarge Intel 256 1 × 2 2 3 11:01 64.0
m1.xlarge AMD 192 1 × 2 2 3 21:21 87.9
c1.xlarge Intel 512 1 × 1 8 3 05:19 75.7
m2.2xlarge Intel 256 1 × 2 2 3 11:12 62.7
m2.4xlarge Intel 512 1 × 1 8 3 05:47 60.8
Table 4.4 Results of the HPL benchmark for matrix size n = 50 k.
Instance Processor Block p×q Threads Bcast Time Efficiency
(NB) (hr:min:sec) (% peak)
m1.large Intel 512 1 × 3 2 5 35:56 60.6
m1.large AMD 256 1 × 3 2 3 1:09:44 83.0
m1.xlarge Intel 512 1 × 2 4 5 28:08 58.0
m1.xlarge AMD 512 1 × 4 2 5 1:00:47 71.4
c1.xlarge Intel 512 3 × 1 8 5 18:14 34.0
m2.2xlarge Intel 256 1 × 2 2 3 51:09 63.1
m2.4xlarge Intel 256 1 × 2 8 3 26:17 61.9
Table 4.5 Results of the HPL benchmark for matrix size n = 80 k.a
Instance Processor Block p×q Threads Bcast Time Efficiency
(NB) (hr:min:sec) (% peak)
m1.large Intel 768 2 × 4 2 5 1:01:06 54.7
m1.xlarge Intel 512 1 × 4 4 3 1:14:58 44.6
m1.xlarge AMD 512 2 × 2 4 5 3:27:47 42.8
c1.xlarge Intel 768 1 × 8 8 5 2:17:51 0.07
m2.2xlarge Intel 512 1 × 2 4 3 1:50:57 60.0
m2.4xlarge Intel 512 1 × 1 8 3 1:46:06 62.8
a We could not allocate enough m1.large nodes of AMD type to solve this problem.
ments, the minimum times do not differ as greatly from the average as on the single
node experiments for several reasons: 1) the elapsed time required (more than an
hour for large problems) implies an averaging effect; 2) the overhead for parallel jobs
89
Chapter 4 High-Performance Computing on Clouds
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 20000  30000  50000  80000
E
ff
ic
ie
n
cy
 (
%
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.4xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.8 Efficiency scaling by problem size. Efficiency is determined with respect to the
theoretical peak given in Table 4.1 multiplied by the number of nodes used.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 20000  30000  50000  80000
T
im
e 
(h
o
u
rs
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.2xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.9 Total time to solution for different instance types by problem size.
is higher than for single node experiments due to the network usage. We note that in
these figures the line for m1.large instances using the AMD processor does not extend
to n = 80 k because we could not allocate enough m1.large instances with AMD CPUs
in a single availability zone.
4.3.1 HPL Minimum Evaluation
Figure 4.8 demonstrates that efficiency is generally better than 60% for problem sizes
below 80 k. Here, efficiency is considered to be the percentage of theoretical peak
given in Table 4.1 multiplied by the number of nodes used. We consider 60% to be
90
4.3 Multiple Nodes Scaling
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 20000  30000  50000  80000
C
o
st
 (
$
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.2xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.10 Cost to solution prorated to actual time spent for different instance types by
problem size.
 0
 2
 4
 6
 8
 10
 12
 20000  30000  50000  80000
C
o
st
 (
$
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.2xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.11 Actual cost to solution for different instance types by problem size. Execution
time is rounded to the nearest hour for cost calculation.
reasonable performance given the relatively slow interconnect provided by Amazon
EC2 compared to purpose-built HPC systems. The balance of resources available to
the computation is clearly important. For example, the m2.2xlarge and m2.4xlarge
instances have roughly equal performance at 80 k although HPL needs two of the
m2.2xlarge instances and only one m2.4xlarge. In this case the interconnect does not
have a significant effect because the nodes are provisioned with enough main memory
to keep the CPU busy. The c1.xlarge instances suffer severe performance degradation.
At n = 80 k, the c1.xlarge instances perform two orders of magnitude worse than a
single node. We conjecture that the 7Gb of RAM of c1.xlarge nodes is insufficient to
91
Chapter 4 High-Performance Computing on Clouds
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 20000  30000  50000  80000
C
o
st
 p
er
 G
fl
o
p
 (
$
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.2xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.12 Cost per GFlop ($/GFlop) prorated to actual time spent for different instance
types by problem size.
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 20000  30000  50000  80000
C
o
st
 p
er
 G
fl
o
p
 (
$
)
Matrix size
m1.large (Xeon)
m1.large (Opt.)
m1.xlarge (Xeon)
m1.xlarge (Opt.)
c1.xlarge (Xeon)
m2.2xlarge (Xeon)
m2.4xlarge (Xeon)
Figure 4.13 Actual cost to solution per GFlop ($/GFlop) for different instance types by
problem size. Execution time is rounded to the nearest hour for cost calculation.
keep the CPU busy given the same interconnect between instances. In general, nodes
with more RAM clearly scale up in problem size much better as expected due to the
high ratio between the required number of operations and the data size to solve a
dense linear system.
While efficiency is important to gauge the scalability of the implementation, for any
particular experiment the most important concern is generally the time to solution.
Figure 4.9 demonstrates the total time to solution for different instance types as the
problem size varies. This graph shows that smaller nodes reach the solution faster
although the efficiency of larger nodes is somewhat better. Excepting the extreme
92
4.3 Multiple Nodes Scaling
case of c1.xlarge instances, the largest (and most expensive) instances are also the
slowest.
One of the important concerns when using commercial cloud environments are
the differing costs of different instances. Figure 4.10 provides the cost to solution
for different instance types by problem size. The cost is prorated to the second
to illustrate the marginal costs incurred when allocating a cluster to solve several
problems. Figure 4.11 provides the comparable actual costs of using the different
instance types to solve each problem size; that is, this graph includes the costs of
the remaining hour after the execution completes. The salient difference between the
figures is the relation of the largest nodes to the smaller nodes. Using absolute cost,
the largest nodes are cheapest for large problems, but using prorated costs they are
more expensive. Since the prorated trends are more informative for different problem
sizes and for multiple jobs, we conclude the smaller instances m1.large and m1.xlarge
are the most cost effective for parallel jobs.
In addition to cost to solution, we also consider a more general cost measure,
$/GFlop, calculated using the ratio between the total GFlops returned by the HPL
benchmark and the (prorated) cost for the specific computation. This measure allows
a rough conversion of expected cost for other problem sizes including other scientific
applications characterized by similar computational needs. Figures 4.12 and 4.13 show
the results of the HPL benchmark giving the cost per GFlop for different instances by
problem size. The cost per GFlop measure magnifies the differences between the in-
stance types, but of course the best instance type in Figures 4.10 and 4.11 – m1.large
– is still the best for the price to performance ratio.
The m1.large instance is the fastest by Figure 4.9 and the best for price to perfor-
mance by Figure 4.10, leading us to recommend the smallest (64-bit) instance for most
parallel linear algebra compute jobs on the Amazon EC2 cloud environment according
to the empirical upper bound on performance. The high-compute and high-memory
instances are not worth the extra costs in our experiments. Given the high variabil-
ity in the measurements of single nodes, in the following subsection we examine the
average execution time of HPL on all instances in order to determine whether our
recommendation holds also for the expected performance.
4.3.2 HPL Average Evaluation
The minimum execution times from the previous subsection provide a rough upper
bound for performance. In this subsection, we examine the average execution times
for HPL to provide a better estimate of the average performance of applications with
dense linear algebra. Figures 4.14 to 4.17 provide the minimum and average (and
standard deviation) execution times for different instances by problem size. We do
not show the m2.2xlarge or m2.4xlarge instances due to insufficient data. These large
instances are also quite expensive to allocate.
As with the single node experiments, Figures 4.14 to 4.17 show that the expected
performance is worse than the best performance, but by a much smaller margin.
Indeed, for the m1.xlarge instances in Figure 4.15, the average execution time at n =
80 k is only 12% worse than the minimum execution time. Generally, the average for
the HPL experiments at n = 80 k is between 12% and 100% worse than the minimum
93
Chapter 4 High-Performance Computing on Clouds
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 20000  30000  50000  80000
T
im
e 
(h
o
u
rs
)
Matrix size
Minimum
Average
20k	 avg/min = 1.41
30k	 avg/min = 1.68
50k	 avg/min = 1.25
80k	 avg/min = 1.22
Figure 4.14 Minimum and average execution time of HPL on m1.large (Intel) instances by
problem size. Error bars on average show one standard deviation.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 20000  30000  50000  80000
T
im
e 
(h
o
u
rs
)
Matrix size
Minimum
Average
20k	 avg/min = 1.82
30k	 avg/min = 1.53
50k	 avg/min = 1.40
80k	 avg/min = 1.12
Figure 4.15 Minimum and average execution time of HPL on m1.xlarge (AMD) instances
by problem size. Error bars on average show one standard deviation.
execution times. As we already mentioned, we believe the smaller differences between
the expected execution times and the best times are due to an averaging effect from the
length of the experiments and greater communication overhead in the HPL evaluation
compare to the single node experiments.
The m1.large instance – the smallest that we tested – remains the best EC2 instance
for tightly coupled, compute-intensive, parallel jobs. Although the average execution
time on m1.large is 20% worse than the minimum for n = 80 k, the average is still
better than the minimum time on m1.xlarge (Intel) – the second best instance. Given
that the m1.large also costs half as much as the m1.xlarge, the smallest instance is
94
4.3 Multiple Nodes Scaling
 0
 0.5
 1
 1.5
 2
 2.5
 3
 20000  30000  50000  80000
T
im
e 
(h
o
u
rs
)
Matrix size
Minimum
Average
20k	 avg/min = 2.67
30k	 avg/min = 1.33
50k	 avg/min = 1.63
80k	 avg/min = 1.84
Figure 4.16 Minimum and average execution time of HPL on m1.xlarge (Intel) instances by
problem size. Error bars on average show one standard deviation.
 0
 1
 2
 3
 4
 5
 6
 20000  30000  50000  80000
T
im
e 
(s
ec
s)
Matrix size
Minimum
Average
20k	 avg/min = 2.98
30k	 avg/min = 4.35
50k	 avg/min = 6.02
80k	 avg/min = 2.08
Figure 4.17 Minimum and average execution time of HPL on c1.xlarge instances by problem
size. Error bars on average show one standard deviation.
the clear winner for the tightly-coupled, dense linear algebra computations that we
evaluated. The high marginal costs for high-compute or high-memory nodes is not
cost effective.
4.3.3 HPL Evaluation through Resource Underutilization
In Subsection 4.2.2 we described the significant benefits of the underutilization of
resources on the performance of single node computations. Here, we extend our study
to parallel multi-node computations by measuring the execution time of the HPL
95
Chapter 4 High-Performance Computing on Clouds
 0
 1000
 2000
 3000
 4000
 5000
 1  2
T
im
e 
(s
ec
s)
Number of cores
Minimum 3 nodes
Average 3 nodes
Figure 4.18 Average execution time of repeated HPL on m1.large (Intel) by number of cores
used per each node.
benchmark on a cluster composed of EC2 instances. We execute HPL for a problem
of size 50 k on different clusters; such problem is the smallest in our experiments
(see Figure 4.7) that requires a small cluster in order to be solved. Also, all instances
can be allocated to find the solution of a problem of size n = 50 k instead of n =
80 k. We again observe significant performance improvements when underutilizing
the resources on clusters of instances. In the rest of this subsection we show that
not only does underutilization decrease the overall time to solution for computations
across multiple nodes, sometimes adding extra nodes also increases performance and
decreases marginal cost.
We analyze the performance results on multiple nodes using the average execution
time. In order to determine the average, we conduct 7–29 and 3–11 sample runs on
Intel and AMD instances, respectively. The AMD instances have fewer runs because
those instances became increasingly difficult to allocate starting in the Fall of 2009.
We believe that those nodes were replaced by nodes with Intel processors in the
availability zone that we used.
To demonstrate the effects of contention among clusters of multi-tenant VMs, in Fig-
ures 4.18 to 4.21 we present the minimum and average execution times for different
instance types, varying the numbers of nodes allocated and cores used. For each in-
stance type we determined the minimum number of nodes needed to host a problem of
size n = 50 k – HPL requires around 19GB of RAM for the input/output matrix. As
an example, in the case of c1.xlarge instances (7GB of RAM per node), three nodes
are required, and thus a maximum of 24 cores can be used. This minimum number
of nodes, fully utilized, would typically provide the best performance in an unshared
standard cluster environment. We introduce a term “total cores” which corresponds to
the number of cores on those required nodes. Total cores is used as the upper bound
for the number of cores in the experiments. In other words, we allow the use of extra
96
4.3 Multiple Nodes Scaling
 0
 2000
 4000
 6000
 8000
 10000
 12000
 1  2  3  4
T
im
e 
(s
ec
s)
Number of cores
Minimum 2 nodes
Average 2 nodes
Minimum 3 nodes
Average 3 nodes
Figure 4.19 Average execution time of repeated HPL on m1.xlarge (AMD) by number of
cores used per each node.
 0
 1000
 2000
 3000
 4000
 5000
 6000
 1  2  3  4
T
im
e 
(s
ec
s)
Number of cores
Minimum 2 nodes
Average 2 nodes
Minimum 3 nodes
Average 3 nodes
Figure 4.20 Average execution time of repeated HPL on m1.xlarge (Intel) by number of
cores used per each node.
(underutilized) nodes, as long as the number of cores involved does not exceed total
cores.
The execution times for the m1.large instances (see Figure 4.18), which have two
cores per node, improve when using both cores. In contrast, the figures show that
on nodes with four or more cores, similarly to the single node evaluation, the av-
erage performance is substantially worse than the best performance. The variance
on multiple nodes is smaller than the variance on single nodes, possibly because a
97
Chapter 4 High-Performance Computing on Clouds
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 1  2  4  6  8
T
im
e 
(s
ec
s)
Number of cores
Minimum 3 nodes
Average 3 nodes
Minimum 4 nodes
Average 4 nodes
Figure 4.21 Average execution time of repeated HPL on c1.xlarge (Intel) by number of cores
used per each node.
conspicuous fraction of the execution time is spent on network overhead. The benefit
of underutilization – that contention on memory is reduced – appears to extend to
clusters of nodes.
Although the interconnects used by Amazon are not specified, it is conjectured
that nodes are typically connected with Gigabit Ethernet [88]. It is not obvious that
while using such a slow interconnect (compared to specialized HPC interconnects)
the different CPU performance obtained with underutilization would have a tangible
effect on the overall performance or whether the variation in CPU performance will
have a significant effect on network performance. However, Figures 4.18 to 4.21 show
that underutilization is still effective for cluster computations. Moreover, using extra
underutilized nodes can achieve better results.
Figure 4.21 demonstrates clearly that the expected execution time of HPL is lower
using a 4 node cluster than using a 3 node cluster – the degree depends upon the
level of underutilization. Note that the fastest expected execution time is obtained
using 50% of a 4 node cluster, which is slightly lower than using 50% or more of a 3
node cluster even though the data easily fits within the main memory of the 3 node
cluster. The 50% utilization is unsurprising because it was also optimal for the average
execution time in the single node experiments. On an unshared cluster with a slow
interconnect such situation would not typically occur, meaning 3 nodes would perform
better than 4 due to the reduced network overhead. The extra network overhead
incurred on EC2 using 4 nodes is more than compensated by the extra parallelism.
However, the computation time does not decrease linearly with the number of nodes;
we infer that the extra network overhead remains significant.
On a fixed cluster infrastructure, higher efficiency for each job increases the number
of jobs that can be completed in a given time. Increasing overall throughput indirectly
reduces costs per job since the cost of the cluster is typically fixed. On commercial
98
4.3 Multiple Nodes Scaling
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 1  2  3  4
A
v
er
ag
e 
co
st
 p
er
 G
F
L
O
P
 (
$
)
Number of cores
2 nodes
3 nodes
Figure 4.22 Average cost per GFlop ($/GFlop) prorated to actual time spent for m1.xlarge
(AMD) by number of cores used per each node.
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 1  2  3  4
A
v
er
ag
e 
co
st
 p
er
 G
F
L
O
P
 (
$
)
Number of cores
2 nodes
3 nodes
Figure 4.23 Average cost per GFlop ($/GFlop) prorated to actual time spent for m1.xlarge
(Intel) by number of cores used per each node.
clouds instead, each computation can be priced individually. We empirically studied
the costs of using EC2 as determined by the execution time and instance types in our
experiments. Figures 4.22 to 4.24 provide the average cost for different instance types
by utilized cores per node. To illustrate the marginal costs incurred when allocating a
cluster to solve several problems, we calculate an average cost per GFlop as the ratio
between the (prorated) average cost to solution and the total GFlops achieved by
the computation. This measure allows a rough estimation of expected cost for other
99
Chapter 4 High-Performance Computing on Clouds
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 1  2  4  6  8
A
v
er
ag
e 
co
st
 p
er
 G
fl
o
p
 (
$
)
Number of cores
3 nodes
4 nodes
Figure 4.24 Average cost per GFlop ($/GFlop) prorated to actual time spent for c1.xlarge
(Intel) by number of cores used per each node.
problem sizes including other scientific applications that have similar computational
needs.
Since marginal cost is directly proportional to execution time, we include the stan-
dard deviation as well as the expected cost. Figures 4.23 and 4.24 show that the
underutilization of resources is not only a beneficial approach from the time to solu-
tion perspective, but it is also the most cost-effective approach. For instance, on the
m1.xlarge (Intel) instances it is more cost-effective to use only 75% of the available
resources: the average cost is 18% less than when using all the cores. The results
on the c1.xlarge instances are even more dramatic: it is more than 3 times cheaper
to use 50% of the resources than to utilize them all. In fact, for a 4 node cluster,
any underutilization, even 12.5%, is more effective than using all the cores of each
machine.
There is clearly a tradeoff between price and speed in our EC2 cluster experiments.
To demonstrate this tradeoff, we plot time to solution versus prorated cost for the
HPL cluster evaluation. Figure 4.25 refers to the c1.xlarge instance, and Figure 4.26
combines the results from all instance types. In these graphs, closer to the origin is
better. For all instance types except c1.xlarge, using fewer nodes is cheapest; for the
c1.xlarge instance, an additional node can be added to get a noticeable speedup at
the same cost. In this case, the extra parallelism speeds up the computation enough
to offset the extra cost of the node. This is quite different from an unshared cluster
with fixed costs. Figure 4.26 shows that for our problem size underutilizing (by 50%)
a c1.xlarge instances with 4 nodes is the fastest and most cost effective strategy.
This strategy differs significantly from the expectation on an unshared cluster where
performance is more stable and using fewer nodes is typically the optimal strategy.
100
4.3 Multiple Nodes Scaling
 0
 2000
 4000
 6000
 8000
 10000
 12000
 0  0.5  1  1.5  2  2.5  3  3.5  4  4.5
T
im
e 
(s
ec
s)
Cost ($)
3 nodes
4 nodes
5 nodes
6 nodes
1 core  (12.5%)
2 cores   (25%)
4 cores   (50%)
6 cores   (75%)
8 cores (100%)
Figure 4.25 Time to solution versus prorated cost for c1.xlarge (Intel).
 0
 2000
 4000
 6000
 8000
 10000
 12000
 0  0.5  1  1.5  2  2.5  3  3.5  4  4.5
T
im
e 
(s
ec
s)
Cost ($)
m1.large (Intel)
m1.xlarge (Intel)
m1.xlarge (AMD)
c1.xlarge (Intel)
Figure 4.26 Time to solution versus prorated cost for different instance types.
101
Chapter 4 High-Performance Computing on Clouds
4.4 Summary
In this chapter we have empirically demonstrated the computational efficiency of
high-performance numerical applications in a commercial cloud environment when
resources are shared under high contention. Through a case study using the DGEMM
kernel and the HPL benchmark, we showed that cache utilization becomes highly
unpredictable and similarly affects computation time. For some problems, it is not
only more efficient to underutilize resources, but the solution can be reached much
earlier in real-time (wall-time). We also showed that the smallest, cheapest (64-bit)
instance on the Amazon EC2 commercial cloud environment is not only the fastest,
but also the best for price to performance ratio.
We investigated in depth the beneficial effects of underutilizing resources for high-
performance computing on Amazon EC2. And, we showed that underutilization of
resources can considerably improve performance. For instance, for some cluster config-
urations, the solution was reached much faster on average when the available resources
were underutilized. The improvement in performance for single node computations
are even more impressive: underutilization improved the expected execution time by
two orders of magnitude. Finally, extending underutilized clusters by adding more
nodes can often improve the execution time by exploiting more parallelism. In the
best case in our experiments, the execution time was improved enough to entirely
offset the cost of the extra node in the cluster, achieving faster computation at the
same cost using extra underutilizes nodes.
We presented average performance, average execution time, cost to solution, and
variance measures – traditionally ignored in the high-performance computing context
– to determine the efficiency and performance of the Amazon EC2 commercial cloud
environment where multiple VMs share a single hardware node. Under virtualization
in this competitive environment, the average performance diverges by an order of
magnitude from the best achieved performance. We conclude that there is significant
space for improvement in providing predictable performance in such environments.
Further, adaptive libraries that dynamically adjust utilized resources in the shared
VM environment have the potential to significantly increase performance.
Remarks: After we finished with the study and presented the results in a book
chapter [6] and in articles [49, 68, 69], in July 2010 Amazon released new cluster
compute instances with single user access. In addition, they are equipped with brand-
new processors and fast interconnect. For instance, Cluster Compute Quadruple Extra
Large (cc1.4xlarge) instance consists of two Intel Xeon X5570 (quad-core Nehalem
architecture) processors operating at a frequency of 2.93GHz with a total memory
of 23Gb; those nodes are connected with 10 Gigabit Ethernet. We observed that
cc1.4xlarge instances deliver performance (best and expected) comparable to standard
clusters, which makes them suitable candidates for high-performance computing.
102
Chapter 5
Conclusions
In this dissertation we addressed two research problems: 1. Performance modeling
and prediction for dense matrix computations; 2. Investigation of the effects of shared
resources for high-performance computing on commercial cloud environments.
• Performance modeling and prediction. We presented two techniques for
modeling and predicting performance of dense matrix computations: the first
is based on sampling the building blocks, while the second requires neither the
execution of the algorithms nor their building blocks. Although the performance
prediction through measurements was our initial strategy (see Section 2.3 for
details), our main focus was on the execution-less modeling. We modeled per-
formance of algorithms through counting the number of Flops performed and
modeling the number of cache misses occurred. The latter is the major factor in
the computation time of memory-bound algorithms on which we focused. Given
the algorithm structure, we built a general analytical model for estimating the
number of cache misses. However, the modeled cache misses differed consider-
ably from the actual measurements. In order to improve the predicted results,
we tailored the model for the target architecture by taking into account the spe-
cific features of the architecture such as software and hardware prefetching. As
a result, we obtained highly accurate predictions in terms of both cache misses
and execution time. The tailored model, however, became strongly architecture
dependent and, thus, lost the generality. Therefore, there is a tradeoff between
the desired accuracy of the model and its generality. Detailed conclusions re-
garding the execution-less performance modeling can be found in Section 3.7.
• High-performance computing on clouds. We investigated the possibility of
using commercial cloud environments for high-performance computing as an al-
ternative to standard clusters. In order to examine the computational efficiency
of compute-intensive scientific applications on clouds, we used the DGEMM ker-
nel and the LINPACK benchmark. We observed that in terms of minimum exe-
cution time, which corresponds to 10% of all executions, commercial clouds can
deliver nearly the same performance as standard clusters. However, expected
performance (average execution time) on clouds can be quite poor due to high
contention for resources. In fact, when the number of utilized cores/instance in-
creased1, the expected performance deteriorated significantly. Because of this,
we suggested to underutilize the allocated resources. Such an underutilization
1It is only possible to pay per allocated instance, but not per utilized cores.
103
Chapter 5 Conclusions
considerably improved the expected performance by reducing contention for the
CPU and cache space. As a result, for some cluster configurations, the solu-
tions were reached almost an order of magnitude earlier. Detailed conclusions
concerning the use of a commercial cloud environment for high-performance
computing can be found in Section 4.4.
104
Appendix A
Measuring Memory Latency
In order to measure the memory latency we use the lat_mem_ld utility [63] from the
LMBench tool [64]. lat_mem_ld is executed from a command line as
$ taskset 0x1 ./lat_mem_rd -N [x] -P [y] [depth] [stride],
where taskset is used to set the CPU affinity of a current process with a given CPU
affinity; 0x1 means that the first CPU is used by the utility for each individual run.
[x] equals the number of times the process is run before reporting the latency; it
was set to 3, which is sufficient for the accurate measurements. For the -P option,
[y] equals the number of processes invoked in the execution of the benchmark. It is
recommended to measure the access latency with only one processing core or thread,
therefore [y] should always be 1. The [depth] indicates how far into memory the
utility will access. To ensure the satisfactory accuracy, a specified amount should go
far enough beyond the cache so that it does not impact on the latency measurements.
Finally, [stride] is the distance in memory between reads. If [stride] is not large
enough, hardware prefetching of data can be involved providing misleading results for
all levels of memory. The utility by default uses [stride] of size 128 bytes. If the
measured numbers are unrealistic, it is better to increase the stride of the command
to avoid the data prefetching. Experiments report that the stride should be between
64 and 512 bytes [79].
Table A.1 The results of measuring the memory latency by lat_mem_ld on Penryn.
Accessed memory [Mb] Time [ns]
0.00049 1.216
0.00098 1.214⋮ ⋮
0.02344 1.229
0.03125 1.214
0.04688 6.076
0.06250 6.069⋮ ⋮
192.00000 79.687
256.00000 79.893
The results shown on Figure 3.27 (Subsection 3.5.1) were achieved by executing the
utility as follow
105
Appendix A Measuring Memory Latency
$ taskset 0x1 ./lat_mem_rd -N 3 -P 1 256M 128.
Thus, lat_mem_rd was bent to the first (of 2) CPU on Penryn (architecture speci-
fication can be found in Subsection 2.2.1). The measurements were repeated three
times accessing up to 256Mb of memory with a stride of 128 bytes. The output of
the measurements is shown by two columns, see Table A.1: the first column points
at a memory level where the measurements happened; the second column presents
the latency in nanoseconds. Thus, the utility runs through each level of memory on
Penryn reporting the latency of each of them.
106
Appendix B
Measuring Flop Time
Listing B.1 Measuring the time of one double precision floating point operation.
1 #include <stdio.h>
2 #include <unistd.h>
3
4 // external vector length
5 #define LEN 8
6 // internal vector length
7 #define VLEN 2
8 #define MAX_ITER 100000000
9
10 typedef union {
11 #if VLEN > 1
12 double __attribute__ (( vector_size(sizeof(double) * VLEN))) v;
13 #else
14 double v;
15 #endif
16 double d[VLEN];
17 } vdouble;
18
19 int main() {
20 int i, j, k;
21 vdouble a[LEN], b[LEN], c[LEN];
22
23 // initialize
24 for (j = 0; j < LEN; j++)
25 for (k = 0; k < VLEN; k++) {
26 a[j].d[k] = 0.0;
27 b[j].d[k] = 1.0 + 1e-6;
28 c[j].d[k] = 1e-6;
29 }
30
31 long long int ticks_start , ticks_end;
32 rdtscll(ticks_start);
33 for (i = 0; i < MAX_ITER; i++)
34 for (j = 0; j < LEN; j++) {
35 a[j].v += b[j].v * c[j].v;
36 b[j].v *= c[j].v;
37 }
38 rdtscll(ticks_end);
39
40 return 0;
41 }
Listing B.1 shows the code implementation1 in C programming language [58] for
measuring the Flop time using SSE vector instructions. The code performs update
of two vectors in a double loop. At each iteration of the loop, two operations, which
are often occur in linear algebra algorithms, are applied: the product of two scalars
1Thanks to my colleague Elmar Peise, who kindly provided the above-listed source code.
107
Appendix B Measuring Flop Time
(β = β ∗ γ) and the accumulation of the product (α = α + β ∗ γ). In order to benefit
from the SSE instructions (perform two operations at once), variables LEN and VLEN
are used to fill the processor’s pipeline. LEN and VLEN are processor dependent – their
product corresponds to the length of the vectors and the number of double precision
floating point values that can be kept in the CPU registers during the computation.
For instance, LEN=8 and VLEN=2 are chosen on Penryn. To ensure the accuracy of the
measurements, the cycle accurate timer is used and the number of execution is set to
108. Compared to the implementation without SSE vector instruction, the execution
time is now reduced by nearly 50%. As a result, we reach 99% of the theoretical peak
performing up to four floating point operations per cycle.
108
Bibliography
[1] Dieter an Mey, Christian Terboven, Paul Kapinos, Dirk Schmidl, Sandra Wienke,
Tim Cramer, and Michael Wirtz. The RWTH HPC-Cluster User’s Guide: Ver-
sion 8.2.1, February 2012. Available via the WWW. Cited 22 March 2012. http:
//www.rz.rwth-aachen.de/ca/c/pil/.
[2] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Ham-
merling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: a portable linear algebra
library for high-performance computers. In Proceedings of the 1990 ACM/IEEE con-
ference on Supercomputing, Supercomputing’90, pages 2–11. IEEE Computer Society
Press, 1990.
[3] Antoine Petitet, R. Clint Whaley, Jack Dongarra and Andrew Cleary. HPL – a
portable implementation of the high-performance linpack benchmark for distributed-
memory computers, September 2008. Available via the WWW. Cited 19 January 2012.
http://www.netlib.org/benchmark/hpl/.
[4] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Timothy L. Harris, Alex Ho,
Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In
SOSP, pages 164–177, 2003.
[5] Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms
related to the inversion of a Symmetric Positive Definite matrix. ACM Trans. Math.
Softw., 35(1):1–22, 2008.
[6] Paolo Bientinesi, Roman Iakymchuk, and Jeff Napper. HPC on Competitive Cloud Re-
sources. In Borko Furht and Armando Escalante, editors, Handbook of Cloud Computing,
pages 493–516. Springer, 2010.
[7] Paolo Bientinesi, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. Representing
linear algebra algorithms in code: The FLAME application programming interfaces.
ACM Transactions on Mathematical Software, 31(1):27–59, March 2005.
[8] Paolo Bientinesi and Robert A. van de Geijn. Representing dense linear algebra algo-
rithms: A farewell to indices. FLAMEWorking Note #17. Technical Report TR-2006-10,
The University of Texas at Austin, Department of Computer Sciences, February 2006.
[9] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing matrix
multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology.
In Proceedings of the 11th international conference on Supercomputing, ICS ’97, pages
340–347, Vienna, Austria, 1997. ACM.
[10] Jeff Bilmes, Krste Asanović, Jim Demmel, Dominic Lam, and CheeWhye Chin. PHiPAC:
A portable, high-performance, ANSI C coding methodology and its application to matrix
multiply. LAPACK working note 111, University of Tennessee, 1996.
[11] Simone Browne, Jack Dongarra, Nancy Garner, Kevin London, and Phil Mucci. A
portable programming interface for performance evaluation on modern processors. The
International Journal of High Performance Computing Applications, 14:189–204, 2000.
[12] Randal E. Bryant and David R. O’Hallaron. Computer Systems – a Programmers Per-
spective. Pearson Education, 2011.
109
Bibliography
[13] Anthony M. Castaldo and R. Clint Whaley. Scaling lapack panel operations using parallel
cache assignment. In PPoPP’08: Proceedings of the 15th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, Bangalore, India, pages 223–231.
ACM, 2010.
[14] Shannon Cepeda. What you need to know about prefetching, August 2009. Available via
the WWW. Cited 19 January 2012. http://software.intel.com/en-us/blogs/2009/
08/24/what-you-need-to-know-about-prefetching/.
[15] Ernie Chan, Robert van de Geijn, and Andrew Chapman. Managing the complexity of
lookahead for lu factorization with pivoting. In SPAA’10: Proceedings of the 22nd ACM
symposium on Parallelism in algorithms and architectures, Thira, Santorini, Greece,
pages 200–208, 2010.
[16] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Grego-
rio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime
scheduling system for algorithms-by-blocks. In PPoPP’08: Proceedings of the 13th ACM
SIGPLAN Symposium on Principles and practice of parallel programming, Salt Lake
City, UT, USA, pages 123–132. ACM, 2008.
[17] J. M. Charnes, Sabri Pllana, and Thomas Fahringer. UML based modeling of perfor-
mance oriented parallel and distributed applications. In Proceedings of the 2002 Winter
Simulation Conference, pages 8–11. IEEE Press, 2002.
[18] Siddhartha Chatterjee, Erin Parker, Philip J. Hanlon, and Alvin R. Lebeck. Exact
analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN
2001 conference on Programming language design and implementation, Snowbird, Utah,
USA, PLDI ’01, pages 286–297. ACM, 2001.
[19] Javier Cuenca, Domingo Giménez, and José González. Modeling the behaviour of linear
algebra algorithms with message-passing. In Proceedings of the Euromicro Workshop on
Parallel and Distributed Processing, Mantova, Italy, pages 282–289. IEEE Press, 2001.
[20] Javier Cuenca, Domingo Giménez, and José González. Towards the design of an au-
tomatically tuned linear algebra library. In Proceedings of the Euromicro Workshop on
Parallel and Distributed Processing, Canary Islands, Spain, pages 201–208. IEEE Press,
2002.
[21] Javier Cuenca, Domingo Giménez, and José González. Architecture of an automatically
tuned linear algebra library. Parallel Comput., 30(2):187–210, February 2004.
[22] Javier Cuenca, Domingo Giménez, José González, Jack Dongarra, and Kenneth Roche.
Automatic optimisation of parallel linear algebra routines in systems with variable load.
In Proceedings of the Euromicro Workshop on Parallel and Distributed Processing, Gen-
ova, Italy, pages 409–416. IEEE Press, 2003.
[23] Franck Delattre and Marc Prieur. Intel Core 2 Duo – test, July 2006. Available via
the WWW. Cited 19 January 2012. http://www.behardware.com/articles/623-1/
intel-core-2-duo-test.html.
[24] The Valgrind Developers. Cachegrind: a cache and branch-prediction profiler. Avail-
able via the WWW. Cited 19 January 2012. http://valgrind.org/docs/manual/
cg-manual.html.
[25] The Valgrind Developers. Callgrind: a call-graph generating cache and branch prediction
profiler. Available via the WWW. Cited 19 January 2012. http://valgrind.org/docs/
manual/cl-manual.html.
110
Bibliography
[26] The Valgrind Developers. Valgrind: an instrumentation framework for building dynamic
analysis tools. Available via the WWW. Cited 19 January 2012. http://valgrind.org/.
[27] Jack Dongarra, Robert van de Geijn, and David Walker. Scalability issues affecting the
design of a dense linear algebra library. Journal of Parallel and Distributed Computing,
22(3):523–537, September 1994.
[28] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic
linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–17,
March 1990.
[29] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. Algo-
rithm 656: An extended Set of Basic Linear Algebra Subprograms: Model Implemen-
tation and Test Programs. ACM Transactions on Mathematical Software, 14(1):18–32,
March 1988.
[30] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An
extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on
Mathematical Software, 14(1):1–17, March 1988.
[31] Diego Fabregat-Traver and Paolo Bientinesi. Automatic generation of loop-invariants
for matrix operations. In Proceedings of the 2011 International Conference on Compu-
tational Science and Its Applications, ICCSA ’11, pages 82–92. IEEE Computer Society,
2011.
[32] Diego Fabregat-Traver and Paolo Bientinesi. Knowledge-based automatic generation of
partitioned matrix expressions. In Vladimir Gerdt, Wolfram Koepf, Ernst Mayr, and
Evgenii Vorozhtsov, editors, Computer Algebra in Scientific Computing, volume 6885 of
Lecture Notes in Computer Science, pages 144–157. Springer Berlin / Heidelberg, 2011.
[33] Basilio B. Fraguela, Ramon Doallo, and Emilio L. Zapata. Automatic analytical modeling
for the estimation of cache misses. In Proceedings of the 1999 International Conference
on Parallel Architectures and Compilation Techniques, PACT ’99, pages 221–232. IEEE
Computer Society, 1999.
[34] Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Pro-
ceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Generation, Op-
timization, and Platform Adaptation”.
[35] Luis-Pedro García, Javier Cuenca, and Domingo Giménez. Using experimental data
to improve the performance modelling of parallel linear algebra routines. In Parallel
Processing and Applied Mathematics, volume 4967 of Lecture Notes in Computer Science,
pages 1150–1159. Springer Berlin / Heidelberg, 2008.
[36] Chris Gemignani and Peter Skomoroch. Elasticwulf: Beowulf cluster run on amazon
EC2. Available via the WWW. Cited 23 March 2010. http://code.google.com/p/
elasticwulf/.
[37] Gene H. Golub and Charles F. Van Loan. Matrix computations, Third Edition. Johns
Hopkins University Press, 1996.
[38] Kazushige Goto. The GotoBLAS library. Available via the WWW. Cited 19 January
2012. http://www.tacc.utexas.edu/tacc-projects/gotoblas2.
[39] Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix
multiplication. ACM Trans. Math. Softw., 34(3), May 2008.
[40] Kazushige Goto and Robert A. van de Geijn. High-performance implementation of the
level-3 BLAS. ACM Trans. Math. Softw., 35(1), July 2008.
111
Bibliography
[41] The Numerical Algorithms Group. The NAG Fortran Library. Available via the WWW.
Cited 22 March 2012. http://www.nag.com/numeric/fl/nagdoc_fl23/html/GENINT/
news.html.
[42] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn.
FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathe-
matical Software, 27(4):422–455, December 2001.
[43] R. J. Hanson, F. T. Krogh, and C. L. Lawson. Improving the efficiency of portable
software for linear algebra. SIGNUM Newsl., 8(4):16–16, October 1973.
[44] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative
Approach, Fifth Edition. Morgan Kaufmann, 2011.
[45] ServePath Dedicated Hosting. GoGrid cloud hosting. Available via the WWW. Cited
23 March 2010. http://gogrid.com.
[46] Roman Iakymchuk. Performance prediction through time measurements. In Proceedings
of the 1st International Conference on High-Performance Computing, HPC-UA 2011,
pages 26–32, 2011.
[47] Roman Iakymchuk and Paolo Bientinesi. Execution-less performance modeling. In Pro-
ceedings of the second international workshop on Performance modeling, benchmarking
and simulation of high performance computing systems, PMBS ’11, pages 11–12. ACM,
2011.
[48] Roman Iakymchuk and Paolo Bientinesi. Modeling performance through memory-stalls.
ACM SIGMETRICS Performance Evaluation Review, 40(2), 2012.
[49] Roman Iakymchuk, Jeff Napper, and Paolo Bientinesi. Improving high-performance
computations on clouds through resource underutilization. In Proceedings of the 2011
ACM Symposium on Applied Computing, SAC ’11, pages 119–126. ACM, 2011.
[50] AMD Inc. AMD Core Math Library (ACML). Available via the WWW. Cited 22 March
2012. http://www.amd.com/acml.
[51] IBM Inc. Engineering and Scientific Subroutine Library (ESSL). Available via the
WWW. Cited 22 March 2012. http://www-03.ibm.com/systems/p/software/essl/
index.html.
[52] Intel Inc. Intel Math Kernel Library (MKL). Available via the WWW. Cited 22 March
2012. http://software.intel.com/en-us/intel-mkl/.
[53] NVIDIA Inc. CUDA Basic Linear Algebra Subroutines (cuBLAS) library. Available via
the WWW. Cited 22 March 2012. http://developer.nvidia.com/cublas.
[54] Oracle Inc. Sun Performance Library. Available via the WWW. Cited 22 March 2012.
http://docs.oracle.com/cd/E24457_01/html/E21987/gkezy.html.
[55] Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin,
Lisa Hsu, and Steve Reinhardt. Qos policies and architecture for cache/memory in cmp
platforms. SIGMETRICS Performance Evaluation Review, 35(1):25–36, 2007.
[56] Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci and Dan Terpstra. Using
PAPI for hardware performance monitoring on linux systems. In Conference on Linux
Clusters: The HPC Revolution, Linux Clusters Institute, 2001.
[57] Kate Keahey, Tim Freeman, Jerome Lauret, and Doug Olson. Virtual Workspaces for
Scientific Applications. In SciDAC 2007 Conference, June 2007.
112
Bibliography
[58] Brian W. Kernighan and Dennis Ritchie. The C Programming Language, Second Edition.
Prentice-Hall, 1988.
[59] Argonne National Laboratory. MPICH2: High-performance and widely portable MPI.
Available via the WWW. Cited 19 January 2012. http://www.mcs.anl.gov/research/
projects/mpich2/.
[60] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra
subprograms for Fortran usage. ACM Transactions on Mathematical Software, 5(3):308–
323, September 1979.
[61] Ali Mahjur and Amir H. Jahangir. Two-phase prediction of L1 data cache misses. In
IEE Proceedings, Computers and Digital Techniques, volume 153(6), pages 381–388,
November 2006.
[62] Yury Malich. AMD K10 micro-architecture, August 2007. Available via the WWW.
Cited 19 January 2012. http://www.xbitlabs.com/articles/cpu/display/amd-k10_
9.html.
[63] Larry McVoy and Carl Staelin. lat_mem_rd – memory read latency benchmark. Avail-
able via the WWW. Cited 19 January 2012. http://www.bitmover.com/lmbench/lat_
mem_rd.8.html.
[64] Larry McVoy and Carl Staelin. LMbench – Tools for Performance Analysis. Available
via the WWW. Cited 19 January 2012. http://www.bitmover.com/lmbench/.
[65] Edson Midorikawa, Helio Oliveira, and Jean Laine. PEMPIs: A New Methodology
for Modeling and Prediction of MPI Programs Performance. International Journal of
Parallel Programming, 33(5):499–527, 2005.
[66] Phil Mucci and the ICL Group. Performance Application Programming Interface (PAPI).
Available via the WWW. Cited 19 January 2012. http://icl.cs.utk.edu/papi/.
[67] Jeff Napper and Paolo Bientinesi. Can cloud computing reach the TOP500? In Uncon-
ventional High-Performance Computing (UCHPC), May 2009.
[68] Jeff Napper, Roman Iakymchuk, and Paolo Bientinesi. Crowded Clouds. HPCwire,
August 2010. Available via the WWW. Cited 22 March 2012. http://www.hpcwire.
com/news/Crowded-Clouds-100587449.html.
[69] Jeff Napper, Roman Iakymchuk, and Paolo Bientinesi. HPC Sharing in the Cloud. HPC
in the Cloud, August 2010. Available via the WWW. Cited 22 March 2012. http://
www.hpcinthecloud.com/hpccloud/2010-08-12/hpc_sharing_in_the_cloud.html.
[70] Nicholas Nethercote, Robert Walsh, and Jeremy Fitzhardinge. Invited tutorial: Building
Workload Characterization Tools with Valgrind. In IEEE International Symposium on
Workload Characterization (IISWC 2006), San Jose, California, USA, page 2, October
2006.
[71] Netlib.org. The reference BLAS library. Available via the WWW. Cited 19 January
2012. http://www.netlib.org/blas/.
[72] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia
Youseff, and Dmitrii Zagorodnov. The Eucalyptus open-source cloud-computing sys-
tem. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster
Computing and the Grid, CCGRID ’09, pages 124–131, 2009.
[73] University of Chicago. Nimbus Science Clouds. Available via the WWW. Cited 23 March
2010. http://www.nimbusproject.org/.
113
Bibliography
[74] David A. Patterson and John L. Hennessy. Computer Organization and Design: The
Hardware / Software Interface, Revised Fourth Edition, Fourth Edition. Morgan Kauf-
mann, 2011.
[75] Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin. The case of the missing su-
percomputer performance: Achieving optimal performance on the 8,192 processors of
ASCI Q. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing,
page 55, Washington, DC, USA, 2003. IEEE Computer Society.
[76] Aashish Phansalkar and Lizy K. John. Performance prediction using program similarity.
In Proceedings of the 2006 SPEC Benchmark Workshop, January 2006.
[77] Sabri Pllana and Thomas Fahringer. Performance Prophet: a performance modeling
and prediction tool for parallel and distributed programs. In Proceedings of the 2005
International Conference on Parallel Processing Workshops, pages 509–516. IEEE Press,
2005.
[78] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso,
Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang
Chen, Robert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP
transforms. Proceedings of the IEEE, special issue on ”Program Generation, Optimiza-
tion, and Adaptation”, 93(2):232– 275, 2005.
[79] Joshua Ruggiero. Measuring Cache and Memory Latency and CPU to Memory Band-
width. Intel White Papers, pages 1–14, December 2008.
[80] Amazon Web Services. Amazon Elastic Compute Cloud (EC2). Available via the WWW.
Cited 13 May 2010. http://aws.amazon.com/ec2.
[81] SPIRAL Team. SPIRAL: Software/Hardware Generation for DSP Algorithms. Available
via the WWW. Cited 22 March 2012. http://spiral.net/index.html.
[82] Gilbert W. Stewart. Matrix Algorithms: Basic decompositions. Miscellaneous Titles in
Applied Mathematics Series. Society for Industrial and Applied Mathematics, 1998.
[83] Jörg Strebel and Alexander Stage. An economic decision model for business software
application deployment on hybrid cloud environments. In Matthias Schumann, Lutz M.
Kolbe, and Michael H. Breitner, editors, Tagungsband Multikonferenz Wirtschaftsinfor-
matik, 2010.
[84] GCC Team. GCC, the GNU Compiler Collection. Available via the WWW. Cited 19
January 2012. http://gcc.gnu.org/.
[85] Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong Ong, Christian Engelmann,
and Stephen L. Scott. An Analysis of HPC Benchmarks in Virtual Machine Envi-
ronments. In Eduardo César, Michael Alexander, Achim Streit, Jesper Larsson Träff,
Christophe Cérin, Andreas Knüpfer, Dieter Kranzlmüller, and Shantenu Jha, editors,
Euro-Par 2008 Workshops - Parallel Processing, pages 63–71. 2009.
[86] TOP500.Org. Top 500 supercomputer sites. Available via the WWW. Cited 19 January
2012. http://www.top500.org/.
[87] Robert A. van de Geijn and Enrique S. Quintana Ortí. The Science of Programming
Matrix Computations. www.lulu.com, 2008.
[88] Edward Walker. Benchmarking Amazon EC2. ;LOGIN:, pages 18–23, October 2008.
[89] Guohui Wang and Eugene Ng. The impact of virtualization on network performance of
Amazon EC2 data center. In INFOCOM ’10: Proceedings of the 2010 IEEE Conference
on Computer Communications. IEEE Communication Society, 2010.
114
Bibliography
[90] David S. Watkins. Fundamentals of Matrix Computations, Second Edition. Wiley-
Interscience, 2002.
[91] R. Clint Whaley and Anthony M. Castaldo. Achieving accurate and context-sensitive
timing for code optimization. Softw., Pract. Exper., pages 1621–1642, 2008.
[92] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In
Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Super-
computing ’98, pages 1–27. IEEE Computer Society, 1998.
[93] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated Empirical Opti-
mization of Software and the ATLAS Project. Parallel Computing, 27(1-2):3–35, 2001.
[94] Xingfu Wu. Performance Evaluation, Prediction and Visualization of Parallel Systems.
Kluwer Academic Publisher, 1999.
[95] xcalibre communications ltd. FlexiScale cloud computing. Available via the WWW.
Cited 23 March 2010. http://www.flexiscale.com.
[96] Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David Padua, Keshav
Pingali, and Paul Stodghill. Is Search Really Necessary to Generate High-Performance
BLAS? Proceedings of the IEEE, 93(2):358–386, February 2005.
[97] Lamia Youseff, Keith Seymour, Haihang You, Jack Dongarra, and Rich Wolski. The
impact of paravirtualized memory hierarchy on linear algebra computational kernels
and software. In HPDC ’08: Proceedings of the 17th international symposium on High
performance distributed computing, pages 141–152, 2008.
115
116
Lebenslauf
Persönliche Daten
Name Iakymchuk
Vorname Roman
Geburtstag 27.11.1985
Geburtsort Lemberg (Lviv), Ukraine
Staatsangehörigkeit Ukrainisch
Qualifikation
2003 Schulabschluss mit Hochschulzugangsberechtigung
2003 bis 2008 Studium der Angewandten Mathematik und Informatik
an der Nationalen Iwan-Franko-Universität Lemberg
Abschluss: Магiстер з прикладної математики
Master of Applied Mathematics
ab 2009 Promotionsstudium an der Graduiertenschule AICES
(Aachen Institute for Advanced Study in Computational
Engineering Science), RWTH Aachen University
117
