On the Role of Search in Generating High-Performance BLAS Libraries by Yotov, Kamen
ON THE ROLE OF SEARCH IN GENERATING
HIGH-PERFORMANCE BLAS LIBRARIES
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulﬁllment of the Requirements for the Degree of
Doctor of Philosophy
by
Kamen Yotov Yotov
January 2006c ￿ 2006 Kamen Yotov Yotov
ALL RIGHTS RESERVEDON THE ROLE OF SEARCH IN GENERATING HIGH-PERFORMANCE
BLAS LIBRARIES
Kamen Yotov Yotov, Ph.D.
Cornell University 2006
A key step in program optimization is the estimation of optimal values for pa-
rameters such as tile sizes and loop unrolling factors. Traditional compilers use
simple analytical models to compute these values. In contrast, library generators
like ATLAS use global search over the space of parameter values by generating pro-
grams with many diﬀerent combinations of parameter values, and running them
on the actual hardware to determine which values give the best performance. It is
widely believed that traditional model-driven optimization cannot compete with
search-based empirical optimization because tractable analytical models cannot
capture all the complexities of modern high-performance architectures. This the-
sis disproves this belief.
In this work we replaced the global search engine in ATLAS with a model-
driven optimization engine, and measured the relative performance of the code
produced by the two systems on a variety of architectures. Our experiments show
that model-driven optimization can be surprisingly eﬀective, and can generate code
with performance comparable to that of code generated by ATLAS using global
search. For improving the code further, we advocate complementing modelling
with local search and model reﬁnement.
Model-driven optimization needs accurate values of hardware parameters. Inthis thesis we also describe X-Ray, a robust framework of micro-benchmark for
measuring such hardware parameters. X-Ray is designed to be extensible to make
it easy to implement new micro-benchmarks. We have developed novel algorithms
for measuring hardware parameters commonly used in optimizing software perfor-
mance, and we have implemented them in X-Ray. We evaluate X-Ray experimen-
tally on traditional workstations and servers as well as on embedded architectures,
and show that it produces more accurate and complete results than existing tools.BIOGRAPHICAL SKETCH
Kamen Yotov was born on 06/03/1977 in Soﬁa, Bulgaria. He spent his early
childhood years in Sopot, Bulgaria. He returned to Soﬁa in 1982 to attend elemen-
tary and secondary school. In 1995 he graduated from the National High-School for
Mathematics and Natural Sciences and joined Soﬁa University, where he received
an M.Sc. degree in Computer Science degree in 1999.
Kamen joined the Ph.D. program in Computer Science at Cornell University
in 2000 and received an M.S. degree in 2003, and a Ph.D. degree in 2006. For the
time being, he does not plan to study more...ever.
Kamen enjoys hiking, cooking, and achieving high performance and productiv-
ity in all aspects of science and life, but even more so in driving cars.
iiiIn memory of my late grandmother Maria Ungarova
ivACKNOWLEDGEMENTS
I am especially grateful to my advisor Keshav Pingali for encouraging me and
standing by me through the long journey. He contaminated me with his scientiﬁc
views of fundamental simplicity, and at the same time, need of great depth in
understanding the issues at hand. Over the years, these views have become my
own, but even today, I cannot quite manage to express the essence of my work
so succinctly and beautifully, as he would. I can only hope that this talent will
develop in me later in my professional carrier. This thesis would not have been
possible without all the hard work we did together. I learned a lot from you in the
process and I continue learning from you to this day!
I would like to extend special thanks to Paul Stodghill for standing beside
me throughout my graduate studies and providing not only academic advice but
also friendly support, encouragement, and strong belief in my abilities. One of the
many insights, which we came across in our Friday afternoon meetings, was central
to the design of the X-Ray system, described in Chapter 4.
I thank the other two members of my committee – Ken Birman and Martin
Burtscher – for being open and ready to give advice whenever I needed it, for
reviewing the draft of this thesis and making constructive suggestions on improving
its content, and for the wonderful classes I took with them (CS514, CS614 and
ECE475, ECE575 respectively).
I thank all other professors in the Computer Science and Electrical and Com-
puting Engineering Departments for showing me computer science in a much
brighter light through their classes. These include (in alphabetical order) Joe
Halpern (CS676), Mark Heinrich (ECE572), Andrew Myers (CS611), and Eva
Tardos (CS681).
vI thank David Padua from UIUC for the great time working with his group and
for the willingness to give honest advice and sense of direction when I needed it.
I thank Fred Gustavson and John Gunnels from IBM T. J. Watson Research
Center for all the interesting conversations and for the motivation and encourage-
ment. I am sure we will do great things together during my post-doctorate work
at IBM.
There are too many other people to thank, who have positively inﬂuenced me in
the last six years, and whom I would be eternally grateful for that. I though I would
mention the following few (in alphabetical order), whose comments and remarks
helped me to stay on track and complete this big step of my life: Gianfranco
Bilardi, Jack Dongarra, Ken Kennedy, Pratap Pattnaik, Marcus Peuschel, and
Clint Whaley.
I thank Becky Stewart, Stephanie Meik, Helene Croft, Rebecca Rich-Goldweber,
Tammy Howe, Cindy Robinson, and Bonnie Maine from the department of Com-
puter Science at Cornell for always being there for me, and never getting tired of
my “special” requests, no matter how diﬀerent or unseen they were.
I thank my colleagues Dan Marques, Greg Bronevetsky, and Tom Roeder for
the interesting conversations, support, and an overall unforgettable experience we
shared as graduate students.
I am especially grateful to my parents, for bringing me to this world, for putting
high hopes in me, and for inﬂuencing my personal and professional development,
so that I can be the person who I am today. Without you, all this could never
happen – even in theory ;)
Finally, I want to express my deepest appreciation and gratitude to my wife
Borislava. She was the one who was on my side for the last eight years, helping
vime and supporting me through the hardest times. She was the one who followed
me in this new world, who gave me my daughter Julianne, and because of whom
I believe family is more important than anything. Thank you!
viiTABLE OF CONTENTS
1 Introduction 1
2 Background 7
2.1 Cache-aware programs and restructuring compilers . . . . . . . . . 7
2.1.1 Legality of Restructuring Transformations . . . . . . . . . . 9
2.1.2 Linear Loop Transformations . . . . . . . . . . . . . . . . . 13
2.1.3 Other Restructuring Transformations . . . . . . . . . . . . . 19
2.1.4 Conventional Compiler Transformations . . . . . . . . . . . 23
2.2 Cache Oblivious Algorithms . . . . . . . . . . . . . . . . . . . . . . 27
3 High-performance Code Generation 32
3.1 Tiling for the L1 data cache . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Tiling for the register ﬁle . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Scalarization of array elements . . . . . . . . . . . . . . . . . . . . . 35
3.4 Pipeline scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Additional details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Measuring Hardware Parameters 42
4.1 Introduction to X-Ray . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 X-Ray Framework Overview . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Nano-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Nano-benchmark Generator . . . . . . . . . . . . . . . . . . 48
4.2.3 Implementing a new micro-benchmark . . . . . . . . . . . . 49
4.3 CPU Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 CPU Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Instruction Latency . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Instruction Throughput . . . . . . . . . . . . . . . . . . . . 51
4.3.4 Instruction Existence . . . . . . . . . . . . . . . . . . . . . . 53
4.3.5 Number of Registers . . . . . . . . . . . . . . . . . . . . . . 54
4.3.6 SMP and SMT parallelism . . . . . . . . . . . . . . . . . . . 55
4.4 Memory Hierarchy Micro-benchmarks . . . . . . . . . . . . . . . . . 56
4.4.1 Previous Approaches . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Compactness of Sequences . . . . . . . . . . . . . . . . . . . 61
4.4.3 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.4 Lower Levels of the Memory Hierarchy . . . . . . . . . . . . 74
4.4.5 Measuring TLB Paramaters . . . . . . . . . . . . . . . . . . 81
4.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
viii5 Global Search in ATLAS 85
5.1 Estimating hardware parameters . . . . . . . . . . . . . . . . . . . . 85
5.2 Search strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Finding NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Finding MU and NU . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Finding KU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Finding Ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Finding FF, IF, and NF . . . . . . . . . . . . . . . . . . . . . . . . 89
5.8 Finding NCNB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Finding parameters for clean-up codes . . . . . . . . . . . . . . . . 89
5.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Model-Driven Optimization 91
6.1 Estimating hardware parameters . . . . . . . . . . . . . . . . . . . . 91
6.2 Modeling strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Estimating NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Fully associative, optimal replacement, unit line size . . . . . 92
6.3.2 Correcting for non-unit line size . . . . . . . . . . . . . . . . 93
6.3.3 Correcting for LRU replacement policy . . . . . . . . . . . . 94
6.3.4 Correcting to avoid micro-MMM clean-up code . . . . . . . 96
6.3.5 Other cache organizations . . . . . . . . . . . . . . . . . . . 97
6.4 Estimating MU and NU . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Estimating KU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Estimating Ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.7 Estimating other parameters . . . . . . . . . . . . . . . . . . . . . . 100
6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7 Model Reﬁnement and Local Search 102
8 Experimental Results 105
8.1 Global search vs. model-driven optimization . . . . . . . . . . . . . 105
8.1.1 DEC Alpha 21264 . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1.2 IBM Power 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1.3 IBM Power 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1.4 SGI R12K . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.1.5 Sun UltraSPARC IIIi . . . . . . . . . . . . . . . . . . . . . . 129
8.1.6 Intel Itanium 2 . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.1.7 AMD Opteron 240 . . . . . . . . . . . . . . . . . . . . . . . 141
8.1.8 AMD Athlon MP . . . . . . . . . . . . . . . . . . . . . . . . 150
8.1.9 Intel Pentium III . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1.10 Intel Pentium 4 . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.1.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2 Model reﬁnement and local search . . . . . . . . . . . . . . . . . . . 167
8.2.1 x86-based architectures . . . . . . . . . . . . . . . . . . . . . 167
ix8.2.2 Multilevel memory hierarchies . . . . . . . . . . . . . . . . . 167
8.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9 Conclusions and Future Work 173
A X-Ray: Experimental Results 176
A.1 CPU Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.1.1 CPU Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.1.2 Instruction Latency and Throughput . . . . . . . . . . . . . 179
A.1.3 Number of Registers . . . . . . . . . . . . . . . . . . . . . . 181
A.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A.2.1 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . 184
A.2.2 Lower Level Caches . . . . . . . . . . . . . . . . . . . . . . . 184
Bibliography 189
xLIST OF TABLES
3.1 Summary of optimization parameters . . . . . . . . . . . . . . . . . 40
8.1 DEC Alpha 21264: Platform Speciﬁcation . . . . . . . . . . . . . . 109
8.2 DEC Alpha 21264: Optimization Parameters . . . . . . . . . . . . 109
8.3 DEC Alpha 21264: Timings . . . . . . . . . . . . . . . . . . . . . . 109
8.4 IBM Power 3: Platform Speciﬁcation . . . . . . . . . . . . . . . . . 115
8.5 IBM Power 3: Optimization Parameters . . . . . . . . . . . . . . . 115
8.6 IBM Power 3: Timings . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.7 IBM Power 4: Platform Speciﬁcation . . . . . . . . . . . . . . . . . 119
8.8 IBM Power 4: Optimization Parameters . . . . . . . . . . . . . . . 119
8.9 IBM Power 4: Timings . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.10 SGI R12K: Platform Speciﬁcation . . . . . . . . . . . . . . . . . . 124
8.11 SGI R12K: Optimization Parameters . . . . . . . . . . . . . . . . . 124
8.12 SGI R12K: Timings . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.13 Sun UltraSPARC IIIi: Platform Speciﬁcation . . . . . . . . . . . . 129
8.14 Sun UltraSPARC IIIi: Optimization Parameters . . . . . . . . . . 129
8.15 Sun UltraSPARC IIIi: Timings . . . . . . . . . . . . . . . . . . . . 129
8.16 Intel Itanium 2: Platform Speciﬁcation . . . . . . . . . . . . . . . . 134
8.17 Intel Itanium 2: Optimization Parameters . . . . . . . . . . . . . . 134
8.18 Intel Itanium 2: Timings . . . . . . . . . . . . . . . . . . . . . . . 134
8.19 AMD Opteron 240: Platform Speciﬁcation . . . . . . . . . . . . . . 141
8.20 AMD Opteron 240: Optimization Parameters . . . . . . . . . . . . 141
8.21 AMD Opteron 240: Timings . . . . . . . . . . . . . . . . . . . . . 141
8.22 AMD Athlon MP: Platform Speciﬁcation . . . . . . . . . . . . . . 150
8.23 AMD Athlon MP: Optimization Parameters . . . . . . . . . . . . . 150
8.24 AMD Athlon MP: Timings . . . . . . . . . . . . . . . . . . . . . . 150
8.25 Intel Pentium III: Platform Speciﬁcation . . . . . . . . . . . . . . . 155
8.26 Intel Pentium III: Optimization Parameters . . . . . . . . . . . . . 155
8.27 Intel Pentium III: Timings . . . . . . . . . . . . . . . . . . . . . . . 155
8.28 Intel Pentium 4: Platform Speciﬁcation . . . . . . . . . . . . . . . 160
8.29 Intel Pentium 4: Optimization Parameters . . . . . . . . . . . . . . 160
8.30 Intel Pentium 4: Timings . . . . . . . . . . . . . . . . . . . . . . . 160
8.31 Optimization Paramameters for Intel Itanium 2 . . . . . . . . . . . 169
8.32 Timings for Intel Itanium 2 . . . . . . . . . . . . . . . . . . . . . . 169
8.33 Optimization Paramameters for SGI R12000 . . . . . . . . . . . . . 170
8.34 Timings for SGI R12000 . . . . . . . . . . . . . . . . . . . . . . . . 171
A.1 Summary of Experimental Results: CPU Features . . . . . . . . . 178
A.2 Experimental results for registers . . . . . . . . . . . . . . . . . . . 182
A.3 Summary of experimental results . . . . . . . . . . . . . . . . . . . 187
A.4 Summary of Itanium 2 C3 parameters . . . . . . . . . . . . . . . . 188
xiLIST OF FIGURES
1.1 Architecture of ATLAS and of Model-driven ATLAS . . . . . . . . 2
2.1 Structure of a simple compiler . . . . . . . . . . . . . . . . . . . . 8
2.2 Loop nest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Loop permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Loop skewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Loop reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Na¨ ıve MMM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Tiled MMM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 Strip-mined Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10 Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 Completely Unrolled Loop . . . . . . . . . . . . . . . . . . . . . . . 23
2.12 Non-pipelined loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 Pipelined loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Na¨ ıve MMM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 MMM tiled for L1 data cache and registers . . . . . . . . . . . . . 34
3.3 mini-MMM and micro-MMM . . . . . . . . . . . . . . . . . . . . . 35
4.1 A micro-benchmark in X-Ray . . . . . . . . . . . . . . . . . . . . . 45
4.2 High-level structure of a nano-benchmark . . . . . . . . . . . . . . 46
4.3 Approach for timing nano-benchmarks . . . . . . . . . . . . . . . . 47
4.4 Standard memory hierarchy benchmark . . . . . . . . . . . . . . . 58
4.5 Memory address decomposition on Pentium III . . . . . . . . . . . 61
4.6 Sequences of sequences . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Example of (semi-/non-)compact . . . . . . . . . . . . . . . . . . . 66
4.8 Decomposition of W = hm0,S,Ni . . . . . . . . . . . . . . . . . . 68
4.9 Measuring C and A of L1 Data Cache . . . . . . . . . . . . . . . . 70
4.10 Modiﬁed address sequence for measuring B . . . . . . . . . . . . . 71
4.11 Algorithm for measuring B . . . . . . . . . . . . . . . . . . . . . . 72
4.12 Improved timing of memory accesses . . . . . . . . . . . . . . . . . 73
4.13 Memory address decomposition on Pentium III . . . . . . . . . . . 82
6.1 Schematic Pseudo-Code for mini-MMM . . . . . . . . . . . . . . . 92
8.1 DEC Alpha 21264: MMM Performance . . . . . . . . . . . . . . . 110
8.2 DEC Alpha 21264: Sensitivity of performance to NB . . . . . . . . 110
8.3 DEC Alpha 21264: Sensitivity of performance to MU and NU . . . 111
8.4 DEC Alpha 21264: Sensitivity of performance to KU . . . . . . . . 111
8.5 DEC Alpha 21264: Sensitivity of performance to Ls . . . . . . . . 111
8.6 IBM Power 3: MMM Performance . . . . . . . . . . . . . . . . . . 116
8.7 IBM Power 3: Sensitivity of performance to NB . . . . . . . . . . . 116
xii8.8 IBM Power 3: Sensitivity of performance to MU and NU . . . . . . 117
8.9 IBM Power 3: Sensitivity of performance to KU . . . . . . . . . . . 117
8.10 IBM Power 3: Sensitivity of performance to Ls . . . . . . . . . . . 117
8.11 IBM Power 4: MMM Performance . . . . . . . . . . . . . . . . . . 120
8.12 IBM Power 4: Sensitivity of performance to NB . . . . . . . . . . . 120
8.13 IBM Power 4: Sensitivity of performance to MU and NU . . . . . . 121
8.14 IBM Power 4: Sensitivity of performance to KU . . . . . . . . . . . 121
8.15 IBM Power 4: Sensitivity of performance to Ls . . . . . . . . . . . 121
8.16 SGI R12K: MMM Performance . . . . . . . . . . . . . . . . . . . . 125
8.17 SGI R12K: Sensitivity of performance to NB . . . . . . . . . . . . 125
8.18 SGI R12K: Sensitivity of performance to MU and NU . . . . . . . . 126
8.19 SGI R12K: Sensitivity of performance to KU . . . . . . . . . . . . 126
8.20 SGI R12K: Sensitivity of performance to Ls . . . . . . . . . . . . . 126
8.21 Sun UltraSPARC IIIi: MMM Performance . . . . . . . . . . . . . . 130
8.22 Sun UltraSPARC IIIi: Sensitivity of performance to NB . . . . . . 130
8.23 Sun UltraSPARC IIIi: Sensitivity of performance to MU and NU . 131
8.24 Sun UltraSPARC IIIi: Sensitivity of performance to KU . . . . . . 131
8.25 Sun UltraSPARC IIIi: Sensitivity of performance to Ls . . . . . . . 131
8.26 Intel Itanium 2: MMM Performance . . . . . . . . . . . . . . . . . 135
8.27 Intel Itanium 2: Sensitivity of performance to NB . . . . . . . . . . 135
8.28 Intel Itanium 2: Sensitivity of performance to MU and NU . . . . . 136
8.29 Intel Itanium 2: Sensitivity of performance to KU . . . . . . . . . . 136
8.30 Intel Itanium 2: Sensitivity of performance to Ls . . . . . . . . . . 136
8.31 Intel Itanium 2: Sensitivity of performance to KU . . . . . . . . . . 138
8.32 AMD Opteron 240: MMM Performance . . . . . . . . . . . . . . . 142
8.33 AMD Opteron 240: Sensitivity of performance to NB . . . . . . . . 142
8.34 AMD Opteron 240: Sensitivity of performance to MU and NU . . . 143
8.35 AMD Opteron 240: Sensitivity of performance to KU . . . . . . . . 143
8.36 AMD Opteron 240: Sensitivity of performance to Ls . . . . . . . . 143
8.37 (MU,NU) = (6,1) code for x86 CISC . . . . . . . . . . . . . . . . . 146
8.38 ATLAS unroll (MU,NU) = (6,1) code for x86 CISC . . . . . . . . . 148
8.39 AMD Athlon MP: MMM Performance . . . . . . . . . . . . . . . . 151
8.40 AMD Athlon MP: Sensitivity of performance to NB . . . . . . . . 151
8.41 AMD Athlon MP: Sensitivity of performance to MU and NU . . . . 152
8.42 AMD Athlon MP: Sensitivity of performance to KU . . . . . . . . 152
8.43 AMD Athlon MP: Sensitivity of performance to Ls . . . . . . . . . 152
8.44 AMD Athlon MP: Sensitivity of performance to KU . . . . . . . . 154
8.45 Intel Pentium III: MMM Performance . . . . . . . . . . . . . . . . 156
8.46 Intel Pentium III: Sensitivity of performance to NB . . . . . . . . . 156
8.47 Intel Pentium III: Sensitivity of performance to MU and NU . . . . 157
8.48 Intel Pentium III: Sensitivity of performance to KU . . . . . . . . . 157
8.49 Intel Pentium III: Sensitivity of performance to Ls . . . . . . . . . 157
8.50 Intel Pentium 4: MMM Performance . . . . . . . . . . . . . . . . . 161
8.51 Intel Pentium 4: Sensitivity of performance to NB . . . . . . . . . 161
xiii8.52 Intel Pentium 4: Sensitivity of performance to MU and NU . . . . . 162
8.53 Intel Pentium 4: Sensitivity of performance to KU . . . . . . . . . 162
8.54 Intel Pentium 4: Sensitivity of performance to Ls . . . . . . . . . . 162
8.55 Summary of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.56 Summary of mini-MMM Performance. Performance numbers are
normalized to that of ATLAS CGw/S, which is presented as 100% 166
8.57 Intel Itanium 2: MMM Performance . . . . . . . . . . . . . . . . . 169
8.58 SGI R12000: MMM Performance . . . . . . . . . . . . . . . . . . . 171
xivPREFACE
Every once in a while, a truly outstanding paper1 comes along that serves as
a model for other papers in the area. This is such a paper. The work provides
a thorough analysis of the potential for compiler models to select the parameters
for optimizing transformations in a way that is competitive with the well-known
ATLAS system, which uses an experimental search mechanism with trial runs to
select these same parameters.
In addition to its lucid description of the experimental method, the paper
provides an excellent description of the ATLAS system and its methodology for
selection of optimization parameters. This is followed by a discussion of the model-
based approach to selecting parameters. Finally, the paper presents its experimen-
tal results. I must admit that I was amazed by the depth and breadth of the
experimental evaluation: it covers every interesting platform and evaluates the
performance on a variety of diﬀerent dimensions and for a variety of matrix sizes.
Although the experiment focuses on matrix multiplication only, it is nevertheless a
fascinating study, ﬁlled with rich insights on the problems of building useful models
into compilers for optimizing performance on diﬀerent machines. I also found the
discussion of the results and conclusions in the ﬁnal section to be balanced and
insightful.
Prof. Ken Kennedy
Rice University
1This preface is extracted from a review, written by Prof. Ken Kennedy for [65],
which is at the heart of this dissertation. It is included here by permission of the
author.
xvChapter 1
Introduction
The sciences do not try to explain, they hardly even try to inter-
pret, they mainly make models. By a model is meant a mathematical
construct which, with the addition of certain verbal interpretations, de-
scribes observed phenomena. The justiﬁcation of such a mathematical
construct is solely and precisely that it is expected to work.
John Von Neumann
It is a fact universally recognized that current restructuring compilers do not
generate code that can compete with hand-tuned code in eﬃciency, even for a
simple kernel like matrix multiplication. This inadequacy of current compilers
does not stem from a lack of technology for transforming high-level programs into
programs that run eﬃciently on modern high-performance architectures; over the
years, the compiler community has invented innumerable techniques for enhancing
locality and parallelism, described in more detail in Chapter 2.
The simplest manual approach to tuning a program for a given platform is to
write diﬀerent versions of that program, evaluate the performance of these versions
on the target platform, and select the one that gives the best performance. These
diﬀerent versions usually implement the same algorithm, but diﬀer in the values
they use for parameters such as tile sizes and loop unroll factors. The architectural
insights and domain knowledge of the programmer are used to limit the number
of versions that are evaluated. In eﬀect, the analytical techniques used in current
compilers to derive optimal values for such parameters are replaced by an empirical
search over a suitably restricted space of parameter values (by empirical search, we
12
mean a three step process: (1) generating a version of the program corresponding
to each combination of the parameters under consideration, (2) executing each
version on the target machine and measuring its performance, and (3) selecting
the version that performs best). This approach has been advocated most force-
fully by Fred Gustavson and his co-workers at IBM, who have used it for many
years to generate the highly optimized ESSL and PESSL libraries for IBM ma-
chines [51]. Recently, a number of projects such as FFTW [27, 28], PhiPAC [8, 3],
ATLAS [60, 1], and SPIRAL [36, 50] have automated the generation of the dif-
ferent program versions whose performance must be evaluated. Experience shows
that these library generators produce much better code than native compilers do
on modern high-performance architectures.
ATLAS Search 
Engine
(mmsearch)
Execute
&
Measure
ATLAS MM Code 
Generator
(mmcase)
mini-MMM
Source LS
MFLOPS
MU, NU, KU
NB
FF, IF, NF
FMA
Model Parameter 
Estimator
(mmmodel)
ATLAS MM Code 
Generator
(mmcase)
LS
MU, NU, KU
NB
FF, IF, NF
FMA
L1Size
NR
LS
FMA
Detect
Hardware
Parameters
mini-MMM
Source
L1Size
NR
L*, |ALUFP|
FMA
Detect
Hardware
Parameters
L1 I-Cache
Figure 1.1: Architecture of ATLAS and of Model-driven ATLAS
This thesis was motivated by a desire to understand the performance gap be-
tween the BLAS codes produced by ATLAS and by restructuring compilers, with
the ultimate goal of improving the state of the art of current compilers. One reason
why compilers might be at a disadvantage is that they are general-purpose and
must be able to optimize any program, whereas a library generator like ATLAS
can focus on a particular problem domain. However, this is somewhat implausible3
because dense numerical linear algebra, the particular problem domain of ATLAS,
is precisely the area that has been studied most intensely by the compiler com-
munity, and there is an extensive collection of well-understood transformations for
optimizing dense linear algebra programs. Another reason for the inadequacy of
current compilers might be that new transformations, unknown to the compiler
community, are required to produce code of the same quality as the code produced
by ATLAS. Finally, it is possible that the analytical models used by compilers to
estimate optimal values for transformation parameters are overly simplistic, given
the complex hardware of modern computers, so they are not able to produce good
values for program optimization parameters.
No deﬁnitive studies exist to settle these matters. This dissertation is the ﬁrst
quantitative study of these issues.
Figure 1.1 shows our experimental set-up, which makes use of the original
ATLAS system (top of the ﬁgure) and a modiﬁed version (bottom of the ﬁgure)
that uses analytical models instead of empirical search. Like any system that uses
empirical search, ATLAS has (i) a module that controls the search, which is used
to determine optimal values for code optimization parameters (mmsearch), and
(ii) a module that generates code, given these values (mmcase). The parameters
used by ATLAS are described in more detail in Chapter 3; for example, NB is the
tile size to be used when optimizing code for the L1 data cache.
In general, there is an unbounded number of possible values for a parameter
like NB so it is necessary to bound the size of the search space. When ATLAS is
installed, it ﬁrst runs a set of micro-benchmarks to determine hardware parame-
ters such as the capacity of the L1 data cache and the number of registers. The
values of these hardware parameters are used to bound the search space. Since4
ATLAS relies on search, it can aﬀord to use approximate values for these hardware
parameters. In Chapter 4, we describe X-Ray, a tool that we have implemented
for more accurate measurement of such hardware parameters.
The mmsearch module within ATLAS enumerates points within the bounded
search space, invokes the mmcase module to generate the appropriate code (de-
noted by mini-MMM in the ﬁgure), runs this code on the actual machine for a
long enough time to ensure accurate measurement, and records its execution time.
At the end of the search, the parameter values that gave the best performance are
used to generate the library code. This library is coded in a simple subset of C,
which can be viewed as portable assembly code, and it is compiled to produce the
ﬁnal executable.
We ﬁrst studied the code generation module1, and determined that the code it
produces can be viewed as the end result of applying standard compiler transfor-
mations to high-level BLAS codes. As we describe in Chapter 3, the code produced
by ATLAS is similar to what we would get if we applied cache tiling, register tiling,
and operation scheduling to the standard three-loop matrix multiplication code.
This exercise ruled out the possibility that ATLAS incorporated some transfor-
mation, unknown to the compiler community, that was critical for obtaining good
performance. We then modiﬁed ATLAS by replacing the search module, described
in more detail in Chapter 5, with a module (mmmodel) that uses standard analyti-
cal models to estimate optimal values for the optimization parameters, as described
in Chapter 6. Since both ATLAS and the modiﬁed ATLAS use the same code gen-
1Our description of ATLAS was arrived at by studying the ATLAS source
code. In case of any discrepancy between this description and how the ATLAS
system is actually implemented, the documentation of the ATLAS project should
be considered to be authoritative [60, 59, 58].5
erator, we are assured that any diﬀerence in the performance of the generated code
results solely from diﬀerent choices for optimization parameter values. Of course,
the optimization parameter values produced by using models can be reﬁned by
using local search, as described in Chapter 7.
In Chapter 8, we present experimental results on ten diﬀerent platforms, com-
paring
• the time spent to determine the parameter values,
• the values of the parameters, and
• the relative performance of generated code.
Our results show that on all ten platforms, a relatively simple and very intuitive
model is able to estimate near-optimal values for the optimization parameters used
by the ATLAS Code Generator. We conclude in Chapter 9 with a discussion of
our main ﬁndings, and suggest future directions for research.
One feature of ATLAS is that it can make use of hand-tuned BLAS routines,
many of which are included in the ATLAS distribution. When ATLAS is installed
on a machine, these hand-coded routines are executed and evaluated. If the per-
formance of one of these hand-coded routines surpasses the performance of the
code generated by the ATLAS Code Generator, the hand-coded routine is used
to produce the library. For example, neither the ATLAS Code Generator nor the
C compilers on the Pentium 4 exploit the SSE2 vector extensions to the x86 in-
struction set, so ATLAS-generated matrix multiplication code runs at around 1.5
GFLOPS on this architecture. However, the matrix multiplication routine in the
library produced by ATLAS runs at 3.3 GFLOPS because it uses SSE2 kernels
carefully hand-coded by expert programmers.6
Our concern in is not with handwritten code, but with the code produced by
the ATLAS Code Generator and with the estimation of optimal values for the
parameters that are inputs to the code generator. To make clear distinctions, we
use the following terminology.
• ATLAS CGw/S: This refers to the ATLAS system in which all code is pro-
duced by the ATLAS Code Generator with Search to determine parameter
values. No hand-written, contributed code is allowed.
• ATLAS Model: This refers to the modiﬁed ATLAS system we built in which
all code is produced by the ATLAS Code Generator, using parameter values
produced from analytical models.
• ATLAS Unleashed: This refers to the complete ATLAS distribution, which
may use hand-written codes and predeﬁned parameter values (architectural
defaults) to produce the library. Where appropriate, we include, for com-
pleteness, the performance graphs for the libraries produced by ATLAS Un-
leashed.Chapter 2
Background
In this chapter, we discuss two diﬀerent approaches to optimizing the cache per-
formance of programs. The ﬁrst approach is cache-aware, and it exploits detailed
knowledge of the memory hierarchy of a machine to improve cache performance.
In particular, there is a large body of research on transforming aﬃne loop nests
to improve locality in a cache-aware manner, and we summarize this work in this
chapter. The second approach we discuss is a programming methodology called the
cache-oblivious approach. The key argument for the cache-oblivious approach is
that programs that implement divide-and-conquer algorithms perform well on any
memory hierarchy without the need for tuning. The argument is that each division
step usually reduces the problem data size, so the data for entire sub-problems ﬁt
in diﬀerent levels of the memory hierarchy at some point during the divide-and-
conquer process. Once the data for a sub-problem is in cache, that sub-problem
can be solved completely without cache misses other than conﬂict misses.
2.1 Cache-aware programs and restructuring compilers
Compiler design has been an active area of research since the ﬁfties. The structure
of a simple compiler is presented in Figure 2.1 [46].
Optimizing program execution was a concern even for early compilers. Pro-
gram optimizations can be either machine-independent or machine-dependent.
Machine-independent optimizations usually try to reduce the amount of com-
putation by eliminating repeated or redundant computations; for example, loop
invariant removal [5] moves invariant computations out of loop bodies. Other
78
Lexical Analyzer
string
Parser
token sequence
Semantic Analyzer
Abstract syntax tree
Code Generator
Intermediate code
Object code
Figure 2.1: Structure of a simple compiler
machine-independent optimizations such as strength reduction [19] replace expen-
sive operations with simpler ones. Machine-dependent optimizations on the other
hand attempt to tailor the code to take advantage of particular machine features;
for example, register allocation [14, 16, 13] attempts to assign variables to registers
because it is faster to access a value that is in a register than one that is stored in
memory. One complication with machine-dependent optimizations is that taking
full advantage of machine-speciﬁc features usually conﬂicts with portability of the
compiler.
In general, optimizations are performed on some form of intermediate program9
representation. When this intermediate form is an abstract syntax tree, the opti-
mization is sometimes referred to as a source-to-source transformation because it
is possible to generate a high-level program from the transformed abstract syntax
tree. Compilers that perform optimizations at this level are known as restructur-
ing compilers. Current compilers implement several levels of intermediate code,
gradually converting the program from more abstract to more machine dependent
representation. There are usually at least two levels of intermediate representation:
• High-Level Intermediate Representation (HIR): The structure of loops and
array references are preserved in this representation so that high-level analy-
ses such as dependence analysis and restructuring transformations such as
loop transformations can be performed.
• Low-Level Intermediate Representation (LIR): Once high-level transforma-
tions have been performed, it is necessary to optimize address arithmetic and
perform traditional ”Dragon-book” style optimizations. To facilitate this,
the level of the code is lowered by replacing loops and array references with
conditional transfer of control constructs and pointer arithmetic respectively.
2.1.1 Legality of Restructuring Transformations
Before a compiler can restructure a program, it is important to prove that the
transformation is legal, i.e., that the transformation does not change the semantics
of the program. To prove legality, current compilers use a conservative approach
like dependence analysis.
Intuitively, a data dependence exists from statement S1 to statement S2 if (i)
they both touch the same location with at least one of them writing to it, and (ii)10
there is a possible control ﬂow from S1 to S2. For such statements, it is clear that
program semantics will change if the two statements are interchanged. On the
other hand, transformations which preserve every dependence in a source program
are legal [6].
Dependences can be classiﬁed in the following four categories.
1. Flow dependence: A ﬂow dependence is said to exist from statement S1 to
statement S2 if S1 writes to a location which may later be read by S2, e.g.:
S1: X = ...
...
S2: ... = X
This dependence is also known as a read-after-write (RAW) dependence.
2. Anti-dependence: An anti-dependence is said to exist from statement S1 to
statement S2 if S1 reads from a location that is later written to by statement
S2, e.g.:
S1: ... = X
...
S2: X = ...
This dependence is also known as a write-after-read (WAR) dependence.
3. Output dependence: An output dependence is said to exist from statement
S1 to statement S2 if S1 and S2 write to the same location, e.g.:
S1: X = ...
...
S2: X = ...11
This dependence is also known as a write-after-write (WAW) dependence.
4. Input dependence is a dependence in which both S1 and S2 read the same
location, e.g.:
S1: ... = X
...
S2: ... = X
Unlike other dependences, input dependences can be reordered without vio-
lating the semantics of the program, and therefore they are not considered
when proving the legality of optimizing transformations. On the other hand,
they are important to consider for optimizations for improving data locality.
This dependence is also known as a read-after-read (RAR) dependence.
Restructuring compilers have focused mainly on optimizing aﬃne loop nests,
i.e., loop nests where loop boundaries and array indices are aﬃne functions of loop
control variables. Loop nests are usually divided into perfectly nested loop nests
and imperfectly nested loop nests. In a perfectly nested loop nest, all assignment
statements are contained in the innermost loop of the loop nest. Figure 2.2 shows
a perfectly nested loop nest (for loop index sets we use MATLAB notation [First :
Step : Last], which represents the set of all integers between First and Last in
steps of Step). If there are assignment statements contained within some but not
all of the loops in a loop nest, that loop nest is called an imperfectly nested loop
nest. For example, if there was an assignment statement between the i1 and i2
loops in Figure 2.2, the loop nest would be imperfectly nested.
An iteration is typically denoted by an iteration vector ¯ v = hv1,v2,...,vni,
where vk is the value of the control variable ik for this iteration. The space of12
for i1 ∈ [L1 : 1 : U1]
for i2 ∈ [L2 : 1 : U2]
...
for in ∈ [Ln : 1 : Un]
S (i1,i2,...,in)
Figure 2.2: Loop nest
iteration vectors is known as iteration space.
If there is a dependence from the instance of the loop body in an iteration ¯ v =
hv1,v2,...,vni to the instance of the loop body in iteration ¯ w = hw1,w2,...,wni,
we can compute the dependence distance vector ¯ δ = ¯ w − ¯ v = hδ1,δ2,...,δni. As
with basic dependences, each dependence vector can be classiﬁed as RAW, WAR,
WAW, or RAR dependence.
Diﬀerent approaches have been developed for computing dependence distance
and direction vectors for aﬃne loop nests. These include simple conservative tests
like Zero Induction Variable (ZIV), Single Induction Variable (SIV), Multiple In-
duction Variable (MIV) [6], and more complex and powerful tests based on integer
linear programming techniques, such as the Omega test [49].
In some programs, it may not be possible to represent a dependence by a single
distance vector. In that case, an abstraction of distance vectors called direction
vectors is often used. A direction vector ¯ d = hd1,d2,...,dni which corresponds to
the distance vector ¯ δ = hδ1,δ2,...,δni is deﬁned as follows:
di =

      
      
+, δi > 0
−, δi < 0
0, δi = 0
Sometimes an asterisk (*) is used as an element of a direction vector, when13
that element can potentially fall in more than one of the disjoint groups above.
Note that a distance (or direction) vector ¯ δ is always lexicographically positive
(denoted ¯ δ ￿ ¯ 0), which means that the ﬁrst non-zero element of it is positive (or +).
For a reordering transformation to be legal, it needs to preserve the lexicographical
positivity of dependence vectors (otherwise it violates that particular dependence).
2.1.2 Linear Loop Transformations
Although one can imagine many transformations of perfectly nested loops, there
are only a handful that are important in practice. Of these, permutation is the
most useful one, and it is described in more detail below. In some loop nests, it
may be necessary to ﬁrst transform the loop nest using other transformations such
as loop skewing and reversal [7, 62, 40] before permutation can be applied legally.
A powerful way of thinking about these loop transformations is to consider
them to be linear transformations that change the basis of the iteration space
of the loop nest I = {hi1,i2,...,ini : ik ∈ [Lk,Uk],k ∈ [1,n]}. Any linear trans-
formation can be described by means of a transformation matrix T. If T is an
integer, n × n unimodular matrix (that is, it has a determinant of +1 or −1),
we call this transformation a unimodular loop transformation. Unimodular linear
transformations map dense sets of integer points to dense sets of integer points.
For aﬃne loop nests, loop bounds can be described by an expression of the
form ¯ aT ×¯ i ≤ b, where ¯ a is the vector of coeﬃcients for each loop control variable
in the iteration vector ¯ i. Therefore all loop bounds can be expressed in matrix
form as:
A ×¯ i ≤ ¯ b.
Note that this notation allows us to express complex loop bounds, such as14
conjunctions of several aﬃne inequalities, just by including each element of the
conjunction as a separate row of matrix A. Such conjunctions often arise from
inequalities of the form ... ≥ max(...) and min(...) ≤ ..., which are common for
loop nests describing convex iteration spaces.
A unimodular loop transformation with matrix T on the iteration space I
produces a new iteration space J as follows:
J = T × I.
A speciﬁc iteration¯ i is mapped to the iteration ¯ j = T×¯ i, and therefore¯ i = T −1×¯ j.
Therefore we can construct the body of the transformed loop nest from the body
of the original loop nest by substituting each occurrence of a loop control variable
from¯ i with the corresponding expression from T −1×¯ j. The loop bound conditions
of the new iteration space can be computed using integer-linear programming from
the inequality:
￿
A × T
−1
￿
× ¯ j ≤ b.
One of the main advantages of unimodular loop transformations lies in the
easiness of checking their legality. Because we have a unimodular transformation
T of the iteration space we can easily compute how an existing dependence vector
¯ δ is transformed by this particular transformation:
¯ δnew = T × ¯ δ.
To ensure dependences are respected, all transformed dependence vectors should
be lexicographically positive. If we express the dependence vectors as columns in
a dependence matrix D, this condition can be expressed as D × T ￿ 0.
As discussed below, unimodular loop transformations [7, 17, 23, 40, 62, 37]
include loop permutation (interchange), loop skewing, and loop reversal. In ad-15
dition, because a product of two unimodular matrices is a unimodular matrix,
any sequence of these transformations is also a unimodular transformation. More-
over, any unimodular loop transformation can be represented as a sequence of loop
permutation, loop skewing, and loop reversal.
Loop scaling [40] is also a linear loop transformation but it is not unimodu-
lar, because it stretches the transformed space, transforming dense sets of integer
points into sparse sets of integer points. Including loop scaling in our repertoire of
transformations simpliﬁes the synthesis of appropriate sequences of loop transfor-
mations for enhancing locality of reference or for parallelization.
Loop permutation
Loop permutation is a transformation with a two-fold eﬀect. By changing the loop
order, one can change the stride with which a certain array is accessed in memory.
Achieving stride one access is crucial for improving spacial locality and in some
cases triggering hardware data prefetchers. On another level, loop permutation
(together with strip-mining) allows loop tiling, which can increase temporal data
locality.
Loop permutation can be represented by a unimodular loop transformation,
whose corresponding matrix T is a permutation matrix. Consider the following
loop nest:
for i1 ∈ [1 : 1 : N]
for i2 ∈ [1 : 1 : M]
S(i1,i2);16
If we apply the unimodular transformation:
T =




0 1
1 0




The result is:
for j1 ∈ [1 : 1 : M]
for j2 ∈ [1 : 1 : N]
S(j2,j1);
The transformation can be presented pictorially as shown in Figure 2.3.
i2
i1
j2
j1
Figure 2.3: Loop permutation
Note that, because of the generality of the approach, loop permutation ex-
pressed as an unimodular transformation can more complex iteration spaces like
the triangular ones in linear algebra codes like triangular solve and Cholesky fac-
torization.
It is obvious that, since loop permutation can be expressed by a unimodular
permutation matrix, the elements of dependence vectors will also be permuted
in the resulting loop nest. Therefore loop permutation is not always legal. One
special case is having a fully permutable loop nest, which as we will see later, is
requirement for legality for loop tiling. Since in this case any permutation should
be legal, to guarantee that the resulting dependence vectors are lexicographically17
positive, it is necessary that all elements of the dependence matrix D are non-
negative.
In cases when a loop nest is not fully-permutable, there might be a chance
to transform it to a fully-permutable loop nest by employing other unimodular
transformations like loop skewing and look reversal. We discuss these next.
Loop skewing and Loop reversal
Consider the following loop nest:
for i1 ∈ [1 : 1 : N]
for i2 ∈ [1 : 1 : N]
X [i1,i2] = X [i1 − 1,i2 + 1];
It is clear that the only dependence has a distance vector ¯ δ = h1,−1i, and
therefore the dependence matrix is:
D =




1
−1



.
Therefore loop interchange is not legal. We will correct this problem by apply-
ing the loop skewing and loop reversal unimodular transformations. First we can
apply loop skewing with transformation matrix
TS =




1 1
0 1



,
which transforms the loop nest as shown in Figure 2.4.
Unfortunately, TS × ¯ δ = h0,−1i and therefore the resulting loop nest is neither
fully-permutable, nor legal. We will ﬁx this problem by applying loop reversal18
i2
i1
j2
j1
Figure 2.4: Loop skewing
using the transformation matrix
TR =




1 0
0 −1



,
which transforms the loop nest as shown in Figure 2.5.
i2
i1 k2
k1
Figure 2.5: Loop reversal
The resulting transformation can be represented by a single unimodular matrix
T = TR × TS =

 

1 1
0 1

 
 ×

 

1 0
0 −1

 
 =

 

1 1
0 −1

 
,
and the resulting dependence vector is T × ¯ δ = h0,1i. Therefore the unimodular
transformation with matrix T is legal, even though one of its component TR is not
legal. Furthermore the resulting loop nest is fully permutable, and therefore can
be tiled.19
2.1.3 Other Restructuring Transformations
Loop tiling
Modern hardware architectures have complex memory hierarchies, and optimizing
programs to execute eﬃciently requires taking this design feature into account.
Loop tiling [38, 39] is a restructuring technique, which aims improve data locality
by reducing the working set size of a piece of code, so that it ﬁts into some level
of the memory hierarchy, and therefore executes faster. We discuss loop tiling in
more detail in terms of matrix multiplication.
Blocking is an algorithmic transformation that converts the matrix multiplica-
tion into a sequence of small matrix multiplications, each of which multiplies small
blocks of the original matrices. Blocking matrix multiplication for memory hierar-
chies was discussed by McKellar and Coﬀman as early as 1969 [43]. The eﬀect of
blocking can be accomplished by the loop transformation called tiling, which was
introduced by Wolfe in 1987 [63].
In Figure 2.6 we show the simple triply-nested version of Matrix-Matrix Mul-
tiply. It is a well-known fact that if this code is executed as is, exhibits very poor
data locality and has very low performance.
for i ∈ [0 : 1 : N − 1]
for j ∈ [0 : 1 : M − 1]
for k ∈ [0 : 1 : K − 1]
Cij = Cij + Aik × Bkj
Figure 2.6: Na¨ ıve MMM Code
To improve performance the three loops in Figure 2.6 can be tiled by a factor
NB, producing the code in Figure 2.7. For simplicity we have assumed that the20
tile size NB divides the matrix dimensions M, N, and K.
for i ∈ [0 : NB : N − 1]
for j ∈ [0 : NB : M − 1]
for k ∈ [0 : NB : K − 1]
for i0 ∈ [i : 1 : NB − 1]
for j0 ∈ [j : 1 : NB − 1]
for k0 ∈ [k : 1 : NB − 1]
Ci0j0 = Ci0j0 + Ai0k0 × Bk0j0
Figure 2.7: Tiled MMM Code
Intuitively, the code in Figure 2.7 performs better because the inner loop nest
hi0,j0,k0i only operates on relatively small NB × NB tiles of the full matrices, and
does not touch other tiles before it is ﬁnished with the ones it has started working
on.
It is not immediately obvious that such drastic code transformation is legal,
and the fact is that, although it preserves the semantics of the computation for
this simple case (Matrix-Matrix Multiply), there are many other codes for which
it does not. We formalize the legality conditions further below.
One way to look at tiling is as a combination of strip-mining and loop per-
mutation. Strip-mining is a simple procedure in which a loop is chopped into
smaller loops of speciﬁc size (say NB). For example the loop in Figure 2.8, when
strip-mined with size of NB, transforms into the loop in Figure 2.9.
for i ∈ [0 : 1 : N − 1]
S (i);
Figure 2.8: Simple Loop21
for i ∈ [0 : NB : N − 1]
for i0 ∈ [i : 1 : NB − 1]
S (i0);
Figure 2.9: Strip-mined Loop
Strip-mining is always legal, as it does not change the relative order of the
original iterations. If we perform strip-mining on all the loops in Figure 2.6 in the
same way, we would get the loop nest hi,i0,j,j0,k,k0i. To obtain the same result
as tiling, as presented in Figure 2.7, we need to permute the loops. Because each
loop needs to be interchangeable with any other loop, it is necessary to have a
fully permutable loop nest. As we discussed in Section 2.1.2, the necessary and
suﬃcient condition for legality of such transformation is that all element of the
dependence matrix should be non-negative.
Loop tiling can be performed several times to accommodate arbitrary complex
memory hierarchies. Tiles do not have to be the same size in all dimensions (like
NB in our example).
Tiling for the register ﬁle and loop unrolling
The register ﬁle is the highest, smallest, and fastest level of the memory hierarchy.
As such, it should be given major consideration during program optimization. In
a sense it is not very diﬀerent from a cache, which is fully associative with optimal
replacement and unit line size. In case of SIMD vector registers it even has non-unit
line size.
To make eﬀective use of hardware registers it is necessary to store frequently
used variables in them. The procedure of assigning which variable is stored in
which register at a certain point in time is known as register allocation. Current22
compilers typically cannot allocate elements of aggregate data types like arrays and
structures to registers directly. Therefore it is necessary to copy these elements
to scalar variables, a technique known as scalar replacement or scalarization. We
discuss register allocation and scalar replacement techniques in more detail in
Section 2.1.4.
Since there is no indexed access to registers, tiling for the register ﬁle requires
an extra step compared to tiling for the regular cache. This step, known as loop
unrolling [6, 48], not only facilitates register allocation, but also decreases loop
overhead.
Since the register ﬁle is at the top of the memory hierarchy, its corresponding
loop nest is the innermost loop nest of the tiled code. This loop nest needs to
be completely unrolled. Complete loop unrolling is when the looping construct is
removed and its diﬀerent iterations are explicitly inlined in its place.
For example the result of completely unrolling the loop in Figure 2.10 is pre-
sented in Figure 2.11.
...
for i ∈ [imin : istep : imin + k × istep]
S (i);
...
Figure 2.10: Simple Loop
Unrolling is always legal, as it does not change the relative order of any itera-
tions, but it can have adverse eﬀect on performance if the total size of the unrolled
code is larger than the instruction cache capacity.23
...
S (imin);
S (imin + istep);
S (imin + 2 × istep);
...
S (imin + k × istep);
...
Figure 2.11: Completely Unrolled Loop
2.1.4 Conventional Compiler Transformations
There are numerous conventional compiler transformations which are generally
useful for many diﬀerent types of code. Some examples are dead code elimination,
constant propagation, induction variable substitution, etc. In this section we will
concentrate on some which are of greater importance to dense numerical linear
algebra codes, namely:
• register allocation and scalar replacement, and
• instruction scheduling and software pipelining.
Register allocation and scalar replacement
Register allocation assigns variables to registers. It is of paramount importance
because registers are almost always a scarce resource.
Register allocation is perhaps simplest to perform using local methods, which
estimate the beneﬁt from allocating a speciﬁc variable to a register inside a speciﬁc
block of code, by some quantitative measure. One can then locally allocate those
variables to available registers which show highest beneﬁt.24
Modern compilers often implement more sophisticated, global register alloca-
tion schemes, primarily based on graph coloring. The idea can be summarized in
the following four steps.
1. Allocate as many symbolic registers as are necessary to hold all objects that
are candidates for allocation (variables, temporaries, constants, etc.)
2. Construct an interference graph, with one node for each symbolic and each
real register, and undirected edges representing interferences as follows.
• An edge between two symbolic registers means that the corresponding
objects have overlapping lifespan, and therefore cannot be assigned to
the same register.
• An edge between a symbolic register and a real register means that
the corresponding variable cannot be assigned to that particular real
register.
• An edge is also introduced between any two real registers.
3. Color the nodes of the interference graph with R colors, where R is the
number of available registers, so that adjacent nodes have diﬀerent colors.
4. Assign each object to the register with the same color.
Eﬀective register allocation is even more important for scientiﬁc codes, like
dense numerical linear algebra ones, because it is often the case that such codes
require whole submatrices to be stored in registers to obtain good performance (in
contrast to just scalar variables and temporaries). This poses a problem, because
traditional register allocation is designed to work for scalars.25
To enable register allocation for elements of aggregate objects compilers use a
transformation called scalar replacement [12]. Scalar replacement is performed by
declaring scalar variables of the type of the elements of the aggregate object, copy-
ing the actual data to these scalar variables, and ﬁnally copying the results back
to the aggregate object if necessary. This procedure enables the use of standard
register allocation of elements of aggregate types.
To determine legality of scalar replacement, compilers need to perform alias
analysis on both the whole aggregate object and the corresponding element.
Instruction scheduling and software pipelining
Instruction scheduling is the process of breaking the computation of the origi-
nal program into small pieces which the underlying architecture understands and
then serializing them in the appropriate order to achieve high performance. To
eﬀectively schedule instructions, a code scheduler needs detailed description of the
architecture issue width and latencies and throughputs of all instructions.
One place where instruction scheduling is crucial, at least for dense numerical
linear algebra codes, is inside a single basic block. Intuitively it consist of build-
ing a dependence DAG from the instructions in the block and then producing a
topological sort of the DAG which executes fastest on the desired architecture.
Even in this simple formulation instruction scheduling is an NP-complete prob-
lem, so compilers traditionally employ heuristics, like list scheduling, to perform
it in polynomial time.
It is worth noting that instruction scheduling is not independent from register
allocation. Assigning several temporaries to the same register creates dependences
in the code and restricts the set of possible schedules. Many compilers perform26
instruction scheduling immediately before register allocation and then repeat in-
struction scheduling if any spill code is generated. This approach has proven to be
more eﬀective than a single scheduling pass either before or after register allocation.
An important optimization that can improve performance of executing loops
is software pipelining. It can be very eﬀective on architectures which beneﬁt from
instruction-level parallelism (ILP). It works by allowing parts of several iterations
to be executed simultaneously.
Intuitively, each iteration of a non-pipelined loop may take a while to ramp up
to a point where it is exploiting a lot of instruction-level parallelism; similarly, the
instruction pipelines drain out as the execution of the iteration completes. This
idea is presented graphically in Figure 2.12.
time
Figure 2.12: Non-pipelined loop
After software pipelining the loop, we allow iterations to overlap which can
potentially avoid the drops of sustained performance as shown in Figure 2.13.
time
Figure 2.13: Pipelined loop
We conclude the discussion with a simple example of software pipelining. Con-
sider the following loop, where the operations S1 and S2 have to be executed27
sequentially:
for i ∈ [1 : 1 : N]
{
S1(i);
S2(i);
}
One software pipelined version of this loop, where S1 and S2 could be executed
in parallel is:
S1(1);
for i ∈ [1 : 1 : N − 1]
{
S2(i);
S1(i + 1);
}
S2(N);
Note that S1 could be communicating results to S2 which might be register
allocated. Software pipelining executes two iterations simultaneously, which could
increase register pressure, and ultimately decrease performance. Therefore this
transformation is not always beneﬁcial.
2.2 Cache Oblivious Algorithms
Compiler transformations like tiling, discussed in Section 2.1.3, aim to improve
locality to achieve better performance. When implementing such transformations,
compilers need to prove their legality before they can apply them. Many of these
transformations also require parameters; for example, tiling requires a value for28
NB, the cache tile size. Such optimization parameters are very architecture speciﬁc
and inherently non-portable [10, 20, 54]. For these reasons, performance-conscious
programmers often optimize their programs manually [25, 29].
The values of optimization parameters can be determined by performing global
search in a subspace of the space of possible values or using analytical models to
compute their values based on the hardware parameters. Chapters 5 and 6 of this
thesis discuss these two approaches in detail. We refer to such approaches as cache-
aware, because they are tailored to the speciﬁc machine and therefore re-tuning is
necessary when the code is ported to a diﬀerent architecture.
Cache-oblivious algorithms and data structures are an alternative approach
to solving the problem of program optimization for memory hierarchies. In this
approach, programs are aware that modern processors have complex multi-level
memory hierarchies, but are oblivious to the particular parameters of these hier-
archies (e.g., cache size at each level, block size, etc.) In principle, cache-oblivious
programs provides good performance on a range of architectures without tuning.
Most commonly cache oblivious algorithms are described as divide-and-conquer
versions of conventional algorithms. The idea is that, by partitioning the problem
enough times, the working set of the computation is reduced, and it will eventually
ﬁt in a particular level of the memory hierarchy. The ﬁt is not guaranteed to be
precise, but often the number of data transfers between the diﬀerent levels of the
memory hierarchy are provably optimal in an asymptotic sense.
However, our experiments show that there is signiﬁcant performance penalty
in using cache oblivious algorithms in this context, and this mainly stems from
the existence of optimized codes which often achieve near-peak performance. For
example, one way to implement a cache oblivious matrix multiplication is as fol-29
lows. An M × K matrix A is multiplied by an K × N matrix B, and the result
is accumulated into a M × N matrix C. The main idea of the algorithm is to
divide the largest of the dimensions M, N, and K, and solve the larger problem
by solving two smaller problems. The partitioning can continue until all dimen-
sions are of size 1, at which point the problem can be solved directly as a single
multiply-accumulate operation.
Such a cache-oblivious implementation has asymptotically optimal I/O com-
plexity, but it may have a signiﬁcant performance problem.
• Recursion overhead – recursing down to a single matrix element produces a
great amount of function call overhead, which signiﬁcantly impairs perfor-
mance.
• Ignoring Registers – recursing down to a single matrix element does not give
the compiler a chance to optimize for the register ﬁle, which is the highest
level of the memory hierarchy, and therefore the one from which performance
beneﬁts the most. The problem here arises partially from the fact that cache-
oblivious algorithms assume that all levels of the memory hierarchy above
main memory are transparent to the program.
Both these problems can be alleviated by inlining the function calls at the bot-
tom of the call tree for some number of levels. This eliminates some of the call
overhead and produces large basic blocks, which compilers can schedule and per-
form register allocation on. Unfortunately such inlining is of limited use, because
it can easily overﬂow the instruction cache. Moreover the recursive order of access-
ing the matrix elements, while beneﬁcial at the high-level, is disastrous for spacial
locality of the inlined code; moreover, the irregular access pattern is not suitable30
for the LRU replacement most caches implement. After performing a number of
optimizations to address these problems, we were not able to achieve performance
higher than roughly 60% of peak on most architectures, whereas carefully tuned
cache-aware implementations easily an at more than 90% of peak.
One way to get the best of both worlds is to be partially aware of the memory
hierarchy, say only to registers and L1 data cache. In an implementation like
the one above, this would mean performing the recursive partitioning until some
problem size is reached, and then using a L1 and register optimized kernel to solve
the problem of that size directly.
Not surprisingly, cache oblivious algorithms perform best if paired with cache
oblivious data structures. The idea of implementing such data structures is similar
to the divide-and-conquer approach used to implement the corresponding algo-
rithms. For example, in dense numerical linear algebra, matrices can be stored in
a non-conventional manner, in which a space ﬁlling curve (e.g. Morton Z-order) is
used to recursively order their elements. Such approach is most beneﬁcial when the
control ﬂow of the corresponding algorithm working on this matrix, follows pre-
cisely the way the matrix is partitioned, because otherwise indexing the individual
matrix elements is very hard.
Previous work exists on implementing cache-oblivious variants of other impor-
tant data structures, like binary trees, priority queues, B-Trees, etc.
The real challenge we are facing in dense numerical algebra is to show whether
the semi-aware approach, discussed above, is enough to achieve compatible perfor-
mance with completely cache-aware algorithms on all architectures. One reason
performance is similar at this point, is that for current architectures, a single level
of cache tiling is enough to achieve near-peak performance. Intuitively, this picture31
might change when the constantly increasing performance gap between processor
and memory speed reaches some threshold. This is an active area of research we
are pursuing at the moment, and therefore it is not further discussed in this thesis.Chapter 3
High-performance Code Generation
In this section, we use the framework of restructuring compilers, described in
Chapter 2, to describe the structure of the code generated by the ATLAS Code
Generator. While reading this description, it is important to keep in mind that
ATLAS is not a compiler. Nevertheless, thinking in these terms helps clarify the
signiﬁcance of the code optimization parameters used in ATLAS.
We concentrate on matrix-matrix multiplication (MMM), which is the key rou-
tine in the BLAS. Na¨ ıve MMM code is shown in Figure 3.1. We use the MATLAB
notation [First : Step : Last] to represent the set of all integers between First
and Last in steps of Step.
for i ∈ [0 : 1 : N − 1]
for j ∈ [0 : 1 : M − 1]
for k ∈ [0 : 1 : K − 1]
Cij = Cij + Aik × Bkj
Figure 3.1: Na¨ ıve MMM Code
The code shown in Figure 3.1 can be optimized for locality by blocking for the
L1 data cache and registers.
3.1 Tiling for the L1 data cache
ATLAS implements an MMM as a sequence of mini-MMMs, where each mini-
MMM multiplies sub-matrices of size NB × NB. NB is an optimization parameter
whose value must be chosen so that the working set of the mini-MMM ﬁts in the
3233
cache.
In the terminology of restructuring compilers, the triply-nested loop of Fig-
ure 3.1 is tiled with tiles of size NB × NB × NB, producing an outer and an inner
loop nest. For the outer loop nest, code for both the JIK and IJK loop orders are
implemented. When the MMM library routine is called, it uses the shapes of the
input arrays to decide which version to invoke, as described later in this section.
For the inner loop nest, only the JIK loop order is used, with (j0,i0,k0) as control
variables. This inner loop nest multiplies sub-matrices of size NB × NB, and we
call this computation a mini-MMM.
3.2 Tiling for the register ﬁle
ATLAS represents each mini-MMM into a sequence of micro-MMMs, where each
micro-MMM multiplies an MU × 1 sub-matrix of A by a 1 × NU sub-matrix of B
and accumulates the result into an MU × NU sub-matrix of C. MU and NU are
optimization parameters that must be chosen so that a micro-MMM can be exe-
cuted without ﬂoating-point register spills. For this to happen, it is necessary that
MU + NU + MU × NU ≤ NR, where NR is the number of ﬂoating-point registers.
In terms of restructuring compiler terminology, the (j0,i0,k0) loops of the mini-
MMM from the previous step are tiled with tiles of size NU ×MU ×KU, producing
an extra (inner) loop nest. The JIK loop order is chosen for the outer loop nest
after tiling, and the KJI loop order for the loop nest of the mini-MMM after tiling.
The resulting code after the two tiling steps is shown in Figure 3.2. To keep
this code simple, we have assumed that all step sizes in these loops divide the
appropriate loop bounds exactly (so NB divides M, N, and K, etc.). In reality,
code should also be generated to handle the fractional tiles at the boundaries of34
// MMM loop nest (j,i,k)
// copy full A here
for j ∈ [1 : NB : M]
// copy a panel of B here
for i ∈ [1 : NB : N]
// possibly copy a tile of C here
for k ∈ [1 : NB : K]
// mini-MMM loop nest (j0,i0,k0)
for j0 ∈ [j : NU : j + NB − 1]
for i0 ∈ [i : MU : i + NB − 1]
for k0 ∈ [k : KU : k + NB − 1]
for k00 ∈ [k0 : 1 : k0 + KU − 1]
// micro-MMM loop nest (j00,i00)
for j00 ∈ [j0 : 1 : j0 + NU − 1]
for i00 ∈ [i0 : 1 : i0 + MU − 1]
Ci00j00 = Ci00j00 + Ai00k00 × Bk00j00
Figure 3.2: MMM tiled for L1 data cache and registers
the three arrays; we omit this clean-up code to avoid complicating the description.
The strategy used by ATLAS to copy blocks of the arrays into contiguous
storage is discussed later in this section. Figure 3.3 is a pictorial view of a mini-
MMM computation within which a micro-MMM is shown using shaded rectangles.
In this ﬁgure, the values assigned to variable K are produced by executing the two
for loops in Figure 3.2 corresponding to indices k0 and k00.35
B
NB
N
B
A C
K
M
U
NU
K
Figure 3.3: mini-MMM and micro-MMM
3.3 Scalarization of array elements
To perform register allocation for the array variables referenced in the micro-MMM
code, ATLAS uses techniques similar to those presented in [11]: the micro-MMM
loop nest (j00,i00) in Figure 3.2 is fully unrolled, producing MU × NU multiply and
add statements in the body of the middle loop nest. In the unrolled loop body,
each array element is accessed several times. To enable register allocation of these
array elements, ATLAS uses scalar replacement to introduce a scalar temporary
for each element of A, B, and C that is referenced in the unrolled micro-MMM code,
and replaces array references in the unrolled micro-MMM code with references to
these scalars. Appropriate assignment statements are introduced to initialize the
scalars corresponding to A and B elements. In addition, assignment statements are
introduced before and after the k0 loop to initialize the scalars corresponding to C
elements, and to write the values back into the array respectively. It is expected
that the back-end compiler will allocate ﬂoating-point registers for these scalars.36
3.4 Pipeline scheduling
The resulting straight-line code in the body of the k00 loop is scheduled to exploit
instruction-level parallelism. Note that the operations in the k00 loop are the MU +
NU loads of A and B elements required for the micro-MMM, and the corresponding
MU × NU multiplications and additions. On hardware architectures that have a
fused multiply-add instruction, the scheduling problem is much simpler because
multiplies and adds are executed together. Therefore, we only discuss the more
interesting case when a multiply-add instruction is not present. An optimization
parameter FMA tells the code generator whether to assume that a fused multiply-
add exists. The scheduling of operations can be described as follows.
• Construct two sequences of length (MU × NU), one containing the multiply
operations (we will denote them by mul1, mul2, ..., mulMU×NU) and the
other containing the add operations (we will denote them by add1, add2, ...,
addMU×NU).
• Interleave the two sequences as shown below to create a single sequence that
is obtained by skewing the adds by a factor of Ls, where Ls is an optimization
parameter. Intuitively, this interleaving separates most dependent multiplies
and adds by 2×Ls−1 independent instructions to avoid stalling the processor37
pipeline.
mul1
mul2
···
mulLs
add1
mulLs+1
add2
mulLs+2
···
mulMU×NU−1
addMU×NU−Ls
mulMU×NU
addMU×NU−Ls+1
addMU×NU−Ls+2
···
addMU×NU
• Inject the MU + NU loads of the elements of A and B into the resulting
sequence of arithmetic operations by scheduling a block of IF (Initial Fetch)
loads in the beginning and blocks of NF loads thereafter as needed. IF and
NF are optimization parameters.
• Unroll the k00 loop completely. The parameter KU must be chosen to be large
enough to reduce loop overhead, but not so large that the body of the k0 loop
overﬂows the L1 instruction cache.
• Reorganize the k0 loop to enable the target machine to overlap the loads from38
one iteration with arithmetic operations from previous iterations. Techniques
for accomplishing this are known as software pipelining or modulo schedul-
ing [52].
Note that skewing of dependent adds and multiplies increases register pressure;
in particular, the following inequality must hold to avoid register spills (that is,
saving in memory the value stored in a processor register):
MU × NU + MU + NU + Ls ≤ NR (3.1)
3.5 Additional details
There are several details we have not discussed so far.
• ATLAS considers a primitive form of L2 cache tiling, driven by a parameter
called CacheEdge. ATLAS empirically ﬁnds the best value of CacheEdge
and uses it to compute KP, based on Inequality 3.2.
2 × KP × NB + N
2
B ≤ CacheEdge (3.2)
KP is further trimmed to be a multiple of NB. The computed value of KP
is used to block the K dimension of the original problem for one additional
level of the memory hierarchy.
• ATLAS chooses the outermost loop order (shown as JIK in Figure 3.2) during
runtime. This technique is known as versioning, because it requires both
versions of the code to be compiled in the library.39
The decision of which loop order to choose is based on the size of matrices A
and B. If A is smaller than B (N < M), ATLAS chooses the JIK loop order.
This guarantees that if A ﬁts completely into the L2 or a higher cache level,
it is reused successfully by the loop nest. Similarly, if B is the smaller matrix
(M < N), ATLAS chooses the IJK loop order.
For brevity, we consider only the JIK loop order.
• Unless the matrices are too small or too large, ATLAS copies tiles of matrices
A, B and C to sequential memory to reduce the number of conﬂict misses and
TLB misses during the execution of a mini-MMM. Copying is performed in
a manner that allows the copied tiles to be reused by diﬀerent mini-MMMs.
The comments in Figure 3.2 and the discussion below explain how this goal
is achieved for the JIK loop order.
– Copy all tiles of A before the beginning of the outermost j loop. This is
necessary as these tiles are fully reused in each iteration of the j loop.
– Copy all tiles from the jth vertical panel of B before the beginning of the
i loop. This is necessary as this panel is fully reused by each iteration
of the i loop.
– The single (i,j) tile of C is copied before the beginning of the k loop if
KP
NB ≥ 12. This may reduce TLB misses, which may be beneﬁcial since
this tile is reused by each iteration of the k loop, provided that the cost
of copying the tile of C to a temporary buﬀer and back can be amortized
by the computation (large enough KP).
If the matrices are very small or if there is insuﬃcient memory for copying
tiles, the cost of copying might outweigh the beneﬁts of reducing conﬂict40
misses during the computation. Therefore, ATLAS generates non-copying
versions of mini-MMM as well and decides at runtime which version to use.
Without copying, the number of conﬂict misses and TLB misses may rise,
so it makes sense to use a smaller tile size for the non-copying mini-MMM.
In ATLAS, this tile size is another optimization parameter called NCNB
(non-copying NB). Roughly speaking, the non-copy version is used if (i) the
amount of computation is less than some threshold (M×N ×K in Figure 3.1
is less than some threshold), and (ii) at least one dimension of one of the
three matrices is smaller than 3×NCNB. The non-copy version is also used
when there is insuﬃcient memory to perform the copying.
3.6 Discussion
Table 3.1 lists all optimization parameters.
Table 3.1: Summary of optimization parameters
Name Description
NB L1 data cache tile size
NCNB L1 data cache tile size for non-copying version
MU, NU Register tile size
KU Unroll factor for k0 loop
Ls Latency for computation scheduling
FMA 1 if fused multiply-add available, 0 otherwise
FF, IF, NF Scheduling of loads
It is intuitively obvious that the performance of the generated mini-MMM
code suﬀers if the values of the optimization parameters in Table 3.1 are too small41
or too large. For example, if MU and NU are too small, the MU × NU block
of computation instructions might not be large enough to hide the latency of the
MU +NU loads. On the other hand, if these parameters are too large, register spills
happen. Similarly, if the value of KU is too small, there is more loop overhead,
but if this value is too big, the code in the body of the k0 loop will overﬂow the
instruction cache. The goal is to determine optimal values of these parameters for
obtaining the best mini-MMM code.Chapter 4
Measuring Hardware Parameters
4.1 Introduction to X-Ray
There is growing interest in the design of self-optimizing computer systems that
can improve their performance without human intervention. Some of these systems
such as ATLAS [60] and FFTW [27] optimize themselves statically at installation
time, while other systems such as those envisioned in IBM’s autonomic systems
initiative [2] optimize their performance continuously during execution.
In many self-optimizing systems, the process of optimization involves choosing
optimal values for certain performance-critical parameters. For example, when
ATLAS, a portable system for generating numerical linear algebra libraries, is
installed on a machine, it performs a series of experiments to determine optimal
values for software parameters like loop tile sizes and loop unroll factors, and then
generates appropriate code with these parameter values. Although the optimal
values of the parameters depend on hardware resources like cache capacity, number
of registers, etc. and therefore vary from machine to machine, ATLAS is able to
adapt itself automatically in this way to each platform without human intervention.
In general, the space of possible values for optimization parameters is too large
to be explored exhaustively, so it is necessary to limit the search for optimal pa-
rameter values. For example, to ﬁnd the optimal cache tile size, it is suﬃcient to
limit the search to tile sizes that are smaller than the capacity of the cache; larger
tiles will not ﬁt in the cache and cannot be optimal. Similarly the search space
for register tiles can be bounded by the number of registers in the processor. In
general therefore, bounding the search space of optimization parameters requires
4243
knowledge of hardware parameters such as the capacity of the cache, and the num-
ber of registers. Furthermore, for some programs, analytical models can be used to
compute optimal values for code optimization parameters; not surprisingly, these
models may require more detailed and accurate information about the hardware,
such as the line size and associativity of various levels of the memory hierarchy [64].
For software to be self-tuning and self-optimizing, it is necessary that these
hardware parameters be determined automatically without human intervention.
On some machines, it may be possible to determine some of this information by
reading special hardware registers or records in the operating system [21]. However,
most processors and operating systems do not support such mechanisms or provide
very limited support; for example, we do not know of any platform on which the
latency and throughput of instructions can be determined in this manner.
An alternative approach is to use carefully crafted micro-benchmarks to mea-
sure hardware parameters automatically. Perhaps the most well-known micro-
benchmark is the Hennessy-Patterson benchmark [33] that determines memory
hierarchy parameters by measuring the time required to access a series of array
elements with diﬀerent strides. However, the timing results from this benchmark
must be analyzed manually to determine memory hierarchy parameters, so it is
not suitable for incorporation into an autonomic system. Furthermore, the bench-
mark makes ﬁxed stride accesses to memory, so hardware pre-fetching on machines
like the Power 3 can corrupt the timing results. Finally, this benchmark does not
determine the latency and throughput of instructions, or whether instructions like
fused multiply-add exist. Some of these problems were addressed in more recent
tools like Calibrator [41], lmbench [45], and MOB [9], but our experience with
these tools is somewhat mixed, as we describe in Appendix A.44
This chapter makes the following contributions.
• We describe a robust, easily extensible framework for implementing micro-
benchmarks to measure hardware parameters.
• We design a number of novel algorithms for measuring memory hierarchy
parameters and other architectural features, and show how they can be im-
plemented in this framework.
• We describe a tool called X-Ray, which is an implementation of these ideas,
and evaluate its capabilities on seven modern architectures. We show that it
produces more accurate and complete results than preexisting tools.
For portability, X-Ray is implemented completely in C [24]. One of the in-
teresting challenges of this approach is to ensure that the C compiler does not
perform any high-level optimizations on our benchmarks that might pollute the
timing results.
4.2 X-Ray Framework Overview
Figure 4.1 shows the structure of a micro-benchmark in the X-Ray framework.
Each hardware parameter is measured by one such micro-benchmark.
For example, consider measuring the number of available registers for a par-
ticular data type T. One way to determine this value is to execute a number
experiments, all of which perform the same computations, but on a diﬀerent num-
ber of variables (N) of type T. When N exceeds the number of available registers
for type T, not all variables can be register allocated, and execution time should
increase substantially. The number of available registers can be inferred from this
cross-over point.45
Nano-benchmark
Specification Control
Engine
Nano-
benchmark
Generator
Compile,
Execute, 
Time
Nano-benchmark
C Code
Execution Time
Micro-benchmark
Parameters
Hardware 
Parameter Value
Figure 4.1: A micro-benchmark in X-Ray
Some general conclusions can be drawn from this example. A single micro-
benchmark may need to execute several timing experiments, which we call nano-
benchmarks. Since there may be no a priori bound of the number of required nano-
benchmarks, we need a Nano-benchmark Generator, which can produce Nano-
benchmark C Code from a high-level Nano-benchmark Speciﬁcation. Finally, gen-
eration should happen on-the-ﬂy since the results of one nano-benchmark may
determine the nano-benchmark to be executed next.
The execution of a micro-benchmark is orchestrated by its Control Engine,
which chooses which nano-benchmarks to execute, along with the appropriate pa-
rameters, and the order in which to execute them. It determines the value of the
hardware parameter based on these timing results.
Some micro-benchmarks may also require input from the results of running
other micro-benchmarks. For example, to determine the latency of an instruction
in cycles rather than in nanoseconds, the control engine needs to know the cycle
time of the processor. This can be speciﬁed by the user or it can be measured by
another micro-benchmark as discussed in Section 4.3.46
4.2.1 Nano-benchmarks
Even with access to a high-resolution timer, it is hard to accurately time operations
that take only a few CPU cycles to execute. Suppose we want to measure the
time required to execute a C statement S. If this time is small compared to
the granularity of the timer, we must measure the time required to execute this
statement some number of times RS (dependent on S), and divide that time by
RS. If RS is too small, the time for execution cannot be measured accurately,
whereas if RS is too big, the experiment will take longer than it needs to.
RS ← 1;
while(measureS (RS) < tmin)
RS ← RS × 2;
return(measureS (RS) ÷ RS);
Figure 4.2: High-level structure of a nano-benchmark
Figure 4.2 shows the timing strategy used in X-Ray nano-benchmarks. In
this code, measureS(RS) measures the time required to execute RS repetitions of
statement S. To determine a reasonable value for RS, the code in Figure 4.2 starts
by setting RS to 1, and then doubles it until the experiment runs for at least tmin
seconds. The value of tmin can be speciﬁed by the user and defaults to 0.25 seconds
in the current implementation.
A simplistic implementation of measureS is shown in Figure 4.3(a). The call
to now() is assumed to invoke the most accurate timer available on the platform.
The code in Figure 4.3(a) incurs considerable loop overhead, which might even be
greater than the time spent in executing S. To address this problem, we can unroll
the loop U times as shown in Figure 4.3(b).47
ts = now();
i = R;
loop: S;
if (--i)
goto loop;
te = now();
return te − ts;
(a)
ts = now();
i = R / U;
loop:
S;
S;
...repeat U times...
S;
if (--i)
goto loop;
te = now();
return te − ts;
(b)
initialize;
volatile int v = 0;
switch (v)
{
case 0:
i = R/U;
ts = now();
loop:
case 1: S;
case 2: S;
...
case U: S;
if (--i)
goto loop;
te = now();
if (!v)
return te − ts;
}
use;
(c)
initialize;
volatile int v = 0;
switch (v)
{
case 0:
i = R/U;
ts = now();
loop:
case 1: S1;
case 2: S2;
...
case i: Si;
...
case n: Sn;
case n + 1: S1;
...
case W: Sn;
if (--i)
goto loop;
te = now();
if (!v)
return te − ts;
}
use;
(d)
Figure 4.3: Approach for timing nano-benchmarks
Another problem is that restructuring compiler optimizations may corrupt the
experiment. For example, consider the case when we want to measure the latency
of a single addition. In our framework, we would measure the time taken to execute
the C statement p0 = p0 + p1. It is important to optimize p0 and p1 by allocating
them in registers, but it is crucial not to optimize the U statements in the loop
body to p0 = p0 +U ×p1, which would prevent from timing the original statement
correctly.48
To solve such problems, we need to generate programs which the compiler can
aggressively optimize, without disrupting the sequence and type of operations,
whose execution time we want to measure. We solve this problem using a switch
statement on a volatile variable v as shown in Figure 4.3(c). The semantics
of C require that all accesses of v go to the memory hierarchy, and therefore the
compiler cannot assume anything about which case of the switch is selected.
Because there is potential control ﬂow to each of the case blocks, it is impossible
to combine or reorder them in any way.
The ﬁnal problem is that if the compiler is able to deduce that the result of the
computations performed in S is not used in the rest of the code, it might perform
dead-code elimination and remove all instances of S altogether. To prevent this
unwanted optimization we assign the initial value of all variables that appear in
S from appropriately typed volatile variables in the initialize statement and
copy their ﬁnal values back to the same volatile variables in the use statement.
As we will see in Section 4.3, there are cases where we wish to measure the
performance of a sequence of diﬀerent statements S1,S2,...,Sn. To prevent the
compiler from optimizing this sequence, the code generator will give each Si a
diﬀerent case label, generating code of the form shown in Figure 4.3(d). In this
ﬁgure, the number of case labels W is the smallest multiple of n greater than or
equal to U.
4.2.2 Nano-benchmark Generator
The X-Ray nano-benchmark generator accepts as an input a nano-benchmark
speciﬁcation and produces nano-benchmark C code structured as shown in Fig-
ure 4.3(d).49
The nano-benchmark speciﬁcation is a tuple which contains a statement S to
be timed and type information for all variables in S. For example, to measure
the latency of double-precision ﬂoating point ADD operation, we use the nano-
benchmark speciﬁcation hp1 = p1 + p2,hp1,p2 : F64ii, which means that we time
the statement p1 = p1 + p2, where p1 and p2 are variables of type double (deﬁned
as F64 in X-Ray). Given this speciﬁcation, the nano-benchmark generator can
produce code as shown in Figure 4.3(c). Generating code of the form shown in
Figure 4.3(d) is more complex and requires the ﬁrst element of the tuple to be a
function f : integer → string, which computes the code for statement Si from the
case label i.
4.2.3 Implementing a new micro-benchmark
The following two steps describe the process of implementing a new micro-benchmark
within the X-Ray framework.
1. Decide what timing experiments are needed and implement the correspond-
ing nano-benchmarks. If the structure of these nano-benchmarks ﬁts the out-
put of the X-Ray nano-benchmark generator, as presented in Figure 4.3(d),
one only needs to provide the corresponding nano-benchmark speciﬁcations.
2. Implement themicro-benchmark control engine, describing which nano-bench-
marks to run, with what parameters, in what order, and how to produce a
ﬁnal result from the external parameters and the timings.
As we will see in Section 4.3, both these steps require very few lines code.50
4.3 CPU Micro-benchmarks
We now describe how X-Ray measures a number of key CPU parameters.
4.3.1 CPU Frequency
CPU frequency (FCPU) is an important hardware parameter because other parame-
ters are measured relative to it (in clock cycles). X-Ray assumes that dependent
integer additions can be executed at the rate of one per cycle, which is valid for most
current processors. The assumption of dependence is important because modern
architectures can often issue two or more independent integer addition operations
in one cycle, so timing independent addition operations would be misleading.
For this micro-benchmark we use a single nano-benchmark with speciﬁcation
S = hp0 = p0 + p1,hp0,p1 : intii. Given the time time(S) in nanseconds required
to execute the statement S, we compute the CPU frequency in MHz as follows.
FCPU ← 1000 ÷ time(S)
4.3.2 Instruction Latency
The latency LO,T of an operation (instruction) O, with operands of type T, is the
number of cycles after one such instruction is dispatched until its result becomes
available to subsequent dependent instructions.
For this micro-benchmark we use a nano-benchmark with speciﬁcation SO,T =
hp0 = O(p0,p1),hp0,p1 : Tii. We then compute the instruction latency in clock
cycles as follows.
LO,T ←
time(SO,T)
1000 ÷ FCPU51
4.3.3 Instruction Throughput
The throughput TPO,T of an operation (instruction) O, with operands of types
T, is the rate in cycles at which the CPU can issue independent instructions of
that type. On modern processors the throughput of an instruction is usually much
smaller than its latency, because of pipelining and super-scalar execution.
To measure TPO,T, we use a nano-benchmark speciﬁcation
SO,T,N = hpi%N = O(pi%N,pN),hp0,p1,...,pN : Tii.
Note that this speciﬁcation generates code of the form shown in Figure 4.3(d). It is
further parameterized by N – the number of independent instructions to generate.
For example, the sequence of statements generated for N = 3 is the following.
case 0 : p0 = O(p0,p3);
case 1 : p1 = O(p1,p3);
case 2 : p2 = O(p2,p3);
case 3 : p0 = O(p0,p3);
...
case W : p2 = O(p0,p3);
We then compute the instruction throughput in clock cycles as follows.
N ← 2;
while
￿
time(SO,T,N)
time(SO,T,N−1) > 1 − ￿
￿
N ← N + 1;
TPO,T ←
time(SO,T,N−1)
1000÷FCPU ;
The nano-benchmark code for SO,T,N exhibits instruction-level parallelism (ILP)
on the order of N. The control engine times the nano-benchmark for successively52
growing values of N while performance continues to increase due to the additional
ILP. When the performance levels oﬀ for some N , the throughput TPO,T is com-
puted as the last (fastest) timing divided by the clock cycle time.
In practice, the X-Ray implementation of the throughput micro-benchmark is
somewhat more complex. The reason for this increased complexity is that the im-
plementation we just described does not permit the exploitation of instruction-level
parallelism in statically scheduled VLIW cores because the case labels prevent the
compiler from scheduling VLIW bundles with more than one instruction. Therefore
we use a nano-benchmark speciﬁcation with a more complex statement generating
function, that puts B independent instructions at each case statement, as follows.
SO,T,N,B =
<
{
p(i×B+0)%N = O(p(i×B+0)%N,pN);
p(i×B+1)%N = O(p(i×B+1)%N,pN);
...
p(i×B+B−1)%N = O(p(i×B+B−1)%N,pN);
},
hp0,p1,...,pN : Ti
>53
The corresponding control engine algorithm is as follows.
N ← 2;
while
￿
time(SO,T,N,1)
time(SO,T,N−1,1) > 1 − ￿
￿
N ← N + 1;
B ← 2;
N ← N − 1;
while
￿
time(SO,T,N×B,B)÷B
time(SO,T,N×(B−1),B−1)÷(B−1) > 1 − ￿
￿
B ← B + 1;
TPO,T ←
time(SO,T,N×(B−1),B−1)÷(B−1)
1000÷FCPU ;
4.3.4 Instruction Existence
Many embedded processors do not have dedicated ﬂoating-point hardware. Some
have single-precision ﬂoating-point functional units, but not double-precision ones.
In case dedicated ﬂoating-point hardware is not present, usually an emulation li-
brary is used. In X-Ray we can easily determine the presence of dedicated ﬂoating-
point hardware by measuring the latency of a ﬂoating-point ADD operation of the
appropriate type. If the latency is more than a few cycles (10 in our implementa-
tion), we conclude that the operation is emulated, otherwise we conclude that it
is executed in hardware.
In certain cases, it is not obvious how a certain C statement is translated to
instructions. One very common operation for numerical applications is p1 = p1 +
p2×p3. On some platforms, this statement is compiled into a single fused multiply-
add instruction (FMA), while on some it is compiled into a separate multiply and
add instructions. If an FMA instruction does not exist, the compiler will need
an extra register to store the intermediate value and schedule two instructions54
instead of one. This has an impact on how such sequences of statements need to
be scheduled. For example, the ATLAS [60, 61] library generator produces diﬀerent
code for the compiler depending upon the existence of an FMA instruction.
With an FMA instruction the CPU can execute an add instruction “for free”
together with a multiply instruction. Therefore we determine the existence of FMA
by comparing the throughput of a simple multiply with that of a fused multiply-
add.
4.3.5 Number of Registers
To measure the number of registers NRT of particular type T, we use the speciﬁ-
cation ST,N =
D
pi%N = pi%N + p(i+N−1)%N,hp0,p1,...,pN : Ti
E
. For example, the
sequence of statements generated for N = 4 is as follows.
case 0 : p0 = p0 + p3;
case 1 : p1 = p1 + p0;
case 2 : p2 = p2 + p1;
case 3 : p3 = p3 + p0;
case 4 : p0 = p0 + p3;
...
case W : p3 = p3 + p0;
If all of pi are allocated in registers, the time per operation is much smaller
than when some are allocated in memory. The goal is to determine the maximum
N, for which no variables are allocated to memory. The control engine doubles N
until it observes a drop in performance. After that it performs a binary search in55
the interval [N ÷ 2,N). The actual control engine algorithm is as follows.
N ← 4;
while
￿
time(ST,N)
time(ST,2) < 1 + ￿
￿
N ← N × 2;
R ← N;
L ← N
2
while(R − L > 1)
P ←
(R+L)
2 ;
if
￿
time(ST,P)
time(ST,2) < 1 + ￿
￿
R ← P;
else
L ← P;
NRT ← L;
4.3.6 SMP and SMT parallelism
To measure the number of processors in a SMP architecture, X-Ray once again
uses the “throughput” nano-benchmark. Its conﬁguration is instantiated with
hADD,I32,ni, where n is the value at which the CPU is saturated as measured by
the Initiation Interval micro-benchmark.
The number p of concurrent instances of this conﬁguration, which exhibits no
slowdown compared to running a single instance characterizes the number of CPUs
in a SMP. Reading the number of CPUs with an OS call returns the number v
of virtual SMT processors. The SMT per CPU of the system is computed as v
p.
To ﬁnd which two virtual processors share the same physical processor, X-Ray
executes instances of the aforementioned conﬁguration concurrently on both. If56
there is no slowdown, the two virtual processors do not share a physical processor.
4.4 Memory Hierarchy Micro-benchmarks
We present micro-benchmarks for measuring the parameters of the memory hier-
archy of a platform, including all levels of cache, registers, and the TLB. Existing
tools such as lmbench [44], Calibrator [41] and MOB [9] measure some of these
memory hierarchy parameters, but our experiments show that none of them oﬀer
the same coverage of parameters or the accuracy of our micro-benchmarks. These
tools implement variations of the micro-benchmark developed by Saavedra [53],
which is reproduced in Hennessy and Patterson’s architecture book [33] and is
discussed in Section 4.4.1. This benchmark, which is a C program, measures the
time required to access a series of array elements with diﬀerent strides. The timing
results are fairly complex because the micro-benchmark considers all levels of the
memory hierarchy simultaneously. Therefore, these results are usually interpreted
manually to obtain the memory hierarchy parameters. Although tools like MOB,
Calibrator and lmbench can determine some cache parameters automatically from
these timing results, none of them are able to measure cache associativity for ex-
ample. Moreover, optimizations performed by modern compilers when compiling
the C code can confuse the timing measurements. Yet another problem is that
hardware pre-fetching for ﬁxed-stride accessed to memory on machines like the
IBM Power architecture can compromise the timing measurements further.
The micro-benchmarks we present determine cache parameters for one level of
cache at a time, rather than considering all levels simultaneously. In particular,
when measuring the parameters of a cache level i, the micro-benchmarks ensure
that higher cache levels L1, L2, ... Li+1 are “transparent” to the measurements57
in the sense that the memory accesses are guaranteed to miss in those caches. Of
course this is not a problem for the L1 cache, so our micro-benchmarks use ﬁxed
stride accesses there, as we discuss in Section 4.4.3. However, our algorithms are
novel, and we take care to ensure that compiler optimizations and hardware pre-
fetching do not compromise the timing measurements. In Section 4.4.4, we show
that by using more complex memory access patterns (in particular, a sequence of
sequences), we can measure the parameters of lower cache levels without interfer-
ence from higher cache levels. In Sections 4.4.5 and A.1.3, we show how some
TLB parameters and the number of registers can be measured automatically. We
present experimental results in Appendix A that show that we provide better accu-
racy and coverage of memory hierarchy parameter measurement than preexisting
tools. Finally, in Section 4.5, we discuss future work.
4.4.1 Previous Approaches
The general approach to measuring memory hierarchy parameters is to repeatedly
access the elements of a large array in memory using diﬀerent strides, and measure
the average time per access. The results are then interpreted to deduce diﬀerent
memory hierarchy parameters. The most widely known micro-benchmark for such
measurements is the benchmark of Saavedra [53], a stylized version of which is
presented in Figure 4.4. We make the following observations.
1. The code performs series of experiments for pairs of the form hcsize,stridei,
where the array size (csize) varies between CACHE MIN and CACHE MAX and
the stride (stride) varies between 1 and csize. Both are restricted to pow-
ers of 2.58
#define SAMPLE (5)
#define CACHE MIN (1024)
#define CACHE MAX (16*1024*1024)
int x[CACHE MAX];
int main ()
{
int temp;
for (int csize = CACHE MIN; csize <= CACHE MAX; csize *= 2)
for (int stride = 1; stride <= csize / 2; stride *= 2)
{
double time = 0.0;
int steps = 0;
int tsteps = 0;
int limit = csize - stride + 1;
do
{
double time0 = get time();
for (int i = SAMPLE * stride; i != 0; --i)
for (int index = 0; index < limit; index += stride)
x[index]++;
steps++;
time += get time() - time0;
} while (time < 1.0);
do
{
double time0 = get time();
for (int i = SAMPLE * stride; i != 0; --i)
for (int index = 0; index < limit; index += stride)
temp += index;
tsteps++;
time -= get time() - time0;
} while (tsteps < steps);
printf("size: %d, stride: %d, time: %d",
csize * sizeof(int), stride * sizeof(int),
(int)(time * 1E9 / (steps * SAMPLE * stride *
((limit - 1) / (stride + 1)));
}
}
Figure 4.4: Standard memory hierarchy benchmark59
2. For each pair hcsize,stridei, the benchmark traverses the array x with the
speciﬁed stride enough times (SAMPLE×steps) to ensure that the total time
spent is at least one second.
3. The measurement for the same pair hcsize,stridei is repeated using the
exact same looping code structure, and the same number of times, but this
time accessing a single scalar variable (temp) instead of elements of x.
The benchmark has problems at both the algorithmic and implementation level,
as summarized below.
1. Algorithmic Level
(a) The benchmark considers all levels of the memory hierarchy simultane-
ously, so each timing result is possibly inﬂuenced by several parameters
from diﬀerent cache levels. Therefore, the interpretation of the timing
results is complex.
(b) The benchmark does not interpret the timing results to produce ac-
tual memory hierarchy parameters itself, but rather produces a set of
measurements that need to be interpreted manually.
(c) The benchmark uses only array sizes restricted to powers of 2, which
prevents it from measuring cache capacities that are not a power of 2.
However, an increasing number of caches on modern architectures, such
as the Level 3 cache on the Itanium 2, have capacities that are not a
power of 2.
2. Implementation Level60
(a) The source code uses a very complex looping structure, which is the
source of substantial loop overhead. An attempt is made to account for
that overhead by measuring and subtracting the time of execution of a
cloned version the same looping structure which does not perform any
memory accesses. Unfortunately, there is no control over the back-end
compiler, so diﬀerent code may be produced for the two replicas, thus
yielding inaccurate results.
(b) All memory accesses are independent, which allows an aggressively op-
timizing compiler to schedule them in a way so that some overlap. This
is mainly a problem for measuring access latency, but also cause inac-
curacies in measuring other parameters.
(c) Both memory read and write are performed on the current array ele-
ment, which introduces interference with write buﬀers and further pro-
hibit the measurement of these operations in isolation.
(d) The addressing mode used to access array elements involves base ad-
dress and oﬀset and on many RISC architectures this operation requires
an extra address computation instruction before performing the actual
memory access instruction.
(e) The source code does not use the values of accessed array elements and
more importantly the value of the temp variable for anything mean-
ingful, so a smart optimizing compiler can prune portions of the code
during dead code elimination.
(f) There is a constant stride between accesses of array elements and some
modern architectures provide hardware that is able to prefetch constant61
stride accesses to memory into the higher levels of the memory hierarhcy.
(g) It is implicitly assumed and very important for the benchmark that the
array x is stored in a contiguous chunk of memory. In reality, it is only
guaranteed to be contiguous in virtual memory and can be fragmented
in physical memory. In many cases lower cache levels are physically
addressed, which invalidates this important assumption.
The existing systems we examined all use this micro-benchmark in one form
or another, although some of them attempt to address some of these problems
in various ways. Our approach is very diﬀerent at the algorithmic level, and it
eliminates the implementation problems discussed above.
4.4.2 Compactness of Sequences
Our micro-benchmarks measure the associativity (A), block size (B), capacity (C),
and hit latency (l) of caches. The ﬁrst three parameters are sometimes referred to
as the hA,B,Ci of caches.
offset￿ index￿ tag￿
t=20￿ i=7￿ b=5￿
Figure 4.5: Memory address decomposition on Pentium III
Figure 4.5 shows the typical structure of a memory address. We use the Intel
Pentium III architecture in the following explanations. On these machines, the L1
data cache is organized as hA,B,Ci = h4,32,16KBi. Therefore the cache contains
C ÷B = 16384÷32 = 512 individual blocks, divided into 512÷A = 512÷4 = 128
sets of 4 blocks each. The highest t = 20 bits constitute the block tag, i = 7 bits62
are needed to index one of the 128 sets, and b = 5 bits are needed to store the
oﬀset of a particular byte within the 32-byte block.
Deﬁnition 1. For a cache with associativity A and capacity C, we deﬁne the stride
T of that cache as T ≡ C
A.
Note that T = 2i+b, and thus C = A×2i+b. Lemma 1 gives another character-
ization of T.
Lemma 1. Consider a cache with stride T, and addresses m0 and m aligned to
a cache block boundary. The address m maps to the same cache set as m0 iﬀ
m = m0 + k × T for some integer k.
Proof. Follows directly from the deﬁnition.
Unlike cache stride, associativity and capacity do not have to be a power of 2.
For example, some versions of the Intel Itanium 2 have a 24-way set associative L3
cache with a capacity of 6MB.
If W is a set of addresses, we deﬁne projecti (W) to be the subset of W con-
taining only the addresses that map to cache set i, and indices(W) to be the set
of cache indices of the elements of W.
Deﬁnition 2. For a set of addresses W, and an index i,
projecti (W) ≡ {m ∈ W : index(m) = i}
Deﬁnition 3. For a set of addresses W,
indices(W) ≡ {i : projecti (W) 6= ∅}
We assume that set-associative caches implement the least-recently-used (LRU)
replacement policy. This assumption is reasonable because most modern processors63
implement variants of this policy. Moreover, our experimental results show that
our micro-benchmarks can be accurate even when the policy is not LRU.
Sequences
Some of our micro-benchmarks access sequences of N addresses, where successive
addresses are separated by a stride S = 2σ as shown in Figure 4.6(a). Such
sequences are completely characterized by their starting address m0, stride S and
number of elements N. Therefore we use the notation hm0,S,Ni to represent them.
To measure parameters of multilevel memory hierarchies our micro-benchmarks use
sequences of sequences, as shown in Figure 4.6(b). To represent them we use the
notation W = hhm0,s,ni,S,Ni.
S S ...
1 2 3 N
m0
(a) hm0,S,Ni
s s ...
S-(n-1)*s
s s ...
...
s s ...
N
1 2 3 n 1 2 3 n 1 2 3 n
S ...
m0
(b) hhm0,s,ni,S,Ni
Figure 4.6: Sequences of sequences
Deﬁnition 4.
(a) hm0,S,Ni ≡ [m0,m0 + S,...,m0 + (N − 1)S]64
(b) hhm0,s,ni,S,Ni ≡ ∪i∈[0,N−1] hm0 + i × S,s,ni
In Deﬁnition 4(b), we call each subsequence hm0 + i × S,s,ni of hhm0,s,ni,S,Ni
an inner subsequence.
Notice that the sequence of addresses in Figure 4.6(b) can also be expressed as
hhm0,S,Ni,s,ni. This property is expressed in Lemma 2.
Lemma 2. hhm0,s,ni,S,Ni ≡ hhm0,S,Ni,s,ni
Compactness
We determine cache parameters by measuring the average time per memory access
when accessing the elements of certain sets of memory addresses.
When all addresses of an address sequence W can coexist together in a cache
we say that W is compact with respect to that cache and the average access time
is the cache hit latency lhit. When the sequence is not compact and we repeatedly
access its elements the cache will suﬀer some misses. If every single access is a
cache miss, we say that W is non-compact and the average access time is the
cache miss latency lmiss, which is typically much greater than lhit. Finally, when
some accesses are cache hits and some are cache misses, the average access time is
between lhit and lmiss and we say that W is semi-compact. Deﬁnition 5 presents
this concepts formally.
Deﬁnition 5. For a cache with associativity A,
compact(W) ≡ ∀i ∈ indices(W) : |projecti (W)| ≤ A
non-compact(W) ≡ ∀i ∈ indices(W) : |projecti (W)| > A
semi-compact(W) ≡ ¬compact(W) ∧ ¬non-compact(W)65
The deﬁnition says that, for any cache index from the set of indices for W,
a compact sequence will have at most A elements with this index, while a non-
compact sequence will have at least A + 1 elements with this index. A sequence
is semi-compact if there is an index with at most A elements, as well as an index
with at least A + 1 elements.
Lemma 3. Compact sequences have the following properties.
(a) For a cache with capacity C and block size B, and an address m0, aligned on
a cache block boundary, the half-open interval [m0,m0 + C) is compact.
(b) A subset of a compact sequence is compact.
(c) If indices(W1)∩indices(W2) = ∅, and W1 and W2 are compact, then W1∪W2
is compact.
(d) If indices(W1)∩indices(W2) = ∅, and W1∪W2 is non-compact, then W1 and
W2 are non-compact.
(e) If W1 and W2 are non-compact then W1 ∪ W2 is non-compact.
Proof. (a) The interval [m0,m0 + C) is equivalent to the sequence W = hm0,1,Ci =
D
hm0,1,Bi,B, C
B
E
. Because m0 is aligned on B, the cache lines used by W are
the same as the cache lines used by c W =
D
m0,B, C
B
E
, in which only one ad-
dress is mapped to a single cache line. Furthermore c W can be expressed as
DD
m0,B, T
B
E
,T, C
T
E
. From Lemma 1, all inner subsequences c wi =
D
m0 + i × T,B, T
B
E
map exactly one element to each cache set. Therefore c W maps exactly A = C
T el-
ements to each cache set, and by Deﬁnition 5 it is compact. Because W uses the
exact same cache lines, it is also compact.
Results (b)-(e) follow directly from Deﬁnition 5.66
4.4.3 L1 Data Cache
To measure the parameters of the L1 data cache our micro-benchmarks measure
the average time per element to access the elements of certain compact and non-
compact ﬁxed stride sequences.
S 2S C C + T
lmiss
lhit
N × S
latency
Figure 4.7: Example of (semi-/non-)compact
Figure 4.7 gives some intuition about the compactness properties of a sequence
W = hm0,S,Ni where S ≤ T. When N × S ≤ C the sequence is compact as it
maps at most A addresses to each cache set. When N × S ≥ C + T the sequence
is non compact, as it maps at least A + 1 addresses to each cache set. When
C < N ×S < C +T, the sequence maps A addresses to some of the cache sets and
A + 1 address to the rest of the cache sets. For S ≥ T there are no semi-compact
sequences, and for S < T, W is semi-compact for T
S −1 diﬀerent values of N. For
example, for S =
T
2 there is only one N =
C+S
S for which the W is semi-compact.
Theorem 1 describes the necessary and suﬃcient conditions for compactness
and non-compactness of a sequence of this type for a given cache. Informally, this67
theorem says that as the stride S gets bigger, the maximum length of a compact
sequence with that stride decreases until it bottoms out at A, while the minimum
length of a non-compact sequence with that stride decreases until it bottoms out
at A + 1.
Theorem 1. Consider a cache with parameters hA,B,Ci and a sequence W =
hm0,S,Ni.
(a) compact(W) ⇔ N ≤ Nc = A
l
T
S
m
(b) non-compact(W) ⇔ N ≥ Nnc = (A + 1)
l
T
S
m
Proof. There are two cases to consider.
• S ≥ T. In this case Nc = A and Nnc = A + 1.
Since both S and T are powers of 2, S must be an integer multiple of T.
From Lemma 1 it follows that all N addresses in the sequence map to the
same cache set.
Therefore the sequence is compact iﬀ for N ≤ A = Nc and non-compact iﬀ
N ≥ A + 1 = Nnc.
• S < T. Since S and T are both powers of 2,
T
S is an integer, and Nc = A×
T
S
and Nnc = (A + 1) × T
S.
Let N = p×T
S+r, where 0 ≤ r < p. We can divide W into p+1 parts, in which
the ﬁrst p have T
S elements, and the last one has r elements. Furthermore,
we represent W as the union of two sequences of sequences: one with p + 1
subsequences of length r and one with p subsequences of length T
S − r. This68
r T/S-r r T/S-r r
...
T T T/S*r
(number of addresses)
(size)
m0
Figure 4.8: Decomposition of W = hm0,S,Ni
is presented pictorially in Figure 4.8.
W = hhm0,S,ri,T,p + 1i
∪
￿￿
m0 + r × S,S,
T
S
− r
￿
,T,p
￿
= hhm0,T,p + 1i,S,ri
∪
￿
hm0 + r × S,T,pi,S,
T
S
− r
￿
From Lemma 1, the elements of each inner subsequence map to the same
cache set, whereas elements from diﬀerent subsequences map to diﬀerent
cache sets. Therefore r cache sets will have p+1 diﬀerent addresses mapped
to each of them and
T
S − r cache sets will have p addresses mapped to each
of them.
– N < Nc. In this case p < A, i.e. p + 1 ≤ A. Therefore T
S cache sets
have at most A diﬀerent addresses mapped to each of them, so W is
compact.
– N = Nc. In this case p = A and r = 0. Therefore T
S cache sets have
exactly A diﬀerent addresses mapped to each of them, so W is compact.
– Nc < N < Nnc. In this case p = A and 0 < r < T
S. Therefore r
cache sets have exactly A + 1 diﬀerent addresses mapped to each of69
them and
T
S − r diﬀerent cache sets have exactly A diﬀerent addresses
mapped to each of them, so W is neither compact nor non-compact (it
is semi-compact).
– N ≥ Nnc. In this case p ≥ A + 1. Therefore T
S cache sets have at least
A+1 diﬀerent addresses mapped to each of them, so W is non-compact.
The required result follows directly from this.
Algorithms for Measuring Parameters
In this section we use the function is compact(W) to determine empirically if W
is compact. Our implementation of this function repeatedly accesses each address
in W, computes the average time per access l, and declares the sequence to be
compact if l is close to the hit latency of the cache lhit, which is measured as
described below.
Although this procedure seems simple in principle, the timing measurements
require some special care to avoid the problems discussed in Section 4.4.1 and we
discuss how we address these in Section 4.4.3.
Cache Latency
We determine lhit by measuring the average time per access of the sequence
hm0,1,1i, which is compact since it contains a single element.
Capacity and Associativity
Theorem 1 suggests a method for determining the capacity C and the associa-
tivity A of the cache. First, we ﬁnd A by determining the asymptotic limit of the
length of a compact sequence as the stride is increased. The smallest value of the70
stride for which this limit is reached is T, the stride of the cache; once we know A
and T, we can ﬁnd C.
S ← 1;
N ← 1;
while (is compact(hm0,S,Ni)
N ← 2 × N
Nold ← N;
N ← 0;
while (N 6= Nold)
S ← 2 × S;
Nold ← N;
N ← minNmin ∈ [1,Nold] : ¬is compact(hm0,S,Nmini);
A ← N − 1;
C ← S
2 × A;
Figure 4.9: Measuring C and A of L1 Data Cache
Pseudo-code for measuring C and A of the L1 data cache is shown in Fig-
ure 4.9. The algorithm can be described as follows. Start with the sequence
hm0,S,Ni = hm0,1,1i, which is compact, and keep doubling N until the sequence
is not compact. Let Nold is the ﬁrst N for which this happens. Now start doubling
the stride S, and for each S compute the smallest N, for which hm0,S,Ni is not
compact. This value of N can be found by using binary search in the interval
[1,Nold]. If N 6= Nold, Let Nold = N and recompute N for the next S. Repeat this
step until N = Nold. At this point, declare A = N − 1 and the C = S
2 × A.
The largest stride S used in this algorithm is 2T. We will exploit this fact when71
we consider multi-level cache hierarchies.
Note that the number of addresses accessed by the algorithm in this micro-
benchmark is on the order of the associativity of the cache, which is superior to
previous approaches because non-compactness produces a very pronounced perfor-
mance drop, which is much easier to detect automatically.
Block Size
For given cache parameters C, A and T, hm0,T,2Ai is non-compact since all
2A addresses map to the same cache set. This sequence can also be expressed as
hhm0,T,Ai,C,2i. If we oﬀset the second half of the sequence by a constant δ, as
shown in Figure 4.10, we get the sequence hhm0,T,Ai,C + δ,2i.
T T ... T T ... T+B
C+B
1 2 3 A 1 2 3 A
m0
Figure 4.10: Modiﬁed address sequence for measuring B
The addresses in each of the inner subsequences hm0,T,Ai and hm0 + C + δ,T,Ai
map to a single cache set. When 0 ≤ δ < B this cache set is the same for both
subsequences. When δ ≥ B they map to two diﬀerent cache sets. Therefore the
smallest value of δ for which the full sequence hhm0,T,Ai,C + δ,2i is compact is
δ = B. Figure 4.11 shows pseudo-code for the algorithm.
Implementation of is compact
The algorithms in Section 4.4.3 call the function is compact(W) to determine
whether sequence W is compact. We now describe how this function is imple-72
δ ← 1
while (!is compact(hhm0,T,Ai,C + δ,2i))
δ ← 2 × δ;
return δ;
Figure 4.11: Algorithm for measuring B
mented to avoid the problems discussed in Section 4.4.1.
The array of elements is declared of type pointer (void *) instead of integer
(int) as in the Saavedra benchmark. The array is initialized in such a way that each
element contains the address of the element that should be accessed immediately
after it. A local variable p is initialized with the address of the element that should
be accessed ﬁrst. This initialization is performed oﬀ-line, before the actual timing.
A simpliﬁed version of the timing routine is presented in Figure 4.12. The
variable R is chosen so that the loop executes for at least a predetermined amount
of time t. Larger values of R are likely to produce more accurate timing results
at the expense of additional running time. In our implementation, we use t = 1
second.
In addition, in the actual implementation, the while loop is unrolled several
times to avoid loop overhead.
It is easy to see that the only operation performed in the loop body is p ←
*(void **)p, which reads the memory address stored at address p and updates p
with it, eﬀectively following the pointer chain preprogrammed inside the array.
The following points address the implementation problems of the Saavedra
benchmark discussed in Section 4.4.1.
(a) The code in Figure 4.12 uses the simplest possible looping structure, and73
startTime ← get time();
while (--R)
p ← *(void **)p;
timePerAccess ← (get time() − startTime) ÷ R;
printf("", p);
Figure 4.12: Improved timing of memory accesses
loop overhead can be reduced as much as needed by suﬃcient unrolling. In
our implementation we unroll 256 times.
(b) Each of the memory accesses depends on the previous one to produce the
actual address to access, so aggressive compilers cannot take advantage of
instruction-level parallelism and overlap them.
(c) Each memory access constitutes precisely one memory read instruction, so
the actual timing corresponds exactly to the average latency per access.
(d) All modern architectures today support an indirect addressing mode, so each
operation should be translated to a single machine instruction (e.g., “lea
eax, [eax]” on x86 ISA).
(e) The ﬁnal value of the variable p is used by the printf statement, so the
compiler is not able to optimize the memory accesses away by dead code
elimination.
(f) For a correct implementation of is compact(W), it is important that we re-
peatedly access all elements of the sequence, but the actual order in which we
access them is irrelevant. To prevent hardware constant stride prefetchers,74
like those on the IBM Power architecture, from interfering with our timings,
we initialize the array elements by chaining the pointers so that we visit the
elements in a pseudo-random order.
Suppose the address sequence is m0,m1,...,mn−1. One way to reorder
this sequence is to choose a number p, such that p and n are mutually
prime. Then, after element mi, visit element m(i+p) modulo n instead of el-
ement a(i+1) modulo n. As p and n are mutually prime, the recurrence i ←
(i + p) modulo n is guaranteed to generate all the integers between 0 and
n − 1 before repeating itself.
(g) All modern processors have virtually indexed L1 data caches and therefore
physical continuity is not an issue. Lower levels of the memory hierarchy
are usually physically indexed, so physical continuity is important for lower
levels of the memory hierarchy, as we discuss in Section 4.4.4.
4.4.4 Lower Levels of the Memory Hierarchy
We denote the cache at level i as Ci, its hA,B,Ci parameters as hAi,Bi,Cii, its
stride as Ti and its hit latency as li. We extend the notation from the previous
section, so that compacti (W) denotes that compact(W) holds with respect to Ci.
We extend non-compact and semi-compact in the same way.
Measuring parameters of lower levels of the memory hierarchy is considerably
more diﬃcult than measuring the parameters of the L1 data cache. One reason
why the algorithms described in Section 4.4.3 cannot be used directly is that Ci
is accessed only if Ci−1 suﬀers a miss. Therefore compactness with respect to Ci
of a sequence of addresses can be accurately determined empirically only if this75
sequence is non-compact with respect to C1,C2,...Ci−1.
Our solution to this problem is to transform any sequence W into a new se-
quence W ∗, with the following properties.
1. compacti (W ∗) ⇔ compacti (W)
2. non-compactj (W ∗), for all j ∈ [1,i − 1]
Intuitively, W is of the form presented in Figure 4.6(a). We want to transform
it to W ∗, which is a sequence of sequences of the form presented in Figure 4.6(b),
so that the extra memory accesses exhaust the associativity at cache levels above
Ci. Such a transformation may be necessary because on some architectures, lower
level caches are less associative than higher level caches. For example some ver-
sions of the IBM Power 3 have 8MB, 8-way set associative C2 and 64KB, 128-
way set associative C1. Therefore the ﬁnal iteration of the algorithm in Figure 4.9
should be examining the sequence W = hm0,2MB,9i and declaring it non-compact.
Without transforming W this will not happen, because although the sequence is
non-compact with respect to C2, it is compact with respect to C1. As we discuss
later, the corresponding W ∗ we use for such W is W ∗ = hhm0,512,15i,2MB,9i,
which is non-compact with respect to C1. Another way to view this sequence is
W ∗ = hhm0,2MB,9i,512,15i, i.e., 15 copies of the original sequence W shifted by
a factor of 512. Each of these copies behaves identically to the original W with
respect to C2, but together they force non-compactness with respect to C1.
To generalize Theorem 1 to sequences of sequences we ﬁrst prove Lemma 4.
Lemma 4. Consider a cache with parameters hA,B,Ci and stride T. If W1 =
hm0,S,Ni and W2 = hm0 + δ,S,Ni, where m0 and m0 + δ is aligned on a cache76
block boundary, and 0 < δ < min(S,T), then indices(W1) and indices(W2) are dis-
joint.
Proof. We will consider two cases.
First, let S ≥ T. From Lemma 1 all elements of W1 map to the same cache set
i1; similarly all elements of W2 map to the same cache set i2. Because δ < T, m0
and m0 + δ map to diﬀerent cache sets, so i1 6= i2.
Second, let S < T. Let N = p × T
S + r, where 0 ≤ r < T
S. Then:
W1 = hm0,S,Ni ⊂
￿
m0,S,(p + 1) ×
T
S
￿
= d W1
W2 = hm0 + δ,S,Ni ⊂
￿
m0 + δ,S,(p + 1) ×
T
S
￿
= d W2
Now we split W1 and W2, which both have (p + 1) × T
S elements, into two
sequences of p + 1 subsequences with T
S elements each.
d W1 =
￿
c w1 =
￿
m0,S,
T
S
￿
,T,p + 1
￿
d W2 =
￿
c w2 =
￿
m0 + δ,S,
T
S
￿
,T,p + 1
￿
From Lemma 1, indices
￿
d W1
￿
= indices(c w1) and indices
￿
d W2
￿
= indices(c w2).
The addresses of the last elements of c w1 and c w2 are m0+T −S and m0+δ+T −S
respectively. Therefore , all addresses in c w1 and c w2 are contained in the half-open
interval [m0,m0 + T). Any two addresses m1 ∈ c w1 and m2 ∈ c w2 are aligned on a
cache block boundary and therefore from Lemma 1 they map to diﬀerent cache sets.
Therefore indices(c w1) and indices(c w2) are disjoint, so indices
￿
d W1
￿
and indices
￿
d W2
￿
are disjoint, which implies that indices(W1) and indices(W2) are disjoint.
Theorem 2. Consider a cache with parameters hA,B,Ci and stride T, and a
sequence of sequences W ∗ = hhm0,s,ni,S,Ni, where (n − 1)×s < min(T,S) and
B ≤ s.77
(a) compact(W ∗) ⇔ N ≤ Nc = A
l
T
S
m
(b) non-compact(W ∗) ⇔ N ≥ Nnc = (A + 1)
l
T
S
m
Proof. From Lemma 2 and Deﬁnition 4,
W
∗ = hhm0,S,Ni,s,ni = ∪i∈[0,n−1] hm0 + i × s,S,Ni.
From Theorem 1 each of the sequences wi = hm0 + i × s,S,Ni for i ∈ [0,n−1]
is compact for N ≤ Nc, non-compact for N ≥ Nnc, and semi-compact otherwise.
From Lemma 4, indices(wi) are pairwise disjoint sets for all i ∈ [0,n − 1]. The
required result follows from Lemma 3.
Note that Theorem 1 is a special case of Theorem 2 for n = 1. In this case the
constraint (n − 1)×s < T is trivially true and the sequence hm0,s,ni has a single
element (m0).
Two Cache Levels
Consider two cache levels, C1 = hA1,B1,C1i and C2 = hA2,B2,C2i.
To apply the algorithms in Section 4.4.3 to measure parameters for C2, we
replace each sequence W in those algorithms with a sequence of sequences W ∗,
such that compact2 (W ∗) ⇔ compact2 (W) and non-compact1 (W ∗).
Ideally we would have a general construction that could construct such a W ∗
from any W = hm0,S,Ni. Since we do not have such a general construction, we
will present an approach, which works for the particular sequences used by the
algorithms in Section 4.4.3. In particular we will restrict ourselves to sequences for
which S ≤ 2T (because 2T is the largest stride used by these algorithms). Further-
more, it is invariably the case that C2 ≥ 2C1, so if (N − 1)×S ≤ 2C1 the sequence78
W can be assumed compact without performing an empirical measurement. There-
fore we can restrict ourselves to sequences for which (N − 1) × S > 2C1.
With these restrictions, we choose
W
∗ = hm0,S,Ni = hhm0,s,ni,S,Ni,
where:
s = T1
n =
￿A1 + 1
N
￿
.
Lemma 5. If S ≤ 2T2 and (N − 1) × S > 2C1 then
(a) compact2 (W ∗) ⇔ compact2 (W) and
(b) non-compact1 (W ∗).
Proof.
(a) First we show that Inequality (4.1) holds.
￿￿A1 + 1
N
￿
− 1
￿
× T1 <
S
2
(4.1)
The opposite is impossible, because then:
S
2
≤
￿￿A1 + 1
N
￿
− 1
￿
× T1 ≤
￿A1 + N
N
− 1
￿
× T1
≤
A1
N
× T1 <
S
2C1
× A1 × T1 =
S
2
⇒
S
2
<
S
2
.
From (4.1) and S ≤ 2T2 we conclude that
￿l
A1+1
N
m
− 1
￿
× T1 < min(S,T2).
Therefore we can apply Theorem 2 to W ∗, and so compact2 (W ∗) ⇔ N ≤ Nc,
where Nc = A2×
l
T2
S
m
. On the other hand, from Theorem 1, compact2 (W) ⇔
N ≤ Nc. Therefore compact2 (W ∗) ⇔ compact2 (W).79
(b) From (N − 1) × S > 2C1 we obtain:
N >
2C1 + S
S
≥
A1 × T1 + T1
S
= (A1 + 1) ×
T1
S
≥ (A1 + 1) ×
￿T1
S
￿
⇒ N > (A1 + 1) ×
￿T1
S
￿
From Theorem 2, it follows that non-compact1 (W ∗)
Multiple Cache Levels
To generalize the approach from Section 4.4.4 to multiple cache levels C1,C2,...,Ck
we replace W with W ∗ = hhm0,s,ni,S,Ni, where
s = min
i<k
Ti
n = max
i<k
￿Ai + 1
N
￿
×
Ti
s
Lemma 6. If S ≤ 2Tk and (N − 1) × S > 2Ci for all i ∈ [1,k − 1] then
(a) compactk (W ∗) ⇔ compactk (W) and
(b) non-compacti (W ∗) for all i ∈ [1,k − 1].
Proof.
(a) By analogy with Inequality (4.1), Inequality (4.2) holds.
max
i∈[1,k−1]
￿￿Ai + 1
N
￿
− 1
￿
× Ti <
S
2
(4.2)
Therefore:
(n − 1) × s =
￿
max
i<k
￿Ai + 1
N
￿
×
Ti
s
− 1
￿
× s
= max
i<k
￿Ai + 1
N
￿
× Ti − s
<
S
2
− s
< min(S,Tk).80
From Theorem 2, applied to W ∗, it follows that if Nc = A2 ×
l
Tk
S
m
, then
compactk (W ∗) ⇔ N ≤ Nc. From Theorem 1, compactk (W) ⇔ N ≤ Nc for
the same Nc. Therefore compactk (W ∗) ⇔ compactk (W).
(b) (N − 1) × S > 2Ci for all i ∈ [1,k − 1] and by analogy with the proof of
Theorem 2(b), non-compacti (W ∗) holds for all i ∈ [1,k − 1].
Algorithms for Measuring Parameters
We use the function is compacti (W) to determine empirically if compacti (W)
holds. Our implementation of this function repeatedly accesses each address in
W, computes the average time per access l, and declares the sequence to be com-
pact if l is close to li (the hit latency of Ci).
Given the transformation from W to W ∗ as discussed in Section 4.4.4, we can
use the algorithms in Section 4.4.3 to measure latency, capacity and associativity
at any cache level.
Implementation of is compact
There is one important complication when measuring parameters of lower cache
levels. On modern platforms C1 is typically virtually indexed, but lower levels are
always physically indexed. This is a problem because continuity in virtual memory
is not a suﬃcient condition for continuity in physical memory, and thus a ﬁxed
stride sequence of addresses in the virtual address space may not map to a ﬁxed
stride sequence in physical address space.
To measure parameters of lower cache levels it is therefore necessary to allocate
physically contiguous memory. There are two ways to acquire such memory in a81
modern operating system: (i) request physically contiguous pages from the kernel,
or (ii) request virtual memory backed by a super-page.
The ﬁrst approach is generally possible only in kernel mode, and there are strict
limits on the amount of allocatable memory. It is mainly used for direct memory
access (DMA) devices. Another, somewhat smaller problem is that such memory
regions typically consist of many pages and TLB misses might introduce inter-level
interference noise in our cache measurements.
The second approach is more promising, but currently there is no portable way
to request super-pages from all operating systems. To address this problem, in
our implementation we provide OS-speciﬁc memory allocation and deallocation
routines, which are then used by the cache micro-benchmarks to allocate memory
supported by super-pages. We have implemented this approach for Linux, and we
will implement it for other operating systems in the near future.
There has been some work on transparently supporting variable size pages
in the OS [47]. When such support becomes generally available, our OS-speciﬁc
solution will not be required.
4.4.5 Measuring TLB Paramaters
The general structure of a virtual memory address is shown in Figure 4.13 (the ﬁeld
widths are Intel Pentium III speciﬁc). The low-order bits contain the page oﬀset,
while the hi-order bits are used for indexing page tables during the translation to
a physical address. Because the translation from virtual to physical address is too
expensive to perform on every memory access, a TLB is used to cache and reuse
the results.
A TLB has a certain number of entries E each of which can cache the address82
page offset￿ indices to page tables￿
TLB index￿ TLB tag￿
20￿ 12￿
16￿ 4￿
Figure 4.13: Memory address decomposition on Pentium III
translation for a single virtual memory page of size P. Even though the TLB does
not store the actual data but only its physical address and a few ﬂags, it uses the
upper portion of the virtual address in a way a normal cache does (for encoding
index and tag), and so we can consider it a normal cache CTLB = hA,B,Ci =
hATLB,P,E × Pi. Ideally we would like to use our cache parameter measurement
algorithms discussed in Section 4.4.3, but some complications arise as outlined
below.
1. Variable page size: measuring parameters for caches with variable block size is
not possible with our current algorithms. On current operating systems, the
default is to use only a single page size, and therefore there is no immediate
danger of measurement failure. Furthermore, [47] suggests that when trans-
parent support for multiple page sizes becomes available, TLB misses will
be automatically minimized and will have negligible impact on performance.
At that point measuring the TLB parameters would not be necessary.
2. Replacement policy: typically a TLB has a high associativity and LRU is
impractical to implement because of speed issues. In practice processors use
much simpler replacement policies like round-robin or random. Some even
perform a software interrupt on a TLB miss and leave to the operating system
to do the replacement. Surprisingly these inconsistencies do not prevent us
from producing accurate measurement results.83
3. Ensuring TLB access: As in the case of lower cache levels, we need to make
sure that the TLB is accessed when memory references are issued by the
processor. In modern platforms this is ensured by the fact that L1 data
caches are usually physically tagged, but even more importantly by the fact
that TLB caches memory protection information which is needed to complete
the particular memory operation.
4. Physical Continuity: As with lower cache levels, we need physically contigu-
ous memory to perform TLB measurements. Unfortunately, using super-
pages is not an alternative for obvious reasons, and so a kernel module is
required.
For a sequences W = hm0,S,Ni, let N = p ×
l
T1
S
m
+ r, where 0 ≤ r <
l
T1
S
m
.
To measure TLB parameters using the algorithms described in Section 4.4.3, we
transform W into (T1 and B1 are the stride and the block size of C1 respectively):
W
∗ =
￿￿
m0,S,
￿T1
S
￿￿
,T1 + B1,p
￿
∪
hm0 + (p − 1) × (T1 + B1),S,ri
We assume that the C1 has at least twice as many blocks as there are entries
in the TLB, i.e., C1
B1 ≥ 2
CTLB
BTLB, which is true for all modern platforms today. Under
this assumption, it is easy to see that compact1 (W ∗).
Because we do not have a portable solution to (4) above, our experience with
measuring TLB parameters is limited. None of the other tools produced any
correct results on any of the tested platforms. Therefore, we describe our limited
experimental results in this section.
Using the algorithms in Section 4.4.3 with the modiﬁed sequences W ∗, we were
able to accurately measure the TLB parameters of a Pentium III as 64 page entries,84
4-way set associative, and page size of 4KB. We also measured the TLB parameters
of a Pentium 4 as 65 page entries, fully-associative, with a page size of 4KB. On
the Pentium 4 our measurement is close to the correct one (measured associativity
65 vs. actual associativity of 64)1.
4.5 Future Work
We are actively designing and developing new micro-benchmarks and we are cur-
rently working on:
• implementing OS support for Solaris, AIX, etc.,
• improving quality of TLB measurements,
• measuring instruction cache parameters,
• cache bandwidth, parallelism, write mode, and sharedness (uniﬁed or dedi-
cated).
X-Ray is freely available from http://iss.cs.cornell.edu/software/x-ray.
aspx.
1This problem may be similar to the one we discuss about the L1 data cache
of Power 3 in Section A.2.1Chapter 5
Global Search in ATLAS
ATLAS performs a global search to determine optimal values for the optimization
parameters listed in Table 3.1. In principle, the search space is unbounded because
most of the parameters, such as NB, are integers. Therefore, it is necessary to
bound the search space using parameters of the machine hardware; for example,
MU and NU, the dimensions of the register tile, must be less than the number of
registers.
5.1 Estimating hardware parameters
The machine parameters measured by ATLAS are the following.
• C1: the size of the L1 data cache.
• NR: the number of ﬂoating-point registers.
• FMA: the availability of a fused multiply-add instruction.
• Ls: software latency
Although Ls is not a hardware parameter per se, it is directly related to the
latency of ﬂoating point multiplication, as explained in Section 3.4. ATLAS mea-
sures this optimization parameter directly using a micro-benchmark.
The micro-benchmarks used to measure machine parameters are independent
of matrix multiplication. For example, the micro-benchmark for estimating C1 is
similar to the one discussed in Hennessy and Patterson [33].
8586
Two other machine parameters are critical for performance: (i) the L1 in-
struction cache size, and (ii) the number of outstanding loads that the hardware
supports. ATLAS does not determine these parameters explicitly using micro-
benchmarks; instead, they are considered implicitly during the optimization of
matrix multiplication code. For example, the size of the L1 instruction cache lim-
its the KU parameter in Figure 3.2. Rather than estimate the size of the instruc-
tion cache directly by running a micro-benchmark and using that to determine the
amount of unrolling, ATLAS generates a suite of mini-MMM kernels with diﬀerent
KU values and selects the kernel that achieves the best performance.
5.2 Search strategy
To ﬁnd optimal values for the optimization parameters in Table 3.1, ATLAS uses
orthogonal line search, which ﬁnds an approximation to the optimal value of a
function y = f(x1,x2,...,xn), an n-dimensional optimization problem, by solving
a sequence of n 1-dimensional optimization problems corresponding to each of the
n parameters. When optimizing the value of parameter xi, it uses reference values
for parameters (xi+1,xi+2,...,xn) that have not yet been optimized. Orthogonal
line search is heuristic because it does not necessarily ﬁnd the optimal value even
for a convex function, but with luck, it may come close.
To specify an orthogonal line search, it is necessary to specify (i) the order
in which the parameters are optimized, (ii) the set of possible values considered
during the optimization of each parameter, and (iii) the reference value used for
parameter k during the optimization of parameters 1, 2, ..., k − 1.
The optimization sequence used in ATLAS is the following.
1. Finding NB.87
2. Finding MU and NU.
3. Finding KU.
4. Finding Ls.
5. Finding FF, IF, and NF.
6. Finding NCNB: a non-copy version of NB.
7. Finding parameters for clean-up codes.
We now discuss each of these steps in greater detail.
5.3 Finding NB
In this step, ATLAS generates a number of mini-MMMs for matrix sizes NB ×NB
where NB is a multiple of four that satisﬁes the following inequality:
16 ≤ NB ≤ min
￿
80,
q
C1
￿
(5.1)
The reference values of MU and NU are set to the values closest to each other
that satisfy (3.1). For each matrix size, ATLAS tries two extreme cases for KU –
no unrolling (KU = 1) and full unrolling (KU = NB).
The NB value that produces the highest MFLOPS is chosen as “best NB” value,
and it is used from this point on in all experiments as well as in the ﬁnal version
of the optimized mini-MMM code.88
5.4 Finding MU and NU
This step is a straightforward search that reﬁnes the reference values of MU and
NU that were used to ﬁnd the best NB. ATLAS tries all possible combinations of
MU and NU that satisfy inequality (3.1). Cases when MU or NU is 1 are treated
specially. A test is performed to see if 1 × 9 unrolling or 9 × 1 unrolling is better
than 3 × 3 unrolling. If not, unrolling factors of the form 1 × U and U × 1 for
values of U greater than 3 are not checked.
5.5 Finding KU
This step is another simple search. Unlike MU and NU, KU does not depend on
the number of available registers, so it can be made as large as desired without
causing register spills. The main constraint is the instruction cache size. ATLAS
tries values for KU between 4 and
NB
2 as well as the special values 1 and NB. The
value that gives best performance (based on NB, MU and NU as determined from
the previous steps) is declared the optimal value for KU.
5.6 Finding Ls
In this step, ATLAS uses Ls values in the interval [1,6] to schedule the compu-
tations in the micro-MMM of Figure 3.2 to determine the best choice for Ls. It
also ensures that the chosen value divides MU ×NU ×KU to facilitate instruction
scheduling.89
5.7 Finding FF, IF, and NF
In this step, ATLAS searches for the values of FF, IF and NF. First, ATLAS
determines the value of FF (0 or 1). Then, it searches for the best value of the
pair (IF, NF) where IF is in the interval [2,MU+NU] and NF is in the interval
[1,MU+NU-IF].
5.8 Finding NCNB
For the non-copying version of mini-MMM, ATLAS uses the same values of MU,
NU, FF, IF, and NF that it uses for the copying version. Without copying, the
likelihood of conﬂict misses is higher, so it makes sense to use a smaller L1 cache tile
size than in the version of mini-MMM that performs copying. ATLAS searches
for an optimal value of NCNB in the range [NB : −4 : 4]. We would expect
performance to increase initially as the tile size is decreased, but decrease when the
tile size becomes too small. ATLAS terminates the search when the performance
falls by 20% or more from the best performance it ﬁnds during this search. Finally,
some restricted searches for better values of KU and Ls are done.
5.9 Finding parameters for clean-up codes
If the tile size is not a multiple of the original matrix size, there may be left-over
rows and columns, at the boundaries of the matrices, forming fractional tiles. To
handle these fractional tiles, ATLAS generates clean-up code – a special mini-
MMM in which one or more of the dimensions of the three tiles is smaller than
NB. For M and N clean-up only the corresponding dimension is smaller than NB,
while for K cleanup, any of the three dimensions can be smaller than NB.90
For example, ATLAS generates K clean-up codes as follows. For each value of
L, representing the size of the K dimension, starting with L = NB − 1 and going
down, it generates a specialized version of the mini-MMM code in which some
of the loops are fully unrolled. Full unrolling is possible because the shapes of
the operands are completely known. When the performance of the general version
falls within 1% of the performance of the current specialized version, the generation
process is terminated. The current L is declared to be the Crossover Point. At
runtime, the specialized versions are invoked when the dimension of the left-over
tile is greater than L and the general version is invoked otherwise.
For M and N clean-up ATLAS produces only a general version, as these are
outer loops in the outermost loop nest in Figure 3.2 and they are not as crucial to
performance as K clean-up is. The use of clean-up code in ATLAS is discussed in
more detail in [58].
5.10 Discussion
In optimization problems, there is usually a trade-oﬀ between search time and
the quality of the solution. For example, we can reﬁne the parameters found
by ATLAS by repeating the orthogonal line search some number of times, using
the values determined by one search as the reference values for the next search.
It is also possible to use more powerful global search algorithms like simulated
annealing. However, the potential for obtaining better solutions must be weighed
carefully against the increase in installation time. We will address this point in
the conclusions.Chapter 6
Model-Driven Optimization
In this section, we present analytical models for estimating optimal values for the
parameters in Table 3.1. To avoid overwhelming the reader, we ﬁrst present models
that ignore interactions between diﬀerent levels of the memory hierarchy (in this
case, L1 data cache and registers). Then, we reﬁne the models to correct for such
interactions.
6.1 Estimating hardware parameters
Model-based optimization requires more machine parameters than the ATLAS
approach because there is no search. The hardware parameters required by our
model are as follows.
• C1,B1: the capacity and the line size of the L1 data cache.
• CI: The capacity of the L1 instruction cache.
• L×: hardware latency of the ﬂoating-point multiply instruction
• |ALUFP|: number of ﬂoating-point functional units
• NR: the number of ﬂoating-point registers.
• FMA: the availability of a fused multiply-add instruction.
Empirical optimizers use the values of machine parameters only to bound the
search space, so approximate values for these parameters are adequate. In contrast,
analytical models require accurate values for these parameters. Therefore, we have
developed a tool called X-Ray [66], which accurately measures these values.
9192
6.2 Modeling strategy
We build analytical models which compute optimization parameters directly from
the corresponding hardware parameters. We explain our models for all optimiza-
tion parameters in the following sections.
6.3 Estimating NB
We present our model for estimating NB using a sequence of reﬁnements for in-
creasingly complex cache organizations. We start with the mini-MMM code in
Figure 6.1, and then adjust the model to take register tiling into account.
for j0 ∈ [0 : 1 : NB − 1]
for i0 ∈ [0 : 1 : NB − 1]
for k0 ∈ [0 : 1 : NB − 1]
Ci0j0 = Ci0j0 + Ai0k0 × Bk0j0
Figure 6.1: Schematic Pseudo-Code for mini-MMM
6.3.1 Fully associative, optimal replacement, unit line size
The goal is to ﬁnd the value of NB that optimizes the use of the L1 data cache.
First, we consider a simple cache of capacity C1, which is fully-associative with an
optimal replacement policy and a unit line size. There are no conﬂict misses, and
spatial locality is not important.
The working set in memory of the mini-MMM loop nest in Figure 6.1 consists
of three NB ×NB tiles, one from each of the matrices A, B, and C. For the rest of
this section, we will refer to these tiles just as A, B, and C. This working set ﬁts93
entirely in the cache if Inequality (6.1) holds.
3N
2
B ≤ C1 (6.1)
A more careful analysis shows that it is not actually necessary for all three
NB × NB blocks to reside in the cache for the entire duration of the mini-MMM
computation. Consider the mini-MMM code shown in Figure 6.1. Because k0 is
the innermost loop, elements of C are computed in succession; once a given element
of C has been computed, subsequent iterations of the loop nest do not touch that
location again. Therefore, with this loop order, it is suﬃcient to hold a single
element of C in the cache, rather than the entire array. The same reasoning shows
that it is suﬃcient to hold a single column of B in the cache. Putting these facts
together, we see that with this loop order, there will be no capacity misses if the
cache can hold all of A, a single column of B, and a single element of C. This leads
to Inequality (6.2).
N
2
B + NB + 1 ≤ C1 (6.2)
6.3.2 Correcting for non-unit line size
In reality, caches have non-unit line size. Assume that the line size is B1. If the
three tiles are stored in column major order, both B and C are walked by columns
and A is in the cache for the entire duration of the mini-MMM. This leads to the
reﬁned constraint shown in Inequality (6.3).
&
N2
B
B1
'
+
￿NB
B1
￿
+ 1 ≤
C1
B1
(6.3)94
6.3.3 Correcting for LRU replacement policy
We can further relax the restrictions of our cache organization to allow for Least
Recently Used (LRU) replacement instead of optimal replacement. To determine
the eﬀects of LRU replacement on the optimal tile size NB, we must examine the
history of memory accesses performed by the loop nest. This analysis is in the
spirit of Mattson et al. [42], who introduced the notions of stack replacement and
stack distance.
We start with the innermost loop of the mini-MMM loop nest. A single iteration
hj,i,ki of this loop touches elements
Aik;Bkj;Cij;
where the most recently accessed element is written rightmost in this sequence.
Extending this analysis to the middle loop, we see that the sequence of memory
access for a given value of the outer loop indices hj,ii is the following (as before,
the most recently accessed element is rightmost):
Ai0;B0j;Cij;Ai1;B1j;Cij;...;Ai,NB−1;BNB−1,j;Cij;
Note that the location Cij is touched repeatedly, so the corresponding history
of memory accesses from least recently accessed to most recently accessed is the
following:
Ai0;B0j;Ai1;B1j;...;Ai,NB−1;BNB−1,j;Cij;
Extending this to a single iteration j of the outermost loop, we see that the95
sequence of memory accesses is the following (in left-to-right, top-to-bottom order):
A00;B0j; ... A0,NB−1;BNB−1,j; C0j;
A10;B0j; ... A1,NB−1;BNB−1,j; C1j;
. . .
ANB−1,0;B0j; ... ANB−1,NB−1;BNB−1,j; CNB−1,j;
Note that the column of B is reused NB times, and thus the corresponding history
of memory accesses from least recently accessed to most recently accessed is
A00; ... A0,NB−1; C0j;
A10; ... A1,NB−1; C1j;
. . .
ANB−1,0;B0j; ... ANB−1,NB−1;BNB−1,j; CNB−1,j;
We do not want to evict the oldest element of this history (A00) because, as
we discussed before, A is completely reused in all iterations of the outermost loop.
Therefore we need to choose NB is such a way that this whole history ﬁts in the
cache.
Furthermore, after the jth iteration of the outermost loop is complete, the
j+1st iteration will bring in the j+1st column of B, which participates in an inner
product with all the rows of A. Because of LRU, this new column will not be able
to “optimally” replace the old jth column of B, since the old column of B has been
used quite recently. For the same reason the new element of C, namely C0,j+1, will
not be able to optimally replace the old C0j. To account for this, we need extra
storage for an extra column of B and an extra element of C.
Putting all this together, we see that if the cache is fully-associative with ca-
pacity C1, line size B1 and has an LRU replacement policy, we need to cache all of96
A, two columns of B and a column plus an element of C. This result is expressed
formally in Inequality (6.4).
&
N2
B
B1
'
+ 3
￿NB
B1
￿
+ 1 ≤
C1
B1
(6.4)
Finally, to model the mini-MMM code of Figure 3.2, which includes register
tiling, we need to take into account interactions between the register ﬁle and the
L1 cache. Thus far, we implicitly assumed that the computation works directly
on the scalar elements of the tiles. As Figure 3.2 shows, the mini-MMM loop nest
actually works on register tiles. We reﬁne Inequality (6.4) by recognizing that
considerations of rows, columns, and elements of A, B, and C respectively must be
replaced by considerations of horizontal panels, vertical panels, and register tiles
instead. Taking this into account, we get Inequality (6.5).
&
N2
B
B1
'
+ 3
￿NB × NU
B1
￿
+
￿MU
B1
￿
× NU ≤
C1
B1
(6.5)
6.3.4 Correcting to avoid micro-MMM clean-up code
Note that estimating NB using Inequality (6.4), it is possible to get a value for NB
that is not an exact multiple of MU and NU. This requires the generation of clean-
up code for fractional register tiles at the boundaries of mini-MMM tiles. This
complicates code generation, and generally lowers performance. We avoid these
complications by trimming the value of NB determined from Inequality (6.4) so
that it becomes a multiple of MU and NU. The ATLAS Code Generator requires
NB to be an even integer, so we enforce this constraint as well.
If N0
B is the tile size obtained by using Inequality (6.4), we set NB to the value
j
N0
B
lcm(MU,NU,2)
k
× lcm(MU,NU,2).97
Note this requires that the value of NB be determined after the values of MU
and NU have been determined as described below.
6.3.5 Other cache organizations
If the cache organization is not fully-associative, conﬂict misses must be taken
into account. Although there is some work in the literature on modeling conﬂict
misses [15, 18], these models are computationally intractable. Therefore, we do
not model conﬂict misses, although there are some general remarks we can make.
If A, B, and C are copied to 3N2
B contiguous storage locations, Inequality (6.1)
can also be viewed as determining the largest value of NB for which there are no
capacity or conﬂict misses during the execution of the mini-MMM in any cache
organization. Although ATLAS usually copies tiles, the code in Figure 3.2 shows
that the three copied tiles are not necessarily adjacent in memory. However, if the
set-associativity of the L1 data cache is at least 3, there will be no conﬂict misses.
Inequality (6.2) determines the largest NB for which there are no capacity
misses during the execution of the mini-MMM, although there may be conﬂict
misses if the cache is direct-mapped or set-associative. Notice that these conﬂict
misses arise even if data from all three matrix tiles is copied into contiguous mem-
ory, because the amount of data touched by the program is more than the capacity
of the cache, and some elements will map to the same cache set.
6.4 Estimating MU and NU
One can look at the register ﬁle as a software-controlled, fully-associative cache
with unit line size and a capacity equal to the number of available registers NR.
Therefore we can use a variant of Inequality (6.2) to estimate the optimal register98
ﬁle tile size.
The ATLAS Code Generator uses the KIJ loop order to tile for the register
ﬁle, and thus we need to cache the complete MU ×NU tile of C, an 1×NU row of
B and a single element of A. Therefore the analog of Inequality (6.2) for registers
is Inequality (6.6), shown below.
MU × NU + NU + 1 ≤ NR (6.6)
Because the register ﬁle is software controlled, the ATLAS Code Generator is
free to allocate registers diﬀerently than Inequality (6.6) prescribes. In fact, as
discussed in Chapter 3, it allocates to registers a MU × 1 column of A, rather
than a single element of A. Furthermore, it needs Ls registers to store temporary
values of multiplication operations to schedule for optimal use of the ﬂoating point
pipelines. Taking into account these details, we reﬁne Inequality (6.6) to obtain
Inequality (6.7).
MU × NU + NU + MU + Ls ≤ NR (6.7)
NR is a hardware parameter, whose eﬀective value is measured by the micro-
benchmarks. The value of the optimization parameter Ls is estimated as discussed
in Section 6.6. Therefore the only unknowns in Inequality (6.7) are MU and NU.
We estimate their values using the following procedure.
• Let MU = NU = u. Solve Inequality (6.7) for u.
• Let MU = max(u,1). Solve Inequality (6.7) for NU.
• Let NU = max(NU,1)
• Let hMU,NUi = hmax(MU,NU),min(MU,NU)i.99
6.5 Estimating KU
Although KU is structurally similar to MU and NU, it is obviously not limited by
the size of the register ﬁle. Therefore the only practical limit for KU is imposed by
the size of the instruction cache. To avoid micro-MMM clean-up code, we trim KU
so that NB is a multiple of KU. Note that in case of complete unrolling (KU = NB),
KU is not changed by this update.
Therefore our model for estimating KU is to unroll the loop as much as possible
within the size constraints of the L1 instruction cache, while ensuring that KU
divides NB. On most platforms, we found that the loop can be unrolled completely
(KU = NB).
6.6 Estimating Ls
Ls is the optimization parameter that represents the skew factor the ATLAS Code
Generator uses when scheduling dependent multiplication and addition operations
for the CPU pipeline.
Studying the description of the scheduling in Chapter 3, we see that the sched-
ule eﬀectively executes Ls independent multiplications and Ls − 1 independent
additions between a multiplication muli and the corresponding addition addi. The
hope is that these 2×Ls −1 independent instructions will hide the latency of the
multiplication. If the ﬂoating-point units are fully pipelined and the latency of
multiplication is L×, we get the following inequality, which can be solved to obtain
a value for Ls.
2 × Ls − 1 ≥ L× (6.8)100
On some machines, there are multiple ﬂoating-point units. If |ALUFP| is the
number of ﬂoating-point ALUs, Inequality (6.8) gets reﬁned as follows.
2 × Ls − 1
|ALUFP|
≥ L× (6.9)
Solving Inequality (6.9) for Ls, we obtain Inequality (6.10).
Ls =
&
L× × |ALUFP| + 1
2
'
(6.10)
Of the machines in our study, only the Intel Pentium machines have ﬂoating-
point units that are not fully pipelined; in particular, multiplications can be issued
only once every 2 cycles. Because Pentium machines have only one ﬂoating-point
functional unit and ATLAS does not schedule back-to-back multiply instructions,
this does not introduce any error in our model. Therefore, Inequality (6.8) holds.
6.7 Estimating other parameters
Our experience shows that performance is insensitive to the values of the scheduling
optimization parameters FF, IF, and NF. Therefore we set FF = 1(true), IF = 2
and NF = 2.
FMA is a hardware parameter, independent of the speciﬁc application. If our
micro-benchmarks determine that the architecture supports a fused multiply-add
instruction, we set this parameter appropriately.
Finally, we set NCNB = NB. That is, we use the same tile size for the non-
copying version of mini-MMM as we do for the copying version. In our experiments,
ATLAS always decided to use the copying version of mini-MMM1, so the value of
1Using the non-copy version is mainly beneﬁcial when the matrices involved in
the computation are either very small or are long and skinny [56].101
this parameter was moot. A careful model for NCNB is diﬃcult because it is hard
to model conﬂict misses analytically. There is some work on this in the compiler
literature but most of the models are based on counting integer points within
certain parameterized polyhedra and appear to be intractable [15, 18]. Fraguela et
al. have proposed another approach to modeling conﬂict misses when the sizes of
matrices are known [26]. In some compilers, this problem is dealt with heuristically
by using the eﬀective cache capacity, deﬁned to be a fraction (such as
1
3) of the
actual cache capacity, when computing the optimal tile size. In our context, we
could set NCNB to the value determined from Inequality (6.4) with C1 replaced
with
C1
3 . We recommend this approach should it become necessary to use a smaller
tile size on some architectures.
6.8 Discussion
We have described a fairly elaborate sequence of models for estimating the optimal
value of NB. In practice, the value found by using Inequality (6.3), a relatively
simple model, is close to the value found by using more elaborate models such as
Inequalities (6.4) and (6.5).Chapter 7
Model Reﬁnement and Local Search
Here we address two problems in using model-driven optimization in the context
of general-purpose compilers.
The ﬁrst problem is that there can be a potential performance gap between
code produced by library generators and code produced by using model-driven
optimization in case the model is not an accurate abstraction of the architecture.
Such performance penalty is unacceptable for critical applications that will be run
many times. On the other hand, global search as performed by ATLAS does not
scale well to large programs or to complex architectures, so it cannot be used in
general-purpose compilers. Dongarra and co-workers are exploring faster search
algorithms like the simplex method [22], but it is not clear that these algorithms
alone are adequate.
The second problem is performance portability. It may seem that the use of
global search ensures that library generators will work “out of the box” on new
architectures, whereas model-driven optimization may fail if the model is a poor
abstraction of the new architecture. However, global search is not a panacea. In
particular, if the code generator does not exploit aspects of an architecture that are
key to performance, the resulting code may be poor regardless of how exhaustive
the search for optimal parameter values is. The methodology used in the ATLAS
system to adapt to new architectures for which search alone is not suﬃcient is
to include a collection of user-contributed hand-tuned kernels in the distribution;
during the search process, the performance of these codes is evaluated, and if
one of them performs better than the code generated by the code generator, it is
102103
used to produce the library. This methodology cannot be used for model-driven
optimization because it runs counter to the spirit of using models to optimize
programs.
The approach that we advocate to address both problems is to use a combi-
nation of model reﬁnement and local search. To close the performance gap with
code produced by empirical optimization, we advocate using local search in the
neighborhood of the parameter values produced by using the model. Of course lo-
cal search alone may not be adequate if the model is not a good abstraction of the
architecture. In that case, we advocate using model reﬁnement in the same spirit
as ATLAS incorporates user-contributed code - we study the new architecture and
reﬁne the model as needed. Note that like the production of user-contributed code,
model reﬁnement must be done manually. Intuitively, in our approach, small per-
formance problems are tackled using local search, while large performance problems
are tackled using model reﬁnement.
The cases where we found that it is necessary to reﬁne the models described in
Chapter 6 are presented along with our experimental results in Chapter 8, as we
believe they are much better justiﬁed in this context.
Now we describe some very simple local search techniques, targeted at possible
inaccuracies in the models for NB, MU, NU, and Ls.
Local Search for NB
If NBM is the value of NB estimated by the model, we can reﬁne this value by local
search in the interval [NBM − lcm(MU,NU,2) : 2 : NBM + lcm(MU,NU,2)]. This
ensures that we examine the ﬁrst values of NB in the neighborhood of NBM that
are divisible by both MU and NU.104
Local Search for MU, NU, and Ls
Unlike sensitivity graphs for NB, sensitivity graphs for MU and NU tend to be con-
vex in the neighborhood of model-predicted values. This is probably because reg-
ister allocation is under compiler control, and there are no conﬂict misses. Hence,
we use a simple hill-climbing search strategy to improve these parameters.
We start with the model predicted values for MU, NU, and Ls and determine
if performance improves by changing each of them by +1 and −1. We continue
following the path of increasing performance until we stop at a local maximum.
On platforms on which there is a Fused-Multiply-Add instruction (FMA = 1), the
optimization parameter Ls has no eﬀect on the generated code and in that case
we only consider MU and NU for the hill-climbing local search.Chapter 8
Experimental Results
8.1 Global search vs. model-driven optimization
Models are to be used, not believed.
H. Theil ‘Principles of Econometrics’
In this section, we present the results of running ATLAS CGw/s and ATLAS
Model on ten common platforms. For all experiments we used the latest stable
version of ATLAS, which as of this writing is 3.6.0. Where appropriate, we also
present numbers for ATLAS Unleashed and vendor supported, native BLAS.
We performed our experiments on the following platforms.
• RISC, Out-of-order
– DEC Alpha 21264
– IBM Power 3
– IBM Power 4
– SGI R12K
• RISC In-order
– Sun UltraSPARC IIIi
• EPIC In-order
– Intel Itanium 2
• CISC, Out-of-order
105106
– AMD Opteron 240
– AMD Athlon MP
– Intel Pentium III
– Intel Pentium 4
For each platform, we present the following results.
• Times:
– X-Ray: time taken by X-Ray to determine hardware parameters;
– ATLAS Micro-benchmarks: time taken by the micro-benchmarks in
ATLAS to determine hardware parameters;
– ATLAS Optimization Parameter Search: time taken by global search in
ATLAS for determining optimization parameter values.
We do not report the actual installation time of any of the versions of ATLAS
because most of this time is spent in optimizing other BLAS kernels, gener-
ating library code, building object modules, etc.
We do not discuss the timing results in detail as they are not particularly
surprising. X-Ray is faster than ATLAS in measuring hardware parameters
on nine out of the ten platforms and has comparable timing (10% slower) on
one (IBM Power 3). Moreover, while ATLAS CGw/S spends a considerable
amount of time, ranging between 8 minutes on the DEC Alpha to more than
8 hours on the Intel Itanium 2, to ﬁnd the optimal values for the optimization
parameters, the model-based approach takes no measurable time.107
• Performance:
– Optimization parameter values: values determined by ATLAS CGw/S
and ATLAS Model. Where appropriate, we also report these values for
ATLAS Unleashed.
– mini-MMM performance: performance of mini-MMM code produced by
ATLAS CGw/S, ATLAS Model and ATLAS Unleashed.
– MMM performance: for matrices sized 100 × 100 to 5000 × 5000. We
report performance of complete MMM computations using (i) vendor
supported, native BLAS, and the code produced by (ii) ATLAS CGw/S,
(iii) ATLAS Model, (iv) ATLAS Unleashed, and (v) the native FOR-
TRAN compiler. On each platform, the code produced by ATLAS is
compiled with the native C compiler. The input FORTRAN program
is the standard triply-nested loop shown in Figure 3.1.
For vendor supported, native BLAS (labeled “BLAS” on all ﬁgures)
we used the following libraries and corresponding versions, which were
current at the time of our experiments:
∗ DEC Alpha: CXML 5.2
∗ IBM Power 3/4: ESSL 3.3
∗ SGI R12K: SCSL 6.5
∗ SUN UltraSPARC IIIi: Sun One Studio 8
∗ Intel Itanium 2, Pentium III/4: MKL 6.1
∗ AMD Opteron, Athlon: ACML 2.0
• Sensitivity Analysis: this describes the relative change of performance as
we change one of the optimization parameters, keeping all other parameters108
ﬁxed to the values found by ATLAS CGw/S. The sensitivity analysis explains
how variations in the values of the optimization parameters inﬂuence the
performance of the generated mini-MMM kernel.
– NB: change in mini-MMM performance when the value of NB is changed
– MU, NU: change in mini-MMM performance when values of MU and
NU are changed. Because optimal values of MU and NU depend on the
same hardware resource (NR), we vary them together.
– KU: change in min-MMM performance when value of KU is changed.
– Ls: change in mini-MMM performance when Ls is changed.
– FF, IF and NF: we do not show sensitivity graphs for these parameters
because performance is relatively insensitive to their values.109
8.1.1 DEC Alpha 21264
Table 8.1: DEC Alpha 21264: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, RISC
CPU Core Frequency 833 MHz
L1 Data Cache 64 KB, 64 B/line, 2-way
L1 Instruction Cache 64 KB, 64 B/line, 2-way
L2 Uniﬁed Cache 4 MB, 64 B/line, 1-way
Floating-Point Registers 32
Floating-Point Functional Units 2
Floating-Point Multiply Latency 4
Has Fused Multiply Add No
Operating System Tru64 v5.1B (rev.2650)
C Compiler Compaq C v6.5-003
Fortran Compiler GNU Fortran 3.3
Table 8.2: DEC Alpha 21264: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 72 4, 4, 72 4 0 1, 7, 1 1281
Model 84 4, 4, 84 4 0 0, 2, 2 1189
Unleashed 80 1491
Table 8.3: DEC Alpha 21264: Timings
Search Model
Machine Parameters 148s 101s
Optimization Parameters 556s
Total 704s 101s110
1000 2000 3000 4000 5000
Size
200
400
600
800
1000
1200
1400
MFLOPS
Compiler
Model
CGw￿S
BLAS
Unleashed
Figure 8.1: DEC Alpha 21264: MMM Performance
200 400 600 800
NB
200
400
600
800
1000
1200
MFLOPS
(a)
20 40 60 80 100 120
NB
200
400
600
800
1000
1200
MFLOPS
(b) Zoomed
Figure 8.2: DEC Alpha 21264: Sensitivity of performance to NB111
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
500
1000
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.3: DEC Alpha 21264: Sensitivity of performance to MU and NU
10 20 30 40 50 60 70
KU
200
400
600
800
1000
1200
MFLOPS
Figure 8.4: DEC Alpha 21264: Sensitivity of performance to KU
2 4 6 8 10 12
LS
200
400
600
800
1000
1200
MFLOPS
Figure 8.5: DEC Alpha 21264: Sensitivity of performance to Ls112
mini-MMM
On this machine the model-determined optimization parameters provided perfor-
mance of about 100 MFLOPS (7%) slower than the ones determined by search.
The reason of the diﬀerence is the suboptimal selection of the NB parameter (84
for Atlas Model vs. 72 for ATLAS CGw/S), as can be seen in the NB sensitivity
graph of Figure 8.2(b).
MMM Performance
Figure 8.1 shows the MMM performance. ATLAS Unleashed produces the fastest
BLAS implementation because it uses highly-optimized, hand-tuned BLAS kernels
written by Goto. A newer version of these kernels is described in [35]. The native
BLAS library is only marginally slower.
Although the gap in performance of the mini-MMM codes produced by ATLAS
CGw/S and ATLAS Model is 100 MFLOPS, the gap in performance of complete
MMM computations is only about 50 MFLOPS (4%) for large matrices. Finally,
we note that the GNU FORTRAN compiler is unable to deliver acceptable per-
formance. We did not have access to the Compaq FORTRAN compiler, so we did
not evaluate it.
Sensitivity Analysis
Figure 8.3 shows the sensitivity of performance to the values of MU and NU. The
optimal value is (4,4), closely followed by (3,6), and (6,3). These match our
expectations that optimal unroll factors are as close to square as possible, while
dividing the tile size NB = 72 without reminder.
Figure 8.2(a) shows the sensitivity of performance to the value of NB. Fig-113
ure 8.2(b) shows a scaled-up version of this graph in the region of the optimal NB
value. The optimal value for NB is 88. ATLAS does not ﬁnd this point because
it does not explore tile sizes greater than 80, as explained in Section 5, but it
chooses a tile size of 72, which is close to optimal. If we use Inequality (6.5) to
determine NB analytically, we obtain NB = 84. Note that using the simpler model
of Inequality (6.3), we obtain NB = 90, which appears to be almost as good as the
value determined by the more complex model.
The NB sensitivity graph of Figure 8.2(b) has a saw-tooth of periodicity 4, with
notable peaks occurring with a periodicity of 8. The saw-tooth of periodicity 4
arises from the interaction between cache tiling and register tiling - the register tile
is (4,4), so whenever NB is divisible by 4, there is no clean-up code for fractional
register tiles in the mini-MMM code, and performance is good. We do not yet
understand why there are notable peaks in the saw-tooth with a periodicity of 8.
Figure 8.4 shows the sensitivity of performance to the value of KU. On this
machine the entire mini-MMM loop body can ﬁt into the L1 instruction cache for
values of KU up to NB. Performance is relatively insensitive to KU as long as the
value of this parameter is suﬃciently large (KU > 7).
Figure 8.5 shows the sensitivity of performance to the value of Ls. The graph
is convex upwards, with a peak at 4. The multiplier on this machine has a latency
of 4 cycles, hence the model for Ls in Chapter 6 computes Ls = 5, which is
close to optimal. The inverted-U shape of this graph follows our expectations.
For very small values of Ls, dependent multiplications and additions are not well
separated and CPU pipeline utilization is low. As Ls grows, the problem gradually
disappears, until the performance peak is reached when the full latency of the
multiplication is hidden. Increasing Ls further does not improve performance as114
there is no more latency to hide. On the contrary, more temporary registers are
needed to save multiplication results, which causes more register spills to memory,
decreasing performance.115
8.1.2 IBM Power 3
Table 8.4: IBM Power 3: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, RISC
CPU Core Frequency 375 MHz
L1 Data Cache 64 KB, 128 B/line, 128-way
L1 Instruction Cache 32 KB, 128 B/line, 128-way
L2 Uniﬁed Cache 4 MB, 128 B/line, direct-mapped
Floating-Point Registers 32
Floating-Point Functional Units 2
Floating-Point Multiply Latency 4
Has Fused Multiply Add Yes
Operating System AIX
C Compiler XL C for AIX v.5
Fortran Compiler XL Fortran for AIX
Table 8.5: IBM Power 3: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 80 4, 5, 80 6 1 0, 8, 1 1264
Model 84 4, 4, 84 4 1 0, 2, 2 1225
Unleashed 80 1257
Table 8.6: IBM Power 3: Timings
Search Model
Machine Parameters 139s 154s
Optimization Parameters 1984s
Total 2123s 154s116
1000 2000 3000 4000 5000
Size
200
400
600
800
1000
1200
1400
MFLOPS
Compiler
Model
BLAS
Unleashed
CGw￿S
Figure 8.6: IBM Power 3: MMM Performance
200 400 600 800 1000
NB
200
400
600
800
1000
1200
1400
MFLOPS
(a)
20 40 60 80 100 120
NB
200
400
600
800
1000
1200
1400
MFLOPS
(b) Zoomed
Figure 8.7: IBM Power 3: Sensitivity of performance to NB117
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
500
1000
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.8: IBM Power 3: Sensitivity of performance to MU and NU
20 40 60 80
KU
200
400
600
800
1000
1200
MFLOPS
Figure 8.9: IBM Power 3: Sensitivity of performance to KU
2 4 6 8 10 12
LS
200
400
600
800
1000
1200
MFLOPS
Figure 8.10: IBM Power 3: Sensitivity of performance to Ls118
mini-MMM
Mini-MMM code produced by ATLAS Model is about 40 MFLOPS (3%) slower
than mini-MMM code produced by ATLAS CGw/S. Figure 8.7(b) shows that one
reason for this diﬀerence is the sub-optimal choice of NB; ﬁxing the values of
all parameter other than NB to the ones chosen by ATLAS CGw/S and using the
model-predicted value of 84 for NB results in mini-MMM code that performs about
100 MFLOPS worse than the mini-MMM code produced by ATLAS CGw/S.
MMM Performance
For multiplying large matrices, the handwritten BLAS as well as the codes pro-
duced by ATLAS CGw/S, ATLAS Model, and ATLAS Unleashed perform almost
identically.
Sensitivity Analysis
Figure 8.7(a) shows the sensitivity of performance to the value of NB. Figure 8.7(b)
shows a scaled-up version of this graph in the region of the optimal NB value. Fig-
ure 8.8 shows the sensitivity of performance to the values of MU and NU. Figure 8.9
shows the sensitivity of performance to the value of KU. On this machine, the en-
tire mini-MMM loop body can ﬁt into the L1 instruction cache for values of KU
up to NB. Performance is relatively insensitive to KU as long as the value of this
parameter is suﬃciently large (KU > 5). We do not understand the sudden drop
in performance at KU = 3. Figure 8.10 shows the sensitivity of performance to
the value of Ls. The Power 3 platform has a fused multiply-add instruction, which
the ATLAS micro-benchmarks and X-ray ﬁnd, and the Code Generator exploits,
so performance does not depend on the value of Ls.119
8.1.3 IBM Power 4
Table 8.7: IBM Power 4: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, RISC
CPU Core Frequency 1450 MHz
L1 Data Cache 32 KB, 128 B/line, 2-way
L1 Instruction Cache 64 KB, 128 B/line, 1-way
L2 Uniﬁed Cache 1.5 MB, 128 B/line, 8-way
L3 Cache 32 MB, 512B/line, 8-way
Floating-Point Registers 32
Floating-Point Functional Units 2
Floating-Point Multiply Latency 4
Has Fused Multiply Add Yes
Operating System AIX
C Compiler XL C for AIX v.5
Fortran Compiler XL Fortran for AIX
Table 8.8: IBM Power 4: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 64 4, 4, 64 1 1 1, 8, 1 3468
Model 56 4, 4, 56 6 1 0, 2, 2 3400
Unleashed 64 3468
Table 8.9: IBM Power 4: Timings
Search Model
Machine Parameters 175s 125s
Optimization Parameters 2390s
Total 2665s 125s120
1000 2000 3000 4000 5000
Size
500
1000
1500
2000
2500
3000
3500
MFLOPS
Compiler
Model
CGw￿S
Unleashed
BLAS
Figure 8.11: IBM Power 4: MMM Performance
200 400 600 800 1000 1200 1400
NB
500
1000
1500
2000
2500
3000
MFLOPS
(a)
20 40 60 80 100 120
NB
500
1000
1500
2000
2500
3000
MFLOPS
(b) Zoomed
Figure 8.12: IBM Power 4: Sensitivity of performance to NB121
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
1000
2000
3000
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.13: IBM Power 4: Sensitivity of performance to MU and NU
10 20 30 40 50 60
KU
500
1000
1500
2000
2500
3000
3500
MFLOPS
Figure 8.14: IBM Power 4: Sensitivity of performance to KU
2 4 6 8 10 12
LS
500
1000
1500
2000
2500
3000
3500
MFLOPS
Figure 8.15: IBM Power 4: Sensitivity of performance to Ls122
mini-MMM
On this machine, mini-MMM code produced by ATLAS Model is about 70 MFLOPS
(2%) slower than mini-MMM code produced by ATLAS CGw/S. Figure 8.12(b)
shows that one reason for this diﬀerence is a slightly sub-optimal choice of NB;
ﬁxing the values of all parameter other than NB to the ones chosen by ATLAS
CGw/S and using the model-predicted value of 56 for NB results in mini-MMM
code that performs slightly worse than the mini-MMM code produced by ATLAS
CGw/S.
MMM Performance
Figure 8.11 shows MMM performance. For large matrices, the hand-tuned BLAS
performs the best, although by a small margin. The code produced by ATLAS
Model, ATLAS CGw/S and ATLAS Unleashed perform almost identically. On this
machine the native IBM XL Fortran compiler produced relatively good results for
small matrices.
Sensitivity Analysis
Figure 8.13 shows the sensitivity of performance to changes in the values of MU
and NU. The parameter values (4,4) perform best, and these are the values used
by both ATLAS CGw/S and ATLAS Model.
Figure 8.12(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.12(b) shows a scaled-up version of this graph in the neighborhood of the NB
value determined by ATLAS CGw/S. Figure 8.12(a) shows that on this machine,
NB values between 150 and 350 give the best performance of roughly 3.5 GFLOPS.
Using Inequality (6.1) for the L2 cache (capacity of 1.5 MB) gives NB= 254, while123
Inequality (6.5) gives NB= 436, showing that on this machine, it is better to tile
for the L2 cache rather than the L1 cache.
Figure 8.14 shows the sensitivity of performance to the value of KU. The L1
instruction cache on this machine is large enough that we can set KU to NB.
As on the Power 3, unrolling by 3 gives poor performance for reasons we do not
understand.
Figure 8.15 shows the sensitivity of performance to the value of Ls. The
Power 4 platform has a fused multiply-add instruction, which the ATLAS micro-
benchmarks ﬁnd and the Code Generator exploits, so performance does not depend
on the value of Ls.124
8.1.4 SGI R12K
Table 8.10: SGI R12K: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, RISC
CPU Core Frequency 270 MHz
L1 Data Cache 32 KB, 32 B/line, 2-way
L1 Instruction Cache 32 KB, 32 B/line, 2-way
L2 Uniﬁed Cache 4 MB, 32 B/line, 1-way
Floating-Point Registers 32
Floating-Point Functional Units 2
Floating-Point Multiply Latency 2
Has Fused Multiply Add Yes
Operating System IRIX64
C Compiler SGI MIPSPro C 7.3.1.1m
Fortran Compiler SGI MIPSPro FORTRAN 7.3.1.1m
Table 8.11: SGI R12K: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 64 4, 5, 32 3 0 1, 8, 1 459
Model 58 5, 4, 58 1 1 0, 2, 2 442
Unleashed 64 464
Table 8.12: SGI R12K: Timings
Search Model
Machine Parameters 251s 117s
Optimization Parameters 5015s
Total 5266s 117s125
1000 2000 3000 4000 5000
Size
100
200
300
400
500
MFLOPS
Compiler
Model
Unleashed
CGw￿S
BLAS
Figure 8.16: SGI R12K: MMM Performance
200 400 600 800 1000
NB
100
200
300
400
500
600
MFLOPS
(a)
20 40 60 80 100 120
NB
100
200
300
400
500
600
MFLOPS
(b) Zoomed
Figure 8.17: SGI R12K: Sensitivity of performance to NB126
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
100
200
300
400
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.18: SGI R12K: Sensitivity of performance to MU and NU
10 20 30 40 50 60
KU
100
200
300
400
500
MFLOPS
Figure 8.19: SGI R12K: Sensitivity of performance to KU
2 4 6 8 10 12
LS
100
200
300
400
500
MFLOPS
Figure 8.20: SGI R12K: Sensitivity of performance to Ls127
mini-MMM
On this machine, mini-MMM code produced by ATLAS Model is about 20 MFLOPS
(4%) slower than mini-MMM code produced by ATLAS CGw/S. The performance
of both codes is similar to that of mini-MMM code produced by ATLAS Unleashed.
MMM Performance
Figure 8.16 shows MMM performance. The hand-coded BLAS performs best by a
small margin. On this machine the native compiler (in this case, the SGI MIPSPro)
generats relatively good code that is only 20% lower in performance than the hand-
coded BLAS, at least for small matrices.
Sensitivity Analysis
Figure 8.18 shows the sensitivity of performance to the values of MU and NU. This
machine has a relatively large number of registers (32), so there is a fairly broad
performance plateau in this graph.
Figure 8.17(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.17(b) shows a scaled-up version of this graph in the region of the optimal
NB value. Figure 8.17(a) shows that NB values between 300 and 500 give the
best performance of roughly 510 MFLOPS. Using Inequality (6.1) for the L2 cache
(capacity of 4MB) gives NB= 418, while Inequality (6.5) gives NB = 718, showing
that it is better to tile for the L2 cache rather than the L1 cache.
Figure 8.19 shows the sensitivity of performance to the value of the KU. The
instruction cache is large enough that full unrolling (KU = NB) is possible.
Figure 8.20 shows the sensitivity of performance to the value of the Ls. The
R12K processor has a fused multiply-add instruction, so we would expect the per-128
formance of the generated code to be insensitive to the value of Ls. While this is
borne out by Figure 8.20, notice that Table 8.11 shows that the micro-benchmark
used by ATLAS did not discover the fused multiply-add instruction on this machine
(FMA = 0)! It is worth mentioning that on this platform the FMA instruction,
while present in the ISA, is not backed up by a real FMA pipeline in hardware.
Instead it allows the two separate functional units (for multiplication and addition
respectively) to be used sequentially saving one latency cycle. Therefore, in theory,
peak performance is achievable even by using separate multiply and add instruc-
tions. Although the ATLAS Code Generator schedules code using Ls = 3, the SGI
MIPSPro compiler is clever enough to discover the separated multiplies and adds,
and fuse them. In fact the compiler is able to do this even when Ls = 20, which is
impressive.129
8.1.5 Sun UltraSPARC IIIi
Table 8.13: Sun UltraSPARC IIIi: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, RISC
CPU Core Frequency 1060 MHz
L1 Data Cache 64 KB, 32 B/line, 4-way
L1 Instruction Cache 32 KB, 32 B/line, 4-way
L2 Uniﬁed Cache 1 MB, 32 B/line, 4-way
Floating-Point Registers 32
Floating-Point Functional Units 2
Floating-Point Multiply Latency 4
Has Fused Multiply Add No
Operating System SUN Solaris 9
C Compiler SUN C 5.5
Fortran Compiler SUN FORTRAN 95 7.1
Table 8.14: Sun UltraSPARC IIIi: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 44 4, 3, 44 5 0 0, 3, 2 986
Model 84 4, 4, 84 4 0 0, 2, 2 1149
Unleashed 168 1695
Table 8.15: Sun UltraSPARC IIIi: Timings
Search Model
Machine Parameters 203s 112s
Optimization Parameters 1254s
Total 1457s 112s130
1000 2000 3000 4000 5000
Size
250
500
750
1000
1250
1500
1750
MFLOPS
Compiler
CGw￿S
Model
BLAS
Unleashed
Figure 8.21: Sun UltraSPARC IIIi: MMM Performance
100 200 300 400 500 600
NB
200
400
600
800
1000
1200
1400
MFLOPS
(a)
20 40 60 80 100 120
NB
200
400
600
800
1000
1200
1400
MFLOPS
(b) Zoomed
Figure 8.22: Sun UltraSPARC IIIi: Sensitivity of performance to NB131
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
200
400
600
800
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.23: Sun UltraSPARC IIIi: Sensitivity of performance to MU and NU
10 20 30 40
KU
200
400
600
800
1000
MFLOPS
Figure 8.24: Sun UltraSPARC IIIi: Sensitivity of performance to KU
2 4 6 8 10 12
LS
200
400
600
800
1000
MFLOPS
Figure 8.25: Sun UltraSPARC IIIi: Sensitivity of performance to Ls132
mini-MMM
On this machine, mini-MMM code produced by ATLAS Model is about 160 MFLOPS
(17%) faster than mini-MMM code produced by ATLAS CGw/S. The main reason
for this is that the micro-benchmarks used by ATLAS incorrectly measured the ca-
pacity of the L1 data cache as 16 KB, rather than 64 KB. Therefore ATLAS only
searched for NB values less than 44. Our micro-benchmarks on the other hand
correctly measured the capacity of the L1 cache, so the model estimated NB = 84,
which gave better performance as can be seen in Figure 8.22(b).
MMM Performance
Figure 8.21 shows the MMM performance. On this machine, the hand-coded BLAS
and ATLAS Unleashed perform roughly 50% better than the code produced by
ATLAS CGw/S. The reason for this diﬀerence is that the mini-MMM code in
ATLAS Unleashed (and perhaps the hand-coded BLAS) pre-fetches portions of
the A and B matrices required for the next mini-MMM. This may be related to
the level-3 pre-fetching idea of Gustavson et al. [4].
Sensitivity Analysis
Figure 8.23 shows the sensitivity of the performance to the values of MU and NU.
Figure 8.22(a) shows the performance sensitivity to the value of NB. Figure 8.22(b)
shows a scaled-up version of this graph in the region of the optimal NB value. On
this machine, as on many other machines, it is better to tile for the L2 cache,
as can be seen in Figure 8.22(a). Using Inequality (6.1) for the L2 cache (with
capacity of 1 MB), we obtain NB = 208, which gives roughly 1380 MFLOPS.
Using Inequality (6.5), we obtain NB = 356, which is close to the NB value in133
Figure 8.22(a) where the performance drops rapidly.
Figure 8.24 shows the sensitivity of performance to the value of the KU. On
this machine, the instruction cache is large enough that full unrolling (KU=NB) is
possible.
Figure 8.25 shows the sensitivity of performance to the value of the Ls. This
machine does not have a fused multiply-add instruction, so the value of the Ls
parameter aﬀects performance. Both the model and ATLAS CGw/S ﬁnd good
values for this parameter.134
8.1.6 Intel Itanium 2
Table 8.16: Intel Itanium 2: Platform Speciﬁcation
Feature Value
Architecture In-Order, EPIC, IA-64
CPU Core Frequency 1500 MHz
L1 Data Cache 16 KB, 64 B/line, 4-way
L1 Instruction Cache 16 KB, 64 B/line, 4-way
L2 Uniﬁed Cache 256 KB, 128 B/line, 8-way
L3 Cache 3 MB, 128B/line, 12-way
Floating-Point Registers 128
Floating-Point Functional Units 2
Floating-Point Multiply Latency 4
Has Fused Multiply Add Yes
Operating System Linux 2.4.18-e.31smp
C Compiler GNU C/C++ 3.3
Fortran Compiler GNU Fortran 3.3
Table 8.17: Intel Itanium 2: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 80 10, 10, 4 4 1 0, 19, 1 4028
Model 30 10, 10, 8 1 1 0, 2, 2 1806
Unleashed 120 4891
Table 8.18: Intel Itanium 2: Timings
Search Model
Machine Parameters 1555s 143s
Optimization Parameters 30710s
Total 32265s 143s135
1000 2000 3000 4000 5000
Size
1000
2000
3000
4000
5000
MFLOPS
Compiler
Model
CGw￿S
Unleashed
BLAS
Figure 8.26: Intel Itanium 2: MMM Performance
200 400 600 800 1000
NB
1000
2000
3000
4000
MFLOPS
(a)
20 40 60 80 100 120
NB
1000
2000
3000
4000
MFLOPS
(b) Zoomed
Figure 8.27: Intel Itanium 2: Sensitivity of performance to NB136
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
1000
2000
3000
4000
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.28: Intel Itanium 2: Sensitivity of performance to MU and NU
20 40 60 80
KU
1000
2000
3000
4000
MFLOPS
Figure 8.29: Intel Itanium 2: Sensitivity of performance to KU
2 4 6 8 10 12
LS
1000
2000
3000
4000
MFLOPS
Figure 8.30: Intel Itanium 2: Sensitivity of performance to Ls137
mini-MMM
On this machine, the mini-MMM code produced by ATLAS Model is about 2.2
GFLOPS (55%) slower than mini-MMM code produced by ATLAS CGw/S. This
is a rather substantial diﬀerence in performance, so it is necessary to examine the
sensitivity graphs to understand the reasons why ATLAS Model is doing so poorly.
Figure 8.27(b) shows that one reason for this diﬀerence is that ATLAS Model
used NB = 30, whereas ATLAS CGw/S used NB = 80. ATLAS CGw/S uses NB =
80 because it disregards the L1 data cache size (16KB) and considers directly the L2
cache size (256KB), and therefore the expression min
￿
80,
√
C
￿
in Inequality (5.1)
evaluates to 80, the largest possible value of NB in the search space used by ATLAS.
While the value NB = 30 used by ATLAS Model is correct with respect to
the L1 data cache size, the Intel Itanium 2 does not allow storing ﬂoating point
numbers in the L1 data cache, and thus L2 has to be considered instead. Once
we incorporate in X-Ray the ability to measure this speciﬁc hardware feature, the
shortcoming of ATLAS Model will be resolved.
MMM Performance
Figure 8.26 shows MMM performance. The hand-written BLAS and ATLAS Un-
leashed give the best performance. The code produced by ATLAS CGw/S runs
about 1.5 GFlops slower than the hand-written BLAS, while the code produced
by ATLAS Model runs about 3.5 GFlops slower.
Sensitivity Analysis
Figure 8.28 shows the sensitivity of performance to the values of MU and NU.
The Itanium 2 has 128 general-purpose registers, so the optimal register tiles are138
2 4 6 8 10 12 14
KU
1000
2000
3000
4000
MFLOPS
NB=360
NB=80
Figure 8.31: Intel Itanium 2: Sensitivity of performance to KU
relatively large. There is a broad plateau of (MU,NU) values that give excellent
performance.
Figure 8.27(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.27(b) shows a scaled-up version of this graph in the region of the optimal
NB value. Figure 8.27(a) shows that on this machine, the best performance is
obtained by tiling for the L3 cache! Indeed, using Inequality (6.1) for the L3
cache (capacity of 3 MB), we obtain NB = 360, which gives roughly 4.6 GFLOPS.
Figure 8.27(a) shows that this value is close to optimal. Using Inequality (6.5),
we obtain NB = 610, which is close to the NB value in Figure 8.27(a) where the
performance starts to drop.
Figure 8.29 shows the sensitivity of performance to the value of KU. On the
Itanium 2, unlike on other machines in our study, performance is highly sensitive
to the value of KU. The main reason is the large register tile (MU,NU) = (10,10);
after unrolling the micro-MMM loops, we get a very long straight-line code se-
quence. Furthermore, unrolling of the k00 loop creates numerous copies of this code
sequence. Unfortunately, the L1 instruction cache on this machine has a capacity
of 32 KB, so it can hold only about 9 copies of the micro-MMM code sequence.
Therefore, performance drops oﬀ dramatically for values of KU greater than 9.139
Since this is the only machine in our study in which the KU parameter mattered,
we decided to investigate the sensitivity graph more carefully. Figure 8.31 shows a
magniﬁed version of Figure 8.29 in the interval KU ∈ [0,15]. We would expect the
KU sensitivity graph to exhibit the typical inverted-U shape, and it more or less
does. However, performance for KU = 7 is signiﬁcantly worse than the performance
for KU = 6, and KU = 8, which appears anomalous.
The anomaly arises from clean-up code that is required when KU does not
divide NB evenly (see the k0 loop in the tiled code in Figure 3.2). If we unroll
the k0 loop by KU, the number of times the completely unrolled micro-MMM
code is replicated inside the mini-MMM is not KU, but KU + NB%KU (% is the
reminder from integer division). The ﬁrst term in the sum is the expected number
of repetitions inside the unrolled k0 loop, while the second part is the clean-up code
that takes care of the case when KU does not divide NB exactly. This second piece
of code is still part of the mini-MMM loop nest, and it has to be stored in the L1
instruction cache during execution to achieve optimal performance.
For NB = 80, performance increases initially as KU increases because loop
overhead is reduced. When KU = 6, there are 8 copies of the unrolled micro-MMM
code in the mini-MMM, and this is close to the I-cache limit. When KU = 7, there
are 7+80%7 = 10 copies of the micro-MMM code, which exceeds the I-cache limit,
and performance drops substantially. However, when KU = 8, there is no clean-up
code, and there are only 8 copies of the unrolled micro-MMM code, so performance
goes up again. Beyond this point, the code sizes overﬂows the I-cache and grows
larger, and performance degrades gradually. Ultimately, performance is limited by
the rate at which L1 I-cache misses can be serviced. For NB = 360, the trends are
similar, but the eﬀect of clean-up code is less because the clean-up code performs140
a smaller fraction of the computations of the k0 loop (less than 1% compared to
about 5% for NB = 80).
Figure 8.30 shows the sensitivity of performance to the value of the Ls. The
Itanium 2 has a fused multiply-add instruction, so performance is insensitive to
the Ls parameter.
In summary, the code produced by ATLAS Model on this machine did not
perform as well as the code produced by ATLAS CGw/S. However, this is because
ATLAS Model tiled for the L1 cache, whereas on this machine, the best perfor-
mance is obtained by tiling for L3 cache. ATLAS CGw/S gets better performance
because the tile size is set to a larger value than the value used by ATLAS Model.141
8.1.7 AMD Opteron 240
Table 8.19: AMD Opteron 240: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, CISC, x86-64
CPU Core Frequency 1400 MHz
L1 Data Cache 64 KB, 64 B/line, 2-way
L1 Instruction Cache 64 KB, 64 B/line, 2-way
L2 Uniﬁed Cache 1024 MB, 64 B/line, 16-way
Floating-Point Registers 8 x87
Floating-Point Functional Units ADD + MUL + Memory
Floating-Point Multiply Latency 4
Has Fused Multiply Add No
Operating System Linux 2.4.19
C Compiler GCC C/C++ 3.3.2
Fortran Compiler GNU Fortran 3.3.2
Table 8.20: AMD Opteron 240: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 60 6, 1, 60 6 1 0, 6, 1 2072
Model 88 2, 1, 88 2 0 0, 2, 2 1282
Unleashed 56 2608
Table 8.21: AMD Opteron 240: Timings
Search Model
Machine Parameters 148s 101s
Optimization Parameters 556s
Total 704s 101s142
1000 2000 3000 4000 5000
Size
500
1000
1500
2000
2500
MFLOPS
Compiler
Model
CGw￿S
BLAS
Unleashed
Figure 8.32: AMD Opteron 240: MMM Performance
100 200 300 400 500 600
NB
500
1000
1500
2000
MFLOPS
(a)
20 40 60 80 100 120
NB
500
1000
1500
2000
MFLOPS
(b) Zoomed
Figure 8.33: AMD Opteron 240: Sensitivity of performance to NB143
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
500
1000
1500
2000
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.34: AMD Opteron 240: Sensitivity of performance to MU and NU
10 20 30 40 50 60
KU
500
1000
1500
2000
MFLOPS
Figure 8.35: AMD Opteron 240: Sensitivity of performance to KU
2 4 6 8 10 12
LS
500
1000
1500
2000
MFLOPS
Figure 8.36: AMD Opteron 240: Sensitivity of performance to Ls144
mini-MMM
Table 8.21 shows that on this machine, the mini-MMM code generated by ATLAS
Model runs roughly 38% slower than the code generated by ATLAS CGw/S. The
values of almost all optimization parameters determined by the two systems are
diﬀerent, so it is not obvious where the problem is. To get some insight, it is
necessary to look at the sensitivity graphs.
Figure 8.33(a) shows the performance sensitivity graph for NB. Both 60 and
88 appear to be reasonable values, so the problem with ATLAS Model is not in
its choice of NB. Because KU is bound to the value of NB, the only remaining
diﬀerences are those between MU, NU, Ls, and FMA. Table 8.20 shows that
ATLAS Model chose MU = 2, NU = 1, FMA = 0, while ATLAS CGw/S chose
MU = 6, NU = 1, FMA = 1. We veriﬁed experimentally that if the model had
chosen MU = 6 and FMA = 1, keeping the rest of the parameters the same, the
mini-MMM performance becomes 2050 MFLOPS, closing the performance gap
with ATLAS CGw/S.
The parameters values used by ATLAS CGw/S are puzzling for several reasons.
First, the Opteron does not have an FMA instruction, so it is not clear why ATLAS
CGw/S chose to set FMA = 1. Second, choosing 6 and 1 for the values of MU
and NU violates Inequality (6.7) since the Opteron has only 8 registers.
We address the problem of the register-tile size ﬁrst. Recall that Inequality (6.7)
stems from the fact that ATLAS uses registers to multiply an MU ×1 vector-tile of
matrix A (which we call ¯ a) with a 1×NU vector-tile of matrix B (which we call ¯ b),
accumulating the result into an MU ×NU tile of matrix C (which we call ¯ c). Notice
that if NU = 1, then ¯ b is a single scalar that is multiplied by each element of ¯ a.
Therefore no reuse exists for elements of ¯ a. This observation lets us generate the145
code in Figure 8.37, which uses 1 register for ¯ b (rb), 6 registers for ¯ c (rc1 ...rc6)
and 1 temporary register (rt) to hold elements of ¯ a.
Even if there are enough logical registers, this kind of scheduling may be bene-
ﬁcial if the ISA is 2-address rather than 3-address, because one of the operands is
overwritten. This is true on the Opteron when the 16 SSE vector registers are used
to hold scalar values, which is GCC’s default behavior. Even though Inequality 3.1
prescribes 3×3 register tiles, the reﬁned model prescribes 14×1 tiles. Experiments
show that this performs better [57].
One might expect that this code will not perform well because there are de-
pendences between most of the instructions that arise from the use of temporary
register rt. In fact, experiments show that the code in Figure 8.37 performs well
because of two architectural features of the Opteron.
1. Out-of-order execution: it is possible to schedule several multiplications in
successive cycles without waiting for the ﬁrst one to complete.
2. Register renaming: the single temporary register rt is renamed to a diﬀerent
physical register for each pair of multiply-add instructions.
Performing instruction scheduling as described in Chapter 3 requires additional
logical registers for temporaries, which in turn limits the sizes of the register tiles.
When an architecture provides out-of-order execution and a small number of logical
registers, it is better to use the logical registers for allocating larger register tiles
and leave instruction scheduling to the out-of-order hardware core, which can use
the extra physical registers to hold the temporaries.
These insights permit us to reﬁne the model described in Chapter 6 as follows:
for processors with out-of-order execution and a small number of logical registers,146
rc1 ← ¯ c1 ...rc6 ← ¯ c6
...
loop k
{
rb ← ¯ b1
rt ← ¯ a1
rt ← rt × rb
rc1 ← rc1 + rt
rt ← ¯ a2
rt ← rt × rb
rc2 ← rc2 + rt
. . .
rt ← ¯ a6
rt ← rt × rb
rc6 ← rc6 + rt
}
...
¯ c1 ← rc1 ...¯ c6 ← rc6
Figure 8.37: (MU,NU) = (6,1) code for x86 CISC147
set NU = 1, MU = NR − 2, FMA = 1.
It is interesting to analyze how the ATLAS search engine settled on these
parameter values. Note that on a processor that does not have a fused multiply-
add instruction, FMA = 1 is equivalent to FMA = 0 and Ls = 1. The code
produced by the ATLAS Code Generator is shown schematically in Figure 8.38.
Note that this code uses 6 registers for ¯ a (ra1 ...ra6), 1 register for ¯ b (rb), 6
registers for ¯ c (rc1 ...rc6) and 1 temporary register (implicitly by the multiply-
add statement). However, the back-end compiler (GCC) reorganizes this code into
the code pattern shown in Figure 8.37.
Notice that the ATLAS Code Generator itself is not aware that the code of
Figure 8.37 is optimal. However, setting FMA = 1 (even though there is no
fused-multiply instruction) produces code that triggers the right instruction reor-
ganization heuristics inside GCC, and performs well on the Opteron. This illus-
trates the well-known point that search does not need to be intelligent to do the
right thing! Nevertheless, our reﬁned model explains the observed performance
data, makes intuitive sense, and can be easily incorporated into a compiler.
MMM Performance
Figure 8.32 shows the MMM performance. ATLAS Unleashed is once again the
fastest implementation here, as it uses the highly-optimized, hand-tuned BLAS
kernels, using the SSE2 SIMD instructions, for which the ATLAS Code Generator
does not generate code. The native BLAS library is about 200 MFLOPS slower
on average. ATLAS CGw/S and ATLAS Model perform at the same level as their
corresponding mini-MMM kernels.
Reﬁning the model as explained above brings ATLAS Model on par with148
rc1 ← ¯ c1 ...rc6 ← ¯ c6
...
loop k
{
ra1 ← ¯ a1
rb ← ¯ b1
rc1 ← rc1 + ra1 × rb
ra2 ← ¯ a2
ra3 ← ¯ a3
rc2 ← rc2 + ra2 × rb
rc3 ← rc3 + ra3 × rb
ra4 ← ¯ a4
ra5 ← ¯ a5
rc4 ← rc4 + ra4 × rb
rc5 ← rc5 + ra5 × rb
ra6 ← ¯ a6
rc6 ← rc6 + ra6 × rb
}
...
¯ c1 ← rc1 ...¯ c6 ← rc6
Figure 8.38: ATLAS unroll (MU,NU) = (6,1) code for x86 CISC
ATLAS CGw/s. To bridge the gap between ATLAS CGw/S and user contributed
code, we would need a diﬀerent code generator – one that understands SIMD and
prefetch instructions. GCC exposes these as intrinsic functions and we plan to149
explore this in future work.
Performance Sensitivity Analysis
Figure 8.34 shows the performance sensitivity to the values of the MU and NU
optimization parameters. As discussed in Section 8.1.7, the optimal value, is (6,1).
From the graph we can see that the only plausible values are those with NU =
1. Furthermore, performance increases while we grow MU from 1 to 6, while it
suddenly drops for MU = 7. This is easily explained by our reﬁned model, as
MU + 2 ≤ NR would require 9 registers, while only 8 are available.
Figure 8.33(a) shows the sensitivity of performance to the value of the NB
optimization parameter. The ﬁrst drop in performance is the result of L1 data
cache misses starting to occur. This fact is accurately captured by our model for
NB in Inequality (6.5). Solving the inequality for C = 8192 (the L1 data cache
capacity in double-sized ﬂoating-point values), we obtain NB = 89. Similarly the
second drop in performance in Figure 8.33(a) can be explained by applying the
same model to the 1MB L2 cache.
Figure 8.35 shows the performance sensitivity to the value of the KU opti-
mization parameter. On this machine the entire mini-MMM loop body can ﬁt
into the L1 instruction cache for arbitrary KU values (up to KU = NB). Perfor-
mance is relatively insensitive to KU as long as this unroll factor is suﬃciently
large (KU > 10).
Figure 8.36 shows the performance sensitivity to the value of the Ls optimiza-
tion parameter. As we mentioned before, when FMA = 1, the Ls optimization
parameter does not inﬂuence the generated code. Therefore, performance is con-
stant with respect to Ls.150
8.1.8 AMD Athlon MP
Table 8.22: AMD Athlon MP: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, CISC, x86
CPU Core Frequency 1733 MHz
L1 Data Cache 64 KB, 64 B/line, 2-way
L1 Instruction Cache 64 KB, 64 B/line, 2-way
L2 Uniﬁed Cache 256 KB, 64 B/line, 16-way
Floating-Point Registers 8
Floating-Point Functional Units ADD + MUL + Memory
Floating-Point Multiply Latency 4
Has Fused Multiply Add No
Operating System Linux 2.4.20
C Compiler GNU C/C++ 3.2.2
Fortran Compiler GNU Fortran 3.2.2
Table 8.23: AMD Athlon MP: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 76 4, 1, 76 1 0 0, 3, 2 1531
Model 88 2, 1, 88 2 0 0, 2, 2 1239
Unleashed 30 2512
Table 8.24: AMD Athlon MP: Timings
Search Model
Machine Parameters 220s 121s
Optimization Parameters 3195s
Total 3415s 121s151
1000 2000 3000 4000 5000
Size
500
1000
1500
2000
2500
MFLOPS
Compiler
Model
CGw￿S
BLAS
Unleashed
Figure 8.39: AMD Athlon MP: MMM Performance
50 100 150 200 250
NB
200
400
600
800
1000
1200
MFLOPS
(a)
20 40 60 80 100 120
NB
200
400
600
800
1000
1200
MFLOPS
(b) Zoomed
Figure 8.40: AMD Athlon MP: Sensitivity of performance to NB152
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
500
1000
1500
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.41: AMD Athlon MP: Sensitivity of performance to MU and NU
10 20 30 40 50 60 70
KU
250
500
750
1000
1250
1500
MFLOPS
Figure 8.42: AMD Athlon MP: Sensitivity of performance to KU
2 4 6 8 10 12
LS
250
500
750
1000
1250
1500
MFLOPS
Figure 8.43: AMD Athlon MP: Sensitivity of performance to Ls153
The AMD Athlon implements the x86 instruction set, so we would expect the
experimental results to be similar to those on the Opteron.
mini-MMM
Table 8.24 shows that on this machine, the mini-MMM code generated by ATLAS
Model runs roughly 20% slower than the code generated by ATLAS CGw/S. Fig-
ure 8.40(a) shows that the choice of NB made by the model is reasonable, while
Figure 8.41 shows that the register-tile values were not chosen optimally by the
model, as on the Opteron. The problem and its solution are similar to those on
the Opteron.
MMM Performance
Figure 8.39 shows MMM performance. ATLAS Unleashed out-performs the other
approaches by a signiﬁcant margin. The hand-coded BLAS do almost as well,
followed by ATLAS CGw/S.
Sensitivity Analysis
Figure 8.41 shows the sensitivity of performance to the values of MU and NU.
Figure 8.40(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.40(b) shows a scaled-up version of this graph in the region of the optimal
NB value. Both ATLAS Model and ATLAS CGw/S choose good values of NB. In
Figure 8.40(b), the saw-tooth with period 2 arises from the overhead of executing
clean-up code when the value of NB is odd, and therefore not divisible by the
value of MU(= 2). As on other machines, we do not understand the saw-tooth
with period 4 that has larger spikes in performance.154
10 20 30 40 50 60 70
KU
250
500
750
1000
1250
1500
1750
MFLOPS
FMA=1
FMA=0
Figure 8.44: AMD Athlon MP: Sensitivity of performance to KU
Figure 8.42 shows the sensitivity of performance to the value of KU. The L1 I-
cache is large enough to permit full unrolling (KU = NB). However, the sensitivity
graph of KU is anomalous; performance is relatively low for all values of KU other
than KU = NB. By examining the code produced by the native compiler (GCC),
we found that this anomaly arose from interference between instruction scheduling
in ATLAS and instruction scheduling in GCC. Notice that ATLAS CGw/S uses
FMA = 0, so it attempts to schedule instructions and perform software pipelining
in the mini-MMM code. Fully unrolling the k0 loop (KU = NB) produces straight-
line code that is easier for GCC to schedule.
To verify this conjecture, we redid the KU sensitivity study with FMA set to
1. Figure 8.44 shows the results. Setting FMA = 1 dissuades the ATLAS Code
Generator from attempting to schedule code, so GCC has an easier job, producing
a KU sensitivity graph that is in line with what we would expect.
Notice that our reﬁned model, described in the context of the Opteron, per-
forms extremely well – 1544 MFLOPS for mini-MMM, which is faster than the
performance of the mini-MMM produced by ATLAS CGw/S.
Figure 8.43 shows the sensitivity of performance to the value of the Ls.155
8.1.9 Intel Pentium III
Table 8.25: Intel Pentium III: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, CISC, x86
CPU Core Frequency 1266 MHz
L1 Data Cache 16 KB, 32 B/line, 4-way
L1 Instruction Cache 16 KB, 32 B/line, 4-way
L2 Uniﬁed Cache 512 MB, 32 B/line, 8-way
Floating-Point Registers 8
Floating-Point Functional Units 1
Floating-Point Multiply Latency 5
Has Fused Multiply Add No
Operating System Linux 2.4.20-28.8smp
C Compiler GNU C/C++ 3.2
Fortran Compiler GNU Fortran 3.2
Table 8.26: Intel Pentium III: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 44 4, 1, 44 3 0 0, 3, 2 894
Model 42 2, 1, 42 2 0 0, 2, 2 841
Unleashed 40 951
Table 8.27: Intel Pentium III: Timings
Search Model
Machine Parameters 133s 100s
Optimization Parameters 630s
Total 763s 100s156
1000 2000 3000 4000 5000
Size
200
400
600
800
1000
1200
MFLOPS
Model
CGw￿S
Unleashed
BLAS
Figure 8.45: Intel Pentium III: MMM Performance
50 100 150 200 250 300 350
NB
200
400
600
800
1000
MFLOPS
(a)
20 40 60 80 100 120
NB
200
400
600
800
1000
MFLOPS
(b) Zoomed
Figure 8.46: Intel Pentium III: Sensitivity of performance to NB157
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
200
400
600
800
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.47: Intel Pentium III: Sensitivity of performance to MU and NU
10 20 30 40
KU
200
400
600
800
1000
MFLOPS
Figure 8.48: Intel Pentium III: Sensitivity of performance to KU
2 4 6 8 10 12
LS
200
400
600
800
1000
MFLOPS
Figure 8.49: Intel Pentium III: Sensitivity of performance to Ls158
mini-MMM
On this machine, mini-MMM code produced by ATLAS Model is about 50 MFLOPS
(6%) slower than mini-MMM code produced by ATLAS CGw/S. The code pro-
duced by ATLAS Unleashed performs roughly 50 MFLOPS better than the code
produced by ATLAS CGw/S.
The diﬀerence in performance between the codes produced by ATLAS CGw/S
and ATLAS Model arises mostly from the sub-optimal register tile chosen by the
model, as explained in the context of the Opteron in Section 8.1.7. Using (6,1) as
the register tile raises mini-MMM performance to 916 MFLOPS.
MMM Performance
Figure 8.45 shows MMM performance. The hand-coded BLAS performs at roughly
1100 MFLOPS, whereas the codes produced by ATLAS CGw/S and ATLAS Un-
leashed perform roughly at 900 MFLOPS. The code produced by ATLAS Model
runs roughly at 850 MFLOPS; using the reﬁned model improves performance to a
point that is slightly above the performance of code produced by ATLAS CGw/S.
Sensitivity Analysis
Figure 8.47 shows the sensitivity of performance to the values of MU and NU. Like
all x86 machines, the Pentium III has a limited number of logical registers. Our
base-line model picked (2,1) for the register tile, whereas ATLAS CGw/S chose
(4,1). If we use the reﬁned model described in Section 8.1.7, the size of the register
tile becomes (6,1), and mini-MMM performance rises to 916 MFLOPS.
Figure 8.46(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.46(b) shows a scaled-up version of this graph in the region of the optimal159
NB value. The broad peak in Figure 8.46(a) arises from the inﬂuence of the L2
cache (capacity of 512 KB). Using Inequality (6.1) for the L2 cache, we obtain
NB = 104, which is the NB values where the peak starts, while Inequality (6.5)
gives NB = 164, which corresponds to the NB value where the peak ends. The
L2 cache on the Pentium III is 8-way set-associative, so the drop in performance
between NB = 104 and NB = 164 is small.
Figure 8.48 shows the sensitivity of performance to the value of the KU. On
this machine, the L1 instruction cache is large enough to permit full unrolling
(KU = NB).
Figure 8.49 shows the sensitivity of performance to the value of the Ls. There
is no fused multiply-add instruction, so performance is sensitive to the value of
Ls, but both ATLAS Model and ATLAS CGw/S ﬁnd reasonable values for this
parameter. If we use the reﬁned model described in Section 8.1.7, we set FMA = 1,
and the value of the Ls parameter becomes irrelevant.160
8.1.10 Intel Pentium 4
Table 8.28: Intel Pentium 4: Platform Speciﬁcation
Feature Value
Architecture Out-Of-Order, CISC, x86
CPU Core Frequency 2000 MHz
L1 Data Cache 8 KB, 64 B/line, 4-way
L1 Instruction Cache 12 K uOPs, 6 uOPs/line, 8-way
L2 Uniﬁed Cache 512 KB, 128 B/line, 8-way
Floating-Point Registers 8
Floating-Point Functional Units 1
Floating-Point Multiply Latency 7
Has Fused Multiply Add No
Operating System Linux 2.4.20-30.9smp
C Compiler GNU C v3.2.2
Fortran Compiler GNU Fortran 3.2.2
Table 8.29: Intel Pentium 4: Optimization Parameters
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
CGw/S 28 3, 1, 28 1 0 0, 2, 1 1504
Model 30 1, 1, 30 4 0 0, 2, 2 913
Unleashed 72 3317
Table 8.30: Intel Pentium 4: Timings
Search Model
Machine Parameters 136s 98s
Optimization Parameters 643s
Total 779s 98s161
1000 2000 3000 4000 5000
Size
500
1000
1500
2000
2500
3000
3500
MFLOPS
Model
CGw￿S
Unleashed
BLAS
Figure 8.50: Intel Pentium 4: MMM Performance
50 100 150 200 250 300 350
NB
500
1000
1500
2000
2500
MFLOPS
(a)
20 40 60 80 100 120
NB
500
1000
1500
2000
2500
MFLOPS
(b) Zoomed
Figure 8.51: Intel Pentium 4: Sensitivity of performance to NB162
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
4
6
8
10
12
14
16
NU
2
4
6
8
10
12
14
16
MU
0
500
1000
1500
2
4
6
8
10
12
14
NU
2
4
6
8
10
12
14
MU
Figure 8.52: Intel Pentium 4: Sensitivity of performance to MU and NU
5 10 15 20 25
KU
250
500
750
1000
1250
1500
MFLOPS
Figure 8.53: Intel Pentium 4: Sensitivity of performance to KU
2 4 6 8 10 12
LS
250
500
750
1000
1250
1500
MFLOPS
Figure 8.54: Intel Pentium 4: Sensitivity of performance to Ls163
mini-MMM
On this machine, the mini-MMM code produced by ATLAS Model is about 600
MFLOPS (40%) slower than the mini-MMM code produced by ATLAS CGw/S.
This is mostly because of the sub-optimal register tile used by ATLAS Model;
changing it to (6,1) improves the performance of the mini-MMM code produced
by ATLAS Model to 1445 MFLOPS, which is only 50 MFLOPS (3%) less than
the performance of the mini-MMM code produced by ATLAS CGw/S.
The mini-MMM produced by ATLAS Unleashed is roughly twice as fast as the
mini-MMM produced by ATLAS Model because this code uses the SSE2 vector
extensions to the x86 instruction set.
MMM Performance
Figure 8.50 shows the MMM performance. The hand-coded BLAS routines for
this machine perform best, followed by the code produced by ATLAS Unleashed.
Both the hand-coded BLAS and the code produced by ATLAS Unleashed use the
SSE2 vector extensions, and this accounts for most of the gap between these codes
and the codes produced by ATLAS Model and ATLAS CGw/S. We do not know
why the hand-coded BLAS performs substantially better than the code produced
by ATLAS Unleashed.
The gap in performance between the codes produced by ATLAS CGw/S and
ATLAS Model disappears if the reﬁned model for register tiles is used.
Sensitivity Analysis
Figure 8.52 shows the sensitivity of performance to the values of MU and NU. This
ﬁgure shows that the best register tile is (5,1), which produces mini-MMM code164
that runs at 1605 MFLOPS. Using (6,1) as the register tile is not as good because
it reduces performance to 1521 MFLOPS.
Figure 8.51(a) shows the sensitivity of performance to the value of NB. Fig-
ure 8.51(b) shows a scaled-up version of this graph in the region of the optimal
NB value. Both ATLAS Model and ATLAS CGw/S choose good tile sizes for the
L1 cache. Tiling for the L2 cache gives slightly better performance. The L2 cache
on this machine has a capacity of 256 KB; using Inequalities (6.1) and (6.5), we
get NB = 105 and NB = 180, which agree well with the data.
Figure 8.53 shows the sensitivity of performance to the value of KU. On this
machine, the L1 instruction cache is large enough to permit full unrolling (KU =
NB). Figure 8.54 shows the sensitivity of performance to the value of Ls.
8.1.11 Discussion
The experimental results in this section can be summarized as follows. Figure 8.55
describes the analytical models used to compute the values for the optimization
parameters. This ﬁgure also shows the reﬁned model used to compute the register
tile values for the x86 architectures.
Figure 8.56 shows the relative performance of the mini-MMM codes produced
by ATLAS Model and by ATLAS Unleashed, using the performance of the codes
produced by ATLAS CGw/S as the base line (the 100% line in this ﬁgure represents
the performance of ATLAS CGw/S on all machines). All the performance numbers
for ATLAS Model in this graph are obtained by tiling for the L1 cache.
We see that on all machines other than the Itanium, the codes produced by
using the analytical models perform almost as well or slightly better than the
codes produced using global search. On the Itanium, we saw that it is best to165
• Estimating FMA: Use the machine parameter FMA
• Estimating Ls: Ls =
l
L××|ALUFP|+1
2
m
• Estimating MU and NU: MU × NU + NU + MU + Ls ≤ NR
1. MU,NU ← u.
2. Solve constraint for u.
3. MU ← max(u,1).
4. Solve constraint for NU.
5. NU ← max(NU,1).
6. If MU < NU then swap MU and NU.
7. Reﬁned Model: If NU = 1 then
– MU ← NR − 2
– NU ← 1
– FMA ← 1
• Estimating NB:
￿
N2
B
B1
￿
+ 3
l
NB×NU
B1
m
+
l
MU
B1
m
× NU ≤ C1
B1.
Trim NB, to make it a multiple of MU, NU, and 2.
• Estimating KU: Choose KU as the maximum value for which mini-MMM
ﬁts in the L1 instruction cache. Trim KU to make it divide NB evenly.
• Estimating FF, IF, and NF: FF = 0,IF = 2,NF = 2
Figure 8.55: Summary of Model
tile for the L3 cache, rather than the L1 cache. By using the L2 cache instead,
ATLAS CGw/S was able to obtain some of the beneﬁts of tiling for the L3 cache.166
0% 50% ATLAS
CGw￿S
100%
150% 200%
Pentium4
PentiumIII
AthlonMP
Opteron 240
Itanium2
UltraSparcIIIi
R12K
Power 4
Power 3
Alpha21264
Unleashed
RefinedModel
Model
Figure 8.56: Summary of mini-MMM Performance. Performance numbers are
normalized to that of ATLAS CGw/S, which is presented as 100%
If we use this value in the model of Figure 8.55, we produce mini-MMM code of
comparable performance. Using the actual capacity of the L3 cache gives even167
better performance.
In our experiments we noticed that on several platforms, we get better MMM
performance by tiling for a lower cache level, such as L2 or L3, rather than L1.
This may result in a large value for NB, which may hurt overall performance if
the resulting MMM library routine is invoked from other routines such as LU and
Cholesky factorizations [32]. It is unclear to us that this is an issue in the context
of compilers, where codes like LU and Cholesky would be optimized directly, rather
than built upon MMM.
8.2 Model reﬁnement and local search
8.2.1 x86-based architectures
In Section 8.1.7 we noticed that the original model (denoted simply as “Model” in
this section) performed almost 40% worse than ATLAS CGw/S. We analyzed the
sensitivity of optimization parameters and designed a Reﬁned Model for estimating
register tile size on x86-based architectures. As we discuss in Section 8.1.11 this re-
ﬁnement alone is enough to close the performance gap between Model and ATLAS
CGw/S on all x86-based architectures. Results are presented in Figure 8.56.
8.2.2 Multilevel memory hierarchies
As discussed in Section 8.1, there are some machines for which tiling for the L2
or L3 cache will give better performance than tiling for the L1 cache. The model
presented in Chapter 6 does not account for cache miss penalties at diﬀerent cache
levels, so although we estimate tile sizes for diﬀerent cache levels, we cannot de-
termine which level to tile for.168
One approach to addressing this problem in the context of model-driven opti-
mization is to reﬁne the model to include miss penalties. Our experience however
is that it is diﬃcult to use micro-benchmarks to measure miss penalties accurately
for lower levels of the memory hierarchy on modern machines, because the actual
penalties are variable and depend on the current utilization of the bus. Therefore,
we decided to estimate tile sizes for all the cache levels according to Inequalities
(6.1) and (6.5), and then empirically determine which one gives the best perfor-
mance. When combined with the local search techniques presented in Chapter 7,
we call this approach Multilevel (ML) Local Search.
Notice that in the context of global search, the problem can be addressed by
making the search space for NB large enough. However, this would increase the
search time substantially since the size of an L3 cache, which would be used to
bound the search space, is typically much larger than the size of an L1 cache.
This diﬃculty highlights the advantage of our approach of using model-driven
optimization together with a small amount of search - we can tackle multi-level
memory hierarchies without increasing installation time signiﬁcantly.
ML Local Search resolved the multilevel memory hierarchy problems on all
architectures discussed in Section 8.1. Here we present experimental results from
two architectures where this ML Local Search was particularly useful – the Intel
Itanium 2 and the SGI R12K.
Intel Itanium 2
Table 8.31 shows the values of the optimization parameters for Model, Reﬁned
Model, Local Search, Multi-Level (ML) Local Search, Global Search, and Un-
leashed, along with the corresponding performance numbers for mini-MMM.169
Table 8.31: Optimization Paramameters for Intel Itanium 2
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
Model 30 10, 10, 4 1 1 0, 2, 2 3130
Reﬁned Model 30 10, 10, 4 1 1 0, 2, 2 3130
Local Search 30 10, 10, 4 1 1 0, 2, 2 3130
ML Local Search 360 10, 10, 4 1 1 0, 2, 2 4602
Global Search 80 10, 10, 4 4 1 0, 19, 1 4027
Unleashed 120 4890
1000 2000 3000 4000 5000
Size
1000
2000
3000
4000
5000
MFLOPS
Compiler
Model
GlobalSearch
ML LocalSearch
Unleashed
BLAS
Figure 8.57: Intel Itanium 2: MMM Performance
Table 8.32 shows the times taken by our micro-benchmarks and by the ATLAS
micro-benchmarks for determining machine and optimization parameters.
Table 8.32: Timings for Intel Itanium 2
Parameters Total
Machine Optimization (sec)
Model 143 6 149
Reﬁned Model 143 6 149
Local Search 143 162 305
ML Local Search 143 278 421
Global Search 1554 29667 31221
Figure 8.57 shows the MMM performance for all these approaches. Reﬁned
Model and Local Search are not plotted on this platform, because their performance
is virtually equivalent to that of Model. The native BLAS used is MKL 6.1.170
Model does not perform well because it tiles for the L1 data cache. For the
Itanium, ATLAS used the size of the L2 cache (256KB) to restrict NB, eﬀectively
selecting the maximum value in the search range (NB = 80). Nevertheless this tile
size is not optimal either. The Multi-level model determined that tiling for the
3 MB L3 cache is optimal, and chooses a value of NB = 362. This is reﬁned to
NB = 360 by local search. This improves performance compared to both Model
and Global Search.
SGI R12000
Table 8.33 shows the values of the optimization parameters for Model, Reﬁned
Model, Local Search, Multi-Level (ML) Local Search, Global Search, and Un-
leashed, along with the corresponding performance numbers for mini-MMM.
Table 8.33: Optimization Paramameters for SGI R12000
NB MU, NU, KU Ls FMA FF, IF, NF MFLOPS
Model 58 5, 4, 58 1 1 0, 2, 2 440
Reﬁned Model 58 5, 4, 58 1 1 0, 2, 2 440
Local Search 58 5, 4, 58 1 1 0, 2, 2 440
ML Local Search 418 5, 4, 16 1 1 0, 2, 2 508
Global Search 64 4, 5, 64 1 0 1, 8, 1 457
Unleashed 64 463
Table 8.34 shows the times taken by our micro-benchmarks and by the ATLAS
micro-benchmarks for determining machine and optimization parameters.
Figure 8.58 shows the MMM performance on the SGI R12K. Reﬁned Model
and Local Search are not plotted on this platform, because as Table 8.33 suggests,
their performance is virtually equivalent to that of Model. For native BLAS we
used SGI SCSL v.1.4.1.3.
The most interesting fact on this platform is that Multi-Level Local Search171
Table 8.34: Timings for SGI R12000
Parameters Total
Machine Optimization (sec)
Model 118 13 131
Reﬁned Model 118 13 131
Local Search 118 457 575
ML Local Search 118 496 608
Global Search 251 2131 2382
1000 2000 3000 4000 5000
Size
420
440
460
480
500
MFLOPS
Compiler
Unleashed
Model
GlobalSearch
BLAS
ML LocalSearch
Figure 8.58: SGI R12000: MMM Performance
successfully ﬁnds that it is worth tiling for the L2 cache. By doing this, it achieves
better performance than even the native BLAS. Global Search achieves slightly
better performance than Model due to the minor diﬀerences in several optimization
parameters. Unleashed does ﬁne for relatively small matrices but for large ones
performs worse than Model and Global Search. Although not entirely visible from
the plot, on this platform, the native compiler (SGI MIPSPro) does a relatively
good job.
8.2.3 Discussion
The experiments reported in show that the combination of model reﬁnement and
local search is eﬀective in closing performance gaps between the model-generated
code and the code generated by global search, while keeping code generation time172
small. However, it is important to realize that reducing library generation time is
not the primary focus of our work; rather, our goal is to ﬁnd optimization strategies
for generating very high-performance code that can be used in general-purpose
compilers because they scale to large programs and complex architectures.Chapter 9
Conclusions and Future Work
...the end of all our exploring
Will be to arrive where we started
And know the place for the ﬁrst time.
T.S.Eliot, Four Quartets
The experimental results demonstrate that it is possible to use analytical mod-
els to determine near-optimal values for the optimization parameters needed in
the ATLAS system to produce high-quality BLAS. Our models were designed to
be compatible with the ATLAS Code Generator; for example, since ATLAS uses
square cache tiles, we had only one parameter NB, whereas a diﬀerent Code Gen-
erator that uses general rectangular tiles may require three cache tile parameters.
Van de Geijn and co-workers have considered such models in their work on opti-
mizing matrix multiplication code for multi-level memory hierarchies [30, 31, 34].
Our results show that using models to determine values for the optimization
parameters is much faster than using empirical search. However, this does not
imply that search has no role to play in the generation of high-performance code.
Systems like FFTW and SPIRAL use search not to choose optimal values for
transformation parameters, but to choose an optimal algorithm from a whole suite
of algorithms. We do not know if model-driven optimization is eﬀective in this
context. Even in the relatively simple context of the BLAS, there are aspects of
program behavior that may not be worth modeling in practice even if they can
be modeled in principle. For example, the analytical models for NB described in
Chapter 6 ignore conﬂict misses. Although there is some work in the compiler
173174
literature on modeling conﬂict misses [15, 18], these models appear to be com-
putationally intractable. Fortunately, the eﬀect of conﬂict misses on performance
can be reduced by appropriate copying. If necessary, the value of NB found by the
model can be reﬁned by local search in the neighborhood of the NB value predicted
by the model. This combination of modeling and local search may be the most
tractable approach for optimizing large programs for complex high-performance
architectures.
At the end of this dissertation, we are left with the same question that we asked
at its beginning: how do we improve the state of the art of compilers? Conven-
tional wisdom holds that current compilers are unable to produce high-quality code
because the analytical models they use to estimate optimization parameter values
are overly simplistic compared to the complexity of modern high-performance ar-
chitectures. The results in this dissertation contradict this conventional wisdom,
and suggest that there is no intrinsic reason why compilers cannot use analytical
models to generate excellent code, at least for the BLAS.
However, it is important not to underestimate the challenge in improving
general-purpose compilers to bridge the current performance gap with library gen-
erators. Although the techniques used by ATLAS, such as loop tiling, unrolling,
and instruction scheduling, have been in the compiler literature for many years,
it is not easy to incorporate them into general-purpose compilers. For example,
transformations such as tiling are not always legal, so a general-purpose compiler
must perform dependence analysis before transforming a program. In contrast,
the implementor of a library generator focuses on one application and knows the
precise structure of the code to be generated for that application, so she is not en-
cumbered by the baggage required to support the restructuring of general codes.175
At the very least, improving the state of the art of compilation technology will
require an open compiler infrastructure that permits researchers to experiment
easily with diﬀerent transformations and to vary the parameters of those transfor-
mations. This has been a long-standing problem, and no adequate infrastructure
exists in spite of many attempts.
An equally important conclusion of this study is that there is still a signiﬁcant
gap in performance between the code generated by ATLAS CGw/S and the vendor
BLAS routines. Although we understand some of the reasons for this gap, the
problem of automating library generation remains open. The high cost of library
and application tuning makes this one of the most important questions we face
today.Appendix A
X-Ray: Experimental Results
A.1 CPU Features
In this section, we will present and discuss the results from running X-Ray on a
number of diﬀerent hardware architectures. In particular, we will show results for
the following,
• Intel Itanium 2, 2-way SMP, 1.5GHz
• AMD Opteron 240, 2-way SMP, 1.4GHz
• Sun UltraSPARC IIIi, 2-way SMP, 1GHz
• Intel Pentium 4 Xeon, 2-way SMP, 2.2GHz
• AMD Athlon MP 2800+, 2-way SMP, 2.25GHz
• SGI R12000, 300MHz
• IBM Power 3, 8-way SMP, 375MHz
We also compare our results with those from three other similar platform-
independent tools:
• Calibrator v0.9e [41] is a memory system benchmark aimed at measuring
capacity, block size, and latency at each level of the memory hierarchy and
the corresponding parameters for TLBs, such as the number of entries, the
page size, and the latency.
176177
• lmbench v3.0a3 [45, 55, 44] is a suite of benchmarks for measuring oper-
ating systems parameters such as thread-creation time and context-switch
time. In version 3, the authors have included several hardware benchmarks
for measuring CPU frequency, latency and parallelism of diﬀerent operations,
capacity, block size, and latency of each level of the memory hierarchy, and
the number of TLB entries.
• MOB v0.1.1 [9] is an ambitious project to create a benchmark suite capable
of measuring a large number of properties of the memory hierarchy, includ-
ing capacity, block size, associativity, sharedness, replacement policy, write
mode, and latency of each level, as well as the corresponding parameters for
TLBs.
We also looked at ATLAS v3.6.0 [56], which has a set of micro-benchmarks to
aid its empirical search engine. These micro-benchmarks measure the latency of
ﬂoating-point multiplication, the number of ﬂoating-point registers, the existence
of a fused multiply-add instruction and the capacity of the L1 data cache. ATLAS
needs only approximate values for these hardware parameters because it uses these
values to bound the search space for optimization parameters, and not to estimate
optimal values for these parameters. The micro-benchmarks in ATLAS are of
limited precision, although they are perfectly adequate for what is needed of them
in the context of the ATLAS system. Therefore, we decided not compare X-Ray
directly to it.
Because all the tools, including X-Ray, measure hardware parameters empiri-
cally, the results sometimes vary from one execution to the next. These variations
are negligible with X-Ray, but sometimes quite noticeable with the other tools.
The results we present are the best ones we obtained in several trial runs.178
Table A.1: Summary of Experimental Results: CPU Features
Parameter Tool Itanium Opteron SPARC Pentium4 Athlon R12000 Power 3
Frequency Actual 1500 1400 1000 2200 2250 300 375
X-Ray 1488.16 1379.31 994.21 4324.09 2129.19 298.26 375.43
(MHz) lmbench 1497 1392 1001 2164 2117 292 375
FP Add Actual 4 4 4 5 4 2 4
Latency X-Ray 3.98 3.96 4.04 10 4 2 4
(cycles) lmbench 4.01 4.04 3.88 5 4 2.01 4.01
FP Multiply Actual 4 4 4 7 4 2 4
Latency X-Ray 3.98 4 4.11 14.09 4 2 4
(cycles) lmbench 4.01 4.04 3.79 6.99 4 2.01 4.01
FP Add Actual 0.5 1 1 1 1 1 0.5
Initiation Interval X-Ray 0.51 0.99 1 2.03 1.01 1 0.5
(cycles) lmbench 0.34 1.48 2.73 1.02 0.73 3.44 1.34
FP Multiply Actual 0.5 1 1 2 1 1 0.5
Initiation Interval X-Ray 0.51 1 0.99 4 1.01 1 0.5
(cycles) lmbench 0.34 1.48 2.85 1.05 0.75 3.44 1.34
FMA existence Actual yes no no no no yes yes
X-Ray yes no no no no yes yes
(yes/no) lmbench !supported !supported !supported !supported !supported !supported !supported
SMP Actual 2 2 2 2 2 1 8
X-Ray 2 2 2 2 2 1 8
(count) lmbench !supported !supported !supported !supported !supported !supported !supported
SMT per CPU Actual 2 1 1 1 1 1 1
X-Ray 2 1 1 1 1 1 1
(count) lmbench !supported !supported !supported !supported !supported !supported !supported
Table A.1 shows the CPU parameters measured by X-Ray and lmbench. Cal-
ibrator and MOB address only the memory hierarchy, so we do not discuss them
here. The parameters that lmbench does not measure are marked “!supported”.
Now we discuss the individual parameters in greater detail.179
A.1.1 CPU Frequency
Processor frequency is not a valuable parameter by itself; its only purpose is to
express the latencies of other hardware parameters in clock cycles. Therefore it is
only important that it be a consistent base for relative comparisons. In addition,
the actual CPU frequency may diﬀer from the advertised frequency by 5%.
On most of the architectures, the results of both X-Ray and lmbench (in Ta-
ble A.1) were close to the advertised actual frequencies. The only exception is the
Pentium 4, where X-Ray measured double the actual value. The reason for this
inconsistency is that the X-Ray methodology for computing frequency assumes
that integer adds have an initiation interval of 1, but the Pentium 4 has a double-
pumped integer ALU unit. In eﬀect, the integer add instruction has an initiation
interval of 0.500 on the P4. For relative comparisons of instruction latencies, this
anomaly is not important.
A.1.2 Instruction Latency and Throughput
X-Ray can measure the latency and initiation interval of any instruction; for lack
of space, we show only the results for double-precision ﬂoating-point addition and
multiplication.
One detail is that lmbench measures these values in diﬀerent units. It measures
instruction latency l in nanoseconds and instruction parallelism p as a factor of
improvement over l. To produce directly comparable results, we converted the
latency to cycles, using the lmbench’s measured processor frequency FCPU: lcycles =
l × FCPU. Then we converted the parallelism to our notion of initiation interval
using II =
lcycles
p .
Similarly, Calibrator does not implement CPU frequency measurement, but180
rather requires the user to input it as an external parameter. We used the adver-
tised frequency of the processors to provide a value for this external parameter.
Both X-Ray and lmbench measure latencies very accurately, but lmbench does
not measure some initiation intervals correctly. Two points deserve mention:
• The X-Ray numbers on the Pentium 4 are twice the actual values because
they are relative to the integer ALU frequency, which, as mentioned in Sec-
tion A.1.1, is twice higher than the rest of the core.
• On the Itanium 2 and the Power 3, the initiation interval of both instructions
is 0.5, which means that the CPU is dispatching 2 instruction per cycle. This
is because these architectures have two FP ALUs.
Paradoxically, X-Ray found that not every CPU with two fully pipelined
ALUs has an initiation interval of 0.5. For example the Pentium III and
Pentium 4 both have two integer ALUs, but their initiation intervals are
0.67. This anomaly arises from the fact that there are other bottlenecks
in the CPU that prevent it from issuing the two integer instructions at a
sustained rate. On the Pentium III the bottleneck is that there are only two
register read ports, while on the Pentium 4 the bottleneck is the bandwidth
between the instruction cache and the CPU core.
This paradox highlights a recurring theme in our investigation. When there
is a disagreement between X-Ray and the vendor’s manuals, the values mea-
sured by X-Ray are more likely to be relevant to program optimization.181
A.1.3 Number of Registers
Registers are often considered a level-0 cache C0, as they are at the top of the
memory hierarchy. If a machine has N registers of type T, we can characterize
C0 = hA,B,Ci = hN,sizeof(T),N × sizeof(T)i. C0 can exhibit spacial locality
only in the case of vector registers (MMX, SSE, etc.). Furthermore, it is fully
associative and the replacement policy is software controlled.
The only way to directly exercise this control is to program in assembly lan-
guage. Portable software, on the other hand, is usually written in a high-level
language like C and the native compiler is responsible for register allocation, reg-
ister spills and ﬁlls. Nevertheless, when the ultimate goal is high-performance,
programmers need to make assumptions about the number of registers available
for register allocation and apply optimizing transformations like array scalarization
and loop unrolling appropriately (e.g., ATLAS [61]).
Note that we measure the eﬀective number of available registers, which is the
value that is relevant for program optimization. This value is often be smaller than
the number of actual registers in the given architecture for the following reasons.
• Some registers may be reserved for the Stack Pointer, Frame Pointer, Return
Address, etc.
• Some registers may be hardwired with speciﬁc values, most often the ﬂoating
point values 0.0 and 1.0.
• Compilers may use some registers in a special way, and they might not be
available to the general register allocator, e.g., accumulators, register win-
dows, etc.182
• Compilers might not use all available registers for diﬀerent reasons, e.g.,
targeting an older version of the ISA.
By appropriately deﬁning the operation add, this method is able to measure
all types of registers, including integer, ﬂoating point, and vector registers (e.g.,
MMX, SSE, 3DNow!, Altivec) through compiler intrinsics.
None of lmbench, Calibrator, and MOB try to measure the number of available
registers. The ATLAS framework attempts to provide a rough estimate for the
number of ﬂoating point registers, but they can aﬀord to be conservative, as op-
posed to precise, because they only use the estimate to bound their search space.
Table A.2 summarizes our measurement results.
Table A.2: Experimental results for registers
available / actual
Architecture int double MMX SSE
Pentium 4 5 / 8 8 / 8 8 / 8 8 / 8
Itanium 2 123 / 128 128 / 128 n/a n/a
Athlon MP 5 / 8 8 / 8 8 / 8 8 / 8
Opteron 240 14 / 16 16 / 16 8 / 8 16 / 16
UltraSPARC IIIi 24 / 32 31 / 32 n/a n/a
R12000 22 / 32 32 / 32 n/a n/a
Power 3 28 / 32 32 / 32 n/a n/a
As expected the number of available integer registers is always less than the
actual number of registers because some registers are reserved for use either by the
hardware or by the compiler. The measured number of ﬂoating point registers is
equal to the actual number in all cases except on the UltraSPARC IIIi machine,
where one of the registers is hardwired to 0.0. The measured number of vector
registers is always equal to the actual number. We do not provide results for
3DNow! and SSE2 registers, because they are equivalent to MMX and SSE register183
respectively.
A.2 Memory Hierarchy
To report cache latency in CPU cycles we use a micro-benchmark for measuring
CPU frequency, which is part of X-Ray. In this section we compare the results of
running the memory-hierarchy portion of X-Ray on 7 platforms with the results
of running the following three tools.
• Calibrator v0.9e [41] is a memory system benchmark aimed at measuring
capacity, block size, and latency at each level of the memory hierarchy and
TLB parameters, such as number of entries, page size, and latency.
• lmbench v3.0a3 [44, 45, 55] is a suite of benchmarks for measuring oper-
ating systems parameters such as thread-creation time and context-switch
time. Version 3 contains micro-benchmarks for measuring latency and par-
allelism of diﬀerent operations, capacity, block size, and latency of each level
of the memory hierarchy, and the number of TLB entries.
• MOB v0.1.1 [9] is an ambitious project to create a benchmark suite ca-
pable of measuring a large number of properties of the memory hierarchy,
including capacity, block size, associativity, sharedness, replacement policy,
write mode, and latency of each level, as well as the corresponding TLB
parameters.
Table A.3 shows the memory hierarchy parameters, along with the results from
measuring them with the diﬀerent tools. Whenever a parameter was not success-
fully computed, we use the following special entries to specify the reason:184
• n/a – the tool does not claim to be able to measure this hardware parameter;
• empty – the benchmark completed but did not produce a value for this
parameter;
• abort – an abnormal termination of some kind occurred prevented the bench-
mark from completion;
• build – the benchmark did not build successfully;
• os – OS-speciﬁc support is required for X-Ray to complete this measurement
and we have not implemented such support yet.
A.2.1 L1 Data Cache
As Table A.3 shows, X-Ray successfully found the correct values for all L1 cache
parameters on all the platforms other than the Power 3, where it decided that the
cache was 129-way set associative although it is actually 128-way set-associative.
For reasons we do not understand, there was no performance loss in the micro-
benchmark when moving from 128 to 129 steps, but there was a performance loss
in moving from 129 to 130. This anomaly also aﬀected the determination of the
cache capacity slightly. The performance of the other tools varies, and the details
are presented in Table A.3.
A.2.2 Lower Level Caches
Lower level caches are physically addressed on all modern machines so we found
it necessary to use super-pages to obtain consistent measurements of lower level
cache parameters, as discussed in Section 4.4.4. Support for super-pages is very185
OS-speciﬁc, so we targeted the Linux system as a proof of concept. Table A.3 shows
that X-Ray was able to measure lower level cache parameters correctly on all the
Linux machines in our study (Pentium 4, Itanium 2, Athlon MP, and Opteron
240). We are currently working on the implementation for Solaris, IRIX and AIX,
which will allow us to test X-Ray on the rest of the machines as well.
The numbers for the AMD machines (Athlon and Opteron) are interesting be-
cause they expose the fact that the L1 and L2 caches on these machines implement
cache exclusion. Most platforms support cache inclusion, which means that infor-
mation cached at a particular level of the memory hierarchy should also be cached
in all lower levels. This is necessary to support cache-coherency protocols in SMP
systems. AMD machines on the other hand use exclusion, so data never resides
in both the L1 and L2 caches simultaneously. While this requires the L1 cache
to snoop on the bus to resolve coherency issues, it eﬀectively increases the useful
capacity of L2 by the capacity of the L1.
X-Ray classiﬁed the 512KB, 16-way associative L2 cache of the AthlonMP as an
18-way set-associative cache with a capacity of 576KB (exactly C1+C2). Similarly
on the Opteron 240, the 1MB L2 was classiﬁed as a 17-way set associative cache
with an eﬀective capacity 1088KB (exactly C1 + C2). If the actual capacity of the
L2 cache is needed, it can be obtained by subtracting the capacity of the L1 cache,
although the combined capacity is what is actually relevant for autonomic code
that wants to perform an optimization like cache tiling.
The performance of the other tools varied. Calibrator produced somewhat
pessimistic results for cache capacity on some of the Linux machines; we believe
this eﬀect, too, arises from non-contiguous physical memory since this reduces the
eﬀective cache capacity. lmbench terminates abnormally on some platforms, but186
produces accurate results when it terminates cleanly. MOB produced accurate
results only for the capacity of the L2 cache of Itanium 2. In all other cases, it
either aborted, produced a wrong result or did not produce a result at all.
The cache access latency ﬁgures produced by all the tools for lower level caches
should be taken with a grain of salt since the actual access time can ﬂuctuate
substantially depending on what other memory bus transactions are occurring at
the same time.187
Table A.3: Summary of experimental results
Architecture Actual X-Ray Calibrator lmbench MOB
Pentium 4 8 8 8 8 8
Itanium 2 16 16 16 abort 4
L1 Athlon MP 64 64 64 empty abort
C Opteron 240 64 64 64 abort empty
(KB) UltraSPARC IIIi 64 64 64 64 abort
R12000 32 32 32 32 build
Power 3 64 64.5 64 64 empty
Pentium 4 64 64 32 64 abort
Itanium 2 64 64 64 abort 104
L1 Athlon MP 64 64 64 empty abort
B Opteron 240 64 64 32 abort empty
(bytes) UltraSPARC IIIi 32 32 32 32 abort
R12000 16 16 64 32 build
Power 3 128 128 128 128 empty
Pentium 4 4 4 n/a n/a empty
Itanium 2 4 4 n/a n/a empty
L1 Athlon MP 2 2 n/a n/a empty
A Opteron 240 2 2 n/a n/a empty
(count) UltraSPARC IIIi 4 4 n/a n/a empty
R12000 2 2 n/a n/a empty
Power 3 128 129 n/a n/a empty
Pentium 4 2 2 2 2 abort
Itanium 2 2 2 2 abort 5
L1 Athlon MP 3 3 3 empty abort
l Opteron 240 3 3 3 abort empty
(cycles) UltraSPARC IIIi 2 2 2 2 abort
R12000 2 2 2 2 build
Power 3 2 2 2 2 empty
Pentium 4 512 512 384 512 abort
Itanium 2 256 256 256 abort 256
L2 Athlon MP 512 576 384 512 abort
C Opteron 240 1024 1088 768 abort empty
(KB) UltraSPARC IIIi 512 os 1024 1024 abort
R12000 512 os 2048 2048 build
Power 3 512 os 6144 6144 0
Pentium 4 128 128 128 128 abort
Itanium 2 128 128 128 128 empty
L2 Athlon MP 64 64 64 64 abort
B Opteron 240 64 64 64 64 empty
(bytes) UltraSPARC IIIi 64 os 64 64 abort
R12000 128 os 128 128 build
Power 3 128 os 128 128 empty
Pentium 4 8 8 n/a n/a empty
Itanium 2 8 8 n/a n/a empty
L2 Athlon MP 16 18 n/a n/a empty
A Opteron 240 16 17 n/a n/a empty
(count) UltraSPARC IIIi ? os n/a n/a empty
R12000 ? os n/a n/a empty
Power 3 ? os n/a n/a empty
Pentium 4 ? 21 18 20 abort
Itanium 2 ? 6 4 abort 6
L2 Athlon MP ? 36 18 3 abort
l Opteron 240 ? 23 13 abort empty
(cycles) UltraSPARC IIIi ? 13 12 15 abort
R12000 ? 14 12 14 build
Power 3 ? 18 9 17 1
Pentium 4 ? 381 372 368 abort
Itanium 2 ? 298 281 abort empty
Memory Athlon MP ? 471 401 198 abort
l Opteron 240 ? 136 127 abort empty
(cycles) UltraSPARC IIIi ? os 164 173 abort
R12000 ? os 111 122 build
Power 3 ? os 136 161 empty188
Table A.4: Summary of Itanium 2 C3 parameters
Actual X-Ray Calibrator lmbench MOB
C (KB) 6144 6144 6144 abort 4096
B (bytes) 128 128 128 abort empty
A (count) 24 24 n/a n/a n/a
l (cycles) ? 19 14 abort 6BIBLIOGRAPHY
[1] ATLAS homepage. http://math-atlas.sourceforge.net/.
[2] Autonomic Computing: creating self-managing computing systems. http:
//www.research.ibm.com/autonomic/.
[3] The PHiPAC home page. http://www.icsi.berkeley.edu/∼bilmes/
phipac.
[4] R. C. Agarwal, F. G. Gustavson, and M. Zubair. Improving performance of
linear algebra algorithms for dense matrices using algorithmic prefetch. IBM
Journal of Research and Development, 38(3):265–275, 1994.
[5] Alfred V. Aho, Ravi Sethi, and Jeﬀrey D. Ullman. Compilers: Principles,
Techniques and Tools. Addison Wesley, Reading, MA, second edition, 1986.
[6] R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures.
Morgan Kaufmann Publishers, 2002.
[7] Utpal Banerjee. Unimodular transformations of double loops. In Advances in
Languages and Compilers for Parallel Computing. Pitman Publisher, 1991.
[8] Jeﬀ Bilmes, Krste Asanovi´ c, Chee whye Chin, and Jim Demmel. Optimiz-
ing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C
coding methodology. In Proceedings of International Conference on Super-
computing, Vienna, Austria, July 1997.
[9] Josep M. Blanquer and Robert C. Chalmers. MOB: Memory Organization
Benchmark. http://www.nmsl.cs.ucsb.edu/mob.
[10] Pierre Boulet, Alain Darte, Tanguy Risset, and Yves Robert. (Pen)-ultimate
tiling? In INTEGRATION, the VLSI Journal, volume 17, pages 33–51. 1994.
[11] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improv-
ing balance for pipelined architectures. Journal of Parallel and Distributed
Computing, 5(4):334–358, 1988.
[12] David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation
for subscripted variables. In SIGPLAN Conference on Programming Language
Design and Implementation, pages 53–65, 1990.
[13] David Callahan and Brian Koblenz. Register allocation via hierarchical graph
coloring. In PLDI ’91: Proceedings of the ACM SIGPLAN 1991 conference
on Programming language design and implementation, pages 192–203, New
York, NY, USA, 1991. ACM Press.
189190
[14] G. J. Chaitin. Register allocation & spilling via graph coloring. In SIGPLAN
’82: Proceedings of the 1982 SIGPLAN symposium on Compiler construction,
pages 98–101, New York, NY, USA, 1982. ACM Press.
[15] Siddhartha Chatterjee, Erin Parker, Philip J. Hanlon, and Alvin R. Lebeck.
Exact analysis of the cache behavior of nested loops. In Proceedings of the
ACM SIGPLAN 2001 conference on Programming language design and im-
plementation, pages 286–297. ACM Press, 2001.
[16] Frederick Chow and John Hennessy. Register allocation by priority-based
coloring. In SIGPLAN ’84: Proceedings of the 1984 SIGPLAN symposium
on Compiler construction, pages 222–232, New York, NY, USA, 1984. ACM
Press.
[17] Michael Cierniak and Wei Li. Unifying data and control transformations
for distributed shared memory machines. In SIGPLAN 1995 conference on
Programming Languages Design and Implementation, June 1995.
[18] Phillipe Claus. Counting solutions to linear and nonlinear constraints through
Erhart polynomials. In ACM International Conference on Supercomputing.
ACM, May 1996.
[19] John Cocke and Ken Kennedy. An algorithm for reduction of operator
strength. Commun. ACM, 20(11):850–856, 1977.
[20] Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache
organization and data layout. In SIGPLAN Conference on Programming Lan-
guage Design and Implementation, pages 279–290, 1995.
[21] J. Dongarra, K. London, S. Moore, P. Mucci, D. Terpstra, H. You, and
M. Zhou. Experiences and lessons learned with a portable interface to hard-
ware performance counters. In PADTAD Workshop, IPDPS 2003, April 2003.
[22] Jack Dongarra. Personal communication.
[23] Paul Feautrier. Some eﬃcient solutions to the aﬃne scheduling problem -
part 1: one dimensional time. International Journal of Parallel Programming,
October 1992.
[24] International Organization for Standardization. ISO/IEC 9899-1999, Pro-
gramming Languages: C, 1999. Technical Committee: JTC 1/SC 22/WG
14.
[25] Martin Fowler. Yet another optimization article. IEEE Software, pages 20–21,
May/June 2002.191
[26] B. B. Fraguela, R. Doallo, and E. Zapata. Automatic analytical modeling
for the estimation of cache misses. In Parallel Architectures and Compilation
Techniques (PACT), pages 221–231, 1999.
[27] Matteo Frigo and Steven G. Johnson. FFTW: An adaptive software architec-
ture for the FFT. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal
Processing, volume 3, pages 1381–1384, Seattle, WA, May 1998.
[28] Matteo Frigo and Steven G. Johnson. The design and implementation of
FFTW3. Proceedings of the IEEE, 93(2), 2005. special issue on ”Program
Generation, Optimization, and Adaptation”.
[29] Stefan Goedecker and Adolfy Hoisie. Performance Optimization of Numeri-
cally Intensive Codes. Society for Industrial & Applied Mathematics, 2001.
[30] Kazushige Goto and Robert van de Geijn. On reducing tlb misses in matrix
multiplication. Technical Report TR-2002-55, University of Texas at Austin,
Department of Computer Sciences, November 2002.
[31] John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. A family of
high-performance matrix algorithms. In Proceedings of International Confer-
ence of Computational Science - ICCS 2001: San Francisco, CA, USA, May
28-30, 2001 Proceedings, Part I, pages 51–60. Springer, 2001.
[32] Fred Gustavson. Personal communication.
[33] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative
Approach. Morgan Kaufmann Publishers, 1990.
[34] G. Henry. Flexible high-performance matrix multiply via self-modifying run-
time code, 2001.
[35] High-performance blas by kazushige goto. http://www.cs.utexas.edu/
users/flame/goto/.
[36] Jeremy Johnson, Robert W. Johnson, David A. Padua, and Jianxin Xiong.
Searching for the best FFT formulas with the SPL compiler. In Proc. of
the 13th International Workshop on Languages and Compilers for Parallel
Computing, pages 109–124, 2000.
[37] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. Improving
locality using loop and data transformations in an integrated framework. In
MICRO 31: Proceedings of the 31st annual ACM/IEEE international sym-
posium on Microarchitecture, pages 285–297, Los Alamitos, CA, USA, 1998.
IEEE Computer Society Press.192
[38] Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. Data-centric
multi-level blocking. In Programming Languages, Design and Implementation.
ACM SIGPLAN, June 1997.
[39] Induprakas Kodukula and Keshav Pingali. Imperfectly nested loop transfor-
mations for memory hierarchy management. In International Conference on
Supercomputing, Rhodes, Greece, June 1999.
[40] W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA
compilers. ACM Transactions on Computer Systems, 1993.
[41] Stefan Manegold. The calibrator: a cache-memory and TLB calibration tool.
http://homepages.cwi.nl/∼manegold/Calibrator/calibrator.shtml.
[42] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques
for storage hierarchies. IBM Systems Journal, 9(2):78–92, 1970.
[43] A. C. McKellar and E. G. Coﬀman, Jr. Organizing matrices and matrix
operations for paged memory systems. Commun. ACM, 12(3):153–165, 1969.
[44] Larry McVoy and Carl Staelin. lmbench. http://www.bitmover.com/
lmbench/.
[45] Larry McVoy and Carl Staelin. lmbench: Portable tools for performance
analysis. In USENIX 1996 Annual Technical Conference, January 22–26,
1996. San Diego, CA, pages 279–294, Berkeley, CA, USA, January 1996.
[46] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan
Kaufmann Publishers, 1997.
[47] Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. Practical, trans-
parent operating system support for superpages. SIGOPS Oper. Syst. Rev.,
36(SI):89–104, 2002.
[48] David Padua and Michael Wolfe. Advanced compiler optimization for super-
computers. Communications of the ACM, 29(12):1184–1201, December 1986.
[49] William Pugh. The omega test: a fast and practical integer programming
algorithm for dependence analysis. In Supercomputing ’91: Proceedings of the
1991 ACM/IEEE conference on Supercomputing, pages 4–13, New York, NY,
USA, 1991. ACM Press.
[50] Markus P¨ uschel, Jos´ e M. F. Moura, Jeremy Johnson, David Padua, Manuela
Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gaˇ ci´ c, Yevgen
Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code
generation for DSP transforms. Proceedings of the IEEE, 93(2), 2005. special
issue on ”Program Generation, Optimization, and Adaptation”.193
[51] Joan McComb Ramesh C. Agarwal, Fred G. Gustavson and Stanley Schmidt.
Engineering and Scientiﬁc Subroutine Library Release 3 for IBM ES/3090
Vector Multiprocessors. IBM Systems Journal, 28(2):345–350, 1989.
[52] B. Ramakrishna Rau. Iterative modulo scheduling. Technical Report HPL-
94-115, Hewlett-Packard Research Laboratories, November 1995.
[53] Rafael H. Saavedra and Alan Jay Smith. Measuring cache and TLB per-
formance and their eﬀect of benchmark run. Technical Report CSD-93-767,
February 1993.
[54] Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops.
Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990.
[55] Carl Staelin and Larry McVoy. mhz: Anatomy of a micro-benchmark. In
USENIX 1998 Annual Technical Conference, January 15–18, 1998. New Or-
leans, Louisiana, pages 155–166, Berkeley, CA, USA, June 1998.
[56] R. Clint Whaley. Personal communication.
[57] R. Clint Whaley. http://sourceforge.net/mailarchive/forum.php?
thread id=1569256&forum id=426.
[58] R. Clint Whaley. User contribution to atlas. http://math-atlas.
sourceforge.net/devel/atlas contrib.
[59] R. Clint Whaley and Antoine Petitet. Minimizing development and mainte-
nance costs in supporting persistently optimized BLAS. Accepted for publica-
tion in Software: Practice and Experience, 2004. http://www.cs.utk.edu/
∼rwhaley/papers/spercw04.ps.
[60] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Au-
tomated empirical optimization of software and the ATLAS project.
Parallel Computing, 27(1–2):3–35, 2001. Also available as Univer-
sity of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000
(www.netlib.org/lapack/lawns/lawn147.ps).
[61] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical
optimization of software and the ATLAS project. Parallel Computing, 27(1–
2):3–35, 2001. Also available as University of Tennessee LAPACK Working
Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.
ps).
[62] Michael E. Wolf and Monica S. Lam. An algorithmic approach to compund
loop transformations. In Advances in Languages and Compilers for Parallel
Computing. Pitman Publisher, 1991.194
[63] M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM
Conference on Parallel Processing for Scientiﬁc Computing, December 1987.
[64] Kamen Yotov, Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong,
Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu.
A comparison of empirical and model-driven optimization. In Proceedings of
the ACM SIGPLAN 2003 conference on Programming language design and
implementation, pages 63–76. ACM Press, 2003.
[65] Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Ke-
shav Pingali, and Paul Stodghill. Is search really necessary to generate high-
performance BLAS? Proceedings of the IEEE, 93(2), 2005. special issue on
”Program Generation, Optimization, and Adaptation”.
[66] Kamen Yotov, Keshav Pingali, and Paul Stodghill. X-ray: A tool for auto-
matic measurement of architectural parameters. Technical Report TR2004-
1966, Cornell University, Computer Science, October 2004.