Application Signature: a new way to predict application performance by Todi, Rajat Kumar
Retrospective Theses and Dissertations Iowa State University Capstones, Theses andDissertations
2003
Application Signature: a new way to predict
application performance
Rajat Kumar Todi
Iowa State University
Follow this and additional works at: https://lib.dr.iastate.edu/rtd
Part of the Computer Sciences Commons
This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University
Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University
Digital Repository. For more information, please contact digirep@iastate.edu.
Recommended Citation
Todi, Rajat Kumar, "Application Signature: a new way to predict application performance " (2003). Retrospective Theses and
Dissertations. 1913.
https://lib.dr.iastate.edu/rtd/1913
Application Signature: A new way to predict application performance 
by 
Raj at Kumar Todi 
A dissertation submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
DOCTOR OF PHILOSOPHY 
Major: Computer Science 
Program of Study Committee: 
John Gustafson, Co-major Professor 
Gurpur Prabhu, Co-major Professor 
Don Heller 
Srinivas Aluru 
Doug Jacobson 
Iowa State University 
Ames, Iowa 
2003 
Copyright © Raj at Kumar Todi, 2003. All rights reserved. 
UMI Number: 3279645 
INFORMATION TO USERS 
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction. 
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion. 
UMI 
UMI Microform 3279645 
Copyright 2007 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code. 
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346 
ii 
Graduate College 
Iowa State University 
This is to certify that the doctoral dissertation of 
Raj at Kumar Todi 
has met the dissertation requirements of Iowa State University 
Co-major Professor 
Co-major Professor 
For thk Major Program 
Signature was redacted for privacy.
Signature was redacted for privacy.
Signature was redacted for privacy.
iii 
TABLE OF CONTENTS 
List of Figures ix 
List of Tables xv 
Acknowledgments xxi 
Abstract xxii 
CHAPTER 1 Introduction 1 
CHAPTER 2 Benchmarks 5 
2.1 Introduction 5 
2.2 User Groups of Benchmarks 7 
2.3 Usefulness of Benchmarking 14 
CHAPTER 3 Benchmarks Classification 15 
3.1 Classification Based on Usage 15 
3.2 Benchmark Strategy 16 
3.3 Narrow versus Broad-spectrum Benchmark 16 
3.4 Benchmark Examples 17 
3.4.1 Peak Performance 17 
3.4.2 Linpack 17 
3.4.3 STREAM 17 
3.4.4 SPEC CPU95 19 
3.4.5 SPEC CPU2000 24 
3.4.6 SPEC CPU2004 24 
3.4.7 NPB Benchmarks 25 
iv 
3.4.8 SPLASH Benchmarks 27 
3.4.9 GAMESS 27 
3.4.10 Assorted Benchmarks 27 
3.4.11 HINT 28 
3.4.12 Peak FLOPS 29 
3.4.13 Lawrence Livermore Loops 29 
3.4.14 Whetstone 29 
3.4.15 SLALOM Benchmark 30 
CHAPTER 4 Common Problems with Benchmarks 31 
4.1 Benchmarks Won't Follow Moore's Law 31 
4.2 Benchmarks Won't Correlate With Real Applications 31 
4.3 Benchmarks are Redundant 33 
4.3.1 Inter-Benchmark Redundancy 34 
4.3.2 Intra-Benchmark Redundancy 34 
4.4 Past Benchmarks Predict Future Performance 35 
4.5 Other Selected Problems of Benchmark 36 
4.6 Summary 39 
CHAPTER 5 Metrics 40 
5.1 Characteristics of Good Performance Metrics 40 
5.2 Means versus Ends Metrics 41 
5.3 Uniprocessor Performance Metrics 41 
5.3.1 MFLOPS 41 
5.3.2 MIPS 42 
5.3.3 Clock Frequency 42 
5.3.4 QUIPS 42 
5.4 Parallel Processing Performance Metrics 43 
5.4.1 Speedup 43 
5.4.2 Efficiency 44 
V 
5.4.3 Scalability . 45 
5.5 Summary 49 
CHAPTER 6 Statistical Background 50 
6.1 Pearson Product Moment Correlation 50 
6.2 Linear Relation 52 
6.3 Spearman's Rank Correlation 52 
6.3.1 A Matlab Example 53 
6.4 The Harmonic Mean 54 
6.5 The Weighted Harmonic Mean 55 
CHAPTER 7 HINT: The Hardware Signature 57 
7.1 Introduction 57 
7.2 Task and Terminology 58 
7.3 An Example using 8-bit Data Type 59 
7.4 Salient features 62 
7.5 Understanding HINT Graphs 64 
7.5.1 Generic HINT Graphs 64 
7.5.2 Classical Memory-Regime Revealing Graph 65 
7.5.3 Varying Precision 66 
7.5.4 Varying Main Memory 67 
7.5.5 Varying Clock Speed 68 
7.5.6 Cache-Dependent and Cache-Independent systems 68 
7.5.7 Dedicated Machine versus Machine with Interrupts 69 
7.5.8 Scalable Parallel Computers 70 
7.5.9 Non-Scalable Parallel Computers 70 
7.5.10 Special-Purpose Computer 71 
7.5.11 Business computer 71 
7.5.12 Serial versus Workstation Clusters 72 
7.5.13 Same Machine Different Operating System 76 
vi 
7.5.14 Serial versus Vector Computer 76 
7.5.15 Region of Computation 77 
7.5.16 Superset of Other Benchmarks 78 
7.5.17 Problem Detection using HINT 79 
7.5.18 Identical Machines Varied Performance 80 
7.5.19 Bug in Motherboard's BIOS software 81 
7.5.20 Dual processors Pentium machine with Slow Memory Bandwidth .... 82 
CHAPTER 8 Application Signature 85 
8.1 History of Application Signature 85 
8.2 What is Application Signature? 87 
8.3 Characteristics of Application Signature 89 
8.4 Modeling Application-Architecture Performance: A Car Transportation Analogy 90 
8.5 Application Performance Model 91 
8.5.1 Hardware Performance Predictors 91 
8.5.2 Application Performance Predictors 93 
8.5.3 The Proposed Computer Design Model 94 
8.6 Experiment Setup 94 
8.6.1 Machines Used 94 
8.6.2 Benchmarks Used 96 
8.7 Summary 97 
CHAPTER 9 Definitions and Notations 99 
9.1 Measured Time, APPMAP Time, and Projected Time 104 
9.2 Validation Strategy for the Models 106 
CHAPTER 10 Model 1: Application Signature Using Instantaneous QUIPS 107 
10.1 Model 107 
10.1.1 Using Instantaneous QUIPS as Application Signature 108 
10.1.2 Using NetQUIPS as Application Signature 108 
10.1.3 Using NetQUIPS and Instantaneous QUIPS Application Signature . . . 109 
vii 
10.1.4 Using Correlation Vector as Application Signature 109 
10.2 Results 109 
10.3 Summary 110 
CHAPTER 11 Model 2: Application Signature Using Optimization Method 112 
11.1 Model 112 
11.2 Results 113 
CHAPTER 12 Model 3: Application Signature Using Cache Misses 115 
12.1 Model 115 
12.2 Results 116 
CHAPTER 13 Model 4: Application Signature Using Cache Sensitivity . . 118 
13.1 Model 118 
13.2 Results 119 
CHAPTER 14 Applications of APPMAP technology 120 
14.1 System Design 120 
14.2 Selecting System on Applications 121 
14.3 Multiprocessor Scheduling 122 
14.4 Utility based Computing 122 
14.5 Power versus Performance 123 
14.6 Chapter Summary 123 
CHAPTER 15 Conclusion and Future Directions 125 
15.1 Original Contributions of the Thesis 126 
15.2 Future Directions 126 
APPENDIX A Cache Memory Subsystem 129 
APPENDIX B HINT Database 132 
APPENDIX C Application Characteristics - I 135 
APPENDIX D Application Characteristics - II 162 
viii 
APPENDIX E Machine Characteristics using HINT 176 
APPENDIX F LMBENCH . 195 
APPENDIX G Machine Profile Using Stream Benchmark 201 
APPENDIX H More Modell Results 211 
APPENDIX I More Model2 Results 227 
Bibliography 249 
ix 
LIST OF FIGURES 
1.1 Application Signature Performance Model 2 
2.1 iCOMP Index 2.0 Weightings 11 
4.1 Computation Chemistry vs LINPACK 32 
4.2 Peak FLOPS versus EP benchmark 33 
4.3 Workload Benchmarks (a) Ideal (b) Redundancy in SPEC CFP2000 
Benchmarks 33 
4.4 Intra-Benchmark Redundancy in SWIM benchmark of SPEC CFP2000 34 
4.5 Past Benchmarks are used for Future System Design 35 
4.6 Benchmarks emphasize different Problem Sizes 37 
7.1 Problem Solved by HINT: Area to be Bounded under the Curve .... 58 
7.2 Two Subintervals of One Dimension Integration with 8-bit Data Precision 60 
7.3 Sequence of Hierarchical Refinement of Integral Bounds 61 
7.4 Precision-Limited Last Iteration, 8-bit data 62 
7.5 Memory Cost versus QUIPS 63 
7.6 Generic HINT Graphs 65 
7.7 Memory Regime Revealing Graph 66 
7.8 Varying Precision 67 
7.9 Varying Main Memory 68 
7.10 Varying Clock Speed 69 
7.11 Cache-independent and Cache-dependent System 70 
7.12 Dedicated Machine versus Machine with interrupts 71 
X 
7.13 Scalable Parallel Computer 72 
7.14 Unscalable Parallel Computer 73 
7.15 Special Purpose Computer 73 
7.16 Business Computer 74 
7.17 Serial versus Workstation Cluster 74 
7.18 Linux Cluster 75 
7.19 Serial versus Workstation Cluster 76 
7.20 Serial versus Vector Machine 77 
7.21 Vector versus Parallel Computers 78 
7.22 Region of Computation 79 
7.23 Superset of Other Benchmarks 80 
7.24 Identical Machine Varied Performance 81 
7.25 Mosix Xluster's Identical Nodes Perform Differently 82 
7.26 Bug in Alpha LX motherboard's BIOS 83 
7.27 Serial versus Threaded HINT on Dual Processors 300 MHz Pentiumll 84 
8.1 Hypothetical Application Signature for (a) Word Processing Applica­
tion (b) Computational Fluid Dynamic 86 
8.2 Gustafson's Great Crossover: The crossover of memory and arithmetic 
performance 92 
8.3 HINT (Double) QUIPS-Time Graph for Machines M1-M8 96 
9.1 Two Different Ranking of Machines k\ and at Memory Points mi 
and m2. At Memory point m\, k\ > whereas at Memory Point mg, 
&2 > k\ 103 
13.1 Working Set Method for SU2COR: (a) Ideal Cache Miss (b) Cache 
Sensitivity or Application Signature 118 
13.2 Working Set Method for VORTEX: (a) Ideal Cache Miss (b) Cache 
Sensitivity or Application Signature 119 
xi 
A.l Memory Hierarchy 129 
B.l HINT Graphs can be used for Consumer Computer Performance Guide 134 
E.l HINT (Double) QUIPS-Time Graph for Machines M1-M8 177 
E.2 HINT (Double) QUIPS-Memory Graph for Machines M1-M8 177 
E.3 HINT (Int) QUIPS-Time Graph for Machines M1-M8 178 
E.4 HINT (Int) QUIPS-Memory Graph for Machines M1-M8 178 
F.l LMBENCH: Memory Latency Graph for Machines (a) Ml (b) M2 (c) 
M3 (d) M4 197 
F.2 LMBENCH: Memory Latency Graph for Machines (a) M5 (b) MB (c) 
M7 (d) M8 198 
F.3 LMBENCH: Memory reread bandwidth for Machine Ml 199 
F.4 LMBENCH: Context Switch Latency for Machine Ml 199 
F.5 LMBENCH: Memory bandwidth for Machine Ml 200 
G.l STREAM Benchmark: System Bandwidth using (a) Copy Kernel (b) 
Scale Kernel (c) Sum Kernel (d) Triad Kernel 210 
H.l Correlation of Instantaneous QUIPS and Measured Time for Applica-
tions (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 212 
H.2 Correlation of Instantaneous QUIPS and Measured Time for Applica­
tions (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 213 
H.3 Correlation of Instantaneous QUIPS and Measured Time for Applica­
tions (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 214 
H.4 Correlation of Instantaneous QUIPS and Measured Time for Applica­
tions (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 215 
H.5 Correlation of Instantaneous QUIPS and Measured Time for Applica­
tions (a) F7 (b) F8 (c) F9 216 
xii 
H.6 Rank Correlation of Instantaneous QUIPS and Measured Time for Ap­
plications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 217 
H.7 Rank Correlation of Instantaneous QUIPS and Measured Time for Ap-
plications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 218 
H.8 Rank Correlation of Instantaneous QUIPS and Measured Time for Sig­
nature for Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 219 
H.9 Rank Correlation of Instantaneous QUIPS and Measured Time for Ap-
plications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 220 
H.10 Rank Correlation of Instantaneous QUIPS and Measured Time for Ap­
plications (a) F7 (b) F8 (c) F9 221 
H.ll Best Fit Between Instananeous QUIPS (or NetQUIPS) and Measured 
Time for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 222 
H.12 Best Fit Between Instananeous QUIPS (or NetQUIPS) and Measured 
Time for Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 .... 223 
H.13 Best Fit Between Instantaneous QUIPS (or NetQUIPS) and Measured 
Time for Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 224 
H.14 Best Fit Between Instananeous QUIPS (or NetQUIPS) and Measured 
Time for Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 . . . . 225 
H.15 Best Fit Between Instananeous QUIPS (or NetQUIPS) and Measured 
Time for Applications (a) F7 (b) F8 (c) F9 226 
I.l SEARCH Result: Application Signature as a function of time for Ap-
plications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 229 
1.2 SEARCH Result: Application Signature as a function of time for Ap­
plications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 230 
1.3 SEARCH Result: Application Signature as a function of time for Ap­
plications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 231 
1.4 SEARCH Result: Application Signature as a function of time for Ap-
plications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 232 
xiii 
1.5 SEARCH Result: Application Signature as a function of time for Ap­
plications (a) F7 (b) F8 (c) F9 233 
1.6 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of time for Applications (a) II (b) 12 (c) 13 (d) 
14 (e) 15 (f) 16 234 
1.7 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of time for Applications (a) 17 (b) 18 (c) 19 (d) 
110 (e) 111 (f) 112 235 
1.8 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of time for Applications (a) 113 (b) 114 (c) 115 
(d) 116(e) 117 236 
1.9 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of time for Applications (a) FI (b) F2 (c) F3 
(d)F4(e) F5(f)F6 237 
1.10 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of time for Applications (a) F7 (b) F8 (c) F9 . 238 
1.11 SEARCH Result: Application Signature as a function of problem size 
for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 239 
1.12 SEARCH Result: Application Signature as a function of problem size 
for Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 240 
1.13 SEARCH Result: Application Signature as a function of problem size 
for Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 241 
1.14 SEARCH Result: Application Signature as a function of problem size 
for Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 242 
1.15 SEARCH Result: Application Signature as a function of problem size 
for Applications (a) F7 (b) F8 (c) F9 243 
XIV 
1.16 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of problem size for Applications (a) II (b) 12 
(c) 13(d) 14(e) 15(f) 16 244 
1.17 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of problem size for Applications (a) 17 (b) 18 
(c) 19(d) 110(e) Hl(f) 112 245 
1.18 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of problem size for Applications (a) 113 (b) 114 
(c) 115(d) 116(e) 117 246 
1.19 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of problem size for Applications (a) FI (b) F2 
(c)F3(d)F4(e)F5(f)F6 247 
1.20 SEARCH Result: Projected Time Vs Measured Time for Application 
Signature as a function of problem size for Applications (a) F7 (b) F8 
(c) F9 248 
XV 
LIST OF TABLES 
2.1 Intel's iCOMP 2.0 Benchmark Composition 10 
3.1 High Performance Computers (circa 2000) - Sorted by Price 18 
3.2 The STREAM Benchmark Operations 19 
3.3 The STREAM2 Benchmark Operations 20 
3.4 SPEC CPU 95 Integer Applications: CINT95 20 
3.5 SPEC CPU 95 Floating Point Applications: CFP95 21 
3.6 SPEC CPU 2000 Integer Applications: CINT2000 24 
3.7 SPEC CPU 2000 Floating Point Applications: CFP2000 25 
3.8 NAS Parallel Bencmarks Problems and their Sizes 26 
3.9 GAMESS Capabilities % 
3.10 Assorted Tiny Benchmarks Maintained by A1 Aburto 29 
5.1 Examples of Algorithms Performance measured by Ends-based Metrics 
versus Means-based Metrics 41 
6.1 Statistical Analysis of 10 Machines using Applications A and B . . . . 53 
6.2 An Matlab Example showing Statistical Analysis 55 
8.1 A Simple Car Analogy to Calculate Time Taken for a Trip 91 
8.2 Machines Processors and Cache Configurations 95 
8.3 Machines Memory Configurations 95 
8.4 Integer Benchmarks 97 
8.5 Floating-Point Benchmarks 98 
xvi 
10.1 Column Definitions for Tables 10.2, 10.3 109 
10.2 Application Signature Results using Instantaneous QUIPS or NetQUIPS 
for Integer Applications 110 
10.3 Application Signature Results using Instantaneous QUIPS or NetQUIPS 
for Floating-point Applications Ill 
11.1 Search Method (function of problem size) Results for Integer Applications 113 
11.2 Search Method (function of problem size) Results for Floating-Point 
Applications 114 
12.1 Model3 Results for Floating-Point point Applications 116 
12.2 Model3 Results for Integer Applications 117 
13.1 Working Set Method Results for SU2COR (Fl) and VORTEX (15) 
Applications 119 
B.l A Sample of HINT Database Entry form for a Typical Workstation . . 133 
C.l Hardware Event Counters for R10000 and R12000 135 
C.2 Characteristics for 099.go using input null.in (II) 136 
C.3 Characteristics for 099.go using input null 1.in (12) 137 
C.4 Characteristics for 099.go using input 5stone21.in (13) 138 
C.5 Characteristics for 099.go using input 9stone21.in (14) 139 
C.6 Characteristics for 147.vortex using input vortex.in (15) 140 
C.7 Characteristics for 132.ijpeg using input penguin.ppm (16) 141 
C.8 Characteristics for 132.ijpeg using input specmun.ppm (17) 142 
C.9 Characteristics for 132.ijpeg using input vigo.ppm (18) 143 
C.10 Characteristics for 126.gcc using input lexpr.i (19) 144 
C.ll Characteristics for 126.gcc using input lrecog.i (110) 145 
C.12 Characteristics for 126.gcc using input 1 reloadl.i (111) 146 
C.l3 Characteristics for 126.gcc using input 2stmt.i (112) 147 
xvii 
C.14 Characteristics for 124.m88ksim using input ctl.raw (113) 148 
C.15 Characteristics for 124.m88ksim using input test.raw (114) 149 
C.16 Characteristics for 129.compress using input bigtest.in (115) 150 
C.17 Characteristics for 129.compress using input test.in (116) 151 
C.18 Characteristics for 130.Ii using input - (117) 152 
C.19 Characteristics for 103.su2cor using input su2cor.in (Fl) 153 
C.20 Characteristics for 102.swim using input swim.in (F2) 154 
C.21 Characteristics for 102.swim using input swim2.in (F3) 155 
C.22 Characteristics for llO.applu using input applu.in (F4) 156 
C.23 Characteristics for 145.fpppp using input natoms.in (F5) 157 
C.24 Characteristics for 141.apsi using input apsi.in (F6) 158 
C.25 Characteristics for 146.wave5 using input wave5.in (F7) 159 
C.26 Characteristics for 107.mgrid using input mgrid.in (F8) 160 
C.27 Characteristics for 125.turb3d using input turb3d.in (F9) 161 
D.l Quadwords Written Back from Scache per 1000 Graduate Instructions 
for Integer-Type Benchmarks 163 
D.2 Quadwords Written Back from Scache per 1000 Graduate Instructions 
for Float-Type Benchmarks 163 
D.3 Graduated Loads per 1000 Graduate Instructions for Integer-Type Bench­
marks 164 
D.4 Graduated Loads per 1000 Graduate Instructions for Float-Type Bench­
marks 164 
D.5 Primary Instruction Cache Misses per 1000 Graduate Instructions for 
Integer-Type Benchmarks 165 
D.6 Primary Instruction Cache Misses per 1000 Graduate Instructions for 
Float-Type Benchmarks 165 
D.7 Primary Data Cache Misses per 1000 Graduate Instructions for Integer-
Type Benchmarks 166 
xviii 
D.8 Primary Data Cache Misses per 1000 Graduate Instructions for Float-
Type Benchmarks 166 
D.9 Secondary Instruction Cache Misses per 1000 Graduate Instructions for 
Integer-Type Benchmarks 167 
D.10 Secondary Instruction Cache Misses per 1000 Graduate Instructions for 
Float-Type Benchmarks 167 
D.ll Secondary Data Cache Misses per 1000 Graduate Instructions for Integer-
Type Benchmarks 168 
D.12 Secondary Data Cache Misses per 1000 Graduate Instructions for Float-
Type Benchmarks 168 
D.13 Graduate Instructions In Billions for Integer-Type Benchmarks .... 169 
D.14 Graduated Floating Point Instructions per 1000 Graduate Instructions 
for Integer-Type Benchmarks 170 
D.15 Graduated Floating Point Instructions per 1000 Graduate Instructions 
for Float-Type Benchmarks 170 
D.16 Issued Instructions per 1000 Graduate Instructions for Integer-Type 
Benchmarks 171 
D.17 Issued Instructions per 1000 Graduate Instructions for Float-Type Bench­
marks 171 
D.18 TLB Misses per 1000 Graduate Instructions for Integer-Type Benchmarks 172 
D.19 TLB Misses per 1000 Graduate Instructions for Float-Type Benchmarks 172 
D.20 Graduated Stores per 1000 Graduate Instructions for Integer-Type Bench­
marks 173 
D.21 Graduated Stores per 1000 Graduate Instructions for Float-Type Bench­
marks 173 
D.22 Cycles Per Instruction for Integer-Type Benchmarks 174 
D.23 Cycles Per Instruction for Float-Type Benchmarks 174 
xix 
D.24 Mispredicted Branches per 1000 Graduate Instructions for Integer-Type 
Benchmarks 175 
D.25 Mispredicted Branches per 1000 Graduate Instructions for Float-Type 
Benchmarks 175 
E.l Truncated HINT Data (DOUBLE) for Hydra (Node 1) (Ml) 179 
E.2 Truncated HINT Data (INT) for Hydra (Node 1) (Ml) 180 
E.3 Truncated HINT Data (DOUBLE) for Helix (Node 0) (M2) 181 
E.4 Truncated HINT Data (INT) for Helix (Node 0) (M2) 182 
E.5 Truncated HINT Data (DOUBLE) for Helix (Node 3) (M3) 183 
E.6 Truncated HINT Data (INT) for Helix (Node 3) (M3) 184 
E.7 Truncated HINT Data (DOUBLE) for Chronus (M4) 185 
E.8 Truncated HINT Data (INT) for Chronus (M4) 186 
E.9 Truncated HINT Data (DOUBLE) for Tajar (M5) 187 
E.10 Truncated HINT Data (INT) for Tajar (M5) 188 
E.ll Truncated HINT Data (DOUBLE) for Hermes (Node 0) (M6) 189 
E.12 Truncated HINT Data (INT) for Hermes (Node 0) (M6) 190 
E.13 Truncated HINT Data (DOUBLE) for DC (Node 0) (M7) 191 
E.14 Truncated HINT Data (INT) for DC (Node 0) (M7) 192 
E.15 Truncated HINT Data (DOUBLE) for Exiguus (M8) 193 
E.16 Truncated HINT Data (INT) for Exiguus (M8) 194 
F.l Memory Latencies in Nanoseconds (using LMBENCH) 196 
G.l STREAM Benchmark for Memory Size 91.6 MB 201 
G.2 STREAM Benchmark for Hydra, Processor 1 (Ml) 202 
G.3 STREAM Benchmark for Helix, Processor 1 (M2) 203 
G.4 STREAM Benchmark for Helix, Processor 3 (M3) 204 
G.5 STREAM Benchmark for Chronus (M4) 205 
G.6 STREAM Benchmark for Tajar (M5) 206 
XX 
G.7 STREAM Benchmark for Hermes (M6) 207 
G.8 STREAM Benchmark for DC (M7) 208 
G.9 STREAM Benchmark for Exiguus (M8) 209 
H.I NetQUIPS Results for Integer Applications 211 
H.2 NetQUIPS Results for Floating-Point Applications 211 
I.l Search Method (function of time) Results for Integer Applications . . . 227 
1.2 Search Method (function of time) Results for Floating-Point Applications228 
xxi 
Acknowledgments1 
I am indebted to Dr. John Gnstafson for providing me his mentor ship, funding and guidance 
throughout my thesis and my stay at Scalable Computing Laboratory (SCL), Ames Laboratory. 
Working with Dr. Gustafson helped me to understand in depth the problems and fallacies 
associated with computer benchmarks. 
I am grateful to Dr. Gurpur Prabhu for helping with my writing and encouraging me 
throughout the years. It was a pleasure to interact with him and to learn from his book 
Anita's Legacy. 
One person who inspired me with his experiments is Dr. Don Heller. I have been fortunate 
to work with Dr. Heller and learn from his work. Dr. Heller's projects on A Brief History of 
Time() and Rabbit gave me insight into measurement techniques. 
I am thankful to Prof. Mark Gordon, Dr. Dave Turner, and Dr. Dave Halstead for their 
support. I also thank Brian Smith, Bogdan Vasilu, Qisun Feng, Sairam S ankarnar ayan, Charles 
Shorb, Quinn Snell, Shri Amit, Joe Metzer, Vicky O'Neal, Nan Ripley, Maria Blanco, Melanie 
Eckhart, Lucy Zhu, Yunshue Shen and others for a memorable time at 237 Wilhelm. 
I would like to thank Late Professor Charles Wright, Dr. Srinivas Aluru and Dr. Doug 
Jacobson for being my committee members. 
I am thankful to my manager Robert Brooks and my colleagues at Hewlett Packard for 
adjusting with my extended leave of absense from work. 
Thanks are due to my friend Rushi Bhatt for helping me with statistical studies. 
Last but not least, I am thankful to my family and my teachers for their support. 
'This work was performed at Ames. Laboratory under Contract No. W-7405-Eng-82 with, 
the U.S. Department of Energy. The United States government has assigned the DOE Report 
number IS-T 1953. 
xxii 
Abstract 
Advances in digital computers have been spectacular but increasingly complex to model. 
Even the cycle-accurate simulators, which are costly to develop and run have questionable 
accuracy. This thesis provides a simple, accurate, scientifically proven, and analytic model to 
accurately predict the performance of real applications. The method creates two profiles as a 
function of time or problem sizes. The first profile, Hardware Signature, that reveals computer 
hardware speed, is obtained by running a universal benchmark, HINT or by running an an­
alytical model, AHINT. The second profile, Application Signature (APPMAP), that divulges 
intrinsic application requirements, can be obtained by four different methods outlined in the 
thesis. The convolution of these two profiles are used to predict real application performance. 
The model was tested using 25000 performance measurements and was validated by determin­
ing Pearson's correlation, Spearman's rank correlation and maximum deviation from linearity. 
Furthermore, through the Hardware Signature of the analytical models, one can obtain precise 
answers to questions about optimum size of memory, caches, and the numerical precision for 
a given clock rate. 
1 
CHAPTER 1 Introduction 
System analysts apply various techniques such as simulation, modeling, and analysis to 
accurately predict the performance of an application on present and future machines. There 
is a trade off of accuracy versus research time and cost in all of these techniques. In most 
cases there is a race against time to deliver a fairly accurate answer as timely design decisions 
are required. Benchmarks in the past have been successfully used as an indicator of system 
performance, but they have been widely known to be unsuccessful in predicting the performance 
of any real application on the known system. Hence, they fail to comply with the first principle 
of performance analysis - that is, to predict the application performance. 
This thesis proposes convolution techniques using hardware and software signatures to 
accurately predict the system performance in a relatively short time. Figure 1.1 describes 
the overview of the technology. There are two basic components: the hardware signature is 
application-independent machine performance revealing component, and the application signa­
ture is machine-independent application characteristics revealing component. The convolution 
of the two results in the application performance. 
The hardware signature encompasses the hardware characteristics signature such as proces­
sor speed; the number of cache regimes; size, access time, line size and miss penalty associated 
with each cache level; memory size, memory speed, and memory latency. In case of parallel 
systems these characteristics also include number of processors, communication latency, and 
message overhead. Hardware signatures can be obtained by running the universal benchmark 
HINT [Gustafson and Snell, 1995a] on actual machines or can be derived for hypothetical ma­
chines by using analytical HINT (AHINT) [Snell and Gustafson, 1996]. 
The application signature encompasses software characteristics such as spatial and temporal 
2 
HARDWARE 
SIGNATURE 
APPLICATION 
SIGNATURE 
CONVOLUTION 
METHODS 
APPLICATION 
PROJECTED 
TIME 
Figure 1.1 Application Signature Performance Model 
locality, problem size, and the data type of the application. This thesis presents various 
techniques to obtain the Application Signatures. These techniques differ in speed and cost of 
obtaining the information, and in accuracy of the projected results. In all the techniques the 
application is treated as the black-box to obtain the information. The extracted information 
about the application to model the application performance is known as application signature 
or application map (APPMAP). 
In this thesis, the hardware signatures of state of the art systems from SGI, CRAY, 
Sun, Hewlett Packard, and IBM are presented, as well as the hardware signatures of some 
of the commodity clusters of Alpha workstations, Apple G4, IBM workstations, and Intel 
Pentium-Pro based machines. To specifically demonstrate the developed models, SGI's MIPS 
R10000 citeYeagerl996 and MIPS R12000 based systems are used with varying processor clock 
speeds and memory subsystems. 
3 
The application signature is collected by running a number of popular high-performance 
workload benchmarks such as NPB benchmarks, SPEC benchmarks and real applications, 
including computational chemistry applications like GAMESS which is developed in-house at 
Ames Laboratory. Various other known benchmarks such as Linpack (100 x 100), Linpack 
(1000 x 1000), STREAM, and LMBENCH were also run. To demonstrate the performance 
prediction techniques, the SPEC CPU benchmark suite was used, which has 28 combinations 
of application binaries and inputs. 
There are three important characteristics of benchmarks that are missing in most of the 
benchmark studies found in the literature. These are defined in [Gustafson and Todi, 1998] as 
follows: 
1. Correct ranking. For computers A and B, does benchmark(A) < benchmark(B) 
imply application(A) < application(B)? 
2. Correlation. Does a set of measurements benchmark( X i )  show correlation with the set 
of measurements application(Xi) close to unity? 
3. Linearity. Is there a proportionality k such that application(X) — k x benchmark(X) 
within some percentage error, and what is that error? 
The objective of this thesis is to investigate these queries using HINT results from eight 
different machines and using SPEC benchmarks as applications. A number of graphs, tables, 
and plots from over 25,000 experiments is presented here to illustrate the methods to extract 
the application signature and the hardware signature and to demonstrate the use of convolution 
of two signatures to predict the real application performance. Standard statistical analysis such 
as Pearson's correlation, Spearman's rank correlation and relative error are used to validate 
the accuracy of each model. 
To summarize, the objective of the thesis is as follows: 
1. To establish the fact that Application Signature exists. 
2. To devise methods to obtain the application signature. 
4 
3. To use the application signature to convolute with the hardware signature to get the 
application time. 
4. To validate the model by doing statistical analyses such as correlation, rank correlation, 
and relative error. 
The thesis is organized as follows. In Chapter 2 we describe why computer benchmarks 
are important. In Chapter 3 we discuss many known benchmarks and benchmark strategies. 
Chapter 4 describes the problems associated with computer benchmarks. Chapter 5 briefly 
discusses the metrics for uniprocessor and multiprocessors. It also discusses many characteris­
tics of a good metric. Chapter 6 covers the statistical background necessary for understanding 
the results. 
Chapter 7 describes the hardware signature, the HINT benchmark, in detail. In this chap­
ter numerous hardware characteristics revealing graphs are presented. The HINT benchmark 
deploys a variable-computation, variable-time benchmark strategy. It uses a realistic perfor­
mance metric in terms of QUIPS (Quality Improvement per Second) and it is easily portable 
from supercomputer to ultraportable computers. In addition, the benchmark can be measured 
with varying precisions. 
Chapter 8 introduces the Application Signature by providing a few analogies from science 
and past work on the technology. In Chapter 9 basic notations and definitions required to 
understand the various models to obtain the Application Signature are described. 
Chapter 10 is the first model for the Application Signature using Instantaneous QUIPS 
and NetQUIPS. In Chapter 11 an improvement over the model in Chapter 10 is presented 
by applying Newton QR line search method. In Chapter 12 a simple model is presented by 
using cache misses. Chapter 13 uses memory traces from the application and derives a simple 
application signature based on cache sensitivity. 
In Chapter 14 a few novel applications of HINT and Application Signature technology 
are discussed. Chapter 15 outlines future directions and describes suggestions to extend the 
Application Signature and HINT technology. 
5 
CHAPTER 2 Benchmarks 
Benchmarking means a standard of measurement or evaluation. Ever since the advent of 
the electronic computer system, performance measurement had profound impact on driving 
technology. In this chapter, different benchmark user groups are explained. In the end of the 
chapter the usefulness of benchmarks is listed. 
2.1 Introduction 
There have been a number of benchmarks and metrics proposed over time to compare 
different computers. Although benchmarks have proven to be useful and have been successfully 
used for system evaluation, almost all of them have failed to provide an easy comprehension 
of performance. They only indicate part of system performance. Many benchmarks used 
only specific applications to indicate a class of application performance and its behavior. For 
example, Linpack benchmark, a widely used benchmark which solves a dense system of linear 
equations, has been used to indicate performance of scientific workloads in general. Until 
today, there is little or no consensus among vendors, researchers, and users about the right 
benchmarking techniques. 
Problem 1 Benchmarks are usually specific programs that project performance for a larger 
workload space. 
A computer system is difficult to evaluate as there are a number of factors that contribute 
to its performance. Some of these factors are due to the interaction of various software and 
hardware components present in the system: applications; system software (i.e. compilers, 
loaders, and operating systems); processor type, architecture, and the number of processors; 
6 
memory subsystem; graphics subsystem; input-output subsystem; and run-time configurations. 
Can the complex interplay of hardware and software be summarized by the performance vari­
ations due to each of these factors using a single number? Intuition says that it is unrealistic; 
but unfortunately some popular benchmarks do so. While a single number gives an indication 
of performance, it fails to explain by benchmark alone the causes of variation in application 
performance. For example, few applications perform poorly on a system with larger but slower 
secondary cache size. There are many such tradeoffs that a system designer has to face re­
sulting in variation in performance. Can we understand all of the favorable and unfavorable 
tradeoffs made by a system designer through an application? 
Problem 2 Benchmarks unrealistically summarize system performance and complex system 
behavior by a single number. 
Problem 3 Benchmarks lack easy comprehension of performance tradeoffs as chosen by sys­
tem designers and as seen by the user. 
In order to get a price-performance estimate we need to know the cost of the system. 
The cost of a product depends upon a number of factors such as manufacturing, research, 
and development. It is also largely dependent on the volume of product manufactured at 
one time. Some of the newer embedded or mobile chips are cutting cost by developing cost-
constrained processors. A two-bit branch prediction technique does not perform as well as 
a Branch Target Buffer (BTB) technique, but it provides enormous cost savings with lower 
die-area requirements. Another example of low-cost specialized processors are digital signal 
processors which avoid the redundant hardware found in general purpose processors. 
Because of the popularity of laptop and mobile computing, conserving power, increasing 
battery life, form factor, heat dissipatiion are key marketing points. Some of the newer types 
of benchmarks stress system power consumption. As growth in embedded system market 
is increasing raplidly, there is an increase in demand of benchmarks to evaluate embedded 
systems. 
7 
In short, a trillion dollar computer industry is driven by the benchmarks' results. A higher 
benchmark number implies a higher priced system and lots of free publicity. Thus each com­
pany and research organization strives to achieve the best performing benchmark number. 
Thus benchmarks have huge impacts on future system design. Because many design decisions 
are usually restricted to a few benchmarks, benchmarks have to be as broad and as scalable 
as possible. 
Benchmarks help to uncover hidden bottlenecks and hotspots in the computer system and 
help to fine-tune high-level and low-level compiler designs. They usually provide part of a 
computer system's performance but they leave a lot of unanswered questions and sometimes 
give rise to misleading conclusions. Some of these questions and concerns can be thought as 
the fundamental principle of benchmarking. They are listed as follows: 
• Benchmark X performs K times better on machine A than machine B. Does my appli­
cation Y performance a x K times better on machine A than machine £? 
• Benchmark X scales K times more on machine A than machine B. Does my application 
Y scale a x K times more on machine A than machine B? 
2.2 User Groups of Benchmarks 
To better understand benchmarking, it is important to understand the benchmark users. 
There are several user groups that rely on the results of benchmarks. They are categorized into 
several main groups depending on the type of information they seek and kinds of workloads they 
require to evaluate the system. These user groups include system designers, system integrators, 
component designers, resellers, customers, performance evaluators, marketers, and capacity 
planners. Let us look at some of the users and how they use and influence the benchmarks. 
There are several companies such as AMD, Intel, Motorola, Hewlett Packard (HP), In­
ternational Business Machine (IBM), Sun Microsystems (Sun), Silicon Graphics Inc. (SGI), 
and Transmeta that design and tune processors for their customers. For example, most Sun 
Microsystems (Sun) customers are running internet applications on their servers for internet 
8 
business. Hence, Sun's Ultra Sparc family of processors is optimized for using internet workload 
benchmarks. 
On the other hand, most Hewlett Packard and IBM customers are traditional banks, air­
lines, and warehouses. For these kinds of customers, the core application is a concurrent 
execution of multiple transaction types against a central database. These transactions span 
a breadth of complexity. For example, for a warehouse, transactions include entering and de­
livering orders, recording payments, checking or withdrawing funds, and monitoring the stock 
at the warehouses. Thus, HP's PA RISC family of processors and IBM's Power family of 
processors are concerned with how to optimize such transaction processing benchmarks. 
There are many specialized processor designers. Silicon Graphics (SGI) has, in the past, 
focused its processor development on scientific and graphics applications. Motorola and Texas 
Instruments (TI) have a majority of clients that need embedded digital signal processing. 
Hence, Motorola and TI use numerous digital signal processing workloads to evaluate their 
components. Transmeta has developed a new generation of x86-compatible, low-cost, energy 
efficient processors that enable design flexibility maximizing per watt, per dollar metric. Still 
another company, Broadcom, makes processors that handle wireless and network protocols. 
Observation 1 The real world applications define the workload benchmarks. 
Developers for other components such as chip sets, storage devices, network switches, etc., 
also use customer-centric benchmarks to evaluate their product. Also, software components 
such as Microsoft's Windows operating system, database solutions from Oracle, Siebel, and 
Sybase; XML and web services, and network security are evaluated by using a variety of 
benchmarks. 
In general, it is beneficial for a company to broaden its product coverage to maximize its 
profit as well as to serve a larger customer base. Even though each of the companies mentioned 
above excels in their niche areas, they do sell their products competitively in other segments 
dominated by their competitors. To do so, companies would like to have their technology, 
products, system, subsystems, and components out-perform competitors in their niche areas 
while also performing competitively on competitors' niche segments. Hence, a company tends 
9 
to pick a variety of workload benchmarks from different application segments. In order to derive 
a single number, weights are assigned for each benchmark. Ideally, these weights reflect a typical 
customer's usage pattern for these kinds of applications. However, since it is hard to define 
what a typical customer's application mix is, and to highlight the performance dominance of 
the company in certain market segments and for certain applications, these weights often turn 
into magic numbers, which are rarely disclosed and continually changing. 
Observation 2 Due to a myriad of components, subsystems, and systems availability, the 
number of ways to assemble a computer is large with many configurations. It is a non-trivial 
problem to rank these combinations. Benchmarks are supposed to aid in this effort. 
Problem 4 A typical user pattern is evaluated in order to estimate processor performance. 
However, usage patterns vary from person to person. A custom-built user workload pattern 
estimator is desired for better performance estimation. 
Intel's Intel Comparative Microprocessor (iComp) Index 2.0 is an example of a bench­
mark index using assorted benchmark suites with varying weights. Intel's iComp [Intel, ] was 
developed to compare 32-bit x86 processors (not systems). Intel's 32-bit x86 processors are 
developed to perform well in three basic segments: integer applications, floating point appli­
cations, and internet and multimedia applications. To illustrate their processors performance 
in those three areas, Intel has developed its own index called Intel Comparative Micropro­
cessor iComp. The present version iComp Index 2.0 is a weighted average of five industry 
standard benchmarks: CPUmark32, Norton SI-32, SPECint95, SPECfp95, and Intel's Media 
Benchmark. 
Table 2.1 summarizes the specific weights applied to each benchmark. Figure 2.1, taken 
from the iComp Index 2.0 report [Intel, ], illustrates the iCOMP Index 2.0's composition as 
well as respective weights for multimedia components in Intel's Media Benchmark suite. For 
a benchmark category I, BMj implies benchmark score and P/ implies respective weight. It 
is interesting to see that a heavy weight of 40% is applied for traditional business applications 
such as Microsoft Office and Lotus Smart Suite whereas a light weight of just 5% is applied 
10 
Table 2.1 Intel's iCOMP 2.0 Benchmark Composition 
I Category Benchmark Scores 
(BM/) 
Weight 
(P/) 
1 Traditional Benchmarks CPUMark32 40% 
2 High-End NSI32 15% 
3 General Purpose Integer SPECbase_int95 20% 
4 General Purpose Floating-Point SPECbase jp95 5% 
5 General Multimedia, Communications 
and Visualization 
Intel Media Benchmark 20% 
to floating-point applications. The reasoning provided from the iComp Manual [Intel, ] is that 
weights are based on the usage patterns of Intel's customers. Since Intel's processors are sold 
more to traditional business users than for scientific purposes, traditional business applications 
are weighted heavier than the floating-point applications. Therefore, if a customer buys an 
Intel based system for scientific applications and makes his purchase based on iCOMP Index 
2.0, then he might be misled by its valuation. A much more realistic approach for advanced 
users would be to apply customized weights instead of Pj (as shown in equation 2.1) based on 
their usage patterns to study the relative performance of processors or systems. 
The formula to compute iComp Index 2.0 is given in Equation 2.1. iComp Index 2.0 is 
computed by calculating the weighted geometric mean of a processor's relative performance 
on each of the categories compared to a known base processor. In iComp Index 2.0 the base 
processor is the Intel's Pentium processor running at 120MHz. 
*  =  1 G 0 * n ( B ( 2 1 )  
Many of the component designers like IBM, Hewlett Packard, SGI, and Sun Microsystems 
also develop their own system. Such companies are known as system designers. System 
designers build systems with different configurations using mainly their components. Their 
workloads, as mentioned earlier, primarily target their own customer base and but also target 
those of their competitors'. They would prefer to use their own components, software or 
hardware, to maximize their profits. To manufacture their products, they benchmark with 
11 
Intel Media Benchmark 
Intel Media 
Benchmark SPECinl95* 
Norton SI32* 
urUmark32* 
Figure 2.1 iCOMP Index 2.0 Weightings 
different configurations with varying systems, sub-systems, and components, to identify the 
system yielding best price-performance1 numbers. They also do competitive benchmarking to 
understand where they stand in each market segment. System designers have leverage over 
system integrators (discussed below) in the sense that they have some say over the design of 
the components being developed in their systems. 
There is another user group of benchmarks that functions primarily as a system integra­
tor. This group includes companies like Dell, Apple, Gateway, Hewlett Packard, and several 
white-box vendors. The system integrators evaluate different components and build the best 
price-performance systems targeted at a number of segments of the computing industry. The 
majority of their clients are cost conscious and use general purpose applications. Unlike sys­
tem designers, the system integrators are not often tied to specific components as they do not 
develop them. Also, the system designers usually use components that are well established 
standards in the industry and are already mass produced. The choice of components usually 
•
1 Price-performance is amount of dollar spend to achieve a unit of performance. 
12 
explodes the number of configurations they need to evaluate. The choices are usually pro­
cessor type and the number of processors; operating system (OS) such as Linux variations, 
proprietary UNIX flavors (in few cases) like HPUX, Solaris, or Windows based OS variations, 
type and size of cache and memory size, memory bus bandwidth, type and size of hard-disks, 
number of input-output ports and so on. Thus, system integrators use a variety of benchmarks 
and cost considerations to decide their offerings. 
System integrators offer several pre-configured packages and custom made packages. A 
pre-configured package usually has popular components and is categorized for business, home, 
home/office, or power users 2. On the other hand, custom made packages usually start with a 
bare minimal pre-configured package with choices of many possible components. A customer 
can add or remove a component based on his liking or budget. Because of this, one can 
purchase additional X amount of main memory by paying Y amount of dollars. Thus, custom 
configuration provides flexibility to customers in terms of price and components choice. A 
customer will immediately know the cost of a new configuration. Unfortunately, as of today, the 
customer will not know what spending extra Y number of dollars or buying extra X amount of 
main memory means in terms of performance. He would also not know how much performance 
improvement he can expect to see for his favorite applications such as Internet Explorer or 
Netscape for the extra dollars he would be paying. Since benchmarks so far lack predictive 
value, system integrators are unable to project performance figures for their customers. 
Problem 5 Benchmarks lack performance prediction for other applications. You can only get 
performance prediction for your favorite applications by actually running the applications on 
the real system or the system simulator. 
Benchmarks have long been used to sell the computers. That is the reason benchmarking 
is often called bench-marketing. Companies often highlight only a portion of the benchmark 
that yields the best performance. If the computer has a high clock speed and a low memory 
bandwidth, only the applications that fit in the cache and thus are benefiting from the high 
clock speed will be shown in the marketing brochure. The marking brochure often shows results 
2 Video gaming community is an example of power users 
13 
that are using highly optimized benchmarks or are results derived from specially configured 
systems. The highly optimized systems and benchmarks may show good benchmark results 
and yield best system costs but they may not be good for real applications a buyer might be 
using the system for. Sometimes the system will be benchmarked (and sold) with a minimum 
memory that fits the benchmark well. 
Problem 6 Benchmarking is often called bench-marketing as its results are used to sell the 
product. This is the single highest factor for corrupting the benchmarking practice as bench­
marks that expose the problems in a system are never shown to the actual buyers. 
Standard Performance Evaluation Corporation (SPEC), responsible for establishing and 
maintaining SPEC CPU Benchmarks, drops certain benchmarks from its standard SPEC suite 
if one of its founding vendor is not willing to support the benchmark due performance reasons 
[Henning, 2000a]. 
The largest user group of the benchmarks is the actual buyers. There are two main cate­
gories of such users. First, there are the masses that are often the target of bench-marketing 
and probably not so technology suave. Second, there are also well informed advanced users and 
technophiles. Well informed users, especially buyers of high end servers or bulk purchasers, 
want to know whether they are getting the best price-performance3. Then there are capacity 
planners, who monitor the resource usage and plan an upgrade cycle for their company. Such 
users have a basic question like "How much performance improvement I should expect if I 
replace 500 MHZ Intel Pentium III processor with 1GHz Intel Pentium III. Is it worth the 
cost?". Similar question from an end user can be "Should I add extra 128 megabytes to all 
the workstations in my office? Will our users' spreadsheet applications benefit from it? If so, 
what is the <5 performance improvement?" 
In short, benchmarks are widely used. Benchmarks mean different things to different user 
groups. 
3There are other factors,besides performance, such as total cost of ownership (TOC), that influence actual 
buying decision 
14 
2.3 Usefulness of Benchmarking 
To summarize, benchmarking can be helpful in unfolding the system or component behav­
ior. Following are a few benefits of benchmarking: 
1. Benchmarking helps in ranking different computers based on varying hardware architec­
ture. 
2. Benchmarking helps in ranking different configurations of the same computer. 
3. Benchmarking can provide interesting insight into the computer system which would lead 
to better future systems or better configurations. Benchmarks are used by the system 
designers to design future systems and they are also used by informed customers to make 
buying decisions. 
4. Benchmarking models What-If scenarios. Some extrapolations from present systems can 
give answers for what would happen if one upgraded the CPU or added more memory, 
etc. Also, modeling of the performance and validation of the results are tasks worth 
undertaking. 
5. System properties can be related with the benchmark results. For example, the HINT 
benchmark clearly shows the cache hierarchy, the LMBENCH benchmark shows memory 
latency, and the STREAM benchmark shows memory bus bandwidth. 
6. Benchmarking sometime can be useful in comparing similar systems especially in homo­
geneous clusters. This helps in isolating abnormal behavior of one node. 
7. Benchmarking can be helpful in tuning real world applications. For example, sometime 
it is extremely difficult to simulate all aspects of real applications, such as transaction 
processing system, in the laboratory. So selecting a smaller but representative workload 
such as TPC benchmark, is usually helpful in fine tuning all aspects of systems (compilers, 
operating systems, processors' tunables, software etc.). 
15 
CHAPTER 3 Benchmarks Classification 
In this chapter, the classification of benchmarks is discussed. Benchmarks are traditional 
classified by their usage. What is the purpose of the benchmark? Also, over a decade re­
searchers have classified benchmarks into computation length or computational time. Is a 
benchmark bounded by some fixed task? Does a benchmark have some fixed time to do 
certain task? A selected benchmarks are discussed in the chapter. 
3.1 Classification Based on Usage 
Following are the kinds are benchmarks that are based on usage. 
1. Synthetic Benchmark These are artificial benchmarks that do no useful work. They 
are used to study instruction mix. Gibson mix is a good example of synthetic benchmark. 
2. Kernel Benchmarks These are benchmarks that take out essential routines of some 
real applications. NAS Benchmarks are good example of kernel benchmarks. 
3. Workload Benchmark These are set of benchmarks taken from broad set of applica­
tions generally designed for broad range of users. SPEC benchmark is a good example 
of workload benchmark. 
4. Microbenchmarks These are small benchmarks specially designed to test certain sub­
systems. STREAM and LMBENCH are good examples for microbenchmarks. 
5. Real Applications Often real applications are used as benchmarks. Computational 
Quantum Chemistry application, GAMESS, is one example of such benchmarks. 
16 
All of the above benchmark classification can be further classified based on usage of the 
benchmarks. Here are kinds of sub-categories. 
1. CPU Benchmarks 
2. Multimedia Benchmarks 
3. Network Benchmarks 
4. Input Output Benchmarks 
3.2 Benchmark Strategy 
There are three main types of benchmarks strategies [Lilja, 2002]. 
1. Fixed Computation 
2. Fixed Time 
3. Variable Computation Variable Time 
3.3 Narrow versus Broad-spectrum Benchmark 
Depending on the granularity of the benchmark focus, the benchmarks can be divided into 
two types: Narrow benchmarks and Broad-spectrum. 
Definition 1 Narrow Benchmark is benchmark that indicates performance due to a single 
factor or summarizes many system variables into a single number. 
Definition 2 Broad-Spectrum Benchmark provides wide range of performance. It indicates 
performance due to more than one factor and presents results for range of such factors. 
Depending upon task at hand either narrow or broad-spectrum benchmark can be useful. 
Narrow-spectrum benchmarks like microbenchmarks where broad-spectrum benchmarks are 
more of system benchmarks. HINT is an example of broad-spectrum benchmarks. STREAM 
can be referred as narrow-benchmark. In the Appendix G, results for STREAM benchmark 
17 
is provided. The interesting aspect is that we converted STREAM a narrow-benchmark to a 
broad-spectrum benchmark where one can observe effective bandwidth at different levels of 
cache-hierarchy. 
3.4 Benchmark Examples 
3.4.1 Peak Performance 
Peak performance is defined as maximum number of floating point operations that can be 
executed in parallel in a clock step times clock speed. This is clearly not a good performance 
as the sustained performance is far below the peak performance. 
In the following table 3.1 taken from IDC report shows price-performance of some of the 
high performance computers. The price is as of circa 2000. One thing to understand is see 
that peak performance is not linear to the price of the machines. 
3.4.2 Linpack 
This is the linear algebra library routine for solving a general dense system of equations with 
partial pivoting [Dongarra et al., 1979], [Dongarra, 1984], [Dongarra, 1987], [Dongarra, 1992], 
[Dongarra and Gentzsch, 1993]. It has countless variations, but survives mainly in its original 
100 by 100 size, a 1000 by 1000 size, and a scaled version. The Linpack benchmark has a fixed 
set of rules such as no optimization or code tweaking is permitted. At its original form the 
data structure size of Linpack is less that 0.5 megabytes. This fits into primary cache of any 
modern computer. Hence, most of the time it is not representative of memory problem. 
3.4.3 STREAM 
The STREAM benchmark is a synthetic benchmark program developed by John MCalpin 
[McCalpin, 1995], [McCalpin, b], [McCalpin, c], [McCalpin, a]. It is one of the first benchmark 
to decouple measurement of the memory bandwidth from the peak CPU performance of the 
system. It emphasizes on the growing gap between CPU speed and memory module speed. 
It's written in standard Fortran 77 (with a corresponding version in C). It measures the the 
18 
Table 3.1 High Performance Computers (circa 2000) - Sorted by Price 
Computer System Price ($K) Peak 
Performance 
(GFLOPS) 
Price/Peak 
Performance 
(SK/GFLOPS) 
IBM RS/6000 44P CO
 
00
 
6 $6 
Model 270 (4 CPUs) 
SGI 2100 (8 CPUs) $92 5.6 $16 
SUN 4-way cluster of E420Rs. $373 36 $10 
Each node has 4 = 40 CPUs 
IBM RS/6000 SP short $465 36 $13 
frame, 6X 375 MHz POWER3 
SMP nodes (each with 4 CPUs) 
HP 9000 N-Class Server $67 35 $13 
including HP-UX PA-8500 
360 MHz CPU 512 MB SDRAM 
SGI 01200 (116 CPUs) $494 81 $6 
HP 9000 L-Class Server $500 63 $8 
PA- 8500 360 MHz 
Compaq 84 node $500 78 $6 
AlphaServer DS10 Beowulf 
cluster 
Compaq AlphaServer HPC $832 43 $19 
320 (using EV67 Alpha chip) 
Compaq 168 node $1,000 157 $6 
AlphaServer DS10 Beowulf 
cluster 
SGI 01200 (256 CPUs) $1,079 179 $6 
HP 9000 L-Class $1,090 144 $8 
Server PA-8500 
360 MHz CPU 
IBM RS/6000 SP tall $1,098 90 $12 
frame, 15x 375 MHz POWER3 
SMP nodes (each with 4 CPUs) 
HP 9000 N-Class Server $1,190 92 $13 
PA-8500 360 MHz CPU 
Sun Enterprise 10000 $2,140 51 $42 
(64 CPUs @ 800 MF) 
SGI 2400 (64 CPUs) $2,201 51 $43 
SGI 2800 (256 CPUs) $10,133 205 $49 
19 
performance of four long vector operations. These operations are COPY, SCALE, SUM, and 
TRIAD. They are explained in Table 3.2. 
Table 3.2 The STREAM Benchmark Operations 
Per Iteration 
Name Kernel 
Bytes FLOPS 
COPY: a(i) = b(i) 16 0 
SCALE: a(i) = q*b(i) 16 1 
SUM: a(i) = b(i) + c(i) 24 1 
TRIAD: a(i) = b(i) + q*c(i) 24 2 
These basic operations are used individually in the inner-loop of the four long vector op­
erations. The array sizes are defined so that each array is at least four times larger than the 
last level of cache of the machine to be tested. Data re-use is not possible in the code. The 
code can be ported easily to uniprocessors, vector processors, multiprocessors shared as well 
as distributed computers. 
STREAM is a small collection of very simple loop operations. It tries to estimate the total 
rate at which all addressable memory spaces can deliver data to their respective processors, 
unfettered by any other operation. For the "peak" measure, one simply multiplies the width of 
the bus in bytes by its maximum repetition rate. Effects such as the need for memory refresh, 
input/output interrupts, or other burdens on the memory bus, are ignored. 
At the time of writing STREAM2 is being developed by the author of STREAM benchmark. 
STREAM2 addresses two basic issues. 
1. It distinguishes read and write performances. 
2. It measures sustain bandwidth at all different levels of the memory hierarchy. 
Table 3.3 lists different kernel operations of STREAM2. 
3.4.4 SPEC CPU95 
SPEC CPU95 (released year 1995) and SPEC CPU 2000 (released year 2000) [SPEC, 2003], 
[Reilly, 1996], are among the most widely used benchmark based on compute intensive work-
20 
Table 3.3 The STREAM2 Benchmark Operations 
Name Kernel 
Per Iteration 
Bytes 
FLOPS Read Written 
FILL: a(i) = q 0 8 0 
COPY: a(i) = b(i) 8 8 0 
DAXPY: a(i) = a(i) + q*b(i) 16 8 2 
SUM: a(i) = b(i) + q*c(i) 8 0 1 
loads. The Standard Performance Evaluation Corporation (SPEC) is a non-profit organization 
primarly comprising of computer vendors, systems integrators, universities, and research orga­
nizations from all over the world. 
CPU95 is a component or application level benchmark as opposed to system level bench­
mark. The primary objective of this benchmark is to evaluate the performance of the processor, 
memory architecture, and the compiler. They are subdivided into two basis applications: In­
teger based and Floating point based applications. 
Integer based application are called CINT95. They are group of eight applications as listed 
in the Table 3.4 written in C. 
Table 3.4 SPEC CPU 95 Integer Applications: CINT95 
Benchmark Name Desscription 
099.go Artificial intelligence; plays the game of "Go" 
124.m88ksim Moto 88K Chip simulator; runs test program 
126.gcc New version of GCC; builds SPARC code 
129.compress Compresses and decompresses file in memory 
130.li LISP interpreter 
132.ijpeg Graphic compression and decompression 
134.perl Manipulates strings (anagrams) and prime numbers in Perl 
147. vortex A database program 
Flaoting point based applications are called CFP95. They are group of ten applications as 
listed in the Table 3.5 written in Fortran 77. 
According to SPEC [SPEC, b] the criteria for the applications to be choosen for this CPU 
95 suites are as follows. The basic idea is that emphasis is on CPU related activity rather than 
disk or any other system component. 
21 
Table 3.5 SPEC CPU 95 Floating Point Applications: CFP95 
Benchmark Name Desscription 
lOl.tomcatv A mesh-generation program 
102.swim Shallow water model with 513 x 513 grid 
103.su2cor Quantum physics; Monte Carlo simulation 
104.hydro2d Astrophysics; Hydrodynamical Navier Stokes equations 
lOT.mgrid Multi-grid solver in 3D potential field 
llO.applu Parabolic/elliptic partial differential equations 
125.turb3d Simulates isotropic, homogeneous turbulence in a cube 
141.apsi Solves problems regarding temperature, wind, velocity 
and distribution of pollutants 
145.fpppp Quantum chemistry 
146.waveS Plasma physics; Electromagnetic particle simulation 
• The benchmark is portable to all SPEC supported hardware (32-bit and 64-bit) and 
operating system (UNIX flavors, Microsoft NT, and VMS). 
• The benchmark should not measure input-output. 
• The benchmark should not include networking or graphics. 
• The benchmark should not include more than 5% of any other activity other than the 
SPEC supplied software. 
• The benchmark should run in 64 MB RAM without swapping. 
Prior to CPU 95, the past SPEC CPU benchmark included CPU 92 and CPU 89. The 
primary reason for introduction of newer suites of benchmark is that technology is continously 
improving and older applications are not representative of current or future workloads for 
which the machine being benchmarked is intended. 
The definition of SPEC benchmarks like ones defined above, e.g., a benchmark in CPU 
'95 suites should within 64 MB without swapping, restrict present workload to scale for future 
machines. The benchmarks need to scale according on the newer computer systems in order 
to be realistic and useful. Here are some of the concerns. 
• Older benchmarks generally fits in the primary cache of the newer machines. So they 
22 
don't exercise the memory bus which is typical of current applications. Also they are not 
representee of current or future applications that had grown in complexity and size. 
• Older benchmarks took less than a minute on state-of-art systems. Measurements of 
such benchmarks are usually not stable as small changes due to compiler optimization 
or system improvement or fluctuations in the systems have significant impacts on the 
percentage improvements. 
• Newer applications such as imaging and database were not well represented in the past 
suites. So newer benchmark suite increase variety and representation within the suites. 
• Newer coding style or languages are not well represented in past benchmark. For exam­
ple, SPEC's CINT2000 benchmarks includes a raytracing application, 252.eon, written 
in C++. There was no benchmark found in earlier SPEC suite that was coded in C++. 
However, due to popularity of C++ language in recent years, and in an effort to bench­
mark C++ compilers performance, SPEC decided to included eon benchmark. 
CPU95 measures the following different types of measurements [SPEC, b]. 
• base versus non-base measurement: There are a number of compiler optimization options. 
Base measurement requires that users use same flags in same orders for all benchmarks. 
This measurement is to get what is usually done by most of the users. For example on 
SGI IRIX 6.5, the native cc compiler has a '-Ofast' options to maximize performances 
for a given target platform. It switches on all the best performing compiler optimization 
flag. So base measurments usually uses such options which are good for most of typical 
application and generally been used by the users. On the other hand, non-base mea­
surement is fairly less restrictive. It allows different optimization to be used on different 
benchmarks. This is generally useful for vendors to get the best reported performances. 
• rate versus a non-rate metrics: 
Rate metric measures number of simultaneous tasks a computer can accomplish in a 
23 
certain amount of time. This is called a throughput, capacity, or rate measures. Non-
rate metrics is measure for running single tasks. 
The different metrics measured by CINT95 [SPEC, b]. 
• SPECint95: The geometric mean of eight normalized ratios (one for each integer bench­
mark) when each benchmark is compiled with best (most aggresive) optimization. 
• SPECint_base95: The geometric mean of eight normalized ratios when when each 
benchmark is compiled with typical optimization. 
• SPECint_rate95: The geometric mean of eight normalized throughput ratios when 
each benchmark is compiled with best (most aggresive) optimization. 
• SPECint_rate_base95: The geometric mean of eight normalized throughput ratios 
when each benchmark is compiled with typical optimization. 
Similarly, the different metrics measured by CFP95 [SPEC, b] . 
• SPECfp95: The geometric mean of ten normalized ratios (one for each floating-point 
benchmark) when each benchmark is compiled with best (most aggresive) optimization. 
• SPECfp_base95: The geometric mean of ten normalized ratios when each benchmark 
is compiled with typical optimization. 
• SPECfp_rate95: The geometric mean of ten normalized throughput ratios when each 
benchmark is compiled with best (most aggresive) optimization. 
• SPECfp_rate_base95: The geometric mean of ten normalized throughput ratios when 
each benchmark is compiled with typical optimization. 
In the next section, SPEC CPU 2000 benchmarks will be discussed. 
24 
3.4.5 SPEC CPU2000 
CPU2000, superseded CPU95, is a suite of compute intensive workloads by SPEC. The 
criteria for selection of workload and metrics of CPU2000 is very similar to SPEC CPU95. It 
has two major components: CINT2000 for integer applications and CFP2000 for floating point 
applications. 
CINT2000 comprises of 11 applications written in C and one application written in C++ 
(252.eon). The description of the applications is listed in the Table 3.6. 
Table 3.6 SPEC CPU 2000 Integer Applications: CINT2000 
Benchmark Name Desscription 
164.gzip Data compression utility 
175.vpr FPGA circuit placement and routing 
176.gcc C compiler 
181.mcf Minimum cost network flow solver 
186. crafty Chess program 
197.parser Natural language processing 
252.eon Ray tracing 
253.perlbmk Perl 
254.gap Computational group theory 
255.vortex Object-oriented database 
256.bzip2 Data compression utility 
300.twolf Place and route simulator 
Flaoting point based applications are called CFP95. They are a group of ten applications 
as listed in the Table 3.5 written in Fortran 77. CFP2000 comprises of 6 applications written 
in FORTRAN??, four applications written in FORTRAN90, and four applications written in 
C. The description of the applications is listed in the Table 3.7. 
3.4.6 SPEC CPU2004 
At the time of writing this thesis, SPEC committee is looking for applications for its CPU 
2004 suite. The criteria for future CPU benchmark suite is as follows [SPEC, 2003] 
• The program can be made computer bound. 
25 
Table 3.7 SPEC CPU 2000 Floating Point Applications: CFP2000 
Benchmark Name Desscription 
101. tomcat v A mesh-generation program 
168.wupwise Quantum chromodynamics 
171.swim Shallow water modeling 
172.mgrid Multi-grid solver in 3D potential field 
173.applu Parabolic/elliptic partial differential equations 
177. mesa 3D graphics library 
178.galgel Fluid dynamics: analysis of oscillatory instability 
179.art Neural network simulation: adaptive resonance theory 
183.equake Finite element simulation: earthquake modeling 
187.facerec Computer vision: recognizes faces 
188.ammp Computational chemistry 
189.1ucas Number theory: primality testing 
191.fma3d Finite-element crash simulation 
200.sixtrack Particle accelerator model 
301.apsi Solves problems regarding temperature, 
wind, distribution of pollutant 
• The program can be easily made portable across different hardware architectures and 
operating systems. 
• The program represents state of art in its field. 
How do one interpret hardware advancement over time using SPEC's benchmark suites 
(CPU '89, CPU'92, CPU '95, CPU'00, CPU '03) introduced at different time? The comparison 
of SPEC benchmark suites introduced over time becomes even complex especially when the 
suites contain different applications and there is no common baseline to compare against. 
3.4.7 NPB Benchmarks 
The NASA Ames Research Center (NAS) Parallel Benchmarks (NPB) [Bailey et al., 1991a], 
[Bailey et al., 1991b] are a set of eight kernel programs used to evaluate the performance of 
parallel computers. The kernel code are derived from computational fluid dynamics. Unfor­
tunately the authors have stop supporting the benchmark primarily due to the time and cost 
involved to support this benchmark. 
There are primarily three kind of NPB. 
26 
Table 3.8 NAS Parallel Bencmarks Pro ]lems and their Sizes 
Benchmark Code Class A Class B Class C 
Embarrassingly parallel (EP) g28 230 g32 
Multigrid (MG) 256% 256^ 512^ 
Conjugate gradient (CG) 1.4 x 104 7.5 x 104 1.5 x 10& 
3-D FFT PDF (FT) 256% x 128 512 x 256^ 5123 
Integer sort (IS) g23 g25 227 
LU solver (LU) 64^ 1023 at to 
Pentadiagonal solver (SP) 64% 1023 1623 
Block tridiagonal solver (BT) 64^ 102% 1623 
1. NPB 1 
These are the "pencil and paper" benchmarks. The problems are well defined and devel­
opers and vendors of parallel computers are given the flexibility to choose the program­
ming language, algorithms, and the hardware. The result are verified by the NAS. Such 
type of approach helps the benchmarking commuty in many ways. It helps the manu­
factures to try out best and sometimes innovative approach to get the performance. It 
helps NASA in particular to readily apply those techniques in their core CFD applications 
without much research effort devoted to such causes. 
2. 2 
These are Message Passing Interface (MPI) [Gropp et al., 1999a], [Gropp et al., 1999b] 
based implementation of above benchmark. Since MPI is generally available on most of 
the parallel computers it is easy to port NPB2 with little efforts. 
3. NPB 2-serial 
These are serialized version of NPB benchmark. They help supposed benchmark work­
station and PC's. They are also testbed for parallelization tools. 
A typical characteristics of the NAS benchmark is that it has different class of the bench­
marks refering to the different problem size. The name of kernels and the different class are 
summarized in Table 3.8. 
27 
3.4.8 SPLASH Benchmarks 
Stanford Parallel Applications for Shared Memory (SPLASH) 
1. It has combination of kernel benchmark and real scientific workload. 
2. Already parallelized. Parallelism, Data Decomposition, and Communication. Area of 
remote migration can be helpful. 
3. good for testing hardware without the best compiler 
3.4.9 GAMESS 
GAMESS is a good example for real-world application benchmark. GAMESS is a program 
for performing ab initio quantum chemistry calculations. The program can do automatic 
geometry optimization, transition state searches, and trace the intrinsic reaction path from a 
transition state to reactants or products, so that the whole reaction path can be constructed. 
A brief list of computing capabilities of GAMESS is provided in Table 3.9. In the Table 3.9, 
C stands for conventional storage of integrals on disk, D stands for direct atomic orbitals inte­
gral computation, and P stands for parallel execution. A detailed description of the program 
is available in the following article [Schmidt et al., 1993]. 
GAMESS allows us make analysis of energetics of reaction, and predict mechanism of 
the reaction. Computation of the energy hessian (second derivatives) makes available normal 
modes, vibrational frequencies, and IR intensities of stationary points. Various molecular 
properties can be calculated from dipole, quadrupole, and octupole moments to electron density 
and spin density. It is done at different levels of theory. 
GAMESS has been used at AMES Laboratory, USDOE, to run and benchmark several 
systems in the past. In past, it has also been part for SPEC HPC suite. 
3.4.10 Assorted Benchmarks 
At any given time, there are dozens of benchmarks used to compare computers. Some 
are fleeting, and others might have a lifetime of more than a decade. Some are targeted at 
28 
Table 3.9 GAMESS Capabilities 
SCF TYPE RHF ROHF UHF GVB MCSCF 
Energy CDP CDP CDP CDP CDP 
analytic gradient CDP CDP CDP CDP CDP 
numerical Hessian CDP CDP CDP CDP CDP 
analytic Hessian CDP CDP - CDP -
MP2 energy CDP CDP CDP 
-
C 
MP2 gradient CD 
- - - -
CI energy CDP CDP 
-
CDP CDP 
CI gradient CD 
- - - -
MOPAC energy yes yes yes yes 
-
MOPAC gradient yes yes yes 
- -
supercomputers, some at personal computers, and some at business computers. A web page 
[Aburto, ] maintained by A. Aburto Jr. facilitated our study by consolidating the results of 
popular benchmarks in an easily accessible way. There is an assorted list of benchmarks that is 
used maintec by A1 Aburto [Aburto, ] of the Naval Command, Control and Ocean Surveillance 
Center (NCCOSC) RDT&E Division (NRaD) in San Diego, CA. These benchmarks provides 
a good example of past workload. They are too-small sized problem from today computers 
and are intentionally included in our study to test the hypothesis that they reveal only part of 
a larger picture. They are summarized in the Table 3.10. 
3.4.11 HINT 
HINT [Gustafson et al., ], [Gustafson and Snell, 1995a], [Snell and Gustafson, 1996] is a 
variable-computation, variable-time benchmark. HINT rigorously defines quality of a solution 
of a given mathematical problem. HINT is a superset of other popular CPU and system 
benchmarks [Gustafson and Todi, 1998]. This is because HINT presents a spectrum of memory 
performance whereas other benchmarks pick a particular problem size and measure execution 
time. A detail discussion on HINT would be presented in later chapters. 
29 
Table 3.10 Assorted Tiny Benchmarks Maintained by A1 Aburto 
Benchmark Data Type Benchmark Description 
sim integer Compares DNA segments for similarity. 
fhourstones integer Finds solutions to the 'connect-4' game. 
dhrystone integer Provides a MIPS rating based upon a 'typical' 
instruction mix. 
nsieve integer Generates prime numbers based on the Sieve 
of Eratosthenes using array sizes from 8 KBytes to 
2 MBytes. MIPS rating is reported. 
heapsort integer Uses the heap sort method to sort a random 
array of long integers. 
hanoi integer Solves the Towers of Hanoi puzzle. 
queens integer Solves the 14 Queens Problem. 
flops floating-point Estimates peak MFLOPS for specific FADD, 
FSUB, FMUL, and FDIV. 
clinpack floating-point C version of Linpack program. 
fft integer A FFT test program from Ron Mayer. 
tfftdp integer A FFT program using the Duhamel-Hollman's 
from 32 to 262,144 points. 
mm floating-point 9 different algorithms for doing matrix 
multiplication (500 x 500. 
3.4.12 Peak FLOPS 
This is usually obtained by figuring the rate at which the floating-point adders and floating­
point multipliers in the hardware can fire, unfettered by any other operation. 
3.4.13 Lawrence Livermore Loops 
The Lawrence Livermore Loops [McMahon, 1986] were designed by Frank McMahon by 
taking excerpts from Fortran application programs used at Lawrence Livermore National Lab­
oratories. 
3.4.14 Whetstone 
This is the latest version of the 1976 Whetstone benchmark [Curnow and Wichmann, 1976] 
written in C. It stresses unoptimized scalar performance, since it is designed to defeat any 
effort to find concurrency. When MIPS ratings were in favor, the Whetstone benchmark was 
30 
a popular way to estimate MIPS, and one occasionally sees "WIPS" (Whetstone Instructions 
Per Second) in the historical literature. 
3.4.15 SLALOM Benchmark 
SLALOM [Gustafson et al., 1991], [Diane et al., 1991], [Gustafson and Snell, 1995a] Bench­
mark is an excellent example of a fixed-time benchmark. 
31 
CHAPTER 4 Common Problems with Benchmarks 
Th benchmark overview and classification were discussed in Chapter 2 and Chapter 3. This 
chapter lists the problems studied in the previous chapters and lists a few others. This chapter 
will be useful in understanding why application signature techniques are required. 
4.1 Benchmarks Won't Follow Moore's Law 
According to Moore's law, the computing performance increase about 60% per year. How­
ever, all the fixed-computation (fixed-size) benchmarks won't scale and they become obsolete 
in a few years after their release. For example, the LINPACK benchmark specifies that the 
matrix size for linear decomposition is 100 x 100. What this means is that LINPACK requires 
less than 1 megabyte of memory and needs less than about half a million floating point oper­
ations to be processed. Any small system can finish the work within milliseconds. In addition 
a scaled version of Linpack may not work as it may be limited by human patience waiting for 
an answer [Gustafson and Todi, 1998]. A scaled version of Linpack can take over 70 days to 
run on a 100 TFLOPS supercomputer. 
4.2 Benchmarks Won't Correlate With Real Applications 
If one runs benchmarks or real applications, one may expect that results among the bench­
marks or the real applications correlate and thus to have a linear relationship among each 
other. If the linear relationship fails, one may expect for monotonicity among the results. 
However, many relationships between benchmarks and applications (or between benchmarks 
and other benchmarks) fail both linearity and monotonicity. Here are few examples. 
32 
0 2 4 6 8 10 
LINPACK Speed relative to IBM 3090 
Figure 4.1 Computation Chemistry vs LINPACK 
For example, a relationship between Linpack and computational chemistry benchmarks 
[Gustafson and Todi, 1998] is shown in Figure 4.1. In the figure, the horizontal axis repre­
sents LINPACK performance and the vertical axis represents performance on the GAMESS 
computational chemistry application, both normalized to an IBM 3090 = 1.0. This data was 
measured by S. Elbert at Ames Lab, circa 1989. The correlation between the results is positive. 
However, Linpack overestimates the performance of the machines being benchmarked. 
There are cases where correlation between a benchmark and actual application performance 
might not even be positive. As specified in Chapter 3, the peak FLOPS rating is considered a 
benchmark as it is easy to obtain number and has been widely used in performance community 
as actual performance. A scatter plot between peak FLOPS rating versus effective FLOPS as 
observed on the NAS Parallel Benchmarks is shown in Figure 4.2. The correlation is negative 
(-0.692). This shows that the results gathered using these benchmarks are not equivalent, even 
within some approximate scale factor. 
33 
g 2.5 
O 
O 2.0 
<U 
1 
eu 
Oh 
W 
"O 
î 
1.5 
1.0 
0.5 
0.0 
0 3 6 9 12 15 
Peak Advertised Performance, GFLOPS 
Figure 4.2 Peak FLOPS versus EP benchmark 
4.3 Benchmarks are Redundant 
Standard workload benchmark suites such as those of SPEC are designed to represent a 
wide variety of benchmarks. Such benchmarks contains diverse workloads to represent broader 
users demands. However, workload benchmark suites suffer, by nature and by design, from 
two kinds of redundancy: inter-benchmarks and intra-benchmarks. 
|(lv:l| < ;vm i Suit-.' Redundancy in SPEC CFP2000 Benchmark 
1 + + + + + + g 
1 
8 + + + + + + 
S 
I + + + + + + Û 
1 
+ + + + + + 
wupwise.mgnd, swim, mesa 
galgel,art,ammp,]uctis.sixtrack 
FIRST PRINCIPAL COMPONENT FIRST PRINCIPAL COMPONLNT 
(4 (b) 
Figure 4.3 Workload Benchmarks (a) Ideal (b) Redundancy in SPEC 
CFP2000 Benchmarks 
34 
4.3.1 Inter-Benchmark Redundancy 
Consider Figure 4.3(a). Each dot in the figure represents a combination of architectural 
and algorithmic properties of a benchmark in a benchmark suite. For an ideal benchmark 
suite, one might expect the dots to be equally spaced in the 2 dimension space. However, 
reality as measured by SPEC CFP2000 benchmarks is that almost all the benchmarks tend to 
have similar behavior. Figure 4.3(b), collected in Hewlett Packard Performance Laboratory by 
the author, shows that almost all the benchmarks tend to display similar execution behavior. 
171.swim 
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 
FIRST 5000 SAMPLES OF 10,000,000 INSTRUCTIONS IN TIME ORDER 
Figure 4.4 Intra-Benchmark Redundancy in SWIM benchmark of SPEC 
CFP2000 
4.3.2 Intra-Benchmark Redundancy 
Workload benchmarks are also marred by intra-benchmarks redundancy. If a benchmark 
that runs for an hour doing a similar kind of work again and again revealing the same infor­
mation at each time step then there is no use of spending extra redundant time steps waiting 
for the benchmark to complete. Therefore, a desired quality of a good benchmark is that the 
benchmark should, within a short time, provide adequate information for comparing perfor­
mance. Consider the example of the first 50 billion instructions of SWIM benchmark in Figure 
35 
4.4. This represents 10% of the total run. Every tick on the horizontal axis represent a chunk 
of 10 million instructions. The verticical axis is cycle per instructions. A quick look at the 
figure, one can tell the uniformity within the benchmark from start to end. The same pattern is 
repeated throughout the benchmark showing intra-benchmark redundancy. Such patterns and 
phases are common throughout the SPEC CFP2000 and CINT2000 benchmarks [Todi, 2001], 
[Todi, 2003]. 
For in depth analysis on desired properties for workload benchmarks, readers are referred 
to the paper [Dujmovic, 1999]. 
4.4 Past Benchmarks Predict Future Performance 
The problems with the workload benchmarks are that they are applications of the past. 
Many times workload benchmarks are used to design the system for the future. Because of 
this by the time the future systems are ready for use, the benchmarks used to design these 
systems are of little value. 
Product 
Release 
Product Development 
Phase 
Design Phase 
Real Applications 
Workload 
Benchmarks 
I 
Product Develpment Time (Years) 
Figure 4.5 Past Benchmarks are used for Future System Design 
Figure 4.5 is based on presentation by Gregory F. Pfister of IBM in a panel discussion. 
During a product design phase, workload benchmarks are given the most emphasis. During 
36 
that phase, real applications are usually unknown or are too big to run on the design tools such 
as simulators. However, when the product is released in the market, the order of importance of 
workload benchmarks and real applications are naturally reversed. During the product release, 
the product is expected to run best on the real and newer applications. 
For example, the Internet boom was not seen by the traditional computer vendors until 
mid-1995. Thus most of the workload suites used to design the computer servers were either 
computation-intensive benchmarks such SPEC or transaction processing benchmarks such as 
TPC. This means that the server designed prior to 1995 never considered internet workloads. 
Another problem is that the source code of a chosen benchmark is frozen too early to be 
considered in the selection process of a benchmark suite. For example, the source code of the 
SPEC2K suite was selected during 1998-1999. Consequently, by the time SPEC2K benchmark 
suite was approved by the SPEC committee [Henning, 2000b], the workload reflected old coding 
practices and also became a victim of Moore's Law. 
4.5 Other Selected Problems of Benchmark 
The following section lists some of the selected common problems which are found with the 
existing benchmarks. 
1. The emphasis of benchmark is too narrow. Benchmarks such as Dhrystone, Whetstone 
exercise just CPU speed but not the memory speed. Similarly standard Linpack that is 
a popular benchmark provides workload that fits in the primary cache of most current 
workstations. 
2. Benchmarks run different instruction mixes than the real applications and most often with 
fewer memory references. One reason why benchmarks and applications rank differently 
is illustrated in Figure 4.6. Benchmarks sample different parts of the HINT curve whereas 
applications sample different parts of the HINT curve. HINT graph is explained in detail 
in Chapter 7. This example was taken from [Gustafson, b], 
3. The lifetime of the workload benchmarks are usually very short. For example every 
37 
Graphs that Cross 
0.5 
SGI 
Indigo! imark Bene 0.4 
g 0.3 
IIP 
712/80i 
a 
0.2 
0.1 
0.0 
10"4 
Time in Seconds 
Figure 4.6 Benchmarks emphasize different Problem Sizes 
3-5 years, SPEC releases newer suites of applications for benchmarking computational 
intensive applications. It is virtually impossible to track SPEC performance from one 
year to the next since the definition of the problem set is always changing. There are 
many reasons for such changes. 
4. Several benchmarks are poorly crafted. One of the SPEC CFP95 benchmark, 145.FPPP, 
can be considered a poorly crafted benchmark. The benchmark 145.fppp represents 
a quantum chemistry application like GAMESS as described in Chapter 3. However, 
unlike quantum chemistry application which is a memory intensive computation, the data 
footprint of 145.fppp is so small that it fits in the primary cache size of most machines. 
Because of this, it came as an anomaly in one of the Application Signature results. 
Whatever the reasons for selecting 145.FPPP, this benchmark was not representative of 
GAMESS or other quantum chemistry applications or any real scientific applications. 
This benchmark was later removed from the SPEC CFP2000 suite. 
5. The rules of benchmarks often get broken. The SLALOM Benchmark as discussed in the 
Chapter 3 was initially an order 0(N3), where N is the number of patches. However, an 
38 
order NlogN that became a reason for the benchmark to become obsolete. 
6. Sometimes compilers turn benchmarks into a toy program. Compilers sometime do pre­
cise recognition to optimize certain benchmarks. For example, The EQNTOTT bench­
mark was in SPEC89 suite but later was removed from SPEC95 suite. EQNTOTT spe­
cific compilation generates exceptionally good code for EQNTOTT, however it generated 
poor code for programs that were only slightly different from EQNTOTT. To tackle this 
problem, compilers were able to pattern match certain hotspots of EQNTOTT and ex­
clude other code from that optimization. Refer to website [NULLSTONE, a] for further 
details on this problem. 
7. The lifetime of a benchmark is also impacted by the cost and support consideration. 
Many times, a benchmark can cease to exist such as when the people supporting it 
leave the organization or there is a lack of funds. A recent example is NAS Benchmarks 
that have different sizes of kernel benchmarks have stopped the production of future 
benchmarks. 
8. Compiler can remove parts of code e.g., some loops are non useful, yielding high perfor­
mance numbers. Hence synthetic benchmarks are not very useful and one needs to be 
careful when using those numbers. 
9. A benchmark code can often get tweaked or hand-written. This artificially improves the 
benchmark's results, however it reduces the benchmark's reliability. 
10. Benchmarks are often run on customized configuration. Most customers report that 
they do not get the performance that is reported by the vendors. This is because the 
vendors use specilized configuration that is hardly available to the customers. At Ames 
Laboratory, we had a similar experience was faced when Ames Laboratory acquired a 256 
node Intel Paragon (xpsl50). The Top-500 URL (www.top500.org) showed the xpslSO 
with 1024 nodes having 127.1 GFLOPS by running Linpack. Since Ames Laboratory 
received one-fourth of the machines, around 31.775 GFLOPS (124.12 MFLOPS per node) 
were expected on the acquired hardware. However, surprisingly only 40 MFLOPS per 
39 
node were achieved. That is about one-third of the expected performance per node. After 
talking to the person (Steve Nassar) who was involved in the original benchmarking, it 
was found that Intel had used a different BLAS library and different Operating System 
(SUNMOS instead of supplied OSF) to do benchmark runs for top500 website. 
11. There are many popular benchmarks, and vendors find time to manually hand-tuned 
some of those. A classic example is Linpack benchmarks. This hand-tuning yields better 
performance and does not provide a typical performance which an user might experience. 
12. If the evaluation of the computer is for purchasing the system, the benchmarks do 
not include other aspects than price and performance. Such aspects, according to 
[Hennessey, 1999], [Joseph et al., 2000], should also include reliability, availability, main­
tainability, scalability, testability, serviceability, supportability, compatibility, porting 
facilities, system administration, system tool, the future road map of the systems, and 
so on. 
13. The benchmark results do not correlate with the cost of the system. For an example, 
Table 3.1 shows that the peak performance benchmark is not linear to the cost of the 
system. 
4.6 Summary 
According to Weicker [Weicker, 1990] [Weicker, 1991] "Fair benchmarking" would be less of 
an oxymoron if those using benchmark results knew what tasks the benchmarks really perform 
and what they measure. In this section many problems related to benchmarks were studied. 
Even though the benchmark problems are intuitive and comprehensible, the solutions for these 
problems are not trivial. HINT benchmark along with APPMAP technology provides solutions 
to many of these problems. In the next chapters, we will see how HINT reduces many problems 
that are highlighted in this chapter. HINT has many of the properties such as linearity, a code 
not easily tweaked, and mathematical soundness. 
40 
CHAPTER 5 Metrics 
A performance metric is a standard unit used to describe the performance of the system. 
Unfortunately, this is one of the most confusing topics in the computing community. In this 
chapter, characteristics of good metrics are discussed. Brief definitions of performance metrics 
as present in single and parallel systems are provided. 
5.1 Characteristics of Good Performance Metrics 
The characteristics of good performance metrics are as follows [Lilja, 2002]: 
1. Linearity: This characteristics implies that metrics should respond to the system change 
in a linear way. If the actual performance of the machine varies by a certain ratio then 
the metrics should indicate this with proportional change in ratio. 
2. Reliability: A performance metric is reliable if system A always outperforms system B 
then the corresponding values of the metrics for both systems should indicate that. This 
characteristic can also be viewed as a predictability measure of the metrics. 
3. Repeatability: A repeatable metric will give the the value no matter how many times 
the experiments is done. 
4. Easiness of measurement: The metric should be easy to measure. 
5. Consistency: A consistent performance metric is architecture-independent. In other 
words this metric should be comparable on different machines. 
6. Independence: A good metric should be always be independent of outside influence. 
41 
5.2 Means versus Ends Metrics 
A good performance metric should be reliable. However, reliability needs understanding of 
what is useful or what is not useful work. 
Definition 3 An ends-based metric or utility-based metrics measure the quality of answer. 
Definition 4 A means-based metric measures the work done whether it is useful or not. 
MFLOPS is an example of a means-based metric, where a machine can actually run inferior 
algorithms with more floating points operations, but providing higher MFLOPS ratings. Table 
5.1 is taken from Dr. John Gustafson's Grand Challenge presentation in Japan. It shows that 
there are numerous examples where faster FLOPS rating algorithms may not lead to faster 
answer. 
Table 5.1 Examples of Algorithms Performance measured by Ends-based 
Metrics versus Means-based Metrics 
Algorithms perform better on 
Mean-based Metrics 
(Higher FLOPS rate) 
Algorithms perform better on 
End-based Metrics Algorithms 
(Faster Answers) 
Explicit Timestepping Implicit Timestepping 
Conventional Matrix Multiplication Strassen, Winograd Method 
Cholesky Decomposition PC Conjugate Gradient 
All-to-All N-Body Methods Barnes-Hut, Greengard 
Successive Over-Relaxation Multigrid 
Time-Domain Operators FFT 
Recompute Gaussian Integrals Computer Once and Store 
Material Property Function Table Look-Up 
5.3 Uniprocessor Performance Metrics 
5.3.1 MFLOPS 
MFLOPS stands for millions of floating point operations executed per second. It is very 
hard to get a consistent value of floating points due to difference in the floating point count in 
different architectures. 
42 
5.3.2 MIPS 
MIPS stands for millions of instructions executed per Instruction. MIPS is independent 
and repeatable, as well as easy to measure. However, MIPS is not linear, reliable or consistent. 
Arguments against MIPS are in [Gustafson and Todi, 1999a], 
Neither MIPS nor MFLOPS account for cost of memory access which is the dominant cost 
in most modern systems. A simple matrix multiplication algorithm will have its performance 
vary by over an order of magnitude depending on the order of nesting of its three loops. 
5.3.3 Clock Frequency 
Clock Frequency is the most used metrics, probably it is easy to obtain. Intuitively higher 
clock rate may seem to be better performance. However, the work done per cycle is unknown. 
Clock rate is nonlinear and the metric is unreliable. 
5.3.4 QUIPS 
QUIPS metric is part of the HINT benchmark. QUIPS is an example of ends-based metrics. 
The useful work is mathematically defined as Quality, which is a measure of quality of the 
solution obtained. QUIPS is quality improvement per second. QUIPS has all the qualities of 
a good metrics: 
1. Linearity: The quality is linearly related to time to achieve the solution. 
2. Reliability: As HINT is a wide-spectrum benchmark it captures many aspect of system 
performance. Studies shows that QUIPS can consistently correlate and rank correlate 
with other benchmarks. So QUIPS is a reliable metrics. This characteristics will be 
discussed in detail through out the thesis. 
3. Repeatability: HINT rating is same for repeated runs. 
4. Easiness of measurement: HINT source code is portable to many platforms and is 
used to measure from a super-computer to a tiny personal desktop assistant (PDA). 
43 
5. Consistency: Since quality is mathematically defined, the QUIPS metric is architec­
turally independent. 
6. Independence: Since quality is mathematically sound, the QUIPS metric is indepen­
dent of outside influence. 
HINT and QUIPS will be discussed in detail in Chapter 7. 
5.4 Parallel Processing Performance Metrics 
In the next three sections some of the widely referenced (but not used) definition of 
speedup, efficiency, and scalability are presented. What insights each of the measure offers? 
Most of the definitions are based on the following references [Sahni and Thanvantri, 1996a], 
[Sun and Ni, 1992]. The three definitions of scalability in detail: asymptotic scalability, isoef-
ficiency, and isospeed. 
5.4.1 Speedup 
Speedup is a measure of performance improvement of the parallel domain over the serial 
domain. The speedup is the ratio of serial execution time over parallel execution time. Here 
are various definitions of the speedups. Here I is the problem instance, P is the number of 
processors, Q is the parallel program, and n is the instant size of the program. 
1. Relative Speedup (I,P): It is the ratio of the time to solve the program instance I using 
program Q and 1 processor over the time to solve the program instance I using the same 
program Q and P processors. The relative speedup depends upon the instance I being 
solved and the number of processors P. 
2. Real Speedup (I,P): It is the ratio of the time to solve the program instance I using best 
serial program and 1 processor over the time to solve the program instance I using the 
program Q and P processors. In practice the best serial algorithm may not be known or 
available, or for all instances there may not be a single best algorithm. So the available 
sequential algorithm is generally used. 
44 
3. Absolute Speedup (I,P): It is the ratio of the time to solve the program instance I vising 
best serial program and 1 fastest processor over the time to solve the program instance I 
using the program Q and P processors. Similar to real speedup, the available sequential 
algorithm is most often used. 
4. Asymptotic Real Speedup (n) It is the ratio of the asymptotic complexity of the best 
serial program to the asymptotic complexity of the program Q vising as many processors 
as possible. For the algorithms such as sorting where the asymptotic complexity is not 
uniquely characterized by the instance size n, the worst-case complexity is used instead. 
Note that the this metric does not rely on the number of processors P available in the 
parallel system because P is assumed to be unbounded. 
5. Absolute Relative Speedup: It is the ratio of the asymptotic complexity of the program 
Q using 1 processor to the asymptotic complexity of the program Q using as many pro­
cessors as possible. For the algorithms such as sorting where the asymptotic complexity 
is not uniquely characterized by the instance size n, the worst-case complexity is used 
instead. Note that as in asymptotic real speedup this metric does not rely on the number 
of processors P available in the parallel system because P is assumed to be unbounded. 
5.4.2 Efficiency 
Efficiency is the ratio of the speedup to the number of the processors. Depending upon 
the definition of speedup, one can have different definitions of efficiency. Note that efficiency 
can be greater than 1 as the speedup can be more than P for P number of processors (for slow 
processors, limited memory or inefficient serial code). 
Efficiency [Carmona and Rice, 1991a] can also defined as ratio of the work accomplished (wa) 
by a parallel algorithm and the work expended (we) by the algorithm. Work accomplished 
(wa) can be defined as the product of work done by the best serial algorithm and the speed of 
an individual processor S. Work expended (we) can be defined as the product of the parallel 
execution time, the speed of an individual parallel processor S, and number of processors P. 
45 
Efficiency 
P x parallel time 
wa best sequential time (5.1) 
Thus from the definition of real speedup and equation 5.1 
Efficiency 
real speedup (5.2) 
P 
Similarly if work accomplished (wa) is defined to be as the product of the work done by the 
parallel algorithm on the single processor and the speed of the individual processor S, then the 
efficiency will be relative speedup divided by P. 
5.4.3 Scalability 
The scalability of a parallel system is defined as the change in the system as problem size 
and computer size increase. A system is called scalable if its performance increases with the 
increase in the size of the system. 
5.4.3.1 Asymptotic Scalability 
Asymptotic Scalability is defined as the ratio of asymptotic relative speedup of a parallel 
system and asymptotic relative speedup of an ideal system consisting of a similar parallel 
algorithm and an Exclusive Read Exclusive Write Parallel Random Access Machine (EREW 
PRAM). 
For example, a n x n mesh-connected computer takes Q(n), to solve the classical matrix-
multiplication algorithm 0(n3). The relative speedup of mesh algorithm is thus @(n3/n) = 
6(n2). This algorithm involves a number of data alignment and interprocessor data shifts, 
that are typical to the mesh architecture. But these steps are unnecessary for EREW PRAM. 
As a result we need to use an algorithm on EREW PRAM, that has the same computations 
but avoids the communication-necessitated work done on the mesh. Note that we are not able 
to use a parallel algorithm based on asymptotically faster serial algorithm but on a similar 
serial algorithm. The runtime of the similar parallel algorithm on an EREW PRAM with 
0(n3/logn) processors is Q(logn). 
46 
Thus the asymptotic scalability of the mesh matrix multiplication system is as follows: 
So from equation 5.3 as the problem size grows to infinity the asymptotic scalability decreases 
to zero. This means that the mesh matrix multiplication is not scalable. 
Notice that asymptotic speedup is independent of the number of processors. Thus if another 
matrix multiplication algorithm on an EREW PRAM with 0(n3) processors takes Q(logn), the 
asymptotic scalability is same as equation 5.3 but it can be easily seen that relative efficiency 
of the former system is ®(\/log n) (thus efficiency decreases as n increases) and that of the 
latter is a constant 0(1). 
5.4.3.2 Isoefficiency 
Isoefficiency [Kumar et al., 1994a], [Grama et al., 1993a] is a scalability measure based on 
efficiency. It is defined as the rate at which the size of the workload should be increased 
relative to the rate of increase in the number of processors such that the efficiency of the 
parallel system remains same. Hence depending upon the version of the efficiency there can 
be different versions of isoefficiency: real, absolute, or relative. 
For example, consider a n x n matrix on m x m mesh computer to solve the classical 
matrix-multiplication algorithm 0 (n3). The workload is cn3. Assuming one unit of workload 
takes one unit of time, the serial execution time of the serial algorithm is cn3. If the problem is 
uniformly spread on the P = m 2 processors then each processor has (n/m) x (n/m) portion of 
the matrix. If the communication and other overheads are bn2/m, then parallel execution time 
is cn3/m'2 + bn2/m, Thus relative speedup of the matrix multiplication algorithms is given as 
follows. 
Asymptotic Scalability Asymptotic Relative Speedup of Parallel System 
Asymptotic Relative Speedup of an EREW PRAM 
log n (5.3) 
n 
relative speedup 
err (5.4) 
cn3/m2 + bn2/m 
47 
From equation 5.4 it implies that for m x m processors the relative efficiency is given as follows. 
relative efficiency = —— Cn (m2)(cnA/m l  + bnz/m) 
= : (5.5) 
1 + brri/cn 
Thus from equation 5.5, it implies that for the relative efficiency to remain constant bm/cn 
should be constant. Thus n should increase at the the rate of bm/c. Thus the workload nA 
should increases at the rate of (bm/c)3 in order to maintain the constant efficiency. Thus the 
isoefficiency ie(W = n3,P) of the mesh matrix multiplication system is therefore (b/c)sP l 'a. 
The isoefficiency concept helps to scale the result from small number of processors to large 
number of processors. Thus one can test parallel programs using a small number of processors 
and then predict the performance for a larger number of processors. Also, one can take a small 
instance of the problem and can reportedly get good projection for a larger program instance. 
Unlike asymptotic scalability, the isoefficiency helps us to analyze how changing the number 
of processors and the communication speed affects performance. 
The problem with isoefficiency is that the most scalable system can run slower than the less 
scalable systems. Hence this metric has to be taken with also execution time in mind. Here 
is an example to demonstrate the problem by considering isoefficiency to compare two matrix 
multiplication systems. Suppose both the systems have equal number of identical processors. 
One has a speedup and efficiency given by Equations 5.4 and 5.5. Suppose in the second system, 
the communication overhead is reduced by a factor of two, but at the expense of doubling the 
time spent on computation. Thus the parallel execution time of 2cn3/m2 + (b/2)n2/m2 for 
m < n. The isoefficiency of the second system is (b/4c)3P15, which is better than the first 
system by a factor of 64. The second system is therefore considerably more scalable than the 
first. 
But, if we compare the runtime of both the systems, we see that when b = 4c 
runtime of system 2 2n3 + 2n2 
runtime of system 1 n3 + 4ri2 
(5.6) 
48 
The equation 5.6 is greater than one for n greater than 2. Hence the most scalable system 
always runs slower than the less scalable system. 
5.4.3.3 Isospeed 
Sun and Rover [Sun and Rover, 1994] defined scalability as the average workload needed 
to sustain a specified computational speed. They define workload W as the work such as 
a floating point operation or clock per instructions. It excludes extraneous computations, 
communications, and other overheads that might be introduced during the parallel execution. 
One problem of using the workload measure is as follows: If the workload is large, then by 
running on the serial processor the speed can be affected due to remote memory access. On 
the other hand if the workload is small, many of the features such as loop initialization, vector 
setup, etc., which otherwise would not have counted have considerable effect on the Mflops (or 
speed) of the system. 
Suppose that a P-processor parallel system achieves an average per-processor speed of x 
Mflops using a workload of W. The parallel execution time of the system is then T — W/(Px) 
seconds. If the number of processors increases to P', the average speed will generally decrease 
unless the workload increases. For many applications, overheads (such as interprocessor com­
munication) increase when the workload is fixed and the number of processors increases. To 
obtain a speed of x Mflops, we need to increase the workload to W'. The time to perform the 
increased workload using P' processors is T' = W'/(P'x). The scalability or isospeed, (P,P'), 
of the parallel system is 
WP
'
, = w  ^=  ^ ("> 
Because the average speed (x Mflops) is a function of the workload, w(P. P') is really a 
function of the initial workload W. To remove this dependence on W, a particular value of x was 
proposed, rœ/2 was proposed to be value of x. where was defined as the maximum speed 
attained by a single-processor execution of the parallel algorithm as the workload increases. 
rœ/2 becomes the average speed that is to be attained during parallel execution. 
49 
Parallel systems with an isospeed closer to one are more scalable than those with an isospeed 
much smaller than one. Recall that when the isoefficiency metric is used, systems with a small 
isoefficiency value are more scalable than those with a large one. One can show isospeed to be 
inversely proportional to isoefficiency when the speed of serial computation is independent of 
the workload. 
5.5 Summary 
In this chapter, a detail discussion about system performance metrics for both uniproces­
sors as well as multiprocessors was presented. Besides performance, metrics like reliability, 
avalability, maintainability, testability, and scalability, and usability [Hennessey, 1999] are of 
interest to system research community. Energy consuption per performance unit per dollar is 
another important metric that is quite useful. 
50 
CHAPTER 6 Statistical Background 
In this section some basic statistical terms will be discussed. In particular Pearson Product 
Moment correlation, Spearman's correlation and linear relationship would be discussed. Also, 
the concept of Harmonic mean would be discussed. Harmonic mean is widely used in Physics 
to average speed. 
6.1 Pearson Product Moment Correlation 
One objective of the thesis is to compare the performance of two sets of machines to 
understand whether there is a linear relationship between the two sets. Pearson Product 
Moment Correlation or Pearson's correlation is used for this purpose. 
Pearson's correlation r, is the measure of the correlation between two variables X and Y. 
It shows the strength and direction of a linear relationship between the X and Y variables. 
The description is provided in the equation 6.1. Here X\ refers to the ith observation of the 
variable X and Yi refers to the ith observation of the variable Y. Also, |X| = \Y\= n >= 1 
r = rxv = 
c(%y) 
—  1  <  r x y  <  1  (6.1) 
where 
c ( % y )  =  ^ ^ ( % i - x ) ( y ; - y )  - œ < c ( x y ) < o o  
i=1 
(6.2a) 
i n 
y(x) = c(xx) = -g(%i-x)2 (6.2b) 
i n 
y(y) = c(yy) = -%](y;-y)2 (6.2c) 
(6.2d) 
51 
(6.2e) 
The properties of Pearson's correlation [H.J. Newton, J.H. Carroll, N. Wang, 2003] can be 
summarized as follows: 
• The value of r does not depend on the units of measurement. For example, X can be 
expressed in seconds or X can be expressed in milliseconds, and the correlation with Y 
(any unit) would be same. 
• The value of r does not depend on labeling of variables X and Y. Any variable can be 
chosen to be either X or Y. 
• A positive value of r implies positive linear relationship between the variables and a 
negative value of r implies negative linear relationship between the variables. 
•  r =  l o r r = —  1  h a p p e n s  o n l y  w h e n  a l l  t h e  p o i n t s  o f  t h e  s c a t t e r  p l o t  l i e  e x a c t l y  o n  a  
straight line. A value of r closer to 1 or -1 implies stronger linear relationship between 
those variables. 
• The value of r equal to 0 implies little and no linear relationship between x and y. 
• r measures only the linear relationship between X and Y. r = 0 does not mean that there 
is no relationship between X and Y. 
• The strength of correlation can be defined as follow: 
Strong : |r| > 0.8 (6.3a) 
Moderate : 0.5 < |r| < 0.8 (6.3b) 
Weak : |r| < 0.5 (6.3c) 
Throughout the thesis the term correlation means Pearson's correlation. 
52 
6.2 Linear Relation 
Suppose variables X and Y are highly correlated. So there is a linear trend between X and 
Y. Given the value of X, the value Y can be predicted. Let the predicted value of variable Y 
be (Y). A mathematical definition of (Y) can be written as in equation 6.4. 
y = Â + (6.4) 
Furthermore, to reduce the degree of freedom in equation 6.4 from two to 1, the value of 
Po is set to be zero. This assumption is useful in a case where variable X represents measured 
time of an application on a set of machines and variable Y represents predicted time of an 
application on a set of machines. The value 0 € X implies there exists a hypothetical machine 
with infinite speed that takes zero time to complete the task. For such a hypothetical machine 
one can assume time 0 € Y. Hence, simplifying the equation 6.4 such that is set to be zero 
implying 0 G X, Y. 
The error in the prediction is given by the following equation 6.5 
error; = Yj - Y (6.5) 
The relative error in the prediction as in following equation 6.6. Based on the value of 
relative-errori for all i, the maximum, the minimum, the average, and the standard deviation 
of the relative error can be determined. 
relative-errori — ^ * v—- (6.6) 
M 
6.3 Spearman's Rank Correlation 
One important use of benchmarking is to rank a set of machines by performance. Ideally 
a good benchmark should rank the machines in a similar way as a real application would do. 
Suppose A is a real application and B is a benchmark. Both A and B are used to rank a set 
of machines from best to worst. In order to find how strongly the ranking of machines by A 
53 
and B are related, Spearman's Rank Correlation [McClave and II, 1988] is used. Spearman's 
Rank Correlation or rank correlation is concerned with the trend in ranking: if benchmark A 
rank a machine the fastest or slowest, does benchmark B rank the same machine as the fastest 
or slowest? Is the relationship between ranking by A and ranking by B linear? 
Suppose X and Y are two variables. To find the rank correlation between X and Y, the n 
instances of variable X and Y are first ranked from 1 to n. The ties in the ranks are averaged. 
Suppose variables U and V are ranked variables corresponding to variables X and Y. Let 
Ui represents an ith instance of variable U. Let represents an ith instance of variable V. 
Spearman's rank correlation (rs) is defined similarly as in Pearson's correlation 6.1 but over 
variables U and V. The simplified form for rank correlation is given by the equation 6.7. 
( g ? )  
The properties of Spearman's rank correlation is similar to Pearson's correlation. The rank 
correlation rs varies from -1 (perfect negative correlation) to 1 (perfect positive correlation). 
rs value of 0 implies no rank correlation. 
6.3.1 A Matlab Example 
Table 6.1 Statistical Analysis of 10 Machines using Applications A and B 
Machine Appl A Appl B Rank A Rank B Rel. Error 
1 4 5 8 9.5 0.1158 
2 1 2 2 2.5 0.4474 
3 3 4 6.5 7 0.1711 
4 5 5 9.5 9.5 0.1053 
5 2 2 4.5 2.5 0.1053 
6 1 2 2 2.5 0.4474 
7 3 4 6.5 7 0.1711 
8 5 4 9.5 7 0.3816 
9 2 3 4.5 5 0.2632 
10 1 2 2.0 2.5 0.4474 
The statistics, Pearson's correlation, linear relation, and Spearman's rank correlation are 
illustrated with an example. Assume that there are two applications (or benchmarks), A and 
54 
B, which are used to evaluate ten computers. The runtime of application A is shown on column 
Appl A of the table 6.1. The runtime of application B is shown on column Appl B of the table 
6.1. Using equation 6.1 and columns Appl A and Appl B, the correlation r between A and B 
is found to be equal to 0.9007 implying strong correlation. 
The rank of the machines is obtained by a matlab function findRank. The rank of the 
machines as determined by application A is given under the column Rank A. The rank of the 
machines as determined by application B is given under the column Rank B. Using equation 
6.7 and columns Rank A and Rank B, the rank correlation rs between A and B is found to be 
equal to 0.9094 implying strong rank correlation. 
Can we predict the result of application B using application A. Since the correlation 
between results of application A and application B are high, there is a linear relationship 
between A and B. Using 6.4, the value of (3\ is found to be 1.1053 assuming (3® is equal to 
0. The relative error as per equation 6.6 is shown in column Rel. Error of the table 6.1. The 
maximum, minimum, mean, median, and standard deviation of the relative error to predict 
the application B results from application A results, are 44.74%, 10.53%, 26.55%, 21.71%, and 
15.07% respectively. 
The matlab code for doing the above statistical analysis is shown in table 6.2. 
6.4 The Harmonic Mean 
The Harmonic mean [Lilja, 2002] is one common metrics used for averaging speed. It is 
defined to be 
* = 3ÛW5) (6-8) 
where Xi represents ith instance of n values that are averaged together. 
For example, let a car takes n back and forth trips from Ames to Des Moines at a speed of 
Si. For each trip the time taken is X, to travel a constant distance D. The speed S, can also 
be written as S. From the equation 6.8, the following equation is obtained. 
55 
Table 6.2 An Matlab Example showing Statistical Analysis 
% matlab code 
clear all; 
x  =  [ 4 , 1 , 3 , 5 , 2 , 1 , 3 , 5 , 2 , 1 ] ;  
y = [5, 2, 4, 5, 2, 2, 4, 4, 3, 2] 
% find Pearson's Correlation 
corrcoef(x,y) 
XT = findRank(x) 
yr = findRank(y) 
% find Spearman's Rank Correlation 
corrcoef(yr,xr) 
% Prediction 
m = y/x 
% error 
reLerror = abs(y - x*m)./y 
minimum_error = min(reLerror) 
maximum.error — max(reLerror) 
a,verage_error = mean (reLerror) 
median_error = median(reLerror) 
std.error = std(reLerror) 
The mean value Sh is the total distance travelled divided by the total time taken to 
complete the n trips. Thus the harmonic mean seems appropriate mean to summarize the 
speed. The harmonic mean is used to calculate a mean value for a sequence of QUIPS. 
6.5 The Weighted Harmonic Mean 
The implicit assumption in the each of the measurement x, used in calculating harmonic 
mean is that each of the n individual measurement is equally important in calculating the 
mean. For example, the illustrated example in calculating the harmonic means assumed that 
the trip was made back and forth from Ames to Des Moines. One may travel from Ames to 
other destinations and the distance D may not be constant anymore but would vary with the 
trip. 
56 
Let Wi be the fraction representing the relative importance of a measure Xi such that 
following sum of all u>, is equal to 1. 
n 
Y, Wi = 1 (6.10) 
i=1 
The weighted harmonic means [Lilja, 2002] can be defined mathematically as 
ZH
" 
= <6U) 
In the car example, the value of weight Wi is the fraction of total distance travelled in the 
ith trip. Let D\,D2, • • •, Dn be the distance travelled in the trip. Then Wi is defined as 
w; = (6 12) A 
The weighted harmonic mean for the trip made by the car example can be calculated from 
equations 6.11 and 6.12. 
S"*' =  ZtJm/s,) ( 6 1 3 )  
where Si is the speed at i trip. 
The weighted harmonic mean is used to calculate the average QUIPS where weights are 
application-dependent and depend on how memory usage pattern of the application. These 
weights are called Application Signature or APPMAP. 
57 
CHAPTER 7 HINT: The Hardware Signature 
7.1 Introduction 
HINT (Hierarchical INTegration) [Gustafson et al., ], [Gustafson and Snell, 1994], 
[Gustafson and Snell, 1995a], [Dowd and Severance, 1998], [Lilja, 2002] is a scientific bench­
mark created in 1995 at Ames Laboratory by Dr. John Gustafson and Dr. Quinn Snell. It is 
a universal benchmark which fixes neither time nor workload to measure the performance of a 
computer system. It is infinitely scalable, as there is no mathematical limit and computation 
continues as long as memory, time, and precision are available. HINT has been ported to al­
most all different types of computers, parallel (MIMD1, SIMD2, SMP3), vector, parallel-vector, 
and serial. There is even HINT for humans for measuring computing speed of a human being 
[Gustafson, a]. 
HINT succeeded Scalable, Language-independent, Ames Laboratory, One-minute Measure­
ment (SLALOM) benchmark [Gustafson et al., 1991], [Diane et al., 1991], 
[Gustafson and Snell, 1995a], a past effort of Dr. John Gustafson, the first author of HINT, 
and others. Like SLALOM, HINT used answer quality as a figure of merit. Also like SLALOM, 
HINT is scalable and language-independent. However unlike SLALOM, HINT doesn't fix the 
time and it is easier to parallelize, port, and maintain. Unlike other popular benchmarks, such 
as SPEC95, SPEC2000 [Reilly, 1996], [SPEC, 2003], and Linpack [Dongarra, 1987], HINT pro­
duces a graphical broad-spectrum of performance results with respect to a scalable workload 
instead of a single number. However, it also produces a single performance number: NetQUIPS. 
^multiple instruction multiple data 
2single instruction single data 
^symmetric multiprocessor 
58 
7.2 Task and Terminology 
HINT uses interval subdivision to find rational upper and lower bounds of the function in 
equation 7.1 using only the monotonically decreasing property of the function. 
/(%) = n+ for 0 < z < 1 (7.1) 
The problem is to find the rational bound on the area in the X- Y plane where x ranges from 
0 to 1 and y ranges from 0 to f(x), where f(x) is as defined before. The problem is to subdivide 
the x and y range into an integer power of two equal subintervals and count the squares thus 
defined that are completely inside the area (lower bound) or completely contain the area (upper 
bound). The only knowledge used about the function is that f(x) is monotonically decreasing 
function, so the upper bound comes from the left function value and the lower bound comes 
from the right function value. The problem solved by HINT is shown in Figure 7.1. 
1 X 
Figure 7.1 Problem Solved by HINT: Area to be Bounded under the Curve 
Quality is defined as the reciprocal of the difference between the upper and lower bounds. 
The objective of the problem is to obtain the highest quality in the least time, for a given range 
59 
of problem. The range of problem depends upon the precision chosen for the computation. 
At each time step, the Quality improvement is calculated with respect to the time spent 
for the calculation, yielding another metric QUIPS which stands for Quality improvement per 
second. 
NetQUIPS summarizes the QUIPS over time. Mathematically, it is the integral of quality 
Q divided by square of time, from the first time of quality improvement (to) to last time 
measured (t\). It is the area under the QUIPS curve plotted on a log(time) scale. For details 
of the metric, refer to [Gustafson and Snell, 1994], [Gustafson and Snell, 1995a] . 
7.3 An Example using 8-bit Data Type 
We will present here an example in brief to explain the task solved by HINT [Gustafson and Snell, 1995a]. 
The reader may like to see [Gustafson and Snell, 1995a] for a detailed version of the following 
example. 
As noted earlier our task is to bound the function f(x). Assume a word size of bd bits. Then 
the x and y axis will be represented by [_^J and b(i — J. For example an eight-bit word can 
represent values from 0 to 255. Thus x and y are represented by word sizes of ^ = 4 bits and 
bd - L^J = 4 bits. So the function f(x) can be superimposed on a grid of 16 by 16. 
Two types of precisions are needed: precision of the data and precision of index. Precision 
of the data is used to count units of the area above and below the function. Precision of the 
index is used to specify positions of x and y. Suppose 6; bits are needed to specify precision. 
Then bi > bd~ ]_ if] • Thus we need at least four-bit indices to specify the index. 
Let nx  and ny  represent the number of units of area in x and y directions. Let i be the 
column number. Then the function in equation 7.1 can also be written as in equation 7.3. 
Notice that scaling by ny  is done in order to take full use of bd bits of precision. 
NetQUIPS = [ QUIPS(t)d(logt) 
Vlog(fo) 
(7.2) 
60 
f ' { x )  = —— where i  G (0,nx) (7.3) 
ny  
For example, x = | is represented by i — 8. Then equation 7.3 yields 
f'W = (16 + 8) _ (128/24) 
* ^ ' 16 16 
Thus the function f ( x )  at x  =  \  can be bounded by 
^ < A 
H K nown to contribute to lower bound 
U Limited by arithmetic precision 
U Available for further refinement 
I | K nown not to contribute to upper bound 
Figure 7.2 Two Subintervals of One Dimension Integration with 8-bit Data 
Precision 
Figure 7.2 shows state of bounds after first subdivision. There are four regions in the figure: 
area known to contribute to lower bounds, area known not to contribute to upper bounds, area 
limited by arithmetic precision, and area available for further refinement. Upper left area has 
(8x11) — 1 = 87 units and lower right area has (8 x 6) — 1 = 47 units. Both contribute 
to the region that still need refinement. One square in both areas is limited by arithmetic 
61 
precision. We also know for sure that the lower left area, that contains 8 x 5 = 40 units, surely 
contributes to the lower bound and the upper right area, that contains (8x11 = 88) units, 
does not contribute to the upper bound. Error is defined as the difference of upper bound and 
lower bound. Hence the error is ^256^g^"40 = Since quality is defined as inverse of the 
error, it is ff§-
Partition 5 
Partition 3 Splits error = 27/256 
Splits error = 87/256 Quality = 256/64 
Quality = 256/96 _ 40q 
= 2.66... 
0 1/4 
Partition 2 
Splits error = 256/256 
Quality = 256/136 
= 1.88... 
Partition 4 
Splits error = 47/256 
Quality = 256/76 
= 3.36... 
Figure 7.3 Sequence of Hierarchical Refinement of Integral Bounds 
Figure 7.3 shows sequence of four refinement steps, with steady improvement in quality of 
improvement. 
Figure 7.4 shows the last iteration where no more error can be eliminated. It should be 
noted that for a larger word size, i.e., higher precision, the quality of integral would be much 
higher. For an infinite precision it is possible to achieve infinite quality. However, to achieve 
higher quality we need to spend more computing time. Hence there is a tradeoff between 
quality and time. 
62 
• Known to contribute to lower bound 
• Limited by arithmetic precision 
• Known not to contribute to upper bound 
Figure 7.4 Precision-Limited Last Iteration, 8-bit data 
7.4 Salient features 
Salient features of HINT are as follows. 
1. Speed is defined as quality improvement per second (QUIPS). "Quality" is the reciprocal 
of the error, which is the difference between an upper bound and a lower bound on 
the answer. The error thus combines precision loss and discretization error. As the 
benchmark is run, the error steadily decreases until memory or precision is exhausted. 
2. Neither the task size nor the execution time are specified; speed is measured as a function 
of time (or, if you prefer, as a function of the memory size of the problem). 
3. The problem can be run with any data type: floating point (any precision), integer (any 
precision), BCD arithmetic, extended-precision arithmetic, etc. Using a lower precision 
may make the speed higher in the millisecond range, but will cripple performance for 
longer execution times. 
63 
Cost versus Performance 
160000 
Cost of Main Memory 
140000 
120000 
100000 
D 80000 
60000 
40000 
20000 
0 500 1000 1500 2000 2500 3000 3500 
Cost 
Figure 7.5 Memory Cost versus QUIPS 
4. HINT produces broad-spectrum of graphical output. This helps to unravel different 
ranking of machine performance as observed by different applications run on it. 
5. While HINT provides a graph of performance, it also has a "single number" measure 
(the area under the graph) that summarizes performance in a meaningful way for simple 
one-dimensional rankings. 
6. As the size of the HINT task grows, the memory access pattern becomes more complicated 
in a way that defeats caches. This is more representative of real applications than simply 
increasing the size of a loop to make a benchmark larger. 
7. The only run rule implied by HINT benchmark is to have no prior knowledge about the 
function f(x) = jjrpjj for 0 < r < 1 except that it is a monotonically decreasing function. 
Thus, any algorithm and any architecture can be used to run the benchmark. 
8. Unlike other benchmarks, it takes only a couple of hours to port HINT to a new machine 
environment and to run the benchmark for a near-optimal result. 
64 
9. One most of important characteristics of HINT is that Quality changes linearly with time 
or subintervals. This may lead to many interesting findings such as one shown in Figure 
7.5 where QUIPS varies linearly with the cost of adding extra memory. 
7.5 Understanding HINT Graphs 
HINT generates five sets of data in the output file. They are time, QUIPS, Quality, 
subintervals, and memory used in bytes. The first two data, Time and QUIPS, are used to 
produce the QUIPS graph. This graph can be thought of as a performance speed graph of the 
computer. The third and fourth data sets are Quality and subintervals. These columns can be 
used to check for loss of quality due to insufficient precision and poor choice of which rectangle 
to split. The fifth data, memory used in bytes, is useful in determining the memory regimes, 
i.e., cache and memory sizes. 
HINT results can be used to interpret and explain a variety of aspects of computer perfor­
mance. A recent study [Gustafson and Todi, 1998] shows that HINT has a potential to express 
a superset of information given by any particular fixed-size benchmark. The following analyses 
use a range of actual HINT results from the steadily growing data base. The systems that 
generated the data are not identified because the point here is to understand the interpreta­
tion of the HINT graphs, not to critique specific systems. Each graph is accompanied by an 
explanation of how to interpret the data. Unless explicitly mentioned, a logarithmic time scale 
for the x-axis and a linear QUIPS scale for the y-axis are used. 
7.5.1 Generic HINT Graphs 
Figure 7.6 shows a pair of typical HINT curves for a workstation. The form of the HINT 
curves reveals machine personality. 
The farther left the curve starts, the lower the latency of the system. Note how the 04-bit 
performance has several drop-offs, marking the end of primary cache, secondary cache, and 
main memory. This workstation has good performance over a broad time scale. It has enough 
main memory to show high speed for tasks in the several-second range. After a few seconds, 
65 
] 32-Bit Integer — 64-Bit Floating-Point 
5 
a 
GO 
P h  
1 0 6  10 3 1 103 
Time in Seconds 
Figure 7.6 Generic HINT Graphs 
it runs out of memory and performance falls with use of disk storage. 
7.5.2 Classical Memory-Regime Revealing Graph 
Figure 7.7 reveals the memory hierarchy in a computer system. Towards the left side of 
the curve, the problem size is small and it fits in the cache completely. Hence, the speed is 
maximum in this side of the curve. As the problem size increases (we move right), the data 
require more memory than the primary cache limit and hence one can see the sudden dip in 
the curve. A similar dip is observed when the problem size increases more than the secondary 
and main memory size limit. 
Hence, a sudden dip in QUIPS reveals the change in the memory regime of the system. 
Looking at the above graph, one can easily visualize a three-level memory (two level of caches 
and one level of main memory). A similar graph 7.9 can be drawn using logarithmic memory 
on x-axis and QUIPS on y-axis. In this graph, approximate cache and memory size can be 
easily determined using the sudden-dip-point and its corresponding x-axis reading. 
66 
Various Memory Regime 
700000 
Workstation 
600000 
500000 
Secondary Memory 
400000 
I 
300000 
200000 
100000 
le-06 le-05 0.0001 0.001 0.01 0.1 1 10 100 1000 
Time in seconds 
Figure 7.7 Memory Regime Revealing Graph 
7.5.3 Varying Precision 
The graph in Figure 7.8 has logarithmic QUIPS on y-axis. It shows HINT run by varying 
precision (data for the computation) i.e., long (32 bits), double (53 bits) and longlong (64 bits) 
on a single computer. 
There is a tradeoff between answer quality versus calculation speed. The 32-bit graph has 
the highest QUIPS, but exhausts its precision at about 0.1 second. The 53-bit precision of an 
IEEE double allows answer refinement to about 3 seconds. The 64-bit graph, though lowest in 
QUIPS, runs the longest time before running out of precision. 
If the NetQUIPS for integer precision is higher than that of floating point, it may indicate 
a computer designed for business and text-processing work. Scientific computers typically 
achieve higher NetQUIPS when using 64-bit floating point data types. (HINT can be run 
using any data type that supports +,-,*,/ on whole numbers). 
67 
Varying Precision 
le+06 
double 
long 
longlong 
100000 
10000 
1000 le-06 le-05 0.0001 0.001 0.01 0.1 1 10 100 1000 
Time in seconds 
Figure 7.8 Varying Precision 
7.5.4 Varying Main Memory 
Figure 7.9 shows HINT run by varying only the main memory and leaving the rest of 
the configuration the same. With each 8 increase in memory size, the area under the graph 
increases implying increase in QUIPS rating. The other thing to be noted above is that with 
the same cache and the same amount of minimum main memory (128 megabytes), the left 
portion of the HINT curves are almost identical. 
The above graph becomes more interesting if one adds cost dimension to it. How much 
performance a user get with each addition of extra dollar. 
From this graph, price-performance graph 7.5 is derived, where one can find out the per­
formance gain in QUIPS by adding x amount of extra memory. The figure shows that where 
QUIPS varies linearly with the cost of adding extra memory. This finding can be of immense 
help in optimizing the cost of a personal computer. 
68 
Varying Main Memory Size 
800000 
128 MB — 
256 MB 
512MB 
768 MB — 
1024MB 
2048 MB 
3072 MB 
— 
700000 
600000 
500000 
y 
5 400000 
CX 
300000 
200000 
100000 
le-06 le-05 0.0001 0.001 0.01 0.1 1 10 100 1000 
Time in seconds 
Figure 7.9 Varying Main Memory 
7.5.5 Varying Clock Speed 
The HINT graph in Figure 7.10 shows shows what happens if we just increase the clock 
speed keeping all other system configurations the same. An increase in clock speed reduces 
the runtime of a program. We wonder, what does it really mean to the actual user? Will 
applications of different sizes have same speedup? The answer is NO. 
From the graph it is clear that increase on the left end of the curve will be higher than 
the right end of the curve. So small applications, which fit in the cache, will have a larger 
speedup compared to larger applications which do not fit in the cache. This difference is due 
to cache/memory miss rate and miss penalty coming to play with the large applications. 
7.5.6 Cache-Dependent and Cache-Independent systems 
In a cache-independent (or balanced) system (Figure 7.11), different system components 
have a close match. No particular system component strongly limits the system performance. 
The processor speed matches closely with the memory bandwidth, interconnection network (if 
69 
Varying Clock Speed 
300000 
100 MHz 
133 MHz 
250000 
200000 
tr 150000 
100000 
50000 
le-06 le-05 0.0001 0.001 0.01 0.1 
Time in seconds 
100 1000 
Figure 7.10 Varying Clock Speed 
present) and input-output system. In Figure 7.11, there is a significant dip in the QUIPS for 
a cache-dependent (unbalanced) system during the memory level change (primary cache to 
secondary cache, secondary cache to main memory). In contrast, for the cache-independent 
(balanced) system, the dip in the curve (from primary cache to main memory) is not significant. 
It suggests that performance will not drop significantly once the application size is increased 
from primary cache to main memory. 
7.5.7 Dedicated Machine versus Machine with Interrupts 
Figure 7.12 has logarithmic QUIPS on the y-axis and logarithmic time on the x-axis. It 
shows the contrast between a dedicated machine and a machine with lots of interrupts. The 
jitter in the HINT curve shows the disturbance in the system. The jitter may be due to multi­
user environment, network daemons, or any other kind of periodic or aperiodic activity of the 
system. One can reduce small amounts of jitter by sampling a large number of data points 
and by increasing the number of repetitions per data point. 
70 
600000 
Cache-dependent system 
(unbalanced system) 
500000 
400000 
Cache-independent system 
(balanced system) e; 300000 
200000 
100000 
0 t— 
lc-06 1000 lc-05 0.0001 0.001 0.01 0.1 
Time in seconds 
100 
Figure 7.11 Cache-independent and Cache-dependent System 
7.5.8 Scalable Parallel Computers 
Figure 7.13 shows HINT run on a scalable parallel computer. It plots logarithmic time on 
the x-axis and logarithmic QUIPS on the y-axis. There is initial overhead to run the program 
for more than one node, and overhead increases with number of nodes. Scalability of the 
computer is determined by the fact that the QUIPS increases uniformly with the increase of 
the number of nodes (power of 2). 
7.5.9 Non-Scalable Parallel Computers 
Figure 7.14 shows HINT run on a unscalable parallel computer. It plots logarithmic time 
on x-axis and logarithmic QUIPS on y-axis. There is initial overhead to run the program for 
more than one node and it increases with the number of nodes. Non-scalability of the computer 
is determined by the fact that the QUIPS decreases with the increase in number of nodes. This 
is because the processors share a single memory bus, which saturates as nodes are added. 
71 
le+06 
Dedicated Machine versus Machine with Interrupts 
100000 
B 
Dedicated Machine 
Machine with interrupts 
le-06 le-05 0.0001 0.001 0.01 
Time in seconds 
Figure 7.12 Dedicated Machine versus Machine with interrupts 
7.5.10 Special-Purpose Computer 
Figure 7.15 compares a special-purpose computer with a general purpose one. The y-axis 
is logarithmic. The special purpose computer has initial overhead to start. It has excellent 
computation speed but it gets limited by small memory. For the general-purpose computer, the 
initial overhead is less. It has relatively less speed and it finds a more accurate answer (larger 
memory). Usually one needs to customize the HINT code for the special purpose computer. 
7.5.11 Business computer 
A business computer is usually used for spreadsheet and word processing kind of applica­
tions. There is a little or no use of floating point hardware found in the system. So a good 
business computer may have hardware performance as shown in Figure 7.16 where the integer 
performance supersedes the floating-point performace. On the other hand, for a scientific work­
station it is desired that the floating-point performance supersedes the integer perfomrance. 
72 
lc+07 
Scalable Parallel Computer 
lc+06 
3 
o 
1 node 
4 node 
8 node 
16 node 
32 node 
64 nodes 
0000 
0.0001 
100000 -
0.01 0.1 
Time in seconds 
Figure 7.13 Scalable Parallel Computer 
7.5.12 Serial versus Workstation Clusters 
Figure 7.17 compares a serial computer (typical workstation) with a cluster of workstations 
(similar type). The logarithmic QUIPS is plotted on the y-scale. The cluster has high startup 
overhead and the overhead increases with the number of nodes in the cluster. The above graph 
shows the scalability of the cluster. It shows that in case an application stresses small subtasks 
and rapid control changes (i.e, it falls in the left hand side of the serial curve) to fit in the 
cache, it is better to use a single computer. 
Similarly, Figure 7.18 shows linux based Pentium-Pro cluster. Here the overhead of going 
from single computer to a two computer cluster increases drastically. However, the overhead 
is amortized with the increase in the number of computers used for computation. 
73 
Unscalable Parallel Computer 
le+07 
1 node 
2 node 
4 node 
8 node 
le+06 
100000 
10000 
1000 
100 (— 
le-06 le-05 0.0001 0.001 0.01 1000 0.1 10 100 
Time in seconds 
Figure 7.14 Unscalable Parallel Computer 
Special Purpose versus General Purpose 
le+07 
Special Purpose 
General Purpose 
le+06 
5 looooo 
a 
10000 
1000 I— 
le-06 le-05 0.0001 0.001 0.01 10 100 1000 0.1 1 
Time in seconds 
Figure 7.15 Special Purpose Computer 
74 
Business Computer 
800000 
floating point 
integer 
700000 
600000 
500000 
£ 
^ 400000 
a 
300000 
200000 
100000 
le-06 le-05 0.0001 0.001 0.01 0.1 1 10 100 
Time in seconds 
Figure 7.16 Business Computer 
Serial versus Workstation Cluster 
le+06 
serial 
2 nodes 
4 nodes 
8 nodes 
100000 
10000 
le-06 le-05 0.0001 0.001 0.01 0.1 
Time in seconds 
100 1000 
Figure 7.17 Serial versus Workstation Cluster 
75 
Pentium Pro Cluster 
64 nodes, 163,54 MQUIPS 
32 nodes. 90.26 MQUIPS 
16 nodes, 49.20 MQUIPS 
8 nodes, 24.39 MQUIPS 
4 nodes, 12.17 MQUIPS 
2 nodes, 6.14 MQUIPS 
MQUIPS 
1000 
le-06 0.001 0.01 
Time in seconds 
0.1 10 
Figure 7.18 Linux Cluster 
76 
7.5.13 Same Machine Different Operating System 
1,2e+06 
le+06 
800000 
V) 
5 600000 
C/ 
400000 
200000 
0 
le-06 le-05 0.0001 0.001 0.01 0.1 1 10 100 
Time in seconds 
Figure 7.19 Serial versus Workstation Cluster 
Figure 7.19 compares two different operating systems, Mac OS 8.1 and MKLinux, on Apple 
Power Mac using G3 processor running at 266 MHz speed containing 512 K secondary cache, 
and 96 MBytes of memory. As we can see from the graph MKLinux provides much smoother 
graphs and also the performance (QUIPS on y-axis) of MkLinux is much higher than the 
performance of MacOS. This is because MacOS is a very large monolithic operating system 
and has a higher overhead compared to MkLinux. MKLinux in the pre-release did not provide 
disk support while the test was done. 
7.5.14 Serial versus Vector Computer 
Figure 7.20 compares a vector computer with a high performance serial computer. The 
logarithmic QUIPS is plotted on the y-scale. NetQUIPS of the above vector computer is 
nearly twice as fast as the high performance serial computer. The HINT code was rewritten 
to suit the vector computer. On a vector computer, the speedup of vector HINT with respect 
Hint Double on G3 - 266 MHz - S:512K - M-96M 
Mac OS 8.1 (10.71 MQUIPS) 
MkLinux pre-dr3 (11.60 MQUIPS) 
is improved performance 
HINT on Mac OS show 
jittery as well as reduced 
performance 
pre-release did not 
supported disk al Ihe 
77 
le+07 
Serial versus Vectorized HINT on Cray C90 
nr 
serial HINT (4.39 MQUIPS) 
vectorized HINT 128 block (30.78 MQUIPS) 
a 
100000 -
le-05 0.0001 0.001 0.01 
Time in seconds 
0.1 
Figure 7.20 Serial versus Vector Machine 
to serial HINT is around 8. 
Figure 7.21 provides a comparison of HINT performance of a single node of Cray C90 with 
other parallel computers. One node of Cray C90, a vector processor4, yielded 30.78 MQUIPS 
using vector version of HINT code. This is relatively very high performance compared to a 
single node of IBM SP2, a superscalar processor. The graph shows that one node of Cray C90 
is comparable to 32 nodes of Intel Paragon, 8 nodes of IBM SP2, around 16 nodes of T3D, and 
128 nodes of NCUBE-2. All the other code used parallel MIMD version of the HINT code. 
7.5.15 Region of Computation 
Figure 7.22 is of a scalable parallel computer with 128 nodes. It shows that the region of 
computation in a parallel computer is bounded by latency, speed, precision and memory. The 
left side of the curve is bounded by latency, the top of the curve is bounded by peak speed, 
and the right side of the curve is bounded by precision or memory. All the computation has to 
4 A processor capable of operating on all of the elements of an ordered list of operands, usually in a pipeline 
manner 
78 
HINT for Vector and Parallel Computers 
8e+06 
NCUBE-2, 256 nodes 
(42.28 MQUIPS) 
T3D, 32 nodes 
(61.20 MQUIPS) 7e+06 
6e+06 
5e+06 
IBM SP2, 8 nodes 
(31.35 MQUIPS) 
= 4e+06 Intel Paragon, 32 nodes 
(24.65 MQUIPS) 
Cray C90, 1 node 
/ (30.78 MQUIPS) 3 e+06 
I IBM SP2, 4 nodes (16.26 MQUIPS) 
le+06 
0 t— 
le-05 100 1000 0.01 0.0001 0.001 
Time in seconds 
Figure 7.21 Vector versus Parallel Computers 
be done within this region. Clearly, we can increase this region by reducing latency, increasing 
the speed and increasing precision or memory of the computer. 
7.5.16 Superset of Other Benchmarks 
HINT can predict the performance of other benchmarks with high accuracy. In most cases 
the correlation was found to be higher than 0.995 with monotone ranking. Most of the fixed 
size parallel and serial benchmarks, like LINPACK, NAS, SPEC, are found to be sample points 
of the generic HINT curve in Figure 7.23. The small problem size benchmarks like Fhourstone, 
Whetstone, correlate well with the left side of HINT curve indicating low memory bandwidth 
and computation requirements. Benchmarks like STREAM and SPECfp correlate well with 
the right side of the HINT curve indicating high traffic on the memory bus and large problem 
size. 
79 
Region of Computation 
le+09 
limited by 
'peak speed' 
le+08 
limited by precision 
or memory limited by 
latency ~~ 
tzi 
£3 le+07 
le+06 
Region of Computation 
100000 
0.01 0.1 10 le-06 le-05 0.0001 0.001 1 100 
Time in seconds 
Figure 7.22 Region of Computation 
7.5.17 Problem Detection using HINT 
HINT graphs comes handy in visualizing performance bottlenecks and machine misconfig-
urations. After observing a couple of graphs one can easily learn what kind of graphs (s)he 
is expecting for an updated system or a newly-acquired or to-be-acquired machine. In the 
previous section we have already seen a number of examples of HINT graphs and we now 
clearly understand what each of the portion of the graph represents. We present some of the 
situations where HINT was helpful in detecting certain problems which would not have been 
easily detected by other benchmarks. In previous sections we have already seen some of the 
problems that can be observed from HINT graph. Some of those examples are cache-dependent 
(unbalanced) or cache-independent (balanced) system as in Figure 7.11; machines with inter­
rupts as in Figure 7.12; and scalable and unscalable parallel computers as in Figures 7.13 and 
7.14. 
80 
(ZI 
5 
ex 
100 1000 10000 100000 106 107 10 
Memory in Bytes 
Figure 7.23 Superset of Other Benchmarks 
7.5.18 Identical Machines Varied Performance 
HINT is the only benchmark program to clearly identify anomalies in the performance of 
supposedly identical machines. The two curves in Figure 7.24 are from two machines with 
identical hardware, motherboard, memory and cache sizes, operating system and compilers 
with all the same version numbers (even the purchase date and vendor are the same). But their 
HINT graphs are so different. This difference was later diagonized to be due to misalignment 
of double data caused by difference in runtime environment. Such differences can be easily 
visualized through HINT curves. This experiment was performed by Dr. Don Heller at Ames 
Laboratory. 
Troy Benjegerdes of Ames Laboratory reported performance difference in 8 identical nodes 
of Mosix 5 enhanced Linux cluster. The cluster is composed of dual processor Pentium II, 450 
MHz personal computer. It is installed at Scalable Computing Laboratory, Ames Laboratory 
5MOSIX [Barak and Laâdam, 1998]is a software package that enhances the Linux kernel with cluster com­
puting capabilities. The enhanced kernel allows single system image for any size of cluster of X86/Pentium 
based workstations and servers to work cooperatively. 
Whetstone 
SPECint 
\ 
X 
LINPACK 
100x100 
Fhourstone, 
Dhrystone, 
Tower of Hanoi, 
Queens, 
Fibonacci, etc. 
SPECfp 
I Stream 
LINPACK 
1000x1000 
81 
6 
5.5 
5 
I 
4.5 
4 
io6 io5 io4 103 102 lO1 10° 
time 
Figure 7.24 Identical Machine Varied Performance 
facility. The HINT graph of an 8-node system running serial code is shown in Figure 7.25. 
It can be seen that the performance curve of node 2 behaves differently than the rest of the 
nodes. There is a sudden drop in QUIPS performance of node 2 once the problem-size exceeds 
that of secondary cache. The results shows either the absence of memory or poor performance 
of memory subsystem beyond secondary cache due to reconfiguration. Such a problem is 
rapidly and easily detected with the help of HINT. 
7.5.19 Bug in Motherboard's BIOS software 
Dr. David Turner, scientist at Ames Laboratory, reported that with HINT he was able to 
detect problems related to timers. Figure 7.26 shows two QUIPS curves of the same AlphaPC 
LX 533 MHz. One is with faulty BIOS6 and the other is with corrected BIOS. This problem 
was detected while comparing newly purchased AlphaPC LX running at 533 MHz with the 
older AlphaPC LX at 300 MHz and 500MHz. The bug in BIOS on the LX machine caused 
6BIOS stands for Basic Input/Output System. The system BIOS is the lowest-level software in the computer, 
it acts as an interface between the hardware and the operating system. 
Hint DOUBLE, "identical" systems 
J. 
82 
Serial HINT on 8-nodc Mosix clustcr 
310000 1  F  |  1  '  '  |  1  •  -  |  '  '  1  '  ' ' 
300000 
290000 
- / % -
280000 
- / v< -
270000 
260000 
s : \ tj Only node 2 of the 8 
/ identical nodes cluser - S 
$ performs differently ^ \ 
250000 - -
240000 
-
Rest all nodes \ 
perform similarly \ -
230000 
- \ -
220000 
- X 
210000 .  . i  .  .  i  .  .  i  .  •  •  i  • 
le-06 le-05 0.0001 0.001 0.01 0.1 1 
Figure 7.25 Mosix Xluster's Identical Nodes Perform Differently 
the clock ticks to be counted incorrectly. The clock was slow by a factor of two, so the HINT 
benchmark results were abnormally higher by a factor of two. I would like to acknowledge 
Brian Smith of SCL system group for restoring the older LX data from the archive. 
7.5.20 Dual processors Pentium machine with Slow Memory Bandwidth 
A number of commodity clusters [Baker and Buyya, 1999], [Baker et al., 1999] such as 
Ames Laboratory Integrated Cluster Environment (ALICE) [Todi et al., 2000] have cost con­
strained dual (or more) SMP processors with slow shared memory bus. Unfortunately low 
memory bandwidth restrict the performance advantage of dual (or more) processors to only 
certain problem size: those that are not too big to contend for memory bandwidth with other 
processors or those that are not too small so as to be efficiently partitioned into two or more 
tasks. Smaller tasks have added disadvantage of having relatively high overhead cost due to 
latency cost of threads or processes which are used to parallelize the task. 
Figure 7.27 shows one such system with two processors of Pentium II running at 300 
83 
Timer Problem of Alpha 533 MHz LX System 
4.5e+06 
4c+06 
Clock Tick is slow by 
a factor of two due to 
bug in BIOS. 3c+06 
2.5c+06 
% 
a 
.5c+06 
Corrected BIOS 
le+06 
500000 
le-07 I c-06 0.0001 0.001 0.01 
Time in seconds 
0.1 10 100 1000 I e-05 
Figure 7.26 Bug in Alpha LX motherboard's BIOS 
MHz. Memory bus bandwidth was around 5Q8MBytes/sec7. The operating system under 
measurement was Windows NT and its native thread functions were used to parallelize the 
HINT code. The graph clearly shows huge latency to start the thread. Also two threads 
performance using two processors is only greater at certain memory size than the serial code 
performance using single processor. The improvement in the performance in those regions is 
not two times as one might expect. 
If the problem size increases beyond certain problem size then there are lots of memory 
activities resulting in memory bus saturation and increased contention for shared memory bus. 
In that case there is no performance improvement having two processors over one processor. 
Thus in order for an application to show improved performance in a dual processor system 
over a serial processor system, its problem size has to correspond to the region in the figure 
showing improved performance. 
764-bit data bus running at 66 MHz speed 
84 
Serial Vs Threaded HINT code on Dual Pentiumll 300 MHz on shared 66 MHz memory bus 
High latency cost 
Io start threads 
0.0001 
1 thread 2.76 MQUIPS 
2 threads 3.28 MQUIPS 
serial 9.40 MQUIPS 
Only place where two 
threads on two proessors 
performance is better 
than the serial code on 
one processor 
Memory bandwidth is 
saturated. Hence it 
is a bottleneck for large 
problem size. 
0.001 0.01 
Time in seconds 
Figure 7.27 Serial versus Threaded HINT on Dual Processors 300 MHz 
Pentiumll 
85 
CHAPTER 8 Application Signature 
Application Signature is a machine-independent abstraction of the inherent characteristics 
of an application. The distinguishing signature is convoluted with a hardware signature to 
predict the application performance. This chapter studies the history of Application Signature 
technology and informally defines it. Towards the end of chapter, the hardware and applications 
used for the experiments are listed. The terms Application map (APPMAP) and Application 
signature are interchangeably used throughout this thesis. 
8.1 History of Application Signature 
Dr. John Gustafson coined the term Application Signature in a keynote speech in POC'96 
France [Gustafson, b]. The presentation of Application Signature was an attempt by Dr. 
Gustafson to develop a scientific framework to speedily estimate the performance of an ap­
plication. As discussed in Chapter 5, most of the contemporary metrics used in performance 
analysis are not rigorous. In the absence of a good metric, several scientific presentations 
use relative speedup to compare different design alternatives. In doing so these presentations 
invariably keep the work done as constant by keeping many aspects of hardware and software 
same and thus using absolute time as a metric. However, true performance of a machine is 
an intricate interplay between applications and machines and in practice the best performance 
signifies optimizing and changing parameters such as cost, memory size, precision, algorithm, 
and architecture simultaneously. Is there a way to say how well-suited a given computer is for 
an application from the perspective of actual use? 
We have studied in Chapter 7 that HINT graphs visually characterize many hardware 
permutations such as memory sizes, precisions, memory subsystems and computer architecture 
86 
designs. Also, the performance metric QUIPS (discussed in Chapters 5,7) is mathematically 
sound and meets all the criteria of a good metric. 
If HINT and QUIPS constitute a good model for a hardware signature, does a similar 
model exist for an application signature? Is such an application signature independent of the 
machine on which the application is run? Here is a summary of two hypothetical application 
signatures [Gustafson, b]. 
Word Processing Application Signature 
Fraction of activity (Total area = 1) 
Computational Fluid Dynamics 
Application Signature 
Fraction of activity (Total area = 1) 
10 10j 
Time in Seconds 
10 1 
Time in Seconds 
(a) (b) 
Figure 8.1 Hypothetical Application Signature for (a) Word Processing 
Application (b) Computational Fluid Dynamic 
Consider Figure 8.1(a) for a word-processing application. The horizontal axis is time and it 
uses the same time scale as the HINT graph. The vertical axis is a density function showing the 
amount of activity for a given time. Events or tasks in applications, such as small loops, with 
a small memory footprint would contribute towards the left side of the graph. Events or tasks 
in applications, such as large stride loops, with enormous memory activity would contribute 
towards the right side of the graph. In the figure, the word-processing application shows spikes 
at the left showing events such as keystrokes. The events such as save, document merge, and 
search that take much larger amount of time than tiny loops to complete, are shown towards 
the right side of the figure. 
Figure 8.1(b) is a hypothetical application signature for a computational fluid dynamics 
(CFD) simulation. In a CFD application, all available memory to an application is used to store 
87 
the state of the fluid. For every change of state of the fluid, the application has to traverse 
the memory several times. Therefore, almost all the events are recorded in the application 
signature towards the right side of the graph. 
8.2 What is Application Signature? 
Application Signature is a set of predictive values of the application that when convoluted 
with hardware signature yields the application runtime. To be specific, Application Signature 
is a set of weights at different memory points or time points of the QUIPS-Memory or QUIPS-
Time graph respectively. The application weights at different memory points or time points 
reveal the demand of the application to perform a certain function. 
Our conjectures for Application Signatures are as follows: 
• Such an application profile of the demands of the application is machine-independent. 
• Such an application profile is specific to underlying behavior of an application rather than 
specific implementation. For example, word editing software may have similar profile. 
• The application signature that is function of time would remain same even after years. 
This means that over time the profile of the application remain same unless there is some 
paradigm shift. For example, in case of word editing software instead of inputing data 
by hand we start inputing data by voice. 
Definition 5 If H is the hardware signature and A is the application signature then the Ap­
plication Signature time or APPMAP time, AT, is the convolution of A and H where the 
convolution is defined to be the weighted harmonic mean1 of H, the weights being given by A. 
Ideally the above definition implies 
(8.1) 
1The weighted harmonic mean is defined in Equation 6.10 
88 
However, in practice the HINT and Application Signature graphs are divided into n memory 
points or time points. Let Hi denote the hardware speed at the ith memory or time point and 
Ai denote the application weight at the ith memory or time point. The application time AT 
is then given by the following equation. 
AT - 1/^(A,/^) (8.2) 
i= 1 
where 
ÈW = 1 (8-3%) 
i—1 
0 < A< < 1 Vi (8.3b) 
An application behavior changes constantly with time. So at one instant an application 
can be doing an input-output task polling on an event to happen and at another instant it can 
be doing a floating-point intensive task that is stressing memory bandwidth by loading and 
storing variables. Hence, the Application Signature, A, is a function of time which captures 
multiple tasks being done sequentially [Todi, 2001], 
Let T be the total runtime of an application. Let us assume that we observe the behavior 
of t h e  a p p l i c a t i o n  i n  p  t i m e  i n t e r v a l s .  A s s u m e  e a c h  i n t e r v a l  i s  o f  e q u a l  l e n g t h  T / p .  I n  e a c h  T / p  
interval a task or a group of tasks is being executed. However, there is overhead in measuring p 
time intervals. A high value of p will yield more intervals implying finer resolution to quantify 
each task but at the expense of high measurement overhead. On the other hand, a lower value 
of p will average out the task behavior but would only bear a low measurement overhead. 
The value of Application Signature A at ith interval is given by 
A® = [Aî_1, A$) where 1 < i < p  (8.4) 
So from the above definition A1 is the value of Application Signature as observed in the 
first interval. So the application signature for a complete run of an application observed in p 
intervals can be specified as follows: 
89 
4 = (8.5) 
The hardware signature H will remain same through out the runtime T. The application 
time as defined in Equation 8.2 can now be rewritten as follows: 
AT = ]T(l/^(^/#,)) (8.6) 
i=i j=l 
The importance of the above equation is that in an application signature the final applica­
tion weights represent weighted harmonic mean from different tasks in p time intervals (phases) 
of an actual run. The same equation is valid whether we take a single task or we take a group 
of tasks. As the run progresses, the weights in the application signature keep re-adjusting 
with the run. For the purpose of this thesis, p is taken to be 1 implying that we observe the 
application runtime in a single interval from start to end. The application signature would 
then indicate cummulative behavior of all tasks in an application. 
8.3 Characteristics of Application Signature 
Now that we understand the definition of Application Signature, we can summarize the 
desired characteristics of Application Signature. They are listed as follows: 
1. Machine Independent: An application signature should be machine-independent. 
2. Reliable: When used with the hardware signature, the application signature should ac­
curately predict the actual runtime of the application. One can validate the results using 
standard statistical techniques. 
3. Easy to obtain: The application signature should be easy to obtain and it should require 
no special hardware. Moreover, the profile should be available within few hours. 
4. Visually Meaningful: Application signatures should visually be helpful in comparing 
different applications. If application A and application B have the same executable 
path but are run with different input data sets then the application signatures of A and 
90 
B should look alike except for the relative weighs at different time or memory points. 
Similar application signatures would signify similar tasks being done by the applications 
A and B. 
5. Application Characteristics Revealing: If an application A is a memory-intensive appli­
cation then its application signature should expose that pattern. Just comparing the 
application-signatures of application A and application B one should be able to compare 
memory-related demand of applications. 
6. Sustain Moore's Law: Using Application Signature as a function of time instead of 
problem-size would make the application signature to remain constant over years. A 
good discussion of the advantages of using time as horizontal-axis for the application sig­
nature and hardware signature can be found in [Gustafson et al., 1991],[Gustafson, b]. 
7. Easy to Store: Unlike instruction traces that are bulky and hard to store, the application 
profile should be easy to store. 
8.4 Modeling Application-Architecture Performance: A Car Transporta­
tion Analogy 
A car transportation model [Gustafson, 1998] is the easiest to understand and it is closest 
analogy to the proposed application performance model. Every car has a different accelerating 
speed, peak speed, and stopping speed depending on the make of the engine. The different 
speeds are activated during different state of the car. The speed of car also depends on the 
road type and condition. For example, a sports car can take a sharper turn whereas a racing 
car may not be able to do so. 
There are many ways to estimate the time it takes for to go from Ames, Iowa to Chicago, 
Illinois that. The most precise way probably might be is to estimate the distance traveled by 
the car at each different speed. An approximate way is to estimate the travel time is to know 
the average car spped and distance travelled at each of the major segments such as inter-state 
highway or residential area. For example, let's assume a trip of 62 miles (as shown in Table 
91 
8.1) is divided into three segments: residential area, state highway, and inter-state highway. 
The total time can be estimated by measuing average car speed in each of the segments and 
then using distance travelled in each segment to calculate the time taken in each segment. The 
summation of time taken in each segment would give approximate time taken for a trip. 
Table 8.1 A Simple Car Analogy to Calculate Time Taken for a Trip 
Travel Segment Distance Average Speed Time Take 
(miles) (miles per hour) (hours) 
Residence Area 2 20 0.1 
State Highway 10 40 0.25 
Inter State Highway 50 70 0.71 
In the above example a car and the driver represent a computer system (hardware as well 
as operating system, compiles, etc.). A trip can be thought as an application of the car. A trip 
consists of many different kinds of roads; similarly an application contains different character­
istics. Different car/machine will have different average speed on different roads/applications. 
Hence, different trips/applications take different time to complete. One can develop a more 
realistic model of the car performance model by subdividing the travel segments into shorter 
granuality where each segments indicates similar car performance. 
Readers are refered to paper [Gustafson, 1998] for a complex car analogy. 
8.5 Application Performance Model 
In order to establish a model using Application and Hardware Signatures, we must first 
understand the essential aspects of the real applications and essential features of computers 
that we should measure. 
8.5.1 Hardware Performance Predictors 
For past few decades many algorithms and architecture litratures and books define the 
design guideline for algorithm-architecture fit as follows [Gustafson, 1998]: 
• Measure the number of arithmetic operations in the application. The smaller the number, 
the faster it will run. 
92 
• Measure how many arithmetic operations per second the computer design can produce. 
The higher the number, the faster it will run. 
However, there is ample evidence that this design guideline no longer work anymore. For 
example, as shown in Figure 4.2 the "Peak Advertised Performance" number correlates in­
versely with the application performance. Here are few other observations why the above 
model does not work anymore. 
"Time, 
seconds 15 sec 
Se a la r 
Arithmetic 
Ata na soff-Berry 
16 m se c 
msec Cray IM Cray2 
138 nsec . ..n ri, 
Memory 
La te ne y 
29 nsec 
2000 1930 1940 1950 1960 1970 1980 1990 
Figure 8.2 Gustafson's Great Crossover: The crossover of memory and 
arithmetic performance 
Moore's law states that every aspects of computer performance doubles every 18 months. 
This statement seems to be true for processor speed but not for the memory speed. The 
improvement in memory bandwidth though exponential is not at the much slower pace of 
processor speed. Furthermore, the improvement in memory latency is becoming worst with 
time it takes to address larger memories. 
Infact there was a crossover in the 1970s regarding which was more expensive: memory 
references or arithmetic [Gustafson, 1998], [Gustafson and Todi, 1999a] as illustrated in Fig­
ure 8.2. The figure demonstrates how in the three decades the memory operations have become 
more costlier than the arithmetic operations. In 1970 ILLIAC IV took one microsecond either 
93 
to fetch a word from memory or do a floating-point operation. In about the same time, the 
total cost of the wire connections began to overtake the cost of gate transistors, and the total 
time spent moving signals began to overtake the total switching time. In 1996, the Pentium 
III based computers took lnsec for arithmetic operations but they took lOOnsec for memory 
operations. Dr. John Gustafson coined this phenomemon as The Great Crossover in design 
which is present in all kinds of conventional computers from mini-PCs to supercomputers. 
Claim 8.5.1 Operations are free; data motion isn't. 
The obvious implication of The Great Crossover is the any scientific measurement of ap­
plication performance using hardware performance counter can show that cost of memory op­
erations is substantial [Vasiliu, 2000]. Thus memory speed, memory bandwidth and memory 
latency, is a good predictor of the hardware performance. 
8.5.2 Application Performance Predictors 
Different applications/benchmarks sample different instruction mixes, and different mem­
ory regimes. This is a major reason why rankings differ depending on applications/benchmarks. 
Clearly, one could define a problem that depended heavily on disk output and another prob­
lem that depended on floating-point multiply-add operations, and show different rankings for 
business and scientific computers. 
While it is obvious that the possible differences in feature emphasis are vast, we conjecture 
the following [Gustafson and Todi, 1999b]: 
Claim 8.5.2 The main difference between feature emphasis in applications and benchmarks is 
the dominant data type. 
Thus, graphics benchmarks operate on pixels, scientific benchmarks use floating-point num­
bers, business benchmarks use integers and character data. We can select a HINT measurement 
that uses that dominant data type. 
The other major difference between benchmarks is the "computational intensity" or the 
ratio of operations to memory references. For example, matrix multiplication and LINPACK 
94 
have very high computational intensity, whereas Dhrystones and STREAM are very low. But 
the issue may not be the ratio of operations to references so much as the location of the data. 
Every operation implicitly references data, and perusal of the code may reveal a register or 
memory source. But on modern architectures, the data actually can come from registers, 
primary cache, secondary cache, main memory, or mass storage (virtual memory), with per­
formance that can range over a factor of a thousand. This leads to our second conjecture: 
Claim 8.5.3 The main difference between "computational intensity" or "size" of a computer 
benchmark is the time spent in different memory regimes. 
8.5.3 The Proposed Computer Design Model 
The proposed computer design model [Gustafson, 1998] using Application Signautre and 
HINT benchmark is as follows. The preliminary results of the design model was presented in 
the paper [Gustafson and Todi, 1999b]. The model further explored and refined with in this 
thesis. 
1. Measure the AppMaps of the applications in questions. 
2. Use HINT to measure the hierachical memory performance. 
3. Verify that the HINT-AppMaps combination predicts actual performance by measuring 
statistical correlation and maximum deviation from linearity. 
4. Model the cost of varying the computer features. 
5. Vary proposed computer designs with AHINT to optimize performance or performance 
divided by cost. 
8.6 Experiment Setup 
8.6.1 Machines Used 
For the purpose of this thesis, eight different machines were used. These are coded 
as Ml through M8. All these machines are based on MIPS architecture and were man-
95 
Table 8.2 Machines Processors and Cache Configurations 
Machine Processor 
Type 
Clock Speed 
(MHz) 
LI I 
(bytes) 
LI D 
(bytes) 
L2 
(bytes) Id Name 
Ml hydra R10000 194 32768 32768 2097152 
M2 helix R10000 194 32768 32768 2097152 
M3 helix R10000 194 32768 32768 1048576 
M4 chronus R10000 195 32768 32768 1048576 
M5 tajar R12000 270 32768 32768 1048576 
M6 hermes R10000 180 32768 32768 1048576 
M7 dc R10000 250 32768 32768 1048576 
M8 exiguus R10000 225 32768 32768 1048576 
ufactured by SGI, using R10000 and R12000 microprocessors. Readers who would like to 
see results based on different architectures and parallel computers are referred to the pa­
per [Gustafson and Todi, 1999b]. 
Tables 8.2,8.3 provide the processor type, cache and memory configurations of the tested 
machine. These machines vary in memory subsystems and clock speed. Machines M2 and M3 
are different nodes of the same machine. 
Table 8.3 Machines Memory Configurations 
Machine Max. Number Machine Machine 
Id Name of Processors Configuration Name 
Ml hydra 8 2048 Mbytes, 8-way interleaved Power Oynx 
M2 helix 6 1536 Mbytes, 2-way interleaved Power Challenge 
M3 helix 6 1536 Mbytes, 2-way interleaved Power Challenge 
M4 chronus 2 256 Mbytes Indigo Impact 
M5 tajar 1 384 Mbytes 02 
M6 hermes 4 1024 Mbytes, 4-way interleaved Origin 200 
M7 dc 2 256 Mbytes, 2-way interleaved Octane 
M8 exiguus 1 128 Mbytes 02 
The HINT benchmark was run on all the machines, Ml to M8, to collect the hardware 
signatures. Two datatypes, int and double, were used for the computation. The QUIPS-Time 
graph for machines M1-M8 using double computation is provided in Figure 8.3. From the 
figure it is clear that many graphs are crossing each other suggesting differences in system 
performance with varying problem sizes. Different rankings of computers at different tasks can 
96 
often be traced to HINT curves that cross. 
Appendix E provides detailed HINT data and QUIPS-Time and QUIPS-Memory graphs 
using double and int data-types for computation. Appendix F provides the memory latency 
data from the LMBENCH benchmark run on all the machines. In particular, Table F.l sum­
marizes the memory latency for each of the machines. Appendix G provides the memory 
bandwidth data from the STREAM benchmark run on all the machines. In particular, Table 
G.l is the summary of memory bandwidth for each of the machines. 
2c+06 
l.8e+06 
1,6e+06 
1.4c+06 
1,2c+06 
5 le+06 
a 
800000 
600000 
400000 
200000 
0 le-
Figure 8.3 HINT (Double) QUIPS-Time Graph for Machines M1-M8 
8.6.2 Benchmarks Used 
For the purpose of the thesis, SPEC's CPU95 benchmarks were used. SPEC CPU95 is 
explained in Chapter 3. Table 8.4 refers to the CINT95 benchmarks that are integer bench­
marks. These integer benchmarks are coded as II through 117. Table 8.5 refers to the CINT95 
benchmarks that are floating-point benchmarks. These floating-point benchmarks are coded 
are coded as F1 through Fll. SPEC's CPU95 benchmarks have three datasets: ref, train, and 
HINT (Double) QUIPS-Time Graph for Machines M1-M8 
M3 Helix N3 15.82 MQuips 
M7 DC NO 19.72 MQuips 
le-05 0.0001 0.001 0.01 0.1 10 100 1000 le-06 -07 
Time in seconds 
97 
test. We use the ref dataset, which is the largest dataset among all the three datasets and also 
ref dataset results are used by SPEC for rating machines. For this thesis, a benchmark with 
two different inputs are considered as two different benchmarks. For instance, in Table 8.4, 
the benchmark 099.go has 4 different inputs implying they are four separate benchmarks. 
Table 8.4 Integer Benchmarks 
Application 
Id Name Input 
11 099.go null.in 
12 099.go null 1. in 
13 O99.go 5stone21.in 
14 099.go 9stone21.in 
15 147.vortex vortex.in 
16 132.ijpeg penguin.ppm 
17 132.ijpeg specmun.ppm 
18 132.ijpeg vigo.ppm 
19 126.gcc lexpr.i 
110 126.gcc lrecog.i 
111 126. gcc lreloadl.i 
112 126. gcc 2stmt.i 
113 124.m88ksim ctl.raw 
114 124.m88ksim test.raw 
115 129.compress bigtest.in 
116 129.compress test, in 
117 130.11 
-
All the benchmarks were compiled to non-shared binaries. The advantage of non-shared 
binaries is that the binaries include all the related libraries. We compiled the binaries on the 
machine Ml and then ported the binaries to the rest of the machines. 
Appendix C and Appendix D summarize the application characterizations for these bench­
marks using hardware event counters found on R10000 and R12000 microprocessors. These 
counters were measured using perfex tool provided by SGI. 
8.7 Summary 
This chapter provides an overview of the Application Signature. The Chapter 9 pro­
vides various notations and definitions used for formally defining the hardware and application 
98 
Table 8.5 Floating-Point Benchmarks 
Application 
Id Name Input 
F1 103.su2cor su2cor.in 
F2 102. swim swim, in 
F3 102.swim swim2.in 
F4 llO.applu applu.in 
F5 145.fpppp natoms.in 
F6 141.apsi apsi.in 
FT 146.wave5 wave5.in 
F8 107.mgrid mgrid.in 
F9 125.turb3d turb3d.in 
F10 lOl.tomcatv tomcat v. in 
Fll 104.hydro2d hydro2d.in 
signatures, 
signatures. 
Chapters 10,11,12,13 provide four different models to determine the application 
99 
CHAPTER 9 Definitions and Notations 
Let us assume there are N machines and M applications. The same binary executable^ 
with different input data are considered as different applications. 
Let Mi, M2,. . . ,  M , v  r e p r e s e n t  N number of machines. Let Hf, . . . ,  H ^ be the corre­
sponding hardware signature (HINT) where p is precision that is either 1) or I. The value p 
is equal to D implies HINT data type used for computation is double precision whereas the 
value p is equal to I implies HINT datatype used for computation is integer precision. HINT 
requires another precision for indexing of intervals. In measuring HINT graphs Hf, Vp and 
1 < i < N, integer precision is used for indexing. 
Let mi, m2,..., fnMem, be the memory points where QUIPS was either computed or statisti­
cally derived; and Mem implies maximum memory operating point where HINT was measured. 
The memory points are logarithmically spaced vectors. 
Definition 6 The relation nu < nij implies that memory size mi is less than the memory size 
m j .  
Definition 7 Quality is the reciprocal of the difference between the upper and lower bounds of 
the function in Equation 7.1. 
There are several important lemmas that are related to quality of the answer. 
Lemma 1 Quality is linear in time (in logarithmic, scale) or subintervals. 
Lemma 2 Quality has unlimited scalablity and can scale with the precision and memory avail­
able. 
Lemma 3 Without precision loss, the quality is equal to subintervals. 
100 
Lemma 4 With precision loss, the quality is less than subintervals. 
Definition 8 Given two machines and kj, a relation > kj implies that machine is 
ranked higher than machine kj i.e., the performance of machine is superior to machine kj. 
Similarly, a relation ki = kj implies that the machines ki and kj are ranked the same. 
Definition 9 Instantaneous Quality (Q p,i)k is quality attained at a memory operating point 
i in a HINT graph with precision p where p <E D,I, i e mi, m.2, —,mMem, and k e 1, ...N. 
Lemma 5 The function where 0 < x < 1 used for hierarchical integration in HINT is 
monotonically decreasing. 
Proof: Proof is by contradiction. Assume the function is not monotonically decreasing. 
Hence, there exists two points x\ and zg such that 22 > zi and > yqif}- This implies 
(1 — X2)(l + $i) > (1 + £2X1 - %i). Simplifying the equation leads to 1 > x\ > xi > 0 which 
is a contradiction. Hence proved. 
The monotonically decreasing property is also illustrated from figure 7.1. According to 
HINT rules, the only knowledge permissible for interval subdivision is that the function 
is monotonically decreasing. • 
Lemma 6 For a machine k using precision p for computation, if (Qp,i)k > (Qp,j) k  then 
ti >= tj, where ti is the time taken to achieve quality (Qp,i)t and tj is the time taken to 
achieve quality (Qp,j)k• 
Proof: The relation (Qp,i)k > (Qp,j)k implies that the number of subdivisions required to 
attain (Qp,i)k is greater than the number of subdivisions required to attain (Qp,j)k• Thus 
number of time steps required to attain (Qp, i)k is greater than number of timesteps required 
to attain (Qp,j)k• Thus U > tj. 
However, in practice there can be instances especially for small ratio between (Q p,i) k  and 
(Qp,i)k that due to measurement error or due to certain cache effect that ti < tj. In such 
HINT algorithm forces èj = tj. • 
101 
Lemma 7 For a machine k using precision p for computation, QUIPS-Time graph and QUIPS-
Memory graph are monotonically decreasing. 
Proof: This follows from lemmas 5 and 6. • 
Definition 10 If there is sufficient memory available and HINT cannot refine further then 
HINT is bounded by precision. 
Definition 11 If there is sufficient precision available for refinement, but HINT cannot allo­
cate more memory for its computation, then HINT is bounded by memory. 
The rest of the definitions and lemmas assume that computation is not precision bound or 
memory bound. 
Lemma 8 For arbitrary machines k\, kg using arbitrary precisions p\, P2 for computation 
respectively, if rrii = mj then (<3Pl, 77^)^ = (QP2,mj)k2. 
Proof: By definition, for the same problem size, the quality achieved by two machines are 
same for the any precision. • 
Lemma 9 For arbitrary machines k\, using arbitrary precisions p\, p2 for computation 
respectively, if rrii < nij and computation at mi and rtij then (QPl ,77%)^ < {QP2, mj)k2 where 
fci,/c2 are arbitrary machines. 
Proof: By definition, since memory available for refinement by mj is greater than memory 
available by 77%. Thus the hierarchical refinement at m;/ would be greater than hierarchical 
refinement at 77%. Thus it implies that the (Qp, 77ij)fc1 < (Qp, mj)k2- • 
Definition 12 Instantaneous QUIPS (IQp,i)k is as measured at a memory operat­
ing point i in a HINT graph with precision p where p G D, I, i G 777.1,7712, • • •, m m em,, k 6 
1,2,... ,N, and t is time taken to achieve the quality (Qp, %)&. 
It should be noted that about instantaneous QUIPS is measured using actual hardware 
for only 60% to 70% memory points between 7711 and m Mem- Using the measured points 
102 
where instantaneous QUIPS values, the memory points m2, mg,... mMem-i are derived by 
using cubic spline interpolation. The matlab function interpl is used for this purpose with the 
interpolation method selected as 'spline'. No extrapolation is allowed for out of range memory 
points value. The instantaneous QUIPS at more number of memory points imply ability to 
quantify performance of the system for a larger set of problem size. 
Lemma 10 Fora machine k using precision p for computation, if rrii > frij then (IQp,m,i)k < 
Proof: By lemma 7, HINT QUIPS-memory graph is monotonically decreasing. This implies 
that larger problem size requires more computing time. Hence, for the same machine k using 
any precision p, nn > mj implies (IQp,mi)k < {IQp,mj)k- • 
Lemma 11 For a machine k using precision p for computing, let j > 0 be number of iterations 
instantaneous QUIPS is measured (IQp,mi)3k at memory point The final instantaneous 
QUIPS (IQp,m l)k = max((IQp,miyk) 
Proof: By definition, for the same precision and the same machine, there is only one HINT 
QUIPS-memory graph that represents the best time taken to achieve the quality. Hence, only 
the minimum runtime taken by j iterations is used as the time to achieve the quality. • 
Lemma 12 Let k\, k2 be arbitrary machines using a precision p for computation. For a 
memory point mi G 1 ...mMem if (IQP,tni)ki > (/Qp,mj)fc2 then machine k\ is faster than k2. 
Thus machine k\ > k2 or in other words machine k\ is ranked higher than machine k2 using 
precision p for computation and the problem size mi. 
Proof: Let us assume that t\ and t\ and the time taken by machine k\ and k2 to achieve 
quality (Qp,m,i)kl and (Qp, rrii)k,2 respectively. From lemma 8, for a same memory point 
rn, and the same precision p, (Qp,mi)k1 = (Qp,mi)k2• Since, (IQp,mi)kx = ^ '™^ki and 
and implies that 
ti < t2. Thus machine k\ takes less time to achieve quality (Qp, than machine k2. Hence, 
machine k\ is faster than machine k2 for memory point mj. Thus k\ > k2. • 
103 
HINT (Double) QUIPS-Memory for Machines k,,k2 
qi 
q2 
0) 
a 
D 
o 
q3 
q4 
m2 
Memory Size 
Figure 9.1 Two Different Ranking of Machines k\ and k2 at Memory Points 
m\ and mg. At Memory point rri\ , k\ > whereas at Memory 
Point m2, k2 > k\. 
One important derivative of lemma 12 is that HINT as a broad-spectrum benchmark can 
rank a set of machines in more than one way depending on the memory size of the HINT 
program. This property is lacked by many widely available narrow-spectrum benchmarks. 
Consider figure 9.1 for an example. Let Hf and H,f be the HINT graphs for machines k\ and 
k2 respectively. Both the HINT graphs are using double precision for computation implying 
p — D. The instantaneous QUIPS at memory point mi for machines k\ and k2 are q\ and 
q2 respectively. Similarly, the instantaneous QUIPS at memory point m2 for machines k\ 
and k2 are 94 and % respectively. The instantaneous QUIPS q\, q2, 93, and 94 can also be 
written as (QD,mi)h1, (QD, mi)k2,(QD, m2)k2, and (<5D,m2)fc1 respectively. From the figure 
9.1, q\ > 92 implying (QD,m 1)^1 > {QDFrom the lemma 12 this implies k\ > k2. 
Thus machine k\ is ranked higher than machine k2- Similarly, from the figure 9.1, % > 94 
implying (QD,m2)k2 > {QD• From the lemma 12 this implies k2 > k\. Thus machine 
k2 is ranked higher than machine k\. 
104 
9.1 Measured Time, APPMAP Time, and Projected Time 
Let the measured application time to run an application Aj on machine A/, is given by 
MTij where 1 < i < N and 1 < j < M. Using application signature methods, we derive 
APPMAP application time (APPTIME). 
Definition 13 Measured time MT.\j is the actual runtime to run an application i on machine 
3-
The MT can be either wall-time or the CPU-time. The CPU-time is the runtime cost 
related to the user code whereas the wall-time includes runtime of system code besides user 
code, any runtime cost associated with context switching of process during its execution and 
also includes runtime cost of input and output (I/O) performed by the application. 
Let the APPTIME of an application Aj on machine Mi is given by A7'i; where 1 < i < N 
and 1 < j < M. 
Let (Wij)mk is the application weight for application Aj on machine Mj at a memory point 
mfc where l<i<N,l<j< M, and mi < m/c < rriMem• Let the vector W(j) is the weight 
vector of the application Aj and for all machines i such that 1 < i < N and 1 < j < M. 
W ( j )  ~  ( w i j ) m 1 , { w i j ) m 2 i  • • • > ( w i j ) m M e r n  ( 9 - 1 )  
It is desired that W(j) is independent of the machines. In mathematical form is given by 
following equation 9.2 for an application j such that j G 1, 2,... M where weights are derived for 
all memory points such that € mi, 7712,, m m em- The machine-independence property 
imply that for each memory point m& there exists just one weight for a given application j for 
all the machines. 
(^lj)mfc = (w2j)mis — - — (w N j ) m k  (9-2) 
There are four independent techniques developed to compute W(j) for an application j. 
Those techniques would that would be discussed in detail in future chapters. Three of these 
105 
techniques are independent of machines. However, one technique based on cache-miss metrics 
is machine-dependent. 
The application-specific weight vector W(j) is then applied to HINT graphs H^, • •., HVN 
to compute APPMAP application time ATy for each application Aj on each machine i. The 
general method of doing is explained as follows. For each machine k using precision p for 
computation, instantaneous QUIPS, (IQP, rrii)k, are calculated at Mem logarithmic-spaced 
memory points. 
The precision p for for a HINT graph H^p)k is chosen to match the application-type. The 
application-type is pre-defined for all the applications used. For example, SPEC CINT95 
suite is of integer type whereas SPEC CFP95 suite is of floating point type. Hence, for 
SPEC CINT95, p is chosen to be I whereas for SPEC CFP95, p is chosen to be D. In one 
benchmark SPEC CINT2000 suite, Eon benchmark, a a probabilistic ray tracer based on 
Kajiya's algorithm, is actually floating-point intensive. One can use a mixed approach where 
depending on the percentage of floating point out of total instructions, the combination of I 
and D is chosen for p. 
The derived application time DT^ for Application A& is n machines, , • • •. Mn, is 
calculated as follows. 
Ideally, the weight (Wij)mi would be machine-independent. Also, the number of memory 
points the weight is defined is infinite. Let H±, H^...., HVN be the corresponding hardware 
signature (HINT) where p is precision that is either D or I. 
Given independent weight (refer equation 9.2) for an application Aj where 1 < i < and 
1 < ; < M. 
AT i j 
AT, 
\  A T N j  J 
( I Q P , m i ) k l  (IQP  ,mMem)k1  
1 1 1 (7QP,mi)t2 (IQP,m2)k2  (IQp ,mM e m )k2  
1 1 1 
\ f , , \ 
\ (/Qp,mi)kjv (/QP,m2)k 
i j j m  2 
(iQP.mMcW&w / \ ^ ' r r i M e  
(9.3) 
/ 
Definition 14 APPMAP or Application Signature is application-specific, machine-independent 
106 
weights (Wij)mj applied to machine-specific HINT graphs 
Let projected time be PTij where 1 < i < N and 1 < j < M. The equation 9.4 gives 
how PT^ is calculated from AT\j and aj. To linearly project MTij from ATij an applicition-
dependentm, machine-independent scalar aj is applied to each application APPTIME. 
( 
P T n  P T u  
% % 
PTN i PTN2 
PT\m  
P T 2 m  
\ ( 
o t \AT\i  a2AT l 2  •  •  •  o .mAT\m  
a\AT2 \  a2AT2 2  •  •  •  auAT2u 
P T n m  y  ^  Q - I A T n x  Œ 2 A T N 2  olmATNM 
(9.4) 
9.2 Validation Strategy for the Models 
The objective of the thesis is to compare two sets of machines performance, one measured 
runtime using actual hardware and other derived runtime using hardware signatures (HINT) 
and using different models of application signatures (APPMAP). 
Let there are N machines and M applications. Let a variable MTij be the measured time 
of an application Aj on a machine i where 1 < i < N and 1 < j < M. Using the same notation, 
let a variable AT\j be the derived time of an application derived through various Application 
Signature models. Let PT\j be the projected time derived by multiplying ATtj with a scalar 
application-specific aj as given in equation 9.4. 
For an application Aj on a set of machines i such that 1 < i < N. the following tests are 
used to validate each model. 
• Are the vector MTij and vector ATij correlated. How strong is the correlation? 
• Is there a linear fit between M'l\j and AT^. If so, is there an scalar aj such that relative 
error between projected time PT\j and measured time MT\j is low. 
• Is the ranking of machines as done by MTij correlated to the ranking of machines as 
done by AT\j. What is the strength of the rank correlation. 
107 
CHAPTER 10 Model 1: Application Signature Using Instantaneous 
QUIPS 
HINT is a light-weight benchmark. It takes less than 10 minutes to run the HINT bench­
mark and get the validated and consistent results. SPEC Benchmarks we ran took over 8 
hours. For a validated results of SPEC Benchmarks one has to repeat the runs for at least 
three times. 
This chapter investigates the following questions: 
1. How an application performance correlated with HINT performance? Does wide spec­
trum of HINT benchmark helps to predict predict other applications and benchmarks. 
2. Is there any distinguishing signature of each application that can be used to along with 
the HINT to predict the performance of other applications and benchmarks. 
3. It such a signature hardware independent? 
In particular, this chapter will be using instantaneous QUIPS and NetQUIPS, a single 
rating by HINT, to predict SPEC benchmarks. 
10.1 Model 
As defined in chapter 9, let (Wij)mk be the application weight for application Aj on machine 
ki at a memory point mk where 1 < i < iV, 1 < j < M, and mi < < m Mem,- Also, as 
defined before, let AT^ be the application time and let MT\j be the measured time. Let m, is 
the memory point where instantaneous QUIPS is being calculated. 
Definition 15 MTv(j) is a vector MT\j,  MT2j . . .  MT^j representing measured time for ap­
plication Aj on machines &i, , tjv respectively. 
108 
Definition 16 ATv(j) is a vector AT\j ,AT2j ... ATjyj  representing APPMAP time for appli­
cation Aj on machines &i, &2, respectively. 
Definition 17 IQv{m,i) is a vector {IQP,mi)k1,{IQp,mi)k2,. • • (IQp,nii)kN representing in­
stantaneous QUIPS at memory point m* for machines &i, respectively. 
Definition 18 Let corr(j,m, i )  be the absolute of correlation between IQ v{m,j)  and, 1 /MTv(j). 
The absolute makes the value 0 < corrj(mi) < 1. 
The corr(j,m,i) can also be written as follows where r is defined in 6.1. 
10.1.1 Using Instantaneous QUIPS as Application Signature 
Solution 1 The weight WmuU , mi) for an application Aj is defined as the earliest memory 
point mi where corr(j,m , i )  is the maximum for all  mi.  
10.1.2 Using NetQUIPS as Application Signature 
HINT defines a single number rating called NETQUIPS^, a machine using precision 
p for computation. The definition of NETQUIPS is provided in equation 7.2. A relative fast 
solution is to use NETQUIPS as the application performance predictor. 
Definition 19 NETQUIPS% is the NETQUIPS as computed by HINT using precision p for 
computation on machine ki. 
Solution 2 For an application Aj on machine Mi, the APPMAP time ATij  is defined as 
1/NETQUIPSf where p is the application type that is predefined. 
Hence, ATv(j) is equal to the vector (1/NETQUIPSj,..., 1 /NETQUIPS^) where ATv(j) 
is as in definition 16. 
C O r r ( j ,  nii)  rcorrj(m,i)IQ v(mi)  (10.1) 
Wm uU, mi) — < 
corr(j, m-i) for memory point = rrij 
0 for memory point ^ 
(10.2) 
109 
10.1.3 Using NetQUIPS and Instantaneous QUIPS Application Signature 
A hybrid technique of Instantaneous QUIPS (Solution 1) and NetQUIPS (Solution 2) is 
used to get the best solution. 
10.1.4 Using Correlation Vector as Application Signature 
Definition 20 corrv(j) is a vector corr(j, mi),corr(j, m%) • • •, corr(j, m m  em) representing cor­
relation between measured time for application Aj on all N machines and instantaneous QUIPS 
at memory point mi, mi,..., mMem on all N machines. 
Solution 3 The weight WmwU) for an application Aj is defined to be corrv(j), m^) . 
= corr(j,n%) Vm; E {mi, m2, -,(10.3) 
10.2 Results 
Tables H.1,H.2 summarizes the results using just NetQUIPS (Solution 2) as the Application 
Signature. With the exception of benchmark 716, the correlation between integer benchmarks 
and NetQUIPS is very high. However, the correlation between floating-point benchmarks and 
NetQUIPS is weak. 
Table 10.1 Column Definitions for Tables 10.2, 10.3 
Column Definition 
CI Maximum correlation coefficient 
C2 Memory-point for Instant Quips or NetQUIPS used to calculate max­
imum correlation coefficient (CI) 
C3 Linear Fit. NetQuips or Instant Quips multiplied by this number 
will yield the predicted time for the benchmark 
C4 Error: The maximum relative error in the projection 
C5 Spearman's rank correlation coefficient corresponding to maximum 
Correlation coefficient (CI) 
C6 Maximum Spearman's rank correlation coefficient 
C7 Memory-point for Instant Quips or NetQUIPS used to calculate max­
imum rank correlation coefficient (C6) 
110 
Table 10.2 Application Signature Results using Instantaneous QUIPS or 
NetQUIPS for Integer Applications 
ID CI C2 C3 C4 C5 C6 C7 
11 0.9940 4.38e+03 0.0085 0.0235 0.9762 1.0000 NetQUIPS 
12 0.9939 4.38e+03 0.0039 0.0222 0.9762 1.0000 NetQUIPS 
13 0.9922 4.38e+03 0.0042 0.0295 1.0000 1.0000 8.80e+01 
14 0.9934 4.38e+03 0.0042 0.0223 0.9762 1.0000 NetQUIPS 
15 0.9434 4.45e+02 0.0014 0.0634 0.9286 0.9762 4.68e+05 
16 0.9902 8.67e+02 0.0076 0.0326 0.8333 0.9762 2.07e+02 
17 0.9894 8.67e+02 0.0085 0.0326 0.8333 0.9762 2.07e+02 
18 0.9900 8.67e+02 0.0074 0.0317 0.8333 0.9762 2.07e+02 
19 0.9395 4.45e+02 0.0906 0.0684 0.8333 0.9048 4.68c+05 
110 0.9459 4.45e+02 0.3247 0.0679 0.8810 0.9286 4.68e+05 
111 0.8975 4.68e+05 0.1377 0.1053 0.9048 0.9048 4.68e+05 
112 0.9610 4.45e+02 0.1672 0.0549 0.8810 0.9286 4.68e+05 
113 0.9981 4.38e+03 0.0017 0.0138 1.0000 1.0000 8.80e+01 
114 0.9937 9.53e+02 0.2474 0.0309 0.8571 1.0000 2.07e+02 
115 0.9918 8.67e+02 0.0026 0.0374 0.8333 0.9762 2.07e+02 
116 0.8288 1.00e+06 51.6381 0.1161 0.9048 0.9048 9.11e+05 
117 0.9808 4.38e+03 0.0017 0.0444 0.8333 0.9762 2.07e+02 
Tables 10.2, 10.3 the results using the hybrid of using instantaneous QUIPS (Solution 1) 
and NetQUIPS ( Solution 2) as the application signature. Table 10.1 provides column def­
initions for the Tables 10.2, 10.3. Using the hybrid solution, the correlation between the 
projected application time and both the integer and floating-point benchmarks are strong. 
Also, the rank correlation between the projected application time and both the integer and 
floating-point benchmarks are strong and in some cases perfect. 
Appendix H lists the detailed results from the model developed in this chapter. 
10.3 Summary 
In this chapter we have studied that the weights at different memory points of HINT highly 
correlates with the SPEC benchmarks. So a basic application signature can be a single weight 
on HINT graphs. Chapter 11 finds the best results by using optimizing technique. 
I l l  
Table 10.3 Application Signature Results using Instantaneous QUIPS or 
NetQUIPS for Floating-point Applications 
ID CI C2 C3 C4 C5 C6 C7 
F1 0.9801 3.39e+06 0.0070 0.1874 0.9524 0.9762 1.56e+07 
F2 0.9868 5.91e+07 0.0075 0.1101 0.8810 0.9048 1.56e+07 
F3 0.9867 5.91e+07 0.0083 0.1102 0.8810 0.9048 1.56e+07 
F4 0.9933 2.07e+07 0.0054 0.0646 0.9762 1.0000 1.56e+07 
F5 0.9878 4.65e+04 0.0013 0.0553 0.8095 0.8333 1.68e+02 
F6 0.9496 9.82e+05 0.0045 0.1110 0.9048 0.9048 9.82e+05 
F7 0.9311 3.73e+06 0.0088 0.2855 0.9524 0.9762 2.10e+06 
F8 0.9673 5.91e+07 0.0059 0.2271 0.8333 0.8571 1.56e+07 
F9 0.9230 NetQUIPS 0.0001 0.0917 0.6429 0.7619 9.82e+05 
112 
CHAPTER 11 Model 2: Application Signature Using Optimization 
Method 
We have seen in the previous chapter that applying weights at different memory points 
in the HINT graph strongly correlate with the performance of applications or benchmarks. 
This chapter finds the best correlation possibly by searching the weight-vector space. These 
application weights can then be used for projecting future performance by varying just the 
hardware signature. 
11.1 Model 
As defined in chapter 9, let (wij)m k  be the application weight for application Aj on machine 
ki at a memory point where 1 < i < N, 1 < j < M, and m\ < < rriMem• Also, as 
defined before, let AT^ be the application time and let MT\j be the measured time. Let m, is 
the memory point where instantaneous QUIPS is being calculated. 
We used Quasi-Newton search method to optimize the cost function given by Equation 
11.1, which is a quadratic equation. The Quasi-Newton search method is the most popular 
algorithm in nonlinear optimization. The matlab function fmincon was used. 
f ( k )  = max(corr(ATij ,MTij))  (11.1) 
where 
113 
0 
< 
(w,  i j jrrt  i  
IJ/7U2 
z i x  
< 
u ) \ iwij)mMem y V1 y 
The non-negative constraint is used for the vector weight because one cannot deduct the 
runtime for doing certain task. 
(11.2) 
11.2 Results 
The Quasi-Newton search method stablized within few seconds in all cases. The total 
number of iterations was less than 30 for each benchmark. In order to see that the solution 
is unique the initial estimate was randomly changed. Even with different initial estimates the 
Quasi-Newton method stablized to the same solution. 
The results of the search is summarized in the Tables 11.1,11.2, 1.1,1.2 . 
Table 11.1 Search Method (function of problem size) Results for Integer 
Applications 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
11 0.9937 0.0085 0.0258 0.9762 
12 0.9936 0.0039 0.0247 0.9762 
13 0.9920 0.0042 0.0295 1.0000 
14 0.9932 0.0042 0.0253 0.9762 
15 0.9792 0.0016 0.0454 0.9762 
16 0.9908 0.0077 0.0331 0.8571 
IT 0.9903 0.0085 0.0336 0.8333 
18 0.9907 0.0074 0.0326 0.8571 
19 0.9775 0.1006 0.0481 0.9048 
110 0.9773 0.3560 0.0496 0.9286 
111 0.9661 0.1158 0.0598 0.9048 
112 0.9843 0.1811 0.0400 0.9286 
113 0.9979 0.0017 0.0159 1.0000 
114 0.9935 0.2471 0.0271 0.8571 
115 0.9912 0.0026 0.0393 0.8333 
116 0.8658 42.3942 0.0949 0.9524 
117 0.9804 • 0.0017 0.0477 • 0.8333 
114 
Table 11.2 Search Method (function of problem size) Results for Float-
ing-Point Applications 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
F1 0.9971 0.0047 0.0540 0.9762 
F2 0.9885 0.0055 0.1008 0.9048 
F3 0.9885 0.0062 0.1011 0.9048 
F4 0.9986 0.0039 0.1192 1.0000 
F5 0.9995 0.0013 0.0118 0.9286 
F6 0.9682 0.0044 0.1127 0.9762 
F7 0.9985 0.0051 0.0263 0.9762 
F8 0.9767 0.0036 0.0859 0.8571 
F9 0.9884 0.0017 0.0310 0.8333 
F10 0.9907 0.0053 0.1373 0.9762 
Fll 0.9991 0.0035 0.2141 1.0000 
Comparing results from search method ( Tables 11.1,11.2) with the results from instanta­
neous QUIPS or NetQuips (Tables 10.2, 10.3), it can be seen that the search method improves 
both the correlation and rank correlation. 
Appendix I lists the detailed results from the model developed in this chapter. 
115 
CHAPTER 12 Model 3: Application Signature Using Cache Misses 
Often for a system designer it is very simple develop a functional simulator that can pro­
vide cache misses of the benchmarks at different memory regimes. This method of finding 
Application Signature is based on cache misses in the memory hierarchy. 
weights, the hardware signature (HINT graph) is divided into three broad regions reflecting 
three levels of performance. The first region reflects system performance when the workload 
is in the processor or is in first level of cache. The second region reflects system performance 
when the workload is in second level of cache. The third region reflects system performance 
when the workload is in memory. 
12.1 Model 
Hardware signature of a given machine is divided into three regions reflecting three levels 
of memory performance. Application weights for a given application is calculated for each 
region is derived from the number of memory access hits in the region. A convolution of 
the both hardware weights and applications weights yields the application performance of the 
application on the machine under evaluation. 
There are many ways to divide HINT graphs into three broad regions depending on the 
size of caches. The memory regime can be automatically detected for HINT graphs. Let 
Qil,Qi2,... Qin is the QUIPS at n memory-points in a memory regime i where i=l,2,3. The 
mean hardware weight Hi is given by as follows: 
Hi — (Qi\  ~f~ Qi2 "1" ••• Qin)I^ (12.1) 
116 
For all the three memory regimes, the application weights for an application Aj are calcu­
lated by following equations. 
Wij = 
w2j = 
w3j = 
1-Level 1 Miss 
Load+Store 
Level 1Miss 
Load+Store 
Level 2 Miss 
Load+Store 
(12.2) 
(12.3) 
(12.4) 
These application weights are first normalized and then applied to the hardware signature 
given by Hi to obtain the application time. 
12.2 Results 
The results of the model in Tables 12.1,12.2 From the tables it can be seen that this simple 
method has very high correlation and rank correlation. However, this method is not consistent 
as the application /3 has very low negative correlation. 
Table 12.1 Models Results for Floating-Point point Applications 
Machine Id Correlation Rank Correlation 
F1 0.9739 0.9048 
F2 0.9838 1.0000 
F3 0.8413 0.9762 
F4 0.8086 0.9762 
F5 0.9466 0.9762 
F6 0.9110 0.9762 
F7 0.8220 0.8333 
F8 0.9694 0.9762 
F9 0.9418 1.0000 
117 
Machine Id Correlation Rank Correlation 
11 0.8479 0.9524 
12 0.7904 0.9524 
13 -0.0825 -0.3333 
14 0.5851 0.6190 
15 0.8201 0.8333 
16 0.8205 0.8333 
17 0.9086 0.9048 
18 0.9151 0.9286 
19 0.9005 0.9048 
110 0.9213 0.9286 
111 0.8559 0.9524 
112 0.8556 0.8095 
113 0.8292 0.8333 
114 0.7152 0.5952 
115 0.7814 0.7857 
116 0.9503 0.8571 
117 0.7104 0.6190 
Table 12.2 Model3 Results for Integer Applications 
118 
CHAPTER 13 Model 4: Application Signature Using Cache Sensitivity 
So far in all the previous models, we have used the hardware signatures to find out the 
application signature. However, it is desirable to obtain application signature independent of 
the hardware signature. A good analogy is that we are using trip reports of multiple cars and 
performance of each car to estimate the highway. However, a much preferred way would be to 
measure the highway separately and then use the car performance to estimate the trip report. 
This chapter develops a model of application signature that is hardware independent. 
Ideal Working Set of SU2COR Benchmark ^ Cache sensitivity of SU2COR Benchmark 
S" i., /I 
Full Associatity Cache: Varying cache-size from 1 to infinity Full Associatity Cache: Varying cachc-sizc from 1 to infinity 
(a) (b) 
Figure 13.1 Working Set Method for SU2COR: (a) Ideal Cache Miss (b) 
Cache Sensitivity or Application Signature 
13.1 Model 
A fully set associate cache model was used in a functional simulator (DincroIV) and the 
cache size was varied from one word size to infinite size. In fully associate cache there will 
be no conflict miss. Hence, cache miss will be reflected by just the summation of capacity 
misses and compulsory misses. For a fixed size cache, both these misses are dependent on 
WSI 
WS2 
119 
Cache Sensitivity of VORTEX Benchmark 
Full Associatity Cache: Varying cachc-sizc from I to infinity 
Ideal Working Set of VORTEX BENCHMARK 
Full Associatity Cache: Varying cachc-sizc from I to infinity 
(a) (b) 
Figure 13.2 Working Set Method for VORTEX: (a) Ideal Cache Miss (b) 
Cache Sensitivity or Application Signature 
the algorithm behavior of the application. It should be noted that the application graph is 
independent upon any hardware. 
Cache-sensitivity is defined as relative change in cache-miss. It is dimension-less measure 
as it is used as Application Signature. 
13.2 Results 
We took one benchmark each from integer suite and floating point suite to test the model. 
Figures 13.1(a) shows the cache miss for an ideal machine for SU2COR benchmark. Fig­
ure 13.1(b) shows the cache sensitivity of the SU2COR benchmark. Figures 13.2(a) shows the 
cache miss for an ideal machine for VORTEX benchmark. Figure 13.2(b) shows the cache 
sensitivity of the VORTEX benchmark. 
Table 13.1 shows the results of using the cache-sensitivity as Application Signature. It can 
be seen that both correlation and rank correlation between the predicted application time and 
measured time are strong. 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
Fl 0.9826 0.0042 0.1241 0.9048 
15 0.9069 0.0014 0.0868 0.9048 
Table 13.1 Working Set Method Results for SU2COR (Fl) and VORTEX 
(15) Applications . . 
120 
CHAPTER 14 Applications of APPMAP technology 
HINT and APPMAP can play an important role where traditional benchmarks lack. To­
gether they can predict a real application performance of either a real machine or a hypothetical 
machine. HINT generates graphs for machines and APPMAP complements HINT graphs by 
providing application specific weights. The convolution of HINT graphs and application spe­
cific weights yield the application's runtime performance. Both HINT graphs and application 
weights can be pre-computed and there is no need to run applications on the real hardware to 
estimate the real performance. Thus, HINT and APPMAP together can be applied in novel 
applications that are not possible with conventional benchmarks. This chapter discusses few 
such examples. 
14.1 System Design 
System designers and component (processors and chipsets) designers are often faced with 
the problem of comparing two or more design alternatives. The designers work in under time 
constraints and it is often nearly impossible to come up with a system cycle simulator or to 
add features to the existing cycle simulators. 
Even if they are able to design cycle simulators, cycle simulators are slow [Todi, 2001] and 
studies that suggest that even cycle simulators are sometimes off by a factor of 50% or more 
when compared with real hardware. So, there is always a desire to take a second approach. An 
analytical approach is the only other choice. However, analytical modeling is difiicult to do as 
it requires advance skills. Further, several times the specification of the design alternatives are 
not well defined except for a few broad guidelines like the number of functional units, pipeline 
stages, cache hierarchy features, clock frequency, etc. These broad-level specifications make it 
121 
difficult to develop a cycle accurate simulator. 
In all the above cases, HINT and APPMAP technology may be used. The technology is an 
analytical modeling technique that is not only easy to comprehend but also easy to use. It is 
relatively simple to come with HINT signatures for the new systems. Since application-specific 
APPMAP remains the same for an application binary, one can easily compute the cycle per 
instruction information from HINT and APPMAP. 
Similarly, HINT and APPMAP technology can also be very useful when comparing different 
configurations of the same system. System designers often provide incremental upgrades to 
the customers. This technology would aid in planning plan and illustrate how incremental 
upgrades are useful in the real environment. 
14.2 Selecting System on Applications 
Among the best use of Application Signature technology is to answer queries on how to 
upgrade any system or how to plan a purchase within a fixed budget. When selecting a new 
personal computer or notebooks, many of the vendors like HP and Dell offer many choices to 
the buyers. Customers can easily customize the servers or personal computers on the vendors' 
website itself. Unfortunately, the customers select a particular configuration over others based 
on intuition. With power usage becoming one prime concern for customers beside performance, 
choice for a customer is increasing becoming complex. Should one choose a higher clock speed 
CPU or one with a lower clock speed but with lower power consumption? Most of the buyers 
apply through their experience and they over-estimate or under-estimate the desired machine 
specification and usually tend to overspent. 
However, none of the vendors so far have offered suggestions for applications that one might 
be running. The following steps that a customer may take when using APPMAP and HINT 
technology to tailor the computer for himself. 
1. Specify the applications you will be using. Are they concurrent applications? 
2. Specify the amount of dollars you will be using. 
122 
3. Specify any preference for your operating systems, processors, etc. 
The above steps are different in many ways from traditional ways of selecting a computer. 
Instead of intuitively choosing from the available range of computers, a customer is actually 
choosing computers based on his need. The customer can easily play with dollar amount and 
the choice of applications to determine easily what is best for him. 
14.3 Multiprocessor Scheduling 
In a multiprocessing environment such as commodity based clusters, one big challenge is 
to distribute work evenly across the available processors. A balanced workload environment 
ensures that all the processors are equally busy. The workload scheduling requirement gets 
even more complex when heterogeneous computers are involved. In heterogeneous system it is 
hard to estimate the time it would take a process or a workload to complete. This problem is 
especially relevant in Grid Computing Architecture [Foster et al., 2001] where large capacity 
workloads are distributed across heterogeneous, geographically dispersed environments. 
Application Signature technology provides an easy way to estimate the runtime of a work­
load. HINT signatures from all the machines in a multiprocessor environment can be collected 
or derived analytically. This process is done once in a lifetime of a machine. Also, APPMAP 
signatures for the application binary targeted for all different architectures in the environment 
can be collected. This process is done once in a lifetime of an application. By combining 
the HINT signatures and APPMAP signatures can used to estimate quickly the runtime of 
the workload. The quick estimation can be used by the workload scheduler before evenly 
distributing the workload across the system. 
14.4 Utility based Computing 
Recent years saw development of utility based computing [Hoffman, 2003] driven by the 
desire of the top computer vendors like HP, Sun and IBM, to sell computing power along with 
servers and storage. The idea is to build the data-centers and lease out the computing power. 
123 
The charging system is based on a unit-of-computing metric, which is called computon 
by HP. The pricing model is similar to the one used by utilities to charge their customers. 
However, since there are so many variables, it makes a unit-of-computing metrics concept 
complex. Suppose a computing power provider decides to charge its customers based on 
CPU cycles usage. This charging system would become easily controversial and unfair to the 
customers as different servers take different CPU cycles to achieve a job. Another related goal 
of the provider is to optimize the usage of the data-centers resources (e.g. network bandwidth, 
server capacity, server utilization, etc.) 
Using HINT and APPMAP technology, the runtime for the workloads of the customers can 
be calculated in advance. This can aid in developing a better charging model to optimize the 
data-centers usage as well as cost to the customers. 
14.5 Power versus Performance 
Until recently, design of a processor was about improving performance of a processor while 
optimizing silicon area of the performance. The demand for lower power CPUs is changing the 
game. As the underlying CMOS technology, enters from 0.1 Sum to 0.09 um to 0.65 um, the 
power consumption due to leakage current increases. The leakage current counts for significant 
portion of the power consumption which does not enhance the performance. This means that 
power/performance tradeoff may not hold steady for the future generations of CPUs. Thus for 
the next generation of computer systems, it may not be about more performance but adequate 
performance with low power. HINT and APPMAP technology can be used to answer what 
is adequate performance for an application or a set of applications. Would an application be 
gaining useful performance by adding more performance over power etc? 
14.6 Chapter Summary 
HINT and APPMAP together open many avenues on how to use benchmarks. Some of 
them are listed above. Since day by day technology is getting complex that it is imperative to 
develop performance modeling techniques that are easy to develop, deploy, and comprehend. 
124 
HINT and APPMAP provide one such practical solution. 
125 
CHAPTER 15 Conclusion and Future Directions 
In this thesis we have proposed four different ways to obtain application signatures. These 
techniques vary in accuracy as well as ease of use. All techniques project results that have 
high linear relationship with the measured data. Also, monotonicity of these results are high 
as compared to the measured data. 
The application signatures obtained through Newton-QR search method were most accu­
rate. The application signatures obtained through the NetQUIPS and instantaneous QUIPS 
were the fastest. The application signatures based on cache-sensitivity were completely machine-
independent and took the application as a black-box. The application signatures using just 
the three-level of misses were easiest to obtain. 
The Application Signature model would be of immense interest to the high performance 
community who are system designers, benchmarkers, performance evaluators, capacity plan­
ners, or to those with an interest in comparative performance analysis. There are two main 
benefits of the APPMAP and hardware signature (HINT) combination. First the combination 
provides a simple what if model to design cost effective hardware. Given an APPMAP one 
can vary the hardware features to obtain the best configuration for a fixed cost. Secondly, 
it provides an easy, accurate, and quick way to predict application performance and compare 
performance among a set of machines. It illustrates clearly why one application performs bet­
ter on machine A than on machine B while the other application performs better on machine B 
than oil machine A. This method demystifies the poor performance behavior of an application 
on a system with a higher clock-speed processor but with a slower memory subsystem or on a 
system with larger but slower secondary cache size or several such tradeoffs. 
126 
15.1 Original Contributions of the Thesis 
We achieved most of the objectives that we embarked on for the thesis. The original 
contributions of the thesis are as follows: 
1. We established the existence of Application Signature model envisioned in [Gustafson, b], 
2. We also showed that HINT is a superset of many conventional serial and parallel bench­
marks [Gustafson and Todi, 1998]. 
3. We devised four different methods to obtain application signature of SPEC benchmarks. 
4. We use the application signature to convolute with the hardware signature to project the 
application time. 
5. We validated the proposed Application Signature model by doing statistical correlation, 
rank correlation, and deviation from linearity. 
The Application Signature model is a black-box methodology and can be applied to any 
applications or benchmarks. 
15.2 Future Directions 
This section will describe some of the things that would be done as extension to the models 
and techniques developed in the thesis. 
1. Power Fourier Analysis: We tried power Fourier analysis [Bloomfield, 2000] on a bench­
mark, Dhystone [Weicker, 1984] [Aburto, ], with a small footprint. Even though the re­
sults were promising, we could not extend our study for benchmarks or applications 
with larger footprints. However, with the availability of cheaper high-density disks this 
experiment can now be possible. The steps for doing Fourier analysis are outlined as 
follows: 
• Collect instruction and data traces of a given application. 
127 
• Build a simple LRU stack model [Spirn, 1977] with the sequence of memory refer­
ences from the collected trace. 
• For each memory reference find the distance of the memory reference from top of the 
stack to where the same memory reference can be found. The distance is number 
of memory references you encounter while going down on a LRU stack of memory 
references. If the reference is not found, then the distance of the reference would 
be size of the LRU stack and the new reference would be pushed on top of the 
stack. Update the simple LRU model by inserting the new reference on top of stack 
and removing similar reference if present in the stack. For a sequence of memory 
references this would yield a sequence of distance of memory references. 
• Use the sequence of distance of memory references to do power Fourier analysis. 
• Match the power FFT results with the hardware signatures to project application 
performance. 
2. Future Workloads: It would be interesting to compare the application signatures of 
present and future workloads of similar kind. For example, one can compare similar 
benchmarks in SPEC2000 and SPEC2004 benchmark suites. 
3. Phases within an application: An application has several phases in its execution 
[Todi, 2001]. There are hardware performance counter based tools to separate the ap­
plication's phase boundaries. An interesting study would be to derive the application 
signatures for each phase and to compare them to the overall application signature of a 
given application. 
4. Study effects of compiler optimization: We can study the effect of compiler opti­
mization with the application signature. For example, how the application profile changes 
when using the application binaries compiled with no optimization option, basic optimiza­
tion (often compiled with +Ofast option) and profile-based optimization (often compiled 
with +04 option with some kind of profile data). 
128 
5. TLB sensitive applications: Since the HINT benchmark has a smaller instruction 
footprint, it would be interesting to study the impact of how applications known to be 
TLB sensitivity are predicted with the Application Signature model. If the error is large, 
an alternate hardware profile can be proposed to obtain TLB sensitivity part. 
6. Parallel Computers: The application signature model was successful in predicting 
NAS Parallel Benchmarks [Gustafson and Todi, 1998]. It would be interesting to extend 
the model for communication-sensitive applications. The hardware profile can be en­
hanced using tools like NetPIPE [David Turner, Quinn Snell, Armin Mikler, 2003] that 
can model application communication overhead for different size of communication traf­
fic. 
7. Power HINT: It would be fruitful to extend the hardware profile to model power 
requirement and heat dissipation of the hardware. Using the power enhanced hardware 
profile, does Application Signature help in building optimum power efficient systems? 
8. Different architecture: We used several different architectures in [Gustafson and Todi, 1998] 
to investigate the Application Signature model. In this thesis we restricted ourselves to 
SGI's MIPS based architecture. Does the application profile remain same across ar­
chitectures? Does the application signature reveal the inherent properties of how the 
application functions? 
9. Adaptive computing: Adaptive computing such as self-modifying code and intelligent 
machines, changes its execution course based on the workload (hardware performance 
counters). We believe that the Application Signature model would be of immense help in 
evaluating such complex machine-application combinations as it would be a tremendously 
difficult and time-consuming task for conventional cycle-accurate simulators to model the 
dynamic behavior of adaptive computing. A study to support this claim would be useful. 
129 
APPENDIX A Cache Memory Subsystem 
Cache Memory 
Cache memory bridges the gap of processor and memory speed. CPUs have been getting 
faster due to inovation in superpipeline and superscalar design. In contrast, the memory chips 
have increased the performance by only seven percent each year. Hence main memory cannot 
match with the required bandwidth of the CPU. Small, expensive, high-speed memory (called 
cache) fills this gap by storing subset of most recently used data and instructions. 
Secondary Cache 
Tertiary Cache 
A Typical Memory Hierarchy 
Figure A.l Memory Hierarchy 
Memory hierarchy in todays computers consists of five to six levels. They are: registers (LO), 
primary cache (LI), secondary cache (L2), tertiary cache (L3), main memory (L4), magnetic 
130 
disk (L5), and Magnetic tapes (L6). In some scalar computers, there is no secondary or tertiary 
cache but they are recent trend. Memory can be classified by following five parameters: access 
time, memory size, cost per byte, transfer bandwidth, and unit of transfer. 
Cache memory are effective because program behavior tends to follow principle of locality 
[Hennessy and Patterson, 2003], [Zomaya, 1996]. 
There are three dimensions of locality for a multiprocessors systems as listed below. The 
first two dimensions are important on uniprocessor, and the third one [Zomaya, 1996] is specific 
to parallel machines. 
1. Spatial locality refers to have high likelihood that in near future neighbours of the exist­
ing memory reference will be referenced. For example instructions are executed mostly 
sequently. 
2. Temporal locality refers to the high likelihood that in near future the existing reference 
will be referenced again. For example in an instruction in a loop will be refereced many 
number of times during a short period. 
3. Processor locality refers to the high likelihood that existing data reference will be refer­
enced by the same processor in the near future. 
There are four aspects of cache design [Hill and Smith, 1984] 
1. Maximize the HIT ratio, that is probability of finding the required memory reference in 
the cache. 
2. Minimizing the access time, that is time to bring reference from cache to the register. 
3. Minimizing the miss penalty, that is delay cost due to a miss. 
4. Minimining the overheads associated with mullilevel cache consistency and cache co­
herency in case of shared mulitprocessors. 
131 
3 Cs - Capacity, Compulsory, Conflict 
The cache momory are usually divided into blocks or lines. It is the smallest chunk of 
consecutive data or instructions bought from next level of memory hierarchy. If the requested 
block is not found in the cache then the block is to be obtained from next level of memory. 
This is called cache miss. A cache miss can be classified into three The Three Cs: 
1. Compulsory: This is also called cold start miss. This is the miss when the block is first 
bought into the cache. 
2. Capacity: When a miss is a block that was previously discarded due to cache size. 
3. Conflict: This type of miss is also called collision miss or interference miss. When a block 
is discarded and later retrieved because too many references were mapped on the same 
set. This miss identifies the problem in placement strategy. The conflict miss occurs for 
direct and set-associate typle placement. There is no conflict miss for fully associative 
cache. 
132 
APPENDIX B HINT Database 
Database Input Form 
One of the unique characteristics about HINT database [Gustafson et al., ] is that it re­
quires a number of information to be filled by the submitters. Following are the advantages of 
such procedures. 
Firstly, it provides a complete view of the system as seen by the person running the bench­
mark. This approach is helpful as the results can be reproduced by a third party. Secondly, 
no submitted result1 without a proper identity is allowed to be in the database. So the con­
sequence of this approach is that every result is backed by someone. By strickly following the 
above procedure lots of credentiality has been added to the database of HINT results. 
A sample form used by the HINT database for each entry is shown in Table B.l. A consumer 
guide index, developed by Dr. John Gustafson, based on HINT benchmark is shown in Figure 
B.l. 
1 Ideally the process of publication of benchmark's results should be like publication of research paper. As 
each research paper is reviewed independently by the referees who are expert in that area it is important to have 
the results to be also reviewed by either panel of experts or independent referees familiar with the benchmarked 
machine. 
133 
Table B.l A Sample of HINT Database Entry form for a Typical Work­
station 
Personal Information 
Name/Email Raj at Todi, todi@scl.ameslab.gov 
Company/Address Ames Laboratory, USDOE 
Date June, 1997 
Computer Architecture 
Computer Vendor Silicon Graphics Inc. 
Computer Model name/number Indy SC 
Operating System IRIX 6.2 
Graphics Indy 24-bit graphics 
Standalone or Multiuser Multiuser 
Processor MIPS R4000 
Floating Point Coprocessor MIPS R4010 
Clock Rate 100 MHZ t 
# of Processors (# used) 1 (1) 
Communication Architecture 
Topology Not applicable 
Interprocessor Latency, B/W Not applicable 
Physical Statistics 
Serial Number 
Physical Dimensions 41cm x 35cm x 8cm 
Weight, Power Consumption 
Purchase Date 9/30/93 
List Price at Purchase Date 
Memory Architecture 
Memory Architecture 
Memory Size 64 MB 
Memory Interleave 
Memory Bandwidth 267 MB / sec t 
Primary Cache Size 8 KB data / 8 KB instruction t 
Secondary Cache Size 1 MB t 
Cache Associativity 
Type of Hard Disk(Virtual Mem) SCSI 2 
Hard Disk brand name Seagate 
Hard Disk Size, access time 1.0 GB t 
HINT Executable Statistics and Results 
HINT version 1.0 
Language Used C (ANSI) 
Compiler (Options Used) cc (-Ofast) 
Data Type for Computation double 
Data Type for Indexing int 
Net QUIPS 2.63 MQUIPS 
f Not personally measured 
134 
WINTEL PERSONAL COMPUTER 
64 MEGABYTES MAIN MEMORY 
MODEL 
9600 
COMPUTEGUIDE 
ESTIMATES ARE BASED ON THE I ONLY SINGLE-USER PERSONAL 
HINT" PERFORMANCE MEASURE^B^ COMPUTERS ARE USED ON 
• ^ THIS SCALE 
15.6 
net MQUIPS 
32-BIT INTEGER RATING 
June 1998 
model with 
lowest 
performance 
3.2 
June 1998 
model with 
highest 
performance 
17.9 
THIS MODEL 
Estimated performance over a range of uses 
Your performance will vary depending on how you use the product, 
and on how you modify it by installing software. 
How fast will this model run different types of problems? 
1.6 
32-bit 
integer 
64-bit 
a i.o floating point 
C 0.8 
1 microsecond 1 millisecond 1 second 
Time for a Computing Task 
Ask your salesperson for information about the needs of your application. 
This test is based on patented, scalable methods developed by a federal laboratory. 
Figure B.l HINT Graphs can be used for Consumer Computer Perfor­
mance Guide 
135 
APPENDIX C Application Characteristics - I 
Table C.l Hardware Event Counters for RIOOOO and R12000 
Events Number Event Name 
CI Quadwords Written Back from Scache 
C2 Graduated Loads 
C3 Primary Instruction Cache Misses 
C4 Primary Data Cache Misses 
C5 Secondary Instruction Cache Misses 
C6 Secondary Data Cache Misses 
C7 Graduated Instructions 
C8 Graduated Floating Point Instructions 
C9 Issued Instructions 
CIO TLB Misses 
Cll Graduated Stores 
C12 Cycles 
C13 Mispredicted Branches 
Table C.2 Characteristics for 099.go using input null.in (II) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 2.32e+05 9.48e+05 4.35e+05 1.12e+06 6.53e+05 9.10e+05 1.91e+06 1.55e+06 
C2 5.21e+09 5.21e+09 5.21e+09 5.21e+09 5.21e+09 5.21e+09 5.21e+09 5.21e+09 
C3 8.44e+07 8.47e+07 8.44e+07 8.78e+07 8.75e+07 8.47e+07 8.98e+07 9.13e+07 
C4 1.45e+08 1.46e+08 1.45e+08 1.54e+08 1.56e+08 1.51e+08 1.55e+08 1.52e-(-08 
C5 2.53e+03 4.69e+03 3.19e+04 2.20e+05 3.27e+05 1.57e+05 4.65e+05 4.63e+05 
C6 7.84e+03 5.55e+04 2.81e+04 2.306+05 2.93e+05 1.30e-t-05 3.85e+05 4.82e+05 
C7 1.60e+10 1.60e+10 1.60e+10 1.60e+10 1.60e+10 1.60e+10 1.60e+10 1.60e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 1.79e+10 1.79e+10 1.79e+10 1.79e+10 2.39e+10 1.79e+10 1.71e+10 1.72e+10 
CIO 3.26e+03 6.02e+03 3.36e+03 8.51e+03 2.12e+06 3.41e+03 6.27e+03 2.16e+06 
Cll 1.50e+09 1.50e+09 1.50e+09 1.50e+09 1.50e+09 1.50e+09 1.50e+09 1.50e+09 
C12 1.70e+10 1.70e+10 1.70e+10 1.72e+10 1.79e+10 1.71e+10 1.79e+10 1.80e+10 
C13 6.26e+08 6.26e+08 6.26e+08 6.26e+08 5.97e+08 6.26e+08 6.29e+08 6.30e+08 
Table C.3 Characteristics for O99.go using input nulll .in (12) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 4.22e+05 1.82e+06 1.06e+06 1.72e+06 1.43e+06 2.19e+06 1.05e+06 1.63e+06 
C2 1.13e+10 1.13e+10 1.13e+10 1.13e+10 1.13e+10 1.13e+10 1.13e+10 1.13e+10 
C3 1.77e+08 1.77e+08 1.76e+08 1.84e+08 1.84e+08 1.77e+08 1.89e+08 1.92e+08 
C4 3.49e+08 3.50e+08 3.49e+08 3.67e+08 3.72e+08 3.60e+08 3.70e+08 3.63e+08 
C5 2.11e+04 2.51e+04 1.64e+05 5.96e+05 1.05e+06 1.91e+05 1.69e+05 2.42e+06 
C6 5.49e+03 6.04e+04 1.45e+05 4.71e+05 1.08e+06 1.90e+05 1.76e+05 1.04e+06 
C7 3.48e+10 3.48e+10 3.48e+10 3.48e+10 3.48e+10 3.48e+10 3.48e+10 3.48e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 3.94e+10 3.94e+10 3.94e+10 3.94e+10 5.26e+10 3.94e+10 3.76e+10 3.76e+10 
CIO 6.43e+03 1.14e+04 6.63e+03 1.84e+04 5.12e+06 6.94e+03 1.25e+04 5.27e+06 
Cll 3.22e+09 3.22e+09 3.22e+09 3.22e+09 3.22e+09 3.22e+09 3.22e+09 3.22e+09 
C12 3.72e+10 3.73e+10 3.73e+10 3.76e+10 3.91e+10 3.74e+10 3.92e+10 3.95e+10 
C13 1.40e+09 1.40e+09 1.40e+09 1.40e+09 1.33e+09 1.40e+09 1.41e+09 1.41e+09 
Table C.4 Characteristics for 099.go using input 5stone21.in (13) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 5.09e+05 1.93e+06 1.20e+06 1.92e+06 8.83e+05 1.35e+06 2.19e+06 4.00e+06 
C2 1.04e+10 1.04e+10 1.04e+10 1.04e+10 1.04e+10 1.04e+10 1.04e+10 1.04e+10 
C3 1.72e+08 1.72e+08 1.71e+08 1.78e+08 1.78e+08 1.72e+08 1.83e+08 1.85e+08 
C4 3.08e+08 3.09e+08 3.08e+08 3.24e+08 3.28e+08 3.18e+08 3.27e+08 3.21e+08 
C5 4.28e+04 3.62e+05 3.90e+05 5.24e+05 4.87e+05 3.59e+05 5.72e+05 2.74e+06 
C6 3.30e+04 1.07e+05 2.08e+05 4.51e+05 4.40e+05 1.56e+05 4.94e+05 4.11e+06 
C7 3.18e+10 3.18e+10 3.18e+10 3.18e+10 3.18e+10 3.18e+10 3.18e+10 3.18e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 3.60e+10 3.60e+10 3.60e+10 3.60e+10 4.80e+10 3.60e+10 3.44e+10 3.44e+10 
CIO 5.80e+03 1.03e+04 5.69e+03 1.47e+04 4.38e+06 6.18e+03 1.18e+04 4.58e+06 
Cll 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 
C12 3.40e+10 3.41e+10 3.41e+10 3.44e+10 3.57e+10 3.42e+10 3.58e+10 3.65e+10 
C13 1.28e+09 1.28e+09 1.28e+09 1.28e+09 1.22e+09 1.28e+09 1.28e+09 1.28e+09 
Table C.5 Characteristics for O99.go using input 9stone21.in (14) 
Events Ml M2 MS M4 M5 MB M7 M8 
CI 3.15e+05 1.78e+06 1.58e+06 2.40e+06 2.01e+06 1.99e+07 Î.9Ô6+O6 2.21e+06 
C2 1.05e+10 1.05e+10 1.05e+10 1.05e+10 1.05e+10 1.05e+10 1.05e+10 1.056+10 
C3 1.64e+08 1.646+08 1.64e+08 1.71e+08 1.70e+08 1.64e+08 1.76e+08 1.80e+08 
C4 3.02e+08 3.03e+08 3.02e+08 3.19e+08 3.23e+08 3.13e+08 3.22e+08 3.17e+08 
C5 1.78e+04 2.47e+04 2.87e+05 ô.29e+0ô 1.24e+06 2.11e+05 7.21e+05 1.46e+06 
C6 8.09e+02 7.10e+04 B.446+O4 4.536+05 1.31e+06 l.406+05 4.206+05 1.15e+06 
C7 3.22e+10 3.22e+10 3.22e+10 3.22e+10 3.22e+10 3.22e+10 3.22e+10 3.22e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 3.62e+10 3.62e+10 3.62e+10 3.63e+10 4.83e+10 3.62e+10 3.46e+10 3.4Ô6+ÎO 
CIO 5.77e+03 9.796+03 5.636+03 1.43e+04 3.77e+06 6.09e+03 1.12e+04 3.87e+06 
Cll 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 2.96e+09 
C12 3.42e+10 3.42e+10 3.43e+10 3.46e+10 3.59e+10 3.44e+10 3.61e+10 3.64e+10 
CIS 1.27e+09 1.27e+09 1.27e+09 1.27e+09 1.21e+09 1.27e+09 1.28e+09 1.28e+09 
Table C.6 Characteristics for 147.vortex using input vortex.in (15) 
Events Ml M2 M3 M4 M5 M6 M7 MS 
CI 1.07e+08 1.09e+08 1.34e+08 1.40e+08 9.43e+07 1.43e+08 1.38e+08 8.52e+07 
C2 2.98e+10 2.98e+10 2.98e+10 2.98e+10 2.98e+10 2.98e+10 2.98e+10 2.98e+10 
C3 5.27e+08 5.30e+08 5.29e+08 5.65e+08 5.90e+08 5.49e+08 5.61e+08 6.03e+08 
C4 4.10e+08 4.12e-)-08 4.12e+08 4.29e+08 4.82e+08 4.24e+08 4.29e+08 4.79e+08 
C5 1.35e+06 1.69e+06 5.23e+06 6.18e+06 5.67e+06 5.84e+06 6.39e+06 7.42e+06 
C6 3.14e+07 3.25e+07 5.19e+07 5.26e+07 6.90e+07 5.25e+07 5.48e+07 6.71e+07 
C7 8.50e+10 8.50e+10 8.50e+10 8.50e+10 8.53e+10 8.50e+10 8.50e+10 8.53e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 9.16e+10 9.16e+10 9.17e+10 9.16e+10 1.18e+ll 9.15e+10 8.78e+10 8.91e+10 
CIO 1.90e+08 1.90e+08 1.90e+08 1.87e+08 4.97e+08 1.92e+08 1.90e+08 4.97e+08 
Cll 1.53e+10 1.53e+10 1.53e+10 1.536+10 1.53e+10 1.53e+10 1.53e+10 1.53e+10 
C12 9.78e+10 9.88e+10 1.02e+ll 9.97e+10 1.18e+ll 9.70e+10 1.02e+ll 1.13e+ll 
C13 2.61e+09 2.61e+09 2.61e+09 2.59e+09 2.69e+09 2.58e+09 2.61e+09 2.61e+09 
Table C.7 Characteristics for 132.ijpeg using input penguin.ppm (16) 
Events Ml M2 M3 M4 M5 MB M7 MS 
CI 3.00e+07 2.91e+07 3.01e+07 3.06e+07 3.01e+07 2.99e+07 3.07e+07 2.95e+07 
C2 5.58e+09 5.58e+09 5.58e+09 5.58e+09 5.58e+09 5.58e+09 5.58e+09 5.58e+09 
C3 3.86e+05 5.70e+05 5.23e+05 1.24e+06 9.75e+05 5.95e+05 1.09e+06 1.42e+06 
C4 6.99e+07 7.01e+07 6.99e+07 7.21e+07 7.29e+07 7.06e+07 7.27e+07 7.27e+07 
C5 1.21e+05 1.12e+05 1.43e+05 1.35e+05 2.32e+05 1.42e+05 1.37e+05 2.23e+05 
C6 7.12e+06 7.17e+06 7.67e+06 7.82e+06 1.54e+07 7.70e+06 7.73e+06 1.54e+07 
C7 2.85e+10 2.85e+10 2.85e+10 2.85e+10 2.85e+10 2.85e+10 2.85e+10 2.85e+10 
C8 6.78e+03 6.78e+03 6.78e+03 6.27e+03 6.27e+03 6.27e+03 6.78e+03 6.27e+03 
C9 2.78e+10 2.78e+10 2.78e+10 2.79e+10 3.06e+10 2.79e+10 2.76e+10 2.76e+10 
CIO 7.69e+04 7.83e+04 7.71e+04 7.86e+04 2.86e+05 7.69e+04 7.76e+04 2.94e+05 
Cll 2.38e+09 2.38e+09 2.38e-f09 2.38e+09 2.38e+09 2.38e+09 2.38e+09 2.38e+09 
C12 1.94e+10 1.94e+10 1.95e+10 1.91e+10 2.03e+10 1.87e+10 1.92e+10 2.05e+10 
C13 2.19e+08 2.19e+08 2.19e+08 2.20e+08 2.20e+08 2.20e+08 2.21e+08 2.21e+08 
Table C.8 Characteristics for 132,ijpeg using input specmun.ppm (17) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 2.76e+07 2.72e+07 2.82e+07 2.82e+07 2.80e+07 2.78e+07 2.80e4-07 2.73e+07 
C2 5.08e+09 5.08e+09 5.08e+09 5.08e+09 5.08e+09 5.08e+09 5.08e+09 5.08e+09 
C3 3.59e+05 5.36e+05 5.06e+05 1.12e+06 1.00e+06 5.89e+05 9.75e+05 1.29e+06 
C4 6.73e+07 6.74e+07 6.72e+07 6.87e+07 6.77e+07 6.71e+07 6.92e+07 6.92e+07 
C5 1.34e+05 1.10e+05 1.70e+05 1.38e+05 3.83e+05 1.60e+05 1.49e+05 2.27e+05 
C6 6.73e+06 6.75e+06 7.15e+06 7.15e+06 1.44e+07 7.13e+06 7.15e+06 1.42e+07 
C7 2.58e+10 2.58e+10 2.58e+10 2.58e+10 2.58e+10 2.58e+10 2.58e+10 2.58e+10 
C8 6.78e+03 6.78e+03 6.78e+03 6.27e+03 6.27e+03 6.27e+03 6.78e+03 6.27e+03 
C9 2.52e+10 2.52e+10 2.52e+10 2.52e+10 2.77e+10 2.52e+10 2.50e+10 2.50e+10 
CIO 7.12e+04 7.20e+04 7.10e+04 7.32e+04 2.66e+05 7.08e+04 7.11e+04 2.72e+05 
Cll 2.15e+09 2.15e+09 2.15e+09 2.15e+09 2.15e+09 2.15e+09 2.15e+09 2.15e+09 
C12 1.75e+10 1.75e+10 1.75e+10 1.72e+10 1.83e+10 1.68e+10 1.73e+10 1.85e+10 
C13 1.94e+08 1.94e+08 1.94e+08 1.95e+08 1.95e+08 1.95e+08 1.97e+08 1.96e+08 
Table C.9 Characteristics for 132.ijpeg using input vigo.ppm (18) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 3.11e+07 3.03e+07 3.11e+07 3.14e+07 3.15e+07 3.11e+07 3.13e+07 3.04e+07 
C2 5.80e+09 5.80e+09 5.80e+09 5.80e+09 5.80e+09 5.80e+09 5.80e+09 5.80e+09 
C3 3.92e+05 5.91e+05 5.36e+05 1.26e+06 1.02e+06 6.02e+05 1.36e+06 1.48e+06 
C4 7.58e+07 7.60e+07 7.57e+07 7.74e+07 7.73e+07 7.59e+07 7.80e+07 7.86e+07 
C5 1.26e+05 1.12e+05 1.47e+05 1.44e+05 2.31e+05 1.36e+05 1.48e+05 2.18e+05 
C6 7.58e+06 7.63e+06 7.98e+06 8.04e+06 1.59e+07 8.00e+06 8.01e+06 1.60e+07 
C7 2.95e+10 2.9oe+10 2.95e+10 2.95e+10 2.95e+10 2.95e+10 2.95e+10 2.95e+10 
C8 6.78e+03 6.78e+03 6.78e+03 6.27e+03 6.27e+03 6.27e+03 6.78e+03 6.27e+03 
C9 2.88e+10 2.88e+10 2.88e+10 2.88e+10 3.18e+10 2.88e+10 2.86e+10 2.86e+10 
CIO 7.99e+04 8.11e+04 8.00e+04 8.15e+04 3.02e+05 7.97e+04 7.99e+04 3.09e+05 
Cll 2.48e+09 2.48e+09 2.48e+09 2.48e+09 2.48e+09 2.48e+09 2.48e+09 2.48e+09 
C12 2.01e+10 2.01e+10 2.02e+10 1.98e+10 2.11e+10 1.94e+10 1.99e+10 2.12e+10 
CIS 2.35e+08 2.35e+08 2.35e+08 2.36e+08 2.36e+08 2.36e+08 2.37e+08 2.37e+08 
Table C.IO Characteristics for 126.gcc using input lexpr.i (19) 
Events Ml M2 M3 M4 M5 M6 M7 MS 
CI 2.53e-|-06 2.57e+06 3.37e+06 3.57e+06 2.92e+06 3.44e+06 3.43e+06 2.86e+06 
C2 3.68e+08 3.68e+08 3.68e+08 3.68e+08 3.68e+08 3.68e+08 3.68e+08 3.68e+08 
C3 1.55e+07 1.55e+07 1.55e+07 1.56e+07 1.58e+07 1.53e+07 1.63e+07 1.66e+07 
C4 9.60e+06 9.58e+06 9.59e+06 9.84e+06 1.01e+07 9.62e+06 9.91e+06 1.02e+07 
C5 6.68e+04 8.30e+04 2.26e+05 2.12e+05 3.34e+05 2.43e+05 2.53e+05 2.96e+05 
C6 3.50e+05 3.73e+05 6.18e+05 6.09e+05 1.02e+06 6.34e+05 6.52e+05 9.95e+05 
C7 1.30e+09 1.30e+09 1.30e+09 1.30e+09 1.30e+09 1.30e+09 1.30e+09 1.31e+09 
C8 7.19e+04 7.19e+04 7.19e+04 7.19e+04 7.19e+04 7.19e+04 7.19e+04 7.19e+04 
C9 1.34e+09 1.34e+09 1.34e+09 1.34e+09 1.70e+09 1.34e+09 1.29e+09 1.31e+09 
CIO 1.43e+05 1.42e-t~05 1.42e+05 1.27e+05 2.83e+06 1.31e+05 1.33e+05 2.86e+06 
Cll 1.78e+08 1.78e+08 1.78e+08 1.78e+08 1.78e+08 1.78e+08 1.78e+08 1.78e+08 
C12 1.54e+09 1.56e+09 1.62e+09 1.62e+09 1.88e+09 1.53e+09 1.64e+09 1.82e4-09 
C13 5.02e+07 5.02e+07 5.02e+07 5.05e+07 4.34e+07 5.05e+07 5.04e+07 5.13e+07 
Table C.ll Characteristics for 126.gcc using input Irecog.i (110) 
Events Ml M2 MS M4 M5 M6 M7 M8 
CI 3.86e+05 3.81e+05 6.54e+05 6.93e+05 5.76e+05 6.31e+05 6.63e+05 5.72e+05 
C2 1.04e+08 1.04e+08 1.04e+08 1.04e+08 1.04e+08 1.04e+08 1.04e+08 1.04e+08 
C3 4.95e+06 4.96e+06 4.96e+06 4.97e+06 5.01e+06 4.87e+06 5.18e+06 5.26e+06 
C4 2.32e+06 2.33e+06 2.35e+06 2.41e+06 2.50e+06 2.34e+06 2.42e+06 2.51e+06 
C5 3.11e+04 3.19e+04 7.14e+04 9.75e+04 1.28e+05 7.44e+04 8.46e+04 1.25e+05 
C6 4.20e+04 4.74e+04 9.30e+04 1.04e+05 1.69e+05 9.75e+04 1.03e+05 1.68e+05 
C7 3.64e+08 3.64e+08 3.64e+08 3.65e+08 3.66e+08 3.65e+08 3.65e+08 3.66e+08 
C8 2.73e+04 2.73e+04 2.73e+04 2.73e+04 2.73e+04 2.73e+04 2.73e+04 2.73e+04 
C9 3.71e+08 3.71e+08 3.71e+08 3.73e+08 4.72e+08 3.73e+08 3.59e+08 3.61e+08 
CIO 9.73e+03 1.00e+04 9.37e+03 9.16e+03 6.59e+05 8.63e+03 9.15e+03 6.69e+05 
Cll 4.81e+07 4.81e+07 4.81e+07 4.82e+07 4.82e+07 4.82e+07 4.82e+07 4.82e+07 
C12 4.29e+08 4.31e+08 4.47e+08 4.53e+08 5.22e+08 4.32e+08 4.63e+08 5.07e+08 
CIS 1.45e+07 1.45e+07 1.45e+07 1.46e+07 1.23e+07 1.46e+07 1.46e+07 1.47e+07 
Table C.12 Characteristics for 126.gcc using input lreloadl.i (111) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 3.10e+06 3.07e+06 3.68e+06 3.87e+06 3.37e+06 3.86e+06 3.98e+06 3.39e+06 
C2 3.22e+08 3.22e+08 3.22e+08 3.22e+08 3.22e+08 3.22e+08 3.22e+08 3.22e+08 
C3 1.35e+07 1.35e+07 1.36e+07 1.37e+07 1.37e+07 1.34e+07 1.42e+07 1.44e+07 
C4 9.85e+06 9.87e+06 9.89e+06 1.00e+07 1.05e+07 9.89e+06 1.01e+07 1.05e+07 
C5 7.25e+04 8.13e+04 1.97e+05 1.83e+05 2.59e+05 1.67e+05 2.01e+05 3.16e+05 
C6 4.05e+05 4.19e+05 6.38e+05 6.58e+05 1.10e+06 6.36e+05 6.76e+05 1.10e+06 
C7 1.15e+09 1.15e+09 1.15e+09 1.15e+09 1.16e+09 1.15e+09 1.15e+09 1.16e+09 
C8 1.15e+05 1.15e+05 1.15e+05 1.15e+05 1.15e+05 1.15e+05 1.15e+05 1.15e+05 
C9 1.18e+09 1.18e+09 1.19e+09 1.19e+09 1.48e+09 1.19e+09 1.14e+09 1.15e+09 
CIO 1.08e+05 1.08e+05 1.07e+05 9.91e+04 2.76e+06 9.73e+04 9.94e+04 2.78e+06 
Cll 1.67e+08 1.67e+08 1.67e+08 1.67e+08 1.68e+08 1.67e+08 1.67e+08 1.68e+08 
C12 1.38e+09 1.38e+09 1.45e+09 1.45e+09 1.71e+09 1.36e+09 1.45e+09 1.66e+09 
C13 4.18e+07 4.18e+07 4.19e+07 4.21e+07 3.60e+07 4.22e+07 4.21e+07 4.26e+07 
Table C.13 Characteristics for 126.gcc using input 2stmt.i (112) 
Events Ml M2 M3 M4 M5 MB M7 M8 
CI 5.00e+05 5.14e+05 9.84e+05 1.02e+06 7.88e+05 1.02e+06 9.98e+05 7.79e+05 
C2 2.03e+08 2.03e+08 2.03e+08 2.036+08 2.03e+08 2.03e+08 2.03e+08 2.03e+08 
C3 9.51e+06 9.53e+06 9.53e+06 9.56e+06 9.51e+06 9.36e+06 9.98e+06 1.00e+07 
C4 3.55e+06 3.56e+06 3.60e+06 3.74e+06 3.88e+06 3.59e+06 3.79e+06 3.95e+06 
C5 5.11e+04 5.29e+04 1.54e+05 1.80e+05 2.57e+05 1.84e+05 1.83e+05 2.53e+05 
C6 6.56e+04 7.04e+04 1.41e+05 1.70e+05 2.49e+05 1.63e+05 1.72e+05 2.54e+05 
C7 7.08e+08 7.08e+08 7.08e+08 7.09e+08 7.11e+08 7.09e+08 7.09e+08 7.11e+08 
C8 4.49e+04 4.49e+04 4.49e+04 4.49e+04 4.49e+04 4.49e+04 4.49e+04 4.49e+04 
C9 7.29e+08 7.29e+08 7.30e+08 7.32e+08 9.36e+08 7.31e+08 7.05e+08 7.10e+08 
CIO 3.63e+04 3.67e+04 3.64e+04 3.33e+04 1.37e+06 3.41e+04 3.65e+04 1.39e+06 
Cll 9.44e+07 9.44e+07 9.44e+07 9.44e+07 9.44e+07 9.44e+07 9.44e+07 9.446+07 
C12 8.35e+08 8.36e+08 8.72e+08 8.77e+08 1.01e+09 8.46e+08 9.04e+08 9.72e+08 
CIS 2.95e+07 2.95e+07 2.95e+07 2.96e+07 2.53e+07 2.97e+07 2.96e+07 2.99e+07 
Table C.14 Characteristics for 124.m88ksim using input ctl.raw (113) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 4.75e+06 6.94e+06 5.70e+06 4.70e+06 3.29e+06 6.74e+06 5.38e+06 2.95e+06 
C2 1.88e+10 1.88e+10 1.88e+10 1.88e+10 1.88e+10 1.88e+10 1.88e+10 1.88e+10 
C3 l.lle+09 l.lle+09 l.lle+09 1.12e+09 1.15e+09 l.lle+09 1.12e+09 1.12e+09 
C4 1.31e+07 1.42e+07 1.27e+07 2.62e+07 2.73e+07 2.00e+07 2.92e+07 2.86e+07 
C5 8.22e+03 5.88e+03 2.55e+04 6.76e+04 9.42e+04 1.51e+05 2.34e+04 6.06e+04 
C6 5.60e+05 6.18e+05 6.96e+05 7.25e+05 1.04e+06 8.63e+05 1.33e+06 1.25e+06 
C7 8.00e+10 8.00e+10 8.00e+10 8.00e+10 8.00e+10 8.00e+10 8.00e+10 8.00e+10 
C8 4.39e+02 4.39e+02 4.39e+02 4.39e+02 4.39e+02 4.39e+02 4.39e+02 4.39e+02 
C9 7.54e+10 7.54e+10 7.54e+10 7.54e+10 9.29e+10 7.55e+10 7.43e+10 7.44e+10 
CIO 8.66e+05 8.76e+05 8.68e+05 8.94e+05 2.69e+06 8.79e+05 8.85e+05 2.80e+06 
Cll 8.40e+09 8.40e+09 8.40e+09 8.40e+09 8.40e+09 8.40e+09 8.40e+09 8.40e+09 
C12 8.34e+10 8.34e+10 8.34e+10 8.41e+10 8.81e+10 8.38e+10 8.69e+10 8.74e+10 
C13 1.69e+09 1.69e+09 1.69e+09 1.69e+09 1.65e+09 1.69e+09 1.70e+09 1.72e+09 
Table C.15 Characteristics for 124.m88ksim using input test.raw (114) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 1.24e+04 3.336+04 1.61e+04 1.79e+04 1.32e+04 2.62e+04 2.03e+04 l.056+04 
C2 1.20e+08 1.20e+08 1.20e+08 1.20e+08 1.20e+08 1.20e+08 1.20e+08 1.20e+08 
C3 7.57e+06 7.576+06 7.56e+06 7.58e+06 7.796+06 7.55e+06 7.54e+06 7.51e+06 
C4 1.30e+05 1.33e+05 Î.2Ô6+O5 2.07e+05 2.32e+05 1.70e+05 2.22e+05 2.17e+05 
C5 9.83e+02 1.17e+03 1.35e+03 1.70e+03 2.52e+03 1.21e+03 1.94e+03 2.57e+03 
C6 4.23e+02 7.33e+02 1.03e+03 2.52e+03 3.70e+03 2.956+03 2.48e+03 3.93e+03 
C7 5.44e+08 5.446+08 5.44e+08 5.456+08 5.45e+08 5.45e+08 5.456+08 5.45e+08 
C8 1.656+02 1.65e+02 1.65e+02 1.65e+02 l.656+02 1.65e+02 1.656+02 1.65e+02 
C9 5.11e+08 5.11e+08 5.11e+08 5.12e+08 6.46e+08 5.12e+08 5.05e+08 5.05e+08 
CIO 2.79e+02 3.43e+02 3.04e+02 3.85e+02 5.17e+03 2.88e+02 3.67e+02 6.35e+03 
Cll 4.56e+07 4.566+07 4.5Ô6+O7 4.56e+07 4.5Ô6+O7 4.56e+07 4.5Ô6+O7 4.506+07 
C12 5.90e+08 5.90e+08 5.90e+08 5.91e+08 ô.lle+08 5.89e+08 6.17e+08 6.16e+08 
CIS 1.30e+07 1.30e+07 1.30e+07 1.30e+07 1.30e+07 1.30e+07 1.31e+07 1.32e+07 
Table C.16 Characteristics for 129.compress using input bigtest.in (115) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 3.41e+07 3.41e+07 3.52e+07 3.73e+07 3.75e+07 3.55e+07 3.99e+07 3.53e+07 
C2 1.57e+10 1.57e+10 1.57e+10 1.57e+10 1.57e+10 1.57e+10 1.57e+10 1.57e+10 
C3 7.29e+04 1.43e+05 1.53e+05 4.69e+05 3.46e+05 1.98e+05 7.01e+05 7.74e+05 
C4 1.40e+09 1.40e+09 1.40e+09 1.40e+09 1.39e+09 1.40e+09 1.43e+09 1.40e+09 
C5 9.09e+04 1.29e+05 3.34e+04 3.49e+04 3.61e+05 2.71e+04 1.70e+04 5.45e+05 
C6 8.62e+06 8.78e+06 8.88e+06 9.23e+06 1.84e+07 9.02e+06 9.77e+06 1.91e+07 
cr 5.17e+10 5.17e+10 5.17e+10 5.17e+10 5.18e+10 5.17e+10 5.17e+10 5.18e+10 
C8 1.89e+08 1.89e+08 1.89e+08 1.89e+08 1.89e+08 1.89e+08 1.89e+08 1.89e+08 
C9 5.53e+10 5.53e+10 5.53e+10 5.53e+10 7.01e+10 5.53e+10 5.29e+10 5.32e+10 
CIO 5.01e+04 5.41e+04 5.02e+04 5.65e+04 5.95e+07 5.00e+04 5.31e+04 6.03e+07 
Cll 6.15e+09 6.15e+09 6.15e+09 6.15e+09 6.15e+09 6.15e+09 6.15e+09 6.15e+09 
C12 5.65e+10 5.65e+10 5.65e+10 5.61e+10 6.04e+10 5.56e+10 5.85e+10 6.10e+10 
C13 1.84e+09 1.84e+09 1.84e+09 1.84e+09 1.82e+09 1.84e+09 1.87e+09 1.87e+09 
Table C.17 Characteristics for 129.compress using input test.in (116) 
Events Ml M2 M3 M4 M5 MB M7 M8 
CI 4.74e+03 2.50e+03 4.51e+03 1.12e+04 9.68e+03 9.99e+03 1.06e+04 9.20e+03 
C2 5.71e+05 5.71e+05 5.71e+05 5.71e+05 5.71e+05 5.71e+05 5.71e+05 5.71e+05 
C3 2.32e+03 2.34e+03 2.31e+03 2.38e+03 4.36e+03 2.35e+03 6.38e+03 4.27e+03 
C4 2.22e+05 2.22e+05 2.22e+05 2.27e+05 2.36e+05 2.27e+05 2.28e+05 2.37e+05 
C5 4.63e+02 3.89e+02 6.29e+02 6.79e+02 2.05e+03 7.17e+02 7.07e+02 1.12e+03 
C6 6.54e+02 4.30e+02 8.32e+02 2.22e+03 3.93e+03 2.22e+03 1.99e+03 3.45e+03 
C7 3.41e+06 3.41e+06 3.41e+06 4.04e+06 4.04e+06 4.04e+06 4.18e+06 4.11e+06 
C8 5.05e+04 5.05e+04 5.05e+04 5.05e+04 5.05e+04 5.05e+04 5.05e+04 5.05e+04 
C9 3.62e+06 3.62e+06 3.61e+06 4.35e+06 4.67e+06 4.33e+06 4.18e+06 4.10e+06 
CIO 1.54e+02 1.72e+02 1.59e+02 1.24e+02 1.36e+03 1.33e+02 1.48e+02 1.36e+03 
Cll 1.91e+06 1.91e+06 1.91e+06 1.95e+06 1.95e+06 1.95e+06 1.96e+06 1.96e+06 
C12 4.31e+06 4.42e+06 4.57e+06 5.24e+06 6.88e+06 5.10e+06 5.22e+06 5.72e+06 
C13 2.59e+04 2.59e+04 2.59e+04 5.22e+04 5.15e+04 5.21e+04 4.90e+04 4.97e+04 
Table C.18 Characteristics for 130.1i using input - (117) 
Events Ml M2 MS M4 M5 M6 M7 M8 
CI 1.34e+06 3.98e+06 1.14e+06 4.11e+06 2.42e+06 3.08e+06 2.17e+06 7.41e+05 
C2 2.26e+10 2.26e+10 2.26e+10 2.26e+10 2.26e+10 2.26e+10 2.26e+10 2.26e+10 
C3 1.62e+07 1.76e+07 1.69e+07 7.51e+06 8.22e+06 3.45e+06 1.09e+07 1.08e+07 
C4 3.62e+08 3.63e+08 3.62e+08 3.65e+08 3.68e+08 3.62e+08 3.68e+08 3.70e+08 
C5 1.00e+03 4.15e+03 5.29e+04 6.37e+04 1.06e+04 4.35e+04 9.81e+05 5.77e+04 
C6 4.89e+02 3.53e+04 9.11e+04 3.42e+05 6.35e+05 3.14e+05 1.12e+06 1.55e+06 
C7 6.87e+10 6.87e+10 6.87e+10 6.87e+10 6.87e+10 6.87e+10 6.87e+10 6.87e+10 
C8 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 
C9 6.92e+10 6.92e+10 6.92e+10 6.92e+10 9.15e+10 6.92e+10 6.47e+10 6.47e+10 
CIO 4.69e+03 7.18e+03 4.90e+03 1.42e+04 6.21e+05 4.82e+03 1.07e+04 7.86e+05 
Cll l.lle+10 l.lle+10 l.lle+10 l.lle+10 l.lle+10 l.lle+10 l.lle+10 l.lle+10 
C12 8.32e+10 8.32e+10 8.33e+10 8.33e+10 8.82e+10 8.31e+10 9.03e+10 9.02e+10 
CIS 3.06e+09 3.06e+09 3.06e+09 3.06e+09 3.07e+09 3.06e+09 3.11e+09 3.11e+09 
Table C.19 Characteristics for 103.su2cor using input su2cor.in (Fl) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 1.84e+08 1.90e+08 4.43e+08 4.51e+08 4.34e+08 4.38e+08 4.41e+08 4.33e+08 
C2 1.18e+10 1.18e+10 1.18e+10 1.18e+10 1.18e+10 1.18e+10 1.18e+10 1.18e+10 
C3 3.98e+06 4.13e+06 4.35e+06 1.67e+06 2.05e+06 1.40e+06 1.74e+06 2.00e+06 
C4 1.47e+09 1.47e+09 1.47e+09 1.47e+09 1.47e+09 1.47e+09 1.47e+09 1.47e+09 
C5 3.70e+05 2.72e+05 4.62e+05 6.36e+05 1.07e+06 7.32e+05 7.06e+05 7.39e+05 
C6 8.04e+07 8.08e+07 1.26e+08 1.28e+08 2.59e+08 1.27e+08 1.28e+08 2.58e+08 
C7 3.81e+10 3.81e+10 3.81e+10 3.81e+10 3.81e+10 3.81e+10 3.81e+10 3.81e+10 
C8 1.06e+10 1.06e+10 1.06e+10 1.06e+10 1.06e+10 1.06e+10 1.06e+10 1.06e+10 
C9 4.58e+10 4.58e+10 4.60e+10 4.62e+10 4.05e+10 4.56e+10 3.74e+10 3.75e+10 
CIO 6.20e+05 6.28e+05 6.20e+05 6.51e+05 7.03e+06 6.51e+05 6.36e+05 7.14e+06 
Cll 3.26e+09 3.26e+09 3.26e+09 3.26e+09 3.26e+09 3.26e+09 3.26e+09 3.26e+09 
C12 4.67e+10 4.69e+10 5.55e+10 5.55e+10 8.61e+10 4.13e+10 4.50e+10 8.07e4-10 
C13 1.69e+08 1.68e+08 1.68e+08 1.65e+08 1.60e+08 1.63e+08 1.68e+08 1.63e+08 
Table C.20 Characteristics for 102.swim using input swim.in (F2) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 8.12e+08 8.06e+08 8.08e+08 7.96e+08 7.76e+08 8.07e+08 8.02e+08 7.74e+08 
C2 8.29e+09 8.29e+09 8.29e+09 8.29e+09 8.29e+09 8.29e+09 8.29e+09 8.29e+09 
C3 2.36e+05 2.70e+05 4.09e+05 5.11e+05 8.65e+05 3.926+05 4.71e+05 6.73e+05 
C4 8.83e+08 8.83e+08 8.83e+08 8.95e+08 9.04e+08 8.85e+08 8.66e+08 8.92e+08 
C5 1.13e+05 1.09e+05 1.87e+05 1.91e+05 3.80e+05 1.86e+05 2.03e+05 3.59e+05 
C6 1.99e+08 1.99e+08 1.99e+08 1.99e+08 3.91e+08 1.99e+08 1.99e+08 3.91e+08 
C7 3.43e+10 3.43e+10 3.43e+10 3.43e+10 3.44e+10 3.43e+10 3.43e+10 3.44e+10 
C8 1.51e+10 1.51e+10 1.51e+10 1.51e+10 1.51e+10 1.51e+10 1.51e+10 1.51e+10 
C9 4.63e+10 4.63e+10 4.63e+10 4.65e+10 3.45e+10 4.52e+10 3.44e+10 3.44e+10 
CIO 1.36e+06 1.36e+06 1.36e+06 1.37e+06 5.68e+06 1.37e+06 1.37e+06 5.80e+06 
Cll 3.08e+09 3.08e+09 3.08e+09 3.08e+09 3.08e+09 3.08e+09 3.08e+09 3.08e+09 
C12 6.06e+10 6.05e+10 6.05e+10 6.35e+10 1.04e+ll 3.91e+10 4.386+10 9.58e+10 
C13 1.71e+06 1.71e+06 1.71e+06 1.76e+06 1.76e+06 1.77e+06 1.76e+06 1.98e+06 
Table C.21 Characteristics for 102.swim using input swim2.in (F3) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 7.31e+08 7.26e+08 7.28e+08 7.21e+08 6.98e+08 7.27e+08 7.22e+08 6.97e+08 
C2 7.46e+09 7.46e+09 7.46e+09 7.46e+09 7.46e+09 7.46e+09 7.46e+09 7.46e+09 
C3 2.13e+05 2.59e+05 3.65e+05 4.41e+05 7.70e+05 3.53e+05 3.94e+05 6.07e+05 
C4 7.94e+08 7.95e+08 7.94e+08 8.05e+08 8.14e+08 7.98e+08 7.79e+08 8.03e+08 
C5 1.01e+05 9.95e+04 1.66e+05 1.67e+05 3.09e+05 1.67e+05 1.72e+05 3.29e+05 
C6 1.79e+08 1.79e+08 1.79e+08 1.79e+08 3.52e+08 1.79e+08 1.79e+08 3.52e+08 
C7 3.09e+10 3.09e+10 3.09e+10 3.09e+10 3.09e+10 3.09e+10 3.09e+10 3.09e+10 
C8 1.36e+10 1.36e+10 1.36e+10 1.36e+10 1.36e+10 1.36e+10 1.36e+10 1.36e+10 
C9 4.17e+10 4.17e+10 4.17e+10 4.19e+10 3.10e+10 4.07e+10 3.09e+10 3.10e+10 
CIO 1.22e+06 1.23e+06 1.22e+06 ï.24e+06 5.13e+06 1.24e+06 1.23e+06 5.24e+06 
Cll 2.77e+09 2.77e+09 2.77e+09 2.77e+09 2.77e+09 2.77e+09 2.77e+09 2.77e+09 
C12 5.46e+10 5.45e+10 5.45e+10 5.72e+10 9.39e+10 3.52e+10 3.94e+10 8.62e+10 
C13 1.57e+06 1.57e+06 1.57e+06 1.63e+06 1.63e+06 1.63e+06 1.62e+06 1.82e+06 
Table C.22 Characteristics for 110.applu using input applu.in (F4) 
Events Ml M2 M3 M4 M5 M6 M7 MS 
CI 9.29e+08 9.26e+08 9.68e+08 9.62e+08 9.55e+08 9.67e+08 9.62e+08 9.48e+08 
C2 1.39e+10 1.39e+10 1.39e+10 1.39e+10 1.39e+10 1.39e+10 1.39e+10 1.39e+10 
C3 1.21e+06 1.32e+06 2.24e+06 2.92e+06 4.27e+06 2.18e+06 3.16e+06 6.12e+06 
C4 1.10e+09 1.10e+09 1.10e+09 1.10e+09 1.10e+09 1.10e+09 1.10e+09 l.lle+09 
C5 5.19e+05 5.28e+05 1.06e+06 9.51e+05 1.49e+06 1.01e+06 1.02e4-06 1.40e+06 
C6 2.48e+08 2.49e+08 2.63e+08 2.63e4-08 5.19e+08 2.62e+08 2.63e+08 5.19e+08 
C7 5.53e+10 5.53e+10 5.53e+10 5.53e+10 5.53e+10 5.53e+10 5.53e+10 5.53e+10 
C8 1.69e+10 1.69e+10 1.69e+10 1.69e+10 1.69e+10 1.69e+10 1.69e+10 1.69e+10 
C9 6.76e+10 6.76e+10 6.85e+10 6.84e+10 6.08e+10 6.62e+10 5.56e+10 5.58e+10 
CIO 2.02e+07 2.02e+07 2.02e+07 2.02e+07 3.59e+07 2.22e+07 1.97e+07 3.59e+07 
Cll 5.06e+09 5.06e+09 5.06e+09 5.06e+09 5.06e+09 5.06e+09 5.06e+09 5.06e+09 
C12 7.90e+10 7.92e+10 8.17e+10 8.55e+10 1.51e+ll 5.58e+10 6.15e+10 1.33e+ll 
C13 6.43e+08 6.39e+08 6.42e+08 6.45e+08 6.41e+08 6.45e+08 6.66e+08 6.66e+08 
Table C.23 Characteristics for 145.fpppp using input natoms.in (F5) 
Events Ml M2 M3 M4 M5 MB M7 M8 
CI 7.50e+05 5.62e+06 1.54e4-06 l.lle+07 3.42e4-05 3.11e4-06 1.52e+06 4.87e4-05 
C2 5.68e+10 5.68e+10 5.68e+10 5.68e4-10 5.68e+10 5.68e4-10 5.68e4-10 5.68e+10 
C3 2.62e4-09 2.63e+09 2.62e4-09 2.64e4-09 2.64e4-09 2.61e4-09 2.66e4-09 2.66e4-09 
C4 1.54e+08 1.56e+08 1.53e+08 1.85e+08 1.87e4-08 1.74e4-08 1.88e4-08 1.87e4-08 
C5 2.55e+03 1.42e+04 2.10e4-04 1.57e4-06 1.13e+05 1.50e4-05 7.41e+05 8.72e4-05 
C6 4.33e+02 7.06e+04 4.15e4-05 3.50e4-06 2.70e+05 2.56e4-05 8.17e4-05 2.1664-06 
C7 1.40e+ll 1.40e+ll 1.40e+ll 1.40e+ll 1.40e+ll 1.40e4-ll 1.40e+ll 1.40e+ll 
C8 5.60e+10 5.60e+10 5.60e4-10 5.60e4-10 5.60e4-10 5.60e+10 5.60e4-10 5.60e4-10 
C9 1.60e+ll 1.60e+ll 1.60e+ll 1.60e+ll 1.46e+ll 1.60e+ll 1.40e+ll 1.41e4-ll 
CIO 6.13e+03 9.47e+03 6.02e4-03 1.82e+04 5.34e+05 5.90e+03 1.23e4-04 6.15e4-05 
Cll 1.98e+10 1.98e+10 1.98e+10 1.98e+10 1.98e+10 1.98e4-10 1.98e+10 1.9864-10 
C12 1.17e+ll 1.17e4-ll 1.18e+ll 1.16e4-ll 1.22e+ll 1.15e+ll 1.17e+ll 1.18e4-ll 
C13 3.04e4-08 3.00e4-08 2.99e+08 2.97e4-08 2.89e4-08 2.96e4-08 3.02e4-08 2.96e4-08 
Table C.24 Characteristics for 141.apsi using input apsi.in (F6) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 2.59e+07 2.85e+07 2.50e+08 2.63e+08 1.77e+08 2.51e+08 2.70e+08 1.84e+08 
C2 7.35e4-09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 
C3 1.34e+06 1.64e+06 2.47e+06 2.71e+06 4.25e+06 1.73e+06 9.58e+06 5.90e+06 
C4 6.89e-t-08 6.90e+08 7.05e+08 7.05e+08 7.00e4-08 7.05e+08 7.07e+08 7.07e+08 
C5 1.36e+05 3.56e+05 7.54e+05 1.60e+06 2.03e+06 3.57e+06 7.82e4-05 9.20e+05 
C6 3.00e+06 6.02e+06 8.00e+07 8.40e+07 1.09e+08 8.48e+07 8.19e+07 1.08e4-08 
C7 3.45e+10 3.45e+10 3.45e+10 3.45e+10 3.46e+10 3.45e+10 3.45e+10 3.46e+10 
C8 9.84e+09 9.84e+09 9.84e+09 9.84e+09 9.84e4-09 9.84e+09 9.84e+09 9.84e+09 
C9 3.80e+10 3.81e+10 3.86e+10 3.86e+10 3.55e+10 3.82e-t-10 3.35e+10 3.40e+10 
CIO 8.53e+04 9.96e4-04 8.87e+04 1.22e+05 4.07e+07 9.16e+04 1.04e+05 4.08e+07 
Cll 3.69e+09 3.69e+09 3.69e+09 3.69e+09 3.69e+09 3.69e+09 3.69e+09 3.69e+09 
C12 3.63e+10 3 71e-K10 4.95e+10 5.19e+10 5.73e+10 4.16e+10 4.37e+10 5.61e+10 
C13 8.26e+07 8.26e+07 8.26e+07 8.37e+07 8.11e+07 8.28e+07 8.30e+07 8.38e+07 
Table C.25 Characteristics for 146.waveS using input waved.in (F7) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 1.95e+08 1.94e+08 2.87e-t-08 2.88e+08 2.55e+08 2.87e+08 2.90e+08 2.68e+08 
C2 6.38e+09 6.38e+09 6.38e+09 6.38e+09 6.38e+09 6.38e+09 6.38e+09 6.38e+09 
C3 3.66e+06 3.77e+06 4.15e+06 6.34e+06 7.19e+06 4.47e+06 6.14e+06 7.96e+06 
C4 1.91e+09 1.91e+09 1.91e+09 1.92e+09 1.94e+09 1.91e+09 1.92e+09 1.93e+09 
C5 3.83e+05 3.02e+05 7.78e+05 5.18e+05 1.97e+06 6.65e+05 7.82e+05 1.10e+06 
C6 3.86e+07 3.90e+07 6.32e+07 6.41e+07 1.15e-H)8 6.33e+07 6.47e+07 1.23e+08 
C7 2.78e+10 2.78e+10 2.78e+10 2.78e+10 2.81e+10 2.78e+10 2.78e+10 2.81e+10 
C8 8.80e+09 8.80e+09 8.80e4-09 8.80e+09 8.80e+09 8.80e+09 8.80e+09 8.80e+09 
C9 3.66e+10 3.66e+10 3.75e+10 3.77e+10 3.66e+10 3.64e+10 2.74e+10 3.13e+10 
CIO 5.35e+06 5.37e+06 5.35e+06 5.49e+06 3.06e+08 6.65e+06 5.42e+06 3.04e+08 
Cll 3.99e+09 3.99e+09 3.99e+09 3.99e4-09 3.99e+09 3.99e+09 3.99e-t~09 3.99e+09 
C12 3.82e+10 3.83e+10 4.26e+10 4.32e+10 6.11e+10 3.54e+10 3.80e+10 5.77e4-10 
C13 7.60e+07 7.69e+07 7.69e+07 7.51e+07 9.10e+07 7.58e+07 8.09e+07 9.87e+07 
Table C.26 Characteristics for 107.mgrid using input mgrid.in (F8) 
Events Ml M2 M3 M4 M5 M6 M7 M8 
CI 7.35e+08 7.31e+08 7.84e+08 7.85e+08 7.17e+08 7.84e+08 7.84e+08 7.13e+08 
C2 2.70e+10 2.70e+10 2.70e+10 2.70e+10 2.70e+10 2.70e+10 2.70e+10 2.70e+10 
C3 1.45e+06 1.58e+06 1.92e+06 2.20e+06 2.52e+06 1.87e+06 2.26e+06 3.05e+06 
C4 1.37e+09 1.37e+09 1.37e+09 1.37e+09 1.38e+09 1.37e+09 1.37e+09 1.38e+09 
C5 5.80e+05 5.52e+05 7.48e+05 8.25e+05 1.43e+06 7.89e+05 7.80e+05 1.49e+06 
C6 1.92e+08 1.93e+08 2.08e4-08 2.09e+08 4.04e+08 2.08e+08 2.09e+08 4.00e+08 
C7 7.29e+10 7.29e+10 7.29e+10 7.29e+10 7.29e+10 7.29e+10 7.29e+10 7.29e+10 
C8 2.94e+10 2.94e+10 2.94e+10 2.94e+10 2.94e+10 2.94e+10 2.94e+10 2.94e+10 
C9 8.61e+10 8.62e+10 8.71e+10 8.86e+10 7.30e+10 8.45e+10 7.26e+10 7.27e+10 
CIO 1.28e+06 1.29e+06 1.28e+06 1.30e+06 6.08e+06 1.28e+06 1.28e+06 6.09e+06 
Cll 1.41e+09 1.41e+09 1.41e+09 1.41e+09 1.41e+09 1.41e+09 1.41e+09 1.41e+09 
C12 7.45e+10 7.47e+10 7.74e+10 7.26e+10 1.16e+ll 5.37e+10 6.07e+10 1.07e+ll 
CIS 3.61e+07 3.61e+07 3.61e+07 3.61e+07 3.65e+07 3.60e+07 3.62e+07 3.65e+07 
Table C.27 Characteristics for 125.turbSd using input turbSd.in (F9) 
Events Ml M2 MS M4 M5 M6 M7 MS 
CI 6.39e+08 6.30e+08 6.42e4-08 6.32e+08 6.32e+08 6.42e4-08 6.37e+08 6.30e+08 
C2 1.96e+10 1.96e+10 1.96e+10 1.96e+10 1.96e+10 1.96e+10 1.96e+10 1.96e+10 
C3 1.22e+06 1.41e+06 1.72e+06 3.42e+06 4.90e+06 2.35e+06 6.65e+06 8.60e+06 
C4 1.32e4-09 1.32e4-09 1.32e+09 1.33e+09 1.34e+09 1.32e+09 1.34e+09 1.36e+09 
C5 3.12e+05 2.62e+05 6.29e+05 4.23e+05 7.18e+05 5.24e+05 5.07e+05 3.30e+06 
C6 1.02e+08 1.02e+08 1.04e+08 1.04e+08 2.08e+08 1.04e+08 1.04e+08 2.12e+08 
C7 1.12e+ll 1.12e+ll 1.12e+ll 1.12e+ll 1.12e+ll 1.12e+ll 1.12e+ll 1.12e+ll 
C8 2.30e+10 2.30e+10 2.30e+10 2.30e+10 2.30e+10 2.30e+10 2.30e+10 2.30e+10 
C9 1.23e+ll 1.23e+ll 1.23e+ll 1.23e+ll 1.24e+ll 1.22e+ll 1.10e+ll 1.10e+ll 
CIO 1.57e+08 1.57e+08 1.57e+08 1.48e4-08 2.15e+08 1.58e+08 1.48e+08 2.11e4-08 
Cll 1.44e+10 1.44e+10 1.44e+10 1.44e+10 1.44e+10 1.44e+10 1.44e+10 1.44e+10 
C12 9.86e+10 9.85e+10 9.90e+10 1.02e+ll 1.25e+ll 8.72e+10 9.24e+10 1.22e+ll 
CIS 1.23e+09 1.23e+09 1.23e+09 1.24e+09 1.24e+09 1.23e+09 1.25e+09 1.25e+09 
162 
APPENDIX D Application Characteristics -
163 
Table D.l Quadwords Written Back from Scache per 1000 Graduate In-
structions for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 0.01 0.06 0.03 0.07 0.04 0.06 0.12 0.10 
12 0.01 0.05 0.03 0.05 0.04 0.06 0.03 0.05 
13 0.02 0.06 0.04 0.06 0.03 0.04 0.07 0.13 
14 0.01 0.06 0.05 0.07 0.06 0.62 0.06 0.07 
F5 0.01 0.04 0.01 0.08 0.00 0.02 0.01 0.00 
15 1.26 1.29 1.58 1.65 1.10 1.68 1.63 1.00 
16 1.05 1.02 1.06 1.08 1.06 1.05 1.08 1.04 
17 1.07 1.06 1.09 1.10 1.09 1.08 1.09 1.06 
18 1.05 1.03 1.06 1.07 1.07 1.06 1.06 1.03 
19 1.95 1.98 2.59 2.75 2.24 2.65 2.64 2.19 
110 1.06 1.05 1.80 1.90 1.57 1.73 1.82 1.56 
111 2.69 2.66 3.19 3.36 2.91 3.35 3.45 2.93 
112 0.71 0.73 1.39 1.44 1.11 1.44 1.41 1.10 
113 0.06 0.09 0.07 0.06 0.04 0.08 0.07 0.04 
114 0.02 0.06 0.03 0.03 0.02 0.05 0.04 0.02 
115 0.66 0.66 0.68 0.72 0.72 0.69 0.77 0.68 
116 1.39 0.73 1.32 2.78 2.40 2.47 2.53 2.24 
117 0.02 0.06 0.02 0.06 0.04 0.04 0.03 0.01 
Table D.2 Quadwords Written Back from Scache per 1000 Graduate In-
structions for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 4.83 5.00 11.62 11.83 11.39 11.49 11.58 11.35 
F2 2363 23.46 23.54 23.18 22.59 23.50 23.35 22.54 
F3 23.64 23.47 23.53 2333 22.58 23.50 23.34 22.54 
F4 16.81 16.76 17.52 17.41 17.27 17.50 17.40 17.15 
F6 0.75 0.82 7.24 7.61 5.12 7.28 7.81 5 32 
F7 7.01 6.97 10.30 10.34 9.07 10.29 10.43 9.51 
F8 10.08 10.02 10.75 10.77 9.83 10.75 10.75 9.78 
F9 5.69 5.61 5.72 5.63 5.62 5.72 5.68 5.61 
F10 27.91 27.77 28.39 28.16 28.14 28.37 28.11 28.03 
Fll 28.95 28.88 34.91 34.92 34.77 34.90 34.86 34.52 
164 
Table D.3 Graduated Loads per 1000 Graduate Instructions for Inte­
ger-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 326.15 326.15 326.15 326.13 326.09 326.14 326.13 326.09 
12 325.36 325.36 325.36 325.35 325.30 325.35 325.35 325.30 
13 325.66 325.66 325.66 325.65 325.60 325.66 325.65 325.60 
14 325.25 325.25 325.25 325.24 325.20 325.24 325.24 325.20 
F5 404.01 404.01 404.01 404.01 404.00 404.01 404.01 404.00 
15 350.88 350.88 350.88 350.87 349.57 350.86 350.86 349.57 
16 195.78 195.78 195.78 195.78 195.76 195.78 195.78 195.77 
17 197.35 197.35 197.35 197.34 197.32 197.34 197.34 197.34 
18 196.78 196.78 196.78 196.77 196.75 196.77 196.77 196.77 
19 283.76 283.76 283.76 283.44 282.32 283.44 283.46 282.30 
110 285.81 285.81 285.81 285.12 284.53 285.12 285.07 284.48 
111 279.21 279.21 279.21 278.91 278.12 278.91 278.92 278.10 
112 286.50 286.50 286.50 286.04 285.47 286.04 286.05 285.43 
113 235.24 235.24 235.24 235.24 235.22 235.24 235.24 235.23 
114 220.08 220.08 220.08 219.81 219.80 219.81 219.75 219.78 
115 304.08 304.08 304.08 304.07 303.66 304.07 304.07 303.65 
116 167.57 167.57 167.57 141.33 141.28 141.33 136.50 139.09 
117 328.55 328.55 328.55 328.54 328.54 328.55 328.54 328.54 
Table D.4 Graduated Loads per 1000 Graduate Instructions for 
Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 309.09 309.08 309.09 308.98 308.92 308.98 309.01 308.92 
F2 241.31 241.31 241.31 241.30 241.26 241.30 241.30 241.26 
F3 241.31 241.31 241.31 241.30 241.27 241.31 241.30 241.27 
F4 251.25 251.25 251.25 251.25 251.17 251.24 251.25 251.17 
F6 212.94 212.94 212.94 212.93 212.68 212.93 212.93 212.67 
F7 229.03 229.03 229.03 229.02 226.55 229.01 229.02 226.57 
F8 369.61 369.61 369.61 369.60 369.57 369.61 369.60 369.57 
F9 174.45 174.45 174.45 174.46 174.36 174.45 174.46 174.36 
F10 264.10 264.10 264.10 264.05 264.01 264.05 264.06 264.01 
Fll 215.68 215.68 215.68 215.67 215.61 215.67 215.67 215.61 
165 
Table D.5 Primary Instruction Cache Misses per 1000 Graduate Instruc-
tions for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
II 5.29 5.30 5.29 5.50 5.48 5.31 5 63 5.72 
12 5.07 5.08 5.06 5.29 5.28 5.08 5.42 5.51 
13 5.39 5.40 5.38 5.60 5.58 5.41 5.74 5.82 
14 5.10 5.10 5.08 5.31 5.29 5.11 5.46 5.58 
F5 18.68 18.69 18.68 18.80 18.79 18.60 18.92 18.91 
15 6.20 6.23 6.22 6.65 6.92 6.46 6.59 7.07 
16 0.01 0.02 0.02 0.04 0.03 0.02 0.04 0.05 
17 0.01 0.02 0.02 0.04 0.04 0.02 0.04 0.05 
18 0.01 0.02 0.02 0.04 0.03 0.02 0.05 0.05 
19 11.90 11.91 11.90 12.04 12.10 11.75 12.52 12.73 
110 13.60 13.62 13.61 13 62 13.68 13 34 14.20 14.37 
111 11.74 11.76 11.76 11.90 11.88 11.63 12.35 12.45 
112 13.44 13.46 13.46 13.48 13.38 13.20 14.08 14.07 
113 13.89 13.89 13.88 14.00 14.35 13.86 14.02 13.98 
114 13.91 13.91 13.90 13.92 14.29 13.85 13.84 13.78 
115 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.01 
116 0.68 0.69 0.68 0.59 1.08 0.58 1.52 1.04 
117 0.24 0.26 0.25 0.11 0.12 0.05 0.16 0.16 
Table D.6 Primary Instruction Cache Misses per 1000 Graduate Instruc-
tions for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
FI 0.10 0.11 0.11 0.04 0.05 0.04 0.05 0.05 
F2 0.01 0.01 0.01 0.01 0.03 0.01 0.01 0.02 
F3 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 
F4 0.02 0.02 0.04 0.05 0.08 0.04 0.06 0.11 
F6 0.04 0.05 0.07 0.08 0.12 0.05 0.28 0.17 
F7 0.13 0.14 0.15 0.23 0.26 0.16 0.22 0.28 
F8 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.04 
F9 0.01 0.01 0.02 0.03 0.04 0.02 0.06 0.08 
F10 0.10 0.10 0.11 0.09 0.09 0.09 0.05 0.04 
Fll 0.05 0.05 0.06 0.06 0.06 0.05 0.04 0.07 
166 
Table D.7 Primary Data Cache Misses per 1000 Graduate Instructions for 
Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
II 9.11 9.13 9.11 9.62 9.77 9.43 9.74 9.52 
12 10.01 10.03 10.01 10.52 10.67 10.35 10.62 10.43 
13 9.68 9.70 9.68 10.17 10.30 10.00 10.28 10.09 
14 9.40 9.41 9.38 9.90 10.05 9.72 10.00 9.85 
F5 1.10 1.11 1.09 1.32 1.33 1.24 1.33 1.33 
15 4.82 4.84 4.85 5.04 5.65 4.99 5.05 5.61 
16 2.46 2.46 2.45 2.53 2.56 2.48 2.55 2.55 
17 2.61 2.62 2.61 2.66 2.63 2.61 2.69 2.69 
18 2.57 2.58 2.57 2 63 2.62 2.58 2.65 2.67 
19 7.39 7.38 7.39 7.57 7.76 7.40 7.62 7.82 
110 6 38 6.40 6.44 6.59 6.83 6.42 6.63 6.85 
111 8.55 8.57 8.58 8.71 9.07 8.57 8.78 9.10 
112 5.01 5.03 5.08 5.27 5.46 5.06 5.35 5.56 
113 0.16 0.18 0.16 0.33 0.34 0.25 0.36 0.36 
114 0.24 0.24 0.23 0.38 0.43 0.31 0.41 0.40 
115 27.03 27.04 27.04 27.13 26.88 27.04 2759 26.98 
116 65.24 65.26 65.25 56.24 58.41 56.23 54.50 57.70 
117 5.27 5.28 5.28 5.32 5.36 5.28 5.36 5.38 
Table D.8 Primary Data Cache Misses per 1000 Graduate Instructions for 
Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
FI 38.61 38.61 38.62 38.61 38.59 38.60 38.64 38.62 
F2 25.70 25.71 25.70 26.06 2632 25.78 25.20 25 96 
F3 25.69 25.72 25.70 26.03 26 32 25.79 25.19 25.96 
F4 19.91 19.92 19.93 19.94 19.97 19.91 19.93 20.02 
F6 19.95 20.00 20.41 20.42 20.26 20.41 20.49 20.44 
F7 68.67 68.66 68.68 68.84 68.78 68.73 68.81 68.71 
F8 18.82 18.82 18.83 18.85 18.88 18.83 18.85 18.88 
F9 11.73 11.73 11.73 11.82 11.89 11.74 11.91 12.12 
F10 34.17 34.19 34.18 34.36 35.09 34.21 34.29 34.32 
Fil 51.36 51.38 51.38 51.51 51.83 51.39 51.47 51.65 
167 
Table D.9 Secondary Instruction Cache Misses per 1000 Graduate Instruc-
tions for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
II 0.00 0.00 0.00 0.01 0.02 0.01 0.03 0.03 
12 0.00 0.00 0.00 0.02 0.03 0.01 0.00 0.07 
13 0.00 0.01 0.01 0.02 0.02 0.01 0.02 0.09 
14 0.00 0.00 0.01 0.02 0.04 0.01 0.02 0.05 
F5 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.01 
15 0.02 0.02 0.06 0.07 0.07 0.07 0.08 0.09 
16 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.01 
17 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.01 
18 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.01 
19 0.05 0.06 0.17 0.16 0.26 0.19 0.19 0.23 
110 0.09 0.09 0.20 0.27 0.35 0.20 0.23 0.34 
111 0.06 0.07 0.17 0.16 0.22 0.14 0.17 0.27 
112 0.07 0.07 0.22 0.25 0.36 0.26 0.26 0.36 
113 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
114 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
115 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
116 0.14 0.11 0.18 0.17 0.51 0.18 0.17 0.27 
117 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 
Table D.10 Secondary Instruction Cache Misses per 1000 Graduate In­
structions for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
FI 0.01 0.01 0.01 0.02 0.03 0.02 0.02 0.02 
F2 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 
F3 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 
F4 0.01 0.01 0.02 0.02 0.03 0.02 0.02 0.03 
F6 0.00 0.01 0.02 0.05 0.06 0.10 0.02 0.03 
F7 0.01 0.01 0.03 0.02 0.07 0.02 0.03 0.04 
F8 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 
F9 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.03 
F10 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 
Fil 0.01 0.01 0.02 0.02 0.03 0.02 0.02 0.03 
168 
Table D.ll Secondary Data Cache Misses per 1000 Graduate Instructions 
for Integer-Type Benchmar 
Ml M2 M3 M4 M5 M6 M7 MS 
II 0.00 0.00 0.00 0.01 0.02 0.01 0.02 0.03 
12 0.00 0.00 0.00 0.01 0.03 0.01 0.01 0.03 
13 0.00 0.00 0.01 0.01 0.01 0.00 0.02 0.13 
14 0.00 0.00 0.00 0.01 0.04 0.00 0.01 0.04 
F5 0.00 0.00 0.00 0.02 0.00 0.00 0.01 0.02 
15 0.37 0.38 0.61 0.62 0.81 0.62 0.64 0.79 
16 0.25 0.25 0.27 0.27 0.54 0.27 0.27 0.54 
17 0.26 0.26 0.28 0.28 0.56 0.28 0.28 0.55 
18 0.26 0.26 0.27 0.27 0.54 0.27 0.27 0.54 
19 0.27 0.29 0.48 0.47 0.78 0.49 0.50 0.76 
110 0.12 0.13 0.26 0.29 0.46 0.27 0.28 0.46 
111 0.35 0.36 0.55 0.57 0.95 0.55 0.59 0.95 
112 0.09 0.10 0.20 0.24 0.35 0.23 0.24 0.36 
113 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02 
114 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 
115 0.17 0.17 0.17 0.18 0.36 0.17 0.19 0.37 
116 0.19 0.13 0.24 0.55 0.97 0.55 0.48 0.84 
117 0.00 0.00 0.00 0.00 0.01 0.00 0.02 0.02 
tes 
Table D.12 Secondary Data Cache Misses per 1000 Graduate Instructions 
for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
FI 2.11 2.12 3 32 335 6.79 3.34 3 37 6.76 
F2 5.79 5.79 5.80 5.79 11.37 5.79 5.80 11.38 
F3 5.79 5.79 5.80 5.79 11.37 5.80 5.80 11.38 
F4 4.49 4.50 4.75 4.75 9 39 4.75 4.76 939 
F6 0.09 0.17 2 32 2.43 3.15 2 46 2 37 3.12 
F7 1.39 1.40 2.27 2.30 4.09 2.27 2 32 4.37 
F8 2.64 2.64 2.86 2.86 5.54 2.86 2.87 5.49 
F9 0.91 0.91 0.93 0.93 1.85 0.92 0.93 1.89 
F10 6.70 6.70 6.88 6.87 13.73 6.88 6.88 13.72 
Fil 6.78 6.83 11.11 11.13 22.25 11.11 11.15 22.24 
169 
Table D.13 Graduate Instructions In Billions for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
II 15.96 15.96 15.96 15.96 15.96 15.96 15.96 15.96 
12 34.83 34.83 34 83 34 83 34.84 34.83 34.83 34.84 
13 31.84 31.84 31.84 31.84 31.84 31.84 31.84 31.84 
14 32.17 32.17 32.17 32.17 32.18 32.17 32.17 32.18 
F5 140.47 140.47 140.47 140.47 140.47 140.47 140.47 140.47 
15 85.02 85.02 85.02 85.02 85.34 85.02 85.02 85.34 
16 28.49 28.49 28.49 28.49 28.49 28.49 28.49 28.49 
17 25.76 25.76 25.76 25.76 25.77 25.76 25.76 25.76 
18 29.47 29.47 29.47 29.47 29.47 29.47 29.47 29.47 
19 1.30 1.30 1.30 1.30 1.30 1.30 1.30 1.31 
110 0.36 0.36 0.36 0.37 0.37 0 37 0.37 0.37 
111 1.15 1.15 1.15 1.15 1.16 1.15 1.15 1.16 
112 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 
113 80.03 80.03 80.03 80.03 80.04 80.03 80.03 80.03 
114 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 
115 51.73 51.74 51.73 51.74 51.81 51.74 51.74 51.81 
116 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
117 68.66 68.66 68.66 68.67 68.67 68.67 68.67 68.67 
170 
Table D.14 Graduated Floating Point Instructions per 1000 Graduate In-
structions for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
F5 398.87 398.87 398.87 398.86 398.86 398.87 398.86 398.86 
15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
19 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 
110 0.08 0.08 0.08 0.07 0.07 0.07 0.07 0.07 
111 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 
112 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 
113 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
114 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
115 3.66 3.66 3.66 3.66 3 66 3.66 3.66 3.66 
116 14.81 14.81 14.81 12.49 12.49 12.49 12.06 12.29 
117 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
Table D.15 Graduated Floating Point Instructions per 1000 Graduate In-
structions for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 279.32 279.32 279.32 279.22 279.17 279.22 279.24 279.17 
F2 440.22 440.22 440.22 440.20 440.14 440.20 440.20 440.13 
F3 440.20 440.20 440.20 440.18 440.11 440.18 440.18 440.11 
F4 306.03 306.03 306.03 306.02 305.93 306.01 306.02 305.93 
F6 285.12 285.11 285.11 285.10 284.77 285.11 285.10 284.76 
F7 316.04 316.04 316.04 316.03 312.62 316.02 316.03 312.64 
F8 402.51 402.51 402.51 402.50 402.47 402.50 402.50 402.47 
F9 204.71 204.71 204.71 204.72 204.60 204.70 204.72 204.61 
F10 351.42 351.42 351.42 351.35 351.31 351.36 351.37 351.31 
Fll 316.42 316.42 316.42 316.40 316.32 316.41 316.41 316.32 
171 
Table D.16 Issued Instructions per 1000 Graduate Instructions for Inte-
ger-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 MS 
11 1123.14 1123.07 1124.18 1124.12 1498.06 1124.01 1073.38 1074.63 
12 1130.26 1130.39 1130.37 1131.43 1509.93 1131.51 1079.22 1080.20 
13 1130.06 1130.05 1130.49 1131.36 1508.65 1131.12 1079.07 1080.24 
14 1125.76 1125.75 1125.64 1126.82 1500.80 1126.54 1075.53 1076.02 
F5 1141.64 1141.65 1141.61 1138.95 1038.10 1138.28 1000.19 1000.24 
15 1077.28 1077.37 1078.38 1077.03 1378.32 1076.43 1032.24 1043.62 
16 977.50 977.50 977.50 978.33 1074.82 978.16 970.15 970.11 
17 977.36 977.36 977.35 977.25 1073.55 977.08 969.42 969.55 
18 978.60 978.60 978.60 978.57 1077.39 978.42 970.92 970.82 
19 1031.31 1031.49 1032.57 1034.20 1300.33 1032.40 995.65 1000.33 
110 1019.03 1019.03 1019.84 1021.80 1289.76 1020.95 982.78 986.92 
111 1027.90 1027.92 1029.31 1030.88 1280.78 1028.52 992.00 997.55 
112 1029.70 1029.72 1030.44 1032.21 1317.35 1031.31 994.79 998.65 
113 941.55 941.56 941.54 942.55 1160.38 943.37 928.92 929 25 
114 938.62 938.68 938.61 939.54 1185.99 939.28 926.28 926.11 
115 1068.29 1068.30 1068.45 1068.79 1353.09 1068.44 1022.92 1027.52 
116 1062.77 1062.72 1060.42 1075.69 1156.39 1072.26 998.49 999.72 
117 1007.42 1007.39 1007.89 1008.02 1333.20 1007.52 942.36 942.24 
Table D.17 Issued Instructions per 1000 Graduate Instructions for 
Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 MS 
F1 1202.82 1202.78 1208.56 1211.88 1062.61 1196.99 981.91 983.89 
F2 1348.83 1348.51 1349.46 1354.54 1003.26 1316.85 1000.18 1001.66 
F3 1348.98 1348.81 1349.64 1355.07 1003.29 1316.73 1000.15 1001.63 
F4 1223.40 1223.27 1240.47 1238.26 1100.87 1198.80 1006.78 1010.16 
F6 1100.74 1104.01 1117.89 1118.81 1026.57 1107.00 970.94 982.66 
F7 1313.61 1314.70 1346.98 1355.64 1300.70 1307.61 985.95 1111.87 
F8 1180.80 1181.67 1194.18 1214.24 1000.81 1158.47 995.43 996.18 
F9 1091.48 1091.27 1091.41 1092.18 1104.44 1087.35 976.21 982.57 
F10 1321.02 1321.38 1323.31 1343.22 1010.60 1261.82 973.13 974.15 
Fll 1234.22 1235.33 1292.55 1318.43 1333.33 1211.17 968.68 972.99 
172 
Table D.18 TLB Misses per 1000 Graduate Instructions for Integer-Type 
Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.14 
12 0.00 0.00 0.00 0.00 0.15 0.00 0.00 0.15 
13 0.00 0.00 0.00 0.00 0.14 0.00 0.00 0.14 
14 0.00 0.00 0.00 0.00 0.12 0.00 0.00 0.12 
F5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
15 2.23 2.23 2 23 2.20 5.83 2.26 2.23 5.82 
16 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
17 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
18 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
19 0.11 0.11 0.11 0.10 2.17 0.10 0.10 2.19 
110 0.03 0.03 0.03 0.03 1.80 0.02 0.03 1.83 
111 0.09 0.09 0.09 0.09 2.39 0.08 0.09 2.40 
112 0.05 0.05 0.05 0.05 1.93 0.05 0.05 1.95 
113 0.01 0.01 0.01 0.01 0.03 0.01 0.01 0.03 
114 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
115 0.00 0.00 0.00 0.00 1.15 0.00 0.00 1.16 
116 0.05 0.05 0.05 0.03 0.34 0.03 0.04 0.33 
117 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 
Table D.19 TLB Misses per 1000 Graduate Instructions for Float-Type 
Benchmarks 
Ml M2 M3 M4 M5 M6 M7 MS 
F1 0.02 0.02 0.02 0.02 0.18 0.02 0.02 0.19 
F2 0.04 0.04 0.04 0.04 0.17 0.04 0.04 0.17 
F3 0.04 0.04 0.04 0.04 0.17 0.04 0.04 0.17 
F4 0.37 0.37 0.37 0.37 0.65 0.40 0.36 0.65 
F6 0.00 0.00 0.00 0.00 1.18 0.00 0.00 1.18 
F7 0.19 0.19 0.19 0.20 10.88 0.24 0.19 10.80 
F8 0.02 0.02 0.02 0.02 0.08 0.02 0.02 0.08 
F9 1.40 1.40 1.40 1.32 1.91 1.41 1.32 1.88 
F10 0.03 0.03 0.03 0.03 0.13 0.03 0.03 0.13 
Fll 0.04 0.04 0.04 0.05 0.26 0.04 0.05 0.26 
173 
Table D.20 Graduated Stores per 1000 Graduate Instructions for Inte-
ger-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 93.95 93 95 9195 93.95 93 94 93.95 93.95 93.94 
12 92.40 92.40 92.40 92.40 92.39 92 40 92.40 92.39 
13 92.99 92.99 92.99 92.99 92.98 92 99 92.99 92.97 
14 92.12 92.12 92.12 92.12 92.11 92 12 92.12 92.11 
F5 140.78 140.78 140.78 140.78 140.79 140.78 140.78 140.78 
15 180.07 180.07 180.07 180.07 179.40 180.06 180.06 179.40 
16 8142 83.42 83.42 8142 83 42 83 42 83.42 83.42 
17 83.59 83.59 83.59 83.59 83.58 83.59 83.59 83.59 
18 84.21 84.21 84.21 84.20 84.20 8121 8121 84.20 
19 137.14 137.14 137.14 137.01 136.76 137.01 137.04 136.75 
110 132.10 132.10 132.10 131.91 131.66 131.91 131.92 131.64 
111 145.32 145.32 145.32 145.19 144.83 145.19 145.22 144.83 
112 133.26 133 26 133.26 133.10 132.83 133.10 133.13 132.82 
113 104.98 104.98 104.98 104.98 104.98 104.98 104.98 104.98 
114 83.72 83.72 83.72 83.70 83.70 83.70 83.70 83.70 
115 118.84 118.84 118.84 118.84 118.69 118.84 118.84 118.68 
116 560.01 560.01 560.03 483.46 483.31 483.46 469.34 476.84 
117 161.70 161.70 161.70 161.70 161.70 161.70 161.70 161.69 
Table D.21 Graduated Stores per 1000 Graduate Instructions for 
Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 85.54 85.54 85.54 85.54 85.53 85.54 85.54 85.52 
F2 89.66 89.66 89.66 89.66 89.65 89.66 89.66 89.64 
F3 89.64 89.64 89.64 89.64 89.63 89.64 89.64 89.63 
F4 91.57 91.57 91.57 91.56 91.54 91.56 91.57 91.54 
F6 106.84 106.84 106.84 106.84 106.72 106.84 106.84 106.71 
F7 143.20 143.20 143.20 143.19 141.66 143.19 143.19 141.66 
F8 19.33 19.33 19.33 19.33 19.33 19.33 19.33 19.33 
F9 128.37 128.36 128.37 128.37 128.30 128.36 128.37 128.30 
F10 120.68 120.68 120.68 120.67 120.67 120.67 120.68 120.66 
Fll 78.01 78.01 78.01 78.01 78.00 78.01 78.01 77.99 
174 
Table D.22 Cycles Per Instruction for Integer-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
II 1.06 1.06 1.07 1.08 1.12 1.07 1.12 1.13 
12 1.07 1.07 1.07 1.08 1.12 1.07 1.13 1.13 
13 1.07 1.07 1.07 1.08 1.12 1.07 1.12 1.15 
14 1.06 1.06 1.07 1.08 1.12 1.07 1.12 1.13 
F5 0.84 0.84 0.84 0.83 0.87 0.82 0.83 0.84 
15 1.15 1.16 1.20 1.17 1.39 1.14 1.20 1.33 
16 0.68 0.68 0.68 0.67 0.71 0.66 0.67 0.72 
17 0.68 0.68 0.68 0.67 0.71 0.65 0.67 0.72 
18 0.68 0.68 0.69 0.67 0.71 0.66 0.68 0.72 
19 1.19 1.20 1.25 1.25 1.44 1.18 1.26 1.39 
110 1.18 1.18 1.23 1.24 1.43 1.18 1.27 1.39 
111 1.19 1.20 1.26 1.26 1.48 1.17 1.25 1.43 
112 1.18 1.18 1.23 1.24 1.42 1.19 1.27 1.37 
113 1.04 1.04 1.04 1.05 1.10 1.05 1.09 1.09 
114 1.08 1.08 1.08 1.08 1.12 1.08 1.13 1.13 
115 1.09 1.09 1.09 1.08 1.17 1.07 1.13 1.18 
116 1.26 1.30 1.34 1.30 1.70 1.26 1.25 1.39 
117 1.21 1.21 1.21 1.21 1.28 1.21 1.31 1.31 
Table D.23 Cycles Per Instruction for Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 1.23 1.23 1.46 1.46 2.26 1.08 1.18 2.12 
F2 1.76 1.76 1.76 1.85 3.04 1.14 1.28 2.79 
F3 1.76 1.76 1.76 1.85 3.04 1.14 1.28 2.79 
F4 1.43 1.43 1.48 1.55 2.73 1.01 1.11 2.41 
F6 1.05 1.07 1.43 1.50 1.66 1.21 1.27 1.62 
F7 1.37 1.38 1.53 1.55 2.17 1.27 1.36 2.05 
F8 1.02 1.02 1.06 1.00 1.59 0.74 0.83 1.47 
F9 0.88 0.88 0.88 0.90 1.11 0.78 0.82 1.09 
F10 1.92 1.92 1.95 1.96 3.52 1.21 1.36 3.25 
Fll 2.10 2.12 2.74 2.98 5.67 1.64 1.88 4.97 
175 
Table D.24 Mispredicted Branches per 1000 Graduate Instructions for In-
teger-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
11 39.21 39.22 39.22 39 23 37.43 39.23 39.44 39.49 
12 40.17 40.16 40.17 40.19 38.28 40.18 40.38 40.37 
13 40.14 40.14 40.14 40.16 38.26 40.16 40.35 40.35 
14 39.46 39.47 39.48 39.50 37.64 39.49 39.69 39.69 
F5 2.17 2.14 2.13 2.12 2.06 2.11 2.15 2.10 
15 30.66 30.65 30.67 30.42 31.58 30.40 30.72 30.60 
16 7 69 7.69 7.69 7.73 7.72 7.72 7.78 7.76 
17 7.54 7.54 7.54 7.56 7.58 7.56 7.63 7.62 
18 7.97 7.97 7.97 8.00 8.00 7.99 8.06 8.04 
19 38.64 38.65 38.66 38.86 33 26 38.87 38.79 3928 
110 39.73 39.74 39.75 39.90 33.54 39.91 39.88 40.20 
111 36.30 3631 36 33 36.52 31.13 36.55 36.50 36.84 
112 41.60 41.60 41.62 41.81 35.63 41.82 41.78 42.03 
113 21.06 21.06 21.07 21.10 20.60 21.09 21.24 21.49 
114 23.86 23.86 23.86 23.88 23.85 23.86 2101 2130 
115 35.61 35.61 35.62 35.62 35.14 35.62 36.14 36.11 
116 7.61 7.60 7.59 12.93 12.74 12.90 11.71 12.10 
117 44.60 44.59 44.60 44.63 44.65 44.61 45.30 45.31 
Table D.25 Mispredicted Branches per 1000 Graduate Instructions for 
Float-Type Benchmarks 
Ml M2 M3 M4 M5 M6 M7 M8 
F1 4.44 4.41 4.41 4 32 4.19 4.28 4.40 4.27 
F2 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.06 
F3 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.06 
F4 11.64 11.57 11.62 11.67 11.59 11.68 12.05 12.05 
F6 2 39 2 39 2.39 2.42 2 35 2.40 2.40 2.42 
F7 2.73 2.76 2.76 2.70 3.23 2.72 2.91 3.51 
F8 0.49 0.50 0.49 0.50 0.50 0.49 0.50 0.50 
F9 10.97 10.99 10.99 11.00 11.03 10.99 11.08 11.12 
F10 0.72 0.71 0.72 0.79 0.65 0.78 0.78 0.70 
Fll 18.53 18.53 18.55 18.60 19.27 18.58 18.58 18.67 
176 
APPENDIX E Machine Characteristics using HINT 
The following section contains HINT Tables and Graphs for the machines Ml to M8. The 
data files have been truncated to suite the thesis format style. Tables E.2, E.4, E.6, E.8, E.10, 
E.12, E.14, E.16, list integer data types. Tables E.l, E.3, E.5, E.7, E.9, E.ll, E.13, E.15 list 
double data types. In the tables, only alternate data points are listed. In most cases, the 
reader will be able to interpolate the missing data points by looking at the table entries or the 
Graphs E.l ,E.2,E.3,E.4. The complete data would be posted at a website in future or can be 
available by writing to the author. 
The data in tables is organized in five columns: time, QUIPS, Quality, subintervals, and 
memory use in bytes. The first two columns (time and QUIPS) are used to produce the 
traditional QUIPS graphs E.l,E.3. The fifth (memory) and second (QUIPS) columns are also 
useful in relating problem size with performance of the machine. The QUIPS-memory graph 
have similar looks as the traditional QUIPS-time graph. The fifth (memory) column is also 
useful for finding the sizes of memory regimes, like where primary cache saturates or where 
virtual memory is first invoked. 
The third and fourth columns are Quality and subintervals. These columns can be used 
to check for loss of quality due to insufficient precision and poor choices of which rectangle to 
split. 
177 
HINT (Double) QUIPS-Time Graph for Machines M1-M8 
2c+06 
l.8c+06 M3 Helix N3 15.82 MQuips 
1.6c+06 
M7 DC NO 19.72 MQuips1 
1.2c+06 
lc+06 
800000 
600000 
400000 
200000 -
lc-07 le-06 lc-05 0.0001 0.001 0.01 0.1 10 100 1000 
Time in seconds 
Figure E.l HINT (Double) QUIPS-Time Graph for Machines M1-M8 
HINT (Double) QUIPS-Memory Graph for Machines M1-M8 
M3 Helix N3 15.82 MQuips 1.8e+06 
1.6e+06 
M7 DC NO 19.72 MQuips 
1.4e+06 
lc+06 
800000 
600000 
400000 
200000 
100 1000 10000 100000 le+06 le+07 le+08 lc+09 lc+10 
Memory (bytes) 
Figure E.2 HINT (Double) QUIPS-Memory Graph for Machines M1-M8 
178 
HINT (Double) QUIPS-Timc Graph for Machines M1-M8 
2c+06 
M3 Helix N3 15.82 MQuips 
1.6c+06 
M7 DC NO 19.72 MQuips 
1.4c+06 
lc+06 
800000 
600000 
400000 
200000 
0 — 
lc-07 le-06 lc-05 0.0001 0.001 0.01 100 1000 
Time in seconds 
Figure E.3 HINT (Int) QUIPS-Time Graph for Machines M1-M8 
HINT (Double) QUIPS-Memory Graph for Machines M1-M8 
2e+06 
M3 Helix N3 15.82 MQuips 1,8c+06 
1,6e+06 
M7 DC NO 19.72 MQuips' 
1.4c+06 
1.2e+06 
I e+06 
800000 
600000 
400000 
200000 
0 
Ic+10 100 1000 10000 100000 lc+06 
Memory (bytes) 
lc+07 lc+08 lc+09 
Figure E.4 HINT (Int) QUIPS-Memory Graph for Machines M1-M8 
179 
Table E.l Truncated HINT Data (DOUBLE) for Hydra [Node 1) (Ml) 
Time QUIPS Quality Subintervals Memory 
42.42725003 401613.98 17039376.58 19517142 1639439928 
26.78001499 421209.57 11279998.72 12314982 1034458488 
17.14560997 428406.24 7345286.33 7770543 652725612 
10.84049904 436352.44 4730278.20 4903080 411858720 
6.81497002 443736.89 3024053.63 3093760 259875840 
4.31247497 446176.24 1924123.89 1952110 163977240 
2.71368897 449774.00 1220546.75 1231749 103466916 
1.70773709 452491.97 772737.32 777213 65285892 
1.06704593 457920.92 488622.66 490409 41194356 
0.66740108 462581.25 308727.23 309440 25992960 
0.41452801 470335.82 194967.37 195252 16401168 
0.25369298 485184.94 123088.01 123202 10348968 
0.15294498 507980.05 77693.00 77739 6530076 
0.08788574 557929.89 49034.08 49053 4120452 
0.04558442 678825.39 30943.86 30952 2599968 
0.01749055 1116440.53 19527.16 19531 1640604 
0.01071822 1149712.23 12322.87 12325 1035300 
0.00660993 1176494.41 7776.55 7778 653352 
0.00418089 1173868.96 4907.82 4909 412356 
0.00258201 1199423.41 3096.93 3098 260232 
0.00160800 1215156.38 1953.97 1955 164220 
0.00098774 1248293.99 1232.99 1234 103656 
0.00059173 1316460.22 779.00 780 65520 
0.00035557 1383697.58 492.00 493 41412 
0.00019651 1582624.14 311.00 312 26208 
0.00012411 1579243.56 196.00 197 16548 
0.00007831 1583352.87 124.00 125 10500 
0.00004995 1581615.57 79.00 80 6720 
0.00003175 1574566.37 50.00 51 4284 
0.00002035 1572321.50 32.00 33 2772 
0.00001342 1564627.08 21.00 22 1848 
0.00000898 1558687.30 14.00 15 1260 
0.00000586 1536510.38 9.00 10 840 
0.00000399 1504709.58 6.00 7 588 
0.00000273 1463470.66 4.00 5 420 
0.00000141 1420212.36 2.00 3 252 
0.00000078 1277112.46 1.00 2 168 
180 
Table E.2 Truncated HINT Data (INT) for Hydra (Node 1) (Ml) 
Time QUIPS Quality Subintervals Memory 
0.02727300 770815.90 21022.46 30952 1361888 
0.01705447 882234.22 15046.04 19531 859364 
0.01061601 977129.39 10373.22 12325 542300 
0.00666016 1043809.81 6951.94 7778 342232 
0.00418389 1091332.67 4566.02 4909 215996 
0.00261269 1131872.72 2957.23 3098 136312 
0.00164462 1153707.72 1897.41 1955 86020 
0.00098913 1223508.41 1210.21 1234 54296 
0.00060019 1282654.93 769.83 780 34320 
0.00036877 1324221.46 488.33 493 21692 
0.00023341 1326112.32 309.53 312 13728 
0.00014633 1335434.63 195.41 197 8668 
0.00009288 1332560.35 123.77 125 5500 
0.00005926 1331383.17 78.90 80 3520 
0.00003748 1333108.51 49.96 51 2244 
0.00002399 1333137.39 31.98 33 1452 
0.00001593 1318038.29 20.99 22 968 
0.00001055 1326171.04 14.00 15 660 
0.00000682 1318650.70 9.00 10 440 
0.00000464 1294318.52 6.00 7 308 
0.00000313 1276074.38 4.00 5 220 
0.00000164 1220519.99 2.00 3 132 
0.00000085 1171940.14 1.00 2 88 
181 
Table E.3 Truncated HINT Data (DOUBLE) for Helix (Node 0) (M2) 
Time QUIPS Quality Subintervals Memory 
57.11850703 243318.49 13897988.99 15503330 1302279720 
22.00605297 414331.14 9117793.01 9782336 821716224 
13.82344699 426890.81 5901102.55 6172487 518488908 
8.79489803 430352.07 3784902.60 3894734 327157656 
5.48751998 439783.79 2413322.35 2457511 206430924 
3.47678304 440906.61 1532936.64 1550648 130254432 
2.15685892 450354.42 971350.96 978433 82188372 
1.36955190 448721.37 614547.20 617375 51859500 
0.85494900 454326.09 388425.63 389554 32722536 
0.53598690 457756.84 245351.67 245802 20647368 
0.32971704 469851.30 154917.98 155098 13028232 
0.20007098 488790.01 97792.69 97865 8220660 
0.12000501 514333.54 61722.60 61752 5187168 
0.06559060 593876.16 38952.69 38965 3273060 
0.02806289 875943.28 24581.50 24587 2065308 
0.01386643 1118687.62 15512.21 15515 1303260 
0.00839221 1166472.27 9789.29 9791 822444 
0.00524705 1177368.21 6177.72 6179 519036 
0.00327624 1190048.73 3898.89 3900 327600 
0.00206498 1191274.18 2459.95 2461 206724 
0.00127735 1214998.09 1551.98 1553 130452 
0.00078425 1249587.05 979.99 981 82404 
0.00046051 1344141.53 619.00 620 52080 
0.00026120 1496951.53 391.00 392 32928 
0.00015581 1585222.78 247.00 248 20832 
0.00009855 1582983.39 156.00 157 13188 
0.00006269 1579220.88 99.00 100 8400 
0.00003996 1576496.96 63.00 64 5376 
0.00002530 1580919.83 40.00 41 3444 
0.00001659 1566893.40 26.00 27 2268 
0.00001091 1558518.90 17.00 18 1512 
0.00000710 1549026.05 11.00 12 1008 
0.00000460 1521388.58 7.00 8 672 
0.00000335 1492185.15 5.00 6 504 
0.00000208 1444711.77 3.00 4 336 
0.00000078 1277759.88 1.00 2 168 
182 
Table E.4 Truncated HINT Data (INT) for Helix (Node 0) (M2) 
Time QUIPS Quality Subintervals Memory 
0.02744360 766024.35 21022.46 30952 1361888 
0.01711232 879251.79 15046.04 19531 859364 
0.01064383 974575.75 10373.22 12325 542300 
0.00667560 1041394.57 6951.94 7778 342232 
0.00419241 1089115.04 4566.02 4909 215996 
0.00261852 1129352.23 2957.23 3098 136312 
0.00164815 1151234.58 1897.41 1955 86020 
0.00099099 1221207.94 1210.21 1234 54296 
0.00060108 1280745.73 769.83 780 34320 
0.00036942 1321876.99 488.33 493 21692 
0.00023384 1323664.09 309.53 312 13728 
0.00014652 1333716.33 195.41 197 8668 
0.00009304 1330310.20 123.77 125 5500 
0.00005936 1329291.13 78.90 80 3520 
0.00003753 1331059.00 49.96 51 2244 
0.00002405 1329874.80 31.98 33 1452 
0.00001594 1316939.83 20.99 22 968 
0.00001057 1323916.61 14.00 15 660 
0.00000683 1316762.03 9.00 10 440 
0.00000464 1292497.46 6.00 7 308 
0.00000314 1274271.93 4.00 5 220 
0.00000164 1216691.39 2.00 3 132 
0.00000085 1170362.57 1.00 2 88 
183 
Table E.5 Truncated HINT Data (DOUBLE) for Helix (Node 3) (M3) 
Time QUIPS Quality Subintervals Memory 
46.48404598 298984.06 13897988.99 15503330 1302279720 
21.97535610 414909.91 9117793.01 9782336 821716224 
13.81081200 427281.36 5901102.55 6172487 518488908 
8.79201603 430493.14 3784902.60 3894734 327157656 
5.49001396 439584.01 2413322.35 2457511 206430924 
3.48220098 440220.61 1532936.64 1550648 130254432 
2.15983593 449733.68 971350.96 978433 82188372 
1.37862992 445766.62 614547.20 617375 51859500 
0.86618495 448432.68 388425.63 389554 32722536 
0.54574001 449576.11 245351.67 245802 20647368 
0.34120095 454037.37 154917.98 155098 13028232 
0.21094447 463594.48 97792.69 97865 8220660 
0.12967869 475965.66 61722.60 61752 5187168 
0.07839699 496864.65 38952.69 38965 3273060 
0.04485722 547994.17 24581.50 24587 2065308 
0.02326351 666804.36 15512.21 15515 1303260 
0.00893078 1096128.68 9789.29 9791 822444 
0.00536208 1152111.04 6177.72 6179 519036 
0.00327807 1189385.07 3898.89 3900 327600 
0.00206520 1191147.55 2459.95 2461 206724 
0.00127706 1215275.42 1551.98 1553 130452 
0.00078456 1249093.58 979.99 981 82404 
0.00046072 1343529.81 619.00 620 52080 
0.00026131 1496285.62 391.00 392 32928 
0.00015590 1584369.53 247.00 248 20832 
0.00009860 1582110.49 156.00 157 13188 
0.00006269 1579245.57 99.00 100 8400 
0.00003995 1577148.11 63.00 64 5376 
0.00002530 1580774.67 40.00 41 3444 
0.00001659 1566841.44 26.00 27 2268 
0.00001091 1558404.37 17.00 18 1512 
0.00000711 1548168.46 11.00 12 1008 
0.00000460 1523118.29 7.00 8 672 
0.00000335 1491716.45 5.00 6 504 
0.00000208 1444543.12 3.00 4 336 
0.00000078 1276215.73 1.00 2 168 
184 
Table E.6 Truncated HINT Data (INT) for Helix (Node 3) (M3) 
Time QUIPS Quality Subintervals Memory 
0.03650756 575838.66 21022.46 30952 1361888 
0.01785582 842640.55 15046.04 19531 859364 
0.01076537 963573.10 10373.22 12325 542300 
0.00668713 1039599.90 6951.94 7778 342232 
0.00418611 1090752.58 4566.02 4909 215996 
0.00261438 1131140.78 2957.23 3098 136312 
0.00164544 1153130.89 1897.41 1955 86020 
0.00098952 1223023.27 1210.21 1234 54296 
0.00060040 1282212.63 769.83 780 34320 
0.00036913 1322912.53 488.33 493 21692 
0.00023350 1325591.07 309.53 312 13728 
0.00014638 1334949.63 195.41 197 8668 
0.00009291 1332051.55 123.77 125 5500 
0.00005928 1330946.47 78.90 80 3520 
0.00003749 1332614.33 49.96 51 2244 
0.00002400 1332895.06 31.98 33 1452 
0.00001593 1317543.72 20.99 22 968 
0.00001056 1325755.61 14.00 15 660 
0.00000683 1318295.59 9.00 10 440 
0.00000464 1294098.93 6.00 7 308 
0.00000314 1275707.67 4.00 5 220 
0.00000164 1217877.00 2.00 3 132 
0.00000085 1171406.65 1.00 2 88 
185 
Table E.7 Truncated HINT Data (DOUBLE) for Chronus (M4) 
Time QUIPS Quality Subintervals Memory 
124.05720603 24376.28 3024053.63 3093760 259875840 
4.75572503 404591.07 1924123.89 1952110 163977240 
2.97769797 409896.09 1220546.75 1231749 103466916 
1.88922501 409023.45 772737.32 777213 65285892 
1.18087697 413779.48 488622.66 490409 41194356 
0.74090099 416691.61 308727.23 309440 25992960 
0.46331298 420811.37 194967.37 195252 16401168 
0.28823507 427040.39 123088.01 123202 10348968 
0.17659998 439937.76 77693.00 77739 6530076 
0.10575774 463645.29 49034.08 49053 4120452 
0.06221833 497343.23 30943.86 30952 2599968 
0.03301690 591429.26 19527.16 19531 1640604 
0.01417412 869392.27 12322.87 12325 1035300 
0.00727919 1068326.41 7776.55 7778 653352 
0.00424717 1155551.11 4907.82 4909 412356 
0.00262020 1181945.69 3096.93 3098 260232 
0.00163320 1196409.54 1953.97 1955 164220 
0.00100149 1231156.13 1232.99 1234 103656 
0.00060382 1290116.14 779.00 780 65520 
0.00036072 1363949.52 492.00 493 41412 
0.00019958 1558302.87 311.00 312 26208 
0.00012582 1557755.32 196.00 197 16548 
0.00007920 1565696.28 124.00 125 10500 
0.00005051 1563959.98 79.00 80 6720 
0.00003211 1557306.85 50.00 51 4284 
0.00002053 1558541.31 32.00 33 2772 
0.00001356 1549188.26 21.00 22 1848 
0.00000898 1558915.66 14.00 15 1260 
0.00000591 1522400.66 9.00 10 840 
0.00000402 1492722.77 6.00 7 588 
0.00000277 1444734.09 4.00 5 420 
0.00000146 1374275.23 2.00 3 252 
0.00000081 1237487.94 1.00 2 168 
186 
Table E.8 Truncated HINT Data (INT) for Chronus (M4) 
Time QUIPS Quality Subintervals Memory 
0.03593762 584970.96 21022.46 30952 1361888 
0.01846160 814991.18 15046.04 19531 859364 
0.01112920 932072.47 10373.22 12325 542300 
0.00676614 1027458.87 6951.94 7778 342232 
0.00425771 1072411.60 4566.02 4909 215996 
0.00266016 1111673.37 2957.23 3098 136312 
0.00166978 1136322.10 1897.41 1955 86020 
0.00100554 1203536.60 1210.21 1234 54296 
0.00060998 1262063.02 769.83 780 34320 
0.00037420 1305011.39 488.33 493 21692 
0.00023609 1311075.80 309.53 312 13728 
0.00014846 1316240.26 195.41 197 8668 
0.00009386 1318665.76 123.77 125 5500 
0.00005993 1316697.93 78.90 80 3520 
0.00003784 1320236.14 49.96 51 2244 
0.00002444 1308637.25 31.98 33 1452 
0.00001602 1310511.67 20.99 22 968 
0.00001074 1303248.91 14.00 15 660 
0.00000687 1310272.94 9.00 10 440 
0.00000470 1275335.33 6.00 7 308 
0.00000311 1286553.29 4.00 5 220 
0.00000165 1210588.78 2 00 3 132 
0.00000086 1166732.61 1.00 2 88 
187 
Table E.9 Truncated HINT Data (DOUBLE) for Tajar (M5) 
Time QUIPS Quality Subintervals Memory 
44.10228300 85821.01 3784902.60 3894734 327157656 
7.99518394 301847.01 2413322.35 2457511 206430924 
5.09887409 300642.18 1532936.64 1550648 130254432 
3.20488906 303084.11 971350.96 978433 82188372 
1.99016297 308792.40 614547.20 617375 51859500 
1.25871408 308589.25 388425.63 389554 32722536 
0.78522396 312460.76 245351.67 245802 20647368 
0.49015892 316056.64 154917.98 155098 13028232 
0.30073404 325179.99 97792.69 97865 8220660 
0.18266451 337901.44 61722.60 61752 5187168 
0.10687050 364484.97 38952.69 38965 3273060 
0.05739850 428260.28 24581.50 24587 2065308 
0.02719990 570303.82 15512.21 15515 1303260 
0.00777333 1259342.98 9789.29 9791 822444 
0.00460592 1341256.12 6177.72 6179 519036 
0.00268285 1453261.54 3898.89 3900 327600 
0.00170394 1443688.16 2459.95 2461 206724 
0.00105883 1465751.83 1551.98 1553 130452 
0.00064519 1518910.91 979.99 981 82404 
0.00037680 1642787.60 619.00 620 52080 
0.00022780 1716440.23 391.00 392 32928 
0.00014148 1745786.69 247.00 248 20832 
0.00008900 1752886.29 156.00 157 13188 
0.00005648 1752907.89 99.00 100 8400 
0.00003575 1762308.00 63.00 64 5376 
0.00002289 1747169.59 40.00 41 3444 
0.00001484 1752328.06 26.00 27 2268 
0.00000980 1734383.85 17.00 18 1512 
0.00000637 1727818.12 11.00 12 1008 
0.00000410 1709324.67 7.00 8 672 
0.00000297 1684408.92 5.00 6 504 
0.00000182 1646038.99 3.00 4 336 
0.00000068 1460874.97 1.00 2 168 
188 
Table E.10 Truncated HINT Data (INT) for Tajar (M5) 
Time QUIPS Quality Subintervals Memory 
0.04353391 482898.60 21022.46 30952 1361888 
0.01957742 768540.55 15046.04 19531 859364 
0.01154909 898184.61 10373.22 12325 542300 
0.00520132 1336572.48 6951.94 7778 342232 
0.00326296 1399349.17 4566.02 4909 215996 
0.00204442 1446489.69 2957.23 3098 136312 
0.00128992 1470957.32 1897.41 1955 86020 
0.00077756 1556427.07 1210.21 1234 54296 
0.00046925 1640570.55 769.83 780 34320 
0.00028035 1741880.78 488.33 493 21692 
0.00017621 1756530.87 309.53 312 13728 
0.00011131 1755529.46 195.41 197 8668 
0.00007035 1759158.27 123.77 125 5500 
0.00004491 1757071.90 78.90 80 3520 
0.00002840 1759286.36 49.96 51 2244 
0.00001819 1757957.42 31.98 33 1452 
0.00001185 1771383.51 20.99 22 968 
0.00000812 1724780.13 14.00 15 660 
0.00000538 1673541.80 9.00 10 440 
0.00000358 1673619.32 6.00 7 308 
0.00000243 1645949.47 4.00 5 220 
0.00000125 1595491.76 2.00 3 132 
0.00000065 1539549.87 1.00 2 88 
189 
Table E.ll Truncated HINT Data (DOUBLE) for Hermes (Node 0) (M6) 
Time QUIPS Quality Subintervals Memory 
85.12695301 107108.18 9117793.01 9782336 821716224 
10.15006006 581385.97 5901102.55 6172487 518488908 
6.24585009 605986.78 3784902.60 3894734 327157656 
3.93785894 612851.40 2413322.35 2457511 206430924 
2.48283100 617414.81 1532936.64 1550648 130254432 
1.57110596 618259.36 971350.96 978433 82188372 
0.98658609 622902.76 614547.20 617375 51859500 
0.61522901 631351.30 388425.63 389554 32722536 
0.38924301 630330.33 245351.67 245802 20647368 
0.24405450 634767.99 154917.98 155098 13028232 
0.15171532 644580.23 97792.69 97865 8220660 
0.09386940 657536.98 61722.60 61752 5187168 
0.05733538 679383.20 38952.69 38965 3273060 
0.03376875 727936.18 24581.50 24587 2065308 
0.01899680 816569.52 15512.21 15515 1303260 
0.00948394 1032196.41 9789.29 9791 822444 
0.00577172 1070342.14 6177.72 6179 519036 
0.00357558 1090422.19 3898.89 3900 327600 
0.00223817 1099090.50 2459.95 2461 206724 
0.00138501 1120560.29 1551.98 1553 130452 
0.00084972 1153314.06 979.99 981 82404 
0.00049880 1240960.62 619.00 620 52080 
0.00028363 1378539.91 391.00 392 32928 
0.00016904 1461154.24 247.00 248 20832 
0.00010687 1459759.18 156.00 157 13188 
0.00006803 1455221.60 99.00 100 8400 
0.00004328 1455513.87 63.00 64 5376 
0.00002749 1454985.84 40.00 41 3444 
0.00001799 1444987.58 26.00 27 2268 
0.00001186 1433213.25 17.00 18 1512 
0.00000773 1422669.98 11.00 12 1008 
0.00000497 1408101.18 7.00 8 672 
0.00000362 1380049.40 5.00 6 504 
0.00000227 1319209.54 3.00 4 336 
0.00000087 1152663.11 1.00 2 168 
190 
Table E.12 Truncated HINT Data (INT) for Hermes (Node 0) (M6) 
Time QUIPS Quality Subintervals Memory 
0.03338103 629772.77 21022.46 30952 1361888 
0.01904379 790075.91 15046.04 19531 859364 
0.01167911 888185.77 10373.22 12325 542300 
0.00722927 961637.16 6951.94 7778 342232 
0.00454336 1004987.50 4566.02 4909 215996 
0.00283571 1042853.71 2957.23 3098 136312 
0.00178382 1063678.88 1897.41 1955 86020 
0.00107423 1126582.97 1210.21 1234 54296 
0.00065144 1181737.95 769.83 780 34320 
0.00039966 1221856.45 488.33 493 21692 
0.00025248 1225927.50 309 53 312 13728 
0.00015902 1228854.56 195.41 197 8668 
0.00010055 1230874.41 123.77 125 5500 
0.00006426 1227825.03 78.90 80 3520 
0.00004049 1233976.31 49.96 51 2244 
0.00002614 1223692.84 31.98 33 1452 
0.00001715 1224323.76 20.99 22 968 
0.00001151 1216529.87 14.00 15 660 
0.00000735 1223544.79 9.00 10 440 
0.00000504 1190708.16 6.00 7 308 
0.00000332 1206454.95 4.00 5 220 
0.00000176 1135091.39 2.00 3 132 
0.00000092 1090609.30 1.00 2 88 
191 
Table E.13 Truncated HINT Data (DOUBLE) for DC (Node 0) (M7) 
Time QUIPS Quality Subintervals Memory 
21.45749795 112469.89 2413322.35 2457511 206430924 
2.23301101 686488.62 1532936.64 1550648 130254432 
1.35233307 718277.90 971350.96 978433 82188372 
0.84862602 724167.29 614547.20 617375 51859500 
0.52757704 736244.38 388425.63 389554 32722536 
0.33351505 735653.98 245351.67 245802 20647368 
0.20864695 742488.59 154917.98 155098 13028232 
0.13000266 752236.12 97792.69 97865 8220660 
0.07992041 772300.86 61722.60 61752 5187168 
0.04817499 808566.67 38952.69 38965 3273060 
0.02785208 882573.26 24581.50 24587 2065308 
0.01532844 1011988.73 15512.21 15515 1303260 
0.00715149 1368846.07 9789.29 9791 822444 
0.00433713 1424377.06 6177.72 6179 519036 
0.00265470 1468672.02 3898.89 3900 327600 
0.00165476 1486588.71 2459.95 2461 206724 
0.00102835 1509193.59 1551.98 1553 130452 
0.00062976 1556125.07 979.99 981 82404 
0.00037176 1665034.68 619.00 620 52080 
0.00020954 1865986.43 391.00 392 32928 
0.00012554 1967489.03 247.00 248 20832 
0.00007837 1990665.23 156.00 157 13188 
0.00004993 1982651.56 99.00 100 8400 
0.00003172 1985989.68 63.00 64 5376 
0.00002038 1963178.85 40.00 41 3444 
0.00001328 1957517.29 26.00 27 2268 
0.00000869 1956091.46 17.00 18 1512 
0.00000565 1945339.55 11.00 12 1008 
0.00000367 1908044.75 7.00 8 672 
0.00000264 1893517.17 5.00 6 504 
0.00000166 1806974.84 3.00 4 336 
0.00000062 1624693.93 1.00 2 168 
192 
Table E.14 Truncated HINT Data (INT) for DC (Node 0) (M7) 
Time QUIPS Quality Subintervals Memory 
0.02541852 827052.89 21022.46 30952 1361888 
0.01410879 1066430.41 15046.04 19531 859364 
0.00874866 1185692.07 10373.22 12325 542300 
0.00530704 1309944.73 6951.94 7778 342232 
0.00333221 1370266.84 4566.02 4909 215996 
0.00207871 1422629.36 2957.23 3098 136312 
0.00130666 1452110.95 1897.41 1955 86020 
0.00078636 1539009.07 1210.21 1234 54296 
0.00047881 1607820.67 769.83 780 34320 
0.00029286 1667429.26 488.33 493 21692 
0.00018311 1690347.69 309.53 312 13728 
0.00011629 1680431.07 195.41 197 8668 
0.00007332 1687934.85 123.77 125 5500 
0.00004676 1687371.95 78.90 80 3520 
0.00002958 1688997.30 49.96 51 2244 
0.00001898 1684954.29 31.98 33 1452 
0.00001245 1686377.53 20.99 22 968 
0.00000847 1652372.90 14.00 15 660 
0.00000535 1683145.28 9.00 10 440 
0.00000363 1652761.21 6.00 7 308 
0.00000246 1627001.11 4.00 5 220 
0.00000126 1581363.59 2.00 3 132 
0.00000069 1456387.62 1.00 2 88 
193 
Table E.15 Truncated HINT Data (DOUBLE) for Exiguus (M8) 
Time QUIPS Quality Subintervals Memory 
16.77947009 72740.48 1220546.75 1231749 103466916 
2.52151108 306458.03 772737.32 777213 65285892 
1.58289492 308689.26 488622.66 490409 41194356 
0.98676205 312868.97 308727.23 309440 25992960 
0.62015307 314385.89 194967.37 195252 16401168 
0.38083506 323205.58 123088.01 123202 10348968 
0.23236650 334355.42 77693.00 77739 6530076 
0.13885232 353138.36 49034.08 49053 4120452 
0.08047659 384507.63 30943.86 30952 2599968 
0.04145850 471004.92 19527 16 19531 1640604 
0.01534350 803132.97 12322.87 12325 1035300 
0.00646320 1203204.23 7776.55 7778 653352 
0.00374381 1310917.35 4907.82 4909 412356 
0.00230764 1342031.90 3096.93 3098 260232 
0.00143128 1365195.59 1953.97 1955 164220 
0.00088279 1396690.51 1232.99 1234 103656 
0.00053061 1468099.64 779.00 780 65520 
0.00031795 1547409.37 492.00 493 41412 
0.00017665 1760552.75 311.00 312 26208 
0.00010952 1789644.36 196.00 197 16548 
0.00006937 1787600.45 124.00 125 10500 
0.00004427 1784347.74 79.00 80 6720 
0.00002814 1776557.75 50.00 51 4284 
0.00001819 1758868.16 32.00 33 2772 
0.00001191 1762734.33 21.00 22 1848 
0.00000805 1739706.16 14.00 15 1260 
0.00000519 1733653.40 9.00 10 840 
0.00000356 1685669.56 6.00 7 588 
0.00000238 1682856.36 4.00 5 420 
0.00000125 1600859.34 2.00 3 252 
0.00000069 1442643.38 1.00 2 168 
194 
Table E.16 Truncated HINT Data (INT) for Exiguus (M8) 
Time QUIPS Quality Subintervals Memory 
0.03817634 550667.29 21022.46 30952 1361888 
0.01688462 891109.06 15046.04 19531 859364 
0.01000013 1037307.76 10373.22 12325 542300 
0.00596293 1165859.09 6951.94 7778 342232 
0.00372771 1224884.62 4566.02 4909 215996 
0.00232030 1274503.10 2957.23 3098 136312 
0.00146253 1297349.21 1897.41 1955 86020 
0.00088700 1364389.45 1210.21 1234 54296 
0.00053661 1434634.69 769.83 780 34320 
0.00032848 1486636.21 488.33 493 21692 
0.00020592 1503144.40 309.53 312 13728 
0.00013072 1494933.10 195.41 197 8668 
0.00008263 1497860.30 123.77 125 5500 
0.00005265 1498705.19 78.90 80 3520 
0.00003318 1505798.99 49.96 51 2244 
0.00002124 1506018.86 31.98 33 1452 
0.00001402 1497709.42 20.99 22 968 
0.00000939 1490333.15 14.00 15 660 
0.00000616 1459876.40 9.00 10 440 
0.00000405 1482748.55 6.00 7 308 
0.00000274 1459594.05 4.00 5 220 
0.00000142 1409760.24 2.00 3 132 
0.00000074 1353597.36 1.00 2 88 
195 
APPENDIX F LMBENCH 
LMBENCH [McVoy and Staelin, 1996] is a suite of simple, portable mico-benchmarks writ­
ten in C. It compares different unix systems performance. The benchmarks in LMBENCH can 
be divided into three main categories. First, a set of performance benchmarks to measure 
bandwidth in cache file read; memory copy; memory read; memory write; interprocess commu­
nication like pipe; and network protocols like TCP, UDP, RPC. Second, a set of performance 
benchmarks to measure latency in process context switching; process creation; signal handling; 
system call overhead; memory read operation; file system creation and deletion; establishing 
network connection such as using unix pipe, TCP, UDP, and RPC; and unix files creation and 
deletion. Third, a miscellaneous benchmark to calculate processor clock rate. 
Among all the benchmarks in lmbench, the memory read latency measuring micro-benchmark 
is most useful and popular. The micro-benchmark measures memory read latency by varying 
memory sizes and strides. The results are reported in nanoseconds per load. The entire mem­
ory hierarchy is measured, including onboard cache latency, and main memory latency. Also, 
only data accesses are measured and the latency to access instruction caches and TLBs are 
not measured. The LMBENCH author, Larry McVoy, claims to have been verified accurate to 
within a few nanoseconds on an SGI Indy. 
The algorithm for the memory read latency benchmark consists of two loops. The outer 
loop iterates on the stride size. The inner loop iterates on the array size. For each array size, 
a  r i n g  o f  p o i n t e r s  i s  c r e a t e d  t h a t  p o i n t  f o r w a r d  0 1 1 c  s t r i d e .  T r a v e r s i n g  t h e  a r r a y  i s  d o n e  b y  F . l .  
an unrolled /or loop. The benchmark stops after doing one million loads. 
p = (char * *) * p\ (F.l) 
196 
Table F.l Memory Latencies in Nanoseconds (using LMBENCH) 
Host OS Mhz LI L2 Main memory 
Ml IRIX64 6.2 194 10 61 1087 
M2 IRIX64 6.2 194 10 61 1094 
M3 IRIX64 6.2 194 10 61 1094 
M4 IRIX64 6.5 195 10 62 929 
M5 IRIX 6.5 270 7 60 856 
M6 IRIX64 6.5 180 11 66 523 
M7 IRIX64 6.4 250 8 48 463 
M8 IRIX 6.5 225 8 54 906 
Lmbench Benchmark version 1 (current version 2 is released) was modified to increase the 
datasize from default 8 megabytes to 256 megabytes so that size is greater than maximum 
secondary cache of the all the machines Ml to M8. The larger datasize minimizes the cache 
effect. 
The Imbench was run all the machines and all the detail benchmarks were collected. Table 
F.l lists the memory and cache latency for level 1 and level 2 system data caches. These 
latency can be also read from the graphs F.l(a), F.l(a), F.l(b), F.l(c), F.l(d) F.2(a), F.2(b), 
F.2(c) and F.2(d). 
As in HINT graphs, the memory hierarchy can be easily seen with the memory read latency 
graphs. For the small array size, once the data is loaded cache, the repeated load will be fast 
and hence memory latency would seen to be lower. In addition, from the Table F.l it is clear 
that larger systems like Ml, M2, and M3 have higher latency cost than workstations. 
For completion purpose, the other interesting graphs for machine Ml is presented here. 
Figure F illustrates context switching time between processes. The size and number of processes 
are varied. 
Figure F illustrates memory reread bandwidth. Finally, figure F.5 illustrates memory 
bandwidth for various memory operations. 
197 
mory Latency for Hydra Node 1 (Ml) mory Latency for Helix Node 0 (N2) 
(a) (b) 
(<") (d) 
Figure F.l LMBENCH: Memory Latency Graph for Machines (a) Ml (b) 
M2 (c) M3 (d) M4 
198 
unory Latency for Hermes (M6) 
(a) (b) 
emory Latency for Exiguus (N8) Memory Latency for DC Node 0 (M7) 
•trid«.5ia 
(«0 (H) 
Figure F.2 LMBENCH: Memory Latency Graph for Machines (a) M5 (b) 
M6 (c) M7 (d) M8 
199 
Reread bandwidth for Hydra Node 1 (Ml) 
2SCM 1034H 
Memory size 
ibc bcopy unaligned 
«••tory reed bendwidch 
Figure F.3 LMBENCH: Memory reread bandwidth for Machine Ml 
Figure F.4 LMBENCH: Context Switch Latency for Machine Ml 
200 
Memory JtJ W tor Hydra Mode 1 (Ml) 
2K 
1K 
512 
256 
128 
64 
32 
2K 
1K 
512 
256 H 
128 
64 
32 
libc bcopy unaligned 
l l l 
2K 
1K 
512 
256 
128 
64 
32 
Memory read bandwidth ^Memory partial read/write bandwidth 
2561K4N 6K<D66KMIM 6BKM8U4M 2561K4N 6K4166KM4M 6BKMSM4M 
libc bcopy aligned 2K 
1K 
512 
256 -
128 
64 
Memory partial read bandwidth 
32 
2K 
1K 
512 
256 
128 
64 
32 
256IK4K6K4M>6KM4M6BKUGtt4M 
Memory bzero bandwidth 
256IK4N6B4K6KMtM6BKUSU4M 256IK4M 6K4166KM4M 6BMMHM4M 2561K4K6K4K6KM1M6BKU8U4M 
unrolled bcoov unaligned Memory write bandwidth 
512 -
256 -
2561 K4K6B4K6HWM6RairaU4M 2561 K4M6K4K6KM4M6BOS8M4M 
2K 
1K 
512 -
256 -
128 
64 -
32 
unrolled partial bcopy unaligned Memory partial write bandwidth 
l l l l l l l l l l 
1K 
512 -
256 -
128 -
64 -
32 
2561K4M6K4K6KMIM6BKUBU4M 2561 K4M6K4K6KM!M6BKUflU4M 
Figure F.5 LMBENCH: Memory bandwidth for Machine Ml 
201 
APPENDIX G Machine Profile Using Stream Benchmark 
STREAM benchmark is described briefly in chapter. STREAM benchmark was modified 
to scale from 2 array elements (48 bytes) upto 32,000,000 array elements (768 megabytes). 
Tables G.2, G.3, G.4, G.5, G.6, G.7 G.8, and G.9 shows the results of the runs. Figures G.l(a), 
G.l(b), G.l(c), and G.l(d) are the results of modified STREAM benchmark for copy kernel, 
scale kernel, sum kernel, and triad kernel respectively. Table G.l is the result of STREAM 
Benchmarks with 4,000,000 array elements (91.6 megabytes) for machines Ml to M8. In the 
table, the numbers are bandwidth to access main memory as the array size is big enough to 
exceeds secondary cache on all the machines. 
Table G.l STREAM Benchmark for Memory Size 91.6 MB 
Machine Copy 
(MB/s) 
Scale 
(MB/s) 
Add 
(MB/s) 
Triad 
(MB/s) 
Ml 165.8745 169.4696 188.3321 189.3678 
M2 164.4183 166.5777 186.8762 188.2434 
M3 164.1164 166.2843 188.5159 189.4668 
M4 99.4816 100.5170 110.5358 111.7675 
M5 68.1302 68.8822 69.8274 70.1060 
M6 269.9613 273.7558 284.1236 280.0409 
M7 326.5440 333.1355 376.2833 376.2994 
M8 68.5972 68.0162 69.7830 69.9155 
202 
Table G.2 STREAM Benchmark for Hydra, Processor 1 (Ml) 
Array Size Memory Size Copy Scale Add Triad 
(Elements) (Bytes) (MB/s) (MB/s) (MB/s) (MB/s) 
2 48 6.3913 6.3913 9.5870 9.5870 
4 96 13.0944 13.0944 19.1740 19.1740 
8 192 26.1888 25.5653 38.3479 39.2832 
16 384 52.3776 51.1306 76.6958 78.5665 
32 768 104.7553 102.2611 153.3917 157.1329 
64 1536 204.5223 204.5223 257.6980 257.6980 
128 3072 343.5974 343.5974 515.3961 515.3961 
256 6144 592.4093 592.4093 687.1948 687.1948 
512 12288 827.9455 827.9455 945.6809 945.6809 
1000 24000 945.1953 1073.7418 1043.1430 1001.6248 
2000 48000 1104.6727 865.9208 675.5926 657.9300 
4000 96000 752.9746 653.1276 827.6530 738.8132 
8000 192000 703.1708 699.5061 824.2645 790.2908 
16000 384000 695.6539 731.4318 825.9552 818.8168 
32000 768000 691.9554 742.0469 826.6972 837.4433 
64000 1536000 643.6336 752.9086 810.9833 777.6981 
128000 3072000 405.7023 342.3035 156.7506 155.3474 
256000 6144000 172.1947 165.6753 117.8954 126.6594 
512000 12288000 166.7074 167.7591 116.9595 127.5880 
1000000 24000000 166.6649 174.7163 195.2601 196.7825 
2000000 48000000 166.1500 173.5810 188.1534 190.1132 
4000000 96000000 165.8745 169.4696 188.3321 189.3678 
8000000 192000000 165.7877 173.0499 187.9874 191.3445 
16000000 384000000 163.6958 169.4650 189.9488 191.9013 
32000000 768000000 163.4732 170.4292 116.6685 129.1620 
203 
Tab! e G.3 STREAM Benchmark br Helix, Processor 1 (IV [2) 
Array Size 
(Elements) 
Memory Size 
(Bytes) 
Copy 
(MB/s) 
Scale 
(MB/s) 
Add 
(MB/s) 
Triad 
(MB/s) 
2 48 6.3913 6.3913 9.8208 9.5870 
4 96 12.7826 12.7826 19.6416 19.6416 
8 192 26.1888 26.1888 38.3479 38.3479 
16 384 51.1306 51.1306 78.5665 76.6958 
32 768 102.2611 104.7553 153.3917 153.3917 
64 1536 204.5223 204.5223 257.6980 257.6980 
128 3072 343.5974 343.5974 444.3070 444.3070 
256 6144 582.3684 582.3684 687.1948 687.1948 
512 12288 827.9455 818.0890 945.6809 945.6809 
1000 24000 888.8591 1073.7418 1094.1663 1043.1430 
2000 48000 1069.4640 888.8591 675.5926 657.9300 
4000 96000 761.5190 653.1276 827.6530 744.2758 
8000 192000 703.6316 699.5061 820.4853 796.5444 
16000 384000 695.6539 727.2210 825.7435 815.2937 
32000 768000 693.8558 743.0739 827.5467 836.5733 
64000 1536000 639.1796 745.8483 806.7686 772.6614 
128000 3072000 405.7789 341.3852 152.4263 151.3449 
256000 6144000 171.4310 164.6433 116.9261 125.3340 
512000 12288000 165.6789 166.8159 116.3978 125.8500 
1000000 24000000 165.1187 169.3497 185.4685 188.4675 
2000000 48000000 164.7192 170.4359 186.8017 188.8663 
4000000 96000000 164.4183 166.5777 186.8762 188.2434 
8000000 192000000 163.7323 169.0818 186.1327 189.4491 
16000000 384000000 161.1851 165.2433 188.1313 189.7767 
32000000 768000000 160.8151 165.9589 115.9834 132.1660 
204 
Table G.4 STREAM Benchmark for Helix, Processor 3 (M3) 
Array Size 
(Elements) 
Memory Size 
(Bytes) 
Copy 
(MB/s) 
Scale 
(MB/s) 
Add 
(MB/s) 
Triad 
(MB/s) 
2 48 6.5472 6.3913 9.5870 9.5870 
4 96 12.7826 13.0944 19.6416 19.1740 
8 192 16.0260 16.0260 24.0390 24.0390 
16 384 52.3776 52.3776 78.5665 78.5665 
32 768 102.2611 104.7553 153.3917 153.3917 
64 1536 204.5223 204.5223 257.6980 257.6980 
128 3072 343.5974 343.5974 444.3070 444.3070 
256 6144 373.4754 373.4754 515.3961 515.3961 
512 12288 827.9455 818.0890 945.6809 945.6809 
1000 24000 945.1953 1073.7418 1094.1663 1043.1430 
2000 48000 1069.4640 888.8591 675.5926 657.9300 
4000 96000 718.7027 639.8938 813.4408 744.2758 
8000 192000 680.8762 684.3479 800.1057 771.3663 
16000 384000 691.8440 731.4318 822.3706 811.8008 
32000 768000 614.7084 736.7011 764.9550 762.6910 
64000 1536000 526.7636 339.5231 156.7659 152.3813 
128000 3072000 170.2123 162.5912 185.2397 165.2769 
256000 6144000 166.5781 165.4615 116.9417 125.5441 
512000 12288000 162.6947 165.9510 116.0351 125.9184 
1000000 24000000 164.6090 170.4612 187.4853 190.0478 
2000000 48000000 164.1177 170.9666 186.8482 187.9426 
4000000 96000000 164.1164 166.2843 188.5159 189.4668 
8000000 192000000 164.1486 169.8266 187.0098 189.6364 
16000000 384000000 161.6765 166.3872 188.7864 190.2729 
32000000 768000000 161.3087 167.1077 115.9890 132.3012 
205 
Table G.5 STREAM Benchmark for Chronus (M4) 
Array Size 
(Elements) 
Memory Size 
(Bytes) 
Copy 
(MB/s) 
Scale 
(MB/s) 
Add 
(MB/s) 
Triad 
(MB/s) 
2 48 6.3913 6.3913 9.5870 9.8208 
4 96 12.7826 12.7826 19.1740 19.1740 
8 192 25.5653 21.4748 39.2832 32 2123 
16 384 51.1306 51.1306 76.6958 64.4245 
32 768 85.8993 85.8993 128.8490 128.8490 
64 1536 171.7987 171.7987 257.6980 257.6980 
128 3072 343.5974 296.2046 444.3070 444.3070 
256 6144 512.8319 512.8319 687.1948 687.1948 
512 12288 818.0890 818.0890 945.6809 945.6809 
1000 24000 945.1953 1065.2201 1094.1663 1094.1663 
2000 48000 1065.2201 916.1620 696.6318 656.8567 
4000 96000 780.3356 660.3578 842.3707 756.1562 
8000 192000 715.3510 715.3510 834.9470 806.9202 
16000 384000 711.3228 746.4316 845.9101 836.6819 
32000 768000 604.4993 756.2894 752.8866 733.5137 
64000 1536000 494.2141 283.5804 103.1361 98.4490 
128000 3072000 120.1524 102.0788 113.5759 104.3975 
256000 6144000 108.3056 101.3060 97.6136 99.2167 
512000 12288000 103.1452 101.8335 99.1176 100.3691 
1000000 24000000 101.3633 102.3889 110.8714 112.1417 
2000000 48000000 100.0194 102.0070 111.1857 111.9421 
4000000 96000000 99.4816 100.5170 110.5358 111.7675 
8000000 192000000 99.3848 100.3319 111.1048 111.0826 
206 
Table G.6 STREAM Benchmark for Tajar (M5) 
Array Size 
(Elements) 
Memory Size 
(Bytes) 
Copy 
(MB/s) 
Scale 
(MB/s) 
Add 
(MB/s) 
Triad 
(MB/s) 
2 48 10.7374 10.7374 16.1061 16.1061 
4 96 21.4748 21.4748 32.2123 32.2123 
8 192 42.9497 42.9497 64.4245 64.4245 
16 384 85.8993 85.8993 128.8490 128.8490 
32 768 171.7987 171.7987 257.6980 257.6980 
64 1536 343.5974 343.5974 390.4516 390.4516 
128 3072 520.6021 520.6021 780.9031 780.9031 
256 6144 838.0424 818.0890 1030.7922 1030.7922 
512 12288 1184.8186 1374.3895 1538.4957 1538.4957 
1000 24000 1458.8883 1597.8301 1597.8301 1597.8301 
2000 48000 1525.2015 1104.6727 906.8765 842.3707 
4000 96000 1102.4043 820.9035 1054.0659 940.7785 
8000 192000 962.9972 883.0114 1032.4441 994.8195 
16000 384000 951.8988 920.8763 1040.7837 1040.7837 
32000 768000 639.9892 901.5465 786.0482 825.7435 
64000 1536000 356.0595 140.0061 70.7119 70.4393 
128000 3072000 81.9822 69.9022 71.1407 69.8721 
256000 6144000 70.7133 69.3556 70.4207 69.9111 
512000 12288000 69.9238 69.5522 69.6488 70.1590 
1000000 24000000 68.6831 69.6227 70.2346 70.5689 
2000000 48000000 68.1704 69.5565 69.6755 69.7249 
4000000 96000000 68.1302 68.8822 69.8274 70.1060 
8000000 192000000 68.2603 68.8473 69.5424 69.7911 
16000000 384000000 56.9420 8.2763 10.9330 10.6785 
207 
Table G.7 STREAM Benchmark for Hermes (M6) 
Array Size Memory Size Copy Scale Add Triad 
(Elements) (Bytes) (MB/s) (MB/s) (MB/s) (MB/s) 
2 48 8.1344 6.5472 12.2016 9.5870 
4 96 16.2688 13.0944 24.4032 19.1740 
8 192 32.5376 32.5376 48.8064 47.3710 
16 384 51.1306 65.0753 76.6958 76.6958 
32 768 102.2611 126.3226 153.3917 153.3917 
64 1536 204.5223 204.5223 306.7834 306.7834 
128 3072 343.5974 343.5974 444.3070 444.3070 
256 6144 512.8319 512.8319 687.1948 757.9354 
512 12288 818.0890 818.0890 945.6809 945.6809 
1000 24000 945.1953 1001.6248 1043.1430 1043.1430 
2000 48000 1001.6248 888.8591 716.4647 593.0091 
4000 96000 711.0873 603.9043 768.4221 690.6573 
8000 192000 653.1276 643.3444 759.0069 735.7756 
16000 384000 643.1517 677.2260 764.9550 758.8281 
32000 768000 583.7933 680.0138 711.7944 715.1128 
64000 1536000 558.9494 380.2370 294.7613 264.6042 
128000 3072000 254.7279 271.2582 272.8263 264.9661 
256000 6144000 235.1860 256.2572 158.8092 180.3932 
512000 12288000 231.9042 258.3167 158.8028 181.2790 
1000000 24000000 267.9034 276.9797 282.8487 281.2645 
2000000 48000000 268.9055 277.0681 282.8854 280.3740 
4000000 96000000 269.9613 273.7558 284.1236 280.0409 
8000000 192000000 269.8925 276.1197 282.0949 280.4479 
16000000 384000000 237.9872 273.3185 274.5120 272.6947 
32000000 768000000 233.5659 252.5599 153.3088 170.9795 
208 
Table G.8 STREAM Benchmark for DC (M7) 
Array Size Memory Size Copy Scale Add Triad 
(Elements) (Bytes) (MB/s) (MB/s) (MB/s) (MB/s) 
2 48 10.7374 8.1344 12.2016 16.1061 
4 96 21.4748 21.4748 32.2123 30.9733 
8 192 42.9497 42.9497 64.4245 64.4245 
16 384 63.1613 65.0753 123.8933 94.7419 
32 768 130.1505 130.1505 195.2258 195.2258 
64 1536 260.3010 260.3010 390.4516 378.9677 
128 3072 520.6021 520.6021 613.5668 628.5318 
256 6144 818.0890 818.0890 1030.7922 1030.7922 
512 12288 1025.6638 1184.8186 1227.1335 1241.9183 
1000 24000 1231.3553 1458.8883 1417.7929 1333.2887 
2000 48000 1335.4998 1231.3553 800.5033 749.8197 
4000 96000 1014.8789 810.9833 1032.4441 950.7749 
8000 192000 928.0396 883.0114 1037.7659 1005.3762 
16000 384000 895.1578 927.2382 1046.1921 1052.0005 
32000 768000 817.9332 935.9266 977.1653 996.0499 
64000 1536000 575.5786 536.1337 363.4669 366.7569 
128000 3072000 318.2577 337.7343 362.3935 351.9311 
256000 6144000 294.2111 319.3017 195.7062 211.8766 
512000 12288000 291.5623 320.0751 195.3297 212.6945 
1000000 24000000 329.8965 338.8678 375.4279 377.2994 
2000000 48000000 326.1777 338.7857 371.9718 371.5747 
4000000 96000000 326.5440 333.1355 376.2833 376.2994 
209 
Table G.9 STREAM Benchmark for Exiguus (M8) 
Array Size Memory Size Copy Scale Add Triad 
(Elements) (Bytes) (MB/s) (MB/s) (MB/s) (MB/s) 
2 48 10.7374 10.7374 16.1061 16.1061 
4 96 21.4748 16.2688 24.4032 32.2123 
8 192 42.9497 32.5376 48.8064 48.8064 
16 384 65.0753 65.0753 97.6129 97.6129 
32 768 130.1505 130.1505 195.2258 195.2258 
64 1536 260.3010 260.3010 306.7834 306.7834 
128 3072 520.6021 409.0445 628.5318 613.5668 
256 6144 687.1948 687.1948 888.6139 888.6139 
512 12288 1025.6638 1025.6638 1227.1335 1241.9183 
1000 24000 1231.3553 1342.1773 1417.7929 1417.7929 
2000 48000 1335.4998 1231.3553 774.3330 749.8197 
4000 96000 1001.6248 831.0695 1021.9624 950.7749 
8000 192000 914.6012 901.5465 1032.4441 989.9279 
16000 384000 904.5845 931.2592 1040.7837 1046.5320 
32000 768000 511.4883 906.3024 728.7016 691.3243 
64000 1536000 397.0572 157.8800 71.7658 71.4716 
128000 3072000 82.0841 68.8796 70.9535 69.8785 
256000 6144000 68.8670 68.2520 63.2795 65.6459 
512000 12288000 68.5127 68 3236 63.9610 65.2018 
1000000 24000000 69.2764 68.8279 70.2737 70.4316 
2000000 48000000 68.6639 68.4501 69.8821 70.0484 
4000000 96000000 68.5972 68.0162 69.7830 69.9155 
210 
I 
5 
Problem Size n Byiee 
(a) (b) 
Problem Si/e In Bytee 
STREAM Berichmerk (Kernel - ToeO) 
(c) (d) 
Figure G.l STREAM Benchmark: System Bandwidth using (a) Copy Ker­
nel (b) Scale Kernel (c) Sum Kernel (d) Triad Kernel 
APPENDIX H More Modell Results 
Table H.l NetQUIPS Resu ts for Integer Applications 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
11 0.9804 0.0010 0.0383 1.0000 
12 0.9802 0.0004 0.0402 1.0000 
13 0.9781 0.0005 0.0411 0.9762 
14 0.9794 0.0005 0.0407 1.0000 
15 0.9214 0.0002 0.0797 0.9524 
16 0.9835 0.0009 0.0417 0.8571 
17 0.9827 0.0009 0.0417 0.8571 
18 0.9832 0.0008 0.0408 0.8571 
19 0.9202 0.0100 0.0848 0.8810 
110 0.9290 0.0360 0.0845 0.9048 
111 0.8675 0.0112 0.1053 0.8810 
112 0.9480 0.0185 0.0712 0.9048 
113 0.9887 0.0002 0.0317 0.9762 
114 0.9791 0.0278 0.0460 0.8333 
115 0.9814 0.0003 0.0466 0.8571 
116 0.3645 3.2078 0.2404 0.5000 
117 0.9598 0.0002 0.0532 0.8571 
Ta Die H.2 NetQUIPS Results for Floating-Point Applications 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
F1 0.7407 0.0002 0.3604 0.1905 
F2 0.6984 0.0002 0.4545 0.0952 
F3 0.6989 0.0002 0.4543 0.0952 
F4 0.6944 0.0002 0.4699 0.2381 
F5 0.6220 0.0001 0.1793 0.5952 
F6 0.7888 0.0003 0.1449 0.8571 
F7 0.8349 0.0003 0.2219 0.3333 
F8 0.7081 0.0002 0.2912 -0.0238 
F9 0.9230 0.0001 0.0917 0.6429 
212 
Correlation of Inst. Quips and Measured Time lor It Correlation of Inst. Quips and Measured Time lor 12 
Memory Penis where Inst Quips was Interpolated (LOG) Memory Points wher Quips was interpolated (LOG) 
(») (b) 
Correlation of Inst. Quips and MeasuredTime for 13 Correlation of Inst. Quips and Measured Time for 14 
Memory Points where hist. Quips was interpolated (LOG) Memory Points Wwt Inst Quips was interpolated (LOG) 
(c) (d) 
Correlation of Inst. Quips and Measured Time tor IS Correlation of ln«. Quips and Measured Time tor 16 
Memory Points where Inst. Quips was interpolated (LOG) Memory Points where Inst. Quips was interpolated (LOG) 
(e) (f) 
Figure H.I Correlation of Instantaneous QUIPS and 
Applications (a) II (b) 12 (c) 13 (d) 14 (e) 
Measured 
15 (f) 16 
Time for 
213 
Correlation ol Inst Quips and Measured Time for 17 Correlation of Inst. Quips end Measured Time lor 18 
Memory Poinls where Inst. Quips was interpolated (LOG) 
(a) (b) 
Correlation ol Inst. Quips and Measured Time lor 19 Correlation of Inst. Quips and Measured Time tor 110 
Memory Points where Inst. Quips was interpolated (LOG) 
(c) (d) 
Correlation of Inst. Quips and Measured Time for 111 Correlation ol Inst. Quips and Measured Time for 112 
Memory Points wnere Inst. Quips was interpolated (LOCI) Memory Points wnere Inst. Quips was interpolated (LOG) 
(e) (0 
Figure H.2 Correlation of Instantaneous QUIPS and Measured Time for 
Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 
214 
Correlation of Inst Quips and Measured Time for 113 Correlation of Inst Quips and Measured Time for 114 
(») 
(c) 
Memory Points where Inst Quips was interpolated (LOG) 
Correlation ol Insl. Quips and Measured Time lor 115 
(b) 
Quips and Measured Time lor 116 
I S 0.5 
A { ' 
Memory Poin:s where insl. Quips was interpolated (LOG) 
(d) 
Correlation of Insl. Quips and Measured Time for 117 
Memory Points where Inst. Quips was Imerpolated (LOG) 
(") 
Figure H.3 Correlation of Instantaneous QUIPS and Measured Time for 
Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 
215 
Correlation ol Inst. Quips and Measured Time lor F1 Correlation ol Inst. Quips and Measured Time lor F2 
f 
f ^ 
: AJ 
. YX \ 
(a) 
Memory Points where Inst. Quips was interpolated (LOG) 
Correlation of Inst. Quips and Measured Time lor F3 
(b) 
Memory Pomls wnere Inst. Quips was interpolated (LOO) 
Correlation of Insl. Quips and Measured Time for F4 
Memory Points where Inst. Quips was interpolated (LOG) 
0.9 
r
 
i 
0. 1 ' 
S 1 
l ° g  fl !.. V 
105 S 'V l s "  
0.4 A 
A 
0.3 
-
02 
0.2 
" \ 0 1 
2 10* 10* 10* 10s 107 10e 
Memory Points where Inst. Quips was interpolated (LOG) 
O1 10s 104 10' 10* 10' 1 
Memory Points where Inst. Quips was interpolated (LOG) 
(c) (d) 
Correlation of Inst. Quips and Measured T melorFS Correlation ol Inst Quips and Measured ' me for F6 
1 0, " 
I 04 I I °'4 
0 
1 
T' 
1 f 
\ 
I 0.2 
3 
|0.6 
5 \ 
0 0.5 
\ / 
^>.2 
I 
0.4 
\ j  
M (0 
Figure H.4 Correlation of Instantaneous QUIPS and Measured Time for 
Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 
216 
Correlation ol Insl. Quips and Measured Time for F7 Correlation of Insl. Quips and Measured Time lor F8 
,/ 
ilj 
V," 1 ! 
(a) 
10 10 10 10 10 10 
Memory Points where Inst Quips was interpolated (LOG) Memory Pcnrls where Inst Quips was interpolated (LOG) 
(b) 
Correlation ol Insl. Quips and Measured Time lor F9 
3 
--A 
(c) 
Figure H.5 Correlation of Instantaneous QUIPS and Measured Time for 
Applications (a) F7 (b) F8 (c) F9 
217 
Aank Correlation of Insl. Quips and Measured Time for 11 Rank Correlate" of Insl Quips and Measured Time for 12 
(a) (b) 
Rank Correlation of Insl. Quips and Measured Time for 13 flank Correlanon of Inst. Quips and Measured Time lor M 
r 
Memory Points where Insl. Quips was interpolated (LOG) Memory Points where Inst. Quips was interpolated (LOG) 
(c) (d) 
Hank Correlation ol Inst. Quips and Measured Time for 15 Rank Correlation of Inst. Quips and Measured Time tor 16 
Memory Pointe where Insl. Quips was interpolated (LOQ) Memory Points where Insl Quip» was interpolated (LOO) 
(e) (f)  
Figure H.6 Rank Correlation of Instantaneous QUIPS and Measured Time 
for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 
218 
Rank Correlation of Insl. Quips and Measured Time lor 17 
10= 103 10' 10' 10* 10' 
Memory Points where Inst. Quips was interpolated (LOG) 
Quips and Measured Time toi 18 
Memory Poms where Insl. Quips was inierpolaled (LOG) 
(a) (b) 
Rank Correlalion of Inst. Quips and Measured Time for 19 Rank Correlation ol Insl. Quips and Measured Time lor 110 
Memory Pomls where Insl Quips was inierpolaled (LOG) 
(c) (d) 
Rank Correlation ol Insl. Quips and Measured Ti 
Memory Points where Inst. Quips was interpolated (LOG) 
flank Correlation ol Insl. Quips and Measured 
Memory Poinls where Inst Quips was interpolated (LOG) 
(c) (f)  
Figure H.7 Rank Correlation of Instantaneous QUIPS and 
for Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 
Measured Time 
111 (f) 112 
219 
Rank Correlation ol Insl. Quips and Measured Time lor 113 Rank Correlalion of Inst. Quips and Measured Time tor 114 
Memory Points where Inst Quips was interpolated (LOG) Memory Points where Insl Quips was inierpolaled (LOG) 
(») (b) 
Rank Correlation of In&t. Quips and Measured Time for 115 Rank Correlation of Insl. Quips and Measured Time for lie 
Memory Points where Insl. Quips was interpolated (LOG) Memory Poinls where Inst Quips was interpolated (LOG) 
(c) (d) 
Flank Correlation of Insl. Quips and Measured Ti 
Memory Points wnere Inst. Quips was interpolated (LOG) 
(e) 
Figure H.8 Rank Correlation of Instantaneous QUIPS and Measured Time 
for Signature for Applications (a) 113 (b) 114 (c) 115 (d) 116 
(e) 117 
220 
Rank Correlation ol Insl. Quips end Measured Time lor F1 Rank Correlation ol Inst. Quips and Measured Time lor F2 
i/ 
> A Ay "Y../vvW"k/ v 
(a) 
Memory Points where Insl. Quips was interpolated (LOG) 
Flank Correlalion ol Inst. Quips and Measured Time lor F3 
(c) 
Memory Poinls where Insl. Quips was inierpolaled (LOG) 
Ran* Correlalion ol Insl. Quips and Measured Time for F5 
MAA" Yivv/" , J'"V 1 
I ft „ r 
10' 10 10 10 10 10 10 
Memory Pome where Insl. Quips was inierpolaled (LOG) 
(b) 
Memory Poinls where Insl. Quips was inierpolaled (LOO) 
Rank Correlalion ol Inst. Quips and Measured Time lor F4 
102 1(f 10* 10* 10* 10' 
Memory Points wtiere Insl Quips was interpolated (LOO) 
(d) 
Rank Correlalion ol Inst. Quips and Measured Time lor F6 
Memory Pcénls where Insl. Gulps we» inierpolaled (LOG) 
10 10 10 10 
(e) (f) 
Figure H.9 Rank Correlation of Instantaneous QUIPS and Measured Time 
for Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 
221 
Rank Correlalion ol Insl. Quips and Measured Time for F7 Rank Correlation ol Insl. Quips and Measured Time for F8 
U " ' 
, jVyv ' -, vV v x /V' \ - A/V \.. /'VVV\/^ 
Memory Po-nls where Inst. Quips was interpolated (LOG) Memory Poinls where Insl. Quips was inierpolaled (LOG) 
(a) (b) 
Rank Correlation of Insl. Quips and Measured Time for F9 
Memory Poinls where Inst. Quips was interpolated (LOG) 
(c) 
Figure H.10 Rank Correlation of Instantaneous QUIPS and Measured 
Time for Applications (a) F7 (b) F8 (c) F9 
222 
Beet Linear Fit ueing Insl Quips 01 NetQUIPS foi II 
Inetent Qulcs et 4381.0059 
Beel Lneai Frt using Inst Quips 01 NetQUIPS tor 12 
InelartQmps at 4381.0959 
(b) 
Insert Quips al 4361.0956 
(c) (d) 
Besl Linear Fit using Insl. Quips or NetQUIPS for 15 
Instant Quips at 444.7934 Inelant Quee at 666 7765 
(e) (f) 
Figure H.ll Best Fit Between Instananeous QUIPS (or NetQUIPS) and 
Measured Time for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 
15 (f) 16 
223 
Best Linear Fit using Inst Quips or NetQUIPS for 17 Besl Lineal Fil using Insl Quips 
(a) 
Instant Quips el 666.7766 
(b) 
Inslanl Qoipt at 866.7766 
Best Linear Fil using Insl. Quips or NetQUIPS lor MO 
Besl Linear Fil using Inst. Quips 
Instant Quips at 444.7934 
(c) (d) 
Best Linear Fil using Inst. Quips or NetQUIPS tor 111 Besl Linear Fit using Inst. Quips or NetQUIPS lor 112 
Instant Quips at 467545.9898 
(e) 
Inslanl Quips at 444.7934 
(0 
Figure H.12 Best Fit Between Instananeous QUIPS (or NetQUIPS) and 
Measured Time for Applications (a) 17 (b) 18 (c) 19 (d) 110 
(e) 111 (f) 112 
224 
Best Linear Fit using Inut. Ouipe 01 NelOUIPS lo> II3 Beel Linear Fn using Insl. Quips 01 NelQUIPSIoi 114 
Inetanl Quips al «381.0959 Instant Quips at 953 *541 
Besl Lineai Fil using Insl. Quips or NetQUIPS lor 116 
Besl Linear Fit using Inst. Quips oi NelOUIPS lor 115 
(c) 
Inslent Quips al 866.7766 
(e) 
(d) 
Besl Linear Fit using Inst. Quips or NetQUIPS loi 11 7 
Inslanl Quips el «381.0959 
Figure H.13 Best Fit Between Instantaneous QUIPS (or NetQUIPS) and 
Measured Time for Applications (a) 113 (b) 114 (c) 115 (d) 
116 (e) 117 
225 
inslanl Quips al 3389600.6605 Inslanl Quips al 59146505.8048 
(a) (b) 
Inslanl Quips al 59146505.8048 inslanl Quips al 20730489.4602 
(c) (d) 
Best Linear Fil using Insl. Quips or NetQUIPS lor F6 Besl Linear Fil using Inst. Quips or NetQUIPS lor F6 
Instant Ouips at 961846.5767 
(e) (0 
Figure H.14 Best Fit Between Instananeous QUIPS (or NetQUIPS) and 
Measured Time for Applications (a) FI (b) F2 (c) F3 (d) F4 
(e) F5 (f) F6 
226 
(b) 
(c) 
Figure H.15 Best Fit Between Instananeous QUIPS (or NetQUIPS) and 
Measured Time for Applications (a) FT (b) F8 (c) F9 
APPENDIX I More Model2 Results 
Table 1.1 Search Method (function of time) Results for Integer Applica­
tions 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
11 0.9952 0.0085 0.0319 0.9762 
12 0.9951 0.0039 0.0306 0.9762 
13 0.9940 0.0043 0.0267 1.0000 
14 0.9948 0.0042 0.0308 0.9762 
15 0.9801 0.0017 0.0439 0.9762 
16 0.9928 0.0077 0.0254 0.8333 
17 0.9924 0.0085 0.0266 0.8333 
18 0.9927 0.0074 0.0269 0.8333 
19 0.9755 0.1086 0.0486 0.9048 
110 0.9773 0.3940 0.0495 0.9286 
111 0.9628 0.1274 0.0626 0.9048 
112 0.9854 0.1997 0.0388 0.9286 
113 0.9985 0.0017 0.0210 1.0000 
114 0.9943 0.2475 0.0311 0.8571 
115 0.9929 0.0026 0.0331 0.8333 
116 0.9016 48.3001 0.1087 0.9524 
117 0.9831 0.0017 0.0530 0.8333 
228 
Table 1.2 Search Method (function of time) Results for Floating-Point Ap­
plications 
Id Correlation Linear Fit Max Rel. Err Rank Corr. 
F1 0.9990 0.0058 0.0427 0.9762 
F2 0.9940 0.0073 0.1366 0.9048 
F3 0.9941 0.0082 0.1377 0.9048 
F4 0.9996 0.0053 0.0999 1.0000 
F5 0.9997 0.0013 0.0091 0.9286 
F6 0.9895 0.0043 0.1186 0.9762 
F7 0.9997 0.0054 0.0266 1.0000 
F8 0.9860 0.0056 0.2871 0.8571 
F9 0.9496 0.0010 0.9698 0.7143 
F10 0.9949 0.0064 0.0684 0.9762 
Fl l  0.9997 0.0033 0.3352 1.0000 
229 
Search Method tor O99.go null.m Searcfi Method for 099 go null!.in (12) 
10 10 10* 10 
(b) 
Search Method tor 099 go 5slone2l in (13) searcn Method tor 099.go 9sloœ21.in (M) 
(d) 
Search Method for 147.vorlex vorlex.cn (15) Search Method tor 132 ijpeg penguin.ppm (16) 
(o) (f) 
Figure 1.1 SEARCH Result: Application Signature as a function of time 
for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 
230 
Search Method lor 132.i|peg specmun.ppm (17) Search Meihod lor 132. ijpegv-go.ppm (16) 
(a) 
(c) 
Search Method for 126 gee 1expr i (19) 
10"3 to 
(b) 
(d) 
Search Method tor 126.get Irecog.i (110) 
lime (Log) 
| 0.3 
(e) (f) 
Figure 1.2 SEARCH Result: Application Signature as a function of time 
for Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 
231 
(4 
(c) 
(b) 
(d) 
Search Method for 130.11 • (117) 
W 
Figure 1.3 SEARCH Result: Application Signature as a function of time 
for Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 
232 
(a) 
Search Method lor 102.swim swmZrn (F3) 
rime (Log) 
(c) 
Search Method for 145 tpppp natoms.in (F5) 
(b) 
Search Method lor 1 lO.applu applu in (F4) 
(d) 
Search Method tor l4i .apsi apsi.rn (F6) 
10'a 
Time (Loo) 
10 10 10 10 to 
(«0 (f) 
Figure 1.4 SEARCH Result: Application Signature as a function of time 
for Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 
233 
Search Melhod for 146. waveS waveS.in <F7) Search Method for lO/ mgrid mgrid.m (Fa) 
(a) (b) 
Sear# Melhod for 125 turb3d turbSd.in (F9) 
10 10 10 10" 
(c) 
Figure 1.5 SEARCH Result: Application Signature as a function of time 
for Applications (a) FT (b) F8 (c) F9 
234 
i 
(b) 
I 
I 
I 
I 
(d) 
(f) 
I I i 
Projected 
Figure 1.6 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of time for Applications (a) II 
(b) 12 (c) 13 (d) 14 (e) 15 (f) 16 
235 
I i 
i 
(b) 
s 
I 1 
I 
(d) 
! 
56 
(f) 
urc 1.7 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of time for Applications (a) 17 
(b) 18 (c) 19 (d) 110 (e) 111 (f) 112 
236 
i 
I 
I 
Projected Time Projected Time (sec) 
Beet tinea- Fh using Search Method lor I 
I i I 
(e) 
Figure 1.8 SEARCH Result: Projected Time Vs Measured Time for Appli­
cation Signature as a function of time for Applications (a) 113 
(b) 114 (c) 115 (d) 116 (e) 117 
237 
5 
I 
I 
I 
(b) 
£ 250 
(d) 
I 
I 
I 
I 
I 
Projected Time (sec) 
(f) 
;urc 1.9 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of time for Applications (a) F1 
(b) F2 (c) F3 (d) F4 (e) F5 (f) F6 
238 
(a) (b) 
I 
I 
(c) 
Figure 1.10 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of time for Applications (a) 
FT (b) F8 (c) F9 
239 
(a) (b) 
(c) 
(e) 
Search Method for 147 vorlexvortex.in(l5) 
Memory Points where Inst. Quips was interpolated (LOG) 
(d) 
(f) 
Search Method for 132 ijpeg penguin.ppm (16) 
Memory Points where Inst. Quips was interpolated (LOG) 
Figure 1.11 SEARCH Result: Application Signature as a function of prob­
lem size for Applications (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 
16 
240 
Search Melhod for I32.rjpeg specmun.ppm (17) Seardi Method lor i32.i|E>eg vigo ppm (18) 
Memory Points where Inst. Quips was interpolated (LOG) Memory Points where Inst. Quips was inlerpolalad (LOG) 
(b)  
Search Melhod tor 126.gcc lexpr.i (19) Search Melhod lor I26.gcc vecog.i (MO) 
(d)  
Memory Points where Insl Quips was interpolated (LOG) 
102 to' to' 10* 10* 
Memory Point! where Inst. Quips wes interpolated (LOG) 
(e) (f)  
Figure 1.12 SEARCH Result: Application Signature as a function of prob­
lem size for Applications (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 
112 
241 
Search Melhod lor 124.m8aksim cfl.raw (113) Search Melhod for I2<.m86kam tesl.raw (114) 
(a) 
(c) 
Memory Pom's where Insl. Quips was interpolated (LOG) Memory Poims where Inst. Quips was inierpolated (LOO) 
(b) 
Memory Points where Insl. Quips was inierpolated (LOG) Memory Poinis where Inst. Ouips was inierpolated (LOG) 
(d) 
Search Method for 130.li - (117) 
Memory Points where Insl. Quips was interpolaied (LOG) 
(e) 
Figure 1.13 SEARCH Result: Application Signature as a function of prob­
lem size for Applications (a) 113 (b) 114 (c) 115 (d) 116 (e) 
117 
242 
(a) 
(c) 
Memory Points where Inst. Quips was interpolated (LOG) Memory Roinls where Inst. Quips was interposed (LOG) 
(b)  
(d)  
Search Melhod tor I45.tpppp natoms.m (F5) 
Memory Poinis where Inst. Quips was interpolated (LOG) 
Search Method lor 141 apsi aps. ii (F6) 
Memory Points where Inst. Quips was interpolated (LOG) Memory Points where Insl. Quips was interpolated (LOG) 
(e) (f)  
Figure 1.14 SEARCH Result: Application Signature as a function of prob­
lem size for Applications (a) FI (b) F2 (c) F3 (d) F4 (e) F5 
(f) F6 
243 
Search Melhod lor 146 waveS waveS.in (F7) Search Melhod lor 107.mgrid rngrrd.in (F6) 
' Ï V ; 
(a) (b) 
Search Melhod lor 12Slurb3d lurt)3d.in (F9) 
(c) 
Figure 1.15 SEARCH Result: Application Signature as a function of prob­
lem size for Applications (a) FT (b) F8 (c) F9 
244 
I i 
(b)  
1 
Projected Time (sec) 
(d)  
1 I I 
i 
(f)  
;ure 1.16 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of problem size for Applica­
tions (a) II (b) 12 (c) 13 (d) 14 (e) 15 (f) 16 
245 
Projected Time (sec) Projected Time (sec) 
(b) 
Projected Til 
(d) 
(f)  
urc 1.17 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of problem size for Applica­
tions (a) 17 (b) 18 (c) 19 (d) 110 (e) 111 (f) 112 
246 
Proioood 
(b) 
(d) 
I 
(e) 
;ure 1.18 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of problem size for Applica­
tions (a) 113 (b) 114 (c) 115 (d) 116 (e) 117 
247 
! 
I 
(b) 
(d) 
I 
s 
1 I 
(f) 
Figure 1.19 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of problem size for Applica­
tions (a) FI (b) F2 (c) F3 (d) F4 (e) F5 (f) F6 
248 
Figure 1.20 SEARCH Result: Projected Time Vs Measured Time for Ap­
plication Signature as a function of problem size for Applica­
tions (a) F7 (b) F8 (c) F9 
249 
Bibliography 
[Int, ] Intel IA-64 Architecture Software Developer's Manual. Intel Corporation. Volume 4: 
Itanium Processor Programmer's Guide, Document number 245320-002, rev 1.1, 2000. 
[Aburto, ] Aburto, J. A. Benchmarks developed and maintained at the Naval Command Con­
trol and Ocean Surveillance mirrored at the NIST Web site, ftp site: ftp.nosc.mil in directory 
pub/aburto. This is mirrored at the NIST Web site, (date retrieved: November 19, 2000). 
[Agarwal et al., 1988] Agarwal, A., Hennessy, J., and Horowitz, M. (1988). Cache performance 
of operating system and multiprogramming. ACM Transactions on Computer Systems, 
6(4):393-431. 
[Ahuja et al., 1995] Ahuja, P. S., Clark, D. W., and Rogers, A. (1995). The performance 
impact of incomplete bypassing in processor pipelines. In Proceedings: MICRO-28, Intl. 
Symposium on Microarchitecture. 
[Amdahl, 1967] Amdahl, G. M. (1967). Validity of the single processor approach to achieving 
large scale computing capabilities. AFIPS Proc. of the SJCC. 31:483-485. 
[Amdahl, 1988] Amdahl, G. M. (1988). Limits of expectation. The International Journal of 
Supercomputer Applications, 2(l):88-94. 
[Bailey, 1991] Bailey, D. H. (1991). Twelve ways to fool the masses when giving performance 
results on parallel supercomputers. Technical Report RNR-91-020, NASA Ames Research 
Center, Moffett Field, CA 94035. 
[Bailey et al., 1991a] Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., 
Dagum, D., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, 
250 
H. D., Venkatakrishnan, V., and Weeratunga, S. K. (1991a). The NAS parallel benchmarks. 
The International Journal of Supercomputer Applications, 5(3):63-73. 
[Bailey et al., 1991b] Bailey, D. H., Barton, J., Lasinski, T., and Simon, H. (1991b). The NAS 
parallel benchmarks. Technical Report RNR-91-002, NASA Ames Research Center, Moffett 
Field, CA 94035. 
[Baker and Buyya, 1999] Baker, M. and Bnyya, R. (1999). Cluster computing: the commodity 
supercomputer. Software Practice and Experience, 29(6):551-576. 
[Baker et al., 1999] Baker, M., Buyya, R., and Hyde, D. (1999). Technical activities forum: 
Cluster computing: A high-performance contender. Computer, 32(7):79-80. 
[Balbo and Denning, 1979] Balbo, G. and Denning, P. J. (1979). Performance of Com­
puter Systems, chapter Homogeneous approximations of general queueing networks. North-
Holland, Amsterdam. 
[Barak and Laâdam, 1998] Barak, A. and Laâdam, O. (1998). The MOSIX multicomputer 
operating system for high performance cluster computing. Journal of Future Generation 
Computer Systems, 13(4-5) :361-372. 
[Bard, 1979] Bard, Y. (1979). Some extensions to multiclass queueing network analysis. In 
Arato, M., editor, Performance of Computer Systems. North-Holland, The Netherlands. 
[Barkley and Schimmel, 1988] Barkley, R. E. and Schimmel, C. F. (1988). A performance 
study of the unix system V fork system call using casper. AT&T Technical Journal, 
67(5):100-109. 
[Baskett and Hennessy, 1993] Baskett, F. and Hennessy, J. L. (1993). Microprocessors : From 
desktops to supercomputers. Science, 261(5123):864--?? 
[Berry et al., 1994] Berry, M. W., Dongarra, J. J., Larosei, B. H., and Letsche, T. A. (1994). 
PDS: A performance database server. Scientific Programming, 3(2): 147-156. 
251 
[Bhuyan et al., 1989] Bhuyan, L. N., Yang, Q., and Agrawal, D. P. (1989). Performance of 
multiprocessor interconnection networks. Computer, 22(2):25-37. 
[Bloomfield, 2000] Bloomfield, P. (2000). Fourier Analysis of Time Series: An Introduction. 
John Wiley & Sons, New York, 2nd edition. 
[Blume and Eigenmann, 93] Blume, W. and Eigenmann, R. (93). Performance analysis of 
parallelizing compilers on the perfect benchmarks programs. Technical Report TR-1218, 
Center for Supercomputing Research and Development (CSRD). 
[Blyler, 1998] Blyler, J. (1998). What's Size Got to Do with It? IEEE Press, 445 Hoes Lane, 
PO Box 1331, Piscataway, NJ 08855-1331. 
[Bode and Dongarra, 1997] Bode, A. and Dongarra, J. (1997). Performance evaluation and 
prediction. Lecture Notes in Computer Science, 1300:969-970. 
[Bray, 1993] Bray, B. K. (1993). Specialized caches to improve data access performance. Tech­
nical Report CSL-TR-93-574, Computer Systems Laboratory, Stanford University. 
[Carmona and Rice, 1991a] Carmona, E. A. and Rice, M. D. (1991a). Modeling the serial and 
parallel fractions of a parallel algorithm. Journal of Parallel and Distributed Computing, 
13(3):286-298. 
[Carmona and Rice, 1991b] Carmona, E. A. and Rice, M. D. (1991b). Modeling the serial and 
parallel fractions of a parallel algorithm. Journal of Parallel and Distributed Computing, 
13(3)=286-298. 
[Chen and Patterson, 1993] Chen, P. M. and Patterson, D. A. (1993). Storage performance-
metrics and benchmarks. In Proceedings of IEEE, volume 81. 
[Chen and Patterson, 1994] Chen, P. M. and Patterson, D. A. (1994). A new approach to I/O 
performance evaluation - self-scaling I/O benchmarks, predicted I/O performance. ACM 
Transactions on Computer Systems, 12, 4:309-339. 
[Chih-Ming and Lu, ] Chih-Ming, C. and Lu, S.-L. Performance issues on micropipelines. 
252 
[Colwell et al., 1988] Colwell, R. R, Gehringer, E. F., and Jensen, E. D. (1988). Performance 
effects of architectural complexity in the intel 432. ACM Transactions on Computer Systems, 
6(3):296-339. 
[Crandall et al., 1995] Crandall, P. E., Aydt, R. A., Chien, A. A., and Reed, D. A. (1995). 
Input/output characteristics of scalable parallel applications. In Proceedings of Supercom-
puting '95, San Diego, CA. IEEE Computer Society Press. 
[Cs, ] Cs, C. D. The simplescalar tool set, version 2.0. 
[Culler et al., 1999] Culler, D., Singh, J. P., and Gupta, A. (1999). Parallel Computer Archi­
tecture. Morgan Kaufman. 
[Culler et al., 1993] Culler, D. E., Karp, R. M., Patterson, D. A., Sahay, A., Schauser, K. E., 
Santos, E., Subramonian, R., and von Eicken, T. (1993). LogP: towards a realistic model of 
parallel computation. ACM SIGPLAN Notices, 28(7):1-12. 
[Curnow and Wichmann, 1976] Curnow, H. J. and Wichmann, B. A. (1976). A synthetic 
benchmark. The Computer Journal, 19(l):43-49. 
[David Turner, Quinn Snell, Armin Mikler, 2003] David Turner, Quinn Snell, Armin Mikler, 
J. G. (2003). NETpipe Webpage. http://www.scl.ameslab.gov/netpipe/; (date re­
trieved: November 19, 2003). 
[del Rosario and Choudhary, 1994] del Rosario, J. M. and Choudhary, A. N. (1994). High-
performance I/O for massively parallel computers: Problems and prospects. Computer, 
27(3):59-68. 
[Diane et al., 1991] Diane, R., Carter, M., and Gustafson, J. (1991). Performance visualization 
of SLALOM. In Proceedings of the Sixth Distributed Memory Computing Conference, New 
York. IEEE Computer Society. 
[Ditzel et al., 1990] Ditzel, D. R., Hennessy, J. L., Rudin, B., Smith, A. J., Squires, S. L., 
Zalcstein, Z., and Hill, M. D. (1990). Big science versus little science - do you have to build 
253 
it? In Baer, J.-L. and Snyder, L., editors, Proceedings of the 17th Annual International 
Symposium on Computer Architecture, pages 136-137, Seattle, WA. IEEE Computer Society 
Press. 
[Diwan et al., 1993] Diwan, A., Tarditi, D., and Moss, E. (1993). Memory subsystem perfor­
mance of programs with intensive heap allocation. CS 93-227, Carnegie Mellon University. 
[Dongarra, 1984] Dongarra, J. J. (1984). Performance of various computers using standard 
linear equations software in a Fortran environment. ACM SIGNUM Newsletter, 19(1):23-
26. 
[Dongarra, 1987] Dongarra, J. J. (1987). The LINPACK benchmark: An explanation. In 
Houstis, E. N.; Papatheodorou, T. S.; Polychronopoulos, C. D., editor, Proceedings of the 1st 
International Conference on Super computing, volume 297 of LNCS, pages 456-474, Athens, 
Greece. Springer. 
[Dongarra, 1992] Dongarra, J. J. (1992). Performance of various computers using standard 
linear equations software. Computer architecture news, 20(3):22-44. 
[Dongarra, 94] Dongarra, J. J. (94). The complete Unpack report. Technical report, University 
of Tennesse. 
[Dongarra and Gentzsch, 1993] Dongarra, J. J. and Gentzsch, W. (1993). Computer Bench­
marks. North Holland, Amsterdam. 
[Dongarra and Hey, 1995] Dongarra, J. J. and Hey, T. (1995). The ParkBench benchmark 
collection. Supercomputer, ll(2-3):94-114. 
[Dongarra et al., 1996] Dongarra, J. J., Hey, T., and Strohmaier, E. (1996). PARKBENCH: 
methodology, relations and results. In Liddell, H. M., Colbrook, A., Hertzberger, B., and 
Sloot, P., editors, High-performance computing and networking: international conference 
and exhibition, HPCN EUROPE 1966, Brussels, Belgium, April 15-19, 1996: proceedings, 
volume 1067 of Lecture Notes in Computer Science, pages 770-777, Berlin, Germany / 
Heidelberg, Germany / London, UK / etc. Springer-Verlag. 
254 
[Dongarra et al., 1979] Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. (1979). 
LINPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 
USA. 
[Dowd and Severance, 1998] Dowd, K. and Severance, C. (1998). High Performance Comput­
ing. OReilly, 101 Morris Street, Sevastopol, CA, 95472, 2nd edition. 
[Dujmovic, 1999] Dujmovic, J. (1999). Universal Benchmark Suites. In Proceedings of 7th Int. 
Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems 
Conference, pages 197-205. 
[Dujmovic, 2001] Dujmovic, J. (2001). Universal Benchmark Suites - A Quantative Approach 
to Benchmark Design. In Eigenmann, R., editor, Performance Evaluation and Benchmarking 
with Realistic Applications, page 304. MIT Press, Cambridge, CA, USA. 
[Eigenmann, 2001] Eigenmann, R., editor (2001). Performance Evaluation and Benchmarking 
with Realistic Applications. MIT Press, Cambridge, CA, USA. 
[Fagerstorm and Kuszmaul, 00] Fagerstorm, F. C. and Kuszmaul, C. L. (00). FTIO bench­
mark. Technical Report RNR-91-020, NAS Applied Research Branch (RNR). 
[Foster et al., 2001] Foster, I., Kesselman, C., and Tuecke, S. (2001). The anatomy of the Grid: 
Enabling scalable virtual organization. The International Journal of High Performance 
Computing Applications, 15(3):200-222. 
[Fredericks and Holtzman, 1988] Fredericks, A. A. and Holtzman, J. M. (1988). An introduc­
tion to performance modeling and analysis. AT&T Technical Journal, 67(5):2-3. 
[Ganger et al., 1994] Ganger, G. R., Worthington, B. L., Hou, R. Y., and Patt, Y. N. (1994). 
Disk arrays: High-performance, high-reliability storage subsystems. Computer, 27(3):30-37. 
[Giorgi et al., 1997a] Giorgi, R., Prete, C. A., and Prina, G. (1997a). An approach for in­
vestigating design and tuning performance of embedded systems. In Proc. Int'l Conf. on 
Innovation and Quality in Education for Electrical and Information Engineering. 
255 
[Giorgi et al., 1997b] Giorgi, R., Prete, C. A., and Prina, G. (1997b). Cache memory design for 
embedded systems based on program locality analysis. In Proc. Int'l Conf. on Microelectronic 
System Education. 
[Giorgi et al., 1996] Giorgi, R., Prete, C. A., Prina, G., and Ricciardi, L. (1996). A hybrid 
approach to trace generation for performance evaluation of shared-bus multiprocessors. In 
Proc. 22nd EuroMicro Intl Conf., Prague, pages 207-214. 
[Giorgi et al., 1997c] Giorgi, R., Prete, C. A., Prina, G., and Ricciardi, L. (1997c). A workload 
generation envirnoment for trace-driven simulation of shared-bus multiprocessor. In Proc. 
30th HICSS, Hawaii, pages 266-275. 
[Grama et al., 1993a] Grama, A. Y., Gupta, A., and Kumar, V. (1993a). Isoefficiency: Mea­
suring the scalability of parallel algorithms and architectures. IEEE parallel and distributed 
technology: systems and applications, 1(3): 12-21. 
[Grama et al., 1993b] Grama, A. Y., Gupta, A., and Kumar, V. (1993b). Isoefficiency: mea­
suring the scalability of parallel algorithms and architectures. IEEE parallel and distributed 
technology: systems and applications, 1(3): 12—21. 
[Gropp et al., 1999a] Gropp, B., Lusk, R., and Skjellum, A. (1999a). Using MPI. MIT Press, 
Five Cambridge Center, Cambridge, MA 02142-1493 USA, 2nd edition. 
[Gropp et al., 1999b] Gropp, B., Lusk, R., and Thakur, R. (1999b). Using MPI-2. MIT Press, 
Five Cambridge Center, Cambridge, MA 02142-1493 USA, 1st edition. 
[Gustafson, a] Gustafson, J. HINT for Human, http://hint.byu.edu/tutorials/hfh/; 
(date retrieved: November 19,2003). 
[Gustafson, b] Gustafson, J. A new approach to computer performance prediction, http:// 
www.scl.ameslab.gov/scl/Publications/France/France.html; (date retrieved: Novem-
ber 19,2003). 
256 
[Gustafson et al., ] Gustafson, J., Heller, D., Amit, S., Todi, R., and Snell, Q. HINT Home­
page. http://hint.byu.edu; (date retrieved: November 19,2003). 
[Gustafson et al., 1988] Gustafson, J., Montry, G., and Benner, R. (1988). Development of 
parallel methods for a 1024 processor hypercube. SIAM Journal on Scientific and Statistical 
Computing, 9(4):609-638. 
[Gustafson et al., 1991] Gustafson, J., Rover, D., Elbert, S., and Carter, M. (1991). The 
design of a scalable fixed-time computer benchmark. Journal of Parallel and Distributed 
Computing, 12(4):388-401. 
[Gustafson and Snell, 1994] Gustafson, J. and Snell, Q. (1994). HINT: A new way to measure 
computer performanace. Technical Report IS-5109, Ames Laboratory, Ames, Iowa, 50011-
3020. 
[Gustafson and Todi, 1999a] Gustafson, J. and Todi, R. (1999a). Letters: Operations are free; 
data motion isn't;. Computer, 32(12):4. 
[Gustafson and Todi, 2000] Gustafson, J. and Todi, R. (2000). Conventional benchmarks as a 
sample of the performance spectrum. In Eigenmann, R., editor, Benchmarking and Perfor­
mance Evaluation. MIT Press, Five Cambridge Center, Cambridge, MA 02142-1493 USA. 
To be published. 
[Gustafson et al., 2000] Gustafson, J., Todi, R., and Heller, D. (2000). APPMAP: A new way 
to predict application performance in minutes. In ACM, editor, SC2000: High Performance 
Networking and Computing. Dallas Convention Center, Dallas, TX, USA, November 1^-10, 
2000, pages 146-146, New York, NY 10036, USA and 1109 Spring Street, Suite 300, Silver 
Spring, MD 20910, USA. ACM Press and IEEE Computer Society Press. 
[Gustafson, 1988] Gustafson, J. L. (1988). Reevaluating Amdahl's law. Commun, of the ACM, 
31, 5:532-533. 
[Gustafson, 1998] Gustafson, J. L. (1998). Making computer design a science instead of an 
art. In Schaefer, J., editor, High Performance Computing Systems and Applications (Proc. 
257 
12th International Symposium on High Performance Computing Systems and Applications 
(HPCS'98). Kluwer Academic Publishers, Edmonton, Canada. 
[Gustafson et al., 1989] Gustafson, J. L., Benner, R. E., Sears, M. P., and Sullivan, T. D. 
(1989). A radar simulation program for a 1024-processor hypercube. In ACM, editor, 
Proceedings, Supercomputing '89: November 13-17, 1989, Reno, Nevada, pages 96-105, 
New York, NY 10036, USA. ACM Press. 
[Gustafson and Snell, 1995a] Gustafson, J. L. and Snell, Q. O. (1995a). HINT: A new way to 
measure computer performance. In El-Rewini, H. and Shriver, B. D., editors, Proceedings of 
the 28th Annual Hawaii International Conference on System Sciences. Volume 2: Software 
Technology, pages 392-401, Los Alamitos, CA, USA. IEEE Computer Society Press. 
[Gustafson and Snell, 1995b] Gustafson, J. L. and Snell, Q. O. (1995b). HINT: A new way to 
measure computer performance. In El-Rewini, H. and Shriver, B. D., editors, Proceedings of 
the 28th Annual Hawaii International Conference on System Sciences. Volume 2: Software 
Technology, pages 392-401, Los Alamitos, CA, USA. IEEE Computer Society Press. 
[Gustafson and Todi, 1998] Gustafson, J. L. and Todi, R. (1998). Conventional benchmarks 
as a sample of the performance spectrum. In El-Rewini, H. and Shriver, B. D., editors, 
Proceedings of the 31st Annual Hawaii International Conference on System Sciences, Los 
Alamitos, CA, USA. IEEE Computer Society Press. 
[Gustafson and Todi, 1999b] Gustafson, J. L. and Todi, R. (1999b). Conventional benchmarks 
as a sample of the performance spectrum. The Journal of Supercomputing, 13(3):321-342. 
[Heller, ] Heller, D. Rabbit: A performance counter library, http://www.scl.ameslab.gov/ 
Projects/Rabbit/; (date retrieved: November 19, 2000). 
[Hennessey, 1999] Hennessey, J. (1999). The future of systems research. IEEE Computer 
Magazine, pages 27-33. 
[Hennessy et al., 1982a] Hennessy, J., Jouppi, N., Baskett, F., Gross, T., and Gill, J, (1982a). 
Hardware/software tradeoffs for increased performance. In Proceedings of the Symposium 
258 
on Architectural Support for Programming Languages and Operating Systems, pages 2-11, 
Palo Alto, California. ACM SIGARCH, SIGOPS, and SIGPLAN. 
[Hennessy et al., 1982b] Hennessy, J., Jouppi, N., Przybylski, S., Rowen, C., Gross, T., Bas-
kett, F., and Gill, J. (1982b). MIPS: A microprocessor architecture. In 15th Annual Work­
shop on Microprogramming, pages 17-22, Palo Alto, California. IEEE Computer Society. 
[Hennessy and Patterson, 2003] Hennessy, J. L. and Patterson, D. A. (2003). Computer Ar­
chitecture: A Quantitative Approach. Morgan Kaufmann Publishers, Los Altos, CA 94022, 
USA, third edition. 
[Henning, 2000a] Henning, J. (2000a). SPEC CPU2000: Measuring CPU Performance in the 
New Millennium. IEEE Computer. 
[Henning, 2000b] Henning, J. L. (2000b). SPEC CPU2000: Measuring CPU performance in 
the new millennium. IEEE Computer Magazine, pages 28-35. 
[Hewlett Packard, 2001] Hewlett Packard (2001). HP Caliper Performance Analyzer, http: // 
hpdrdev.fc.hp.com/devresource/Tools/caliper/; (date retrieved: November 19, 2003). 
[Hill, 1990] Hill, M. D. (1990). What is scalability? Computer Architecture News, 18(4): 18—21. 
[Hill and Smith, 1984] Hill, M. D. and Smith, A. J. (1984). Experimental evaluation of on-chip 
microprocessor cache memories. In Proceedings of the 11th Annual International Symposium 
on Computer Architecture, pages 158-166, Ann Arbor, Michigan. IEEE Computer Society 
and ACM SIGARCH. 
[Hily and Seznec, 1997] Hily, S. and Seznec, A. (1997). Contention on 2nd level cache may 
limit the effectiveness of simultaneous multithreading. Technical Report PI-1086, IRISA, 
University of Rennes 1, 35042 Rennes, France. 
[H.J. Newton, J.H. Carroll, N. Wang, 2003] H.J. Newton, J.H. Carroll, N. Wang, D. W. 
(2003). Statistics 30X Class Notes. 
259 
[Hockney, 1985] Hockney, R. W. (1985). (r(inf),n(l/2),s(l/2)) measurements on the 2-CPU 
Cray X-MP. Parallel Computing, 2(1-14). 
[Hockney and Jesshope, 1988] Hockney, R. W. and Jesshope, C. R. (1988). Parallel Computers 
2: Architecture, Programming, and Algorithms. A. Hilger, Bristol, England, 2nd edition. 
[Hoffman, 2003] Hoffman, T. (2003). HP takes new pricing path for utility-based computing. 
[Holliday and M.K. Vernon, 1987] Holliday, M. and M.K. Vernon (1987). A Generalised Timed 
Petri Net Model for Performance Analysis. IEEE Transactions on Software Engineering, 
13(12):1279-1310. 
[Howard et al., 1988] Howard, J. H., Kazar, M. L., Menees, S. G., Nichols, D. A., Satya-
narayanan, M., Sidebotham, R. N., and West, M. J. (1988). Scale and performance in a 
distributed file system. ACM Transactions on Computer Systems, 6(1 ):51 81. 
[Hsu et al., 1999] Hsu, W. W., Smith, A. J., and Young, H. C. (1999). Analysis of the char­
acteristics of production database workloads and comparison with the TPC benchmarks. 
Technical Report CSD-99-1070, University of California, Berkeley. 
[Hundt, 2000] Hundt, R. (2000). HP Caliper: A framework for performance analysis tools. 
IEEE Concurrency, 8 (4): 64-71. 
[Intel, ] Intel. Intel's comparitive microprocessor (iComp) index 2.0. http : //www. Intel. com/; 
(date retrieved: November 19, 2003). 
[Jain, 1991a] Jain, R. (1991a). The Art of Cumpter Systems Performance Analysis. Wiley 
Professional Computing. Excellent text! Must-read for all simulation novices /experts. 
[Jain, 1991b] Jain, R. (1991b). The Art of Computer Systems Performance Analysis: Tech­
niques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Intersci-
ence, New York, NY, USA. 
[Joseph et al., 2000] Joseph, E., Williard, C., and Goldfarb, D. (2000). Workstations and high 
performance systems. IDC, 2:38-46. 
260 
(Karanam et al., 1988] Karanam, V. R., Sriram, K., and Bowker, D. O. (1988). Performance 
evaluation of variable-bit-rate voice in packet-switched networks. AT&T Technical Journal, 
67(5) :57—71. 
[KleinOsowski et al., 2000] KleinOsowski, A., Flynn, J., Meares, N., and Lilja, D. (2000). 
Adapting the SPEC benchmark suite for simulation based computer architecture research. 
In Proceedings of the Third IEEE Annual Workshop on Workload Characterization, pages 
73-82. 
[Kumar, 1988] Kumar, A. (1988). SNA*/SDLC performance over ISDN frame-relay, virtual-
circuit data networks. AT&T Technical Journal, 67(5):27-40. 
[Kumar et al., 1994a] Kumar, V., Grama, A., Gupta, A., and Karypis, G. (1994a). Intro­
duction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 
Redwood City, CA. 
[Kumar et al., 1994b] Kumar, V., Grama, A., Gupta, A., and Karypis, G. (1994b). Intro­
duction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 
Redwood City, CA. 
[Kumar and Gupta, 1994] Kumar, V. P. and Gupta, A. (1994). Analyzing scalability of parallel 
algorithms and architectures. Journal of Parallel and Distributed Computing, 22(3):379-391. 
[Leutenegger and Sun, 1993] Leutenegger, S. T. and Sun, X.-H. (1993). Distributed computing 
feasibility in a non-dedicated homogenous distributed system. In Proceedings of Supercom­
puting '93, pages 143-152. 
[Lilja, 2002] Lilja, D. J. (2002). Measuring Computer Performance: a practitioner's guide. 
Cambridge University Press, New York, NY. 
[Luan and Lucantoni, 1988] Luan, D. T. and Lucantoni, D. M. (1988). The effect of bandwidth 
management on the performance of a window-based flow control. AT&T Technical Journal, 
67(5):17-26. 
261 
[Mangione-Smith et al., 1991] Mangione-Smith, W., Abraham, S. G., and Davidson, E. S. 
(1991). A performance comparison of the IBM RS/6000 and the astronautics ZS-1. Com­
puter, 24(l):39-47. 
[McCalpin, a] McCalpin, J. QG.GYRE and memory bandwidth, http://home.austin.rr. 
com/mccalpin/papers/balance/; (date retrieved: November 19, 2003). 
[McCalpin, b] McCalpin, J. STREAM: Measuring sustainable memory bandwidth in high per­
formance computers, http://www.cs.virginia.edu/stream/; (date retrieved: November 
19, 2003). 
[McCalpin, c] McCalpin, J. STREAM2. http://www.cs.virginia.edu/stream/stream2; 
(date retrieved: November 19, 2003). 
[McCalpin, 1995] McCalpin, J. (1995). Memory bandwidth and machine balance in current 
high performance computers. 
[McClave and II, 1988] McClave, J. T. and II, F. H. D. (1988). Statistics. Dellen Publishing 
Company, 499 Pacific Avenue, San Francisco, 1st edition. 
[McMahon, 1986] McMahon, F. H. (1986). The Livermore Fortran kernels: a computer test 
of the numerical performance range. Technical report, Lawrence Livermore National Labo­
ratory, Livermore, CA, USA. 
[McVoy and Staelin, 1996] McVoy, L. and Staelin, C. (1996). lmbench: Portable tools for 
performance analysis. In USENIX, editor, Proceedings of the USENIX 1996 annual tech­
nical conference: January 22-26, 1996, San Diego, California, USA, USENIX Conference 
Proceedings 1996, pages 279-294, Berkeley, CA, USA. USENIX. 
[Mejia and O'Keefe, 1993] Mejia, J. C. and O'Keefe, M. T. (1993). High performance instruc­
tion memory design for multiprocessors. In 26th Hawaii Int. Conference on System Science 
(WC&S pages 224-231. 
262 
[Miller and Hollingsworth, 1994] Miller, B. P. and Hollingsworth, J. K. (1994). Slack: A new 
performance metric for parallel programs. Technical Report UWMADISONCS CS-TR-95-
1260, University of Wisconsin - Madison, Department of Computer Science. 
[Mulupuru, 1996] Mulupuru, J. (1996). Automatic fitting of HINT curves. Master's thesis, 
Iowa State Univeristy. Major Professor: Dr. John Gustafson. 
[Muppala et al., 1991] Muppala, J. K., Woolet, S. P., and Trivedi, K. S. (1991). Real-time-
systems performance in the presence of failures. Computer, 24(5):37-47. 
[Nielsen and Kishinevsky, 1994] Nielsen, C. D. and Kishinevsky, M. (1994). Performance anal­
ysis based on timing simulation. In Proc. ACM/IEEE Design Automation Conference, pages 
70-76. 
[NULLSTONE, a] NULLSTONE. EQNTOTT benchmark, http://www.nullstone.com/ 
eqntott/eqntott .htm; (date retrieved: November 19, 2003). 
[NULLSTONE, b] NULLSTONE. NULLSTONE benchmark for C and java. http://www. 
nullstone.com/eqntott/eqntott.htm; (date retrieved: November 19, 2003). 
[Nussbaum and Agarwal, 1991a] Nussbaum, D. and Agarwal, A. (1991a). Scalability of par­
allel machines. Communications of the ACM, 34(3). 
[Nussbaum and Agarwal, 1991b] Nussbaum, D. and Agarwal, A. (1991b). Scalability of par­
allel machines. Communications of the ACM, 34(3):56-61. 
[Pagnoni, 1987] Pagnoni, A. (1987). Stochastic Nets and Performance Evaluation. In Brauer, 
W., Reisig, W., and Rozenberg, G., editors, Advances in Petri Nets 1986 Part I: Petri Nets, 
central models and their properties, volume 254 of Lecture Notes in Computer Science, pages 
460-478. Springer-Verlag, Berlin. 
[Pease et al., 1991] Pease, D., Ghafoor, A., Ahmad, I., Andrews, O. L., Foudil-Bey, K., Karpin-
ski, T. E., Mikki, M. A., and Zerrouki, M. (1991). PAWS: A performance evaluation tool 
for parallel computing systems. Computer, 24(l):18-30. 
263 
[Pfister, 1995] Pfister, G. F. (1995). In Search of Clusters. Prentice Hall PTR, Upper Saddle 
River, NJ,New Jersey, 1st edition. 
[Poursepanj, 1994] Poursepanj, A. (1994). The PowerPC performance modeling methodology. 
Communications of the ACM, 37(6):47-55. 
[Preparata, 1995] Preparata, F. P. (1995). Should Amdahl's law be repealed? Lecture Notes 
in Computer Science, 1004:311-?? 
[Ramamoorthy and Ho, 1980] Ramamoorthy, C. V. and Ho, G. S. (1980). Performance evalu­
ation of asynchronous concurrent systems using petri nets. IEEE Transactions on Software 
Engineering, SE-6(5):440-449. 
[Reed, 1993] Reed, D. A. (1993). Performance instrumentation techniques for parallel systems. 
Lecture Notes in Computer Science, 729:463-?? 
[Reed et al., 1993] Reed, D. A., Aydt, R. A., Noe, R. J., Roth, P. C., Shields, K. A., Schwartz, 
B. W., and Tavera, L. F. (1993). Scalable Performance Analysis: The Pablo Performance 
Analysis Environment. In Proc. Scalable Parallel Libraries Conf., pages 104-113. IEEE 
Computer Society. 
[Reilly, 1996] Reilly, J. (1996). A brief introduction to the SPEC CPU95 benchmarks. IEEE 
Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 
1004. 
[Rothberg et al., 1993] Rothberg, E., Singh, J. P., and Gupta, A. (1993). Working sets, cache 
sizes, and node granularity issues for large-scale multiprocessors. In Bic, L., editor, Proceed­
ings of the 20th Annual International Symposium on Computer Architecture, pages 14-26, 
San Diego, CA. IEEE Computer Society Press. 
[Saavedra and Smith, 1992] Saavedra, R. H. and Smith, A. J. (1992). Analysis of Benchmark 
Characteristics and Benchmark Performance Prediction. Technical Report USC-CS-92-524, 
Computer Science Division, University of Calfornia, Berkeley, CA. 
264 
[Saavedra and Smith, 1996] Saavedra, R. H. and Smith, A. J. (1996). Analysis of benchmark 
characteristics and benchmark performance prediction. ACM Transactions on Computer 
Systems, 14(4):344-384. 
[Saavedra-Barrera et al., 1989a] Saavedra-Barrera, R., Smith, A., and Miya, E. (1989a). Per­
formance prediction by benchmark and machine characterization. 
[Saavedra-Barrera, 1988] Saavedra-Barrera, R. H. (1988). Machine characterization and 
benchmark performance prediction. Technical Report CSD-88-437, University of Califor­
nia, Berkeley. 
[Saavedra-Barrera, 1990] Saavedra-Barrera, R. H. (1990). Performance prediction by bench­
mark and machine analysis. Technical Report CSD-90-607, University of California, Berke­
ley. 
[Saavedra-Barrera, 1992] Saavedra-Barrera, R. H. (1992). CPU performance evaluation and 
execution time prediction using narrow spectrum benchmarking. Technical Report CSD-92-
684, University of California, Berkeley. 
[Saavedra-Barrera and Culler, 1991] Saavedra-Barrera, R. H. and Culler, D. E. (1991). An 
analytical solution for a Markov chain modeling multithreaded execution. Report UCB/CSD 
91/623, University of California, Berkeley, Computer Science Division, Berkeley, CA, USA. 
[Saavedra-Barrera et al., 1990] Saavedra-Barrera, R. H., Culler, D. E., and Von Eiken, T. 
(1990). Analysis of multithreaded architectures for parallel computing. Report UCB/CSD 
90/569, University of California, Berkeley, Computer Science Division, Berkeley, CA, USA. 
To appear in the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 
Crete, Greece, July 1990. 
[Saavedra-Barrera and Smith, 1992] Saavedra-Barrera, R. H. and Smith, A. J. (1992). Per­
formance characterization of optimizing compilers. Technical Report UCB//CSD-92-699, 
University of California Berkeley, Department of Computer Science. 
265 
[Saavedra-Barrera and Smith, 1993] Saavedra-Barrera, R. H. and Smith, A. J. (1993). Mea­
suring cache and TLB performance and their effect of benchmark run times. Technical 
Report CSD-93-767, University of California, Berkeley. 
[Saavedra-Barrera et al., 1989b] Saavedra-Barrera, R. H., Smith, A. J., and Miya, E. (1989b). 
Machine characterization BASed on an abstract high level machine. Technical Report CSD-
89-494, University of California, Berkeley. 
[Sahni and Thanvantri, 1996a] Sahni, S. and Thanvantri, V. (1996a). Performance Metrics: 
Keeping the focus on runtime. IEEE parallel and distributed technology: systems and appli­
cations, 4(l):43-56. 
[Sahni and Thanvantri, 1996b] Sahni, S. and Thanvantri, V. (1996b). Performance metrics: 
Keeping the focus on runtime. IEEE parallel and distributed technology: systems and appli­
cations, 4(l):43-56. 
[Schmidt et al., 1993] Schmidt, M., Baldridge, K., Boatz, J., Elbert, S., Gordon, M., and et. al. 
(1993). General atomic and molecular electronic structure system. Journal of Computational 
Chemistry, 14:1347-63. 
[SGI, ] SGI. Pixie man pages on IRIX 6.2. http://www.sgi.com/; (date retrieved: August 
16, 2001). 
[SGI, 1997] SGI (1997). 0rigin2000 & Onyx2 4 MB cache performance report. Technical 
Report 1.07, Silicon Graphics Inc., Mountainview, CA, USA. 
[Simon et al., 1997] Simon, J., Vieth, M., and Weicker, R. (1997). Workload analysis of com­
putation intensive tasks: Case study on SPEC CPU95 benchmarks. In Euro-Par'97 - Per­
formance evaluation a,nd benchmarking workshop, Lecture Notes in Computer Science 1300, 
pages 971-984. Springer. 
[Snell and Gustafson, 1996] Snell, Q. O. and Gustafson, J. L. (1996). An analytical model of 
thé HINT performance metric. In ACM, editor, Supercomputing '96 Conference Proceedings: 
266 
November 17-22, Pittsburgh, PA, pages ??-??, New York, NY 10036, USA and 1109 Spring 
Street, Suite 300, Silver Spring, MD 20910, USA. ACM Press and IEEE Computer Society 
Press. 
[SPEC, a] SPEC. SPEC CPU2000. http://www.spec.org/osg/cpu2000/; (date retrieved: 
March 11, 2001). 
[SPEC, b] SPEC. SPEC CPU95 Q & A. http://www.spec.org/osg/cpu95/qanda.html; 
(date retrieved: November 19, 2003). 
[SPEC, 2003] SPEC (2003). SPEC Webpage. http://www.spec.org/; (date retrieved: 
November 19, 2003). 
[Spinellis and Papadopoulos, 1997] Spinellis, D. and Papadopoulos, H. T. (1997). A simulated 
annealing approach for buffer allocation in reliable production lines. In International Work­
shop on Performance Evaluation and, Optimization of Production Lines, pages 365-375, 
Samos, Greece. University of the Aegean, Department of Mathematics. 
[Spirn, 1977] Spirn, J. R. (1977). Program Behavior: Models and Measurement. Elsevier 
North-Holland, Inc., 52 Vanderbilt Avenue, New York, New York, 10017, 1st edition. 
[Sun, 1998] Sun, X.-H. (1998). Performance range comparison via crossing point analysis. 
Lecture Notes in Computer Science, 1388:1025-?? 
[Sun and Gustafson, 1991a] Sun, X.-H. and Gustafson, J. L. (1991a). Sizeup: a new paral­
lel performance metric. In Proceedings of the 1991 International Conference on Parallel 
Processing, volume II, Software, pages II—298—II—299, Boca Raton, FL. CRC Press. 
[Sun and Gustafson, 1991b] Sun, X.-H. and Gustafson, J. L. (1991b). Toward a better parallel 
performance metric. Parallel Computing, 17(10-11): 1093-1109. 
[Sun and Ni, 1992] Sun, X.-H. and Ni, L. M. (1992). Scalable problems and memory-bounded 
speedup. Technical Report MSU-CPS-ACS-21, Advanced Computer Systems Group, Com­
puter Science Department, Michigan State University. 
267 
[Sun and Ni, 1993] Sun, X.-H. and Ni, L. M. (1993). Scalable problems and memory-bounded 
speedup. J. Parallel and Distributed Computing, 19(l):27-37. 
[Sun and Rover, 1994] Sun, X.-H. and Rover, D. T. (1994). Scalability of parallel algorithm-
machine combinations. IEEE Transactions on Parallel and Distributed Systems, 5(6):599-
613. 
[Sun and Zhu, 1994] Sun, X.-H. and Zhu, J. (1994). Shared virtual memory and generalized 
speedup. In Siegel, H. J., editor, Proceedings of the 8th International Symposium on Parallel 
Processing, pages 637-643, Los Alamitos, CA, USA. IEEE Computer Society Press. 
[Sun and Zhu, 1995] Sun, X.-H. and Zhu, J. (1995). Performance prediction of scalable com­
puting: A case study. In El-Rewini, H. and Shriver, B. D., editors, Proceedings of the 28th 
Annual Hawaii International Conference on System Sciences. Volume 2: Software Technol­
ogy, pages 456-466, Los Alamitos, CA, USA. IEEE Computer Society Press. 
[Sun and Zhu, 1996] Sun, X.-H. and Zhu, J. (1996). Performance Prediction: A case study 
using a scalable shared-virtual-memory machine. IEEE parallel and distributed technology: 
systems and applications, 4(4): 36-49. 
[Todi, 2001] Todi, R. (2001). SPEClite: Using Representative Samples to reduce SPEC2000 
workload. In Proceedings of the Fourth IEEE Annual Workshop on Workload Characteriza­
tion. 
[Todi, 2003] Todi, R. (2003). SPEClite: An Accurate Microprocessor Simuation of SPEC2000 
in an Hour. 
[Todi et al., 2000] Todi, R., Prabhu, G., Alexeev, U., and Gustafson, J. (2000). Performance 
evaluation of parallel file systems using realistic I/O workloads. In PARALLEL COMPUT­
ING, Fundamentals and Applications, Proceedings of the International Conference ParCo99, 
Delft, The Netherland, page 788, South Kensington, London, SW7 2AZ. Imperial College 
Press. 
268 
[Truong et al., 1996] Truong, D. N., Bodin, F., and Seznec, A. (1996). Accurate data layout 
into blocks may boost cache performance. In Second Workshop on Interraction between 
Compiler s and Computer Architecture (Interract-2), San-Antonio, Texas, IEEE TCCA 
Newsletter, June 1997, pages 55-57. 
[Vasiliu, 2000] Vasiliu, B. (2000). Performance Characterization of Automatically Optimized 
Basic Linear Algebra Subprograms. Master's thesis, Iowa State Univeristy. Major Professor: 
Dr. Don Heller. 
[Wang and Baer, 1991] Wang, W.-H. and Baer, J.-L. (1991). Efficient trace-driven simula­
tion methods for cache performance analysis. ACM Transactions on Computer Systems, 
9(3)=222-241. 
[Weicker, 1984] Weicker, R. P. (1984). Dhrystone: A synthetic systems programming bench­
mark. Communications of the ACM, 27(10):1013-1030. 
[Weicker, 1990] Weicker, R. P. (1990). An overview of common benchmarks. Computer, 
23(12):65-75. 
[Weicker, 1991] Weicker, R. P. (1991). A detailed look at some popular benchmarks. Parallel 
Computing, 17(10-11): 1153-1172. 
[Woods, 2001] Woods, G. (2001). Blizzard: An IA64 processor simulator. Hewlett Packard 
Internal Notes. 
[Wulf and McKee, 1995] Wulf, W. A. and McKee, S. A. (1995). Hitting the memory wall: 
Implications of the obvious. Computer Architecture News, 23(l):20-24. 
[Zhou, 1989] Zhou, X. (1989). Bridging the gap between Amdahl's law and Sandia laboratory's 
result. Communications of the ACM, 32(6):1014-1015. 
[Zomaya, 1996] Zomaya, A. Y. (1996). Parallel and Distributed Computing Handbook. McGraw 
Hill, New York, NY, 1st edition. 
