Using Dynamic Binary Instrumentation To Create Faster, Validated, Multi-Core Simulations by Weaver, Vincent
USING DYNAMIC BINARY INSTRUMENTATION
TO CREATE FASTER, VALIDATED, MULTI-CORE
SIMULATIONS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University





c© 2010 Vincent Michael Weaver
ALL RIGHTS RESERVED
USING DYNAMIC BINARY INSTRUMENTATION TO CREATE FASTER,
VALIDATED, MULTI-CORE SIMULATIONS
Vincent Michael Weaver, Ph.D.
Cornell University 2010
The Memory Wall continues to be a problem with modern systems design.
While the steady increase in processor speeds has abated somewhat, Moore’s
Law continues to provide more transistors to chip designers. This leads to an
increase in the number of processors and threads located per chip, which in-
creases the demands on memory systems. Current simulation technology is not
able to keep up, leading to sacrifices in methodology and accuracy in order to
get results in reasonable time.
Because cycle-accurate simulators are so slow, various methods for reducing
execution time can be used. Unfortunately these methods can introduce varia-
tions in results of between 10-50% when compared to full reference input sets.
Limitations of academic simulators also constrain the architectures under study,
with results generated for obsolete or uninteresting systems.
We analyze the performance and accuracy of various limited-execution
methodologies. We investigate how deterministic execution affects the mea-
surement of error. We then evaluate using Dynamic Binary Instrumentation
(DBI) as an alternative to cycle-accurate simulation. We compare our results to
actual systems using hardware performance counters. We look first at a simple
32-bit RISC system, and then look at more complex 64-bit x86 based systems. Fi-
nally we investigate the feasibility of using the same methodology for modern
multi-processors simulations.
BIOGRAPHICAL SKETCH
VincentWeaverwas born in 1978 and grew up in Joppatowne, Maryland. He
attended Joppatowne Elementary and Magnolia Middle schools before moving
on to The John Carroll School. He received his B.S. in Electrical Engineering
from the University of Maryland College Park in December of 2000. After grad-
uation he briefly worked at Frontpath, a maker of tablet PCs located in Biller-
ica Massachusetts. The dot-com bust caught up with the company, and after a
round of layoffs Vince returned to Maryland and worked as a contractor for the
U.S. Army creating web front-ends for legacy Fortran applications. In the Fall
of 2003 he entered the M.S./Ph.D. program at Cornell University. He obtained
a M.S. degree in Electrical and Computer Engineering from Cornell in January
of 2009 and his Ph.D. in May of 2010. Vince is a Linux enthusiast who is often
accompanied by guinea pigs. He enjoys retro-computing and can program in
over 20 types of assembly language.
iii
To Kristina and Elena, for their unfailing support.
i* d ed ika*t T iS TeeS iS to l ovle* kr iSt eena
. W iToWt h eR i* n evR W ud h av ma*d
it T iS f aR . i* alo* W ud li*k to T
a
k
pR ineS e*la*na suZ an foR ma*k
i
g me*




First and foremost I would like to thank my advisor, Sally McKee, for her
leadership and guidance throughout my time in grad school. Without her help
and support none of this would have been possible. Most notable is her amaz-
ing ability to accumulate computing clusters, without which this work would
have not been finished in a reasonable amount of time. I would like to thank
my other committee members, Rajit Manohar and David Albonesi for their in-
sights and feedback that have been instrumental in improving this thesis. In
addition I would like to thank Bruce Jacob from the University of Maryland. It
was his computer organization and computer architecture classes that set me on
the path that resulted in this research.
I also would like to thank all of the members of the Fusion group, past and
present, for all their help and support. This includes Martin, Pete, Brian, Chris,
Cat, Karan, and Major as well as many others who were not around as long but
were just as important.
This work was helped by many open source software projects. I would like
to thank the developers of the Linux kernel, especially Linus Torvalds. I would
like to thank the perfmon2 developers, especially Ste´phane Eranian, as well as
the developers of Qemu, Valgrind, and m5.
Additional thanks to the Intel Corporation for donating processors which
are used in our Sampaka and Domori clusters.
Part of this work is supported by the National Science Foundation under




Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction 1
2 Related Work 5
2.1 Reduced Execution Validations . . . . . . . . . . . . . . . . . . . . 5
2.2 SimPoint Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Performance Counter Validation . . . . . . . . . . . . . . . . . . . 9
2.4 Single-core DBI-Based Simulation . . . . . . . . . . . . . . . . . . . 11
2.4.1 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 Qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 TAXI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Multi-core Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 CMP$im . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Cycle-Accurate x86 Simulators . . . . . . . . . . . . . . . . . . . . 14
2.7 Simulator Validations . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Multi-processor Phase Detection . . . . . . . . . . . . . . . . . . . 17
2.9 Deterministic Execution . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Performance Counter based CPI Prediction . . . . . . . . . . . . . 19
3 Methods of Reducing Simulation Time 21
3.1 Running a Small Portion from the Beginning . . . . . . . . . . . . 22
3.2 Un-guided Fast-forwarding . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Reduced Input Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Statistics-based Sampling . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 SimPoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 BBV Generation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.2 x86 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.3 x86 64 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.4 Cross-Platform MIPS Results . . . . . . . . . . . . . . . . . 38
3.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 SimPoint Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vi
4 Single-Core Validation Concerns 45
4.1 Hardware Performance Counters . . . . . . . . . . . . . . . . . . . 45
4.1.1 Performance Counter Evaluation . . . . . . . . . . . . . . . 47
4.1.2 Sources of Hardware Counter Variation . . . . . . . . . . . 49
4.1.3 Counter Variation Findings . . . . . . . . . . . . . . . . . . 52
4.1.4 Intra-machine results . . . . . . . . . . . . . . . . . . . . . . 53
4.1.5 Inter-machine Results . . . . . . . . . . . . . . . . . . . . . 54
4.2 Deterministic Execution . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Virtual Memory Layout . . . . . . . . . . . . . . . . . . . . 57
4.2.2 System Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3 Sources of DBI Tool Variation . . . . . . . . . . . . . . . . . 61
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 32-Bit RISC Results 65
5.1 SESC Cycle-accurate Simulator . . . . . . . . . . . . . . . . . . . . 68
5.2 Reference Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 DBI-based Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Absolute Results . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Relative Results . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 64-Bit CISC Results 84
6.1 RISC/CISC differences . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Modern CPU Features . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 µop Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Valgrind DBI-based Simulator . . . . . . . . . . . . . . . . 90
6.4.2 m5 Cycle-accurate Simulator . . . . . . . . . . . . . . . . . 91
6.4.3 Reference Hardware . . . . . . . . . . . . . . . . . . . . . . 93
6.4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Absolute Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5.1 Phase Behavior Results . . . . . . . . . . . . . . . . . . . . 97
6.5.2 L1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . 97
6.5.3 Data Accesses per Thousand Instructions . . . . . . . . . . 98
6.5.4 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7 Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.8 CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.9 Relative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.9.1 L1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . 106
6.9.2 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9.3 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
vii
6.9.4 Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . 108
6.9.5 CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Multi-Core Validation Concerns 111
7.1 Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Deterministic Execution . . . . . . . . . . . . . . . . . . . . . . . . 111
8 Multi-Core Results 114
8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.1.1 Performance Counters . . . . . . . . . . . . . . . . . . . . . 115
8.1.2 DBI Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1.3 Cycle-accurate Simulation . . . . . . . . . . . . . . . . . . . 116
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9 Conclusion and Future Work 120
9.1 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A The Lost Art of Assembly Language Programming 123
A.1 Benefits of Code Density . . . . . . . . . . . . . . . . . . . . . . . . 123
A.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 Architectural Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.4 Code Density Findings . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.5 Density of Compiler-Generated Binaries . . . . . . . . . . . . . . . 133
A.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 136
B Cache Latencies 138
C Instruction Counts 141
D Simulation Timings 153
E CPI Phase Plots 164
E.1 32-bit x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
E.2 64-bit x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
F Multi-architecture Phase Plots 254
G L1 Data Cache Accesses per Instruction Phase Plots 279
H L1 Data Cache Accesses per µop Phase Plots 320
viii
I Valgrind exp-bbv Tool Code Listing 341
J Qemu BBV Patch Code Listing 353
K R12000 Branch Predictor Kernel Module 363




3.1 Machines used for x86 SimPoint evaluation. . . . . . . . . . . . . 29
3.2 Machines used for x86 64 SimPoint evaluation. . . . . . . . . . . 36
4.1 Machines used for this study. . . . . . . . . . . . . . . . . . . . . . 48
4.2 Dynamic count of fldcw instructions, showing all benchmarks
with over 100 million. This instruction is counted as two instruc-
tions on Pentium 4 machines but only as one instruction on all
other implementations. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Potential overcounted dynamic instructions due to the rep pre-
fix (only benchmarks with more than 10 billion are shown). . . . 62
5.1 Configuration of SGI Octane2 machine used for comparison . . . 69
5.2 Comparison of simulation times . . . . . . . . . . . . . . . . . . . 73
5.3 Summary of results. The weighted average is across all of the
SPEC 2000 benchmarks which ran to completion on all three
platforms: 23 integer and 11 floating point (this is unfortunately
only a portion of the 48 available benchmark/input combinations). 83
5.4 Summary of relative results. The relative results compare the
relative results when moving from 2-bit branch predictor to ei-
ther taken or static. The error shown is the relative error between
the relative averagemeans of all benchmarks on actual hardware
versus the predicted relative average means of the simulated re-
sults. The results represent the 33 of the SPEC CPU 2000 bench-
marks which ran to completion on all three platforms. . . . . . . 83
6.1 Hardware performance counters used for µop experiments . . . 87
6.2 Number of uops required for an assortment of x86 instructions . 89
6.3 Configuration of AMD Phenom machine used for comparison . 91
6.4 Hardware performance counters used for our experiments. We
did not use all of the counters listed. Some of the counters have
known errata. We gathered this list from PAPI [102] and the
AMD and Intel reference manuals [10, 72]. . . . . . . . . . . . . . 94
A.1 Summary of investigated architectures . . . . . . . . . . . . . . . 125
A.2 Correlations of architectural features to binary size . . . . . . . . 129
B.1 L1 Cache latencies on Fusion group machines . . . . . . . . . . . 139
B.2 L2 Cache latencies on Fusion group machines . . . . . . . . . . . 140
C.1 Retired instructions for Alpha SPEC CPU2000, showing Qemu
and m5 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.2 Retired instructions for MIPS SPEC CPU2000, showing both
Qemu and actual hardware. . . . . . . . . . . . . . . . . . . . . . . 144
x
C.3 Retired instructions for PPC SPEC CPU 2000, showing Qemu
and Valgrind results. . . . . . . . . . . . . . . . . . . . . . . . . . . 145
C.4 Retired instructions for SPARC SPEC CPU2000, showing actual
hardware and Qemu results. . . . . . . . . . . . . . . . . . . . . . 146
C.5 Retired instructions for SPARC SPEC CPU2006, showing actual
hardware and Qemu results (part 1) . . . . . . . . . . . . . . . . . 147
C.6 Retired instructions for SPARC SPEC CPU2006, showing actual
hardware and Qemu results (part 2) . . . . . . . . . . . . . . . . . 148
C.7 Retired instructions for x86 SPEC CPU2000, showing both Qemu
and actual hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . 149
C.8 Retired instructions for x86 SPEC CPU2006, showing Pin, Val-
grind, and Qemu and Pentium D (part 1). . . . . . . . . . . . . . . 150
C.9 Retired instructions for x86 SPEC CPU2006, showing Pin, Val-
grind, and Qemu and Pentium D (part 2). . . . . . . . . . . . . . . 151
C.10 Retired instructions for x86 64 SPEC CPU2000, showing both
Qemu and actual hardware. . . . . . . . . . . . . . . . . . . . . . . 152
D.1 Summary of slowdown compared to Pentium D node running
x86 64 binaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
D.2 x86 32-bit versus 64-bit run time anomaly for sixtrack. Some
benchmarks perform markedly worse when compiled as 64-bit. . 156
D.3 Elapsed times for running the SPEC CPU 2000 benchmarks on
various Alpha simulators. domori is time on our reference Pen-
tium D machine. bmul is an actual Alpha 21264 system. . . . . . . 157
D.4 Elapsed times for running the SPEC CPU 2000 benchmarks on
various MIPS simulators. domori is time on our reference Pen-
tium D machine. hershey is an actual MIPS R12000 system. The
pre-compiled SPEC benchmarks from the SESC site are used;
some (such as gzip) are modified to have shorter run-times,
which is why the R12000 runs them faster than the Pentium D. . 158
D.5 Elapsed times for running the SPEC CPU 2000 benchmarks on
various SPARC simulators. domori is time on our reference Pen-
tium D machine. niagara is an actual SPARC niagara system. . . 159
D.6 Times for x86 architecture . . . . . . . . . . . . . . . . . . . . . . . 160
D.7 Times for x86 64 architecture comparing simulators. . . . . . . . 161
D.8 Times for x86 64 DBI . . . . . . . . . . . . . . . . . . . . . . . . . . 162
D.9 Times for x86 64 DBI utilities running cache simulations. . . . . . 163
xi
LIST OF FIGURES
1.1 Weighted slowdowns of various simulators when running SPEC
CPU2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Instruction set diversity across various domains. Recent com-
puter architecture conference papers (ICCD’09, ISCA’09, MI-
CRO’08 and ASPLOS’09) match years-old high-performance
computing diversity rather than modern trends in computing . . 3
3.1 L1 Data Cache andCPI behavior for twolf: behavior is uniform,
with one phase representing the entire program. . . . . . . . . . . 25
3.2 L1 Data Cache and CPI behavior for mcf: several recurring
phases are evident. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 L1 Data Cache and CPI behavior for gcc.200: this program ex-
hibits complex behavior that is hard to capture with phase de-
tection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Architectures supported by Pin, Qemu, and Valgrind: x86 is the
ideal platform for comparison, as it is well supported by all three
of the tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Average CPI error for SPEC CPU2000 when using first, un-
guided fast-forward, and SimPoint selected intervals on various
x86 machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Percent error in CPI on a Pentium D when using up to 20 Sim-
Points on CPU2000 FP: the error with facerec and fma3d is
due to extreme swings in the phase behavior that SimPoint has
trouble capturing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Percent error in CPI on a Pentium D when using up to 20 Sim-
Points on CPU2000 INT: the large error with the gcc benchmarks
is due to spikes in the phase behavior that SimPoint does not
capture well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 Average CPI error for CPU2006 on a selection of x86 machines
when using first, unguided fast-forward, and SimPoint selected
intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9 Percent error in CPI on a Pentium D when using up to 20
SimPoints on CPU2006 FP: the large variation in results for
cactusADM and GemsFDTD are due to unresolved inaccuracies
in the way the tools count instructions. . . . . . . . . . . . . . . . 35
3.10 Percent error in CPI on a Pentium D when using up to 20 Sim-
Points on CPU2006 INT: the large error with the gcc and bzip2
benchmarks is due to spikes in the phase behavior not captured
by SimPoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.11 Average CPI error for CPU2000 on three x86 64 machines when
using first, unguided fast-forward, and SimPoint selected inter-
vals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xii
3.12 x86 64 CPI Error for SPEC CPU2000 floating point benchmarks . 38
3.13 x86 64 CPI Error for SPEC CPU2000 integer benchmarks . . . . . 38
3.14 Phase plot for mcf across various architectures. While the phases
look similar, the interval numbers are not. . . . . . . . . . . . . . 39
3.15 Phase plot for equake across various compilers are compile op-
tions. The interval numbers vary widely. . . . . . . . . . . . . . . 40
3.16 MIPS R12000 SimPoint results for SPEC CPU2000. The BBVs for
the SimPoints were generated cross-platform on an x86 machine
using Qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.17 MIPS CPI Error for SPEC CPU2000 floating point . . . . . . . . . 42
3.18 MIPS CPI Error for SPEC CPU2000 integer benchmarks . . . . . 42
3.19 Percent average CPI error for SPEC CPU2000 as more SimPoints
are added per benchmark. After 20 SimPoints the average does
not decrease, even up to 100 points per benchmark (this is equiv-
alent to running 2% of all of the benchmarks). . . . . . . . . . . . 44
4.1 SPEC 2000 Coefficient of variation. The top graph shows integer
benchmarks, the bottom, floating point. The error variation from
mesa, perlbmk, vpr, twolf and eon are primarily due to the
fldcw miscount on the Pentium 4 systems. Variation after our
adjustments becomes negligible. . . . . . . . . . . . . . . . . . . . 51
4.2 SPEC 2006 Coefficient of variation. The top graph shows inte-
ger benchmarks, bottom, floating point. The original variation
is small compared to the large numbers of instructions in these
benchmarks. The largest variation is in sphinx3, due to fldcw
instruction issues. Variation after our adjustments becomes or-
ders of magnitude smaller. . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Intra-machine results for SPEC CPU2000 (above) and CPU2006
(below). Outliers are indicated by the first letter of the bench-
mark name and a distinctive color. For CPU2000, the perlbmk
benchmarks (represented by gray ‘p’s) are a large source of vari-
ation. For CPU2006, the perlbench (green ‘p’) and povray
(gray ‘p’) are the common outliers. Order of plotted letters for
outliers has no intrinsic meaning, but tries to make the graphs
as readable as possible. Horizontal lines summarize results
for remaining benchmarks (they’re all similar). The message
here is that most platforms have few outliers, and there’s much
consistency with respect to measurements across benchmarks;
Core Duo and Core2 Q6600 have many more outliers, especially
for CPU2006. Our technical report provides detailed perfor-
mance information — these plots are merely intended to indi-
cate trends. Standard deviations decrease drastically with our
updated methods, but there is still room for improvement. . . . 54
xiii
4.4 Inter-machine results for SPEC CPU2000. We choose five repre-
sentative benchmarks and show the individual machine differ-
ences contributing to the standard deviations. Often there is a
single outlier affecting results; the outlying machine is often dif-
ferent. DBI results are shown, but not incorporated into standard
deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Inter-machine results for SPEC CPU2006. We choose five rep-
resentative benchmarks and show the individual machine dif-
ferences contributing to the standard deviations. Often there is
a single outlier affecting results; the outlying machine is often
different. DBI results are shown, but not incorporated into the
standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 The typical layout of virtual memory for a process on 32-bit x86
Linux. If process space randomization is enabled, then the BSS,
Heap, mmap and stack can have different offsets. . . . . . . . . . 58
5.1 The precompiled SPEC 2000 benchmarks available from the
SESC website have potentially been modified to reduce runtime.
A phase chart gathered with hardware performance counters
shows behavior of the provided precompiled binary on top and
that of a binary we compiled from original SPEC sources (with
gcc) on bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Instruction cache miss rate with integer benchmarks above and
floating point below. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 L1 data cache miss rate with integer benchmarks above and
floating point below. . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 L2 cache miss rate with integer above and floating point below.
None of the simulations captures mcf’s behavior well. None of
the simulation methods predicts the art benchmarks well. . . . 76
5.5 Branch miss rate with integer above and floating point below.
The hardware can have up to four outstanding branches; Qemu
and SESC do not model wrong-path execution. . . . . . . . . . . 77
5.6 CPI results with integer above and floating point below. . . . . . 78
5.7 Always taken branch predictor miss rate, normalized against dy-
namic two-bit results. . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8 Static branch predictor miss rate, normalized against dynamic
two-bit results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.9 L2 cache miss rates with the always-taken predictor, normalized
against two-bit results. . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.10 L2 cache miss rates with the static predictor, normalized against
two-bit results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.11 TLB misses with always taken, normalized against two-bit. . . . 81
5.12 TLB misses with static predictor, normalized against two-bit. . . 81
5.13 CPI with always taken normalized against two-bit results. . . . . 82
xiv
5.14 CPI with static predictor normalized against two-bit results. . . . 82
6.1 Data cache accesses per µop for gzip.program . . . . . . . . . . 87
6.2 Normalized µops per benchmark for three x86 64 implementa-
tions, a 32-bit x86, them5 simulator, and two representative RISC
architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 L1 data cache accesses per instruction. This plot shows that
cache accesses per instruction is consistent across all actual ma-
chines, as well as the simulators. The MIPS results are very dif-
ferent. SimPoint results are shown for comparison . . . . . . . . 96
6.4 Average bytes per x86 instruction. For integer benchmarks the
average is 4.0, for floating point it is 5.1. These values are needed
when extrapolating cache miss rates when given only total re-
tired instruction count. . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Instruction cache miss rate with integer benchmarks above and
floating point below. . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Data Accesses per Thousand Instructions for the SPEC CPU2000
benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.7 L1 data cache miss rate with integer benchmarks above and
floating point below. . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.8 Dcache miss rates for Phenom-style cache . . . . . . . . . . . . . 102
6.9 L2 cache miss rates, actual and simulated. The simulators are
pessimistic; in the case of gcc severely so. . . . . . . . . . . . . . 103
6.10 Branch predictor results for Valgrind and actual hardware. m5
currently cannot simulate branch prediction for x86 64 . . . . . . 104
6.11 CPI results with integer above and floating point below. Val-
grind cycle times are estimated based on cache and branch pre-
dictor behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.12 Relative instruction cache miss rate ratios whenmoving from 32-
bit to 64-bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.13 Relative L1 data cache miss rate ratios when moving from 32-bit
to 64-bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.14 Relative L1 data cache miss rate ratios when moving from 32-bit
to 64-bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.15 Relative branch predictor miss rate ratios whenmoving from 32-
bit to 64-bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.16 Relative CPI ratios when moving from 32-bit to 64-bit . . . . . . . 109
8.1 equake m run times for varying number of threads, both on ac-
tual hardware and Valgrind . . . . . . . . . . . . . . . . . . . . . . 116
8.2 equake m retired instruction counts for varying number of
threads, both on real hardware and Valgrind . . . . . . . . . . . . 117
8.3 equake m L1 dcache access counts for varying number of
threads, both on real hardware and Valgrind . . . . . . . . . . . . 118
xv
9.1 Speed vs Accuracy tradeoffs of the various simulation methods
on SPEC CPU2000, assuming perfect simulation . . . . . . . . . . 121
A.1 Sample output from the linux logo benchmark . . . . . . . . . 124
A.2 Total size of benchmarks (includes some platform-specific code,
so does not strictly reflect code density) . . . . . . . . . . . . . . . 130
A.3 Size of LZSS decompression code . . . . . . . . . . . . . . . . . . 130
A.4 Size of string concatenation code (machines with auto-increment
addressing modes and dedicated string instructions perform
better) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.5 Size of string searching code (unaligned load instructions help,
since four bytes at arbitrary offsets can be compared at once.
CISC architectures as well as avr32 and MIPS benefit) . . . . . . 130
A.6 Size of integer printing code (hardware divide helps code density)131
A.7 Total size of generated executables, stripped of debugging infor-
mation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
E.1 CPI phase plot for gzip.graph (INT, C, Compression) . . . . . 165
E.2 CPI phase plot for gzip.log (INT, C, Compression) . . . . . . . 166
E.3 CPI phase plot for gzip.prog (INT, C, Compression) . . . . . . 167
E.4 CPI phase plot for gzip.rand (INT, C, Compression) . . . . . . 168
E.5 CPI phase plot for gzip.src (INT, C, Compression) . . . . . . . 169
E.6 CPI phase plot for wupwise (FP, F77, Quantum Chromodynamics)170
E.7 CPI phase plot for swim (FP, F77, Meteorology/Water) . . . . . . 171
E.8 CPI phase plot for mgrid (FP, F77, Multi-Grid Solver) . . . . . . 172
E.9 CPI phase plot for applu (FP, F77, Fluid Dynamics) . . . . . . . . 173
E.10 CPI phase plot for vpr.place (INT, C, FPGA Place/Route) . . . 174
E.11 CPI phase plot for vpr.route (INT, C, FPGA Place/Route) . . . 175
E.12 CPI phase plot for gcc.166 (INT, C, C Compiler) . . . . . . . . . 176
E.13 CPI phase plot for gcc.200 (INT, C, C Compiler) . . . . . . . . . 177
E.14 CPI phase plot for gcc.expr (INT, C, C Compiler) . . . . . . . . 178
E.15 CPI phase plot for gcc.int (INT, C, C Compiler) . . . . . . . . . 179
E.16 CPI phase plot for gcc.sci (INT, C, C Compiler) . . . . . . . . . 180
E.17 CPI phase plot for mesa (FP, C, 3D-graphics) . . . . . . . . . . . . 181
E.18 CPI phase plot for galgel (FP, F90, Fluid Dynamics) . . . . . . . 182
E.19 CPI phase plot for art.110 (FP, C, Neural Networks) . . . . . . 183
E.20 CPI phase plot for art.470 (FP, C, Neural Networks) . . . . . . 184
E.21 CPI phase plot for mcf (INT, C, Combinatorial Opt) . . . . . . . . 185
E.22 CPI phase plot for equake (FP, C, Seismic Propogation) . . . . . 186
E.23 CPI phase plot for crafty (INT, C, Chess) . . . . . . . . . . . . . 187
E.24 CPI phase plot for facerec (FP, F90, Facial Recognition) . . . . . 188
E.25 CPI phase plot for ammp (FP, C, Chemistry) . . . . . . . . . . . . . 189
E.26 CPI phase plot for lucas (FP, F90, Number Theory) . . . . . . . 190
E.27 CPI phase plot for fma3d (FP, F90, Crash Simulation) . . . . . . . 191
xvi
E.28 CPI phase plot for parser (INT, C, Word Processing) . . . . . . . 192
E.29 CPI phase plot for sixtrack (FP, F77, Nuclear Physics) . . . . . 193
E.30 CPI phase plot for eon.cook (INT, C++, Computer Graphics) . 194
E.31 CPI phase plot for eon.kaj (INT, C++, Computer Graphics) . . 195
E.32 CPI phase plot for eon.rush (INT, C++, Computer Graphics) . 196
E.33 CPI phase plot for perlbmk.535 (INT, C, Scripting Language) . 197
E.34 CPI phase plot for perlbmk.704 (INT, C, Scripting Language) . 198
E.35 CPI phase plot for perlbmk.850 (INT, C, Scripting Language) . 199
E.36 CPI phase plot for perlbmk.957 (INT, C, Scripting Language) . 200
E.37 CPI phase plot for perlbmk.diff (INT, C, Scripting Language) 201
E.38 CPI phase plot for perlbmk.mkrnd (INT, C, Scripting Language) 202
E.39 CPI phase plot for perlbmk.perf (INT, C, Scripting Language) 203
E.40 CPI phase plot for gap (INT, C, Group Theory) . . . . . . . . . . 204
E.41 CPI phase plot for vortex.1 (INT, C, Database) . . . . . . . . . 205
E.42 CPI phase plot for vortex.2 (INT, C, Database) . . . . . . . . . 206
E.43 CPI phase plot for vortex.3 (INT, C, Database) . . . . . . . . . 207
E.44 CPI phase plot for bzip2.graph (INT, C, Compression) . . . . . 208
E.45 CPI phase plot for bzip2.prog (INT, C, Compression) . . . . . 209
E.46 CPI phase plot for bzip2.src (INT, C, Compression) . . . . . . 210
E.47 CPI phase plot for twolf (INT, C, Place/Route) . . . . . . . . . . 211
E.48 CPI phase plot for apsi (FP, F77, Meteorology/Pollution) . . . . 212
E.49 CPI phase plot for gzip.graph (INT, C, Compression) . . . . . 214
E.50 CPI phase plot for gzip.log (INT, C, Compression) . . . . . . . 215
E.51 CPI phase plot for gzip.prog (INT, C, Compression) . . . . . . 216
E.52 CPI phase plot for gzip.rnd (INT, C, Compression) . . . . . . . 217
E.53 CPI phase plot for gzip.src (INT, C, Compression) . . . . . . . 218
E.54 CPI phase plot for wupwise (FP, F77, Quantum Chromodynamics)219
E.55 CPI phase plot for swim (FP, F77, Meteorology/Water) . . . . . . 220
E.56 CPI phase plot for mgrid (FP, F77, Multi-Grid Solver) . . . . . . 221
E.57 CPI phase plot for applu (FP, F77, Fluid Dynamics) . . . . . . . . 222
E.58 CPI phase plot for vpr.place (INT, C, FPGA Place/Route) . . . 223
E.59 CPI phase plot for vpr.route (INT, C, FPGA Place/Route) . . . 224
E.60 CPI phase plot for gcc.166 (INT, C, C Compiler) . . . . . . . . . 225
E.61 CPI phase plot for gcc.200 (INT, C, C Compiler) . . . . . . . . . 226
E.62 CPI phase plot for gcc.expr (INT, C, C Compiler) . . . . . . . . 227
E.63 CPI phase plot for gcc.int (INT, C, C Compiler) . . . . . . . . . 228
E.64 CPI phase plot for gcc.sci (INT, C, C Compiler) . . . . . . . . . 229
E.65 CPI phase plot for mesa (FP, C, 3D-graphics) . . . . . . . . . . . . 230
E.66 CPI phase plot for galgel (FP, F90, Fluid Dynamics) . . . . . . . 231
E.67 CPI phase plot for art.110 (FP, C, Neural Networks) . . . . . . 232
E.68 CPI phase plot for art.470 (FP, C, Neural Networks) . . . . . . 233
E.69 CPI phase plot for mcf (INT, C, Combinatorial Opt) . . . . . . . . 234
E.70 CPI phase plot for equake (FP, C, Seismic Propogation) . . . . . 235
E.71 CPI phase plot for crafty (INT, C, Chess) . . . . . . . . . . . . . 236
xvii
E.72 CPI phase plot for facerec (FP, F90, Facial Recognition) . . . . . 237
E.73 CPI phase plot for ammp (FP, C, Chemistry) . . . . . . . . . . . . . 238
E.74 CPI phase plot for lucas (FP, F90, Number Theory) . . . . . . . 239
E.75 CPI phase plot for fma3d (FP, F90, Crash Simulation) . . . . . . . 240
E.76 CPI phase plot for parser (INT, C, Word Processing) . . . . . . . 241
E.77 CPI phase plot for sixtrack (FP, F77, Nuclear Physics) . . . . . 242
E.78 CPI phase plot for eon.cook (INT, C++, Computer Graphics) . 243
E.79 CPI phase plot for eon.kaj (INT, C++, Computer Graphics) . . 244
E.80 CPI phase plot for eon.rush (INT, C++, Computer Graphics) . 245
E.81 CPI phase plot for perlbmk.mkrnd (INT, C, Scripting Language) 246
E.82 CPI phase plot for perlbmk.perf (INT, C, Scripting Language) 247
E.83 CPI phase plot for gap (INT, C, Group Theory) . . . . . . . . . . 248
E.84 CPI phase plot for bzip2.graph (INT, C, Compression) . . . . . 249
E.85 CPI phase plot for bzip2.prog (INT, C, Compression) . . . . . 250
E.86 CPI phase plot for bzip2.src (INT, C, Compression) . . . . . . 251
E.87 CPI phase plot for twolf (INT, C, Place/Route) . . . . . . . . . . 252
E.88 CPI phase plot for apsi (FP, F77, Meteorology/Pollution) . . . . 253
F.1 Multi-arch CPI plot for gzip.graph (INT, C, Compression) . . . 254
F.2 Multi-arch CPI plot for gzip.log (INT, C, Compression) . . . . 255
F.3 Multi-arch CPI plot for gzip.prog (INT, C, Compression) . . . 255
F.4 Multi-arch CPI plot for gzip.rand (INT, C, Compression) . . . 256
F.5 Multi-arch CPI plot for gzip.src (INT, C, Compression) . . . . 256
F.6 Multi-arch CPI plot for wupwise (FP, F77, Quantum Chromody-
namics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
F.7 Multi-arch CPI plot for swim (FP, F77, Meteorology/Water) . . . 257
F.8 Multi-arch CPI plot for mgrid (FP, F77, Multi-Grid Solver) . . . . 258
F.9 Multi-arch CPI plot for applu (FP, F77, Fluid Dynamics) . . . . . 258
F.10 Multi-arch CPI plot for vpr.place (INT, C, FPGA Place/Route) 259
F.11 Multi-arch CPI plot for vpr.route (INT, C, FPGA Place/Route) 259
F.12 Multi-arch CPI plot for gcc.166 (INT, C, C Compiler) . . . . . . 260
F.13 Multi-arch CPI plot for gcc.200 (INT, C, C Compiler) . . . . . . 260
F.14 Multi-arch CPI plot for gcc.expr (INT, C, C Compiler) . . . . . 261
F.15 Multi-arch CPI plot for gcc.integrate (INT, C, C Compiler) . 261
F.16 Multi-arch CPI plot for gcc.scilab (INT, C, C Compiler) . . . . 262
F.17 Multi-arch CPI plot for mesa (FP, C, 3D-graphics) . . . . . . . . . 262
F.18 Multi-arch CPI plot for galgel (FP, F90, Fluid Dynamics) . . . . 263
F.19 Multi-arch CPI plot for art.110 (FP, C, Neural Networks) . . . 263
F.20 Multi-arch CPI plot for art.470 (FP, C, Neural Networks) . . . 264
F.21 Multi-arch CPI plot for mcf (INT, C, Combinatorial Opt) . . . . . 264
F.22 Multi-arch CPI plot for equake (FP, C, Seismic Propogation) . . 265
F.23 Multi-arch CPI plot for crafty (INT, C, Chess) . . . . . . . . . . 265
F.24 Multi-arch CPI plot for facerec (FP, F90, Facial Recognition) . . 266
F.25 Multi-arch CPI plot for ammp (FP, C, Chemistry) . . . . . . . . . . 266
xviii
F.26 Multi-arch CPI plot for lucas (FP, F90, Number Theory) . . . . . 267
F.27 Multi-arch CPI plot for fma3d (FP, F90, Crash Simulation) . . . . 267
F.28 Multi-arch CPI plot for parser (INT, C, Word Processing) . . . . 268
F.29 Multi-arch CPI plot for sixtrack (FP, F77, Nuclear Physics) . . 268
F.30 Multi-arch CPI plot for eon.cook (INT, C++, Computer Graphics)269
F.31 Multi-arch CPI plot for eon.kajiya (INT, C++, Computer
Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
F.32 Multi-arch CPI plot for eon.rushmeier (INT, C++, Computer
Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
F.33 Multi-arch CPI plot for perlbmk.535 (INT, C, Scripting Lan-
guage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
F.34 Multi-arch CPI plot for perlbmk.704 (INT, C, Scripting Lan-
guage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
F.35 Multi-arch CPI plot for perlbmk.850 (INT, C, Scripting Lan-
guage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
F.36 Multi-arch CPI plot for perlbmk.957 (INT, C, Scripting Lan-
guage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
F.37 Multi-arch CPI plot for perlbmk.diff (INT, C, Scripting Lan-
guage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
F.38 Multi-arch CPI plot for perlbmk.mkrnd (INT, C, Scripting) . . . 273
F.39 Multi-arch CPI plot for perlbmk.perf (INT, C, Scripting) . . . 273
F.40 Multi-arch CPI plot for gap (INT, C, Group Theory) . . . . . . . . 274
F.41 Multi-arch CPI plot for vortex.1 (INT, C, Database) . . . . . . . 274
F.42 Multi-arch CPI plot for vortex.2 (INT, C, Database) . . . . . . . 275
F.43 Multi-arch CPI plot for vortex.3 (INT, C, Database) . . . . . . . 275
F.44 Multi-arch CPI plot for bzip2.graph (INT, C, Compression) . . 276
F.45 Multi-arch CPI plot for bzip2.prog (INT, C, Compression) . . . 276
F.46 Multi-arch CPI plot for bzip2.src (INT, C, Compression) . . . 277
F.47 Multi-arch CPI plot for twolf (INT, C, Place/Route) . . . . . . . 277
F.48 Multi-arch CPI plot for apsi (FP, F77, Meteorology/Pollution) . 278
G.1 L1 dcache accesses per instruction plot for gzip.graph (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
G.2 L1 dcache accesses per instruction plot for gzip.log (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
G.3 L1 dcache accesses per instruction plot for gzip.prog (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
G.4 L1 dcache accesses per instruction plot for gzip.rand (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
G.5 L1 dcache accesses per instruction plot for gzip.src (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
G.6 L1 dcache accesses per instruction plot for wupwise (FP, F77,
Quantum Chromodynamics) . . . . . . . . . . . . . . . . . . . . . 285
xix
G.7 L1 dcache accesses per instruction plot for swim (FP, F77, Mete-
orology/Water) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
G.8 L1 dcache accesses per instruction plot for mgrid (FP, F77, Multi-
Grid Solver) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
G.9 L1 dcache accesses per instruction plot for applu (FP, F77, Fluid
Dynamics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
G.10 L1 dcache accesses per instruction plot for vpr.place (INT, C,
FPGA Place/Route) . . . . . . . . . . . . . . . . . . . . . . . . . . 289
G.11 L1 dcache accesses per instruction plot for vpr.route (INT, C,
FPGA Place/Route) . . . . . . . . . . . . . . . . . . . . . . . . . . 290
G.12 L1 dcache accesses per instruction plot for gcc.166 (INT, C, C
Compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
G.13 L1 dcache accesses per instruction plot for gcc.200 (INT, C, C
Compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
G.14 L1 dcache accesses per instruction plot for gcc.expr (INT, C, C
Compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
G.15 L1 dcache accesses per instruction plot for gcc.int (INT, C, C
Compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
G.16 L1 dcache accesses per instruction plot for gcc.sci (INT, C, C
Compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
G.17 L1 dcache accesses per instruction plot for mesa (FP, C, 3D-
graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
G.18 L1 dcache accesses per instruction plot for galgel (FP, F90,
Fluid Dynamics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
G.19 L1 dcache accesses per instruction plot for art.110 (FP, C, Neu-
ral Networks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
G.20 L1 dcache accesses per instruction plot for art.470 (FP, C, Neu-
ral Networks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
G.21 L1 dcache accesses per instruction plot for mcf (INT, C, Combi-
natorial Opt) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
G.22 L1 dcache accesses per instruction plot for equake (FP, C, Seis-
mic Propogation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
G.23 L1 dcache accesses per instruction plot for crafty (INT, C, Chess)302
G.24 L1 dcache accesses per instruction plot for facerec (FP, F90,
Facial Recognition) . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
G.25 L1 dcache accesses per instruction plot for ammp (FP, C, Chemistry)304
G.26 L1 dcache accesses per instruction plot for lucas (FP, F90, Num-
ber Theory) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
G.27 L1 dcache accesses per instruction plot for fma3d (FP, F90, Crash
Simulation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
G.28 L1 dcache accesses per instruction plot for parser (INT, C,
Word Processing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
G.29 L1 dcache accesses per instruction plot for sixtrack (FP, F77,
Nuclear Physics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
xx
G.30 L1 dcache accesses per instruction plot for eon.cook (INT, C++,
Computer Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . 309
G.31 L1 dcache accesses per instruction plot for eon.kaj (INT, C++,
Computer Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . 310
G.32 L1 dcache accesses per instruction plot for eon.rush (INT, C++,
Computer Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . 311
G.33 L1 dcache accesses per instruction plot for perlbmk.mkrnd
(INT, C, Scripting Language) . . . . . . . . . . . . . . . . . . . . . 312
G.34 L1 dcache accesses per instruction plot for perlbmk.perf (INT,
C, Scripting Language) . . . . . . . . . . . . . . . . . . . . . . . . 313
G.35 L1 dcache accesses per instruction plot for gap (INT, C, Group
Theory) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
G.36 L1 dcache accesses per instruction plot for bzip2.graph (INT,
C, Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
G.37 L1 dcache accesses per instruction plot for bzip2.prog (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
G.38 L1 dcache accesses per instruction plot for bzip2.src (INT, C,
Compression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
G.39 L1 dcache accesses per instruction plot for twolf (INT, C,
Place/Route) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
G.40 L1 dcache accesses per instruction plot for apsi (FP, F77, Mete-
orology/Pollution) . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
H.1 L1 D$ accesses per µop for gzip.graph (INT, C, Compression) . 320
H.2 L1 D$ accesses per µop for gzip.log (INT, C, Compression) . . 321
H.3 L1 D$ accesses per µop for gzip.prog (INT, C, Compression) . 321
H.4 L1 D$ accesses per µop for gzip.rand (INT, C, Compression) . 322
H.5 L1 D$ accesses per µop for gzip.src (INT, C, Compression) . . 322
H.6 L1 D$ accesses per µop for wupwise (FP, F77, Quantum Chro-
modynamics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
H.7 L1 D$ accesses per µop for swim (FP, F77, Meteorology/Water) . 323
H.8 L1 D$ accesses per µop for mgrid (FP, F77, Multi-Grid Solver) . . 324
H.9 L1 D$ accesses per µop for applu (FP, F77, Fluid Dynamics) . . . 324
H.10 L1 D$ accesses per µop for vpr.place (INT, C, FPGA
Place/Route) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
H.11 L1 D$ accesses per µop for vpr.route (INT, C, FPGA
Place/Route) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
H.12 L1 D$ accesses per µop for gcc.166 (INT, C, C Compiler) . . . . 326
H.13 L1 D$ accesses per µop for gcc.200 (INT, C, C Compiler) . . . . 326
H.14 L1 D$ accesses per µop for gcc.expr (INT, C, C Compiler) . . . 327
H.15 L1 D$ accesses per µop for gcc.int (INT, C, C Compiler) . . . . 327
H.16 L1 D$ accesses per µop for gcc.sci (INT, C, C Compiler) . . . . 328
H.17 L1 D$ accesses per µop for mesa (FP, C, 3D-graphics) . . . . . . . 328
H.18 L1 D$ accesses per µop for galgel (FP, F90, Fluid Dynamics) . . 329
xxi
H.19 L1 D$ accesses per µop for art.110 (FP, C, Neural Networks) . 329
H.20 L1 D$ accesses per µop for art.470 (FP, C, Neural Networks) . 330
H.21 L1 D$ accesses per µop for mcf (INT, C, Combinatorial Opt) . . . 330
H.22 L1 D$ accesses per µop for equake (FP, C, Seismic Propogation) 331
H.23 L1 D$ accesses per µop for crafty (INT, C, Chess) . . . . . . . . 331
H.24 L1 D$ accesses per µop for facerec (FP, F90, Facial Recognition) 332
H.25 L1 D$ accesses per µop for ammp (FP, C, Chemistry) . . . . . . . . 332
H.26 L1 D$ accesses per µop for lucas (FP, F90, Number Theory) . . . 333
H.27 L1 D$ accesses per µop for fma3d (FP, F90, Crash Simulation) . . 333
H.28 L1 D$ accesses per µop for parser (INT, C, Word Processing) . . 334
H.29 L1 D$ accesses per µop for sixtrack (FP, F77, Nuclear Physics) 334
H.30 L1 D$ accesses per µop for eon.cook (INT, C++, Computer
Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
H.31 L1 D$ accesses per µop for eon.kaj (INT, C++, Computer
Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
H.32 L1 D$ accesses per µop for eon.rush (INT, C++, Computer
Graphics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
H.33 L1 D$ accesses per µop for perlbmk.mkrnd (INT, C, Scripting
Language) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
H.34 L1 D$ accesses per µop for perlbmk.perf (INT, C, Scripting
Language) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
H.35 L1 D$ accesses per µop for gap (INT, C, Group Theory) . . . . . . 337
H.36 L1 D$ accesses per µop for bzip2.graph (INT, C, Compression) 338
H.37 L1 D$ accesses per µop for bzip2.prog (INT, C, Compression) . 338
H.38 L1 D$ accesses per µop for bzip2.src (INT, C, Compression) . 339
H.39 L1 D$ accesses per µop for twolf (INT, C, Place/Route) . . . . . 339




We investigate various methods of speeding up computer architectural sim-
ulation, validating the results against real hardware using performance coun-
ters. We evaluate RISC, CISC, and CMP-CISC systems. We find that a Dy-
namic Binary Instrumentation (DBI) based simulation methodology improves
run-time over cycle-accurate simulation by at least an order of magnitude, en-
abling more complete results using full input sets.
Our primary motivation is the Memory Wall [162], which notes that modern
system performance is held back by the speed of the memory system. While the
steady increase in processor speeds has abated somewhat, Moore’s Law con-
tinues to provide more transistors to chip designers. This leads to an increase
in the number of processors and threads located per chip, which increases the
demands on already overloaded memory systems.
In order to address the Memory Wall and other performance problems, the
underlying architecture must be studied in detail. The only practical way to do
this is with simulators, usually fully in software, that simulate various parts of
a computer. As systems get more complicated, simulators get larger, slower,
and harder to understand. With the decline of RISC processors and the rise of
the Intel x86 architecture, it has become increasingly difficult to create relevant
cycle-accurate simulators. Each additional generation of features slow simula-
tors further; development time is spent enhancing micro-architectural simula-
tion and not tuning for speed. Often external effects such as I/O and DRAM





















































































Figure 1.1: Weighted slowdowns of various simulators when running
SPEC CPU2000
these externalities are critical to overall performance.
Simulation speed is critical in architectural research. Unfortunately even the
fastest DBI methods slow execution by almost a factor of 30, and cycle-accurate
simulators slow execution by over a factor of 100 (see Figure 1.1 for slowdowns
from various common academic simulation methods). These slowdowns are
enough to make a minutes long benchmark take over a day to execute. Some
simulators, especially those of complex out-of-order processors, can slow execu-
tion by over 1000 times, making single simulations take weeks to months. This
drastically increases the hardware required for experiments, as large clusters of
computers are required to run simulations in parallel to mitigate long execution
times. Simulator bugs become difficult to find, as it might take days to repro-
duce problems, leading to inefficient debugging sessions. Validation against
real hardware also suffers, as proper validation requires many iterations of runs

























Figure 1.2: Instruction set diversity across various domains. Recent com-
puter architecture conference papers (ICCD’09, ISCA’09, MI-
CRO’08 and ASPLOS’09) match years-old high-performance
computing diversity rather than modern trends in computing
Because cycle-accurate simulators are so slow, various methods have been
proposed for reducing execution time. Unfortunately these methods can in-
troduce differences in results between 10-50% when compared to full reference
input sets [168]. We investigate the effectiveness of many of these methods in
Chapter 3. The reduced methods involve simulating small amounts of code, of-
ten only a few hundredmillion to a few billion instructions. On amodernmulti-
gigahertz chip this equates to less than a second of run time; the reduced execu-
tion can miss longer timescale events such as disk and network I/O, operating
system context-switches, thermal events, etc. To allow for longer-running input
sets, we evaluate using faster Dynamic Binary Instrumentation (DBI) methods
of execution.
Another current limitation is the lack of accurate and up to date academic
simulators. Once available simulators become “good enough” there is little in-
centive to do more than incremental improvements. The computer industry
3
moves quickly and simulators cannot keep pace. This leads to simulators sup-
porting older, simpler, architectures. As shown in Figure 1.2, the mix of archi-
tectures simulated in papers from recent conferences (ICCD’09, ISCA’09, MI-
CRO’08 and ASPLOS’09) is unlike any current workloads (gaming [1], embed-
ded [4], desktop, or high performance computing [5]); in fact the most similar
workload we find is Top 500 list [5] from seven years ago. This dependence on
older architectures makes it difficult to determine if suggested improvements
apply to current implementations. Speedups for obsolete systems might not be
relevant to the vastly different chips being produced today. We attempt to avoid
these limitations by investigating current 64-bit x86 architectures in addition to
a more traditional RISC MIPS simulation environment.
One final barrier to accurate simulation is the recent proliferation of multi-
core machines. Processor designers are limited by how much performance they
can squeeze out of a design before thermal and power issues make the design
infeasible. Moore’s law of ever increasing transistor counts still holds, but the
most common solution is to place more cores per chip. Most non-embedded
processors have at least two cores, if not more, per package. This is unfortunate,
as adding cores complicates simulators and makes them slower. Most simula-
tors are single-threaded themselves, and can only simulate multiple cores by in-
terleaving execution. This causes a linear increase in slowdowns as more cores
are added, compounding the already critical slowness in simulations. In addi-
tion, many of the techniques used to improve single-core simulation either do
not work for CMP or else are not thoroughly tested enough to know accuracy
tradeoffs. We undertake preliminary examinations to address the CMP problem




Slow simulators are an ever-present limitation, holding back the work of
computer architects. There is much work attempting to mitigate the problem,
some more successful than others. The problem has been attacked from many
angles, and there are many related topics that also must be investigated. Our
work encompasses many of these different areas, requiring comparison to a
large body of related work.
2.1 Reduced Execution Validations
One way to speed simulations is to simulate smaller workloads. Yi et al. [168,
169, 170] investigate the six most common ways of workload reduction: repre-
sentative sampling (SimPoint [132]), statistics based sampling (SMARTS [163]),
reduced input sets (SPEC training inputs, MinneSPEC [77]), simulating the first
X Million instructions, fast-forwarding Y Million instructions and simulating
X Million, and fast-forwarding Y Million, performing architectural warmup,
then simulating X Million. They conclude that SimPoint and SMARTS give the
most accurate results. We find similar results, although we do not investigate
statistics-based sampling. Their work uses the WATTCH [28] simulator to char-
acterize their results using ten SPEC CPU2000 benchmarks; we use hardware
performance counters and the complete CPU2000 and 2006 benchmarks while
exploring more architectures and compilers.
Another method of reducing run time is to avoid running redundant bench-
marks. Phansalkar et al. [122] use various techniques to determine redundancy
5
in the SPEC CPU2006 suite and propose eliminating the need to run all of the
benchmarks. They use Pin, as well as hardware performance counters on four
different architectures (Power, SPARC, itanium, x86) and find that 6 of 12 integer
and 8 of 17 FP benchmarks can capture most of the overall benchmark behavior.
Eeckhout et al. [49] propose a hybrid method of reduced inputs and sam-
pling, determining in advance which method works best on a benchmark-by-
benchmark basis.
There have been attempts to speed simulation by moving to a hardware
based approach. Chiou et al. [33, 32] propose FAST, which is a timing simulator
implemented in an FPGA. This speeds simulation by having the slow timing
simulation implemented in fast hardware. The Qemu tool is used to gener-
ate traces which are fed into the timing simulator; Qemu is modified to handle
wrong-path execution. Qemu also handles correctness and operating system
issues. This work is not validated, and its speed is limited by the Qemu trace
generation.
2.2 SimPoint Validation
There have been many papers published that investigate the SimPoint method-
ology; our work encompasses more architectures and more implementations
than any previous work. We also explore in detail the speed and accuracy of
Basic Block Vector (BBV) generation, which is a critical step in undertaking Sim-
Point analysis (an extension of our HiPEAC work [155]). BBV generation is not
discussed in depth in previous papers. We validate our results with hardware
performance counters, whereas most previous work validates solely using sim-
6
ulation.
Sherwood, Perelman, and Calder [131] introduce the SimPoint methodology,
which uses basic block distribution to investigate phase behavior. They use
SimpleScalar [30] to generate the BBVs, as well as to evaluate the results for
the Alpha architecture. They show preliminary results for three of the SPEC95
benchmarks and three of the SPEC CPU2000 benchmarks. They build on this
work and introduce the original SimPoint tool [132]. They use ATOM [135] to
collect the BBVs and SimpleScalar to evaluate the results for the SPEC CPU2000
benchmark suite. They use an interval of 10M instructions, and find an average
18% IPC error for using one simulation point for each benchmark, and 3% IPC
error using between 6 to 10 simulation points. These results roughly match ours.
Perelman, Hamerly and Calder [118] investigate finding “early” simulation
points that can minimize fast-forwarding in the simulator. We do not investi-
gate early points as that functionality is no longer available in current versions
of the SimPoint tool. When looking at a configuration similar to ours, with 43
of the SPEC2000 reference input combinations, 100M instruction intervals, and
up to 10 simulations per benchmark, they find an average CPI error of 2.6%.
This is better than what we find using performance counters. They collect BBVs
and evaluate results with SimpleScalar, showing that the results on one archi-
tectural configuration track the results on other configurations while using the
same simulation points. We also find this to be true when comparing different
implementations of the same instruction set architecture.
When reporting results that use the SimPoint methodology, often no men-
tion is made of how the underlying BBV files are collected. If not specified, it
is usually assumed that the original method described by Sherwood et al. [132]
7
is used, which involves ATOM [135] or SimpleScalar [30]. The SimPoint web-
site provides pre-generated simulation points for a set of Alpha SPEC CPU2000
binaries; the use of these makes gathering BBV files unnecessary. Some works
mention BBV generation briefly, with no indication of any validation. For exam-
ple, Nagpurkar and Krintz [107] implement BBV collection in a modified Java
Virtual Machine in order to analyze Java phase behavior, but do not specify the
accuracy of the resulting phase detection.
Patil et al.’s PinPoints [115] use the Pin [87] tool to gather BBVs, and then val-
idate the results on the Itanium architecture using performance counters. This
work predates the existence of Pin for x86, so no x86 results are shown. Their
results show that 95% of the SPEC CPU2000 benchmarks have under 8% CPI
error when using up to ten 250M instruction intervals. All their benchmarks
complete with under 12% error, which is more accurate than our results. This is
potentially due to their use of much longer intervals. They also investigate com-
mercial benchmarks, and find that the results are not as accurate as the SPEC
results.
Perelman et al. [120] look at cross-binary simulation points. These rely on
code path traces instead of plain BBVs, which allows the SimPoint methodology
to be applied across different compilations and even different architectures (as
long as they are compiled from the same source code). This does require a more
complicated data collection infrastructure, and requires gathering data on all
architectures of interest. Using CMP$im [75] they find error similar to using
regular SimPoints. In our work we generate cross-platform SimPoints by using
a cross-platform simulator, which is a much simpler solution.
Nair and John’s [108] work on SPEC CPU2006 SimPoints postdates ours.
8
They use PinPoint and performance counters on a Pentium 4, simulating up to
30 simulation points. They find average error of 2.45% for SPEC CPU2006 and
2.15% for SPEC CPU2000. This is better than our results, but they simulate more
of the benchmarks by at least a factor of three.
Ganesan et al. [55] look at SimPoint results for SPEC CPU2006 using Al-
pha binaries on the sim-alpha simulator. No validation to real hardware is per-
formed.
2.3 Performance Counter Validation
We notice irregularities when validating our BBV generation methods using
hardware performance counters, leading us to validate the counters themselves.
The previous work on the topic is not as comprehensive as our investigations,
first presented at IISWC’08 [153].
Black et al. [24] use performance counters to investigate the total number of
retired instructions and cycles on the PowerPC 604 platform. Unlike our work,
they compare their results against a cycle-accurate simulator. The study uses
a small number of benchmarks (including some from SPEC92), and the total
number of instructions executed is many orders of magnitude fewer than in our
work.
Patil et al. [115] validate SimPoint generation using CPI from Itanium per-
formance counters. They compare different machines, but only the SimPoint-
generated CPI values, not the raw performance counter results.
Sherwood et al. [132] compare results from performance counters on the Al-
9
pha architecture with SimpleScalar [13] and the Atom [135] DBI tool. They do
not investigate changes in counts across more than one machine.
Korn, Teller, and Castillo [78] validate performance counters of the
MIPS R12000 processor via microbenchmarks. They compare counter re-
sults to estimated (simulator-generated) results, but do not investigate the
instructions graduated metric (the MIPS equivalent of retired instruc-
tions). They report up to 25% error with the instructions decoded counter
on long-running benchmarks, though this is possibly due to the 20% error in-
herent in the simulator itself [43].
Maxwell et al. [95] look at accuracy of performance counters on a variety of
architectures, including a Pentium III system. They report less than 1% error on
the retired instruction metric, but only for microbenchmarks and only on one
system.
Mathur and Cook [94] look at hand-instrumented versions of nine of the
SPEC 2000 benchmarks on a Pentium III. They only report relative error of using
sampled versus aggregate counts, and do not investigate overall error.
DeRose et al. [40] look at variation and error with performance counters on
a Power3 system, but only for startup and shutdown costs. They do not report
total benchmark behavior.
Zaparanuks et al. [173] investigate the accuracy of the cycle count on various
x86 processors, as gathered by three different measurement infrastructures.
Mytkowicz et al. [105] investigate sources of non-deterministic execution,
but look at causes for variations in run-time rather that retired instruction count.
10
Keeton et al. [76] use performance counters to thoroughly investigate the
behavior or a parallel Pentium Pro based system. They swap CPU boards to
vary cache sizes, allowing analysis to investigate changing hardware parame-
ters. They did not compare against simulators.
2.4 Single-core DBI-Based Simulation
The DBI-based simulation methodology we use is inherently similar to trace-
based simulation [144]. The idea of generating traces on the fly and feeding
architectural simulation is not new. Our contribution is in validating the gen-
erated results against reduced input methods, hardware performance counters
and cycle-accurate simulators.
2.4.1 Valgrind
Valgrind [113] is a dynamic binary instrumentation tool for the PowerPC, x86,
x86 64 and ARM architectures. It is a generic and flexible DBI utility originally
designed to detect application memory allocation errors. It comes with a single-
core memory simulator called cachegrind.
2.4.2 Pin
Pin [87] is a fast DBI tool that runs on Intel architectures (including x86, x86 64,
and Itanium), and supports the Linux and Windows operating systems. Pin
11
comes with some simple cache and branch simulator tools. It can be used to
generate more complicated cache simulations, see CMP$im in Section 2.5.1.
2.4.3 Qemu
Qemu [18] is a DBI-based simulator that can simulate a large number of plat-
forms, and also can simulate full operating systems. Qemu has no native cache
simulation; any simulation done has to be patched into the binary. It is the only
DBI tool that we investigate that can simulate cross-platform. We use a patched
version of Qemu in conjunction with the Dinero [48] cache simulator in our
WDDD’08 [152] work.
2.4.4 TAXI
Vlaovic and Davidson develop TAXI [148], which uses a Bochs-based front end
to generate traces that are fed to a cycle-accurate simulator modeling an earlier
x86 machine. They attempt to validate this method using performance coun-
ters, and find their major limiting factor to be lack of documentation for the
architecture they are trying to model.
2.5 Multi-core Simulation
A number of academic cycle-accurate simulators have various levels of multi-
core simulation support, the most popular being SESC [125], m5 [22], and Sim-
12
ics/GEMS [91]. We find these to be slow, and look to DBI for speed gains. Some
simulators incorporate DBI methods for speed.
2.5.1 CMP$im
CMP$im [75, 74] is the most similar project to ours. The Pin DBI tool feeds the
results from x86 simulation into a custom CMP cache simulator. Their results
match an unspecified cycle-accurate model to within 13% (4% for benchmarks
with low branch predictor misses) on SPEC CPU2006 with full input sets. They
also run ammp from SPEC OMP and multi-programmed SPEC CPU2006 work-
loads (no validation was done on these results). Their cache-simulator imple-
ments a MESI-like coherence protocol. It is configurable in the number of levels,
privacy, inclusion, associativity, allocation, and replacement policy. Unlike our
work they only compare results against a simulator and not against actual hard-
ware. Their simulation runs at a speed of 4-10MIPS.
2.5.2 Other
PTLSIM [172, 171], is a cycle-accurate simulator that uses DBI internally for
speed. It is described more fully in Section 2.6.
Goldschmidt andHennessy [57] investigate multi-threaded trace simulation,
as compared to cycle-accurate simulation. They find the methods to be equiv-
alent, except in cases where synchronization matters or where the metric mea-
sured has a small value.
13
Li et al. [84] generate traces using IBM’s Turandot/PowerTimer (a cycle-
accurate simulator) but then use these traces multiple times to feed Zauber, a
cache simulator. Re-using the traces mitigates the overhead of using a cycle-
accurate simulator.
Lee et al. [81] propose Composable Performance Regression which uses
uniprocessor and contention models to predict multiprocessor performance.
Once trained, the models can predict multiprocessor performance with median
errors of under 7%.
Donald and Martonosi [47] create a parallel version of PowerPC Sim-
pleScalar that can run CMP simulations, and is multi-threaded itself, giving
a speedup of 2-3x on a multi-core system. This is still much slower than the
benefits achievable by using DBI.
Muzahid et al. [103] detect data races using a Pin-based simulation method
that feeds into an unspecified MESI cache simulator. They do not investigate
performance or perform any validation.
Luo et al. [89] investigate speculative threads using a custom version of Sim-
pleScalar fed by Pin. They do not validate their results or comment on perfor-
mance.
2.6 Cycle-Accurate x86 Simulators
The x86 architecture has been the dominant desktop platform for a long time,
and more recently it has begun dominating in server and high-performance
computing situations. There is an ongoing push to use the architecture more
14
frequently in embedded systems as well. Any architectural study that avoids
investigating x86 limits the relevance of the results. Unfortunately academic
simulators are just now catching up to using x86 and many studies still use ob-
solete RISC architectures.
Loh et al. [86] present Zesto, which is a detailed x86 cycle accurate simulator
based on SimpleScalar. It is designed with accuracy, not speed in mind. They
have validated it versus wall-clock time on a series of microbenchmarks and
found around 5% error.
m5 [22] has recently acquired x86 support (with many contributions by us).
It has not been validated except by the work in this thesis. It currently cannot
run in detailed out-of-order or in-order modes.
PTLsim [172, 171], is a DBI-based full-system simulator that runs x86 bina-
ries. There is an SMT mode available but it uses a simplistic cache coherence
scheme and does notmodel systemmemory at all. MPTLsim [174], an enhanced
CMP version, is described but is not currently available.
2.7 Simulator Validations
Cycle-accurate simulators are often usedwithout concern that results match real
hardware. This limits architectural studies, as the magnitude of error in the re-
sults is unknown. We list previous studies that attempt validation of simulators.
Gibson et al. [56] validate various MIPS simulators against their R10000-
based FLASH system. They find that even their most carefully designed simu-
lators have surprisingly large errors. They, like we, call into question the value
15
of highly detailed simulators that are not validated against real hardware.
Black et al. [24, 25] create a model of the PowerPC 604 processor and validate
it using hardware performance counters. They use a small set of benchmarks
for validation, and try to reduce error. Interestingly, they find that fixing bugs
in the simulator can actually increase the error in simulation because previous
errors masked other bugs.
Desikan, Burger, Keckler and Austin [42, 43] validate the sim-alpha cycle-
accurate simulator. They find that the generic sim-outorder simulator has up-
wards of 40% error, and even a fine-tuned attempt to match an actual Alpha
machine still yields errors of around 15%. They run 22 of the SPEC CPU2000
benchmarks.
SimOS [127] is a full-system simulator. It was the first simulator to use DBI
internally; its DBI implementation is called Embra [157]. Is it only 3-9x slower
than actual hardware, although with the cache simulator enabled it is 7-20 times
slower. It models a 32-bit MIPS R3000 and can run SGI IRIX. Parallel Embra can
run parallel simulations which scale with multiple host cores. It models DASH-
like directory memory coherence and has been validated against the MIPSy
simulator (and MIPSy has been validated against a real machine, within 1-2%
for uniprocessor) [128, 17]. Currently the project is not under development,
although a version for PowerPC running AIX is developed by IBM [2] and a
version that can run Linux is also developed [159].
XTREM [37] is a validated ARM simulator. It matches real hardware to
within 4% for thermal measurements and 7% on IPC using some MiBench and
Java benchmarks. The IPC numbers are collected using hardware performance
16
counters. XEEMU [66] is another ARM simulator validated with performance
counters. They claim better results than XTREM. Varma et al. [147] look at
power estimation on ARM using a simulator based on Intel’s Xsim with is a
simulator found to be with 2% for hardware memory accesses.
SIGMA: [41] is a memory system simulator validated to match real hardware
within 1% using performance counters for a Power3 system. They only validate
against one benchmark, swim, from SPEC CPU2000.
Barroso et al. [16] use hardware counters to validate SimOS on Alpha, as
well as to characterize a memory subsystem. They also use the static binary
instrumentation tool ATOM, but they in the end do not elaborate on their use of
ATOM to gather traces.
2.8 Multi-processor Phase Detection
Many of the reduced execution methods previously mentioned, including Sim-
Point, will not work for multi-processor workloads. This severely limits multi-
processor studies. Some attempts have been made to solve this problem.
Perelman et. al [121] find multi-processor SimPoints by gathering info for
each thread individually, then aggregating the count.
Namkung et al. [109] synthesize samples from similar phase combinations.
They find that they can reduce sampling by 90% with error of less than 5%.
Van Biesbrouck et al. [146] use a technique called a Co-Phase Matrix to com-
bine single-threaded phase behaviors in order to estimate performance on SMT
17
systems. They found an error rate of 4% while only requiring 1% to be run for
28 pairs of Alpha SPEC CPU2000 benchmarks on the m5 simulator. This work
is extended [145] to consider multiple benchmark starting points for higher ac-
curacy.
Ekman and Stenstrom [50] use matched-pair comparison in conjunction
with statistical sampling to reduce the amount of simulation needed for multi-
processor simulations.
Gonzalez et al. [58] propose using hardware performance counters in con-
junction with density-based clustering algorithms to detect phases in parallel
applications.
2.9 Deterministic Execution
Comparing performance results of CMP systems and simulators is difficult, as
inherent non-determinism in the executions make it nearly impossible to com-
pare results fairly. There have been many attempts at creating practical deter-
ministic execution environments for CMPs; the methods proposed often require
hardware modification and thus are not available on commodity processors.
Examples of this are See Capo [98], DMP [44], Delorean [97], and Flight Data
Recorder [165].
The most promising implementation that requires no hardware modifica-
tion is Kendo [114]. They use performance counters to enforce deterministic
context switching. The retired stores counter is used; they, like us, found that
on x86 the retired instructions counter includes interrupt counts. They modify
18
the pthreads package to have a new type of deterministic lock. When running
in deterministic mode there is an overall overhead of 16%.
Pereira et al. [117] present a method of deterministic execution that uses
DBI instrumentation to gather dependence information. Data collection is 27x
slower, but running in a simulator is actually slightly faster due to elimination
of stalls.
Alameldeen et al. [8] propose using random perturbations and statistical
methodology to mitigate non-determinism in simulation.
Narayanasamy et al. [110] log operating system effects in order to have
deterministic multi-threaded workloads, but only when simulating multiple
threads on a single core.
Lepak et al. [83] enhance a simulator to enable deterministic execution by
recording various sources of non-determinism.
2.10 Performance Counter based CPI Prediction
DBI simulations have no concept of cycle time, making prediction of methods
such as CPI or IPC difficult. Various other groups have looked at estimating
cycles from other performance metrics.
Amato et al. [9] use performance counters on the R10000 to predict parallel
application performance.
Marin et al. [90] predict execution time and cache misses on R12000 proces-
sors, and compare the results with hardware performance counters.
19
Luo et al. [88] estimate CPI values based on memory performance counter
results on MIPS R10000. They find good results.
Eyerman et al. [52] use interval analysis on out of order processors and try
to determine the causes of stalls that impact CPU. The Power5 processor has
performance counter hardware dedicated to generating these CPI stacks.
Bhargava et al. [21] enhance CPI numbers by modeling speculative instruc-
tion execution when generating program traces. A “resurrection” tree can be




METHODS OF REDUCING SIMULATION TIME
A common way of reducing simulation time is using reduced execution
methods. This involves running only a small part of a workload and extrapolat-
ing total behavior. This inherently adds error to the results, but the dramatically
decreased runtime is often deemed worth it.
Running reduced inputs can have problems besides accuracy. For one, not
running full inputs means the final benchmark results are not generated, which
is an important step in determining if the simulator is working properly. Subtle
bugs that are not enough to crash simulation but different enough to skew re-
sults can be hidden if the program subset being run does not generate I/O that
can be compared to known good results. Another problem with reduced inputs
is the loss of results that can only be observed over relatively long time peri-
ods. For example, temperature fluctuations happen on the order of many sec-
onds, and reduced input methods often reduce simulation times to sub-second
lengths of time.
Yi et al. [168] investigate common methods of speeding up simulations,
which they break up into six categories (some additional methods are described
in Section 2.1):
• Representative sampling (SimPoint [132]),
• Statistics based sampling (SMARTS [163]),
• Reduced input sets (such as training inputs, or MinneSPEC [77]),
• Simulating the first X Million instructions,
• Fast-forwarding Y Million instructions and simulating X Million, and
21
• Fast-forwarding Y Million, performing architectural warmup, then simu-
lating X Million.
They conclude that SimPoint and SMARTS give the most accurate results, with
differences in the 10% range. The other methods can have upward of 50% differ-
ence when compared to running full benchmarks. They investigated 10 years
(from 1995 to 2005) of HPCA, ISCA, and MICRO papers and found that over
70% use reduced simulation methods. This shows how critical fast simulation
is to the architectural community, and how important it is to understand the
accuracy tradeoffs introduced by these methods.
We evaluate various of the reduced execution methods in order to compare
the results with our dynamic binary instrumentation based approach that uses
full input sets.
3.1 Running a Small Portion from the Beginning
The simplest form of reduced execution is simply to start at the beginning and
execute for some number of instructions, usually a few billion. It turns out that
this has poor accuracy, as often the beginning of a program is one-time initial-
ization and startup routines and is not representative of full program execution.
We look at this method in our analysis.
22
3.2 Un-guided Fast-forwarding
Another method is to fast-forward deeper into a program (most simulators sup-
port running a faster, functional, mode that can then be switched into slower
cycles-accurate mode). Usually the program is fast-forwarded by a billion or
more instructions before starting detailed simulation. This usually avoids the
startup region of a program, but it is still not necessarily representative of the
rest of the program. We also look at this method in our analysis.
3.3 Reduced Input Sets
Yi et al. found that using reduced input sets, such as MinneSPEC or the SPEC
training input sets, had worse accuracy than SimPoint while requiring much
more execution. We do present some results for the SPEC training inputs in
Section 3.5.3 which agree with that analysis.
3.4 Statistics-based Sampling
Statistics based sampling, such as SMARTS [163]), is a method of reducing run-
time by gathering detailed statistics from various parts of the execution. It has
high-accuracy, but it requires large amounts (gigabytes) of disk space. Yi et al.
found that the results were not much better than using multiple SimPoints. We
did not investigate this type of reduced execution.
23
3.5 SimPoint
SimPoint [62, 118, 119, 131, 132] exploits the phase behavior of programs. Many
applications exhibit cyclic behavior: code executing at one point in time behaves
similarly to code running at some other point. Entire program behavior can be
approximated by modeling only a representative set of intervals (in our case,
simulation points or SimPoints).
Figures 3.1, 3.2, and 3.3 show examples of program phase behavior at a gran-
ularity of 100M instructions; these are captured using hardware performance
counters from representative SPEC CPU2000 benchmarks. Each figure shows
two metrics: the top is L1 D-Cache miss rate, and the bottom is cycles per in-
struction (CPI). Figure 3.1 shows twolf, which exhibits almost completely uni-
form behavior. For this type of program, one interval is enough to approximate
whole-program behavior. Figure 3.2 shows the mcf benchmark, which hasmore
complex behavior. Periodic behavior is evident: representative intervals from
the various phases can be used to approximate total behavior. The last example,
Figure 3.3, shows the extremely complex behavior of gcc running the 200.i
input set. Few patterns are apparent; this type of program is difficult to approx-
imate with the SimPoint methodology (smaller phase intervals are needed to
recognize patterns, and variable-size phases are possible, but choosing appro-
priate interval lengths is non-trivial). A complete set of CPI phase plots for x86
and x86 64 can be found in Appendix E.
24



























Figure 3.1: L1 Data Cache and CPI behavior for twolf: behavior is uni-
form, with one phase representing the entire program.



























Figure 3.2: L1 Data Cache and CPI behavior for mcf: several recurring
phases are evident.

























Figure 3.3: L1 Data Cache and CPI behavior for gcc.200: this program




To generate the simulation points for a program, the SimPoint tool needs a Ba-
sic Block Vector (BBV) describing the code’s execution. Dynamic execution is
split into intervals (often fixed size, although that is not strictly necessary). In-
terval size is measured by number of committed instructions, usually 1M-300M
instructions. Smaller sizes enable finer grained phase detection; larger sizesmit-
igate warmup error when fast-forwarding (without explicit state warmup) in a
simulator. We use 100M instruction intervals, which is a common compromise.
During execution, entry into all basic blocks is tracked along with a count of
how many times each block is executed. The block count is weighted by the
number of instructions in each block to ensure that instructions in smaller basic
blocks are not given disproportionate significance. When total instruction count
reaches the interval size, the basic block list and frequency count are appended
to the BBV file.
The SimPoint methodology uses K-means clustering of the BBV file to find
simulation points of interest. The algorithm selects one representative interval
from each phase identified by clustering. The number of phases can be specified
directly, or the tool can search within a given range for an appropriate number
of phases.
The final step in using SimPoint is to gather statistics for all chosen simula-
tion points. For multiple simulation points, the SimPoint tool generates weights
to apply to the intervals. By scaling the statistics by the corresponding weights,
an accurate approximation of entire program behavior can be estimated quickly
(within a small fraction of whole-application simulation time).
26
The SimPoint website only provides BBV generation tools using ATOM [135]
and SimpleScalar sim-alpha [13]. These are useful for experiments involving
the Alpha processor, but that architecture has declined in significance. There is
a PinPoints tool that enabled generation of BBVs using Intel’s Pin [87] tool, but
that only works for Intel supported architectures. We investigate using other
tools to generate BBVs for a wider range of architectures.
We modify the Qemu [18] and Valgrind [113] Dynamic Binary Instrumenta-
tion tools to generate SimPoint BBV files. The changes to Qemu are available
from our website [3] and are also shown in Appendix J. The tool we develop for
Valgrind, exp-bbv, was merged into the main Valgrind project as of versions
3.5 (the code is also included as Appendix I). We tried using DynInst [29] to
generate BBV files, but we were unsuccessful. Unfortunately the version of the
tool available at the time only worked with dynamically linked applications and
had a large overhead, often exceeding 4GB of RAM used for some benchmarks.
To evaluate our BBV generation methods, we compare results gathered on
the x86 architecture, as this is the one architecture supported by Qemu, Valgrind
and Pin. Figure 3.4 shows architectures supported by each tool.
3.5.2 x86 Evaluation
To evaluate the BBV generation tools, we use the SPEC CPU2000 [136] and
CPU2006 [138] benchmarks with full reference inputs. We compile the bench-
marks on SuSE Linux 10.2 with gcc 4.1 and -O2 optimization (except for
vortex, whichwe compile without optimization because it crashes, otherwise).

















Figure 3.4: Architectures supported by Pin, Qemu, and Valgrind: x86 is
the ideal platform for comparison, as it is well supported by all
three of the tools.
to gather data. The choice to use static linking is not due to tool dependencies;
all three handle both dynamic and static executables. We use the Perfmon2 [51]
interface to gather hardware performance counter results for the platforms de-
scribed in Table 3.1.
We use the Cycles Per Instruction (CPI) metric to evaluate our tools. The per-
formance counter infrastructure is set to dump the cycles performance counter
results every 100M instructions. The same performance counter data are used
to evaluate all three tools, to avoid any variation between runs. Basic Block Vec-
tor files are generated using the three tools, and SimPoint version 3.2 is used
to generate the simulation points and weights. We calculate actual overall CPI
for the benchmarks by using the performance counter data, and use this as a
basis for our error calculations. Note that calculated statistics are ideal, with
full warmup. If we were analyzing via a simulation, the results would likely
28
Table 3.1: Machines used for x86 SimPoint evaluation.
type frequency memory L1 I/D L2/L3 Cache performance counters used
Pentium Pro 200MHz 256MB 8KB/8KB 512KB inst retired,
cpu clk unhalted
Pentium II 400MHz 256MB 16KB/16KB 512KB inst retired,
cpu clk unhalted
Pentium III 550MHz 512MB 16KB/16KB 512KB inst retired,
cpu clk unhalted
Itanium 800MHz 1GB 16KB/16KB 96KB/3MB ia32 inst retired,
cpu cycles
Atom N270 1.6GHz 1GB 32KB/24KB 512KB instructions retired,
unhalted core cycles
Core Duo 1.66GHz 1GB 32KB/32KB 1MB instructions retired,
unhalted core cycles
Athlon MP 1.733MHz 512MB 64KB/64KB 256KB retired instructions,
cpu clk unhalted
Athlon64 X2 2GHz 1GB 64KB/64KB 512KB retired instructions,
cpu clk unhalted
AMD Phenom 2.2GHz 2GB 64KB/64KB 512MB/2MB retired instuctions,
cpu clk unhalted
Core2 Q6600 2.4GHz 2GB 32KB/32KB 4MB instructions retired,
unhalted core cycles
Pentium 4 2.8GHz 2GB 12Kµ/16KB 512KB instr retired:nbogusntag,
global power events:running
Pentium D 3.46GHz 4GB 12Kµ/16KB 2MB instr retired:nbogusntag,
global power events:running
vary in accuracy depending on how architectural state is warmed up after fast-
forwarding between simulation points.
Figure 3.5 shows results for reduced input methods for the SPEC CPU2000
benchmarks across 12 different implementations of the x86 architecture. The
results shown are the average error for CPI when compared against a full ref-
erence input run, as measured with hardware performance counters. Each ma-
chine has three sets of plots; one for detailed simulation of 100 million instruc-
tions, one for 500 million, and one for 1 billion.
The first plot in each set is just starting from the beginning of the program




















































































Figure 3.5: Average CPI error for SPEC CPU2000 when using first, un-











































































Pin Qemu ValgrindFP Results, up to 20 SimPoints, Pentium D
-15.5
10.817.5
Figure 3.6: Percent error in CPI on a Pentium D when using up to 20 Sim-
Points on CPU2000 FP: the error with facerec and fma3d is


















































































































































Figure 3.7: Percent error in CPI on a Pentium D when using up to 20 Sim-
Points on CPU2000 INT: the large error with the gcc bench-
marks is due to spikes in the phase behavior that SimPoint does
not capture well.
simulating more instructions can sometimes get results as close as 20% error.
The second plot in each set is fast forwarding by 1 billion instructions, in an
attempt to avoid startup effects. This often, but not always, has better accuracy
than starting from the beginning, and can also obtain results approaching 20%
error, though usually higher.
The next three plots show the results using the SimPoint methodology, with
31
BBV files generated by Pin, Qemu and Valgrind respectively. Even when only
simulating one simulation point, the results are much better than the unguided
results. In general they are within the 10-20% error range. Moving on to up to 5
chosen SimPoints (approximately 500M instructions per benchmark) helps even
more, with error in the 5-10% range for all machines. Moving on to simulating
up to 10 SimPoints does not help much, and in fact it can have worse results!
This might be unexpected, but there is no guarantee in the SimPoint methodol-
ogy that adding more points helps error (Figure 3.19 shows this with regard to
the x86 64 architecture).
The last plot shown is the “oracle” result shown. This shows how low the
error would be if the optimal interval was picked for each benchmark. This
is an extremely low value, which shows that each benchmark has an interval
that matches program behavior well. Unfortunately this interval varies from
machine to machine, so it would not be possible to have a tool that can find this
in a generic fashion.
The thing to note about these results is the small amount of simulation time
required. When allowing SimPoint to choose up to 10 simulation points per
benchmark, the average error across all machines for CPI is roughly 5-10% for
all machines tested while having a small amount of execution. The intervals
chosen are not many; Pin chooses 354 SimPoints, Qemu 363, and Valgrind 346;
this represents only 0.4% of the total execution length, making the simulations
finish 250 times faster than if run to completion. It is reassuring that all three
BBV methods pick a similar number of intervals, and in many cases they pick
the same intervals.
Figures 3.6 and 3.7 break out the Pentium D results by benchmark. For float-
32
ing point applications, facerec and fma3d have significantly more error than
the others. This is because those programs feature phases which exhibit extreme
shifts in CPI from interval to interval, a behavior that SimPoint often has trouble
capturing. The integer benchmarks have the biggest source of error, which is the
gcc benchmarks. The reason gcc behaves so poorly is that there are intervals
during its execution where the CPI and other metrics spike. These huge spikes
do not repeat, and only happen for one interval; because of this, SimPoint does
not weight them as being important, and they therefore are omitted from the
chosen simulation points. These high peaks are what cause the actual average
results to be much higher than what is predicted by SimPoint. It might be pos-
sible to work around this problem by choosing a smaller interval size, which
would break the problematic intervals into multiple smaller ones that would be
more easily seen by SimPoint.
We also use our BBV tools on the SPEC CPU2006 benchmarks. These runs
use the same tools as for CPU2000, without any modifications. These tools
yield good results without requiring any special knowledge of the newer bench-
marks. We do not have results for the zeusmp benchmark for Valgrind; it
uses a 1GB data segment which Valgrind was unable to handle. Unlike the
CPU2000 results, we only have performance counter data from six of the ma-
chines. Many of the CPU2006 benchmarks have working sets of over 1GB, and
many of our machines have less RAM than that. On those machines the bench-
marks take months to run, with the operating system paging constantly to disk.
The CPU2006 results shown in Figure 3.8 are as favorable as the CPU2000 re-
sults. When allowing SimPoint to choose up to 10 simulation points per bench-
mark, the average error for CPI is less than 10% for all of the BBV generation















































Figure 3.8: Average CPI error for CPU2006 on a selection of x86 machines
when using first, unguided fast-forward, and SimPoint se-
lected intervals.
would require simulating only 0.056% of the total benchmark suite. This is an
impressive speedup, considering the long running time of these benchmarks.
Error when simulating the first 100M instructions peaks at over 100%, show-
ing that this continues to be a poor way to choose simulation intervals. Fast-
forwarding 1B instructions and then simulating produces average errors in the
range of 20-40%. Using only a single simulation point again always does better
than unguided simulation.
Figures 3.9 and 3.10 show CPI errors for individual benchmarks on the Pen-
tium D machine. For floating point applications, there are outlying results for
cactusADM and GemsFDTD. As with the CPU2000 results, the biggest source



































































Figure 3.9: Percent error in CPI on a Pentium D when using up to 20
SimPoints on CPU2006 FP: the large variation in results for
cactusADM and GemsFDTD are due to unresolved inaccuracies
















































































































































Figure 3.10: Percent error in CPI on a Pentium D when using up to 20
SimPoints on CPU2006 INT: the large error with the gcc and
bzip2 benchmarks is due to spikes in the phase behavior not
captured by SimPoint.
described previously: SimPoint cannot handle the spikes in the phase behavior.
The bzip2 benchmarks in CPU2006 exhibit the same problem that gcc has. In-
puts used in CPU2006 have spiky behavior that the CPU2000 inputs do not. The
other outliers, perlbench and astar require further investigation.
35
Table 3.2: Machines used for x86 64 SimPoint evaluation.
Processor Cores Speed Memory L1 I/D L2/L3 Retired Instruction CounterCache Cache Cycles Counter
AMD Phenom 4 2.2GHz 2GB 64KB/64KB 512MB/2MB retired instuctions,
cpu clk unhalted
Core2 Q6600 4 2.4GHz 2GB 32KB/32KB 4MB instructions retired,
unhalted core cycles
Pentium D 2x2 3.46GHz 4GB 12Kµ/16KB 2MB instr completed:nbogus,
global power events:running
3.5.3 x86 64 Results
The x86 64 architecture is a 64-bit extension of the x86 architecture. While it is
very similar to the x86 architecture, it has features that change program behav-
ior. The move to 64-bits causes memory access widths to change, there are more
registers (which reduces register spills), and by default SSE vector instructions
can be used (this allows for saner floating point math and optimized memory
transfers). We extend our original x86 SimPoint work by generating results for
x86 64.
The machines used are described in Table 3.2. The SPEC CPU2000
benchmarks were used, compiled with -O3 -msse3 -funroll-all-loops
-ffast-math -static using gcc-4.2. With that configuration, some of the
perlbmk benchmarks and all of the vortex benchmarks fail to run due to
memory access errors inherent in the benchmarks that are exhibited with recent
compilers.
Unlike the x86 results, we use only SimPoints from Valgrind-generated BBV
files. In Section 3.5.2 we show that the Valgrind generated BBV files have similar
characteristics to those generated by other tools. We generate the BBV files using
our exp-bbv tool as included in Valgrind 3.5.
36














Valgrind, up to 5 SimPoints
Valgrind, up to 10 SimPoints
46.1% 46.2% 45.9% 41.9%
Figure 3.11: Average CPI error for CPU2000 on three x86 64 machines
when using first, unguided fast-forward, and SimPoint se-
lected intervals.
Figure 3.11 shows CPI error for three different x86 64 implementations on
the SPEC CPU2000 benchmarks. On all of themachines, the SimPoint results are
much better than the un-guided results. Increasing the number of simulation
points helps accuracy, and on all machines accuracy of better than 5% can be
found when using up to 10 SimPoints per benchmark. This is better than the
average results found using the x86 binaries, even on the same machines. This
is primarily due to the outliers being much better behaved on 64-bit systems,
and since it is an average measure, it is the outliers which cause the high percent
error results.
Figure 3.12 and 3.13 show broken out results for the Phenom when up to 10
SimPoints are used per benchmark. It is somewhat unsurprising to note that the











































































ValgrindFP Results, up to 10 SimPoints, AMD Phenom
-63.7





















































































































































Figure 3.13: x86 64 CPI Error for SPEC CPU2000 integer benchmarks
3.5.4 Cross-PlatformMIPS Results
A common situation found when performing architectural simulation is using
simulators for machines for which you do not have any actual hardware. This
makes for difficult development, involving setting up cross-compiler toolchains
to generate binaries. It becomes hard to determine when bugs are in the
toolchain or in the simulator when there is no real hardware for comparison.
Generating SimPoints for an unavailable platform is also difficult; it might be
tempting to just re-use SimPoints generated for another architecture, but this









































Avg CPI = 4.26
Figure 3.14: Phase plot for mcf across various architectures. While the
phases look similar, the interval numbers are not.
multiple architectures. (A complete set of multi-architecture phase plots can be
seen in Appendix F). While the phases are similar, the actual interval values are
very different, and SimPoints generated for one of the architectures would not
work for any of the others.
Another way to avoid generating SimPoints is to re-use those already gener-
ated by someone else. This can cause problems unless you have the exact same
binaries used to generate the original SimPoints. Figure 3.15 shows that on x86
the compiler chosen and the compiler flags used can vastly affect the interval
numbers for a benchmark (in this case, equake.
There have been studies done on the possibility of generating true cross-
platform SimPoints [120], but the methods involve time-consuming profiling
on multiple machines, and the results are not practical.
39



















































































Avg CPI = 2.86
Figure 3.15: Phase plot for equake across various compilers are compile













Qemu, up to 5 SimPoints
Qemu, up to 10 SimPoints
51.5%
Figure 3.16: MIPS R12000 SimPoint results for SPEC CPU2000. The BBVs
for the SimPoints were generated cross-platform on an x86
machine using Qemu
An option we explore is to use DBI simulation to generate BBV files for a
different platform. The Qemu DBI tool can run executables cross-platform. By
using our BBV-generation patched version of Qemu, we can generate BBV files
for Alpha, SPARC, MIPS, PPC and ARMwhile still running on an x86 machine.
Figure 3.16 shows results using SimPoints generated for the MIPS archi-
tecture using MIPS binaries while running on an x86 machine. These Sim-
Points were then used on performance counter data collected on an actual MIPS
R12000 processor. The results are very similar to those found for the other ar-
chitectures investigated, and have 5% CPI error when using up to 5 SimPoints.
This shows that Qemu is a valuable tool for generating SimPoints for platforms
where native hardware is not available.
Figures 3.17 and 3.18 break out the results per-benchmark. The results are
markedly different from the x86 and x86 64 results seen previously. The gcc
benchmarks are not outliers, in this case mcf has a large error, and the floating























































































































































































































Figure 3.18: MIPS CPI Error for SPEC CPU2000 integer benchmarks
3.5.5 Summary
On actual x86 hardware, using the SimPoint methodology can give CPI error
of under 10% while only running 0.4% of the total SPEC CPU2000 suite on full
reference inputs across 12 different machines. Our code generates under 12%
CPI error when running under 0.06% of SPEC CPU2006 (excepting zeusmp)
with full reference inputs across 6 different machines.
We also investigate x86 64 and cross-platform generated MIPS SimPoints
and find results that compare favorably to the x86 results.
42
3.6 SimPoint Limitations
While these results are good, there are some limitations to using this methodol-
ogy. This error can only add to the error generated with cycle-accurate simula-
tors (for example, 20% with sim-alpha [43]). Also, it is unclear if it is possible to
use the SimPoint methodology for multi-threaded workloads (see discussion in
Section 2.8.
We find more variation in our results than we originally expected. This led
us to investigate our evaluation methods to try to determine the source of the
differences. For example, we would expect that the different DBI tools, since
they are running the same executables/inputs on the same machines with the
same inputs, should have identical BBV files, but they do not. This turns out to
be because the different DBI tools have different ideas of what constitute a basic
block. For performance reasons the DBI tools try to have biggest blocks as pos-
sible, and will use “super-blocks” which unlike basic blocks can have multiple
exits but only one entry. Also, the tools discover basic-blocks at run-time, so are
often in the situation where a program will jump to a middle of a block (or on
x86, it’s even technically legal to jump to the middle of an instruction), which
means that a new block has to be created out of the old one, and the DBI tools
differ into how statistics are accounted in that situation. The SimPoint method-
ology generates different SimPoint files depending on the BBV inputs, and even
a single extra instruction in a block can change which points are chosen. Because
of this, even slight difference in BBV accounting can cause different results.
Even with the DBI differences, we found that even on real hardware the
performance counts for retired instructions were different from machine to ma-
43



















AMD Phenom Core2 Q6600 Pentium D 
Figure 3.19: Percent average CPI error for SPEC CPU2000 as more Sim-
Points are added per benchmark. After 20 SimPoints the av-
erage does not decrease, even up to 100 points per benchmark
(this is equivalent to running 2% of all of the benchmarks).
chine, which was often unexpected. To get accurate SimPoint results you need
to have fairly accurate instruction counts, as you need to fast-forward to the ex-
act start of the phase. On programs with a high amount of phase variability be-
ing a million instructions off could end up in a completely different phase than
the one intended, causing poor results. This exposes many hardware counter
and deterministic execution issues that we investigate it in detail in Chapter 4.
Another problem with the SimPoint methodology is that it is not possible to
predict what the error will be. Figure 3.19 shows the average CPI error on three
different x86 64 machines with SPEC CPU2000 as the number of SimPoints per
benchmark is raised from 1 to 100. The error does not always decrease, and after
a certain point (roughly around 20) a steady-state is reached and the error does
not get better and in fact can get worse.
SimPoint is a valuable tool, and is much better than using unguided simula-
tion. However, we believe that many of its limitations cannot be fully addressed,




Hardware performance counters are a useful tool for validation. These coun-
ters are available on most modern processors, and keep track in real time of var-
ious architectural statistics. The counters must be used with caution, as hard-
ware engineers are reluctant to certify the accuracy of the counters. Before using
a counter in research, it needs to be checked to ensure it is delivering reasonable
results.
When using hardware performance counters to validate the SimPoint
methodology in Chapter 3 we noticed discrepancies in the results. Some of
these could be attributed to variations in how the DBI tools generate BBV files,
but some results indicate that the retired instructions counters were varying
both run-to-run and across machines. These unexpected variations can be by
as much as 2%. The retired instruction counter should not vary this much; it
is high profile enough to be heavily debugged by hardware engineers. Retired
instruction count is one of the few counters that should be the same for the same
executable/input set across all implementations of an ISA.
In order to trust the results from our SimPoint study we investigate the ac-
curacy of the retired instruction performance counter and how it relates to de-
terministic execution on the x86 architecture.
4.1 Hardware Performance Counters
When used in aggregate counting mode (as opposed to sampling mode), per-
formance counters provide architectural statistics at full hardware speed with
45
minimal overhead. Most modern processors support some form of counters.
Although originally implemented for debugging hardware designs during de-
velopment, they have come to be used extensively for performance analysis and
for validating tools and simulators. The types and numbers of events tracked
and the methodologies for using these performance counters vary widely, not
only across architectures, but also across systems sharing an ISA. For example,
the Pentium III tracks 80 different events, measuring only two at a time, but
the Pentium 4 tracks 48 different events, measuring up to 18 at a time. Chips
manufactured by different companies have even more divergent counter archi-
tectures: for instance, AMD and Intel implementations have little in common,
despite their supporting the same ISA. Verifying that measurements generate
meaningful results across arrays of implementations is essential to using coun-
ters for research.
Comparison across diverse machines requires a common subset of equiva-
lent counters. Many counters are unsuitable due to microarchitectural or timing
differences. Furthermore, counters used for architectural comparisons must be
available on all machines of interest. We choose a counter that meets these re-
quirements: number of retired instructions. For a given statically linked binary,
the retired instruction count should be the same on all machines implement-
ing the same ISA, since the number of retired instructions excludes speculation
and cache effects that complicate cross-machine correlation. When validating
SimPoints (as described in Chapter 3) the retired instruction count was not as
regular as expected. This count is especially relevant, since it is a component
of both the Cycles per Instruction (CPI) and (conversely) Instructions per Cycle
(IPC) metrics commonly used to describe machine performance.
46
The CPI and IPC metrics are important in computer architecture research; in
the rare occasion that a simulator is actually validated [116, 37, 42, 152] these
metrics are usually the ones used for comparison. Retired instruction count and
IPC are also used for vertical profiling [64] and trace alignment [106], which are
methods of synchronizing data from various trace streams for analysis.
Retired instruction counts are also important when generating basic block
vectors (BBVs) for use with the SimPoint [62] tool. When investigating the use
of DBI tools to generate BBVs [155], we find that even a single extra instruction
counted in a basic block can change which simulation points the SimPoint tool
chooses to be most representative of whole program execution.
All these uses of retired instruction counters assume that generated results
are repeatable, relatively deterministic, and have minimal variation across ma-
chines with the same ISA. Here we explore whether these assumptions hold
by comparing the hardware-based counts from a variety of machines, as well as
comparing to counts generated by Dynamic Binary Instrumentation (DBI) tools.
4.1.1 Performance Counter Evaluation
We run experiments on multiple generations of x86 machines, listed in Table
4.1. All machines run the Linux 2.6.25.4 kernel patched to enable performance
counter collection with the perfmon2 [51] infrastructure. We use the entire SPEC
CPU2000 [136] and CPU2006 [138] benchmark suites with the full reference in-
put sets. We compile the SPEC benchmarks on a SuSE Linux 10.1 system with
version 4.1 of the gcc compiler and -O2 optimization (except for vortex, which
crashes when compiledwith optimization). All benchmarks are statically linked
47
Table 4.1: Machines used for this study.
Processor Speed Bits Memory L1 I/D L2 Retired Instruction Counter /Cache Cache Cycles Counter
Pentium Pro 200MHz 32 256MB 8KB/8KB 512KB inst retired
cpu clk unhalted
Pentium II 400MHz 32 256MB 16KB/16KB 512KB inst retired
cpu clk unhalted
Pentium III 550MHz 32 512MB 16KB/16KB 512KB inst retired
cpu clk unhalted
Pentium 4 2.8GHz 32 2GB 12Kµ/16KB 512KB instr retired:nbogusntagglobal power events:running
Pentium D 3.46GHz 64 4GB 12Kµ/16KB 2MB instr completed:nbogusglobal power events:running
Athlon XP 1.733GHz 32 768MB 64KB/64KB 256KB retired instructions
cpu clk unhalted
AMD Phenom 2.2GHz 64 2GB 64KB/64KB 512KB retired instructions
cpu clk unhalted
Core Duo 1.66GHz 32 1GB 32KB/32KB 1MB instructions retired
unhalted core cycles
Core2 Q6600 2.4GHz 64 2GB 32KB/32KB 4MB instructions retired
unhalted core cycles
to avoid variations due to the C library. We use the same 32-bit, statically linked
binaries for all experiments on all machines.
We gather Pin [87] results using a simple instruction count utility via Pin
version pin-2.0-10520-gcc.4.0.0-ia32-linux. We patch Valgrind [113] 3.3.0 and
Qemu [18] 0.9.1 to generate retired instruction counts. We gather the DBI results
on a cluster of Pentium Dmachines identical to that described in Figure 4.1. We
configure pfmon [51] to gather complete aggregate retired instruction counts,
without any sampling. The tool runs as a separate process, enabling counting
in the OS; it requires no changes to the application of interest and induces mini-
mal overhead during execution. We count user-level instructions specific to the
benchmark.
We collect at least seven data points for every benchmark/input combina-
tion on each machine and with each DBI method. The CPU2006 benchmarks
require at least 1GB of RAM to finish in a reasonable amount of time. Given
48
this, we do not run them on the Pentium Pro or Pentium II, and we do not run
bwaves, GemsFDTD, mcf, or zeusmp on machines with small memories. Fur-
thermore, we omit results for zeusmp with DBI tools, since they cannot handle
the large 1GB data segment the application requires.
4.1.2 Sources of Hardware Counter Variation
We focus on two types of variation when gathering performance counter results.
One is inter-machine variations, the differences between counts on two different
systems. The other is intra-machine variations, those found when running the
same benchmark multiple times on the same system. We investigate methods
for reducing both types.
Specific Instructions Counted Differently
For instruction counts to match on two machines, the instructions involved
must be counted the same way. If not, this can cause large divergences in to-
tal counts. On Pentium 4 systems, the instr retired:nbogusntag perfor-
mance counter counts fldcw as two retired instructions; on all other x86 imple-
mentations fldcw counts as one. This instruction is common in floating point
code: it is used in converting between floating point and integer values. It alone
accounts for a significant divergence in the mesa and sphinx3 benchmarks.
Table 4.2 demonstrates occurrences in the SPEC benchmarks where the count is
over 100 million. We modify Valgrind to count the fldcw instructions, and use
these counts to adjust results when presenting Pentium 4 data. It should be pos-
sible to use statistical methods to automatically determine which type of opcode
49
Table 4.2: Dynamic count of fldcw instructions, showing all benchmarks
with over 100million. This instruction is counted as two instruc-
tions on Pentium 4 machines but only as one instruction on all
other implementations.





456.hmmer retro 561,271,823 0.03%
175.vpr place 405,499,739 0.37%
300.twolf 379,247,681 0.12%
483.xalancbmk 358,907,611 0.03%
416.gamess cytosine 255,142,184 0.02%
435.gromacs 230,286,959 0.01%
252.eon kajiya 159,579,683 0.15%
252.eon cook 107,592,203 0.13%
causes divergence in cases like this; this is part of ongoing work. We isolated
the fldcw problem by using a tedious binary search of the mesa source code.
Using the Proper Counter
Pentium 4 systems newer than model 6 support a instr completed:nbogus
counter, which is more accurate than the instr retired:nbogusntag
counter found on previous models. This newer counter does not suffer the
fldcw problem described in Section 4.1.2. Unfortunately, all systems do not
include this counter; our Pentium D can use it, but our older Pentium 4 systems
cannot. This counter is not well documented, and thus it was not originally
available within the perfmon infrastructure. We contributed counter support






































































































































































































































































Original After Adjustments 1.07%
Figure 4.1: SPEC 2000 Coefficient of variation. The top graph shows inte-
ger benchmarks, the bottom, floating point. The error variation
from mesa, perlbmk, vpr, twolf and eon are primarily due
to the fldcw miscount on the Pentium 4 systems. Variation
after our adjustments becomes negligible.
Processor Errata
There are built-in limitations to performance counter accuracy. Some are in-
tended, and some are unintentional by-products of the processor design. Our
results for our 32-bit Athlon exhibit some unexplained divergences, leading us
to investigate existing errata for this processor [6]. The errata mention vari-
ous counter limitations that can result in incorrect total instruction counts. Re-





























































































































































































































































































































Original After Adjustments 0.41%
Figure 4.2: SPEC 2006 Coefficient of variation. The top graph shows in-
teger benchmarks, bottom, floating point. The original varia-
tion is small compared to the large numbers of instructions in
these benchmarks. The largest variation is in sphinx3, due to
fldcw instruction issues. Variation after our adjustments be-
comes orders of magnitude smaller.
4.1.3 Counter Variation Findings
Figure 4.1 shows the coefficient of variation for SPEC CPU2000 benchmarks
before and after our adjustments. Large variations in mesa, perlbmk, vpr,
twolf, and eon are due to the Pentium 4 fldcw problem described in Sec-
tion 4.1.2. Once adjustments are applied, variation drops below 0.0006% in
all cases. Figure 4.2 shows similar results for SPEC CPU2006 benchmarks.
Larger variations for sphinx3 and povray are again due to the fldcw in-
struction. Once adjustments are made, variations drop below 0.002%. Overall,
the CPU2006 variations are much lower than for CPU2000; the higher abso-
lute differences are counterbalanced by the much larger numbers of total re-
tired instructions. These results can be misleading: a billion-instruction differ-
52
ence appears small in percentage terms when part of a three trillion instruction
program, but in absolute terms it is large. When attempting to capture phase
behavior accurately using SimPoint with an interval size of 100 million instruc-
tions, a phase’s being offset by one billion instructions can alter final results.
4.1.4 Intra-machine results
Figure 4.3 shows the standard deviations of results across the CPU2000 and
CPU2006 benchmarks for each machine and DBI method. DBI results are
shown, but not incorporated into standard deviations. In all but one case
the standard deviation improves, often by at least an order of magnitude.
For CPU2000 benchmarks, perlbmk has large variation for every generation
method. We are still investigating the cause. In addition, the Pin DBI tool has a
large outlier with the parser benchmark, most likely due to issues with consis-
tent heap locations. Improvements for CPU2006 benchmarks are less dramatic,
with large standard deviations due to high outlying results. On AMDmachines,
perlbench has larger variation than on other machines, for unknown reasons.
The povray benchmark is an outlier on all machines (and on the DBI tools); this
requires further investigation. The Valgrind DBI tool actually has worse stan-
dard deviations after our methods are applied due to a large increase in varia-
tion with the perlbench benchmarks. For the CPU2006 benchmarks, similar
platforms have similar outliers: the two AMD machines share outliers, as do










































































































































































s p p p
Original Standard Deviation Updated Standard Deviation
Figure 4.3: Intra-machine results for SPEC CPU2000 (above) and CPU2006
(below). Outliers are indicated by the first letter of the bench-
mark name and a distinctive color. For CPU2000, the perlbmk
benchmarks (represented by gray ‘p’s) are a large source of
variation. For CPU2006, the perlbench (green ‘p’) and
povray (gray ‘p’) are the common outliers. Order of plotted
letters for outliers has no intrinsic meaning, but tries to make
the graphs as readable as possible. Horizontal lines summa-
rize results for remaining benchmarks (they’re all similar). The
message here is that most platforms have few outliers, and
there’s much consistency with respect to measurements across
benchmarks; Core Duo and Core2 Q6600 have many more out-
liers, especially for CPU2006. Our technical report provides
detailed performance information — these plots are merely in-
tended to indicate trends. Standard deviations decrease dras-
tically with our updated methods, but there is still room for
improvement.
4.1.5 Inter-machine Results
Figure 4.4 shows results for each SPEC 2000 benchmark (DBI values are shown
























































































































































































































































































































































































Figure 4.4: Inter-machine results for SPEC CPU2000. We choose five rep-
resentative benchmarks and show the individual machine dif-
ferences contributing to the standard deviations. Often there
is a single outlier affecting results; the outlying machine is of-






































































































































































































































































































































































































































Figure 4.5: Inter-machine results for SPEC CPU2006. We choose five rep-
resentative benchmarks and show the individual machine dif-
ferences contributing to the standard deviations. Often there is
a single outlier affecting results; the outlying machine is often
different. DBI results are shown, but not incorporated into the
standard deviations.
56
for five representative benchmarks to show individual machine contributions
to deviations. (Detailed plots for all benchmarks are available in our technical
report [154].) Our variation-reduction methods help integer benchmarks more
than floating point. The Pentium III, Core Duo and Core 2 machines often over-
count instructions. Since they share the same base design, this is probably due to
architectural reasons. The Athlon frequently is an outlier, often under-counting.
DBI results closely match the Pentium 4’s, likely because the Pentium 4 counter
apparently ignores many OS effects that other machines cannot.
Figure 4.5 shows inter-machine results for each SPEC 2006 benchmark.
These results have much higher variation than the SPEC 2000 results. Machines
with the smallest memories (Pentium 3, Athlon, and Core Duo) behave simi-
larly, possibly due to excessive OS paging activity. The Valgrind DBI tool be-
haves poorly compared to the others, often overcounting by at least a million
instructions.
4.2 Deterministic Execution
We found various issues that affect deterministic execution.
4.2.1 Virtual Memory Layout
It may seem counter-intuitive, but some benchmarks behave differently de-
pending on where in memory their data structures reside. This causes much
of the intra-machine variation we see across the benchmark suites. In theory,























Null Guard Page 0x0000 0000
shared libraries
Figure 4.6: The typical layout of virtual memory for a process on 32-bit x86
Linux. If process space randomization is enabled, then the BSS,
Heap, mmap and stack can have different offsets.
58
and perlbench exhibit this problem. To understand how this can happen, it
is important to understand the layout of virtual memory on x86 Linux (see Fig-
ure 4.6). In general, program code resides near the bottom of memory, with
initialized and uninitialized data immediately above. Above these is the heap,
which grows upward and the mmap region, which on newer kernels grows
downward. Near the top of virtual memory is the stack, which grows down-
ward. At the very top of the stack is process information, including command
line arguments and environment variables.
Typical programs are insensitive to virtual address assignments for data
structures. Languages that allow pointers to data structures make the virtual
address space “visible”. Different pointer values only affect instruction counts
if programs act on those values. Both parser and perlbench use pointers as
hash table keys. Differing table layouts can cause hash lookups to use differ-
ent numbers of instructions, causing noticeable changes in retired instruction
counts.
There are multiple reasons why memory layout can vary from machine to
machine. On Linux the environment variables are placed above the stack; a
differing number of environment variables can change the addresses of local
variables on the stack. The same is true of the executable name (so a program
run from a different directory path could change this offset). Also, from kernel
to kernel the number of ELF auxiliary vectors changes, and unfortunately these
too are above the stack. If the addresses of local variables are used as hash keys
then the size and number of any of these executable parameters can affect the
total instruction count. This happens with perlbench; Mytkowicz et al. [104]
document the effect, finding that it causes execution time differences of up to
59
5%.
A machine’s word size can have unexpected effects on virtual memory lay-
out. Systems running in 64-bit mode can run 32-bit executables in a compat-
ibility mode. By default, however, the stack is placed at a higher address to
free extra virtual memory space. This can cause inter-machine variations, as lo-
cal variables have different addresses on a 64-bit machine (even when running
a 32-bit binary) than on a true 32-bit machine. Running the Linux command
linux32 -3 before executing a 32-bit program forces the stack to be in the
same place it would be on a 32-bit machine.
Another cause of varied layout is due to virtual memory randomization. For
security reasons, recent Linux kernels randomize the start of the text, data, bss,
stack, heap, and mmap() regions. This feature makes buffer-overrun attacks
more difficult, but the result is that programs have different memory address
layouts each time they are run. This causes programs (like parser) that use
heap-allocated addresses as hash keys to have different instruction counts every
time. This behavior is disabled system wide by the command:
echo 0 >
/proc/sys/kernel/randomize_va_space
It is disabled at a per-process level with the -R option to the linux32 com-
mand. For our final runs, we use the linux32 -3 -R command to ensure
consistent virtual memory layout, and we use a shell script to force environ-
ment variables to be exactly 422 bytes on all systems.
60
4.2.2 System Effects
Any Operating System or C library call that returns non-deterministic values
can potentially lead to divergences. This includes calls to random number gen-
erators; anything involving the time, process ID, or thread synchronizations;
and any I/O that might involve errors or partial returns. In general, the SPEC
benchmarks carefully avoid most such causes of non-determinism; this would
not be the case for many real world applications.
OS activity can further perturb counts. For example, we find that perfor-
mance counters for all but the Pentium 4 increase once for every page fault
caused by a process. This can cause instruction counts to be several thousands
higher, depending on the application’s memory footprint. Another source of
higher instruction counts is related to the number of timer interrupts incurred
when a program executes; this is possibly proportional to the number of context
switches. The timer based perturbation is most noticeable on slower machines,
where longer benchmark run times allow more interrupts to occur. Again, the
Pentium 4 counter is not affected by this, but all of the other processors are. In
our final results, we account for perturbations due to timer interrupt but not for
those related to page faults. There are potentially other OS-related effects which
have not yet been discovered.
4.2.3 Sources of DBI Tool Variation
In addition to actual performance counter results, computer architects use var-
ious tools to generate retired instruction counts. Dynamic Binary Instrumenta-
tion (DBI) is a fast way to analyze benchmarks, and it is important to know how
61
Table 4.3: Potential overcounted dynamic instructions due to the rep pre-
fix (only benchmarks with more than 10 billion are shown).
benchmark rep counts % overcount
464.h264ref sss main 443,109,753,850 15.7%
464.h264ref fore main 45,947,752,893 14.2%
482.sphinx3 33,734,602,541 1.2%
403.gcc s04 33,691,268,130 18.8%
403.gcc c-typeck 30,532,770,775 21.7%
403.gcc expr2 26,145,709,200 16.3%
403.gcc g23 23,490,076,359 12.1%
403.gcc expr 18,526,142,466 15.7%
483.xalancbmk 15,102,464,207 1.2%
403.gcc cp-decl 14,936,880,311 13.6%
450.soplex pds-50 11,760,258,188 2.5%
453.povray 10,303,766,848 0.9%
403.gcc 200 10,260,100,762 6.1%
closely tool results match actual hardware counts.
The rep Prefix
An issue with the Qemu and Valgrind tools involves the x86 rep prefix. The
rep prefix can come before string instructions, causing the the string instruction
to repeat while decrementing the ecx register until it reaches zero. A naive
implementation of this prefix counts each repetition as a committed instruction,
and Valgrind and Qemu do this by default. This can cause many excess retired
instructions to be counted, as shown in Table 4.3. The count can be up to 443
billion too high for the SPEC benchmarks. We modify the DBI tools to count
only the rep prefixed instruction as a single instruction, as per the relevant
hardware manuals. (Note that older versions of Pin matched real hardware
with regards to rep, but versions newer than 29972 do not, possibly requiring
62
extra care when measuring instruction counts).
Floating Point Rounding
Dynamic Binary Instrumentation tools can make floating point problematic, es-
pecially for x86 architectures. Default x86 floating point mode is 80-bit FP math,
not commonly found in other architectures. When translating x86 instructions,
Valgrind uses 64-bit FP instructions for portability. In theory, this should cause
no problems with well written programs, but, in practice, it occasionally does.
The move to SSE-type FP implementations on newer machines decreases the
problem’s impact, although new instructions may also be sources of variation.
The art benchmark. The art benchmark uses many fewer instructions on
Valgrind than on real hardware. This is due to the use of the “==” C operator
to compare floating point numbers. Rounding errors between 80-bit and 64-bit
versions of the code cause the 64-bit versions to finish with significantly differ-
ent instruction counts (while still generating the proper reference output). This
is because a loop waiting for a value being divided to fall below a certain limit
can happen faster when the lowest bits are being truncated. The proper fix is to
update the DBI tools to handle 80-bit floating point properly. A few temporary
workarounds can be used: passing a compiler option to use only 64-bit floating
point, having the compiler generate SSE rather than x87 floating point instruc-
tions, or adding an instruction to the offending source code to force the FPU
into 64-bit mode.
63
The dealIIbenchmark. The dealII SPECCPU2006 benchmark is problem-
atic for Valgrind, much like art. In this case, the issue is more critical: the
program enters an infinite loop. It waits for a floating point value to reach an
epsilon value smaller than can be represented with 64-bit floating point. The
authors of dealII are aware of this possibility, since source code already has a
#define to handle this issue on non-x86 architectures.
Virtual Memory Layout
When instrumenting a binary, DBI tools need room for their own code. The
tools try to keep layout as close as possible to what a normal process would see,
but this is not always possible, and some data structures are moved to avoid
conflicts with memory needed by the tool. This leads to perturbations in the
instruction counts similar to those exhibited in Section 4.2.1.
4.3 Summary
Even though originally included in processor architectures for hardware debug-
ging purposes, when used correctly, performance counters can be used produc-
tively for many types of research (as well as application performance debug-
ging). We have shown that with some simple methodology changes, the x86
retired instruction performance counters can be made to have a coefficient of
variation of less than 0.002%. We have also done some preliminary examina-





Cycle-accurate simulators are one of the prevailing modeling tools in com-
puter architecture research. Unfortunately, the results generated by academic
“cycle-accurate” simulators can be misleading due to unknown levels of error.
More importantly, similar results can often be generated much faster using sim-
ulation techniques based on dynamic binary instrumentation (DBI). (Hereto-
fore, we use cycle-accurate simulations to refer to tools and results generated in
academia. Industry researchers and developers may have much more accurate
simulators, but since source code is not generally available to academics, we do
not discuss them here.)
In spite of their popularity, cycle-accurate simulators have several draw-
backs.
• Speed: Simulators are slow, often multiple orders of magnitude slower
than native execution. Many researchers commonly use “reduced-
execution” methods to compensate, yet these techniques can compound
simulation error if not applied carefully. We investigate these methods in
detail in Chapter 3.
• Obscurity: The simulation tools are rarely used outside the specialized
field of computer architecture research. Since the simulators themselves
are generally used to run a limited set of benchmark suites, bugs can lurk
in the code base.
• Code Forks: The code base for an academic simulation tool can quickly
become fragmented among the groups using it, or may cease to be main-
tained entirely. Bugs may be fixed at different times at different institu-
65
tions. The source codes diverge so much that when a paper claims it uses
a particular simulator, that statement may have little meaning, since the
code used differs from the mainline (potentially so much so as to be un-
recognizable).
• Generalization: Simulators are often highly configurable, since the au-
thors usually want a flexible tool that can model a multitude of different
hardware configurations. The end result is that a single simulator might
model many architectures, but it may not model any particular architec-
ture well. Furthermore, the more flexible a simulator, the easier it is to
configure it improperly, often in non-obvious ways.
• Validation: Most simulators are not validated against real hardware, and
when they are, the results are rarely within 10% error, even after extensive
effort to model a known architecture as closely as possible [25, 56, 43].
Exceptions exist, of course, but the most commonly used academic tools
have diverged widely from any versions for which validation has been
attempted.
• Documentation: Simulators are often poorly documented, both at a high
level and at the source-code level. This alone probably accounts for more
errors in simulation than any overt programming bugs. Researchers sim-
ply do not have the information needed to use them correctly.
• Obsolescence: Most simulators are already outdated by the time they be-
come mature enough to run useful workloads. It is difficult to gain suf-
ficient documentation on modern processors to accurately implement in-
ternals, so well understood but obsolete processors are often modeled, in-
stead.
66
• Tools: Many simulators require a special tool-chain to build suitable ex-
ecutables. The difficulty of using out-of-date toolchains (many need old
versions of libraries that are no longer available, for instance) leads to the
use of pre-compiled benchmarks that are rarely updated. New advance-
ments in compiler technology are thus lost, since the toolchain is rarely
complete enough to compile whole benchmark suites. Some of the more
interesting benchmarks may simply be left out due to toolchain difficul-
ties. This is yet another source of error in simulations [35].
• Operating System: Many simulators cannotmodel full operating systems.
Cain et al. [31] find that removing the OS from the simulation equation can
have a greater impact on results than ignoring effects of speculation.
These problems result in part from the lack of funding for building and
maintaining solid academic architectural tools. One or two students cannot
create and maintain a tool and use it for their doctoral research in a reason-
able amount of time, given today’s complicated architectures. Many academic
researchers end up using an unvalidated or poorly documented simulator mod-
eling a decade-old processor to run only small portions of a decade-old bench-
mark suite (that was compiled with a decade-old compiler). Needless to say,
using such an infrastructure is unlikely to represent “best practices” when per-
forming cutting-edge computer architecture research. Taking that setup and
scaling the configuration to match a hypothetical processor only tangentially
related to the original design can compound the accuracy problem. Eventu-
ally it becomes critical to know how big the potential error is; a small average
speedup of 5-10% (which is often sufficient for publication) might, in reality,
67
be dwarfed by cumulative errors of the infrastructure. 1 To that end, we con-
figure one commonly used cycle-accurate simulator to model a MIPS R12000,
and compare simulation versus machine results for five performance metrics.
To better understand the tradeoffs between types of simulation tools, we then
compare machine results to simulation results generated by a dynamic binary
instrumentation tool based on Qemu.
5.1 SESC Cycle-accurate Simulator
SESC [125] is a widely used cycle-accurate simulator. It can simulate CMP sys-
tems, but for comparison purposes, we only model a single-core system. The
simulator was originally built to model out-of-order MIPS processors, and thus
it runs MIPS binaries. It uses an elaborate configuration file that can specify ar-
chitectures very different from the initially modeled platform. No documenta-
tion of peer-reviewed validation is publicly available for SESC. The documenta-
tion distributed with the simulator includes a README.validation file show-
ing that results for a few microbenchmarks match hardware execution times
within about 20% for R10000 and R4400 MIPS-based machines.
We configure SESC to match our reference platform as closely as possible
(this required the help of the tool’s original author), which turns out to be dif-
ficult, despite our machine’s being almost exactly the same as the simulator’s
original design point. Major differences are that the R12000 has a unified 2-page,
64-entry software-controlled TLB (SESC apparently only handles separate data
and instruction TLBs), and the R12000’s off-chip L2 cache with a way-predictor
1We do not discuss issues involved with averages chosen to represent simulation statistics,
but see John Mashey [92].
68





Memory Subsystem L1i: 32kB, 2-way, 64B
L1d: 32kB, 2-way, 32B
L2 : 2MB, 2-way, 128B
2GB SDRAM, 1.0GB/s
Branch Predictor 2048 entry 2-bit
TLB Unified 64-entry
(which can affect L2 cache latencies in a way not easily modeled with SESC).
The branch predictor in the R12000 is deceptively non-trivial, and again it is not
possible to model exactly. (Many of the arcane architectural details are not suffi-
ciently documented for any simulator author to model exactly without “inside”
industrial information.)
We make a best attempt to configure SESC properly. The configuration for-
mat is poorly documented, andmany necessary options are not described. Sam-
ple configurations lack necessary information, and source code is not well com-
mented. In the end, after we spent much time carefully researching and crafting
our configuration file, SESC’s author found 40 errors. This does not bode well
for others attempting to configure the tool without input from SESC authors.
The configuration file we used can be found in Appendix L.
We use a default version of SESC, checked out from the CVS server on
7 April 2008 and compiled with gcc version 4.2.4. We use the -k0x800000
-h0x23400000 -p2 command line options when running benchmarks.
69
5.2 Reference Hardware
Our reference platform is an SGI Octane2 [156] with an R12000 MIPS proces-
sor [167, 111]. A summary of key features is listed in Table 5.1. The machine
runs Linux 2.6.22 patched to provide Octane support. The kernel is modified to
include the perfmon2 [51] performance counter infrastructure.
The R12000 allows the processor’s branch prediction method to be config-
ured at runtime (it is unusual for a processor to be that configurable). We
create a custom kernel module (available in Appendix K that sets the proper
Branch Diagnostic Register bits (cp0 register 22) to change the branch predic-
tion method on the fly. The processor defaults to a 2048-entry two-bit saturat-
ing counter dynamic prediction scheme. This can be changed to various static
schemes: always taken, always not-taken, and forward/taken-backward/not-
taken. A global pattern history table with a configurable number of bits can
be enabled, and the Branch Target Address Cache (BTAC) and Branch Return
Cache (BRC) can be individually disabled.
We run microbenchmarks to verify that the performance counters work
properly. We use pfmon [51] to collect performance statistics. This tool enables
performance monitoring by a separate process, so the bookkeeping is handled
entirely by the OS kernel, inducing very little user-space overheard. Counts are
collected in aggregate for the full program, with no sampling.
There has been concern about the accuracy of MIPS performance counters:
Korn et al. [78] find up to 25% error with some counters on the R12000 and
R10000 under SGI IRIX. We do not detect similar error; potentially, the differ-
ences they see are due to their use of sim-outorder as a reference, which Desikan
70
et al. [42, 43] found to have similar levels of error.
5.3 DBI-based Simulator
We use Qemu [18] to generate traces consumed by a set of small independent
simulators. Qemu uses dynamic retranslation at the basic-block level to convert
from one architecture (in this case MIPS) to another (in this case x86). We add
code hooks to output needed trace data.
For cache simulation we use the Dinero IV [48] Cache Simulator. Qemu
passes trace information in the Dinero file format over a named-pipe to Dinero
(which runs in a separate process). To determine branch prediction information
we write a custom branch predictor (source available on our website). This pre-
dictor runs in a separate process and obtains the full instruction stream (both
address and instruction value) from Qemu over a named-pipe. The predictor
decodes MIPS instructions and determines which are branches (taking special
care to handle the “predict taken” beql instructions properly). A branch is de-
termined to be taken or not by buffering an additional two instructions to see if
the address after the delay slot is PC+8.
Because each of our tools runs in a separate process, we can take advan-
tage of CMP and SMP systems better than most cycle-accurate simulators. Each
process can live on its own core, and running the branch predictor thread at the
same time as the cache thread adds negligible overhead on a four-processor ma-
chine. The limiting factor here is the cache simulator, not dynamic translation
and execution of the binary.
71






















Avg CPI = 0.75
Figure 5.1: The precompiled SPEC 2000 benchmarks available from the
SESC website have potentially been modified to reduce run-
time. A phase chart gathered with hardware performance
counters shows behavior of the provided precompiled binary
on top and that of a binary we compiled from original SPEC
sources (with gcc) on bottom.
5.4 Benchmarks
To evaluate the various simulation methods, we use SPEC CPU2000 [136]
benchmarks. To enable comparison with past uses of the SESC simulator, we
use the pre-compiled versions of the benchmarks provided on the SESC web-
site. All three of our test platforms can run these benchmarks unmodified.
Unfortunately the pre-compiled benchmarks have some limitations. Al-
though not documented as such, they are not plain CPU2000 binaries. Extra
printf() commands have been scattered throughout the code (presumably
for debugging purposes or for controlling partial simulation experiments), and
some benchmarks have been modified for faster run times. As an example, see
Figure 5.1, which shows that gzip— as provided— only executes a small frac-
tion of the full benchmark. In addition, not all of the CPU2000 benchmarks are
included with the precompiled binaries. We run full reference input sets for all
72
Table 5.2: Comparison of simulation times
Method Fastest Slowest Mean Slowdown
R12000 15s (gzip.log) 57m23s (swim) –
QEMU 13m52s (gzip.log) 1d20h20m47s (sixtrack) 38x
SESC 2h17m38s (gzip.log) 16d02h53m15s (mgrid) 393x
experiments.
5.5 Results
We run as many SPEC 2000 benchmarks as possible on the various platforms.
Relative run times are shown in Table 5.2. For the simulated results, we run on a
large cluster of 4-processor 3.46GHz Pentium D nodes, each with 4GB of RAM.
5.5.1 Absolute Results
Figure 5.2 shows actual and predicted L1 instruction cache miss rates. Our three
methods calculate instruction cache misses in different ways. For the perfor-
mance counter results, these graphs show decoded instructions versus instruc-
tion cache misses; for SESC and Qemu the graphs show graduated instructions
versus instruction cache misses. The number of instruction cache misses in the
floating point case is so small that a small absolute error can cause a large per-
centage error. Qemu has problems with the art benchmarks, which we are
investigating.
























































































































































































actual r12k Qemu/Dinero SESCL1 Instruction Cache Miss Rate
Figure 5.2: Instruction cache miss rate with integer benchmarks above and
floating point below.
issues with the performance counters. Memory accesses that occur while the
benchmark process is not running can change values in the cache. While we
attempt to run the benchmarks on an otherwise quiet system, other processes
and even the operating system can evict cache lines on the real system in ways
that cannot be modeled in the simulator. Similarly, values stored into cache
may not be accounted for by the performance counters if the actual write-back
to memory happens when in a different processor context.
Qemu does not follow wrong-path execution 2 , which can account for some
of the differences from actual hardware. Likewise, SESC does not followwrong-
path execution; the code path that models speculation is out of date, and is
thus disabled in the default configuration. Despite not executing wrong-path
2There has been work done to enable wrong-path execution support on Qemu [33, 32] but


























































































































































































actual r12k Qemu/Dinero SESCL1 Data Cache Miss Rate
51.6 51.7
Figure 5.3: L1 data cache miss rate with integer benchmarks above and
floating point below.
instructions, results are quite accurate; this shows that full cycle-accuracy is not
always needed to generate good cache simulation results (and further supports
the conclusions of Cain et al. [31] regarding OS impact versus speculation, at
least in the case of Qemu).
Figure 5.3 shows L1 data cache miss rates, and Figure 5.4 shows L2 miss
rates. The latter is important, since L2 cache misses must traverse the processor
bus of amultiprocessor system. If the tool used records vastly incorrect numbers
of misses, multiprocessor simulations will generate erroneous data that could
influence a final design. SESC generally does poorly predicting L2miss rates for
floating point benchmarks. This could indicate that the floating point pipeline

























































































































































































actual r12k Qemu/Dinero SESCL2 Cache Miss Rate
Figure 5.4: L2 cache miss rate with integer above and floating point below.
None of the simulations captures mcf’s behavior well. None of
the simulation methods predicts the art benchmarks well.
The R12000 has a complicated off-chip cache. In order to save pins, the ma-
chine incurs significant overhead in changing cache ways. To mitigate this, it
uses a cache way-predictor, with a penalty on a miss. None of the simulators
model this aspect of the system, which can potentially become another source
of modeling error.
Figure 5.5 shows branch predictor results. The R12000 can predict and fetch
past up to four branches, so many speculative instructions can be in flight.
Qemu and SESC cannot model this. In fact, the R12000 branch predictor has
many hardware subtleties that neither Qemu nor SESC can model.
Figure 5.6 shows CPI results. Qemu does not model time, sowe approximate




























































































































































































actual r12k Qemu/Dinero SESCBranch Miss Rate
33.9 27.6
Figure 5.5: Branch miss rate with integer above and floating point below.
The hardware can have up to four outstanding branches; Qemu
and SESC do not model wrong-path execution.
cycles = Ig∗L1hti f s + DL1aL1ht + L1mL1mt + L2mL2mt + BrmBrmt
where Ig is graduated instructions, L1ht is L1 hit time (2 cycles), i f s is the
instruction fetch size (4 words), DL1a is L1 data accesses, L1m is L1 misses, L1mt
is L1 miss time (14 cycles), L2m is L2 misses, L2mt is L2 miss time (120 cycles),
Brm is number of branch misses, and Brmt is branch miss delay (2 cycles)
This is an empirical model that was arbitrarily chosen because it seems to
match well against the parameters we have. It is similar in idea to CPI gen-
eration functions for the R10000 presented by Luo et al. [88]. The L1 icache
parameter might be spurious; its primary effect is to limit the minimum IPC to
two, which is what is found on the SPEC benchmarks. In theory the R12000










































































































































































actual r12k Qemu/Dinero SESCCycles Per Instruction
5.3 5.2
Figure 5.6: CPI results with integer above and floating point below.
ancy. The data cache misses should be hidden by out-of-order execution too,
although depending on the memory subsystem design this might not happen.
Luo et al. [88] found up to an 80% stall rate for one configuration of an R10000
processor.
CPI is the metric most often used in validation, so it is important to have
these values match hardware as closely as possible. There are many architec-
tural and software causes of cycle variation not modeled by either simulator.










































































































































































actual r12k Qemu SESCRelative Branch Miss Rate 2bit/Taken










































































































































































actual r12k Qemu SESCRelative Branch Miss Rate 2bit/Static
Figure 5.8: Static branch predictor miss rate, normalized against dynamic
two-bit results.
5.5.2 Relative Results
Many researchers hold that absolute results are not as important with cycle-
accurate simulation, but that relative results are what matter most. As long as
the trends are consistent, then a simulator is still useful, even if the simulator is
unvalidated and the error is large. To investigate this, we configure our R12000
to use different branch predictors. We plot relative differences in the metrics to
see if consistent trends are visible.
Figure 5.7 shows the relative reduction in branch predictor miss rate when






































































































































































actual r12k Qemu SESCRelative L2 Miss Rate 2bit/Taken
Figure 5.9: L2 cache miss rates with the always-taken predictor, normal-





































































































































































actual r12k Qemu SESCRelative L2 Miss Rate 2bit/Static
Figure 5.10: L2 cache miss rates with the static predictor, normalized
against two-bit results.
shows that trends are similar across all benchmarks, although Qemu results are
optimistic and SESC results are pessimistic. Figure 5.8 compares a static back-
ward/taken forward/not-taken predictor to the dynamic two-bit predictor.
Figure 5.9 shows how the always-taken predictor affects the L2 cache miss
rate compared to the two-bit predictor. Neither Qemu nor SESCmodels wrong-
path execution, so they exhibit identical memory access behavior even with dif-
ferent branch predictors. Neither simulation method can predict the significant
predictor-based changes in L2 behavior observed on actual hardware. Results




































































































































































actual r12k SESCRelative TLB 2bit/Taken



































































































































































actual r12k SESCRelative CPI 2bit/Static
Figure 5.12: TLB misses with static predictor, normalized against two-bit.
Figures 5.11 and 5.12 show TLB behavior. Results are not shown for Qemu
because a trace-based TLB simulator was not available. On actual hardware,
the branch predictor seems to have minimal impact on TLB behavior. The MIPS
TLB is managed in software, usually with random replacement. This means
that it is easy for results to diverge. Also, MIPS has a unified instruction/data
TLB, which SESC cannot model.
Figure 5.13 and Figure 5.14 show the relative results for CPI. Qemu results
are close to those for the R12000, despite the cycle counts being based solely on


































































































































































actual r12k Qemu SESCRelative CPI 2bit/Taken

































































































































































actual r12k Qemu SESCRelative CPI 2bit/Static
Figure 5.14: CPI with static predictor normalized against two-bit results.
5.5.3 Summary
A summary of the absolute results is shown in Table 5.3. The weighted average
of the various metrics is taken across all benchmarks that run to completion
on all three platforms. This is a total of 22 benchmarks (19 integer, 3 floating
point) which, unfortunately, only represents a portion of the 48 SPEC CPU2000
benchmark/input pairs. SESC does not perform noticeably better than Qemu,
despite taking an order of magnitude longer to run.
Table 5.4 shows the percent error of the average relative performance differ-
ences. The CPI results show that these methods can be used to predict perfor-
mance with an average error of 15%. The L2 Cache results show that sometimes
82
Table 5.3: Summary of results. The weighted average is across all of the
SPEC 2000 benchmarks which ran to completion on all three
platforms: 23 integer and 11 floating point (this is unfortunately
only a portion of the 48 available benchmark/input combina-
tions).
Metric
Bench R12000 Qemu SESC
Type Weighted Weighted % Weighted %
Average Average Error Average Error
L1I$ Miss Rate Int 0.233% 0.334% 43.5% 0.248% 6.4%FP 0.008% 0.001% -83.9% 0.006% -23.9%
L1D$ Miss Rate Int 3.928% 4.260% 8.5% 4.726% 20.3%FP 5.230% 6.406% 22.5% 6.485% 24.0%
L2$ Miss Rate Int 0.058% 0.051% -11.9% 0.042% -27.6%FP 0.127% 0.107% -16.2% 0.128% 0.4%
BrPred Miss Rate Int 18.9% 18.4% -2.7% 27.0% 42.9%FP 12.7% 18.2% 43.2% 15.0% 18.4%
CPI Int 1.20 1.03 -14.6% 1.47 22.6%FP 1.09 1.41 29.3% 1.60 46.4%
Table 5.4: Summary of relative results. The relative results compare the
relative results when moving from 2-bit branch predictor to ei-
ther taken or static. The error shown is the relative error between
the relative average means of all benchmarks on actual hard-
ware versus the predicted relative average means of the simu-
lated results. The results represent the 33 of the SPEC CPU 2000





BrPred Miss Rate Taken 64.1% -28.0%Static -11.0% -44.9%
L2$ Miss Rate Taken 5.6% 6.1%Static 7.1% 7.4%
CPI Taken 11.5% -7.1%Static 0.1% -10.9%
83
results can be deceptive; even though neither QEMU nor SESC models wrong-




Our work in Chapter 5 finds acceptable results when using DBI methods to
simulate an obsolete RISC processor; we extend this work to a more modern 64-
bit x86 platform. Memory access patterns on modern CISC (Complex Instruc-
tion Set Computer) systems differ from older RISC systems, with variable-sized
instructions, aggressive prefetching, and SSE vector-like memory accesses. Un-
fortunately CISC simulations run slower than RISC. The exact slowdown de-
pends on the simulator, but on the m5 simulator moving from Alpha to x86 has
a slowdown of at least a factor of two.
6.1 RISC/CISC differences
RISC chips, even sophisticated ones such as theMIPS R10000 or Alpha 21264 (as
simulated by common simulators), are missing many features found in entry-
level x86 processors.
Here are some CISC “features” that most RISC implementations do not have
to worry about:
• Unaligned instructions
• Variable length instructions
• Instructions that cross cache lines
• Complicated lock instructions
• Complicated string instructions
• Hardware square-root and transcendental functions
85
• µop Decoder Cache
• Complex µop issue logic, “fusing”
• Self-modifying code
• Micro-code assist on complex instructions (NaN, Denormals, Div/0, Un-
derflows)
6.2 Modern CPU Features
Cycle-accurate simulators tend to model older implementations of architec-
tures. Modern architectural features are often left out of a simulator as they do
not affect correctness, but can affect behavior. Modern implementations of RISC
chips (such as ARM, MIPS, Power and SPARC) might have these features, but
many simulators do not support them. Recent x86 binaries make use of these
features, and since comprehensive x86 simulators are a recent development, the
simulators have to handle these newer features to run the binaries properly.
There are many features that can affect architectural simulation but are not
commonly found in simulators:
• Vector instructions (most modern RISC architectures have support, but are
not commonly used).
• Hardware prefetch
• Various software prefetch types (including non-temporal)
• Large pages (2MB, 1GB)
• Memory disambiguation predictor
86
• Execute small loops out of instruction fetch unit (without accessing cache):
LSD Loop Stream Detector
• Trace caches
• Thermal trip support
• CPU frequency scaling
• MTRR/PAT Page attributes (set cache behavior at page level)
• ECC memory
• Return address prediction
• Stack pointer prediction
• Sophisticated branch prediction schemes
• Complicated memory hierarchies
• On-chip memory controllers
6.3 µop Concerns
The x86 architecture does not directly execute complex CISC instructions. Dur-
ing fetch and decode these complex instructions are broken down into RISC-like
instructions known as µops.
Since µops are “RISC-like”, RISC simulators can be repurposed to act
as backends for CISC simulators. This is a common simulation methodol-
ogy [53, 23, 129, 141, 134, 26], that as far as we know has not been validated.
Figure 6.1 shows L1 data cache accesses per µop on MIPS and three x86 64
architectures for the gzip.program benchmark (complete µop phase diagrams
can be found in Appendix H). We measure µop counts using the counters listed
in Table 6.1.
87
Table 6.1: Hardware performance counters used for µop experiments
machine Retired Instructions Retired µops
Phenom retired instructions retired uops
Core2 instructions retired uops retired:any
Pentium D instr completed:nbogus uops retired:nbogus
Pentium Pro inst retired uops retired
Atom instructions retired uops retired:any















u Phenom Full Intervals = 1349





u Pentium D Full Intervals = 1669





u Core2 Full Intervals = 1476
Avg dpu = 0.40
Figure 6.1: Data cache accesses per µop for gzip.program
An unexpected result is that the µop behavior varies between implementa-
tions of the same architecture. The set of µops is not fixed and architects are free
to change it at any time. Figure 6.1 shows that the MIPS instruction trace would
make a believable x86 64 µop stream, however it does not closely match any of
the existing machines. Care should be taken when using RISC results as a µop
substitute.





























































































































































































































Phenom Phenom(32)Core2 Pentium D MIPS PPCm5
Normalized uops
3.2 2.9
Figure 6.2: Normalized µops per benchmark for three x86 64 implemen-
tations, a 32-bit x86, the m5 simulator, and two representative
RISC architectures.
89
Table 6.2: Number of uops required for an assortment of x86 instructions
instruction Phenom Core2 Pentium D Pentium Pro Atom
add %eax,%edx
32-bit int add 1 1 1 1 1
add mem,%eax
32-bit add from mem 1 1 2 2 1
imul %eax,%edx
32-bit int multiply 2 3 2 3 3
rep stosb
repeated string store 0.3 0.43 0.55 0.6 3
fadd 1.0,pi
floating point add 23 1 1 4 1
fsincos
floating point sincos 60 101 150 107 118
haddps
128-bit horizontal add 1 6 3 N/A 5
pslldq
128-bit shift 1 2 1 N/A 1
all of the SPEC CPU2000 benchmarks. The relative number of µops varies by
benchmark, even on the same architecture. The 32-bit machine has many more
µops, especially on floating point benchmarks; this is because the 32-bit pro-
gram is using x87 floating point, which produces many more µops than the
SSE-based floating point used on the x86 64 machines. The two comparison
RISC machines are roughly the same as the x86 machines. The m5 counts are in
general much too high; this is because the simulator’s µop generation has not
yet been matched to that of an actual machine.
Table 6.2 breaks out µop counts for a few selected instructions, to show why
it is difficult to make generic statements about µop behavior. An additional
challenge is that µop counts may vary from run to run, because unlike the retired
instruction counters, the µop counts include microcode, exception, interrupt,
90
and various other effects [10, 72]. There is not always a static mapping between
µops and instructions; operations like floating point transcendental functions
can take varying numbers of instructions depending on the operands involved.
Another issue with µops is that hardware performance counters do not al-
ways measure the same results across architectures. Kenneth Hoste [68] found
that some architectures “fuse” the µops, making it difficult to compare results,
specifically between Nehalem and Core2 implementations.
Due to all of the issues found with µops, retired instructions may be the
best base metric to use when comparing x86 implementations. This might seem
counter-intuitive, because it sacrifices some of the fine detail provided by the
knowledge of µop behavior.
6.4 Evaluation Methodology
We evaluate x86 simulation using three different methods: the Valgrind DBI
tool, the m5 cycle-accurate simulator, and hardware performance counters.
6.4.1 Valgrind DBI-based Simulator
To test DBI-based simulation we use the Cachegrind [112] tool that comes with
the Valgrind [113] DBI infrastructure. This tool simulates a configurable single-
core cache and also can simulate a simple branch predictor.
We configure the cache simulator to have the same basic cache configuration
as the Phenom hardware described in Table 6.3, which means the command line
91




L1 Instruction Cache 64kB, 2-way, 64B
prefetch 2 lines on miss
L1 Data Cache 64kB, 2-way, 64B
write-allocate, write-back
LRU, ECC, MOESI, 3-cycles
L2 Cache 512kB, 16-way, 64B
non-inclusive victim
9-cycles, per-core
L3 Cache 2MB,, 32-way, 64B
non-inclusive victim
shared by all cores






The average slowdown while running Cachegrind is 29x over baseline.
6.4.2 m5 Cycle-accurate Simulator
We use the m5 [22] simulator as a reference cycle-accurate simulator for our
study. It is currently one of only two readily available academic simulators
capable of running x86 binaries, the other being PTLsim [172].
m5 can simulate multiple architectures, but we are primarily interested in
92
x86 emulation. m5 can run both standalone statically linked binaries in syscall
emulation mode, as well as full operating systems in full system mode. Unfor-
tunately full system mode has not been tested for x86, so we are limited to using
syscall emulation mode.
m5’s x86 support is new; so new that it was not working when we started
this work. We contribute a large number of patches that allowed the SPEC
CPU2000 benchmarks to run correctly to completion on the simulator, and most
of these patches have been merged into the project. There are still some limita-
tions to x86 support, most notably that x87 floating point is not implemented;
only binaries compiled to use SSE instructions will work.
Another issue with m5 is that x86 support is so new that only the simple
atomic model of execution is supported. This treats each instruction as a sin-
gle atomic entity. The detailed (in-order) and out-of-order models are not sup-
ported, which limits the experiments that can be run. This is unfortunate, but
the only real alternative (PTLsim) has show-stopping issues as well, leaving us
with no clear best choice for our experiments.
We configure m5 to match our Phenom machine described in Table 6.3 as
closely as possible without requiring code changes to the simulator. This limits
our changes primarily to cache parameter settings. We cannot model a branch
predictor, as that requires the non-working detailed execution model; the same
is true for speculative execution.
We use a development version of m5 checked out of the code repository on
16 November 2009, with patches added that enable full x86 support (mainly
some missing syscalls and instruction corner cases). We also add code which
93
adds extra statistics dumping (to print instruction count as well as µop count,
and to dump stats at regular intervals).
The average slowdown of m5 running in simple atomic mode with caches
enabled 2882 times slower.
6.4.3 Reference Hardware
Table 3.2 lists the machines used in our experiments. The performance counters
we use are listed in Table 6.4.
We primarily use the Phenom (summarized in Table 6.3) for gathering re-
sults, as the cache simulator we intend to use supports the MOESI protocol for
AMD-style machines. The Phenom has a complicated memory hierarchy. It has
a 64KB, 2-way, 64 byte linesize, L1 instruction cache; on a miss it assumes tem-
poral locality and fetches two lines, the missing line and the one following. It
has a 64KB, 2-way, 64 byte linesize, L1 data cache which is write-allocate, write-
back, ECC and an LRU replacement policy. Cache coherence is maintained with
a MOESI-like protocol, and there is a latency of 3-cycles. The L2 Cache is per
core, 512KB, 16-way, 64 byte linesize, non-inclusive victim, with a latency of
9-cycles. The L3 Cache is system wide, 2MB, 32-way, 64 byte linesize, which
behaves as a non-inclusive victim cache. The CPU has an integrated memory
controller with a built-in prefetcher.
94
Table 6.4: Hardware performance counters used for our experiments. We
did not use all of the counters listed. Some of the counters have
known errata. We gathered this list from PAPI [102] and the
AMD and Intel reference manuals [10, 72].
stat Phenom Core 2 Pentium D
Retired retired instructions instructions retired instr completed:nbogus
Instructions
Retired retired uops uops retired:any uops retired:nbogus
µops
Elapsed cpu clk unhalted unhalted core cycles global power events:running
Cycles
L1 dCache data cache accesses l1d all ref front end event:NBOGUS
Accesses uops type:TAGLOADS:TAGSTORES
L1 dCache data cache misses l1d pend miss n/a
Misses
L1 iCache instruction cache fetches l1i reads uop queue writes:from tc build:from tc deliver
References
L1 iCache instruction cache misses l1i misses bpu fetch request:tcmiss
Misses
L2 Cache data cache missses + l2 rqsts:self:any:mesi bsq cache reference:rd 2ndL miss:rd 2ndL hits:
References instruction cache missses rd 2ndL hite:RD 2ndL hitm
L2 Cache l2 cache miss:data + l2 lines in:self:any bsq cache reference:RD 2ndL MISS
Misses l2 cache miss:instructions
Branch retired branch instructions br inst exec branch retired:mmnp:mmnm:mmtp:mmtm
Instructions




We use the SPEC CPU2000 [136] benchmarks for evaluation purposes, as they
are long enough to provide interesting results, but at the same time short
enough that the cycle-accurate results have a chance of finishing within a few
weeks.
The benchmarks were compiledwith -O3 -msse3 -funroll-all-loops
-ffast-math -static.
The vortex benchmarks and some of the perlbmk benchmarks did not
run; this is a limitation of the benchmarks themselves with modern compilers,
and not an issue with our simulation methods. The same benchmarks fail on
actual hardware.
We run full reference input sets for all experiments.
We ran all of the simulations on a large cluster of 3.4GHz Pentium D ma-
chines, identical to the system herein referred to as “Pentium D”.
6.5 Absolute Results
We first investigate the absolute results returned by our various simulation
methods. These are the results for one hardware configuration, without varying
any of the simulation parameters.
96



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 4







i 10 SimPoints (Phenom) 
Intervals = 8


































i Phenom Full 
Intervals = 256







i Pentium D Full 
Intervals = 262







i Core2 Full 
Intervals = 260
Avg dpi = 0.55
Figure 6.3: L1 data cache accesses per instruction. This plot shows that
cache accesses per instruction is consistent across all actual ma-
chines, as well as the simulators. The MIPS results are very
different. SimPoint results are shown for comparison
97
6.5.1 Phase Behavior Results
Figure 6.3 shows the phase behavior of L1 data cache accesses per instruction
for gcc.166 (full results for SPEC CPU2000 can be found in Appendix G). The
three actual hardware implementations give practically identical plots for this
metric, which is encouraging. Valgrind and m5 also give similar results. The
MIPS results, while showing similar patterns, has many more instructions so
any direct comparisons cannot be made. Also shown on the graph are the Sim-
Point results and the results of un-guided simulation.
6.5.2 L1 Instruction Cache
Figure 6.5 shows actual and predicted L1 instruction cache miss rates. Ac-
tual hardware measures icache references, while the DBI tools measure total
instructions. In order to convert between the two, the average instruction size
is needed. On RISC this is a fixed value, but x86 has variable-sized instructions.
We scale the results based on an average number of bytes per instruction (shown
in Figure 6.4). On our Phenom reference platform, a 16-byte load from icache
is considered an instruction reference. The actual hardware does aggressive
prefetching, always fetching the next block. The Valgrind rates are relatively
close to actual hardware. m5 reports results much lower than real hardware, we






























































































































































































































valgrind/cachegrindAverage Bytes per x86 instruction
Figure 6.4: Average bytes per x86 instruction. For integer benchmarks the
average is 4.0, for floating point it is 5.1. These values are
needed when extrapolating cache miss rates when given only
total retired instruction count.
6.5.3 Data Accesses per Thousand Instructions
Figure 6.6 shows data cache accesses per thousand instructions for the SPEC
CPU2000 benchmarks. Most of the architectures show consistent results, and
Valgrind and m5 are roughly the same. The one confusing point is the 32-bit
result; the 32-bit binary generates many more cache accesses on the same ma-
chine and kernel than a 64-bit binary. This could be a compiler difference; it


























































































































































































































actual x86_64 Phenom valgrind/cachegrind m5L1 Instruction Cache Miss Rate
3.4 2.2 0.4
Figure 6.5: Instruction cache miss rate with integer benchmarks above and
floating point below.
6.5.4 L1 Data Cache
Figure 6.7 shows L1 data cache miss rates for CPU2000. The gcc benchmarks
have very high miss rates, in ways that the simulators do not expect. The rate is
much higher than the miss rate when running the equivalent 32-bit binary. This
is possibly due to the expansion to 64-bit pointers, as gcc is a pointer-heavy
code. We conduct extra performance counter measurements that show the gcc
benchmarks software prefetch more than the other benchmarks; this could be
polluting the cache.
Figure 6.8 adds additional points to the previous graphs. All of the results
presented are either simulating a Phenom-like cache or else running on an ac-


































































Phenom Phenom (32)Core2 Pentium D MIPSvalgrind m5














































































Phenom Phenom (32)Core2 Pentium D MIPSvalgrind m5






























































































































Phenom Phenom (32)Core2 Pentium D MIPSvalgrind m5































































































































































































































































































































































actual x86_64 Phenom valgrind/cachegrind m5L1 Data Cache Miss Rate
45.3 56.345.3 56.5
Figure 6.7: L1 data cache miss rate with integer benchmarks above and
floating point below.
marks have vastly fewer cache misses than the equivalent 64-bit versions. The
32-bit floating point benchmark results are also different than 64-bit; this is pos-
sibly due to x87 versus SSE math differences. In most cases the simulators are
overly pessimistic about the data cache rates. This is possibly because none of
the simulators are modeling hardware prefetching, nor are they properly mod-
eling the cache as exclusive. PPC results are shown too, on PPC Valgrind con-
figured with the same cache parameters as x86 Valgrind. Those results are dif-


































































Phenom 32b Phenom 32b ValgrindValgrind PPC Valgrindm5 32b m5


















































































Phenom 32b Phenom 32b ValgrindValgrind PPC Valgrindm5 32b m5

































































































Phenom 32b Phenom 32b ValgrindValgrind PPC Valgrindm5 32b m5


























































































































































































































































actual x86_64 Phenom valgrind/cachegrind m5L2 Cache Miss Rate
Figure 6.9: L2 cache miss rates, actual and simulated. The simulators are
pessimistic; in the case of gcc severely so.
6.6 L2 Cache
Figure 6.9 shows L2 miss rates for CPU2000 on x86 64, both actual and simu-
lated. The simulated results are pessimistic, severely so in the cases of gcc and
swim. While large in relative terms, the absolute differences in the rates are rel-
atively small. The benchmarks with the largest L2 error are also the ones that
have large error with the L1 data cache, so this error might just be the L1 error




























































































































































































































actual x86_64 Phenom valgrind/cachegrind m5Branch Miss Rate
Figure 6.10: Branch predictor results for Valgrind and actual hardware.
m5 currently cannot simulate branch prediction for x86 64
6.7 Branch Predictor
Figure 6.10 shows branch predictor results for Valgrind and actual hardware.
m5 results are not shown, as m5 currently cannot simulate branch predictors on
x86 64. The results match surprisingly well for the integer codes, considering
Valgrind is modeling a simplistic 16k 2-bit up/down counter predictor. The
results are not as good for floating point results, which is a bit surprising as
typically floating point branches should be easier to predict. This could mean










































































































































































































actual x86_64 Phenom valgrind/cachegrind m5Cycles Per Instruction
4.3 20.7 20.5 5.9
Figure 6.11: CPI results with integer above and floating point below. Val-
grind cycle times are estimated based on cache and branch
predictor behavior.
6.8 CPI
Figure 6.11 shows CPI results. The Valgrind cycle counts are estimated, based
on a formula similar to the one in Section 5.5.1. The Valgrind results are im-
pressively good for integer benchmarks, though they are off for gcc (which is
unsurprising as the data cache results for gcc are poor, skewing the cycle esti-
mate). The Valgrind floating point results are poor, possibly due to the lack of
good branch prediction results. The m5 results are poor overall, as it is simu-
lating a simple atomic CPU where only one instruction finishes at a time. Since
it lacks any super-scalar simulation at all, the cycles are always going to be off










































































































































































































actual Phenom valgrind m5Relative icache Miss Rate i686/x86_64
Figure 6.12: Relative instruction cache miss rate ratios when moving from
32-bit to 64-bit
6.9 Relative Results
Aswith the RISC results, we present results that compare how the various meth-
ods predict improvement when changing an architectural feature. Unlike the
RISC case, none of the chips we have support changing architectural features
on the fly. Instead, we compare results when moving from 32-bit to 64-bit on
the same machine. Real hardware, Valgrind, and m5 all support running both
32-bit and 64-bit x86 binaries. Unfortunately m5 cannot run the full CPU2000
benchmarks in 32-bit mode due primarily to unimplemented x87 floating point
support. This severely limits the number of benchmarks that can be compared
using m5.
6.9.1 L1 Instruction Cache
Figure 6.12 shows relative icache results when moving from 32-bit to 64-bit.









































































































































































































actual Phenom valgrind m5Relative dcache Miss Rate i686/x86_64
Figure 6.13: Relative L1 data cache miss rate ratios when moving from 32-
bit to 64-bit
the ratio is not as close as it could be. This might have to do with actual hard-
ware engaging in hardware prefetching.
6.9.2 L1 Data Cache
Figure 6.13 shows relative L1 data cache miss rate ratios when moving from 32-
bit to 64-bit. These results track much better than the instruction cache results.
The primary outliers seem to be the gcc benchmarks (discussed earlier) and the
eon benchmarks.
6.9.3 L2 Cache
Figure 6.14 shows relative L2 miss rate ratios. Unfortunately the integer results
are not good. Valgrind and m5 cannot predict gzip or gcc behavior; this is
possible because those benchmark’s actual performance is dramatically worse








































































































































































































actual Phenom valgrind m5Relative L2 Miss Rate i686/x86_64










































































































































































































actual Phenom valgrindRelative Branch Miss Rate i686/x86_64




Figure 6.15 shows the relative change in branch predictor results when moving
from 32-bit to 64-bit. m5 is not represented, as currently it lacks branch predictor





































































































































































































actual Phenom valgrind m5Relative CPI i686/x86_64
Figure 6.16: Relative CPI ratios when moving from 32-bit to 64-bit
erly predicts a subset of the integer benchmarks. One unexpected data point
is the wildly different branch predictor accuracy when moving to 64-bit on real
hardware.
6.9.5 CPI
Figure 6.16 shows relative CPI results. Unfortunately Valgrind does a poor job
of predicting, although this is not surprising as Valgrind’s cycle count is only
an estimate. m5 is even worse, but it has its own issues with cycle count, as
described in the absolute results section. Without major changes to the simula-




Unlike the RISC results found in Chapter 5, we find that current tools and sim-
ulators are not up to the task of predicting performance on CISC systems. It is
possible that amore faithful model of the underlying architectures would gener-
ate better results. It is also possible that the 32-bit to 64-bit comparison has too
many variables; the RISC branch-prediction study might have been an easier




Chip multi-processing (CMP) systems retain all the validation concerns
found with single-core systems (as described in Chapter 4) while adding new
and more complex issues. We briefly address the issues encountered when ex-
tending our simulation methodology work to handle multiple cores.
7.1 Performance Counters
Most CMP systems support per-core performance counter measurements.
There are some counter issues that do not occur on single core machines; for
example most CMP systems have some number of resources that are shared
between the cores, such as L3 caches or memory controllers. When measuring
statistics for these structures, it can be unclear which core owns these counts.
These troublesome shared resources are sometimes referred to as the “uncore”
and the perfmon2 tool makes it possible to count these. Unfortunately this often
involves extra work, or else forces counts to be taken system-wide even if the
thread of interest is only running on one of the cores.
7.2 Deterministic Execution
The problem of deterministic execution becomes even more pronounced once
more cores are added to a system. The theoretical rock of stability in our previ-
ous analysis, the retired instruction count, no longer has any guarantees. Once
multiple threads are running, most hope of deterministic execution are lost. The
112
Operating System takes on a larger role, as scheduling decisions by the OS can
vastly change overall system performance.
When performing validation, is is important that the simulator is running
the same exact code as the real hardware. This is much harder on CMP systems.
Many cycle-accurate simulators do not even model the Operating System at all,
and even if they did, synching the scheduling decisions between a simulator
and actual hardware is not trivial.
If execution cannot be made deterministic, then comparisons between simu-
lation and hardware are meaningless.
There has been a lot of work toward deterministic multi-thread execution
(see Section 2.9). Unfortunately many of the implementations are at the hard-
ware level and thus require low-level architectural changes. Some recent exam-
ples of such solutions are Capo [98], DMP [44], Delorean [97] and Flight Data
Recorder [165].
An ideal deterministic execution method for validation work would be
software-only, require limited changes to the executables being run, and should
work unmodified on both real hardware and in a simulator. The recent
Kendo [114] project meets all of these criteria. Kendo uses hardware per-
formance counters to enforce deterministic context switching via a modified
version of the pthreads library. The retired stores performance counter is
used as a reference count as they (like us) found that other counters like (re-
tired instructions) include interrupt counts and other undesirable noise. Using
Kendo adds an average overhead of 16% to execution time, which is unfortu-
nate, but worth the sacrifice.
113
For our validation work we would have liked to use Kendo, but unfortu-
nately despite originally saying it would be available for download, at the time
of writing this the authors were still not ready to release it. Thus our validation





Our original plan was to generate multi-core results using Valgrind/Ruby,
m5, and actual hardware performance counters and then compare the results.
This would have been a natural extension to the RISC results in Chapter 5 and
the single-core CISC results in Chapter 6.
Unfortunately the standalone Ruby CMP cache simulator was not mature
enough to do this type of research. The m5 simulator’s x86 64 support was also
not ready for this type of experiment. Nor were other x86 64 simulators such as
PTLSim.
We investigated maybe using other architectures, but were limited by the
hardware we had access to that performance counters were fully working. This
eliminated Alpha, MIPS and SPARC.
In then end, what we present are some preliminary results showing that we
get sane memory access patterns across DBI, cycle-accurate and real hardware.
However the actual end results of the cache simulations cannot be compared.
8.1 Methodology
We run experiments using some of the SPEC OMP [137] benchmarks. We com-
pile with the Intel ICC compiler, as the benchmarks for some reason do not scale
when compiled with gcc 4.4.
115
8.1.1 Performance Counters
We run our tests on a 4-core AMD Phenom system running the 2.6.29 kernel
patched to enable perfmon2 [51] performance counter support.
The Phenom system has rich performance counter support, allowing counts
on a per-core basis. Detailed overall system counts are available for shared re-
sources, such as the L3 caches and the memory controller.
8.1.2 DBI Simulation
Various DBI tools have support for CMP simulation. For user-space only tools,
this involves intercepting the various thread and process creation system calls
and handling the situation appropriately. Some DBI tools, such as Valgrind,
handle the multi-thread case but can only themselves run one thread at a time.
This in effect serializes the multi-thread execution. Despite this serialization,
CMP results can still be effectively used if the traces generated have enough in-
formation to re-create the parallel execution. Running in a serial fashion though
does cause a linear slowdown in execution for each additional thread being run.
Not all DBI tools force serialization on multi-thread executions. The Pin tool
is capable of spawning a separate DBI instance for each thread, allowing a pro-
gram to be simulated in a manner much closer to native execution [65]. This
could lead to faster trace generation than with Valgrind.
For collecting DBI CMP memory traces we use Valgrind 3.5 with a custom
tool (based on our exp-bbv tool) that generates memory and instruction traces.
These traces are fed into an external program via a named pipe that counts and
116
















43x 43x 97x 97x 99x 133x
Figure 8.1: equake m run times for varying number of threads, both on
actual hardware and Valgrind
analyzes the references.
8.1.3 Cycle-accurate Simulation
We had hoped to use m5 for x86 64 multi-threaded cache simulation. However
the user-mode support for multi-core is not working, and nor is the full-system
mode that would allow running multithreaded benchmarks on top of a full sim-
ulated operating system. In the end we did not conduct any cycle-accurate mul-
ticore simulations.
8.2 Results
Figure 8.1 shows run times when running equake m on real hardware and on
Valgrind. On real hardware the benchmark scales with number of CPUs, al-
though not purely linearly. Valgrind has interesting behavior; one would expect
117














Thread 1 Thread 2
Thread 3 Thread 4
Phenom Valgrind
Figure 8.2: equake m retired instruction counts for varying number of
threads, both on real hardware and Valgrind
the total run time to stay approximately the same, as for the benchmark the same
amount of total work is being done, it is just being split between cores. For the
one and two thread cases this holds, but adding additional threads dramatically
increases the run times. This could be due to an artifact with Valgrind’s internal
thread scheduling mechanism, possibly conflicting with the way the OpenMP
library distributes the work among threads.
Figure 8.2 shows per-thread retired instruction counts for equake m on real
hardware and on Valgrind. In each case there is a helper scheduler thread run-
ning in addition to the shown threads, but it is proportionally so few instruc-
tions it is not visible on the graph. The overall retired instruction counts grow
as threads are added due to multi-threading overheads. This overhead is higher
on real hardware, due to locking overheads from concurrent execution that do
not occur under the Valgrind DBI tool. It is encouraging that the relative ratio of
instructions per thread is consistent between real hardware and simulation. The
Valgrind tool allows running experiments onmore threads than the actual hard-
ware has available; this allows running experiments for machines with more
118



















) Thread 1 Thread 2Thread 3 Thread 4
Phenom Valgrind
Figure 8.3: equake m L1 dcache access counts for varying number of
threads, both on real hardware and Valgrind
cores than currently available.
Figure 8.3 shows per-thread L1 DCache accesses when running equake m
on real hardware and Valgrind. As before, there is an additional helper thread
too small to be visible on the plots. It is encouraging that the relative ratios of
memory accesses per thread is similar between hardware and Valgrind. Espe-
cially note that thread one has proportionately more accesses in both instances.
We cannot explain why the total number of memory accesses drops when mov-
ing to 8 threads on Valgrind. We do not have 8-core hardware so we do not
know if the same thing happens on an actual machine. We find it encouraging
that the cache results for Valgrind match so closely, as it gives hope that once a
multi-threaded cache simulator is available that with proper tuning it can pro-
duce outputs just as good as its inputs.
119
8.3 Summary
Even though we are not able to generate the results from CMP cache simulation
on x86, we have conducted preliminary experiments that show that the data
cache accesses that would be fed into the simulator are sane and match real
hardware. This gives hope that once a CMP simulator becomes available, that




CONCLUSION AND FUTURE WORK
Our goal is to speed simulation times of architectural simulations without af-
fecting accuracy. First we investigate reduced execution methods, concentrating
on the SimPoint methodology. We find that SimPoint has much higher accuracy
than other commonly used methods, but it can still take long run times when
attempting to generate high accuracy results. We next look at using DBI tools,
which run orders of magnitude faster than cycle-accurate simulation, to gener-
ate results using full input sets. We find that it is simple to get good results using
DBI means on RISC platforms. Unfortunately we find it is not as simple to get
good results on more modern CISC machines. We begin preliminary investiga-
tions of whether the DBI method of simulation would work on CMP systems,
as opposed to single core machines previously investigated.
9.1 Results Summary
Simulation time is of critical importance to most computer architects. Many are
willing to trade accuracy by any means necessary so that their experiments can
finish in a reasonable amount of time.
Figure 9.1 shows speed versus accuracy tradeoffs for the various simulation
methods that we investigate. The results are for the CPU2000 benchmarks. Not
all of these are actual results; approximations were made where DBI simulation
is 376x slower than native, function simulation is 390x slower than native, and
cycle-accurate simulation is 3900x slower than native (these values match the
ones found with Qemu and SESC in Chapter 5). The results also assume per-
121











Start from Beginning: 100m, 500m, 1B, 2B
FFWD 1B: 100m, 500m, 1B, 2B
SimPoint: 1, 5, 10, 20 intervals
Training Input Set
SESC (8060h) -->
Figure 9.1: Speed vs Accuracy tradeoffs of the various simulation methods
on SPEC CPU2000, assuming perfect simulation
fect results, that is the simulators generate the same results that performance
counters would. Actual error rates will be worse, accumulating error from the
simulator. The results show that for accuracy, nothing can beat DBI. Full in-
puts can be run in the time it takes to run 20 SimPoints in the cycle-accurate
simulator. Assuming that the same accuracy can be obtained with DBI as with
cycle-accurate, using DBI is almost always the winner. There are other simula-
tion methods that might also compare favorably; the SimPoint results assume
functional fast-forwarding for each run. If a method such as SimSnap [143] is
used to leverage snapshots, so that fast forwarding is instantaneous, then the
slowdown times would be reduced even more.
The best possible comparison would involve having the full SimPoint mea-
surements and accuracy including simulator overhead for both cycle-accurate
and DBI, but unfortunately we did not have time to generate those results.
122
9.2 Future Work
The most important future work is the completion of the CMP work started
in Chapter 8. The various projects involved, most notably m5 and gem5/ruby
are under heavy development and may become usable at any time. Barring
that, PTLsim also may gain full CMP support and be ready for the experiments
we need. Once that happens, there is hope that DBI methods can be validated
against both hardware and cycle accurate simulators on CMP systems.
Another future work is to make use of the faster execution times enabled by
DBI-based simulation. One major use would be modeling DRAM systems in
full detail, possibly by using DRAMsim [150]. The main problem holding back
detailed DRAMsimulation is slow simulation time, something that is addressed
by or DBI simulation methods.
9.3 Conclusion
DBI-basedmethods make the best of the speed versus accuracy tradeoff in com-
puter architectural simulation. We encourage researchers to use DBI methods if
possible, to allow running longer-running more complete simulations, includ-
ing simulations of overlooked (due to speed) subsystems, such as DRAM and
I/O. Modern systems continue to grow in complexity, and without moving to
faster methodologies, such as DBI, we will rapidly lose the ability to have any
confidence in simulation results.
123
APPENDIX A
THE LOST ART OF ASSEMBLY LANGUAGE PROGRAMMING
When debugging simulators and DBI tools, being well versed in various as-
sembly languages helps immensely. Assembly is optimal for designing small
test cases, especially ones where the simulator is having errors before getting
past the C library. Obscure bugs and reproducible test cases for external distri-
bution are also best done in assembly.
Once you are well versed in writing tiny assembly language, all the tools are
available to explore the nature of code density on modern processors.
A.1 Benefits of Code Density
Dense code yields many benefits. The L1 instruction cache can hold more in-
structions, which usually results in fewer cache misses [139]. Less bandwidth
is required to fetch instructions from memory and disk [38], and less storage
is needed to hold program images. With fewer instructions, more data fits
in a combined L2 cache. Also, on modern multi-threaded processors, multi-
ple threads share limited L1 cache space, so having fewer instructions can be
advantageous. Denser code causes fewer TLB misses, since the code requires
fewer virtual memory pages. Modern Intel processors, for instance, can exe-
cute compact loops entirely from the instruction buffer, removing the need for
L1 I-cache accesses. Finally, the ability to consistently generate denser code can
conserve power, since it enables smaller microarchitectural structures and uses
less bandwidth [63, 149, 177, 19, 15].






































Figure A.1: Sample output from the linux logo benchmark
might require larger (and thus slower) pipeline decode stages, more compli-
cated compilers, smaller logical register set sizes (due to limitations in the num-
ber of bits available in instructions), or even slower and more complex func-
tional units. Compilers tend to optimize for performance, not size (even though
the two are inextricably related): obtaining optimal code density often requires
hand-tuned assembly language, which represents yet another tradeoff in terms
of programmer time andmaintainability. The current push for using CISC chips
in the embedded market [133] forces a re-evaluation of existing ISAs.
A.2 Methodology
Investigations of code density often use microbenchmarks (which tend to be
short and not representative of actual workloads) or else industry standard
benchmarks (which are written in high-level languages and thus are limited
by compiler code generation capabilities). As a compromise, we take an actual
system utility, but convert it into pure assembly language in order to directly
125
Table A.1: Summary of investigated architectures
Type arch endian⋆ bits
instr len op GP int unaligned auto-inc hw stat branch predi-
(bytes) args regs ld/st address div flags delay cation
VLIW IA64 little 64 16/3† 3 127,zero no yes no yes no yes
RISC
Alpha little 64 4 3 31, zero no no no no no no
ARM little 32 4 3 15,PC no yes no yes no yes
m88k big 32 4 3 31,zero no no Q only no optional no
MicroBlaze big 32 4 3 31,zero no no Q only⋆⋆ no optional no
MIPS big 32/64 4 3 31,hi/lo,zero yes⋆⋆ no yes no yes no
PA-RISC big 32/64 4 3 31,zero no no part no yes no
PPC big 32/64 4 3 32 yes yes Q only yes no no
SPARC big 32/64 4 3 63-527,zero‡ no no Q only yes yes no
CISC
m68k big 32 2-22 2 16 yes yes yes yes no no
s390 big 32/64 2-6 2 16 yes no yes yes no no
VAX big 32 1-54 3 16 yes yes yes yes no no
x86 little 32 1-15 2 8 yes yes yes yes no no
x86 64 little 32/64 1-15 2 16 yes yes yes yes no no
Embedded
AVR32 big 32 2 2 15,PC yes yes yes yes no no
CRISv32 little 32 2-6 2 16,zero,special yes yes part yes yes no
SH3 little 32 2 2 16,MAC no yes part yes yes no
THUMB little 32 2 2 8/15,PC no yes no yes no no
8/16-bit
6502 little 8 1-3 1 3 yes no no yes no no
PDP-11 little 16 2-6 2 6,sp,pc no yes yes⋆⋆ yes no no
z80 little 8 1-4 2 18 no lim no yes no no
⋆ on the machine we used † 16-byte bundle has 3 instructions ‡ register windows, only 32 visible ⋆⋆ many implementations
126
interact with the underlying ISA. We hand-optimize it for size, attempting to
create the smallest binary possible, even if this potentially creates slower code.
The program we choose, linux logo [151], is a utility available with many
Linux distributions. When given a sufficiently large input set, its characteristics
are similar to the stringsearch benchmark included with the MiBench [60]
suite. The program executes various syscalls to gather system information, then
displays this info along with a colorful ASCII penguin (Figure A.1 shows sam-
ple output).
The stock linux logo program contains a multitude of features and com-
mand line options; we remove all but the minimum for simplicity. Remaining
code is divided into two parts: the first decodes and displays the text logo,
which is packed using LZSS compression [176, 140]; the second prints system
information, which is gathered by reading the Linux /proc/cpuinfo file, in
addition to invoking the uname() and sysinfo() syscalls. Major subroutines
include string copying, string searching, integer to ASCII conversion, and cen-
tering routines. The code makes system calls directly to avoid C library over-
heads. Code is assembled with the GNU assembler and is linked with GNU ld.
Executables are stripped of non-essential data using the sstrip “super strip”
program [124], an enhanced version of the UNIX strip command. Executables
are tested on actual hardware or under an emulator where hardware is unavail-
able.
We attempt to optimize each architecture’s code to the minimum possible
size without corrupting correct results. For RISC architectures with fixed-length
instructions this is easier: typically, there is only one way to express an opera-
tion, so there are limitations to clever implementations. Optimizations are lim-
127
ited to trying to load 32-bit constants in a small area, using registers instead of
memory, and using tail merging to shorten procedure lengths. CISC architec-
tures provide many more opportunities to decrease code size, but it is much
more difficult to track optimizations due to variable-length instructions. Opti-
mizing for density requires frequent disassembler checks to verify sizes of indi-
vidual instructions. Interestingly, we find that the “do-everything” super-CISC
instructions available on these systems can often be implementedwith a smaller
set of simpler CISC instructions.
A.3 Architectural Notes
Table A.1 lists relevant features of the architectures of interest. We present a
broad overview of these architectures.
VLIW: Very Long Instruction Word (VLIW) architectures are designed to
take advantage of parallelism in code. If the code is not inherently parallel (and
ours is not), code density suffers, and many operations are wasted as nops.
Writing compact VLIW code can be hard: resolving dependences correctly is
a difficult task for compilers, and an even more difficult task for programmers
writing assembly by hand. VLIW can be designed with code density in mind:
e.g., the WM [161] architecture could exploit two operations per instruction in
over two-thirds of all cases. The only VLIW architecture we investigate is Intel’s
IA64 [71].
RISC: Reduced Instruction Set Computers (RISC) emphasize simple archi-
tectures with easy to decode instructions. Instruction length is fixed at four
bytes, which necessitates inefficiency in instruction encoding. These are load-
128
store architectures, which require moving memory values into registers before
operating on them (this negatively impacts code density). Some of these archi-
tectures stretch the definition of “reduced”; the PowerPC architecture has nine
different add instructions, and has the rlwimi (rotate left word immediate then
mask insert) instruction, which takes five parameters. We investigate the Al-
pha [36], ARM [11], m88k [100], MicroBlaze [164], MIPS [96], PA-RISC [67],
PowerPC [70], and SPARC [142] ISAs.
CISC: Complex Instruction Set Computers (CISC) tend to have high code
density. Most CISC architectures have variable-sized instructions, which makes
processor decodemore complicated, but allows for dense code. An example of a
dense “complex” instruction is the x86 one-byte lodsb instruction, which both
loads a byte frommemory and increments a pointer. Another impressively com-
plex instruction is the VAX matchc, which does a full “find substring x inside
of string y in memory.” Compilers often have difficulty using these instructions
appropriately, so this potential for density can be wasted. Also, these instruc-
tions may not be shorter or faster than a set of simpler instructions performing
the same operations. We investigate them68k [101], s390 [69], VAX [46], x86 [73],
and AMD64 [7] ISAs.
Embedded: Modern advances in CPU design have pushed the limits of what
qualifies as “embedded”. We use the term to refer to any architecture with a
fixed two-byte instruction length, but capable of running a modern 32-bit Linux
kernel. These processors tend to have consistently small code sizes, but can
still be beaten by variable-instruction length CISC systems. We investigate the
AVR32 [12], CRISv32 [14], SH3 [126], and ARM THUMB [11] ISAs.
129




0.9381 Minimum possible instruction length
0.9116 Number of integer registers
0.7823 Virtual address of first instruction
0.6607 Architecture has a zero register
0.6159 Bit-width
0.4982 Number of operands in each instruction
0.3129 Year the architecture was introduced
-0.0021 Branch delay slot
-0.0809 Machine is big-endian
-0.2121 Auto-incrementing addressing scheme
-0.2521 Hardware status flags (zero/overflow/etc.)
-0.3653 Unaligned load/store available
-0.3854 Hardware divide in ALU
8 and 16 bit: For comparison purposes we investigate older processors with
smaller word sizes. Such CPUs are still used for embedded systems, and they
are designed for use where code density is a much more critical concern. We
investigate the 6502 [99], PDP-11 [45], and z80 [175] ISAs.
A.4 Code Density Findings
Table A.2 shows how architectural features contribute to code size. A positive
correlation means that high values of the feature increase code size; a negative
correlation means that high values decrease code size. Figure A.2 shows to-
tal binary sizes across the investigated architectures and Figures A.3, A.4, A.5,































































Figure A.2: Total size of benchmarks (includes some platform-specific





















































































































Figure A.4: Size of string concatenation code (machines with auto-


























































Figure A.5: Size of string searching code (unaligned load instructions help,
since four bytes at arbitrary offsets can be compared at once.


























































Figure A.6: Size of integer printing code (hardware divide helps code den-
sity)
Minimum instruction length: Short instruction encodings help most with
respect to reducing density. Architectures with variable-length instructions, es-
pecially those with useful single-byte instructions (like x86 and VAX), can ac-
complish much work with little code. Fixed-length ISAs can be dense if all in-
structions are 16-bit (like AVR32 and SH3); RISCs with fixed 32-bit instructions
generate less dense code; and the VLIW generates the least dense code of all
platforms studied. Figure A.3’s LZSS decompression clearly demonstrates this.
Number of integer registers: Having fewer registers reduces the number of
bits needed to encode instructions, increasing code density. There is a tradeoff,
in that having fewer registers generates more loads/stores from spilling in load-
store architectures.
Virtual address of first instruction: Operating system design decisions af-
fect code density. If the virtual address space is configured so programs start
near the bottom of virtual memory, then a 16-bit constant is enough to point to a
small program’s entire memory. Constant 32-bit pointer loads are at least dou-
ble the size of 16-bit loads on most architectures, and 64-bit pointer loads are
even more wasteful. Using small system call numbers can help, too; avoiding
large immediate constants saves space in executables.
132
Existence of a zero register: Zero registers are normally found in RISC ar-
chitectures, so they tend to correlate with less dense code. A zero register can
be simulated using one load instruction and sacrificing a register, so the feature
offers few benefits with regards to code density.
Bit width: Having a narrower bit-width leads to denser code, mainly due to
shorter immediate values for pointer loads and branch offsets.
Number of operations in instruction: Operation count directly affects the
size of instruction encoding.
Year of introduction: Somewhat surprisingly, age does not correlate highly
with code density. This is due to the many embedded architectures introduced
recently.
Branch delay slots: Branch delay slots can decrease code density due to
added nops. For our benchmark, slots can often be filled, so branch delay slots
cause no problem.
Endianess: Endianess has little impact on code density unless the program
operates on data in a non-native format.
Status flags: Upon completion of ALU operations, these flags (or condition
codes) are set as side effects to indicate that the result was zero, negative, an
overflow, etc. These flags can lead to denser code by eliminating the need for
comparison instructions before conditional branches. Most RISC designs avoid
status flags, as they add complexity and ordering dependencies to out-of-order
processors.
133
Auto-increment addressing: Auto-increment addressing modes allow ac-
cessing consecutive memory addresses without requiring separate increment
instructions. This is especially useful for accessing arrays, of which C strings
are a subset. String copying and concatenation, as in Figure A.4, benefit from
these instructions.
Unaligned memory access: Allowing unaligned loads and stores leads to
smaller code, especially for string manipulation. Unaligned 16 and 32 bit loads
permit arbitrary simultaneous access to consecutive bytes in memory. If align-
ment is enforced, achieving the same results requires a series of memory, shift,
and logical operations. Results in Figure A.5 demonstrate benefits of this fea-
ture.
Hardware division: A hardware divide instruction is often slower than us-
ing the equivalent multiply by the reciprocal [59] or lookup table-based division
routines, but it almost always takes fewer bytes in the instruction stream. Some
architectures only implement single-bit division routines that require software
pipelining; this can lead to less space-efficient code than otherwise undesirable
algorithms such as iterative subtraction. Integer printing code benefits greatly
from hardware divide, as in Figure A.6.
A.5 Density of Compiler-Generated Binaries
Hand-optimizing large programs in assembly language is impractical under
most circumstances. We therefore evaluate compact code generation usingmore
traditional methods. We choose to experiment with the x86 architecture due to











































GLIBC / STATIC GLIBC / DYNAMIC uCLIBC SYSCALL ONLY ASM
Figure A.7: Total size of generated executables, stripped of debugging in-
formation.
We use a variety of C compilers and libraries to determine how small an
executable we can generate using off-the-shelf tools. We use the GNU gcc 4.2
compiler (gcc 4.1 for uClibc runs), the Intel C compiler version 9.1.038, and the
SunStudio 12 compiler, all under Linux. We use GNU libc 2.7 and the embedded
uClibc 0.9.27.
We experiment with different compiler optimizations. In general, we use
-O3; this usually optimizes for maximum performance. We also evaluate -Os,
which optimizes for size. In practice, resulting executables are very similar.
The primary differences are lack of loop unrolling, use of the hardware divide
instruction instead of the faster multiply by reciprocal method, lack of function
inlining, and less aggressive padding of function entry points.
Figure A.7 shows that executable sizes vary by many orders of magnitude.
This is because statically linked programs contain the entire C library, which
represents an overhead of at least 450KB (when using glibc).
By writing code that avoids the C library (and using system calls directly),
we obtain executables only twice as large as hand-optimized codes. The remain-
135
ing reasons for larger code are:
• setting up the stack frame pointer at function entry — this can be turned
off with the compiler option ---fomit-frame-pointer;
• writing back to memory using 32-bit constants — due to pointer aliasing
issues the compiler must frequently write values to memory using 5-byte
instructions. The optimized assembler avoids aliasing and places more
values in registers;
• loading of constants inefficiently — there are various slow (but smaller)
ways to load small constants on x86; and
• avoiding string instructions — the compiler simply does not use the x86
specialized string instructions.
A.6 Related Work
Most code density research addresses the compressibility of instruction code
[158, 82, 39, 20, 80, 166, 85, 149, 160, 130, 27]. Usually what is compressed is
compiler-generated RISC or VLIW code, with compression ratios typically in
the 50-70% range. We show here that embedded and CISC ISAs yield smaller bi-
naries than RISC. Adding compression to a RISC architecture likely negates the
speed benefits and decoder simplicity that initially motivated the move away
from CISC.
Previous work compares multiple architectures, but our work is unique in
the number (21) considered. Kozuch and Wolfe [79] measure entropy and com-
pressibility of six different architectures (VAX, MIPS, SPARC,m68k, RS6000 and
136
PowerPC). Hasegawa et al. [63] compare SH3 code density to that of code gen-
erated by gcc on 10 other platforms (m68k, IA32, i960, Sparclite, SPARC, MIPS,
AMD29k, m88k, Alpha, and RS6000). They find results roughly similar to ours,
though they find the SH3 architecture generates smaller code than the x86 and
m68k by a small margin. Flynn, Mitchell, andMulder [54] compare code density
of synthetic architectures that do not model actual systems.
Phelan [123] investigates features added to Thumb-2 to increase code den-
sity. Thumb-2 uses specialized instructions for enhanced constant support, lim-
ited predication, and compare-against-zero. These are similar features to those
we find useful for density in Section A.4. Halambi et al. [61] investigate the ben-
efits of using a reduced Instruction Set Architecture (rISA), such as THUMB and
MIPS-16. They test hypothetical architectures, finding that a hybrid approach
unlike any current reduced architecture should perform best.
Massalin’s Superoptimizer [93] cleverly generates extremely dense (and
non-intuitive) m68k and IA32 code by exhaustive search, but it only operates
on small blocks of code (i.e., it’s a highly tuned peephole optimizer).
A.7 Conclusions and Future Work
A 1987 article by Chow and Horowitz [34] quotes an early MIPS-X design doc-
ument:
“The goal of any instruction format should be: 1. simple decode, 2.
simple decode, and 3. simple decode. Any attempts at improved
code density at the expense of CPU performance should be ridiculed
137
at every opportunity.”
Two decades later, the debate between prioritizing code density versus decoder
simplicity in ISAs continues.
We investigate code density of 21 different architectures, and find that very
high density levels can be achieved with proper planning of an ISA. To thor-
oughly exploit ISA density there must be cooperation between the operating
system, system libraries, and compiler. On the x86 architecture, even after elim-
inating the C library and choosing maximum compiler options, a factor of two
in code density can still be realized by hand-optimizing the assembly code. This
is much greater than the 25% average size difference between RISC and CISC
codes.
New ISAs, especially embedded ones, are continually being developed.
Now that FPGAs are powerful enough to contain competitive CPUs, this trend
of creating custom ISAs will likely increase. To aid in this development, we
show which architectural features contribute most to code density, but also
show that the entire system stack must be optimized to avoid wasting an ISA’s
inherent potential for density.
Ongoing work applies some of what we have learned to much bigger bench-
marks to see what the performance and power implications are of using smaller
libraries and different compiler options on larger applications. We hope to raise
awareness of the importance of code density on all modern architectures, not




Tables B.1 and B.2 show cache latencies for various machines avail-
able in CSL. This is useful when judging how realistic cache settings
are in simulators. The results were generated with the lmbench tool (
http://www.bitmover.com/lmbench/ ), and were spot-checked against actual
documentation to make sure the results were sane. Many of the older chips with
extremely high L2 latencies have off-chip L2s.
139
Table B.1: L1 Cache latencies on Fusion group machines
Machine CPU Freq DL1 Latency DL1 DL1
(MHz) (ns) (cycles) Size Assoc
sampaka12 P4 Xeon 2790 0.703 2 8k 4-way
cluizel P4 Xeon 2787 0.724 2 8k 4-way
dolfin P4 2394 0.853 2 8k 4-way
cluster Core2 E5440 2820 1.066 3 32k 8-way
cluster-026 Core2 E5430 2660 1.128 3 32k 8-way
domori25 Pentium D 3463 1.155 4 16k 8-way
ithaca P4 Xeon 2988 1.343 4 16k 8-way
tasse Core2 Q6600 2399 1.250 3 32k 8-way
venchi Phenom 9500 2212 1.360 3 64k 2-way
ps3 Cell 1591 1.545 2.5? 32k ?-way
old-milka Athlon MP 1729 1.735 3 64k 2-way
tobler Athlon XP 1663 1.804 3 64k 2-way
chocovic Core T2300 1600 1.812 3 32k 8-way
atom-power Atom N270 1597 1.899 3 24k 6-way
lindt ARM v5te 1197 2.599 3 16k 4-way
valor Niagara1 1000 3.115 3 8k ?-way
elrey09 Power3 375 5.335 2 64k 128-way
spruengli Pentium III 545 5.497 3 16k 4-way
carnivore USPARC II 359 5.597 2 16k 1-way
bmul Alpha EV6 496 6.042 3 64k 2-way
hershey MIPS R12k 300 6.630 2 32k 2-way
nestle Pentium II 400 7.495 3 16k 4-way
perugina MIPS R5k 178 11.500 2 32k 2-way
ancient Pentium Pro 198 15.100 3 8k 2-way
140
Table B.2: L2 Cache latencies on Fusion group machines
Machine CPU Freq L2 Latency L2
(MHz) (ns) (cycles) size
cluster Core2 E5440 2820 5.365 15 6144k
cluster-026 Core2 E5430 2660 5.665 15 6144k
tasse Core2 Q6600 2399 5.863 14 4096k
sampaka12 P4 Xeon 2790 6.568 18 512k
cluizel P4 Xeon 2787 6.575 18 512k
venchi Phenom 9500 2212 6.950 15 512k
dolfin P4 2394 7.720 18 512k
domori25 Pentium D 3463 7.982 28 2048k
chocovic Core T2300 1600 8.436 14 2048k
ithaca P4 Xeon 2988 9.431 28 1024k
atom-power Atom N270 1597 10.200 16 512k
old-milka Athlon MP 1729 11.600 20 256k
tobler Athlon XP 1663 12.000 20 256k
ps3 Cell 1591 12.600 20 512k
lindt ARM v5te 1197 21.000 25 256k
valor Niagara1 1000 22.100 22 3072k
carnivore USPARC II 359 27.900 10 4096k
bmul Alpha EV6 496 30.300 15 4096k
elrey09 Power3 375 32.000 12 512k
spruengli Pentium III 545 33.000 18 512k
ancient Pentium Pro 198 35.400 7 512k
hershey MIPS R12k 300 47.300 14 2048k
nestle Pentium II 400 55.000 22 512k




This appendix contains retired instruction counts for various architectures,
comparing simulators and hardware performance counters (if available). Re-
tired instructions are a useful metric, as the results should be the same (within
reason, see Chapter 4) across all implementations, including simulators. These
tables show that our results are reasonable across independent implementa-
tions, including actual hardware and the various simulators.
Table C.1 shows Alpha retired instruction counts for SPEC CPU2000 on m5
and on Qemu.
Table C.2 shows MIPS retired instruction counts for an actual R12000 proces-
sor as well as Qemu.
Table C.3 shows PPC retired instruction counts for Qemu and Valgrind.
Some preliminary performance counter results are available for a G3 proces-
sor, however the retired instruction counter on that architecture does not count
many branch instructions (so call “folded” branch instructions) so the hardware
undercounts by a large amount.
Table C.4 shows SPARC retired instruction counts on a Niagara processor
as compared to Qemu for SPEC CPU2000. Tables C.5 and C.6 show SPARC re-
tired instruction counts on a Niagara processor as compared to Qemu for SPEC
CPU2006.
Table C.7 shows x86 retired instruction counts for Pin, Valgrind and Qemu
as well as native Pentium D for SPEC CPU2000. Tables C.8 and C.9 show the
142
DBI results for SPEC CPU2006.
Table C.10 shows x86 64 retired instruction counts for Valgrind as well as m5
and native Pentium D for SPEC CPU2000.
143




















































Table C.2: Retired instructions for MIPS SPEC CPU2000, showing both
Qemu and actual hardware.





























































































Table C.4: Retired instructions for SPARC SPEC CPU2000, showing actual
hardware and Qemu results.
Benchmark niagara Qemu % diff
perlbmk.mkrnd 1,404,947,767 1,404,994,700 0.0033%
gcc.expr 7,438,524,056 7,439,822,058 0.0174%
gcc.integrate 7,496,590,557 7,496,932,953 0.0046%
perlbmk.perf 22,090,018,022 22,117,246,402 0.1233%
gcc.166 24,096,748,165 24,099,211,420 0.0102%
gzip.log 32,220,222,634 32,220,258,877 0.0001%
perlbmk.diffml 32,527,608,598 32,552,061,978 0.0752%
gcc.scilab 38,943,582,033 38,954,242,610 0.0274%
perlbmk.535 52,797,417,192 52,812,588,889 0.0287%
art.110 55,919,697,739 55,919,809,366 0.0002%
perlbmk.704 56,138,421,439 56,155,913,315 0.0312%
eon.rushmeier 59,734,334,097 59,754,206,796 0.0333%
art.470 61,320,265,673 61,320,378,433 0.0002%
gzip.source 64,028,560,356 64,028,627,716 0.0001%
mcf 66,952,070,888 66,955,397,677 0.0050%
gcc.200 69,304,422,957 69,323,496,419 0.0275%
gzip.random 72,129,328,188 72,129,329,697 0.0000%
vpr.route 80,003,670,096 80,004,767,633 0.0014%
eon.cook 81,546,300,471 81,582,283,446 0.0441%
gzip.graphic 84,794,373,750 84,794,614,880 0.0003%
bzip2.source 85,705,886,914 85,705,888,072 0.0000%
perlbmk.957 93,025,850,757 93,047,199,542 0.0229%
bzip2.program 104,202,484,846 104,202,485,985 0.0000%
vortex.1 104,232,857,833 104,324,773,140 0.0882%
eon.kajiya 105,264,967,103 105,318,413,139 0.0508%
perlbmk.850 107,063,868,164 107,084,662,958 0.0194%
vortex.2 112,156,726,713 112,244,643,448 0.0784%
vpr.place 115,176,477,366 115,176,656,963 0.0002%
vortex.3 116,025,640,113 116,128,639,345 0.0888%
bzip2.graphic 120,800,611,046 120,800,612,184 0.0000%
gzip.program 123,680,369,194 123,680,901,121 0.0004%
equake 151,889,001,382 151,891,686,411 0.0018%
crafty 206,305,279,572 206,315,053,194 0.0047%
gap 210,517,827,839 210,558,299,855 0.0192%
swim 240,514,963,796 240,515,886,318 0.0004%
lucas 266,811,475,823 266,811,478,992 0.0000%
twolf 305,337,539,213 305,339,039,589 0.0005%
facerec 326,679,452,956 326,696,166,196 0.0051%
fma3d 341,030,439,528 341,075,610,275 0.0132%
wupwise 341,661,289,809 341,661,294,714 0.0000%
mesa 346,548,676,244 346,551,169,132 0.0007%
ammp 349,598,884,109 349,606,001,911 0.0020%
parser 355,866,671,084 356,177,932,828 0.0875%
galgel 384,843,413,509 384,843,478,318 0.0000%
apsi 404,410,763,702 404,412,372,555 0.0004%
applu 483,553,093,892 483,553,210,988 0.0000%
mgrid 523,649,830,576 523,650,635,035 0.0002%
sixtrack 531,777,045,890 531,779,537,683 0.0005%
147
Table C.5: Retired instructions for SPARC SPEC CPU2006, showing actual
























Table C.6: Retired instructions for SPARC SPEC CPU2006, showing actual
hardware and Qemu results (part 2)
Benchmark niagara Qemu
bzip2.source 525,214,607,479 525,214,762,807


































Table C.7: Retired instructions for x86 SPEC CPU2000, showing both
Qemu and actual hardware.
Benchmark Pentium D Pin Qemu Valgrind
188.ammp 333,169,333,372 333,169,294,670 333,169,294,696 333,169,329,139
173.applu 554,510,033,405 554,509,978,381 554,509,978,455 554,509,978,924
301.apsi 648,607,278,050 648,607,219,730 648,607,218,356 648,607,225,337
179.art 110 117,967,839,911 117,967,839,150 117,967,839,198 58,632,609,034
179.art 470 121,326,001,662 121,326,000,207 121,326,000,255 64,317,418,082
256.bzip2 graphic 117,528,983,508 117,528,935,447 117,528,935,461 117,528,935,918
256.bzip2 program 103,252,292,827 103,252,244,840 103,252,244,854 103,252,245,311
256.bzip2 source 86,640,101,418 86,640,053,044 86,640,053,064 86,640,053,515
186.crafty 215,657,884,011 215,657,814,911 215,657,814,944 215,657,815,392
252.eon cook 85,146,651,778 85,146,645,565 85,146,645,795 85,146,647,511
252.eon kajiya 109,342,530,157 109,342,523,528 109,342,523,841 109,342,278,681
252.eon rushmeier 62,973,702,423 62,973,695,532 62,973,695,924 62,973,696,148
183.equake 144,985,852,691 144,985,830,752 144,985,830,784 144,985,809,902
187.facerec 309,900,303,948 309,897,997,695 309,897,997,844 309,897,997,983
191.fma3d 320,946,898,097 320,946,792,622 320,946,792,780 320,946,772,563
178.galgel 370,730,832,412 370,730,602,525 370,730,602,672 370,916,153,512
254.gap 221,616,787,940 221,616,650,209 221,616,650,177 221,616,650,640
176.gcc 166 22,310,940,703 22,310,842,538 22,310,842,738 22,311,064,104
176.gcc 200 72,618,452,686 72,618,200,323 72,618,200,652 72,618,892,100
176.gcc expr 7,287,040,224 7,286,992,242 7,286,992,508 7,287,101,582
176.gcc integrate 7,295,131,438 7,295,099,630 7,295,099,672 7,295,204,413
176.gcc scilab 39,177,416,341 39,177,182,848 39,177,182,732 39,176,446,405
164.gzip graphic 73,929,735,977 73,929,689,944 73,929,689,958 73,929,690,409
164.gzip log 29,339,108,465 29,339,062,611 29,339,062,625 29,339,063,076
164.gzip program 105,592,070,042 105,592,024,103 105,592,024,117 105,592,024,568
164.gzip random 60,368,090,965 60,368,044,987 60,368,044,996 60,368,045,458
164.gzip source 56,026,942,353 56,026,896,002 56,026,896,011 56,026,896,473
189.lucas 299,119,908,646 299,119,860,603 299,119,860,756 299,141,824,974
181.mcf 69,384,440,147 69,385,080,146 69,385,080,176 69,385,080,617
177.mesa 282,923,973,831 282,923,962,440 282,923,962,466 282,923,962,929
172.mgrid 502,690,354,308 502,690,322,067 502,690,322,141 502,690,797,081
197.parser 372,095,799,952 372,120,124,402 372,139,619,752 372,100,648,992
253.perlbmk 535 54,500,531,177 54,501,566,628 54,500,611,900 54,494,161,136
253.perlbmk 704 57,746,134,915 57,747,635,083 57,747,177,503 57,740,259,411
253.perlbmk 957 95,767,811,078 95,767,410,497 95,768,071,803 95,757,389,828
253.perlbmk 850 110,760,424,439 110,760,912,962 110,760,662,259 110,748,559,089
253.perlbmk diffmail 32,815,507,767 32,809,491,692 32,811,408,056 32,805,199,186
253.perlbmk mkrnd 1,265,082,975 1,265,125,871 1,265,119,126 1,265,122,789
253.perlbmk perfect 21,360,587,029 21,358,546,517 21,359,077,277 21,359,269,806
200.sixtrack 907,227,353,765 907,226,845,666 907,226,845,733 907,226,834,827
171.swim 301,163,912,858 301,163,859,925 301,163,859,993 301,163,890,220
300.twolf 311,868,478,360 311,868,472,123 311,868,472,155 311,868,472,603
255.vortex 1 144,373,942,830 144,373,882,650 144,373,882,668 144,369,991,734
255.vortex 2 162,519,416,945 162,519,362,767 162,519,362,785 162,517,282,072
255.vortex 3 160,888,128,313 160,888,065,924 160,888,065,942 160,890,052,476
175.vpr place 110,294,409,743 110,294,407,436 110,294,407,461 110,294,407,909
175.vpr route 93,441,428,087 93,441,408,681 93,441,408,697 93,441,437,858
168.wupwise 502,204,546,200 502,204,501,014 502,204,501,084 502,204,501,530
150
Table C.8: Retired instructions for x86 SPEC CPU2006, showing Pin, Val-
grind, and Qemu and Pentium D (part 1).
Benchmark Pentium D Pin Qemu Valgrind
473.astar BigLakes 435,510,885,945 435,510,704,863 435,510,704,889 435,571,189,405
473.astar rivers 870,943,327,262 870,943,274,640 870,943,274,666 870,945,429,505
410.bwaves 2,495,857,175,894 2,495,855,313,228 2,488,693,693,543 2,497,901,489,742
401.bzip2 chicken 199,232,656,452 199,232,627,434 199,232,627,466 199,232,627,914
401.bzip2 combined 364,136,091,950 364,135,929,341 364,135,929,355 364,135,929,812
401.bzip2 html 706,417,018,345 706,416,797,441 706,416,797,473 706,416,797,921
401.bzip2 liberty 346,361,794,588 346,361,765,172 346,361,765,204 346,361,765,652
401.bzip2 program 593,333,086,611 593,332,865,463 593,332,865,483 593,332,865,934
401.bzip2 source 452,012,609,560 452,012,385,798 452,012,385,812 452,012,386,269
436.cactusADM 3,149,915,322,930 3,149,914,624,261 3,149,914,624,429 3,149,914,680,629
454.calculix 8,687,262,977,320 8,687,259,261,309 8,687,259,304,233 8,687,445,799,664
447.dealII 2,334,573,872,592 2,334,572,223,794 2,334,572,223,809 2,330,760,559,334
416.gamess cytosine 1,143,014,974,777 1,143,014,915,456 1,143,014,915,621 1,142,857,629,518
416.gamess h2ocu2 867,682,898,659 867,682,786,674 867,682,786,822 867,681,851,546
416.gamess triazolium 4,215,197,021,876 4,215,196,654,946 4,215,196,655,094 4,215,183,173,721
403.gcc 166 85,720,786,510 85,719,045,618 85,719,045,707 85,729,929,027
403.gcc 200 166,630,914,986 166,629,909,806 166,629,909,895 166,629,517,693
403.gcc c-typeck 140,819,919,763 140,813,681,279 140,813,681,371 140,836,875,542
403.gcc cp-decl 109,542,581,703 109,541,663,142 109,541,663,224 109,553,354,039
403.gcc expr 118,136,049,520 118,131,076,133 118,131,076,225 118,152,895,761
403.gcc expr2 160,294,450,226 160,288,195,981 160,288,196,080 160,319,263,901
403.gcc g23 193,775,955,105 193,769,398,187 193,769,398,269 193,795,174,499
403.gcc s04 179,205,087,128 179,202,306,373 179,202,306,455 179,225,687,180
403.gcc scilab 64,696,677,236 64,696,579,050 64,696,579,111 64,697,188,635
151
Table C.9: Retired instructions for x86 SPEC CPU2006, showing Pin, Val-
grind, and Qemu and Pentium D (part 2).
Benchmark Pentium D Pin Qemu Valgrind
445.gobmk 13x13 238,220,175,611 238,220,161,269 238,220,161,299 238,220,161,795
445.gobmk nngs 631,487,346,762 631,487,324,668 631,487,324,687 631,487,325,199
445.gobmk score2 345,153,272,758 345,153,264,793 345,153,264,841 345,153,265,332
445.gobmk trevorc 236,505,606,986 236,505,585,769 236,505,585,812 236,505,586,316
445.gobmk trevord 340,188,792,640 340,188,777,068 340,188,777,111 340,188,777,615
459.GemsFDTD 2,511,548,122,944 2,511,544,724,520 2,511,544,724,652 2,511,544,820,559
435.gromacs 2,929,271,991,015 2,929,271,975,841 2,929,271,975,828 2,929,271,976,443
464.h264ref forebase 564,679,796,012 564,679,752,397 564,679,752,405 564,679,821,693
464.h264ref foremain 323,101,393,242 323,101,365,998 323,101,366,009 323,101,161,152
464.h264ref sss 2,814,673,346,854 2,814,673,203,792 2,814,673,203,803 2,814,672,542,474
456.hmmer nph3 1,039,884,719,301 1,039,884,652,135 1,039,884,652,113 1,039,884,652,662
456.hmmer retro 2,212,798,693,606 2,212,798,691,472 2,212,798,691,507 2,212,798,691,985
470.lbm 1,495,737,680,772 1,495,737,572,643 1,495,737,572,670 1,495,736,273,152
437.leslie3d 2,534,171,419,277 2,534,170,033,329 2,534,170,033,474 2,534,170,033,260
462.libquantum 3,884,593,324,266 3,884,593,087,057 3,884,593,087,091 3,884,593,087,595
429.mcf 449,894,848,644 449,895,233,570 449,895,233,605 449,895,234,103
433.milc 1,386,822,701,292 1,386,801,295,324 1,386,801,294,849 1,386,800,309,412
444.namd 2,895,739,745,975 2,895,739,724,790 2,895,739,724,842 2,895,739,729,006
471.omnetpp 764,012,416,098 764,012,359,528 764,012,359,553 762,846,798,782
400.perlbench checkspam 148,065,633,450 148,067,449,377 148,061,319,677 148,020,669,811
400.perlbench diffmail 401,910,091,346 401,888,866,913 401,932,464,865 401,831,332,925
400.perlbench splitmail 714,326,937,255 714,290,352,654 714,309,327,813 714,461,865,506
453.povray 1,204,156,309,806 1,204,157,467,167 1,204,159,849,673 1,204,125,183,903
458.sjeng 2,530,950,089,415 2,530,950,014,161 2,530,950,014,181 2,530,950,014,688
450.soplex pds-50 450,970,473,747 450,957,516,054 450,959,827,905 476,893,662,178
450.soplex ref 459,067,223,899 459,034,016,857 459,020,702,694 477,057,983,713
482.sphinx3 2,827,878,655,277 2,827,878,538,864 2,827,878,538,664 2,828,020,092,268
465.tonto 2,895,603,788,430 2,895,583,404,681 2,895,626,773,725 2,895,531,451,323
481.wrf 4,117,176,411,328 4,117,161,034,700 4,117,161,034,831 4,116,831,956,200
483.xalancbmk 1,313,433,329,271 1,313,431,575,521 1,313,431,575,551 1,314,798,555,882
434.zeusmp 2,397,609,296,047 n/a n/a n/a
152
Table C.10: Retired instructions for x86 64 SPEC CPU2000, showing both
Qemu and actual hardware.
Benchmark Pentium D m5 Valgrind
perlbmk.mkrnd 1,090,919,227 1,090,879,129 1,090,746,089
gcc.expr 7,350,887,801 7,257,774,662 7,258,023,131
gcc.integrate 7,698,302,743 7,598,617,570 7,597,927,209
perlbmk.pfct 19,654,889,034 19,649,264,066 19,674,125,598
gcc.166 26,258,578,150 26,053,572,133 26,053,249,578
gzip.log 27,720,223,414 27,629,578,439 27,630,555,769
art.110 37,684,130,154 37,684,106,361 37,684,089,804
gcc.scilab 39,085,872,433 38,718,233,041 38,719,744,344
art.470 41,815,575,277 41,815,549,206 41,814,960,116
eon.rushmeier 46,652,449,332 46,652,438,765 46,652,447,077
mcf 47,178,238,767 47,178,770,487 47,178,758,942
gzip.random 50,716,078,217 50,552,564,398 50,553,545,097
eon.cook 59,432,883,084 59,432,871,622 59,432,880,124
gzip.source 63,638,496,739 63,533,923,887 63,534,804,993
vpr.route 65,842,168,801 65,842,101,031 65,842,410,972
gzip.graphic 66,140,686,787 65,984,284,025 65,985,226,242
gcc.200 69,752,973,526 69,333,015,398 69,350,744,008
bzip2.source 75,737,059,115 75,736,212,867 75,737,065,461
eon.kajiya 79,548,196,338 79,548,182,772 79,548,189,789
vpr.place 91,801,882,750 91,627,577,007 91,801,833,351
equake 91,831,665,346 91,831,629,328 91,831,292,111
bzip2.program 92,195,189,138 92,194,260,068 92,195,239,731
bzip2.graphic 104,716,089,604 104,715,201,159 104,716,114,878
gzip.program 134,301,541,033 134,183,019,555 134,184,027,716
crafty 140,491,641,621 140,491,608,144 140,491,506,813
gap 183,443,821,679 183,443,733,395 183,443,755,451
lucas 205,651,052,365 205,650,963,195 205,650,990,335
swim 211,145,979,309 211,145,850,745 211,145,887,898
mesa 225,141,182,114 225,141,105,441 225,141,115,104
facerec 249,466,728,521 249,465,506,605 249,433,555,885
fma3d 252,621,825,649 252,621,687,157 252,621,712,799
parser 263,269,230,283 263,269,185,444 263,218,164,789
galgel 265,315,494,177 265,315,409,019 265,319,397,124
ammp 282,273,753,920 282,273,684,014 282,273,805,462
twolf 294,395,392,631 294,395,327,575 294,395,331,989
mgrid 317,902,282,935 317,901,442,889 317,901,782,490
applu 329,640,061,785 329,639,906,447 329,639,978,210
apsi 335,998,339,268 335,998,752,850 335,998,224,351
wupwise 360,553,449,666 360,553,370,094 360,553,381,385
sixtrack 542,751,559,882 542,751,311,580 542,751,677,787
perlbmk.535 n/a n/a n/a
perlbmk.704 n/a n/a n/a
perlbmk.850 n/a n/a n/a
perlbmk.957 n/a n/a n/a
perlbmk.diff n/a n/a n/a
vortex.1 n/a n/a n/a
vortex.2 n/a n/a n/a




This appendix contains details of how long various simulators take to simu-
late the SPEC CPU2000 benchmarks.
These results are only approximate. Times for simulators were measured on
the domori cluster (3.46GHz Pentium D) with varying loads. These measure-
ments were taken over a 5 year span, the operating systems were updated in
that time, and the various simulators were also updated. Some points are miss-
ing, either due to broken simulators, cluster crashes, or broken toolchains. The
vortex and some of the perlbmk benchmarks are missing, as they do not run
out of the box with modern compilers on our comparison Pentium D machine.
Often multiple runs were taken, in that case the lowest value was chosen.
For the native runs, at least three runs were made, with the middle chosen as
the reference time. The average given is the weighted average across all bench-
marks in the suite. If benchmarks are missing, the comparison is between that
subset of benchmarks that completed.
Be careful if using these numbers for anything but rough estimates or orders-
of-magnitude comparisons.
An overall summary can be found in Table D.1.
Table D.3 shows Alpha results. The binaries were compiled with gcc 4.2
using -O2 optimization. The Qemu results are from git Qemu as of 11 January
2010with a few extra patches applied to enable proper floating-point and fstat64
support. The many missing points for sim-alpha are due to that simulator not
154
supporting modern Linux binaries.
Table D.4 shows MIPS results. The binaries used are the pre-compiled ones
from the SESC website. Care needs to be taken, as those binaries are incomplete
(not all are available) and some of them,most notably gzip, have beenmodified
for shorter runtime. This is why the gzip benchmarks seem to run faster on the
older R12k machine than on the modern Pentium D machine. qemu cache is
Qemu custom-patched and feeding into the Dinero cache simulator.
Table D.5 shows SPARC results. qemu bbv is Qemu patched to generate
basic block vectors, hence a bit slower than stock Qemu. The niagara system
shown for comparison has a simple FPU unit, which is why the floating point
benchmarks perform relatively poorly.
Table D.6 shows x86 results for SPEC CPU2000. These are using pinkit pin-
2.0-10520-gcc.4.0.0-ia32-linux, qemu 0.9.1 and Valgrind 3.3.0. According to the
pin 2005 PLDI paper [87] the perlbmk and gcc slowdowns are because there is
not much code reuse, so the jit overhead is higher. Integer codes perform worse
due to the large number of indirect or unpredictable jumps. The m5 results are
incomplete, as currently m5 x86 support is missing x87 floating point support.
In many cases the 32-bit code performs better than the 64-bit code. Table D.2
breaks this out for the sixtrack benchmark. The difference seems to be in-
herent in the 64-bit aspect of the code, as running on different implementations
shows the same problem, and turning on SSE instructions does not help (SSE
instructions are limited to 64-bit floating point, so issues due to 80-bit floating
point would be uncovered that way). The cause of this is still under investiga-
tion.
155
Table D.1: Summary of slowdown compared to Pentium D node running
x86 64 binaries.
Minimum Maximum Weighted
Arch. Method Slowdown Slowdown Average
Alpha
Alpha 21264 3.08 mcf 12.21 galgel 5.88
qemu-alpha 3.67 mcf 261.04 mgrid 61.19
m5-nocache 105.82 mcf 2696.46 mgrid 1312.59
sim-alpha 617.19 mcf 8250.56 gap 3772.03
MIPS
MIPS R12k 0.65 gzip 22.95 swim 10.01
qemu cache 34.92 mcf 966.31 mgrid 376.21
sesc 377.90 gzip 12435.67 mgrid 3909.04
SPARC SPARC niagara 2.25 mcf 78.13 mgrid 20.95qemu bbv 1.75 mcf 181.18 mgrid 44.58
x86
Pentium D 0.25 sixtrk 4.00 perl.rnd 0.97
pin 1.47 sixtrk 47.00 perl.rnd 8.30
qemu 2.69 mcf 84.38 wupwise 26.50
valgrind 4.08 mcf 99.52 eon.cook 32.10
m5-nocache 2943.13 gcc.166 7495.53 gap 5624.99
x86 64
pin baseline 0.89 sixtrk 4.50 gcc.expr 1.27
qemu baseline 1.31 mcf 20.59 eon.cook 8.78
valgrind baseline 1.64 mcf 42.44 mesa 6.92
pin cache 14.31 mcf 352.66 wupwise 101.68
qemu cache 50.44 mcf 17936.83 eon.cook 1036.74
valgrind cache 5.73 mcf 97.77 perl.pfct 29.13
m5-nocache 358.71 mcf 6784.89 eon.rush 2694.39
m5-cache 841.46 sixtrk 7104.07 wupwise 2882.13
ptlsim 1070.03 mcf 14398.58 wupwise 7534.09
Table D.7 shows x86 64 results for simulations. The m5 version is the current
mercurial tree as of 10 January 2010, with a few additional patches. Benchmarks
are compiled with -O3 -msse3 -funroll-all-loops -ffast-math on
gcc 4.3.2. Table D.8 shows unmodified x86 64 DBI results, showing the fastest
possible speed. Valgrind is version 3.5, Qemu is the git tree as of 10 December
2009. Table D.9 shows x86 64 DBI results when including a cache simulator.
The Qemu results are shown feeding the Dinero cache simulator via a named
pipe with addresses truncated to 32-bits. The pin and Valgrind results are using
the cache simulators that come with the tools, while simulating a Phenom-like
cache infrastructure.
156
Table D.2: x86 32-bit versus 64-bit run time anomaly for sixtrack. Some
benchmarks perform markedly worse when compiled as 64-bit.
Pentium D Phenom Core2 Q6600
32-bit -O2 4:18 4:52 3:21
64-bit -O2 20:12 23:34 19:14
64-bit -msse3 23:00 37:08 18:57
157
Table D.3: Elapsed times for running the SPEC CPU 2000 benchmarks on
various Alpha simulators. domori is time on our reference Pen-
tium D machine. bmul is an actual Alpha 21264 system.
Benchmark domori bmul qemu m5-nocache sim-alpha
perlbmk.makerand 1s 5s 21s 23m09s n/a
gcc.integrate 5s 20s 1m45s 1h36m42s 6h12m38s
gcc.expr 4s 19s 1m51s 1h33m20s 6h04m18s
gzip.log 13s 1m27s 5m04s 6h16m20s n/a
gcc.166 23s 1m30s 5m18s 5h13m42s 20h51m23s
eon.rushmeier 18s 1m38s 22m57s 9h32m53s 1d13h19m29s
gcc.scilab 23s 1m40s 11m02s 8h11m27s 1d08h16m48s
gzip.random 24s 2m35s 10m40s 13h34m50s n/a
gzip.source 29s 2m50s 11m51s 12h56m51s n/a
gzip.graphic 29s 2m56s 12m43s 18h03m23s 2d09h32m45s
gcc.200 40s 3m00s 17m09s 14h27m11s 2d21h34m53s
eon.kajiya 34s 3m07s 40m48s 17h34m35s n/a
eon.cook 22s 3m09s 31m09s 13h07m53s 2d01h30m28s
bzip2.source 40s 3m38s 13m00s 17h07m12s n/a
art.110 1m15s 4m24s 29m31s 8h08m28s 1d03h59m01s
bzip2.program 41s 4m24s 15m06s 22h34m37s n/a
gzip.program 52s 4m43s 19m28s 23h53m45s n/a
art.470 1m22s 4m57s 26m30s 8h51m15s 1d06h10m39s
vpr.place 1m28s 6m13s 31m21s n/a n/a
vpr.route 1m25s 6m20s 22m56s 15h47m11s n/a
crafty 1m12s 6m21s 32m58s 1d12h58m47s 4d11h59m51s
bzip2.graphic 51s 8m00s 16m43s 1d02h36m04s n/a
mesa 1m36s 8m44s 1h52m33s 2d03h57m47s n/a
gap 1m13s 9m34s 38m51s 2d00h09m00s 6d23h18m11s
wupwise 1m40s 9m59s 3h12m48s 2d10h28m58s 7d09h32m35s
lucas 2m02s 10m07s 2h21m56s 1d09h02m13s 2d20h01m11s
equake 1m09s 10m13s 1h27m57s 1d02h20m24s 3d12h05m15s
sixtrack 2m46s 13m23s 8h18m57s 3d04h58m54s n/a
facerec 3m56s 14m15s 2h14m20s n/a n/a
mcf 5m09s 15m52s 18m53s 9h04m58s 2d04h58m32s
applu 2m42s 15m57s 5h06m50s 3d04h53m25s 6d04h57m11s
fma3d 5m31s 17m57s 3h47m10s 2d21h39m19s n/a
ammp 2m47s 18m35s 5h38m13s 3d10h47m49s 7d06h45m53s
mgrid 1m52s 19m42s 8h07m16s 3d11h53m24s 8d12h12m06s
galgel 1m54s 23m12s 3h24m22s 2d14h51m58s 9d17h05m47s
twolf 3m26s 23m22s 1h15m44s 2d21h56m13s 9d03h35m12s
parser 3m08s 23m32s 1h21m47s 2d23h50m36s n/a
apsi 3m33s 23m37s 3h44m49s 3d00h08m42s 8d05h20m38s
swim 2m30s 27m45s 3h04m57s 1d17h35m53s 5d01h14m23s
158
Table D.4: Elapsed times for running the SPEC CPU 2000 benchmarks on
various MIPS simulators. domori is time on our reference Pen-
tium D machine. hershey is an actual MIPS R12000 system. The
pre-compiled SPEC benchmarks from the SESC site are used;
some (such as gzip) are modified to have shorter run-times,
which is why the R12000 runs them faster than the Pentium D.
Benchmark domori hershey qemu cache sesc
164.gzip.log 13s 15s 13m52s 2h17m38s
164.gzip.source 29s 25s 24m00s 3h58m48s
164.gzip.program 52s 34s 32m49s 5h27m31s
176.gcc.expr 4s 37s 32m16s 6h12m10s
176.gcc.integrate 5s 39s 34m31s 7h05m16s
164.gzip.random 24s 39s 36m16s 6h01m46s
164.gzip.graphic 29s 47s 49m07s 8h02m23s
256.bzip2.source 40s 2m09s 2h16m34s 1d00h25m02s
176.gcc.166 23s 2m35s 2h03m00s 1d07h14m07s
256.bzip2.program 41s 2m43s 2h58m24s 1d07h32m39s
256.bzip2.graphic 51s 3m07s 3h39m48s 2d00h52m22s
176.gcc.scilab 23s 3m17s 2h47m22s 1d11h13m34s
176.gcc.200 40s 5m39s 6h49m07s 2d15h11m51s
179.art.110 1m15s 10m23s 1h57m46s 2d08h51m01s
179.art.470 1m22s 11m26s 2h06m30s 2d11h09m05s
175.vpr.place 1m28s 13m16s 5h54m56s 2d12h27m59s
183.equake 1m09s 13m41s 6h35m58s 3d17h27m29s
168.wupwise 1m40s 14m55s 21h00m46s 7d08h57m22s
177.mesa 1m36s 14m58s 14h06m14s 5d16h45m08s
254.gap 1m13s 15m36s 11h40m59s 5d00h02m12s
186.crafty 1m12s 16m30s 12h22m53s 4d13h06m39s
200.sixtrack 2m46s 17m38s 1d20h20m47s 10d03h12m14s
175.vpr.route 1m25s 17m38s 6h02m32s 2d02h27m09s
181.mcf 5m09s 28m55s 2h59m49s 2d00h28m10s
197.parser 3m08s 29m05s 17h09m46s 7d23h33m44s
188.ammp 2m47s 29m31s 17h41m38s 7d10h05m58s
173.applu 2m42s 31m24s 20h40m23s 11d04h04m19s
172.mgrid 1m52s 31m59s 1d06h03m47s 16d02h53m15s
300.twolf 3m26s 33m50s 15h13m50s 6d13h20m15s
301.apsi 3m33s 53m35s 1d01h06m40s 10d23h44m24s
171.swim 2m30s 57m23s 11h52m41s 6d12h03m34s
159
Table D.5: Elapsed times for running the SPEC CPU 2000 benchmarks on
various SPARC simulators. domori is time on our reference Pen-
tium D machine. niagara is an actual SPARC niagara system.
Benchmark domori niagara qemu bbv
perlbmk.makernd 1s 5s 15s
gcc.expr 4s 25s 1m40s
gcc.integrate 5s 28s 1m40s
perlbmk.perf 13s 1m23s 5m05s
gcc.166 23s 1m40s 5m40s
gzip.log 13s 1m22s 2m27s
gcc.scilab 23s 2m07s 8m39s
art.110 1m15s 9m46s 17m22s
eon.rushmeier 18s 7m52s 17m22s
art.470 1m22s 10m41s 18m08s
gzip.source 29s 2m53s 5m05s
mcf 5m09s 11m35s 9m02s
gcc.200 40s 3m45s 13m22s
gzip.random 24s 2m41s 5m19s
vpr.route 1m25s 10m06s n/a
eon.cook 22s 2m54s 25m31s
gzip.graphic 29s 3m05s 6m45s
bzip2.source 40s 4m25s 7m42s
bzip2.program 41s 5m06s 8m38s
eon.kajiya 34s 14m44s 39m07s
vpr.place 1m28s 10m00s n/a
bzip2.graphic 51s 6m02s 9m43s
gzip.program 52s 5m46s 9m07s
equake 1m09s 38m51s 1h16m43s
crafty 1m12s 15m04s 33m39s
gap 1m13s 10m01s 26m26s
swim 2m30s 1h23m48s 2h24m19s
lucas 2m02s 52m04s 2h07m59s
twolf 3m26s 22m41s 23m47s
facerec 3m56s 46m18s n/a
fma3d 5m31s 1h13m00s 2h43m36s
wupwise 1m40s 1h01m14s 2h37m21s
mesa 1m36s 31m04s 1h06m40s
ammp 2m47s 1h31m38s 3h32m39s
parser 3m08s n/a n/a
galgel 1m54s 1h28m53s 2h44m34s
apsi 3m33s 1h25m21s 3h05m47s
applu 2m42s 1h35m48s 3h46m28s
mgrid 1m52s 2h25m51s 5h38m12s
sixtrack 2m46s 2h38m15s n/a
160
Table D.6: Times for x86 architecture
Benchmark 32-bit 64-bit pin qemu valgrind m5
perlbmk.mkrnd 4s 1s 47s 31s 43s n/a
gcc.expr 5s 5s 2m33s 2m06s 3m15s 5h42m46s
gcc.integrate 5s 5s 2m31s 1m49s 3m05s 5h50m11s
perlbmk.prfct 14s 13s 5m20s 7m49s 10m51s n/a
gzip.log 17s 13s 4m28s 4m17s 7m43s 20h23m11s
eon.rushmeier 31s 18s 8m06s 21m43s 27m54s n/a
eon.cook 37s 23s 9m52s 30m24s 38m09s n/a
gcc.166 12s 23s 5m07s 5m09s 9m08s 18h48m12s
gcc.scilab 20s 24s 10m02s 10m20s 16m48s 1d05h57m27s
gzip.random 28s 24s 8m35s 8m57s 16m22s 1d20h32m49s
gzip.graphic 35s 30s 10m00s 11m24s 19m54s 2d06h59m35s
gzip.source 28s 31s 9m47s 8m27s 15m20s 1d16h01m59s
eon.kajiya 54s 35s 13m09s 39m22s 50m49s n/a
bzip2.program 48s 41s 13m00s 15m38s 26m50s 3d01h45m46s
gcc.200 34s 41s 14m30s 16m55s 28m09s 2d05h54m01s
bzip2.source 47s 42s 11m40s 13m35s 23m03s 2d15h18m22s
bzip2.graphic 59s 53s 15m12s 17m47s 30m04s 3d12h43m40s
gzip.program 53s 53s 17m22s 15m31s 29m57s 3d03h46m29s
equake 1m27s 1m11s 7m59s 35m55s 41m24s n/a
crafty 1m42s 1m12s 26m41s 1h16m48s 1h44m35s 5d21h50m20s
art.110 1m27s 1m12s 9m55s 27m02s 18m15s n/a
gap 1m33s 1m14s 33m21s 1h02m41s 1h20m34s 6d10h04m29s
art.470 1m52s 1m16s 10m38s 27m27s 20m08s n/a
vpr.route 1m39s 1m26s 10m53s 20m06s 29m12s n/a
vpr.place 1m07s 1m30s 14m27s 30m27s 48m03s n/a
wupwise 3m20s 1m46s 43m12s 2h29m04s 2h51m27s n/a
galgel 2m23s 1m58s 30m15s 1h01m42s 1h43m08s n/a
mesa 3m13s 1m58s 24m36s 1h45m27s 2h51m15s n/a
lucas 4m50s 2m09s 17m09s 1h34m40s 1h23m06s n/a
ammp 4m35s 2m52s 29m43s 1h50m27s 1h52m03s n/a
parser 3m21s 3m18s 56m41s 1h33m21s 1h55m06s 11d07h24m21s
applu 8m18s 3m21s 28m18s 2h53m12s 3h12m13s n/a
twolf 3m15s 3m28s 34m17s 1h57m09s 2h09m56s n/a
mgrid 3m46s 3m49s 15m27s 1h53m41s 2h06m23s n/a
swim 3m18s 3m51s 10m16s 1h16m52s 1h19m14s n/a
facerec 3m53s 4m10s 24m11s 1h11m21s 1h30m02s n/a
apsi 4m28s 4m47s 30m16s 2h38m45s 3h06m16s n/a
mcf 2m45s 5m24s 14m22s 14m32s 22m01s n/a
fma3d 5m21s 5m35s 33m35s 2h04m33s 2h25m38s n/a
sixtrack 4m40s 18m24s 27m08s 4h02m57s 4h21m14s n/a
161
Table D.7: Times for x86 64 architecture comparing simulators.
Benchmark domori m5-nocache m5-cache ptlsim
perlbmk.makerand 1s 53m48s 58m52s n/a
gcc.expr 5s 4h53m12s 5h33m00s 10h30m07s
gcc.integrate 5s 4h50m33s 5h52m55s 10h16m38s
perlbmk.perfect 13s 14h32m05s 15h53m58s n/a
gzip.log 13s 19h39m50s 20h24m41s 1d14h58m40s
eon.rushmeier 18s 1d09h55m28s 1d07h58m31s 2d17h09m28s
eon.cook 23s 1d19h02m39s 1d16h51m53s 3d09h41m21s
gcc.166 23s 16h37m58s 20h38m33s 1d09h50m05s
gcc.scilab 24s 1d01h55m48s 1d05h23m15s 2d08h30m16s
gzip.random 24s 1d13h24m45s 1d15h12m09s 3d08h57m53s
gzip.graphic 30s 2d01h39m03s 2d03h34m21s 4d18h23m40s
gzip.source 31s 1d21h33m05s 1d23h23m02s 3d21h11m05s
eon.kajiya 35s 2d09h24m17s 2d07h21m29s 5d07h00m46s
bzip2.program 41s 2d15h17m39s 2d19h00m34s 6d00h17m33s
gcc.200 41s 1d21h23m22s 2d07h14m04s 4d02h12m45s
bzip2.source 42s 2d03h19m57s 2d06h53m50s 5d03h11m06s
bzip2.graphic 53s 3d01h38m12s 3d05h03m02s 6d16h48m50s
gzip.program 53s 4d01h02m04s 4d05h04m40s 7d21h37m40s
equake 1m11s 2d22h59m11s 2d14h41m01s 7d16h12m56s
crafty 1m12s 4d04h15m47s 4d08h47m26s 9d01h36m55s
art.110 1m12s 21h25m23s 1d06h28m02s 2d05h29m40s
gap 1m14s 5d07h28m52s 5d15h42m02s 11d14h07m02s
art.470 1m16s 1d04h40m49s 1d09h41m38s 2d11h07m00s
vpr.route 1m26s 1d21h55m53s 2d03h26m07s n/a
vpr.place 1m30s 2d16h46m57s 2d21h32m58s n/a
wupwise 1m46s 8d07h15m28s 8d17h10m31s 17d15h57m30s
galgel 1m58s 6d16h04m16s 7d16h52m55s 18d08h38m56s
mesa 1m58s 7d02h31m53s 6d15h05m59s 12d10h17m04s
lucas 2m09s 6d03h01m15s 5d12h26m48s 12d16h59m44s
ammp 2m52s 6d19h42m04s 7d23h34m04s 18d00h55m59s
parser 3m18s 8d03h55m04s 8d18h34m43s 18d04h31m22s
applu 3m21s 8d07h21m55s 9d09h47m46s 20d17h07m13s
twolf 3m28s 8d07h34m14s 9d03h33m07s 17d21h00m02s
mgrid 3m49s 7d20h02m09s 9d00h43m12s 19d08h28m21s
swim 3m51s 4d18h36m35s 5d19h04m44s 13d06h49m15s
facerec 4m10s 6d08h53m58s 7d00h03m23s n/a
apsi 4m47s 9d08h01m41s 9d05h47m29s 20d11h39m32s
mcf 5m24s 1d08h17m02s 3d11h32m14s 4d00h18m09s
fma3d 5m35s 7d23h11m24s 8d02h44m38s 17d23h13m33s
sixtrack 18m24s 11d14h34m06s 10d18h02m56s n/a
162
Table D.8: Times for x86 64 DBI
Benchmark domori pin qemu valgrind
perlbmk.makerand 1s 2s n/a 6s
gcc.integrate 5s 17s 1m21s 33s
gcc.expr 4s 18s 1m03s 38s
perlbmk.perfect 13s 53s n/a 2m30s
gzip.log 13s 15s 2m39s 2m56s
eon.rushmeier 18s 37s 5m47s 5m23s
eon.cook 22s 41s 7m33s 7m27s
gcc.166 23s 40s 1m51s 1m33s
gcc.scilab 23s 52s 3m42s 3m18s
gzip.random 24s 32s 4m47s 5m34s
gzip.source 29s 48s 5m58s 2m23s
gzip.graphic 29s 56s 7m12s 2m56s
eon.kajiya 34s 1m09s 10m16s 9m53s
gcc.200 40s 1m17s 6m27s 5m17s
bzip2.source 40s 53s 7m43s 3m35s
bzip2.program 41s 1m02s 8m51s 4m01s
bzip2.graphic 51s 1m15s 9m58s 4m35s
gzip.program 52s 1m18s 12m01s 4m36s
equake 1m09s 1m20s 8m09s 4m15s
crafty 1m12s 3m00s 17m23s 15m12s
gap 1m13s 2m22s 18m29s 11m25s
art.110 1m15s 1m31s 4m28s 4m02s
art.470 1m22s 1m43s 4m50s 4m09s
vpr.route 1m25s 1m36s 7m42s 6m33s
vpr.place 1m28s 1m39s 10m03s 23m35s
mesa 1m36s 2m09s 26m40s 1h07m54s
wupwise 1m40s 2m25s 25m10s 15m04s
mgrid 1m52s 1m55s 30m53s 13m21s
galgel 1m54s 2m02s 17m31s 10m21s
lucas 2m02s 2m42s 19m51s 25m37s
swim 2m30s 2m47s 17m47s 9m51s
applu 2m42s 2m51s 25m29s 13m40s
sixtrack 2m46s 2m28s 36m30s 15m26s
ammp 2m47s 2m55s 24m35s 12m52s
parser 3m08s 4m51s 24m56s 15m03s
twolf 3m26s 4m13s 32m42s 27m35s
apsi 3m33s 4m19s 26m03s 16m30s
facerec 3m56s 3m56s 17m14s 19m48s
mcf 5m09s 5m41s 6m46s 8m28s
fma3d 5m31s 5m55s 35m42s 16m08s
163
Table D.9: Times for x86 64 DBI utilities running cache simulations.
Benchmark domori pin cache qemu cache valgrind cache
perlbmk.makerand 1s 1m56s n/a 51s
gcc.expr 5s 11m20s 32m00s 3m56s
gcc.integrate 5s 11m53s 39m38s 3m41s
perlbmk.perfect 13s 35m38s n/a 21m11s
gzip.log 13s 26m21s 1h12m24s 9m27s
eon.rushmeier 18s 1h05m32s 3d00h03m24s 23m54s
eon.cook 23s 1h25m14s 4d18h35m47s 31m36s
gcc.166 23s 42m49s 3h08m00s 13m08s
gcc.scilab 24s 58m32s 2h56m21s 20m45s
gzip.random 24s 54m41s 2h16m35s 16m57s
gzip.graphic 30s 1h11m05s 2h56m45s 22m54s
gzip.source 31s 1h00m47s 2h43m03s 21m19s
eon.kajiya 35s 1h55m44s 5d17h29m44s 43m32s
bzip2.program 41s 1h34m03s 4h21m22s 31m45s
gcc.200 41s 1h40m16s 5h48m13s 34m47s
bzip2.source 42s 1h19m10s 3h43m54s 27m01s
bzip2.graphic 53s 1h51m39s 11h35m49s 36m09s
gzip.program 53s 2h00m24s 5h43m41s 42m57s
equake 1m11s 2h01m17s n/a 35m11s
crafty 1m12s 3h35m02s 6h32m49s 1h28m35s
art.110 1m12s 1h10m26s 1d21h39m24s 19m02s
gap 1m14s 4h54m09s 17h30m24s 1h33m39s
art.470 1m16s 1h17m53s 2d00h29m12s 20m47s
vpr.route 1m26s 1h44m57s 4h00m26s 30m12s
vpr.place 1m30s 2h52m41s 17h15m57s 1h02m51s
wupwise 1m46s 10h23m02s n/a 2h12m48s
galgel 1m58s 5h55m11s n/a 1h24m49s
mesa 1m58s 5h38m14s 15h38m15s 2h54m54s
lucas 2m09s 4h33m53s n/a 1h20m45s
ammp 2m52s 6h26m20s n/a 1h48m44s
parser 3m18s 6h16m05s 18h18m15s 1h59m23s
applu 3m21s 7h52m02s n/a 2h01m57s
twolf 3m28s 10h54m13s 23h36m00s 2h37m57s
mgrid 3m49s 7h26m15s n/a 1h45m15s
swim 3m51s 6h06m45s n/a 1h05m14s
facerec 4m10s 5h06m00s 10h59m14s 1h31m52s
apsi 4m47s 7h55m53s n/a 2h09m19s
mcf 5m24s 1h17m15s 4h32m23s 30m55s
fma3d 5m35s 7h03m39s n/a 2h15m55s




Program phases can vary even among implementations of the same archi-

























I Core Duo 
Intervals = 739








I Atom N270 
Intervals = 739








I Pentium D 
Intervals = 739








I Pentium 4 
Intervals = 739








I itanium (x86) 
Intervals = 739


















I Athlon64 XP 
Intervals = 739








I Athlon MP 
Intervals = 739








I Pentium III 
Intervals = 739








I Pentium II 
Intervals = 739








I Pentium Pro 
Intervals = 739
Avg cpi = 0.98
Figure E.1: CPI phase plot for gzip.graph (INT, C, Compression)
166



















I Core Duo 
Intervals = 293







I Atom N270 
Intervals = 293







I Pentium D 
Intervals = 293







I Pentium 4 
Intervals = 293







I itanium (x86) 
Intervals = 293
















I Athlon64 XP 
Intervals = 293







I Athlon MP 
Intervals = 293







I Pentium III 
Intervals = 293







I Pentium II 
Intervals = 293







I Pentium Pro 
Intervals = 293
Avg cpi = 1.02
Figure E.2: CPI phase plot for gzip.log (INT, C, Compression)
167





















I Core Duo 
Intervals = 1055








I Atom N270 
Intervals = 1055








I Pentium D 
Intervals = 1055








I Pentium 4 
Intervals = 1055








I itanium (x86) 
Intervals = 1055


















I Athlon64 XP 
Intervals = 1055








I Athlon MP 
Intervals = 1055








I Pentium III 
Intervals = 1055








I Pentium II 
Intervals = 1055








I Pentium Pro 
Intervals = 1055
Avg cpi = 0.99
Figure E.3: CPI phase plot for gzip.prog (INT, C, Compression)
168





















I Core Duo 
Intervals = 603








I Atom N270 
Intervals = 603








I Pentium D 
Intervals = 603








I Pentium 4 
Intervals = 603








I itanium (x86) 
Intervals = 603


















I Athlon64 XP 
Intervals = 603








I Athlon MP 
Intervals = 603








I Pentium III 
Intervals = 603








I Pentium II 
Intervals = 603








I Pentium Pro 
Intervals = 603
Avg cpi = 0.92
Figure E.4: CPI phase plot for gzip.rand (INT, C, Compression)
169



















I Core Duo 
Intervals = 560







I Atom N270 
Intervals = 560







I Pentium D 
Intervals = 560







I Pentium 4 
Intervals = 560







I itanium (x86) 
Intervals = 560
















I Athlon64 XP 
Intervals = 560







I Athlon MP 
Intervals = 560







I Pentium III 
Intervals = 560







I Pentium II 
Intervals = 560







I Pentium Pro 
Intervals = 560
Avg cpi = 1.03
Figure E.5: CPI phase plot for gzip.src (INT, C, Compression)
170

















I Core Duo 
Intervals = 5021






I Atom N270 
Intervals = 5021






I Pentium D 
Intervals = 5021






I Pentium 4 
Intervals = 5021






I itanium (x86) 
Intervals = 5022














I Athlon64 XP 
Intervals = 5021






I Athlon MP 
Intervals = 5021






I Pentium III 
Intervals = 5022






I Pentium II 
Intervals = 5022






I Pentium Pro 
Intervals = 5021
Avg cpi = 0.94
Figure E.6: CPI phase plot for wupwise (FP, F77, Quantum Chromody-
namics)
171





















I Core Duo 
Intervals = 3011








I Atom N270 
Intervals = 3011








I Pentium D 
Intervals = 3011








I Pentium 4 
Intervals = 3011








I itanium (x86) 
Intervals = 3011


















I Athlon64 XP 
Intervals = 3011








I Athlon MP 
Intervals = 3011








I Pentium III 
Intervals = 3011








I Pentium II 
Intervals = 3011








I Pentium Pro 
Intervals = 3011
Avg cpi = 3.16
Figure E.7: CPI phase plot for swim (FP, F77, Meteorology/Water)
172

















I Core Duo 
Intervals = 5026






I Atom N270 
Intervals = 5026






I Pentium D 
Intervals = 5026






I Pentium 4 
Intervals = 5026






I itanium (x86) 
Intervals = 5026














I Athlon64 XP 
Intervals = 5026






I Athlon MP 
Intervals = 5026






I Pentium III 
Intervals = 5026






I Pentium II 
Intervals = 5026






I Pentium Pro 
Intervals = 5026
Avg cpi = 1.95
Figure E.8: CPI phase plot for mgrid (FP, F77, Multi-Grid Solver)
173

















I Core Duo 
Intervals = 5544






I Atom N270 
Intervals = 5544






I Pentium D 
Intervals = 5544






I Pentium 4 
Intervals = 5544






I itanium (x86) 
Intervals = 5545














I Athlon64 XP 
Intervals = 5544






I Athlon MP 
Intervals = 5544






I Pentium III 
Intervals = 5545






I Pentium II 
Intervals = 5545






I Pentium Pro 
Intervals = 5544
Avg cpi = 1.69
Figure E.9: CPI phase plot for applu (FP, F77, Fluid Dynamics)
174





















I Core Duo 
Intervals = 1102








I Atom N270 
Intervals = 1102








I Pentium D 
Intervals = 1106








I Pentium 4 
Intervals = 1106








I itanium (x86) 
Intervals = 1102


















I Athlon64 XP 
Intervals = 1102








I Athlon MP 
Intervals = 1102








I Pentium III 
Intervals = 1102








I Pentium II 
Intervals = 1102








I Pentium Pro 
Intervals = 1102
Avg cpi = 1.49
Figure E.10: CPI phase plot for vpr.place (INT, C, FPGA Place/Route)
175





















I Core Duo 
Intervals = 934








I Atom N270 
Intervals = 934








I Pentium D 
Intervals = 934








I Pentium 4 
Intervals = 934








I itanium (x86) 
Intervals = 934


















I Athlon64 XP 
Intervals = 934








I Athlon MP 
Intervals = 934








I Pentium III 
Intervals = 934








I Pentium II 
Intervals = 934








I Pentium Pro 
Intervals = 934
Avg cpi = 1.56
Figure E.11: CPI phase plot for vpr.route (INT, C, FPGA Place/Route)
176





















I Core Duo 
Intervals = 223








I Atom N270 
Intervals = 223








I Pentium D 
Intervals = 223








I Pentium 4 
Intervals = 223








I itanium (x86) 
Intervals = 223


















I Athlon64 XP 
Intervals = 223








I Athlon MP 
Intervals = 223








I Pentium III 
Intervals = 223








I Pentium II 
Intervals = 223








I Pentium Pro 
Intervals = 223
Avg cpi = 3.85
Figure E.12: CPI phase plot for gcc.166 (INT, C, C Compiler)
177





















I Core Duo 
Intervals = 726








I Atom N270 
Intervals = 726








I Pentium D 
Intervals = 726








I Pentium 4 
Intervals = 726








I itanium (x86) 
Intervals = 726


















I Athlon64 XP 
Intervals = 726








I Athlon MP 
Intervals = 726








I Pentium III 
Intervals = 726








I Pentium II 
Intervals = 726








I Pentium Pro 
Intervals = 726
Avg cpi = 1.68
Figure E.13: CPI phase plot for gcc.200 (INT, C, C Compiler)
178





















I Core Duo 
Intervals = 72








I Atom N270 
Intervals = 72








I Pentium D 
Intervals = 72








I Pentium 4 
Intervals = 72








I itanium (x86) 
Intervals = 72


















I Athlon64 XP 
Intervals = 72








I Athlon MP 
Intervals = 72








I Pentium III 
Intervals = 72








I Pentium II 
Intervals = 72








I Pentium Pro 
Intervals = 72
Avg cpi = 1.61
Figure E.14: CPI phase plot for gcc.expr (INT, C, C Compiler)
179





















I Core Duo 
Intervals = 72








I Atom N270 
Intervals = 72








I Pentium D 
Intervals = 72








I Pentium 4 
Intervals = 72








I itanium (x86) 
Intervals = 72


















I Athlon64 XP 
Intervals = 72








I Athlon MP 
Intervals = 72








I Pentium III 
Intervals = 72








I Pentium II 
Intervals = 72








I Pentium Pro 
Intervals = 72
Avg cpi = 1.89
Figure E.15: CPI phase plot for gcc.int (INT, C, C Compiler)
180

















I Core Duo 
Intervals = 391






I Atom N270 
Intervals = 391






I Pentium D 
Intervals = 391






I Pentium 4 
Intervals = 391






I itanium (x86) 
Intervals = 391














I Athlon64 XP 
Intervals = 391






I Athlon MP 
Intervals = 391






I Pentium III 
Intervals = 391






I Pentium II 
Intervals = 391






I Pentium Pro 
Intervals = 391
Avg cpi = 1.80
Figure E.16: CPI phase plot for gcc.sci (INT, C, C Compiler)
181

















I Core Duo 
Intervals = 2829






I Atom N270 
Intervals = 2829






I Pentium D 
Intervals = 2898






I Pentium 4 
Intervals = 2897






I itanium (x86) 
Intervals = 2829














I Athlon64 XP 
Intervals = 2829






I Athlon MP 
Intervals = 2829






I Pentium III 
Intervals = 2829






I Pentium II 
Intervals = 2829






I Pentium Pro 
Intervals = 2829
Avg cpi = 1.62
Figure E.17: CPI phase plot for mesa (FP, C, 3D-graphics)
182





















I Core Duo 
Intervals = 3707








I Atom N270 
Intervals = 3707








I Pentium D 
Intervals = 3706








I Pentium 4 
Intervals = 3706








I itanium (x86) 
Intervals = 3707


















I Athlon64 XP 
Intervals = 3707








I Athlon MP 
Intervals = 3707








I Pentium III 
Intervals = 3707








I Pentium II 
Intervals = 3707








I Pentium Pro 
Intervals = 3707
Avg cpi = 1.97
Figure E.18: CPI phase plot for galgel (FP, F90, Fluid Dynamics)
183















I Core Duo 
Intervals = 562





I Atom N270 
Intervals = 562





I Pentium D 
Intervals = 562





I Pentium 4 
Intervals = 562





I itanium (x86) 
Intervals = 562












I Athlon64 XP 
Intervals = 562





I Athlon MP 
Intervals = 562





I Pentium III 
Intervals = 562





I Pentium II 
Intervals = 562





I Pentium Pro 
Intervals = 562
Avg cpi = 6.25
Figure E.19: CPI phase plot for art.110 (FP, C, Neural Networks)
184

















I Core Duo 
Intervals = 619






I Atom N270 
Intervals = 619






I Pentium D 
Intervals = 619






I Pentium 4 
Intervals = 619






I itanium (x86) 
Intervals = 619














I Athlon64 XP 
Intervals = 619






I Athlon MP 
Intervals = 619






I Pentium III 
Intervals = 619






I Pentium II 
Intervals = 619






I Pentium Pro 
Intervals = 619
Avg cpi = 6.74
Figure E.20: CPI phase plot for art.470 (FP, C, Neural Networks)
185

















I Core Duo 
Intervals = 693






I Atom N270 
Intervals = 693






I Pentium D 
Intervals = 693






I Pentium 4 
Intervals = 693






I itanium (x86) 
Intervals = 693














I Athlon64 XP 
Intervals = 693






I Athlon MP 
Intervals = 693






I Pentium III 
Intervals = 693






I Pentium II 
Intervals = 693






I Pentium Pro 
Intervals = 693
Avg cpi = 5.86























I Core Duo 
Intervals = 1449








I Atom N270 
Intervals = 1449








I Pentium D 
Intervals = 1449








I Pentium 4 
Intervals = 1449








I itanium (x86) 
Intervals = 1449


















I Athlon64 XP 
Intervals = 1449








I Athlon MP 
Intervals = 1449








I Pentium III 
Intervals = 1449








I Pentium II 
Intervals = 1449








I Pentium Pro 
Intervals = 1449
Avg cpi = 2.82
Figure E.22: CPI phase plot for equake (FP, C, Seismic Propogation)
187



















I Core Duo 
Intervals = 2156







I Atom N270 
Intervals = 2156







I Pentium D 
Intervals = 2156







I Pentium 4 
Intervals = 2156







I itanium (x86) 
Intervals = 2156
















I Athlon64 XP 
Intervals = 2156







I Athlon MP 
Intervals = 2156







I Pentium III 
Intervals = 2156







I Pentium II 
Intervals = 2156







I Pentium Pro 
Intervals = 2156
Avg cpi = 1.25
Figure E.23: CPI phase plot for crafty (INT, C, Chess)
188





















I Core Duo 
Intervals = 3098








I Atom N270 
Intervals = 3101








I Pentium D 
Intervals = 3097








I Pentium 4 
Intervals = 3096








I itanium (x86) 
Intervals = 3098


















I Athlon64 XP 
Intervals = 3102








I Athlon MP 
Intervals = 3098








I Pentium III 
Intervals = 3099








I Pentium II 
Intervals = 3099








I Pentium Pro 
Intervals = 3098
Avg cpi = 1.71
Figure E.24: CPI phase plot for facerec (FP, F90, Facial Recognition)
189



















I Core Duo 
Intervals = 3331







I Atom N270 
Intervals = 3331







I Pentium D 
Intervals = 3331







I Pentium 4 
Intervals = 3331







I itanium (x86) 
Intervals = 3331
















I Athlon64 XP 
Intervals = 3331







I Athlon MP 
Intervals = 3331







I Pentium III 
Intervals = 3331







I Pentium II 
Intervals = 3331







I Pentium Pro 
Intervals = 3331
Avg cpi = 2.78
Figure E.25: CPI phase plot for ammp (FP, C, Chemistry)
190

















I Core Duo 
Intervals = 2991






I Atom N270 
Intervals = 2991






I Pentium D 
Intervals = 2991






I Pentium 4 
Intervals = 2991






I itanium (x86) 
Intervals = 2991














I Athlon64 XP 
Intervals = 2991






I Athlon MP 
Intervals = 2990






I Pentium III 
Intervals = 2991






I Pentium II 
Intervals = 2991






I Pentium Pro 
Intervals = 2991
Avg cpi = 1.72
Figure E.26: CPI phase plot for lucas (FP, F90, Number Theory)
191

















I Core Duo 
Intervals = 3209






I Atom N270 
Intervals = 3210






I Pentium D 
Intervals = 3209






I Pentium 4 
Intervals = 3208






I itanium (x86) 
Intervals = 3209














I Athlon64 XP 
Intervals = 3212






I Athlon MP 
Intervals = 3209






I Pentium III 
Intervals = 3209






I Pentium II 
Intervals = 3209






I Pentium Pro 
Intervals = 3209
Avg cpi = 2.30
Figure E.27: CPI phase plot for fma3d (FP, F90, Crash Simulation)
192





















I Core Duo 
Intervals = 3721








I Atom N270 
Intervals = 3720








I Pentium D 
Intervals = 3721








I Pentium 4 
Intervals = 3720








I itanium (x86) 
Intervals = 3720


















I Athlon64 XP 
Intervals = 3720








I Athlon MP 
Intervals = 3721








I Pentium III 
Intervals = 3721








I Pentium II 
Intervals = 3720








I Pentium Pro 
Intervals = 3721
Avg cpi = 1.33
Figure E.28: CPI phase plot for parser (INT, C, Word Processing)
193

















I Core Duo 
Intervals = 9071






I Atom N270 
Intervals = 9071






I Pentium D 
Intervals = 9069






I Pentium 4 
Intervals = 9070






I itanium (x86) 
Intervals = 9072














I Athlon64 XP 
Intervals = 9070






I Athlon MP 
Intervals = 9070






I Pentium III 
Intervals = 9072






I Pentium II 
Intervals = 9072






I Pentium Pro 
Intervals = 9071
Avg cpi = 0.72
Figure E.29: CPI phase plot for sixtrack (FP, F77, Nuclear Physics)
194

















I Core Duo 
Intervals = 851






I Atom N270 
Intervals = 851






I Pentium D 
Intervals = 852






I Pentium 4 
Intervals = 852






I itanium (x86) 
Intervals = 851














I Athlon64 XP 
Intervals = 851






I Athlon MP 
Intervals = 851






I Pentium III 
Intervals = 851






I Pentium II 
Intervals = 851






I Pentium Pro 
Intervals = 851
Avg cpi = 1.50
Figure E.30: CPI phase plot for eon.cook (INT, C++, Computer Graphics)
195

















I Core Duo 
Intervals = 1093






I Atom N270 
Intervals = 1093






I Pentium D 
Intervals = 1095






I Pentium 4 
Intervals = 1094






I itanium (x86) 
Intervals = 1093














I Athlon64 XP 
Intervals = 1093






I Athlon MP 
Intervals = 1093






I Pentium III 
Intervals = 1093






I Pentium II 
Intervals = 1093






I Pentium Pro 
Intervals = 1093
Avg cpi = 1.50
Figure E.31: CPI phase plot for eon.kaj (INT, C++, Computer Graphics)
196

















I Core Duo 
Intervals = 629






I Atom N270 
Intervals = 629






I Pentium D 
Intervals = 630






I Pentium 4 
Intervals = 630






I itanium (x86) 
Intervals = 629














I Athlon64 XP 
Intervals = 629






I Athlon MP 
Intervals = 629






I Pentium III 
Intervals = 629






I Pentium II 
Intervals = 629






I Pentium Pro 
Intervals = 629
Avg cpi = 1.41
Figure E.32: CPI phase plot for eon.rush (INT, C++, Computer Graphics)
197





















I Core Duo 
Intervals = 545








I Atom N270 
Intervals = 545








I Pentium D 
Intervals = 545








I Pentium 4 
Intervals = 545








I itanium (x86) 
Intervals = 545


















I Athlon64 XP 
Intervals = 545








I Athlon MP 
Intervals = 545








I Pentium III 
Intervals = 545








I Pentium II 
Intervals = 545








I Pentium Pro 
Intervals = 545
Avg cpi = 0.86
Figure E.33: CPI phase plot for perlbmk.535 (INT, C, Scripting Lan-
guage)
198





















I Core Duo 
Intervals = 577








I Atom N270 
Intervals = 577








I Pentium D 
Intervals = 577








I Pentium 4 
Intervals = 577








I itanium (x86) 
Intervals = 577


















I Athlon64 XP 
Intervals = 577








I Athlon MP 
Intervals = 577








I Pentium III 
Intervals = 577








I Pentium II 
Intervals = 577








I Pentium Pro 
Intervals = 577
Avg cpi = 0.89
Figure E.34: CPI phase plot for perlbmk.704 (INT, C, Scripting Lan-
guage)
199





















I Core Duo 
Intervals = 1107








I Atom N270 
Intervals = 1107








I Pentium D 
Intervals = 1107








I Pentium 4 
Intervals = 1107








I itanium (x86) 
Intervals = 1107


















I Athlon64 XP 
Intervals = 1107








I Athlon MP 
Intervals = 1107








I Pentium III 
Intervals = 1107








I Pentium II 
Intervals = 1107








I Pentium Pro 
Intervals = 1107
Avg cpi = 0.84
Figure E.35: CPI phase plot for perlbmk.850 (INT, C, Scripting Lan-
guage)
200





















I Core Duo 
Intervals = 957








I Atom N270 
Intervals = 957








I Pentium D 
Intervals = 957








I Pentium 4 
Intervals = 957








I itanium (x86) 
Intervals = 957


















I Athlon64 XP 
Intervals = 957








I Athlon MP 
Intervals = 957








I Pentium III 
Intervals = 957








I Pentium II 
Intervals = 957








I Pentium Pro 
Intervals = 957
Avg cpi = 0.88
Figure E.36: CPI phase plot for perlbmk.957 (INT, C, Scripting Lan-
guage)
201





















I Core Duo 
Intervals = 328








I Atom N270 
Intervals = 328








I Pentium D 
Intervals = 328








I Pentium 4 
Intervals = 328








I itanium (x86) 
Intervals = 328


















I Athlon64 XP 
Intervals = 328








I Athlon MP 
Intervals = 328








I Pentium III 
Intervals = 328








I Pentium II 
Intervals = 328








I Pentium Pro 
Intervals = 328
Avg cpi = 1.19




















I Core Duo 
Intervals = 12






I Atom N270 
Intervals = 12






I Pentium D 
Intervals = 12






I Pentium 4 
Intervals = 12






I itanium (x86) 
Intervals = 12














I Athlon64 XP 
Intervals = 12






I Athlon MP 
Intervals = 12






I Pentium III 
Intervals = 12






I Pentium II 
Intervals = 12






I Pentium Pro 
Intervals = 12
Avg cpi = 1.46
Figure E.38: CPI phase plot for perlbmk.mkrnd (INT, C, Scripting Lan-
guage)
203





















I Core Duo 
Intervals = 213








I Atom N270 
Intervals = 213








I Pentium D 
Intervals = 213








I Pentium 4 
Intervals = 213








I itanium (x86) 
Intervals = 213


















I Athlon64 XP 
Intervals = 213








I Athlon MP 
Intervals = 213








I Pentium III 
Intervals = 213








I Pentium II 
Intervals = 213








I Pentium Pro 
Intervals = 213
Avg cpi = 1.56
Figure E.39: CPI phase plot for perlbmk.perf (INT, C, Scripting Lan-
guage)
204



















I Core Duo 
Intervals = 2216







I Atom N270 
Intervals = 2216







I Pentium D 
Intervals = 2216







I Pentium 4 
Intervals = 2215







I itanium (x86) 
Intervals = 2216
















I Athlon64 XP 
Intervals = 2216







I Athlon MP 
Intervals = 2216







I Pentium III 
Intervals = 2214







I Pentium II 
Intervals = 2216







I Pentium Pro 
Intervals = 2216
Avg cpi = 1.20







I Core Duo 
Intervals = 1443






I Pentium D 
Intervals = 1443






I Pentium 4 
Intervals = 1443






I itanium (x86) 
Intervals = 1443














I Athlon64 XP 
Intervals = 1443






I Athlon MP 
Intervals = 1443






I Pentium III 
Intervals = 1443






I Pentium II 
Intervals = 1443






I Pentium Pro 
Intervals = 1443
Avg cpi = 1.03






I Core Duo 
Intervals = 1625






I Pentium D 
Intervals = 1625






I Pentium 4 
Intervals = 1625






I itanium (x86) 
Intervals = 1625














I Athlon64 XP 
Intervals = 1625






I Athlon MP 
Intervals = 1625






I Pentium III 
Intervals = 1625






I Pentium II 
Intervals = 1625






I Pentium Pro 
Intervals = 1625
Avg cpi = 1.03







I Core Duo 
Intervals = 1608






I Pentium D 
Intervals = 1608






I Pentium 4 
Intervals = 1608






I itanium (x86) 
Intervals = 1608














I Athlon64 XP 
Intervals = 1608






I Athlon MP 
Intervals = 1608






I Pentium III 
Intervals = 1608






I Pentium II 
Intervals = 1608






I Pentium Pro 
Intervals = 1608
Avg cpi = 1.04
Figure E.43: CPI phase plot for vortex.3 (INT, C, Database)
208



















I Core Duo 
Intervals = 1175







I Atom N270 
Intervals = 1175







I Pentium D 
Intervals = 1175







I Pentium 4 
Intervals = 1175







I itanium (x86) 
Intervals = 1175
















I Athlon64 XP 
Intervals = 1175







I Athlon MP 
Intervals = 1175







I Pentium III 
Intervals = 1175







I Pentium II 
Intervals = 1175







I Pentium Pro 
Intervals = 1175
Avg cpi = 1.12
Figure E.44: CPI phase plot for bzip2.graph (INT, C, Compression)
209



















I Core Duo 
Intervals = 1032







I Atom N270 
Intervals = 1032







I Pentium D 
Intervals = 1032







I Pentium 4 
Intervals = 1032







I itanium (x86) 
Intervals = 1032
















I Athlon64 XP 
Intervals = 1032







I Athlon MP 
Intervals = 1032







I Pentium III 
Intervals = 1032







I Pentium II 
Intervals = 1032







I Pentium Pro 
Intervals = 1032
Avg cpi = 1.33
Figure E.45: CPI phase plot for bzip2.prog (INT, C, Compression)
210



















I Core Duo 
Intervals = 866







I Atom N270 
Intervals = 866







I Pentium D 
Intervals = 866







I Pentium 4 
Intervals = 866







I itanium (x86) 
Intervals = 866
















I Athlon64 XP 
Intervals = 866







I Athlon MP 
Intervals = 866







I Pentium III 
Intervals = 866







I Pentium II 
Intervals = 866







I Pentium Pro 
Intervals = 866
Avg cpi = 1.55
Figure E.46: CPI phase plot for bzip2.src (INT, C, Compression)
211





















I Core Duo 
Intervals = 3118








I Atom N270 
Intervals = 3118








I Pentium D 
Intervals = 3122








I Pentium 4 
Intervals = 3122








I itanium (x86) 
Intervals = 3118


















I Athlon64 XP 
Intervals = 3118








I Athlon MP 
Intervals = 3118








I Pentium III 
Intervals = 3118








I Pentium II 
Intervals = 3118








I Pentium Pro 
Intervals = 3118
Avg cpi = 1.77
Figure E.47: CPI phase plot for twolf (INT, C, Place/Route)
212





















I Core Duo 
Intervals = 6485








I Atom N270 
Intervals = 6486








I Pentium D 
Intervals = 6485








I Pentium 4 
Intervals = 6485








I itanium (x86) 
Intervals = 6486


















I Athlon64 XP 
Intervals = 6485








I Athlon MP 
Intervals = 6485








I Pentium III 
Intervals = 6486








I Pentium II 
Intervals = 6486








I Pentium Pro 
Intervals = 6485
Avg cpi = 1.53
Figure E.48: CPI phase plot for apsi (FP, F77, Meteorology/Pollution)
213
E.2 64-bit x86 64
Due to issues inherent in the benchmarks themselves, there are no plots for the
vortex benchmarks, nor the perlbmk.535, perlbmk.704, perlbmk.850,
perlbmk.957 or perlbmk.diffmail.
214

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 393






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 7














I Phenom Full 
Intervals = 657






I Pentium D Full 
Intervals = 661






I Core2 Full 
Intervals = 659
Avg cpi = 0.76
Figure E.49: CPI phase plot for gzip.graph (INT, C, Compression)
215















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 393





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 7












I Phenom Full 
Intervals = 274





I Pentium D Full 
Intervals = 277





I Core2 Full 
Intervals = 276
Avg cpi = 0.81



















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 393






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 3






I 10 SimPoints (Phenom) 
Intervals = 5














I Phenom Full 
Intervals = 1339






I Pentium D Full 
Intervals = 1343






I Core2 Full 
Intervals = 1341
Avg cpi = 0.70
Figure E.51: CPI phase plot for gzip.prog (INT, C, Compression)
217

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 393






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 4






I 10 SimPoints (Phenom) 
Intervals = 4














I Phenom Full 
Intervals = 503






I Pentium D Full 
Intervals = 507






I Core2 Full 
Intervals = 505
Avg cpi = 0.77
Figure E.52: CPI phase plot for gzip.rnd (INT, C, Compression)
218

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 393






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 7














I Phenom Full 
Intervals = 633






I Pentium D Full 
Intervals = 636






I Core2 Full 
Intervals = 635
Avg cpi = 0.79
Figure E.53: CPI phase plot for gzip.src (INT, C, Compression)
219















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 530





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 7












I Phenom Full 
Intervals = 3604





I Pentium D Full 
Intervals = 3605





I Core2 Full 
Intervals = 3605
Avg cpi = 0.50
Figure E.54: CPI phase plot for wupwise (FP, F77, Quantum Chromody-
namics)
220

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 78






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 2111






I Pentium D Full 
Intervals = 2111






I Core2 Full 
Intervals = 2111
Avg cpi = 1.66
Figure E.55: CPI phase plot for swim (FP, F77, Meteorology/Water)
221















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 169





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 8












I Phenom Full 
Intervals = 3179





I Pentium D Full 
Intervals = 3178





I Core2 Full 
Intervals = 3178
Avg cpi = 0.69
Figure E.56: CPI phase plot for mgrid (FP, F77, Multi-Grid Solver)
222















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 127





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 9












I Phenom Full 
Intervals = 3296





I Pentium D Full 
Intervals = 3296





I Core2 Full 
Intervals = 3296
Avg cpi = 1.08
Figure E.57: CPI phase plot for applu (FP, F77, Fluid Dynamics)
223

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 113






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 4






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 918






I Pentium D Full 
Intervals = 918






I Core2 Full 
Intervals = 918
Avg cpi = 1.24
Figure E.58: CPI phase plot for vpr.place (INT, C, FPGA Place/Route)
224

















I Unguided FF (Phenom) 
Intervals = 1






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 6














I Phenom Full 
Intervals = 658






I Pentium D Full 
Intervals = 658






I Core2 Full 
Intervals = 658
Avg cpi = 1.60
Figure E.59: CPI phase plot for vpr.route (INT, C, FPGA Place/Route)
225





















I Unguided FF (Phenom) 
Intervals = 1








I Train (Phenom) 
Intervals = 31








I 1 SimPoint (Phenom) 
Intervals = 1








I 5 SimPoints (Phenom) 
Intervals = 4








I 10 SimPoints (Phenom) 
Intervals = 8


















I Phenom Full 
Intervals = 256








I Pentium D Full 
Intervals = 262








I Core2 Full 
Intervals = 260
Avg cpi = 1.11
Figure E.60: CPI phase plot for gcc.166 (INT, C, C Compiler)
226

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 31






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 10














I Phenom Full 
Intervals = 686






I Pentium D Full 
Intervals = 697






I Core2 Full 
Intervals = 693
Avg cpi = 0.86
Figure E.61: CPI phase plot for gcc.200 (INT, C, C Compiler)
227

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 31






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 8














I Phenom Full 
Intervals = 70






I Pentium D Full 
Intervals = 73






I Core2 Full 
Intervals = 72
Avg cpi = 0.90
Figure E.62: CPI phase plot for gcc.expr (INT, C, C Compiler)
228

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 31






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 4






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 74






I Pentium D Full 
Intervals = 76






I Core2 Full 
Intervals = 75
Avg cpi = 0.93
Figure E.63: CPI phase plot for gcc.int (INT, C, C Compiler)
229

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 31






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 381






I Pentium D Full 
Intervals = 390






I Core2 Full 
Intervals = 387
Avg cpi = 0.86
Figure E.64: CPI phase plot for gcc.sci (INT, C, C Compiler)
230















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 996





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 5












I Phenom Full 
Intervals = 2251





I Pentium D Full 
Intervals = 2251





I Core2 Full 
Intervals = 2251
Avg cpi = 0.65
Figure E.65: CPI phase plot for mesa (FP, C, 3D-graphics)
231

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 256






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 8














I Phenom Full 
Intervals = 2652






I Pentium D Full 
Intervals = 2653






I Core2 Full 
Intervals = 2653
Avg cpi = 0.67
Figure E.66: CPI phase plot for galgel (FP, F90, Fluid Dynamics)
232

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 69






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 3






I 10 SimPoints (Phenom) 
Intervals = 4














I Phenom Full 
Intervals = 376






I Pentium D Full 
Intervals = 376
Avg cpi = 6.37
Figure E.67: CPI phase plot for art.110 (FP, C, Neural Networks)
233

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 69






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 3






I 10 SimPoints (Phenom) 
Intervals = 5














I Phenom Full 
Intervals = 418






I Pentium D Full 
Intervals = 418
Avg cpi = 6.48
Figure E.68: CPI phase plot for art.470 (FP, C, Neural Networks)
234

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 69






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 4






I 10 SimPoints (Phenom) 
Intervals = 8














I Phenom Full 
Intervals = 471






I Pentium D Full 
Intervals = 471






I Core2 Full 
Intervals = 471
Avg cpi = 5.43
Figure E.69: CPI phase plot for mcf (INT, C, Combinatorial Opt)
235















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 158





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 2





I 10 SimPoints (Phenom) 
Intervals = 7












I Phenom Full 
Intervals = 918





I Pentium D Full 
Intervals = 918





I Core2 Full 
Intervals = 918
Avg cpi = 1.56

















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 200





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 4





I 10 SimPoints (Phenom) 
Intervals = 9












I Phenom Full 
Intervals = 1404





I Pentium D Full 
Intervals = 1404





I Core2 Full 
Intervals = 1404
Avg cpi = 0.74
Figure E.71: CPI phase plot for crafty (INT, C, Chess)
237















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 468





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 4





I 10 SimPoints (Phenom) 
Intervals = 7












I Phenom Full 
Intervals = 2497





I Pentium D Full 
Intervals = 2494





I Core2 Full 
Intervals = 2494
Avg cpi = 0.85
Figure E.72: CPI phase plot for facerec (FP, F90, Facial Recognition)
238



















I Unguided FF (Phenom) 
Intervals = 1







I Train (Phenom) 
Intervals = 387







I 1 SimPoint (Phenom) 
Intervals = 1







I 5 SimPoints (Phenom) 
Intervals = 4







I 10 SimPoints (Phenom) 
Intervals = 7
















I Phenom Full 
Intervals = 2822







I Pentium D Full 
Intervals = 2822







I Core2 Full 
Intervals = 2822
Avg cpi = 1.03
Figure E.73: CPI phase plot for ammp (FP, C, Chemistry)
239

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 586






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 2056






I Pentium D Full 
Intervals = 2056






I Core2 Full 
Intervals = 2056
Avg cpi = 1.17
Figure E.74: CPI phase plot for lucas (FP, F90, Number Theory)
240



















I Unguided FF (Phenom) 
Intervals = 1







I Train (Phenom) 
Intervals = 2139







I 1 SimPoint (Phenom) 
Intervals = 1







I 5 SimPoints (Phenom) 
Intervals = 5







I 10 SimPoints (Phenom) 
Intervals = 7
















I Phenom Full 
Intervals = 2531







I Pentium D Full 
Intervals = 2526







I Core2 Full 
Intervals = 2526
Avg cpi = 1.51
Figure E.75: CPI phase plot for fma3d (FP, F90, Crash Simulation)
241

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 66






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 8














I Phenom Full 
Intervals = 2633






I Pentium D Full 
Intervals = 2631






I Core2 Full 
Intervals = 2634
Avg cpi = 1.25
Figure E.76: CPI phase plot for parser (INT, C, Word Processing)
242

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 1159






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 3






I 10 SimPoints (Phenom) 
Intervals = 3














I Phenom Full 
Intervals = 5427






I Pentium D Full 
Intervals = 5427






I Core2 Full 
Intervals = 5427
Avg cpi = 0.55
Figure E.77: CPI phase plot for sixtrack (FP, F77, Nuclear Physics)
243















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 13





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 7












I Phenom Full 
Intervals = 594





I Pentium D Full 
Intervals = 594





I Core2 Full 
Intervals = 594
Avg cpi = 0.69
Figure E.78: CPI phase plot for eon.cook (INT, C++, Computer Graphics)
244















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 70





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 8












I Phenom Full 
Intervals = 795





I Pentium D Full 
Intervals = 795





I Core2 Full 
Intervals = 795
Avg cpi = 0.73
Figure E.79: CPI phase plot for eon.kaj (INT, C++, Computer Graphics)
245















I Unguided FF (Phenom) 
Intervals = 1





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 4





I 10 SimPoints (Phenom) 
Intervals = 6












I Phenom Full 
Intervals = 466





I Pentium D Full 
Intervals = 466





I Core2 Full 
Intervals = 466
Avg cpi = 0.67
Figure E.80: CPI phase plot for eon.rush (INT, C++, Computer Graphics)
246















I Unguided FF (Phenom) 
Intervals = 10





I Train (Phenom) 
Intervals = 184





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 3





I 10 SimPoints (Phenom) 
Intervals = 3












I Phenom Full 
Intervals = 10





I Pentium D Full 
Intervals = 10





I Core2 Full 
Intervals = 10
Avg cpi = 0.73
Figure E.81: CPI phase plot for perlbmk.mkrnd (INT, C, Scripting Lan-
guage)
247















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 127





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 5












I Phenom Full 
Intervals = 196





I Pentium D Full 
Intervals = 196





I Core2 Full 
Intervals = 196
Avg cpi = 1.11
Figure E.82: CPI phase plot for perlbmk.perf (INT, C, Scripting Lan-
guage)
248



















I Unguided FF (Phenom) 
Intervals = 1







I Train (Phenom) 
Intervals = 59







I 1 SimPoint (Phenom) 
Intervals = 1







I 5 SimPoints (Phenom) 
Intervals = 4







I 10 SimPoints (Phenom) 
Intervals = 7
















I Phenom Full 
Intervals = 1834







I Pentium D Full 
Intervals = 1834







I Core2 Full 
Intervals = 1834
Avg cpi = 0.71
Figure E.83: CPI phase plot for gap (INT, C, Group Theory)
249



















I Unguided FF (Phenom) 
Intervals = 1







I Train (Phenom) 
Intervals = 448







I 1 SimPoint (Phenom) 
Intervals = 1







I 5 SimPoints (Phenom) 
Intervals = 5







I 10 SimPoints (Phenom) 
Intervals = 9
















I Phenom Full 
Intervals = 1047







I Pentium D Full 
Intervals = 1047







I Core2 Full 
Intervals = 1047
Avg cpi = 0.68
Figure E.84: CPI phase plot for bzip2.graph (INT, C, Compression)
250















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 448





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 5





I 10 SimPoints (Phenom) 
Intervals = 9












I Phenom Full 
Intervals = 921





I Pentium D Full 
Intervals = 921





I Core2 Full 
Intervals = 921
Avg cpi = 0.69
Figure E.85: CPI phase plot for bzip2.prog (INT, C, Compression)
251

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 448






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 9














I Phenom Full 
Intervals = 757






I Pentium D Full 
Intervals = 757






I Core2 Full 
Intervals = 757
Avg cpi = 0.76
Figure E.86: CPI phase plot for bzip2.src (INT, C, Compression)
252















I Unguided FF (Phenom) 
Intervals = 1





I Train (Phenom) 
Intervals = 106





I 1 SimPoint (Phenom) 
Intervals = 1





I 5 SimPoints (Phenom) 
Intervals = 4





I 10 SimPoints (Phenom) 
Intervals = 8












I Phenom Full 
Intervals = 2943





I Pentium D Full 
Intervals = 2943





I Core2 Full 
Intervals = 2943
Avg cpi = 1.01
Figure E.87: CPI phase plot for twolf (INT, C, Place/Route)
253

















I Unguided FF (Phenom) 
Intervals = 1






I Train (Phenom) 
Intervals = 113






I 1 SimPoint (Phenom) 
Intervals = 1






I 5 SimPoints (Phenom) 
Intervals = 5






I 10 SimPoints (Phenom) 
Intervals = 8














I Phenom Full 
Intervals = 3359






I Pentium D Full 
Intervals = 3359






I Core2 Full 
Intervals = 3359
Avg cpi = 1.12




Program phases are often similar across architectures; the overall work be-
ing done is the same even though the various platforms involved have different
underlying microarchitecture. Despite the gross similarities, the cycle and re-
tired instruction counts differ enough that SimPoint intervals gathered for one
architecture cannot be used to analyze executions on another. Below are phase
plots for the cycles per instruction (CPI) metric for the SPEC CPU2000 bench-
marks, running on the MIPS, ia64, x86, x86 64 platforms. Some of the plots are
missing; this is due to the performance counter results being unavailable for
































Avg CPI = 0.75
Figure F.1: Multi-arch CPI plot for gzip.graph (INT, C, Compression)
255































Avg CPI = 0.69
Figure F.2: Multi-arch CPI plot for gzip.log (INT, C, Compression)































Avg CPI = 0.64
Figure F.3: Multi-arch CPI plot for gzip.prog (INT, C, Compression)
256































Avg CPI = 0.75
Figure F.4: Multi-arch CPI plot for gzip.rand (INT, C, Compression)































Avg CPI = 0.68
Figure F.5: Multi-arch CPI plot for gzip.src (INT, C, Compression)
257



































Avg CPI = 0.92
Figure F.6: Multi-arch CPI plot for wupwise (FP, F77, Quantum Chromo-
dynamics)



































Avg CPI = 1.98
Figure F.7: Multi-arch CPI plot for swim (FP, F77, Meteorology/Water)
258































Avg CPI = 1.07
Figure F.8: Multi-arch CPI plot for mgrid (FP, F77, Multi-Grid Solver)



































Avg CPI = 1.30
Figure F.9: Multi-arch CPI plot for applu (FP, F77, Fluid Dynamics)
259



































Avg CPI = 1.13





































Avg CPI = 1.66
Figure F.11: Multi-arch CPI plot for vpr.route (INT, C, FPGA
Place/Route)
260







































Avg CPI = 0.00
Figure F.12: Multi-arch CPI plot for gcc.166 (INT, C, C Compiler)



































Avg CPI = 0.00
Figure F.13: Multi-arch CPI plot for gcc.200 (INT, C, C Compiler)
261































Avg CPI = 0.98
Figure F.14: Multi-arch CPI plot for gcc.expr (INT, C, C Compiler)



































Avg CPI = 0.00
Figure F.15: Multi-arch CPI plot for gcc.integrate (INT, C, C Com-
piler)
262



































Avg CPI = 0.00
Figure F.16: Multi-arch CPI plot for gcc.scilab (INT, C, C Compiler)































Avg CPI = 1.09
Figure F.17: Multi-arch CPI plot for mesa (FP, C, 3D-graphics)
263











































Avg CPI = 1.38
Figure F.18: Multi-arch CPI plot for galgel (FP, F90, Fluid Dynamics)



































Avg CPI = 0.00
Figure F.19: Multi-arch CPI plot for art.110 (FP, C, Neural Networks)
264



































Avg CPI = 0.00








































Avg CPI = 4.26
Figure F.21: Multi-arch CPI plot for mcf (INT, C, Combinatorial Opt)
265



































Avg CPI = 1.62
Figure F.22: Multi-arch CPI plot for equake (FP, C, Seismic Propogation)











































Avg CPI = 0.93
Figure F.23: Multi-arch CPI plot for crafty (INT, C, Chess)
266































Avg CPI = 1.00
Figure F.24: Multi-arch CPI plot for facerec (FP, F90, Facial Recognition)



































Avg CPI = 1.28
Figure F.25: Multi-arch CPI plot for ammp (FP, C, Chemistry)
267











































Avg CPI = 1.31
Figure F.26: Multi-arch CPI plot for lucas (FP, F90, Number Theory)







































Avg CPI = 1.15
Figure F.27: Multi-arch CPI plot for fma3d (FP, F90, Crash Simulation)
268







































Avg CPI = 1.13
Figure F.28: Multi-arch CPI plot for parser (INT, C, Word Processing)



































Avg CPI = 0.67

































Avg CPI = 1.02
Figure F.30: Multi-arch CPI plot for eon.cook (INT, C++, Computer
Graphics)































Avg CPI = 1.02
Figure F.31: Multi-arch CPI plot for eon.kajiya (INT, C++, Computer
Graphics)
270































Avg CPI = 1.01
Figure F.32: Multi-arch CPI plot for eon.rushmeier (INT, C++, Com-
puter Graphics)































Avg CPI = 0.79
Figure F.33: Multi-arch CPI plot for perlbmk.535 (INT, C, Scripting Lan-
guage)
271































Avg CPI = 0.81
Figure F.34: Multi-arch CPI plot for perlbmk.704 (INT, C, Scripting Lan-
guage)































Avg CPI = 0.78


































Avg CPI = 0.81
Figure F.36: Multi-arch CPI plot for perlbmk.957 (INT, C, Scripting Lan-
guage)































Avg CPI = 1.02
Figure F.37: Multi-arch CPI plot for perlbmk.diff (INT, C, Scripting
Language)
273































Avg CPI = 1.14
Figure F.38: Multi-arch CPI plot for perlbmk.mkrnd (INT, C, Scripting)































Avg CPI = 1.24
Figure F.39: Multi-arch CPI plot for perlbmk.perf (INT, C, Scripting)
274



































Avg CPI = 1.04
































Avg CPI = 0.92
Figure F.41: Multi-arch CPI plot for vortex.1 (INT, C, Database)
275































Avg CPI = 0.89
Figure F.42: Multi-arch CPI plot for vortex.2 (INT, C, Database)































Avg CPI = 0.90
Figure F.43: Multi-arch CPI plot for vortex.3 (INT, C, Database)
276































Avg CPI = 0.87
Figure F.44: Multi-arch CPI plot for bzip2.graph (INT, C, Compression)































Avg CPI = 0.82
Figure F.45: Multi-arch CPI plot for bzip2.prog (INT, C, Compression)
277































Avg CPI = 0.88
Figure F.46: Multi-arch CPI plot for bzip2.src (INT, C, Compression)































Avg CPI = 1.29
Figure F.47: Multi-arch CPI plot for twolf (INT, C, Place/Route)
278



































Avg CPI = 1.44




L1 DATA CACHE ACCESSES PER INSTRUCTION PHASE PLOTS
When investigating memory subsystems it is important that your simulation
method creates a faithful representation of the processor’s memory access pat-
terns. One way to measure this is L1 data cache accesses per retired instruction
(DPI). The following figures contain phase plots showing data cache accesses
per instruction for SPEC CPU2000 on three actual x86 64 machines. Results
from the SimPoint, unguided fast-forwarding, and start from the beginning re-
duced execution methods are also shown. Simulator results for m5 andValgrind
are included for comparison, as are results from the MIPS architecture.
280















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 7


























i Phenom Full 
Intervals = 657





i Pentium D Full 
Intervals = 661





i Core2 Full 
Intervals = 659
Avg dpi = 0.50
Figure G.1: L1 dcache accesses per instruction plot for gzip.graph (INT,
C, Compression)
281















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 7


























i Phenom Full 
Intervals = 274





i Pentium D Full 
Intervals = 277





i Core2 Full 
Intervals = 276
Avg dpi = 0.43
Figure G.2: L1 dcache accesses per instruction plot for gzip.log (INT, C,
Compression)
282















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 3





i 10 SimPoints (Phenom) 
Intervals = 5


























i Phenom Full 
Intervals = 1339





i Pentium D Full 
Intervals = 1342





i Core2 Full 
Intervals = 1341
Avg dpi = 0.44
Figure G.3: L1 dcache accesses per instruction plot for gzip.prog (INT,
C, Compression)
283

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 4






i 10 SimPoints (Phenom) 
Intervals = 4






























i Phenom Full 
Intervals = 503






i Pentium D Full 
Intervals = 507






i Core2 Full 
Intervals = 505
Avg dpi = 0.48
Figure G.4: L1 dcache accesses per instruction plot for gzip.rand (INT,
C, Compression)
284















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 7


























i Phenom Full 
Intervals = 633





i Pentium D Full 
Intervals = 636





i Core2 Full 
Intervals = 635
Avg dpi = 0.44
Figure G.5: L1 dcache accesses per instruction plot for gzip.src (INT, C,
Compression)
285















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 7


























i Phenom Full 
Intervals = 3604





i Pentium D Full 
Intervals = 3605





i Core2 Full 
Intervals = 3605
Avg dpi = 0.31
Figure G.6: L1 dcache accesses per instruction plot for wupwise (FP, F77,
Quantum Chromodynamics)
286















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 9



















i Phenom Full 
Intervals = 2111





i Pentium D Full 
Intervals = 2111





i Core2 Full 
Intervals = 2111
Avg dpi = 0.32
Figure G.7: L1 dcache accesses per instruction plot for swim (FP, F77, Me-
teorology/Water)
287















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 8



















i Phenom Full 
Intervals = 3179





i Pentium D Full 
Intervals = 3179





i Core2 Full 
Intervals = 3178
Avg dpi = 0.48
Figure G.8: L1 dcache accesses per instruction plot for mgrid (FP, F77,
Multi-Grid Solver)
288

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 9






















i Phenom Full 
Intervals = 3296






i Pentium D Full 
Intervals = 3296






i Core2 Full 
Intervals = 3296
Avg dpi = 0.45




















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 4






i 10 SimPoints (Phenom) 
Intervals = 9






























i Phenom Full 
Intervals = 918






i Pentium D Full 
Intervals = 918






i Core2 Full 
Intervals = 918
Avg dpi = 0.58
Figure G.10: L1 dcache accesses per instruction plot for vpr.place (INT,
C, FPGA Place/Route)
290

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 6






























i Phenom Full 
Intervals = 658






i Pentium D Full 
Intervals = 658






i Core2 Full 
Intervals = 658
Avg dpi = 0.68
Figure G.11: L1 dcache accesses per instruction plot for vpr.route (INT,
C, FPGA Place/Route)
291



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 4







i 10 SimPoints (Phenom) 
Intervals = 8


































i Phenom Full 
Intervals = 256







i Pentium D Full 
Intervals = 262







i Core2 Full 
Intervals = 260
Avg dpi = 0.55
Figure G.12: L1 dcache accesses per instruction plot for gcc.166 (INT, C,
C Compiler)
292



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 5







i 10 SimPoints (Phenom) 
Intervals = 10


































i Phenom Full 
Intervals = 686







i Pentium D Full 
Intervals = 697







i Core2 Full 
Intervals = 693
Avg dpi = 0.49
Figure G.13: L1 dcache accesses per instruction plot for gcc.200 (INT, C,
C Compiler)
293















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 8


























i Phenom Full 
Intervals = 70





i Pentium D Full 
Intervals = 73





i Core2 Full 
Intervals = 72
Avg dpi = 0.51
Figure G.14: L1 dcache accesses per instruction plot for gcc.expr (INT,
C, C Compiler)
294















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 9


























i Phenom Full 
Intervals = 74





i Pentium D Full 
Intervals = 76





i Core2 Full 
Intervals = 75
Avg dpi = 0.52
Figure G.15: L1 dcache accesses per instruction plot for gcc.int (INT, C,
C Compiler)
295



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 5







i 10 SimPoints (Phenom) 
Intervals = 9


































i Phenom Full 
Intervals = 381







i Pentium D Full 
Intervals = 390







i Core2 Full 
Intervals = 387
Avg dpi = 0.51
Figure G.16: L1 dcache accesses per instruction plot for gcc.sci (INT, C,
C Compiler)
296















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 5


























i Phenom Full 
Intervals = 2251





i Pentium D Full 
Intervals = 2251





i Core2 Full 
Intervals = 2251
Avg dpi = 0.47
Figure G.17: L1 dcache accesses per instruction plot for mesa (FP, C, 3D-
graphics)
297



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 5







i 10 SimPoints (Phenom) 
Intervals = 8

























i Phenom Full 
Intervals = 2653







i Pentium D Full 
Intervals = 2652







i Core2 Full 
Intervals = 2652
Avg dpi = 0.45
Figure G.18: L1 dcache accesses per instruction plot for galgel (FP, F90,
Fluid Dynamics)
298

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 3






i 10 SimPoints (Phenom) 
Intervals = 4






























i Phenom Full 
Intervals = 376






i Pentium D Full 
Intervals = 376
Avg dpi = 0.32
Figure G.19: L1 dcache accesses per instruction plot for art.110 (FP, C,
Neural Networks)
299

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 3






i 10 SimPoints (Phenom) 
Intervals = 5






























i Phenom Full 
Intervals = 418






i Pentium D Full 
Intervals = 418
Avg dpi = 0.31
Figure G.20: L1 dcache accesses per instruction plot for art.470 (FP, C,
Neural Networks)
300

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 4






i 10 SimPoints (Phenom) 
Intervals = 8






























i Phenom Full 
Intervals = 471






i Pentium D Full 
Intervals = 471






i Core2 Full 
Intervals = 471
Avg dpi = 0.55
Figure G.21: L1 dcache accesses per instruction plot for mcf (INT, C, Com-
binatorial Opt)
301

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 2






i 10 SimPoints (Phenom) 
Intervals = 7






















i Phenom Full 
Intervals = 918






i Pentium D Full 
Intervals = 918






i Core2 Full 
Intervals = 918
Avg dpi = 0.60


















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 9



















i Phenom Full 
Intervals = 1404





i Pentium D Full 
Intervals = 1404





i Core2 Full 
Intervals = 1404
Avg dpi = 0.44
Figure G.23: L1 dcache accesses per instruction plot for crafty (INT, C,
Chess)
303















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 7



















i Phenom Full 
Intervals = 2497





i Pentium D Full 
Intervals = 2494





i Core2 Full 
Intervals = 2494
Avg dpi = 0.36
Figure G.24: L1 dcache accesses per instruction plot for facerec (FP, F90,
Facial Recognition)
304















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 7



















i Phenom Full 
Intervals = 2822





i Pentium D Full 
Intervals = 2822





i Core2 Full 
Intervals = 2822
Avg dpi = 0.43
Figure G.25: L1 dcache accesses per instruction plot for ammp (FP, C,
Chemistry)
305

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 9






















i Phenom Full 
Intervals = 2056






i Pentium D Full 
Intervals = 2056






i Core2 Full 
Intervals = 2056
Avg dpi = 0.40
Figure G.26: L1 dcache accesses per instruction plot for lucas (FP, F90,
Number Theory)
306

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 7






























i Phenom Full 
Intervals = 2531






i Pentium D Full 
Intervals = 2526






i Core2 Full 
Intervals = 2526
Avg dpi = 0.56
Figure G.27: L1 dcache accesses per instruction plot for fma3d (FP, F90,
Crash Simulation)
307















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 8


























i Phenom Full 
Intervals = 2632





i Pentium D Full 
Intervals = 2633





i Core2 Full 
Intervals = 2633
Avg dpi = 0.53
Figure G.28: L1 dcache accesses per instruction plot for parser (INT, C,
Word Processing)
308















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 3





i 10 SimPoints (Phenom) 
Intervals = 3


























i Phenom Full 
Intervals = 5427





i Pentium D Full 
Intervals = 5427





i Core2 Full 
Intervals = 5427
Avg dpi = 0.21
Figure G.29: L1 dcache accesses per instruction plot for sixtrack (FP,
F77, Nuclear Physics)
309















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 7


























i Phenom Full 
Intervals = 594





i Pentium D Full 
Intervals = 594





i Core2 Full 
Intervals = 594
Avg dpi = 0.44


















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 8


























i Phenom Full 
Intervals = 795





i Pentium D Full 
Intervals = 795





i Core2 Full 
Intervals = 795
Avg dpi = 0.46
Figure G.31: L1 dcache accesses per instruction plot for eon.kaj (INT,
C++, Computer Graphics)
311















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 6


























i Phenom Full 
Intervals = 466





i Pentium D Full 
Intervals = 466





i Core2 Full 
Intervals = 466
Avg dpi = 0.43
Figure G.32: L1 dcache accesses per instruction plot for eon.rush (INT,
C++, Computer Graphics)
312

















i Unguided FF (Phenom) 
Intervals = 10






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 3






i 10 SimPoints (Phenom) 
Intervals = 3






























i Phenom Full 
Intervals = 10






i Pentium D Full 
Intervals = 10






i Core2 Full 
Intervals = 10
Avg dpi = 0.63
Figure G.33: L1 dcache accesses per instruction plot for perlbmk.mkrnd
(INT, C, Scripting Language)
313

















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 5






























i Phenom Full 
Intervals = 196






i Pentium D Full 
Intervals = 196






i Core2 Full 
Intervals = 196
Avg dpi = 0.65
Figure G.34: L1 dcache accesses per instruction plot for perlbmk.perf
(INT, C, Scripting Language)
314















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 7



















i Phenom Full 
Intervals = 1834





i Pentium D Full 
Intervals = 1834





i Core2 Full 
Intervals = 1834
Avg dpi = 0.46




















i Unguided FF (Phenom) 
Intervals = 1






i 1 SimPoint (Phenom) 
Intervals = 1






i 5 SimPoints (Phenom) 
Intervals = 5






i 10 SimPoints (Phenom) 
Intervals = 9






























i Phenom Full 
Intervals = 1047






i Pentium D Full 
Intervals = 1047






i Core2 Full 
Intervals = 1047
Avg dpi = 0.49


















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 9


























i Phenom Full 
Intervals = 921





i Pentium D Full 
Intervals = 921





i Core2 Full 
Intervals = 921
Avg dpi = 0.48
Figure G.37: L1 dcache accesses per instruction plot for bzip2.prog
(INT, C, Compression)
317















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 5





i 10 SimPoints (Phenom) 
Intervals = 9


























i Phenom Full 
Intervals = 757





i Pentium D Full 
Intervals = 757





i Core2 Full 
Intervals = 757
Avg dpi = 0.49
Figure G.38: L1 dcache accesses per instruction plot for bzip2.src (INT,
C, Compression)
318















i Unguided FF (Phenom) 
Intervals = 1





i 1 SimPoint (Phenom) 
Intervals = 1





i 5 SimPoints (Phenom) 
Intervals = 4





i 10 SimPoints (Phenom) 
Intervals = 8


























i Phenom Full 
Intervals = 2943





i Pentium D Full 
Intervals = 2943





i Core2 Full 
Intervals = 2943
Avg dpi = 0.45
Figure G.39: L1 dcache accesses per instruction plot for twolf (INT, C,
Place/Route)
319



















i Unguided FF (Phenom) 
Intervals = 1







i 1 SimPoint (Phenom) 
Intervals = 1







i 5 SimPoints (Phenom) 
Intervals = 5







i 10 SimPoints (Phenom) 
Intervals = 8


































i Phenom Full 
Intervals = 3359







i Pentium D Full 
Intervals = 3359







i Core2 Full 
Intervals = 3359
Avg dpi = 0.43




L1 DATA CACHE ACCESSES PER µOP PHASE PLOTS
On the x86 architecture instructions are decoded into RISC-like µops before
execution. Each implementation of the architecture has a different set of µops,
making comparisons difficult. What follows are phase plots showing data cache
accesses per µop for SPEC CPU2000 on three different x86 64 machines, as well
as MIPS (to show how RISC-like the µops are).















u Phenom Full Intervals = 672





u Pentium D Full Intervals = 884





u Core2 Full Intervals = 734
Avg dpu = 0.45
Figure H.1: L1 D$ accesses per µop for gzip.graph (INT, C, Compres-
sion)
321















u Phenom Full Intervals = 279





u Pentium D Full Intervals = 346





u Core2 Full Intervals = 299
Avg dpu = 0.40
Figure H.2: L1 D$ accesses per µop for gzip.log (INT, C, Compression)















u Phenom Full Intervals = 1349





u Pentium D Full Intervals = 1669





u Core2 Full Intervals = 1476
Avg dpu = 0.40
Figure H.3: L1 D$ accesses per µop for gzip.prog (INT, C, Compression)
322















u Phenom Full Intervals = 512





u Pentium D Full Intervals = 660





u Core2 Full Intervals = 565
Avg dpu = 0.43
Figure H.4: L1 D$ accesses per µop for gzip.rand (INT, C, Compression)















u Phenom Full Intervals = 640





u Pentium D Full Intervals = 795





u Core2 Full Intervals = 695
Avg dpu = 0.40
Figure H.5: L1 D$ accesses per µop for gzip.src (INT, C, Compression)
323















u Phenom Full Intervals = 3665





u Pentium D Full Intervals = 4314





u Core2 Full Intervals = 3709
Avg dpu = 0.30
Figure H.6: L1 D$ accesses per µop for wupwise (FP, F77, Quantum Chro-
modynamics)















u Phenom Full Intervals = 2111





u Pentium D Full Intervals = 2463





u Core2 Full Intervals = 2112
Avg dpu = 0.32
Figure H.7: L1 D$ accesses per µop for swim (FP, F77, Meteorology/Water)
324















u Phenom Full Intervals = 3183





u Pentium D Full Intervals = 4107





u Core2 Full Intervals = 3186
Avg dpu = 0.48
Figure H.8: L1 D$ accesses per µop for mgrid (FP, F77, Multi-Grid Solver)















u Phenom Full Intervals = 3313





u Pentium D Full Intervals = 4441





u Core2 Full Intervals = 3368
Avg dpu = 0.44

















u Phenom Full Intervals = 1020





u Pentium D Full Intervals = 1487





u Core2 Full Intervals = 1121
Avg dpu = 0.48
Figure H.10: L1 D$ accesses per µop for vpr.place (INT, C, FPGA
Place/Route)

















u Phenom Full Intervals = 670






u Pentium D Full Intervals = 910






u Core2 Full Intervals = 679
Avg dpu = 0.66
Figure H.11: L1 D$ accesses per µop for vpr.route (INT, C, FPGA
Place/Route)
326



















u Phenom Full Intervals = 268







u Pentium D Full Intervals = 355







u Core2 Full Intervals = 306
Avg dpu = 0.47
Figure H.12: L1 D$ accesses per µop for gcc.166 (INT, C, C Compiler)

















u Phenom Full Intervals = 721






u Pentium D Full Intervals = 925






u Core2 Full Intervals = 761
Avg dpu = 0.45
Figure H.13: L1 D$ accesses per µop for gcc.200 (INT, C, C Compiler)
327

















u Phenom Full Intervals = 76






u Pentium D Full Intervals = 98






u Core2 Full Intervals = 82
Avg dpu = 0.45
Figure H.14: L1 D$ accesses per µop for gcc.expr (INT, C, C Compiler)

















u Phenom Full Intervals = 79






u Pentium D Full Intervals = 101






u Core2 Full Intervals = 86
Avg dpu = 0.45
Figure H.15: L1 D$ accesses per µop for gcc.int (INT, C, C Compiler)
328

















u Phenom Full Intervals = 407






u Pentium D Full Intervals = 522






u Core2 Full Intervals = 431
Avg dpu = 0.46
Figure H.16: L1 D$ accesses per µop for gcc.sci (INT, C, C Compiler)















u Phenom Full Intervals = 2540





u Pentium D Full Intervals = 3221





u Core2 Full Intervals = 2428
Avg dpu = 0.43
Figure H.17: L1 D$ accesses per µop for mesa (FP, C, 3D-graphics)
329

















u Phenom Full Intervals = 2748






u Pentium D Full Intervals = 3516






u Core2 Full Intervals = 2818
Avg dpu = 0.42
Figure H.18: L1 D$ accesses per µop for galgel (FP, F90, Fluid Dynamics)

















u Phenom Full Intervals = 0






u Pentium D Full Intervals = 454






u Core2 Full Intervals = 0
Avg dpu = 0.00
Figure H.19: L1 D$ accesses per µop for art.110 (FP, C, Neural Net-
works)
330

















u Phenom Full Intervals = 0






u Pentium D Full Intervals = 503






u Core2 Full Intervals = 0
Avg dpu = 0.00
Figure H.20: L1 D$ accesses per µop for art.470 (FP, C, Neural Net-
works)

















u Phenom Full Intervals = 472






u Pentium D Full Intervals = 554






u Core2 Full Intervals = 473
Avg dpu = 0.55
Figure H.21: L1 D$ accesses per µop for mcf (INT, C, Combinatorial Opt)
331

















u Phenom Full Intervals = 937






u Pentium D Full Intervals = 1199






u Core2 Full Intervals = 951
Avg dpu = 0.58
Figure H.22: L1 D$ accesses per µop for equake (FP, C, Seismic Propoga-
tion)















u Phenom Full Intervals = 1438





u Pentium D Full Intervals = 1871





u Core2 Full Intervals = 1513
Avg dpu = 0.41
Figure H.23: L1 D$ accesses per µop for crafty (INT, C, Chess)
332















u Phenom Full Intervals = 2702





u Pentium D Full Intervals = 4295





u Core2 Full Intervals = 2884
Avg dpu = 0.31
Figure H.24: L1 D$ accesses per µop for facerec (FP, F90, Facial Recog-
nition)















u Phenom Full Intervals = 2832





u Pentium D Full Intervals = 3674





u Core2 Full Intervals = 2848
Avg dpu = 0.43
Figure H.25: L1 D$ accesses per µop for ammp (FP, C, Chemistry)
333















u Phenom Full Intervals = 2110





u Pentium D Full Intervals = 2625





u Core2 Full Intervals = 2077
Avg dpu = 0.40
Figure H.26: L1 D$ accesses per µop for lucas (FP, F90, Number Theory)

















u Phenom Full Intervals = 2918






u Pentium D Full Intervals = 5475






u Core2 Full Intervals = 3199
Avg dpu = 0.43
Figure H.27: L1 D$ accesses per µop for fma3d (FP, F90, Crash Simulation)
334















u Phenom Full Intervals = 4465





u Pentium D Full Intervals = 4200





u Core2 Full Intervals = 3428
Avg dpu = 0.41
Figure H.28: L1 D$ accesses per µop for parser (INT, C,Word Processing)















u Phenom Full Intervals = 5430





u Pentium D Full Intervals = 6025





u Core2 Full Intervals = 5433
Avg dpu = 0.21
Figure H.29: L1 D$ accesses per µop for sixtrack (FP, F77, Nuclear
Physics)
335















u Phenom Full Intervals = 639





u Pentium D Full Intervals = 862





u Core2 Full Intervals = 665
Avg dpu = 0.40

















u Phenom Full Intervals = 855





u Pentium D Full Intervals = 1154





u Core2 Full Intervals = 892
Avg dpu = 0.41
Figure H.31: L1 D$ accesses per µop for eon.kaj (INT, C++, Computer
Graphics)
336















u Phenom Full Intervals = 499





u Pentium D Full Intervals = 670





u Core2 Full Intervals = 517
Avg dpu = 0.38
Figure H.32: L1 D$ accesses per µop for eon.rush (INT, C++, Computer
Graphics)















u Phenom Full Intervals = 15





u Pentium D Full Intervals = 17





u Core2 Full Intervals = 12
Avg dpu = 0.52
Figure H.33: L1 D$ accesses per µop for perlbmk.mkrnd (INT, C, Script-
ing Language)
337

















u Phenom Full Intervals = 217






u Pentium D Full Intervals = 287






u Core2 Full Intervals = 228
Avg dpu = 0.56
Figure H.34: L1 D$ accesses per µop for perlbmk.perf (INT, C, Scripting
Language)















u Phenom Full Intervals = 2026





u Pentium D Full Intervals = 2498





u Core2 Full Intervals = 2008
Avg dpu = 0.42



















u Phenom Full Intervals = 1063






u Pentium D Full Intervals = 1321






u Core2 Full Intervals = 1112
Avg dpu = 0.46

















u Phenom Full Intervals = 935





u Pentium D Full Intervals = 1157





u Core2 Full Intervals = 977
Avg dpu = 0.45
Figure H.37: L1 D$ accesses per µop for bzip2.prog (INT, C, Compres-
sion)
339















u Phenom Full Intervals = 769





u Pentium D Full Intervals = 972





u Core2 Full Intervals = 811
Avg dpu = 0.46
Figure H.38: L1 D$ accesses per µop for bzip2.src (INT, C, Compres-
sion)















u Phenom Full Intervals = 4969





u Pentium D Full Intervals = 4287





u Core2 Full Intervals = 3187
Avg dpu = 0.41
Figure H.39: L1 D$ accesses per µop for twolf (INT, C, Place/Route)
340



















u Phenom Full Intervals = 3526







u Pentium D Full Intervals = 4229







u Core2 Full Intervals = 3546
Avg dpu = 0.41




VALGRIND EXP-BBV TOOL CODE LISTING
Here is the BBV generating plugin for Valgrind, as found in Valgrind 3.5.0.
/ /−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗/
/ /−−− BBV: a SimPoint b a s i c b l o c k v e c t o r g e n e r a t o r bbv main . c −−−∗/
/ /−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗/
/∗
This f i l e i s p a r t o f BBV, a Valgr ind t o o l f o r g e n e r a t i n g SimPoint
b a s i c b l o c k v e c t o r s .
Copy r igh t (C) 2006−2009 Vince Weaver
v i n c e a t c s l . c o r n e l l . edu
p c f i l e c od e i s Copyr igh t (C) 2006−2009 Or i o l Prat
o r i o l . p r a t a t b s c . e s
Th is program i s f r e e s o f tw a r e ; you can r e d i s t r i b u t e i t and / or
modi fy i t under t h e t e rms o f t h e GNU Gene r a l P u b l i c L i c e n s e as
pu b l i s h e d by t h e F r e e So f twar e Foundat ion ; e i t h e r v e r s i o n 2 o f t h e
L i c en s e , or ( a t your o p t i o n ) any l a t e r v e r s i o n .
Th is program i s d i s t r i b u t e d in t h e hope t h a t i t w i l l b e u s e f u l , but
WITHOUT ANY WARRANTY; w i t hou t even t h e imp l i e d warranty o f
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See t h e GNU
Gene r a l P u b l i c L i c e n s e f o r more d e t a i l s .
You shou ld have r e c e i v e d a copy o f t h e GNU Gene r a l P u b l i c L i c e n s e
a l ong with t h i s program ; i f not , w r i t e t o t h e F r e e So f twar e
Foundat ion , Inc . , 59 Temple P lace , S u i t e 330 , Boston , MA
02111−1307 , USA.
The GNU Gene r a l P u b l i c L i c e n s e i s c on t a i n e d in t h e f i l e COPYING.
∗ /
# include ” pub too l bas i c s . h”
# include ” pub t oo l t oo l i f a c e . h”
# include ” pub tool opt ions . h” /∗ command l i n e o p t i o n s ∗ /
# include ” pub tool vki . h” /∗ v k i s t a t ∗ /
# include ” pub too l l ib cbase . h” /∗ VG ( s t r l e n ) ∗ /
# include ” pub t o o l l i b c f i l e . h” /∗ VG ( wr i t e ) ∗ /
# include ” pub t oo l l i b cp r i n t . h” /∗ VG ( p r i n t f ) ∗ /
# include ” pub t oo l l i b c a s s e r t . h” /∗ VG ( e x i t ) ∗ /
# include ” pub too l ma l loc f r e e . h” /∗ p l a i n f r e e ∗ /
# include ”pub tool machine . h” /∗ VG ( f n p t r t o f n e n t r y ) ∗ /
# include ”pub tool debuginfo . h” /∗ VG ( ge t fnname ) ∗ /
# include ” pub too l ose t . h” /∗ o r d e r e d s e t s t u f f ∗ /
342
/∗ i n s t r u c t i o n s p e c i a l c a s e s ∗ /
#define REP INSTRUCTION 0x1
#define FLDCW INSTRUCTION 0x2
/∗ i n t e r v a l v a r i a b l e s ∗ /
#define DEFAULT GRAIN SIZE 100000000 /∗ 100 m i l l i o n by d e f a u l t ∗ /
s t a t i c In t i n t e r v a l s i z e =DEFAULT GRAIN SIZE ;
/∗ f i l e n am e s ∗ /
s t a t i c UChar ∗ c l o b b ou t f i l e =”bb . out .%p” ;
s t a t i c UChar ∗ c l o p c ou t f i l e =”pc . out .%p” ;
s t a t i c UChar ∗ p c ou t f i l e =NULL;
s t a t i c UChar ∗ b b ou t f i l e=NULL;
/∗ output pa r ame t e r s ∗ /
s t a t i c Bool in s t r coun t on ly =False ;
s t a t i c Bool g en e r a t e p c f i l e =False ;
/∗ wr i t e b u f f e r ∗ /
s t a t i c UChar buf [ 1 0 2 4 ] ;
/∗ Glob a l v a l u e s ∗ /
s t a t i c OSet∗ i n s t r i n f o t a b l e ; /∗ t a b l e t h a t h o l d s t h e b a s i c b l o c k i n f o ∗ /
s t a t i c In t block num=1; /∗ g l o b a l next b l o c k number ∗ /
s t a t i c In t cur ren t th read =0;
s t a t i c In t a l loca t ed th reads =1;
s t r u c t t h read in fo ∗bbv thread=NULL;
/∗ Per− t h r e a d v a r i a b l e s ∗ /
s t r u c t t h read in fo {
ULong dyn inst r ; /∗ Current r e t i r e d i n s t r u c t i o n count ∗ /
ULong t o t a l i n s t r ; /∗ To t a l r e t i r e d i n s t r u c t i o n count ∗ /
Addr la s t r ep addr ; /∗ r e p c oun t ing v a l u e s ∗ /
ULong rep count ;
ULong global rep count ;
ULong unique rep count ;
ULong fldcw count ; /∗ f l d cw count ∗ /
In t bb t race fd ; /∗ f i l e d e s c r i p t o r ∗ /
} ;
#define FUNCTIONNAMELENGTH 20
s t r u c t BB info {
Addr BB addr ; /∗ used as key , must b e f i r s t ∗ /
In t n in s t r s ; /∗ i n s t r u c t i o n s in t h e b a s i c b l o c k ∗ /
In t block num ; /∗ unique b l o c k i d e n t i f i e r ∗ /
In t ∗ in s t coun t e r ; /∗ t im e s e n t e r e d ∗ num in s t r u c t i o n s ∗ /
Bool i s e n t r y ; /∗ i s t h i s b l o c k a f un c t i o n en t r y p o i n t ∗ /
UChar fn name [FUNCTIONNAMELENGTH] ; /∗ Funct ion b l o c k i s in ∗ /
} ;
/∗ dump t h e o p t i o n a l PC f i l e , which c on t a i n s b a s i c b l o c k number t o ∗ /
/∗ i n s t r u c t i o n ad d r e s s and f un c t i o n name mappings ∗ /
343
s t a t i c void dumpPcFile ( void )
{
s t r u c t BB info ∗bb elem ;
In t pc t race fd ;
SysRes s re s ;
p c ou t f i l e =
VG ( expand fi le name ) ( ”−−pc−out− f i l e ” , c l o p c ou t f i l e ) ;
s r e s = VG ( open ) ( p c ou t f i l e , VKI O CREAT |VKI O TRUNC |VKI O WRONLY,
VKI S IRUSR |VKI S IWUSR |VKI S IRGRP |VKI S IWGRP ) ;
i f ( s r i s E r r o r ( s r e s ) ) {
VG (umsg ) ( ” Error : cannot c rea t e pc f i l e %s\n” , p c ou t f i l e ) ;
VG ( e x i t ) ( 1 ) ;
} else {
pc t race fd = sr Res ( s re s ) ;
}
/∗ Loop through t h e t a b l e , p r i n t i n g t h e number , add r e s s , ∗ /
/∗ and f un c t i o n name f o r e a ch b a s i c b l o c k ∗ /
VG ( OSetGen ResetIter ) ( i n s t r i n f o t a b l e ) ;
while ( ( bb elem = VG (OSetGen Next ) ( i n s t r i n f o t a b l e ) ) ) {
VG ( write ) ( pct race fd , ”F” , 1 ) ;
VG ( sp r i n t f ) ( buf , ”:%d:%x:%s\n” ,
bb elem−>block num ,
( In t ) bb elem−>BB addr ,
bb elem−>fn name ) ;
VG ( write ) ( pct race fd , ( void∗ ) buf , VG ( s t r l e n ) ( buf ) ) ;
}
VG ( c lose ) ( pc t race fd ) ;
}
s t a t i c In t open t r a c e f i l e ( In t thread num )
{
SysRes s re s ;
UChar temp st r ing [ 2 0 4 8 ] ;
/∗ For t h r e a d 1 , don ’ t append any t h r e a d number ∗ /
/∗ This l e t s t h e s i n g l e − t h r e a d c a s e not have any ∗ /
/∗ e x t r a v a l u e s appended t o t h e f i l e name . ∗ /
i f ( thread num==1) {
VG ( strncpy ) ( temp string , b b ou t f i l e , 2 0 4 7 ) ;
}
else {
VG ( sp r i n t f ) ( temp string , ”%s .%d” , bb ou t f i l e , thread num ) ;
}
s re s = VG ( open ) ( temp string , VKI O CREAT |VKI O TRUNC |VKI O WRONLY,
VKI S IRUSR |VKI S IWUSR |VKI S IRGRP |VKI S IWGRP ) ;
i f ( s r i s E r r o r ( s r e s ) ) {
VG (umsg ) ( ” Error : cannot c rea t e bb f i l e %s\n” , temp st r ing ) ;
VG ( e x i t ) ( 1 ) ;
}
344
return sr Res ( s re s ) ;
}
s t a t i c void handle overf low ( void )
{
s t r u c t BB info ∗bb elem ;
i f ( bbv thread [ cur ren t th read ] . dyn inst r > i n t e r v a l s i z e ) {
i f ( ! in s t r coun t on ly ) {
/∗ I f our output f d hasn ’ t b e en opened , open i t ∗ /
i f ( bbv thread [ cur ren t th read ] . bb t race fd < 0) {
bbv thread [ cur ren t th read ] . bb t race fd=open t r a c e f i l e ( cur ren t th read ) ;
}
/∗ put an en t r y t o t h e bb . ou t f i l e ∗ /
VG ( write ) ( bbv thread [ cur ren t th read ] . bb t race fd , ”T” , 1 ) ;
VG ( OSetGen ResetIter ) ( i n s t r i n f o t a b l e ) ;
while ( ( bb elem = VG (OSetGen Next ) ( i n s t r i n f o t a b l e ) ) ) {
i f ( bb elem−> in s t coun t e r [ cur ren t th read ] != 0 ) {
VG ( sp r i n t f ) ( buf , ”:%d:%d ” ,
bb elem−>block num ,
bb elem−>in s t coun t e r [ cur ren t th read ] ) ;
VG ( write ) ( bbv thread [ cur ren t th read ] . bb t race fd ,
( void∗ ) buf , VG ( s t r l e n ) ( buf ) ) ;
bb elem−> in s t coun t e r [ cur ren t th read ] = 0 ;
}
}
VG ( write ) ( bbv thread [ cur ren t th read ] . bb t race fd , ”\n” , 1 ) ;
}
bbv thread [ cur ren t th read ] . dyn inst r −= i n t e r v a l s i z e ;
}
}
s t a t i c void c lose ou t reps ( void )
{
bbv thread [ cur ren t th read ] . g lobal rep count+=bbv thread [ cur ren t th read ] . rep count ;
bbv thread [ cur ren t th read ] . unique rep count ++;
bbv thread [ cur ren t th read ] . rep count =0;
}
/∗ Gene r i c f u n c t i o n t o g e t c a l l e d e a ch i n s t r u c t i o n ∗ /
s t a t i c VGREGPARM( 1 ) void per ins t ruct ion BBV ( s t r u c t BB info ∗bbInfo )
{
In t n in s t r s =1;
t l a s s e r t ( bbInfo ) ;
345
/∗ we f i n i s h e d r e p but d idn ’ t c l e a r out count ∗ /
i f ( bbv thread [ cur ren t th read ] . rep count ) {
n in s t r s ++;
c lo se ou t reps ( ) ;
}
bbInfo−> in s t coun t e r [ cur ren t th read ]+= n in s t r s ;
bbv thread [ cur ren t th read ] . t o t a l i n s t r +=n in s t r s ;
bbv thread [ cur ren t th read ] . dyn inst r +=n in s t r s ;
handle overf low ( ) ;
}
/∗ Funct ion t o g e t c a l l e d i f i n s t r u c t i o n has a r e p p r e f i x ∗ /
s t a t i c VGREGPARM( 1 ) void per ins t ruct ion BBV rep (Addr addr )
{
/∗ hand l e back −to−b ac k r e p i n s t r u c t i o n s ∗ /
i f ( bbv thread [ cur ren t th read ] . l a s t r ep addr != addr ) {
i f ( bbv thread [ cur ren t th read ] . rep count ) {
c lose ou t reps ( ) ;
bbv thread [ cur ren t th read ] . t o t a l i n s t r ++;
bbv thread [ cur ren t th read ] . dyn inst r ++;
}
bbv thread [ cur ren t th read ] . l a s t r ep addr=addr ;
}
bbv thread [ cur ren t th read ] . rep count ++;
}
/∗ Funct ion t o c a l l i f our i n s t r u c t i o n has a f l d cw i n s t r u c t i o n ∗ /
s t a t i c VGREGPARM( 1 ) void per ins t ruct ion BBV f ldcw ( s t r u c t BB info ∗bbInfo )
{
In t n in s t r s =1;
t l a s s e r t ( bbInfo ) ;
/∗ we f i n i s h e d r e p but d idn ’ t c l e a r out count ∗ /
i f ( bbv thread [ cur ren t th read ] . rep count ) {
n in s t r s ++;
c lo se ou t reps ( ) ;
}
/∗ count f l d cw i n s t r u c t i o n s ∗ /
bbv thread [ cur ren t th read ] . f ldcw count ++;
bbInfo−> in s t coun t e r [ cur ren t th read ]+= n in s t r s ;
bbv thread [ cur ren t th read ] . t o t a l i n s t r +=n in s t r s ;
bbv thread [ cur ren t th read ] . dyn inst r +=n in s t r s ;
handle overf low ( ) ;
}
346
/∗ Check i f t h e i n s t r u c t i o n p o i n t e d t o i s one t h a t n e e d s ∗ /
/∗ s p e c i a l h and l ing . I f so , s e t a b i t in t h e r e tu rn ∗ /
/∗ va lu e i n d i c a t i n g what t yp e . ∗ /
s t a t i c In t g e t i n s t t yp e ( In t len , Addr addr )
{
in t r e su l t =0;
# i f defined (VGA x86 ) | | defined (VGA amd64)
unsigned char ∗ i n s t po in t e r ;
unsigned char i n s t b y t e ;
in t i , pos s ib l e rep ;
/∗ r e p p r e f i x e d i n s t r u c t i o n s a r e counted as one i n s t r u c t i o n on ∗ /
/∗ x86 p r o c e s s o r s and must b e hand l e d as a s p e c i a l c a s e ∗ /
/∗ Also , t h e r e p p r e f i x i s re−used as p a r t o f t h e op c od e f o r ∗ /
/∗ SSE i n s t r u c t i o n s . So we need t o s p e c i f i c a l l y ch e c k f o r ∗ /
/∗ t h e f o l l ow i n g : movs , cmps , s c a s , l od s , s t o s , ins , ou t s ∗ /
i n s t po in t e r =(unsigned char ∗ ) addr ;
i =0;
i n s t b y t e =0;
poss ib l e rep =0;
while ( i<len ) {
i n s t b y t e=∗ i n s t po in t e r ;
i f ( ( i n s t b y t e == 0x67 ) | | /∗ s i z e o v e r r i d e p r e f i x ∗ /
( i n s t b y t e == 0x66 ) | | /∗ s i z e o v e r r i d e p r e f i x ∗ /
( i n s t b y t e == 0x48 ) ) { /∗ 64− b i t p r e f i x ∗ /
} else i f ( ( i n s t b y t e == 0 xf2 ) | | /∗ r e p p r e f i x ∗ /
( i n s t b y t e == 0 xf3 ) ) { /∗ r e pn e p r e f i x ∗ /
poss ib l e rep =1;
} else {
break ; /∗ o t h e r byt e , e x i t ∗ /
}
i ++;
i n s t po in t e r ++;
}
i f ( poss ib l e rep &&
( ( ( i n s t b y t e >= 0xa4 ) && /∗ movs , cmps , s c a s ∗ /
( i n s t b y t e <= 0 xaf ) ) | | /∗ l od s , s t o s ∗ /
( ( i n s t b y t e >= 0x6c ) &&
( in s t b y t e <= 0 x6f ) ) ) ) { /∗ in s , ou t s ∗ /
r e su l t |=REP INSTRUCTION;
}
/∗ f l d cw i n s t r u c t i o n s a r e doub le −counted by t h e hardware ∗ /
/∗ pe r f o rmanc e c oun t e r s on pentium 4 p r o c e s s o r s so i t i s ∗ /
/∗ u s e f u l t o have t h a t count when do ing v a l i d a t i o n work . ∗ /
347
i n s t po in t e r =(unsigned char ∗ ) addr ;
i f ( len>1) {
/∗ FLDCW d e t e c t i o n ∗ /
/∗ op c od e i s 0xd9 / 5 , i e 1101 1001 oo10 1mmm ∗ /
i f ( (∗ i n s t po in t e r ==0xd9 ) &&
(∗ ( i n s t po in t e r +1)<0xb0 ) && /∗ need t h i s c a s e o f f l d z , e t c , count ∗ /
( (∗ ( i n s t po in t e r +1) & 0x38 ) == 0x28 ) ) {




return r e su l t ;
}
/∗ Our in s t r um en t a t i o n f un c t i o n ∗ /
/∗ s b In = supe r b l o c k t o t r a n s l a t e ∗ /
/∗ l a y ou t = gue s t l a y ou t ∗ /
/∗ gWordTy = s i z e o f gu e s t word ∗ /
/∗ hWordTy = s i z e o f h o s t word ∗ /
s t a t i c IRSB∗ bbv instrument ( VgCallbackClosure∗ c losure ,
IRSB∗ sbIn , VexGuestLayout∗ layout ,
VexGuestExtents∗ vge ,
IRType gWordTy , IRType hWordTy )
{
In t i , n i n s t r s =1;
IRSB ∗sbOut ;
IRStmt ∗ s t ;
s t r u c t BB info ∗bbInfo ;
Addr64 origAddr , ourAddr ;
IRDirty ∗di ;
IRExpr ∗∗argv , ∗arg1 ;
In t regparms , opcode type ;
/∗ We don ’ t h and l e a h o s t / gu e s t word s i z e mismatch ∗ /
i f (gWordTy != hWordTy ) {
VG ( too l pan i c ) ( ” host/guest word s ize mismatch ” ) ;
}
/∗ Se t up SB ∗ /
sbOut = deepCopyIRSBExceptStmts ( sbIn ) ;
/∗ Copy ve r b a t im any IR pr e amb l e p r e c e d i n g t h e f i r s t IMark ∗ /
i = 0 ;
while ( ( i < sbIn−>stmts used ) && ( sbIn−>stmts [ i ]−> tag != Is t IMark ) ) {
addStmtToIRSB ( sbOut , sbIn−>stmts [ i ] ) ;
i ++;
}
/∗ Get t h e f i r s t s t a t em en t ∗ /
t l a s s e r t ( sbIn−>stmts used > 0 ) ;
s t = sbIn−>stmts [ i ] ;
348
/∗ doub l e c h e c k we a r e a t a Mark s t a t em en t ∗ /
t l a s s e r t ( Is t IMark == st−>tag ) ;
origAddr=st−> I s t . IMark . addr ;
/∗ Get t h e BB in f o ∗ /
bbInfo = VG (OSetGen Lookup ) ( i n s t r i n f o t a b l e , &origAddr ) ;
i f ( bbInfo==NULL) {
/∗ BB nev e r t r a n s l a t e d b e f o r e ( a t t h i s add r e s s , a t l e a s t ; ∗ /
/∗ c ou ld have be en unloaded and then r e l o a d e d e l s ewh e r e in memory ) ∗ /
/∗ a l l o c a t e and i n i t i a l i z e a new b a s i c b l o c k s t r u c t u r e ∗ /
bbInfo=VG (OSetGen AllocNode ) ( i n s t r i n f o t a b l e , sizeof ( s t r u c t BB info ) ) ;
bbInfo−>BB addr = origAddr ;
bbInfo−>n in s t r s = n in s t r s ;
bbInfo−>in s t coun t e r =VG ( ca l l o c ) ( ”bbv instrument” ,
a l loca t ed th reads ,
sizeof ( In t ) ) ;
/∗ a s s i g n a unique b l o c k number ∗ /
bbInfo−>block num=block num ;
block num++;
/∗ g e t f u n c t i o n name and en t r y p o i n t i n f o rm a t i o n ∗ /
VG ( get fnname ) ( origAddr , bbInfo−>fn name ,FUNCTIONNAMELENGTH) ;
bbInfo−> i s e n t r y =VG ( ge t fnname i f en t ry ) ( origAddr , bbInfo−>fn name ,
FUNCTIONNAMELENGTH) ;
/∗ i n s e r t s t r u c t u r e i n t o t a b l e ∗ /
VG ( OSetGen Insert ) ( i n s t r i n f o t a b l e , bbInfo ) ;
}
/∗ I t e r a t e through t h e b a s i c b l o c k , pu t t ing t h e o r i g i n a l ∗ /
/∗ i n s t r u c t i o n s in p l a c e , p lu s pu t t ing a c a l l t o updateBBV ∗ /
/∗ f o r e a ch o r i g i n a l i n s t r u c t i o n ∗ /
/∗ This i s l e s s e f f i c i e n t than on ly in s t rumen t ing t h e BB ∗ /
/∗ But i t g i v e s p r op e r r e s u l t s g iv en t h e f a c t t h a t ∗ /
/∗ v a l g r i n d u s e s s u p e r b l o c k s ( not b a s i c b l o c k s ) by d e f a u l t ∗ /
while ( i < sbIn−>stmts used ) {
s t =sbIn−>stmts [ i ] ;
i f ( st−>tag == Is t IMark ) {
ourAddr = st−> I s t . IMark . addr ;
opcode type=ge t i n s t t yp e ( st−> I s t . IMark . len , ourAddr ) ;
regparms =1;
arg1= mkIRExpr HWord ( (HWord) bbInfo ) ;
argv= mkIRExprVec 1 ( arg1 ) ;
349
i f ( opcode type&REP INSTRUCTION) {
arg1= mkIRExpr HWord ( ourAddr ) ;
argv= mkIRExprVec 1 ( arg1 ) ;
di= unsafeIRDirty 0 N ( regparms , ” per ins t ruct ion BBV rep ” ,
VG ( fnp t r t o f n en t r y ) ( &per ins t ruct ion BBV rep ) ,
argv ) ;
}
else i f ( opcode type&FLDCW INSTRUCTION) {
di= unsafeIRDirty 0 N ( regparms , ” per ins t ruct ion BBV f ldcw ” ,




di= unsafeIRDirty 0 N ( regparms , ” per ins t ruct ion BBV” ,
VG ( fnp t r t o f n en t r y ) ( &per ins t ruct ion BBV ) ,
argv ) ;
}
/∗ I n s e r t our c a l l ∗ /
addStmtToIRSB ( sbOut , IRStmt Dirty ( di ) ) ;
}
/∗ I n s e r t t h e o r i g i n a l i n s t r u c t i o n ∗ /





s t a t i c s t r u c t t h read in fo ∗ al locate new thread ( s t r u c t t h read in fo ∗old ,
In t old number , In t new number)
{
s t r u c t t h read in fo ∗temp ;
s t r u c t BB info ∗bb elem ;
In t i ;
temp=VG ( r e a l l o c ) ( ”bbv main . c a l l o c a t e t h r e ad s ” ,
old ,
new number∗ sizeof ( s t r u c t t h read in fo ) ) ;
/∗ i n i t t h e new t h r e a d ∗ /
/∗ We l o o p in c a s e t h e new t h r e a d i s not c on t i guou s ∗ /
for ( i =old number ; i<new number ; i ++) {
temp [ i ] . l a s t r ep addr =0;
temp [ i ] . dyn inst r =0;
temp [ i ] . t o t a l i n s t r =0;
temp [ i ] . g lobal rep count =0;
temp [ i ] . unique rep count =0;
temp [ i ] . rep count =0;
temp [ i ] . f ldcw count =0;
temp [ i ] . bb t race fd =−1;
350
}/∗ expand t h e i n s t c o u n t e r on a l l a l l o c a t e d b a s i c b l o c k s ∗ /
VG ( OSetGen ResetIter ) ( i n s t r i n f o t a b l e ) ;
while ( ( bb elem = VG (OSetGen Next ) ( i n s t r i n f o t a b l e ) ) ) {
bb elem−>in s t coun t e r =
VG ( r e a l l o c ) ( ”bbv main . c in s t coun t e r ” ,
bb elem−>in s t coun t e r ,
new number∗ sizeof ( In t ) ) ;
for ( i =old number ; i<new number ; i ++) {





s t a t i c void bbv thread cal led ( ThreadId t id , ULong nDisp )
{
i f ( t id >= a l loca t ed th reads ) {
bbv thread=al locate new thread ( bbv thread , a l loca t ed th reads , t id +1 ) ;
a l l oca t ed th reads= t id +1;
}
cur ren t th read= t id ;
}
/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ /
/∗−−− Setup −−−∗ /
/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ /
s t a t i c void bbv pos t c l o i n i t ( void )
{
b b ou t f i l e =
VG ( expand fi le name ) ( ”−−bb−out− f i l e ” , c l o b b ou t f i l e ) ;
/∗ Try a c l o s e r app r ox ima t i on o f b a s i c b l o c k s ∗ /
/∗ This i s t h e same as t h e command l i n e o p t i o n ∗ /
/∗ −−vex−gues t −chas e − t h r e s h =0 ∗ /
VG ( c l o v e x con t r o l ) . gue s t chase th re sh = 0 ;
}
/∗ Par s e t h e command l i n e o p t i o n s ∗ /
s t a t i c Bool bbv process cmd l ine opt ion ( Char∗ arg )
{
i f VG INT CLO ( arg , ”−− in t e rva l −s i ze ” , i n t e r v a l s i z e ) {}
else i f VG STR CLO ( arg , ”−−bb−out− f i l e ” , c l o b b ou t f i l e ) {}
else i f VG STR CLO ( arg , ”−−pc−out− f i l e ” , c l o p c o u t f i l e ) {
gen e r a t e p c f i l e = True ;
}







s t a t i c void bbv pr in t usage ( void )
{
VG ( p r i n t f ) (
” −−bb−out− f i l e =< f i l e> f i lename for BBV in fo \n”
” −−pc−out− f i l e =< f i l e> f i lename for BB addresses and funct ion names\n”
” −− in t e rva l −s i ze=<num> i n t e r v a l s i ze \n”
” −− in s t r −count−only=yes | no only pr in t t o t a l i n s t ru c t i on count\n”
) ;
}
s t a t i c void bbv print debug usage ( void )
{
VG ( p r i n t f ) ( ” ( none )\n” ) ;
}
s t a t i c void bbv f in i ( In t ex i t code )
{
In t i ;
i f ( g en e r a t e p c f i l e ) {
dumpPcFile ( ) ;
}
for ( i =0; i<a l loca t ed th reads ; i ++) {
i f ( bbv thread [ i ] . t o t a l i n s t r !=0 ) {
VG ( sp r i n t f ) ( buf , ”\n\n”
”# Thread %d\n”
”# Tota l i n t e rv a l s : %d ( In t e rv a l S ize %d)\n”
”# Tota l i n s t ru c t i on s : %l ld \n”
”# Tota l reps : %l ld \n”
”# Unique reps : %l ld \n”
”# Tota l fldcw in s t ru c t i on s : %l ld \n\n” ,
i ,
( In t ) ( bbv thread [ i ] . t o t a l i n s t r /(ULong) i n t e r v a l s i z e ) ,
i n t e r v a l s i z e ,
bbv thread [ i ] . t o t a l i n s t r ,
bbv thread [ i ] . g lobal rep count ,
bbv thread [ i ] . unique rep count ,
bbv thread [ i ] . f ldcw count ) ;
/∗ Pr in t r e s u l t s t o d i s p l a y ∗ /
VG (umsg ) ( ”%s\n” , buf ) ;
/∗ open t h e output f i l e i f i t hasn ’ t a l r e a d y ∗ /
i f ( bbv thread [ i ] . bb t race fd < 0) {
bbv thread [ i ] . bb t race fd=open t r a c e f i l e ( i ) ;
}
/∗ Also p r i n t t o r e s u l t s f i l e ∗ /
VG ( write ) ( bbv thread [ i ] . bb t race fd , ( void∗ ) buf , VG ( s t r l e n ) ( buf ) ) ;
352




s t a t i c void bbv p r e c l o i n i t ( void )
{
VG ( deta i l s name ) ( ”exp−bbv” ) ;
VG ( d e t a i l s v e r s i on ) (NULL) ;
VG ( d e t a i l s d e s c r i p t i on ) ( ”a SimPoint bas i c block vector generator” ) ;
VG ( de t a i l s copyr igh t au thor ) (
”Copyright (C) 2006−2009 Vince Weaver” ) ;
VG ( d e t a i l s bug r epo r t s t o ) (VG BUGS TO ) ;
VG ( b a s i c t oo l f un c s ) ( b bv pos t c l o i n i t ,
bbv instrument ,
bbv f in i ) ;
VG ( needs command line options ) ( bbv process cmd l ine opt ion ,
bbv print usage ,
bbv print debug usage ) ;
VG ( t r a c k s t a r t c l i e n t c o d e ) ( bbv thread cal led ) ;
i n s t r i n f o t a b l e = VG ( OSetGen Create ) ( /∗ k e yO f f ∗ / 0 ,
NULL,
VG ( malloc ) , ”bbv . 1 ” , VG ( f re e ) ) ;
bbv thread=al locate new thread ( bbv thread , 0 , a l l oca t ed th reads ) ;
}
VG DETERMINE INTERFACE VERSION( bbv p r e c l o i n i t )
/∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ /




QEMU BBV PATCH CODE LISTING
Here is the BBV generating patch for Qemu, against the git development tree
as of 13 January 2010. A few additional patches are required for proper Alpha
and MIPS support (hopefully those will be merged soon).
d i f f −−g i t a/exec− a l l . h b/exec− a l l . h
index 820 b59e . . fd22d13 100644
−−− a/exec− a l l . h
+++ b/exec− a l l . h
@@ −151 ,6 +151 ,7 @@ s t r u c t Trans la t ionBlock {
s t r u c t Trans la t ionBlock ∗ jmp next [ 2 ] ;
s t r u c t Trans la t ionBlock ∗ jmp f i r s t ;
u i n t 3 2 t icount ;
+ u in t 3 2 t unique id ;
} ;
s t a t i c i n l i n e unsigned in t tb jmp cache hash page ( t a rge t u long pc )
d i f f −−g i t a/exec . c b/exec . c
index 1190591 . . 1997 af1 100644
−−− a/exec . c
+++ b/exec . c
@@ −1167 ,6 +1167 ,8 @@ s t a t i c i n l i n e void t b a l loc page ( Trans la t ionBlock ∗ tb ,
# endi f /∗ TARGET HAS SMC ∗ /
}
+ in t tb count =0;
+
/∗ A l l o c a t e a new t r a n s l a t i o n b l o c k . F lush t h e t r a n s l a t i o n b u f f e r i f
t o o many t r a n s l a t i o n b l o c k s or t o o much g e n e r a t e d c od e . ∗ /
Trans la t ionBlock ∗ t b a l l o c ( t a rge t u long pc )
@@ −1179 ,6 +1181 ,8 @@ Trans la t ionBlock ∗ t b a l l o c ( t a rge t u long pc )
tb = &tbs [ nb tbs ++] ;
tb−>pc = pc ;
tb−>c f l a g s = 0 ;
+ tb−>unique id = tb count ;
+ tb count ++;
return tb ;
}
d i f f −−g i t a/linux−user/ s y s c a l l . c b/linux −user/ s y s c a l l . c
index 1 ac f1 f5 . . 3 b59366 100644
−−− a/linux−user/ s y s c a l l . c
+++ b/linux−user/ s y s c a l l . c
@@ −86 ,6 +86 ,8 @@
# include ”qemu . h”
# include ”qemu−common . h”
+void do dump pc ( u in t 3 2 t ) ;
354
+# i f defined (CONFIG USE NPTL)
# def ine CLONE NPTL FLAGS2 (CLONE SETTLS | \
CLONE PARENT SETTID | CLONE CHILD SETTID | CLONE CHILD CLEARTID)
@@ −4194 ,6 +4196 ,7 @@ abi long do sysca l l ( void ∗cpu env , in t num, ab i long arg1 ,
# i f d e f TARGET GPROF
mcleanup ( ) ;
# endi f
+ do dump pc (0 x f f f f f f f f ) ;
gdb exi t ( cpu env , arg1 ) ;
e x i t ( arg1 ) ;
r e t = 0 ; /∗ avo id warning ∗ /
@@ −5718 ,6 +5721 ,7 @@ abi long do sysca l l ( void ∗cpu env , in t num, ab i long arg1 ,
# i f d e f TARGET GPROF
mcleanup ( ) ;
# endi f
+ do dump pc (0 x f f f f f f f f ) ;
gdb exi t ( cpu env , arg1 ) ;
r e t = ge t e r rno ( ex i t group ( arg1 ) ) ;
break ;
d i f f −−g i t a/ targe t −alpha/helper . c b/ targe t −alpha/helper . c
index be7d37b . . 3 4 0 aadd 100644
−−− a/targe t −alpha/helper . c
+++ b/targe t −alpha/helper . c
@@ −25 ,6 +25 ,8 @@
# include ”exec− a l l . h”
# include ” s o f t f l o a t . h”
+# include ” . . / bbv rout ines . h”
+
u in t 6 4 t cpu alpha load fpcr ( CPUState ∗env )
{
u in t 6 4 t r e t = 0 ;
d i f f −−g i t a/ targe t −alpha/helper . h b/targe t −alpha/helper . h
index bedd3c0 . . efc145d 100644
−−− a/targe t −alpha/helper . h
+++ b/targe t −alpha/helper . h
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
DEF HELPER 2 ( excp , void , int , in t )
DEF HELPER 0 ( load pcc , i64 )
DEF HELPER 0 ( rc , i64 )
d i f f −−g i t a/ targe t −alpha/ t r an s l a t e . c b/targe t −alpha/ t r an s l a t e . c
index 87813e7 . . 5 3 a0315 100644
−−− a/targe t −alpha/ t r an s l a t e . c
+++ b/targe t −alpha/ t r an s l a t e . c
@@ −2626 ,6 +2626 ,15 @@ s t a t i c i n l i n e void gen in t e rmed ia t e code in t e rna l ( CPUState ∗env ,
i f ( num insns + 1 == max insns && ( tb−>c f l a g s & CF LAST IO ) )
g en i o s t a r t ( ) ;
insn = ld l code ( c t x . pc ) ;
+ {
+ /∗ vmw ∗ /
355
+ TCGv const1 ;
+
+ const1 = t cg con s t i 3 2 ( tb−>unique id ) ;
+ gen helper dump pc ( const1 ) ;




i f ( un l ike ly ( qemu loglevel mask (CPU LOG TB OP ) ) ) {
d i f f −−g i t a/ targe t −arm/helper . c b/targe t −arm/helper . c
index b3aec99 . . caa7549 100644
−−− a/targe t −arm/helper . c
+++ b/targe t −arm/helper . c
@@ −9 ,6 +9 ,8 @@
# include ”qemu−common . h”
# include ”host−u t i l s . h”
+# include ” . . / bbv rout ines . h”
+
s t a t i c u in t 3 2 t cor t exa9 cp15 c0 c1 [ 8 ] =
{ 0x1031 , 0x11 , 0x000 , 0 , 0x00100103 , 0x20000000 , 0x01230000 , 0x00002111 } ;
d i f f −−g i t a/ targe t −arm/helpers . h b/targe t −arm/helpers . h
index 0d1bc47 . . 5 f58e87 100644
−−− a/targe t −arm/helpers . h
+++ b/targe t −arm/helpers . h
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
DEF HELPER 1 ( c lz , i32 , i32 )
DEF HELPER 1 ( sxtb16 , i32 , i32 )
DEF HELPER 1 ( uxtb16 , i32 , i32 )
d i f f −−g i t a/ targe t −arm/ t r an s l a t e . c b/ targe t −arm/ t r an s l a t e . c
index 5 cf3e06 . . 3 1 2 d5e6 100644
−−− a/targe t −arm/ t r an s l a t e . c
+++ b/targe t −arm/ t r an s l a t e . c
@@ −5964 ,7 +5964 ,7 @@ s t a t i c void gen s to re exc lus iv e ( DisasContext ∗s ,
}
# endi f
− s t a t i c void disas arm insn ( CPUState ∗ env , DisasContext ∗s )
+ s t a t i c void disas arm insn ( CPUState ∗ env , DisasContext ∗s , in t unique id )
{
unsigned in t cond , insn , val , op1 , i , s h i f t , rm , rs , rn , rd , sh ;
TCGv tmp ;
@@ −5975 ,7 +5975 ,16 @@ s t a t i c void disas arm insn ( CPUState ∗ env , DisasContext ∗s )
insn = ld l code ( s−>pc ) ;





+ /∗ vmw ∗ /
+ TCGv const1 ;
+
+ const1 = t cg con s t i 3 2 ( unique id ) ;
+ gen helper dump pc ( const1 ) ;
+ tcg temp free ( const1 ) ;
+ }
+
/∗ M va r i a n t s do not implement ARM mode . ∗ /
i f ( IS M ( env ) )
goto i l l e g a l o p ;




− disas arm insn ( env , dc ) ;
+ d isas arm insn ( env , dc , tb−>unique id ) ;
}
i f ( num temps ) {
f p r i n t f ( s tderr , ” I n t e rn a l resource leak before %08x\n” , dc−>pc ) ;
d i f f −−g i t a/ targe t − i386/helper . c b/targe t − i386/helper . c
index 049 f c c f . . 4 e4b7b3 100644
−−− a/targe t − i386/helper . c
+++ b/targe t − i386/helper . c
@@ −30 ,6 +30 ,8 @@
/ / # d e f i n e DEBUGMMU
+# include ” . . / bbv rout ines . h”
+
/∗ f e a t u r e f l a g s t a k en from ” I n t e l P r o c e s s o r I d e n t i f i c a t i o n and t h e CPUID
∗ I n s t r u c t i o n ” and AMD’ s ”CPUID S p e c i f i c a t i o n ” . In c a s e s o f d i s a g r e em en t
∗ about f e a t u r e names , t h e Linux name i s used . ∗ /
d i f f −−g i t a/ targe t − i386/helper . h b/targe t − i386/helper . h
index 6b518ad . . 4 a5fa43 100644
−−− a/targe t − i386/helper . h
+++ b/targe t − i386/helper . h
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
DEF HELPER FLAGS 1( cc compute a l l , TCG CALL PURE , i32 , in t )
DEF HELPER FLAGS 1( cc compute c , TCG CALL PURE , i32 , in t )
d i f f −−g i t a/ targe t − i386/op helper . c b/ targe t − i386/op helper . c
index 5eea322 . . 4 d93fa5 100644
−−− a/targe t − i386/op helper . c
+++ b/targe t − i386/op helper . c
@@ −23 ,7 +23 ,6 @@
/ / # d e f i n e DEBUG PCALL
−
# i f d e f DEBUG PCALL
357
# def ine LOG PCALL ( . . . ) qemu log mask (CPU LOG PCALL, ## VA ARGS )
# def ine LOG PCALL STATE( env ) \
d i f f −−g i t a/ targe t − i386/ t r an s l a t e . c b/ targe t − i386/ t r an s l a t e . c
index 511 a4ea . . 2 5 7 97 c5 100644
−−− a/targe t − i386/ t r an s l a t e . c
+++ b/targe t − i386/ t r an s l a t e . c
@@ −4075 ,7 +4075 ,8 @@ s t a t i c void gen sse ( DisasContext ∗s , in t b , t a rge t u long
/∗ c on v e r t one i n s t r u c t i o n . s−> i s jmp i s s e t i f t h e t r a n s l a t i o n must
b e s t o p p e d . Return t h e next pc va lu e ∗ /
− s t a t i c t a rge t u long d i sas in sn ( DisasContext ∗s , t a rge t u long p c s t a r t )
+ s t a t i c t a rge t u long d i sas in sn ( DisasContext ∗s , t a rge t u long pc s t a r t ,
+ in t unique id )
{
in t b , p re f i xe s , a f lag , d f lag ;
in t sh i f t , ot ;
@@ −4208 ,6 +4209 ,20 @@ s t a t i c t a rge t u long d i sas in sn ( DisasContext ∗s ,
i f ( p re f i xe s & PREFIX LOCK)
gen he lper lock ( ) ;
+ {
+ /∗ vmw ∗ /
+ TCGv const1 ;
+
+ i f ( p re f i xe s & ( PREFIX REPZ | PREFIX REPNZ ) ) {
+ const1 = t cg con s t i 3 2 ( unique id |0 x80000000 ) ;
+ }
+ else {
+ const1 = t cg con s t i 3 2 ( unique id ) ;
+ }
+ gen helper dump pc ( const1 ) ;
+ tcg temp free ( const1 ) ;
+ }
+
/∗ now che c k op c od e ∗ /
reswit ch :
switch ( b ) {
@@ −7849 ,7 +7864 ,7 @@ s t a t i c i n l i n e void gen in t e rmed ia t e code in t e rna l (
i f ( num insns + 1 == max insns && ( tb−>c f l a g s & CF LAST IO ) )
g en i o s t a r t ( ) ;
− pc ptr = d i sas in sn ( dc , pc pt r ) ;
+ pc pt r = d i sas in sn ( dc , pc ptr , tb−>unique id ) ;
num insns ++;
/∗ s t o p t r a n s l a t i o n i f i n d i c a t e d ∗ /
i f ( dc−>i s jmp )
d i f f −−g i t a/ targe t −mips/helper . c b/targe t −mips/helper . c
index 903987b . . dd0e4f9 100644
−−− a/targe t −mips/helper . c
+++ b/targe t −mips/helper . c
@@ −34 ,6 +34 ,8 @@ enum {
TLBRETMATCH = 0
} ;
+# include ” . . / bbv rout ines . h”
358
+/∗ no MMU emu la t i on ∗ /
in t no mmu map address ( CPUState ∗env , t a rge t phys addr t ∗physical ,
t a rge t u long address , in t rw , in t acce s s t ype )
d i f f −−g i t a/ targe t −mips/helper . h b/targe t −mips/helper . h
index ab47b1a . . e 4 f a f f 4 100644
−−− a/targe t −mips/helper . h
+++ b/targe t −mips/helper . h
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
DEF HELPER 2 ( r a i s e e x c ep t i on e r r , void , i32 , in t )
DEF HELPER 1 ( ra i s e excep t ion , void , i32 )
DEF HELPER 0 ( i n t e r rup t r e s t a r t , void )
d i f f −−g i t a/ targe t −mips/ t r an s l a t e . c b/ targe t −mips/ t r an s l a t e . c
index dfea6f6 . . f700599 100644
−−− a/targe t −mips/ t r an s l a t e . c
+++ b/targe t −mips/ t r an s l a t e . c
@@ −9524 ,9 +9524 ,17 @@ gen in t e rmed ia t e code in t e rna l ( CPUState ∗env ,
i f ( ! ( c t x . h f l ags & MIPS HFLAG M16 ) ) {
c t x . opcode = ld l code ( c t x . pc ) ;
in sn by t e s = 4 ;
+ {
+ /∗ vmw ∗ /
+ gen he lpe r 0 i (dump pc , tb−>unique id ) ;
+ }
decode opc ( env , &ctx , &is branch ) ;
} else i f ( env−> i n sn f l ag s & ASE MIPS16 ) {
c t x . opcode = lduw code ( c t x . pc ) ;
+ {
+ /∗ vmw ∗ /
+ gen he lpe r 0 i (dump pc , tb−>unique id ) ;
+ }
in sn by t e s = decode mips16 opc ( env , &ctx , &is branch ) ;
} else {
generate except ion (&ctx , EXCP RI ) ;
d i f f −−g i t a/ targe t −ppc/helper . c b/targe t −ppc/helper . c
index b233d4f . . 7 5 adac4 100644
−−− a/targe t −ppc/helper . c
+++ b/targe t −ppc/helper . c
@@ −29 ,6 +29 ,8 @@
# include ”qemu−common . h”
# include ”kvm. h”
+# include ” . . / bbv rout ines . h”
+
/ / # d e f i n e DEBUGMMU
/ / # d e f i n e DEBUG BATS
/ / # d e f i n e DEBUG SLB
d i f f −−g i t a/ targe t −ppc/helper . h b/targe t −ppc/helper . h
index 40d4ced . . b34a83e 100644
−−− a/targe t −ppc/helper . h
+++ b/targe t −ppc/helper . h
359
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
DEF HELPER 2 ( r a i s e e x c ep t i on e r r , void , i32 , i32 )
DEF HELPER 1 ( ra i s e excep t ion , void , i32 )
DEF HELPER 3 ( tw , void , t l , t l , i32 )
d i f f −−g i t a/ targe t −ppc/ t r an s l a t e . c b/ targe t −ppc/ t r an s l a t e . c
index d4e81ce . . 6 2 3 c045 100644
−−− a/targe t −ppc/ t r an s l a t e . c
+++ b/targe t −ppc/ t r an s l a t e . c
@@ −9029 ,6 +9029 ,16 @@ s t a t i c i n l i n e void gen in t e rmed ia t e code in t e rna l (
} else {
c t x . opcode = ld l code ( c t x . nip ) ;
}
+ {
+ /∗ vmw ∗ /
+ TCGv const1 ;
+
+ const1 = t cg con s t i 3 2 ( tb−>unique id ) ;
+ gen helper dump pc ( const1 ) ;
+
+ tcg temp free ( const1 ) ;
+ }
+
LOG DISAS( ” t r an s l a t e opcode %08x (%02x %02x %02x ) (%s )\n” ,
c t x . opcode , opc1 ( c t x . opcode ) , opc2 ( c t x . opcode ) ,
opc3 ( c t x . opcode ) , l i t t l e e n d i a n ? ” l i t t l e ” : ” big” ) ;
d i f f −−g i t a/ targe t −sparc/helper . c b/ targe t −sparc/helper . c
index e801474 . . 9 a32974 100644
−−− a/targe t −sparc/helper . c
+++ b/targe t −sparc/helper . c
@@ −38 ,6 +38 ,8 @@ s t a t i c in t cpu sparc f ind by name ( spa r c d e f t ∗cpu def ,
s t a t i c sp in l o ck t g loba l cpu lock = SPIN LOCK UNLOCKED;
+# include ” . . / bbv rout ines . h”
+
void cpu lock ( void )
{
sp in lock (&g loba l cpu lock ) ;
d i f f −−g i t a/ targe t −sparc/helper . h b/targe t −sparc/helper . h
index 6 f103e7 . . 7 e74a21 100644
−−− a/targe t −sparc/helper . h
+++ b/targe t −sparc/helper . h
@@ −1 ,5 +1 ,7 @@
# include ”def−helper . h”
+DEF HELPER 1 (dump pc , void , i32 )
+
# i fnde f TARGET SPARC64
DEF HELPER 0 ( r e t t , void )
DEF HELPER 1 (wrpsr , void , t l )
d i f f −−g i t a/ targe t −sparc/ t r an s l a t e . c b/targe t −sparc/ t r an s l a t e . c
360
index 7 e 9 f 0 c f . . 5 8 4 05 c5 100644
−−− a/targe t −sparc/ t r an s l a t e . c
+++ b/targe t −sparc/ t r an s l a t e . c
@@ −1695 ,7 +1695 ,7 @@ s t a t i c i n l i n e void g e n l o ad t r a p s t a t e a t t l ( TCGv ptr
goto nfpu insn ;
/∗ b e f o r e an i n s t r u c t i o n , dc−>pc must b e s t a t i c ∗ /
− s t a t i c void d isas sparc in sn ( DisasContext ∗ dc )
+ s t a t i c void d isas sparc in sn ( DisasContext ∗ dc , in t unique id )
{
unsigned in t insn , opc , rs1 , rs2 , rd ;
t a rge t long simm ;
@@ −1703 ,6 +1703 ,16 @@ s t a t i c void d isas sparc in sn ( DisasContext ∗ dc )
i f ( un l ike ly ( qemu loglevel mask (CPU LOG TB OP ) ) )
t cg gen debug in sn s t a r t ( dc−>pc ) ;
insn = ld l code ( dc−>pc ) ;
+
+ {
+ /∗ vmw ∗ /
+ TCGv const1 ;
+
+ const1 = t cg con s t i 3 2 ( unique id ) ;
+ gen helper dump pc ( const1 ) ;
+ tcg temp free ( const1 ) ;
+ }
+
opc = GET FIELD( insn , 0 , 1 ) ;
rd = GET FIELD( insn , 2 , 6 ) ;
@@ −4732 ,7 +4742 ,7 @@ s t a t i c i n l i n e void gen in t e rmed ia t e code in t e rna l (
i f ( num insns + 1 == max insns && ( tb−>c f l a g s & CF LAST IO ) )
g en i o s t a r t ( ) ;
l a s t p c = dc−>pc ;
− d isas sparc in sn ( dc ) ;
+ d i sas sparc in sn ( dc , tb−>unique id ) ;
num insns ++;
i f ( dc−> i s b r )
d i f f −−g i t a/bbv rout ines . h b/bbv rout ines . h
new f i l e mode 100644
index 0000000 . . 16 f31d0
−−− /dev/nul l
+++ b/bbv rout ines . h
@@ −0 ,0 +1 ,101 @@
+ /∗ vmw ∗ /
+
+void do dump pc ( u in t 3 2 t bb ) ;
+
+# i f ! def ined (TARGET ARM)
+void helper dump pc ( u in t 3 2 t bb ) ;
+# endi f
+




+#def ine MAX BBS 100000
+#def ine INTERVAL SIZE 100000000 /∗ 100 m i l l i o n ∗ /
+
+void do dump pc ( unsigned in t bb ) {
+
+ s t a t i c unsigned long t o t a l c oun t =0 , i n t e rv a l s =0;
+ s t a t i c in t bbvs [MAX BBS ] ;
+ in t i ;
+ s t a t i c FILE ∗ bbv f i l e =NULL;
+
+# i f defined ( TARGET I386 ) | | defined ( TARGET X86 64 )
+ s t a t i c in t rep count =0;
+ in t rep ;
+ s t a t i c long long t o t a l r e p s =0;
+# endi f
+
+ i f ( bb==0 x f f f f f f f f ) {
+ i f ( b bv f i l e !=NULL) {
+ long long t o t a l ;
+ t o t a l = ( ( long long ) i n t e r v a l s ∗INTERVAL SIZE)+ ( long long ) t o t a l c oun t ;
+ f p r i n t f ( bbv f i l e , ”# Tota l count : %l ld \n” , t o t a l ) ;
+# i f defined ( TARGET I386 ) | | defined ( TARGET X86 64 )
+ f p r i n t f ( bbv f i l e , ”# Rep count : %l ld \n” , t o t a l r e p s ) ;
+# endi f





+ i f ( b bv f i l e ==NULL) {
+ bbv f i l e =fopen ( ”qemusim . bbv” , ”w” ) ;
+ i f ( b bv f i l e ==NULL) {
+ p r in t f ( ” Error ! Could not open f i l e %s\n” , ”qemusim . bbv” ) ;




+# i f defined ( TARGET I386 ) | | defined ( TARGET X86 64 )
+ rep=bb&0x80000000 ;
+ bb &=0 x 7 f f f f f f f ;
+# endi f
+
+ i f ( bb>MAX BBS) {
+ p r in t f ( ” Error ! Not enough BBS %d\n” , bb ) ;
+ e x i t ( −1 ) ;
+ }
+
+# i f defined ( TARGET I386 ) | | defined ( TARGET X86 64 )
+
+ i f ( rep ) {
+ rep count ++;





+ i f ( ( rep count ) && ( ! rep ) ) {
+ rep count =0;
+ /∗ count a l l r e p s as one i n s t r u c t i o n ( as p e r doc s ) ∗ /
+ /∗ t h i s makes t h i n g s match p e r f − c t r r e s u l t s ∗ /
+ t o t a l coun t ++;




+ t o t a l c oun t ++;
+ bbvs [ bb ]++;
+
+ i f ( t o t a l coun t>=INTERVAL SIZE ) {
+ in t e rv a l s ++;
+ f p r i n t f ( bbv f i l e , ”T” ) ;
+ for ( i =0; i<MAX BBS; i ++) {
+ i f ( bbvs [ i ] ) {
+ /∗ s impo in t can ’ t h and l e a b a s i c b l o c k s t a r t i n g a t z e r o ? ∗ /
+ f p r i n t f ( bbv f i l e , ”:%d:%d ” , i +1 , bbvs [ i ] ) ;
+ }
+ }
+ f p r i n t f ( bbv f i l e , ”\n” ) ;
+
+ /∗ c l e a r t h e s t a t s ∗ /
+ t o t a l coun t =0;
+ for ( i =0; i<MAX BBS; i ++) {





+ /∗ grr r , why i s t h i s ne eded on x86 ∗ /
+void helper dump pc ( unsigned in t bb ) {




R12000 BRANCH PREDICTOR KERNELMODULE
This is patch against the MIPS Linux kernel allows setting the branch pre-
dictor behavior on an R12000 processor, as described in Chapter 5.
/∗
∗ b r p r e d c o n f i g . c − c o n f i g u r e b ranch p r e d i c t o r on R12000
∗ TODO − c o n f i g u r a b l e by module paramete r , not by
∗ r e c omp i l i n g
∗ /
# include <l inux/module . h> /∗ Needed by a l l modules ∗ /
# include <l inux/kerne l . h> /∗ Needed f o r KERN INFO ∗ /
# include <asm/mipsregs . h>
in t in it module ( void ) {
unsigned in t x ;
x= r e ad 3 2 b i t c 0 r e g i s t e r ( $22 , 0 ) ;
pr in tk (KERN INFO ”Hello world . %x\n” , x ) ;
/∗
∗ A non 0 r e tu rn means i n i t m odu l e f a i l e d ; module can ’ t b e l o a d e d .
∗ /
wr i t e 3 2 b i t c 0 r e g i s t e r ( $22 , 0 , 0x20300000 ) ; /∗ d e f a u l t 2− b i t ∗ /
/ / w r i t e 3 2 b i t c 0 r e g i s t e r ( £22 , 0 , 0x20310000 ) ; /∗ not− t a k en ∗ /
/ / w r i t e 3 2 b i t c 0 r e g i s t e r ( £22 , 0 , 0x20320000 ) ; /∗ t a k en ∗ /
/ / w r i t e 3 2 b i t c 0 r e g i s t e r ( £22 , 0 , 0x20330000 ) ; /∗ fwd=not , b a c k=ye s ∗ /
return 0 ;
}
void cleanup module ( void ) {
unsigned in t x ;
x= r e ad 3 2 b i t c 0 r e g i s t e r ( $22 , 0 ) ;




SESC R12000 CONFIGURATION FILE
We use this SESC configuration file to model an R12000 processor with 2-bit
branch prediction (as described in Chapter 5).
############################
# General Processor Options
############################
nCPUs = 1 # We have a s ing le core
cpucore [ 0 : 0 ] = ’ issueX ’ # s ing le core
# Parameters
procsPerNode = 1 # our machine i s s ing le processor




technology = ’ techParam ’
[ techParam ]
tech = 250 # nm
frequency = 300 e6 # Hz
###############################
# PROCESSOR CONFIGURATION #
###############################
# r12k p 13
# in t r eg i s t e r f i l e
# 3 write ports , 7 read port s
# each ALU has two read and 1 write
# addr ca l c un i t has 2 read port s
# l a s t read shared between store , j r , and move−to−fp
# l a s t write shared between load , bal , and move−from−fp
# no rob
# sp e c i a l ” condi t ion ” f i l e for cond i t iona l move in s t ru c t i on s
# fp r eg i s t e r f i l e
# 3 write ports , 5 read port s
# + adder and mul each has 2 dedicated read and one write
# l a s t read i s shared between s to re and move
[ issueX ]
issueWrongPath = t rue # only i f compiled with SESC MISPATH
inorder = f a l s e # r12k paper page 1
365
fetchWidth = 4 # r12k paper page 1
instQueueSize = 4 # Renau − This i s a d i f f e r en t type of s t ruc tu re
# ! ( r10k paper page 33)
issueWidth = 4 # Renau − By i s sue SESC means max rename per cyc l e
# ! ( from r12k paper , page 1)
re t i reWidth = 4 # r10k paper page 33
decodeDelay = 1 # Renau
renameDelay = 1 # Renau
maxBranches = 4 # does t h i s mean max branches we can run through ? r12k
bb4Cycle = 1 # ??
maxIRequests = 3 # ??
in t e rC lus t e rLa t = 1 # Renau
c lu s t e r [ 0 ] = ’ FXClusterIssueX ’
c l u s t e r [ 1 ] = ’ FPClusterIssueX ’
c l u s t e r [ 2 ] = ’ AddressIssueX ’ # r12k has separate address queue
stForwardDelay = 1 # ??
maxLoads = 16 # Renau − ld/ s t share a s ing le queue with 16 en t r i e s
maxStores = 16 # Renau
regFi leDelay = 1 # Renau
robSize = 48 # ?? r12k does not have a ROB but has a 48− entry ac t iv e l i s t ( p9 )
intRegs = 64 # r12k p8
fpRegs = 64 # r12k p8
bpred = ’ BPredIssueX ’
dataSource = ”DataL1 DL1”
in s t rSource = ” InstL1 IL1”
enableICache = t rue
dt lb = ’TLB ’
i t l b = ’TLB ’
OSType = ’ std ’





# r12k has a 64− entry uni f ied TLB
# each entry poin t s to two consecut ive pages
# i t i s f u l l y a s s o c i a t i v e
# i t i s not poss ib l e to represent t h i s in SESC?
# replacement i s done in software . Often the bottom 8 en t r i e s
# are f i xed and the r e s t i s random?
# more info , r12kpaper p16
# also , t yp i c a l l y 8 of the en t r i e s are pinned to the OS
[TLB]
deviceType= ’ t l b ’
s i ze = 64∗8 # i s t h i s bytes ?
assoc = 64 # we want fu l ly −a s so c i a t i v e
bs ize = 8 # block s ize ???
numPorts = 1 # have no idea
rep lPo l i cy = ’LRU’ # ??
366
## Pipe l ine c lu s t e r s
#
# In t ALU
#
# E i the r ALU can add/sub/ l o g i c a l/move h i lo/t rap
# ALU1 branches , s h i f t , lu i , cond i t iona l moves
# ALU2 mul , div
# Load/Store uni t
[ FXClusterIssueX ]
blockName = ”IntWin”
winSize = 16 #??
recycleAt = ’ Execute ’ # Recycle en t r i e s at : Execute | Re t i r e
# i t looks l i k e Execute i s r igh t for r12k
schedNumPorts = 2 # Renau − 2 execut ion un i t s max
schedPortOccp = 1 # ??
wakeUpNumPorts= 0 # Renau − No s p l i t wakeup/ s e l e c t cyc l e in R12k
wakeUpPortOccp= 0 # Renau
wakeupDelay = 0 # Renau
schedDelay = 0 # Renau
iALUUnit = ’ALUIssueX ’
iALULat = 1 # r12k paper , p13
iB JUni t = ’ALUIssueX ’ # Branch jump?
iB JLa t = 1 # r12k paper , p13
iDivUnit = ’MDIssueX ’
iDivLat = 35 # r12k , 34/35 cyc l e s for 32 b i t , 66/67 for 64 b i t
# r12k paper , p13
iMultUnit = ’MDIssueX ’
iMultLat = 6 # r12k , 5/6 for 32− bi t , unsigned i s 1 ex t ra
# 9/10 for 64− b i t
# r12k paper , p13
[ AddressIssueX ]
blockName = ”IntWin”
winSize = 16 #??
recycleAt = ’ Execute ’ # Recycle en t r i e s at : Execute | Re t i r e
schedNumPorts = 2 # ??
schedPortOccp = 1 # ??
wakeUpNumPorts= 0 # ??
wakeUpPortOccp= 0 # ??
wakeupDelay = 0 # ??
schedDelay = 0 # ??
367
i S t o reUn i t = ’ LDSTIssueX ’ # Renau − shared LD/ST cache port
i S t o reLa t = 1 # ??
iLoadUnit = ’ LDSTIssueX ’
iLoadLat = 1 # r12k paper , p13 i f in l 1 cache .




# For r12k , there are 5 f l o a t i n g point un i t s
# they are 3− s tage pipelined , with a 1− cyc l e repeat ra t e
# adder
# mul t ip l i e r
# divide
# square root
# load/s to re
# la tency and repeat ra t e are not n e c e s s a r i l y the same
[ FPClusterIssueX ]
blockName = ”FPWin”
winSize = 16 # ??
recycleAt = ’ Execute ’ # Recycle en t r i e s at : Execute | Re t i r e
schedNumPorts = 2 # Renau
schedPortOccp = 1 # ??
wakeUpNumPorts= 0 # Renau
wakeUpPortOccp= 0 # Renau
wakeupDelay = 0 # Renau
schedDelay = 0 # ??
fpALUUnit = ’ FP0IssueX ’
fpALULat = 2 # r12k paper p16
fpMultUnit = ’ FP1IssueX ’ # Renau
fpMultLat = 2 # 4 i f i t i s a mult iply/add
# r12k p16
fpDivUnit = ’ FP1IssueX ’ # Renau
fpDivLat = 12 # 12 for 32 b i t , 19 for 64− b i t
# r12k p16
[ LDSTIssueX ]
Num = 1 # ??
Occ = 1 # ??
[ALUIssueX ]
Num = 1 # Renau
Occ = 1 # ??
[MDIssueX ] # Renau − muldiv for in t
Num = 1
368
Occ = 8 # Renau − Shared by mult and div . 32b mult 6 , 64b mult 10 ,
# 32 b i t div 35 , 64b div 67 .
# What i s the r e a l mix? Not easy to model with sesc
[ FP0IssueX ]
Num = 1 # ??
Occ = 1 # ??
[ FP1IssueX ] # Renau − FP Mult/Div uni t
Num = 1




# b i t s 12 : 2 used to index in to 2048 entry sa tu ra t ing
# 2− b i t counter
#
# 4 ”shadow” copies of reg f i l e . When mispredict ion ,
# t akes 2 cyc l e s to recover on r12k (1 cyc l e on r10k )
#
# r12k has opt ional 8− b i t g lobal h i s t o ry tha t can be hashed
# in t o main branch index . Linux leaves t h i s alone , d isab led
# by defaul t ( r12k p19 )
[ BPredIssueX ]
type = ”2 b i t ”
s i ze = 2048 # r12k paper , p5
rasS ize = 4 # r12k manual page 236
btbS ize = 32 # r12k paper p5
btbAssoc = 2 # r12k paper p5
btbBs ize = 1 # ??
btbReplPol icy = ’LRU’ # ??
b i t s = 2 # ( sa tu ra t ing b i t s . why i s t h i s not
# documented b e t t e r )
BTACDelay = 2 # Renau
###############################
# phys i ca l ly tagged
# 32kB 2−way 64−byte l i n e s
# from dmesg on ac tua l machine
# also has a par i t y b i t
# s tored in a 36− b i t pre−decoded format
# LRU
# Can fe t ch 4 consecut ive in s t ruc t ion s , but cannot cross a block boundary
[ Ins tL1 ]
deviceType = ’ icache ’
blockName = ” Icache ”
s ize = 32∗1024
assoc = 2
bs ize = 64
wr i t ePo l i cy = ’WB’ # n/a
rep lPo l i cy = ’LRU’ #
369
numPorts = 1 # Renau
portOccp = 1 #??
hitDelay = 2 # cyc l e s ? i f so , r12k p6
missDelay = 0 #??
MSHR = ’ InstL1MSHR ’ #??
lowerLevel = ”L2Cache L2 shared”
[ InstL1MSHR]
s ize = 4 # Renau
# ?? i s t h i s bytes ? en t r i e s ?
type = ’ f u l l ’ # ??
bs ize = 64 # ??
################################
# Memory Subsystem (L1 )
################################
# 32kB , 2−way , l i n e s i z e 32 bytes
# each byte has par i t y b i t
# two i d e n t i c a l banks se l e c t ed by address b i t 5
# phys i ca l ly tagged
[ DataL1 ]
deviceType = ’ cache ’
blockName = ”Dcache”
MSHR = ”DL1MSHR”
s ize = 32∗1024
assoc = 2
skew = f a l s e #??
bs ize = 32
rep lPo l i cy = ’LRU’ #
numPorts = 2 #Renau
portOccp = 1 #??
hitDelay = 2 # cyc l e s ? i f so r12k p6
# lmbench shows 6 . 6 ns , so tha t matches 2 cyc l e s
missDelay = 1 #??
wr i t ePo l i cy = ”WB” # r12k p7
lowerLevel = ”CommonBus Bus shared” #??
[DL1MSHR]
type = ’ f u l l ’ #??
s i ze = 8 # Renau − 4MSHR en t r i e s in R10K
# but can handle up to 4 pending per entry
# SESC cannot do t h i s , approx with 8?
bs ize = 64 #??
##############################
# 2MB, 2−way , 128 bytes
# from dmesg
# each quadword has 9− b i t ECC and a par i t y b i t
# cor re c t ion p ipe l ine takes 2 cyc l e s
# la tency ˜10 cyc l e s
# has a 16kB way−pred i c t t ab l e
370
# the tags re on−chip , cache i t s e l f of f −chip
[ L2Cache ]
deviceType = ’ cache ’
blockName = ”L2”
s ize = 2∗1024∗1024
assoc = 2
bs ize = 128
wr i t ePo l i cy = ’WB’ # r12k p7
rep lPo l i cy = ’LRU’ #??
numPorts = 1 #??
portOccp = 1 #??
hitDelay = 10 # r12k p 6
# lmbench measurements show 47 . 3 ns or about 14 cyc l e s
missDelay = 4 #??
MSHR = ’MSHRL2 ’ #??
lowerLevel = ”MemoryBus”
[MSHRL2]
type = ’ f u l l ’ # ??
s i ze = 32 # ??
bs ize = 64 # ??
[CommonBus ]
deviceType = ’ bus ’
busWidth = 32
busLength = 7500 # 7 . 5 mm ??
numPorts = 1 # Renau
portOccp = 1
delay = 1 # Renau
buffWCReqs = 1
lowerLevel = ”L2Cache L2 shared”
[MemoryBus]
deviceType = ’ bus ’
numPorts = 1
portOccp = 168 # Renau − S ince the processor operates at 300MHz and
# the L2 cache has 128 bytes , to have around 300MB/s
# when you j u s t need 1 request every 128 cy le s . To
# make i t c l o s e r to 220MB/s use 168 cyc l e s port
# occupancy (128/168∗300 = 220MB/s )
# lmbench shows ˜220MB/s bandwidth
# 220MB/s / 128B = 1.802 e6 cache l ine s/s
# = .005946 cache l ine s/cyc l e
# so , 168 cyc l e s /cache l ine ?
delay = 15
lowerLevel = ”Memory Memory”
371
# The Octane we have has a peak bandwidth 1 . 0GB/s system bus
# 2GB of SDRAM memory, poss ib ly PC100 (100MHz)
[Memory]
deviceType = ’ niceCache ’
s i ze = 128
assoc = 1
bs ize = 64
wr i t ePo l i cy = ’WB’
rep lPo l i cy = ’LRU’
numPorts = 1
portOccp = 1
hitDelay = 113 # According to lmbench our machine has 404 .6 ns delay
# clock cyc l e i s 300MHz so 3 . 3 ns
# so approximately 120 clock cyc l e s
# renau − You have to discount the miss time
# (21 cyc l e s t o t a l ) , so 113 clk should be
# f ine (404/3−21 ˜ 1 1 3 ) .
missDelay = 500
MSHR = NoMSHR
lowerLevel = ’ voidDevice ’
[NoMSHR]
type = ’ none ’
s i ze = 128
bs ize = 64
[ voidDevice ]
deviceType = ’ void ’
372
BIBLIOGRAPHY
[1] Chart Get!: Media create sales: 04/06 - 04/12. http://chartget.com/
2009/04/ media-create-sales-0406-0412-hardware.html.
[2] IBM research SimOS website. http://www.research.ibm.com/ arl/pro-
jects/SimOSppc.html.
[3] QEMU BBV website. http://www.csl.cornell.edu/ vince/projects/qe-
musim/.
[4] Snapshot of the embedded Linux market – may, 2006.
http://www.linuxfordevices.com/ c/a/Linux-For-Devices-Articles/
Snapshot-of-the-embedded-Linux-market-May-2006/.
[5] Top 500 supercomputing sites. http://www.top500.org/.
[6] Advanced Micro Devices. AMD Athlon Processor Model 6 Revision Guide,
2003.
[7] AdvancedMicro Devices. AMD64Architecture Programmer’sManual, 2006.
[8] A.R. Alameldeen and D.A. Wood. Variability in architectural simulations
of multi-threaded commercial workloads. In Proc. 9th IEEE Symposium on
High Performance Computer Architecture, 2003.
[9] N.M. Amato, J. Perdue, M.M. Mathis, A. Pietracaprina, and G. Pucci. Pre-
dicting performance on smps. a case study: The SGI power challence. In
Proc. 14th IEEE/ACM International Parallel and Distributed Processing Sym-
posium, page 729, 2000.
[10] AMD. AMD Family 10h Processor BIOS and Kernel Developer Guide, 2009.
[11] ARM Limited. ARM Architecture Reference Manual, 2000.
[12] Atmel. AVR32 Architecture Document, 2006.
[13] T. Austin. SimpleScalar 4.0 release note. http://www.simplescalar.com/.
[14] Axis Communications AB. ETRAX FS Designer’s Reference, 2007.
373
[15] R. Balasubramonian, D.H. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas. Memory hierarchy reconfiguration for energy and
performance in general-purpose processor architectures. In Proc.
IEEE/ACM 33nd International Symposium on Microarchitecture, pages
245–257, December 2000.
[16] L. Barroso, K. Gharachoroloo, and E. Bugnion. Memory system charac-
terization of commerical workloads. In Proc. 25th IEEE/ACM International
Symposium on Computer Architecture, June 1998.
[17] R.C. Bedicheck. Talisman: Fast and accurate multicomputer simulation.
In Proc.ACM International Conference on Measurement and Modeling of Com-
puter Systems, May 1995.
[18] F. Bellard. QEMU, a fast and portable dynamic translator. In Proc. USENIX
Annual Technical Conference, FREENIX Track, pages 41–46, April 2005.
[19] L. Benini, A. Macii, and A. Nannarelli. Cached-code compression for en-
ergy minimization in embedded processors. In Proc. IEEE/ACM Interna-
tional Symposium on Low Power Electronics and Design, pages 322–327, Au-
gust 2001.
[20] A´. Besze´des, R. Ferenc, T. Gyimo´thy, A. Dolenc, and K. Karsisto. Survey
of code-size reduction methods. ACM Computing Surveys, 35(3):223–267,
September 2003.
[21] R. Bhargava, L.K. John, and F. Matus. Accurately modeling speculative
instruction fetching in trace-driven simulation. pages 65–71, 1999.
[22] N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K.
Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro,
26(4):52–60, 2006.
[23] R. Bitirgen, E. I˙pek, and J.F. Martı´nez. Coordinated management of mul-
tiple interacting resources in chip multiprocessors: A machine learning
approach. In Proc. IEEE/ACM 41st Annual International Symposium on Mi-
croarchitecture, pages 318–329, December 2008.
[24] B. Black, A. Huang, M. Lipasti, and J. Shen. Can trace-driven simulators
accurately predict superscalar performance? In Proc. IEEE International
Conference on Computer Design, pages 478–485, October 1996.
374
[25] B. Black and J. P. Shen. Calibration of microprocessor performance mod-
els. IEEE Computer, 31(5):59–65, May 1998.
[26] C. Blundell, M.M.K. Martin, and T.F. Wenisch. InvisiFence: Performance
transparent memory ordering in conventional multiprocessors. In Proc.
36th IEEE/ACM International Symposium on Computer Architecture, pages
233–244, June 2009.
[27] T. Bonny and J. Henkel. Efficient code density through look-up table com-
pression. In Proc. ACM/IEEE Design, Automation and Test in Europe Confer-
ence and Exposition, pages 809–814, April 2007.
[28] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for
architectural-level power analysis and optimizations. In Proc. 27th
IEEE/ACM International Symposium on Computer Architecture, pages 83–94,
June 2000.
[29] B.R. Buck and J.K. Hollingsworth. An API for runtime code patching. The
International Journal of High Performance Computing Applications, 14(4):317–
329, 2000.
[30] D. Burger and T.M. Austin. The SimpleScalar toolset, version 2.0. Techni-
cal Report 1342, University of Wisconsin, June 1997.
[31] H. Cain, K. Lepak, B. Schwartz, and M. Lipasti. Precise and accurate pro-
cessor simulation. In Workshop on Computer Architecture Evaluation Using
Commercial Workloads, pages 13–22, February 2002.
[32] D. Chiou, D. Sunwoo, H. Angepat, J. Kim, N.A. Patil, W. Reinhart,
and D.E. Johnson. Parallelizing computer system simulators. In Proc.
22nd IEEE/ACM International Parallel and Distributed Processing Symposium,
pages 1–5, April 2008.
[33] D. Chiou, D. Sunwoo, J. Kim, N.A. Patil, W. Reinhart, D.E. Johnson,
J. Keefe, and H. Angepat. FPGA-accelerated simulation technologies
(FAST): Fast, full-system, cycle-accurate simulators. In Proc. IEEE/ACM
40th Annual International Symposium on Microarchitecture, pages 249–261,
December 2007.
[34] P. Chow and M. Horowitz. Architectural tradeoffs in the design of MIPS-
X. In Proc. 14th IEEE/ACM International Symposium on Computer Architec-
ture, pages 300–308, June 1987.
375
[35] D. Citron. MisSPECulation: Partial and misleading use of SPEC CPU2000
in computer architecture conferences. In Proc. 30th IEEE/ACM Interna-
tional Symposium on Computer Architecture, pages 52–62, June 2003.
[36] Compaq Computer Corporation. Alpha Architecture Handbook, 1998.
[37] G. Contreras, M. Martonosi, J. Peng, R. Ju, and G. Lueh. XTREM: A power
simulator for the intel XScale core. In Proc. ACM SIGPLAN Workshop on
Languages, Compilers, and Tools for Embedded Systems, pages 115–125, 2004.
[38] J.W. Davidson and R.A. Vaughan. The effect of instruction set complexity
on program size and memory performance. In Proc. 2nd ACM Symposium
on Architectural Support for Programming Languages and Operating Systems,
pages 60–64, October 1987.
[39] B. De Sutter, B. De Bus, K. De Bosschere, and S. Debray. Combining
global code and data compaction. In Proc. ACM SIGPLAN Workshop on
Languages, Compilers, and Tools for Embedded Systems, pages 29–38, 2001.
[40] L.A. DeRose. The hardware performance monitor toolkit. In Proc. 7th
International Euro-Par Conference, pages 122–132, August 2001.
[41] L.A. DeRose, K. Ekanadham, J.K. Hollingsworth, and S. Sbaraglia.
SIGMA: A simulator infrastructure to guide memory analysis. In Proc.
IEEE/ACM Supercomputing International Conference on High Performance
Computing, Networking, Storage and Analysis, number 6, November 2002.
[42] R. Desikan, D. Burger, and S. Keckler. Measuring experimental error in
multiprocessor simulation. In Proc. 28th IEEE/ACM International Sympo-
sium on Computer Architecture, pages 266–277, June 2001.
[43] R. Desikan, D. Burger, S. Keckler, and T. Austin. Sim-alpha: a validated,
execution-driven Alpha 21264 simulator. Technical Report TR-01-23, De-
partment of Computer Sciences, The University of Texas at Austin, 2001.
[44] J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic shared
memory multiprocessing. In Proc. 14th ACM Symposium on Architectural
Support for Programming Languages and Operating Systems, 2009.
[45] Digital Equipment Corp. pdp11/40 Processor Handbook, 1972.
[46] Digital Equipment Corp. VAX Architecture Reference Manual, 1987.
376
[47] J. Donald and M. Martonosi. An efficient, practical parallelization
methodology for multicore architecture simulation. Computer Architecture
Letters, August 2006.
[48] J. Edler and M.D. Hill. Dinero IV trace-driven uniprocessor cache simula-
tor. http://www.cs.wisc.edu/ markhill/DineroIV, 2003.
[49] L. Eeckhout, A. Georges, and K. De Bosschere. Selecting a reduced but
representative workload. InOOPSLA 2003Workshop on Middleware Bench-
marking: Approaches, Results and Experiences, 2003.
[50] M. Ekman and P. Stenstrom. Enhancing multiprocessor architecture sim-
ulation speed using matched-pair comparison. In Proc. IEEE International
Symposium on Performance Analysis of Systems and Software, 2005.
[51] S. Eranian. Perfmon2: a flexible performance monitoring interface for
Linux. In Proc. 2006 Ottawa Linux Symposium, pages 269–288, July 2006.
[52] S. Eyerman, L. Eeckhout, T. Karkhanis, and J.E. Smith. A performance
counter architecture for computing accurate CPI components. In Proc.
12th ACM Symposium on Architectural Support for Programming Languages
and Operating Systems, pages 175–184, 2006.
[53] M. Ferdman, T.F.Wenisch, A. Ailamaki, B. Falsafi, and A.Moshovos. Tem-
poral instruction fetch streaming. In Proc. IEEE/ACM 41st Annual Interna-
tional Symposium on Microarchitecture, December 2008.
[54] M.J. Flynn, C.L. Mitchell, and J.M. Mulder. And now a case for more
complex instruction sets. IEEE Computer, 20(9):71–83, September 1987.
[55] K. Ganesan, D. Panwar, and L.K. John. Generalization, validation and
analysis of spec cpu2006 simulation points based on branch, memory and
TLB characteristics. In SPEC Benchmark Workshop, January 2009.
[56] J. Gibson, R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, and M. Heinrich.
FLASH vs. (simulated) FLASH: Closing the simulation loop. In Proc. 9th
ACM Symposium on Architectural Support for Programming Languages and
Operating Systems, pages 49–58, November 2000.
[57] S.R. Goldschmidt and J.L. Hennessy. The accuracy of trace-driven simu-
lations of multiprocessors. In Proc.ACM International Conference on Mea-
surement and Modeling of Computer Systems, pages 146–157, May 1993.
377
[58] J. Gonzalez, J. Gimenez, and J. Labarta. Automatic detection of parallel
applications computation phases. In Proc. 23rd IEEE/ACM International
Parallel and Distributed Processing Symposium, pages 1–11, May 2009.
[59] T. Granlund and L. Montgomery. Division by invariant integers using
multiplication. In Proc. ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation, pages 61–72, June 1994.
[60] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B.
Brown. MiBench: A free, commercially representative embedded bench-
mark suite. In Proc. IEEE 4th Workshop on Workload Characterization, pages
3–14, December 2001.
[61] A. Halambi, A. Shrivastava, P. Biswas, N. Dutt, and A. Nicolau. A design
space exploration framework for reduced bit-width instruction set archi-
tecture (rISA) design. In Proc. 15th IEEE/ACM International Symposium on
System Synthesis, pages 120–125, November 2002.
[62] G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and
more flexible program analysis. In Workshop on Modeling, Benchmarking
and Simulation, June 2005.
[63] A. Hasegawa, I. Kawasaki, K. Yamada, S. Yoshioka, S. Kawasaki, and
P. Biswas. SH3: High code density, low power. IEEE Micro, 15(6):11–19,
1995.
[64] M. Hauswirth, A. Diwan, P.F. Sweeney, and M.C. Mozer. Automating
vertical profiling. In Proc. 20th ACM Conference on Object-Oriented Pro-
gramming Systems, Languages and Applications, pages 281–296, 2005.
[65] K. Hazelwood, G. Lueck, and R. Cohn. Scalable support for multi-
threaded applications on dynamic binary instrumentation systems. In
Proc. International Symposium on Memory Management, June 2009.
[66] Z. Herczeg, A´. Kiss, D. Schmidt, N. Wehn, and T. Gyimo´thy. Xeemu: An
improved xscale power simulator. In PATMOS, pages 300–309, 2007.
[67] Hewlett Packard. PA-RISC 1.1 Architecture and Instruction Set Reference
Manual, 1994.
[68] K. Hoste. Personal communication, 2009.
378
[69] IBM. Enterprise Systems Architecture/390: Principles of Operation, 1999.
[70] IBM. PowerPC Microprocessor Family: The Programming Environments for
32-bit Microprocessors, 2000.
[71] Intel. Intel Itanium Architecture Software Developer’s Manual, 2000.
[72] Intel. Intel Architecture Software Developer’s Manual, Volume 3: System Pro-
gramming Guide, 2009.
[73] Intel Corp. Intel 64 and IA-32 Architectures Software Developer’s Manual,
2007.
[74] A. Jaleel, R. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A binary instrumen-
tation approach to modeling memory behavior of workloads on CMPs.
Technical Report UMD-SCA-2006-01, University of Maryland, 2006.
[75] A. Jaleel, R.S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-
the-fly multi-core cache simulator. In Proc. Workshop on Modeling, Bench-
marking, and Simulation, pages 28–36, June 2008.
[76] K. Keeton, D. Patterson, Y. He, R. Raphael, and W. Baker. Performance
characterization of a quad pentium pro SMP using OLTP workloads. In
Proc. 28th IEEE/ACM International Symposium on Computer Architecture,
June 2001.
[77] AJ KleinOsowski and D.J. Lilja. MinneSPEC: A new SPEC benchmark
workload for simulation-based computer architecture research. Computer
Architecture Letters, 1, June 2002.
[78] W. Korn, P.J. Teller, and G. Castillo. Just how accurate are performance
counters? In 20th IEEE International Performance, Computing, and Commu-
nication Conference, pages 303–310, April 2001.
[79] M. Kozuch and A. Wolfe. Compression of embedded system programs.
In Proc. IEEE International Conference on Computer Design, pages 270–277,
October 1994.
[80] J. Lau, S. Schoenmackers, T. Sherwood, and B. Calder. Reducing code
size with echo instructions. In Proc. 7th ACM International Conference on
Compilers, Architectures and Synthesis for Embedded Systems, pages 84–94,
October 2003.
379
[81] B.C. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable perfor-
mance regression for scalable multiprocessor models. In Proc. IEEE/ACM
41st Annual International Symposium on Microarchitecture, pages 270–281,
December 2008.
[82] H. Lekatsas and W. Wolf. SAMC: A code compression algorithm for em-
bedded processors. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 18(12):1689–1701, 1999.
[83] K.M. Lepak, H.W. Cain, and M.H. Lipasti. Redeeming IPC as a perfor-
mance metric for multithreaded programs. In Proc. IEEE/ACM Interna-
tional Conference on Parallel Architectures and Compilation Techniques, page
232, 2003.
[84] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMP design space ex-
ploration subject to physical constraints. In Proc. 12th IEEE Symposium on
High Performance Computer Architecture, pages 15–26, February 2006.
[85] C.H. Lin, Y. Xie, and W. Wolf. LZW-based code compression for VLIW
embedded systems. In Proc. ACM/IEEE Design, Automation and Test in
Europe Conference and Exposition, pages 76–81, February 2004.
[86] G.H. Loh, S. Subramaniam, and Y. Xie. Zesto: A cycle-level simulator for
highly detailed microarchitecture exploration. In Proc. IEEE International
Symposium on Performance Analysis of Systems and Software, April 2009.
[87] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace,
V.J. Reddi, and K. Hazelwood. Pin: Building customized program anal-
ysis tools with dynamic instrumentation. In Proc. ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation, pages 190–200,
June 2005.
[88] Y. Luo, O.M. Lubeck, H. Wasserman, F. Bassetti, and K.W. Cameron. De-
velopment and validation of a hiearchical memory model incorporating
cpu- and memory-operation overlap model. In Workshop on Software Per-
formance, pages 152–163, 1998.
[89] Y. Luo, V. Packirisamy, W.-C. Hsu, A. Zhai, N. Mungre, and A. Tarkas.
Dynamic performance tuning for speculative threads. In Proc. 36th
IEEE/ACM International Symposium on Computer Architecture, pages 462–
473, June 2009.
380
[90] G. Marin and J. Mellor-Crummey. Cross-architecture performance predic-
tions for scientific applications using parameterizedmodels. In Proc.ACM
International Conference on Measurement and Modeling of Computer Systems,
pages 2–13, June 2004.
[91] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R.
Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood. Multifacet’s gen-
eral execution-driven multiprocessor simulator (GEMS) toolset. Computer
Architecture News, 2005.
[92] J.R. Mashey. War of the benchmark means: Time for a truce. ACM
SIGARCH Computer Architecture News, 32:1–14, September 2004.
[93] H. Massalin. Superoptimizer: a look at the smallest program. In Proc. 2nd
ACM Symposium on Architectural Support for Programming Languages and
Operating Systems, pages 122–126, October 1987.
[94] W.Mathur and J. Cook. Improved estimation for software multiplexing of
performance counting. In Proc. 13th IEEE International Symposium on Mod-
eling, Analysis and Simulation of Computer and Telecommunication Systems,
pages 23–34, September 2005.
[95] M.E. Maxwell, P.J. Teller, L.Salayandia, and S. Moore. Accuracy of perfor-
mancemonitoring hardware. In Proc. Los Alamos Computer Science Institute
Symposium, October 2002.
[96] MIPS Technologies, Inc. MIPS32 Architecture for Programmers, 2001.
[97] P. Montesinos, L. Ceze, and J. Torrellas. Delorean: Recording and de-
terministically replaying shared-memory multiprocessor execution effi-
ciently. In Proc. 35th IEEE/ACM International Symposium on Computer Ar-
chitecture, pages 289–300, June 2008.
[98] P. Montesinos, M. Hicks, S.T. King, and J. Torrellas. Capo: A software-
hardware interface for practical deterministic multiprocessor replay. In
Proc. 14th ACM Symposium on Architectural Support for Programming Lan-
guages and Operating Systems, March 2009.
[99] MOS Technology Inc. MCS6500 Microcomputer Family Hardware Manual,
1975.
[100] Motorola, Inc. Motorola MC88110 User’s Manual, 1991.
381
[101] Motorola, Inc. Motorola M68000 Family Programmer’s Reference Manual,
1992.
[102] P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A portable interface
to hardware performance counters. In Proc. Department of Defense HPCMP
User Group Conference, June 1999.
[103] A. Muzahid, D. Suaa´rez, S. Qi, and J. Torrellas. SigRace: Signature-based
data race detection. In Proc. 36th IEEE/ACM International Symposium on
Computer Architecture, pages 337–348, June 2009.
[104] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney. We have it easy,
but do we have it right? In NSF Next Generation Systems Workshop, pages
1–5, April 2008.
[105] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney. Producin wrong
data without doing anything obviously wrong! In Proc. 14th ACM Sym-
posium on Architectural Support for Programming Languages and Operating
Systems, March 2009.
[106] T. Mytkowicz, P.F. Sweeney, M. Hauswirth, and A. Diwan. Time interpo-
lation: So many metrics, so few registers. In Proc. IEEE/ACM 41st Annual
International Symposium on Microarchitecture, pages 286–300, 2007.
[107] P. Nagpurkar and C. Krintz. Visualization and analysis of phased behav-
ior in Java programs. In Proc. ACM 3rd international symposium on Princi-
ples and practice of programming in Java, pages 27–33, June 2004.
[108] A.A. Nair and L.K. John. Simulation points for spec cpu 2006. In Proc.
IEEE International Conference on Computer Design, pages 397–403, 2008.
[109] J. Namkung, D. Kim, R. Gupta, I Kozintsev, J.-Y. Bouget, and C. Du-
long. Phase guided sampling for efficient parallel application simulation.
In Proc. 4th IEEE/ACM/IFIP International Conference on Hardware/Software
Codesign and System Synthesis, pages 187–192, 2006.
[110] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic
logging of operating system effects to guide application-level architecture
simulation. In Proc.ACM International Conference on Measurement and Mod-
eling of Computer Systems, pages 216–227, 2006.
[111] NEC. VR10000 Series 64-/32-bit Microprocessor User’s Manual, 2001.
382
[112] N. Nethercote. Dynamic Binary Analysis and Instrumentation. PhD thesis,
University of Cambridge, 2004.
[113] N. Nethercote and J. Seward. Valgrind: A framework for heavyweight
dynamic binary instrumentation. In Proc. ACM SIGPLAN Conference on
Programming Language Design and Implementation, pages 89–100, June 2007.
[114] M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determin-
istic multithreading in software. In Proc. 14th ACM Symposium on Archi-
tectural Support for Programming Languages and Operating Systems, March
2009.
[115] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi.
Pinpointing representative portions of large Intel Itanium programs with
dynamic instrumentation. In Proc. IEEE/ACM 37th Annual International
Symposium on Microarchitecture, pages 81–93, December 2004.
[116] D.A. Penry, D.L. August, and M. Vachharajani. Rapid development of a
flexible validated processor model. In Proc. Workshop on Modeling, Bench-
marking, and Simulation, pages 21–30, June 2005.
[117] C. Pereira, H. Patil, and B. Calder. Reproducible simulation of multi-
threaded workloads for architectural design exploration. In Proc.
IEEE International Symposium on Workload Characterization, pages 173–182,
September 2008.
[118] E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and
early simulation points. In Proc. IEEE/ACM International Conference on
Parallel Architectures and Compilation Techniques, pages 244–256, Septem-
ber 2003.
[119] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder.
Using SimPoint for accurate and efficient simulation. In Proc.ACM Interna-
tional Conference on Measurement and Modeling of Computer Systems, pages
318–319, June 2003.
[120] E. Perelman, J. Lau, H. Patil, A. Jaleel, G. Hamerly, and B. Calder. Cross
binary simulation points. In Proc. IEEE International Symposium on Perfor-
mance Analysis of Systems and Software, 2007.
[121] E. Perelman, M. Polito, J.-Y. Bouguet, J. Sampson, B. Calder, and C. Du-
long. Detecting phases in parallel applications on shared memory archi-
383
tectures. In Proc. 20th IEEE/ACM International Parallel and Distributed Pro-
cessing Symposium, 2006.
[122] A. Phansalkar, A. Joshi, and L.K. John. Analysis of redundancy and ap-
plication balance in the SPEC CPU2006 benchmark suite. In Proc. 34th
IEEE/ACM International Symposium on Computer Architecture, pages 412–
413, June 2007.
[123] R. Phelan. Improving ARM Code Density and Performance: New Thumb Ex-
tensions to the ARM Architecture. ARM Limited, 2003.
[124] B. Raiter. http://www.muppetlabs.com/˜breadbox/software/ elfkick-
ers.html, 2007.
[125] J. Renau. SESC. http://sesc.sourceforge.net/index.html, 2002.
[126] Renesas Technology. SH-3/SH-3E/SH3-DSP Software Manual, 2006.
[127] M. Rosenblum, E. Bugnion, S. Devine, and S. Herrod. Using the SimOS
machine simulator to study complex computer systems. ACMTransactions
on Modeling and Computer Simulation, 7(1):78–103, 1997.
[128] M. Rosenblum, E. Bugnion, S.A. Jerrod, E. Witchel, and A. Gupta. The
impact of architectural trends on operating system performance. In Proc.
15th ACM Symposium on Operating Systems Principles, 1995.
[129] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas. EVAL: Utilizing pro-
cessors with variation-induced timing errors. In Proc. IEEE/ACM 41st An-
nual International Symposium on Microarchitecture, pages 423–434, Decem-
ber 2008.
[130] S. Seong and P. Mishra. A bitmask-based code compression technique
for embedded systems. In Proc. International Conference on Computer Aided
Design, pages 251–254, November 2006.
[131] T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis
to find periodic behavior and simulation points in applications. In Proc.
IEEE/ACM International Conference on Parallel Architectures and Compilation
Techniques, pages 3–14, September 2001.
[132] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically char-
acterizing large scale program behavior. In Proc. 10th ACM Symposium
384
on Architectural Support for Programming Languages and Operating Systems,
pages 45–57, October 2002.
[133] B. Smith. ARM and Intel battle over the mobile chip’s future. IEEE Com-
puter, 41(5):16–19, May 2008.
[134] S. Somogyi, T.F. Wenisch, A. Ailamaki, and B. Falsafi. Spatio-temporal
memory streaming. In Proc. 36th IEEE/ACM International Symposium on
Computer Architecture, pages 69–80, June 2009.
[135] A. Srivastava and A. Eustace. ATOM: a system for building customized
program analysis tools. InProc. ACMSIGPLANConference on Programming
Language Design and Implementation, pages 196–205, June 1994.
[136] Standard Performance Evaluation Corporation. SPEC CPU benchmark
suite. http://www.specbench.org/osg/cpu2000/, 2000.
[137] Standard Performance Evaluation Corporation. SPEC OMP benchmark
suite. http://www.specbench.org/hpg/omp2001/, 2001.
[138] Standard Performance Evaluation Corporation. SPEC CPU benchmark
suite. http://www.specbench.org/osg/cpu2006/, 2006.
[139] P. Steenkiste. The impact of code density on instruction cache perfor-
mance. In Proc. 16th IEEE/ACM International Symposium on Computer Ar-
chitecture, pages 252–259, June 1989.
[140] J. Storer and T. Szymanski. Data compression via textual substitution.
Journal of the ACM, 29:928–951, 1982.
[141] J. Suh and M. Dubois. Dynamic MIPS rate stabilization in out-of-order
processors. In Proc. 36th IEEE/ACM International Symposium on Computer
Architecture, pages 46–56, June 2009.
[142] Sun Microsystems. The SPARC Architecture Manual Version 9, 1994.
[143] P.K. Szwed, D. Marques, R.M. Buels, S.A. McKee, and M. Schulz. Sim-
Snap: Fast-forwarding via native execution and application-level check-
pointing. In Proc. 8th IEEE Workshop on Interaction between Compilers and
Computer Architectures, February 2004.
385
[144] R.A. Uhlig and T.N. Mudge. Trace-driven memory simulation: A survey.
ACM Computing Surveys, 29(2):128–170, June 1997.
[145] M. Van Biesbrouck, L. Eeckhout, and B. Calder. Considering all starting
points for simultaenous multithreading simulation. In Proc. IEEE Inter-
national Symposium on Performance Analysis of Systems and Software, March
2006.
[146] M. Van Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to
guide simultaneous multithreading simulation. In Proc. IEEE International
Symposium on Performance Analysis of Systems and Software, pages 45–56,
March 2004.
[147] A. Varma, E. Debes, I. Kozintsev, P. Klein, and B. Jacob. Accurate and fast
system-level power modeling. ACM Transactions on Embedded Computing
Systems, 7(3), 2008.
[148] S. Vlaovic and E.S. Davidson. TAXI: Trace analysis for X86 interpretation.
In Proc. IEEE International Conference on Computer Design, pages 508–514,
September 2002.
[149] E. Wanderley Netto, R. Azevedo, P. Centoducatte, and G. Araujo. Multi-
profile based code compression. In Proc. 41st ACM/IEEE Design Automa-
tion Conference, pages 244–249, June 2004.
[150] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob.
DRAMsim: A memory-system simulator. Computer Architecture News,
33(4):100–107, September 2005.
[151] V.M. Weaver. http://www.deater.net/weave/vmwprod/linux logo/,
2009.
[152] V.M. Weaver and S.A. McKee. Are cycle accurate simulations a waste of
time? In Proc. 7th Workshop on Duplicating, Deconstructing, and Debunking,
pages 40–53, June 2008.
[153] V.M. Weaver and S.A. McKee. Can hardware performance counters be
trusted? In Proc. IEEE International Symposium on Workload Characteriza-
tion, pages 141–150, September 2008.
[154] V.M. Weaver and S.A. McKee. Can hardware performance counters be
386
trusted? Technical Report CSL-TR-2008-1051, Cornell University, August
2008.
[155] V.M. Weaver and S.A. McKee. Using dynamic binary instrumentation to
generate multi-platform simpoints: Methodology and accuracy. In Proc.
3rd International Conference on High Performance Embedded Architectures and
Compilers, pages 305–319, January 2008.
[156] I. Williams. An illustration of the benefits of the MIPS R12000 micropro-
cessor and OCTANE system architecture. White Paper, SGI, 1999.
[157] E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simula-
tion. In Proc.ACM International Conference on Measurement and Modeling of
Computer Systems, pages 68–79, May 1996.
[158] A. Wolfe and A. Chanin. Executing compressed programs on an embed-
ded RISC architecture. In Proc. IEEE/ACM 25th International Symposium on
Microarchitecture, pages 81–91, November 1992.
[159] C. Won, B. Lee, C. Yu, S. Moh, Y.-Y. Kim, and K. Park. Linux/SimOS -
a simulation environment for evaluating high-speed communication sys-
tems. In Proc. International Conference on Parallel Processing, pages 193–199,
2002.
[160] Y. Wu, M. Breternitz, Jr., H. Hum, R. Peri, and J. Pickett. Enhanced code
density of embedded CISC processors with echo technology. In Proc. 3rd
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and
System Synthesis, pages 160–165, October 2005.
[161] W.A. Wulf. Evaluation of the WM architecture. In Proc. 19th IEEE/ACM
International Symposium on Computer Architecture, pages 382–390, 1992.
[162] Wm.A. Wulf and S.A. McKee. Hitting the memory wall: Implications of
the obvious. Computer Architecture News, 23(1):20–24, March 1995.
[163] R.E. Wunderlich, T.F. Wenish, B. Falsafi, and J.C. Hoe. SMARTS: Ac-
celerating microarchitecture simulation via rigorous statistical sampling.
In Proc. 30th IEEE/ACM International Symposium on Computer Architecture,
pages 84–95, June 2003.
[164] Xilinx. MicroBlaze Processor Reference Guide, 2004.
387
[165] M. Xu, R. Bodik, and M. Hill. A flight data recorder for enabling full-
system multiprocessor deterministic replay. In Proc. 30th IEEE/ACM In-
ternational Symposium on Computer Architecture, pages 122–135, June 2003.
[166] X.H. Xu, C.T. Clarke, and S.R. Jones. High performance code compression
architecture for the embedded ARM/THUMB processor. In Proc. ACM
Computing Frontiers Conference, pages 451–456, April 2004.
[167] K.C. Yeager. The Mips R12000 superscalar microprocessor. White Paper,
SGI, 2000.
[168] J.J. Yi, S. Kodakara, R. Sendag, D.J. Lilja, and D.M. Hawkins. Charac-
terizing and comparing prevailing simulation techniques. In Proc. 11th
IEEE Symposium on High Performance Computer Architecture, pages 266–
277, February 2005.
[169] J.J. Yi and D.J. Lilja. Simulation of computer architectures: Simulators,
benchmarks, methodologies, and recommendations. IEEE Transactions of
Computers, 55(3):268–280, March 2006.
[170] J.J. Yi, R. Sendag, D.J. Lilja, and D.M. Hawkins. Speed and accuracy
trade-offs in microarchitectural simulations. IEEE Transactions of Comput-
ers, 56(11):1549–1563, November 2007.
[171] M. Yourst. PTLsim User’s Guide and Reference, 2007.
[172] M.T. Yourst. PTLsim: A cycle accurate full system x86-64 microarchitec-
tural simulator. In Proc. IEEE International Symposium on Performance Anal-
ysis of Systems and Software, pages 23–34, April 2007.
[173] D. Zaparanuks, M. Jovic, and M. Hauswirth. Accuracy of performance
counter measurements. In Proc. IEEE International Symposium on Perfor-
mance Analysis of Systems and Software, pages 23–32, April 2009.
[174] H. Zeng, M. Yourst, K. Ghose, and D. Ponomarev. MPTLsim: a simulator
for X86 multicore processors. In Proc. 46th ACM/IEEE Design Automation
Conference, pages 226–231, 2009.
[175] Zilog. Z80 family CPU User Manual, 2001.
[176] J. Ziv and A. Lempel. A universal algorithm for sequential data compres-
sion. IEEE Transactions on Information Theory, 23(3):337–343, 1977.
388
[177] A. Zmily and C. Kozyrakis. Simultaneously improving code size, per-
formance, and energy in embedded processors. In Proc. ACM/IEEE De-
sign, Automation and Test in Europe Conference and Exposition, pages 224–
229, March 2006.
389
